Site Reliability Engineer

Application ends: August 10, 2025

Apply for this job

Email *

Job Description

About the Role

We’re seeking a Site Reliability Engineer who thrives at the intersection of software development and infrastructure scalability. You’ll be a key contributor to designing and maintaining systems that support high availability, low latency, and rapid deployment—especially under real-world load and chaos.

This role is not about babysitting servers. We’re looking for someone who can identify single points of failure in a system diagram, introduce chaos engineering principles into CI/CD, and challenge dev teams on observability and failure scenarios.


What You’ll Be Responsible For

  • Architecting and maintaining distributed systems with 99.99% uptime goals across services deployed in Kubernetes clusters (EKS preferred).
  • Implementing SLOs/SLIs from scratch and getting buy-in from product and engineering teams to align them with business outcomes.
  • Building internal tooling in Go or Python to automate reliability checks, simulate failure conditions, and enforce infra guardrails.
  • Working directly with development teams to implement production-readiness checklists and manage incident retrospectives that lead to actual changes.
  • Leading on-call rotations with a strong emphasis on reducing alert fatigue—designing actionable alerts, not noise.
  • Designing progressive rollout strategies (canary, blue/green, feature flags) and integrating them into existing CI/CD pipelines (we use ArgoCD + GitHub Actions).
  • Performing load testing and chaos testing before peak traffic events (Black Friday, product launches), with documented failover procedures.

Required Skills & Experience

  • 4+ years in a production-facing SRE or infrastructure engineering role.
  • Strong understanding of Linux internals, DNS, load balancing, and containerization (Docker + Kubernetes).
  • Proven experience designing and running production systems in AWS (Lambda, SQS, RDS, IAM, VPCs).
  • You’ve written at least one internal tool or automation in Python, Go, or Rust—not just shell scripts.
  • Hands-on with monitoring stacks: Prometheus, Grafana, and one of Datadog, New Relic, or Honeycomb.
  • Strong understanding of distributed systems failure modes, from split-brain to thundering herd.
  • You’ve participated in (or better, led) post-mortems that resulted in meaningful system changes.

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru