Site Reliability Engineer

Application ends: August 20, 2025

Apply for this job

Email *

Job Description

About the Role

We’re looking for a Site Reliability Engineer who thrives on solving complex infrastructure challenges in high-throughput, low-latency environments. This is not a “keep the lights on” ops job. You’ll be architecting fault-tolerant systems, enforcing infrastructure as code, and leading game days to proactively surface systemic weaknesses.

Key Responsibilities

  • Architect scalable infrastructure across Kubernetes, Terraform, and multi-region cloud environments (GCP preferred), focusing on observability, redundancy, and rapid rollback.
  • Automate failure recovery: Develop self-healing systems using custom controllers, retry policies, and circuit breakers—not just alerts with runbooks.
  • Lead incident response drills: Run chaos engineering experiments and simulate outages; don’t wait for production to break.
  • Define SLOs/SLIs from real business metrics, not arbitrary latency targets. Partner with product teams to ensure reliability is tied to user experience, not just uptime.
  • Root cause without finger-pointing: Design postmortem processes that result in design changes, not blame games.
  • Own the on-call experience: Proactively reduce alert fatigue and advocate for smart escalation policies. You will be empowered to say “this page is useless” and fix it.

Must-Have Qualifications

  • 4+ years of experience in SRE or DevOps in production environments handling real user traffic above 100k QPS.
  • Deep knowledge of Kubernetes internals (e.g., admission controllers, scheduler behavior, CNI plugins) and managing clusters at scale.
  • Proven success with infrastructure as code (Terraform, Helm), secrets management, and immutable deployments.
  • Demonstrated ability to design and measure SLIs tied to business KPIs (e.g., user sign-in latency or checkout success rate).
  • Fluent in at least one backend language (Go, Rust, or Python preferred) with experience writing tooling for infrastructure teams.
  • Strong opinions on observability tooling: can speak to trade-offs between Prometheus + Thanos vs. Datadog or Grafana Cloud, and implement accordingly.
  • Experience building or integrating incident response tooling (PagerDuty, Blameless, FireHydrant) into CI/CD and ChatOps workflows.

Bonus Points

  • Experience working on distributed systems like Kafka, Redis Cluster, or service meshes (Istio, Linkerd).
  • Participation in open-source SRE tooling projects or internal developer platform teams.
  • Strong presentation skills: have led internal tech deep dives, brown bags, or postmortem reviews.

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru