Job Description
About the Role
We’re looking for a Site Reliability Engineer who thrives on solving complex infrastructure challenges in high-throughput, low-latency environments. This is not a “keep the lights on” ops job. You’ll be architecting fault-tolerant systems, enforcing infrastructure as code, and leading game days to proactively surface systemic weaknesses.
Key Responsibilities
- Architect scalable infrastructure across Kubernetes, Terraform, and multi-region cloud environments (GCP preferred), focusing on observability, redundancy, and rapid rollback.
- Automate failure recovery: Develop self-healing systems using custom controllers, retry policies, and circuit breakers—not just alerts with runbooks.
- Lead incident response drills: Run chaos engineering experiments and simulate outages; don’t wait for production to break.
- Define SLOs/SLIs from real business metrics, not arbitrary latency targets. Partner with product teams to ensure reliability is tied to user experience, not just uptime.
- Root cause without finger-pointing: Design postmortem processes that result in design changes, not blame games.
- Own the on-call experience: Proactively reduce alert fatigue and advocate for smart escalation policies. You will be empowered to say “this page is useless” and fix it.
Must-Have Qualifications
- 4+ years of experience in SRE or DevOps in production environments handling real user traffic above 100k QPS.
- Deep knowledge of Kubernetes internals (e.g., admission controllers, scheduler behavior, CNI plugins) and managing clusters at scale.
- Proven success with infrastructure as code (Terraform, Helm), secrets management, and immutable deployments.
- Demonstrated ability to design and measure SLIs tied to business KPIs (e.g., user sign-in latency or checkout success rate).
- Fluent in at least one backend language (Go, Rust, or Python preferred) with experience writing tooling for infrastructure teams.
- Strong opinions on observability tooling: can speak to trade-offs between Prometheus + Thanos vs. Datadog or Grafana Cloud, and implement accordingly.
- Experience building or integrating incident response tooling (PagerDuty, Blameless, FireHydrant) into CI/CD and ChatOps workflows.
Bonus Points
- Experience working on distributed systems like Kafka, Redis Cluster, or service meshes (Istio, Linkerd).
- Participation in open-source SRE tooling projects or internal developer platform teams.
- Strong presentation skills: have led internal tech deep dives, brown bags, or postmortem reviews.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru