Job Description
About the Role
We’re seeking a Site Reliability Engineer who thrives in a complex, high-throughput systems environment and is driven by eliminating toil, improving system resilience, and enabling development teams to move fast without breaking things. This role is not a catch-all for DevOps—it’s tailored for engineers who deeply understand distributed systems and obsess over service-level indicators (SLIs), availability targets, and failure recovery.
What You’ll Be Responsible For
- Own the reliability of customer-facing services that handle over 30K requests per second, with a current uptime SLA of 99.99%.
- Partner with platform and service teams to define and maintain SLIs, SLOs, and error budgets—not just once, but as part of a continuous improvement cycle.
- Lead the design and execution of game days and chaos engineering efforts to proactively uncover reliability gaps.
- Write tooling that makes incident detection, triage, and mitigation more autonomous (not just dashboards, but auto-remediation where appropriate).
- Instrument production systems with actionable telemetry; you know the difference between noise and signal.
- Act as an incident commander for SEV-1/SEV-2 events, and contribute directly to our blameless postmortem culture by leading retrospectives that produce measurable improvements.
- Build and maintain CI/CD safety nets that reduce rollback risks and accelerate deployment velocity without sacrificing stability.
Minimum Qualifications
- 4+ years of experience in Site Reliability Engineering or Production Engineering roles in high-scale environments.
- Deep understanding of distributed systems, service meshes, and failure domains (e.g., how cascading failures propagate in microservice architectures).
- Proficient in Go, Python, or Rust—and comfortable writing production-grade code (this is not a scripting-only role).
- Strong experience with Kubernetes at scale—you’ve dealt with node churn, memory pressure, and control plane bottlenecks.
- Demonstrated experience deploying and tuning observability stacks (e.g., Prometheus, Grafana, OpenTelemetry, Honeycomb, or similar).
- Expertise in incident response tooling (e.g., PagerDuty, Opsgenie) and creating actionable on-call rotations.
Bonus if You Have
- Experience in designing multi-region failover strategies and load balancing policies that take into account partial outages.
- Familiarity with eBPF for performance analysis and observability in containerized environments.
- A track record of building internal tooling that helped other engineers reduce deployment-related incidents.
- Contributions to open-source reliability-focused projects or platforms.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru