Site Reliability Engineer

Application ends: August 17, 2025

Apply for this job

Email *

Job Description

About the Role

We’re seeking a Site Reliability Engineer who thrives in a complex, high-throughput systems environment and is driven by eliminating toil, improving system resilience, and enabling development teams to move fast without breaking things. This role is not a catch-all for DevOps—it’s tailored for engineers who deeply understand distributed systems and obsess over service-level indicators (SLIs), availability targets, and failure recovery.


What You’ll Be Responsible For

  • Own the reliability of customer-facing services that handle over 30K requests per second, with a current uptime SLA of 99.99%.
  • Partner with platform and service teams to define and maintain SLIs, SLOs, and error budgets—not just once, but as part of a continuous improvement cycle.
  • Lead the design and execution of game days and chaos engineering efforts to proactively uncover reliability gaps.
  • Write tooling that makes incident detection, triage, and mitigation more autonomous (not just dashboards, but auto-remediation where appropriate).
  • Instrument production systems with actionable telemetry; you know the difference between noise and signal.
  • Act as an incident commander for SEV-1/SEV-2 events, and contribute directly to our blameless postmortem culture by leading retrospectives that produce measurable improvements.
  • Build and maintain CI/CD safety nets that reduce rollback risks and accelerate deployment velocity without sacrificing stability.

Minimum Qualifications

  • 4+ years of experience in Site Reliability Engineering or Production Engineering roles in high-scale environments.
  • Deep understanding of distributed systems, service meshes, and failure domains (e.g., how cascading failures propagate in microservice architectures).
  • Proficient in Go, Python, or Rust—and comfortable writing production-grade code (this is not a scripting-only role).
  • Strong experience with Kubernetes at scale—you’ve dealt with node churn, memory pressure, and control plane bottlenecks.
  • Demonstrated experience deploying and tuning observability stacks (e.g., Prometheus, Grafana, OpenTelemetry, Honeycomb, or similar).
  • Expertise in incident response tooling (e.g., PagerDuty, Opsgenie) and creating actionable on-call rotations.

Bonus if You Have

  • Experience in designing multi-region failover strategies and load balancing policies that take into account partial outages.
  • Familiarity with eBPF for performance analysis and observability in containerized environments.
  • A track record of building internal tooling that helped other engineers reduce deployment-related incidents.
  • Contributions to open-source reliability-focused projects or platforms.

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru