Job Description
About the Role
We’re looking for a Site Reliability Engineer who thrives in high-traffic, high-availability environments and understands the challenges of scaling real-time distributed systems. You won’t just “keep the lights on.” You’ll build systems that detect problems before they impact customers, engineer fault tolerance into every layer, and automate yourself out of recurring work.
What You’ll Do
- Architect and refine end-to-end observability (logs, metrics, traces) using tools like OpenTelemetry, Grafana, and Honeycomb.
- Improve the recovery time objective (RTO) and error budgets for mission-critical services used by millions of users per day.
- Own incident response processes, from detection and escalation to postmortem writing that drives real change, not just documentation.
- Implement and optimize progressive rollouts, canary deployments, and blue/green deployments via CI/CD systems like ArgoCD, Spinnaker, or Flux.
- Define and enforce SLOs and SLIs in partnership with product engineering, based on user-facing impact rather than abstract system metrics.
- Champion infrastructure as code using Terraform, Pulumi, or Crossplane, and guide the transition from ad-hoc scripts to codified workflows.
- Harden Kubernetes-based workloads running across multi-region clusters in GKE and EKS, including cost optimization and pod disruption budgets.
- Contribute to the chaos engineering strategy, simulating network partitions, node failures, and dependency outages in non-production environments.
What We’re Looking For
- 5+ years of experience running production systems at scale (preferably 1M+ users/month or similar complexity).
- Deep knowledge of Linux internals, especially around cgroups, namespaces, and container networking.
- Fluency in Go, Python, or Rust for building internal tooling or contributing to open-source reliability components.
- Experience debugging distributed systems failures, including retries, timeouts, deadlocks, and cascading failures.
- Strong understanding of cloud-native patterns, including service meshes (e.g., Istio or Linkerd) and gRPC communication.
- Track record of driving blameless postmortems and reducing incident recurrence.
- Clear communication under pressure, especially during on-call rotations and live incident coordination.
- Bonus: experience with event-driven architectures (Kafka, NATS, or Pulsar) and their implications for service reliability.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru