Site Reliability Engineer

Application ends: August 26, 2025

Apply for this job

Email *

Job Description

We’re looking for a Site Reliability Engineer who thrives at the intersection of software engineering and systems operations. This isn’t a passive monitoring or alert-routing role — our SREs own reliability as a core product feature. You’ll be embedded with a team managing a platform that processes real-time transactions across multiple regions with high throughput and strict latency requirements. Expect deep dives into distributed systems, hands-on automation, and the mandate to eliminate toil wherever it hides.


Key Responsibilities

  • Design and manage fault-tolerant and self-healing infrastructure across Kubernetes, AWS (EKS, EC2, RDS), and GCP workloads.
  • Build and refine automated runbooks, deployment pipelines, and chaos testing frameworks to simulate system degradation and improve recovery.
  • Collaborate with application teams to define and enforce Service Level Objectives (SLOs) and Error Budgets.
  • Develop and extend our observability platform using tools like Grafana, Prometheus, OpenTelemetry, and Honeycomb.
  • Participate in on-call rotations, conduct post-incident reviews, and lead blameless retrospectives with a bias toward long-term fixes.
  • Proactively identify hidden reliability risks in code deployments, API integrations, and third-party services.

Qualifications

  • 5+ years of hands-on experience in a production SRE or infrastructure engineering role, with deep expertise in distributed systems.
  • Strong coding ability in Go, Python, or Rust, with a focus on writing tools, infrastructure-as-code, or platform services.
  • Proven experience with service mesh technologies (e.g., Istio or Linkerd) and zero-downtime deployment strategies.
  • Practical experience operating multi-region or multi-cloud architectures at scale.
  • You’ve deployed or maintained infrastructure using Terraform, Helm, and CI/CD systems like ArgoCD or GitHub Actions.
  • Familiarity with incident management tools (e.g., PagerDuty, FireHydrant) and structured postmortem practices.
  • Comfortable challenging assumptions and driving cross-functional alignment in high-pressure situations.

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru