Job Description
About the Role:
We’re seeking a Site Reliability Engineer who thrives in high-stakes environments, has a knack for diagnosing complex distributed systems under pressure, and is as comfortable writing Terraform modules as they are dissecting kernel panic logs. You’ll be embedded in our core infrastructure team, responsible for designing, scaling, and fortifying the systems that support millions of transactions per day. We run a multi-cloud setup (AWS + GCP), manage 50+ Kubernetes clusters, and prioritize observability, automation, and clear escalation paths over manual interventions.
What You’ll Do:
- Build and maintain self-healing infrastructure using Terraform, Helm, and GitOps workflows (ArgoCD).
- Lead incident response, postmortem processes, and RCA documentation for P0 and P1 events.
- Develop latency and error budgets with product teams, and implement SLO enforcement gates in CI/CD.
- Implement failover strategies and multi-region architectures with minimal RTO/RPO windows.
- Maintain and evolve observability stacks (Prometheus, Grafana, OpenTelemetry, Loki) to reduce MTTR and improve signal clarity.
- Write internal tools in Go or Python to streamline deployment, monitoring, and provisioning.
- Perform threat modeling on infrastructure-level risks and collaborate with Security Engineering on hardening strategies.
- Partner with DBAs and backend engineers to optimize query patterns, cache design, and autoscaling policies.
What We’re Looking For:
- Proven experience managing container orchestration systems (Kubernetes at scale – not minikube).
- Deep understanding of distributed systems concepts: quorum, backpressure, circuit breakers, and idempotency.
- 3+ years working in SRE, DevOps, or platform engineering roles, with demonstrated impact on production reliability.
- Strong experience with infrastructure-as-code (Terraform preferred) and CI/CD automation (CircleCI, GitHub Actions, or equivalent).
- You’ve led or contributed significantly to at least one large-scale incident response with measurable improvements implemented afterward.
- Comfortable coding in Go or Python for automation and tooling.
- Experience configuring and tuning alerting and observability pipelines—more signal, less noise.
- Bonus: Familiarity with service mesh architectures (Istio or Linkerd), and experience with database failover orchestration (PostgreSQL or Cassandra clusters).
You’ll Succeed Here If You:
- Think uptime is a team sport and value root-cause analysis over finger-pointing.
- Believe in automating yourself out of a job—then finding the next piece to automate.
- Have strong opinions about incident management runbooks and chaos engineering.
- Treat SLOs as contracts, not aspirations.
- Prefer structured logging, detailed dashboards, and actionable alerts.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru