Site Reliability Engineer

Application ends: September 4, 2025

Apply for this job

Email *

Job Description

We’re looking for a Site Reliability Engineer who doesn’t just want to keep systems running — you want to make them run better. You’ll help us build a proactive SRE function, not just firefight. This role focuses on designing fault-tolerant systems, improving deployment pipelines, measuring everything that matters, and embedding reliability into our engineering culture. You’ll work closely with backend and platform engineers to drive architectural improvements, implement operational best practices, and keep latency low even when traffic spikes without warning.


Key Responsibilities

  • Partner with engineering teams to design systems that are highly available, scalable, and observable
  • Own and improve SLIs/SLOs across services, and lead root cause analyses for production incidents
  • Build tooling to automate operational tasks like scaling, deployments, rollbacks, and failovers
  • Architect multi-region failover strategies, load balancing policies, and efficient database replication
  • Lead post-incident reviews and turn findings into systemic improvements, not just patches
  • Review Terraform and Helm PRs for best practices and operational impact
  • Proactively identify bottlenecks in CI/CD pipelines and recommend performance improvements
  • Define reliability engineering onboarding paths and promote SRE principles across the org

Required Qualifications

  • 5+ years of experience in site reliability, infrastructure, or backend engineering roles
  • Deep understanding of cloud-native architecture on AWS, including VPC, IAM, ALB/NLB, and autoscaling groups
  • Experience managing Kubernetes clusters in production (preferably with EKS or GKE)
  • Fluency with Terraform or similar infrastructure-as-code tools (Pulumi, CloudFormation)
  • Strong grasp of systems observability: metrics (Prometheus/Grafana), distributed tracing (OpenTelemetry), and logging (ELK, Loki, or similar)
  • Operational experience with PostgreSQL, Redis, and Kafka in high-throughput environments
  • Proficiency in at least one programming language such as Go, Python, or Rust
  • History of running production incident calls and leading technical postmortems

Nice to Have

  • Experience with chaos engineering or game day exercises
  • Exposure to service mesh technologies (e.g., Istio, Linkerd)
  • Familiarity with GitOps practices using tools like Argo CD or Flux
  • Knowledge of eBPF-based monitoring or performance tuning tools

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru