Site Reliability Engineer

Application ends: August 8, 2025

Apply for this job

Email *

Job Description

Role Overview:

We are seeking a Site Reliability Engineer to join our core infrastructure team. This role is critical in designing, scaling, and hardening our production systems that process requests daily. You’ll work at the intersection of development and operations, focusing on infrastructure reliability, observability, and incident management within an environment where uptime is non-negotiable.


Key Responsibilities:

  • Incident Response and Prevention:
    Act as a first responder for major outages, leading root cause analysis (RCA) and implementing durable post-incident action items. Must have demonstrable experience reducing MTTD/MTTR in prior roles.
  • Infrastructure as Code (IaC):
    Develop, audit, and maintain Terraform-based infrastructure. GitOps workflows are required; experience with Atlantis or ArgoCD preferred.
  • Monitoring and Observability:
    Design and deploy distributed tracing and metrics systems using OpenTelemetry, Prometheus, and Grafana. Must have experience building custom dashboards and alerting strategies aligned to SLAs/SLOs.
  • Scaling Systems:
    Optimize Kubernetes-based infrastructure across multi-region deployments. Deep understanding of container orchestration, pod autoscaling, and resource quota management is required.
  • CI/CD Pipeline Reliability:
    Harden and maintain Jenkins, GitHub Actions, or similar systems with parallel test execution and rollback capabilities.
  • Chaos Engineering & Resilience Testing:
    Lead initiatives to validate system behavior under duress (e.g., failover simulations, latency injection, service degradation drills).

Required Qualifications:

  • 5+ years of hands-on experience in site reliability, DevOps, or infrastructure engineering roles in high-traffic environments.
  • Advanced proficiency with Kubernetes, Helm, and service meshes (e.g., Istio, Linkerd).
  • Strong programming/scripting ability in Python or Go; Bash alone is not sufficient.
  • Deep understanding of distributed systems and patterns such as leader election, consensus, rate limiting, and circuit breaking.
  • Experience with secrets management tools (e.g., Vault) and secure service-to-service communication using mTLS.
  • Hands-on knowledge of cloud platforms, preferably GCP or AWS, with a clear grasp of IAM, VPC architecture, and cost optimization strategies.
  • Familiarity with incident response frameworks such as PagerDuty, Blameless, or FireHydrant.
  • Strong documentation practices – must have contributed to or maintained internal runbooks and architectural decision records (ADRs).

Preferred but Not Required:

  • SRE certification (e.g., from Google Cloud or equivalent)
  • Experience participating in or leading production readiness reviews
  • Prior involvement in regulatory environments (SOC2, HIPAA, FedRAMP)

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru