Site Reliability Engineer

Application ends: August 6, 2025

Apply for this job

Email *

Job Description

Job Summary

We are seeking a Site Reliability Engineer with a strong background in distributed systems, low-latency service monitoring, and incident management. The ideal candidate has prior experience in scaling Kubernetes clusters in production, implementing service-level objectives (SLOs), and has written automation to reduce toil in production environments.


Key Responsibilities

  • Design, implement, and manage scalable infrastructure across AWS using Terraform and Helm.
  • Define, monitor, and improve SLOs/SLIs for core services across engineering teams.
  • Lead root cause analyses of service disruptions and implement postmortem processes.
  • Automate recurring tasks using Python, Bash, or Go to eliminate manual work and improve reliability.
  • Collaborate with development teams to improve CI/CD workflows using GitHub Actions and ArgoCD.
  • Develop custom Prometheus exporters and tune alerting strategies using Alertmanager and Grafana.
  • Harden service mesh policies (e.g., Istio or Linkerd) to ensure zero-trust security and traffic observability.
  • Optimize resource utilization across multi-tenant Kubernetes clusters while enforcing quota boundaries.
  • Own the incident response process, including on-call rotations and escalation playbooks.
  • Conduct resilience testing using tools like Chaos Mesh or Gremlin.

Required Qualifications

  • 4+ years of experience as an SRE, DevOps Engineer, or Infrastructure Engineer in production environments with 99.9%+ availability targets.
  • Demonstrated experience scaling and tuning Kubernetes clusters (not just using managed services).
  • Proficient in writing infrastructure-as-code using Terraform and managing secrets with tools like Vault or Sealed Secrets.
  • Hands-on experience with observability stacks including Prometheus, Grafana, Loki, and Jaeger.
  • In-depth knowledge of Linux internals (systemd, cgroups, networking) and performance tuning.
  • Strong scripting/programming skills in Python, Go, or Bash with emphasis on writing maintainable automation.
  • Familiarity with service-level metrics design and implementing distributed tracing for microservices.
  • Experience working in incident-driven environments and participating in high-severity incident responses.
  • Solid understanding of DNS, CDN caching strategies, TLS certificate lifecycle, and HTTP/2 performance behaviors.

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru