Job Description
Job Summary
We are seeking a Site Reliability Engineer with a strong background in distributed systems, low-latency service monitoring, and incident management. The ideal candidate has prior experience in scaling Kubernetes clusters in production, implementing service-level objectives (SLOs), and has written automation to reduce toil in production environments.
Key Responsibilities
- Design, implement, and manage scalable infrastructure across AWS using Terraform and Helm.
- Define, monitor, and improve SLOs/SLIs for core services across engineering teams.
- Lead root cause analyses of service disruptions and implement postmortem processes.
- Automate recurring tasks using Python, Bash, or Go to eliminate manual work and improve reliability.
- Collaborate with development teams to improve CI/CD workflows using GitHub Actions and ArgoCD.
- Develop custom Prometheus exporters and tune alerting strategies using Alertmanager and Grafana.
- Harden service mesh policies (e.g., Istio or Linkerd) to ensure zero-trust security and traffic observability.
- Optimize resource utilization across multi-tenant Kubernetes clusters while enforcing quota boundaries.
- Own the incident response process, including on-call rotations and escalation playbooks.
- Conduct resilience testing using tools like Chaos Mesh or Gremlin.
Required Qualifications
- 4+ years of experience as an SRE, DevOps Engineer, or Infrastructure Engineer in production environments with 99.9%+ availability targets.
- Demonstrated experience scaling and tuning Kubernetes clusters (not just using managed services).
- Proficient in writing infrastructure-as-code using Terraform and managing secrets with tools like Vault or Sealed Secrets.
- Hands-on experience with observability stacks including Prometheus, Grafana, Loki, and Jaeger.
- In-depth knowledge of Linux internals (systemd, cgroups, networking) and performance tuning.
- Strong scripting/programming skills in Python, Go, or Bash with emphasis on writing maintainable automation.
- Familiarity with service-level metrics design and implementing distributed tracing for microservices.
- Experience working in incident-driven environments and participating in high-severity incident responses.
- Solid understanding of DNS, CDN caching strategies, TLS certificate lifecycle, and HTTP/2 performance behaviors.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru