Job Description
We are looking for a highly skilled Site Reliability Engineer who thrives on solving complex system challenges at scale. You will be the guardian of our platform’s uptime, performance, and scalability, working closely with engineering teams to build resilient infrastructure and automate operational tasks. This role demands a hands-on engineer with deep technical expertise, a proactive mindset, and a passion for evolving reliability practices that directly impact user experience.
Job Requirements
- Proven track record in designing, building, and operating large-scale distributed systems with high availability and resilience. You’ve wrestled with real-world outages and emerged with actionable insights, not just postmortems.
- Deep expertise with cloud-native environments (AWS, GCP, Azure), including managing infrastructure as code (Terraform, CloudFormation) and container orchestration platforms like Kubernetes at scale.
- Mastery in monitoring, alerting, and observability tooling—beyond just installing Prometheus and Grafana. You understand how to instrument complex services to catch subtle, system-wide failure modes before they cascade.
- Strong programming skills in at least one systems language (Go, Python, Rust) with experience building automation that reduces toil and increases system reliability.
- Hands-on experience with chaos engineering practices to proactively uncover weaknesses in distributed systems before customers do.
- Ability to collaborate effectively with software engineers, product owners, and support teams to embed reliability and scalability early in the development lifecycle.
- In-depth knowledge of networking concepts (DNS, TCP/IP, HTTP/2) and experience diagnosing complex production issues involving network latency, load balancing, or service mesh configurations.
- Solid understanding of CI/CD pipelines and how they impact reliability, including automating canary deployments, blue-green releases, and rollback strategies.
- Experience in incident response and root cause analysis under pressure, with a bias for clear communication and actionable documentation.
- Passion for continuously evolving the SRE practice within the organization—mentoring others, shaping culture, and driving best practices beyond just the technology.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 5+ years working in a Site Reliability, DevOps, or Systems Engineering role supporting production-grade systems.
- Demonstrated success managing cloud infrastructure at scale with infrastructure as code.
- Proficient in at least one systems programming language such as Go, Python, or Rust.
- Experience with Kubernetes in production environments, including cluster management and troubleshooting.
- Solid understanding of networking protocols and troubleshooting tools.
- Familiarity with chaos engineering frameworks and practices (e.g., Chaos Monkey, Gremlin).
- Proven ability to manage on-call rotations and handle high-pressure incident response.
- Strong communication skills, capable of bridging technical and non-technical teams.
- Certifications such as AWS Certified DevOps Engineer, Certified Kubernetes Administrator (CKA), or similar are a plus.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru