Job Description
About the Role
We’re seeking a Site Reliability Engineer who thrives at the intersection of software development and infrastructure scalability. You’ll be a key contributor to designing and maintaining systems that support high availability, low latency, and rapid deployment—especially under real-world load and chaos.
This role is not about babysitting servers. We’re looking for someone who can identify single points of failure in a system diagram, introduce chaos engineering principles into CI/CD, and challenge dev teams on observability and failure scenarios.
What You’ll Be Responsible For
- Architecting and maintaining distributed systems with 99.99% uptime goals across services deployed in Kubernetes clusters (EKS preferred).
- Implementing SLOs/SLIs from scratch and getting buy-in from product and engineering teams to align them with business outcomes.
- Building internal tooling in Go or Python to automate reliability checks, simulate failure conditions, and enforce infra guardrails.
- Working directly with development teams to implement production-readiness checklists and manage incident retrospectives that lead to actual changes.
- Leading on-call rotations with a strong emphasis on reducing alert fatigue—designing actionable alerts, not noise.
- Designing progressive rollout strategies (canary, blue/green, feature flags) and integrating them into existing CI/CD pipelines (we use ArgoCD + GitHub Actions).
- Performing load testing and chaos testing before peak traffic events (Black Friday, product launches), with documented failover procedures.
Required Skills & Experience
- 4+ years in a production-facing SRE or infrastructure engineering role.
- Strong understanding of Linux internals, DNS, load balancing, and containerization (Docker + Kubernetes).
- Proven experience designing and running production systems in AWS (Lambda, SQS, RDS, IAM, VPCs).
- You’ve written at least one internal tool or automation in Python, Go, or Rust—not just shell scripts.
- Hands-on with monitoring stacks: Prometheus, Grafana, and one of Datadog, New Relic, or Honeycomb.
- Strong understanding of distributed systems failure modes, from split-brain to thundering herd.
- You’ve participated in (or better, led) post-mortems that resulted in meaningful system changes.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru