Job Description
About the Role
We’re looking for a DevOps Engineer who can debug Kubernetes networking issues blindfolded, isn’t afraid of 3 a.m. pager duty (because you’ve automated 95% of alerts), and understands the difference between “it works on my machine” and “it’s in production.” You’ll be working closely with backend engineers, SREs, and SecOps to scale infrastructure supporting our AI-driven analytics platform that serves 50M+ API requests daily.
Key Responsibilities
- Own our CI/CD pipelines (currently GitHub Actions + ArgoCD), improve deployment frequency, rollback reliability, and release visibility.
- Maintain and extend Kubernetes (EKS) clusters, with Helm 3 for service deployments.
- Design and enforce secrets and config management strategies using HashiCorp Vault and Sealed Secrets.
- Monitor, alert, and auto-remediate using Prometheus, Loki, Grafana, and custom exporters when needed.
- Optimize AWS infrastructure cost using automation and proactive resource scaling.
- Collaborate with security to enforce least-privilege IAM policies across hundreds of microservices.
- Lead chaos testing drills and incident retrospectives.
- Maintain internal tooling for developer experience — from local Kubernetes clusters to sandbox environments.
Must-Have Skills
- Deep experience with Kubernetes internals (pod lifecycle, node affinity, network policies, etc.).
- Strong IaC skills (Terraform preferred; CloudFormation acceptable with a good reason).
- Hands-on experience with ArgoCD, Helm, and GitOps workflows.
- Expert in AWS services: EC2, IAM, VPC, EKS, ALB, CloudWatch.
- Scripting proficiency (Python or Bash) to build and maintain automation and operational tools.
- Experience tuning Prometheus alerts to avoid alert fatigue.
- Ability to trace performance bottlenecks using distributed tracing tools (e.g., OpenTelemetry, Jaeger).
- Familiar with container runtime security and image scanning practices (e.g., Trivy, AquaSec).
Bonus Points
- Experience building developer platform features like ephemeral environments or self-serve infra tooling.
- Past participation in incident response rotations — and an opinionated take on how they should be run.
- Contributions to open-source DevOps tooling or internal platform evangelism efforts.
- Familiarity with eBPF or service mesh tools like Istio/Linkerd for deep observability.
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers#Dynamicbrandguru