Job Description
Role overview
We’re looking for a Senior DevOps Engineer to take strong ownership of the infrastructure behind our global SaaS messaging platform. This role is for someone who wants to shape how infrastructure is built, operated, and improved. You will be responsible for reliability, scalability, automation, and production stability across all environments.
You will work closely with engineering leadership and developers to improve system architecture, deployment processes, security, and operational standards. This is a high-impact role with real influence on technical decisions and how our infrastructure evolves as we grow.
Key responsibilities
- CI/CD automation: Design and own CI/CD pipelines (Gitlab), improving build speed, deployment safety, and rollback processes across environments
- Infrastructure ownership: Own and evolve our cloud and bare-metal infrastructure (OVH, Cloudflare, AWS, OpenStack), ensuring high availability, performance, and stability under load
- Infrastructure as code: Lead infrastructure as code practices using Terraform and Ansible, enforcing version control, peer review, and consistency standards
- Observability and monitoring: Improve system observability using monitoring, logging, tracing, and alerting tools (Grafana, Prometheus, Loki), and drive proactive reliability improvements
- Infrastructure security: Strengthen infrastructure security, including DDoS mitigation, traffic filtering, and access control management
- Incident management: Lead root cause analysis of production incidents and implement long-term reliability improvements
- Automation: Design automation to reduce manual operational work and improve deployment and recovery processes
- Database reliability: Ensure high availability and performance of production databases (PostgreSQL, MongoDB), including backup, recovery, and scaling strategies
- Environment management: Ensure consistency and reliability across development, staging, and production environments
Expected qualifications
- Linux expertise: Strong Linux system administration experience in high-availability production environments
- Kubernetes production experience: Hands-on experience running Kubernetes in production, including scaling, upgrades, and troubleshooting
- Systems architecture understanding: Solid understanding of containerization, virtualization, and infrastructure design trade-offs
- Networking knowledge: Strong understanding of networking concepts (L2, L4, L7), debugging tools (tcpdump, ngrep), and traffic analysis
- Production lifecycle experience: Experience operating and troubleshooting applications in high-availability production environments
- CI/CD systems design: Experience designing and maintaining CI/CD systems and deployment workflows
- Database operations: Strong experience managing PostgreSQL and MongoDB in production, including performance tuning and reliability
- Infrastructure as code: Practical experience with Terraform and configuration management tools (Ansible or similar), following best practices
- Monitoring and logging: Experience working with monitoring and log aggregation systems (Grafana, Prometheus, Loki, or similar)
- Security awareness: Practical understanding of infrastructure security principles and production hardening
- Communication skills: Fluent written English and fluent spoken Russian required
Nice to have
- Messaging/telecom background: Experience with telecom or messaging systems (SMPP, Asterisk, Kamailio)
- PostgreSQL high availability: Experience with PostgreSQL replication/clustering, backups, and failover (PITR, Patroni/repmgr or similar)
- Kubernetes operations: Experience operating Kubernetes clusters in production (upgrades, autoscaling, networking, troubleshooting)
- Scripting: Scripting skills in Bash, Python, or Go for automation and internal tooling
- Security and traffic protection: Experience mitigating malicious traffic and managing DDoS protection (Cloudflare WAF/rate limiting, fail2ban)
- Email deliverability basics: Familiarity with SPF, DKIM, DMARC, and how they affect sending reliability
- SRE practices: Experience with SLOs/SLIs, alert quality, and incident postmortems
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline #WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers #Dynamicbrandguru
Apply Now