Job Description
About the Role
- As a System Software Engineer focused on data center infrastructure, you will be responsible for the reliability, scalability, and performance of extremely large compute environments supporting advanced AI and machine learning workloads.
- You’ll work at the intersection of software reliability, distributed systems, and physical infrastructure – helping operate and evolve high-density compute clusters used for large-scale model training and experimentation. This is a deeply technical, hands-on role in a fast-moving environment, collaborating closely with teams across hardware, networking, and software.
Key Responsibilities
- Own the reliability, availability, and performance of on-premises and cloud-based compute environments, including GPU-accelerated clusters
- Design, build, and operate monitoring, logging, and alerting systems to ensure high observability and fast incident response
- Develop and maintain infrastructure-as-code and automated deployment pipelines
- Participate in on-call rotations, incident response, root cause analysis, and post-incident improvements
- Analyse system performance, forecast capacity, and optimise resource utilisation for large-scale distributed workloads
- Partner with hardware, networking, and platform teams to design resilient and scalable systems
- Create and maintain technical documentation and operational runbooks
- Identify and remove bottlenecks across compute, storage, and networking layers to improve overall system efficiency
Required Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent professional experience)
- 5+ years of experience in site reliability engineering, infrastructure engineering, or large-scale systems operations
- Strong expertise in Kubernetes (on-prem and/or cloud), infrastructure-as-code tools, and CI/CD systems
- Proficiency in at least one systems programming language (e.g. Go, Rust, C++) and strong automation/scripting skills
- Deep understanding of monitoring, alerting, and observability practices
- Proven ability to troubleshoot complex issues spanning hardware, networking, and distributed software systems
- Hands-on experience with incident response, post-mortems, and preventative engineering
- Clear written and verbal communication skills
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline
#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers# Dynamicbrand guru
Apply Now