System Engineer

January 30, 2026
Application ends: April 29, 2026
Apply Now

Job Description


About the Role

  • As a System Software Engineer focused on data center infrastructure, you will be responsible for the reliability, scalability, and performance of extremely large compute environments supporting advanced AI and machine learning workloads.
  • You’ll work at the intersection of software reliability, distributed systems, and physical infrastructure – helping operate and evolve high-density compute clusters used for large-scale model training and experimentation. This is a deeply technical, hands-on role in a fast-moving environment, collaborating closely with teams across hardware, networking, and software.

Key Responsibilities

  • Own the reliability, availability, and performance of on-premises and cloud-based compute environments, including GPU-accelerated clusters
  • Design, build, and operate monitoring, logging, and alerting systems to ensure high observability and fast incident response
  • Develop and maintain infrastructure-as-code and automated deployment pipelines
  • Participate in on-call rotations, incident response, root cause analysis, and post-incident improvements
  • Analyse system performance, forecast capacity, and optimise resource utilisation for large-scale distributed workloads
  • Partner with hardware, networking, and platform teams to design resilient and scalable systems
  • Create and maintain technical documentation and operational runbooks
  • Identify and remove bottlenecks across compute, storage, and networking layers to improve overall system efficiency

Required Qualifications

  • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent professional experience)
  • 5+ years of experience in site reliability engineering, infrastructure engineering, or large-scale systems operations
  • Strong expertise in Kubernetes (on-prem and/or cloud), infrastructure-as-code tools, and CI/CD systems
  • Proficiency in at least one systems programming language (e.g. Go, Rust, C++) and strong automation/scripting skills
  • Deep understanding of monitoring, alerting, and observability practices
  • Proven ability to troubleshoot complex issues spanning hardware, networking, and distributed software systems
  • Hands-on experience with incident response, post-mortems, and preventative engineering
  • Clear written and verbal communication skills

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline

#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers# Dynamicbrand guru