MLOps Engineer – Generative AI

May 23, 2026
Application ends: August 22, 2026
Apply Now

Job Description

II. CORE RESPONSIBILITIES :

– Architect self-hosted inference clusters using vLLM, TGI (Text Generation Inference), and TensorRT-LLM on on-premise NVIDIA DGX systems and GPU racks, ensuring sub-100ms latency for 70B+ parameter models.

– Design parallel workflows on AWS SageMaker (Endpoints/Pipelines), Google Vertex AI (Prediction/Training), and Azure ML for elastic training workloads and managed foundation model APIs.

– Implement cloud-agnostic model deployment using Kubernetes (EKS/GKE/AKS) with portability across private data centers and cloud VPCs, ensuring zero vendor lock-in.

– Deploy multi-GPU inference parallelism (tensor + pipeline parallelism) for foundation models using Ray Serve, NVIDIA Triton, and custom FastAPI stacks.

– Optimize inference economics through quantization (AWQ/GPTQ/FP8), KV-cache optimization, and continuous batching – reducing per-token costs by 40%+.

– Build auto-scaling GPU node pools (Karpenter/Cluster Autoscaler) that respond to inference demand spikes within seconds.

– Implement RLHF (Reinforcement Learning from Human Feedback) infrastructure using DeepSpeed, LoRA/QLoRA fine-tuning pipelines, and distributed training orchestration.

– Design evaluation frameworks for LLMs : automated benchmarking (MMLU, HumanEval), A/B testing for model versions, and human-in-the-loop feedback systems.

– Manage vector database infrastructure (Pinecone, Weaviate, Milvus, pgvector) for RAG systems spanning private and cloud environments.

– Build CI/CD for ML using GitOps (ArgoCD/Flux) with model versioning (MLflow/DVC), automated testing for data drift, and canary deployments for model updates.

– Implement feature stores (Feast/Tecton) and experiment tracking (Weights & Biases/MLflow) supporting both cloud and on-premise data lakes.

– Create observability stacks for LLMs : token-level latency tracking, GPU memory saturation alerts, and cost-per-inference dashboards using Prometheus/Grafana/CloudWatch.

– Manage secrets, model encryption at rest (HashiCorp Vault), and network policies (Istio/Linkerd) for multi-tenant model serving.

III. ESSENTIAL QUALIFICATIONS & EXPERIENCE :

Educational Qualifications :

– Bachelor’s degree (B.E./B.Tech) in Computer Science, Engineering, Mathematics, or related technical field from a recognized university. Graduates from IITs, NITs, BITS, IIIT, or top-tier engineering institutions preferred.


– Master’s degree (M.Tech/MS) in Machine Learning, Computer Science, Artificial Intelligence, or related field desirable but not mandatory.
– Relevant professional certifications in cloud platforms (AWS/Azure/GCP) and Kubernetes (CKA/CKAD) highly desirable.

Experience Requirements :

– Minimum 5 – 9 years of hands-on experience in production ML infrastructure engineering, with at least 2 years dedicated to large-scale model deployment and MLOps.

– Demonstrable track record of deploying and maintaining 70B+ parameter models in production environments (are preferred).

– Proven experience managing both on-premise GPU clusters (NVIDIA DGX, A100/H100) and cloud-based ML platforms (AWS SageMaker, Google Vertex AI, or Azure ML).

IV. TECHNICAL COMPETENCIES REQUIRED :

Infrastructure & Systems :

– Expert-level proficiency in Kubernetes (GPU operators, taints/tolerations, multi-tenancy) across both on-premise (Rancher/OpenShift) and cloud (EKS/GKE/AKS) environments.

– Deep expertise in LLM serving engines : Proven hands-on experience with vLLM, TGI (Text Generation Inference), or TensorRT-LLM in production settings.

– Professional-level certification or equivalent experience in AWS SageMaker, Google Vertex AI, or Azure ML – including model registry, endpoints, and pipeline orchestration.

– Strong understanding of NVIDIA Hopper/Ampere architectures, NVLink/InfiniBand networking, and CUDA optimization.

– CUDA kernel optimization, custom inference kernels, or TritonML server extensions.

– Infrastructure as Code : Terraform, Helm, Kustomize for reproducible GPU cluster provisioning.

Machine Learning & Distributed Systems :

– Expert-level Python programming with PyTorch/TensorFlow.

– Distributed training frameworks : DeepSpeed, Horovod, PyTorch DDP/FSDP.

– LLM Stack : LangChain, LlamaIndex, Hugging Face Transformers, and agentic workflow orchestration.

– Data Engineering : Apache Spark, Airflow, and feature engineering at scale (terabyte+ datasets).

– Database Systems : Vector databases (Pinecone, Weaviate, Milvus, pgvector) and feature stores (Feast/Tecton).

DevOps & Observability :

– CI/CD for ML : GitOps (ArgoCD/Flux), model versioning (MLflow/DVC), and automated testing.

– Observability : Prometheus, Grafana, ELK stack, and cloud-native monitoring (CloudWatch/Stackdriver/Azure Monitor).

– Security : HashiCorp Vault, Istio/Linkerd service mesh, network policies, and secrets management

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#GraphicDesignJobsOnline

#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers# Dynamicbrand guru