Job Description
II. CORE RESPONSIBILITIES :
– Architect self-hosted inference clusters using vLLM, TGI (Text Generation Inference), and TensorRT-LLM on on-premise NVIDIA DGX systems and GPU racks, ensuring sub-100ms latency for 70B+ parameter models.
– Design parallel workflows on AWS SageMaker (Endpoints/Pipelines), Google Vertex AI (Prediction/Training), and Azure ML for elastic training workloads and managed foundation model APIs.
– Implement cloud-agnostic model deployment using Kubernetes (EKS/GKE/AKS) with portability across private data centers and cloud VPCs, ensuring zero vendor lock-in.
– Deploy multi-GPU inference parallelism (tensor + pipeline parallelism) for foundation models using Ray Serve, NVIDIA Triton, and custom FastAPI stacks.
– Optimize inference economics through quantization (AWQ/GPTQ/FP8), KV-cache optimization, and continuous batching – reducing per-token costs by 40%+.
– Build auto-scaling GPU node pools (Karpenter/Cluster Autoscaler) that respond to inference demand spikes within seconds.
– Implement RLHF (Reinforcement Learning from Human Feedback) infrastructure using DeepSpeed, LoRA/QLoRA fine-tuning pipelines, and distributed training orchestration.
– Design evaluation frameworks for LLMs : automated benchmarking (MMLU, HumanEval), A/B testing for model versions, and human-in-the-loop feedback systems.
– Manage vector database infrastructure (Pinecone, Weaviate, Milvus, pgvector) for RAG systems spanning private and cloud environments.
– Build CI/CD for ML using GitOps (ArgoCD/Flux) with model versioning (MLflow/DVC), automated testing for data drift, and canary deployments for model updates.
– Implement feature stores (Feast/Tecton) and experiment tracking (Weights & Biases/MLflow) supporting both cloud and on-premise data lakes.
– Create observability stacks for LLMs : token-level latency tracking, GPU memory saturation alerts, and cost-per-inference dashboards using Prometheus/Grafana/CloudWatch.
– Manage secrets, model encryption at rest (HashiCorp Vault), and network policies (Istio/Linkerd) for multi-tenant model serving.
III. ESSENTIAL QUALIFICATIONS & EXPERIENCE :
Educational Qualifications :
– Bachelor’s degree (B.E./B.Tech) in Computer Science, Engineering, Mathematics, or related technical field from a recognized university. Graduates from IITs, NITs, BITS, IIIT, or top-tier engineering institutions preferred.
– Master’s degree (M.Tech/MS) in Machine Learning, Computer Science, Artificial Intelligence, or related field desirable but not mandatory.
– Relevant professional certifications in cloud platforms (AWS/Azure/GCP) and Kubernetes (CKA/CKAD) highly desirable.
Experience Requirements :
– Minimum 5 – 9 years of hands-on experience in production ML infrastructure engineering, with at least 2 years dedicated to large-scale model deployment and MLOps.
– Demonstrable track record of deploying and maintaining 70B+ parameter models in production environments (are preferred).
– Proven experience managing both on-premise GPU clusters (NVIDIA DGX, A100/H100) and cloud-based ML platforms (AWS SageMaker, Google Vertex AI, or Azure ML).
IV. TECHNICAL COMPETENCIES REQUIRED :
Infrastructure & Systems :
– Expert-level proficiency in Kubernetes (GPU operators, taints/tolerations, multi-tenancy) across both on-premise (Rancher/OpenShift) and cloud (EKS/GKE/AKS) environments.
– Deep expertise in LLM serving engines : Proven hands-on experience with vLLM, TGI (Text Generation Inference), or TensorRT-LLM in production settings.
– Professional-level certification or equivalent experience in AWS SageMaker, Google Vertex AI, or Azure ML – including model registry, endpoints, and pipeline orchestration.
– Strong understanding of NVIDIA Hopper/Ampere architectures, NVLink/InfiniBand networking, and CUDA optimization.
– CUDA kernel optimization, custom inference kernels, or TritonML server extensions.
– Infrastructure as Code : Terraform, Helm, Kustomize for reproducible GPU cluster provisioning.
Machine Learning & Distributed Systems :
– Expert-level Python programming with PyTorch/TensorFlow.
– Distributed training frameworks : DeepSpeed, Horovod, PyTorch DDP/FSDP.
– LLM Stack : LangChain, LlamaIndex, Hugging Face Transformers, and agentic workflow orchestration.
– Data Engineering : Apache Spark, Airflow, and feature engineering at scale (terabyte+ datasets).
– Database Systems : Vector databases (Pinecone, Weaviate, Milvus, pgvector) and feature stores (Feast/Tecton).
DevOps & Observability :
– CI/CD for ML : GitOps (ArgoCD/Flux), model versioning (MLflow/DVC), and automated testing.
– Observability : Prometheus, Grafana, ELK stack, and cloud-native monitoring (CloudWatch/Stackdriver/Azure Monitor).
– Security : HashiCorp Vault, Istio/Linkerd service mesh, network policies, and secrets management
Are you interested in this position?
Apply by clicking on the “Apply Now” button below!
#GraphicDesignJobsOnline
#WebDesignRemoteJobs #FreelanceGraphicDesigner #WorkFromHomeDesignJobs #OnlineWebDesignWork #RemoteDesignOpportunities #HireGraphicDesigners #DigitalDesignCareers# Dynamicbrand guru
Apply Now