- LocationBengaluru
-
IndustryInternet
We are building the MLOps backbone for enterprise generative AI, supporting both air-gapped private GPU clusters (such as for sensitive financial/healthcare data) and hyperscale cloud platforms (such as AWS SageMaker, Google Vertex AI, Azure ML).
Expected to architect infrastructure that handles billions of inference tokens monthly across hybrid environments—such as optimizing Llama 3/Mistral deployments on private A100 clusters while orchestrating GPT-4/Claude fine-tuning pipelines onmanaged cloud services. This is a high-visibility role requiring deep expertise in LLM serving engines, distributed GPU systems, and cloud-agnostic MLOps.
CORE RESPONSIBILITIES:
- Architect self-hosted inference clusters using vLLM, TGI (Text Generation Inference), and TensorRT-LLM on on-premiseNVIDIA DGX systems and GPU racks, ensuring sub-100ms latency for 70B+ parameter models.
- Design parallel workflows on AWS SageMaker (Endpoints/Pipelines), Google Vertex AI (Prediction/Training), and Azure MLfor elastic training workloads and managed foundation model APIs.
- Implement cloud-agnostic model deployment using Kubernetes (EKS/GKE/AKS) with portability across private data centers and cloud VPCs, ensuring zero vendor lock-in.
- Deploy multi-GPU inference parallelism (tensor + pipeline parallelism) for foundation models using Ray Serve, NVIDIA Triton, and custom FastAPI stacks.
- Optimize inference economics through quantization (AWQ/GPTQ/FP8), KV-cache optimization, and continuous batching—reducing per-token costs by 40%+.
- Build auto-scaling GPU node pools (Karpenter/Cluster Autoscaler) that respond to inference demand spikes within seconds.
- Implement RLHF (Reinforcement Learning from Human Feedback) infrastructure using DeepSpeed, LoRA/QLoRA fine-tuning pipelines, and distributed training orchestration.
- Design evaluation frameworks for LLMs: automated benchmarking (MMLU, HumanEval), A/B testing for model versions, and human-in-the-loop feedback systems.
- Manage vector database infrastructure (Pinecone, Weaviate, Milvus, pgvector) for RAG systems spanning private and cloud environments.
- Build CI/CD for ML using GitOps (ArgoCD/Flux) with model versioning (MLflow/DVC), automated testing for data drift, and canary deployments for model updates.
- Implement feature stores (Feast/Tecton) and experiment tracking (Weights & Biases/MLflow) supporting both cloud and on-premise data lakes.
- Create observability stacks for LLMs: token-level latency tracking, GPU memory saturation alerts, and cost-per-inference dashboards using Prometheus/Grafana/CloudWatch.
- Manage secrets, model encryption at rest (HashiCorp Vault), and network policies (Istio/Linkerd) for multi-tenant model serving.
ESSENTIAL QUALIFICATIONS & EXPERIENCE:
Educational Qualifications:
- Bachelor's degree (B.E./B.Tech) in Computer Science, Engineering, Mathematics, or related technical field from a recognized university.
- Master's degree (M.Tech/MS) in Machine Learning, Computer Science, Artificial Intelligence, or related field desirable.
- Relevant professional certifications in cloud platforms (AWS/Azure/GCP) and Kubernetes (CKA/CKAD) highly desirable.
Experience Requirements:
- Minimum 5-9 years of hands-on experience in production ML infrastructure engineering, with at least 2 years dedicated to large-scale model deployment and MLOps.
- Demonstrable track record of deploying and maintaining 70B+ parameter models in production environments (are preferred).
- Proven experience managing both on-premise GPU clusters (NVIDIA DGX, A100/H100) and cloud- based ML platforms (AWS SageMaker, Google Vertex AI, or Azure ML).
TECHNICAL COMPETENCIES REQUIRED:
Infrastructure & Systems:
- Expert-level proficiency in Kubernetes (GPU operators, taints/tolerations, multi-tenancy) across both on-premise (Rancher/OpenShift) and cloud (EKS/GKE/AKS) environments.
- Deep expertise in LLM serving engines: Proven hands-on experience with vLLM, TGI (Text Generation Inference), or TensorRT-LLM in production settings.
- Professional-level certification or equivalent experience in AWS SageMaker, Google Vertex AI, or Azure ML—including model registry, endpoints, and pipeline orchestration.
- Strong understanding of NVIDIA Hopper/Ampere architectures, NVLink/InfiniBand networking, and CUDA optimization.
- CUDA kernel optimization, custom inference kernels, or TritonML server extensions.
- Infrastructure as Code: Terraform, Helm, Kustomize for reproducible GPU cluster provisioning.
Machine Learning & Distributed Systems:
- Exper-level Python programming with PyTorch/TensorFlow.
- Distributed training frameworks: DeepSpeed, Horovod, PyTorch DDP/FSDP.
- LLM Stack: LangChain, LlamaIndex, Hugging Face Transformers, and agentic workflow orchestration.
- Data Engineering: Apache Spark, Airflow, and feature engineering at scale (terabyte+ datasets).
- Database Systems: Vector databases (Pinecone, Weaviate, Milvus, pgvector) and feature stores (Feast/Tecton).
DevOps & Observability:
- CI/CD for ML: GitOps (ArgoCD/Flux), model versioning (MLflow/DVC), and automated testing.
- Observability: Prometheus, Grafana, ELK stack, and cloud-native monitoring (CloudWatch/Stackdriver/Azure Monitor).
- Security: HashiCorp Vault, Istio/Linkerd service mesh, network policies, and secrets management.
Check Your Resume for Match
Upload your resume and our tool will compare it to the requirements for this job like recruiters do.
Check for Match