Opportunity Description
Key Responsibilities
Inference Platform & Optimization: Build and optimize enterprise LLM serving platforms (., vLLM, TensorRT-LLM) using techniques like PagedAttention, continuous batching, and quantization (AWQ/FP8) for high throughput and low latency. GPU Pooling & AI Infra: Design GPU pooling, virtualization, and scheduling solutions on Kubernetes to maximize hardware utilization. Manage distributed training clusters and high-performance networking (RDMA/NCCL). Model Deployment & MLOps: Streamline the CI/CD pipeline for AI models. Implement automated benchmarking, zero-downtime deployment, and comprehensive observability (TTFT, TPS, GPU metrics). Qualifications
1. Education & Experience:
Bachelor’s, Master’s, or . in Computer Science, Computer Engineering, or a related field. 3+ years of experience in Backend Systems, Distributed Systems, or AI Infrastructure/MLOps, with at least 1-2 years specifically focused on LLM s...
Full time
Computer Occupations