Opportunity Description
Job Responsibilities:
1. Distributed Training Engineering
- Participate in the implementation of large-scale distributed training solutions.
- Lead the engineering deployment of data parallelism, model parallelism (TP/PP), and ZeRO optimization.
- Continuously tune GPU compute utilization and ensure stability of ultra-large-scale training tasks.
2. Compute Scheduling Optimization
- Deeply involved in the development and optimization of AI task scheduling logic.
- Implement fine-grained resource management, fault self-healing, and efficient Checkpoint mechanisms.
- Solve compute bottlenecks in complex gaming scenarios.
3. End-to-End Model Engineering
- Own the full pipeline from model training to inference deployment.
- Participate in operator performance profiling, model quantization, and high-performance inference pipe...