JobMesh

Research Engineer – AI Training Systems Reliability & Performance (Seed Infra)

ByteDance · Seattle, Washington, US

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneo...

Job description

About the Team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models. Responsibilities: The base salary range for this position in the selected city is $232560 - $427500 annually. - Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models - Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads - Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments - Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services - Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively - Participate in global team rotations for system monitoring, on-call support, and incident response