Senior Solutions Architect - AI Factory Deployment
NVIDIA · US
We are seeking an ambitious Senior Solutions Architect - AI Factory Deployment to join our NVIDIA Infrastructure Specialists team in Santa Clara! This role i...
Job description
We are seeking an ambitious Senior Solutions Architect - AI Factory Deployment to join our NVIDIA Infrastructure Specialists team in Santa Clara! This role is uniquely positioned to develop, deploy, and validate AI factories end to end. You will focus on running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters, using NCCL and collectives like AllReduce and AllToAll to improve performance and scalability. As part of our world-class team, you will bring to bear observability and automation to improve benchmarks and validation. You will serve as the expert when workloads or benchmarks do not perform flawlessly. You will collaborate across NVIDIA to ensure AI factories are prepared for customers, validating hardware and software for modern AI deployments. What You Will be Doing: Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters. Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks. Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis. Investigate and resolve issues when training jobs or benchmarks fail, hang, o...