JobMesh

Sr Principal Software Developer

Oracle · Austin, Texas, US

OCI (Oracle Cloud Infrastructure) AI Infrastructure is at the forefront of building a cutting-edge, ultra-high-performance GPU platforms designed to support...

Job description

OCI (Oracle Cloud Infrastructure) AI Infrastructure is at the forefront of building a cutting-edge, ultra-high-performance GPU platforms designed to support AI/ML/HPC workloads. This is your chance to be part of the AI revolution, creating systems that allow customers to scale from tens to thousands of GPUs without compromising performance. Our team is responsible for designing and developing fundamental architectural changes for GPU delivery, health monitoring, testing, triage automation, and diagnostic services. These are essential for running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband. As a Consulting Member of Technical Staff, you will own the software design and development for major components of Oracle's Cloud Infrastructure. You should be both a rock-solid lead developer, curious problem solver, a distributed systems generalist and/or skilled Linux engineer with Systems triage experience able to dive deep into any part of the stack and low-level systems to design broad distributed system interactions. You should value simplicity and scale, work comfortably in a collaborative, agile environment, and be excited t...