JobMesh

Member of Technical Staff - Compute Platform

Reflection AI · New York City, New York, US

Our Mission Reflection’s mission is to build open superintelligence and make it accessible to all . We’re developing open weight models for individuals, agen...

Job description

Our Mission Reflection’s mission is to build open superintelligence and make it accessible to all . We’re developing open weight models for individuals, agents, enterprises, and even nation states. Our team of AI researchers and company builders come from DeepMind, OpenAI, Google Brain, Meta, Character.AI, Anthropic and beyond. About the Role: Reflection’s Compute Platform team specializes in keeping our compute layer healthy and highly available. We run a K8s-based platform distributed across multiple neo-clouds. We manage multi-cloud scheduling, node health, and performance debugging at this scale presents genuinely hard systems problems. More broadly, you will work closely with Reflection's training teams to co-design fault tolerance, node health checks, and remediation strategies. What You’ll Do: Cluster Management: Build and maintain tools for the automatic remediation, topology-aware scheduling, capacity planning and rapid hardware debugging. Platform Engineering: Design and iterate on our cluster management stack for workloads across large, multi-GPU fleets Monitoring & Observability: Implement comprehensive cluster-wide monitoring, focusing on durability and active performa...