JobMesh

Staff Software Engineer, Compute

fal · TR

You are an experienced software engineer who thrives on building large-scale computing platforms. You have deep expertise in large scale distributed systems...

Job description

You are an experienced software engineer who thrives on building large-scale computing platforms. You have deep expertise in large scale distributed systems that deal with high complexity, a lot of traffic and data. You know how to achieve reliability and scale with minimum operational load. Key responsibilities: - Build our core Python/Rust platform: request routing, AI workload orchestration, scheduling, GPU autoscaling, large scale file storage, queueing, etc - Produce forward designs for platform evolution as we scale to 100x current traffic and need to provide low latency across the world - Leverage AI to an extreme level to automate the mundane parts of building complex but reliable systems - Profile and tune low level CPU and memory performance Requirements: - 5+ years experience building distributed compute and orchestration platforms in Python or Rust - Strong understanding of distributed systems fundamentals: consensus, scheduling, fault tolerance, capacity planning - Deep understanding of computational complexity and memory allocation - Track record of designing systems that scale under real production load - Experience building and using observability to drive performan...