JobMesh

Site Reliability Engineer, Infrastructure - Analytics Platform

OpenAI · San Francisco, California, US

About the Team The Scaling team designs, builds, and operates critical infrastructure that enables research at OpenAI. Our mission is simple: accelerate the...

Job description

About the Team The Scaling team designs, builds, and operates critical infrastructure that enables research at OpenAI. Our mission is simple: accelerate the progress of research towards AGI. We do this by building core systems that researchers rely on - ranging from low-level infrastructure components to research-facing custom applications. These systems must scale with the increasing complexity and size of our workloads, while remaining reliable and easy to use. About the Role: We're looking for an experienced Site Reliability Engineer to own production-critical infrastructure end to end. This role is centered on data-heavy, low-latency workloads, with emphasis on operating large-scale ClickHouse clusters, high-throughput Kafka pipelines, and reliable integrations with Snowflake. You'll turn ambiguous operational problems into clear plans, ship pragmatic solutions quickly, and improve them through production feedback and iteration. We are specifically looking for someone who can independently define and raise operational standards across teams while remaining deeply hands-on in production systems. In this role, you will: Own infrastructure lifecycle management across provisioning,...