JobMesh

Senior/Staff Site Reliability Engineer

fal · TR

You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems — from Kuber...

Job description

You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems — from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better. Key Responsibilities: - Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads - Build and maintain CI/CD pipelines and deployment infrastructure - Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability - Build dashboards, alerting, and anomaly detection across our systems - Define and enforce SLOs and build out incident response processes - Manage and improve our networking, load balancing, and service mesh configurations - Drive reliability improvements across the stack through automation, runbooks, and chaos engineering Requirements: - 5+ years experience in managing critical production systems and software development workflows - Strong production expe...