JobMesh

Senior Platform & Reliability Engineer (SRE)

Vizcom · San Francisco, California, US

About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, N...

Job description

About Vizcom Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission: Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades. This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails. What You’ll Own: Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows. Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access. Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety. Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation....