AI Cloud Platform Site Reliability Engineer
Booz Allen Hamilton · San Antonio, Texas, US
AI Cloud Platform Site Reliability Engineer The Opportunity : Mission users are increasingly relying on agentic AI systems to support complex workflows, acce...
Job description
AI Cloud Platform Site Reliability Engineer The Opportunity: Mission users are increasingly relying on agentic AI systems to support complex workflows, accelerate analysis, and improve decision advantage. Unlike traditional software systems, agentic AI platforms introduce operational complexity across model invocations, workflow orchestration, tool integrations, retrieval and knowledge layers, safety controls, and probabilistic outputs. As an AI Platform Site Reliability Engineer (SRE), you’ll help ensure the availability, resiliency, observability, and operational integrity of an AWS GovCloud-based agentic AI platform supporting national defense missions. In this role, you’ll serve as the reliability owner for production AI operations. You’ll work cross-functionally with multiple stakeholders, including with cloud engineering, platform engineering, AI agent development, MLOps, data science, and customer knowledge teams to operationalize their work in production through monitoring, alerting, Service Level Indicators (SLI) and Service Level Objectives (SLO) management, incident response, ticket triage, change control, and automation. You won’t be duplicating model development, data...