Senior/Staff Infrastructure Engineer
fal · San Francisco, California, US
You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive.
Job description
You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including provisioning, health monitoring, error detection, and recovery — and when something breaks that automation can’t fix, you drive resolution with partners. Key responsibilities: - Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc - Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting - Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals) - Leverage AI to an extreme level to build tools and automate alerting and recovery - Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation - Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arra...