AI Inference Engineer - Model Optimization & Deployment
Zoox · Foster City, California, US
The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence. As a M...
Job description
The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence. As a Model Optimization & Deployment Engineer, you will focus on bringing highly efficient, production-ready large-scale models to our on-vehicle stack. We are looking for experts with hands-on experience in compressing, accelerating, and deploying complex models (LLMs, VLMs, or FMs) for power- and thermal-constrained vehicle SOCs. You will optimize the ML models, write custom CUDA kernels, and build highly concurrent inference code to ensure real-time, deterministic execution on edge devices. In this role, you will: Architect and implement model conversion and compilation pipelines using TensorRT and TensorRT-LLM for edge deployment. Perform rigorous parity checking, accuracy recovery, and latency benchmarking between PyTorch frameworks and compiled edge binaries. Qualifications: - Optimize large-scale models (LLMs, VLMs) using advanced quantization (PTQ, QAT), mixed-precision inference workflows, and parameter-efficient fine-tuning (LoRA, QLoRA). - Write and optimize custom CUDA kernels and TensorRT Plugins to maximize memory bandw...