JobMesh

Vision Language Model Engineer

EchoTwin AI · San Francisco, California, US

Company Overview EchoTwin AI is pioneering AI-driven infrastructure intelligence, redefining how cities are managed.

Job description

Company Overview EchoTwin AI is pioneering AI-driven infrastructure intelligence, redefining how cities are managed. Powered by a proprietary visual intelligence engine with full spatial reasoning, EchoTwin transforms municipal fleets into mobile urban sensors—creating living digital twins that provide real-time insights into infrastructure, compliance, and safety. By enabling municipalities to proactively monitor, predict, and resolve issues, EchoTwin helps build resilient, self-healing, and sustainable urban ecosystems. More than “smart cities,” EchoTwin is advancing the era of cognizant cities—urban environments with the awareness to see, think, and act on challenges in real time. What You’ll Do: As a Vision Language Model Engineer, you will design, develop, and optimize advanced vision-language models that integrate visual and textual data to enable intelligent systems. You will work closely with cross-functional teams to build models that power applications such as image captioning, visual question answering, and multimodal AI at the edge. Key Responsibilities: Design and implement state-of-the-art vision-language models using deep learning frameworks. Develop and fine-tune mo...