JobMesh

Senior Software Engineer — AI Evaluation & Benchmarks (Python)

G2i Inc. · Miami, Florida, US

Before Applying This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to pro...

Job description

Before Applying This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations. For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role. What You'll Be Doing: Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work: Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code Build and maintain scalable data pipelines for evaluation workflows Analyze model-generated code for correctness, reliability, and edge-case failures Construct structured evaluation scenarios across large repos and multi-language environments Provide detailed technical feedback on model performance and failure patterns Contribute to evaluation frameworks that set the bar for how cod...