Our Client - A US-based startup who is the neutral routing layer for voice AI. One API, every provider (STT/TTS/V2V), routed by language, latency, cost, and quality. Built for multilingual voice agents.
Responsibilities
- Own and ship Company's Voice Reliability Index (VRI), our public weekly-updated benchmark across voice providers (OpenAI, Google, ElevenLabs...) measuring p50/p95 latency, successful-turn rate, language-specific WER/MOS, cost per minute.
- Build the recommendation engine that powers Company's core value prop: when a customer onboards (a YC voice-agent dev, a contact-center vendor, a multilingual app), We answers for your specific use case, the optimal stack is X with system prompt Y and tool-call config Z - with a benchmark-backed guarantee.
- Lead TTS voice-identity preservation across providers. Decide build-vs-partner, staff and ship.
- Define the canonical metric system (routed turns, success-rate, latency budget) used across product, deck, dashboard, investor updates.
- Expand benchmarks to multilingual coverage (Indonesian, Vietnamese, Cantonese, Thai, Uzbek)
- Work directly with the CEO and the 4-person founding team. Daily async with engineering in Central Asia
- Ship public artifacts weekly: leaderboard updates, methodology posts, regression cards, incident postmortems. The benchmark is also the marketing engine.
- Own the public face of Company's eval credibility on Twitter, GitHub, Hugging Face. Be quotable.
Qualifications
- 5-8+ years building production AI/ML systems, ideally as a founding/early engineer at an AI-native or voice/speech startup.
- Direct experience with real-time voice agent pipelines (STT -> LLM -> TTS over WebRTC / WebSocket).
- Has built or shipped voice agents in production, not just experimented. Pipecat / LiveKit familiarity is a strong plus.
- Strong evaluation / benchmarking instincts: has shipped public benchmarks, contributed to leaderboards (Hugging Face Open ASR, TTS Arena, LMSYS, SEA-HELM), or built internal eval pipelines that drove product decisions.
- Familiar with RAGAS / DeepEval / Promptfoo / TruLens.
- Multi-LLM workflow experience: has built pipelines that route subtasks to different models (e.g. GPT-4o for vision, Gemini for spatial, Claude for reasoning). Understands the right model per use case pattern.
- Statistical and methodological rigor: percentiles not averages, reproducibility, version control for datasets, environmental robustness.
- Open-source / public-shipping track record: GitHub repos with stars, technical blog posts, public benchmarks, conference talks, or active Twitter/X presence in the AI/voice community.
- AI-native operator: daily user of Claude Code, Cursor, GPT, agent workflows. Treats benchmarking + research as engineering problems automatable with agents.
What you bring
Required skills
- Tech stacks: Python, TypeScript, ML evaluation, statistical methodology, voice AI pipelines (STT/TTS/LLM), WebRTC/WebSocket, LiveKit/Pipecat, OpenAI/Anthropic/Google APIs, Hugging Face, GitHub Actions, Docker.
- Industry: voice AI infrastructure, AI-native B2B, dev tools, speech tech, conversational AI.
- Language: fluent English (written + spoken). Bahasa Indonesia native a plus for benchmark dataset curation in regional languages.
- Comfortable being a public technical voice for the company on Twitter, GitHub, Hugging Face, Discord.
Preferred skills
- Pipecat, LiveKit, Daily.co, RAGAS, DeepEval, Promptfoo, TruLens, Hugging Face
Work arrangement
- Indonesia-based, with 3 hours of daily overlap with US Pacific (early morning local time)
- Fully remote