Every voice AI demo works. Production doesn’t.
You’ve seen it happen. A voice agent sounds great in the lab. Crisp audio, perfect timing, natural flow. Then it ships. Someone calls from a busy airport. Their kid is screaming in the background. A bad cell connection mangles the audio. The agent talks over the caller, ignores a real interruption, or gets confused by a siren outside the window.
This is happening everywhere. Voice agent usage grew 9x in 2025. Over 150 companies are building them. Twenty-two percent of Y Combinator’s latest cohort is voice-first. The market crossed $22 billion and is growing at 35% a year. Everyone is building, and everyone is hitting the same wall.
Two problems keep voice agents from working in production. Neither is new. Both are unsolved, until now.
The audio problem
Real-world voice sounds nothing like a demo room. There’s background noise, other people talking, cheap mics, room echo, feedback loops, and codec compression that chews up the signal before it even reaches your model.
This breaks things in predictable ways. Noise pushes word error rate from around 5% to 15–30% or worse. Background voices trick the bot into thinking someone is speaking when they’re not. On phone calls, the agent’s own voice bounces back into the mic and triggers self-interruption loops.
It’s not an edge case. It’s every call.
The conversation problem
Even with perfectly clean audio, voice agents still feel off. Human conversation runs on a thousand tiny cues that we pick up without thinking.
We know when someone is about to finish a thought. We know that “mhm” means “keep going” and not “stop, I have something to say.” We can tell the difference between a pause that means someone is thinking and a pause that means they’re done. Nobody teaches us this. We just feel it.
Voice agents don’t feel any of it. Most run on one simple rule: when you go quiet, they start talking. Everyone on the other end, whether they’re booking a flight, checking a prescription, or disputing a charge, can tell something is off immediately.
From reactive to predictive
Every voice agent out there today is reactive. It waits for silence, then talks. It hears a sound, then stops. It takes whatever audio it gets, clean or not, and hopes for the best.
Human conversation doesn’t work that way. It’s predictive. We don’t wait for total silence to know it’s our turn. We don’t stop talking every time someone makes a sound. We’re always reading the signal, anticipating what’s coming next.
Krisp has spent eight years on this. Not in a lab, but in production. We’ve processed over a trillion minutes of voice traffic across real environments, real devices, real noise. Two-time Webby Award winner for technical achievement. We started with human-to-human communication, powering noise cancellation for millions of users. Then we launched VIVA for human-to-AI, bringing voice isolation, voice activity detection, and turn taking to production voice agents at scale.
VIVA 2.0 takes the next step. It doesn’t just clean the audio and hand it off. It understands the conversation. One SDK. Server-side. Sits in the audio pipeline before speech-to-text. Everything downstream gets better.
This isn’t theory. VIVA is already running inside Daily, Vapi, LiveKit, Vodex, Ultravox, and the world’s largest AI labs. Teams using VIVA have seen 3.5x better turn-taking accuracy, 50% fewer dropped calls, and 30% higher customer satisfaction scores. We process over 10 billion minutes of voice AI traffic a year and growing.
What’s in VIVA 2.0
Voice Isolation v3: isolate the speaker, improve WER
That 15–30% word error rate isn’t a small problem. It means your agent hears “I need to cancel my Thursday flight” as “I need to cancel my first day flight” and acts on it. Every misheard word makes things worse downstream. The LLM reasons on bad input, the response goes sideways, the user has to repeat themselves, and trust in the agent drops fast.
Voice Isolation v3 is a ground-up rebuild of our core engine. It isolates the primary speaker’s voice from everything else — background noise, other voices, room echo, and codec artifacts — and delivers cleaner audio to your STT pipeline, directly improving word error rate. Works across languages and accents. This is the foundation everything else in VIVA builds on.
Turn Prediction v3: knowing when to speak
Without end-of-turn prediction, bots just wait for silence. The user stops talking, the bot counts a few seconds of quiet, then responds. This is why talking to most voice agents feels slow and robotic.
Turn Prediction v3 works completely differently. Instead of counting silence, it listens to the music of the speech, the intonation, rhythm, how the sentence is shaped, and predicts the end of turn in a fraction of a second. V3 catches 47% more true turn-shifts within the first 200 milliseconds compared to v2, without more false positives. The bot just responds at the right moment, and the conversation feels natural.
Now multilingual: English, German, French, Spanish, Hindi, Finnish, Italian, Portuguese, Chinese, Japanese, Korean, Russian, and more. Runs on CPU, ships at 30 MB, works purely on audio with no transcription needed.
We tested Turn Prediction v3 against every major solution available today:
| Balanced Accuracy | AUC | F1 Score | F1 Score Hold | |
|---|---|---|---|---|
| Turn Prediction v3 | 88.05 | 94.58 | 84.44 | 91.20 |
| SmartTurn v3.2 | 77.41 | 88.81 | 70.88 | 86.44 |
| Deepgram Flux | 87.10 | — | 84.60 | 92.60 |
| LiveKit | 82.70 | 88.70 | 76.70 | 83.30 |
Turn Prediction v3 leads on balanced accuracy and AUC across all conditions. Full benchmarks and our public test dataset are in the technical deep-dive.
Interruption Prediction v1: knowing when to stop
When you’re listening to someone and you say “yeah” or “okay” or “got it,” you’re not interrupting. You’re saying “I’m with you, keep going.” But when you say “wait, stop, that’s not what I meant,” you need the other person to actually stop.
Without interruption prediction, bots can’t tell the difference. Every sound the user makes while the bot is talking gets treated the same way. Either the bot stops on every “uh-huh,” which is annoying, or it plows through when someone actually needs to jump in, which is worse.
Interruption Prediction v1 is the first audio-only model in the industry built to solve this. It figures out whether the user actually wants to interrupt or is just giving feedback. From the audio alone, no transcription needed, without waiting for a full sentence. It reacts in under a second with less than 6% false positives. It handles laughter, coughing, and sneezing correctly too, with under 5% false triggers on non-speech sounds. The bot stops when you need it to, and keeps going when you don’t.
Turn Prediction and Interruption Prediction are two sides of the same coin. One reads the silence, the other reads the speech. Together, they give a voice agent something no reactive system has: the ability to read the room.
Signal Detectors: a thousand tiny cues
We don’t just read conversational flow from someone’s voice. We pick up on who they are. Whether they’re a real person or a recording. Their gender, age group, accent. We do this without thinking, in milliseconds. Signal Detectors brings this to voice AI with a new set of small, real-time models launching with three:
- TTS Detector spots synthetic or generated speech in real time
- Gender Detector identifies speaker gender from audio
- Accent Detector identifies the speaker’s accent
Voice Activity Detection: the gatekeeper
Real-time detection of when someone is speaking and when they’re not. Fewer false triggers, better responsiveness. The first layer that everything else depends on.
All VIVA 2.0 capabilities, Voice Isolation v3, Turn Prediction v3, Interruption Prediction v1, Signal Detectors, and Voice Activity Detection, come bundled into existing VIVA pricing at no extra charge.
How it fits in your pipeline
VIVA 2.0 is a server-side SDK that sits in the audio pipeline before speech-to-text. The integration path is straightforward:
- Audio in — raw audio stream from the caller (WebRTC, SIP, PSTN, any codec)
- VIVA processes — voice isolation cleans the audio, turn prediction and interruption prediction read the conversational signals, signal detectors extract metadata — all in real time on CPU
- Clean audio + signals out — your STT, LLM, and TTS pipeline receives isolated speaker audio and conversational cues, so it can transcribe more accurately and respond at the right moment
No GPU required. 30 MB model footprint. 15 ms algorithmic latency for voice isolation. Drop-in for existing pipelines — if you’re already running STT, VIVA sits in front of it.
What builders are seeing
“When our development team demonstrated Krisp’s capabilities, we were blown away,” said Kumar Saurav, CTO of Vodex. “Seeing our bot continue uninterrupted, even amidst loud office noise, was a game-changer for us.”
“At scale, the biggest challenge in voice AI isn’t the model. It’s the quality of the signal going into it,” said David Casem, CEO of Telnyx. “Krisp addresses that at the source, which improves everything downstream from transcription to response.”
From agents that break in noise to agents that understand the conversation.
Why we’re launching at Twilio Signal
Twilio’s ecosystem sits at the center of where the demo-to-production gap is biggest. Contact centers, IVRs, voice agents handling millions of calls over PSTN and SIP — these are the environments where real-world audio destroys agent performance and silence-based turn-taking falls apart. The builders at Signal are the ones hitting this wall every day.
We’re launching VIVA 2.0 here because these are the pipelines it was built for.
If you’re at Signal, come find us. If you’re not, VIVA 2.0 is available now.
The thesis
Voice is becoming the main way humans interact with AI. Support, healthcare, finance, shopping, companionship. Every one of those conversations happens in the real world, with real-world noise and real-world conversational rules that nobody teaches but everyone knows.
The industry has spent two years building voice agents that talk. The next generation will be voice agents that listen. That’s the shift from reactive to predictive. That’s what VIVA 2.0 makes possible.