Table of contents

The audio problem
The conversation problem
From reactive to predictive
What's in VIVA 2.0
How it fits in your pipeline
What builders are seeing
Why we're launching at Twilio Signal
The thesis
FAQ

Engineering Blog

Krisp News

SDK

Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents

May 06, 2026

Written by Krisp Engineering Team

0:00

1.0x

Every voice AI demo works. Production doesn’t.

You’ve seen it happen. A voice agent sounds great in the lab. Crisp audio, perfect timing, natural flow. Then it ships. Someone calls from a busy airport. Their kid is screaming in the background. A bad cell connection mangles the audio. The agent talks over the caller, ignores a real interruption, or gets confused by a siren outside the window.

This is happening everywhere. Voice agent usage grew 9x in 2025. Over 150 companies are building them. Twenty-two percent of Y Combinator’s latest cohort is voice-first. The market crossed $22 billion and is growing at 35% a year. Everyone is building, and everyone is hitting the same wall.

Two problems keep voice agents from working in production. Neither is new. Both are unsolved, until now.

The audio problem

Real-world voice sounds nothing like a demo room. There’s background noise, other people talking, cheap mics, room echo, feedback loops, and codec compression that chews up the signal before it even reaches your model.

This breaks things in predictable ways. Noise pushes word error rate from around 5% to 15–30% or worse. Background voices trick the bot into thinking someone is speaking when they’re not. On phone calls, the agent’s own voice bounces back into the mic and triggers self-interruption loops.

It’s not an edge case. It’s every call.

The conversation problem

Even with perfectly clean audio, voice agents still feel off. Human conversation runs on a thousand tiny cues that we pick up without thinking.

We know when someone is about to finish a thought. We know that “mhm” means “keep going” and not “stop, I have something to say.” We can tell the difference between a pause that means someone is thinking and a pause that means they’re done. Nobody teaches us this. We just feel it.

Voice agents don’t feel any of it. Most run on one simple rule: when you go quiet, they start talking. Everyone on the other end, whether they’re booking a flight, checking a prescription, or disputing a charge, can tell something is off immediately.

From reactive to predictive

Every voice agent out there today is reactive. It waits for silence, then talks. It hears a sound, then stops. It takes whatever audio it gets, clean or not, and hopes for the best.

Human conversation doesn’t work that way. It’s predictive. We don’t wait for total silence to know it’s our turn. We don’t stop talking every time someone makes a sound. We’re always reading the signal, anticipating what’s coming next.

Krisp has spent eight years on this. Not in a lab, but in production. We’ve processed over a trillion minutes of voice traffic across real environments, real devices, real noise. Two-time Webby Award winner for technical achievement. We started with human-to-human communication, powering noise cancellation for millions of users. Then we launched VIVA for human-to-AI, bringing voice isolation, voice activity detection, and turn taking to production voice agents at scale.

VIVA 2.0 takes the next step. It doesn’t just clean the audio and hand it off. It understands the conversation. One SDK. Server-side. Sits in the audio pipeline before speech-to-text. Everything downstream gets better.

This isn’t theory. VIVA is already running inside Daily, Vapi, LiveKit, Vodex, Ultravox, and the world’s largest AI labs. Teams using VIVA have seen 3.5x better turn-taking accuracy, 50% fewer dropped calls, and 30% higher customer satisfaction scores. We process over 10 billion minutes of voice AI traffic a year and growing.

Get Access

What’s in VIVA 2.0

Voice Isolation v3: isolate the speaker, improve WER

That 15–30% word error rate isn’t a small problem. It means your agent hears “I need to cancel my Thursday flight” as “I need to cancel my first day flight” and acts on it. Every misheard word makes things worse downstream. The LLM reasons on bad input, the response goes sideways, the user has to repeat themselves, and trust in the agent drops fast.

Voice Isolation v3 is a ground-up rebuild of our core engine. It isolates the primary speaker’s voice from everything else — background noise, other voices, room echo, and codec artifacts — and delivers cleaner audio to your STT pipeline, directly improving word error rate. Works across languages and accents. This is the foundation everything else in VIVA builds on.

Turn Prediction v3: knowing when to speak

Without end-of-turn prediction, bots just wait for silence. The user stops talking, the bot counts a few seconds of quiet, then responds. This is why talking to most voice agents feels slow and robotic.

Turn Prediction v3 works completely differently. Instead of counting silence, it listens to the music of the speech, the intonation, rhythm, how the sentence is shaped, and predicts the end of turn in a fraction of a second. V3 catches 47% more true turn-shifts within the first 200 milliseconds compared to v2, without more false positives. The bot just responds at the right moment, and the conversation feels natural.

Now multilingual: English, German, French, Spanish, Hindi, Finnish, Italian, Portuguese, Chinese, Japanese, Korean, Russian, and more. Runs on CPU, ships at 30 MB, works purely on audio with no transcription needed.

We tested Turn Prediction v3 against every major solution available today:

	Balanced Accuracy	AUC	F1 Score	F1 Score Hold
Turn Prediction v3	88.05	94.58	84.44	91.20
SmartTurn v3.2	77.41	88.81	70.88	86.44
Deepgram Flux	87.10	—	84.60	92.60
LiveKit	82.70	88.70	76.70	83.30

Turn Prediction v3 leads on balanced accuracy and AUC across all conditions. Full benchmarks and our public test dataset are in the technical deep-dive.

Interruption Prediction v1: knowing when to stop

When you’re listening to someone and you say “yeah” or “okay” or “got it,” you’re not interrupting. You’re saying “I’m with you, keep going.” But when you say “wait, stop, that’s not what I meant,” you need the other person to actually stop.

Without interruption prediction, bots can’t tell the difference. Every sound the user makes while the bot is talking gets treated the same way. Either the bot stops on every “uh-huh,” which is annoying, or it plows through when someone actually needs to jump in, which is worse.

Interruption Prediction v1 is the first audio-only model in the industry built to solve this. It figures out whether the user actually wants to interrupt or is just giving feedback. From the audio alone, no transcription needed, without waiting for a full sentence. It reacts in under a second with less than 6% false positives. It handles laughter, coughing, and sneezing correctly too, with under 5% false triggers on non-speech sounds. The bot stops when you need it to, and keeps going when you don’t.

Turn Prediction and Interruption Prediction are two sides of the same coin. One reads the silence, the other reads the speech. Together, they give a voice agent something no reactive system has: the ability to read the room.

Signal Detectors: a thousand tiny cues

We don’t just read conversational flow from someone’s voice. We pick up on who they are. Whether they’re a real person or a recording. Their gender, age group, accent. We do this without thinking, in milliseconds. Signal Detectors brings this to voice AI with a new set of small, real-time models launching with three:

TTS Detector spots synthetic or generated speech in real time
Gender Detector identifies speaker gender from audio
Accent Detector identifies the speaker’s accent

Voice Activity Detection: the gatekeeper

Real-time detection of when someone is speaking and when they’re not. Fewer false triggers, better responsiveness. The first layer that everything else depends on.

All VIVA 2.0 capabilities, Voice Isolation v3, Turn Prediction v3, Interruption Prediction v1, Signal Detectors, and Voice Activity Detection, come bundled into existing VIVA pricing at no extra charge.

How it fits in your pipeline

VIVA 2.0 is a server-side SDK that sits in the audio pipeline before speech-to-text. The integration path is straightforward:

Audio in — raw audio stream from the caller (WebRTC, SIP, PSTN, any codec)
VIVA processes — voice isolation cleans the audio, turn prediction and interruption prediction read the conversational signals, signal detectors extract metadata — all in real time on CPU
Clean audio + signals out — your STT, LLM, and TTS pipeline receives isolated speaker audio and conversational cues, so it can transcribe more accurately and respond at the right moment

No GPU required. 30 MB model footprint. 15 ms algorithmic latency for voice isolation. Drop-in for existing pipelines — if you’re already running STT, VIVA sits in front of it.

What builders are seeing

“When our development team demonstrated Krisp’s capabilities, we were blown away,” said Kumar Saurav, CTO of Vodex. “Seeing our bot continue uninterrupted, even amidst loud office noise, was a game-changer for us.”

“At scale, the biggest challenge in voice AI isn’t the model. It’s the quality of the signal going into it,” said David Casem, CEO of Telnyx. “Krisp addresses that at the source, which improves everything downstream from transcription to response.”

From agents that break in noise to agents that understand the conversation.

Why we’re launching at Twilio Signal

Twilio’s ecosystem sits at the center of where the demo-to-production gap is biggest. Contact centers, IVRs, voice agents handling millions of calls over PSTN and SIP — these are the environments where real-world audio destroys agent performance and silence-based turn-taking falls apart. The builders at Signal are the ones hitting this wall every day.

We’re launching VIVA 2.0 here because these are the pipelines it was built for.

If you’re at Signal, come find us. If you’re not, VIVA 2.0 is available now.

The thesis

Voice is becoming the main way humans interact with AI. Support, healthcare, finance, shopping, companionship. Every one of those conversations happens in the real world, with real-world noise and real-world conversational rules that nobody teaches but everyone knows.

The industry has spent two years building voice agents that talk. The next generation will be voice agents that listen. That’s the shift from reactive to predictive. That’s what VIVA 2.0 makes possible.

FAQ

What is VIVA 2.0 and how does it fit in my voice AI pipeline?

VIVA 2.0 is Krisp’s server-side SDK for voice AI agents. It bundles voice isolation, turn prediction, interruption prediction, signal detectors, and voice activity detection into one package that sits before your STT. One SDK, runs on CPU, 15 ms latency. Everything downstream — transcription accuracy, response timing, conversation flow — gets better.

How is VIVA different from noise cancellation?

Noise cancellation removes unwanted sound. VIVA goes further — it isolates the primary speaker’s voice to improve STT word error rate, predicts when a speaker’s turn is ending, detects real interruptions vs. backchannel cues, and identifies signals like synthetic speech, gender, and accent. It’s conversational intelligence, not just audio cleanup.

What's the difference between a backchannel and an interruption, and why can't VAD handle it?

A backchannel (“yeah,” “uh-huh,” “right”) signals engagement without requesting the floor. An interruption means the user wants the agent to stop. VAD only detects that someone is speaking — it can’t distinguish intent, so it fires on nearly two-thirds of backchannels. Krisp Interruption Prediction v1 uses a learned model that separates the two with under 6% false positives at the recommended threshold.

How fast can a voice AI agent respond using Krisp Turn Prediction v3?

At the recommended threshold (0.5), 69% of true turn-shifts are detected within 200 ms of silence — a 47% improvement over v2. The model runs on CPU with ~9M parameters and 30 MB footprint, so it adds negligible overhead to your voice agent pipeline.

What languages does VIVA 2.0 support?

Turn Prediction v3 supports 12+ languages: English, German, French, Spanish, Hindi, Finnish, Italian, Portuguese, Chinese, Japanese, Korean, and Russian. Interruption Prediction v1 is English-only at launch, with additional language support planned. Voice Isolation v3 works across all languages and accents.

Does VIVA 2.0 require a GPU?

No. All models run on CPU. Turn Prediction v3 ships at 30 MB. This matters for server-side deployments at scale where GPU costs add up fast.

Is VIVA 2.0 a separate product or an upgrade?

An upgrade. All new capabilities — Voice Isolation v3, Turn Prediction v3, Interruption Prediction v1, and Signal Detectors — are bundled into existing VIVA pricing at no extra charge.

Get Access