


{"id":23248,"date":"2026-05-06T16:55:43","date_gmt":"2026-05-06T12:55:43","guid":{"rendered":"https:\/\/krisp.ai\/blog\/?p=23248"},"modified":"2026-05-06T16:58:34","modified_gmt":"2026-05-06T12:58:34","slug":"viva-2-0-ai-infrastructure-for-voice-ai-agents","status":"publish","type":"post","link":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/","title":{"rendered":"Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents"},"content":{"rendered":"<p>Every voice AI demo works. Production doesn&#8217;t.<\/p>\n<p>You&#8217;ve seen it happen. A voice agent sounds great in the lab. Crisp audio, perfect timing, natural flow. Then it ships. Someone calls from a busy airport. Their kid is screaming in the background. A bad cell connection mangles the audio. The agent talks over the caller, ignores a real interruption, or gets confused by a siren outside the window.<\/p>\n<p>&nbsp;<\/p>\n<p>This is happening everywhere. Voice agent usage grew 9x in 2025. Over 150 companies are building them. Twenty-two percent of Y Combinator&#8217;s latest cohort is voice-first. The market crossed $22 billion and is growing at 35% a year. Everyone is building, and everyone is hitting the same wall.<\/p>\n<p>Two problems keep voice agents from working in production. Neither is new. Both are unsolved, until now.<\/p>\n<hr \/>\n<h2>The audio problem<\/h2>\n<p>Real-world voice sounds nothing like a demo room. There&#8217;s background noise, other people talking, cheap mics, room echo, feedback loops, and codec compression that chews up the signal before it even reaches your model.<\/p>\n<p>&nbsp;<\/p>\n<p>This breaks things in predictable ways. Noise pushes word error rate from around 5% to 15\u201330% or worse. Background voices trick the bot into thinking someone is speaking when they&#8217;re not. On phone calls, the agent&#8217;s own voice bounces back into the mic and triggers self-interruption loops.<\/p>\n<p>It&#8217;s not an edge case. It&#8217;s every call.<\/p>\n<h2>The conversation problem<\/h2>\n<p>Even with perfectly clean audio, voice agents still feel off. Human conversation runs on a thousand tiny cues that we pick up without thinking.<\/p>\n<p>We know when someone is about to finish a thought. We know that &#8220;mhm&#8221; means &#8220;keep going&#8221; and not &#8220;stop, I have something to say.&#8221; We can tell the difference between a pause that means someone is thinking and a pause that means they&#8217;re done. Nobody teaches us this. We just feel it.<\/p>\n<p>&nbsp;<\/p>\n<p>Voice agents don&#8217;t feel any of it. Most run on one simple rule: when you go quiet, they start talking. Everyone on the other end, whether they&#8217;re booking a flight, checking a prescription, or disputing a charge, can tell something is off immediately.<\/p>\n<p>&nbsp;<\/p>\n<h2>From reactive to predictive<\/h2>\n<p>Every voice agent out there today is reactive. It waits for silence, then talks. It hears a sound, then stops. It takes whatever audio it gets, clean or not, and hopes for the best.<\/p>\n<p>&nbsp;<\/p>\n<p>Human conversation doesn&#8217;t work that way. It&#8217;s predictive. We don&#8217;t wait for total silence to know it&#8217;s our turn. We don&#8217;t stop talking every time someone makes a sound. We&#8217;re always reading the signal, anticipating what&#8217;s coming next.<\/p>\n<p>&nbsp;<\/p>\n<p>Krisp has spent eight years on this. Not in a lab, but in production. We&#8217;ve processed over a trillion minutes of voice traffic across real environments, real devices, real noise. Two-time Webby Award winner for technical achievement. We started with human-to-human communication, powering <a href=\"https:\/\/krisp.ai\/contact-center\/noise-cancellation\/\">noise cancellation<\/a> for millions of users. Then we <a href=\"https:\/\/krisp.ai\/blog\/krisp-launches-viva-sdk-and-surpasses-1b-minutes-of-voice-ai-processing-per-month-milestone\/\">launched VIVA<\/a> for human-to-AI, bringing <a href=\"https:\/\/krisp.ai\/blog\/small-voice-isolation-model\/\">voice isolation<\/a>, voice activity detection, and <a href=\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\">turn taking<\/a> to production voice agents at scale.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>VIVA 2.0<\/strong> takes the next step. It doesn&#8217;t just clean the audio and hand it off. It understands the conversation. One SDK. Server-side. Sits in the audio pipeline before speech-to-text. Everything downstream gets better.<\/p>\n<p>&nbsp;<\/p>\n<p>This isn&#8217;t theory. VIVA is already running inside Daily, Vapi, LiveKit, Vodex, Ultravox, and the world&#8217;s largest AI labs. Teams using VIVA have seen 3.5x better turn-taking accuracy, 50% fewer dropped calls, and 30% higher customer satisfaction scores. We process over 10 billion minutes of voice AI traffic a year and growing.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"text_center\">\n<div class=\"btn btn--primary\">\n        <a style=\"color:#FFF !important;\" href=\"https:\/\/\/krisp.ai\/developers\/\">Get Access<\/a>\n    <\/div>\n<\/div>\n<h2>What&#8217;s in VIVA 2.0<\/h2>\n<h3>Voice Isolation v3: isolate the speaker, improve WER<\/h3>\n<p>That 15\u201330% word error rate isn&#8217;t a small problem. It means your agent hears &#8220;I need to cancel my Thursday flight&#8221; as &#8220;I need to cancel my first day flight&#8221; and acts on it. Every misheard word makes things worse downstream. The LLM reasons on bad input, the response goes sideways, the user has to repeat themselves, and trust in the agent drops fast.<\/p>\n<p>&nbsp;<\/p>\n<p>Voice Isolation v3 is a ground-up rebuild of our core engine. It isolates the primary speaker&#8217;s voice from everything else \u2014 <a href=\"https:\/\/krisp.ai\/blog\/contact-center-background-voice-cancellation\/\">background noise<\/a>, other voices, room echo, and codec artifacts \u2014 and delivers cleaner audio to your STT pipeline, directly improving word error rate. Works across languages and accents. This is the foundation everything else in VIVA builds on.<\/p>\n<h3>Turn Prediction v3: knowing when to speak<\/h3>\n<p>Without end-of-turn prediction, bots just wait for silence. The user stops talking, the bot counts a few seconds of quiet, then responds. This is why talking to most voice agents feels slow and robotic.<\/p>\n<p>&nbsp;<\/p>\n<p>Turn Prediction v3 works completely differently. Instead of counting silence, it listens to the music of the speech, the intonation, rhythm, how the sentence is shaped, and predicts the end of turn in a fraction of a second. V3 catches 47% more true turn-shifts within the first 200 milliseconds compared to v2, without more false positives. The bot just responds at the right moment, and the conversation feels natural.<\/p>\n<p>&nbsp;<\/p>\n<p>Now multilingual: English, German, French, Spanish, Hindi, Finnish, Italian, Portuguese, Chinese, Japanese, Korean, Russian, and more. Runs on CPU, ships at 30 MB, works purely on audio with no transcription needed.<\/p>\n<p>&nbsp;<\/p>\n<p>We tested Turn Prediction v3 against every major solution available today:<\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>Balanced Accuracy<\/th>\n<th>AUC<\/th>\n<th>F1 Score<\/th>\n<th>F1 Score Hold<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Turn Prediction v3<\/strong><\/td>\n<td><strong>88.05<\/strong><\/td>\n<td><strong>94.58<\/strong><\/td>\n<td>84.44<\/td>\n<td>91.20<\/td>\n<\/tr>\n<tr>\n<td>SmartTurn v3.2<\/td>\n<td>77.41<\/td>\n<td>88.81<\/td>\n<td>70.88<\/td>\n<td>86.44<\/td>\n<\/tr>\n<tr>\n<td>Deepgram Flux<\/td>\n<td>87.10<\/td>\n<td>\u2014<\/td>\n<td><strong>84.60<\/strong><\/td>\n<td><strong>92.60<\/strong><\/td>\n<\/tr>\n<tr>\n<td>LiveKit<\/td>\n<td>82.70<\/td>\n<td>88.70<\/td>\n<td>76.70<\/td>\n<td>83.30<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Turn Prediction v3 leads on balanced accuracy and AUC across all conditions. Full benchmarks and our public test dataset are in the <a href=\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\">technical deep-dive<\/a>.<\/p>\n<h3>Interruption Prediction v1: knowing when to stop<\/h3>\n<p>When you&#8217;re listening to someone and you say &#8220;yeah&#8221; or &#8220;okay&#8221; or &#8220;got it,&#8221; you&#8217;re not interrupting. You&#8217;re saying &#8220;I&#8217;m with you, keep going.&#8221; But when you say &#8220;wait, stop, that&#8217;s not what I meant,&#8221; you need the other person to actually stop.<\/p>\n<p>&nbsp;<\/p>\n<p>Without interruption prediction, bots can&#8217;t tell the difference. Every sound the user makes while the bot is talking gets treated the same way. Either the bot stops on every &#8220;uh-huh,&#8221; which is annoying, or it plows through when someone actually needs to jump in, which is worse.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Interruption Prediction v1 is the first audio-only model in the industry built to solve this.<\/strong> It figures out whether the user actually wants to interrupt or is just giving feedback. From the audio alone, no transcription needed, without waiting for a full sentence. It reacts in under a second with less than 6% false positives. It handles laughter, coughing, and sneezing correctly too, with under 5% false triggers on non-speech sounds. The bot stops when you need it to, and keeps going when you don&#8217;t.<\/p>\n<p>&nbsp;<\/p>\n<p>Turn Prediction and Interruption Prediction are two sides of the same coin. One reads the silence, the other reads the speech. Together, they give a voice agent something no reactive system has: the ability to read the room.<\/p>\n<h3>Signal Detectors: a thousand tiny cues<\/h3>\n<p>We don&#8217;t just read conversational flow from someone&#8217;s voice. We pick up on who they are. Whether they&#8217;re a real person or a recording. Their gender, age group, accent. We do this without thinking, in milliseconds. Signal Detectors brings this to voice AI with a new set of small, real-time models launching with three:<\/p>\n<ul>\n<li><strong>TTS Detector<\/strong> spots synthetic or generated speech in real time<\/li>\n<li><strong>Gender Detector<\/strong> identifies speaker gender from audio<\/li>\n<li><strong>Accent Detector<\/strong> <a href=\"https:\/\/krisp.ai\/blog\/accent-conversion-sdk\/\">identifies the speaker&#8217;s accent<\/a><\/li>\n<\/ul>\n<h3>Voice Activity Detection: the gatekeeper<\/h3>\n<p>Real-time detection of when someone is speaking and when they&#8217;re not. Fewer false triggers, better responsiveness. The first layer that everything else depends on.<\/p>\n<p>All VIVA 2.0 capabilities, Voice Isolation v3, Turn Prediction v3, Interruption Prediction v1, Signal Detectors, and Voice Activity Detection, come bundled into existing VIVA pricing at no extra charge.<\/p>\n<hr \/>\n<h2>How it fits in your pipeline<\/h2>\n<p>VIVA 2.0 is a server-side SDK that sits in the audio pipeline before speech-to-text. The integration path is straightforward:<\/p>\n<ol>\n<li><strong>Audio in<\/strong> \u2014 raw audio stream from the caller (WebRTC, SIP, PSTN, any codec)<\/li>\n<li><strong>VIVA processes<\/strong> \u2014 voice isolation cleans the audio, turn prediction and interruption prediction read the conversational signals, signal detectors extract metadata \u2014 all in real time on CPU<\/li>\n<li><strong>Clean audio + signals out<\/strong> \u2014 your STT, LLM, and TTS pipeline receives isolated speaker audio and conversational cues, so it can transcribe more accurately and respond at the right moment<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>No GPU required. 30 MB model footprint. 15 ms algorithmic latency for voice isolation. Drop-in for existing pipelines \u2014 if you&#8217;re already running STT, VIVA sits in front of it.<\/p>\n<h2>What builders are seeing<\/h2>\n<p>&#8220;When our development team demonstrated Krisp&#8217;s capabilities, we were blown away,&#8221; said Kumar Saurav, CTO of Vodex. &#8220;Seeing our bot continue uninterrupted, even amidst loud office noise, was a game-changer for us.&#8221;<\/p>\n<p>&nbsp;<\/p>\n<p>&#8220;At scale, the biggest challenge in voice AI isn&#8217;t the model. It&#8217;s the quality of the signal going into it,&#8221; said David Casem, CEO of Telnyx. &#8220;Krisp addresses that at the source, which improves everything downstream from transcription to response.&#8221;<\/p>\n<p>&nbsp;<\/p>\n<p>From agents that break in noise to agents that understand the conversation.<\/p>\n<h2><\/h2>\n<h2>Why we&#8217;re launching at Twilio Signal<\/h2>\n<p>Twilio&#8217;s ecosystem sits at the center of where the demo-to-production gap is biggest. Contact centers, IVRs, voice agents handling millions of calls over PSTN and SIP \u2014 these are the environments where real-world audio destroys agent performance and silence-based turn-taking falls apart. The builders at Signal are the ones hitting this wall every day.<\/p>\n<p>We&#8217;re launching VIVA 2.0 here because these are the pipelines it was built for.<\/p>\n<p>If you&#8217;re at Signal, come find us. If you&#8217;re not, VIVA 2.0 is available now.<\/p>\n<h2><\/h2>\n<h2>The thesis<\/h2>\n<p>Voice is becoming the main way humans interact with AI. Support, healthcare, finance, shopping, companionship. Every one of those conversations happens in the real world, with real-world noise and real-world conversational rules that nobody teaches but everyone knows.<\/p>\n<p>The industry has spent two years building voice agents that talk. The next generation will be voice agents that listen. That&#8217;s the shift from reactive to predictive. That&#8217;s what VIVA 2.0 makes possible.<\/p>\n<h2><\/h2>\n<h2>FAQ<\/h2>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>What is VIVA 2.0 and how does it fit in my voice AI pipeline?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">VIVA 2.0 is Krisp&#8217;s server-side SDK for voice AI agents. It bundles voice isolation, turn prediction, interruption prediction, signal detectors, and voice activity detection into one package that sits before your STT. One SDK, runs on CPU, 15 ms latency. Everything downstream \u2014 transcription accuracy, response timing, conversation flow \u2014 gets better.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>How is VIVA different from noise cancellation?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">Noise cancellation removes unwanted sound. VIVA goes further \u2014 it isolates the primary speaker&#8217;s voice to improve STT word error rate, predicts when a speaker&#8217;s turn is ending, detects real interruptions vs. backchannel cues, and identifies signals like synthetic speech, gender, and accent. It&#8217;s conversational intelligence, not just audio cleanup.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>What's the difference between a backchannel and an interruption, and why can't VAD handle it?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">A backchannel (&#8220;yeah,&#8221; &#8220;uh-huh,&#8221; &#8220;right&#8221;) signals engagement without requesting the floor. An interruption means the user wants the agent to stop. VAD only detects that someone is speaking \u2014 it can&#8217;t distinguish intent, so it fires on nearly two-thirds of backchannels. Krisp Interruption Prediction v1 uses a learned model that separates the two with under 6% false positives at the recommended threshold.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>How fast can a voice AI agent respond using Krisp Turn Prediction v3?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">At the recommended threshold (0.5), 69% of true turn-shifts are detected within 200 ms of silence \u2014 a 47% improvement over v2. The model runs on CPU with ~9M parameters and 30 MB footprint, so it adds negligible overhead to your voice agent pipeline.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>What languages does VIVA 2.0 support?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">Turn Prediction v3 supports 12+ languages: English, German, French, Spanish, Hindi, Finnish, Italian, Portuguese, Chinese, Japanese, Korean, and Russian. Interruption Prediction v1 is English-only at launch, with additional language support planned. Voice Isolation v3 works across all languages and accents.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>Does VIVA 2.0 require a GPU?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">No. All models run on CPU. Turn Prediction v3 ships at 30 MB. This matters for server-side deployments at scale where GPU costs add up fast.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>Is VIVA 2.0 a separate product or an upgrade?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">An upgrade. All new capabilities \u2014 Voice Isolation v3, Turn Prediction v3, Interruption Prediction v1, and Signal Detectors \u2014 are bundled into existing VIVA pricing at no extra charge.<\/div>\n<\/div>\n<hr \/>\n<div class=\"text_center\">\n<div class=\"btn btn--primary\">\n        <a style=\"color:#FFF !important;\" href=\"https:\/\/\/krisp.ai\/developers\/\">Get Access<\/a>\n    <\/div>\n<\/div>\n<p><!-- notionvc: a9563608-a8c2-4a7d-9123-fb8b4f9c4c56 --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every voice AI demo works. Production doesn&#8217;t. You&#8217;ve seen it happen. A voice agent sounds great in the lab. Crisp audio, perfect timing, natural flow. Then it ships. Someone calls from a busy airport. Their kid is screaming in the background. A bad cell connection mangles the audio. The agent talks over the caller, ignores [&hellip;]<\/p>\n","protected":false},"author":71,"featured_media":23263,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"two_page_speed":[]},"categories":[421,1,456],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.2 (Yoast SEO v23.6) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents<\/title>\n<meta name=\"description\" content=\"VIVA 2.0 ships voice isolation, turn prediction, and interruption prediction in one SDK. Clean audio and natural turn taking for voice AI agents.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents\" \/>\n<meta property=\"og:description\" content=\"VIVA 2.0 ships voice isolation, turn prediction, and interruption prediction in one SDK. Clean audio and natural turn taking for voice AI agents.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/\" \/>\n<meta property=\"og:site_name\" content=\"Krisp\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/krispHQ\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-06T12:55:43+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-06T12:58:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1-380x266.png\" \/>\n\t<meta property=\"og:image:width\" content=\"380\" \/>\n\t<meta property=\"og:image:height\" content=\"266\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Krisp Engineering Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@krispHQ\" \/>\n<meta name=\"twitter:site\" content=\"@krispHQ\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/\"},\"author\":{\"name\":\"Krisp Engineering Team\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5\"},\"headline\":\"Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents\",\"datePublished\":\"2026-05-06T12:55:43+00:00\",\"dateModified\":\"2026-05-06T12:58:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/\"},\"wordCount\":2034,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png\",\"articleSection\":[\"Engineering Blog\",\"Krisp News\",\"SDK\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/\",\"url\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/\",\"name\":\"Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png\",\"datePublished\":\"2026-05-06T12:55:43+00:00\",\"dateModified\":\"2026-05-06T12:58:34+00:00\",\"description\":\"VIVA 2.0 ships voice isolation, turn prediction, and interruption prediction in one SDK. Clean audio and natural turn taking for voice AI agents.\",\"breadcrumb\":{\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png\",\"width\":2000,\"height\":1400,\"caption\":\"Krisp VIVA 2.0 \u2014 voice infrastructure for voice AI agents\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/krisp.ai\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/krisp.ai\/blog\/#website\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"name\":\"Krisp\",\"description\":\"Blog\",\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/krisp.ai\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\",\"name\":\"Krisp\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"width\":696,\"height\":696,\"caption\":\"Krisp\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/krispHQ\/\",\"https:\/\/x.com\/krispHQ\",\"https:\/\/www.linkedin.com\/company\/krisphq\/\",\"https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5\",\"name\":\"Krisp Engineering Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g\",\"caption\":\"Krisp Engineering Team\"},\"url\":\"https:\/\/krisp.ai\/blog\/author\/eng-team\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents","description":"VIVA 2.0 ships voice isolation, turn prediction, and interruption prediction in one SDK. Clean audio and natural turn taking for voice AI agents.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/","og_locale":"en_US","og_type":"article","og_title":"Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents","og_description":"VIVA 2.0 ships voice isolation, turn prediction, and interruption prediction in one SDK. Clean audio and natural turn taking for voice AI agents.","og_url":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/","og_site_name":"Krisp","article_publisher":"https:\/\/www.facebook.com\/krispHQ\/","article_published_time":"2026-05-06T12:55:43+00:00","article_modified_time":"2026-05-06T12:58:34+00:00","og_image":[{"width":380,"height":266,"url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1-380x266.png","type":"image\/png"}],"author":"Krisp Engineering Team","twitter_card":"summary_large_image","twitter_creator":"@krispHQ","twitter_site":"@krispHQ","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#article","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/"},"author":{"name":"Krisp Engineering Team","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5"},"headline":"Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents","datePublished":"2026-05-06T12:55:43+00:00","dateModified":"2026-05-06T12:58:34+00:00","mainEntityOfPage":{"@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/"},"wordCount":2034,"commentCount":0,"publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"image":{"@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png","articleSection":["Engineering Blog","Krisp News","SDK"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/","url":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/","name":"Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage"},"image":{"@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png","datePublished":"2026-05-06T12:55:43+00:00","dateModified":"2026-05-06T12:58:34+00:00","description":"VIVA 2.0 ships voice isolation, turn prediction, and interruption prediction in one SDK. Clean audio and natural turn taking for voice AI agents.","breadcrumb":{"@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#primaryimage","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/VIVA-blog1.png","width":2000,"height":1400,"caption":"Krisp VIVA 2.0 \u2014 voice infrastructure for voice AI agents"},{"@type":"BreadcrumbList","@id":"https:\/\/krisp.ai\/blog\/viva-2-0-ai-infrastructure-for-voice-ai-agents\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/krisp.ai\/blog\/"},{"@type":"ListItem","position":2,"name":"Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents"}]},{"@type":"WebSite","@id":"https:\/\/krisp.ai\/blog\/#website","url":"https:\/\/krisp.ai\/blog\/","name":"Krisp","description":"Blog","publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/krisp.ai\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/krisp.ai\/blog\/#organization","name":"Krisp","url":"https:\/\/krisp.ai\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","width":696,"height":696,"caption":"Krisp"},"image":{"@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/krispHQ\/","https:\/\/x.com\/krispHQ","https:\/\/www.linkedin.com\/company\/krisphq\/","https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg"]},{"@type":"Person","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5","name":"Krisp Engineering Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g","caption":"Krisp Engineering Team"},"url":"https:\/\/krisp.ai\/blog\/author\/eng-team\/"}]}},"_links":{"self":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/23248"}],"collection":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/users\/71"}],"replies":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/comments?post=23248"}],"version-history":[{"count":13,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/23248\/revisions"}],"predecessor-version":[{"id":23269,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/23248\/revisions\/23269"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media\/23263"}],"wp:attachment":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media?parent=23248"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/categories?post=23248"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/tags?post=23248"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}