


{"id":23222,"date":"2026-05-06T16:53:02","date_gmt":"2026-05-06T12:53:02","guid":{"rendered":"https:\/\/krisp.ai\/blog\/?p=23222"},"modified":"2026-05-06T17:01:56","modified_gmt":"2026-05-06T13:01:56","slug":"voice-ai-turn-taking-interruption-prediction","status":"publish","type":"post","link":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/","title":{"rendered":"A New Approach to Turn-Taking in Voice AI: Turn Prediction v3 and Interruption Prediction v1"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">The natural rhythm of conversation depends on knowing when to start speaking and when to stop. Humans handle this effortlessly: we sense the end of a turn, we recognize a quick &#8220;uh-huh&#8221; as encouragement rather than an interruption, and we can stop mid-sentence when someone clearly needs to break in. Voice AI agents have struggled with all three.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Last year we introduced <\/span><a href=\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\"><span style=\"font-weight: 400;\">Krisp Turn-Taking<\/span><\/a><span style=\"font-weight: 400;\">, the first audio-based end-of-turn prediction model designed for Voice AI. Then we shipped <\/span><a href=\"https:\/\/krisp.ai\/blog\/krisp-turn-taking-v2-voice-ai-viva-sdk\/\"><span style=\"font-weight: 400;\">v2<\/span><\/a><span style=\"font-weight: 400;\">, with substantial accuracy gains and integration into the Krisp VIVA SDK. With this release, we&#8217;re expanding what Turn-Taking means at Krisp.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><span style=\"font-weight: 400;\">Krisp Turn-Taking: Turn Prediction v3 and Interruption Prediction v1<\/span><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Krisp Turn-Taking now comprises <\/span><b>two complementary models<\/b><span style=\"font-weight: 400;\">:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Turn Prediction v3<\/b><span style=\"font-weight: 400;\"> \u2014 our end-of-turn prediction model, substantially faster and more accurate than v2, and now multilingual.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Interruption Prediction v1<\/b><span style=\"font-weight: 400;\"> \u2014 a brand-new model that distinguishes between backchannels (short acknowledgments like &#8220;yeah&#8221; or &#8220;uh-huh&#8221;) and genuine interruptions when the user wants to take the turn and interrupt the speaking AI agent.<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Together, they cover both halves of the turn-taking problem: knowing when to speak, and knowing when to stop. The naming reflects the maturity of each component: Turn Prediction has reached v3, while Interruption Prediction is introduced as v1.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Headline gains in this release:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Fast responses (under 200 ms) jumped from 47% to 69% compared to v2, without increasing the risk of interrupting the user mid-sentence.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Better latency-accuracy curve for end-of-turn prediction<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Introduction of Interruption Prediction v1 \u2014 a game-changing approach to interruption handling that significantly outperforms VAD- and word-count-based methods<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The test dataset used in our end-of-turn prediction evaluation is now public on HuggingFace: <\/span><a href=\"https:\/\/huggingface.co\/datasets\/Krisp-AI\/turn-taking-test-v1\"><span style=\"font-weight: 400;\">Krisp-AI\/turn-taking-test-v1<\/span><\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2><b>Architecture overview<\/b><\/h2>\n<h3><b>Turn Prediction v3<\/b><\/h3>\n<p><img loading=\"lazy\" class=\"size-full wp-image-23223 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image1.png\" alt=\"\" width=\"1999\" height=\"486\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image1.png 1999w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image1-300x73.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image1-380x92.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image1-768x187.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image1-1536x373.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image1-600x146.png 600w\" sizes=\"(max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Turn Prediction v3 listens to conversational audio and outputs a probability between 0 and 1 that the speaker has finished their turn. The probability is <\/span><b>progressively refined during the silence period that follows speech<\/b><span style=\"font-weight: 400;\"> \u2014 so the model can react quickly when the end of the turn is clear, and hold longer when it isn&#8217;t yet.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The model:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Operates on audio frames of <\/span><b>configurable duration<\/b><span style=\"font-weight: 400;\"> (e.g., 40 ms segments)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Returns a probability per frame, with the binary end-of-turn decision controlled by a <\/span><b>configurable threshold<\/b><span style=\"font-weight: 400;\"> (default 0.5)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Is <\/span><b>multilingual<\/b><span style=\"font-weight: 400;\">: supports English, German, French, Spanish, Hindi, Finnish, Italian, Portuguese, Chinese, Japanese, Korean, Russian, and additional languages<\/span><\/li>\n<\/ul>\n<h3><b>Interruption Prediction v1 \u2014 a new capability<\/b><\/h3>\n<p><img loading=\"lazy\" class=\"size-full wp-image-23224 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image2.png\" alt=\"\" width=\"1999\" height=\"503\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image2.png 1999w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image2-300x75.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image2-380x96.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image2-768x193.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image2-1536x386.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image2-600x151.png 600w\" sizes=\"(max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Interruption Prediction v1 addresses a problem most voice agents handle poorly: the difference between a backchannel (short acknowledgments like &#8216;yeah&#8217; or &#8216;uh-huh&#8217;) and an interruption.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">When a human says &#8220;yeah&#8221; while the agent is speaking, they&#8217;re encouraging the agent to continue \u2014 not interrupting. When they say &#8220;wait, hold on&#8221;, they want the agent to stop. Most current systems use crude heuristics: VAD on any sound (which fires on every &#8220;uh-huh&#8221;), fixed timing thresholds, minimum word counts, or stop-word lists. The result is bots that interrupt themselves on backchannels or fail to stop when the user genuinely needs to speak.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Interruption Prediction v1 takes a different approach. It produces a probability\/confidence rate between 0 and 1 that the user genuinely intends to interrupt. The probability is <\/span><b>progressively refined during the user&#8217;s speech segment<\/b><span style=\"font-weight: 400;\">, distinguishing intent from acknowledgment without waiting for the user to complete a full sentence.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The model:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Operates on audio frames of <\/span><b>40 ms duration<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Returns a probability per frame, with the binary interrupt decision controlled by a <\/span><b>configurable threshold<\/b><span style=\"font-weight: 400;\"> (recommended default 0.4)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Supports <\/span><b>English-only at v1<\/b><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3><b>Symmetry by design<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The two core models share an evaluation philosophy and a mirrored mechanism for turn-taking decisions: Turn Prediction refines its probability during the silence after speech, while Interruption Prediction refines its probability during the user&#8217;s overlapping speech.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Specs at a glance<\/b><\/h3>\n<table>\n<thead>\n<tr>\n<th><b>Model<\/b><\/th>\n<th><b>Parameters<\/b><\/th>\n<th><b>SDK Size<\/b><\/th>\n<th><b>SDK Name<\/b><\/th>\n<th><b>Frame<\/b><\/th>\n<th><b>Languages<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Turn Prediction v3<\/span><\/td>\n<td><span style=\"font-weight: 400;\">~9M<\/span><\/td>\n<td><span style=\"font-weight: 400;\">30 MB<\/span><\/td>\n<td><span style=\"font-weight: 400;\">krisp-viva-tp-v3<\/span><\/td>\n<td><span style=\"font-weight: 400;\">configurable<\/span><\/td>\n<td><span style=\"font-weight: 400;\">12+<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Interruption Prediction v1<\/span><\/td>\n<td><span style=\"font-weight: 400;\">~6M<\/span><\/td>\n<td><span style=\"font-weight: 400;\">24 MB<\/span><\/td>\n<td><span style=\"font-weight: 400;\">krisp-viva-ip-v1<\/span><\/td>\n<td><span style=\"font-weight: 400;\">40 ms<\/span><\/td>\n<td><span style=\"font-weight: 400;\">English<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">Both models run efficiently on the CPU and are included in the Krisp VIVA SDK.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>Evaluation methodology<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Turn-taking models trade off two things: how accurately they predict turn boundaries, and how quickly they react. A model that gives accurate predictions but waits too long to respond will feel sluggish; a model that fires after 100 ms will feel snappy but will cut users off mid-thought. Single-number metrics like accuracy or F1 hide this trade-off.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">In our <\/span><a href=\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\"><span style=\"font-weight: 400;\">v1 blog post<\/span><\/a><span style=\"font-weight: 400;\"> we introduced the latency\u2013accuracy curve as the right way to think about end-of-turn prediction. With this release, we extend the same philosophy to interruption prediction.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Mean Shift Time vs False Positive Rate<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">For Turn Prediction we use two coupled metrics.<\/span><\/p>\n<p><b>Mean Shift Time (MST)<\/b><span style=\"font-weight: 400;\"> measures latency. Given a set of true turn-shift cases <\/span><i><span style=\"font-weight: 400;\">S<\/span><\/i><span style=\"font-weight: 400;\"> and a model that outputs probability <\/span><i><span style=\"font-weight: 400;\">P<\/span><\/i><i><span style=\"font-weight: 400;\">i<\/span><\/i><i><span style=\"font-weight: 400;\">(t)<\/span><\/i><span style=\"font-weight: 400;\"> at time <\/span><i><span style=\"font-weight: 400;\">t<\/span><\/i><span style=\"font-weight: 400;\"> after the end of speech, MST at threshold <\/span><i><span style=\"font-weight: 400;\">\u03c4<\/span><\/i><span style=\"font-weight: 400;\"> is the average time the model takes to declare a shift across all true-shift cases:<\/span><\/p>\n<p><img loading=\"lazy\" class=\"size-full wp-image-23225 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image13.png\" alt=\"\" width=\"570\" height=\"97\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image13.png 570w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image13-300x51.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image13-380x65.png 380w\" sizes=\"(max-width: 570px) 100vw, 570px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">If a silence is shorter than the time needed to cross the threshold, the silence is conceptually extended \u2014 we measure when the model <\/span><i><span style=\"font-weight: 400;\">would<\/span><\/i><span style=\"font-weight: 400;\"> have fired.<\/span><\/p>\n<p><b>False Positive Rate (FPR)<\/b><span style=\"font-weight: 400;\"> measures accuracy on the negative class. Given a set of true-hold cases <\/span><i><span style=\"font-weight: 400;\">H<\/span><\/i><span style=\"font-weight: 400;\"> (where the speaker pauses but does not finish their turn), FPR is the fraction of holds where the model erroneously crosses the threshold at any point during the silence:<\/span><\/p>\n<p><img loading=\"lazy\" class=\"size-full wp-image-23226 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image6.png\" alt=\"\" width=\"665\" height=\"90\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image6.png 665w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image6-300x41.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image6-380x51.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image6-600x81.png 600w\" sizes=\"(max-width: 665px) 100vw, 665px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Sweeping the threshold from 0 to 1 traces out a curve in the MST\u2013FPR plane. Lower curves are better \u2014 they mean faster reaction at any given accuracy level.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Mean Interruption Time vs False Positive Rate (Interruption Prediction)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The same logic extends to interruption prediction. The latency metric is <\/span><b>Mean Interruption Time (MIT)<\/b><span style=\"font-weight: 400;\"> \u2014 the average duration between the moment the user begins speaking over the bot and the moment the model classifies that speech as an interruption rather than a backchannel.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Higher thresholds mean fewer false interrupts on backchannels, but the bot is slower to stop when the user genuinely wants the floor. Lower thresholds mean faster cutoffs, but a greater chance of the bot stopping during a quick &#8220;uh-huh.&#8221; To visualize this trade-off, we plot a chart showing the relationship between mean interruption time (computed on true-interruption examples) and the false-positive rate (interruptions during backchannels) as the threshold is varied from 0 to 1.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">For baseline methods that don&#8217;t expose a confidence score (e.g., VAD-based, minimum word count), we plot a single point at their (MIT, FPR) coordinates rather than a curve.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Test datasets<\/b><\/h3>\n<p><b>For Turn Prediction<\/b><span style=\"font-weight: 400;\">, evaluation runs on the <\/span><a href=\"https:\/\/huggingface.co\/datasets\/Krisp-AI\/turn-taking-test-v1\"><span style=\"font-weight: 400;\">Krisp-AI\/turn-taking-test-v1<\/span><\/a><span style=\"font-weight: 400;\"> dataset, which we are publishing on HuggingFace alongside this release:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">4 hours of conversational audio<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">30 speakers<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">976 manually labeled shift cases<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">1,754 manually labeled hold cases<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The original recordings consisted of long-form conversations. These recordings were first reviewed and labeled manually by human annotators. After labeling, the conversations were segmented into shorter clips, ensuring that each segment preserved the context needed for accurate interpretation while maintaining alignment with the original annotations. We also retained the last silence segment of each clip and recorded its duration in the metadata, enabling false-positive-rate calculation and analysis of the MST vs FPR trade-off.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">We also evaluate on two derived variants: <\/span><b>secondary mixes<\/b><span style=\"font-weight: 400;\"> (with additional secondary voices added to simulate cross-talk) and <\/span><b>noisy mixes<\/b><span style=\"font-weight: 400;\"> (with realistic background noise). Both stress-test how the model behaves in the challenging conditions of real-world deployments.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">To assess the contribution of Krisp&#8217;s <\/span><a href=\"https:\/\/krisp.ai\/blog\/improving-turn-taking-of-ai-voice-agents-with-background-voice-cancellation\/\"><span style=\"font-weight: 400;\">Background Voice and Noise Cancellation<\/span><\/a><span style=\"font-weight: 400;\"> (BVC), we report results both <\/span><b>before BVC<\/b><span style=\"font-weight: 400;\"> (raw audio) and <\/span><b>after BVC<\/b><span style=\"font-weight: 400;\"> (audio cleaned by Krisp&#8217;s BVC model). BVC is a standard component in Krisp deployments \u2014 TT models in production receive BVC-cleaned audio.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>For Interruption Prediction<\/b><span style=\"font-weight: 400;\">, the test set consists of 1,721 audio segments collected from natural interactions:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">1,182 backchannel cases<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">539 interruption cases<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">We additionally evaluate robustness on non-verbal human sounds: 200 laughter samples, 100 cough samples, and 100 sneeze samples.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>Results: Turn Prediction v3<\/b><\/h2>\n<h3><b>Threshold selection<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Turn Prediction v3&#8217;s behavior is controlled by a single threshold. We benchmarked across the full threshold range and identified three operating points corresponding to low, medium, and high accuracy levels. The binary prediction evaluations for each are shown below (for the definitions of these metrics, see <\/span><a href=\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\"><span style=\"font-weight: 400;\">our first blog post<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p>&nbsp;<\/p>\n<table>\n<thead>\n<tr>\n<th><b>Threshold<\/b><\/th>\n<th><b>Balanced Accuracy<\/b><\/th>\n<th><b>AUC<\/b><\/th>\n<th><b>F1 Score<\/b><\/th>\n<th><b>F1 Score Hold<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">0.3 (low)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">88.00<\/span><\/td>\n<td><span style=\"font-weight: 400;\">94.09<\/span><\/td>\n<td><span style=\"font-weight: 400;\">83.37<\/span><\/td>\n<td><span style=\"font-weight: 400;\">89.29<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">0.5 (medium)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">87.85<\/span><\/td>\n<td><span style=\"font-weight: 400;\">94.09<\/span><\/td>\n<td><span style=\"font-weight: 400;\">84.15<\/span><\/td>\n<td><span style=\"font-weight: 400;\">90.98<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">0.7 (high)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">85.58<\/span><\/td>\n<td><span style=\"font-weight: 400;\">94.09<\/span><\/td>\n<td><span style=\"font-weight: 400;\">82.07<\/span><\/td>\n<td><span style=\"font-weight: 400;\">90.93<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23227 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image3.png\" alt=\"\" width=\"1200\" height=\"600\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image3.png 1200w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image3-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image3-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image3-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image3-600x300.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><br \/>\nKrisp TT v3 \u2014 False Positive Rate vs Mean Shift Time across the threshold range. The three red dots mark thresholds 0.3, 0.5, and 0.7.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">We recommend <\/span><b>threshold 0.5<\/b><span style=\"font-weight: 400;\"> as the default operating point \u2014 it balances reaction speed and accuracy. All Turn Prediction v3 results below use 0.5; Turn Prediction v2 results use 0.4.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Turn Prediction v3 vs v2<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Comparing v3 directly against the previous generation on identical test conditions reveals the magnitude of the improvement.<\/span><\/p>\n<p><b>Original dataset:<\/b><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>True \u2264 200 ms<\/b><\/th>\n<th><b>True \u2264 400 ms<\/b><\/th>\n<th><b>True \u2264 600 ms<\/b><\/th>\n<th><b>False Positive Rate<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Turn Prediction v2<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.47<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.57<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.64<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.11<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Turn Prediction v3<\/b><\/td>\n<td><b>0.69<\/b><\/td>\n<td><b>0.73<\/b><\/td>\n<td><b>0.76<\/b><\/td>\n<td><b>0.10<\/b><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">The headline result: at the same FPR, v3 catches 47% more true turn-shifts within the first 200 ms of silence. An agent&#8217;s responsiveness can be substantially enhanced by this contrast.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23229 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image5.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image5.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image5-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image5-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image5-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image5-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image5-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><br \/>\nShift Histogram: Krisp TT v2 vs v3 \u2014 true cases (% per bin), original dataset. v3&#8217;s mass concentrates sharply at the 200 ms bin.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23228 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image4.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image4.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image4-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image4-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image4-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image4-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image4-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/>Shift Histogram: Krisp TT v2 vs v3 \u2014 false cases (% per bin), original dataset. Both models keep the false-positive mass low and comparable.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23230 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image8.png\" alt=\"\" width=\"1200\" height=\"600\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image8.png 1200w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image8-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image8-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image8-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image8-600x300.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><br \/>\nFPR vs Mean Shift Time \u2014 Krisp TT v2 (AUC = 0.222) vs v3 (AUC = 0.285), original dataset. Lower curve = faster reaction at the same accuracy.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">The histograms tell the story visually: v3&#8217;s true-shift mass concentrates sharply in the 200 ms bin, while v2&#8217;s is spread across hundreds of milliseconds. At the same time, both models have a comparable false-positive rate, which means v3 is significantly faster without trading away accuracy, making conversations with voice AI agents feel more natural. Note that in this chart, AUC is the area under MST vs FPR curve.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Secondary mixes:<\/b><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>True \u2264 200 ms<\/b><\/th>\n<th><b>True \u2264 400 ms<\/b><\/th>\n<th><b>True \u2264 600 ms<\/b><\/th>\n<th><b>False Positive Rate<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Turn Prediction v2<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.45<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.55<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.62<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.11<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Turn Prediction v3<\/b><\/td>\n<td><b>0.65<\/b><\/td>\n<td><b>0.70<\/b><\/td>\n<td><b>0.73<\/b><\/td>\n<td><span style=\"font-weight: 400;\">0.13<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23231 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image7.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image7.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image7-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image7-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image7-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image7-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image7-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/>Shift Histogram: Krisp TT v2 vs v3 \u2014 true cases, secondary mixes.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23232 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image11.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image11.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image11-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image11-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image11-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image11-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image11-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><br \/>\nShift Histogram: Krisp TT v2 vs v3 \u2014 false cases, secondary mixes.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p><b>Noisy mixes:<\/b><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>True \u2264 200 ms<\/b><\/th>\n<th><b>True \u2264 400 ms<\/b><\/th>\n<th><b>True \u2264 600 ms<\/b><\/th>\n<th><b>False Positive Rate<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Turn Prediction v2<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.48<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.56<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.63<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.11<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Turn Prediction v3<\/b><\/td>\n<td><b>0.67<\/b><\/td>\n<td><b>0.70<\/b><\/td>\n<td><b>0.74<\/b><\/td>\n<td><span style=\"font-weight: 400;\">0.10<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23233 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image9.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image9.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image9-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image9-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image9-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image9-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image9-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><br \/>\nShift Histogram: Krisp TT v2 vs v3 \u2014 true cases, noisy mixes.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23234 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image10.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image10.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image10-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image10-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image10-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image10-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image10-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><br \/>\nShift Histogram: Krisp TT v2 vs v3 \u2014 false cases, noisy mixes.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">The advantage holds across all three conditions. With secondary voices and ambient noise \u2014 the actual conditions a model encounters in deployment \u2014 v3 maintains roughly the same gains it shows on clean audio.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Turn Prediction v3 vs SmartTurn v3.2<\/b><\/h3>\n<p><a href=\"https:\/\/www.daily.co\/blog\/smart-turn-v3-2-handling-noisy-environments-and-short-responses\/\"><span style=\"font-weight: 400;\">SmartTurn v3.2<\/span><\/a><span style=\"font-weight: 400;\"> is an open source model designed specifically for audio-based end-of-turn prediction. The comparison highlights both performance and design philosophy.<\/span><\/p>\n<p><b>Original dataset:<\/b><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>True \u2264 200 ms<\/b><\/th>\n<th><b>True \u2264 400 ms<\/b><\/th>\n<th><b>True \u2264 600 ms<\/b><\/th>\n<th><b>False Positive Rate<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">SmartTurn v3.2<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.63<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.63<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.63<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.09<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Turn Prediction v3<\/b><\/td>\n<td><b>0.69<\/b><\/td>\n<td><b>0.73<\/b><\/td>\n<td><b>0.76<\/b><\/td>\n<td><span style=\"font-weight: 400;\">0.10<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23235 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image12.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image12.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image12-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image12-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image12-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image12-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image12-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><br \/>\nShift Histogram: Krisp TT v3 vs SmartTurn v3.2 \u2014 true cases. SmartTurn&#8217;s mass is split between the 200 ms bin and the 3-second ceiling.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23236 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image14.png\" alt=\"\" width=\"1600\" height=\"800\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image14.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image14-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image14-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image14-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image14-1536x768.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image14-600x300.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><br \/>\nShift Histogram: Krisp TT v3 vs SmartTurn v3.2 \u2014 false cases.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">The uniform performance of SmartTurn across all temporal horizons (0.63 at 200 ms, 400 ms, and 600 ms) indicates a rigid, discrete triggering mechanism that is not leveraging the evolving context of silence (Pipecat implementation). In contrast, Turn Prediction v3 generates a continuous probability stream throughout the silence interval, offering a more sophisticated and granular approach that enables developers to precisely calibrate the latency\u2013accuracy trade-off for their specific use cases. Crucially, <\/span><b>Turn Prediction v3<\/b><span style=\"font-weight: 400;\"> captures 20% more turn-shifts within the 600 ms window (0.76 vs 0.63) \u2014 a critical segment for enabling the rapid, low-latency responses that define natural conversation.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The following charts illustrate the <\/span><b>MST vs FPR trade-off<\/b><span style=\"font-weight: 400;\"> across our three test conditions \u2014 clean audio, secondary voice mixes, and noisy mixes \u2014 evaluating performance both <\/span><b>before and after BVC<\/b><span style=\"font-weight: 400;\"> processing.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23237 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image15.png\" alt=\"\" width=\"1200\" height=\"600\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image15.png 1200w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image15-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image15-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image15-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image15-600x300.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><br \/>\nFPR vs Mean Shift Time \u2014 Krisp TT v3 (after BVC) vs SmartTurn v3.2 (before BVC), original dataset.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23238 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image16.png\" alt=\"\" width=\"1200\" height=\"600\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image16.png 1200w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image16-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image16-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image16-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image16-600x300.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><br \/>\nFPR vs Mean Shift Time \u2014 Krisp TT v3 (after BVC) vs SmartTurn v3.2 (after BVC), original dataset.<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23239 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image17.png\" alt=\"\" width=\"1200\" height=\"600\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image17.png 1200w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image17-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image17-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image17-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image17-600x300.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><br \/>\nFPR vs Mean Shift Time \u2014 Krisp TT v3 vs SmartTurn v3.2, secondary mixes (both after BVC).<\/span><\/i><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23240 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image18.png\" alt=\"\" width=\"1200\" height=\"600\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image18.png 1200w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image18-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image18-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image18-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image18-600x300.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><br \/>\nFPR vs Mean Shift Time \u2014 Krisp TT v3 vs SmartTurn v3.2, noisy mixes (both after BVC).<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">Across all three conditions and both before\/after BVC, Turn Prediction v3 dominates the latency\u2013accuracy curve.<\/span><\/p>\n<h3><b>Binary testing \u2014 full results across models<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">For readers who want a single comparison view, the table below collects binary metrics across all three datasets, three models (Turn Prediction v3, Turn Prediction v2, SmartTurn v3.2), and both before\/after BVC.<\/span><\/p>\n<p><b>Original dataset:<\/b><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>TP v3 (before BVC)<\/b><\/th>\n<th><b>TP v2 (before BVC)<\/b><\/th>\n<th><b>SmartTurn v3.2 (before BVC)<\/b><\/th>\n<th><b>TP v3 (after BVC)<\/b><\/th>\n<th><b>TP v2 (after BVC)<\/b><\/th>\n<th><b>SmartTurn v3.2 (after BVC)<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Balanced Accuracy<\/span><\/td>\n<td><b>88.05<\/b><\/td>\n<td><span style=\"font-weight: 400;\">80.34<\/span><\/td>\n<td><span style=\"font-weight: 400;\">77.41<\/span><\/td>\n<td><b>87.85<\/b><\/td>\n<td><span style=\"font-weight: 400;\">81.84<\/span><\/td>\n<td><span style=\"font-weight: 400;\">76.95<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">AUC<\/span><\/td>\n<td><b>94.58<\/b><\/td>\n<td><span style=\"font-weight: 400;\">93.92<\/span><\/td>\n<td><span style=\"font-weight: 400;\">88.81<\/span><\/td>\n<td><b>94.09<\/b><\/td>\n<td><span style=\"font-weight: 400;\">93.93<\/span><\/td>\n<td><span style=\"font-weight: 400;\">88.74<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">F1 Score<\/span><\/td>\n<td><b>84.44<\/b><\/td>\n<td><span style=\"font-weight: 400;\">75.25<\/span><\/td>\n<td><span style=\"font-weight: 400;\">70.88<\/span><\/td>\n<td><b>84.15<\/b><\/td>\n<td><span style=\"font-weight: 400;\">77.30<\/span><\/td>\n<td><span style=\"font-weight: 400;\">70.21<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">F1 Score Hold<\/span><\/td>\n<td><b>91.20<\/b><\/td>\n<td><span style=\"font-weight: 400;\">89.09<\/span><\/td>\n<td><span style=\"font-weight: 400;\">86.44<\/span><\/td>\n<td><b>90.98<\/b><\/td>\n<td><span style=\"font-weight: 400;\">89.48<\/span><\/td>\n<td><span style=\"font-weight: 400;\">86.27<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><b>Secondary mixes:<\/b><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>TP v3 (before BVC)<\/b><\/th>\n<th><b>TP v2 (before BVC)<\/b><\/th>\n<th><b>SmartTurn v3.2 (before BVC)<\/b><\/th>\n<th><b>TP v3 (after BVC)<\/b><\/th>\n<th><b>TP v2 (after BVC)<\/b><\/th>\n<th><b>SmartTurn v3.2 (after BVC)<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Balanced Accuracy<\/span><\/td>\n<td><b>83.36<\/b><\/td>\n<td><span style=\"font-weight: 400;\">66.14<\/span><\/td>\n<td><span style=\"font-weight: 400;\">73.74<\/span><\/td>\n<td><b>86.44<\/b><\/td>\n<td><span style=\"font-weight: 400;\">79.29<\/span><\/td>\n<td><span style=\"font-weight: 400;\">74.60<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">AUC<\/span><\/td>\n<td><b>90.59<\/b><\/td>\n<td><span style=\"font-weight: 400;\">86.55<\/span><\/td>\n<td><span style=\"font-weight: 400;\">81.60<\/span><\/td>\n<td><b>92.72<\/b><\/td>\n<td><span style=\"font-weight: 400;\">92.56<\/span><\/td>\n<td><span style=\"font-weight: 400;\">85.96<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">F1 Score<\/span><\/td>\n<td><b>77.89<\/b><\/td>\n<td><span style=\"font-weight: 400;\">50.29<\/span><\/td>\n<td><span style=\"font-weight: 400;\">66.60<\/span><\/td>\n<td><b>82.05<\/b><\/td>\n<td><span style=\"font-weight: 400;\">73.70<\/span><\/td>\n<td><span style=\"font-weight: 400;\">66.86<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">F1 Score Hold<\/span><\/td>\n<td><b>85.59<\/b><\/td>\n<td><span style=\"font-weight: 400;\">83.20<\/span><\/td>\n<td><span style=\"font-weight: 400;\">78.93<\/span><\/td>\n<td><b>89.32<\/b><\/td>\n<td><span style=\"font-weight: 400;\">88.48<\/span><\/td>\n<td><span style=\"font-weight: 400;\">84.37<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><b>Noisy mixes:<\/b><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>TP v3 (before BVC)<\/b><\/th>\n<th><b>TP v2 (before BVC)<\/b><\/th>\n<th><b>SmartTurn v3.2 (before BVC)<\/b><\/th>\n<th><b>TP v3 (after BVC)<\/b><\/th>\n<th><b>TP v2 (after BVC)<\/b><\/th>\n<th><b>SmartTurn v3.2 (after BVC)<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Balanced Accuracy<\/span><\/td>\n<td><b>87.25<\/b><\/td>\n<td><span style=\"font-weight: 400;\">53.61<\/span><\/td>\n<td><span style=\"font-weight: 400;\">77.13<\/span><\/td>\n<td><b>87.60<\/b><\/td>\n<td><span style=\"font-weight: 400;\">81.62<\/span><\/td>\n<td><span style=\"font-weight: 400;\">75.95<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">AUC<\/span><\/td>\n<td><b>93.55<\/b><\/td>\n<td><span style=\"font-weight: 400;\">85.61<\/span><\/td>\n<td><span style=\"font-weight: 400;\">85.28<\/span><\/td>\n<td><b>93.71<\/b><\/td>\n<td><span style=\"font-weight: 400;\">93.71<\/span><\/td>\n<td><span style=\"font-weight: 400;\">87.36<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">F1 Score<\/span><\/td>\n<td><b>82.84<\/b><\/td>\n<td><span style=\"font-weight: 400;\">13.54<\/span><\/td>\n<td><span style=\"font-weight: 400;\">70.58<\/span><\/td>\n<td><b>83.79<\/b><\/td>\n<td><span style=\"font-weight: 400;\">77.01<\/span><\/td>\n<td><span style=\"font-weight: 400;\">68.82<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">F1 Score Hold<\/span><\/td>\n<td><b>89.50<\/b><\/td>\n<td><span style=\"font-weight: 400;\">79.53<\/span><\/td>\n<td><span style=\"font-weight: 400;\">83.42<\/span><\/td>\n<td><b>90.74<\/b><\/td>\n<td><span style=\"font-weight: 400;\">89.42<\/span><\/td>\n<td><span style=\"font-weight: 400;\">85.16<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The pattern is consistent: Turn Prediction v3 leads across all conditions and metrics. Performance is similar with and without BVC on the clean original dataset; BVC&#8217;s contribution is most visible on noisy mixes, where it lifts every model&#8217;s accuracy and is essential for v2&#8217;s competitiveness.<\/span><\/p>\n<h3><b>Turn Prediction v3 vs LiveKit and Deepgram Flux<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">We also evaluated Turn Prediction v3 against two other widely deployed solutions: LiveKit\u2019s built-in and Deepgram Flux\u2019s built-in.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">A methodological note before the numbers: <\/span><b>LiveKit&#8217;s TT is a text-based model<\/b><span style=\"font-weight: 400;\"> \u2014 it operates on transcripts rather than raw audio. To make the comparison fair on audio data, we evaluated LiveKit using transcripts produced by the Deepgram Nova 3 ASR model. <\/span><b>Deepgram Flux is English-only<\/b><span style=\"font-weight: 400;\">, so it appears in our English benchmarks but does not cover the multilingual scope of Turn Prediction v3.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23241 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image19.png\" alt=\"\" width=\"1200\" height=\"600\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image19.png 1200w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image19-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image19-380x190.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image19-768x384.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image19-600x300.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><br \/>\nFPR vs Mean Shift Time \u2014 Krisp TT v3 vs SmartTurn V3.2 vs LiveKit vs Deepgram Flux. Krisp TT v3&#8217;s curve sits below SmartTurn and LiveKit across the operating range.<\/span><\/i><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th><b>Balanced Accuracy<\/b><\/th>\n<th><b>AUC<\/b><\/th>\n<th><b>F1 Score<\/b><\/th>\n<th><b>F1 Score Hold<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><b>Turn Prediction v3<\/b><\/td>\n<td><b>88.05<\/b><\/td>\n<td><b>94.58<\/b><\/td>\n<td><span style=\"font-weight: 400;\">84.44<\/span><\/td>\n<td><span style=\"font-weight: 400;\">91.20<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">SmartTurn v3.2<\/span><\/td>\n<td><span style=\"font-weight: 400;\">77.41<\/span><\/td>\n<td><span style=\"font-weight: 400;\">88.81<\/span><\/td>\n<td><span style=\"font-weight: 400;\">70.88<\/span><\/td>\n<td><span style=\"font-weight: 400;\">86.44<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Deepgram Flux<\/span><\/td>\n<td><span style=\"font-weight: 400;\">87.10<\/span><\/td>\n<td><span style=\"font-weight: 400;\">\u2014<\/span><\/td>\n<td><b>84.60<\/b><\/td>\n<td><b>92.60<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">LiveKit<\/span><\/td>\n<td><span style=\"font-weight: 400;\">82.70<\/span><\/td>\n<td><span style=\"font-weight: 400;\">88.70<\/span><\/td>\n<td><span style=\"font-weight: 400;\">76.70<\/span><\/td>\n<td><span style=\"font-weight: 400;\">83.30<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">Turn Prediction v3 leads on Balanced Accuracy and AUC. Deepgram Flux is marginally ahead on F1 Score (84.60 vs 84.44) and F1 Score Hold (92.60 vs 91.20). Deepgram Flux end-of-turn prediction is English-only and is integrated within an ASR pipeline; the MST vs FPR plot also shows that Deepgram Flux achieves a lower mean shift time at the same FPR levels. Turn Prediction v3, by contrast, is a multilingual, lightweight model designed specifically for audio-based end-of-turn prediction. SmartTurn v3.2 and LiveKit trail across all four metrics, and their MST vs FPR trade-offs are comparable to each other.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>Results: Interruption Prediction v1<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Interruption Prediction is a fundamentally different task from end-of-turn detection: instead of judging when a speaker has finished, the model judges whether incoming user speech is a backchannel or a real interruption attempt. The metric framework, however, is structurally parallel.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Mean Interruption Time vs False Positive Rate<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><i><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"size-full wp-image-23242 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image20.png\" alt=\"\" width=\"989\" height=\"490\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image20.png 989w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image20-300x149.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image20-380x188.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image20-768x381.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/image20-600x297.png 600w\" sizes=\"(max-width: 989px) 100vw, 989px\" \/><br \/>\nMIT vs FPR Curve \u2014 Krisp Interruption Prediction v1 across the threshold sweep, with VAD-based and minimum-word-count baselines plotted as single points.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">The curve shows Interruption Prediction v1 across the full threshold sweep, with two industry baselines plotted as single points (since they don&#8217;t expose a confidence score):<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>VAD-based<\/b><span style=\"font-weight: 400;\">: triggers an interrupt after a fixed duration of continuous user speech (Silero VAD, integrated in Pipecat).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Minimum word count<\/b><span style=\"font-weight: 400;\">: triggers an interrupt after 3 words are recognized by Deepgram ASR (Pipecat default).<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">A lower curve means faster interruption at the same false-positive rate, or fewer false interrupts at the same speed. Krisp Interruption Prediction v1&#8217;s curve sits decisively below both baseline points across the operating range.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Threshold trade-off<\/b><\/h3>\n<table>\n<thead>\n<tr>\n<th><b>Method<\/b><\/th>\n<th><b>MIT (s)<\/b><\/th>\n<th><b>FPR (%)<\/b><\/th>\n<th><b>Balanced Accuracy<\/b><\/th>\n<th><b>F1 Score<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><b>Interruption Prediction v1 \u2014 threshold 0.4<\/b><\/td>\n<td><b>0.833<\/b><\/td>\n<td><span style=\"font-weight: 400;\">5.9<\/span><\/td>\n<td><b>0.906<\/b><\/td>\n<td><span style=\"font-weight: 400;\">0.871<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Interruption Prediction v1 \u2014 threshold 0.7<\/span><\/td>\n<td><span style=\"font-weight: 400;\">1.085<\/span><\/td>\n<td><b>2.7<\/b><\/td>\n<td><span style=\"font-weight: 400;\">0.876<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.848<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Minimum word count<\/span><\/td>\n<td><span style=\"font-weight: 400;\">1.528<\/span><\/td>\n<td><span style=\"font-weight: 400;\">3.6<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.948<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.927<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">VAD-based<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.375<\/span><\/td>\n<td><span style=\"font-weight: 400;\">66.3<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.675<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.583<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Two operating points are worth highlighting. <\/span><b>Threshold 0.4<\/b><span style=\"font-weight: 400;\"> is our recommended default \u2014 sub-second mean interruption time at under 6% FPR. <\/span><b>Threshold 0.7<\/b><span style=\"font-weight: 400;\"> is the conservative setting: slightly slower but FPR under 3%.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The minimum-word-count baseline achieves higher balanced accuracy and F1 Score, but at the cost of substantially higher latency (1.5 s vs 0.8 s) \u2014 the bot waits much longer before stopping. VAD-based interruption fails the basic test: it fires on almost two-thirds of backchannels.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Robustness to non-verbal human sounds<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Real conversations include laughter, coughing, and sneezing \u2014 sounds that are unmistakably human but are not interruptions. We evaluated each method on a separate set of non-verbal human sounds (200 laughter, 100 cough, 100 sneeze samples):<\/span><\/p>\n<table>\n<thead>\n<tr>\n<th><b>Method<\/b><\/th>\n<th><b>False Positive (%) on non-verbal sounds<\/b><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><b>Interruption Prediction v1 \u2014 threshold 0.4<\/b><\/td>\n<td><b>4.5<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Minimum word count<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.0<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">VAD-based<\/span><\/td>\n<td><span style=\"font-weight: 400;\">32.0<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">Minimum-word-count is trivially robust here by construction \u2014 a cough produces no recognized words \u2014 but, as the previous table shows, pays for it with slow interruption response on real speech (1.5 s MIT). Interruption Prediction v1 stays below 5% false positives on non-verbal sounds while keeping interruption response well under one second. That&#8217;s the trade-off most production deployments will want.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>Availability<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Both Turn Prediction v3 and Interruption Prediction v1 are available in the Krisp VIVA SDK and integrated into Pipecat:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Turn Prediction v3<\/b><span style=\"font-weight: 400;\"> \u2014 <\/span><span style=\"font-weight: 400;\">krisp-viva-tp-v3<\/span><span style=\"font-weight: 400;\">, ~9M parameters, 30 MB, configurable frame duration, 12+ languages, recommended threshold 0.5<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Interruption Prediction v1<\/b><span style=\"font-weight: 400;\"> \u2014 <\/span><span style=\"font-weight: 400;\">krisp-viva-ip-v1<\/span><span style=\"font-weight: 400;\">, ~6M parameters, 24 MB, 40 ms frames, English, recommended threshold 0.4<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Both models are designed for low resource consumption on CPU \u2014 no GPU required for real-time inference.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>FAQ<\/b><\/h2>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>What is turn-taking in voice AI, and why does it matter?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">Turn-taking determines when a voice AI agent should start speaking (after the user finishes) and when it should stop (when the user interrupts). Poor turn-taking causes awkward silences, cut-offs, and agents that halt mid-sentence because the user said \u201cuh-huh.\u201d It\u2019s the single biggest factor in whether a voice agent feels natural or robotic.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>How does Krisp\u2019s end-of-turn detection differ from text-based approaches like LiveKit\u2019s?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\"> Krisp Turn Prediction v3 operates directly on audio rather than on ASR transcripts. It supports multilingual input without requiring per-language ASR, and makes decisions based on acoustic cues (prosody, pausing patterns) that get lost in transcription.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>What\u2019s the difference between a backchannel and an interruption, and why can\u2019t VAD handle it?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">A backchannel (\u201cyeah,\u201d \u201cuh-huh,\u201d \u201cright\u201d) signals engagement without requesting the floor. An interruption means the user wants the agent to stop. VAD only detects that someone is speaking \u2014 it can\u2019t distinguish intent, so it fires on nearly two-thirds of backchannels. Krisp Interruption Prediction v1 uses a learned model that separates the two with under 6% false positives at the recommended threshold.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>How fast can a voice AI agent respond using Krisp Turn Prediction v3?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">At the recommended threshold (0.5), 69% of true turn-shifts are detected within 200 ms of silence \u2014 a 47% improvement over v2. The model runs on CPU with ~9M parameters and 30 MB footprint, so it adds negligible overhead to your voice agent pipeline.<\/div>\n<\/div>\n<div class=\"faq_item\">\n<div class=\"faq_title text_body--md text--semi-bold\"><strong>What languages does Krisp turn-taking support?<\/strong><\/div>\n<div class=\"faq_answer text_body--md\">Turn Prediction v3 supports 12+ languages: English, German, French, Spanish, Hindi, Finnish, Italian, Portuguese, Chinese, Japanese, Korean, and Russian. Interruption Prediction v1 is English-only at launch, with additional language support planned.<\/div>\n<\/div>\n<h2><b>Related Resources<\/b><\/h2>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/sdk-docs.krisp.ai\/docs\/ttv3-migration\"><span style=\"font-weight: 400;\">SDK Documentation: Turn Prediction v3 Migration<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/sdk-docs.krisp.ai\/docs\/interupt-prediction\"><span style=\"font-weight: 400;\">SDK Documentation: Interruption Prediction v1<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/github.com\/pipecat-ai\/pipecat\/blob\/main\/examples\/voice\/voice-krisp-viva.py\"><span style=\"font-weight: 400;\">Pipecat Integration Example<\/span><\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>The natural rhythm of conversation depends on knowing when to start speaking and when to stop. Humans handle this effortlessly: we sense the end of a turn, we recognize a quick &#8220;uh-huh&#8221; as encouragement rather than an interruption, and we can stop mid-sentence when someone clearly needs to break in. Voice AI agents have struggled [&hellip;]<\/p>\n","protected":false},"author":71,"featured_media":23244,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"two_page_speed":[]},"categories":[417,421,456],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.2 (Yoast SEO v23.6) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A solution to Turn-Taking and Interruption Prediction in Voice AI<\/title>\n<meta name=\"description\" content=\"Krisp Turn Prediction v3 cuts end-of-turn latency below 200ms. New Interruption Prediction v1 separates backchannels from real interruptions. Benchmarks and public dataset.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A solution to Turn-Taking and Interruption Prediction in Voice AI\" \/>\n<meta property=\"og:description\" content=\"Krisp Turn Prediction v3 cuts end-of-turn latency below 200ms. New Interruption Prediction v1 separates backchannels from real interruptions. Benchmarks and public dataset.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\" \/>\n<meta property=\"og:site_name\" content=\"Krisp\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/krispHQ\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-06T12:53:02+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-06T13:01:56+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Krisp Engineering Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@krispHQ\" \/>\n<meta name=\"twitter:site\" content=\"@krispHQ\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\"},\"author\":{\"name\":\"Krisp Engineering Team\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5\"},\"headline\":\"A New Approach to Turn-Taking in Voice AI: Turn Prediction v3 and Interruption Prediction v1\",\"datePublished\":\"2026-05-06T12:53:02+00:00\",\"dateModified\":\"2026-05-06T13:01:56+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\"},\"wordCount\":3195,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png\",\"articleSection\":[\"Company\",\"Engineering Blog\",\"SDK\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\",\"url\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\",\"name\":\"A solution to Turn-Taking and Interruption Prediction in Voice AI\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png\",\"datePublished\":\"2026-05-06T12:53:02+00:00\",\"dateModified\":\"2026-05-06T13:01:56+00:00\",\"description\":\"Krisp Turn Prediction v3 cuts end-of-turn latency below 200ms. New Interruption Prediction v1 separates backchannels from real interruptions. Benchmarks and public dataset.\",\"breadcrumb\":{\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png\",\"width\":1000,\"height\":700},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/krisp.ai\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A New Approach to Turn-Taking in Voice AI: Turn Prediction v3 and Interruption Prediction v1\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/krisp.ai\/blog\/#website\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"name\":\"Krisp\",\"description\":\"Blog\",\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/krisp.ai\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\",\"name\":\"Krisp\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"width\":696,\"height\":696,\"caption\":\"Krisp\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/krispHQ\/\",\"https:\/\/x.com\/krispHQ\",\"https:\/\/www.linkedin.com\/company\/krisphq\/\",\"https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5\",\"name\":\"Krisp Engineering Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g\",\"caption\":\"Krisp Engineering Team\"},\"url\":\"https:\/\/krisp.ai\/blog\/author\/eng-team\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"A solution to Turn-Taking and Interruption Prediction in Voice AI","description":"Krisp Turn Prediction v3 cuts end-of-turn latency below 200ms. New Interruption Prediction v1 separates backchannels from real interruptions. Benchmarks and public dataset.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/","og_locale":"en_US","og_type":"article","og_title":"A solution to Turn-Taking and Interruption Prediction in Voice AI","og_description":"Krisp Turn Prediction v3 cuts end-of-turn latency below 200ms. New Interruption Prediction v1 separates backchannels from real interruptions. Benchmarks and public dataset.","og_url":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/","og_site_name":"Krisp","article_publisher":"https:\/\/www.facebook.com\/krispHQ\/","article_published_time":"2026-05-06T12:53:02+00:00","article_modified_time":"2026-05-06T13:01:56+00:00","og_image":[{"width":1000,"height":700,"url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png","type":"image\/png"}],"author":"Krisp Engineering Team","twitter_card":"summary_large_image","twitter_creator":"@krispHQ","twitter_site":"@krispHQ","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#article","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/"},"author":{"name":"Krisp Engineering Team","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5"},"headline":"A New Approach to Turn-Taking in Voice AI: Turn Prediction v3 and Interruption Prediction v1","datePublished":"2026-05-06T12:53:02+00:00","dateModified":"2026-05-06T13:01:56+00:00","mainEntityOfPage":{"@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/"},"wordCount":3195,"commentCount":0,"publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"image":{"@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png","articleSection":["Company","Engineering Blog","SDK"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/","url":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/","name":"A solution to Turn-Taking and Interruption Prediction in Voice AI","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage"},"image":{"@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png","datePublished":"2026-05-06T12:53:02+00:00","dateModified":"2026-05-06T13:01:56+00:00","description":"Krisp Turn Prediction v3 cuts end-of-turn latency below 200ms. New Interruption Prediction v1 separates backchannels from real interruptions. Benchmarks and public dataset.","breadcrumb":{"@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#primaryimage","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2026\/05\/SDK-blog.png","width":1000,"height":700},{"@type":"BreadcrumbList","@id":"https:\/\/krisp.ai\/blog\/voice-ai-turn-taking-interruption-prediction\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/krisp.ai\/blog\/"},{"@type":"ListItem","position":2,"name":"A New Approach to Turn-Taking in Voice AI: Turn Prediction v3 and Interruption Prediction v1"}]},{"@type":"WebSite","@id":"https:\/\/krisp.ai\/blog\/#website","url":"https:\/\/krisp.ai\/blog\/","name":"Krisp","description":"Blog","publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/krisp.ai\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/krisp.ai\/blog\/#organization","name":"Krisp","url":"https:\/\/krisp.ai\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","width":696,"height":696,"caption":"Krisp"},"image":{"@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/krispHQ\/","https:\/\/x.com\/krispHQ","https:\/\/www.linkedin.com\/company\/krisphq\/","https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg"]},{"@type":"Person","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5","name":"Krisp Engineering Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g","caption":"Krisp Engineering Team"},"url":"https:\/\/krisp.ai\/blog\/author\/eng-team\/"}]}},"_links":{"self":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/23222"}],"collection":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/users\/71"}],"replies":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/comments?post=23222"}],"version-history":[{"count":6,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/23222\/revisions"}],"predecessor-version":[{"id":23270,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/23222\/revisions\/23270"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media\/23244"}],"wp:attachment":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media?parent=23222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/categories?post=23222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/tags?post=23222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}