{"id":21824,"date":"2025-08-05T03:20:04","date_gmt":"2025-08-04T23:20:04","guid":{"rendered":"https:\/\/krisp.ai\/blog\/?p=21824"},"modified":"2025-08-08T13:34:18","modified_gmt":"2025-08-08T09:34:18","slug":"turn-taking-for-voice-ai","status":"publish","type":"post","link":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/","title":{"rendered":"Audio-only, 6M weights Turn-Taking model for Voice AI Agents"},"content":{"rendered":"<p>In this article we discuss an outstanding problem in today&#8217;s Voice AI Agents &#8211; turn-taking. We examine why it is a hard problem and <strong>present a solution<\/strong> in<br \/>\n<a href=\"https:\/\/krisp.ai\/blog\/krisp-launches-viva-sdk-and-surpasses-1b-minutes-of-voice-ai-processing-per-month-milestone\/\">Krisp&#8217;s VIVA SDK<\/a>.<br \/>\nWe also benchmark the Krisp solution against some of the established solutions in the market.<\/p>\n<p style=\"background-color: #fff9c4; font-size: 1.05em; padding: 12px 16px; margin: 16px 0; border-radius: 4px;\"><strong>Note:<\/strong> The Turn-Taking model is included in the VIVA SDK offering at <strong>no additional charge<\/strong>.<\/p>\n<h2><strong>What is turn-taking?<\/strong><\/h2>\n<p>Turn-taking is the fundamental mechanism by which participants in a conversation coordinate who speaks when. While seemingly effortless in human interaction, in human to AI agent conversations modeling this process computationally is highly complex. In the context of Voice AI Agents (including voice assistants, customer support bots, and AI meeting agents), turn-taking decides when the agent should speak, listen, or remain silent.<\/p>\n<p>Without effective turn-taking, even the most advanced dialogue systems can come across as unnatural, unresponsive, and frustrating to use. A precise and lightweight turn-taking model enables natural, seamless conversations by minimizing interruptions and awkward pauses while adapting in real time to human cues such as hesitations, prosody, and pauses.<\/p>\n<p>In general, turn-taking includes the following tasks:<\/p>\n<ul>\n<li><strong>End-of-turn prediction<\/strong> \u2013 predicting when the current speaker is likely to finish their turn<\/li>\n<li><strong>Backchannel prediction<\/strong> \u2013 detecting moments where a listener may provide short verbal acknowledgments like <em>\u201cuh-huh\u201d<\/em>, <em>\u201cyeah\u201d<\/em>, etc. to show engagement, without intending to take over the speaking turn.<\/li>\n<\/ul>\n<p>In this article, we present our first <strong>audio-based turn-taking model<\/strong>, which focuses on the <strong>end-of-turn prediction task<\/strong> using <strong>only audio input<\/strong>. We chose to release the audio-based turn-taking model first, as it enables faster response times and a lightweight solution compared to text-based models, which usually require large architectures and depend on the availability of a streamable ASR providing real-time, accurate transcriptions.<\/p>\n<h3>Approaches to Turn-Taking<\/h3>\n<p>Solutions to Turn Taking problem are usually implemented in AI models, which use audio and\/or text representation.<\/p>\n<h4>1. Audio-based<\/h4>\n<p>Audio-based approaches rely on analyzing acoustic and prosodic features of speech. These features include, changes in pitch, energy levels, intonation, pauses and speaking rate. By detecting silence or overlapping speech, the system predicts when the user has finished speaking and when it is safe to respond. For example, a sudden drop in energy followed by a pause can be interpreted as a turn-ending cue. Such models are effective in real-time, low-latency scenarios where immediate response timing is critical.<\/p>\n<h4>2. Text-based<\/h4>\n<p>Text-based solutions analyze the transcribed content of speech rather than the raw audio. These models detect linguistic cues that indicate turn completion, such as sentence boundaries, punctuation, discourse markers (e.g., &#8220;so,&#8221; &#8220;anyway&#8221;), natural language patterns or semantics (e.g., user might directly ask the bot not to speak). Text-based systems are often integrated with dialogue state tracking and natural language processing (NLP) modules, making them effective for scenarios where accurate semantic interpretation of user intent is essential. However, they may require larger neural network architectures to effectively analyze the linguistic content.<\/p>\n<h4>3. Audio-Text Multimodal (Fusion)<\/h4>\n<p>Multimodal solutions combine both acoustic and textual inputs, leveraging the strengths of each. While audio-based methods capture real-time prosodic cues, text-based analysis provides deeper semantic understanding. By integrating both modalities, fusion models can make accurate and context-aware predictions of turn boundaries. These systems are effective in complex, multi-turn conversations where relying on either audio or text alone might lead to errors in timing or intent detection.<\/p>\n<h2>Challenges of turn-taking<\/h2>\n<h3>Hesitation and filler words<\/h3>\n<p>In natural dialogue, speakers often take a pause using fillers like &#8220;um&#8221; or &#8220;you know&#8221; without intending to give up their turn. For instance:<\/p>\n<p><em>\u201cI think we should, um, maybe \u2013\u201d<\/em>\u00a0<em>[The agent jumps in, assuming the sentence is over]<\/em><\/p>\n<p>Here, a turn-taking system must distinguish hesitation from completion, or risk interrupting too early.<\/p>\n<h3>Natural pauses vs. true end-of-turns<\/h3>\n<p>Pauses are not always indicators that a speaker has finished. For example:<\/p>\n<p><em>\u201cYesterday I woke up early, then&#8230; [pause] I went to work&#8230;\u201d<\/em><\/p>\n<p>A model might misinterpret the pause as a turn boundary, generating a premature response and breaking the conversational flow.<\/p>\n<h3>Quick turn prediction<\/h3>\n<p>Minimizing response latency is essential for maintaining natural conversational flow. Humans tend to respond quickly, sometimes even reactively, when the end of the speech is obvious. If a model fails to predict the turn boundary fast enough, the system may sound sluggish or unnatural. The challenge is to trigger responses at just the right moment &#8211; early enough to sound fluid, but not so early that it risks interrupting the speaker.<\/p>\n<h3>Varying speaking styles and accents<\/h3>\n<p>People speak in diverse rhythms, intonations, and speeds. A fast speaker with sharp pitch drops might appear to end a sentence even when they haven\u2019t. Conversely, a slow, melodic speaker may stretch syllables in ways that confuse timing-based systems. Modeling these variations effectively requires a neural network\u2013based approach.<\/p>\n<h2>Krisp\u2019s audio-based Turn-Taking model<\/h2>\n<p>Recently Krisp had <a href=\"https:\/\/krisp.ai\/blog\/improving-turn-taking-of-ai-voice-agents-with-background-voice-cancellation\/\">released AI models<\/a> for effective noise cancellation and voice isolation for Voice AI Agent use-cases, particularly improving pre-mature turn taking caused by background noise. See more details. This technology is widely deployed and has recently <a href=\"https:\/\/krisp.ai\/blog\/krisp-launches-viva-sdk-and-surpasses-1b-minutes-of-voice-ai-processing-per-month-milestone\/\">passed a 1B mins\/month milestone<\/a>.<\/p>\n<p>It was only natural for us to take on a larger problem of turn-taking (TT). In this first iteration, we designed a lightweight, low-latency, audio-based turn-taking model optimized to run efficiently on a CPU. The Krisp TT model is built into\u00a0 Krisp&#8217;s VIVA SDK, where using the Python SDK you can easily chain it with the Voice Isolation models , placing it in front of a voice agents to create a complete, end\u2011to\u2011end conversational flow, as shown in the following diagram.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-21825 \" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Krisp-Turn-Taking-Model.jpg\" alt=\"\" width=\"1264\" height=\"488\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Krisp-Turn-Taking-Model.jpg 1608w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Krisp-Turn-Taking-Model-300x116.jpg 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Krisp-Turn-Taking-Model-380x147.jpg 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Krisp-Turn-Taking-Model-768x297.jpg 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Krisp-Turn-Taking-Model-1536x593.jpg 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Krisp-Turn-Taking-Model-600x232.jpg 600w\" sizes=\"(max-width: 1264px) 100vw, 1264px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Here, the TT model continuously outputs a confidence score (probability) ranging from 0 to 1, indicating the likelihood of a shift \u2013 a point where a speaker is expected to finish their turn. It operates on 100ms audio frames, assigning a shift confidence score to each frame. To convert this score into a binary decision, we apply a configurable threshold. If the score exceeds this threshold (\u0394), we interpret it as a shift (end of turn) prediction; otherwise, the model considers the current speaker is still holding the turn.<\/p>\n<p>We also define a maximum hold duration, which defaults to 5 seconds. The model is designed such that, during uninterrupted silence, the confidence score gradually increases and reaches a value of 1 precisely at the end of this maximum hold period.<\/p>\n<h2><strong>Comparison with other Turn-Taking models<\/strong><\/h2>\n<p>Let\u2019s take a closer look at how other solutions handle the turn-taking problem in comparison to Krisp.<\/p>\n<h3>Simple VAD (Voice Activity Detection)<\/h3>\n<p>The basic VAD-based approach is as straightforward as it gets &#8211; if you taken a pause in your speech, you have probably have finished your turn. Technically, once a few seconds of (usually configurable) silence is detected, the system assumes the speaker has finished and hands over the turn. While efficient, this method lacks awareness of conversational context and often struggles with natural pauses or hesitant speech. In our comparisons, we use the Silero-VAD model with a 1-second silence detection window as a simple VAD-based turn-taking approach.<\/p>\n<h3>SmartTurn<\/h3>\n<p><strong>SmartTurn v1 and SmartTurn v2<\/strong> by Pipecat are open-source AI models, designed to detect exactly when a speaker has finished their turn. We picked them for in-depth comparison because like Krisp TT, they are audio-based models.<\/p>\n<p>Interestingly, SmartTurn models introduce a hybrid strategy. They first wait for 200ms of silence detected by Silero VAD, then evaluate whether a turn shift should occur. If the confidence is too low to switch, the system defers the decision. However, if silence persists for 3 seconds (default value, configurable parameter in SmartTurn), it forcefully initiates the turn transition. This layered approach aims to strike a balance between speed and caution in handling user pauses.<\/p>\n<h3>Tested Models<\/h3>\n<p>The following table gives a high-level comparison between the contenders<\/p>\n<table style=\"height: 169px;\" width=\"1011\">\n<thead>\n<tr>\n<th style=\"text-align: center;\"><strong>Attribute<\/strong><\/th>\n<th style=\"text-align: center;\"><strong>Krisp TT<\/strong><\/th>\n<th style=\"text-align: center;\"><strong>SmartTurn v1<\/strong><\/th>\n<th style=\"text-align: center;\"><strong>SmartTurn v2<\/strong><\/th>\n<th style=\"text-align: center;\"><strong>VAD-based TT<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Model Parameters count<\/strong><\/td>\n<td>6.1M<\/td>\n<td>581M<\/td>\n<td>95M<\/td>\n<td>260k<\/td>\n<\/tr>\n<tr>\n<td><strong>Model Size<\/strong><\/td>\n<td>65\u202fMB<\/td>\n<td>2.3\u202fGB<\/td>\n<td>360 MB<\/td>\n<td>2.3 MB<\/td>\n<\/tr>\n<tr>\n<td><strong>Recommended Execution<\/strong><\/td>\n<td>On CPU<\/td>\n<td>On GPU<\/td>\n<td>On GPU<\/td>\n<td>On CPU<\/td>\n<\/tr>\n<tr>\n<td><strong>Overall Accuracy<\/strong><\/td>\n<td>Good<\/td>\n<td>Good<\/td>\n<td>Good<\/td>\n<td>Poor<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><strong>Test Dataset<\/strong><\/h3>\n<p>The test dataset was built using real conversational recordings, with manually labeled turn-taking (shift) and hold scenarios (hold). A turn-taking instance marks a point where one speaker hands over the conversation, we will call a shift, while a hold scenario captures cases where the speaker continues after a brief pause, filler words, or unfinished context.<\/p>\n<p>The dataset consists of 1,875 labeled audio samples, including a significant number of labeled shift and hold scenarios. Each audio file is annotated to include the silence at the end of a speaker\u2019s segment \u2013 either resulting in a turn shift or a hold. The test data was annotated according to multiple criteria, including context, intonation, filler words (e.g., \u201cum,\u201d \u201cam\u201d), keywords (e.g., \u201cbut,\u201d \u201cand\u201d), and breathing patterns.<\/p>\n<p>Below are the statistics on silence duration for each scenario type as well as the distribution of shift and hold cases based on mentioned criteria.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-21826 alignleft\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_hold.png\" alt=\"\" width=\"480\" height=\"288\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_hold.png 1000w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_hold-300x180.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_hold-380x228.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_hold-768x461.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_hold-600x360.png 600w\" sizes=\"(max-width: 480px) 100vw, 480px\" \/><img loading=\"lazy\" class=\"wp-image-21827 alignleft\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_shift.png\" alt=\"\" width=\"480\" height=\"288\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_shift.png 1000w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_shift-300x180.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_shift-380x228.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_shift-768x461.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/last_silence_distribution_shift-600x360.png 600w\" sizes=\"(max-width: 480px) 100vw, 480px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h3><img loading=\"lazy\" class=\"size-full wp-image-21828 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/distribution_shift_hold.png\" alt=\"\" width=\"760\" height=\"385\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/distribution_shift_hold.png 760w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/distribution_shift_hold-300x152.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/distribution_shift_hold-380x193.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/distribution_shift_hold-600x304.png 600w\" sizes=\"(max-width: 760px) 100vw, 760px\" \/><\/h3>\n<h3>Training Dataset<\/h3>\n<p>Our training dataset comprises approximately 2,000 hours of conversational speech, containing around 700,000 speaker turns.<\/p>\n<h3><strong>Evaluation: Prediction Quality Metrics<\/strong><\/h3>\n<p>To assess the performance of the turn-taking model, we used a combination of classification metrics and timing-based analysis:<\/p>\n<table>\n<thead>\n<tr>\n<th>Metric<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>TP<\/strong><\/td>\n<td>True Positives: Correctly predicted positive class cases<\/td>\n<\/tr>\n<tr>\n<td><strong>TN<\/strong><\/td>\n<td>True Negatives: Correctly predicted negative class cases<\/td>\n<\/tr>\n<tr>\n<td><strong>FP<\/strong><\/td>\n<td>False Positives: Incorrectly predicted positive class cases<\/td>\n<\/tr>\n<tr>\n<td><strong>FN<\/strong><\/td>\n<td>False Negatives: Missed positive class cases<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table>\n<thead>\n<tr>\n<th>Metric<\/th>\n<th>Formula<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Precision<\/strong><\/td>\n<td>TP \/ (TP + FP)<\/td>\n<td>Proportion of predicted positives that are actually positive<\/td>\n<\/tr>\n<tr>\n<td><strong>Recall<\/strong><\/td>\n<td>TP \/ (TP + FN)<\/td>\n<td>Proportion of actual positives correctly predicted<\/td>\n<\/tr>\n<tr>\n<td><strong>Specificity<\/strong><\/td>\n<td>TN \/ (TN + FP)<\/td>\n<td>Proportion of actual negatives correctly predicted<\/td>\n<\/tr>\n<tr>\n<td><strong>Balanced<\/strong> <strong>Accuracy<\/strong><\/td>\n<td>(Recall + Specificity) \/ 2<\/td>\n<td>Average performance across both classes (positive and negative)<\/td>\n<\/tr>\n<tr>\n<td><strong>F1<\/strong> <strong>Score<\/strong><\/td>\n<td>2 \u00d7 (Precision \u00d7 Recall) \/ (Precision + Recall)<\/td>\n<td>Harmonic mean of Precision and Recall; balances false positives and false negatives<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>AUC:<\/strong> The AUC is the area under the ROC curve. A higher AUC value indicates better classification performance, here ROC (receiver operating characteristic) shows the trade-off between the true positive rate and the false positive rate as the decision threshold is varied, for more details on AUC and other metrics <a href=\"https:\/\/www.geeksforgeeks.org\/machine-learning\/auc-roc-curve\/\">read here<\/a>.<\/p>\n<h3><strong>Evaluation: Latency vs. Accuracy tradeoff (MST vs FPR)<\/strong><\/h3>\n<p>We realized that there is a natural tradeoff between the accuracy and latency, i.e. how quickly the system detects a true shift. We can reduce the latency by lowering the threshold, however, it will likely lead to increased false-positive rate (FPR) and unwanted interruptions. On the other hand, we don&#8217;t want to wait too long to predict a shift, because the increased latency will result in awkward interaction (see the chart below).<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-21832\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1.png\" alt=\"\" width=\"1200\" height=\"692\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1.png 3366w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1-300x173.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1-380x219.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1-768x443.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1-1536x886.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1-2048x1182.png 2048w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_graph1-600x346.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Therefore, the latency to accuracy relationship is important and here we measure TT system&#8217;s latency by mean shift time (MST). The shift time is defined as the duration between the onset of silence and the moment of predicting end-of-turn (shift). If the model outputs a confidence score, the end-of-turn prediction can be controlled via a threshold. This makes the threshold an important control lever in the trade-off between reaction speed and prediction accuracy:<\/p>\n<ul>\n<li>Higher thresholds result in delayed shift predictions, which help reduce false positives (i.e., shift detections during the current speaker hold period which leads to interruption from the bot). However, this increases the mean shift time, making the system slower to respond.<\/li>\n<li>Lower thresholds lead to faster responses, decreasing mean shift time, but at the cost of increased false positives, potentially causing the bot to interrupt speakers prematurely.<\/li>\n<\/ul>\n<p>To visualize this trade-off, we plot a chart showing the relationship between mean shift time calculated in end-of-speech cases and false positive (interruption) rate as the threshold varies from 0 to 1. To provide a comparative summary of models, we plot these charts. A lower curve indicates a faster mean response time for the same interruption rate \u2013 or, from another perspective, fewer interruptions for the same mean response time. Here you can see the corresponding plots for Krisp TT, SmartTurn v1 and SmartTurn v2. Note that we can\u2019t directly visualize such a chart for the VAD-based TT, as MST vs FPR requires a model that outputs a confidence score, whereas the VAD-based model produces binary outputs (0 or 1). The same limitation applies to AUC-shift computation shown in the table above.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-21833\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_2.png\" alt=\"\" width=\"1200\" height=\"602\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_2.png 979w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_2-300x150.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_2-380x191.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_2-768x385.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/evaluation_latency_2-600x301.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>This basically means that the Krisp TT model has considerably faster average response time (0.9 vs. 1.3 seconds at a 0.06 FPR) compared to SmartTurn to produce a true-positive answer.<\/p>\n<p>To summarize the overall latency-accuracy tradeoff, we also compute the <strong>area under the<\/strong> <strong>MST vs FPR curve<\/strong>. This single scalar score captures the model&#8217;s ability to respond quickly while minimizing interruptions across different thresholds. A lower area indicates better performance.<\/p>\n<h3>Evaluation Results<\/h3>\n<table>\n<thead>\n<tr>\n<th><strong>Model<\/strong><\/th>\n<th><strong>Balanced Accuracy<\/strong><\/th>\n<th><strong>AUC Shift<\/strong><\/th>\n<th><strong>F1 Score Shift<\/strong><\/th>\n<th><strong>F1 Score Hold<\/strong><\/th>\n<th>AUC (MSP vs FPR)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Krisp TT<\/strong><\/td>\n<td><strong>0.82<\/strong><\/td>\n<td><strong>0.89<\/strong><\/td>\n<td><strong>0.80<\/strong><\/td>\n<td>0.83<\/td>\n<td><strong>0.21<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>VAD based TT<\/strong><\/td>\n<td>0.59<\/td>\n<td>&#8211;<\/td>\n<td>0.48<\/td>\n<td>0.70<\/td>\n<td>&#8211;<\/td>\n<\/tr>\n<tr>\n<td><strong>SmartTurn V1<\/strong><\/td>\n<td>0.78<\/td>\n<td>0.86<\/td>\n<td>0.73<\/td>\n<td><strong>0.84<\/strong><\/td>\n<td>0.39<\/td>\n<\/tr>\n<tr>\n<td><strong>SmartTurn V2<\/strong><\/td>\n<td>0.78<\/td>\n<td>0.83<\/td>\n<td>0.76<\/td>\n<td>0.78<\/td>\n<td>0.44<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\ud83d\udca1 It\u2019s important to note that the Krisp TT model delivers comparable quality in terms of predictive quality metrics and significantly better quality in terms of latency vs accuracy tradeoff <strong>while being 5-10x smaller<\/strong> and optimized to run efficiently on a CPU. The VAD-based turn-taking approach is more lightweight, but it performs significantly worse than dedicated TT models \u2013 highlighting the importance of modeling the complex relationships between speech structure, acoustic features, and turn-taking behavior.<\/p>\n<h2><strong>Demo<br \/>\n<\/strong><\/h2>\n<p>Here\u2019s a simple dialogue showing how Krisp\u2019s Turn-Taking model works in practice. In the demo, you\u2019ll hear intentional utterances, pauses, filler words and interruptions. The response time you observe includes the Turn-Taking model\u2019s speed, plus the latency from the speech-to-text (STT) system and the language model (LLM).<\/p>\n<h3>Krisp&#8217;s Turn-Taking Model<\/h3>\n<p><iframe src=\"https:\/\/drive.google.com\/file\/d\/1kzUoNPexrTloE2KnSqoGLsf07aYbgptg\/preview\" width=\"720\" height=\"405\"><\/iframe><\/p>\n<h3>Krisp&#8217;s TT model vs Pipecat&#8217;s SmartTurn V2<\/h3>\n<p>This demo compares Krisp\u2019s Turn-Taking model with Pipecat\u2019s SmartTurn model (3-second default value, configurable parameter in SmartTurn). To highlight the differences visually, we\u2019ve also overlaid a speech-to-text transcript on the video.<\/p>\n<p><iframe src=\"https:\/\/drive.google.com\/file\/d\/1EyczUQ04FVgWfxxlzyLGnNhgV8yMzPXM\/preview\" width=\"720\" height=\"405\"><\/iframe><\/p>\n<h2><strong>Future Plans<\/strong><\/h2>\n<h3><strong>Improved Accuracy in TT<\/strong><\/h3>\n<p>While this initial, audio-based TT model provides balanced accuracy and latency, it is mainly limited to analyzing prosodic and acoustic features, such as changes in intonation, pitch and rhythm. By analyzing linguistic features like the syntactic completion of a sentence we can further improve the accuracy of the TT model.<\/p>\n<p>We plan to build the following features as well:<\/p>\n<ul>\n<li>Text-based Turn-Taking: This model will use text only input and predict end-of-turn with a custom Neural Network trained for this use case.<\/li>\n<li>Audio-Text Multimodal (Fusion): This model will use both audio and text inputs to leverage the best from these two modalities and give the highest accuracy end-of-turn prediction.<\/li>\n<\/ul>\n<p>Early prototypes show promising results, with the multimodal approach outperforming the audio-based turn-taking models noticeably.<\/p>\n<h3><strong>Backchannel support<\/strong><\/h3>\n<p>Backchannel detection is another challenge encountered during the development of Voice AI agents. The &#8220;backchannel&#8221; is the secondary or parallel forms of communication that occur alongside a primary conversation or presentation. It encompasses the responses a listener gives to a speaker to indicate they are paying attention, without taking over the main speaking role.<\/p>\n<p>While interacting with AI agent, in some cases, the user may genuinely want to interrupt \u2013 to ask a question or shift the conversation. In others, they might simply be using backchannel cues like \u201cright\u201d or \u201cokay\u201d to signal that they\u2019re actively listening. The core challenge lies distinguishing meaningful interruptions from casual acknowledgments.<\/p>\n<p>Our roadmap includes the release of a reliable dedicated backchannel prediction model.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article we discuss an outstanding problem in today&#8217;s Voice AI Agents &#8211; turn-taking. We examine why it is a hard problem and present a solution in Krisp&#8217;s VIVA SDK. We also benchmark the Krisp solution against some of the established solutions in the market. Note: The Turn-Taking model is included in the VIVA [&hellip;]<\/p>\n","protected":false},"author":71,"featured_media":21942,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"two_page_speed":[]},"categories":[417,421,456],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.2 (Yoast SEO v23.6) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Audio-only, 6M weights Turn-Taking model for Voice AI Agents - Krisp<\/title>\n<meta name=\"description\" content=\"Krisp\u2019s new audio-only Turn-Taking model sets a new standard for Voice AI. See how it compares to SmartTurn and VAD models in real conversations.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Audio-only, 6M weights Turn-Taking model for Voice AI Agents - Krisp\" \/>\n<meta property=\"og:description\" content=\"Krisp\u2019s new audio-only Turn-Taking model sets a new standard for Voice AI. See how it compares to SmartTurn and VAD models in real conversations.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\" \/>\n<meta property=\"og:site_name\" content=\"Krisp\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/krispHQ\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-08-04T23:20:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-08-08T09:34:18+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Krisp Engineering Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@krispHQ\" \/>\n<meta name=\"twitter:site\" content=\"@krispHQ\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\"},\"author\":{\"name\":\"Krisp Engineering Team\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5\"},\"headline\":\"Audio-only, 6M weights Turn-Taking model for Voice AI Agents\",\"datePublished\":\"2025-08-04T23:20:04+00:00\",\"dateModified\":\"2025-08-08T09:34:18+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\"},\"wordCount\":2530,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png\",\"articleSection\":[\"Company\",\"Engineering Blog\",\"SDK\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\",\"url\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\",\"name\":\"Audio-only, 6M weights Turn-Taking model for Voice AI Agents - Krisp\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png\",\"datePublished\":\"2025-08-04T23:20:04+00:00\",\"dateModified\":\"2025-08-08T09:34:18+00:00\",\"description\":\"Krisp\u2019s new audio-only Turn-Taking model sets a new standard for Voice AI. See how it compares to SmartTurn and VAD models in real conversations.\",\"breadcrumb\":{\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png\",\"width\":1000,\"height\":700},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/krisp.ai\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Audio-only, 6M weights Turn-Taking model for Voice AI Agents\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/krisp.ai\/blog\/#website\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"name\":\"Krisp\",\"description\":\"Blog\",\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/krisp.ai\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\",\"name\":\"Krisp\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"width\":696,\"height\":696,\"caption\":\"Krisp\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/krispHQ\/\",\"https:\/\/x.com\/krispHQ\",\"https:\/\/www.linkedin.com\/company\/krisphq\/\",\"https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5\",\"name\":\"Krisp Engineering Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g\",\"caption\":\"Krisp Engineering Team\"},\"url\":\"https:\/\/krisp.ai\/blog\/author\/eng-team\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Audio-only, 6M weights Turn-Taking model for Voice AI Agents - Krisp","description":"Krisp\u2019s new audio-only Turn-Taking model sets a new standard for Voice AI. See how it compares to SmartTurn and VAD models in real conversations.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/","og_locale":"en_US","og_type":"article","og_title":"Audio-only, 6M weights Turn-Taking model for Voice AI Agents - Krisp","og_description":"Krisp\u2019s new audio-only Turn-Taking model sets a new standard for Voice AI. See how it compares to SmartTurn and VAD models in real conversations.","og_url":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/","og_site_name":"Krisp","article_publisher":"https:\/\/www.facebook.com\/krispHQ\/","article_published_time":"2025-08-04T23:20:04+00:00","article_modified_time":"2025-08-08T09:34:18+00:00","og_image":[{"width":1000,"height":700,"url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png","type":"image\/png"}],"author":"Krisp Engineering Team","twitter_card":"summary_large_image","twitter_creator":"@krispHQ","twitter_site":"@krispHQ","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#article","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/"},"author":{"name":"Krisp Engineering Team","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5"},"headline":"Audio-only, 6M weights Turn-Taking model for Voice AI Agents","datePublished":"2025-08-04T23:20:04+00:00","dateModified":"2025-08-08T09:34:18+00:00","mainEntityOfPage":{"@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/"},"wordCount":2530,"commentCount":0,"publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"image":{"@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png","articleSection":["Company","Engineering Blog","SDK"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/","url":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/","name":"Audio-only, 6M weights Turn-Taking model for Voice AI Agents - Krisp","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage"},"image":{"@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png","datePublished":"2025-08-04T23:20:04+00:00","dateModified":"2025-08-08T09:34:18+00:00","description":"Krisp\u2019s new audio-only Turn-Taking model sets a new standard for Voice AI. See how it compares to SmartTurn and VAD models in real conversations.","breadcrumb":{"@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#primaryimage","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2025\/08\/Turn-taking-image.png","width":1000,"height":700},{"@type":"BreadcrumbList","@id":"https:\/\/krisp.ai\/blog\/turn-taking-for-voice-ai\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/krisp.ai\/blog\/"},{"@type":"ListItem","position":2,"name":"Audio-only, 6M weights Turn-Taking model for Voice AI Agents"}]},{"@type":"WebSite","@id":"https:\/\/krisp.ai\/blog\/#website","url":"https:\/\/krisp.ai\/blog\/","name":"Krisp","description":"Blog","publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/krisp.ai\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/krisp.ai\/blog\/#organization","name":"Krisp","url":"https:\/\/krisp.ai\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","width":696,"height":696,"caption":"Krisp"},"image":{"@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/krispHQ\/","https:\/\/x.com\/krispHQ","https:\/\/www.linkedin.com\/company\/krisphq\/","https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg"]},{"@type":"Person","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/e9f59158d89de3002958d323d2e788f5","name":"Krisp Engineering Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/26475ad8219056696662f819691ee49d?s=96&d=mm&r=g","caption":"Krisp Engineering Team"},"url":"https:\/\/krisp.ai\/blog\/author\/eng-team\/"}]}},"primary_category":"Company","_links":{"self":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/21824"}],"collection":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/users\/71"}],"replies":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/comments?post=21824"}],"version-history":[{"count":25,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/21824\/revisions"}],"predecessor-version":[{"id":21859,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/21824\/revisions\/21859"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media\/21942"}],"wp:attachment":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media?parent=21824"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/categories?post=21824"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/tags?post=21824"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}