krisp

What’s behind great Accent Conversion technology

This document is intended for contact center operators to assess the quality and performance of Krisp AI Accent Conversion (also known as Accent Translation, Accent Neutralization, or Accent Smoothing) with Sanas’s offering. 

Enhanced voice quality in agent-customer interactions, driven by accent conversion, generates ROI based on lower AHT, faster FCR, and improved CSAT and ESAT scores. 

Krisp and Sanas applications are deployed on the agents’ desktops and function as virtual microphones and speakers, working as companion applications with calling platforms. Delivering on the promise of smoothing difficult accents while maintaining clear voice quality in real-time is challenging and takes years to perfect.

There are a few technical challenges that make the task difficult:

  • Removing background noises and voices
  • Synthesizing a natural-sounding speech
  • Ensuring accurate pronunciation of different words
  • Conveying emotions
  • Doing all the above in various real-life situations (fast speech, acoustic conditions, different speakers, etc)

Krisp launched its first commercial application in contact centers in 2019 and has processed over 1 trillion minutes of voice calls. Today, Krisp is deployed across many BPOs and top-tier enterprise call centers, along with also being integrated into voice applications with more than 200 million users across both desktop and mobile devices.

The table below highlights the key performance and management requirements for delivering tier-1 voice fidelity that scales globally within contact centers.

Krisp vs Sanas

Speech naturalness

Krisp
Sanas
Current deployments
  • Over 200 million desktop and mobile devices
  • Over 200K contact center agents
  • Over 1 trillion minutes of Krisp-processed voice
  • Embedded into world-class services such as Vonage, RingCentral, Zoho, Aircall, Discord, others
  • Over 30K agents

Accent Conversion robustness

Supported accent packs
  • Indian English
  • Filipino English
  • Indian English
  • Filipino English
Audio Latency 220ms 350ms-450ms
Modes of operation
  • Voice Preservation mode – fully preserves the user’s voice
  • Voice Profiles mode – allows the user to choose a natural-sounding output voice
Voice Preservation mode – somewhat preserves the user’s voice
Scalable range of output voices Yes
Can generate new voices in Voice Profiles mode
No
Limited to the user’s voice
Accent leakage
  • Some leakage in Voice Preservation mode
  • No leakage in Voice Profiles mode
Consistently observed leakage
Background noise and voice cancellation robustness Highly robust, automatically included in the Accent Conversion models Very limited
Agent and customer-side noise cancellation Bi-directional, automatically included in the Accent Conversion models Customer-side only
Headset robustness Highly robust Requires specific headsets
Robustness across users Works consistently across all users Requires testing three different versions for each user
Wrong pronunciations Some Noticeably more frequent
Preserves user’s voice Yes Limited
User enrollment needed No No
Dynamic adaptation to new speakers Yes, within the same or different call, regardless of the gender Unknown

Requires an output voice gender selection

Voice quality 16khz (wide-band, VOIP, industry-leading voice quality) 8kHz only

Noise Cancellation robustness

Voice quality and noise cancellation World’s best, based on objective and subjective tests (see and hear) New entrant, tests show noise leakage and voice quality degradation (see and hear)
Agent-side Background Voice Cancellation World’s best (see test measurements) Other voices and background chatter leakage when in a typical loud call center
Agent-side Noise Cancellation World’s best (see test measurements) Adequate performance for low-volume noises (fan, for example)

Noise leakage and voice degradation in contact center environments (other voices, loud chatter)
Customer-side Noise Cancellation Included
Optimized for inbound voice from mobile or landline. Pass-through of ringtones, dialtones, etc.
Not available
Acoustic Echo Cancellation Included
Optimized for call center use cases
Not available
Voice quality
  • 8kHz (narrow-band, standard telephony, good voice quality)
  • 16khz (wide-band, VOIP, industry-leading voice quality)
  • 32kHz (full-band, best voice quality – near studio-grade)
8kHz only

Application and audio drivers robustness

CPU utilization
  • Supports range of CPUs typically in agent desktops
  • Supports older, lower-end CPUs through smaller models
  • Has auto-switching between models based on CPU load
  • Single model uses 2x more than Krisp on i5-8th Gen CPU
  • Error message in Sanas app with older CPUs
  • Slightly higher CPU utilization for CPUs beyond i5 12th gen
Audio drivers  Highly reliable and tested for 7+ years Users often need to restart the drivers to avoid breakdown of mic and speaker audio streams.
Headset and application compatibility Compatible and tested with most headsets and voice applications used in call centers New entrant, minimal deployments and testing

Management and deployment at scale

Supported platforms Win, Mac, Linux, Chrome, VDI Win
Installation package Single installation package including all accent packs and noise cancellation A separate package for different accent packs
A separate package for noise cancellation
SSO authentication
  • Available for agents, per the enterprise customers’ requirements
  • SSO/SCIM for automated provisioning and deprovisioning, saving admins’ time
  • Not available for users (agents)
  • Only available for admins
Remote deployment and settings for admins Highly Scalable Very Limited
App version management and auto-update Highly Scalable Very Limited
Analytics for Accent Conversion, Noise Cancellation and platform usage Available Not available
Enterprise-Grade Support 24/7

Application and IT infrastructure expertise during pilots and post-launch, including VDI

24/7

Limited

Krisp vs. Sanas: in-depth accent conversion evaluation

In this section, a detailed evaluation of Krisp and Sanas Accent Conversion technologies is presented. It covers the methodology used for both subjective and objective comparisons, along with quantitative results across key speech quality metrics. The goal is to assess real-world performance, highlight strengths and limitations, and inform product decisions with reliable benchmarks.

Evaluations presented in the following section have been conducted for the Filipino English accent pack.

Subjective evaluation

The evaluation was conducted across two structured tracks: expert panel ratings and crowdsourced listener preferences, designed to capture both technical precision and human perception.

1. Expert panel evaluation

Six expert evaluators independently rated 70 matched audio pairs — each pair consisting of the same original voice converted by Krisp and Sanas. The recordings represented a diverse set of male/female speakers and input conditions, including, but not limited to

  • Accent level – high, medium, low
  • Speech rates and fluency
  • Background conditions (quiet, noisy, multi-speaker environments)

Each recording was processed with Krisp and Sanas’ accent conversion software using a combination of VB-Audio Virtual Cable and Audacity tools to generate a pair of matching recordings.

Evaluators scored each recording across four qualitative dimensions using a 5-point Likert scale:

 

Score Meaning
5 Excellent / Native-like
4 Very Good
3 Acceptable
2 Needs Improvement
1 Poor / Unintelligible

 

To eliminate bias:

  • File names were anonymized (no brand markers)
  • The order of samples was randomized
  • Scoring was blind and individual (no group discussion)

2. Crowdsourced A/B testing

To further simulate real-world user perception, a blind A/B test was run with a subset of 57 real-world, anonymized audio pairs. Each pair of recordings was voted on exactly 60 times.

Each respondent was asked: “Which voice sounds more natural?”

3420 responses were gathered, giving a statistically significant insight into the perceived naturalness of the two accent conversion solutions. Each participant evaluated randomly selected samples, with no access to brand or source information.

Evaluation metrics

Accent Conversion performance was measured across four key subjective and objective dimensions. These were selected based on real-world call center priorities such as clarity, naturalness, and robustness.

Metric Description
Accent Conversion How effectively the speaker’s original accent is transformed into a neutral or target accent. High scores mean minimal accent leakage or trace of the original pronunciation.
Speech Clarity Evaluates articulation, intelligibility, and absence of audio distortions, such as mumbling, muffling, or low vocal energy.
Natural Speech Measures how closely the output resembles fluid, human-like speech, including natural variations in pitch, tone, rhythm, and intonation.
Background Noise/Voice Robustness Assesses the system’s ability to isolate the speaker’s voice and maintain quality when external noises or secondary voices are present.

Objective evaluation methodology and metrics

To complement human evaluations, a structured objective analysis was conducted using state-of-the-art tools to quantify speech quality and pronunciation accuracy in Krisp and Sanas accent-converted outputs. These metrics offer an additional, unbiased perspective into the perceptual and technical performance of each solution.

For objective evaluation, the same 70 pairs of recordings were processed using the Meta Audiobox Aesthetics and captured metrics strongly correlated to Natural Speech and Speech Clarity.

Accent conversion often alters pronunciation patterns. To quantify how each system impacts phoneme accuracy, all recordings were also processed using the Facebook NN Phonemizer, which is strongly correlated with the accent conversion metric.

Objective Metric Interpretation Highly Correlated to Subjective Metric What It Captures
Production Quality* Higher is better Speech Clarity Fidelity, presence of audio artifacts, balance, and clarity of the output signal
Content Enjoyment* Higher is better Natural Speech Perceived naturalness, fluidity, and enjoyment of listening — akin to human listening satisfaction
Phoneme Error Rate (PER) Lower is better Accent Conversion Measures pronunciation distortion. Lower scores mean more accurate, intelligible speech with better articulation.

* – These metrics are derived from waveform-level analysis and do not require transcript or linguistic alignment, making them ideal for evaluating accent conversion outputs that vary in delivery and prosody.

Evaluation results

The following table summarizes the subjective and objective performance of Krisp vs. Sanas across key evaluation dimensions:

Metric Type Krisp Sanas Winner
Accent Conversion Subjective 3.6/5 3.0/5 Krisp
Natural Speech Subjective 🟰 3.7/5 🟰 3.6/5 Near Tie
Speech Clarity Subjective 4.3/5 3.7/5 Krisp
Background Noise/Voice Robustness Subjective 4.6/5 3.9/5 Krisp
Which recording sounds more natural?
Preferred by (# votes / total responses)
Subjective 1875/3420 1545/3420 Krisp
Natural Speech* Objective 5.8/10 4.7/10 Krisp
Speech Clarity* Objective 7.5/10 5.3/10 Krisp
Phoneme Error Rate (PER) Objective 29.3% 40.7% Krisp

Main Takeaways

  • Krisp leads across all critical performance metrics — both human-perceived and objectively measured — showing superior clarity, intelligibility, and accent transformation accuracy.
  • Accent Conversion: Krisp delivers more effective accent neutralization with fewer traces of the original pronunciation. Sanas often leaks source accent elements and produces less consistent results across varied speakers and speech patterns.
  • Speech Clarity & Phoneme Accuracy: Krisp-converted speech is significantly easier to understand. Sanas samples frequently exhibit muffled segments or slurred phonemes, which negatively affect comprehension and usability in customer support settings.
  • Background Noise Robustness: Krisp maintains speech quality in real-world noisy conditions, including multi-speaker and contact center environments. Sanas, by contrast, is more susceptible to background voice leakage — a potential liability for call quality and privacy.
  • Audio Quality and Bandwidth: Krisp outputs at 16 kHz wideband audio, providing richer, more intelligible voice quality, especially in modern platforms like Zoom, MS Teams, and G.722-based telephony. Sanas outputs audio at 8 kHz, which can degrade quality in high-bandwidth environments and limit downstream use in QA systems.
  • Compatibility and Headset Dependence: Sanas performance appears dependent on specific headsets to avoid secondary voice artifacts. Krisp, by contrast, is hardware-agnostic and built with a robust, production-grade Background Voice Cancellation AI model.

 

Comparative audio samples

# Observation Audio
1 – Strong accent leakage in Sanas
– Pronunciation error of “across”, “travel” in Sanas
– Krisp version is more natural and easier to understand
– Krisp fixed “trouble” in original speech to “travel”
Original

Sanas

Krisp
 2 – Strong accent leakage in Sanas
– Slurred and unintelligible speech in Sanas
– Krisp version is more natural and easier to understand
Original

Sanas

Krisp
3 – Robotic, slurred and unintelligible speech in Sanas
– Krisp version is more natural and easier to understand
Original

Sanas

Krisp
4 – Pronunciation errors of “interested”, “especially” , “gives” in Sanas speech
– Muffled “every day” in Sanas
– Strong accent leakage in Sanas
– Pronunciation error of “hobbies” in original, Krisp, Sanas versions
Original

Sanas

Krisp
5 – Muffled output on “a smooth” in Sanas
– Better naturalness, higher quality voice in Krisp
Original

Sanas

Krisp
6 – Strong accent leakage in Sanas
– Pronunciation errors of “range” in Sanas
Original

Sanas

Krisp
7 – Pronunciation errors of “permission”, “dialer” in Sanas
– Krisp version is more natural and easier to understand
Original

Sanas

Krisp
8 – Strong accent leakage in Sanas
– Krisp fixed “trouble” in original speech to “travel”, Sanas did not
Original

Sanas

Krisp
9 – Strong accent leakage in Sanas
– Muffled “financial” in Sanas
– Krisp version is more natural and easier to understand
Original

Sanas

Krisp
10 – Strong accent leakage in Sanas
– Pronunciation errors of “support”, “workforce”, in Sanas
– Muffled “more”, “Colombia” in Sanas
Original

Sanas

Krisp
11 – Strong secondary voice leakage in Sanas
– All secondary voices cleaned in Krisp
Original

Sanas

Krisp
12 – Strong secondary voice leakage in Sanas
– All secondary voices cleaned in Krisp
Original

Sanas

Krisp
13 – Moderate accent leakage in Sanas
– Muffled words in Sanas
– Krisp audio is easier to understand
Original

Sanas

Krisp
14 – Muffled output on “book an appointment” in Sanas
– Excellent naturalness and smoothness in Krisp
– Pronunciation error in “checkup” in Original, Sanas, and Krisp
Original

Sanas

Krisp
15 – Pronunciation error in “favorite” in Sanas
– Pronunciation error in “for me” in Original, Sanas, and Krisp
Original

Sanas

Krisp
16 – Muffled “sign the order” in Sanas
– More stable volume in Krisp
Original

Sanas

Krisp
17 – Pronunciation errors of “professional”, “tech”, “more” in Sanas
– Low volume in Krisp; pronunciation error in “tech”
Original

Sanas
Krisp

 

Krisp is a trusted vendor on G2

With over 500 reviews on G2, Krisp consistently excels in enhancing customer interactions for service teams. G2, a trusted platform for software reviews and assessments, showcases Krisp’s exceptional 4.7 rating—earned through the trust and endorsement of hundreds of verified professionals across diverse industries.

Check Krisp’s page on G2 here.

 

Krisp Voice AI Platform for call centers

Krisp is the only real-time Voice AI platform that covers every stage of the agent experience—before, during, and after the call—within a single, lightweight application. It eliminates the need to juggle multiple tools and services by delivering core capabilities like Noise Cancellation, Accent Conversion, Live Interpretation, real-time agent assist, and post-call summaries in one seamless interface.

 

Agents with pronounced English accents can benefit from Accent Conversion, which enhances comprehension in calls without altering their original voice. The same agents can handle international calls using Live Interpreter, enabling real-time multilingual conversations across 80+ languages with one click, directly in the Krisp app. This flexibility removes hiring constraints, the need for the standard language line services, and allows teams to scale globally without friction.

 

During the call, Krisp Agent Copilot provides real-time transcripts, key moment capture, and access to company and industry-specific knowledge via AI Chat, boosting confidence and precision. After the call, automatic summaries and reports help streamline follow-ups and coaching. All of this is centrally managed, with analytics and policy controls available in a unified Admin Portal.

Krisp platform easily integrates with the agent’s desktop to seamlessly work with all CCaaS and calling applications, delivering call quality that translates to much better CSAT and related contact center KPIs.

 

Conclusion

While both Krisp and Sanas are innovators in the Accent Conversion space, Krisp stands out as the enterprise-ready solution trusted by global contact centers.

Krisp’s Accent Conversion consistently delivers clearer, more natural, and more intelligible speech, with significantly lower accent leakage and superior performance in noisy, real-world environments. Across both subjective human evaluations and objective acoustic metrics, Krisp leads on all critical dimensions—accent conversion, speech clarity, and background noise robustness.

Beyond voice quality, Krisp offers:

  • Superior deployment flexibility, working seamlessly across any headset, desktop, or CCaaS platform—no hardware or system limitations.Built-in voice and noise cancellation, tested over 1 trillion minutes, eliminating the need for additional tools or packages.
  • Enterprise-grade reliability with SSO, auto-updates, analytics, and remote configuration for admins—essential for scaled contact center rollouts.
  • A full Voice AI platform in one app: Live Interpreter, Accent Conversion, Agent Copilot, Auto QA, and real-time knowledge—all integrated with one-click simplicity.

Sanas requires device-specific setups, lacks compatibility in key environments, and struggles with accent consistency and audio fidelity, especially in enterprise use cases.

💡 If your goal is to scale globally, serve diverse customers, and ensure your agents are clearly understood — Krisp is the clear choice for Accent Conversion that works, performs, and scales.

Related Articles