What’s behind great Accent Conversion technology
This document is intended for contact center operators to assess the quality and performance of Krisp AI Accent Conversion (also known as Accent Translation, Accent Neutralization, or Accent Smoothing) with Sanas’s offering.
Enhanced voice quality in agent-customer interactions, driven by accent conversion, generates ROI based on lower AHT, faster FCR, and improved CSAT and ESAT scores.
Krisp and Sanas applications are deployed on the agents’ desktops and function as virtual microphones and speakers, working as companion applications with calling platforms. Delivering on the promise of smoothing difficult accents while maintaining clear voice quality in real-time is challenging and takes years to perfect.
There are a few technical challenges that make the task difficult:
- Removing background noises and voices
- Synthesizing a natural-sounding speech
- Ensuring accurate pronunciation of different words
- Conveying emotions
- Doing all the above in various real-life situations (fast speech, acoustic conditions, different speakers, etc)
Krisp launched its first commercial application in contact centers in 2019 and has processed over 1 trillion minutes of voice calls. Today, Krisp is deployed across many BPOs and top-tier enterprise call centers, along with also being integrated into voice applications with more than 200 million users across both desktop and mobile devices.
The table below highlights the key performance and management requirements for delivering tier-1 voice fidelity that scales globally within contact centers.
Krisp vs Sanas
Speech naturalness
Krisp |
Sanas |
|
---|---|---|
Current deployments |
|
|
Accent Conversion robustness |
||
Supported accent packs |
|
|
Audio Latency | 220ms | 350ms-450ms |
Modes of operation |
|
Voice Preservation mode – somewhat preserves the user’s voice |
Scalable range of output voices | Yes Can generate new voices in Voice Profiles mode |
No Limited to the user’s voice |
Accent leakage |
|
Consistently observed leakage |
Background noise and voice cancellation robustness | Highly robust, automatically included in the Accent Conversion models | Very limited |
Agent and customer-side noise cancellation | Bi-directional, automatically included in the Accent Conversion models | Customer-side only |
Headset robustness | Highly robust | Requires specific headsets |
Robustness across users | Works consistently across all users | Requires testing three different versions for each user |
Wrong pronunciations | Some | Noticeably more frequent |
Preserves user’s voice | Yes | Limited |
User enrollment needed | No | No |
Dynamic adaptation to new speakers | Yes, within the same or different call, regardless of the gender | Unknown
Requires an output voice gender selection |
Voice quality | 16khz (wide-band, VOIP, industry-leading voice quality) | 8kHz only |
Noise Cancellation robustness |
||
Voice quality and noise cancellation | World’s best, based on objective and subjective tests (see and hear) | New entrant, tests show noise leakage and voice quality degradation (see and hear) |
Agent-side Background Voice Cancellation | World’s best (see test measurements) | Other voices and background chatter leakage when in a typical loud call center |
Agent-side Noise Cancellation | World’s best (see test measurements) | Adequate performance for low-volume noises (fan, for example) Noise leakage and voice degradation in contact center environments (other voices, loud chatter) |
Customer-side Noise Cancellation | Included Optimized for inbound voice from mobile or landline. Pass-through of ringtones, dialtones, etc. |
Not available |
Acoustic Echo Cancellation | Included Optimized for call center use cases |
Not available |
Voice quality |
|
8kHz only |
Application and audio drivers robustness |
||
CPU utilization |
|
|
Audio drivers | Highly reliable and tested for 7+ years | Users often need to restart the drivers to avoid breakdown of mic and speaker audio streams. |
Headset and application compatibility | Compatible and tested with most headsets and voice applications used in call centers | New entrant, minimal deployments and testing |
Management and deployment at scale |
||
Supported platforms | Win, Mac, Linux, Chrome, VDI | Win |
Installation package | Single installation package including all accent packs and noise cancellation | A separate package for different accent packs A separate package for noise cancellation |
SSO authentication |
|
|
Remote deployment and settings for admins | Highly Scalable | Very Limited |
App version management and auto-update | Highly Scalable | Very Limited |
Analytics for Accent Conversion, Noise Cancellation and platform usage | Available | Not available |
Enterprise-Grade Support | 24/7
Application and IT infrastructure expertise during pilots and post-launch, including VDI |
24/7
Limited |
Krisp vs. Sanas: in-depth accent conversion evaluation
In this section, a detailed evaluation of Krisp and Sanas Accent Conversion technologies is presented. It covers the methodology used for both subjective and objective comparisons, along with quantitative results across key speech quality metrics. The goal is to assess real-world performance, highlight strengths and limitations, and inform product decisions with reliable benchmarks.
Evaluations presented in the following section have been conducted for the Filipino English accent pack.
Subjective evaluation
The evaluation was conducted across two structured tracks: expert panel ratings and crowdsourced listener preferences, designed to capture both technical precision and human perception.
1. Expert panel evaluation
Six expert evaluators independently rated 70 matched audio pairs — each pair consisting of the same original voice converted by Krisp and Sanas. The recordings represented a diverse set of male/female speakers and input conditions, including, but not limited to
- Accent level – high, medium, low
- Speech rates and fluency
- Background conditions (quiet, noisy, multi-speaker environments)
Each recording was processed with Krisp and Sanas’ accent conversion software using a combination of VB-Audio Virtual Cable and Audacity tools to generate a pair of matching recordings.
Evaluators scored each recording across four qualitative dimensions using a 5-point Likert scale:
Score | Meaning |
5 | Excellent / Native-like |
4 | Very Good |
3 | Acceptable |
2 | Needs Improvement |
1 | Poor / Unintelligible |
To eliminate bias:
- File names were anonymized (no brand markers)
- The order of samples was randomized
- Scoring was blind and individual (no group discussion)
2. Crowdsourced A/B testing
To further simulate real-world user perception, a blind A/B test was run with a subset of 57 real-world, anonymized audio pairs. Each pair of recordings was voted on exactly 60 times.
Each respondent was asked: “Which voice sounds more natural?”
3420 responses were gathered, giving a statistically significant insight into the perceived naturalness of the two accent conversion solutions. Each participant evaluated randomly selected samples, with no access to brand or source information.
Evaluation metrics
Accent Conversion performance was measured across four key subjective and objective dimensions. These were selected based on real-world call center priorities such as clarity, naturalness, and robustness.
Metric | Description |
Accent Conversion | How effectively the speaker’s original accent is transformed into a neutral or target accent. High scores mean minimal accent leakage or trace of the original pronunciation. |
Speech Clarity | Evaluates articulation, intelligibility, and absence of audio distortions, such as mumbling, muffling, or low vocal energy. |
Natural Speech | Measures how closely the output resembles fluid, human-like speech, including natural variations in pitch, tone, rhythm, and intonation. |
Background Noise/Voice Robustness | Assesses the system’s ability to isolate the speaker’s voice and maintain quality when external noises or secondary voices are present. |
Objective evaluation methodology and metrics
To complement human evaluations, a structured objective analysis was conducted using state-of-the-art tools to quantify speech quality and pronunciation accuracy in Krisp and Sanas accent-converted outputs. These metrics offer an additional, unbiased perspective into the perceptual and technical performance of each solution.
For objective evaluation, the same 70 pairs of recordings were processed using the Meta Audiobox Aesthetics and captured metrics strongly correlated to Natural Speech and Speech Clarity.
Accent conversion often alters pronunciation patterns. To quantify how each system impacts phoneme accuracy, all recordings were also processed using the Facebook NN Phonemizer, which is strongly correlated with the accent conversion metric.
Objective Metric | Interpretation | Highly Correlated to Subjective Metric | What It Captures |
Production Quality* | Higher is better | Speech Clarity | Fidelity, presence of audio artifacts, balance, and clarity of the output signal |
Content Enjoyment* | Higher is better | Natural Speech | Perceived naturalness, fluidity, and enjoyment of listening — akin to human listening satisfaction |
Phoneme Error Rate (PER) | Lower is better | Accent Conversion | Measures pronunciation distortion. Lower scores mean more accurate, intelligible speech with better articulation. |
* – These metrics are derived from waveform-level analysis and do not require transcript or linguistic alignment, making them ideal for evaluating accent conversion outputs that vary in delivery and prosody.
Evaluation results
The following table summarizes the subjective and objective performance of Krisp vs. Sanas across key evaluation dimensions:
Metric | Type | Krisp | Sanas | Winner |
Accent Conversion | Subjective | ✅ 3.6/5 | ❌ 3.0/5 | Krisp |
Natural Speech | Subjective | 🟰 3.7/5 | 🟰 3.6/5 | Near Tie |
Speech Clarity | Subjective | ✅ 4.3/5 | ❌ 3.7/5 | Krisp |
Background Noise/Voice Robustness | Subjective | ✅ 4.6/5 | ❌ 3.9/5 | Krisp |
Which recording sounds more natural? Preferred by (# votes / total responses) |
Subjective | ✅1875/3420 | ❌1545/3420 | Krisp |
Natural Speech* | Objective | ✅ 5.8/10 | ❌ 4.7/10 | Krisp |
Speech Clarity* | Objective | ✅ 7.5/10 | ❌ 5.3/10 | Krisp |
Phoneme Error Rate (PER) | Objective | ✅ 29.3% | ❌ 40.7% | Krisp |
Main Takeaways
- Krisp leads across all critical performance metrics — both human-perceived and objectively measured — showing superior clarity, intelligibility, and accent transformation accuracy.
- Accent Conversion: Krisp delivers more effective accent neutralization with fewer traces of the original pronunciation. Sanas often leaks source accent elements and produces less consistent results across varied speakers and speech patterns.
- Speech Clarity & Phoneme Accuracy: Krisp-converted speech is significantly easier to understand. Sanas samples frequently exhibit muffled segments or slurred phonemes, which negatively affect comprehension and usability in customer support settings.
- Background Noise Robustness: Krisp maintains speech quality in real-world noisy conditions, including multi-speaker and contact center environments. Sanas, by contrast, is more susceptible to background voice leakage — a potential liability for call quality and privacy.
- Audio Quality and Bandwidth: Krisp outputs at 16 kHz wideband audio, providing richer, more intelligible voice quality, especially in modern platforms like Zoom, MS Teams, and G.722-based telephony. Sanas outputs audio at 8 kHz, which can degrade quality in high-bandwidth environments and limit downstream use in QA systems.
- Compatibility and Headset Dependence: Sanas performance appears dependent on specific headsets to avoid secondary voice artifacts. Krisp, by contrast, is hardware-agnostic and built with a robust, production-grade Background Voice Cancellation AI model.
Comparative audio samples
# | Observation | Audio |
---|---|---|
1 | – Strong accent leakage in Sanas – Pronunciation error of “across”, “travel” in Sanas – Krisp version is more natural and easier to understand – Krisp fixed “trouble” in original speech to “travel” |
Original Sanas Krisp |
2 | – Strong accent leakage in Sanas – Slurred and unintelligible speech in Sanas – Krisp version is more natural and easier to understand |
Original Sanas Krisp |
3 | – Robotic, slurred and unintelligible speech in Sanas – Krisp version is more natural and easier to understand |
Original Sanas Krisp |
4 | – Pronunciation errors of “interested”, “especially” , “gives” in Sanas speech – Muffled “every day” in Sanas – Strong accent leakage in Sanas – Pronunciation error of “hobbies” in original, Krisp, Sanas versions |
Original Sanas Krisp |
5 | – Muffled output on “a smooth” in Sanas – Better naturalness, higher quality voice in Krisp |
Original Sanas Krisp |
6 | – Strong accent leakage in Sanas – Pronunciation errors of “range” in Sanas |
Original Sanas Krisp |
7 | – Pronunciation errors of “permission”, “dialer” in Sanas – Krisp version is more natural and easier to understand |
Original Sanas Krisp |
8 | – Strong accent leakage in Sanas – Krisp fixed “trouble” in original speech to “travel”, Sanas did not |
Original Sanas Krisp |
9 | – Strong accent leakage in Sanas – Muffled “financial” in Sanas – Krisp version is more natural and easier to understand |
Original Sanas Krisp |
10 | – Strong accent leakage in Sanas – Pronunciation errors of “support”, “workforce”, in Sanas – Muffled “more”, “Colombia” in Sanas |
Original Sanas Krisp |
11 | – Strong secondary voice leakage in Sanas – All secondary voices cleaned in Krisp |
Original Sanas Krisp |
12 | – Strong secondary voice leakage in Sanas – All secondary voices cleaned in Krisp |
Original Sanas Krisp |
13 | – Moderate accent leakage in Sanas – Muffled words in Sanas – Krisp audio is easier to understand |
Original Sanas Krisp |
14 | – Muffled output on “book an appointment” in Sanas – Excellent naturalness and smoothness in Krisp – Pronunciation error in “checkup” in Original, Sanas, and Krisp |
Original Sanas Krisp |
15 | – Pronunciation error in “favorite” in Sanas – Pronunciation error in “for me” in Original, Sanas, and Krisp |
Original Sanas Krisp |
16 | – Muffled “sign the order” in Sanas – More stable volume in Krisp |
Original Sanas Krisp |
17 | – Pronunciation errors of “professional”, “tech”, “more” in Sanas – Low volume in Krisp; pronunciation error in “tech” |
Original Sanas Krisp |
Krisp is a trusted vendor on G2
With over 500 reviews on G2, Krisp consistently excels in enhancing customer interactions for service teams. G2, a trusted platform for software reviews and assessments, showcases Krisp’s exceptional 4.7 rating—earned through the trust and endorsement of hundreds of verified professionals across diverse industries.
Check Krisp’s page on G2 here.
Krisp Voice AI Platform for call centers
Krisp is the only real-time Voice AI platform that covers every stage of the agent experience—before, during, and after the call—within a single, lightweight application. It eliminates the need to juggle multiple tools and services by delivering core capabilities like Noise Cancellation, Accent Conversion, Live Interpretation, real-time agent assist, and post-call summaries in one seamless interface.
Agents with pronounced English accents can benefit from Accent Conversion, which enhances comprehension in calls without altering their original voice. The same agents can handle international calls using Live Interpreter, enabling real-time multilingual conversations across 80+ languages with one click, directly in the Krisp app. This flexibility removes hiring constraints, the need for the standard language line services, and allows teams to scale globally without friction.
During the call, Krisp Agent Copilot provides real-time transcripts, key moment capture, and access to company and industry-specific knowledge via AI Chat, boosting confidence and precision. After the call, automatic summaries and reports help streamline follow-ups and coaching. All of this is centrally managed, with analytics and policy controls available in a unified Admin Portal.
Krisp platform easily integrates with the agent’s desktop to seamlessly work with all CCaaS and calling applications, delivering call quality that translates to much better CSAT and related contact center KPIs.
Conclusion
While both Krisp and Sanas are innovators in the Accent Conversion space, Krisp stands out as the enterprise-ready solution trusted by global contact centers.
Krisp’s Accent Conversion consistently delivers clearer, more natural, and more intelligible speech, with significantly lower accent leakage and superior performance in noisy, real-world environments. Across both subjective human evaluations and objective acoustic metrics, Krisp leads on all critical dimensions—accent conversion, speech clarity, and background noise robustness.
Beyond voice quality, Krisp offers:
- Superior deployment flexibility, working seamlessly across any headset, desktop, or CCaaS platform—no hardware or system limitations.Built-in voice and noise cancellation, tested over 1 trillion minutes, eliminating the need for additional tools or packages.
- Enterprise-grade reliability with SSO, auto-updates, analytics, and remote configuration for admins—essential for scaled contact center rollouts.
- A full Voice AI platform in one app: Live Interpreter, Accent Conversion, Agent Copilot, Auto QA, and real-time knowledge—all integrated with one-click simplicity.
Sanas requires device-specific setups, lacks compatibility in key environments, and struggles with accent consistency and audio fidelity, especially in enterprise use cases.
💡 If your goal is to scale globally, serve diverse customers, and ensure your agents are clearly understood — Krisp is the clear choice for Accent Conversion that works, performs, and scales.