krisp

Krisp Accent Conversion v3, released in March 2025, marked a breakthrough moment in the evolution of our accent conversion technology. For the first time in two years, we felt the system was mature enough for wide-scale production use.

 

In May 2025, we released Accent Conversion v3.5, bringing a major quality upgrade — with ~20% improvement across key metrics for both Filipino and Indian accents (details here). Thanks to Krisp desktop application’s auto-update mechanism, the rollout reached 95% of users within 2 days, and the feedback was overwhelmingly positive, both from agents and customers, driving sentiment and business KPIs.

 

In July 2025, we expanded the offering to support the Latin American accent pack. The launch quickly gained traction with several large customers and is now deployed across thousands of agents.

 

Throughout this period, we’ve worked closely with partners, agents, and customers to deeply understand corner cases — especially for the Indian accent, which is the most challenging due to its vast regional variation and phonetic complexity. This close collaboration, combined with relentless efforts from the world-class research and engineering teams at Krisp, has culminated in another major step forward now.

 

Today, we’re launching Accent Conversion v3.7, delivering significant improvements in naturalness and voice stability. This release is currently focused on the Indian accent pack, with support for other accents rolling out soon.

The following sections summarize the key improvements, benchmarking methodology, and a side-by-side comparison of Accent Conversion v3.7 with v3.5.

Key Improvements in AC v3.7

  1. Naturalness: The converted speech sounds even more human-like and natural, with much improved filler-sound handling. Here, expert-rated naturalness scores improved by +14%. Crowdsourced evaluations confirm it with a +6% gain.
  2. Voice Stability: Enhanced consistency in pitch and tone throughout the utterance, helping avoid unnatural fluctuations, especially for thick accents. This contributed to improved naturalness and clarity scores across all metrics.
  3. Speech & Audio Clarity: Improvements were noted in both intelligibility and the reduction of artifacts and distortions. Speech Clarity scores rose by 5% in expert assessments, with corresponding enhancements across Meta metrics.
  4. Pronunciation Accuracy: There’s a gain in objective metrics as well, about a 4% relative improvement in Phoneme Error Rate (PER), which can be attributed to more conversational data inclusion in the training. Here, some noticeable accent-specific enhancements in phoneme pronunciation, such as more native-like articulation of “R” and “L”, contribute to a +5% increase in the Accent Conversion score.

Evaluation Results

For subjective and objective evaluations, 78 real-world recordings were sampled.

For the crowdsourced evaluation, each recording received exactly 30 independent votes to ensure statistical confidence, 2340 total votes.

The results shown in the table below represent aggregated averages across all recordings.

Metric IN AC v3.5 IN AC v3.7 Comment
Expert Evaluation – Natural speech (1 to 5) 3.7 4.2 (+14%) Speech sounds even more human-like, with much improved filler-sound handling
Expert Evaluation – Speech Clarity (1 to 5) 4.0 4.2 (+5%) Speech is with fewer artifacts and clearer, especially in slurred and mumbling segments
Expert Evaluation – Accent Conversion (1 to 5) 4.3 4.5 (+5%) Accent-specific enhancements in phoneme pronunciation, such as more native-like articulation of “R” and “L”
Crowdsourced Evaluation“How natural does the voice sound?” (1 to 5) 3.4 3.6 (+6%) 78 real-world audio recordings assessed by 30 participants
Crowdsourced Models’ ComparisonWhich option sounds more natural? 1242 1878 (+20%) 78 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants
Meta Aesthetic – Natural speech (1 to 10) 5.6 5.8 (+4%)
Meta Aesthetic – Speech Clarity (1 to 10) 7.5 7.6 (+1%)

 

Comparative audio samples

Listening Tip: For the most accurate and immersive comparison between v3.5 and v3.7 Accent Conversion, we recommend using quality headphones.

This helps highlight the improvements in clarity, naturalness, and speaker identity preservation that may be less perceptible on laptop or mobile speakers.

# Improvement Category Original Converted AC v3.5 Converted AC v3.7
1 Speech Naturalness
2 Speech Naturalness
3 Speech Naturalness
Speech Clarity
4 Speech Clarity
5 Speech Clarity
Speech Naturalness
Voice Stability
6 Speech Clarity
Speech Naturalness
Voice Stability
7 Speech Naturalness
Speech Clarity
8 Speech Naturalness
Speech Clarity

 

Appendix

Subjective Evaluation

Our evaluation was conducted across two structured tracks: expert panel ratings and crowdsourced listener preferences, designed to capture both technical precision and human perception.

Real-world agent calls have been sampled to represent a diverse set of speakers and input conditions, including, but not limited to

  • Accent level – high, medium, low
  • Speech rates and fluency
  • Background conditions (quiet, noisy, multi-speaker environments)

Evaluators scored each recording across four qualitative dimensions using a 5-point Likert scale:

Score Meaning
5 Excellent / Native-like
4 Very Good
3 Acceptable
2 Needs Improvement
1 Poor / Unintelligible

1. Expert Panel Evaluation

Six expert evaluators independently rated matching audio pairs — each pair consisting of the same original voice converted by AC v3.5 and AC v3.7.

To eliminate bias:

  • File names were anonymized (no version markers)
  • The order of samples was randomized
  • Scoring was blind and individual (no group discussion)

2. Crowdsourced Evaluation

To further simulate real-world user perception, a blind A/B test was run with a pairs of recordings: AC v3.5 vs. AC v3.7.
78 real-world audio recording pairs were evaluated, with each pair assessed by 40 participants, resulting in 3,120 votes overall.

Participants were asked the following question:
“Which option sounds more natural (i.e., more human-like)?”

Results:

  • Version 3.5 was selected 1242 times
  • Version 3.7 was selected 1878 times

Evaluation metrics

Accent Conversion performance was measured across four key dimensions. These were selected based on real-world call center priorities such as clarity, naturalness, and robustness.

Metric Description
Accent Conversion How effectively the speaker’s original accent is transformed into a neutral or target accent. High scores mean minimal accent leakage or trace of the original pronunciation.
Speech Clarity Evaluates articulation, intelligibility, and absence of audio distortions, such as mumbling, muffling, or low vocal energy.
Natural Speech Measures how closely the output resembles fluid, human-like speech, including natural variations in pitch, tone, rhythm, and intonation.
Pronunciation Accuracy Measures how closely the converted speech matches standard American English pronunciation at the phoneme level. It evaluates whether individual sounds (vowels, consonants, syllables) are produced correctly and consistently, without distortion, misplacement, or omission, ensuring that the converted voice sounds intelligible and native-like to a U.S. listener.

Objective Evaluation

For objective evaluation, the same set of recordings was processed using the Meta Audiobox Aesthetics and captured metrics strongly correlated to Natural Speech and Speech Clarity. Additionally, to quantify how each system impacts phoneme accuracy, all recordings were also processed using the Facebook NN Phonemizer, which is strongly correlated with the Accent Conversion metric.

Objective Metric Interpretation Highly Correlated to Subjective Metric What It Captures
Production Quality* Higher is better Speech Clarity Fidelity, presence of audio artifacts, balance, and clarity of the output signal
Content Enjoyment* Higher is better Natural Speech Perceived naturalness, fluidity, and enjoyment of listening — akin to human listening satisfaction
Phoneme Error Rate (PER) Lower is better Accent Conversion Measures pronunciation distortion. Lower scores mean more accurate, intelligible speech with better articulation.
  •  these metrics are derived from waveform-level analysis and do not require transcript or linguistic alignment, making them ideal for evaluating accent conversion outputs that vary in delivery and prosody.

Related Articles