Krisp Accent Conversion v3.5 represents a significant upgrade over the previous v3.0 release. Both Indian and Filipino accent models show consistent improvements across clarity, naturalness, and pronunciation accuracy, validated through expert evaluation, crowdsourced ratings, and objective metrics. Overall, the v3.5 models deliver clearer, more natural, and more intelligible speech while preserving speaker identity.
Key Improvements in AC v3.5
- Speech & Audio Clarity: Major improvements in intelligibility and reduction of audio artifacts and distortions. Speech Clarity scores increased by +18% (Indian) and +23% (Filipino) in expert evaluations, with consistent boosts across Meta metrics as well.
- Naturalness & Fluidity: Speech sounds more human and expressive, with better rhythm, pacing, and filler sound handling. Expert-rated Natural Speech scores improved by +18% (Indian) and +20% (Filipino). Crowdsourced evaluations confirm this with +10% (Indian) and +6% (Filipino) gains.
- Pronunciation Accuracy: Improved phoneme articulation and intelligibility reflected in a fairly significant 10% reduction in Phoneme Error Rate (PER) for the Indian accent pack.
- Voice Stability: Enhanced consistency in pitch and tone throughout the utterance, helping avoid unnatural fluctuations. This contributed to improved naturalness and clarity scores across all metrics.
- Speaker Identity Retention: v3.5 models better preserve the original speaker’s voice characteristics, resulting in more personalized and authentic-sounding output, evident in higher naturalness ratings across both subjective and objective evaluations.
Evaluation Results
For subjective and objective evaluations, 78 real-world recordings were sampled for the Indian accent pack and 57 for the Filipino accent pack.
For the crowdsourced evaluation, each recording received exactly 30 independent votes to ensure statistical confidence — 2340 total votes for Indian recordings and 1710 for Filipino recordings.
The results shown in the table below represent aggregated averages across all recordings.
Metric | IN AC v3 | IN AC v3.5 | PH AC v3 | PH AC v3.5 |
---|---|---|---|---|
Expert evaluation – Natural speech (1 to 5) |
3.3 | ![]() |
3.4 | ![]() |
Expert evaluation – Speech clarity (1 to 5) |
3.4 | ![]() |
3.4 | ![]() |
Crowdsourced evaluation – “How natural does the voice sound?” (1 to 5) |
3.1 | ![]() |
3.3 | ![]() |
Meta Aesthetic – Natural speech (1 to 10) |
5.4 | ![]() |
5.4 | ![]() |
Meta Aesthetic – Speech clarity (1 to 10) |
7.1 | ![]() |
7.1 | ![]() |
Phoneme Error Rate (PER) | 26.1% | ![]() |
28.4% | = 28.4% (no change) |
Comparative audio samples
Listening Tip: For the most accurate and immersive comparison between v3.0 and v3.5 Accent Conversion, we recommend using quality headphones.
This helps highlight the improvements in clarity, naturalness, and speaker identity preservation that may be less perceptible on laptop or mobile speakers.
Indian English accent pack
Improvement category | Original speech | Converted AC V3 | Converted AC V3.5 |
---|---|---|---|
Voice stability
Speech clarity Speech naturalness |
|||
Voice stability
Speech clarity Speech naturalness |
|||
Speech clarity
Speech naturalness |
|||
Speech clarity
Speech naturalness Audio quality |
|||
Speech clarity
Speech naturalness |
Filipino English accent pack
Improvement category | Original speech | Converted AC V2 | Converted AC V3 |
---|---|---|---|
Audio quality
Speaker identity Speech naturalness |
|||
Audio quality | |||
Speech clarity
Speech naturalness |
|||
Audio quality
Speaker identity |
|||
Audio quality
Speech clarity Speech naturalness |
Appendix
Subjective evaluation
Our evaluation was conducted across two structured tracks: expert panel ratings and crowdsourced listener preferences, designed to capture both technical precision and human perception.
Real-world agent calls have been sampled to represent a diverse set of speakers and input conditions, including, but not limited to
- Accent level – high, medium, low
- Speech rates and fluency
- Background conditions (quiet, noisy, multi-speaker environments)
Evaluators scored each recording across four qualitative dimensions using a 5-point Likert scale:
Score | Meaning |
---|---|
5 | Excellent / Native-like |
4 | Very Good |
3 | Acceptable |
2 | Needs Improvement |
1 | Poor / Unintelligible |
1. Expert Panel Evaluation
Six expert evaluators independently rated matching audio pairs — each pair consisting of the same original voice converted by AC v3 and AC v3.5.
To eliminate bias:
- File names were anonymized (no version markers)
- The order of samples was randomized
- Scoring was blind and individual (no group discussion)
2. Crowdsourced Evaluation
To further simulate real-world user perception, a blind A/B/C test was run with a trio of recordings: original vs. AC v3 vs. AC v3.5.
Respondents asked a single question – “How natural does the voice sound?”, and scored recordings using the same 5-point Likert scale:
Evaluation metrics
Accent Conversion performance was measured across four key subjective and objective dimensions. These were selected based on real-world call center priorities such as clarity, naturalness, and robustness.
Metric | Description |
---|---|
Accent Conversion | How effectively the speaker’s original accent is transformed into a neutral or target accent. High scores mean minimal accent leakage or trace of the original pronunciation. |
Speech Clarity | Evaluates articulation, intelligibility, and absence of audio distortions, such as mumbling, muffling, or low vocal energy. |
Natural Speech | Measures how closely the output resembles fluid, human-like speech, including natural variations in pitch, tone, rhythm, and intonation. |
Objective evaluation
For objective evaluation, the same set of recordings was processed using the Meta Audiobox Aesthetics and captured metrics strongly correlated to Natural Speech and Speech Clarity. Additionally, to quantify how each system impacts phoneme accuracy, all recordings were also processed using the Facebook NN Phonemizer, which is strongly correlated with the accent conversion metric.
Objective metric | Interpretation | Highly correlated to subjective metric | What it captures |
---|---|---|---|
Production quality* | Higher is better | Speech clarity | Fidelity, presence of audio artifacts, balance, and clarity of the output signal |
Content enjoyment* | Higher is better | Natural speech | Perceived naturalness, fluidity, and enjoyment of listening — akin to human listening satisfaction |
Phoneme Error Rate (PER) | Lower is better | Accent conversion | Measures pronunciation distortion. Lower scores mean more accurate, intelligible speech with better articulation. |
* These metrics are derived from waveform-level analysis and do not require transcript or linguistic alignment, making them ideal for evaluating accent conversion outputs that vary in delivery and prosody.