June 9, 2026

Introducing the Voice Translation API: Real-Time Speech-to-Speech Translation for Developers

Written by Krisp Engineering Team

Introducing the Voice Translation API: Real-Time Speech-to-Speech Translation for Developers

Max 9 min read

Share this post

Get Krisp for Free

The engine behind Krisp’s enterprise voice translation, with over 1M+ minutes of production call translation, tested across 30 languages, 6 business domains, and 870 real conversations, is now available as a self-serve API.

The demo-to-production gap in voice translation

Getting a real-time voice translation demo working is easy. Getting it to survive production is the hard part.

Real users have accents. They speak over background noise. They use domain-specific terms that carry the most weight in the conversation: medication names, policy numbers, account details, email addresses. These are exactly the terms that get hallucinated or garbled by general-purpose translation engines. And there’s no built-in feedback mechanism to tell you when it happens. Your first quality signal is a user complaint, a compliance flag, or a patient safety issue.

Most voice translation APIs report accuracy on clean benchmark recordings made in studio conditions. Those numbers typically drop 5 to 10 points in production. The gap between what works in a demo and what works on a real call with a real customer is where most translation features fail.

We built the Voice Translation API to close that gap. Production-grade speech-to-speech translation, the same engine running in live enterprise contact centers today, now available self-serve.

Don’t take our word for it. Try the playground → Speak into it, pick a language pair, and hear the output yourself. No integration needed, no signup required.

The engine: 96% accuracy on real calls, not studio audio

This is not a new engine. It is the same translation engine that powers Krisp Voice Translation in live enterprise contact centers, with over 1M+ minutes of production call translation. Same model, same accuracy, same language support.

What makes this engine different from other voice translation APIs is where it was built and how accuracy was measured. Enterprise contact centers are the most unforgiving environment for voice AI. Frustrated customers speaking fast. Background noise from open floor plans. Heavy accents across dozens of languages. Account verifications where every digit matters. Calls where a translation error means a compliance violation, a disputed claim, or a patient safety incident.

That pressure produced an engine with production data no benchmark can replicate:

Metric	Result
Translation accuracy	96% on live calls with real accents and noise (AutoQA-scored)
Calls handled without interpreter	89% end-to-end
Patient safety incidents	Zero (across 8+ languages in a healthcare deployment)
AHT reduction vs. human language services	20%+
Interpreter wait time reduction	More than 2x
Production minutes translated	1M+

These numbers come from real calls with real consequences, not curated test sets. Most speech-to-speech translation APIs report accuracy on clean benchmark audio recorded in studio conditions. We measure where it counts.

Benchmark data: 30 languages, 6 domains, 870 conversations

Beyond production deployments, the engine has been independently evaluated using three validation layers: automated benchmarking, AI-driven semantic scoring (AutoQA), and bilingual human review by professional linguists across 8 languages.

Metric	Result
English transcription accuracy (WER)	~2.7% (97 out of 100 words correct)
Target language transcription accuracy	2–10% WER for most languages
Translation quality (BLEU), top languages	51–66 (human translations typically score ~60)
Semantic accuracy (AutoQA)	94–96 out of 100 across all 30 benchmarked languages

How we measured it

Transcription was measured using Word Error Rate (WER), the industry standard for speech recognition accuracy. Top languages like Italian (2.07%) and Spanish (2.11%) achieve WER under 2.5%.

Translation was measured using BLEU, scored bidirectionally (English→target and target→English). We also used chrF++, a character-level metric that complements BLEU for morphologically complex languages like Turkish, Finnish, and Hungarian, where word-level BLEU alone can understate quality.

AutoQA, Krisp’s semantic scoring system, independently validated every conversation across four dimensions: intent accuracy (35% weight), entity accuracy (30%), conversation flow (25%), and naturalness (10%). Scores averaged 94–96 across all 30 languages.

Bilingual human review by professional linguists across 8 languages independently confirmed the automated findings.

What the API does

Speech in one language, speech and text out in another. Real-time, synchronous, built for live conversations.

Here’s the simplest integration. Configure a session, open it with callbacks, and stream audio:

# Configure and open a session
config = VtSessionConfig(
auth_token = session_key,
input_language_code = "en-US",
output_language_code = "es-US",
voice = VtVoice.FEMALE,
)

vt = Vt.create(config,
audio_result_callback = on_audio,
translated_transcript_callback = on_text,
)

# Stream audio in, get speech + text out
for chunk in pcm_chunks:
vt.process(chunk)

That’s it. Speech in, translated speech and text out. Python and JavaScript SDKs ship with sample code and a quickstart guide. From zero to translated audio in 5 minutes.

Here’s what the engine handles that most real-time translation APIs leave to you:

Built-in Background Voice Cancellation. Background noise, competing voices, reverberation. The conditions that degrade translation quality in every real-world deployment are handled before translation begins. Configurable via the API. You don’t need clean audio input.

Native accent robustness. Indian-accented English, Hispanic-accented English, regional accents across every supported language. Accuracy doesn’t degrade. The engine was built on the full spectrum of how people actually speak, not how they speak in recording studios.

Accurate handling of names, numbers, and emails. Policy numbers, medication names, account details, email addresses, dates of birth. The kind of content that typically gets hallucinated or garbled comes through accurately.

61 languages with any-to-any pairs. Not just “Spanish” but US Spanish, European Spanish, and the engine distinguishes between them. French Canadian and metropolitan French. Egyptian Arabic. Regional languages like Catalan, Galician, and Basque. The full list is available via the languages endpoint and updated dynamically.

Real-time transcripts. Interim, final, and translated transcripts streamed alongside translated audio. Each independently toggleable via the session config.

Under the hood: technical details

Authentication

Two-step authorization keeps your long-lived API key off the client-side connection:

API Key from the developer dashboard. Used to generate short-lived session keys.
Session Key, a temporary scoped token with configurable TTL (5 minutes to 24 hours). Passed as a query parameter when opening the WebSocket.

GET https://api.developers.krisp.ai/v2/sdk/voice-translation/session/token?expiration_ttl=100
Authorization: api-key API_KEY

The long-lived key never touches the WebSocket connection directly.

Session configuration

Every session is controlled through a single JSON config message sent after the WebSocket connects. Source and target language, output voice, custom vocabulary, translation dictionary, transcript toggles, background voice cancellation, and client metadata are all set in one message:

{
"config": {
"source_language": "en-US",
"target_language": "es-US",
"voice": "female",

"vocabulary": ["Lisinopril", "metformin", "HIPAA"],
"translation_dictionary": [
{ "source": "copay", "target": "copago" },
{ "source": "referral", "target": "remisión" }
],

"transcript": {
"interim": true,
"final": true,
"translate": true
},

"features": {
"background_voice_cancellation": true
}
}
}

Domain customization from day one

Custom Vocabulary improves transcription accuracy. Add terms the engine should recognize: product names, medical terminology, internal codes. If you’re in healthcare, you add your medication names. If you’re in insurance, you add your product terms.

Translation Dictionary controls how recognized terms are translated. Define specific source → target mappings per language pair. Map “copay” to “copago” in Spanish. Map “deductible” to “Selbstbehalt” in German. You control both recognition and translation output.

Both are configured per session via the JSON config. No training step, no fine-tuning, no waiting. Add your terms and they’re active immediately.

Server events

Three event types come back alongside translated audio frames:

Transcript: real-time source transcription (interim and final), with utterance ID, timestamp, and duration
Translation: translated text linked to each transcript via utterance ID
Error: HTTP-style codes (400, 401, 402, 429, 500) with reason and description

Audio format

PCM S16LE, 16 KHz, mono (640 bytes per 20ms chunk). Translated audio returns in the same encoding. Additional formats coming soon.

Security and compliance

The API carries the same security posture that serves enterprise contact centers. No voice data is stored on Krisp servers. Encryption in-transit and at-rest.

Certifications: SOC 2 Type II · HIPAA · GDPR · PCI-DSS 4.0

For full details, visit the Krisp Trust Center.

Where accuracy-critical voice translation fits

Not every voice translation use case demands the same level of accuracy. The Krisp engine was built for environments where translation errors have real consequences, and that’s where its production provenance matters most.

Accuracy is critical. Healthcare, legal, emergency services, pharmaceutical. A mistranslated medication name is a patient safety incident. A garbled legal term changes the outcome of a proceeding. A misunderstood 911 call costs time that someone doesn’t have.

Accuracy has financial or compliance consequences. Insurance, financial services, government services, enterprise procurement. Mandated disclosures, transaction details, and policy terms must land correctly in the customer’s language.

Accuracy drives business outcomes. Customer support for complex products, cross-language sales, B2B meetings, HR and recruiting. Accumulated translation quality directly impacts CSAT, close rates, resolution rates, and trust.

For gaming, social apps, streaming, and travel, the engine works well. But the buying criteria are different: latency, naturalness, language coverage, and DX matter more than accuracy provenance.

Pricing: self-serve to enterprise

Self-Serve: Get Started. 60 minutes of free translation credit included. Full engine access (same model as enterprise), 61 languages with locale variants, Custom Vocabulary and Dictionary, Python and JavaScript SDKs, developer dashboard and playground. No sales call required.

Subscription: Production. Everything in self-serve, plus included translation hours with predictable monthly cost that scales with your usage. Usage monitoring and billing dashboard.

Enterprise: Custom. Volume pricing, dedicated support with 99.9% uptime SLA, VIVA and RTC SDK access, custom integration support. Talk to Sales →

Need deeper voice pipeline integration?

The Translation API is one part of the Krisp audio stack. Two more SDK families are available for teams building voice-first products.

VIVA SDK for Voice AI Agents. Voice Isolation, Turn Prediction, Interruption Prediction, and VAD. Lightweight models that sit between real-world audio and your AI agent. Explore VIVA SDK →

RTC SDK for Human-to-Human Calls. Accent Conversion, Background Voice Cancellation, and Noise Cancellation. Real-time audio processing for contact centers and communication platforms. Explore RTC SDK →

What’s coming next

Auto language detection. Automatic source language identification so developers don’t need to specify it per session.

Voice cloning. Preserve the speaker’s original voice in the translated output.

Additional audio formats beyond the current PCM S16LE 16 KHz mono.

Start building

This engine was built inside enterprise contact centers, on calls where a wrong word means a patient safety incident, a disputed insurance claim, or a compliance violation. 96% accuracy measured on live calls, not studio audio. 1M+ minutes of production translation. 30 languages benchmarked across 6 domains with AutoQA scores of 94–96. BLEU scores that match professional human translators.

The access model changed. The engine didn’t.

Get API Key Free

Try in Playground

FAQ

How accurate is Krisp's voice translation API?

It delivers 96% accuracy measured on real enterprise calls, not studio audio. That figure comes from over 1M+ minutes of production call translation in live contact centers, where heavy accents, background noise, and high-stakes details like policy numbers and medication names test accuracy under real conditions. Most voice translation APIs report accuracy on clean benchmark recordings, and those numbers typically drop 5 to 10 points in production.

How is voice translation accuracy measured?

Krisp uses three independent validation layers. Transcription is scored with Word Error Rate (WER) — top languages like Italian (2.07%) and Spanish (2.11%) achieve WER under 2.5%. Translation is scored with BLEU bidirectionally (English→target and target→English), plus chrF++ for morphologically complex languages like Turkish, Finnish, and Hungarian. Krisp’s AutoQA system then rates every conversation across intent accuracy (35%), entity accuracy (30%), conversation flow (25%), and naturalness (10%), averaging 94–96 across all 30 benchmarked languages, with bilingual professional linguists across 8 languages confirming the results.

Why do voice translation APIs lose accuracy in production?

Because real conditions look nothing like a demo. Real users have accents, speak over background noise, and use domain-specific terms — medication names, policy numbers, account details — that general-purpose engines tend to hallucinate or garble. Most APIs report accuracy on clean studio recordings, so their numbers typically fall 5 to 10 points once they hit real calls. Krisp built and measured its engine inside enterprise contact centers, the most unforgiving environment for voice AI, so its reported accuracy reflects the conditions you’ll actually deploy in.

How is this different from Krisp's enterprise voice translation?

It’s the same engine, not a new one — the same model, accuracy, and language support that powers Krisp Voice Translation in live enterprise contact centers today, with over 1M+ minutes of production call translation behind it. The only thing that changed is the access model: it’s now available self-serve via API, with Python and JavaScript SDKs, a developer dashboard and playground, and 60 minutes of free translation credit to start.

ENGINEERING BLOG

Subscribe to get the latest insights weekly

Introducing the Voice Translation API: Real-Time Speech-to-Speech Translation for Developers

The demo-to-production gap in voice translation

The engine: 96% accuracy on real calls, not studio audio

Benchmark data: 30 languages, 6 domains, 870 conversations

How we measured it

What the API does

Under the hood: technical details

Authentication

Session configuration

Domain customization from day one

Server events

Audio format

Security and compliance

Where accuracy-critical voice translation fits

Pricing: self-serve to enterprise

Need deeper voice pipeline integration?

What’s coming next

Start building

FAQ

Related Articles

Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents

Introducing Krisp RTC: Voice Translation SDK for Customer Experience

Introducing Krisp’s Accent Conversion SDK

You're one step away from
supercharging your online meeting!

AI Meeting Assistant

Call Center AI

Developers

Subscribe to get the latest insights weekly

Introducing the Voice Translation API: Real-Time Speech-to-Speech Translation for Developers

The demo-to-production gap in voice translation

The engine: 96% accuracy on real calls, not studio audio

Benchmark data: 30 languages, 6 domains, 870 conversations

How we measured it

What the API does

Under the hood: technical details

Authentication

Session configuration

Domain customization from day one

Server events

Audio format

Security and compliance

Where accuracy-critical voice translation fits

Pricing: self-serve to enterprise

Need deeper voice pipeline integration?

What’s coming next

Start building

FAQ

Related Articles

Introducing Krisp VIVA 2.0: Voice Infrastructure for Voice AI Agents

Introducing Krisp RTC: Voice Translation SDK for Customer Experience

Introducing Krisp’s Accent Conversion SDK

You're one step away from supercharging your online meeting!

You're one step away from
supercharging your online meeting!