


{"id":11051,"date":"2024-03-04T22:01:55","date_gmt":"2024-03-04T18:01:55","guid":{"rendered":"https:\/\/krisp.ai\/blog\/?p=11051"},"modified":"2025-02-20T16:43:19","modified_gmt":"2025-02-20T12:43:19","slug":"deep-dive-ai-accent-conversion-for-call-centers","status":"publish","type":"post","link":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/","title":{"rendered":"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">In this article, we dive deep into a new disruptive technology called AI Accent Conversion, which in real-time translates a speaker\u2019s accent to the listener\u2019s natively understood accent, using AI.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Accent refers to the distinctive way in which a group of people pronounce words, influenced by their region, country, or social background. In broad terms, English accents can be categorized into major groups such as British, American, Australian, South African, and Indian among others.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Accents can often be a barrier to communication, affecting the clarity and comprehension of speech. Differences in pronunciation, intonation, and rhythm can lead to misunderstandings.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">While the importance of this topic goes beyond call centers, our primary focus is this industry.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Offshore expansion and accented speech in call centers<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The call center industry in the United States has <\/span><a href=\"https:\/\/www.siteselectiongroup.com\/whitepapers\"><span style=\"font-weight: 400;\">experienced<\/span><\/a><span style=\"font-weight: 400;\"> substantial growth, with a noticeable surge in the creation of new jobs from 2020-onward, both on-shore and globally.<\/span><\/p>\n<p><img loading=\"lazy\" class=\"size-full wp-image-11053 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-1.png\" alt=\"\" width=\"1156\" height=\"930\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-1.png 1156w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-1-300x241.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-1-380x306.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-1-768x618.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-1-600x483.png 600w\" sizes=\"(max-width: 1156px) 100vw, 1156px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">In 2021, many US based call centers expanded their footprints thanks to the pandemic-fueled adoption of remote work, but growth slowed substantially in 2022. Inflated salaries and limited resources drove call centers to deepen their offshore operations, both in existing and new geographies.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">There are several strong incentives for businesses to expand call centers operations to off-shore locations, including:<\/span><\/p>\n<ul>\n<li><b>Cost savings<\/b><span style=\"font-weight: 400;\">: Labor costs in offshore locations such as India, the Philippines, and Eastern Europe are up to 70% lower than in the United States.<\/span><\/li>\n<li><b>Access to diverse talent pools:<\/b><span style=\"font-weight: 400;\"> Offshoring enables access to a diverse talent pool, often with multilingual capabilities, facilitating a more comprehensive customer support service.<\/span><\/li>\n<li><b>24\/7 coverage<\/b><span style=\"font-weight: 400;\">: Time zone differences allow for 24\/7 coverage, enhancing operational continuity.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">However, offshore operations come with a cost. One major challenge offshore call centers face is decreased language comprehension. Accents, varying fluency levels, cultural nuances and inherent biases lead to misunderstandings and frustration among customers.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">According to Reuters, as many as <\/span><a href=\"https:\/\/www.reuters.com\/article\/idUSTRE5AN37C\/\"><span style=\"font-weight: 400;\">65% of customers<\/span><\/a><span style=\"font-weight: 400;\"> have cited difficulties in understanding offshore agents due to language-related issues. Over a third of consumers say working with US-based agents is most important to them when contacting an organization.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><span style=\"font-weight: 400;\">Ways accents create challenges in call centers<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">While the world celebrates global and diverse workforces at large, <\/span><a href=\"https:\/\/journals.sagepub.com\/doi\/10.1177\/0023830910372495\"><span style=\"font-weight: 400;\">research <\/span><\/a><span style=\"font-weight: 400;\">shows that misalignment of native language backgrounds between speakers leads to a lack of comprehension and inefficient communication.\u00a0<\/span><\/p>\n<ul>\n<li><b>Longer calls:<\/b><span style=\"font-weight: 400;\"> Thick accents contribute to comprehension difficulties, causing higher average handle time (AHT) and also lower first call resolutions (FCR).<br \/>\n<\/span><span style=\"font-weight: 400;\">According to ContactBabel\u2019s \u201c2024 US Contact Center Decision Maker\u2019s Guide\u201d the cost of mishearing and repetition per year for a 250-seat contact center exceeds $155,000 per year.<br \/>\n<\/span><\/li>\n<li><b>Decreased customer satisfaction<\/b><span style=\"font-weight: 400;\">: Language barriers are among the primary contributors to lower customer satisfaction scores within off-shore call centers. According to ContactBabel, 35% of consumers say working with US-based call center agents is most important to them when contacting an organization.<\/span><\/li>\n<li><b>High agent attrition rates:<\/b><span style=\"font-weight: 400;\"> Decreased customer satisfaction and increased escalations create high stress for agents, in turn decreasing agent morale. The result is higher employee turnover rates and short-term disability claims. In 2023, US contact centers saw an average annual agent attrition rate of 31%, according to <a href=\"https:\/\/resources.krisp.ai\/guide-to-agent-engagement-and-empowerment\">The US Contact Center Decision Makers&#8217; Guide to Agent Engagement and Empowerment<\/a>.<\/span><\/li>\n<li><b>Increased onboarding costs: <\/b><span style=\"font-weight: 400;\">The need for specialized training programs to address language and cultural nuances further adds to onboarding costs.\u00a0<\/span><\/li>\n<li><b>Limited talent pool: <\/b><span style=\"font-weight: 400;\">Finding individuals who meet the required linguistic criteria within the available talent pool is challenging. The competitive demand for specialized language skills leads to increased recruitment costs.\u00a0<\/span><\/li>\n<\/ul>\n<h2><span style=\"font-weight: 400;\">How do call centers mitigate accent challenges today?<\/span><\/h2>\n<h3>Training<\/h3>\n<p><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Accent neutralization training is used as a solution to improve communication clarity in these environments. Call Centers invest in weeks-long accent neutralization training as part of agent onboarding and ongoing improvement. <\/span><\/span>Depending on\u00a0 geography, duration, and training method, training costs can run $500-$1500 per agent during onboarding. The effectiveness of these training programs can be limited due to the inherent challenges in altering long-established accent habits. So, call centers may find it necessary to temporarily remove agents from their operational roles for further retraining, incurring additional costs in the process.<br \/>\n<img loading=\"lazy\" class=\"aligncenter size-full wp-image-11054\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-2.jpg\" alt=\"\" width=\"937\" height=\"528\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-2.jpg 937w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-2-300x169.jpg 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-2-380x214.jpg 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-2-768x433.jpg 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-2-600x338.jpg 600w\" sizes=\"(max-width: 937px) 100vw, 937px\" \/><\/p>\n<h3><b>Limited geography for expansion<\/b><\/h3>\n<p>Call centers limit their site selection to regions and countries where accents of the available talent pool is considered to be more neutral to the customer&#8217;s native language, sacrificing locations that would be more cost-effective.<br \/>\n<img loading=\"lazy\" class=\"aligncenter size-full wp-image-11056\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-3.png\" alt=\"\" width=\"992\" height=\"992\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-3.png 992w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-3-300x300.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-3-380x380.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-3-150x150.png 150w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-3-768x768.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-3-600x600.png 600w\" sizes=\"(max-width: 992px) 100vw, 992px\" \/><\/p>\n<h2><span style=\"font-weight: 400;\">Enter AI-Powered Accent Conversion<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Recent advancements in Artificial Intelligence have introduced new accent conversion technology. This technology leverages AI to translate source accents to targets accent in real-time, with the click of a button. While the technologies in production don\u2019t support multiple accents in parallel, over time this will be solved as well.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">State of the Art AI Accent Conversion Demo<\/span><\/h3>\n<p><iframe title=\"Krisp AI Accent Localization demo: Indian accent pack - male agent\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/x0_6h17RskU?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Below is the evolution of Krisp&#8217;s AI Accent Conversion technology over the past 2 years.<\/span><\/p>\n<table style=\"height: 388px;\" width=\"983\">\n<thead>\n<tr>\n<th><span style=\"font-weight: 400;\">Version<\/span><\/th>\n<th><span style=\"font-weight: 400;\">Demo<\/span><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">v0.1 First model<\/span><\/td>\n<td style=\"width: 300px;\"><!--[if lt IE 9]><script>document.createElement('audio');<\/script><![endif]--><br \/>\n<audio class=\"wp-audio-shortcode\" id=\"audio-11051-1\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/wav\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.1.wav?_=1\" \/><a href=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.1.wav\">https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.1.wav<\/a><\/audio><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">v0.2 A bit more natural sound<\/span><\/td>\n<td style=\"width: 300px;\"><audio class=\"wp-audio-shortcode\" id=\"audio-11051-2\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/mpeg\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.2-online-audio-converter.com_.mp3?_=2\" \/><a href=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.2-online-audio-converter.com_.mp3\">https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.2-online-audio-converter.com_.mp3<\/a><\/audio><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">v0.3 A bit more natural sound<\/span><\/td>\n<td style=\"width: 300px;\"><audio class=\"wp-audio-shortcode\" id=\"audio-11051-3\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/wav\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.3.wav?_=3\" \/><a href=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.3.wav\">https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.3.wav<\/a><\/audio><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">v0.4 Improved voice<\/span><\/td>\n<td style=\"width: 300px;\"><audio class=\"wp-audio-shortcode\" id=\"audio-11051-4\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/wav\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.4.wav?_=4\" \/><a href=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.4.wav\">https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.4.wav<\/a><\/audio><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">v0.5 Improved intonation transfer<\/span><\/td>\n<td style=\"width: 300px;\"><audio class=\"wp-audio-shortcode\" id=\"audio-11051-5\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/wav\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.5.wav?_=5\" \/><a href=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.5.wav\">https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/v0.5.wav<\/a><\/audio><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This innovation is revolutionary for call centers as it eliminates the need for difficult and expensive training and increases the talent pool worldwide, providing immediate scalability for offshore operations.<\/p>\n<p>&nbsp;<\/p>\n<p>It&#8217;s also highly convenient for agents and reduces the cognitive load and stress they have today. This translates to decreased short-term disability claims and attrition rates, and overall improved agent experience.<\/p>\n<h2><span style=\"font-weight: 400;\">Deploying AI Accent Conversion in the call center<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">There are various ways AI Accent Conversion can be integrated into a call center\u2019s tech stack.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">It can be embedded into a call center&#8217;s existing CX software (e.g. CCaaS and UCaaS) or installed as a separate application on the agent\u2019s machine (e.g. <\/span><a href=\"\/contact-center\/\"><span style=\"font-weight: 400;\">Krisp<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Currently, there are no CX solutions in market with accent conversion capabilities, leaving the latter as the only possible path forward for call centers looking to leverage this technology today.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Applications like <\/span><a href=\"https:\/\/krisp.ai\/contact-center\/\"><span style=\"font-weight: 400;\">Krisp<\/span><\/a><span style=\"font-weight: 400;\"> have accent conversion built in their offerings.\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" class=\"size-full wp-image-11055 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-4.png\" alt=\"\" width=\"524\" height=\"684\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-4.png 524w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-4-230x300.png 230w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-4-345x450.png 345w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-4-460x600.png 460w\" sizes=\"(max-width: 524px) 100vw, 524px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">These applications are on-device, meaning they sit locally on the agent\u2019s machine. They support all CX software platforms out of the box since they are installed as a virtual microphone and speaker.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">AI runs on an agent&#8217;s device so there is no additional load on the network.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The deployment and management can be done remotely, and at scale, from the admin dashboard.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><span style=\"font-weight: 400;\">Challenges of building AI Accent Conversion technology<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">At a fundamental level, speech can be divided into 4 parts: voice, text, prosody and accent.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Accents can be divided into 4 parts as well &#8211; phoneme, intonation, stress and rhythm.<\/span><\/p>\n<p><img loading=\"lazy\" class=\"size-full wp-image-11064 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-5.png\" alt=\"\" width=\"1368\" height=\"790\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-5.png 1368w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-5-300x173.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-5-380x219.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-5-768x444.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-5-600x346.png 600w\" sizes=\"(max-width: 1368px) 100vw, 1368px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">In order to convert or translate an accent, three of these parts must be changed &#8211; phoneme pronunciation, intonation, and stress. Doing this in real-time is an extremely difficult technical problem.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">While there are numerous technical challenges in building this technology, we will focus on eight majors.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Data Collection<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Speech Synthesis<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Low Latency<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Background Noises and Voices<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Acoustic Conditions<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Maintaining Correct Intonation<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Maintaining Speaker\u2019s Voice<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Wrong Pronunciations<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Let\u2019s discuss them individually.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">1) Data collection<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Collecting accented speech data is a tough process. The data must be highly representative of different dialects spoken in the source language. Also, it should cover various voices, age groups, speaking rates, prosody, and emotion variations. For call centers, it is preferable to have natural conversational speech samples with rich vocabulary targeted for the use case.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">There are two options: buy ready data or record and capture the data in-house. In practice, both can be done in parallel.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">An ideal dataset would consist of thousands of hours of speech where source accent utterance is mapped to each target accent utterance and aligned with it accurately.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">However, getting precise alignment is exceedingly challenging due to variations in the duration of phoneme pronunciations. Nonetheless, improved alignment accuracy contributes to superior results.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">2) Speech synthesis<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The speech synthesis part of the model, which is sometimes referred to as the vocoder algorithm in research, should produce a high-quality, natural-sounding speech waveform.\u00a0 It is expected to sound closer to the target accent, have high intelligibility, be low-latency, convey natural emotions and intonation, be robust against noise and background voices, and be compatible with various acoustic environments.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">3) Low latency<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">As studies by the International Telecommunication Union show (G.114 recommendation), speech transmission maintains acceptable quality during real-time communication if the one-way delay is less than approximately 300 ms. Therefore, the latency of the end-to-end accent conversion system should be within that range to ensure it does not impact the quality of real-time conversation.<img loading=\"lazy\" class=\"aligncenter wp-image-11063\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-6.png\" alt=\"\" width=\"600\" height=\"424\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-6.png 906w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-6-300x212.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-6-380x268.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-6-768x543.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-6-600x424.png 600w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are two ways to run this technology: locally or in the cloud. While both have theoretical advantages, in practice, more systems with similar characteristics (e.g. AI-powered noise cancellation, voice conversion, etc.) have been successfully deployed locally. This is mostly due to hard requirements around latency and scale.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">To be able to run locally, the end-to-end neural network must be small and highly optimized, which requires significant engineering resources.<\/span><\/p>\n<h3><\/h3>\n<h3><span style=\"font-weight: 400;\">4) Background noise and voices<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Having a sophisticated noise cancellation system is crucial for this Voice AI technology. Otherwise, the speech synthesizing model will generate unwanted artifacts.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Not only should it eliminate the input background noise but also the input background voices. Any sound that is not the speaker\u2019s voice must be suppressed.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This is especially important in call center environments where multiple agents sit in close proximity to each other, serving multiple customers simultaneously over the phone.\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" class=\"size-full wp-image-11062 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-7.png\" alt=\"\" width=\"1600\" height=\"899\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-7.png 1600w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-7-300x169.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-7-380x214.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-7-768x432.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-7-1536x863.png 1536w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-7-600x337.png 600w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Detecting and filtering out other human voices is a very difficult problem. As of this writing, to our knowledge, there is only one system doing it properly today &#8211; Krisp&#8217;s <a href=\"\/contact-center\/\">AI Noise Cancellation<\/a> technology.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><span style=\"font-weight: 400;\">5) Acoustic conditions<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Acoustic conditions differ for call center agents. The sheer volume of combinations of device microphones and room setups (accountable for room echo) makes it very difficult to design a robust system against such input variations.\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" class=\"wp-image-11061 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-8.png\" alt=\"\" width=\"642\" height=\"642\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-8.png 992w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-8-300x300.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-8-380x380.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-8-150x150.png 150w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-8-768x768.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-8-600x600.png 600w\" sizes=\"(max-width: 642px) 100vw, 642px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">6) Maintaining the speaker\u2019s intonation<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Not transferring the speaker&#8217;s intonation in the generated speech will result in a robotic speech that sounds worse than the original.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Krisp addressed this issue by developing an algorithm capturing input speaker\u2019s intonation details in real-time and leveraging this information in the synthesized speech. Solving this challenging problem allowed us to increase the naturalness of the generated speech.\u00a0\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" class=\"wp-image-11060 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-9.png\" alt=\"\" width=\"633\" height=\"633\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-9.png 992w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-9-300x300.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-9-380x380.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-9-150x150.png 150w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-9-768x768.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-9-600x600.png 600w\" sizes=\"(max-width: 633px) 100vw, 633px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">7) Maintaining the speaker\u2019s voice<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">It is desirable to maintain the speaker&#8217;s vocal characteristics (e.g., formants, timbre) while generating output speech. This is a major challenge and one potential solution is designing the speech synthesis component so that it generates speech conditioned on the input speaker&#8217;s voice &#8216;fingerprint&#8217; &#8211; a special vector encoding a unique acoustic representation of an individual&#8217;s voice.<\/span><\/p>\n<p><img loading=\"lazy\" class=\"size-full wp-image-11059 aligncenter\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-10.png\" alt=\"\" width=\"512\" height=\"319\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-10.png 512w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-10-300x187.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-10-380x237.png 380w\" sizes=\"(max-width: 512px) 100vw, 512px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">8) Wrong pronunciations<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Mispronounced words can be difficult to correct in real-time, as the general setup would require separate automatic speech recognition and language modeling blocks, which introduce significant algorithmic delays and fail to meet the low latency criterion.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">3 technical approaches to AI Accent Conversion<\/span><\/h2>\n<h3><span style=\"font-weight: 400;\">Approach 1: Speech \u2192 STT \u2192 Speech<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">One approach to accent conversion involves applying Speech-to-Text (STT) to the input speech and subsequently utilizing Text-to-Speech (TTS) algorithms to synthesize the target speech.\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-11193 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/04.jpg\" alt=\"\" width=\"1000\" height=\"257\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/04.jpg 1000w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/04-300x77.jpg 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/04-380x98.jpg 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/04-768x197.jpg 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/04-600x154.jpg 600w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">This approach is relatively straightforward and involves common technologies like STT and TTS, making it conceptually simple to implement.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">STT and TTS are well-established, with existing solutions and tools readily available.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Integration into the algorithm can leverage these technologies effectively. These represent the strengths of the method, yet it is not without its drawbacks. There are 3 of them:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The difficulty of having accent-robust STT with a very low word error rate.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The TTS algorithm must possess capabilities to manage emotions, intonation, and speaking rate, which should come from original accented input and produce speech that sounds natural.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Algorithmic delay within the STT plus TTS pipeline may fall short of meeting the demands of real-time communication.<\/span><\/li>\n<\/ul>\n<h3><span style=\"font-weight: 400;\">Approach 2: Speech \u2192 Phoneme \u2192 Speech<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">First, let\u2019s define what a phoneme is. A phoneme is the smallest unit of sound in a language that can distinguish words from each other. It is an abstract concept used in linguistics to understand how language sounds function to encode meaning. Different languages have different sets of phonemes; the number of phonemes in a language can vary widely, from as few as 11 to over 100. Phonemes themselves do not have inherent meaning but work within the system of a language to create meaningful distinctions between words. For example, the English phonemes \/p\/ and \/b\/ differentiate the words &#8220;pat&#8221; and &#8220;bat.&#8221;<\/span><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-11058 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-11.png\" alt=\"table of English phenomes, mapping source speech to a phonetic representation, then the result to the target speech\u2019s phonetic representation\" width=\"502\" height=\"512\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-11.png 502w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-11-294x300.png 294w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/image-11-380x388.png 380w\" sizes=\"(max-width: 502px) 100vw, 502px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The objective is to first map the source speech to a phonetic representation, then map the result to the target speech\u2019s phonetic representation (content), and then synthesize the target speech from it.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This approach enables the achievement of comparatively smaller delays than Approach 1. However, it faces the challenge of generating natural-sounding speech output, and reliance solely on phoneme information is insufficient for accurately reconstructing the target speech. To address this issue, the model should also extract additional features such as speaking rate, emotions, loudness, and vocal characteristics. These features should then be integrated with the target speech content to synthesize the target speech based on these attributes.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Approach 3: Speech \u2192 Speech<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Another approach is to create parallel data using deep learning or digital signal processing techniques. This entails generating a native target-accent sounding output for each accented speech input, maintaining consistent emotions, naturalness, and vocal characteristics, and achieving an ideal frame-by-frame alignment with the input data.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">If high-quality parallel data are available, the accent conversion model can be implemented as a single neural network algorithm trained to directly map input accented speech to target native speech.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The biggest challenge of this approach is obtaining high-quality parallel data.The quality of the final model directly depends on the quality of parallel data.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Another drawback is the lack of integrated explicit control over speech characteristics, such as intonation, voice, or loudness. Without this control, the model may fail to accurately learn these important aspects.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">How to measure the quality AI Accent Conversion output<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">High-quality output of accent conversion technology should:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Be intelligible<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Have little or no accentedness (the degree of deviation from the native accent)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Sound natural<\/span><\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">To evaluate these quality features, we use the following objective metrics:\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Word Error Rate (WER)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Phoneme Error Rate (PER)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Naturalness prediction<\/span><\/li>\n<\/ul>\n<h3><span style=\"font-weight: 400;\">Word Error Rate (WER)<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">WER is a crucial metric used to assess STT systems&#8217; accuracy. It quantifies the word level errors of predicted transcription compared to a reference transcription.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">To compute WER we use a high-quality STT system on generated speech from test audios that come with predefined transcripts.\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-11196 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/01.jpg\" alt=\"\" width=\"1000\" height=\"502\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/01.jpg 1000w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/01-300x151.jpg 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/01-380x191.jpg 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/01-768x386.jpg 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/01-600x301.jpg 600w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The evaluation process is the following:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The test set is processed through the candidate accent conersion model to obtain the converted speech samples.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">These converted speech samples are then fed into the STT system to generate the predicted transcriptions.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">WER is calculated using the predicted and the reference texts.<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The assumption in this methodology is that a model demonstrating better intelligibility will have a lower WER score.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Phoneme Error Rate (PER)<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The AL model may retain some aspects of the original accent in the converted speech, notably in the pronunciation of phonemes. Given that state-of-the-art STT systems are designed to be robust to various accents, they might still achieve low WER scores even when the speech exhibits accented characteristics.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">To identify phonetic mistakes, we employ the Phoneme Error Rate (PER) as a more suitable metric than WER. PER is calculated in a manner similar to WER, focusing on phoneme errors in the transcription, rather than word-level errors.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">For PER calculation, a high-quality phoneme recognition model is used, such as the one available at <\/span><a href=\"https:\/\/huggingface.co\/facebook\/wav2vec2-xlsr-53-espeak-cv-ft\"><span style=\"font-weight: 400;\">https:\/\/huggingface.co\/facebook\/wav2vec2-xlsr-53-espeak-cv-ft<\/span><\/a><span style=\"font-weight: 400;\">. The evaluation process is as follows:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The test set is processed by the candidate AL model to produce the converted speech samples.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">These converted speech samples are fed into the phoneme recognition system to obtain the predicted phonetic transcriptions.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">PER is calculated using predicted and reference phonetic transcriptions.<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This method addresses the phonetic precision of the AL model to a certain extent.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Naturalness Prediction<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">To assess the naturalness of generated speech, one common method involves conducting subjective listening tests. In these tests, listeners are asked to rate the speech samples on a 5-point scale, where 1 denotes very robotic speech and 5 denotes highly natural speech.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The average of these ratings, known as the Mean Opinion Score (MOS), serves as the naturalness score for the given sample.<\/span><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-11192 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/05.jpg\" alt=\"\" width=\"1000\" height=\"502\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/05.jpg 1000w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/05-300x151.jpg 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/05-380x191.jpg 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/05-768x386.jpg 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/05-600x301.jpg 600w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">In addition to subjective evaluations, obtaining an objective measure of speech naturalness is also feasible. It is a distinct research direction\u2014predicting the naturalness of generated speech using AI. Models in this domain are developed using large datasets comprised of subjective listening assessments of the naturalness of generated speech (obtained from various speech-generating systems like text-to-speech, voice conversion, etc).\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">These models are designed to predict the MOS score for a speech sample based on its characteristics. Developing such models is a great challenge and remains an active area of research. Therefore, one should be careful when using these models to predict naturalness. Notable examples include the self-supervised learned MOS predictor and NISQA, which represent significant advances in this field.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">In addition to objective metrics mentioned above, we conduct subjective listening tests and calculate objective scores using MOS predictors. We also manually examine the quality of these objective assessments. This approach enables a thorough analysis of the naturalness of our AL models, ensuring a well-rounded evaluation of their performance.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">AI Accent Conversion model training and inference<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The following diagrams show how the training and inference are organized.<\/span><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-11195 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/02.jpg\" alt=\"\" width=\"1000\" height=\"700\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/02.jpg 1000w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/02-300x210.jpg 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/02-380x266.jpg 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/02-768x538.jpg 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/02-600x420.jpg 600w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400;\">AI Training<\/span><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400;\"><img loading=\"lazy\" class=\"aligncenter wp-image-11194 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/03.jpg\" alt=\"\" width=\"1000\" height=\"257\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/03.jpg 1000w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/03-300x77.jpg 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/03-380x98.jpg 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/03-768x197.jpg 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/03\/03-600x154.jpg 600w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400;\">AI Inference<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2>Closing<\/h2>\n<p>In navigating the complexities of global call center operations, AI Accent Conversion technology is a disruptive innovation, primed to bridge language barriers and elevate customer service while expanding talent pools, reducing costs, and revolutionizing CX.<\/p>\n<h3><span style=\"font-weight: 400;\">References<\/span><\/h3>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/www.smartcommunications.com\/resources\/news\/benchmark-report-2023-2\/\"><span style=\"font-weight: 400;\">https:\/\/www.smartcommunications.com\/resources\/news\/benchmark-report-2023-2\/<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/info.siteselectiongroup.com\/blog\/site-selection-group-releases-2023-global-call-center-location-trend-report\">https:\/\/info.siteselectiongroup.com\/blog\/site-selection-group-releases-2023-global-call-center-location-trend-report\u00a0<\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/www.siteselectiongroup.com\/whitepapers\"><span style=\"font-weight: 400;\">https:\/\/www.siteselectiongroup.com\/whitepapers<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/www.reuters.com\/article\/idUSTRE5AN37C\/\"><span style=\"font-weight: 400;\">https:\/\/www.reuters.com\/article\/idUSTRE5AN37C\/<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we dive deep into a new disruptive technology called AI Accent Conversion, which in real-time translates a speaker\u2019s accent to the listener\u2019s natively understood accent, using AI. &nbsp; Accent refers to the distinctive way in which a group of people pronounce words, influenced by their region, country, or social background. In broad [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":11075,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"two_page_speed":[]},"categories":[517,421,413],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.2 (Yoast SEO v23.6) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers - Krisp<\/title>\n<meta name=\"description\" content=\"This report details AI Accent Conversion&#039;s development, deployment challenges, and ability to revolutionize offshore call center operations\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers - Krisp\" \/>\n<meta property=\"og:description\" content=\"This report details AI Accent Conversion&#039;s development, deployment challenges, and ability to revolutionize offshore call center operations\" \/>\n<meta property=\"og:url\" content=\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/\" \/>\n<meta property=\"og:site_name\" content=\"Krisp\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/krispHQ\/\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-04T18:01:55+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-02-20T12:43:19+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Krisp Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@krispHQ\" \/>\n<meta name=\"twitter:site\" content=\"@krispHQ\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/\"},\"author\":{\"name\":\"Krisp Team\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/0496a17834794b226cc0925eabe55a2d\"},\"headline\":\"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers\",\"datePublished\":\"2024-03-04T18:01:55+00:00\",\"dateModified\":\"2025-02-20T12:43:19+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/\"},\"wordCount\":3045,\"commentCount\":3,\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png\",\"articleSection\":[\"AI Accent Conversion\",\"Engineering Blog\",\"Enterprise\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/\",\"url\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/\",\"name\":\"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers - Krisp\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png\",\"datePublished\":\"2024-03-04T18:01:55+00:00\",\"dateModified\":\"2025-02-20T12:43:19+00:00\",\"description\":\"This report details AI Accent Conversion's development, deployment challenges, and ability to revolutionize offshore call center operations\",\"breadcrumb\":{\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png\",\"width\":1000,\"height\":700},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/krisp.ai\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/krisp.ai\/blog\/#website\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"name\":\"Krisp\",\"description\":\"Blog\",\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/krisp.ai\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\",\"name\":\"Krisp\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"width\":696,\"height\":696,\"caption\":\"Krisp\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/krispHQ\/\",\"https:\/\/x.com\/krispHQ\",\"https:\/\/www.linkedin.com\/company\/krisphq\/\",\"https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/0496a17834794b226cc0925eabe55a2d\",\"name\":\"Krisp Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2023\/10\/cropped-Favicon-96x96.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2023\/10\/cropped-Favicon-96x96.png\",\"caption\":\"Krisp Team\"},\"description\":\"Here at Krisp, we are passionate about making your life more productive and easy by building noise cancelling app that removes background noise during calls.\",\"url\":\"https:\/\/krisp.ai\/blog\/author\/krisp-team\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers - Krisp","description":"This report details AI Accent Conversion's development, deployment challenges, and ability to revolutionize offshore call center operations","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/","og_locale":"en_US","og_type":"article","og_title":"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers - Krisp","og_description":"This report details AI Accent Conversion's development, deployment challenges, and ability to revolutionize offshore call center operations","og_url":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/","og_site_name":"Krisp","article_publisher":"https:\/\/www.facebook.com\/krispHQ\/","article_published_time":"2024-03-04T18:01:55+00:00","article_modified_time":"2025-02-20T12:43:19+00:00","og_image":[{"width":1000,"height":700,"url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png","type":"image\/png"}],"author":"Krisp Team","twitter_card":"summary_large_image","twitter_creator":"@krispHQ","twitter_site":"@krispHQ","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#article","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/"},"author":{"name":"Krisp Team","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/0496a17834794b226cc0925eabe55a2d"},"headline":"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers","datePublished":"2024-03-04T18:01:55+00:00","dateModified":"2025-02-20T12:43:19+00:00","mainEntityOfPage":{"@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/"},"wordCount":3045,"commentCount":3,"publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"image":{"@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png","articleSection":["AI Accent Conversion","Engineering Blog","Enterprise"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/","url":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/","name":"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers - Krisp","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage"},"image":{"@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png","datePublished":"2024-03-04T18:01:55+00:00","dateModified":"2025-02-20T12:43:19+00:00","description":"This report details AI Accent Conversion's development, deployment challenges, and ability to revolutionize offshore call center operations","breadcrumb":{"@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#primaryimage","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/02\/AL-blog.png","width":1000,"height":700},{"@type":"BreadcrumbList","@id":"https:\/\/krisp.ai\/blog\/deep-dive-ai-accent-conversion-for-call-centers\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/krisp.ai\/blog\/"},{"@type":"ListItem","position":2,"name":"Deep Dive: AI\u2019s Role in Accent Conversion for Call Centers"}]},{"@type":"WebSite","@id":"https:\/\/krisp.ai\/blog\/#website","url":"https:\/\/krisp.ai\/blog\/","name":"Krisp","description":"Blog","publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/krisp.ai\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/krisp.ai\/blog\/#organization","name":"Krisp","url":"https:\/\/krisp.ai\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","width":696,"height":696,"caption":"Krisp"},"image":{"@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/krispHQ\/","https:\/\/x.com\/krispHQ","https:\/\/www.linkedin.com\/company\/krisphq\/","https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg"]},{"@type":"Person","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/0496a17834794b226cc0925eabe55a2d","name":"Krisp Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2023\/10\/cropped-Favicon-96x96.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2023\/10\/cropped-Favicon-96x96.png","caption":"Krisp Team"},"description":"Here at Krisp, we are passionate about making your life more productive and easy by building noise cancelling app that removes background noise during calls.","url":"https:\/\/krisp.ai\/blog\/author\/krisp-team\/"}]}},"_links":{"self":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/11051"}],"collection":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/comments?post=11051"}],"version-history":[{"count":51,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/11051\/revisions"}],"predecessor-version":[{"id":20755,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/11051\/revisions\/20755"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media\/11075"}],"wp:attachment":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media?parent=11051"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/categories?post=11051"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/tags?post=11051"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}