{"id":9582,"date":"2022-09-12T22:51:55","date_gmt":"2022-09-12T18:51:55","guid":{"rendered":"https:\/\/krisp.ai\/blog\/?p=9582"},"modified":"2022-09-25T17:17:14","modified_gmt":"2022-09-25T13:17:14","slug":"speech-recognition-testing","status":"publish","type":"post","link":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/","title":{"rendered":"On-Device Meeting Transcription and Speech Recognition Testing"},"content":{"rendered":"<p style=\"text-align: left;\">We are inevitably going to have even more online meetings in the future. So it\u2019s important to not get lost or lose context with all the information around us.<\/p>\n<p><span style=\"font-weight: 400;\">Just think about one of last week\u2019s meetings. Can your team really recall everything discussed on a call??<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When no one remembers exactly what was discussed during an online meeting, the effectiveness of these calls are reduced materially.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Many meeting leaders take hand-written notes to memorialize what was discussed to share and assign action items. However, the act of manually taking notes causes the note-taker to not be fully present during the call, and even the best note-takers miss key points and important dialogue.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Fixing meeting knowledge loss<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In 1885, Hermann Ebbinghaus claimed that humans begin to lose recently acquired knowledge within hours of learning it. This means that you will recall the details of a conversation during the meeting, but, a day later, you\u2019ll only remember about 60% of that information. The following day, this drops to 40% and keeps getting lower until you recall very little of the specifics of the discussion.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The solution?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Automatically transcribe meetings so that they can be reviewed and shared after the call. This approach helps us access important details discussed during a meeting, allowing for accurate and timely follow up and prevent misunderstandings or missed deadlines due to miscommunication.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Many studies have showed that people are generally more likely to comprehend and remember visual information as opposed to when they\u2019re taking part in events\/meetings that solely rely on audio for information sharing.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Meeting transcriptions provide participants with a visual format of any spoken content. It also allows attendees to listen and read along at the same time. This makes for increased focus during meetings or events, and improved outcomes post-meeting.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">On-device processing<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Having meeting transcript technology available to work seamlessly with all online meeting applications allows for unlimited transcriptions without having to utilize expensive cloud services.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">At Krisp, we value privacy and keep the entire speech processing on-device, so no voice or audio is ever processed or stored in the cloud. This is very important from a security perspective, as all voice and transcripted data will be on-device under the users control.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It\u2019s quite a big challenge to make transcription technologies work on-device due to constrained compute resources available vs cloud-based servers. Most transcription solutions are cloud-based and don\u2019t deliver the accuracy and privacy of an on-device solution.. On-device technologies need to be optimized to operate smoothly, with specific attention to:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Package size<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Memory footprint<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">CPU usage (calculated using the real-time factor (RTF), which is the ratio of the technology response time to the meeting duration)<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">So every time we test a new underlying speech model, we first ensure that it is able to operate within the limited resources available on most desktop and laptop computers.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Technologies behind meeting transcripts<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">At first glance, it looks like the only technology behind having readable meeting transcriptions is simply converting audio into text.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But there are two main challenges with this simplistic approach:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">First, we don\u2019t explicitly include punctuation when speaking like we do when writing. So we can only guess punctuation from the acoustic information. <\/span><a href=\"https:\/\/www.researchgate.net\/publication\/325950545_User-centric_Evaluation_of_Automatic_Punctuation_in_ASR_Closed_Captioning\"><span style=\"font-weight: 400;\">Studies<\/span><\/a><span style=\"font-weight: 400;\"> have found that transcripts without punctuation are even more detrimental to understanding the information than a <\/span><span style=\"font-weight: 400;\">word error rate<\/span><span style=\"font-weight: 400;\"> of 15 or 20%. So we need a separate solution for adding punctuation and capitalization to text.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The second challenge is to distinguish between texts spoken by different people. This distinction improves the readability <\/span><i><span style=\"font-weight: 400;\">and<\/span><\/i><span style=\"font-weight: 400;\"> understanding of the transcript. Differentiating between separate speakers is typically performed with a separate technology from core ASR, as there are hidden text-dependent acoustic features inside speech recognition models. In this case, text-independent speaker features are needed.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">To summarize, meeting transcription technology consists of three different technologies:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>ASR (Automatic Speech Recognition):<\/b><span style=\"font-weight: 400;\"> Enables the recognition and translation of spoken language into text. Applying this technology on an input audio stream gives us lowercase text with only apostrophes as punctuation marks.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Punctuation and capitalization of the text: <\/b><span style=\"font-weight: 400;\">Enables the addition of\u00a0 capitalization, periods, commas, and question marks.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Speaker diarization<\/b><span style=\"font-weight: 400;\">: Enables the partitioning of an input audio stream into homogeneous segments according to every speaker&#8217;s identity.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">The diagram below represents the process of generating meeting transcripts from an audio stream:<\/span><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-9586\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/diagram.png\" alt=\"\" width=\"1498\" height=\"412\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/diagram.png 1498w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/diagram-300x83.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/diagram-380x105.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/diagram-768x211.png 768w\" sizes=\"(max-width: 1498px) 100vw, 1498px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As mentioned above, all of these technologies should work on a device,\u00a0 even when the device is\u00a0 running on low CPU usage. For good results, a proper testing mechanism is needed for each of the models. Metrics and datasets are the key components of this type of testing methodology. Each of the models has further testing nuances that we\u2019ll discuss below.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Datasets and benchmarks for speech recognition testing\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">To test our technologies, we use custom datasets along with publicly available datasets such as <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2104.11348v3.pdf\"><span style=\"font-weight: 400;\">Earnings 21<\/span><\/a><span style=\"font-weight: 400;\">. This gives us a good comparison with both open source benchmarks and those provided through research papers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For gathering the test data, we first define the use cases and collect data for each one. Let\u2019s take online meetings as the main use case. Here we need conversational data for testing purposes. Moreover we need to consider that Plus, we\u2019ll perform a competitor evaluation on the same data to see advantages and identify possible improvement areas for Krisp\u2019s meeting transcription technology.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Testing the ASR model<\/span><\/h2>\n<h3><span style=\"font-weight: 400;\">Metrics<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The main testing metric of the ASR model is the WER (Word Error Rate), which is being computed based on the reference labeled text and the processed one in the following way:<\/span><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-9615\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-11.15.43.png\" alt=\"\" width=\"976\" height=\"382\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-11.15.43.png 976w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-11.15.43-300x117.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-11.15.43-380x149.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-11.15.43-768x301.png 768w\" sizes=\"(max-width: 976px) 100vw, 976px\" \/><\/p>\n<h3><span style=\"font-weight: 400;\">Datasets<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">After gathering custom conversational data for the main test, we augment it by adding:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Noises at the signal-to-noise ratio (SNR) with dBs 0, 5, and 10.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Reverberations with time from 100ms to 900ms.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Low-pass and high-pass filters to simulate low-quality microphones.<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">For this scenario, we\u2019re also using the Earnings 21 dataset because its utterances have very low bandwidth<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Speech pace modifications.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">We also want our ASR to support accents such as American, British, Canadian, Indian, Australian, etc.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We\u2019ve collected data for each of those accents and calculated the WER, comparing the results with competitors.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Testing the punctuation and capitalization model<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The main challenge of testing punctuation is the subjectivity factor. There can be multiple ways of rendering punctuation and all of them can be true. For instance, adding commas and even deciding on the length of a sentence depends on the grammar rules you want to use.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Metrics<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The main metrics for measuring accuracy here are Precision, Recall, and the F1 score. These are calculated for each punctuation mark and capitalization instance.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Precision: <\/b><span style=\"font-weight: 400;\">The number of true predictions of a mark divided by the total number of all predictions of the same mark.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Recall: <\/b><span style=\"font-weight: 400;\">The number of true predictions of a mark divided by the total number of a mark.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>F1 score: <\/b><span style=\"font-weight: 400;\">The harmonic mean of precision and recall.<\/span><\/li>\n<\/ul>\n<h3><span style=\"font-weight: 400;\">Datasets<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Since we use the punctuation and capitalization model on top of ASR, we have to evaluate it on a text with errors. Taking this into account, we run our ASR algorithm on the meeting data we collected. Then, linguists manually punctuate and capitalize the output texts. Using these as labels, we\u2019re ready to calculate the three above-mentioned metrics [Precision, Recall, F1 score].<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Testing the speaker diarization model<\/span><\/h2>\n<h3><span style=\"font-weight: 400;\">Metrics<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The main metrics of the speaker diarization model testing are:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Diarization error rate (DER)<\/b><span style=\"font-weight: 400;\">, which is the sum of the following error rates<\/span>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Speaker error: The percentage of scored time when a speaker ID is assigned to the wrong speaker. This type of error doesn\u2019t account for speakers when the overlap isn\u2019t detected or if errors from non-speech frames occur.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">False alarm speech: The percentage of scored time when a hypothesized speaker is labeled as non-speech.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Missed speech: The percentage of scored time when a hypothesized non-speech segment corresponds to a reference speaker segment.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Overlap speaker: The percentage of scored time when some of the multiple speakers in a segment don\u2019t get assigned to any speaker.<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Word diarization error rate<\/b><span style=\"font-weight: 400;\"> (see <\/span><a href=\"https:\/\/arxiv.org\/pdf\/1907.05337v1.pdf\"><span style=\"font-weight: 400;\">the Joint Speech Recognition and Speaker Diarization via Sequence Transduction paper<\/span><\/a><span style=\"font-weight: 400;\">), which calculates as:<br \/>\n<\/span><img loading=\"lazy\" class=\"alignnone wp-image-9591 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-00.04.46.png\" alt=\"\" width=\"1870\" height=\"676\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-00.04.46.png 1870w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-00.04.46-300x108.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-00.04.46-380x137.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-00.04.46-768x278.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/Screen-Shot-2022-09-13-at-00.04.46-1536x555.png 1536w\" sizes=\"(max-width: 1870px) 100vw, 1870px\" \/><\/li>\n<\/ul>\n<h2><span style=\"font-weight: 400;\">Datasets<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">We used the same custom datasets as for the ASR models. We\u2019ve made sure that the number of speakers varies a lot from sample to sample in this test data. Also, we performed the same augmentations like with ASR testing.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Conclusions on speech recognition testing<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">On-device meeting transcription combines three different technologies. All of them require extensive testing considering they should be hosted on a device.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The biggest challenges are choosing the right datasets and the right metrics for each use case as well as ensuring that all technologies run on the device without impacting other running processes.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Try next-level audio and voice technologies\u00a0<\/span><\/h2>\n<p><a href=\"https:\/\/krisp.ai\/blog\/voice-communication-quality-with-krisp-sdk\/\" target=\"_blank\" rel=\"noopener\">Krisp licenses its SDKs<\/a> to developers to embed directly into applications and devices. <a href=\"https:\/\/krisp.ai\/developers\/\" target=\"_blank\" rel=\"noopener\">Learn more about Krisp&#8217;s SDKs<\/a> and begin your evaluation today.<\/p>\n<p><a href=\"https:\/\/krisp.ai\/developers\/\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-9589\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta.png\" alt=\"krisp sdk\" width=\"1280\" height=\"720\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta.png 1280w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta-300x169.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta-380x214.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta-768x432.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\" \/><\/a><\/p>\n<hr \/>\n<p>This article is written by:<\/p>\n<p>Vazgen Mikayelyan, PhD in Mathematics | Machine Learning Architect, Tech Lead<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We are inevitably going to have even more online meetings in the future. So it\u2019s important to not get lost or lose context with all the information around us. Just think about one of last week\u2019s meetings. Can your team really recall everything discussed on a call?? When no one remembers exactly what was discussed [&hellip;]<\/p>\n","protected":false},"author":65,"featured_media":9583,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"two_page_speed":[]},"categories":[421],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.2 (Yoast SEO v23.6) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>On-Device Meeting Transcription and Speech Recognition Testing<\/title>\n<meta name=\"description\" content=\"We\u2019re analyzing the technologies needed for on-device meeting transcription, including speaker diarization and speech recognition testing.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"On-Device Meeting Transcription and Speech Recognition Testing\" \/>\n<meta property=\"og:description\" content=\"We\u2019re analyzing the technologies needed for on-device meeting transcription, including speaker diarization and speech recognition testing.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/\" \/>\n<meta property=\"og:site_name\" content=\"Krisp\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/krispHQ\/\" \/>\n<meta property=\"article:published_time\" content=\"2022-09-12T18:51:55+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-09-25T13:17:14+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Krisp Research Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@krispHQ\" \/>\n<meta name=\"twitter:site\" content=\"@krispHQ\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/\"},\"author\":{\"name\":\"Krisp Research Team\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1\"},\"headline\":\"On-Device Meeting Transcription and Speech Recognition Testing\",\"datePublished\":\"2022-09-12T18:51:55+00:00\",\"dateModified\":\"2022-09-25T13:17:14+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/\"},\"wordCount\":1497,\"commentCount\":2,\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png\",\"articleSection\":[\"Engineering Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/\",\"url\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/\",\"name\":\"On-Device Meeting Transcription and Speech Recognition Testing\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png\",\"datePublished\":\"2022-09-12T18:51:55+00:00\",\"dateModified\":\"2022-09-25T13:17:14+00:00\",\"description\":\"We\u2019re analyzing the technologies needed for on-device meeting transcription, including speaker diarization and speech recognition testing.\",\"breadcrumb\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png\",\"width\":1000,\"height\":700,\"caption\":\"speech testing recognition\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/krisp.ai\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"On-Device Meeting Transcription and Speech Recognition Testing\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/krisp.ai\/blog\/#website\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"name\":\"Krisp\",\"description\":\"Blog\",\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/krisp.ai\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\",\"name\":\"Krisp\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"width\":696,\"height\":696,\"caption\":\"Krisp\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/krispHQ\/\",\"https:\/\/x.com\/krispHQ\",\"https:\/\/www.linkedin.com\/company\/krisphq\/\",\"https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1\",\"name\":\"Krisp Research Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g\",\"caption\":\"Krisp Research Team\"},\"url\":\"https:\/\/krisp.ai\/blog\/author\/research-team\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"On-Device Meeting Transcription and Speech Recognition Testing","description":"We\u2019re analyzing the technologies needed for on-device meeting transcription, including speaker diarization and speech recognition testing.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/","og_locale":"en_US","og_type":"article","og_title":"On-Device Meeting Transcription and Speech Recognition Testing","og_description":"We\u2019re analyzing the technologies needed for on-device meeting transcription, including speaker diarization and speech recognition testing.","og_url":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/","og_site_name":"Krisp","article_publisher":"https:\/\/www.facebook.com\/krispHQ\/","article_published_time":"2022-09-12T18:51:55+00:00","article_modified_time":"2022-09-25T13:17:14+00:00","og_image":[{"width":1000,"height":700,"url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png","type":"image\/png"}],"author":"Krisp Research Team","twitter_card":"summary_large_image","twitter_creator":"@krispHQ","twitter_site":"@krispHQ","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#article","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/"},"author":{"name":"Krisp Research Team","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1"},"headline":"On-Device Meeting Transcription and Speech Recognition Testing","datePublished":"2022-09-12T18:51:55+00:00","dateModified":"2022-09-25T13:17:14+00:00","mainEntityOfPage":{"@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/"},"wordCount":1497,"commentCount":2,"publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"image":{"@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png","articleSection":["Engineering Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/","url":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/","name":"On-Device Meeting Transcription and Speech Recognition Testing","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage"},"image":{"@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png","datePublished":"2022-09-12T18:51:55+00:00","dateModified":"2022-09-25T13:17:14+00:00","description":"We\u2019re analyzing the technologies needed for on-device meeting transcription, including speaker diarization and speech recognition testing.","breadcrumb":{"@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/krisp.ai\/blog\/speech-recognition-testing\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#primaryimage","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/speech-testing-recognition.png","width":1000,"height":700,"caption":"speech testing recognition"},{"@type":"BreadcrumbList","@id":"https:\/\/krisp.ai\/blog\/speech-recognition-testing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/krisp.ai\/blog\/"},{"@type":"ListItem","position":2,"name":"On-Device Meeting Transcription and Speech Recognition Testing"}]},{"@type":"WebSite","@id":"https:\/\/krisp.ai\/blog\/#website","url":"https:\/\/krisp.ai\/blog\/","name":"Krisp","description":"Blog","publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/krisp.ai\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/krisp.ai\/blog\/#organization","name":"Krisp","url":"https:\/\/krisp.ai\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","width":696,"height":696,"caption":"Krisp"},"image":{"@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/krispHQ\/","https:\/\/x.com\/krispHQ","https:\/\/www.linkedin.com\/company\/krisphq\/","https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg"]},{"@type":"Person","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1","name":"Krisp Research Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g","caption":"Krisp Research Team"},"url":"https:\/\/krisp.ai\/blog\/author\/research-team\/"}]}},"primary_category":"Engineering Blog","_links":{"self":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/9582"}],"collection":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/users\/65"}],"replies":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/comments?post=9582"}],"version-history":[{"count":9,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/9582\/revisions"}],"predecessor-version":[{"id":9670,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/9582\/revisions\/9670"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media\/9583"}],"wp:attachment":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media?parent=9582"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/categories?post=9582"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/tags?post=9582"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}