{"id":9563,"date":"2022-08-29T23:11:45","date_gmt":"2022-08-29T19:11:45","guid":{"rendered":"https:\/\/krisp.ai\/blog\/?p=9563"},"modified":"2025-03-11T18:34:08","modified_gmt":"2025-03-11T14:34:08","slug":"speech-quality-measurement","status":"publish","type":"post","link":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/","title":{"rendered":"Speech Quality Measurement Algorithms and Testing Technology"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Estimating the quality of speech is a central part of quality assurance in any system for generating or transforming speech.\u00a0<\/span><\/p>\n<p>Such systems include telecommunication networks or speech processing or generating software. The speech signal in their output suffers from various degradations inherent to the particular system.<\/p>\n<p><span style=\"font-weight: 400;\">With background noise cancellation, an algorithm could leave remnants of noise in an audio snippet or partially suppress speech (see Fig. 1). Or, a telecommunication system could also suffer from packet loss and delays. Meanwhile, audio codecs can introduce unwanted coloration just like speech-to-text systems deliver unnatural sounds.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A more straightforward approach to speech quality testing is conducting listening sessions, resulting in subjective quality results.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A group of people listens to recordings under controlled settings. They\u2019re then asked to provide normalized feedback (e.g. on a scale from 1 to 5). Responses are then aggregated into a single quality value, called Mean Opinion Score (MOS).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Aggregation is necessary to avoid subject bias. There are standard guidelines for conducting such listening tests and for further statistical analysis of the results that yield the MOS values (see ITU-T P.830, ITU-R BS.1116, ITU-T P.835).<\/span><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-9567\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image4.png\" alt=\"degraded speech signal reference\" width=\"1703\" height=\"745\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image4.png 1703w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image4-300x131.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image4-380x166.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image4-768x336.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image4-1536x672.png 1536w\" sizes=\"(max-width: 1703px) 100vw, 1703px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Fig. 1 An example of a degraded speech signal and its reference<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In some circumstances, conducting listening sessions for collecting MOS is infeasible, laborious, or costly due to the large volume of the data to be tested.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is where <\/span><i><span style=\"font-weight: 400;\">automated speech perceptual quality measures<\/span><\/i><span style=\"font-weight: 400;\"> come into play. The aim is to replicate the way humans evaluate speech quality, avoiding subject bias inherent with subjective listening panels.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In short, the goal is to objectively approximate or predict voice quality in the presence of noise. The perceptual speech quality estimation is assessed in two cases, depending on whether clean reference speech can be compared with the system output or not (depicted in Fig. 2).<\/span><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-9565\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image2.png\" alt=\" intrusive and non-intrusive measures\" width=\"817\" height=\"903\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image2.png 817w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image2-271x300.png 271w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image2-380x420.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image2-768x849.png 768w\" sizes=\"(max-width: 817px) 100vw, 817px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Fig. 2 Schematic view of intrusive and non-intrusive measures (source: ITU-T P.563)<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Double-ended (intrusive) speech quality assessment measures<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The model, in this case, has access to both the reference and output audio of the speech processing system. Its score is given based on the differences between the two. Please note that model and measure are used interchangeably in this article.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To mimic a human listener, automated methods should \u201ccatch\u201d speech impairments that are detectable by the human ear\/brain and assign them a quantitative value. It\u2019s not enough to compute only mathematical differences between the audio samples (waveforms).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Such algorithms need to model the human psychoacoustic system. In particular, they\u2019re expected to capture the ear functionality as well as certain cognitive effects.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To mimic human hearing capability, the model obtains an \u201cinternal\u201d representation of the signals that mimics the transformations happening within the human ear. Signals of different frequencies are detected by separate areas of the cochlea (see Fig. 3). This \u201ccoverage\u201d of frequencies isn\u2019t linear. <\/span><span style=\"font-weight: 400;\">As a consequence, the audible spectrum can be partitioned into frequency bands of<em> various widths<\/em> the ear perceives as <em>equal<\/em> widths.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This phenomenon is modeled by psychoacoustic scales such as Bark and Mel scales (e.g. PESQ, NISQA double-ended, and PEAQ basic version), or by filter banks like the Gammatone filter bank (e.g. PEMO-Q, ViSQOL, PEAQ advanced version).<\/span><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-9566\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image3.png\" alt=\"auditory system\" width=\"1999\" height=\"1194\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image3.png 1999w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image3-300x179.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image3-380x227.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image3-768x459.png 768w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image3-1536x917.png 1536w\" sizes=\"(max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Fig. 3 Schematic view of auditory system (<\/span><a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:Uncoiled_cochlea_with_basilar_membrane.png\" target=\"_blank\" rel=\"nofollow noopener\"><span style=\"font-weight: 400;\">source<\/span><\/a><span style=\"font-weight: 400;\">)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Other voice quality aspects to consider are:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The way loudness is perceived \u2013 this depends on the frequency structure of the sound, as well as its duration.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The absolute hearing threshold \u2013 another time\/frequency-dependent element, which can be between 2 and 3 kHz and is usually lower than that for other frequencies<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Masking effects \u2013 a loud sound may make a weaker sound inaudible when the latter occurs simultaneously or shortly after the former.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">For audio quality testing in the case of telecommunication networks, it\u2019s important to take into account missing signal fragments due to lost packets. Several models (e.g., PESQ, POLQA, NISQA) incorporate intricate alignment mechanisms that may take up the bulk of their computation.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Take PESQ alignment procedures as an example. These are based on signal cross-correlation between speech fragments while NISQA uses Attention networks for this.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Having computed an internal representation of reference speech and output signals, the model then works out the value of an appropriate function. The latter is designed to measure the difference between the two representations, mapping the result to a MOS value.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Quantifying the difference between internal representations is one of the main distinguishing factors of various quality measures. This may include further modeling of cognitive effects or resorting to pre-trained deep learning models (e.g. NISQA).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As an example, when evaluating loudness differences between signal fragments, PESQ gives higher penalty to missing fragments of speech than to additive noise. That\u2019s due to the former being perceived as more disturbing to the listener.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Single-ended (non-intrusive) speech quality measures<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In this case, the test algorithm only evaluates the output of the system without access to the reference speech. These metrics check if the given sound fragment is indeed human speech and whether it\u2019s of good quality.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To achieve this, the way speech is produced by the human vocal tract should be modeled (see Fig. 4), along with the modeling of the auditory system. This appears to be a much more complex task than modeling the auditory system alone, as it involves more components and parameters. Additionally, the model needs to detect noise and missing speech fragments as well.<\/span><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-9564\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image1.jpg\" alt=\"human speech production system\" width=\"424\" height=\"430\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image1.jpg 424w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image1-296x300.jpg 296w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/image1-380x385.jpg 380w\" sizes=\"(max-width: 424px) 100vw, 424px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Fig. 4 Schematic view of human speech production system (<\/span><a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:Illu01_head_neck.jpg\" target=\"_blank\" rel=\"nofollow noopener\"><span style=\"font-weight: 400;\">source<\/span><\/a><span style=\"font-weight: 400;\">)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Some speech quality assessment methods do such modeling explicitly. For instance, the algorithm from the ITU-T P.563 standard estimates parameters of the human speech production system from a recording.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Using this, the algorithm generates a clean version of the reference speech signal. The result is then compared with the output signal using an intrusive quality measure like PESQ.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As mentioned earlier, such objective algorithmic models rely on a considerable number of hand-crafted parameters. This means this problem is a good candidate for machine learning methods.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There have been quite a few recent works approaching single-ended audio quality estimation using Deep Neural Nets (DNN) and other machine learning techniques. The DNN usually consists of a convolution-based feature extraction phase and a subsequent higher-level analysis phase where LSTM or Attention-based modules are used (e.g. NISQA, SESQA, MOSNet) or simply dense layers or pooling (e.g. CNN-ELM, <\/span><span style=\"font-weight: 400;\">WAWEnets<\/span><span style=\"font-weight: 400;\">). The result is then mapped to a MOS prediction value.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another approach that\u2019s been used with machine learning was, similar to ITU-T P.563, to synthesize a clean pseudo-reference version of the system output speech using machine learning (e.g. by using Gaussian Mixture Models to derive noise statistics from the noisy output of the system and compensate for the noise) and then compare it with the output speech using an intrusive method (see ref. 16).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Data generation for DNN models involves introducing speech degradations typical to the target use scenario. For example, applying various audio codecs to speech samples, mixing with noise, simulating packet loss and applying low\/high-pass filters.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For supervised training, the training data needs to be annotated. The two main approaches here are either to annotate the data using known intrusive quality measures (e.g. SESQA, Quality-Net, <\/span><span style=\"font-weight: 400;\">WAWEnets<\/span><span style=\"font-weight: 400;\">) or conducting listening sessions for collecting MOS scores (MOSNet, NISQA, AutoMOS, CNN-ELM). In fact, models from the former group annotate the data using <\/span><i><span style=\"font-weight: 400;\">several<\/span><\/i><span style=\"font-weight: 400;\"> intrusive measures, with the aim of smoothing out the shortcomings of particular measures.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Applicability for speech quality measures<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Each speech quality measure was fitted to some limited data in the design stage and developed with usage scenarios and audio distortions in mind.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The specific application scenarios of perceived audio quality measures range considerably across different fields, such as telecommunication networks, VoIP, source speech separation or noise cancellation, speech-to-text algorithms, and audio codec design. This is true for both algorithmic and machine learning-based models, raising the question of cross-domain applicability of models.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">From this point of view, DNN models crucially depend on the data distribution they\u2019re trained on. On the other hand, algorithmic models are based on psychoacoustic research that doesn\u2019t directly rely on any dataset.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The parameters of algorithmic measures are tuned to fit the measure output to MOS values of <\/span><i><span style=\"font-weight: 400;\">some<\/span><\/i><span style=\"font-weight: 400;\"> dataset, but this dependence seems less crucial than for DNN models.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are algorithmic measures that have been designed to be general quality measures, such as PEMO-Q. This phenomenon is well-reflected in a recent study (see ref. 18) that examines domain dependence of intrusive models with respect to <\/span><i><span style=\"font-weight: 400;\">audio coding<\/span><\/i><span style=\"font-weight: 400;\"> and <\/span><i><span style=\"font-weight: 400;\">source separation<\/span><\/i><span style=\"font-weight: 400;\"> domains. Among other things, they found out that standards like PESQ and PEAQ fare very well across these domains. This continues to happen although they weren\u2019t designed for source separation. For PEAQ, one needs to take a re-weighted combination of only a subset of multiple output values to achieve good results.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another aspect of usability is bandwidth dependence. This limitation is often specific to algorithmic measures.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While it\u2019s rather easy to simulate audio instances of various bandwidths during data generation in DNNs, algorithmic models need explicit parameter tuning to give dependable outputs for different bandwidths. For example, the original ITU-T standard for PESQ was designed specifically for narrowband speech (then extended to support wideband), while its successor, POLQA, supports full-band speech.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A final consideration for speech and audio quality testing is performance. This may show up in massive and regular testing. DNN models can benefit from optimized batch processing, while multiprocessing can be applied with other measures.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Performance can be further improved if models were modular so that one could turn off certain functionality that isn\u2019t necessary for a given application. For instance, we could improve the performance of the intrusive NISQA model in a noise cancellation application by removing its alignment layer (which isn\u2019t necessary in this situation).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This was a short glimpse into the key points of objective speech quality measurement and prediction. This is an active research area with many facets that can\u2019t be fully covered in a brief post. Please review the publications below for a more detailed description of signal to noise ratio measure and other speech quality assessment algorithms.<\/span><\/p>\n<h2>Try next-level voice and audio technologies<\/h2>\n<p>Krisp rigorously tests its voice technologies utilizing both objective and subjective methodologies. <a href=\"https:\/\/krisp.ai\/blog\/voice-communication-quality-with-krisp-sdk\/\" target=\"_blank\" rel=\"noopener\">Krisp licenses its SDKs<\/a> to developers to embed directly into applications and devices. <a href=\"https:\/\/krisp.ai\/developers\/\" target=\"_blank\" rel=\"noopener\">Learn more about Krisp&#8217;s SDKs<\/a> and begin your evaluation today.<\/p>\n<p><a href=\"https:\/\/krisp.ai\/developers\/\"><img loading=\"lazy\" class=\"alignnone wp-image-9589 size-full\" src=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta.png\" alt=\"krisp sdk\" width=\"1280\" height=\"720\" srcset=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta.png 1280w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta-300x169.png 300w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta-380x214.png 380w, https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/09\/engineering-blog-cta-768x432.png 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\" \/><\/a><\/p>\n<hr \/>\n<p><strong>The article is written by: <\/strong><\/p>\n<p>Tigran Tonoyan, PhD in Computer Science, Senior ML Engineer II<br \/>\nAris Hovsepyan, BSc in Computer Science,\u00a0ML Engineer II<br \/>\nHovhannes Shmavonyan, PhD in Physics, Senior ML Engineer I<br \/>\nHayk Aleksanyan, PhD in Mathematics, Principal ML Engineer, Tech Lead<\/p>\n<hr \/>\n<p><strong>References:<\/strong><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ITU-T P.830: <\/span><a href=\"https:\/\/www.itu.int\/rec\/T-REC-P.830\/en\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.itu.int\/rec\/T-REC-P.830\/en<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ITU-R BS.1116: <\/span><a href=\"https:\/\/www.itu.int\/rec\/R-REC-BS.1116\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.itu.int\/rec\/R-REC-BS.1116<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ITU-T P.835: <\/span><a href=\"https:\/\/www.itu.int\/rec\/T-REC-P.835\/en\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.itu.int\/rec\/T-REC-P.835\/en<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">PESQ: ITU-T P.862, <\/span><a href=\"https:\/\/www.itu.int\/rec\/T-REC-P.862\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.itu.int\/rec\/T-REC-P.862<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">NISQA: G. Mittag et al., NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets, INTERSPEECH 2021 <\/span><span style=\"font-weight: 400;\">(see also the first author\u2019s PhD thesis)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">PEAQ: ITU-R BS.1387, <\/span><a href=\"https:\/\/www.itu.int\/rec\/R-REC-BS.1387\/en\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.itu.int\/rec\/R-REC-BS.1387\/en<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">PEMO-Q: R. Huber and B. Kollmeier, PEMO-Q \u2013 A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception, IEEE Transactions on Audio, Speech, and Language Processing, 14 (6)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ViSQOL: A. Hines et al., ViSQOL: The Virtual Speech Quality Objective Listener, IWAENC 2012<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ITU-T P.563: <\/span><a href=\"https:\/\/www.itu.int\/rec\/T-REC-P.563\/en\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.itu.int\/rec\/T-REC-P.563\/en<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">POLQA: ITU-T P.863, <\/span><a href=\"https:\/\/www.itu.int\/rec\/T-REC-P.863\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.itu.int\/rec\/T-REC-P.863<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ANIQUE: D.-S. Kim, ANIQUE: an auditory model for single-ended speech quality estimation, IEEE Transactions on Speech and Audio Processing 13 (5)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">SESQA: J. Serr\u00e0 et al., SESQA: Semi-Supervised Learning for Speech Quality Assessment, ICASSP 2021<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">\u00a0MOSNet: C.-C. Lo et al., MOSNet: Deep Learning-Based Objective Assessment, INTERSPEECH 2019<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">\u00a0CNN-ELM: H. Gamper et al., Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network, WASPAA 2019<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">\u00a0WAWEnets:\u00a0 A. Catellier and S. Voran, Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality, ICASSP 2020<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">\u00a0Y. Shan et al., Non-intrusive Speech Quality Assessment Using Deep Belief Network and Backpropagation Neural Network, ISCSLP 2018<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">\u00a0Quality-Net: S. Fu et al., Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM, INTERSPEECH 2018<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">\u00a0M. Torcoli et al., Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence, IEEE\/ACM Transactions on Audio, Speech, and Language Processing 29<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ETSI standard <\/span><span style=\"font-weight: 400;\">EG 202 396-3: <\/span><a href=\"https:\/\/www.etsi.org\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.etsi.org<\/span><\/a><span style=\"font-weight: 400;\">\u00a0<\/span><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Estimating the quality of speech is a central part of quality assurance in any system for generating or transforming speech.\u00a0 Such systems include telecommunication networks or speech processing or generating software. The speech signal in their output suffers from various degradations inherent to the particular system. With background noise cancellation, an algorithm could leave remnants [&hellip;]<\/p>\n","protected":false},"author":65,"featured_media":9575,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"two_page_speed":[]},"categories":[421],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v24.2 (Yoast SEO v23.6) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Speech Quality Measurement Algorithms and Testing Technology<\/title>\n<meta name=\"description\" content=\"Learn about speech quality measurement and its importance in communication. Krisp optimizes audio quality for clearer, more effective conversations.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Speech Quality Measurement Algorithms and Testing Technology\" \/>\n<meta property=\"og:description\" content=\"Learn about speech quality measurement and its importance in communication. Krisp optimizes audio quality for clearer, more effective conversations.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/\" \/>\n<meta property=\"og:site_name\" content=\"Krisp\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/krispHQ\/\" \/>\n<meta property=\"article:published_time\" content=\"2022-08-29T19:11:45+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-03-11T14:34:08+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Krisp Research Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@krispHQ\" \/>\n<meta name=\"twitter:site\" content=\"@krispHQ\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/\"},\"author\":{\"name\":\"Krisp Research Team\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1\"},\"headline\":\"Speech Quality Measurement Algorithms and Testing Technology\",\"datePublished\":\"2022-08-29T19:11:45+00:00\",\"dateModified\":\"2025-03-11T14:34:08+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/\"},\"wordCount\":2045,\"commentCount\":8,\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png\",\"articleSection\":[\"Engineering Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/\",\"url\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/\",\"name\":\"Speech Quality Measurement Algorithms and Testing Technology\",\"isPartOf\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png\",\"datePublished\":\"2022-08-29T19:11:45+00:00\",\"dateModified\":\"2025-03-11T14:34:08+00:00\",\"description\":\"Learn about speech quality measurement and its importance in communication. Krisp optimizes audio quality for clearer, more effective conversations.\",\"breadcrumb\":{\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png\",\"width\":1000,\"height\":700,\"caption\":\"speech quality measurement\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/krisp.ai\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Speech Quality Measurement Algorithms and Testing Technology\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/krisp.ai\/blog\/#website\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"name\":\"Krisp\",\"description\":\"Blog\",\"publisher\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/krisp.ai\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/krisp.ai\/blog\/#organization\",\"name\":\"Krisp\",\"url\":\"https:\/\/krisp.ai\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"contentUrl\":\"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png\",\"width\":696,\"height\":696,\"caption\":\"Krisp\"},\"image\":{\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/krispHQ\/\",\"https:\/\/x.com\/krispHQ\",\"https:\/\/www.linkedin.com\/company\/krisphq\/\",\"https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1\",\"name\":\"Krisp Research Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g\",\"caption\":\"Krisp Research Team\"},\"url\":\"https:\/\/krisp.ai\/blog\/author\/research-team\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Speech Quality Measurement Algorithms and Testing Technology","description":"Learn about speech quality measurement and its importance in communication. Krisp optimizes audio quality for clearer, more effective conversations.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/","og_locale":"en_US","og_type":"article","og_title":"Speech Quality Measurement Algorithms and Testing Technology","og_description":"Learn about speech quality measurement and its importance in communication. Krisp optimizes audio quality for clearer, more effective conversations.","og_url":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/","og_site_name":"Krisp","article_publisher":"https:\/\/www.facebook.com\/krispHQ\/","article_published_time":"2022-08-29T19:11:45+00:00","article_modified_time":"2025-03-11T14:34:08+00:00","og_image":[{"width":1000,"height":700,"url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png","type":"image\/png"}],"author":"Krisp Research Team","twitter_card":"summary_large_image","twitter_creator":"@krispHQ","twitter_site":"@krispHQ","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#article","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/"},"author":{"name":"Krisp Research Team","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1"},"headline":"Speech Quality Measurement Algorithms and Testing Technology","datePublished":"2022-08-29T19:11:45+00:00","dateModified":"2025-03-11T14:34:08+00:00","mainEntityOfPage":{"@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/"},"wordCount":2045,"commentCount":8,"publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"image":{"@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png","articleSection":["Engineering Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/","url":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/","name":"Speech Quality Measurement Algorithms and Testing Technology","isPartOf":{"@id":"https:\/\/krisp.ai\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage"},"image":{"@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage"},"thumbnailUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png","datePublished":"2022-08-29T19:11:45+00:00","dateModified":"2025-03-11T14:34:08+00:00","description":"Learn about speech quality measurement and its importance in communication. Krisp optimizes audio quality for clearer, more effective conversations.","breadcrumb":{"@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/krisp.ai\/blog\/speech-quality-measurement\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#primaryimage","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2022\/08\/speech-quality-measurement.png","width":1000,"height":700,"caption":"speech quality measurement"},{"@type":"BreadcrumbList","@id":"https:\/\/krisp.ai\/blog\/speech-quality-measurement\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/krisp.ai\/blog\/"},{"@type":"ListItem","position":2,"name":"Speech Quality Measurement Algorithms and Testing Technology"}]},{"@type":"WebSite","@id":"https:\/\/krisp.ai\/blog\/#website","url":"https:\/\/krisp.ai\/blog\/","name":"Krisp","description":"Blog","publisher":{"@id":"https:\/\/krisp.ai\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/krisp.ai\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/krisp.ai\/blog\/#organization","name":"Krisp","url":"https:\/\/krisp.ai\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","contentUrl":"https:\/\/krisp.ai\/blog\/wp-content\/uploads\/2024\/10\/K.png","width":696,"height":696,"caption":"Krisp"},"image":{"@id":"https:\/\/krisp.ai\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/krispHQ\/","https:\/\/x.com\/krispHQ","https:\/\/www.linkedin.com\/company\/krisphq\/","https:\/\/www.youtube.com\/channel\/UCAMZinJdR9P33fZUNpuxXtg"]},{"@type":"Person","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/172d23b73915155e0ab4e97868216bd1","name":"Krisp Research Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/krisp.ai\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/49fc839d54b3ccba70e28ccaad1472a7?s=96&d=mm&r=g","caption":"Krisp Research Team"},"url":"https:\/\/krisp.ai\/blog\/author\/research-team\/"}]}},"_links":{"self":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/9563"}],"collection":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/users\/65"}],"replies":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/comments?post=9563"}],"version-history":[{"count":12,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/9563\/revisions"}],"predecessor-version":[{"id":9667,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/posts\/9563\/revisions\/9667"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media\/9575"}],"wp:attachment":[{"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/media?parent=9563"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/categories?post=9563"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/krisp.ai\/blog\/wp-json\/wp\/v2\/tags?post=9563"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}