In the last 3 months Krisp team worked hard to build Krisp Chrome Extension. At the end, we had to squeeze our Deep Neural Network (DNN) based noise cancellation algorithm 30x to fit it inside Chrome.
This was a challenging and fun journey, including unexpected insight from a dinner, a miracle and some Rick brilliance. We had to go through some challenging steps to make it happen and thought it’s a story worth sharing.
First, what’s Krisp?
Krisp app uses a specially designed Deep Neural Network (DNN) to separate human voice from background noise in real-time. The DNN is collectively trained on 10K+ hours of human voice and background noises and achieves an unprecedented level of quality. The app is used by tens of thousands of users every day so the DNN was in the wild for some time now.
Our main desktop app (Krisp for Windows and Mac) acts as a virtual microphone device.
We always wanted to have Krisp running inside Chrome in the form of an Extension. And always thought it’s not possible.
The dinner that changed everything
I was skeptical.
I was so skeptical that didn’t want to engage our core team yet. I went straight to Toptal to find someone who could quickly hack a proof of concept (POC) for us. We ❤️ Toptal.
A month after that dinner we had a POC of an extension which was able to add an audio filter to a webpage that transforms the microphone stream in real-time by adding a static noise. The filter was working fine for Google Meet, Webex and other apps. It was perfect.
This was exciting. We set out to build Krisp for Chrome and even came up with a cool name for it – KrispX (X for extension).
There was one big challenge though. How do we port our C++ codebase into browser? Well, obviously WebAssembly (WASM) but we didn’t have anyone with such experience.
WASM and XNNPACK
So Artak, one of our architects, started looking into WASM. When we have no idea about something – we always look at Artak 👀
After some research we found about Tensorflow.js and all the awesome work that Google is doing to bring DNNs to the browser. The next finding was XNNPACK. It’s a highly optimized library of floating-point neural network inference operators for WASM (and ARM, x86).
XNNPACK was great. After some adjustment to the build system we were able to build our code with XNNPACK for WASM. It took us overall 10 days. The result? Almost the same speed as the C++ version.
To be honest we were mindblown at this point. WASM and XNNPACK were amazing. Good engineering, Mozilla, Facebook and Google, as always 🙏
But apparently this was just the beginning of our journey.
The way Chrome audio filters are designed is extremely constrained.
Chrome feeds the filter plugin with 3ms frames, while it reads the data from the mic, and the plugin has <3ms time to complete the processing. However our DNN operates on 30ms frames. So we had to bufferize 10 frames (30ms) and then process the 30ms frame with our DNN within 3ms.
If you don’t process within 3ms Chrome will drop the audio packet and give you the next one. In practice this would mean dropped packets and broken voice.
To be clear, this was simply impossible for us. Our DNN was too big for this.
We needed a miracle.
The miracle came from our incredible research team. We had an ongoing project where the goal was to reduce the size/speed of our DNN so that it can be embedded in more constrained environments. Our architect, Stepan, together with the team, had made decent progress there and this progress came amazingly timely.
They have built a new DNN which was 10x smaller than the main model and it had a very comparable quality.
We quickly ported it to WASM and XNNPACK and viola, we had a real-time solution inside Chrome.
Inspired by this progress, our product team, led by Davit, integrated WASM code into Chrome extension and the QA team has started testing it with real apps. It worked quite well but occasionally there were voice breakups.
Apparently the optimizations were not enough. If the CPU usage was going up, which is common when you have a video call running inside Chrome, the time required by DNN to process a frame would increase so we would pass the threshold of 3ms and this would cause Chrome to drop packets.
The Rick moment
So we all looked at Artak again ?. And he of course had an amazing idea. We were not surprised. He always has plenty of these.
What if we distribute the DNN processing into multiple parts and process every part within 3ms? We would introduce a bit of latency but that should be negligible for our use case.
He had the algorithm ready in 10 days. Why so long? Well he was on vacation for 7 days.
The new algorithm was perfect. No voice breakups, no noticeable latency even when there is video and the CPU usage hits the fan.
Krisp for Chrome is now live. From idea to going live – 3 months.
You can use it with any app running inside Chrome browser, from Google Meet to Whereby. And it’s running our 30X faster algorithm.