We had to squeeze our Neural Network by 30x to run inside Chrome

Apr 01, 2020

Written by Davit Baghdasaryan

First, what’s Krisp?
The dinner that changed everything
WASM and XNNPACK
3ms frames
The miracle
The Rick moment

Get Started with Krisp AI Meeting Assistant:

Free Unlimited Meeting Transcriptions
AI-Powered Meeting Note Taker
Bot-free Meeting Recording Mode

Get Krisp for Free

Spread the word

In the last 3 months Krisp team worked hard to build Krisp Chrome Extension. At the end, we had to squeeze our Deep Neural Network (DNN) based noise cancellation algorithm 30x to fit it inside Chrome.

This was a challenging and fun journey, including unexpected insight from a dinner, a miracle and some Rick brilliance. We had to go through some challenging steps to make it happen and thought it’s a story worth sharing.

First, what’s Krisp?

Krisp app uses a specially designed Deep Neural Network (DNN) to separate human voice from background noise in real-time. The DNN is collectively trained on 10K+ hours of human voice and background noises and achieves an unprecedented level of quality. The app is used by tens of thousands of users every day so the DNN was in the wild for some time now.

Our main desktop app (Krisp for Windows and Mac) acts as a virtual microphone device.

We always wanted to have Krisp running inside Chrome in the form of an Extension. And always thought it’s not possible.

The dinner that changed everything

Until one beautiful evening we had a dinner with our friends @Discord. Stan, Discord’s CTO, hinted that Chrome lately made some additions (enabled audio filter plugins) and it might be possible to develop an external audio filter now, all in Javascript.

I was skeptical.

I was so skeptical that didn’t want to engage our core team yet. I went straight to Toptal to find someone who could quickly hack a proof of concept (POC) for us. We ❤️ Toptal.

A month after that dinner we had a POC of an extension which was able to add an audio filter to a webpage that transforms the microphone stream in real-time by adding a static noise. The filter was working fine for Google Meet, Webex and other apps. It was perfect.

This was exciting. We set out to build Krisp for Chrome and even came up with a cool name for it – KrispX (X for extension).

There was one big challenge though. How do we port our C++ codebase into browser? Well, obviously WebAssembly (WASM) but we didn’t have anyone with such experience.

WASM and XNNPACK

So Artak, one of our architects, started looking into WASM. When we have no idea about something – we always look at Artak ?

His first version of the port came after 3 days. Yep, he is super fast ?. The version was extremely slow though. It was a hacked-together version which implemented matrix multiplication in the most naive way, in Javascript. So the algorithm was running 10x slower than our main C++ model on the same laptop.

After some research we found about Tensorflow.js and all the awesome work that Google is doing to bring DNNs to the browser. The next finding was XNNPACK. It’s a highly optimized library of floating-point neural network inference operators for WASM (and ARM, x86).

XNNPACK was great. After some adjustment to the build system we were able to build our code with XNNPACK for WASM. It took us overall 10 days. The result? Almost the same speed as the C++ version.

To be honest we were mindblown at this point. WASM and XNNPACK were amazing. Good engineering, Mozilla, Facebook and Google, as always ?

But apparently this was just the beginning of our journey.

3ms frames

The way Chrome audio filters are designed is extremely constrained.

Chrome feeds the filter plugin with 3ms frames, while it reads the data from the mic, and the plugin has <3ms time to complete the processing. However our DNN operates on 30ms frames. So we had to bufferize 10 frames (30ms) and then process the 30ms frame with our DNN within 3ms.

If you don’t process within 3ms Chrome will drop the audio packet and give you the next one. In practice this would mean dropped packets and broken voice.

To be clear, this was simply impossible for us. Our DNN was too big for this.

We needed a miracle.

The miracle

The miracle came from our incredible research team. We had an ongoing project where the goal was to reduce the size/speed of our DNN so that it can be embedded in more constrained environments. Our architect, Stepan, together with the team, had made decent progress there and this progress came amazingly timely.

They have built a new DNN which was 10x smaller than the main model and it had a very comparable quality.

We quickly ported it to WASM and XNNPACK and viola, we had a real-time solution inside Chrome.

Inspired by this progress, our product team, led by Davit, integrated WASM code into Chrome extension and the QA team has started testing it with real apps. It worked quite well but occasionally there were voice breakups.

Apparently the optimizations were not enough. If the CPU usage was going up, which is common when you have a video call running inside Chrome, the time required by DNN to process a frame would increase so we would pass the threshold of 3ms and this would cause Chrome to drop packets.

The Rick moment

So we all looked at Artak again ?. And he of course had an amazing idea. We were not surprised. He always has plenty of these.

What if we distribute the DNN processing into multiple parts and process every part within 3ms? We would introduce a bit of latency but that should be negligible for our use case.