Automatic speech recognition using Distil-Whisper and OpenVINO#

This Jupyter notebook can be launched after a local installation only.

Github

Distil-Whisper is a distilled variant of the Whisper model by OpenAI. The Distil-Whisper is proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. According to authors, compared to Whisper, Distil-Whisper runs in several times faster with 50% fewer parameters, while performing to within 1% word error rate (WER) on out-of-distribution evaluation data.

Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It maps a sequence of audio spectrogram features to a sequence of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. Then, the Transformer encoder encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder hidden states.

You can see the model architecture in the diagram below:

whisper_architecture.svg

whisper_architecture.svg#

In this tutorial, we consider how to run Distil-Whisper using OpenVINO. We will use the pre-trained model from the Hugging Face Transformers library. To simplify the user experience, the Hugging Face Optimum library is used to convert the model to OpenVINO™ IR format. To further improve OpenVINO Distil-Whisper model performance INT8 post-training quantization from NNCF is applied.

Table of contents:

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

%pip install -q "transformers>=4.35" "torch>=2.4.1" "onnx!=1.16.2" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "git+https://github.com/huggingface/optimum-intel.git"
%pip install -q "openvino>=2023.2.0" datasets  "gradio>=4.19" "librosa" "soundfile"
%pip install -q "nncf>=2.6.0" "jiwer"

import requests

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)

Load PyTorch model#

The AutoModelForSpeechSeq2Seq.from_pretrained method is used for the initialization of PyTorch Whisper model using the transformers library. By default, we will use the distil-whisper/distil-large-v2 model as an example in this tutorial. The model will be downloaded once during first run and this process may require some time.

You may also choose other models from Distil-Whisper hugging face collection such as distil-whisper/distil-medium.en or distil-whisper/distil-small.en. Models of the original Whisper architecture are also available, more on them here.

Preprocessing and post-processing are important in this model use. AutoProcessor class used for initialization WhisperProcessor is responsible for preparing audio input data for the model, converting it to Mel-spectrogram and decoding predicted output token_ids into string using tokenizer.

import ipywidgets as widgets

model_ids = {
    "Distil-Whisper": [
        "distil-whisper/distil-large-v2",
        "distil-whisper/distil-large-v3",
        "distil-whisper/distil-medium.en",
        "distil-whisper/distil-small.en",
    ],
    "Whisper": [
        "openai/whisper-large-v3-turbo",
        "openai/whisper-large-v3",
        "openai/whisper-large-v2",
        "openai/whisper-large",
        "openai/whisper-medium",
        "openai/whisper-small",
        "openai/whisper-base",
        "openai/whisper-tiny",
        "openai/whisper-medium.en",
        "openai/whisper-small.en",
        "openai/whisper-base.en",
        "openai/whisper-tiny.en",
    ],
}

model_type = widgets.Dropdown(
    options=model_ids.keys(),
    value="Distil-Whisper",
    description="Model type:",
    disabled=False,
)

model_type
model_id = widgets.Dropdown(
    options=model_ids[model_type.value],
    value=model_ids[model_type.value][0],
    description="Model:",
    disabled=False,
)

model_id
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained(model_id.value)

pt_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id.value)
pt_model.eval();
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Prepare input sample#

The processor expects audio data in numpy array format and information about the audio sampling rate and returns the input_features tensor for making predictions. Conversion of audio to numpy format is handled by Hugging Face datasets implementation.

from datasets import load_dataset


def extract_input_features(sample):
    input_features = processor(
        sample["audio"]["array"],
        sampling_rate=sample["audio"]["sampling_rate"],
        return_tensors="pt",
    ).input_features
    return input_features


dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation", trust_remote_code=True)
sample = dataset[0]
input_features = extract_input_features(sample)

Run model inference#

To perform speech recognition, one can use generate interface of the model. After generation is finished processor.batch_decode can be used for decoding predicted token_ids into text transcription.

import IPython.display as ipd

predicted_ids = pt_model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

display(ipd.Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]))
print(f"Reference: {sample['text']}")
print(f"Result: {transcription[0]}")