Automatic speech recognition using Whisper and OpenVINO with Generate API#

This Jupyter notebook can be launched after a local installation only.

Github

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It maps a sequence of audio spectrogram features to a sequence of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. Then, the Transformer encoder encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder hidden states.

You can see the model architecture in the diagram below:

whisper_architecture.svg

whisper_architecture.svg#

In this tutorial, we consider how to run Whisper using OpenVINO. We will use the pre-trained model from the Hugging Face Transformers library. The Hugging Face Optimum Intel library converts the models to OpenVINO™ IR format. To simplify the user experience, we will use OpenVINO Generate API for Whisper automatic speech recognition scenarios.

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Table of contents:

Prerequisites#

import platform


%pip install -q "torch>=2.3" "torchvision>=0.18.1" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "transformers>=4.45" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q -U "openvino>=2024.5.0" "openvino-tokenizers>=2024.5.0" "openvino-genai>=2024.5.0"
%pip install -q datasets  "gradio>=4.0" "soundfile>=0.12" "librosa" "python-ffmpeg<=1.0.16"
%pip install -q "nncf>=2.14.0" "jiwer" "typing_extensions>=4.9"
if platform.system() == "Darwin":
    %pip install -q "numpy<2.0"
import requests
from pathlib import Path

if not Path("notebook_utils.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
    )
    open("notebook_utils.py", "w").write(r.text)

if not Path("cmd_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
    )
    open("cmd_helper.py", "w").write(r.text)

Load PyTorch model#

The AutoModelForSpeechSeq2Seq.from_pretrained method is used for the initialization of PyTorch Whisper model using the transformers library. The model will be downloaded once during first run and this process may require some time.

You may also choose other models from Whisper collection, more on them here.

Preprocessing and post-processing are important in this model use. AutoProcessor class used for initialization WhisperProcessor is responsible for preparing audio input data for the model, converting it to Mel-spectrogram and decoding predicted output token_ids into string using tokenizer. We will use pipeline method to transcribe audios of arbitrary length.

import ipywidgets as widgets

model_ids = {
    "Multilingual models": [
        "openai/whisper-large-v3-turbo",
        "openai/whisper-large-v3",
        "openai/whisper-large-v2",
        "openai/whisper-large",
        "openai/whisper-medium",
        "openai/whisper-small",
        "openai/whisper-base",
        "openai/whisper-tiny",
    ],
    "English-only models": [
        "distil-whisper/distil-large-v2",
        "distil-whisper/distil-large-v3",
        "distil-whisper/distil-medium.en",
        "distil-whisper/distil-small.en",
        "openai/whisper-medium.en",
        "openai/whisper-small.en",
        "openai/whisper-base.en",
        "openai/whisper-tiny.en",
    ],
}

model_type = widgets.Dropdown(
    options=model_ids.keys(),
    value="Multilingual models",
    description="Model:",
    disabled=False,
)

model_type
Dropdown(description='Model:', options=('Multilingual models', 'English-only models'), value='Multilingual mod…
model_id = widgets.Dropdown(
    options=model_ids[model_type.value],
    value=model_ids[model_type.value][-1],
    description="Model:",
    disabled=False,
)

model_id
Dropdown(description='Model:', index=7, options=('openai/whisper-large-v3-turbo', 'openai/whisper-large-v3', '…
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, pipeline
from transformers.utils import logging

processor = AutoProcessor.from_pretrained(model_id.value)

pt_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id.value)

pipe_pt = pipeline(
    "automatic-speech-recognition",
    model=pt_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device="cpu",
)

Run PyTorch model inference#

The pipeline expects audio data in numpy array format. We will use .wav file and convert it numpy array format for that purpose.

from notebook_utils import download_file

en_example_short = Path("data", "courtroom.wav")

# a wav sample
download_file(
    "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/courtroom.wav",
    en_example_short.name,
    directory=en_example_short.parent,
)
'data/courtroom.wav' already exists.
PosixPath('/home/labuser/work/notebook/openvino_notebooks/notebooks/whisper-asr-genai/data/courtroom.wav')
import librosa

en_raw_speech, samplerate = librosa.load(str(en_example_short), sr=16000)

Let’s check how to work the transcribe task.

import copy
import IPython.display as ipd

logging.set_verbosity_error()

sample = copy.deepcopy(en_raw_speech)

display(ipd.Audio(sample, rate=samplerate))

pt_result = pipe_pt(sample)
print(f"Result: {pt_result['text']}")