Automatic speech recognition using Distil-Whisper and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Distil-Whisper is a distilled variant of the Whisper model by OpenAI. The Distil-Whisper is proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. According to authors, compared to Whisper, Distil-Whisper runs in several times faster with 50% fewer parameters, while performing to within 1% word error rate (WER) on out-of-distribution evaluation data.
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It maps a sequence of audio spectrogram features to a sequence of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. Then, the Transformer encoder encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder hidden states.
You can see the model architecture in the diagram below:
In this tutorial, we consider how to run Distil-Whisper using OpenVINO.
We will use the pre-trained model from the Hugging Face
Transformers
library. To simplify the user experience, the Hugging Face
Optimum library is used to
convert the model to OpenVINO™ IR format. To further improve OpenVINO
Distil-Whisper model performance INT8
post-training quantization
from NNCF is applied.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
%pip install -q "transformers>=4.35" "torch>=2.1,<2.4.0" "torchvision<0.19.0" onnx "peft==0.6.2" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "git+https://github.com/huggingface/optimum-intel.git"
%pip install -q "openvino>=2023.2.0" datasets "gradio>=4.0" "librosa" "soundfile"
%pip install -q "nncf>=2.6.0" "jiwer"
Load PyTorch model#
The AutoModelForSpeechSeq2Seq.from_pretrained
method is used for the
initialization of PyTorch Whisper model using the transformers library.
By default, we will use the distil-whisper/distil-large-v2
model as
an example in this tutorial. The model will be downloaded once during
first run and this process may require some time.
You may also choose other models from Distil-Whisper hugging face
collection
such as distil-whisper/distil-medium.en
or
distil-whisper/distil-small.en
. Models of the original Whisper
architecture are also available, more on them
here.
Preprocessing and post-processing are important in this model use.
AutoProcessor
class used for initialization WhisperProcessor
is
responsible for preparing audio input data for the model, converting it
to Mel-spectrogram and decoding predicted output token_ids into string
using tokenizer.
import ipywidgets as widgets
model_ids = {
"Distil-Whisper": [
"distil-whisper/distil-large-v2",
"distil-whisper/distil-medium.en",
"distil-whisper/distil-small.en",
],
"Whisper": [
"openai/whisper-large-v3",
"openai/whisper-large-v2",
"openai/whisper-large",
"openai/whisper-medium",
"openai/whisper-small",
"openai/whisper-base",
"openai/whisper-tiny",
"openai/whisper-medium.en",
"openai/whisper-small.en",
"openai/whisper-base.en",
"openai/whisper-tiny.en",
],
}
model_type = widgets.Dropdown(
options=model_ids.keys(),
value="Distil-Whisper",
description="Model type:",
disabled=False,
)
model_type
model_id = widgets.Dropdown(
options=model_ids[model_type.value],
value=model_ids[model_type.value][0],
description="Model:",
disabled=False,
)
model_id
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
processor = AutoProcessor.from_pretrained(model_id.value)
pt_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id.value)
pt_model.eval();
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Prepare input sample#
The processor expects audio data in numpy array format and information
about the audio sampling rate and returns the input_features
tensor
for making predictions. Conversion of audio to numpy format is handled
by Hugging Face datasets implementation.
from datasets import load_dataset
def extract_input_features(sample):
input_features = processor(
sample["audio"]["array"],
sampling_rate=sample["audio"]["sampling_rate"],
return_tensors="pt",
).input_features
return input_features
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation", trust_remote_code=True)
sample = dataset[0]
input_features = extract_input_features(sample)
Run model inference#
To perform speech recognition, one can use generate
interface of the
model. After generation is finished processor.batch_decode can be used
for decoding predicted token_ids into text transcription.
import IPython.display as ipd
predicted_ids = pt_model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
display(ipd.Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]))
print(f"Reference: {sample['text']}")
print(f"Result: {transcription[0]}")
Reference: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
Result: Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
Load OpenVINO model using Optimum library#
The Hugging Face Optimum API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the Hugging Face Optimum documentation.
Optimum Intel can be used to load optimized models from the Hugging
Face Hub and
create pipelines to run an inference with OpenVINO Runtime using Hugging
Face APIs. The Optimum Inference models are API compatible with Hugging
Face Transformers models. This means we just need to replace the
AutoModelForXxx
class with the corresponding OVModelForXxx
class.
Below is an example of the distil-whisper model
-from transformers import AutoModelForSpeechSeq2Seq
+from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from transformers import AutoTokenizer, pipeline
model_id = "distil-whisper/distil-large-v2"
-model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)
+model = OVModelForSpeechSeq2Seq.from_pretrained(model_id, export=True)
Model class initialization starts with calling the from_pretrained
method. When downloading and converting the Transformers model, the
parameter export=True
should be added. We can save the converted
model for the next usage with the save_pretrained
method. Tokenizers
and Processors are distributed with models also compatible with the
OpenVINO model. It means that we can reuse initialized early processor.
from pathlib import Path
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
model_path = Path(model_id.value.replace("/", "_"))
ov_config = {"CACHE_DIR": ""}
if not model_path.exists():
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
model_id.value,
ov_config=ov_config,
export=True,
compile=False,
load_in_8bit=False,
)
ov_model.half()
ov_model.save_pretrained(model_path)
else:
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(model_path, ov_config=ov_config, compile=False)
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
Select Inference device#
import openvino as ov
import ipywidgets as widgets
core = ov.Core()
device = widgets.Dropdown(
options=core.available_devices + ["AUTO"],
value="AUTO",
description="Device:",
disabled=False,
)
device
Dropdown(description='Device:', index=4, options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='AUTO')
Compile OpenVINO model#
ov_model.to(device.value)
ov_model.compile()
Compiling the encoder to AUTO ...
Compiling the decoder to AUTO ...
Compiling the decoder to AUTO ...
Run OpenVINO model inference#
predicted_ids = ov_model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
display(ipd.Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"]))
print(f"Reference: {sample['text']}")
print(f"Result: {transcription[0]}")
/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/optimum/intel/openvino/modeling_seq2seq.py:457: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly. last_hidden_state = torch.from_numpy(self.request(inputs, shared_memory=True)["last_hidden_state"]).to( /home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/optimum/intel/openvino/modeling_seq2seq.py:538: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly. self.request.start_async(inputs, shared_memory=True)