Video Subtitle Generation using Whisper and OpenVINO™#

This Jupyter notebook can be launched on-line, opening an interactive environment in a browser window. You can also make a local installation. Choose one of the following options:

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It is a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

You can find more information about this model in the research paper, OpenAI blog, model card and GitHub repository.

In this notebook, we will use Whisper model with OpenVINO Generate API for Whisper automatic speech recognition scenarios to generate subtitles in a sample video. Additionally, we will use NNCF improving model performance by INT8 quantization. Notebook contains the following steps: 1. Download the model. 2. Instantiate the PyTorch model pipeline. 3. Convert model to OpenVINO IR, using model conversion API. 4. Run the Whisper pipeline with OpenVINO models. 5. Quantize the OpenVINO model with NNCF. 6. Check quantized model result for the demo video. 7. Compare model size, performance and accuracy of FP32 and quantized INT8 models. 8. Launch Interactive demo for video subtitles generation.

Table of contents:

Prerequisites
Instantiate model
- Convert model to OpenVINO Intermediate Representation (IR) format.
Prepare inference pipeline
- Select inference device
Run video transcription pipeline
Quantization
Interactive demo

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

Install dependencies.

import platform
import importlib.metadata
import importlib.util

%pip install -q "nncf>=2.14.0"
%pip install -q -U "openvino>=2024.5.0" "openvino-tokenizers>=2024.5.0" "openvino-genai>=2024.5.0"
%pip install -q "python-ffmpeg<=1.0.16" "ffmpeg" "moviepy" "transformers>=4.45" "git+https://github.com/huggingface/optimum-intel.git" "torch>=2.1" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q -U "yt_dlp>=2024.8.6" soundfile librosa jiwer packaging
%pip install -q  "gradio>=4.19" "typing_extensions>=4.9"

if platform.system() == "Darwin":
    %pip install -q "numpy<2.0"


from packaging import version

if (
    importlib.util.find_spec("tensorflow") is not None
    and version.parse(importlib.metadata.version("tensorflow")) < version.parse("2.18.0")
    and version.parse(importlib.metadata.version("numpy")) >= version.parse("2.0.0")
):
    %pip uninstall -q -y tensorflow

import requests
from pathlib import Path

if not Path("notebook_utils.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
    )
    open("notebook_utils.py", "w").write(r.text)

if not Path("cmd_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
    )
    open("cmd_helper.py", "w").write(r.text)

Instantiate model#

Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It maps a sequence of audio spectrogram features to a sequence of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. Then, the Transformer encoder encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder hidden states.

You can see the model architecture in the diagram below:

There are several models of different sizes and capabilities trained by the authors of the model. In this tutorial, we will use the tiny model, but the same actions are also applicable to other models from Whisper family.

import ipywidgets as widgets

MODELS = [
    "openai/whisper-large-v3-turbo",
    "openai/whisper-large-v3",
    "openai/whisper-large-v2",
    "openai/whisper-large",
    "openai/whisper-medium",
    "openai/whisper-small",
    "openai/whisper-base",
    "openai/whisper-tiny",
]

model_id = widgets.Dropdown(
    options=list(MODELS),
    value="openai/whisper-tiny",
    description="Model:",
    disabled=False,
)

model_id

Dropdown(description='Model:', index=7, options=('openai/whisper-large-v3-turbo', 'openai/whisper-large-v3', '…

Convert model to OpenVINO Intermediate Representation (IR) format using Optimum-Intel.#

Listed Whisper model are available for downloading via the HuggingFace hub. We will use optimum-cli interface for exporting it into OpenVINO Intermediate Representation (IR) format.

Optimum CLI interface for converting models supports export to OpenVINO (supported starting optimum-intel 1.12 version). General command format:

optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>

where --model argument is model id from HuggingFace Hub or local directory with model (saved using .save_pretrained method), --task is one of supported task that exported model should solve. For LLMs it will be automatic-speech-recognition-with-past. If model initialization requires to use remote code, --trust-remote-code flag additionally should be passed. Full list of supported arguments available via --help For more details and examples of usage, please check optimum documentation.

from cmd_helper import optimum_cli

model_dir = model_id.value.split("/")[-1]

if not Path(model_dir).exists():
    optimum_cli(model_id.value, model_dir)

Prepare inference pipeline#

The image below illustrates the pipeline of video transcribing using the Whisper model.

To simplify user experience we will use OpenVINO Generate API. Firstly we will create pipeline with WhisperPipeline. You can construct it straight away from the folder with the converted model. It will automatically load the model, tokenizer, detokenizer and default generation configuration.

Select inference device#

select device from dropdown list for running inference using OpenVINO

from notebook_utils import device_widget

device = device_widget(default="CPU", exclude=["NPU"])

device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

import openvino_genai as ov_genai

ov_pipe = ov_genai.WhisperPipeline(str(model_dir), device=device.value)

Run video transcription pipeline#

Now, we are ready to start transcription. Let’s load the video first.

from notebook_utils import download_file

output_file = Path("downloaded_video.mp4")

download_file(
    "https://storage.openvinotoolkit.org/repositories/openvino_notebooks/data/data/video/Sheldon%20Cooper%20Jim%20Parsons%20at%20Intels%20Lab.mp4",
    filename=output_file.name,
)

'downloaded_video.mp4' already exists.

PosixPath('/home/labuser/work/notebook/openvino_notebooks/notebooks/whisper-subtitles-generation/downloaded_video.mp4')

Select the task for the model:

transcribe - generate audio transcription in the source language (automatically detected).
translate - generate audio transcription with translation to English language.

task = widgets.Select(
    options=["transcribe", "translate"],
    value="translate",
    description="Select task:",
    disabled=False,
)
task

Select(description='Select task:', index=1, options=('transcribe', 'translate'), value='translate')

try:
    from moviepy import VideoFileClip
except ImportError:
    from moviepy.editor import VideoFileClip
from transformers.pipelines.audio_utils import ffmpeg_read


def get_audio(video_file):
    """
    Extract audio signal from a given video file, then convert it to float,
    then mono-channel format and resample it to the expected sample rate

    Parameters:
        video_file: path to input video file
    Returns:
      resampled_audio: mono-channel float audio signal with 16000 Hz sample rate
                       extracted from video
      duration: duration of video fragment in seconds
    """
    input_video = VideoFileClip(str(video_file))
    duration = input_video.duration
    audio_file = video_file.stem + ".wav"
    input_video.audio.write_audiofile(audio_file, verbose=False, logger=None)
    with open(audio_file, "rb") as f:
        inputs = f.read()
    audio = ffmpeg_read(inputs, 16000)
    return {
        "raw": audio,
        "sampling_rate": 16000,
    }, duration

Let’s run generation method. We will put input data as np array. Also we will specify task and return_timestamps=True options. If task is translate, you can place language option, for example <|fr|> for French or it would be detect automatically. We can set up generation parameters in different ways. We can get default config with get_generation_config(), setup parameters and put config directly to generate(). It’s also possible to specify the needed options just as inputs in the generate() method and we will use this way. Then we just run generate method and get the output in text format.

generate method with return_timestamps set to True will return chunks, which contain attributes: text, start_ts and end_ts

inputs, duration = get_audio(output_file)

transcription = ov_pipe.generate(inputs["raw"], task=task.value, return_timestamps=True).chunks

import math


def format_timestamp(seconds: float):
    """
    format time in srt-file expected format
    """
    assert seconds >= 0, "non-negative timestamp expected"
    milliseconds = round(seconds * 1000.0)

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    return (f"{hours}:" if hours > 0 else "00:") + f"{minutes:02d}:{seconds:02d},{milliseconds:03d}"


def prepare_srt(transcription, filter_duration=None):
    """
    Format transcription into srt file format
    """
    segment_lines = []
    for idx, segment in enumerate(transcription):
        timestamp = (segment.start_ts, segment.end_ts)
        # for the case where the model could not predict an ending timestamp, which can happen if audio is cut off in the middle of a word.
        if segment.end_ts == -1:
            timestamp[1] = filter_duration

        if filter_duration is not None and (timestamp[0] >= math.floor(filter_duration) or timestamp[1] > math.ceil(filter_duration) + 1):
            break
        segment_lines.append(str(idx + 1) + "\n")
        time_start = format_timestamp(timestamp[0])
        time_end = format_timestamp(timestamp[1])
        time_str = f"{time_start} --> {time_end}\n"
        segment_lines.append(time_str)
        segment_lines.append(segment.text + "\n\n")
    return segment_lines

“The results will be saved in the downloaded_video.srt file. SRT is one of the most popular formats for storing subtitles and is compatible with many modern video players. This file can be used to embed transcription into videos during playback or by injecting them directly into video files using ffmpeg.

srt_lines = prepare_srt(transcription, filter_duration=duration)
# save transcription
with output_file.with_suffix(".srt").open("w") as f:
    f.writelines(srt_lines)

Now let us see the results.

widgets.Video.from_file(output_file, loop=False, width=800, height=800)

Video(value=b'x00x00x00x18ftypmp42x00x00x00x00isommp42x00x00Aimoovx00x00x00lmvhd...', height='800…

print("".join(srt_lines))

1
00:00:00,000 --> 00:00:05,000
 Oh, what's that?

2
00:00:05,000 --> 00:00:08,000
 Oh, wow.

3
00:00:08,000 --> 00:00:10,000
 Hello, humans.

4
00:00:13,000 --> 00:00:15,000
 Focus on me.

5
00:00:15,000 --> 00:00:17,000
 Focus on the guard.

6
00:00:17,000 --> 00:00:20,000
 Don't tell anyone what you're seeing in here.

7
00:00:22,000 --> 00:00:24,000
 Have you seen what's in there?

8
00:00:24,000 --> 00:00:25,000
 They have intel.

9
00:00:25,000 --> 00:00:27,000
 This is where it all changes.

Quantization#

NNCF enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. The framework is designed so that modifications to your original training code are minor.

The optimization process contains the following steps:

Create a calibration dataset for quantization.
Run nncf.quantize to obtain quantized encoder and decoder models.
Serialize the INT8 model using openvino.save_model function.

Note: Quantization is time and memory consuming operation. Running quantization code below may take some time.

Please select below whether you would like to run Whisper quantization.

to_quantize = widgets.Checkbox(
    value=True,
    description="Quantization",
    disabled=False,
)

to_quantize

Checkbox(value=True, description='Quantization')

# Fetch `skip_kernel_extension` module
import requests

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py",
)
open("skip_kernel_extension.py", "w").write(r.text)

ov_quantized_model = None
quantized_ov_pipe = None

%load_ext skip_kernel_extension

Let’s load converted OpenVINO model format using Optimum-Intel to easily quantize it.

Optimum Intel can be used to load optimized models from the Hugging Face Hub or local folder to create pipelines to run an inference with OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API compatible with Hugging Face Transformers models. This means we just need to replace the AutoModelForXxx class with the corresponding OVModelForXxx class.

Below is an example of the whisper-tiny model

-from transformers import AutoModelForSpeechSeq2Seq
+from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from transformers import AutoTokenizer, pipeline

model_id = "openai/whisper-tiny"
-model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)
+model = OVModelForSpeechSeq2Seq.from_pretrained(model_id, export=True)

Like the original PyTorch model, the OpenVINO model is also compatible with HuggingFace pipeline interface for automatic-speech-recognition.

%%skip not $to_quantize.value

from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForSpeechSeq2Seq

ov_model = OVModelForSpeechSeq2Seq.from_pretrained(model_dir, device=device.value)
processor = AutoProcessor.from_pretrained(model_dir)

Prepare calibration datasets#

First step is to prepare calibration datasets for quantization. Since we quantize whisper encoder and decoder separately, we need to prepare a calibration dataset for each of the models. We import an InferRequestWrapper class that will intercept model inputs and collect them to a list. Then we run model inference on some small amount of audio samples. Generally, increasing the calibration dataset size improves quantization quality.

%%skip not $to_quantize.value

from itertools import islice
from tqdm.notebook import tqdm
from datasets import load_dataset
from transformers import pipeline
from optimum.intel.openvino.quantization import InferRequestWrapper


def collect_calibration_dataset(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):
    # Overwrite model request properties, saving the original ones for restoring later
    encoder_calibration_data = []
    decoder_calibration_data = []
    ov_model.encoder.request = InferRequestWrapper(ov_model.encoder.request, encoder_calibration_data, apply_caching=True)
    ov_model.decoder_with_past.request = InferRequestWrapper(ov_model.decoder_with_past.request,
                                                             decoder_calibration_data,
                                                             apply_caching=True)

    pipe = pipeline(
      "automatic-speech-recognition",
      model=ov_model,
      chunk_length_s=30,
      tokenizer=processor.tokenizer,
      feature_extractor=processor.feature_extractor)
    try:
        calibration_dataset = dataset = load_dataset("openslr/librispeech_asr", "clean", split="validation", streaming=True, trust_remote_code=True)
        for sample in tqdm(islice(calibration_dataset, calibration_dataset_size), desc="Collecting calibration data",
                           total=calibration_dataset_size):
            pipe(sample["audio"], generate_kwargs={"task": task.value}, return_timestamps=True)
    finally:
        ov_model.encoder.request = ov_model.encoder.request.request
        ov_model.decoder_with_past.request = ov_model.decoder_with_past.request.request

    return encoder_calibration_data, decoder_calibration_data

Quantize Whisper encoder and decoder models#

Below we run the quantize function which calls nncf.quantize on Whisper encoder and decoder-with-past models. We don’t quantize first-step-decoder because its share in whole inference time is negligible.

%%skip not $to_quantize.value

import gc
import shutil
import nncf
import openvino as ov


CALIBRATION_DATASET_SIZE = 30
quantized_model_path = Path(f"{model_dir}_quantized")


def quantize(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):
    if not quantized_model_path.exists():
        encoder_calibration_data, decoder_calibration_data = collect_calibration_dataset(ov_model, calibration_dataset_size)
        print("Quantizing encoder")
        quantized_encoder = nncf.quantize(
            ov_model.encoder.model,
            nncf.Dataset(encoder_calibration_data),
            subset_size=len(encoder_calibration_data),
            model_type=nncf.ModelType.TRANSFORMER,
            # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search
            advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.80),
        )
        ov.save_model(quantized_encoder, quantized_model_path / "openvino_encoder_model.xml")
        del quantized_encoder
        del encoder_calibration_data
        gc.collect()

        print("Quantizing decoder with past")
        quantized_decoder_with_past = nncf.quantize(
            ov_model.decoder_with_past.model,
            nncf.Dataset(decoder_calibration_data),
            subset_size=len(decoder_calibration_data),
            model_type=nncf.ModelType.TRANSFORMER,
            # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search
            advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.96),
        )
        ov.save_model(quantized_decoder_with_past, quantized_model_path / "openvino_decoder_with_past_model.xml")
        del quantized_decoder_with_past
        del decoder_calibration_data
        gc.collect()

        # Copy the config file and the first-step-decoder manually
        model_path = Path(model_dir)
        shutil.copy(model_path / "config.json", quantized_model_path / "config.json")
        shutil.copy(model_path / "generation_config.json", quantized_model_path / "generation_config.json")
        shutil.copy(model_path / "openvino_decoder_model.xml", quantized_model_path / "openvino_decoder_model.xml")
        shutil.copy(model_path / "openvino_decoder_model.bin", quantized_model_path / "openvino_decoder_model.bin")
        shutil.copy(model_path / "openvino_tokenizer.xml", quantized_model_path / "openvino_tokenizer.xml")
        shutil.copy(model_path / "openvino_tokenizer.bin", quantized_model_path / "openvino_tokenizer.bin")
        shutil.copy(model_path / "openvino_detokenizer.xml", quantized_model_path / "openvino_detokenizer.xml")
        shutil.copy(model_path / "openvino_detokenizer.bin", quantized_model_path / "openvino_detokenizer.bin")
        shutil.copy(model_path / "tokenizer_config.json", quantized_model_path / "tokenizer_config.json")
        shutil.copy(model_path / "tokenizer.json", quantized_model_path / "tokenizer.json")
        shutil.copy(model_path / "vocab.json", quantized_model_path / "vocab.json")
        shutil.copy(model_path / "preprocessor_config.json", quantized_model_path / "preprocessor_config.json")
        shutil.copy(model_path / "special_tokens_map.json", quantized_model_path / "special_tokens_map.json")
        shutil.copy(model_path / "normalizer.json", quantized_model_path / "normalizer.json")
        shutil.copy(model_path / "merges.txt", quantized_model_path / "merges.txt")
        shutil.copy(model_path / "added_tokens.json", quantized_model_path / "added_tokens.json")

    quantized_ov_pipe = ov_genai.WhisperPipeline(str(quantized_model_path), device=device.value)
    return quantized_ov_pipe


quantized_ov_pipe = quantize(ov_model, CALIBRATION_DATASET_SIZE)

Run quantized model inference#

Let’s compare the transcription results for original and quantized models.

if ov_quantized_model is not None:
    inputs, duration = get_audio(output_file)
    transcription = quantized_ov_pipe.generate(inputs["raw"], task=task.value, return_timestamps=True).chunks
    srt_lines = prepare_srt(transcription, filter_duration=duration)
    print("".join(srt_lines))
    widgets.Video.from_file(output_file, loop=False, width=800, height=800)

Compare performance and accuracy of the original and quantized models#

Finally, we compare original and quantized Whisper models from accuracy and performance stand-points.

To measure accuracy, we use 1 - WER as a metric, where WER stands for Word Error Rate.

%%skip not $to_quantize.value

import time
from contextlib import contextmanager
from jiwer import wer, wer_standardize

TEST_DATASET_SIZE = 50

def calculate_transcription_time_and_accuracy(ov_model, test_samples):
    whole_infer_times = []

    ground_truths = []
    predictions = []
    for data_item in tqdm(test_samples, desc="Measuring performance and accuracy"):
        start_time = time.perf_counter()
        transcription = ov_model.generate(data_item["audio"]["array"], return_timestamps=True)
        end_time = time.perf_counter()
        whole_infer_times.append(end_time - start_time)

        ground_truths.append(data_item["text"])
        predictions.append(transcription.texts[0])

    word_accuracy = (1 - wer(ground_truths, predictions, reference_transform=wer_standardize,
                             hypothesis_transform=wer_standardize)) * 100
    mean_whole_infer_time = sum(whole_infer_times)
    return word_accuracy, mean_whole_infer_time

test_dataset = load_dataset("openslr/librispeech_asr", "clean", split="validation", streaming=True, trust_remote_code=True)
test_dataset = test_dataset.shuffle(seed=42).take(TEST_DATASET_SIZE)
test_samples = [sample for sample in test_dataset]

accuracy_original, times_original = calculate_transcription_time_and_accuracy(ov_pipe, test_samples)
accuracy_quantized, times_quantized = calculate_transcription_time_and_accuracy(quantized_ov_pipe, test_samples)
print(f"Whole pipeline performance speedup: {times_original / times_quantized:.3f}")
print(f"Whisper transcription word accuracy. Original model: {accuracy_original:.2f}%. Quantized model: {accuracy_quantized:.2f}%.")
print(f"Accuracy drop: {accuracy_original - accuracy_quantized:.2f}%.")

Measuring performance and accuracy:   0%|          | 0/50 [00:00<?, ?it/s]

Measuring performance and accuracy:   0%|          | 0/50 [00:00<?, ?it/s]

Whole pipeline performance speedup: 1.452
Whisper transcription word accuracy. Original model: 81.77%. Quantized model: 82.97%.
Accuracy drop: -1.20%.

Interactive demo#

def_config = ov_pipe.get_generation_config()


def transcribe(video_path, task, use_int8):
    data_path = Path(video_path)
    inputs, duration = get_audio(data_path)
    m_pipe = quantized_ov_pipe if use_int8 else ov_pipe

    frame_num = len(inputs["raw"]) / 16000
    if frame_num > 30:
        config = ov_pipe.get_generation_config()
        chink_num = math.ceil(frame_num / 30)
        config.max_length = chink_num * def_config.max_length
        m_pipe.set_generation_config(config)

    transcription = m_pipe.generate(inputs["raw"], task=task.lower(), return_timestamps=True).chunks
    srt_lines = prepare_srt(transcription, duration)
    with data_path.with_suffix(".srt").open("w") as f:
        f.writelines(srt_lines)
    return [str(data_path), str(data_path.with_suffix(".srt"))]


if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/whisper-subtitles-generation/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)

from gradio_helper import make_demo

demo = make_demo(fn=transcribe, quantized=ov_quantized_model is not None, sample_path=output_file)

try:
    demo.launch(debug=False)
except Exception:
    demo.launch(share=True, debug=False)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/