MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™#

This Jupyter notebook can be launched after a local installation only.

Github

The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over 4,000 languages (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages.

The MMS model was proposed in Scaling Speech Technology to 1,000+ Languages. The models and code are originally released here.

There are different open sourced models in the MMS project: Automatic Speech Recognition (ASR), Language Identification (LID) and Speech Synthesis (TTS). A simple diagram of this is below.

LID and ASR flow

LID and ASR flow#

In this notebook we are considering ASR and LID. We will use LID model to identify language, and then language-specific ASR model to recognize it. Additional models quantization step is employed to improve models inference speed. In the end of the notebook there’s a Gradio-based interactive demo.

Table of contents:#

Prerequisites#

%pip install -q --upgrade pip
%pip install -q "transformers>=4.33.1" "torch>=2.1" "openvino>=2023.1.0" "numpy>=1.21.0" "nncf>=2.9.0"
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu torch "datasets>=2.14.6" accelerate soundfile librosa "gradio>=4.19" jiwer
from pathlib import Path

import torch

import openvino as ov

Prepare an example audio#

Read an audio file and process the audio data. Make sure that the audio data is sampled to 16000 kHz. For this example we will use a streamable version of the Multilingual LibriSpeech (MLS) dataset. It supports contains example on 7 languages: 'german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'. Choose one of them.

import ipywidgets as widgets


SAMPLE_LANG = widgets.Dropdown(
    options=["german", "dutch", "french", "spanish", "italian", "portuguese", "polish"],
    value="german",
    description="Dataset language:",
    disabled=False,
)

SAMPLE_LANG
Dropdown(description='Dataset language:', options=('german', 'dutch', 'french', 'spanish', 'italian', 'portugu…

Specify streaming=True to not download the entire dataset.

from datasets import load_dataset


mls_dataset = load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="test", streaming=True)
mls_dataset = iter(mls_dataset)  # make it iterable

example = next(mls_dataset)  # get one example

Example has a dictionary structure. It contains an audio data and a text transcription.

print(example)  # look at structure
{'file': None, 'audio': {'path': '1054_1599_000000.flac', 'array': array([-0.00131226, -0.00152588, -0.00134277, ...,  0.00411987,
        0.00308228, -0.00015259]), 'sampling_rate': 16000}, 'text': 'mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit', 'speaker_id': 1054, 'chapter_id': 1599, 'id': '1054_1599_000000'}
import IPython.display as ipd

print(example["text"])
ipd.Audio(example["audio"]["array"], rate=16_000)
mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit