MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™#
This Jupyter notebook can be launched after a local installation only.
The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over 4,000 languages (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages.
The MMS model was proposed in Scaling Speech Technology to 1,000+ Languages. The models and code are originally released here.
There are different open sourced models in the MMS project: Automatic Speech Recognition (ASR), Language Identification (LID) and Speech Synthesis (TTS). A simple diagram of this is below.
In this notebook we are considering ASR and LID. We will use LID model to identify language, and then language-specific ASR model to recognize it. Additional models quantization step is employed to improve models inference speed. In the end of the notebook there’s a Gradio-based interactive demo.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
%pip install -q --upgrade pip
%pip install -q "transformers>=4.33.1" "torch>=2.1" "openvino>=2023.1.0" "numpy>=1.21.0" "nncf>=2.9.0"
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu torch "datasets>=2.14.6" accelerate soundfile librosa "gradio>=4.19" jiwer
from pathlib import Path
import torch
import openvino as ov
Prepare an example audio#
Read an audio file and process the audio data. Make sure that the audio
data is sampled to 16000 kHz. For this example we will use a streamable
version of the Multilingual LibriSpeech (MLS)
dataset.
It supports contains example on 7 languages:
'german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'
.
Choose one of them.
import ipywidgets as widgets
SAMPLE_LANG = widgets.Dropdown(
options=["german", "dutch", "french", "spanish", "italian", "portuguese", "polish"],
value="german",
description="Dataset language:",
disabled=False,
)
SAMPLE_LANG
Dropdown(description='Dataset language:', options=('german', 'dutch', 'french', 'spanish', 'italian', 'portugu…
Specify streaming=True
to not download the entire dataset.
from datasets import load_dataset
mls_dataset = load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="test", streaming=True, trust_remote_code=True)
mls_dataset = iter(mls_dataset) # make it iterable
example = next(mls_dataset) # get one example
Example has a dictionary structure. It contains an audio data and a text transcription.
print(example) # look at structure
{'file': None, 'audio': {'path': '1054_1599_000000.flac', 'array': array([-0.00131226, -0.00152588, -0.00134277, ..., 0.00411987,
0.00308228, -0.00015259]), 'sampling_rate': 16000}, 'text': 'mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit', 'speaker_id': 1054, 'chapter_id': 1599, 'id': '1054_1599_000000'}
import IPython.display as ipd
print(example["transcript"])
ipd.Audio(example["audio"]["array"], rate=16_000)
mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit