MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™#

This Jupyter notebook can be launched after a local installation only.

The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over 4,000 languages (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages.

The MMS model was proposed in Scaling Speech Technology to 1,000+ Languages. The models and code are originally released here.

There are different open sourced models in the MMS project: Automatic Speech Recognition (ASR), Language Identification (LID) and Speech Synthesis (TTS). A simple diagram of this is below.

In this notebook we are considering ASR and LID. We will use LID model to identify language, and then language-specific ASR model to recognize it. Additional models quantization step is employed to improve models inference speed. In the end of the notebook there’s a Gradio-based interactive demo.

Table of contents:

Prerequisites
Prepare an example audio
Language Identification (LID)
Automatic Speech Recognition (ASR)
Quantization
Interactive demo with Gradio

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

%pip install -q --upgrade pip
%pip install -q "transformers>=4.33.1" "torch>=2.1" "openvino>=2023.1.0" "numpy>=1.21.0" "nncf>=2.9.0"
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu torch "datasets>=2.14.6" accelerate soundfile librosa "gradio>=4.19" jiwer

from pathlib import Path

import torch

import openvino as ov

import requests

if not Path("notebook_utils.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
    )
    open("notebook_utils.py", "w").write(r.text)

# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry

collect_telemetry("mms-massively-multilingual-speech.ipynb")

Prepare an example audio#

Read an audio file and process the audio data. Make sure that the audio data is sampled to 16000 kHz. For this example we will use a streamable version of the Multilingual LibriSpeech (MLS) dataset. It supports contains example on 7 languages: 'german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'. Choose one of them.

import ipywidgets as widgets


SAMPLE_LANG = widgets.Dropdown(
    options=["german", "dutch", "french", "spanish", "italian", "portuguese", "polish"],
    value="german",
    description="Dataset language:",
    disabled=False,
)

SAMPLE_LANG

Dropdown(description='Dataset language:', options=('german', 'dutch', 'french', 'spanish', 'italian', 'portugu…

Specify streaming=True to not download the entire dataset.

from datasets import load_dataset

mls_dataset = load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="test", streaming=True, trust_remote_code=True)
mls_dataset = iter(mls_dataset)  # make it iterable

example = next(mls_dataset)  # get one example

Example has a dictionary structure. It contains an audio data and a text transcription.

print(example)  # look at structure

{'file': None, 'audio': {'path': '1054_1599_000000.flac', 'array': array([-0.00131226, -0.00152588, -0.00134277, ...,  0.00411987,
        0.00308228, -0.00015259]), 'sampling_rate': 16000}, 'text': 'mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit', 'speaker_id': 1054, 'chapter_id': 1599, 'id': '1054_1599_000000'}

import IPython.display as ipd

print(example["transcript"])
ipd.Audio(example["audio"]["array"], rate=16_000)

mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit

Language Identification (LID)#

Download pretrained model and processor#

Different LID models are available based on the number of languages they can recognize - 126, 256, 512, 1024, 2048, 4017. We will use 126.

from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor

model_id = "facebook/mms-lid-126"

lid_processor = AutoFeatureExtractor.from_pretrained(model_id)
lid_model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

Use the original model to run an inference#

inputs = lid_processor(example["audio"]["array"], sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = lid_model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = lid_model.config.id2label[lang_id]
print(detected_lang)

deu

Convert to OpenVINO IR model and run an inference#

Select device from dropdown list for running inference using OpenVINO

from notebook_utils import device_widget

core = ov.Core()

device = device_widget("CPU", exclude=["NPU"])

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

Convert model to OpenVINO format and compile it

MAX_SEQ_LENGTH = 30480

lid_model_xml_path = Path("models/ov_lid_model.xml")


def get_lid_model(model_path, compiled=True):
    input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)

    if not model_path.exists() and model_path == lid_model_xml_path:
        lid_model_xml_path.parent.mkdir(parents=True, exist_ok=True)
        converted_model = ov.convert_model(lid_model, example_input={"input_values": input_values})
        ov.save_model(converted_model, lid_model_xml_path)
        if not compiled:
            return converted_model
    if compiled:
        return core.compile_model(model_path, device_name=device.value)
    return core.read_model(model_path)


compiled_lid_model = get_lid_model(lid_model_xml_path)

/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:595: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:634: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):

Now it is possible to run an inference.

def detect_language(compiled_model, audio_data):
    inputs = lid_processor(audio_data, sampling_rate=16_000, return_tensors="pt")

    outputs = compiled_model(inputs["input_values"])[0]

    lang_id = torch.argmax(torch.from_numpy(outputs), dim=-1)[0].item()
    detected_lang = lid_model.config.id2label[lang_id]

    return detected_lang

detect_language(compiled_lid_model, example["audio"]["array"])

'deu'

Let’s check another language.

SAMPLE_LANG = widgets.Dropdown(
    options=["german", "dutch", "french", "spanish", "italian", "portuguese", "polish"],
    value="french",
    description="Dataset language:",
    disabled=False,
)

SAMPLE_LANG

Dropdown(description='Dataset language:', index=2, options=('german', 'dutch', 'french', 'spanish', 'italian',…

mls_dataset = load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="test", streaming=True, trust_remote_code=True)
mls_dataset = iter(mls_dataset)

example = next(mls_dataset)
print(example["transcript"])
ipd.Audio(example["audio"]["array"], rate=16_000)

grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle

language_id = detect_language(compiled_lid_model, example["audio"]["array"])
print(language_id)

fra

Automatic Speech Recognition (ASR)#

Download pretrained model and processor#

Download pretrained model and processor. By default, MMS loads adapter weights for English. If you want to load adapter weights of another language make sure to specify target_lang=<your-chosen-target-lang> as well as ignore_mismatched_sizes=True. The ignore_mismatched_sizes=True keyword has to be passed to allow the language model head to be resized according to the vocabulary of the specified language. Similarly, the processor should be loaded with the same target language. It is also possible to change the supported language later.

from transformers import Wav2Vec2ForCTC, AutoProcessor

model_id = "facebook/mms-1b-all"

asr_processor = AutoProcessor.from_pretrained(model_id)
asr_model = Wav2Vec2ForCTC.from_pretrained(model_id)

You can look at all supported languages:

asr_processor.tokenizer.vocab.keys()

dict_keys(['abi', 'abk', 'abp', 'aca', 'acd', 'ace', 'acf', 'ach', 'acn', 'acr', 'acu', 'ade', 'adh', 'adj', 'adx', 'aeu', 'afr', 'agd', 'agg', 'agn', 'agr', 'agu', 'agx', 'aha', 'ahk', 'aia', 'aka', 'akb', 'ake', 'akp', 'alj', 'alp', 'alt', 'alz', 'ame', 'amf', 'amh', 'ami', 'amk', 'ann', 'any', 'aoz', 'apb', 'apr', 'ara', 'arl', 'asa', 'asg', 'asm', 'ast', 'ata', 'atb', 'atg', 'ati', 'atq', 'ava', 'avn', 'avu', 'awa', 'awb', 'ayo', 'ayr', 'ayz', 'azb', 'azg', 'azj-script_cyrillic', 'azj-script_latin', 'azz', 'bak', 'bam', 'ban', 'bao', 'bas', 'bav', 'bba', 'bbb', 'bbc', 'bbo', 'bcc-script_arabic', 'bcc-script_latin', 'bcl', 'bcw', 'bdg', 'bdh', 'bdq', 'bdu', 'bdv', 'beh', 'bel', 'bem', 'ben', 'bep', 'bex', 'bfa', 'bfo', 'bfy', 'bfz', 'bgc', 'bgq', 'bgr', 'bgt', 'bgw', 'bha', 'bht', 'bhz', 'bib', 'bim', 'bis', 'biv', 'bjr', 'bjv', 'bjw', 'bjz', 'bkd', 'bkv', 'blh', 'blt', 'blx', 'blz', 'bmq', 'bmr', 'bmu', 'bmv', 'bng', 'bno', 'bnp', 'boa', 'bod', 'boj', 'bom', 'bor', 'bos', 'bov', 'box', 'bpr', 'bps', 'bqc', 'bqi', 'bqj', 'bqp', 'bre', 'bru', 'bsc', 'bsq', 'bss', 'btd', 'bts', 'btt', 'btx', 'bud', 'bul', 'bus', 'bvc', 'bvz', 'bwq', 'bwu', 'byr', 'bzh', 'bzi', 'bzj', 'caa', 'cab', 'cac-dialect_sanmateoixtatan', 'cac-dialect_sansebastiancoatan', 'cak-dialect_central', 'cak-dialect_santamariadejesus', 'cak-dialect_santodomingoxenacoj', 'cak-dialect_southcentral', 'cak-dialect_western', 'cak-dialect_yepocapa', 'cap', 'car', 'cas', 'cat', 'cax', 'cbc', 'cbi', 'cbr', 'cbs', 'cbt', 'cbu', 'cbv', 'cce', 'cco', 'cdj', 'ceb', 'ceg', 'cek', 'ces', 'cfm', 'cgc', 'che', 'chf', 'chv', 'chz', 'cjo', 'cjp', 'cjs', 'ckb', 'cko', 'ckt', 'cla', 'cle', 'cly', 'cme', 'cmn-script_simplified', 'cmo-script_khmer', 'cmo-script_latin', 'cmr', 'cnh', 'cni', 'cnl', 'cnt', 'coe', 'cof', 'cok', 'con', 'cot', 'cou', 'cpa', 'cpb', 'cpu', 'crh', 'crk-script_latin', 'crk-script_syllabics', 'crn', 'crq', 'crs', 'crt', 'csk', 'cso', 'ctd', 'ctg', 'cto', 'ctu', 'cuc', 'cui', 'cuk', 'cul', 'cwa', 'cwe', 'cwt', 'cya', 'cym', 'daa', 'dah', 'dan', 'dar', 'dbj', 'dbq', 'ddn', 'ded', 'des', 'deu', 'dga', 'dgi', 'dgk', 'dgo', 'dgr', 'dhi', 'did', 'dig', 'dik', 'dip', 'div', 'djk', 'dnj-dialect_blowowest', 'dnj-dialect_gweetaawueast', 'dnt', 'dnw', 'dop', 'dos', 'dsh', 'dso', 'dtp', 'dts', 'dug', 'dwr', 'dyi', 'dyo', 'dyu', 'dzo', 'eip', 'eka', 'ell', 'emp', 'enb', 'eng', 'enx', 'epo', 'ese', 'ess', 'est', 'eus', 'evn', 'ewe', 'eza', 'fal', 'fao', 'far', 'fas', 'fij', 'fin', 'flr', 'fmu', 'fon', 'fra', 'frd', 'fry', 'ful', 'gag-script_cyrillic', 'gag-script_latin', 'gai', 'gam', 'gau', 'gbi', 'gbk', 'gbm', 'gbo', 'gde', 'geb', 'gej', 'gil', 'gjn', 'gkn', 'gld', 'gle', 'glg', 'glk', 'gmv', 'gna', 'gnd', 'gng', 'gof-script_latin', 'gog', 'gor', 'gqr', 'grc', 'gri', 'grn', 'grt', 'gso', 'gub', 'guc', 'gud', 'guh', 'guj', 'guk', 'gum', 'guo', 'guq', 'guu', 'gux', 'gvc', 'gvl', 'gwi', 'gwr', 'gym', 'gyr', 'had', 'hag', 'hak', 'hap', 'hat', 'hau', 'hay', 'heb', 'heh', 'hif', 'hig', 'hil', 'hin', 'hlb', 'hlt', 'hne', 'hnn', 'hns', 'hoc', 'hoy', 'hrv', 'hsb', 'hto', 'hub', 'hui', 'hun', 'hus-dialect_centralveracruz', 'hus-dialect_westernpotosino', 'huu', 'huv', 'hvn', 'hwc', 'hye', 'hyw', 'iba', 'ibo', 'icr', 'idd', 'ifa', 'ifb', 'ife', 'ifk', 'ifu', 'ify', 'ign', 'ikk', 'ilb', 'ilo', 'imo', 'ina', 'inb', 'ind', 'iou', 'ipi', 'iqw', 'iri', 'irk', 'isl', 'ita', 'itl', 'itv', 'ixl-dialect_sangasparchajul', 'ixl-dialect_sanjuancotzal', 'ixl-dialect_santamarianebaj', 'izr', 'izz', 'jac', 'jam', 'jav', 'jbu', 'jen', 'jic', 'jiv', 'jmc', 'jmd', 'jpn', 'jun', 'juy', 'jvn', 'kaa', 'kab', 'kac', 'kak', 'kam', 'kan', 'kao', 'kaq', 'kat', 'kay', 'kaz', 'kbo', 'kbp', 'kbq', 'kbr', 'kby', 'kca', 'kcg', 'kdc', 'kde', 'kdh', 'kdi', 'kdj', 'kdl', 'kdn', 'kdt', 'kea', 'kek', 'ken', 'keo', 'ker', 'key', 'kez', 'kfb', 'kff-script_telugu', 'kfw', 'kfx', 'khg', 'khm', 'khq', 'kia', 'kij', 'kik', 'kin', 'kir', 'kjb', 'kje', 'kjg', 'kjh', 'kki', 'kkj', 'kle', 'klu', 'klv', 'klw', 'kma', 'kmd', 'kml', 'kmr-script_arabic', 'kmr-script_cyrillic', 'kmr-script_latin', 'kmu', 'knb', 'kne', 'knf', 'knj', 'knk', 'kno', 'kog', 'kor', 'kpq', 'kps', 'kpv', 'kpy', 'kpz', 'kqe', 'kqp', 'kqr', 'kqy', 'krc', 'kri', 'krj', 'krl', 'krr', 'krs', 'kru', 'ksb', 'ksr', 'kss', 'ktb', 'ktj', 'kub', 'kue', 'kum', 'kus', 'kvn', 'kvw', 'kwd', 'kwf', 'kwi', 'kxc', 'kxf', 'kxm', 'kxv', 'kyb', 'kyc', 'kyf', 'kyg', 'kyo', 'kyq', 'kyu', 'kyz', 'kzf', 'lac', 'laj', 'lam', 'lao', 'las', 'lat', 'lav', 'law', 'lbj', 'lbw', 'lcp', 'lee', 'lef', 'lem', 'lew', 'lex', 'lgg', 'lgl', 'lhu', 'lia', 'lid', 'lif', 'lin', 'lip', 'lis', 'lit', 'lje', 'ljp', 'llg', 'lln', 'lme', 'lnd', 'lns', 'lob', 'lok', 'lom', 'lon', 'loq', 'lsi', 'lsm', 'ltz', 'luc', 'lug', 'luo', 'lwo', 'lww', 'lzz', 'maa-dialect_sanantonio', 'maa-dialect_sanjeronimo', 'mad', 'mag', 'mah', 'mai', 'maj', 'mak', 'mal', 'mam-dialect_central', 'mam-dialect_northern', 'mam-dialect_southern', 'mam-dialect_western', 'maq', 'mar', 'maw', 'maz', 'mbb', 'mbc', 'mbh', 'mbj', 'mbt', 'mbu', 'mbz', 'mca', 'mcb', 'mcd', 'mco', 'mcp', 'mcq', 'mcu', 'mda', 'mdf', 'mdv', 'mdy', 'med', 'mee', 'mej', 'men', 'meq', 'met', 'mev', 'mfe', 'mfh', 'mfi', 'mfk', 'mfq', 'mfy', 'mfz', 'mgd', 'mge', 'mgh', 'mgo', 'mhi', 'mhr', 'mhu', 'mhx', 'mhy', 'mib', 'mie', 'mif', 'mih', 'mil', 'mim', 'min', 'mio', 'mip', 'miq', 'mit', 'miy', 'miz', 'mjl', 'mjv', 'mkd', 'mkl', 'mkn', 'mlg', 'mlt', 'mmg', 'mnb', 'mnf', 'mnk', 'mnw', 'mnx', 'moa', 'mog', 'mon', 'mop', 'mor', 'mos', 'mox', 'moz', 'mpg', 'mpm', 'mpp', 'mpx', 'mqb', 'mqf', 'mqj', 'mqn', 'mri', 'mrw', 'msy', 'mtd', 'mtj', 'mto', 'muh', 'mup', 'mur', 'muv', 'muy', 'mvp', 'mwq', 'mwv', 'mxb', 'mxq', 'mxt', 'mxv', 'mya', 'myb', 'myk', 'myl', 'myv', 'myx', 'myy', 'mza', 'mzi', 'mzj', 'mzk', 'mzm', 'mzw', 'nab', 'nag', 'nan', 'nas', 'naw', 'nca', 'nch', 'ncj', 'ncl', 'ncu', 'ndj', 'ndp', 'ndv', 'ndy', 'ndz', 'neb', 'new', 'nfa', 'nfr', 'nga', 'ngl', 'ngp', 'ngu', 'nhe', 'nhi', 'nhu', 'nhw', 'nhx', 'nhy', 'nia', 'nij', 'nim', 'nin', 'nko', 'nlc', 'nld', 'nlg', 'nlk', 'nmz', 'nnb', 'nno', 'nnq', 'nnw', 'noa', 'nob', 'nod', 'nog', 'not', 'npi', 'npl', 'npy', 'nso', 'nst', 'nsu', 'ntm', 'ntr', 'nuj', 'nus', 'nuz', 'nwb', 'nxq', 'nya', 'nyf', 'nyn', 'nyo', 'nyy', 'nzi', 'obo', 'oci', 'ojb-script_latin', 'ojb-script_syllabics', 'oku', 'old', 'omw', 'onb', 'ood', 'orm', 'ory', 'oss', 'ote', 'otq', 'ozm', 'pab', 'pad', 'pag', 'pam', 'pan', 'pao', 'pap', 'pau', 'pbb', 'pbc', 'pbi', 'pce', 'pcm', 'peg', 'pez', 'pib', 'pil', 'pir', 'pis', 'pjt', 'pkb', 'pls', 'plw', 'pmf', 'pny', 'poh-dialect_eastern', 'poh-dialect_western', 'poi', 'pol', 'por', 'poy', 'ppk', 'pps', 'prf', 'prk', 'prt', 'pse', 'pss', 'ptu', 'pui', 'pus', 'pwg', 'pww', 'pxm', 'qub', 'quc-dialect_central', 'quc-dialect_east', 'quc-dialect_north', 'quf', 'quh', 'qul', 'quw', 'quy', 'quz', 'qvc', 'qve', 'qvh', 'qvm', 'qvn', 'qvo', 'qvs', 'qvw', 'qvz', 'qwh', 'qxh', 'qxl', 'qxn', 'qxo', 'qxr', 'rah', 'rai', 'rap', 'rav', 'raw', 'rej', 'rel', 'rgu', 'rhg', 'rif-script_arabic', 'rif-script_latin', 'ril', 'rim', 'rjs', 'rkt', 'rmc-script_cyrillic', 'rmc-script_latin', 'rmo', 'rmy-script_cyrillic', 'rmy-script_latin', 'rng', 'rnl', 'roh-dialect_sursilv', 'roh-dialect_vallader', 'rol', 'ron', 'rop', 'rro', 'rub', 'ruf', 'rug', 'run', 'rus', 'sab', 'sag', 'sah', 'saj', 'saq', 'sas', 'sat', 'sba', 'sbd', 'sbl', 'sbp', 'sch', 'sck', 'sda', 'sea', 'seh', 'ses', 'sey', 'sgb', 'sgj', 'sgw', 'shi', 'shk', 'shn', 'sho', 'shp', 'sid', 'sig', 'sil', 'sja', 'sjm', 'sld', 'slk', 'slu', 'slv', 'sml', 'smo', 'sna', 'snd', 'sne', 'snn', 'snp', 'snw', 'som', 'soy', 'spa', 'spp', 'spy', 'sqi', 'sri', 'srm', 'srn', 'srp-script_cyrillic', 'srp-script_latin', 'srx', 'stn', 'stp', 'suc', 'suk', 'sun', 'sur', 'sus', 'suv', 'suz', 'swe', 'swh', 'sxb', 'sxn', 'sya', 'syl', 'sza', 'tac', 'taj', 'tam', 'tao', 'tap', 'taq', 'tat', 'tav', 'tbc', 'tbg', 'tbk', 'tbl', 'tby', 'tbz', 'tca', 'tcc', 'tcs', 'tcz', 'tdj', 'ted', 'tee', 'tel', 'tem', 'teo', 'ter', 'tes', 'tew', 'tex', 'tfr', 'tgj', 'tgk', 'tgl', 'tgo', 'tgp', 'tha', 'thk', 'thl', 'tih', 'tik', 'tir', 'tkr', 'tlb', 'tlj', 'tly', 'tmc', 'tmf', 'tna', 'tng', 'tnk', 'tnn', 'tnp', 'tnr', 'tnt', 'tob', 'toc', 'toh', 'tom', 'tos', 'tpi', 'tpm', 'tpp', 'tpt', 'trc', 'tri', 'trn', 'trs', 'tso', 'tsz', 'ttc', 'tte', 'ttq-script_tifinagh', 'tue', 'tuf', 'tuk-script_arabic', 'tuk-script_latin', 'tuo', 'tur', 'tvw', 'twb', 'twe', 'twu', 'txa', 'txq', 'txu', 'tye', 'tzh-dialect_bachajon', 'tzh-dialect_tenejapa', 'tzj-dialect_eastern', 'tzj-dialect_western', 'tzo-dialect_chamula', 'tzo-dialect_chenalho', 'ubl', 'ubu', 'udm', 'udu', 'uig-script_arabic', 'uig-script_cyrillic', 'ukr', 'umb', 'unr', 'upv', 'ura', 'urb', 'urd-script_arabic', 'urd-script_devanagari', 'urd-script_latin', 'urk', 'urt', 'ury', 'usp', 'uzb-script_cyrillic', 'uzb-script_latin', 'vag', 'vid', 'vie', 'vif', 'vmw', 'vmy', 'vot', 'vun', 'vut', 'wal-script_ethiopic', 'wal-script_latin', 'wap', 'war', 'waw', 'way', 'wba', 'wlo', 'wlx', 'wmw', 'wob', 'wol', 'wsg', 'wwa', 'xal', 'xdy', 'xed', 'xer', 'xho', 'xmm', 'xnj', 'xnr', 'xog', 'xon', 'xrb', 'xsb', 'xsm', 'xsr', 'xsu', 'xta', 'xtd', 'xte', 'xtm', 'xtn', 'xua', 'xuo', 'yaa', 'yad', 'yal', 'yam', 'yao', 'yas', 'yat', 'yaz', 'yba', 'ybb', 'ycl', 'ycn', 'yea', 'yka', 'yli', 'yor', 'yre', 'yua', 'yue-script_traditional', 'yuz', 'yva', 'zaa', 'zab', 'zac', 'zad', 'zae', 'zai', 'zam', 'zao', 'zaq', 'zar', 'zas', 'zav', 'zaw', 'zca', 'zga', 'zim', 'ziw', 'zlm', 'zmz', 'zne', 'zos', 'zpc', 'zpg', 'zpi', 'zpl', 'zpm', 'zpo', 'zpt', 'zpu', 'zpz', 'ztq', 'zty', 'zul', 'zyb', 'zyp', 'zza'])

Switch out the language adapters by calling the load_adapter() function for the model and set_target_lang() for the tokenizer. Pass the target language as an input - "detect_language_id" which was detected in the previous step.

asr_processor.tokenizer.set_target_lang(language_id)
asr_model.load_adapter(language_id)

Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize

Use the original model for inference#

inputs = asr_processor(example["audio"]["array"], sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = asr_model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = asr_processor.decode(ids)
print(transcription)

grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle

Convert to OpenVINO IR model and run inference#

Convert to OpenVINO IR model format with ov.convert_model function directly. Use ov.save_model function to serialize the result of conversion. For convenience of further use, we will create a function for these purposes.

asr_model_xml_path_template = "models/ov_asr_{}_model.xml"


def get_asr_model(model_path_template, language_id, compiled=True):
    input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
    model_path = Path(model_path_template.format(language_id))

    asr_processor.tokenizer.set_target_lang(language_id)
    if not model_path.exists() and model_path_template == asr_model_xml_path_template:
        asr_model.load_adapter(language_id)

        model_path.parent.mkdir(parents=True, exist_ok=True)
        converted_model = ov.convert_model(asr_model, example_input={"input_values": input_values})
        ov.save_model(converted_model, model_path)
        if not compiled:
            return converted_model

    if compiled:
        return core.compile_model(model_path, device_name=device.value)
    return core.read_model(model_path)


compiled_asr_model = get_asr_model(asr_model_xml_path_template, language_id)

Run inference.

def recognize_audio(compiled_model, src_audio):
    inputs = asr_processor(src_audio, sampling_rate=16_000, return_tensors="pt")
    outputs = compiled_model(inputs["input_values"])[0]

    ids = torch.argmax(torch.from_numpy(outputs), dim=-1)[0]
    transcription = asr_processor.decode(ids)

    return transcription


transcription = recognize_audio(compiled_asr_model, example["audio"]["array"])
print("Original text:", example["transcript"])
print("Transcription:", transcription)

Original text: grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle
Transcription: grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle

Quantization#

NNCF enables post-training quantization by adding quantization layers into model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. Quantized operations are executed in INT8 instead of FP32/FP16 making model inference faster.

The optimization process contains the following steps:

Create a calibration dataset for quantization.
Run nncf.quantize() to obtain quantized models.
Serialize quantized INT8 model using openvino.save_model() function.

Note: Quantization is time and memory consuming operation. Running quantization code below may take some time.

from notebook_utils import quantization_widget

compiled_quantized_lid_model = None
quantized_asr_model_xml_path_template = None

to_quantize = quantization_widget()

to_quantize

Checkbox(value=True, description='Quantization')

Let’s load skip magic extension to skip quantization if to_quantize is not selected

# Fetch `skip_kernel_extension` module
import requests

if not Path("skip_kernel_extension.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py",
    )
    open("skip_kernel_extension.py", "w").write(r.text)

%load_ext skip_kernel_extension

Preparing calibration dataset#

Select the language to quantize the model for:

%%skip not $to_quantize.value

from IPython.display import display

display(SAMPLE_LANG)

Load dev split of the same MLS dataset for the selected language.

%%skip not $to_quantize.value

mls_dataset = iter(load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="dev", streaming=True, trust_remote_code=True))
example = next(mls_dataset)

Create calibration dataset for quantization.

%%skip not $to_quantize.value

CALIBRATION_DATASET_SIZE = 5

calibration_data = []
for i in range(CALIBRATION_DATASET_SIZE):
    data = asr_processor(next(mls_dataset)['audio']['array'], sampling_rate=16_000, return_tensors="np")
    calibration_data.append(data["input_values"])

Language identification model quantization#

Run LID model quantization.

%%skip not $to_quantize.value

import nncf

quantized_lid_model_xml_path = Path(str(lid_model_xml_path).replace(".xml", "_quantized.xml"))

if not quantized_lid_model_xml_path.exists():
    quantized_lid_model = nncf.quantize(
        get_lid_model(lid_model_xml_path, compiled=False),
        calibration_dataset=nncf.Dataset(calibration_data),
        subset_size=len(calibration_data),
        model_type=nncf.ModelType.TRANSFORMER
    )
    ov.save_model(quantized_lid_model, quantized_lid_model_xml_path)
    compiled_quantized_lid_model = core.compile_model(quantized_lid_model, device_name=device.value)
else:
    compiled_quantized_lid_model = get_lid_model(quantized_lid_model_xml_path)

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, openvino

Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.24s/it]
Applying Smooth Quant: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 291/291 [00:18<00:00, 15.34it/s]

INFO:nncf:144 ignored nodes was found by name in the NNCFGraph

Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:18<00:00,  3.65s/it]
Applying Fast Bias correction: 100%|██████████████████████████████████████████████████████████████████████████████████████| 298/298 [05:09<00:00,  1.04s/it]

Detect language with the quantized model.

%%skip not $to_quantize.value

language_id = detect_language(compiled_quantized_lid_model, example['audio']['array'])
print("Detected language:", language_id)

Detected language: fra

Speech recognition model quantization#

Run ASR model quantization.

%%skip not $to_quantize.value

quantized_asr_model_xml_path_template = asr_model_xml_path_template.replace(".xml", "_quantized.xml")
quantized_asr_model_xml_path = Path(quantized_asr_model_xml_path_template.format(language_id))

if not quantized_asr_model_xml_path.exists():
    quantized_asr_model = nncf.quantize(
        get_asr_model(asr_model_xml_path_template, language_id, compiled=False),
        calibration_dataset=nncf.Dataset(calibration_data),
        subset_size=len(calibration_data),
        model_type=nncf.ModelType.TRANSFORMER
    )
    ov.save_model(quantized_asr_model, quantized_asr_model_xml_path)
    compiled_quantized_asr_model = core.compile_model(quantized_asr_model, device_name=device.value)
else:
    compiled_quantized_asr_model = get_asr_model(quantized_asr_model_xml_path_template, language_id)

Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize

Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.17s/it]
Applying Smooth Quant: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 290/290 [00:17<00:00, 16.39it/s]

INFO:nncf:144 ignored nodes was found by name in the NNCFGraph

Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:19<00:00,  3.93s/it]
Applying Fast Bias correction: 100%|██████████████████████████████████████████████████████████████████████████████████████| 393/393 [05:22<00:00,  1.22it/s]

Run transcription with quantized model and compare the result to the one produced by original model.

%%skip not $to_quantize.value

compiled_asr_model = get_asr_model(asr_model_xml_path_template, language_id)
transcription_original = recognize_audio(compiled_asr_model, example['audio']['array'])
transcription_quantized = recognize_audio(compiled_quantized_asr_model, example['audio']['array'])
print("Transcription by original model: ", transcription_original)
print("Transcription by quantized model:", transcription_quantized)

Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Transcription by original model:  le salon était de la plus haute magnificence dorée comme la galerie de diane aux tuileries avec des tableaux à l'huile au lombri il y avait des tâches claires dans ces tableaux julien apprit plus tard que les sujets avaient semblé peu décent à la maîtresse du logis qui avait fait corriger les tableaux
Transcription by quantized model: le salon était de la plus haute magnificence doré comme la galerie de diane aux tuileries avec des tableaux à l'huile au lombri il y avait des tâches claires dans ces tableaux julien apprit plus tard que les sujets avaient semblé peu decent à la maîtresse du logis qui avait fait corriger les tableaux

Compare model size, performance and accuracy#

First we compare model size.

%%skip not $to_quantize.value

def calculate_compression_rate(model_path_ov, model_path_ov_int8, model_type):
    model_size_fp32 = model_path_ov.with_suffix(".bin").stat().st_size / 10 ** 6
    model_size_int8 = model_path_ov_int8.with_suffix(".bin").stat().st_size / 10 ** 6
    print(f"{model_type} model footprint comparison:")
    print(f"    * FP32 IR model size: {model_size_fp32:.2f} MB")
    print(f"    * INT8 IR model size: {model_size_int8:.2f} MB")
    return model_size_fp32, model_size_int8

lid_model_size_fp32, lid_model_size_int8 = \
    calculate_compression_rate(lid_model_xml_path, quantized_lid_model_xml_path, 'LID')
asr_model_size_fp32, asr_model_size_int8 = \
    calculate_compression_rate(Path(asr_model_xml_path_template.format(language_id)), quantized_asr_model_xml_path, 'ASR')

LID model footprint comparison:
    * FP32 IR model size: 1931.81 MB
    * INT8 IR model size: 968.96 MB
ASR model footprint comparison:
    * FP32 IR model size: 1930.10 MB
    * INT8 IR model size: 968.29 MB

Secondly we compare accuracy values of the original and quantized models on a test split of MLS dataset. We rely on the Word Error Rate (WER) metric and compute accuracy as (1 - WER).

We also measure inference time for both language identification and speech recognition models.

%%skip not $to_quantize.value

import time
from tqdm.notebook import tqdm
import numpy as np
from jiwer import wer

TEST_DATASET_SIZE = 20
test_dataset = load_dataset("facebook/multilingual_librispeech", SAMPLE_LANG.value, split="test", streaming=True, trust_remote_code=True)
test_dataset = test_dataset.take(TEST_DATASET_SIZE)

def calculate_transcription_time_and_accuracy(lid_model, asr_model):
    ground_truths = []
    predictions = []
    identification_time = []
    transcription_time = []
    for data_item in tqdm(test_dataset, desc="Measuring performance and accuracy", total=TEST_DATASET_SIZE):
        audio = data_item["audio"]["array"]

        start_time = time.perf_counter()
        detect_language(lid_model, audio)
        end_time = time.perf_counter()
        identification_time.append(end_time - start_time)

        start_time = time.perf_counter()
        transcription = recognize_audio(asr_model, audio)
        end_time = time.perf_counter()
        transcription_time.append(end_time - start_time)

        ground_truths.append(data_item["transcript"])
        predictions.append(transcription)

    word_accuracy = (1 - wer(ground_truths, predictions)) * 100
    mean_identification_time = np.mean(identification_time)
    mean_transcription_time = np.mean(transcription_time)
    return mean_identification_time, mean_transcription_time, word_accuracy

identification_time_fp32, transcription_time_fp32, accuracy_fp32 = \
    calculate_transcription_time_and_accuracy(compiled_lid_model, compiled_asr_model)
identification_time_int8, transcription_time_int8, accuracy_int8 = \
    calculate_transcription_time_and_accuracy(compiled_quantized_lid_model, compiled_quantized_asr_model)
print(f"LID model footprint reduction: {lid_model_size_fp32 / lid_model_size_int8:.3f}")
print(f"ASR model footprint reduction: {asr_model_size_fp32 / asr_model_size_int8:.3f}")
print(f"Language identification performance speedup: {identification_time_fp32 / identification_time_int8:.3f}")
print(f"Language transcription performance speedup:  {transcription_time_fp32 / transcription_time_int8:.3f}")
print(f"Transcription word accuracy. FP32: {accuracy_fp32:.2f}%. INT8: {accuracy_int8:.2f}%. Accuracy drop :{accuracy_fp32 - accuracy_int8:.2f}%.")

Measuring performance and accuracy:   0%|          | 0/20 [00:00<?, ?it/s]

Measuring performance and accuracy:   0%|          | 0/20 [00:00<?, ?it/s]

LID model footprint reduction: 1.994
ASR model footprint reduction: 1.993
Language identification performance speedup: 1.425
Language transcription performance speedup:  1.489
Transcription word accuracy. FP32: 85.01%. INT8: 84.76%. Accuracy drop :0.25%.

Interactive demo with Gradio#

In this demo you can try your own examples. Make sure that the audio data is sampled to 16000 kHz.

import gradio as gr
import librosa
import time


current_state = {
    "fp32": {"model": None, "language": None},
    "int8": {"model": None, "language": None},
}


def infer(src_audio_path, quantized):
    src_audio, _ = librosa.load(src_audio_path)
    lid_model = compiled_quantized_lid_model if quantized else compiled_lid_model

    start_time = time.perf_counter()
    detected_language_id = detect_language(lid_model, src_audio)
    end_time = time.perf_counter()
    identification_delta_time = f"{end_time - start_time:.2f}"

    state = current_state["int8" if quantized else "fp32"]
    if detected_language_id != state["language"]:
        template_path = quantized_asr_model_xml_path_template if quantized else asr_model_xml_path_template
        try:
            gr.Info(f"Loading {'quantized' if quantized else ''} ASR model for '{detected_language_id}' language. " "This will take some time.")
            state["model"] = get_asr_model(template_path, detected_language_id)
            state["language"] = detected_language_id
        except RuntimeError as e:
            if "Unable to read the model:" in str(e) and quantized:
                raise gr.Error(f"There is no quantized ASR model for '{detected_language_id}' language. " "Please run quantization for this language first.")

    start_time = time.perf_counter()
    transcription = recognize_audio(state["model"], src_audio)
    end_time = time.perf_counter()
    transcription_delta_time = f"{end_time - start_time:.2f}"

    return (
        detected_language_id,
        transcription,
        identification_delta_time,
        transcription_delta_time,
    )

if not Path("gradio_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mms-massively-multilingual-speech/gradio_helper.py"
    )
    open("gradio_helper.py", "w").write(r.text)

from gradio_helper import make_demo

demo = make_demo(fn=infer, quantized=to_quantize.value)

try:
    demo.queue().launch(debug=False)
except Exception:
    demo.queue().launch(share=True, debug=False)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/