wav2vec2-base#

Use Case and High-Level Description#

Wav2Vec2.0-base is a model, which pre-trained to learn speech representations on unlabeled data as described in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper and fine-tuned for speech recognition task with a Connectionist Temporal Classification (CTC) loss on LibriSpeech dataset containing 960 hours of audio. The model is composed of a multi-layer convolutional feature encoder which takes as input raw audio and outputs latent speech representations, then fed to a Transformer to build representations capturing information from the entire sequence. For base model Transformer consists of 12 transformer layers and has 768 as feature dimension. For details please also check repository and model card.

Specification#

Metric

Value

Type

Speech recognition

GFLOPs

26.843

MParams

94.3965

Source framework

PyTorch*

Accuracy#

Metric

Value

WER @ Librispeech test-clean

3.39%

Input#

Original model#

Normalized audio signal, name - inputs, shape - B, N, format is B, N, where:

  • B - batch size

  • N - sequence length

Model is dynamic and can working with different shapes of input.

NOTE: Model expects 16-bit, 16 kHz, mono-channel WAVE audio as input data.

Converted model#

The converted model has the same parameters as the original model.

Output#

Original model#

Per-token probabilities (after LogSoftmax) for every symbol in the alphabet, name - logits, shape - B, N, 32, output data format is B, N, C, where:

  • B - batch size

  • N - number of recognized tokens

  • C - alphabet size

B and N dimensions can take different values, because model is dynamic. Alphabet size C is static and equals 32. Model alphabet: “[pad]”, “[s]”, “[/s]”, “[unk]”, “|”, “E”, “T”, “A”, “O”, “N”, “I”, “H”, “S”, “R”, “D”, “L”, “U”, “M”, “W”, “C”, “F”, “G”, “Y”, “P”, “B”, “V”, “K”, “’”, “X”, “J”, “Q”, “Z”, where:

  • [pad] - padding token used as CTC-blank label

  • [s]- start of string

  • [/s] - end of string

  • [unk] - unknown symbol

  • | - whitespace symbol used as separator between words.

Converted model#

The converted model has the same parameters as the original model.

Download a Model and Convert it into OpenVINO™ IR Format#

You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.

An example of using the Model Downloader:

omz_downloader --name <model_name>

An example of using the Model Converter:

omz_converter --name <model_name>

Demo usage#

The model can be used in the following demos provided by the Open Model Zoo to show its capabilities: