wavernn (composite)#

Use Case and High-Level Description#

WaveRNN is a model for the text-to-speech task originally trained in PyTorch* then converted to ONNX* format. The model was trained on LJSpeech dataset. WaveRNN performs waveform regression from mel-spectrogram. For details see paper, repository.

ONNX Models#

We provide pre-trained models in ONNX format for user convenience.

Steps to Reproduce PyTorch to ONNX Conversion#

Model is provided in ONNX format, which was obtained by the following steps.

  1. Clone the original repository

git clone https://github.com/as-ideas/ForwardTacotron
cd ForwardTacotron
  1. Checkout the commit that the conversion was tested on:

git checkout 78789c1aa845057bb2f799e702b1be76bf7defd0
  1. Follow README.md and preprocess LJSpeech dataset.

  2. Copy provided script wavernn_to_onnx.py to ForwardTacotron root directory, and apply git patch 0001-Added-batch-norm-fusing-to-conv-layers.patch.

  3. Download WaveRNN model from https://github.com/fatchord/WaveRNN/tree/master/pretrained/ and extract in to pre-trained directory.

mkdir pretrained
wget https://raw.githubusercontent.com/fatchord/WaveRNN/master/pretrained/ljspeech.wavernn.mol.800k.zip
unzip ljspeech.wavernn.mol.800k.zip -d pretrained && mv pretrained/latest_weights.pyt pretrained/wave_800K.pyt
  1. Run provided script for conversion WaveRNN to onnx format

python3 wavernn_to_onnx.py --mel <path_to_preprocessed_dataset>/mel/LJ008-0254.npy --voc_weights pretrained/wave_800K.pyt --hp_file hparams.py --batched

Note: by the reason of autoregressive nature of the network, the model is divided into two parts: wavernn_upsampler.onnx, wavernn_rnn.onnx. The first part expands feature map by the time dimension, and the second one iteratively processes every column in expanded feature map.

Composite model specification#

Metric

Value

Source framework

PyTorch*

Accuracy#

Subjective

wavernn-upsampler model specification#

The wavernn-upsampler model accepts mel-spectrogram and produces two feature map: the first one expands mel-spectrogram in one step using Upsample layer and sequence of convolutions, and the second one expands mel-spectrogram in three steps using sequence of Upsample layers and of convolutions.

Metric

Value

GOPs

0.37

MParams

0.4

Input#

Mel-spectrogram, name: mels, shape: 1, 200, 80, format: B, T, C, where:

  • B - batch size

  • T - time in mel-spectrogram

  • C - number of mels in mel-spectrogram

Output#

  1. Processed mel-spectrogram, name: aux, shape: 1, 53888, 128, format: B, T, C, where:

    • B - batch size

    • T - time in audio (equal to time in mel spectrogram * hop_length)

    • C - number of features in processed mel-spectrogram.

  2. Upsampled and processed (by time) mel-spectrogram, name: upsample_mels, shape: 1, 55008, 80, format: B, T', C, where:

    • B - batch size

    • T' - time in audio padded with number of samples for crossfading between batches

    • C - number of mels in mel-spectrogram

wavernn-rnn model specification#

The wavernn-rnn model accepts two feature maps from wavernn-upsampler and produces parameters for mixture of logistics distribution that is used for audio regression by B samples per forward step, where B is batch size.

Metric

Value

GOps

0.06

MParams

3.83

Input#

  1. Time slice in upsampled_mels, name: m_t, shape: B, 80

  2. Time/space slice in aux, name: a1_t, shape: B, 32, where second dimension is 32 = aux.shape[1] / 4

  3. Time/space slice in aux, name: a2_t, shape: B, 32, where second dimension is 32 = aux.shape[1] / 4

  4. Time/space slice in aux, name: a3_t, shape: B, 32, where second dimension is 32 = aux.shape[1] / 4

  5. Time/space slice in aux, name: a4_t, shape: B, 32, where second dimension is 32 = aux.shape[1] / 4

  6. Hidden state for GRU layers in autoregression, name: h1.1, shape: B, 512

  7. Hidden state for GRU layers in autoregression, name: h2.1, shape: B, 512

  8. Previous prediction for autoregression (initially equal to zero), name: x, shape: B, 1

Note: B - batch size.

Output#

  1. Hidden state for GRU layers in autoregression, name: h1, shape: B, 512

  2. Hidden state for GRU layers in autoregression, name: h2, shape: B, 512

  3. Parameters for mixture of logistics distribution, name: logits, shape: B, 30. Can be divided to parameters of mixture of logistic distributions: probabilities = logits[:, :10], means = logits[:, 10:20], scales = logits[:, 20:30]

Note: B - batch size.

Download a Model and Convert it into OpenVINO™ IR Format#

You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.

An example of using the Model Downloader:

omz_downloader --name <model_name>

An example of using the Model Converter:

omz_converter --name <model_name>

Demo usage#

The model can be used in the following demos provided by the Open Model Zoo to show its capabilities: