This is a speech synthesis composite model that simultaneously reconstructs mel-spectrogram and wave form from text. The model generates wave form from symbol sequences separated by space. The model is built on top of the modified ForwardTacotron and modified MelGAN frameworks.
Metric | Value |
---|---|
Source framework | PyTorch* |
The text-to-speech-en-0001-duration-prediction model is a ForwardTacotron-based duration predictor for symbols.
Metric | Value |
---|---|
GFlops | 15.84 |
MParams | 13.569 |
Sequence, name: input_seq
, shape: 1, 512
, format: B,C
where:
Sequence, name: input_mask
, shape: 1, 1, 512
, format: B, D, C
where:
Mask for input sequence, name: input_mask
, shape: 1, 1, 512
, format: B, D, C
where:
Mask for relative position representation in attention, name: pos_mask
, shape: 1, 1, 512, 512
, format: B, D, C, C
where:
duration
, shape: 1, 512, 1
, format B, C, H
. Contains predicted duration for each of the symbol in sequence.embeddings
, shape: 1, 512, 256
, format BxCxH
. Contains processed embeddings for each symbol in sequence.The text-to-speech-en-0001-regression model accepts aligned by duration processed embeddings (for example: if duration is [2, 3] and processed embeddings is [[1, 2], [3, 4]], aligned embeddings is [[1, 2], [1, 2], [1,2], [3, 4], [3, 4]]) and produces mel-spectrogram.
Metric | Value |
---|---|
GFlops | 7.65 |
MParams | 4.96 |
Processed embeddigs aligned by durations, name: data
, shape: 1x512x256
, format: BxTxC
where:
Mask for 'data' by time dimension, name: data_mask
, shape: 1x1x512
, format: BxDxT
where:
Mask for relative position representation in attention, name: pos_mask
, shape: 1x1x512x512
, format: BxDxCxC
where:
Mel-spectrogram, name: mel
, shape: 80x512
, format: CxT
where:
The text-to-speech-en-0001-generation model is a MelGAN based audio generator.
Metric | Value |
---|---|
GFlops | 48.38 |
MParams | 12.77 |
Mel-spectrogram, name: mel
, shape: 1x80x128
, format: BxCxT
where:
Audio, name: audio
, shape: 32768
, format: T
where:
[*] Other names and brands may be claimed as the property of others.