Text-to-Speech synthesis using OuteTTS and OpenVINO#

This Jupyter notebook can be launched after a local installation only.

Warning

Important note: This notebook requires python >= 3.10. Please make sure that your environment fulfill to this requirement before running it

OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture. It demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.

More details about model can be found in original repo.

In this tutorial we consider how to run OuteTTS pipeline using OpenVINO.

Table of contents:

Prerequisites
Convert model
Run model inference
- Text-to-Speech generation
- Text-to-Speech generation with Voice Cloning
Interactive demo

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

import platform

%pip install -q "torch>=2.1" "torchaudio" "einops" "transformers>=4.46.1" "loguru" "inflect" "pesq" "torchcrepe" "natsort" "polars" uroman mecab-python3 unidic-lite --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "gradio>=4.19" "openvino>=2024.4.0" "tqdm" "pyyaml" "librosa" "soundfile"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git"  --extra-index-url https://download.pytorch.org/whl/cpu

if platform.system() == "Darwin":
    %pip install -q "numpy<2.0.0"

import requests
from pathlib import Path

utility_files = ["cmd_helper.py", "notebook_utils.py"]
base_utility_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/"

for utility_file in utility_files:
    if not Path(utility_file).exists():
        r = requests.get(base_utility_url + utility_file)
        with Path(utility_file).open("w") as f:
            f.write(r.text)


helper_files = ["gradio_helper.py", "ov_outetts_helper.py"]
base_helper_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/outetts-text-to-speech"

for helper_file in helper_files:
    if not Path(helper_file).exists():
        r = requests.get(base_helper_url + helper_file)
        with Path(helper_file).open("w") as f:
            f.write(r.text)

from cmd_helper import clone_repo

repo_path = clone_repo("https://github.com/edwko/OuteTTS.git")

interface_path = repo_path / "outetts/version/v1/interface.py"

updated_version = interface_path.exists()

if not updated_version:
    interface_pth = repo_path / "outetts/v0_1/interface.py"
orig_interface_path = interface_path.parent / "_orig_interface.py"

if not updated_version and not orig_interface_path.exists():
    interface_path.rename(orig_interface_path)
    # sounddevice requires to install manually additional libraries, as we do not plan to use it for audio playing
    # move it closer to its usage for avoid errors
    with orig_interface_path.open("r") as in_file:
        content = in_file.read()
        upd_content = content.replace("import sounddevice as sd", "")
        upd_content = upd_content.replace("sd.play", "import sounddevice as sd\n        sd.play")
    with interface_path.open("w") as out_file:
        out_file.write(upd_content)

%pip install -q {repo_path} --extra-index-url https://download.pytorch.org/whl/cpu

Convert model#

OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation format. For convenience, we will use OpenVINO integration with HuggingFace Optimum. Optimum Intel is the interface between the Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.

Among other use cases, Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime. optimum-cli provides command line interface for model conversion and optimization.

General command format:

optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>

where task is task to export the model for, if not specified, the task will be auto-inferred based on the model. You can find a mapping between tasks and model classes in Optimum TaskManager documentation. Additionally, you can specify weights compression using --weight-format argument with one of following options: fp32, fp16, int8 and int4. Fro int8 and int4 nncf will be used for weight compression. More details about model export provided in Optimum Intel documentation.

As OuteTTS utilizes pure language modeling approach, model conversion process remains the same like conversion LLaMa models family for text generation purposes.

from cmd_helper import optimum_cli

model_id = "OuteAI/OuteTTS-0.1-350M"
model_dir = Path(model_id.split("/")[-1] + "-ov")

if not model_dir.exists():
    optimum_cli(model_id, model_dir, additional_args={"task": "text-generation-with-past"})

Run model inference#

OpenVINO integration with Optimum Intel provides ready-to-use API for model inference that can be used for smooth integration with transformers-based solutions. For loading model, we will use OVModelForCausalLM class that have compatible interface with Transformers LLaMa implementation. For loading a model, from_pretrained method should be used. It accepts path to the model directory or model_id from HuggingFace hub (if model is not converted to OpenVINO format, conversion will be triggered automatically). Additionally, we can provide an inference device, quantization config (if model has not been quantized yet) and device-specific OpenVINO Runtime configuration. More details about model inference with Optimum Intel can be found in documentation. We will use OVModelForCausalLM as replacement of original AutoModelForCausalLM in InterfaceHF.

from notebook_utils import device_widget

device = device_widget(exclude=["NPU"])

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

from ov_outetts_helper import InterfaceOV, OVHFModel  # noqa: F401

# Uncomment these lines to see pipeline details
# ??InterfaceOV
# ??OVHFModel

2024-11-29 11:48:51.975233: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-11-29 11:48:51.989550: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1732866532.005718 2314480 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732866532.010517 2314480 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-29 11:48:52.027376: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

interface = InterfaceOV(model_dir, device.value)

making attention of type 'vanilla' with 768 in_channels

Text-to-Speech generation#

Now let’s see model in action. Providing input text to generate method of interface, model returns tensor that represents output audio with random speaker characteristics.

output = interface.generate(text="Hello, I'm working!", temperature=0.1, repetition_penalty=1.1, max_length=4096)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.

import IPython.display as ipd

ipd.Audio(output.audio[0].numpy(), rate=output.sr)

Text-to-Speech generation with Voice Cloning#

Additionally, we can specify reference voice for generation by providing reference audio and transcript for it. interface.create_speaker processes reference audio and text to set of features used for audio description.

from notebook_utils import download_file

ref_audio_url = "https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/2.wav"

file_path = download_file(ref_audio_url)

'2.wav' already exists.

ipd.Audio(file_path)

speaker = interface.create_speaker(file_path, "Hello, I can speak pretty well, but sometimes I make some mistakes.")

# Save the speaker to a file
interface.save_speaker(speaker, "speaker.pkl")

# Load the speaker from a file
speaker = interface.load_speaker("speaker.pkl")

# Generate TTS with the custom voice
output = interface.generate(text="This is a cloned voice speaking", speaker=speaker, temperature=0.1, repetition_penalty=1.1, max_length=4096)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.

ipd.Audio(output.audio[0].numpy(), rate=output.sr)

Interactive demo#

from gradio_helper import make_demo

demo = make_demo(interface)

try:
    demo.launch(debug=True)
except Exception:
    demo.launch(share=True, debug=True)