Text-to-Speech synthesis using OuteTTS and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Warning
Important note: This notebook requires python >= 3.10. Please make sure that your environment fulfill to this requirement before running it
OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture. It demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.
More details about model can be found in original repo.
In this tutorial we consider how to run OuteTTS pipeline using OpenVINO.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
import platform
%pip install -q "torch>=2.1" "torchaudio" "einops" "transformers>=4.46.1" "loguru" "inflect" "pesq" "torchcrepe" "natsort" "polars" uroman mecab-python3 unidic-lite --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "gradio>=4.19" "openvino>=2024.4.0" "tqdm" "pyyaml" "librosa" "soundfile"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
if platform.system() == "Darwin":
%pip install -q "numpy<2.0.0"
import requests
from pathlib import Path
utility_files = ["cmd_helper.py", "notebook_utils.py"]
base_utility_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/"
for utility_file in utility_files:
if not Path(utility_file).exists():
r = requests.get(base_utility_url + utility_file)
with Path(utility_file).open("w") as f:
f.write(r.text)
helper_files = ["gradio_helper.py", "ov_outetts_helper.py"]
base_helper_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/outetts-text-to-speech"
for helper_file in helper_files:
if not Path(helper_file).exists():
r = requests.get(base_helper_url + helper_file)
with Path(helper_file).open("w") as f:
f.write(r.text)
from cmd_helper import clone_repo
repo_path = clone_repo("https://github.com/edwko/OuteTTS.git")
interface_path = repo_path / "outetts/version/v1/interface.py"
updated_version = interface_path.exists()
if not updated_version:
interface_pth = repo_path / "outetts/v0_1/interface.py"
orig_interface_path = interface_path.parent / "_orig_interface.py"
if not updated_version and not orig_interface_path.exists():
interface_path.rename(orig_interface_path)
# sounddevice requires to install manually additional libraries, as we do not plan to use it for audio playing
# move it closer to its usage for avoid errors
with orig_interface_path.open("r") as in_file:
content = in_file.read()
upd_content = content.replace("import sounddevice as sd", "")
upd_content = upd_content.replace("sd.play", "import sounddevice as sd\n sd.play")
with interface_path.open("w") as out_file:
out_file.write(upd_content)
%pip install -q {repo_path} --extra-index-url https://download.pytorch.org/whl/cpu
Convert model#
OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation format. For convenience, we will use OpenVINO integration with HuggingFace Optimum. Optimum Intel is the interface between the Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.
Among other use cases, Optimum Intel provides a simple interface to
optimize your Transformers and Diffusers models, convert them to the
OpenVINO Intermediate Representation (IR) format and run inference using
OpenVINO Runtime. optimum-cli
provides command line interface for
model conversion and optimization.
General command format:
optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>
where task is task to export the model for, if not specified, the task
will be auto-inferred based on the model. You can find a mapping between
tasks and model classes in Optimum TaskManager
documentation.
Additionally, you can specify weights compression using
--weight-format
argument with one of following options: fp32
,
fp16
, int8
and int4
. Fro int8 and int4
nncf will be used for
weight compression. More details about model export provided in Optimum
Intel
documentation.
As OuteTTS utilizes pure language modeling approach, model conversion process remains the same like conversion LLaMa models family for text generation purposes.
from cmd_helper import optimum_cli
model_id = "OuteAI/OuteTTS-0.1-350M"
model_dir = Path(model_id.split("/")[-1] + "-ov")
if not model_dir.exists():
optimum_cli(model_id, model_dir, additional_args={"task": "text-generation-with-past"})
Run model inference#
OpenVINO integration with Optimum Intel provides ready-to-use API for
model inference that can be used for smooth integration with
transformers-based solutions. For loading model, we will use
OVModelForCausalLM
class that have compatible interface with
Transformers LLaMa implementation. For loading a model,
from_pretrained
method should be used. It accepts path to the model
directory or model_id from HuggingFace hub (if model is not converted to
OpenVINO format, conversion will be triggered automatically).
Additionally, we can provide an inference device, quantization config
(if model has not been quantized yet) and device-specific OpenVINO
Runtime configuration. More details about model inference with Optimum
Intel can be found in
documentation.
We will use OVModelForCausalLM
as replacement of original
AutoModelForCausalLM
in InterfaceHF
.
from notebook_utils import device_widget
device = device_widget(exclude=["NPU"])
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
from ov_outetts_helper import InterfaceOV, OVHFModel # noqa: F401
# Uncomment these lines to see pipeline details
# ??InterfaceOV
# ??OVHFModel
2024-11-29 11:48:51.975233: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-11-29 11:48:51.989550: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1732866532.005718 2314480 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1732866532.010517 2314480 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-29 11:48:52.027376: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
interface = InterfaceOV(model_dir, device.value)
making attention of type 'vanilla' with 768 in_channels
Text-to-Speech generation#
Now let’s see model in action. Providing input text to generate
method of interface, model returns tensor that represents output audio
with random speaker characteristics.
output = interface.generate(text="Hello, I'm working!", temperature=0.1, repetition_penalty=1.1, max_length=4096)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:None for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
import IPython.display as ipd
ipd.Audio(output.audio[0].numpy(), rate=output.sr)
Text-to-Speech generation with Voice Cloning#
Additionally, we can specify reference voice for generation by providing
reference audio and transcript for it. interface.create_speaker
processes reference audio and text to set of features used for audio
description.
from notebook_utils import download_file
ref_audio_url = "https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/2.wav"
file_path = download_file(ref_audio_url)
'2.wav' already exists.
ipd.Audio(file_path)
speaker = interface.create_speaker(file_path, "Hello, I can speak pretty well, but sometimes I make some mistakes.")
# Save the speaker to a file
interface.save_speaker(speaker, "speaker.pkl")
# Load the speaker from a file
speaker = interface.load_speaker("speaker.pkl")
# Generate TTS with the custom voice
output = interface.generate(text="This is a cloned voice speaking", speaker=speaker, temperature=0.1, repetition_penalty=1.1, max_length=4096)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:None for open-end generation.
ipd.Audio(output.audio[0].numpy(), rate=output.sr)
Interactive demo#
from gradio_helper import make_demo
demo = make_demo(interface)
try:
demo.launch(debug=True)
except Exception:
demo.launch(share=True, debug=True)