High-Quality Text-Free One-Shot Voice Conversion with FreeVC and OpenVINO™#
This Jupyter notebook can be launched after a local installation only.
FreeVC allows alter the voice of a source speaker to a target style, while keeping the linguistic content unchanged, without text annotation.
Figure bellow illustrates model architecture of FreeVC for inference. In this notebook we concentrate only on inference part. There are three main parts: Prior Encoder, Speaker Encoder and Decoder. The prior encoder contains a WavLM model, a bottleneck extractor and a normalizing flow. Detailed information is available in this paper.
FreeVC suggests only command line interface to use and only with CUDA. In this notebook it shows how to use FreeVC in Python and without CUDA devices. It consists of the following steps:
Download and prepare models.
Inference.
Convert models to OpenVINO Intermediate Representation.
Inference using only OpenVINO’s IR models.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Pre-requisites#
This steps can be done manually or will be performed automatically
during the execution of the notebook, but in minimum necessary scope. 1.
Clone this repo: git clone OlaWod/FreeVC.git. 2.
Download
WavLM-Large
and put it under directory FreeVC/wavlm/
. 3. You can download the
VCTK dataset. For
this example we download only two of them from Hugging Face FreeVC
example. 4.
Download pretrained
models
and put it under directory ‘checkpoints’ (for current example only
freevc.pth
are required).
Install extra requirements
%pip install -q "openvino>=2023.3.0" "librosa>=0.8.1" "webrtcvad==2.0.10" "gradio>=4.19" "torch>=2.1" gdown scipy tqdm torchvision --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Check if FreeVC is installed and append its path to sys.path
from pathlib import Path
import sys
free_vc_repo = "FreeVC"
if not Path(free_vc_repo).exists():
!git clone https://github.com/OlaWod/FreeVC.git
sys.path.append(free_vc_repo)
Cloning into 'FreeVC'...
remote: Enumerating objects: 131, done.[K
remote: Counting objects: 100% (74/74), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 131 (delta 43), reused 27 (delta 27), pack-reused 57 (from 1)[K
Receiving objects: 100% (131/131), 15.28 MiB | 17.50 MiB/s, done.
Resolving deltas: 100% (43/43), done.
# Fetch `notebook_utils` module
import requests
import gdown
r = requests.get(
url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)
from notebook_utils import download_file, device_widget
wavlm_large_dir_path = Path("FreeVC/wavlm")
wavlm_large_path = wavlm_large_dir_path / "WavLM-Large.pt"
wavlm_url = "https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4"
if not wavlm_large_path.exists():
gdown.download(wavlm_url, str(wavlm_large_path))
Downloading...
From: https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4
To: /opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/810/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/WavLM-Large.pt
100%|██████████| 1.26G/1.26G [00:32<00:00, 38.5MB/s]
freevc_chpt_dir = Path("checkpoints")
freevc_chpt_name = "freevc.pth"
freevc_chpt_path = freevc_chpt_dir / freevc_chpt_name
if not freevc_chpt_path.exists():
download_file(
f"https://storage.openvinotoolkit.org/repositories/openvino_notebooks/models/freevc/{freevc_chpt_name}",
directory=freevc_chpt_dir,
)
checkpoints/freevc.pth: 0%| | 0.00/451M [00:00<?, ?B/s]
audio1_name = "p225_001.wav"
audio1_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio1_name}"
audio2_name = "p226_002.wav"
audio2_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio2_name}"
if not Path(audio1_name).exists():
download_file(audio1_url)
if not Path(audio2_name).exists():
download_file(audio2_url)
p225_001.wav: 0%| | 0.00/50.8k [00:00<?, ?B/s]
p226_002.wav: 0%| | 0.00/135k [00:00<?, ?B/s]
Imports and settings#
import logging
import os
import time
import librosa
import numpy as np
import torch
from scipy.io.wavfile import write
from tqdm import tqdm
import openvino as ov
import utils
from models import SynthesizerTrn
from speaker_encoder.voice_encoder import SpeakerEncoder
from wavlm import WavLM, WavLMConfig
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
Redefine function get_model
from utils
to exclude CUDA
def get_cmodel():
checkpoint = torch.load(wavlm_large_path)
cfg = WavLMConfig(checkpoint["cfg"])
cmodel = WavLM(cfg)
cmodel.load_state_dict(checkpoint["model"])
cmodel.eval()
return cmodel
Models initialization
hps = utils.get_hparams_from_file("FreeVC/configs/freevc.json")
os.makedirs("outputs/freevc", exist_ok=True)
net_g = SynthesizerTrn(hps.data.filter_length // 2 + 1, hps.train.segment_size // hps.data.hop_length, **hps.model)
utils.load_checkpoint(freevc_chpt_path, net_g, optimizer=None, strict=True)
cmodel = get_cmodel()
smodel = SpeakerEncoder("FreeVC/speaker_encoder/ckpt/pretrained_bak_5805000.pt", device="cpu")
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/810/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Loaded the voice encoder model on cpu in 0.01 seconds.
Reading dataset settings
srcs = [audio1_name, audio2_name]
tgts = [audio2_name, audio1_name]
Inference
with torch.no_grad():
for line in tqdm(zip(srcs, tgts)):
src, tgt = line
# tgt
wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)
g_tgt = smodel.embed_utterance(wav_tgt)
g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)
# src
wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
wav_src = torch.from_numpy(wav_src).unsqueeze(0)
c = utils.get_content(cmodel, wav_src)
tgt_audio = net_g.infer(c, g=g_tgt)
tgt_audio = tgt_audio[0][0].data.cpu().float().numpy()
timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
write(
os.path.join("outputs/freevc", "{}.wav".format(timestamp)),
hps.data.sampling_rate,
tgt_audio,
)
2it [00:04, 2.03s/it]
Result audio files should be available in ‘outputs/freevc’
Convert Modes to OpenVINO Intermediate Representation#
Convert each model to OpenVINO IR, with FP16 precision. The
ov.convert_model
function accepts the original PyTorch model object
and example inputs for tracing and returns the OpenVINO Model class
instance which represents this model. The obtained model is ready to use
and to be loaded on a device using compile_model
or can be saved on
a disk using the ov.save_model
function. The read_model
method
loads a saved model from a disk. For more information about model
conversion, see this
page.
Convert Prior Encoder.#
First we convert WavLM model, as a part of Convert Prior Encoder to
OpenVINO’s IR format. We keep the original name of the model in code:
cmodel
.
# define forward as extract_features for compatibility
cmodel.forward = cmodel.extract_features
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "cmodel"
OUTPUT_DIR.mkdir(exist_ok=True)
ir_cmodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_ir")).with_suffix(".xml")
length = 32000
dummy_input = torch.randn(1, length)
Converting to OpenVINO’s IR format.
core = ov.Core()
class ModelWrapper(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, input):
return self.model(input)[0]
if not ir_cmodel_path.exists():
ir_cmodel = ov.convert_model(ModelWrapper(cmodel), example_input=dummy_input)
ov.save_model(ir_cmodel, ir_cmodel_path)
else:
ir_cmodel = core.read_model(ir_cmodel_path)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/810/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:495: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert embed_dim == self.embed_dim
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/810/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert list(query.size()) == [tgt_len, bsz, embed_dim]
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/810/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:500: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert key_bsz == bsz
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/810/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:502: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert src_len, bsz == value.shape[:2]
Select device from dropdown list for running inference using OpenVINO
device = device_widget()
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
compiled_cmodel = core.compile_model(ir_cmodel, device.value)
Convert SpeakerEncoder
#
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "smodel"
OUTPUT_DIR.mkdir(exist_ok=True)
ir_smodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")
length = 32000
dummy_input = torch.randn(1, length, 40)
if not ir_smodel_path.exists():
ir_smodel = ov.convert_model(smodel, example_input=dummy_input)
ov.save_model(ir_smodel, ir_smodel_path)
else:
ir_smodel = core.read_model(ir_smodel_path)
For preparing input for inference, we should define helper functions
based on speaker_encoder.voice_encoder.SpeakerEncoder
class methods
from speaker_encoder.hparams import sampling_rate, mel_window_step, partials_n_frames
from speaker_encoder import audio
def compute_partial_slices(n_samples: int, rate, min_coverage):
"""
Computes where to split an utterance waveform and its corresponding mel spectrogram to
obtain partial utterances of <partials_n_frames> each. Both the waveform and the
mel spectrogram slices are returned, so as to make each partial utterance waveform
correspond to its spectrogram.
The returned ranges may be indexing further than the length of the waveform. It is
recommended that you pad the waveform with zeros up to wav_slices[-1].stop.
:param n_samples: the number of samples in the waveform
:param rate: how many partial utterances should occur per second. Partial utterances must
cover the span of the entire utterance, thus the rate should not be lower than the inverse
of the duration of a partial utterance. By default, partial utterances are 1.6s long and
the minimum rate is thus 0.625.
:param min_coverage: when reaching the last partial utterance, it may or may not have
enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
then the last partial utterance will be considered by zero-padding the audio. Otherwise,
it will be discarded. If there aren't enough frames for one partial utterance,
this parameter is ignored so that the function always returns at least one slice.
:return: the waveform slices and mel spectrogram slices as lists of array slices. Index
respectively the waveform and the mel spectrogram with these slices to obtain the partial
utterances.
"""
assert 0 < min_coverage <= 1
# Compute how many frames separate two partial utterances
samples_per_frame = int((sampling_rate * mel_window_step / 1000))
n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
frame_step = int(np.round((sampling_rate / rate) / samples_per_frame))
assert 0 < frame_step, "The rate is too high"
assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % (sampling_rate / (samples_per_frame * partials_n_frames))
# Compute the slices
wav_slices, mel_slices = [], []
steps = max(1, n_frames - partials_n_frames + frame_step + 1)
for i in range(0, steps, frame_step):
mel_range = np.array([i, i + partials_n_frames])
wav_range = mel_range * samples_per_frame
mel_slices.append(slice(*mel_range))
wav_slices.append(slice(*wav_range))
# Evaluate whether extra padding is warranted or not
last_wav_range = wav_slices[-1]
coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
if coverage < min_coverage and len(mel_slices) > 1:
mel_slices = mel_slices[:-1]
wav_slices = wav_slices[:-1]
return wav_slices, mel_slices
def embed_utterance(
wav: np.ndarray,
smodel: ov.CompiledModel,
return_partials=False,
rate=1.3,
min_coverage=0.75,
):
"""
Computes an embedding for a single utterance. The utterance is divided in partial
utterances and an embedding is computed for each. The complete utterance embedding is the
L2-normed average embedding of the partial utterances.
:param wav: a preprocessed utterance waveform as a numpy array of float32
:param smodel: compiled speaker encoder model.
:param return_partials: if True, the partial embeddings will also be returned along with
the wav slices corresponding to each partial utterance.
:param rate: how many partial utterances should occur per second. Partial utterances must
cover the span of the entire utterance, thus the rate should not be lower than the inverse
of the duration of a partial utterance. By default, partial utterances are 1.6s long and
the minimum rate is thus 0.625.
:param min_coverage: when reaching the last partial utterance, it may or may not have
enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
then the last partial utterance will be considered by zero-padding the audio. Otherwise,
it will be discarded. If there aren't enough frames for one partial utterance,
this parameter is ignored so that the function always returns at least one slice.
:return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
<return_partials> is True, the partial utterances as a numpy array of float32 of shape
(n_partials, model_embedding_size) and the wav partials as a list of slices will also be
returned.
"""
# Compute where to split the utterance into partials and pad the waveform with zeros if
# the partial utterances cover a larger range.
wav_slices, mel_slices = compute_partial_slices(len(wav), rate, min_coverage)
max_wave_length = wav_slices[-1].stop
if max_wave_length >= len(wav):
wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")
# Split the utterance into partials and forward them through the model
mel = audio.wav_to_mel_spectrogram(wav)
mels = np.array([mel[s] for s in mel_slices])
with torch.no_grad():
mels = torch.from_numpy(mels).to(torch.device("cpu"))
output_layer = smodel.output(0)
partial_embeds = smodel(mels)[output_layer]
# Compute the utterance embedding from the partial embeddings
raw_embed = np.mean(partial_embeds, axis=0)
embed = raw_embed / np.linalg.norm(raw_embed, 2)
if return_partials:
return embed, partial_embeds, wav_slices
return embed
Select device from dropdown list for running inference using OpenVINO
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
Then compile model.
compiled_smodel = core.compile_model(ir_smodel, device.value)
Convert Decoder#
In the same way export SynthesizerTrn
model, that implements decoder
function to OpenVINO IR format.
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "net_g"
onnx_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx")
ir_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")
dummy_input_1 = torch.randn(1, 1024, 81)
dummy_input_2 = torch.randn(1, 256)
# define forward as infer
net_g.forward = net_g.infer
if not ir_net_g_path.exists():
ir_net_g_model = ov.convert_model(net_g, example_input=(dummy_input_1, dummy_input_2))
ov.save_model(ir_net_g_model, ir_net_g_path)
else:
ir_net_g_model = core.read_model(ir_net_g_path)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/810/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1102: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!
Mismatched elements: 25915 / 25920 (100.0%)
Greatest absolute difference: 1.3485908806324005 at index (0, 0, 24258) (up to 1e-05 allowed)
Greatest relative difference: 8204.075456053068 at index (0, 0, 5777) (up to 1e-05 allowed)
_check_trace(
Select device from dropdown list for running inference using OpenVINO
compiled_ir_net_g_model = core.compile_model(ir_net_g_model, device.value)
Define function for synthesizing.
def synthesize_audio(src, tgt):
wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)
g_tgt = embed_utterance(wav_tgt, compiled_smodel)
g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)
# src
wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
wav_src = np.expand_dims(wav_src, axis=0)
output_layer = compiled_cmodel.output(0)
c = compiled_cmodel(wav_src)[output_layer]
c = c.transpose((0, 2, 1))
output_layer = compiled_ir_net_g_model.output(0)
tgt_audio = compiled_ir_net_g_model((c, g_tgt))[output_layer]
tgt_audio = tgt_audio[0][0]
return tgt_audio
And now we can check inference using only IR models.
result_wav_names = []
with torch.no_grad():
for line in tqdm(zip(srcs, tgts)):
src, tgt = line
output_audio = synthesize_audio(src, tgt)
timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
result_name = f"{timestamp}.wav"
result_wav_names.append(result_name)
write(
os.path.join("outputs/freevc", result_name),
hps.data.sampling_rate,
output_audio,
)
2it [00:01, 1.31it/s]
Result audio files should be available in ‘outputs/freevc’ and you can check them and compare with generated earlier. Below one of the results presents.
Source audio (source of text):
import IPython.display as ipd
ipd.Audio(srcs[0])
Target audio (source of voice):
ipd.Audio(tgts[0])
Result audio:
ipd.Audio(f"outputs/freevc/{result_wav_names[0]}")
Also, you can use your own audio file. Just upload them and use for
inference. Use rate corresponding to the value of
hps.data.sampling_rate
.
def infer(src, tgt):
output_audio = synthesize_audio(src, tgt)
timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
result_name = f"{timestamp}.wav"
write(result_name, hps.data.sampling_rate, output_audio)
return result_name
if not Path("gradio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/freevc-voice-conversion/gradio_helper.py")
open("gradio_helper.py", "w").write(r.text)
from gradio_helper import make_demo
demo = make_demo(fn=infer)
try:
demo.queue().launch(debug=False)
except Exception:
demo.queue().launch(debug=False, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/"
Running on local URL: http://127.0.0.1:7860 To create a public link, set share=True in launch().
# please uncomment and run this cell for stopping gradio interface
# demo.close()