High-Quality Text-Free One-Shot Voice Conversion with FreeVC and OpenVINO™#

This Jupyter notebook can be launched after a local installation only.

Github

FreeVC allows alter the voice of a source speaker to a target style, while keeping the linguistic content unchanged, without text annotation.

Figure bellow illustrates model architecture of FreeVC for inference. In this notebook we concentrate only on inference part. There are three main parts: Prior Encoder, Speaker Encoder and Decoder. The prior encoder contains a WavLM model, a bottleneck extractor and a normalizing flow. Detailed information is available in this paper.

Inference

Inference#

**image_source*

FreeVC suggests only command line interface to use and only with CUDA. In this notebook it shows how to use FreeVC in Python and without CUDA devices. It consists of the following steps:

  • Download and prepare models.

  • Inference.

  • Convert models to OpenVINO Intermediate Representation.

  • Inference using only OpenVINO’s IR models.

Table of contents:

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Pre-requisites#

This steps can be done manually or will be performed automatically during the execution of the notebook, but in minimum necessary scope. 1. Clone this repo: git clone OlaWod/FreeVC.git. 2. Download WavLM-Large and put it under directory FreeVC/wavlm/. 3. You can download the VCTK dataset. For this example we download only two of them from Hugging Face FreeVC example. 4. Download pretrained models and put it under directory ‘checkpoints’ (for current example only freevc.pth are required).

Install extra requirements

%pip install -q "openvino>=2023.3.0" "librosa>=0.8.1" "webrtcvad==2.0.10" "gradio>=4.19" "torch>=2.1" gdown scipy tqdm torchvision --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
# Fetch `notebook_utils` module
import requests


r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)


r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
)
open("cmd_helper.py", "w").write(r.text)
1491
from pathlib import Path

import gdown
from cmd_helper import clone_repo
from notebook_utils import download_file, device_widget


clone_repo("https://github.com/OlaWod/FreeVC.git")


wavlm_large_dir_path = Path("FreeVC/wavlm")
wavlm_large_path = wavlm_large_dir_path / "WavLM-Large.pt"

wavlm_url = "https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4"

if not wavlm_large_path.exists():
    gdown.download(wavlm_url, str(wavlm_large_path))
Downloading...
From: https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4
To: /opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/WavLM-Large.pt
100%|██████████| 1.26G/1.26G [01:03<00:00, 19.9MB/s]
freevc_chpt_dir = Path("checkpoints")
freevc_chpt_name = "freevc.pth"
freevc_chpt_path = freevc_chpt_dir / freevc_chpt_name

if not freevc_chpt_path.exists():
    download_file(
        f"https://storage.openvinotoolkit.org/repositories/openvino_notebooks/models/freevc/{freevc_chpt_name}",
        directory=freevc_chpt_dir,
    )
freevc.pth:   0%|          | 0.00/451M [00:00<?, ?B/s]
audio1_name = "p225_001.wav"
audio1_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio1_name}"
audio2_name = "p226_002.wav"
audio2_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio2_name}"

if not Path(audio1_name).exists():
    download_file(audio1_url)

if not Path(audio2_name).exists():
    download_file(audio2_url)
p225_001.wav:   0%|          | 0.00/50.8k [00:00<?, ?B/s]
p226_002.wav:   0%|          | 0.00/135k [00:00<?, ?B/s]

Imports and settings#

import logging
import os
import time

import librosa
import numpy as np
import torch
from scipy.io.wavfile import write
from tqdm import tqdm

import openvino as ov

import utils
from models import SynthesizerTrn
from speaker_encoder.voice_encoder import SpeakerEncoder
from wavlm import WavLM, WavLMConfig

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

Redefine function get_model from utils to exclude CUDA

def get_cmodel():
    checkpoint = torch.load(wavlm_large_path)
    cfg = WavLMConfig(checkpoint["cfg"])
    cmodel = WavLM(cfg)
    cmodel.load_state_dict(checkpoint["model"])
    cmodel.eval()

    return cmodel

Models initialization

hps = utils.get_hparams_from_file("FreeVC/configs/freevc.json")
os.makedirs("outputs/freevc", exist_ok=True)

net_g = SynthesizerTrn(hps.data.filter_length // 2 + 1, hps.train.segment_size // hps.data.hop_length, **hps.model)

utils.load_checkpoint(freevc_chpt_path, net_g, optimizer=None, strict=True)
cmodel = get_cmodel()
smodel = SpeakerEncoder("FreeVC/speaker_encoder/ckpt/pretrained_bak_5805000.pt", device="cpu")
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:134: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  WeightNorm.apply(module, name, dim)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/utils.py:71: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/pytorch for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
/tmp/ipykernel_2193775/250534645.py:2: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/pytorch for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(wavlm_large_path)
Loaded the voice encoder model on cpu in 0.01 seconds.
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/speaker_encoder/voice_encoder.py:39: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/pytorch for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(weights_fpath, map_location="cpu")

Reading dataset settings

srcs = [audio1_name, audio2_name]
tgts = [audio2_name, audio1_name]

Inference

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line
        # tgt
        wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
        wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

        g_tgt = smodel.embed_utterance(wav_tgt)
        g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

        # src
        wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
        wav_src = torch.from_numpy(wav_src).unsqueeze(0)

        c = utils.get_content(cmodel, wav_src)

        tgt_audio = net_g.infer(c, g=g_tgt)
        tgt_audio = tgt_audio[0][0].data.cpu().float().numpy()

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        write(
            os.path.join("outputs/freevc", "{}.wav".format(timestamp)),
            hps.data.sampling_rate,
            tgt_audio,
        )
2it [00:03,  1.66s/it]

Result audio files should be available in ‘outputs/freevc’

Convert Modes to OpenVINO Intermediate Representation#

Convert each model to OpenVINO IR, with FP16 precision. The ov.convert_model function accepts the original PyTorch model object and example inputs for tracing and returns the OpenVINO Model class instance which represents this model. The obtained model is ready to use and to be loaded on a device using compile_model or can be saved on a disk using the ov.save_model function. The read_model method loads a saved model from a disk. For more information about model conversion, see this page.

Convert Prior Encoder.#

First we convert WavLM model, as a part of Convert Prior Encoder to OpenVINO’s IR format. We keep the original name of the model in code: cmodel.

# define forward as extract_features for compatibility
cmodel.forward = cmodel.extract_features
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "cmodel"

OUTPUT_DIR.mkdir(exist_ok=True)

ir_cmodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_ir")).with_suffix(".xml")

length = 32000

dummy_input = torch.randn(1, length)

Converting to OpenVINO’s IR format.

core = ov.Core()


class ModelWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, input):
        return self.model(input)[0]


if not ir_cmodel_path.exists():
    ir_cmodel = ov.convert_model(ModelWrapper(cmodel), example_input=dummy_input)
    ov.save_model(ir_cmodel, ir_cmodel_path)
else:
    ir_cmodel = core.read_model(ir_cmodel_path)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:495: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert embed_dim == self.embed_dim
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert list(query.size()) == [tgt_len, bsz, embed_dim]
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:500: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert key_bsz == bsz
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:502: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert src_len, bsz == value.shape[:2]

Select device from dropdown list for running inference using OpenVINO

device = device_widget()
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
compiled_cmodel = core.compile_model(ir_cmodel, device.value)

Convert SpeakerEncoder#

OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "smodel"

OUTPUT_DIR.mkdir(exist_ok=True)

ir_smodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")


length = 32000

dummy_input = torch.randn(1, length, 40)

if not ir_smodel_path.exists():
    ir_smodel = ov.convert_model(smodel, example_input=dummy_input)
    ov.save_model(ir_smodel, ir_smodel_path)
else:
    ir_smodel = core.read_model(ir_smodel_path)

For preparing input for inference, we should define helper functions based on speaker_encoder.voice_encoder.SpeakerEncoder class methods

from speaker_encoder.hparams import sampling_rate, mel_window_step, partials_n_frames
from speaker_encoder import audio


def compute_partial_slices(n_samples: int, rate, min_coverage):
    """
    Computes where to split an utterance waveform and its corresponding mel spectrogram to
    obtain partial utterances of <partials_n_frames> each. Both the waveform and the
    mel spectrogram slices are returned, so as to make each partial utterance waveform
    correspond to its spectrogram.

    The returned ranges may be indexing further than the length of the waveform. It is
    recommended that you pad the waveform with zeros up to wav_slices[-1].stop.

    :param n_samples: the number of samples in the waveform
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the waveform slices and mel spectrogram slices as lists of array slices. Index
    respectively the waveform and the mel spectrogram with these slices to obtain the partial
    utterances.
    """
    assert 0 < min_coverage <= 1

    # Compute how many frames separate two partial utterances
    samples_per_frame = int((sampling_rate * mel_window_step / 1000))
    n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
    frame_step = int(np.round((sampling_rate / rate) / samples_per_frame))
    assert 0 < frame_step, "The rate is too high"
    assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % (sampling_rate / (samples_per_frame * partials_n_frames))

    # Compute the slices
    wav_slices, mel_slices = [], []
    steps = max(1, n_frames - partials_n_frames + frame_step + 1)
    for i in range(0, steps, frame_step):
        mel_range = np.array([i, i + partials_n_frames])
        wav_range = mel_range * samples_per_frame
        mel_slices.append(slice(*mel_range))
        wav_slices.append(slice(*wav_range))

    # Evaluate whether extra padding is warranted or not
    last_wav_range = wav_slices[-1]
    coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
    if coverage < min_coverage and len(mel_slices) > 1:
        mel_slices = mel_slices[:-1]
        wav_slices = wav_slices[:-1]

    return wav_slices, mel_slices


def embed_utterance(
    wav: np.ndarray,
    smodel: ov.CompiledModel,
    return_partials=False,
    rate=1.3,
    min_coverage=0.75,
):
    """
    Computes an embedding for a single utterance. The utterance is divided in partial
    utterances and an embedding is computed for each. The complete utterance embedding is the
    L2-normed average embedding of the partial utterances.

    :param wav: a preprocessed utterance waveform as a numpy array of float32
    :param smodel: compiled speaker encoder model.
    :param return_partials: if True, the partial embeddings will also be returned along with
    the wav slices corresponding to each partial utterance.
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
    <return_partials> is True, the partial utterances as a numpy array of float32 of shape
    (n_partials, model_embedding_size) and the wav partials as a list of slices will also be
    returned.
    """
    # Compute where to split the utterance into partials and pad the waveform with zeros if
    # the partial utterances cover a larger range.
    wav_slices, mel_slices = compute_partial_slices(len(wav), rate, min_coverage)
    max_wave_length = wav_slices[-1].stop
    if max_wave_length >= len(wav):
        wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")

    # Split the utterance into partials and forward them through the model
    mel = audio.wav_to_mel_spectrogram(wav)
    mels = np.array([mel[s] for s in mel_slices])
    with torch.no_grad():
        mels = torch.from_numpy(mels).to(torch.device("cpu"))
        output_layer = smodel.output(0)
        partial_embeds = smodel(mels)[output_layer]

    # Compute the utterance embedding from the partial embeddings
    raw_embed = np.mean(partial_embeds, axis=0)
    embed = raw_embed / np.linalg.norm(raw_embed, 2)

    if return_partials:
        return embed, partial_embeds, wav_slices
    return embed

Select device from dropdown list for running inference using OpenVINO

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

Then compile model.

compiled_smodel = core.compile_model(ir_smodel, device.value)

Convert Decoder#

In the same way export SynthesizerTrn model, that implements decoder function to OpenVINO IR format.

OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "net_g"
onnx_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx")
ir_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")

dummy_input_1 = torch.randn(1, 1024, 81)
dummy_input_2 = torch.randn(1, 256)

# define forward as infer
net_g.forward = net_g.infer


if not ir_net_g_path.exists():
    ir_net_g_model = ov.convert_model(net_g, example_input=(dummy_input_1, dummy_input_2))
    ov.save_model(ir_net_g_model, ir_net_g_path)
else:
    ir_net_g_model = core.read_model(ir_net_g_path)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1303: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!

Mismatched elements: 25885 / 25920 (99.9%)
Greatest absolute difference: 0.11462387628853321 at index (0, 0, 1782) (up to 1e-05 allowed)
Greatest relative difference: 71139.34281225452 at index (0, 0, 3236) (up to 1e-05 allowed)
  _check_trace(

Select device from dropdown list for running inference using OpenVINO

compiled_ir_net_g_model = core.compile_model(ir_net_g_model, device.value)

Define function for synthesizing.

def synthesize_audio(src, tgt):
    wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
    wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

    g_tgt = embed_utterance(wav_tgt, compiled_smodel)
    g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

    # src
    wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
    wav_src = np.expand_dims(wav_src, axis=0)

    output_layer = compiled_cmodel.output(0)
    c = compiled_cmodel(wav_src)[output_layer]
    c = c.transpose((0, 2, 1))

    output_layer = compiled_ir_net_g_model.output(0)
    tgt_audio = compiled_ir_net_g_model((c, g_tgt))[output_layer]
    tgt_audio = tgt_audio[0][0]

    return tgt_audio

And now we can check inference using only IR models.

result_wav_names = []

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line

        output_audio = synthesize_audio(src, tgt)

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        result_name = f"{timestamp}.wav"
        result_wav_names.append(result_name)
        write(
            os.path.join("outputs/freevc", result_name),
            hps.data.sampling_rate,
            output_audio,
        )
2it [00:01,  1.32it/s]

Result audio files should be available in ‘outputs/freevc’ and you can check them and compare with generated earlier. Below one of the results presents.

Source audio (source of text):

import IPython.display as ipd

ipd.Audio(srcs[0])

Target audio (source of voice):

ipd.Audio(tgts[0])

Result audio:

ipd.Audio(f"outputs/freevc/{result_wav_names[0]}")

Also, you can use your own audio file. Just upload them and use for inference. Use rate corresponding to the value of hps.data.sampling_rate.

def infer(src, tgt):
    output_audio = synthesize_audio(src, tgt)
    timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
    result_name = f"{timestamp}.wav"
    write(result_name, hps.data.sampling_rate, output_audio)
    return result_name


if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/freevc-voice-conversion/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)

from gradio_helper import make_demo

demo = make_demo(fn=infer)

try:
    demo.queue().launch(debug=False)
except Exception:
    demo.queue().launch(debug=False, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/"
Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().
# please uncomment and run this cell for stopping gradio interface
# demo.close()