High-Quality Text-Free One-Shot Voice Conversion with FreeVC and OpenVINO™#

This Jupyter notebook can be launched after a local installation only.

Github

FreeVC allows alter the voice of a source speaker to a target style, while keeping the linguistic content unchanged, without text annotation.

Figure bellow illustrates model architecture of FreeVC for inference. In this notebook we concentrate only on inference part. There are three main parts: Prior Encoder, Speaker Encoder and Decoder. The prior encoder contains a WavLM model, a bottleneck extractor and a normalizing flow. Detailed information is available in this paper.

Inference

Inference#

**image_source*

FreeVC suggests only command line interface to use and only with CUDA. In this notebook it shows how to use FreeVC in Python and without CUDA devices. It consists of the following steps:

  • Download and prepare models.

  • Inference.

  • Convert models to OpenVINO Intermediate Representation.

  • Inference using only OpenVINO’s IR models.

Table of contents:#

Pre-requisites#

This steps can be done manually or will be performed automatically during the execution of the notebook, but in minimum necessary scope. 1. Clone this repo: git clone OlaWod/FreeVC.git. 2. Download WavLM-Large and put it under directory FreeVC/wavlm/. 3. You can download the VCTK dataset. For this example we download only two of them from Hugging Face FreeVC example. 4. Download pretrained models and put it under directory ‘checkpoints’ (for current example only freevc.pth are required).

Install extra requirements

%pip install -q "openvino>=2023.3.0" "librosa>=0.8.1" "webrtcvad==2.0.10" "gradio>=4.19" "torch>=2.1" gdown scipy tqdm torchvision --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.

Check if FreeVC is installed and append its path to sys.path

from pathlib import Path
import sys


free_vc_repo = "FreeVC"
if not Path(free_vc_repo).exists():
    !git clone https://github.com/OlaWod/FreeVC.git

sys.path.append(free_vc_repo)
Cloning into 'FreeVC'...
remote: Enumerating objects: 131, done.
remote: Counting objects: 100% (65/65), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 131 (delta 39), reused 24 (delta 24), pack-reused 66
Receiving objects: 100% (131/131), 15.28 MiB | 27.90 MiB/s, done.
Resolving deltas: 100% (43/43), done.
# Fetch `notebook_utils` module
import requests
import gdown

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)

open("notebook_utils.py", "w").write(r.text)
from notebook_utils import download_file

wavlm_large_dir_path = Path("FreeVC/wavlm")
wavlm_large_path = wavlm_large_dir_path / "WavLM-Large.pt"

wavlm_url = "https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4"

if not wavlm_large_path.exists():
    gdown.download(wavlm_url, str(wavlm_large_path))
Downloading...
From: https://drive.google.com/uc?id=12-cB34qCTvByWT-QtOcZaqwwO21FLSqU&confirm=t&uuid=a703c43c-ccce-436c-8799-c11b88e9e7e4
To: /opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-697/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/WavLM-Large.pt
100%|██████████| 1.26G/1.26G [00:34<00:00, 36.3MB/s]
freevc_chpt_dir = Path("checkpoints")
freevc_chpt_name = "freevc.pth"
freevc_chpt_path = freevc_chpt_dir / freevc_chpt_name

if not freevc_chpt_path.exists():
    download_file(
        f"https://storage.openvinotoolkit.org/repositories/openvino_notebooks/models/freevc/{freevc_chpt_name}",
        directory=freevc_chpt_dir,
    )
checkpoints/freevc.pth:   0%|          | 0.00/451M [00:00<?, ?B/s]
audio1_name = "p225_001.wav"
audio1_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio1_name}"
audio2_name = "p226_002.wav"
audio2_url = f"https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio2_name}"

if not Path(audio1_name).exists():
    download_file(audio1_url)

if not Path(audio2_name).exists():
    download_file(audio2_url)
p225_001.wav:   0%|          | 0.00/50.8k [00:00<?, ?B/s]
p226_002.wav:   0%|          | 0.00/135k [00:00<?, ?B/s]

Imports and settings#

import logging
import os
import time

import librosa
import numpy as np
import torch
from scipy.io.wavfile import write
from tqdm import tqdm

import openvino as ov

import utils
from models import SynthesizerTrn
from speaker_encoder.voice_encoder import SpeakerEncoder
from wavlm import WavLM, WavLMConfig

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

Redefine function get_model from utils to exclude CUDA

def get_cmodel():
    checkpoint = torch.load(wavlm_large_path)
    cfg = WavLMConfig(checkpoint["cfg"])
    cmodel = WavLM(cfg)
    cmodel.load_state_dict(checkpoint["model"])
    cmodel.eval()

    return cmodel

Models initialization

hps = utils.get_hparams_from_file("FreeVC/configs/freevc.json")
os.makedirs("outputs/freevc", exist_ok=True)

net_g = SynthesizerTrn(hps.data.filter_length // 2 + 1, hps.train.segment_size // hps.data.hop_length, **hps.model)

utils.load_checkpoint(freevc_chpt_path, net_g, optimizer=None, strict=True)
cmodel = get_cmodel()
smodel = SpeakerEncoder("FreeVC/speaker_encoder/ckpt/pretrained_bak_5805000.pt", device="cpu")
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-697/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Loaded the voice encoder model on cpu in 0.01 seconds.

Reading dataset settings

srcs = [audio1_name, audio2_name]
tgts = [audio2_name, audio1_name]

Inference

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line
        # tgt
        wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
        wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

        g_tgt = smodel.embed_utterance(wav_tgt)
        g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

        # src
        wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
        wav_src = torch.from_numpy(wav_src).unsqueeze(0)

        c = utils.get_content(cmodel, wav_src)

        tgt_audio = net_g.infer(c, g=g_tgt)
        tgt_audio = tgt_audio[0][0].data.cpu().float().numpy()

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        write(
            os.path.join("outputs/freevc", "{}.wav".format(timestamp)),
            hps.data.sampling_rate,
            tgt_audio,
        )
2it [00:03,  1.55s/it]

Result audio files should be available in ‘outputs/freevc’

Convert Modes to OpenVINO Intermediate Representation#

Convert each model to OpenVINO IR, with FP16 precision. The ov.convert_model function accepts the original PyTorch model object and example inputs for tracing and returns the OpenVINO Model class instance which represents this model. The obtained model is ready to use and to be loaded on a device using compile_model or can be saved on a disk using the ov.save_model function. The read_model method loads a saved model from a disk. For more information about model conversion, see this page.

First we convert WavLM model, as a part of Convert Prior Encoder to OpenVINO’s IR format. We keep the original name of the model in code: cmodel.

# define forward as extract_features for compatibility
cmodel.forward = cmodel.extract_features
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "cmodel"

OUTPUT_DIR.mkdir(exist_ok=True)

ir_cmodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_ir")).with_suffix(".xml")

length = 32000

dummy_input = torch.randn(1, length)

Converting to OpenVINO’s IR format.

core = ov.Core()


class ModelWrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, input):
        return self.model(input)[0]


if not ir_cmodel_path.exists():
    ir_cmodel = ov.convert_model(ModelWrapper(cmodel), example_input=dummy_input)
    ov.save_model(ir_cmodel, ir_cmodel_path)
else:
    ir_cmodel = core.read_model(ir_cmodel_path)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-697/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:495: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert embed_dim == self.embed_dim
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-697/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert list(query.size()) == [tgt_len, bsz, embed_dim]
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-697/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:500: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert key_bsz == bsz
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-697/.workspace/scm/ov-notebook/notebooks/freevc-voice-conversion/FreeVC/wavlm/modules.py:502: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert src_len, bsz == value.shape[:2]

Select device from dropdown list for running inference using OpenVINO

import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value="AUTO",
    description="Device:",
    disabled=False,
)

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
compiled_cmodel = core.compile_model(ir_cmodel, device.value)
OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "smodel"

OUTPUT_DIR.mkdir(exist_ok=True)

ir_smodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")


length = 32000

dummy_input = torch.randn(1, length, 40)

if not ir_smodel_path.exists():
    ir_smodel = ov.convert_model(smodel, example_input=dummy_input)
    ov.save_model(ir_smodel, ir_smodel_path)
else:
    ir_smodel = core.read_model(ir_smodel_path)

For preparing input for inference, we should define helper functions based on speaker_encoder.voice_encoder.SpeakerEncoder class methods

from speaker_encoder.hparams import sampling_rate, mel_window_step, partials_n_frames
from speaker_encoder import audio


def compute_partial_slices(n_samples: int, rate, min_coverage):
    """
    Computes where to split an utterance waveform and its corresponding mel spectrogram to
    obtain partial utterances of <partials_n_frames> each. Both the waveform and the
    mel spectrogram slices are returned, so as to make each partial utterance waveform
    correspond to its spectrogram.

    The returned ranges may be indexing further than the length of the waveform. It is
    recommended that you pad the waveform with zeros up to wav_slices[-1].stop.

    :param n_samples: the number of samples in the waveform
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the waveform slices and mel spectrogram slices as lists of array slices. Index
    respectively the waveform and the mel spectrogram with these slices to obtain the partial
    utterances.
    """
    assert 0 < min_coverage <= 1

    # Compute how many frames separate two partial utterances
    samples_per_frame = int((sampling_rate * mel_window_step / 1000))
    n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
    frame_step = int(np.round((sampling_rate / rate) / samples_per_frame))
    assert 0 < frame_step, "The rate is too high"
    assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % (sampling_rate / (samples_per_frame * partials_n_frames))

    # Compute the slices
    wav_slices, mel_slices = [], []
    steps = max(1, n_frames - partials_n_frames + frame_step + 1)
    for i in range(0, steps, frame_step):
        mel_range = np.array([i, i + partials_n_frames])
        wav_range = mel_range * samples_per_frame
        mel_slices.append(slice(*mel_range))
        wav_slices.append(slice(*wav_range))

    # Evaluate whether extra padding is warranted or not
    last_wav_range = wav_slices[-1]
    coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
    if coverage < min_coverage and len(mel_slices) > 1:
        mel_slices = mel_slices[:-1]
        wav_slices = wav_slices[:-1]

    return wav_slices, mel_slices


def embed_utterance(
    wav: np.ndarray,
    smodel: ov.CompiledModel,
    return_partials=False,
    rate=1.3,
    min_coverage=0.75,
):
    """
    Computes an embedding for a single utterance. The utterance is divided in partial
    utterances and an embedding is computed for each. The complete utterance embedding is the
    L2-normed average embedding of the partial utterances.

    :param wav: a preprocessed utterance waveform as a numpy array of float32
    :param smodel: compiled speaker encoder model.
    :param return_partials: if True, the partial embeddings will also be returned along with
    the wav slices corresponding to each partial utterance.
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
    <return_partials> is True, the partial utterances as a numpy array of float32 of shape
    (n_partials, model_embedding_size) and the wav partials as a list of slices will also be
    returned.
    """
    # Compute where to split the utterance into partials and pad the waveform with zeros if
    # the partial utterances cover a larger range.
    wav_slices, mel_slices = compute_partial_slices(len(wav), rate, min_coverage)
    max_wave_length = wav_slices[-1].stop
    if max_wave_length >= len(wav):
        wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")

    # Split the utterance into partials and forward them through the model
    mel = audio.wav_to_mel_spectrogram(wav)
    mels = np.array([mel[s] for s in mel_slices])
    with torch.no_grad():
        mels = torch.from_numpy(mels).to(torch.device("cpu"))
        output_layer = smodel.output(0)
        partial_embeds = smodel(mels)[output_layer]

    # Compute the utterance embedding from the partial embeddings
    raw_embed = np.mean(partial_embeds, axis=0)
    embed = raw_embed / np.linalg.norm(raw_embed, 2)

    if return_partials:
        return embed, partial_embeds, wav_slices
    return embed

Select device from dropdown list for running inference using OpenVINO

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

Then compile model.

compiled_smodel = core.compile_model(ir_smodel, device.value)

In the same way export SynthesizerTrn model, that implements decoder function to OpenVINO IR format.

OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "net_g"
onnx_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx")
ir_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")

dummy_input_1 = torch.randn(1, 1024, 81)
dummy_input_2 = torch.randn(1, 256)

# define forward as infer
net_g.forward = net_g.infer


if not ir_net_g_path.exists():
    ir_net_g_model = ov.convert_model(net_g, example_input=(dummy_input_1, dummy_input_2))
    ov.save_model(ir_net_g_model, ir_net_g_path)
else:
    ir_net_g_model = core.read_model(ir_net_g_path)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-697/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/jit/_trace.py:1116: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Tensor-likes are not close!

Mismatched elements: 25632 / 25920 (98.9%)
Greatest absolute difference: 0.05565535183995962 at index (0, 0, 18928) (up to 1e-05 allowed)
Greatest relative difference: 506443.2686567164 at index (0, 0, 10753) (up to 1e-05 allowed)
  _check_trace(

Select device from dropdown list for running inference using OpenVINO

compiled_ir_net_g_model = core.compile_model(ir_net_g_model, device.value)

Define function for synthesizing.

def synthesize_audio(src, tgt):
    wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
    wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

    g_tgt = embed_utterance(wav_tgt, compiled_smodel)
    g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

    # src
    wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
    wav_src = np.expand_dims(wav_src, axis=0)

    output_layer = compiled_cmodel.output(0)
    c = compiled_cmodel(wav_src)[output_layer]
    c = c.transpose((0, 2, 1))

    output_layer = compiled_ir_net_g_model.output(0)
    tgt_audio = compiled_ir_net_g_model((c, g_tgt))[output_layer]
    tgt_audio = tgt_audio[0][0]

    return tgt_audio

And now we can check inference using only IR models.

result_wav_names = []

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line

        output_audio = synthesize_audio(src, tgt)

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        result_name = f"{timestamp}.wav"
        result_wav_names.append(result_name)
        write(
            os.path.join("outputs/freevc", result_name),
            hps.data.sampling_rate,
            output_audio,
        )
2it [00:01,  1.31it/s]

Result audio files should be available in ‘outputs/freevc’ and you can check them and compare with generated earlier. Below one of the results presents.

Source audio (source of text):

import IPython.display as ipd

ipd.Audio(srcs[0])

Target audio (source of voice):

ipd.Audio(tgts[0])