High-Quality Text-Free One-Shot Voice Conversion with FreeVC and OpenVINO™

This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.

Github

FreeVC allows alter the voice of a source speaker to a target style, while keeping the linguistic content unchanged, without text annotation.

Figure bellow illustrates model architecture of FreeVC for inference. In this notebook we concentrate only on inference part. There are three main parts: Prior Encoder, Speaker Encoder and Decoder. The prior encoder contains a WavLM model, a bottleneck extractor and a normalizing flow. Detailed information is available in this paper.

Inference

Inference

**image_source*

FreeVC suggests only command line interface to use and only with CUDA. In this notebook it shows how to use FreeVC in Python and without CUDA devices. It consists of the following steps:

  • Download and prepare models.

  • Inference.

  • Convert models to OpenVINO Intermediate Representation.

  • Inference using only OpenVINO’s IR models.

Table of contents:

Prerequisites

This steps can be done manually or will be performed automatically during the execution of the notebook, but in minimum necessary scope.

  1. Clone this repo:

git clone https://github.com/OlaWod/FreeVC.git
  1. Download WavLM-Large and put it under directory FreeVC/wavlm/.

  2. You can download the VCTK dataset. For this example we download only two of them from Hugging Face FreeVC example.

  3. Download pretrained models and put it under directory ‘checkpoints’ (for current example only freevc.pth are required).

Install extra requirements

!pip install -q "librosa>=0.8.1"
!pip install -q "webrtcvad==2.0.10"
!pip install -q gradio
DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063

Check if FreeVC is installed and append its path to sys.path

from pathlib import Path
import sys


free_vc_repo = 'FreeVC'
if not Path(free_vc_repo).exists():
    !git clone https://github.com/OlaWod/FreeVC.git

sys.path.append(free_vc_repo)
Cloning into 'FreeVC'...
remote: Enumerating objects: 131, done.
remote: Counting objects: 100% (61/61), done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 131 (delta 36), reused 21 (delta 21), pack-reused 70
Receiving objects: 100% (131/131), 15.28 MiB | 4.14 MiB/s, done.
Resolving deltas: 100% (43/43), done.
sys.path.append("../utils")
from notebook_utils import download_file

wavlm_large_dir_path = Path('FreeVC/wavlm')
wavlm_large_path = wavlm_large_dir_path / 'WavLM-Large.pt'

if not wavlm_large_path.exists():
    download_file(
        'https://valle.blob.core.windows.net/share/wavlm/WavLM-Large.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D',
        directory=wavlm_large_dir_path
    )
FreeVC/wavlm/WavLM-Large.pt:   0%|          | 0.00/1.18G [00:00<?, ?B/s]
freevc_chpt_dir = Path('checkpoints')
freevc_chpt_name = 'freevc.pth'
freevc_chpt_path = freevc_chpt_dir / freevc_chpt_name

if not freevc_chpt_path.exists():
    download_file(
        f'https://storage.openvinotoolkit.org/repositories/openvino_notebooks/models/freevc/{freevc_chpt_name}',
        directory=freevc_chpt_dir
    )
checkpoints/freevc.pth:   0%|          | 0.00/451M [00:00<?, ?B/s]
audio1_name = 'p225_001.wav'
audio1_url = f'https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio1_name}'
audio2_name = 'p226_002.wav'
audio2_url = f'https://huggingface.co/spaces/OlaWod/FreeVC/resolve/main/{audio2_name}'

if not Path(audio1_name).exists():
    download_file(audio1_url)

if not Path(audio2_name).exists():
    download_file(audio2_url)
p225_001.wav:   0%|          | 0.00/50.8k [00:00<?, ?B/s]
p226_002.wav:   0%|          | 0.00/135k [00:00<?, ?B/s]

Imports and settings

import logging
import os
import time

import librosa
import numpy as np
import torch
from scipy.io.wavfile import write
from tqdm import tqdm

from openvino.runtime import Core, serialize
from openvino.runtime.ie_api import CompiledModel
from openvino.tools import mo

import utils
from models import SynthesizerTrn
from speaker_encoder.voice_encoder import SpeakerEncoder
from wavlm import WavLM, WavLMConfig

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

Redefine function get_model from utils to exclude CUDA

def get_cmodel():
    checkpoint = torch.load(wavlm_large_path)
    cfg = WavLMConfig(checkpoint['cfg'])
    cmodel = WavLM(cfg)
    cmodel.load_state_dict(checkpoint['model'])
    cmodel.eval()

    return cmodel

Models initialization

hps = utils.get_hparams_from_file('FreeVC/configs/freevc.json')
os.makedirs('outputs/freevc', exist_ok=True)

net_g = SynthesizerTrn(
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    **hps.model
)

utils.load_checkpoint(freevc_chpt_path, net_g, optimizer=None, strict=True)
cmodel = get_cmodel()
smodel = SpeakerEncoder('FreeVC/speaker_encoder/ckpt/pretrained_bak_5805000.pt', device='cpu')
Loaded the voice encoder model on cpu in 0.01 seconds.

Reading dataset settings

srcs = [audio1_name, audio2_name]
tgts = [audio2_name, audio1_name]

Inference

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line
        # tgt
        wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
        wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

        g_tgt = smodel.embed_utterance(wav_tgt)
        g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

        # src
        wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
        wav_src = torch.from_numpy(wav_src).unsqueeze(0)

        c = utils.get_content(cmodel, wav_src)

        tgt_audio = net_g.infer(c, g=g_tgt)
        tgt_audio = tgt_audio[0][0].data.cpu().float().numpy()

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        write(os.path.join('outputs/freevc', "{}.wav".format(timestamp)), hps.data.sampling_rate,
              tgt_audio)
2it [00:01,  1.27it/s]

Result audio files should be available in ‘outputs/freevc’

Convert Modes to OpenVINO Intermediate Representation

Convert each model to ONNX format and then use the model conversion Python API to convert the ONNX model to OpenVINO IR, with FP16 precision. The mo.convert_model function accepts the path to a model and returns the OpenVINO Model class instance which represents this model. The obtained model is ready to use and to be loaded on a device using compile_model or can be saved on a disk using the serialize function. The read_model method loads a saved model from a disk. For more information about model conversion, see this page.

Convert Prior Encoder.

First we convert WavLM model, as a part of Convert Prior Encoder, to the ONNX format, then to OpenVINO’s IR format. We keep the original name of the model in code: cmodel.

# define forward as extract_features for compatibility
cmodel.forward = cmodel.extract_features

Convert cmodel to ONNX.

OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "cmodel"

OUTPUT_DIR.mkdir(exist_ok=True)

onnx_cmodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx")
ir_cmodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_ir")).with_suffix(".xml")

length = 32000
input_shape = (1, length)

input_names = ['input']
output_names = ['output']
dummy_input = torch.randn(1, length)
dynamic_axes = {
    'input': {1: 'length'},
    'output': {1: 'out_length'}
}

torch.onnx.export(
    cmodel,
    dummy_input,
    onnx_cmodel_path,
    input_names=input_names,
    output_names=output_names,
    dynamic_axes=dynamic_axes
)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/WavLM.py:352: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if mask:
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:495: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert embed_dim == self.embed_dim
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert list(query.size()) == [tgt_len, bsz, embed_dim]
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:500: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert key_bsz == bsz
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/modules.py:502: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert src_len, bsz == value.shape[:2]
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/WavLM.py:372: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  feature = res["features"] if ret_conv else res["x"]
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/notebooks/242-freevc-voice-conversion/FreeVC/wavlm/WavLM.py:373: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if ret_layer_results:

Converting to OpenVINO’s IR format.

core = Core()

if not ir_cmodel_path.exists():
    ir_cmodel = mo.convert_model(onnx_cmodel_path, compress_to_fp16=True)
    serialize(ir_cmodel, str(ir_cmodel_path))
else:
    ir_cmodel = core.read_model(ir_cmodel_path)

Select device from dropdown list for running inference using OpenVINO

import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
compiled_cmodel = core.compile_model(ir_cmodel, device.value)

Convert SpeakerEncoder

Converting to ONNX format.

OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "smodel"

OUTPUT_DIR.mkdir(exist_ok=True)

onnx_smodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx")
ir_smodel_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")


length = 32000

input_names = ['input']
output_names = ['output']
dummy_input = torch.randn(1, length, 40)
dynamic_axes = {
    'input': {
        0: 'branch_size',
        1: 'length'
    },
    'output': {1: 'out_length'}
}

torch.onnx.export(smodel, dummy_input, onnx_smodel_path, input_names=input_names, output_names=output_names, dynamic_axes=dynamic_axes)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/onnx/symbolic_opset9.py:4315: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model.
  warnings.warn(
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/onnx/_internal/jit_utils.py:258: UserWarning: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/onnx/utils.py:687: UserWarning: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
  _C._jit_pass_onnx_graph_shape_type_inference(
/opt/home/k8sworker/ci-ai/cibuilds/ov-notebook/OVNotebookOps-475/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/onnx/utils.py:1178: UserWarning: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:1884.)
  _C._jit_pass_onnx_graph_shape_type_inference(

Converting to OpenVINO’s IR format.

if not ir_smodel_path.exists():
    ir_smodel = mo.convert_model(onnx_smodel_path, compress_to_fp16=True)
    serialize(ir_smodel, str(ir_smodel_path))
else:
    ir_smodel = core.read_model(ir_smodel_path)

For preparing input for inference, we should define helper functions based on speaker_encoder.voice_encoder.SpeakerEncoder class methods

from speaker_encoder.hparams import sampling_rate, mel_window_step, partials_n_frames
from speaker_encoder import audio


def compute_partial_slices(n_samples: int, rate, min_coverage):
    """
    Computes where to split an utterance waveform and its corresponding mel spectrogram to
    obtain partial utterances of <partials_n_frames> each. Both the waveform and the
    mel spectrogram slices are returned, so as to make each partial utterance waveform
    correspond to its spectrogram.

    The returned ranges may be indexing further than the length of the waveform. It is
    recommended that you pad the waveform with zeros up to wav_slices[-1].stop.

    :param n_samples: the number of samples in the waveform
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the waveform slices and mel spectrogram slices as lists of array slices. Index
    respectively the waveform and the mel spectrogram with these slices to obtain the partial
    utterances.
    """
    assert 0 < min_coverage <= 1

    # Compute how many frames separate two partial utterances
    samples_per_frame = int((sampling_rate * mel_window_step / 1000))
    n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
    frame_step = int(np.round((sampling_rate / rate) / samples_per_frame))
    assert 0 < frame_step, "The rate is too high"
    assert frame_step <= partials_n_frames, "The rate is too low, it should be %f at least" % \
        (sampling_rate / (samples_per_frame * partials_n_frames))

    # Compute the slices
    wav_slices, mel_slices = [], []
    steps = max(1, n_frames - partials_n_frames + frame_step + 1)
    for i in range(0, steps, frame_step):
        mel_range = np.array([i, i + partials_n_frames])
        wav_range = mel_range * samples_per_frame
        mel_slices.append(slice(*mel_range))
        wav_slices.append(slice(*wav_range))

    # Evaluate whether extra padding is warranted or not
    last_wav_range = wav_slices[-1]
    coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
    if coverage < min_coverage and len(mel_slices) > 1:
        mel_slices = mel_slices[:-1]
        wav_slices = wav_slices[:-1]

    return wav_slices, mel_slices


def embed_utterance(wav: np.ndarray, smodel: CompiledModel, return_partials=False, rate=1.3, min_coverage=0.75):
    """
    Computes an embedding for a single utterance. The utterance is divided in partial
    utterances and an embedding is computed for each. The complete utterance embedding is the
    L2-normed average embedding of the partial utterances.

    :param wav: a preprocessed utterance waveform as a numpy array of float32
    :param smodel: compiled speaker encoder model.
    :param return_partials: if True, the partial embeddings will also be returned along with
    the wav slices corresponding to each partial utterance.
    :param rate: how many partial utterances should occur per second. Partial utterances must
    cover the span of the entire utterance, thus the rate should not be lower than the inverse
    of the duration of a partial utterance. By default, partial utterances are 1.6s long and
    the minimum rate is thus 0.625.
    :param min_coverage: when reaching the last partial utterance, it may or may not have
    enough frames. If at least <min_pad_coverage> of <partials_n_frames> are present,
    then the last partial utterance will be considered by zero-padding the audio. Otherwise,
    it will be discarded. If there aren't enough frames for one partial utterance,
    this parameter is ignored so that the function always returns at least one slice.
    :return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
    <return_partials> is True, the partial utterances as a numpy array of float32 of shape
    (n_partials, model_embedding_size) and the wav partials as a list of slices will also be
    returned.
    """
    # Compute where to split the utterance into partials and pad the waveform with zeros if
    # the partial utterances cover a larger range.
    wav_slices, mel_slices = compute_partial_slices(len(wav), rate, min_coverage)
    max_wave_length = wav_slices[-1].stop
    if max_wave_length >= len(wav):
        wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")

    # Split the utterance into partials and forward them through the model
    mel = audio.wav_to_mel_spectrogram(wav)
    mels = np.array([mel[s] for s in mel_slices])
    with torch.no_grad():
        mels = torch.from_numpy(mels).to(torch.device('cpu'))
        output_layer = smodel.output(0)
        partial_embeds = smodel(mels)[output_layer]

    # Compute the utterance embedding from the partial embeddings
    raw_embed = np.mean(partial_embeds, axis=0)
    embed = raw_embed / np.linalg.norm(raw_embed, 2)

    if return_partials:
        return embed, partial_embeds, wav_slices
    return embed

Select device from dropdown list for running inference using OpenVINO

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

Then compile model.

compiled_smodel = core.compile_model(ir_smodel, device.value)

Convert Decoder

In the same way export SynthesizerTrn model, that implements decoder function, to ONNX format and convert it to OpenVINO IR format.

OUTPUT_DIR = Path("output")
BASE_MODEL_NAME = "net_g"
onnx_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "_fp32")).with_suffix(".onnx")
ir_net_g_path = Path(OUTPUT_DIR / (BASE_MODEL_NAME + "ir")).with_suffix(".xml")

dummy_input_1 = torch.randn(1, 1024, 81)
dummy_input_2 = torch.randn(1, 256)

input_names = ['input1', 'input2']
output_names = ['output']
dynamic_axes = {
    'input1': {
        0: 'branch_size',
        2: 'length'
    },
    'input2': {
        0: 'branch_size',
    },
    'output': {1: 'out_length'}
}

# define forward as infer
net_g.forward = net_g.infer

torch.onnx.export(net_g, (dummy_input_1, dummy_input_2), onnx_net_g_path, input_names=input_names, output_names=output_names, dynamic_axes=dynamic_axes)

if not ir_net_g_path.exists():
    ir_net_g_model = mo.convert_model(onnx_net_g_path, compress_to_fp16=True)
    serialize(ir_net_g_model, str(ir_net_g_path))
else:
    ir_net_g_model = core.read_model(ir_net_g_path)

Select device from dropdown list for running inference using OpenVINO

compiled_ir_net_g_model = core.compile_model(ir_net_g_model, device.value)

Define function for synthesizing.

def synthesize_audio(src, tgt):
    wav_tgt, _ = librosa.load(tgt, sr=hps.data.sampling_rate)
    wav_tgt, _ = librosa.effects.trim(wav_tgt, top_db=20)

    g_tgt = embed_utterance(wav_tgt, compiled_smodel)
    g_tgt = torch.from_numpy(g_tgt).unsqueeze(0)

    # src
    wav_src, _ = librosa.load(src, sr=hps.data.sampling_rate)
    wav_src = np.expand_dims(wav_src, axis=0)

    output_layer = compiled_cmodel.output(0)
    c = compiled_cmodel(wav_src)[output_layer]
    c = c.transpose((0, 2, 1))

    output_layer = compiled_ir_net_g_model.output(0)
    tgt_audio = compiled_ir_net_g_model((c, g_tgt))[output_layer]
    tgt_audio = tgt_audio[0][0]

    return tgt_audio

And now we can check inference using only IR models.

result_wav_names = []

with torch.no_grad():
    for line in tqdm(zip(srcs, tgts)):
        src, tgt = line

        output_audio = synthesize_audio(src, tgt)

        timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
        result_name = f'{timestamp}.wav'
        result_wav_names.append(result_name)
        write(
            os.path.join('outputs/freevc', result_name),
            hps.data.sampling_rate,
            output_audio
        )
2it [00:02,  1.45s/it]

Result audio files should be available in ‘outputs/freevc’ and you can check them and compare with generated earlier. Below one of the results presents.

Source audio (source of text):

import IPython.display as ipd
ipd.Audio(srcs[0])

Target audio (source of voice):

ipd.Audio(tgts[0])

Result audio:

ipd.Audio(f'outputs/freevc/{result_wav_names[0]}')

Also, you can use your own audio file. Just upload them and use for inference. Use rate corresponding to the value of hps.data.sampling_rate.

import gradio as gr


audio1 = gr.inputs.Audio(label="Source Audio", type='filepath')
audio2 = gr.inputs.Audio(label="Reference Audio", type='filepath')
outputs = gr.outputs.Audio(label="Output Audio", type='filepath')
examples = [[audio1_name, audio2_name]]

title = 'FreeVC with Gradio'
description = 'Gradio Demo for FreeVC and OpenVINO™. Upload a source audio and a reference audio, then click the "Submit" button to inference.'


def infer(src, tgt):
    output_audio = synthesize_audio(src, tgt)

    timestamp = time.strftime("%m-%d_%H-%M", time.localtime())
    result_name = f'{timestamp}.wav'
    write(result_name, hps.data.sampling_rate, output_audio)

    return result_name


iface = gr.Interface(infer, [audio1, audio2], outputs, title=title, description=description, examples=examples)
iface.launch()
# if you are launching remotely, specify server_name and server_port
# iface.launch(server_name='your server name', server_port='server port in int')
# if you have any issue to launch on your platform, you can pass share=True to launch method:
# iface.launch(share=True)
# it creates a publicly shareable link for the interface. Read more in the docs: https://gradio.app/docs/
/tmp/ipykernel_2082705/3932271335.py:4: GradioDeprecationWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
  audio1 = gr.inputs.Audio(label="Source Audio", type='filepath')
/tmp/ipykernel_2082705/3932271335.py:4: GradioDeprecationWarning: optional parameter is deprecated, and it has no effect
  audio1 = gr.inputs.Audio(label="Source Audio", type='filepath')
/tmp/ipykernel_2082705/3932271335.py:5: GradioDeprecationWarning: Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components
  audio2 = gr.inputs.Audio(label="Reference Audio", type='filepath')
/tmp/ipykernel_2082705/3932271335.py:5: GradioDeprecationWarning: optional parameter is deprecated, and it has no effect
  audio2 = gr.inputs.Audio(label="Reference Audio", type='filepath')
/tmp/ipykernel_2082705/3932271335.py:6: GradioDeprecationWarning: Usage of gradio.outputs is deprecated, and will not be supported in the future, please import your components from gradio.components
  outputs = gr.outputs.Audio(label="Output Audio", type='filepath')
Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().
iface.close()
Closing server running on port: 7860