SoftVC VITS Singing Voice Conversion and OpenVINO™#
This Jupyter notebook can be launched on-line, opening an interactive environment in a browser window. You can also make a local installation. Choose one of the following options:
This tutorial is based on SoftVC VITS Singing Voice Conversion project. The purpose of this project was to enable developers to have their beloved anime characters perform singing tasks. The developers’ intention was to focus solely on fictional characters and avoid any involvement of real individuals, anything related to real individuals deviates from the developer’s original intention.
The singing voice conversion model uses SoftVC content encoder to extract speech features from the source audio. These feature vectors are directly fed into VITS without the need for conversion to a text-based intermediate representation. As a result, the pitch and intonations of the original audio are preserved.
In this tutorial we will use the base model flow.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
This model has some problems
with installation. So, we have to downgrade pip
to version below
24.1
to avoid problems with fairseq
. Also, fairseq
has some
problems on Windows
on python>=3.11
and we use a custom version
of this library to resolve them.
import platform
import sys
import warnings
%pip install -qU "pip<24.1"
if platform.system() == "Windows" and sys.version_info >= (3, 11, 0):
warnings.warn(
"Building the custmom fairseq package may take a long time and may require additional privileges in the system. We recommend using Python versions 3.8, 3.9 or 3.10 for this model."
)
%pip install git+https://github.com/aleksandr-mokrov/fairseq.git --extra-index-url https://download.pytorch.org/whl/cpu
else:
%pip install "fairseq==0.12.2" --extra-index-url https://download.pytorch.org/whl/cpu
import requests
r = requests.get(
url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w", encoding="utf-8").write(r.text)
r = requests.get(
url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
)
open("cmd_helper.py", "w").write(r.text)
from cmd_helper import clone_repo
%pip install -q "openvino>=2023.2.0"
clone_repo("https://github.com/svc-develop-team/so-vits-svc", revision="4.1-Stable", add_to_sys_path=False)
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu tqdm librosa "torch>=2.1.0" "torchaudio>=2.1.0" faiss-cpu "gradio>=4.19" "numpy>=1.23.5" praat-parselmouth
Download pretrained models and configs. We use a recommended encoder ContentVec and models from a collection of so-vits-svc-4.0 models made by the Pony Preservation Project for example. You can choose any other pretrained model from this or another project or prepare your own.
from notebook_utils import download_file, device_widget
# ContentVec
download_file(
"https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt",
"checkpoint_best_legacy_500.pt",
directory="so-vits-svc/pretrain/",
)
# pretrained models and configs from a collection of so-vits-svc-4.0 models. You can use other models.
download_file(
"https://huggingface.co/therealvul/so-vits-svc-4.0/resolve/main/Rainbow%20Dash%20(singing)/kmeans_10000.pt",
"kmeans_10000.pt",
directory="so-vits-svc/logs/44k/",
)
download_file(
"https://huggingface.co/therealvul/so-vits-svc-4.0/resolve/main/Rainbow%20Dash%20(singing)/config.json",
"config.json",
directory="so-vits-svc/configs/",
)
download_file(
"https://huggingface.co/therealvul/so-vits-svc-4.0/resolve/main/Rainbow%20Dash%20(singing)/G_30400.pth",
"G_30400.pth",
directory="so-vits-svc/logs/44k/",
)
download_file(
"https://huggingface.co/therealvul/so-vits-svc-4.0/resolve/main/Rainbow%20Dash%20(singing)/D_30400.pth",
"D_30400.pth",
directory="so-vits-svc/logs/44k/",
)
# a wav sample
download_file(
"https://huggingface.co/datasets/santifiorino/spinetta/resolve/main/spinetta/000.wav",
"000.wav",
directory="so-vits-svc/raw/",
)
Use the original model to run an inference#
Change directory to so-vits-svc
in purpose not to brake internal
relative paths.
%cd so-vits-svc
Define the Sovits Model.
from inference.infer_tool import Svc
model = Svc("logs/44k/G_30400.pth", "configs/config.json", device="cpu")
Define kwargs
and make an inference.
kwargs = {
"raw_audio_path": "raw/000.wav", # path to a source audio
"spk": "Rainbow Dash (singing)", # speaker ID in which the source audio should be converted.
"tran": 0,
"slice_db": -40,
"cluster_infer_ratio": 0,
"auto_predict_f0": False,
"noice_scale": 0.4,
}
audio = model.slice_inference(**kwargs)
And let compare the original audio with the result.
import IPython.display as ipd
# original
ipd.Audio("raw/000.wav", rate=model.target_sample)
# result
ipd.Audio(audio, rate=model.target_sample)
Convert to OpenVINO IR model#
Model components are PyTorch modules, that can be converted with
ov.convert_model
function directly. We also use ov.save_model
function to serialize the result of conversion. Svc
is not a model,
it runs model inference inside. In base scenario only SynthesizerTrn
named net_g_ms
is used. It is enough to convert only this model and
we should re-assign forward
method on infer
method for this
purpose.
SynthesizerTrn
uses several models inside it’s flow,
i.e. TextEncoder
, Generator
, ResidualCouplingBlock
, etc.,
but in our case OpenVINO allows to convert whole pipeline by one step
without need to look inside.
import openvino as ov
import torch
from pathlib import Path
dummy_c = torch.randn(1, 256, 813)
dummy_f0 = torch.randn(1, 813)
dummy_uv = torch.ones(1, 813)
dummy_g = torch.tensor([[0]])
model.net_g_ms.forward = model.net_g_ms.infer
net_g_kwargs = {
"c": dummy_c,
"f0": dummy_f0,
"uv": dummy_uv,
"g": dummy_g,
"noice_scale": torch.tensor(0.35), # need to wrap numeric and boolean values for conversion
"seed": torch.tensor(52468),
"predict_f0": torch.tensor(False),
"vol": torch.tensor(0),
}
core = ov.Core()
net_g_model_xml_path = Path("models/ov_net_g_model.xml")
if not net_g_model_xml_path.exists():
converted_model = ov.convert_model(model.net_g_ms, example_input=net_g_kwargs)
net_g_model_xml_path.parent.mkdir(parents=True, exist_ok=True)
ov.save_model(converted_model, net_g_model_xml_path)
Run the OpenVINO model#
Select a device from dropdown list for running inference using OpenVINO.
import openvino as ov
core = ov.Core()
device = device_widget()
device
We should create a wrapper for net_g_ms
model to keep it’s
interface. Then replace net_g_ms
original model by the converted IR
model. We use ov.compile_model
to make it ready to use for loading
on a device.
class NetGModelWrapper:
def __init__(self, net_g_model_xml_path):
super().__init__()
self.net_g_model = core.compile_model(net_g_model_xml_path, device.value)
def infer(self, c, *, f0, uv, g, noice_scale=0.35, seed=52468, predict_f0=False, vol=None):
if vol is None: # None is not allowed as an input
results = self.net_g_model((c, f0, uv, g, noice_scale, seed, predict_f0))
else:
results = self.net_g_model((c, f0, uv, g, noice_scale, seed, predict_f0, vol))
return torch.from_numpy(results[0]), torch.from_numpy(results[1])
model.net_g_ms = NetGModelWrapper(net_g_model_xml_path)
audio = model.slice_inference(**kwargs)
Check result. Is it identical to that created by the original model.
import IPython.display as ipd
ipd.Audio(audio, rate=model.target_sample)
Interactive inference#
def infer(src_audio, tran, slice_db, noice_scale):
kwargs["raw_audio_path"] = src_audio
kwargs["tran"] = tran
kwargs["slice_db"] = slice_db
kwargs["noice_scale"] = noice_scale
audio = model.slice_inference(**kwargs)
return model.target_sample, audio
if not Path("gradio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/softvc-voice-conversion/gradio_helper.py")
open("gradio_helper.py", "w", encoding="utf-8").write(r.text)
from gradio_helper import make_demo
demo = make_demo(fn=infer)
try:
demo.queue().launch(debug=False)
except Exception:
demo.queue().launch(share=True, debug=False)
# If you are launching remotely, specify server_name and server_port
# EXAMPLE: `demo.launch(server_name='your server name', server_port='server port in int')`
# To learn more please refer to the Gradio docs: https://gradio.app/docs/
# please uncomment and run this cell for stopping gradio interface
# demo.close()