Text-to-Music generation using Riffusion and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips. General diffusion models are machine learning systems that are trained to denoise random Gaussian noise step by step, to get to a sample of interest, such as an image. Diffusion models have been shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference. OpenVINO brings capabilities to run model inference on Intel hardware and opens the door to the fantastic world of diffusion models for everyone!
In this tutorial, we consider how to run a text-to-music generation pipeline using Riffusion and OpenVINO. We will use a pre-trained model from the Diffusers library. To simplify the user experience, the Hugging Face Optimum Intel library is used to convert the models to OpenVINO™ IR format.
The tutorial consists of the following steps:
Install prerequisites
Download and convert the model from a public source using the OpenVINO integration with Hugging Face Optimum.
Create a text-to-music inference pipeline
Run inference pipeline
About Riffusion#
Riffusion is based on Stable Diffusion v1.5 and fine-tuned on images of spectrogram paired with text. Audio processing happens downstream of the model. This model can generate an audio spectrogram for given input text.
An audio spectrogram is a visual way to represent the frequency content of a sound clip. The x-axis represents time, and the y-axis represents frequency. The color of each pixel gives the amplitude of the audio at the frequency and time given by its row and column. An audio spectrogram is a visual way to represent the frequency content of a sound clip. The x-axis represents time, and the y-axis represents frequency. The color of each pixel gives the amplitude of the audio at the frequency and time given by its row and column.
The spectrogram can be computed from audio using the Short-time Fourier transform (STFT), which approximates the audio as a combination of sine waves of varying amplitudes and phases.
The STFT is invertible, so the original audio can be reconstructed from a spectrogram. This idea is a behind approach to using Riffusion for audio generation.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu Pillow scipy "torch>=2.1" torchaudio "diffusers>=0.16.1" "transformers>=4.33.0"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" onnx "gradio>=3.34.0" "openvino>=2023.1.0"
Stable Diffusion pipeline in Optimum Intel#
As the riffusion model architecture is the same as Stable Diffusion, we can use it with the Stable Diffusion pipeline for text-to-image generation. Optimum Intel can be used to load optimized models from the Hugging Face Hub and create pipelines to run an inference with OpenVINO Runtime without rewriting APIs. When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into three components that consist of four models combined during inference into the pipeline:
The text encoder
The U-NET
The VAE encoder
The VAE decoder
More details about the Stable Diffusion pipeline can be found in stable-diffusion notebook.
For running the Stable Diffusion model with Optimum Intel, we should use
the optimum.intel.OVStableDiffusionPipeline
class, which represents
the inference pipeline. OVStableDiffusionPipeline
initialized by the
from_pretrained
method. It supports on-the-fly conversion models
from PyTorch using the export=True
parameter. A converted model can
be saved on disk using the save_pretrained
method for the next
running.
from pathlib import Path
MODEL_ID = "riffusion/riffusion-model-v1"
MODEL_DIR = Path("riffusion_pipeline")
Select inference device#
select device from dropdown list for running inference using OpenVINO
import ipywidgets as widgets
import openvino as ov
core = ov.Core()
device = widgets.Dropdown(
options=core.available_devices + ["AUTO"],
value="AUTO",
description="Device:",
disabled=False,
)
device
Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')
from optimum.intel.openvino import OVStableDiffusionPipeline
DEVICE = device.value
if not MODEL_DIR.exists():
pipe = OVStableDiffusionPipeline.from_pretrained(MODEL_ID, export=True, device=DEVICE, compile=False)
pipe.half()
pipe.save_pretrained(MODEL_DIR)
else:
pipe = OVStableDiffusionPipeline.from_pretrained(MODEL_DIR, device=DEVICE, compile=False)
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' 2023-09-19 18:21:08.176653: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-09-19 18:21:08.217600: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-19 18:21:08.865600: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /home/ea/work/ov_venv/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn(
Prepare postprocessing for reconstruction audio from spectrogram image#
The riffusion model generates an audio spectrogram image, which can be used to reconstruct audio. However, the spectrogram images from the model only contain the amplitude of the sine waves and not the phases, because the phases are chaotic and hard to learn. Instead, we can use the Griffin-Lim algorithm to approximate the phase when reconstructing the audio clip. The Griffin-Lim Algorithm (GLA) is a phase reconstruction method based on the redundancy of the Short-time Fourier transform (STFT). It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on consistency and does not take any prior knowledge about the target signal into account.
The frequency bins in generated spectrogram use the Mel scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another.
The code below defines the process of reconstruction of a WAV audio clip from a spectrogram image using Griffin-Lim Algorithm.
import io
from typing import Tuple
import numpy as np
from PIL import Image
from scipy.io import wavfile
import torch
import torchaudio
def wav_bytes_from_spectrogram_image(image: Image.Image) -> Tuple[io.BytesIO, float]:
"""
Reconstruct a WAV audio clip from a spectrogram image. Also returns the duration in seconds.
Parameters:
image (Image.Image): generated spectrogram image
Returns:
wav_bytes (io.BytesIO): audio signal encoded in wav bytes
duration_s (float): duration in seconds
"""
max_volume = 50
power_for_image = 0.25
Sxx = spectrogram_from_image(image, max_volume=max_volume, power_for_image=power_for_image)
sample_rate = 44100 # [Hz]
clip_duration_ms = 5000 # [ms]
bins_per_image = 512
n_mels = 512
# FFT parameters
window_duration_ms = 100 # [ms]
padded_duration_ms = 400 # [ms]
step_size_ms = 10 # [ms]
# Derived parameters
num_samples = int(image.width / float(bins_per_image) * clip_duration_ms) * sample_rate
n_fft = int(padded_duration_ms / 1000.0 * sample_rate)
hop_length = int(step_size_ms / 1000.0 * sample_rate)
win_length = int(window_duration_ms / 1000.0 * sample_rate)
samples = waveform_from_spectrogram(
Sxx=Sxx,
n_fft=n_fft,
hop_length=hop_length,
win_length=win_length,
num_samples=num_samples,
sample_rate=sample_rate,
mel_scale=True,
n_mels=n_mels,
num_griffin_lim_iters=32,
)
wav_bytes = io.BytesIO()
wavfile.write(wav_bytes, sample_rate, samples.astype(np.int16))
wav_bytes.seek(0)
duration_s = float(len(samples)) / sample_rate
return wav_bytes, duration_s
def spectrogram_from_image(image: Image.Image, max_volume: float = 50, power_for_image: float = 0.25) -> np.ndarray:
"""
Compute a spectrogram magnitude array from a spectrogram image.
Parameters:
image (image.Image): input image
max_volume (float, *optional*, 50): max volume for spectrogram magnitude
power_for_image (float, *optional*, 0.25): power for reversing power curve
"""
# Convert to a numpy array of floats
data = np.array(image).astype(np.float32)
# Flip Y take a single channel
data = data[::-1, :, 0]
# Invert
data = 255 - data
# Rescale to max volume
data = data * max_volume / 255
# Reverse the power curve
data = np.power(data, 1 / power_for_image)
return data
def waveform_from_spectrogram(
Sxx: np.ndarray,
n_fft: int,
hop_length: int,
win_length: int,
num_samples: int,
sample_rate: int,
mel_scale: bool = True,
n_mels: int = 512,
num_griffin_lim_iters: int = 32,
device: str = "cpu",
) -> np.ndarray:
"""
Reconstruct a waveform from a spectrogram.
This is an approximate waveform, using the Griffin-Lim algorithm
to approximate the phase.
"""
Sxx_torch = torch.from_numpy(Sxx).to(device)
if mel_scale:
mel_inv_scaler = torchaudio.transforms.InverseMelScale(
n_mels=n_mels,
sample_rate=sample_rate,
f_min=0,
f_max=10000,
n_stft=n_fft // 2 + 1,
norm=None,
mel_scale="htk",
).to(device)
Sxx_torch = mel_inv_scaler(Sxx_torch)
griffin_lim = torchaudio.transforms.GriffinLim(
n_fft=n_fft,
win_length=win_length,
hop_length=hop_length,
power=1.0,
n_iter=num_griffin_lim_iters,
).to(device)
waveform = griffin_lim(Sxx_torch).cpu().numpy()
return waveform
Run Inference pipeline#
The diagram below briefly describes the workflow of our pipeline
As you can see, it is very similar to Stable Diffusion Text-to-Image
generation with an additional post-processing step that transforms
generated spectrogram into an audio signal. Firstly,
OVStableDiffusionPipeline
accepts input text prompt, which will be
tokenized and transformed to embeddings space using Frozen CLIP text
encoder and generates initial latent spectrogram representation using a
random generator, then U-Net iteratively denoises the random latent
spectrogram image representations while being conditioned on the text
embeddings. The output of the U-Net, being the noise residual, is used
to compute a denoised latent image representation via a scheduler
algorithm. The denoising process is repeated a given number of times
(by default 50) to step-by-step retrieve better latent image
representations. When complete, the latent image representation is
decoded by the decoder part of the variational auto-encoder. Generated
spectrogram image will be converted into a spectrogram magnitude range
and inverse mel scale applied to it to estimate an STFT in the normal
frequency domain from the mel frequency domain. Finally, Griffin-Lim
Algorithm approximates the phase of an audio signal and we got
reconstructed audio.
pipe.reshape(batch_size=1, height=512, width=512, num_images_per_prompt=1)
pipe.compile()
def generate(prompt: str, negative_prompt: str = "") -> Tuple[Image.Image, str]:
"""
function for generation audio from text prompt
Parameters:
prompt (str): input prompt for generation.
negative_prompt (str): negative prompt for generation, contains undesired concepts for generation, which should be avoided. Can be empty.
Returns:
spec (Image.Image) - generated spectrogram image
"""
spec = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=20).images[0]
wav = wav_bytes_from_spectrogram_image(spec)
with open("output.wav", "wb") as f:
f.write(wav[0].getbuffer())
return spec, "output.wav"
Compiling the vae_decoder...
Compiling the unet...
Compiling the vae_encoder...
Compiling the text_encoder...
Now, we can test our generation. Function generate accepts text input and returns generated spectrogram and path to generated audio. Optionally, it also accepts negative prompt. A negative prompt declares undesired concepts for generation, e.g. if we want to generate instrumental music, having vocal on audio will be an unwanted effect, so in this case vocal can be treated as a negative prompt. The positive and negative prompts are in equal footing. You can always use one with or without the other. More explanation of how it works can be found in this article.
spectrogram, wav_path = generate("Techno beat")
height was set to 256 but the static model will output images of height 512.To fix the height, please reshape your model accordingly using the .reshape() method. width was set to 256 but the static model will output images of width 512.To fix the width, please reshape your model accordingly using the .reshape() method. /home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:559: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly. outputs = self.request(inputs, shared_memory=True)
0%| | 0/21 [00:00<?, ?it/s]
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:590: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly. outputs = self.request(inputs, shared_memory=True) /home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:606: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly. outputs = self.request(inputs, shared_memory=True)
spectrogram
import IPython.display as ipd
ipd.Audio(wav_path)