Text-to-Music generation using Riffusion and OpenVINO

This Jupyter notebook can be launched after a local installation only.


Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips. General diffusion models are machine learning systems that are trained to denoise random Gaussian noise step by step, to get to a sample of interest, such as an image. Diffusion models have been shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference. OpenVINO brings capabilities to run model inference on Intel hardware and opens the door to the fantastic world of diffusion models for everyone!

In this tutorial, we consider how to run a text-to-music generation pipeline using Riffusion and OpenVINO. We will use a pre-trained model from the Diffusers library. To simplify the user experience, the Hugging Face Optimum Intel library is used to convert the models to OpenVINO™ IR format.

The tutorial consists of the following steps:

About Riffusion

Riffusion is based on Stable Diffusion v1.5 and fine-tuned on images of spectrogram paired with text. Audio processing happens downstream of the model. This model can generate an audio spectrogram for given input text.

An audio spectrogram is a visual way to represent the frequency content of a sound clip. The x-axis represents time, and the y-axis represents frequency. The color of each pixel gives the amplitude of the audio at the frequency and time given by its row and column. An audio spectrogram is a visual way to represent the frequency content of a sound clip. The x-axis represents time, and the y-axis represents frequency. The color of each pixel gives the amplitude of the audio at the frequency and time given by its row and column.


*image source

The spectrogram can be computed from audio using the Short-time Fourier transform (STFT), which approximates the audio as a combination of sine waves of varying amplitudes and phases.

The STFT is invertible, so the original audio can be reconstructed from a spectrogram. This idea is a behind approach to using Riffusion for audio generation.


%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu torch torchaudio "diffusers>=0.16.1" "transformers>=4.33.0"
%pip install -q "git+https://github.com/huggingface/optimum-intel.git" onnx "gradio>=3.34.0" "openvino>=2023.1.0"

Stable Diffusion pipeline in Optimum Intel

As the riffusion model architecture is the same as Stable Diffusion, we can use it with the Stable Diffusion pipeline for text-to-image generation. Optimum Intel can be used to load optimized models from the Hugging Face Hub and create pipelines to run an inference with OpenVINO Runtime without rewriting APIs. When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into three components that consist of four models combined during inference into the pipeline:

  • The text encoder

  • The U-NET

  • The VAE encoder

  • The VAE decoder

More details about the Stable Diffusion pipeline can be found in stable-diffusion notebook.

For running the Stable Diffusion model with Optimum Intel, we should use the optimum.intel.OVStableDiffusionPipeline class, which represents the inference pipeline. OVStableDiffusionPipeline initialized by the from_pretrained method. It supports on-the-fly conversion models from PyTorch using the export=True parameter. A converted model can be saved on disk using the save_pretrained method for the next running.

from pathlib import Path

MODEL_ID = "riffusion/riffusion-model-v1"
MODEL_DIR = Path("riffusion_pipeline")

select device from dropdown list for running inference using OpenVINO

import ipywidgets as widgets
from openvino.runtime import Core

core = Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],

Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')
from optimum.intel.openvino import OVStableDiffusionPipeline

DEVICE = device.value

if not MODEL_DIR.exists():
    pipe = OVStableDiffusionPipeline.from_pretrained(MODEL_ID, export=True, device=DEVICE, compile=False)
    pipe = OVStableDiffusionPipeline.from_pretrained(MODEL_DIR, device=DEVICE, compile=False)
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2023-09-19 18:21:08.176653: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-09-19 18:21:08.217600: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-19 18:21:08.865600: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/ea/work/ov_venv/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations

Prepare postprocessing for reconstruction audio from spectrogram image

The riffusion model generates an audio spectrogram image, which can be used to reconstruct audio. However, the spectrogram images from the model only contain the amplitude of the sine waves and not the phases, because the phases are chaotic and hard to learn. Instead, we can use the Griffin-Lim algorithm to approximate the phase when reconstructing the audio clip. The Griffin-Lim Algorithm (GLA) is a phase reconstruction method based on the redundancy of the Short-time Fourier transform (STFT). It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on consistency and does not take any prior knowledge about the target signal into account.

The frequency bins in generated spectrogram use the Mel scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another.

The code below defines the process of reconstruction of a WAV audio clip from a spectrogram image using Griffin-Lim Algorithm.

import io
from typing import Tuple

import numpy as np
from PIL import Image
from scipy.io import wavfile
import torch
import torchaudio

def wav_bytes_from_spectrogram_image(image: Image.Image) -> Tuple[io.BytesIO, float]:
    Reconstruct a WAV audio clip from a spectrogram image. Also returns the duration in seconds.

      image (Image.Image): generated spectrogram image
      wav_bytes (io.BytesIO): audio signal encoded in wav bytes
      duration_s (float): duration in seconds

    max_volume = 50
    power_for_image = 0.25
    Sxx = spectrogram_from_image(image, max_volume=max_volume, power_for_image=power_for_image)

    sample_rate = 44100  # [Hz]
    clip_duration_ms = 5000  # [ms]

    bins_per_image = 512
    n_mels = 512

    # FFT parameters
    window_duration_ms = 100  # [ms]
    padded_duration_ms = 400  # [ms]
    step_size_ms = 10  # [ms]

    # Derived parameters
    num_samples = int(image.width / float(bins_per_image) * clip_duration_ms) * sample_rate
    n_fft = int(padded_duration_ms / 1000.0 * sample_rate)
    hop_length = int(step_size_ms / 1000.0 * sample_rate)
    win_length = int(window_duration_ms / 1000.0 * sample_rate)

    samples = waveform_from_spectrogram(

    wav_bytes = io.BytesIO()
    wavfile.write(wav_bytes, sample_rate, samples.astype(np.int16))

    duration_s = float(len(samples)) / sample_rate

    return wav_bytes, duration_s

def spectrogram_from_image(
    image: Image.Image, max_volume: float = 50, power_for_image: float = 0.25
) -> np.ndarray:
    Compute a spectrogram magnitude array from a spectrogram image.

      image (image.Image): input image
      max_volume (float, *optional*, 50): max volume for spectrogram magnitude
      power_for_image (float, *optional*, 0.25): power for reversing power curve
    # Convert to a numpy array of floats
    data = np.array(image).astype(np.float32)

    # Flip Y take a single channel
    data = data[::-1, :, 0]

    # Invert
    data = 255 - data

    # Rescale to max volume
    data = data * max_volume / 255

    # Reverse the power curve
    data = np.power(data, 1 / power_for_image)

    return data

def waveform_from_spectrogram(
    Sxx: np.ndarray,
    n_fft: int,
    hop_length: int,
    win_length: int,
    num_samples: int,
    sample_rate: int,
    mel_scale: bool = True,
    n_mels: int = 512,
    num_griffin_lim_iters: int = 32,
    device: str = "cpu",
) -> np.ndarray:
    Reconstruct a waveform from a spectrogram.
    This is an approximate waveform, using the Griffin-Lim algorithm
    to approximate the phase.
    Sxx_torch = torch.from_numpy(Sxx).to(device)

    if mel_scale:
        mel_inv_scaler = torchaudio.transforms.InverseMelScale(
            n_stft=n_fft // 2 + 1,

        Sxx_torch = mel_inv_scaler(Sxx_torch)

    griffin_lim = torchaudio.transforms.GriffinLim(

    waveform = griffin_lim(Sxx_torch).cpu().numpy()

    return waveform

Run Inference pipeline

The diagram below briefly describes the workflow of our pipeline



As you can see, it is very similar to Stable Diffusion Text-to-Image generation with an additional post-processing step that transforms generated spectrogram into an audio signal. Firstly, OVStableDiffusionPipeline accepts input text prompt, which will be tokenized and transformed to embeddings space using Frozen CLIP text encoder and generates initial latent spectrogram representation using a random generator, then U-Net iteratively denoises the random latent spectrogram image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. The denoising process is repeated a given number of times (by default 50) to step-by-step retrieve better latent image representations. When complete, the latent image representation is decoded by the decoder part of the variational auto-encoder. Generated spectrogram image will be converted into a spectrogram magnitude range and inverse mel scale applied to it to estimate an STFT in the normal frequency domain from the mel frequency domain. Finally, Griffin-Lim Algorithm approximates the phase of an audio signal and we got reconstructed audio.

pipe.reshape(batch_size=1, height=512, width=512, num_images_per_prompt=1)

def generate(prompt:str, negative_prompt:str = "") -> Tuple[Image.Image, str]:
    function for generation audio from text prompt

      prompt (str): input prompt for generation.
      negative_prompt (str): negative prompt for generation, contains undesired concepts for generation, which should be avoided. Can be empty.
      spec (Image.Image) - generated spectrogram image
    spec = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=20).images[0]
    wav = wav_bytes_from_spectrogram_image(spec)
    with open("output.wav", "wb") as f:
    return spec, "output.wav"
Compiling the vae_decoder...
Compiling the unet...
Compiling the vae_encoder...
Compiling the text_encoder...

Now, we can test our generation. Function generate accepts text input and returns generated spectrogram and path to generated audio. Optionally, it also accepts negative prompt. A negative prompt declares undesired concepts for generation, e.g. if we want to generate instrumental music, having vocal on audio will be an unwanted effect, so in this case vocal can be treated as a negative prompt. The positive and negative prompts are in equal footing. You can always use one with or without the other. More explanation of how it works can be found in this article.

spectrogram, wav_path = generate("Techno beat")
height was set to 256 but the static model will output images of height 512.To fix the height, please reshape your model accordingly using the .reshape() method.
width was set to 256 but the static model will output images of width 512.To fix the width, please reshape your model accordingly using the .reshape() method.
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:559: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)
0%|          | 0/21 [00:00<?, ?it/s]
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:590: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)
/home/ea/work/ov_venv/lib/python3.8/site-packages/optimum/intel/openvino/modeling_diffusion.py:606: FutureWarning: shared_memory is deprecated and will be removed in 2024.0. Value of shared_memory is going to override share_inputs value. Please use only share_inputs explicitly.
  outputs = self.request(inputs, shared_memory=True)
import IPython.display as ipd