Sound Generation with Stable Audio Open and OpenVINO™#
This Jupyter notebook can be launched after a local installation only.
Stable Audio Open is an open-source model optimized for generating short audio samples, sound effects, and production elements using text prompts. The model was trained on data from Freesound and the Free Music Archive, respecting creator rights.
Key Takeaways:#
Stable Audio Open is an open source text-to-audio model for generating up to 47 seconds of samples and sound effects.
Users can create drum beats, instrument riffs, ambient sounds, foley and production elements.
The model enables audio variations and style transfer of audio samples.
This model is made to be used with the stable-audio-tools library for inference.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
>Note: using python3.8 can take a long time to resolve dependency conflicts.
%pip install -q "torch>=2.3" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "stable-audio-tools" "nncf>=2.12.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -Uq --pre "openvino>=2024.3.0"
Load the original model and inference#
Note: run model with notebook, you will need to accept license agreement. You must be a registered user in Hugging Face Hub. Please visit HuggingFace model card, carefully read terms of usage and click accept button. You will need to use an access token for the code below to run. For more information on access tokens, refer to this section of the documentation. You can login on Hugging Face Hub in notebook environment, using following code:
# uncomment these lines to login to huggingfacehub to get access to pretrained model
# from huggingface_hub import notebook_login, whoami
# try:
# whoami()
# print('Authorization token already provided')
# except OSError:
# notebook_login()
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
model = model.to("cpu")
total_seconds = 20
# Set up text and timing conditioning
conditioning = [{"prompt": "128 BPM tech house drum loop", "seconds_start": 0, "seconds_total": total_seconds}]
# Generate stereo audio
output = generate_diffusion_cond(
model,
steps=100,
seed=42,
cfg_scale=7,
conditioning=conditioning,
sample_size=sample_rate * total_seconds,
sigma_min=0.3,
sigma_max=500,
sampler_type="dpmpp-3m-sde",
device="cpu",
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
from IPython.display import Audio
Audio("output.wav")
Convert the model to OpenVINO IR#
Let’s define the conversion function for PyTorch modules. We use
ov.convert_model
function to obtain OpenVINO Intermediate
Representation object and ov.save_model
function to save it as XML
file.
For reducing memory consumption, weights compression optimization can be applied using NNCF. Weight compression aims to reduce the memory footprint of a model. models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:
enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;
improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
Neural Network Compression Framework (NNCF) provides 4-bit / 8-bit mixed weight quantization as a compression method. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
nncf.compress_weights
function can be used for performing weights
compression. The function accepts an OpenVINO model and other
compression parameters. Different parameters may be suitable for
different models. In this case default parameters give bad results. But
we can change mode to CompressWeightsMode.INT8_SYM
to compress
weights symmetrically to 8-bit integer data
type
and get the inference results the same as original.
More details about weights compression can be found in OpenVINO documentation.
from pathlib import Path
import numpy as np
import torch
from nncf import compress_weights, CompressWeightsMode
import openvino as ov
def convert(model: torch.nn.Module, xml_path: str, example_input):
xml_path = Path(xml_path)
if not xml_path.exists():
xml_path.parent.mkdir(parents=True, exist_ok=True)
model.eval()
with torch.no_grad():
converted_model = ov.convert_model(model, example_input=example_input)
converted_model = compress_weights(converted_model, mode=CompressWeightsMode.INT8_SYM)
ov.save_model(converted_model, xml_path)
# cleanup memory
torch._C._jit_clear_class_registry()
torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
torch.jit._state._clear_class_state()
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, openvino
MODEL_DIR = Path("model")
CONDITIONER_ENCODER_PATH = MODEL_DIR / "conditioner_encoder.xml"
DIFFUSION_PATH = MODEL_DIR / "diffusion.xml"
PRETRANSFORM_PATH = MODEL_DIR / "pretransform.xml"
The pipeline comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder. In this example an initial audio is not used, so we need to convert T5-based text embedding model, transformer-based diffusion (DiT) model and only decoder part of autoencoder.
T5-based text embedding#
example_input = {
"input_ids": torch.zeros(1, 120, dtype=torch.int64),
"attention_mask": torch.zeros(1, 120, dtype=torch.int64),
}
convert(model.conditioner.conditioners["prompt"].model, CONDITIONER_ENCODER_PATH, example_input)
WARNING:nncf:NNCF provides best results with torch==2.3.*, while current torch version is 2.4.0+cpu. If you encounter issues, consider switching to torch==2.3.*
/home/ea/work/py3.11/lib/python3.11/site-packages/transformers/modeling_utils.py:4664: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead warnings.warn(
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ 8 │ 100% (74 / 74) │ 100% (74 / 74) │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
Transformer-based diffusion (DiT) model#
class DiffusionWrapper(torch.nn.Module):
def __init__(self, diffusion):
super().__init__()
self.diffusion = diffusion
def forward(self, x=None, t=None, cross_attn_cond=None, cross_attn_cond_mask=None, global_embed=None):
model_inputs = {"cross_attn_cond": cross_attn_cond, "cross_attn_cond_mask": cross_attn_cond_mask, "global_embed": global_embed}
return self.diffusion.forward(x, t, cfg_scale=7, **model_inputs)
example_input = {
"x": torch.rand([1, 64, 1024], dtype=torch.float32),
"t": torch.rand([1], dtype=torch.float32),
"cross_attn_cond": torch.rand([1, 130, 768], dtype=torch.float32),
"cross_attn_cond_mask": torch.ones([1, 130], dtype=torch.float32),
"global_embed": torch.rand(torch.Size([1, 1536]), dtype=torch.float32),
}
convert(DiffusionWrapper(model.model.model), DIFFUSION_PATH, example_input)
/home/ea/work/py3.11/lib/python3.11/site-packages/stable_audio_tools/models/transformer.py:772: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert prepend_dim == x.shape[-1], 'prepend dimension must match sequence dimension'
/home/ea/work/py3.11/lib/python3.11/site-packages/stable_audio_tools/models/transformer.py:461: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if n == 1 and causal:
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ 8 │ 100% (179 / 179) │ 100% (179 / 179) │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
Decoder part of autoencoder#
convert(model.pretransform.model.decoder, PRETRANSFORM_PATH, torch.rand([1, 64, 215], dtype=torch.float32))
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ 8 │ 100% (37 / 37) │ 100% (37 / 37) │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
Compiling models and inference#
Select device from dropdown list for running inference using OpenVINO.
import ipywidgets as widgets
core = ov.Core()
device = widgets.Dropdown(
options=core.available_devices + ["AUTO"],
value="CPU",
description="Device:",
disabled=False,
)
device
Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')
Let’s create callable wrapper classes for compiled models to allow
interaction with original pipeline. Note that all of wrapper classes
return torch.Tensor
s instead of np.array
s.
class TextEncoderWrapper(torch.nn.Module):
def __init__(self, text_encoder, dtype, device="CPU"):
super().__init__()
self.text_encoder = core.compile_model(text_encoder, device)
self.dtype = dtype
def __call__(self, input_ids=None, attention_mask=None):
inputs = {
"input_ids": input_ids,
"attention_mask": attention_mask,
}
last_hidden_state = self.text_encoder(inputs)[0]
return {"last_hidden_state": torch.from_numpy(last_hidden_state)}
class OVWrapper(torch.nn.Module):
def __init__(self, ov_model, old_model, device="CPU") -> None:
super().__init__()
self.mock = torch.nn.Parameter(torch.zeros(1)) # this is only mock to not change the pipeline
self.dif_transformer = core.compile_model(ov_model, device)
def forward(self, x=None, t=None, cross_attn_cond=None, cross_attn_cond_mask=None, global_embed=None, **kwargs):
inputs = {
"x": x,
"t": t,
"cross_attn_cond": cross_attn_cond,
"cross_attn_cond_mask": cross_attn_cond_mask,
"global_embed": global_embed,
}
result = self.dif_transformer(inputs)
return torch.from_numpy(result[0])
class PretransformDecoderWrapper(torch.nn.Module):
def __init__(self, ov_model, device="CPU"):
super().__init__()
self.decoder = core.compile_model(ov_model, device)
def forward(self, latents=None):
result = self.decoder(latents)
return torch.from_numpy(result[0])
Now we can replace the original models by our wrapped OpenVINO models and run inference.
model.model.model = OVWrapper(DIFFUSION_PATH, model.model.model, device.value)
model.conditioner.conditioners["prompt"].model = TextEncoderWrapper(
CONDITIONER_ENCODER_PATH, model.conditioner.conditioners["prompt"].model.dtype, device.value
)
model.pretransform.model.decoder = PretransformDecoderWrapper(PRETRANSFORM_PATH, device.value)
output = generate_diffusion_cond(
model,
steps=100,
seed=42,
cfg_scale=7,
conditioning=conditioning,
sample_size=sample_rate * total_seconds,
sigma_min=0.3,
sigma_max=500,
sampler_type="dpmpp-3m-sde",
device="cpu",
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
42
/home/ea/work/py3.11/lib/python3.11/site-packages/stable_audio_tools/models/conditioners.py:314: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. with torch.cuda.amp.autocast(dtype=torch.float16) and torch.set_grad_enabled(self.enable_grad): /home/ea/work/py3.11/lib/python3.11/site-packages/torch/amp/autocast_mode.py:265: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn( /home/ea/work/py3.11/lib/python3.11/site-packages/stable_audio_tools/inference/sampling.py:177: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. with torch.cuda.amp.autocast():
0%| | 0/100 [00:00<?, ?it/s]
/home/ea/work/py3.11/lib/python3.11/site-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=500.00006103515625 and t1=500.000061.
warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")
/home/ea/work/py3.11/lib/python3.11/site-packages/torchsde/_brownian/brownian_interval.py:599: UserWarning: Should have ta>=t0 but got ta=0.29999998211860657 and t0=0.3.
warnings.warn(f"Should have ta>=t0 but got ta={ta} and t0={self._start}.")
Audio("output.wav")
Interactive inference#
def _generate(prompt, total_seconds, steps, seed):
sample_rate = model_config["sample_rate"]
# Set up text and timing conditioning
conditioning = [{"prompt": prompt, "seconds_start": 0, "seconds_total": total_seconds}]
output = generate_diffusion_cond(
model,
steps=steps,
seed=seed,
cfg_scale=7,
conditioning=conditioning,
sample_size=sample_rate * total_seconds,
sigma_min=0.3,
sigma_max=500,
sampler_type="dpmpp-3m-sde",
device="cpu",
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
return (sample_rate, output.numpy().transpose())
import gradio as gr
demo = gr.Interface(
_generate,
inputs=[
gr.Textbox(label="Text Prompt"),
gr.Slider(1, 47, label="Total seconds", step=1, value=10),
gr.Slider(10, 100, label="Number of steps", step=1, value=100),
gr.Slider(0, np.iinfo(np.int32).max, label="Seed", step=1),
],
outputs=["audio"],
examples=[
["128 BPM tech house drum loop"],
["Blackbird song, summer, dusk in the forest"],
["Rock beat played in a treated studio, session drumming on an acoustic kit"],
["Calmful melody and nature sounds for restful sleep"],
],
allow_flagging="never",
)
try:
demo.launch(debug=True)
except Exception:
demo.launch(share=True, debug=True)
# If you are launching remotely, specify server_name and server_port
# EXAMPLE: `demo.launch(server_name='your server name', server_port='server port in int')`
# To learn more please refer to the Gradio docs: https://gradio.app/docs/