Sound Generation with Stable Audio Open and OpenVINO™#
This Jupyter notebook can be launched after a local installation only.
Stable Audio Open is an open-source model optimized for generating short audio samples, sound effects, and production elements using text prompts. The model was trained on data from Freesound and the Free Music Archive, respecting creator rights.
Key Takeaways:#
Stable Audio Open is an open source text-to-audio model for generating up to 47 seconds of samples and sound effects.
Users can create drum beats, instrument riffs, ambient sounds, foley and production elements.
The model enables audio variations and style transfer of audio samples.
This model is made to be used with the stable-audio-tools library for inference.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
>Note: using python3.8 can take a long time to resolve dependency conflicts.
%pip install -q "torch>=2.3" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "stable-audio-tools" "nncf>=2.12.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -Uq --pre "openvino>=2024.3.0"
Load the original model and inference#
Note: run model with notebook, you will need to accept license agreement. You must be a registered user in Hugging Face Hub. Please visit HuggingFace model card, carefully read terms of usage and click accept button. You will need to use an access token for the code below to run. For more information on access tokens, refer to this section of the documentation. You can login on Hugging Face Hub in notebook environment, using following code:
# uncomment these lines to login to huggingfacehub to get access to pretrained model
# from huggingface_hub import notebook_login, whoami
# try:
# whoami()
# print('Authorization token already provided')
# except OSError:
# notebook_login()
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
model = model.to("cpu")
total_seconds = 20
# Set up text and timing conditioning
conditioning = [{"prompt": "128 BPM tech house drum loop", "seconds_start": 0, "seconds_total": total_seconds}]
# Generate stereo audio
output = generate_diffusion_cond(
model,
steps=100,
seed=42,
cfg_scale=7,
conditioning=conditioning,
sample_size=sample_rate * total_seconds,
sigma_min=0.3,
sigma_max=500,
sampler_type="dpmpp-3m-sde",
device="cpu",
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
from IPython.display import Audio
Audio("output.wav")