Controllable Music Generation with MusicGen and OpenVINO¶
This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS. To run without installing anything, click the “launch binder” or “Open in Colab” button.
MusicGen is a single-stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text prompt is passed to a text encoder model (T5) to obtain a sequence of hidden-state representations. These hidden states are fed to MusicGen, which predicts discrete audio tokens (audio codes). Finally, audio tokens are then decoded using an audio compression model (EnCodec) to recover the audio waveform.
The MusicGen model does not require a self-supervised semantic representation of the text/audio prompts; it operates over several streams of compressed discrete music representation with efficient token interleaving patterns, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Unlike prior models addressing music generation, it is able to generate all the codebooks in a single forward pass.
In this tutorial, we consider how to run the MusicGen model using OpenVINO.
We will use a model implementation from the Hugging Face Transformers library.
Table of contents:
!pip install -q "openvino==2023.1.0.dev20230811" !pip install -q torch onnx gradio !pip install -q transformers
from collections import namedtuple import gc from pathlib import Path from typing import Optional, Tuple import warnings from IPython.display import Audio from openvino import Core, convert_model, PartialShape, save_model, Type import numpy as np import torch from torch.jit import TracerWarning from transformers import AutoProcessor, MusicgenForConditionalGeneration from transformers.modeling_outputs import BaseModelOutputWithPastAndCrossAttentions, CausalLMOutputWithCrossAttentions # Ignore tracing warnings warnings.filterwarnings("ignore", category=TracerWarning)
To work with
MusicGen by Meta
AI, we will use Hugging Face Transformers
package exposes the
simplifying the model instantiation and weights loading. The code below
demonstrates how to create a
generate a text-conditioned music sample.
# Load the pipeline model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small", torchscript=True, return_dict=False)
In the cell below user is free to change PyTorch model inference device and the desired music sample length.
device = "cpu" sample_length = 8 # seconds n_tokens = sample_length * model.config.audio_encoder.frame_rate + 3 sampling_rate = model.config.audio_encoder.sampling_rate print('Sampling rate is', sampling_rate, 'Hz') model.to(device) model.eval();
Sampling rate is 32000 Hz
Text Preprocessing prepares the text prompt to be fed into the model,
processor object abstracts this step for us. Text tokenization
is performed under the hood, it assigning tokens or IDs to the words; in
other words, token IDs are just indices of the words in the model
vocabulary. It helps the model understand the context of a sentence.
processor = AutoProcessor.from_pretrained("facebook/musicgen-small") inputs = processor( text=["80s pop track with bassy drums and synth"], return_tensors="pt", ) audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=n_tokens) Audio(audio_values.cpu().numpy(), rate=sampling_rate)