Controllable Music Generation with MusicGen and OpenVINO#

This Jupyter notebook can be launched on-line, opening an interactive environment in a browser window. You can also make a local installation. Choose one of the following options:

BinderGoogle ColabGithub

MusicGen is a single-stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text prompt is passed to a text encoder model (T5) to obtain a sequence of hidden-state representations. These hidden states are fed to MusicGen, which predicts discrete audio tokens (audio codes). Finally, audio tokens are then decoded using an audio compression model (EnCodec) to recover the audio waveform.

pipeline

pipeline#

The MusicGen model does not require a self-supervised semantic representation of the text/audio prompts; it operates over several streams of compressed discrete music representation with efficient token interleaving patterns, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Unlike prior models addressing music generation, it is able to generate all the codebooks in a single forward pass.

In this tutorial, we consider how to run the MusicGen model using OpenVINO.

We will use a model implementation from the Hugging Face Transformers library.

Table of contents:

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

Install requirements#

%pip install -q "openvino>=2023.3.0"
%pip install -q "torch>=2.1" "gradio>=4.19" "transformers" packaging --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Imports#

from collections import namedtuple
from functools import partial
import gc
from pathlib import Path
from typing import Optional, Tuple
import warnings

from IPython.display import Audio
import openvino as ov
import numpy as np
import torch
from torch.jit import TracerWarning
from transformers import AutoProcessor, MusicgenForConditionalGeneration
from transformers.modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    CausalLMOutputWithCrossAttentions,
)

# Ignore tracing warnings
warnings.filterwarnings("ignore", category=TracerWarning)
2024-11-22 01:43:50.913766: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-11-22 01:43:50.938403: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

MusicGen in HF Transformers#

To work with MusicGen by Meta AI, we will use Hugging Face Transformers package. Transformers package exposes the MusicgenForConditionalGeneration class, simplifying the model instantiation and weights loading. The code below demonstrates how to create a MusicgenForConditionalGeneration and generate a text-conditioned music sample.

import sys
from packaging.version import parse


if sys.version_info < (3, 8):
    import importlib_metadata
else:
    import importlib.metadata as importlib_metadata
loading_kwargs = {}

if parse(importlib_metadata.version("transformers")) >= parse("4.40.0"):
    loading_kwargs["attn_implementation"] = "eager"


# Load the pipeline
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small", torchscript=True, return_dict=False, **loading_kwargs)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/823/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/models/encodec/modeling_encodec.py:124: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder config: T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.46.3",
  "use_cache": true,
  "vocab_size": 32128
}

Config of the audio_encoder: <class 'transformers.models.encodec.modeling_encodec.EncodecModel'> is overwritten by shared audio_encoder config: EncodecConfig {
  "_name_or_path": "facebook/encodec_32khz",
  "architectures": [
    "EncodecModel"
  ],
  "audio_channels": 1,
  "chunk_length_s": null,
  "codebook_dim": 128,
  "codebook_size": 2048,
  "compress": 2,
  "dilation_growth_rate": 2,
  "hidden_size": 128,
  "kernel_size": 7,
  "last_kernel_size": 7,
  "model_type": "encodec",
  "norm_type": "weight_norm",
  "normalize": false,
  "num_filters": 64,
  "num_lstm_layers": 2,
  "num_residual_layers": 1,
  "overlap": null,
  "pad_mode": "reflect",
  "residual_kernel_size": 3,
  "sampling_rate": 32000,
  "target_bandwidths": [
    2.2
  ],
  "torch_dtype": "float32",
  "transformers_version": "4.46.3",
  "trim_right_ratio": 1.0,
  "upsampling_ratios": [
    8,
    5,
    4,
    4
  ],
  "use_causal_conv": false,
  "use_conv_shortcut": false
}

Config of the decoder: <class 'transformers.models.musicgen.modeling_musicgen.MusicgenForCausalLM'> is overwritten by shared decoder config: MusicgenDecoderConfig {
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "attention_dropout": 0.0,
  "audio_channels": 1,
  "bos_token_id": 2048,
  "classifier_dropout": 0.0,
  "dropout": 0.1,
  "ffn_dim": 4096,
  "hidden_size": 1024,
  "initializer_factor": 0.02,
  "layerdrop": 0.0,
  "max_position_embeddings": 2048,
  "model_type": "musicgen_decoder",
  "num_attention_heads": 16,
  "num_codebooks": 4,
  "num_hidden_layers": 24,
  "pad_token_id": 2048,
  "scale_embedding": false,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.3",
  "use_cache": true,
  "vocab_size": 2048
}

In the cell below user is free to change the desired music sample length.

sample_length = 8  # seconds

n_tokens = sample_length * model.config.audio_encoder.frame_rate + 3
sampling_rate = model.config.audio_encoder.sampling_rate
print("Sampling rate is", sampling_rate, "Hz")

model.to("cpu")
model.eval();
Sampling rate is 32000 Hz

Original Pipeline Inference#

Text Preprocessing prepares the text prompt to be fed into the model, the processor object abstracts this step for us. Text tokenization is performed under the hood, it assigning tokens or IDs to the words; in other words, token IDs are just indices of the words in the model vocabulary. It helps the model understand the context of a sentence.

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=["80s pop track with bassy drums and synth"],
    return_tensors="pt",
)

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=n_tokens)

Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)