Audio-language assistant with Qwen2Audio and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. Model supports more than 8 languages and dialects, e.g., Chinese, English, Cantonese, French, Italian, Spanish, German, and Japanese and can work in two distinct audio interaction modes: * voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input; * audio analysis: users could provide audio and text instructions for analysis during the interaction;
More details about model can be found in model card, blog, original repository and technical report.
In this tutorial we consider how to convert and optimize Qwen2Audio model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using NNCF
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
%pip install -q "transformers>=4.45" "torch>=2.1" "librosa" "gradio>=4.36" "modelscope-studio>=0.4.2" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -qU "openvino>=2024.4.0" "nncf>=2.13.0"
from pathlib import Path
import requests
if not Path("ov_qwen2_audio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-audio/ov_qwen2_audio_helper.py")
open("ov_qwen2_audio_helper.py", "w").write(r.text)
if not Path("notebook_utils.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
open("notebook_utils.py", "w").write(r.text)
# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry
collect_telemetry("qwen2-audio.ipynb")
Convert and Optimize model#
Qwen2Audio is PyTorch model. OpenVINO supports PyTorch models via
conversion to OpenVINO Intermediate Representation (IR). OpenVINO model
conversion
API
should be used for these purposes. ov.convert_model
function accepts
original PyTorch model instance and example input for tracing and
returns ov.Model
representing this model in OpenVINO framework.
Converted model can be used for saving on disk using ov.save_model
function or directly loading on device using core.compile_model
.
ov_qwen2_audio_helper.py
script contains helper function for model
conversion, please check its content if you interested in conversion
details.
Click here for more detailed explanation of conversion steps
Qwen2Audio is autoregressive transformer generative model, it means that
each next model step depends from model output from previous step. The
generation approach is based on the assumption that the probability
distribution of a word sequence can be decomposed into the product of
conditional next word distributions. In other words, model predicts the
next token in the loop guided by previously generated tokens until the
stop-condition will be not reached (generated sequence of maximum length
or end of string token obtained). The way the next token will be
selected over predicted probabilities is driven by the selected decoding
methodology. You can find more information about the most popular
decoding methods in this blog. The entry point for the generation
process for models from the Hugging Face Transformers library is the
generate
method. You can find more information about its parameters
and configuration in the documentation. To preserve flexibility in the
selection decoding methodology, we will convert only model inference for
one step.
The inference flow has difference on first step and for the next. On the
first step, model accept audio and optionally input text instruction
and, that transformed to the unified embedding space using
input_embedding
and audio_encoder
models, after that
language model
, LLM-based part of model, runs on input embeddings to
predict probability of next generated tokens. On the next step,
language_model
accepts only next token id selected based on sampling
strategy and processed by input_embedding
model and cached attention
key and values. Since the output side is auto-regressive, an output
token hidden state remains the same once computed for every further
generation step. Therefore, recomputing it every time you want to
generate a new token seems wasteful. With the cache, the model saves the
hidden state once it has been computed. The model only computes the one
for the most recently generated output token at each time step, re-using
the saved ones for hidden tokens. This reduces the generation complexity
from \(O(n^3)\) to \(O(n^2)\) for a transformer model. More
details about how it works can be found in this
article.
To sum up above, model consists of 4 parts:
Audio encoder for encoding input audio into audio embedding space and Multi-modal projector for transforming audio embeddings into language model embedding space.
Input Embedding for conversion input text tokens into embedding space
Language Model for generation answer based on input embeddings provided by Audio Encoder and Input Embedding models.
Compress model weights to 4-bit#
For reducing memory consumption, weights compression optimization can be applied using NNCF.
Click here for more details about weight compression
Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:
enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;
improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
Neural Network Compression Framework (NNCF) provides 4-bit / 8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
nncf.compress_weights
function can be used for performing weights
compression. The function accepts an OpenVINO model and other
compression parameters. Compared to INT8 compression, INT4 compression
improves performance even more, but introduces a minor drop in
prediction quality.
More details about weights compression, can be found in OpenVINO documentation.
pt_model_id = "Qwen/Qwen2-Audio-7B-Instruct"
model_dir = Path(pt_model_id.split("/")[-1])
from ov_qwen2_audio_helper import convert_qwen2audio_model
# uncomment these lines to see model conversion code
# convert_qwen2audio_model??
import nncf
compression_configuration = {
"mode": nncf.CompressWeightsMode.INT4_ASYM,
"group_size": 128,
"ratio": 1.0,
}
convert_qwen2audio_model(pt_model_id, model_dir, compression_configuration)
✅ Qwen/Qwen2-Audio-7B-Instruct model already converted. You can find results in Qwen2-Audio-7B-Instruct
Prepare model inference pipeline#
As discussed, the model comprises Image Encoder and LLM (with separated
text embedding part) that generates answer. In
ov_qwen2_audio_helper.py
we defined inference class
OVQwen2AudioForConditionalGeneration
that will represent generation
cycle, It is based on HuggingFace Transformers
GenerationMixin
and looks similar to Optimum
Intel
OVModelForCausalLM
that is used for LLM inference.
from ov_qwen2_audio_helper import OVQwen2AudioForConditionalGeneration
# Uncomment below lines to see the model inference class code
# OVQwen2AudioForConditionalGeneration??
from notebook_utils import device_widget
device = device_widget(default="AUTO", exclude=["NPU"])
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
ov_model = OVQwen2AudioForConditionalGeneration(model_dir, device.value)
Run model inference#
from transformers import AutoProcessor, TextStreamer
import librosa
import IPython.display as ipd
processor = AutoProcessor.from_pretrained(model_dir)
audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"
audio_chat_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"
audio_file = Path(audio_url.split("/")[-1])
audio_chat_file = Path(audio_chat_url.split("/")[-1])
if not audio_file.exists():
r = requests.get(audio_url)
with audio_file.open("wb") as f:
f.write(r.content)
if not audio_chat_file.exists():
r = requests.get(audio_chat_url)
with audio_chat_file.open("wb") as f:
f.write(r.content)
Voice chat#
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "audio", "audio_url": audio_chat_url},
],
},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load(audio_chat_file, sr=processor.feature_extractor.sampling_rate)[0]]
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
display(ipd.Audio(audio_chat_file))
print("Answer:")
generate_ids = ov_model.generate(**inputs, max_new_tokens=50, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))
It is strongly recommended to pass the sampling_rate argument to this function. Failing to do so can result in silent errors that might be hard to debug.
Setting pad_token_id to eos_token_id:151645 for open-end generation.
Answer:
Yes, I can guess that you are a female in your twenties.
Audio analysis#
question = "What does the person say?"
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "audio", "audio_url": audio_url},
{"type": "text", "text": question},
],
},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load(audio_file, sr=processor.feature_extractor.sampling_rate)[0]]
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
print("Question:")
print(question)
display(ipd.Audio(audio_file))
print("Answer:")
generate_ids = ov_model.generate(**inputs, max_new_tokens=50, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))
It is strongly recommended to pass the sampling_rate argument to this function. Failing to do so can result in silent errors that might be hard to debug.
Question:
What does the person say?
Setting pad_token_id to eos_token_id:151645 for open-end generation.
Answer:
The person says: 'Mister Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
Interactive Demo#
if not Path("gradio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-vl/gradio_helper.py")
open("gradio_helper.py", "w").write(r.text)
from gradio_helper import make_demo
demo = make_demo(ov_model, processor)
try:
demo.launch(debug=True)
except Exception:
demo.launch(debug=True, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/