Visual-language assistant with Llama-3.2-11B-Vision and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Llama-3.2-11B-Vision is the latest model from LLama3 model family those capabilities extended to understand images content. The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
In this tutorial we consider how to convert, optimize and run this model using OpenVINO. More details about model can be found in model card, and original repo.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
%pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.14.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -Uq "openvino>=2024.5.0"
import requests
from pathlib import Path
if not Path("ov_mllama_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/ov_mllama_helper.py")
open("ov_mllama_helper.py", "w").write(r.text)
if not Path("gradio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/gradio_helper.py")
open("gradio_helper.py", "w").write(r.text)
if not Path("ov_mllama_compression.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/ov_mllama_compression.py")
open("ov_mllama_compression.py", "w").write(r.text)
if not Path("data_preprocessing.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/data_preprocessing.py")
open("data_preprocessing", "w").write(r.text)
if not Path("notebook_utils.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
open("notebook_utils.py", "w").write(r.text)
Convert model#
OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate
Representation (IR). OpenVINO model conversion
API
should be used for these purposes. ov.convert_model
function accepts
original PyTorch model instance and example input for tracing and
returns ov.Model
representing this model in OpenVINO framework.
Converted model can be used for saving on disk using ov.save_model
function or directly loading on device using core.complie_model
.
ov_mllama_helper.py
script contains helper function for model
conversion, please check its content if you interested in conversion
details.
Click here for more detailed explanation of conversion steps
Llama-3.2.-Vision is autoregressive transformer generative model, it
means that each next model step depends from model output from previous
step. The generation approach is based on the assumption that the
probability distribution of a word sequence can be decomposed into the
product of conditional next word distributions. In other words, model
predicts the next token in the loop guided by previously generated
tokens until the stop-condition will be not reached (generated sequence
of maximum length or end of string token obtained). The way the next
token will be selected over predicted probabilities is driven by the
selected decoding methodology. You can find more information about the
most popular decoding methods in this
blog. The entry point
for the generation process for models from the Hugging Face Transformers
library is the generate
method. You can find more information about
its parameters and configuration in the
documentation.
To preserve flexibility in the selection decoding methodology, we will
convert only model inference for one step.
The inference flow has difference on first step and for the next. On the
first step, model accept preprocessed input instruction and image. Image
processed via Image Encoder
to cross-attention state, after that
language model
, LLM-based part of model, runs on cross-attention
states and tokenized input token ids to predict probability of next
generated tokens. On the next step, language_model
accepts only next
token. Since the output side is auto-regressive, an output token hidden
state remains the same once computed for every further generation step.
Therefore, recomputing it every time you want to generate a new token
seems wasteful. With the cache, the model saves the hidden state once it
has been computed. The model only computes the one for the most recently
generated output token at each time step, re-using the saved ones for
hidden tokens. This reduces the generation complexity from
\(O(n^3)\) to \(O(n^2)\) for a transformer model. More details
about how it works can be found in this
article.
With increasing model size like in modern LLMs, we also can note an
increase in the number of attention blocks and size past key values
tensors respectively. The strategy for handling cache state as model
inputs and outputs in the inference cycle may become a bottleneck for
memory-bounded systems, especially with processing long input sequences,
for example in a chatbot scenario. OpenVINO suggests a transformation
that removes inputs and corresponding outputs with cache tensors from
the model keeping cache handling logic inside the model. Such models are
also called stateful. A stateful model is a model that implicitly
preserves data between two consecutive inference calls. The tensors
saved from one run are kept in an internal memory buffer called a
state
or a variable
and may be passed to the next run, while
never being exposed as model output. Hiding the cache enables storing
and updating the cache values in a more device-friendly representation.
It helps to reduce memory consumption and additionally optimize model
performance. More details about stateful models and working with state
can be found in OpenVINO
documentation.
image_encoder
is represented in Llama-3.2-Vision by pretrained VIT
model.
To sum up above, model consists of 2 parts:
Image Encoder for encoding input images into LLM cross attention states space.
Language Model for generation answer based on cross attention states provided by Image Encoder and input tokens.
Let’s convert each model part.
Note: run model with notebook, you will need to accept license agreement. You must be a registered user in Hugging Face Hub. Please visit HuggingFace model card, carefully read terms of usage and click accept button. You will need to use an access token for the code below to run. For more information on access tokens, refer to this section of the documentation. You can login on Hugging Face Hub in notebook environment, using following code:
# uncomment these lines to login to huggingfacehub to get access to pretrained model
# from huggingface_hub import notebook_login, whoami
# try:
# whoami()
# print('Authorization token already provided')
# except OSError:
# notebook_login()
from pathlib import Path
from ov_mllama_helper import convert_mllama
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_dir = Path(model_id.split("/")[-1]) / "OV"
# uncomment the line to see model conversion code
# convert_mllama??
2024-09-26 08:47:58.173539: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-09-26 08:47:58.175474: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-09-26 08:47:58.210782: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-09-26 08:47:59.026387: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
convert_mllama(model_id, model_dir)
⌛ Load original model
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
⌛ Convert vision model...
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.
[ WARNING ] Please fix your imports. Module %s has been moved to %s. The old module will be deleted in version %s. /home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/modeling_utils.py:4773: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead warnings.warn( /home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/mllama/modeling_mllama.py:1496: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! slice_index = -num_padding_patches if num_padding_patches > 0 else None
✅ Vision model successfully converted
⌛ Convert language model...
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/mllama/modeling_mllama.py:83: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if sequence_length != 1:
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/mllama/modeling_mllama.py:1710: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if is_cross_attention_layer and cross_attention_states is None and is_cross_attention_cache_empty:
/home/ea/work/openvino_notebooks_new_clone/openvino_notebooks/notebooks/mllama-3.2/ov_mllama_helper.py:402: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif past_key_value.get_seq_length(self.layer_idx) != 0:
✅ Language model successfully converted
✅ Model sucessfully converted and can be found in Llama-3.2-11B-Vision-Instruct/OV
Select inference device#
from notebook_utils import device_widget
device = device_widget("CPU", exclude=["NPU"])
device
Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')
Optimize model using NNCF#
Compress Language model weights in 4bits#
For reducing memory consumption, weights compression optimization can be applied using NNCF.
Click here for more details about weight compression Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:
enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;
improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
Neural Network Compression Framework (NNCF) provides 4-bit / 8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
nncf.compress_weights
function can be used for performing weights
compression. The function accepts an OpenVINO model and other
compression parameters. Compared to INT8 compression, INT4 compression
improves performance even more, but introduces a minor drop in
prediction quality.
More details about weights compression, can be found in OpenVINO documentation.
In this tutorial we consider usage Data-Aware weights compression. Such approaches may require more time and memory as they involves calibration dataset, while promising better int4 model accuracy. > Note: AWQ weight quantization requires at least 64GB RAM, if you run notebook in memory-constrained environment, you can switch to data-free weight compression using widget bellow
from ov_mllama_compression import compress
# uncomment the line to see compression code
# compress??
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
from ov_mllama_compression import compression_widgets_helper
compression_scenario, compress_args = compression_widgets_helper()
compression_scenario
VBox(children=(RadioButtons(index=1, options=('data-free', 'data-aware'), value='data-aware'), Accordion(child…
compression_kwargs = {key: value.value for key, value in compress_args.items()}
language_model_path = compress(model_dir, **compression_kwargs)
✅ Compressed model already exists and can be found in Llama-3.2-11B-Vision-Instruct/OV/llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.xml
Optimize Vision model#
While weight compression is the great tool for large language models memory footprint reduction, for smaller size models like Image Encoder, it may be more efficient to apply INT8 Post-training quantization. You can find more details about post-training quantization in OpenVINO documentation.
Basically model quantization process consists of 3 steps: 1. Prepare
quantization dataset 2. Perform model quantization using
nncf.quantize
3. Save optimized model on disk using
ov.save_model
Note: Model quantization may requires additional time and memory for optimization and be non-applicable for some devices. You can skip quantization step or replace it with weight compression using widget bellow if you does not have enough resources.
from ov_mllama_compression import vision_encoder_selection_widget
vision_encoder_options = vision_encoder_selection_widget(device.value)
vision_encoder_options
Dropdown(description='Vision Encoder', index=1, options=('FP16', 'INT8 quantization', 'INT8 weights compressio…
from transformers import AutoProcessor
import nncf
import openvino as ov
import gc
from data_preprocessing import prepare_dataset_vision
processor = AutoProcessor.from_pretrained(model_dir)
core = ov.Core()
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
if vision_encoder_options.value == "INT8 quantization":
if not int8_vision_encoder_path.exists():
calibration_data = prepare_dataset_vision(processor, 100)
ov_model = core.read_model(fp_vision_encoder_path)
calibration_dataset = nncf.Dataset(calibration_data)
quantized_model = nncf.quantize(
model=ov_model,
calibration_dataset=calibration_dataset,
model_type=nncf.ModelType.TRANSFORMER,
advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
)
ov.save_model(quantized_model, int8_vision_encoder_path)
del quantized_model
del ov_model
del calibration_dataset
del calibration_data
gc.collect()
vision_encoder_path = int8_vision_encoder_path
elif vision_encoder_options.value == "INT8 weights compression":
if not int8_wc_vision_encoder_path.exists():
ov_model = core.read_model(fp_vision_encoder_path)
compressed_model = nncf.compress_weights(ov_model)
ov.save_model(compressed_model, int8_wc_vision_encoder_path)
vision_encoder_path = int8_wc_vision_encoder_path
else:
vision_encoder_path = fp_vision_encoder_path
Model Inference#
Now, we are ready to test model inference.
OVOVMLlamaForConditionalGeneration
defined in
ov_mllama_helper.py
has similar generation interface with original
model and additionally enables runtime optimizations for efficient model
inference with OpenVINO: - Slicing LM head - usually LLM models
provides probability for all input tokens, while for selection next
token, we are interested only for the last one. Reducing Language Model
head size to return only last token probability may provide better
performance and reduce memory consumption for the first inference, where
usually whole input prompt processed. You can find more details about
this optimization in OpenVINO
blog
Using Remote tensors for GPU - Coping data on device and back into host memory can become bottleneck for efficient execution multi-model pipeline on GPU. Remote Tensor API provides functionality for low-level GPU memory management, we can use this feature for sharing cross-attention keys and values between Image Encoder and Language Model.
from ov_mllama_helper import OVMLlamaForConditionalGeneration
# Uncomment this line to see model inference code
# OVMLlamaForConditionalGeneration??
ov_model = OVMLlamaForConditionalGeneration(
model_dir, device=device.value, language_model_name=language_model_path.name, image_encoder_name=vision_encoder_path.name
)
processor = AutoProcessor.from_pretrained(model_dir)
applied slice for lm head
from PIL import Image
from transformers import TextStreamer
import numpy as np
question = "What is unusual on this image?"
messages = [
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
]
text = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=text, images=[raw_image], return_tensors="pt")
streamer = TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
print(f"Question: {question}")
display(raw_image)
output = ov_model.generate(**inputs, do_sample=False, max_new_tokens=100, temperature=None, top_p=None, streamer=streamer)
print(f"Visual encoder time {ov_model.vision_encoder_infer_time[0] * 1000 :.2f} ms")
print(f"First token latency {ov_model.llm_infer_time[0] * 1000 :.2f}ms, Second token latency {np.mean(np.array(ov_model.llm_infer_time[1:])) * 1000:.2f}ms")
Question: What is unusual on this image?
The cat is lying in a box, which is an unusual position for a cat. Cats are known for their agility and flexibility, but they tend to prefer more comfortable and secure positions, such as on a soft surface or in a cozy spot. Lying in a box is not a typical behavior for a cat, and it may be due to the cat's desire to feel safe and protected or to explore a new environment.
Visual encoder time 19374.52 ms
First token latency 693.76ms, Second token latency 431.92ms
Interactive demo#
from gradio_helper import make_demo
processor.chat_template = processor.tokenizer.chat_template
demo = make_demo(ov_model, processor)
try:
demo.launch(debug=False)
except Exception:
demo.launch(debug=False, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/