Visual-language assistant with Llama-3.2-11B-Vision and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Llama-3.2-11B-Vision is the latest model from LLama3 model family those capabilities extended to understand images content. The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
In this tutorial we consider how to convert, optimize and run this model using OpenVINO. More details about model can be found in model card, and original repo.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
%pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.14.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -Uq --pre "openvino>=2024.5.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
import requests
from pathlib import Path
if not Path("ov_mllama_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/ov_mllama_helper.py")
open("ov_mllama_helper.py", "w").write(r.text)
if not Path("gradio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/gradio_helper.py")
open("gradio_helper.py", "w").write(r.text)
if not Path("ov_mllama_compression.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/ov_mllama_compression.py")
open("ov_mllama_compression.py", "w").write(r.text)
if not Path("data_preprocessing.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/mllama3.2/data_preprocessing.py")
open("data_preprocessing", "w").write(r.text)
if not Path("notebook_utils.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
open("notebook_utils.py", "w").write(r.text)
# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry
collect_telemetry("mllama-3.2.ipynb")
Convert model#
OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate
Representation (IR). OpenVINO model conversion
API
should be used for these purposes. ov.convert_model
function accepts
original PyTorch model instance and example input for tracing and
returns ov.Model
representing this model in OpenVINO framework.
Converted model can be used for saving on disk using ov.save_model
function or directly loading on device using core.complie_model
.
ov_mllama_helper.py
script contains helper function for model
conversion, please check its content if you interested in conversion
details.
Click here for more detailed explanation of conversion steps
Llama-3.2.-Vision is autoregressive transformer generative model, it
means that each next model step depends from model output from previous
step. The generation approach is based on the assumption that the
probability distribution of a word sequence can be decomposed into the
product of conditional next word distributions. In other words, model
predicts the next token in the loop guided by previously generated
tokens until the stop-condition will be not reached (generated sequence
of maximum length or end of string token obtained). The way the next
token will be selected over predicted probabilities is driven by the
selected decoding methodology. You can find more information about the
most popular decoding methods in this
blog. The entry point
for the generation process for models from the Hugging Face Transformers
library is the generate
method. You can find more information about
its parameters and configuration in the
documentation.
To preserve flexibility in the selection decoding methodology, we will
convert only model inference for one step.
The inference flow has difference on first step and for the next. On the
first step, model accept preprocessed input instruction and image. Image
processed via Image Encoder
to cross-attention state, after that
language model
, LLM-based part of model, runs on cross-attention
states and tokenized input token ids to predict probability of next
generated tokens. On the next step, language_model
accepts only next
token. Since the output side is auto-regressive, an output token hidden
state remains the same once computed for every further generation step.
Therefore, recomputing it every time you want to generate a new token
seems wasteful. With the cache, the model saves the hidden state once it
has been computed. The model only computes the one for the most recently
generated output token at each time step, re-using the saved ones for
hidden tokens. This reduces the generation complexity from
\(O(n^3)\) to \(O(n^2)\) for a transformer model. More details
about how it works can be found in this
article.
With increasing model size like in modern LLMs, we also can note an
increase in the number of attention blocks and size past key values
tensors respectively. The strategy for handling cache state as model
inputs and outputs in the inference cycle may become a bottleneck for
memory-bounded systems, especially with processing long input sequences,
for example in a chatbot scenario. OpenVINO suggests a transformation
that removes inputs and corresponding outputs with cache tensors from
the model keeping cache handling logic inside the model. Such models are
also called stateful. A stateful model is a model that implicitly
preserves data between two consecutive inference calls. The tensors
saved from one run are kept in an internal memory buffer called a
state
or a variable
and may be passed to the next run, while
never being exposed as model output. Hiding the cache enables storing
and updating the cache values in a more device-friendly representation.
It helps to reduce memory consumption and additionally optimize model
performance. More details about stateful models and working with state
can be found in OpenVINO
documentation.
image_encoder
is represented in Llama-3.2-Vision by pretrained VIT
model.
To sum up above, model consists of 2 parts:
Image Encoder for encoding input images into LLM cross attention states space.
Language Model for generation answer based on cross attention states provided by Image Encoder and input tokens.
Let’s convert each model part.
Note: run model with notebook, you will need to accept license agreement. You must be a registered user in Hugging Face Hub. Please visit HuggingFace model card, carefully read terms of usage and click accept button. You will need to use an access token for the code below to run. For more information on access tokens, refer to this section of the documentation. You can login on Hugging Face Hub in notebook environment, using following code:
# uncomment these lines to login to huggingfacehub to get access to pretrained model
# from huggingface_hub import notebook_login, whoami
# try:
# whoami()
# print('Authorization token already provided')
# except OSError:
# notebook_login()
from pathlib import Path
from ov_mllama_helper import convert_mllama
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_dir = Path(model_id.split("/")[-1]) / "OV"
# uncomment the line to see model conversion code
# convert_mllama??
2025-01-07 08:39:57.815213: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2025-01-07 08:39:57.827771: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1736224797.842114 2088673 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1736224797.846261 2088673 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-01-07 08:39:57.861492: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
convert_mllama(model_id, model_dir)
⌛ Load original model
Downloading shards: 0%| | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
tokenizer_config.json: 0%| | 0.00/55.8k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/9.09M [00:00<?, ?B/s]
chat_template.json: 0%| | 0.00/5.09k [00:00<?, ?B/s]
/home/ea/work/py311/lib/python3.11/site-packages/transformers/modeling_utils.py:5006: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead warnings.warn( loss_type=None was set in the config but it is unrecognised.Using the default loss: ForCausalLMLoss.
⌛ Convert vision model...
/home/ea/work/py311/lib/python3.11/site-packages/transformers/models/mllama/modeling_mllama.py:1441: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
slice_index = -num_padding_patches if num_padding_patches > 0 else None
✅ Vision model successfully converted
⌛ Convert language model...
/home/ea/work/py311/lib/python3.11/site-packages/transformers/cache_utils.py:458: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
or len(self.key_cache[layer_idx]) == 0 # the layer has no cache
/home/ea/work/py311/lib/python3.11/site-packages/transformers/models/mllama/modeling_mllama.py:1819: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if sequence_length != 1:
/home/ea/work/py311/lib/python3.11/site-packages/transformers/cache_utils.py:443: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
elif len(self.key_cache[layer_idx]) == 0: # fills previously skipped layers; checking for tensor causes errors
/home/ea/work/py311/lib/python3.11/site-packages/transformers/models/mllama/modeling_mllama.py:1653: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if is_cross_attention_layer and cross_attention_states is None and is_cross_attention_cache_empty:
/home/ea/work/openvino_notebooks_new_clone/openvino_notebooks/notebooks/mllama-3.2/ov_mllama_helper.py:401: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif past_key_value.get_seq_length(self.layer_idx) != 0:
✅ Language model successfully converted
✅ Model sucessfully converted and can be found in Llama-3.2-11B-Vision-Instruct/OV
Select inference device#
from notebook_utils import device_widget
device = device_widget("CPU", exclude=["NPU"])
device
Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')
Optimize model using NNCF#
Compress Language model weights in 4bits#
For reducing memory consumption, weights compression optimization can be applied using NNCF.
Click here for more details about weight compression Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:
enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;
improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
Neural Network Compression Framework (NNCF) provides 4-bit / 8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
nncf.compress_weights
function can be used for performing weights
compression. The function accepts an OpenVINO model and other
compression parameters. Compared to INT8 compression, INT4 compression
improves performance even more, but introduces a minor drop in
prediction quality.
More details about weights compression, can be found in OpenVINO documentation.
In this tutorial we consider usage Data-Aware weights compression. Such approaches may require more time and memory as they involves calibration dataset, while promising better int4 model accuracy. > Note: AWQ weight quantization requires at least 64GB RAM, if you run notebook in memory-constrained environment, you can switch to data-free weight compression using widget bellow
from ov_mllama_compression import compress
# uncomment the line to see compression code
# compress??
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
from ov_mllama_compression import compression_widgets_helper
compression_scenario, compress_args = compression_widgets_helper()
compression_scenario
VBox(children=(RadioButtons(index=1, options=('data-free', 'data-aware'), value='data-aware'), Accordion(child…
compression_kwargs = {key: value.value for key, value in compress_args.items()}
language_model_path = compress(model_dir, **compression_kwargs)
⌛ Dataset preparation started
Fetching 64 samples for the initialization...
0%| | 0/64 [00:00<?, ?it/s]
✅ Dataset preparation finished ⌛ Model compression started Compression parameters: algorithm int4_asym group size - 64 ratio - 1.0 awq - True scale estimation - True lora correction - False gptq - False all_layers - True
Output()
WARNING:nncf:Dataset contains only 64 samples, smaller than the requested subset size 128.
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Weight compression mode │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ int4_asym │ 100% (266 / 266) │ 100% (266 / 266) │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
Output()
Output()
✅ Model compression finished. Compressed model can be found in Llama-3.2-11B-Vision-Instruct/OV/llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.xml
Optimize Vision model#
While weight compression is the great tool for large language models memory footprint reduction, for smaller size models like Image Encoder, it may be more efficient to apply INT8 Post-training quantization. You can find more details about post-training quantization in OpenVINO documentation.
Basically model quantization process consists of 3 steps: 1. Prepare
quantization dataset 2. Perform model quantization using
nncf.quantize
3. Save optimized model on disk using
ov.save_model
Note: Model quantization may requires additional time and memory for optimization and be non-applicable for some devices. You can skip quantization step or replace it with weight compression using widget bellow if you does not have enough resources.
from ov_mllama_compression import vision_encoder_selection_widget
vision_encoder_options = vision_encoder_selection_widget()
vision_encoder_options
Dropdown(description='Vision Encoder', index=1, options=('FP16', 'INT8 quantization', 'INT8 weights compressio…
from transformers import AutoProcessor
import nncf
import openvino as ov
import gc
from data_preprocessing import prepare_dataset_vision
processor = AutoProcessor.from_pretrained(model_dir)
core = ov.Core()
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
if vision_encoder_options.value == "INT8 quantization":
if not int8_vision_encoder_path.exists():
calibration_data = prepare_dataset_vision(processor, 100)
ov_model = core.read_model(fp_vision_encoder_path)
calibration_dataset = nncf.Dataset(calibration_data)
quantized_model = nncf.quantize(
model=ov_model,
calibration_dataset=calibration_dataset,
model_type=nncf.ModelType.TRANSFORMER,
advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alphas=nncf.AdvancedSmoothQuantParameters(matmul=0.6)),
subset_size=100,
)
ov.save_model(quantized_model, int8_vision_encoder_path)
del quantized_model
del ov_model
del calibration_dataset
del calibration_data
gc.collect()
vision_encoder_path = int8_vision_encoder_path
elif vision_encoder_options.value == "INT8 weights compression":
if not int8_wc_vision_encoder_path.exists():
ov_model = core.read_model(fp_vision_encoder_path)
compressed_model = nncf.compress_weights(ov_model)
ov.save_model(compressed_model, int8_wc_vision_encoder_path)
vision_encoder_path = int8_wc_vision_encoder_path
else:
vision_encoder_path = fp_vision_encoder_path
/home/ea/work/py311/lib/python3.11/site-packages/nncf/quantization/algorithms/post_training/pipeline.py:87: FutureWarning: AdvancedQuantizationParameters(smooth_quant_alpha=..) is deprecated.Please, use AdvancedQuantizationParameters(smooth_quant_alphas) option with AdvancedSmoothQuantParameters(convolution=.., matmul=..) as value instead. warning_deprecated(
Output()
WARNING:nncf:Dataset contains only 100 samples, smaller than the requested subset size 300.
Output()
Output()
WARNING:nncf:Dataset contains only 100 samples, smaller than the requested subset size 300.
/home/ea/work/py311/lib/python3.11/site-packages/numpy/core/_methods.py:118: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
Output()
Model Inference#
Now, we are ready to test model inference.
OVOVMLlamaForConditionalGeneration
defined in
ov_mllama_helper.py
has similar generation interface with original
model and additionally enables runtime optimizations for efficient model
inference with OpenVINO: - Slicing LM head - usually LLM models
provides probability for all input tokens, while for selection next
token, we are interested only for the last one. Reducing Language Model
head size to return only last token probability may provide better
performance and reduce memory consumption for the first inference, where
usually whole input prompt processed. You can find more details about
this optimization in OpenVINO
blog
Using Remote tensors for GPU - Coping data on device and back into host memory can become bottleneck for efficient execution multi-model pipeline on GPU. Remote Tensor API provides functionality for low-level GPU memory management, we can use this feature for sharing cross-attention keys and values between Image Encoder and Language Model.
from ov_mllama_helper import OVMLlamaForConditionalGeneration
# Uncomment this line to see model inference code
# OVMLlamaForConditionalGeneration??
ov_model = OVMLlamaForConditionalGeneration(
model_dir, device=device.value, language_model_name=language_model_path.name, image_encoder_name=vision_encoder_path.name
)
processor = AutoProcessor.from_pretrained(model_dir)
applied slice for lm head
from PIL import Image
from transformers import TextStreamer
import numpy as np
question = "What is unusual on this image?"
messages = [
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
]
text = processor.tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_image_path = Path("cat.png")
if not input_image_path.exists():
url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
raw_image = Image.open(requests.get(url, stream=True).raw)
raw_image.save(input_image_path)
else:
raw_image = Image.open(input_image_path)
inputs = processor(text=text, images=[raw_image], return_tensors="pt")
streamer = TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
print(f"Question: {question}")
display(raw_image)
output = ov_model.generate(**inputs, do_sample=False, max_new_tokens=100, temperature=None, top_p=None, streamer=streamer)
print(f"Visual encoder time {ov_model.vision_encoder_infer_time[0] * 1000 :.2f} ms")
print(f"First token latency {ov_model.llm_infer_time[0] * 1000 :.2f}ms, Second token latency {np.mean(np.array(ov_model.llm_infer_time[1:])) * 1000:.2f}ms")
Question: What is unusual on this image?
![../_images/mllama-3.2-with-output_19_1.png](../_images/mllama-3.2-with-output_19_1.png)
The cat is lying in a box. The cat is lying in a box, which is unusual because cats are known for their love of boxes. The cat's unusual behavior of lying in a box is likely due to its natural instinct to seek out small, enclosed spaces for rest and relaxation.
Visual encoder time 19083.66 ms
First token latency 2937.73ms, Second token latency 175.03ms
Interactive demo#
from gradio_helper import make_demo
processor.chat_template = processor.tokenizer.chat_template
demo = make_demo(ov_model, processor)
try:
demo.launch(debug=False)
except Exception:
demo.launch(debug=False, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/
/home/ea/work/py311/lib/python3.11/site-packages/gradio/components/chatbot.py:228: UserWarning: The 'tuples' format for chatbot messages is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style 'role' and 'content' keys.
warnings.warn(
* Running on local URL: http://127.0.0.1:7860 Rerunning server... use close() to stop if you need to change launch() parameters. ---- * Running on public URL: https://5276392df9bae2f87b.gradio.live This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run gradio deploy from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)