Visual-language assistant with Video-LLaVA and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Video-LLaVA (Learning United Visual Representation by Alignment Before Projection, paper) is a Large Vision-Language Model (LVLM) that breaks new ground by understanding both images and videos through a single, unified visual representation. While LLaVA excels at image-based tasks, Video-LLaVA expands this fluency to the dynamic world of videos, enabling seamless comprehension and reasoning across both visual domains. This means it can answer questions, generate text, and perform other tasks with equal ease, regardless of whether it’s presented with a still image or a moving scene.
In this tutorial we consider how to use Video-LLaVA model to build multimodal chatbot. For demonstration purposes we will use Video-LLaVA-7B model for conversion.
The tutorial consists from following steps:
Install prerequisites
Prepare input processor and tokenizer
Download original model
Compress model weights to 4 and 8 bits using NNCF
Convert model to OpenVINO Intermediate Representation (IR) format
Prepare OpenVINO-based inference pipeline
Run OpenVINO model
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
About model#
Video-LLaVA connects pre-trained CLIP ViT-L/14 visual encoders and large language model using a simple projection matrix
More details about model can be found in original paper and repo.
Prerequisites#
Install required dependencies
%pip install -q torch "torchvision<0.17.0" "transformers>=4.31.0,<4.35.0" "pytorchvideo" "einops" "peft==0.6.2" "huggingface-hub>=0.23.0" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q opencv_python sentencepiece protobuf "openvino>=2024.2.0" "nncf>=2.11.0" "gradio>=4.19"
from pathlib import Path
import sys
repo_dir = Path("Video-LLaVA")
if not repo_dir.exists():
!git clone https://github.com/PKU-YuanGroup/Video-LLaVA.git
# decord lib is not supported on macos, for overcome this limitation, we will use opencv for video processing
video_cfg_path = repo_dir / "videollava/model/multimodal_encoder/languagebind/video/configuration_video.py"
orig_video_cfg_path = video_cfg_path.parent / ("orig_" + video_cfg_path.name)
video_processor_path = repo_dir / "videollava/model/multimodal_encoder/languagebind/video/processing_video.py"
orig_video_processor_path = video_processor_path.parent / ("orig_" + video_processor_path.name)
if not orig_video_cfg_path.exists():
video_cfg_path.rename(orig_video_cfg_path)
with orig_video_cfg_path.open("r") as f:
data = f.read()
data = data.replace("decord", "opencv")
with video_cfg_path.open("w") as out_f:
out_f.write(data)
if not orig_video_processor_path.exists():
video_processor_path.rename(orig_video_processor_path)
with orig_video_processor_path.open("r") as f:
data = f.read()
data = data.replace("import decord", "")
data = data.replace("from decord import VideoReader, cpu", "")
data = data.replace("decord.bridge.set_bridge('torch')", "")
with video_processor_path.open("w") as out_f:
out_f.write(data)
sys.path.insert(0, str(repo_dir.resolve()))
import gc
import transformers
from videollava.model import LlavaLlamaForCausalLM
from videollava.constants import (
DEFAULT_IMAGE_PATCH_TOKEN,
DEFAULT_VIDEO_PATCH_TOKEN,
DEFAULT_IM_START_TOKEN,
DEFAULT_VID_START_TOKEN,
DEFAULT_IM_END_TOKEN,
DEFAULT_VID_END_TOKEN,
DEFAULT_IMAGE_TOKEN,
)
from videollava.model.multimodal_encoder.languagebind import config_dict
from videollava.model.multimodal_encoder.languagebind import transform_dict
transformers.logging.set_verbosity_error()
model_id = "LanguageBind/Video-LLaVA-7B"
config = transformers.AutoConfig.from_pretrained(model_id)
image_cfg = config_dict["image"].from_pretrained("LanguageBind/LanguageBind_Image", "./cache_dir")
image_processor = transform_dict["image"](image_cfg)
video_cfg = config_dict["video"].from_pretrained("LanguageBind/LanguageBind_Video_merge", "./cache_dir")
video_processor = transform_dict["video"](video_cfg)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
mm_use_im_start_end = getattr(config, "mm_use_im_start_end", False)
mm_use_im_patch_token = getattr(config, "mm_use_im_patch_token", True)
if hasattr(config, "max_sequence_length"):
context_len = config.max_sequence_length
else:
context_len = 2048
if mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
tokenizer.add_tokens([DEFAULT_VIDEO_PATCH_TOKEN], special_tokens=True)
if mm_use_im_start_end:
tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
tokenizer.add_tokens([DEFAULT_VID_START_TOKEN, DEFAULT_VID_END_TOKEN], special_tokens=True)
Build model and convert it to OpenVINO IR format#
Video-LLaVA is autoregressive transformer generative model, it means
that each next model step depends from model output from previous step.
The generation approach is based on the assumption that the probability
distribution of a word sequence can be decomposed into the product of
conditional next word distributions. In other words, model predicts the
next token in the loop guided by previously generated tokens until the
stop-condition will be not reached (generated sequence of maximum length
or end of string token obtained). The way the next token will be
selected over predicted probabilities is driven by the selected decoding
methodology. You can find more information about the most popular
decoding methods in this
blog. The entry point
for the generation process for models from the Hugging Face Transformers
library is the generate
method. You can find more information about
its parameters and configuration in the
documentation.
To preserve flexibility in the selection decoding methodology, we will
convert only model inference for one step.
The inference flow has difference on first step and for the next. On the first step, model accept preprocessed input instruction and video, after that LLM-based part of model runs on input embeddings to predict probability of next generated tokens. On the next step, model accepts only next token id selected based on sampling strategy and cached attention key and values. Since the output side is auto-regressive, an output token hidden state remains the same once computed for every further generation step. Therefore, recomputing it every time you want to generate a new token seems wasteful. With the cache, the model saves the hidden state once it has been computed. The model only computes the one for the most recently generated output token at each time step, re-using the saved ones for hidden tokens. This reduces the generation complexity from \(O(n^3)\) to \(O(n^2)\) for a transformer model. More details about how it works can be found in this article.
Prepare helpers for model conversion#
The code below prepares function for converting Video-LLaVA model to
OpenVINO Intermediate Representation format. It splits model on parts
described above, prepare example inputs for each part and convert each
part using OpenVINO Model Conversion
API.
ov.convert_model
function accepts PyTorch model instance and returns
ov.Model
object that represent model in OpenVINO format. It is ready
to use for loading on device using ov.compile_model
or can be saved
on disk using ov.save_model
.
import torch
import openvino as ov
import nncf
from typing import Optional, Tuple, List
class ModelWrapper(torch.nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
):
outputs = self.model.model(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=True,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
)
hidden_states = outputs[0]
logits = self.model.lm_head(hidden_states)
return (logits, outputs.past_key_values)
def set_node_names(ov_model, input_names=None, output_names=None):
if input_names is not None:
for inp, name in zip(ov_model.inputs, input_names):
inp.get_tensor().set_names({name})
if output_names is not None:
for out, name in zip(ov_model.outputs, output_names):
out.get_tensor().set_names({name})
ov_model.validate_nodes_and_infer_types()
def cleanup_torchscript_cache():
"""
Helper for removing cached model representation
"""
torch._C._jit_clear_class_registry()
torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
torch.jit._state._clear_class_state()
def convert_videollava(
pt_model: torch.nn.Module,
videollava_wc_parameters: Optional[dict] = None,
):
"""
Video-LLaVA model conversion function
Params:
pt_model: PyTorch model
model_path: path for saving model
Returns:
None
"""
pt_model.config.use_cache = True
pt_model.config.torchscript = True
wrapped = ModelWrapper(pt_model)
if input_embed_model_path.exists() and image_encoder.exists() and video_encoder.exists() and model_path.exists():
print("Video-LLaVA model successfully converted")
del pt_model
return
example_input_first_stage = {
"inputs_embeds": torch.zeros((1, 307, 4096)),
"attention_mask": torch.ones((1, 307), dtype=torch.long),
}
outs = wrapped(**example_input_first_stage)
input_names = ["attention_mask"]
output_names = ["logits"]
for idx in range(len(outs[1])):
input_names.extend([f"past_key_values.{idx}.key", f"past_key_values.{idx}.value"])
output_names.extend([f"present.{idx}.key", f"present.{idx}.value"])
input_names.append("inputs_embeds")
if not input_embed_model_path.exists():
ov_model = ov.convert_model(wrapped.model.model.embed_tokens, example_input=torch.ones([1, 10], dtype=torch.int64))
ov.save_model(ov_model, input_embed_model_path)
cleanup_torchscript_cache()
del ov_model
gc.collect()
if not model_path.exists():
example_input_second_stage = {
"inputs_embeds": torch.ones((1, 2, 4096)),
"attention_mask": torch.ones((1, outs[1][-1][-1].shape[-2] + 2), dtype=torch.long),
"past_key_values": outs[1],
}
ov_model = ov.convert_model(wrapped, example_input=example_input_second_stage)
set_node_names(ov_model, input_names, output_names)
if videollava_wc_parameters is not None:
print("Applying weight compression to second stage Video-LLaVA model")
ov_model = nncf.compress_weights(ov_model, **videollava_wc_parameters)
ov.save_model(ov_model, model_path)
cleanup_torchscript_cache()
del ov_model
gc.collect()
if not image_encoder_path.exists():
pt_model.forward = pt_model.encode_images
ov_model = ov.convert_model(
pt_model,
example_input=torch.zeros((1, 3, 224, 224)),
input=[(-1, 3, 224, 224)],
)
ov.save_model(ov_model, image_encoder_path)
if not video_encoder_path.exists():
pt_model.forward = pt_model.encode_videos
ov_model = ov.convert_model(
pt_model,
example_input=torch.zeros((1, 3, 16, 224, 224)),
input=[(-1, 3, -1, 224, 224)],
)
ov.save_model(ov_model, video_encoder_path)
print("Video-LLaVA model successfully converted")
del wrapped
del pt_model
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
Convert and Optimize Model#
Our model conversion and optimization consist of following steps: 1. Download original PyTorch model. 2. Compress model weights using NNCF 3. Convert model to OpenVINO format and save it on disk.
Let’s consider each step more deeply.
For creating PyTorch model we should use from_pretrained
method of
LlavaLlamaForCausalLM
model class. Model weights will be downloaded
from HuggingFace hub during first
run. It may takes some time and requires at least 13 Gb free space on
disk.
For reducing memory consumption, weights compression optimization can be applied using NNCF. Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:
enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;
improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.
Neural Network Compression Framework (NNCF) provides 4-bit / 8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.
nncf.compress_weights
function can be used for performing weights
compression. The function accepts an OpenVINO model and other
compression parameters. Compared to INT8 compression, INT4 compression
improves performance even more, but introduces a minor drop in
prediction quality.
More details about weights compression, can be found in OpenVINO documentation.
Note: There is no speedup for INT4 compressed models on dGPU.
Convert model to OpenVINO format using conversion helper function defined above.
Please select below whether you would like to run INT4 weight compression instead of INT8 weight compression.
import ipywidgets as widgets
compression_mode = widgets.Dropdown(
options=["INT4", "INT8"],
value="INT4",
description="Compression mode:",
disabled=False,
)
compression_mode
Dropdown(description='Compression mode:', options=('INT4', 'INT8'), value='INT4')
if compression_mode.value == "INT4":
compressed_model_dir = Path("videollava/INT4_compressed_weights")
videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT4_ASYM, group_size=128, ratio=0.8)
else:
compressed_model_dir = Path("videollava/INT8_compressed_weights")
videollava_wc_parameters = dict(mode=nncf.CompressWeightsMode.INT8)
input_embed_model_path = compressed_model_dir / "input_embed.xml"
video_encoder_path = compressed_model_dir / "video_encoder.xml"
image_encoder_path = compressed_model_dir / "image_encoder.xml"
model_path = compressed_model_dir / "videollava.xml"
compressed_model_dir.mkdir(exist_ok=True, parents=True)
if not all([input_embed_model_path.exists(), video_encoder_path.exists(), image_encoder_path.exists(), model_path.exists()]):
model = LlavaLlamaForCausalLM.from_pretrained(model_id)
model.resize_token_embeddings(len(tokenizer))
model.config.save_pretrained(compressed_model_dir)
image_tower = model.get_image_tower()
if not image_tower.is_loaded:
image_tower.load_model()
video_tower = model.get_video_tower()
if not video_tower.is_loaded:
video_tower.load_model()
model.eval()
with torch.no_grad():
convert_videollava(
model,
videollava_wc_parameters=videollava_wc_parameters,
)
del model
gc.collect();
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn(
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.
[ WARNING ] Please fix your imports. Module %s has been moved to %s. The old module will be deleted in version %s.
WARNING:nncf:NNCF provides best results with torch==2.3.*, while current torch version is 2.1.2+cpu. If you encounter issues, consider switching to torch==2.3.*
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:595: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1:
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:55: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:119: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if seq_len > self.max_seq_len_cached:
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:348: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:355: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:365: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
Applying weight compression to second stage Video-LLaVA model
Output()
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Num bits (N) │ % all parameters (layers) │ % ratio-defining parameters (layers) │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ 8 │ 22% (57 / 225) │ 20% (56 / 224) │
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│ 4 │ 78% (168 / 225) │ 80% (168 / 224) │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:287: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:327: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
Video-LLaVA model successfully converted
Prepare OpenVINO based inference pipeline#
OVLlavaLlamaForCausalLM
class provides ease-to-use interface for
using model in generation scenario. It is based on
transformers.generation.GenerationMixin
that gives us opportunity to
reuse all reach capabilities for generation implemented in HuggingFace
Transformers library. More details about this interface can be found in
HuggingFace
documentation.
from transformers.generation import GenerationConfig, GenerationMixin
from transformers.modeling_outputs import CausalLMOutputWithPast
import numpy as np
import torch
from videollava.constants import IMAGE_TOKEN_INDEX
class OVLlavaLlamaForCausalLM(GenerationMixin):
def __init__(self, core, model_dir, device):
self.model = core.read_model(model_dir / "videollava.xml")
self.model_input_embed = core.compile_model(model_dir / "input_embed.xml", device)
self.image_encoder_model = core.compile_model(model_dir / "image_encoder.xml", device)
self.video_encoder_model = core.compile_model(model_dir / "video_encoder.xml", device)
self.input_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.inputs)}
self.output_names = {key.get_any_name(): idx for idx, key in enumerate(self.model.outputs)}
self.key_value_input_names = [key for key in self.input_names if "key_values" in key]
self.key_value_output_names = [key for key in self.output_names if "present" in key]
compiled_model = core.compile_model(self.model, device)
self.request = compiled_model.create_infer_request()
self.config = transformers.AutoConfig.from_pretrained(model_dir)
self.generation_config = GenerationConfig.from_model_config(config)
self.main_input_name = "input_ids"
self.device = torch.device("cpu")
self.num_pkv = 2
self._supports_cache_class = False
def can_generate(self):
"""Returns True to validate the check that the model using `GenerationMixin.generate()` can indeed generate."""
return True
def embed_tokens(self, input_ids):
res = self.model_input_embed(input_ids)[0]
return torch.from_numpy(res)
def encode_images(self, images):
res = self.image_encoder_model(images)[0]
return torch.from_numpy(res)
def encode_videos(self, videos):
res = self.video_encoder_model(videos)[0]
return torch.from_numpy(res)
def __call__(
self,
input_ids: torch.LongTensor,
images: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
prefix_mask: Optional[torch.LongTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
**kwargs,
) -> CausalLMOutputWithPast:
return self.forward(input_ids, images, attention_mask, prefix_mask, past_key_values)
def forward(
self,
input_ids: torch.LongTensor,
images: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.LongTensor] = None,
prefix_mask: Optional[torch.LongTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
**kwargs,
) -> CausalLMOutputWithPast:
"""General inference method"""
inputs = self.prepare_inputs_for_multimodal(input_ids, images, attention_mask, past_key_values)
# Run inference
self.request.start_async(inputs, share_inputs=True)
self.request.wait()
logits = torch.from_numpy(self.request.get_tensor("logits").data)
# Tuple of length equal to : number of layer * number of past_key_value per decoder layer (2 corresponds to the self-attention layer)
past_key_values = tuple(self.request.get_tensor(key).data for key in self.key_value_output_names)
# Tuple of tuple of length `n_layers`, with each tuple of length equal to 2 (k/v of self-attention)
past_key_values = tuple(past_key_values[i : i + self.num_pkv] for i in range(0, len(past_key_values), self.num_pkv))
return CausalLMOutputWithPast(logits=logits, past_key_values=past_key_values)
def prepare_inputs_for_multimodal(self, input_ids, images, attention_mask, past_key_values):
if images is None or input_ids.shape[1] == 1:
if past_key_values is not None and images is not None and input_ids.shape[1] == 1:
target_shape = past_key_values[-1][-1].shape[-2] + 1
attention_mask = torch.cat(
(
attention_mask,
torch.ones((attention_mask.shape[0], target_shape - attention_mask.shape[1]), dtype=attention_mask.dtype, device=attention_mask.device),
),
dim=1,
)
past_key_values = (past_key_value for pkv_per_layer in past_key_values for past_key_value in pkv_per_layer)
inputs = dict(zip(self.key_value_input_names, past_key_values))
inputs_embeds = self.embed_tokens(input_ids)
inputs["inputs_embeds"] = inputs_embeds
inputs["attention_mask"] = attention_mask
return inputs
image_idx = [idx for idx, img in enumerate(images) if img.ndim == 3]
video_idx = [idx for idx, vid in enumerate(images) if vid.ndim == 4]
images_minibatch = torch.stack([images[idx] for idx in image_idx]) if len(image_idx) > 0 else [] # mini_b c h w
videos_minibatch = torch.stack([images[idx] for idx in video_idx]) if len(video_idx) > 0 else [] # mini_b c t h w
tmp_image_features = [None] * (len(image_idx) + len(video_idx))
if getattr(images_minibatch, "ndim", 0) == 4: # batch consists of images, [mini_b, c, h, w]
image_features_minibatch = self.encode_images(images_minibatch) # [mini_b, l, c]
for i, pos in enumerate(image_idx):
tmp_image_features[pos] = image_features_minibatch[i]
if getattr(videos_minibatch, "ndim", 0) == 5: # batch consists of videos, [mini_b, c, t, h, w]
video_features_minibatch = self.encode_videos(videos_minibatch) # fake list [mini_b, t, l, c]
for i, pos in enumerate(video_idx):
t = video_features_minibatch[i].shape[0]
tmp_image_features[pos] = [video_features_minibatch[i][j] for j in range(t)]
new_tmp = []
for image in tmp_image_features:
# print(len(new_tmp), len(image))
if isinstance(image, list):
t = len(image)
for i in range(t):
new_tmp.append(image[i])
# print('add video')
else:
new_tmp.append(image)
image_features = new_tmp
if attention_mask is None:
attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
else:
attention_mask = attention_mask.bool()
# remove the padding using attention_mask -- TODO: double check
input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
new_input_embeds = []
cur_image_idx = 0
for batch_idx, cur_input_ids in enumerate(input_ids):
num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
# print(num_images, cur_input_ids)
if num_images == 0:
cur_image_features = image_features[cur_image_idx]
cur_input_embeds_1 = self.embed_tokens(cur_input_ids.unsqueeze(0))[0]
cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
new_input_embeds.append(cur_input_embeds)
cur_image_idx += 1
continue
image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
cur_input_ids_noim = []
for i in range(len(image_token_indices) - 1):
cur_input_ids_noim.append(cur_input_ids[image_token_indices[i] + 1 : image_token_indices[i + 1]])
split_sizes = [x.shape[0] for x in cur_input_ids_noim]
cur_input_embeds = self.embed_tokens(torch.cat(cur_input_ids_noim).unsqueeze(0))[0]
cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
cur_new_input_embeds = []
for i in range(num_images + 1):
cur_new_input_embeds.append(cur_input_embeds_no_im[i])
if i < num_images:
cur_image_features = image_features[cur_image_idx]
cur_image_idx += 1
cur_new_input_embeds.append(cur_image_features)
cur_new_input_embeds = torch.cat(cur_new_input_embeds)
new_input_embeds.append(cur_new_input_embeds)
# Truncate sequences to max length as image embeddings can make the sequence longer
tokenizer_model_max_length = getattr(self.config, "tokenizer_model_max_length", None)
if tokenizer_model_max_length is not None:
new_input_embeds = [x[:tokenizer_model_max_length] for x in new_input_embeds]
# Combine them
max_len = max(x.shape[0] for x in new_input_embeds)
batch_size = len(new_input_embeds)
new_input_embeds_padded = []
attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
for i, cur_new_embed in enumerate(new_input_embeds):
cur_len = cur_new_embed.shape[0]
if getattr(self.config, "tokenizer_padding_side", "right") == "left":
new_input_embeds_padded.append(
torch.cat(
(torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device), cur_new_embed), dim=0
)
)
if cur_len > 0:
attention_mask[i, -cur_len:] = True
else:
new_input_embeds_padded.append(
torch.cat(
(cur_new_embed, torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0
)
)
if cur_len > 0:
attention_mask[i, :cur_len] = True
new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
inputs = {}
inputs["inputs_embeds"] = new_input_embeds
inputs["attention_mask"] = attention_mask
if past_key_values is None:
for name in self.key_value_input_names:
inputs[name] = np.zeros([attention_mask.shape[0], 32, 0, 128])
else:
past_key_values = (past_key_value for pkv_per_layer in past_key_values for past_key_value in pkv_per_layer)
inputs.update(dict(zip(self.key_value_input_names, past_key_values)))
return inputs
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):
"""
This function is used during running GenerationMixin.generate for preparing model specific inputs for
each generation step
"""
past_len = 0
if past_key_values is not None:
input_ids = input_ids[:, -1].unsqueeze(-1)
past_len = past_key_values[-1][-1].shape[-2]
attention_mask = kwargs.get(
"attention_mask",
torch.ones(input_ids.shape[0], input_ids.shape[1] + past_len),
)
if not kwargs.get("use_cache", True):
raise NotImplementedError("MPT with prefix_lm=True does not support use_cache=False.")
else:
prefix_mask = None
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"prefix_mask": prefix_mask,
"past_key_values": past_key_values,
"images": kwargs.get("images", None),
}
def _reorder_cache(self, past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:
"""
This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
[`~PreTrainedModel.beam_sample`] is called.
This is required to match `past_key_values` with the correct beam_idx at every generation step.
"""
# from transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel._reorder_cache
return tuple(tuple(np.take(past_state, beam_idx, 0) for past_state in layer_past) for layer_past in past_key_values)
Run model inference#
Now, when we have model and defined generation pipeline, we can run model inference.
Select inference device#
Select device from dropdown list for running inference using OpenVINO.
Note: There is no speedup for INT4 compressed models on dGPU.
core = ov.Core()
import requests
r = requests.get(
url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)
from notebook_utils import device_widget
device = device_widget(exclude=["NPU"])
device
Dropdown(description='Device:', index=3, options=('CPU', 'GPU.0', 'GPU.1', 'AUTO'), value='AUTO')
Load OpenVINO model#
ov_model = OVLlavaLlamaForCausalLM(core, compressed_model_dir, device.value)
Prepare input data#
For preparing input data, we will use tokenizer and image processor defined in the begging of our tutorial. For alignment with original PyTorch implementation we will use PyTorch tensors as input.
from IPython.display import display, Video, Image
examples_dir = Path("Video-LLaVA/videollava/serve/examples")
video_file = examples_dir / "sample_demo_22.mp4"
image_file = examples_dir / "sample_img_22.png"
video_tensor = video_processor.preprocess(str(video_file), return_tensors="pt")["pixel_values"][0]
image_tensor = image_processor.preprocess(str(image_file), return_tensors="pt")["pixel_values"][0]
images_tensor = [video_tensor, image_tensor]
text_message = "Are the instruments in the pictures used in the video?"
print(f"Question: {text_message}")
display(Video(video_file, embed=True))
Image(image_file, embed=True)
Question: Are the instruments in the pictures used in the video?
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True).
warnings.warn(
Test model inference#
Generation process for long response maybe time consuming, for accessing partial result as soon as it is generated without waiting when whole process finished, Streaming API can be used. Token streaming is the mode in which the generative system returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience. You can find more details about how streaming work in HuggingFace documentation.
Also for simplification of preparing input in conversational mode, we will use Conversation Template helper provided by model authors for accumulating history of provided messages and images.
from videollava.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria
from transformers import TextStreamer
from videollava.conversation import conv_templates, SeparatorStyle
# Prepare
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()
roles = ("user", "assistant")
if mm_use_im_start_end:
inp = DEFAULT_VIDEO_START_TOKEN + DEFAULT_IMAGE_TOKEN * 8 + DEFAULT_VIDEO_END_TOKEN + "\n" + text_message
else:
inp = DEFAULT_IMAGE_TOKEN * 8 + "\n" + text_message
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
print("Answer:")
output_ids = ov_model.generate(
input_ids,
images=images_tensor,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
streamer=streamer,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
Interactive demo#
from videollava.conversation import conv_templates, SeparatorStyle
def generate(image, video, textbox_in):
if video is not None:
textbox_in = DEFAULT_IMAGE_TOKEN * 8 + "\n" + textbox_in
if image is not None:
textbox_in += "\n" + DEFAULT_IMAGE_TOKEN
elif image is not None:
textbox_in = DEFAULT_IMAGE_TOKEN + "\n" + textbox_in
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], textbox_in)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
images_tensor = []
if image is not None:
images_tensor.append(image_processor(image, return_tensors="pt")["pixel_values"][0])
if video is not None:
images_tensor.append(video_processor(video, return_tensors="pt")["pixel_values"][0])
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
generate_kwargs = dict(
input_ids=input_ids,
images=images_tensor,
max_new_tokens=1024,
temperature=0.2,
do_sample=True,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
output_ids = ov_model.generate(**generate_kwargs)
input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
outputs = outputs[: -len(stop_str)]
outputs = outputs.strip()
return outputs
import requests
if not Path("gradio_helper.py").exists():
r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/llava-multimodal-chatbot/gradio_helper.py")
open("gradio_helper.py", "w").write(r.text)
from gradio_helper import make_demo_videollava
demo = make_demo_videollava(fn=generate)
try:
demo.queue().launch(debug=False)
except Exception:
demo.queue().launch(share=True, debug=False)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/
# please uncomment and run this cell for stopping gradio interface
# demo.close()