Generative Model Preparation#

Since generative AI models tend to be big and resource-heavy, it is advisable to optimize them for efficient inference. This article will show how to prepare LLM models for inference with OpenVINO by:

Downloading Models from Hugging Face
Downloading Models from Model Scope
Converting and Optimizing Generative Models

Download Generative Models From Hugging Face Hub#

Pre-converted and pre-optimized models are available in the OpenVINO Toolkit organization, under the model section, or under different model collections:

You can also use the huggingface_hub package to download models:

pip install huggingface_hub
huggingface-cli download "OpenVINO/phi-2-fp16-ov" --local-dir model_path

The models can be used in OpenVINO immediately after download. No dependencies are required except huggingface_hub.

Download Generative Models From Model Scope#

To download models from Model Scope, use the modelscope package:

pip install modelscope
modelscope download --model "Qwen/Qwen2-7b" --local_dir model_path

Models downloaded via Model Scope are available in Pytorch format only and they must be converted to OpenVINO IR before inference.

Convert and Optimize Generative Models#

OpenVINO works best with models in the OpenVINO IR format, both in full precision and quantized. If your selected model has not been pre-optimized, you can easily do it yourself, using a single optimum-cli command. For that, make sure optimum-intel is installed on your system:

pip install optimum-intel[openvino]

While optimizing models, you can decide to keep the original precision or select one that is lower.

Keeping full model precision

optimum-cli export openvino --model <model_id> --weight-format fp16 <exported_model_name>

Examples:

LLM (text generation)

optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 ov_llama_2

Diffusion models (text2image)

optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 --weight-format fp16 ov_SDXL

VLM (Image processing):

optimum-cli export openvino --model openbmb/MiniCPM-V-2_6 --trust-remote-code –weight-format fp16 ov_MiniCPM-V-2_6

Whisper models (speech2text):

optimum-cli export openvino --trust-remote-code --model openai/whisper-base ov_whisper

Exporting to selected precision

optimum-cli export openvino --model <model_id> --weight-format int4 <exported_model_name>

Examples:

LLM (text generation)

optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format int4 ov_llama_2

Diffusion models (text2image)

optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 --weight-format int4 ov_SDXL

VLM (Image processing)

optimum-cli export openvino -m model_path --task text-generation-with-past --weight-format int4 ov_MiniCPM-V-2_6

Note

Any other model_id, for example openbmb/MiniCPM-V-2_6, or the path to a local model file can be used.

Also, you can specify different data type like int8.

Generative Model Preparation#

Download Generative Models From Hugging Face Hub#

Download Generative Models From Model Scope#

Convert and Optimize Generative Models#

Additional Resources#