Optimize and Deploy Generative AI Models

Generative AI is an innovative technique that creates new data, such as text, images, video, or audio, using neural networks. OpenVINO accelerates Generative AI use cases as they mostly rely on model inference, allowing for faster development and better performance. When it comes to generative models, OpenVINO supports:

  • Conversion, optimization and inference for text, image and audio generative models, for example, Llama 2, MPT, OPT, Stable Diffusion, Stable Diffusion XL, etc.

  • Int8 weight compression for text generation models.

  • Storage format reduction (fp16 precision for non-compressed models and int8 for compressed models).

  • Inference on CPU and GPU platforms, including integrated Intel® Processor Graphics, discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data Center GPU Flex Series.

OpenVINO offers two main paths for Generative AI use cases:

  • Using OpenVINO as a backend for Hugging Face frameworks (transformers, diffusers) through the Optimum Intel extension.

  • Using OpenVINO native APIs (Python and C++) with custom pipeline code.

In both cases, OpenVINO runtime and tools are used, the difference is mostly in the preferred API and the final solution’s footprint. Native APIs enable the use of generative models in C++ applications, ensure minimal runtime dependencies, and minimize application footprint. The Native APIs approach requires the implementation of glue code (generation loop, text tokenization, or scheduler functions), which is hidden within Hugging Face libraries for a better developer experience.

It is recommended to start with Hugging Face frameworks. Experiment with different models and scenarios to find your fit, and then consider converting to OpenVINO native APIs based on your specific requirements.

Optimum Intel provides interfaces that enable model optimization (weight compression) using Neural Network Compression Framework (NNCF), and export models to the OpenVINO model format for use in native API applications.

The table below summarizes the differences between Hugging Face and Native APIs approaches.

Hugging Face through OpenVINO

OpenVINO Native API

Model support

Broad set of Models

Broad set of Models

APIs

Python (Hugging Face API)

Python, C++ (OpenVINO API)

Model Format

Source Framework / OpenVINO

OpenVINO

Inference code

Hugging Face based

Custom inference pipelines

Additional dependencies

Many Hugging Face dependencies

Lightweight (e.g. numpy, etc.)

Application footprint

Large

Small

Pre/post-processing and glue code

Available at Hugging Face out-of-the-box

OpenVINO samples and notebooks

Performance

Good

Best

Running Generative AI Models using Hugging Face Optimum Intel

Prerequisites

  • Create a Python environment.

  • Install Optimum Intel:

pip install optimum[openvino,nncf]

To start using OpenVINO as a backend for Hugging Face, change the original Hugging Face code in two places:

-from transformers import AutoModelForCausalLM
+from optimum.intel import OVModelForCausalLM

model_id = "meta-llama/Llama-2-7b-chat-hf"
-model = AutoModelForCausalLM.from_pretrained(model_id)
+model = OVModelForCausalLM.from_pretrained(model_id, export=True)

After that, you can call save_pretrained() method to save model to the folder in the OpenVINO Intermediate Representation and use it further.

model.save_pretrained(model_dir)

Alternatively, you can download and convert the model using CLI interface: optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf llama_openvino. In this case, you can load the converted model in OpenVINO representation directly from the disk:

model_id = "llama_openvino"
model = OVModelForCausalLM.from_pretrained(model_id)

By default, inference will run on CPU. To select a different inference device, for example, GPU, add device="GPU" to the from_pretrained() call. To switch to a different device after the model has been loaded, use the .to() method. The device naming convention is the same as in OpenVINO native API:

model.to("GPU")

Optimum-Intel API also provides out-of-the-box model optimization through weight compression using NNCF which substantially reduces the model footprint and inference latency:

model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=True)

Weight compression is applied by default to models larger than one billion parameters and is also available for CLI interface as the --int8 option.

Note

8-bit weight compression is enabled by default for models larger than 1 billion parameters.

NNCF also provides 4-bit weight compression, which is supported by OpenVINO. It can be applied to Optimum objects as follows:

from nncf import compress_weights, CompressWeightsMode

model = OVModelForCausalLM.from_pretrained(model_id, export=True, load_in_8bit=False)
model.model = compress_weights(model.model, mode=CompressWeightsMode.INT4_SYM, group_size=128, ratio=0.8)

The optimized model can be saved as usual with a call to save_pretrained(). For more details on compression options, refer to the weight compression guide.

Note

OpenVINO also supports 4-bit models from Hugging Face Transformers library optimized with GPTQ. In this case, there is no need for an additional model optimization step because model conversion will automatically preserve the INT4 optimization results, allowing model inference to benefit from it.

Below are some examples of using Optimum-Intel for model conversion and inference:

Working with Models Tuned with LoRA

Low-rank Adaptation (LoRA) is a popular method to tune Generative AI models to a downstream task or custom data. However, it requires some extra steps to be done for efficient deployment using the Hugging Face API. Namely, the trained adapters should be fused into the baseline model to avoid extra computation. This is how it can be done for Large Language Models (LLMs):

model_id = "meta-llama/Llama-2-7b-chat-hf"
lora_adaptor = "./lora_adaptor"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True)
model = PeftModelForCausalLM.from_pretrained(model, lora_adaptor)
model.merge_and_unload()
model.get_base_model().save_pretrained("fused_lora_model")

Now the model can be converted to OpenVINO using Optimum Intel Python API or CLI interfaces mentioned above.

Running Generative AI Models using Native OpenVINO APIs

To run Generative AI models using native OpenVINO APIs you need to follow regular Сonvert -> Optimize -> Deploy path with a few simplifications.

To convert model from Hugging Face you can use Optimum-Intel export feature that allows to export model in OpenVINO format without invoking conversion API and tools directly, as it is shown above. In this case, the conversion process is a bit more simplified. You can still use a regular conversion path if model comes from outside of Hugging Face ecosystem, i.e., in source framework format (PyTorch, etc.)

Model optimization can be performed within Hugging Face or directly using NNCF as described in the weight compression guide.

Inference code that uses native API cannot benefit from Hugging Face pipelines. You need to write your custom code or take it from the available examples. Below are some examples of popular Generative AI scenarios:

  • In case of LLMs for text generation, you need to handle tokenization, inference and token selection loop, and de-tokenization. If token selection involves beam search, it also needs to be written.

  • For image generation models, you need to make a pipeline that includes several model inferences: inference for source (e.g., text) encoder models, inference loop for diffusion process and inference for decoding part. Scheduler code is also required.

To write such pipelines, you can follow the examples provided as part of OpenVINO: