Generative AI workflow#
Generative AI is a specific area of Deep Learning models used for producing new and “original” data, based on input in the form of image, sound, or natural language text. Due to their complexity and size, generative AI pipelines are more difficult to deploy and run efficiently. OpenVINO™ simplifies the process and ensures high-performance integrations, with the following options:
Install the OpenVINO GenAI package and run generative models out of the box. With custom API and tokenizers, among other components, it manages the essential tasks such as the text generation loop, tokenization, and scheduling, offering ease of use and high performance.
Using Optimum Intel is a great way to experiment with different models and scenarios, thanks to a simple interface for the popular API and infrastructure offered by Hugging Face. It also enables weight compression with Neural Network Compression Framework (NNCF), as well as conversion on the fly. For integration with the final product it may offer lower performance, though.
OpenVINO™ Model Server provides a set of REST API endpoints dedicated to generative use cases. The endpoints simplify writing AI applications, ensure scalability, and provide state-of-the-art performance optimizations. They include OpenAI API for: text generation, embeddings, and reranking. The model server supports deployments as containers or binary applications on Linux and Windows with CPU or GPU acceleration. See the demos.
The advantages of using OpenVINO for generative model deployment:
You can run Generative AI models, using native OpenVINO API, although it is not recommended. If you want to learn how to do it, refer to the 24.6 documentation.
Proceed to guides on: