OVMS Pull mode#

This document describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models. When pulling from Hugging Face Hub, no additional steps are required. However, when pulling models in Pytorch format, you have to install additional python dependencies when using baremetal execution so that optimum-cli is available for ovms executable or rely on the docker image openvino/model_server:latest-py. In summary you have 2 options:

  • pulling pre-configured models in IR format (described below)

  • pulling models with automatic conversion and quantization via optimum-cli. Described in the pulling with conversion

Note: Models in IR format must be exported using optimum-cli including tokenizer and detokenizer files also in IR format, if applicable. If missing, tokenizer and detokenizer should be added using convert_tokenizer --with-detokenizer tool.

Pulling pre-configured models#

There is a special OVMS mode to pull the model from Hugging Face before starting the service. It is triggered by --source_models parameter. In addition, --pull parameter is for pulling alone. The application quits after the model is downloaded. Without --pull option, the model will be deployed and server started.

Required: Docker Engine installed

docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]

Required: OpenVINO Model Server package - see deployment instructions for details.

ovms --pull --source_model <model_name_in_HF> --model_repository_path <model_repository_path> --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]

Example for pulling OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov:

ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --target_device CPU --task text_generation 

Required: Docker Engine installed

docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation

Required: OpenVINO Model Server package - see deployment instructions for details.

ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_repository_path /models --model_name Phi-3-mini-FastDraft-50M-int8-ov --task text_generation 

Note: When using pull mode you need read-write access rights to models’ repository.

Check parameters page for detailed descriptions of configuration options and parameter usage.

In case you want to setup model and start server in one step, follow instructions.