Model Server Parameters#
Model Configuration Options#
Option |
Value format |
Description |
---|---|---|
|
|
Model name exposed over gRPC and REST API.(use |
|
|
If using a Google Cloud Storage, Azure Storage or S3 path, see cloud storage guide. The path may look as follows: |
|
|
|
|
|
Optional. By default, the batch size is derived from the model, defined through the OpenVINO Model Optimizer. |
|
|
|
|
|
Optional. The model version policy lets you decide which versions of a model that the OpenVINO Model Server is to serve. By default, the server serves the latest version. One reason to use this argument is to control the server memory consumption.The accepted format is in json or string. Examples: |
|
|
List of device plugin parameters. For full list refer to OpenVINO documentation and performance tuning guide. Example: |
|
|
The size of internal request queue. When set to 0 or no value is set value is calculated automatically based on available resources. |
|
|
Device name to be used to execute inference operations. Accepted values are: |
|
|
If set to true, model is loaded as stateful. |
|
|
If set to true, model will be subject to periodic sequence cleaner scans. See idle sequence cleanup. |
|
|
Determines how many sequences can be handled concurrently by a model instance. |
|
|
If set to true, model server will apply low latency transformation on model load. |
|
|
Flag enabling metrics endpoint on rest_port. |
|
|
Comma separated list of metrics. If unset, only default metrics will be enabled. |
|
|
Path to the directory containing images to include in requests. If unset, local filesystem images in requests are not supported. |
Note : Specifying config_path is mutually exclusive with putting model parameters in the CLI (serving multiple models).
Option |
Value format |
Description |
---|---|---|
|
|
Absolute path to json configuration file |
Server configuration options#
Configuration options for the server are defined only via command-line options and determine configuration common for all served models.
Option |
Value format |
Description |
---|---|---|
|
|
Number of the port used by gRPC sever. |
|
|
Number of the port used by HTTP server (if not provided or set to 0, HTTP server will not be launched). |
|
|
Network interface address or a hostname, to which gRPC server will bind to. Default: all interfaces: 0.0.0.0 |
|
|
Network interface address or a hostname, to which REST server will bind to. Default: all interfaces: 0.0.0.0 |
|
|
Number of the gRPC server instances (must be from 1 to CPU core count). Default value is 1 and it’s optimal for most use cases. Consider setting higher value while expecting heavy load. |
|
|
Number of HTTP server threads. Effective when |
|
|
Time interval between config and model versions changes detection in seconds. Default value is 1. Zero value disables changes monitoring. |
|
|
Time interval (in minutes) between next sequence cleaner scans. Sequences of the models that are subjects to idle sequence cleanup that have been inactive since the last scan are removed. Zero value disables sequence cleaner. See idle sequence cleanup. It also sets the schedule for releasing free memory from the heap. |
|
|
Time interval (in seconds) between two consecutive resources cleanup scans. Default is 1. Must be greater than 0. See custom node development. |
|
|
Optional path to a library with custom layers implementation. |
|
|
Serving logging level |
|
|
Optional path to the log file. |
|
|
Path to the model cache storage. Caching will be enabled if this parameter is defined or the default path /opt/cache exists |
|
|
A comma separated list of arguments to be passed to the grpc server. (e.g. grpc.max_connection_age_ms=2000) |
|
|
Maximum number of threads which can be used by the grpc server. Default value depends on number of CPUs. |
|
|
GRPC server buffer memory quota. Default value set to 2147483648 (2GB). |
|
|
Shows help message and exit |
|
|
Shows binary version |
|
|
Whether to allow credentials in CORS requests. |
|
|
Comma-separated list of allowed headers in CORS requests. |
|
|
Comma-separated list of allowed methods in CORS requests. |
|
|
Comma-separated list of allowed origins in CORS requests. |
Config management mode options#
Configuration options for the config management mode, which is used to manage config file in the model repository.
Option |
Value format |
Description |
---|---|---|
|
|
Path to the model repository. This path is prefixed to the relative model path. |
|
|
List all models paths in the model repository. |
|
|
Name of the model as visible in serving. If |
|
|
Optional. Path to the model repository. If path is relative then it is prefixed with |
|
|
Either path to directory containing config.json file for OVMS, or path to ovms configuration file, to add specific model to. |
|
|
Either path to directory containing config.json file for OVMS, or path to ovms configuration file, to remove specific model from. |
Pull mode configuration options#
Shared configuration options for the pull, and pull & start mode. In the presence of --pull
parameter OVMS will only pull model without serving.
Pull Mode Options#
Option |
Value format |
Description |
---|---|---|
|
|
Runs the server in pull mode to download the model from the Hugging Face repository. |
|
|
Name of the model in the Hugging Face repository. If not set, |
|
|
Directory where all required model files will be saved. |
|
|
Name of the model as exposed externally by the server. |
|
|
Device name to be used to execute inference operations. Accepted values are: |
|
|
Task type the model will support ( |
There are also additional environment variables that may change the behavior of pulling:
Basic Environment Variables for Pull Mode#
Variable |
Value format |
Description |
---|---|---|
|
|
Default: |
|
|
Authentication token required for accessing some models from Hugging Face. |
|
|
If set, model downloads will use this proxy. |
Advanced Environment Variables for Pull Mode#
Variable |
Format |
Description |
---|---|---|
|
|
Timeout to attempt connections to a remote server. Default value 4000 ms. |
|
|
Timeout for reading from and writing to a remote server. Default value 4000 ms. |
|
|
Path to check for ssl certificates. |
Task specific parameters for different tasks (text generation/image generation/embeddings/rerank) are listed below:
Text generation#
option |
Value format |
Description |
---|---|---|
|
|
The maximum number of sequences that can be processed together. Default: 256. |
|
|
Type of the pipeline to be used. Choices: |
|
|
Enables algorithm to cache the prompt tokens. Default: true. |
|
|
The maximum number of tokens that can be batched together. |
|
|
Cache size in GB. Default: 10. |
|
|
HF model name or path to the local folder with PyTorch or OpenVINO draft model. |
|
|
Enables dynamic split fuse algorithm. Default: true. |
|
|
Sets NPU specific property for maximum number of tokens in the prompt. |
|
|
Reduced kv cache precision to |
|
|
Type of parser to use for tool calls and reasoning in model output. Currently supported: [qwen3, llama3, hermes3, phi4] |
Image generation#
option |
Value format |
Description |
---|---|---|
|
|
Maximum allowed resolution in the format |
|
|
Default resolution in the format |
|
|
Maximum number of images a client can request per prompt in a single request. In 2025.2 release only 1 image generation per request is supported. |
|
|
Default number of inference steps when not specified by the client. |
|
|
Maximum number of inference steps a client can request for a given model. |
|
|
Number of parallel execution streams for image generation models. Use at least 2 on 2-socket CPU systems. |
Embeddings#
option |
Value format |
Description |
---|---|---|
|
|
The number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1. |
|
|
Normalize the embeddings. Default: true. |
|
|
Mean pooling option. Default: false. |
Rerank#
option |
Value format |
Description |
---|---|---|
|
|
The number of parallel execution streams to use for the model. Use at least 2 on 2 socket CPU systems. Default: 1. |
|
|
Maximum allowed chunks. Default: 10000. |