Model Server Features#
Efficient LLM Serving#
Serve LLMs enhanced with state of the art optimization techniques for best performance and resource utilization on generative workloads
Python Code Execution#
Write Python code that will do your custom processing and serve it in the Model Server. Take advantage of a rich environment of Python modules in domains like data processing and data science to create flexible solutions without the need to write C++ code.
Serving MediaPipe Graphs#
Create MediaPipe graphs and serve them. Configure multiple nodes and connect them to create powerful pipelines.
Serving Pipelines of Models#
Connect multiple models in a pipeline and reduce data transfer overhead with Directed Acyclic Graph (DAG) Scheduler. Implement model inference and data transformations using a custom node C/C++ dynamic library.
Processing Raw Data#
Send data in JPEG or PNG formats to reduce traffic and offload data pre-processing to the server.
Model Versioning Policies#
The model repository structure enables adding or deleting numerical version directories and the server will automatically adjust which models are served.
Control which model versions are served by setting the model version policy to serve all models, a specific model or set of models or just the latest version of the model (default setting).
Model Reshaping#
Change the batch size, shape and layout of the model at runtime to achieve high throughput and low latency.
Modify Model Configuration at Runtime#
OpenVINO Model Server regularly checks for changes to the configuration file and applies them during runtime. This means that you can change model configurations (for example, change the device where a model is served), add a new model or completely remove one that is no longer needed. These changes will be applied without any disruption to the service.
Working with Stateful Models#
Serve models that operate on sequences of data and maintain their state between inference requests.
Metrics#
Use the metrics endpoint compatible with the Prometheus to access performance and utilization statistics.
Enabling Dynamic Inputs#
Configure served models to accept data with variable batch sizes and input shapes.
Model Server C API#
Use in process inference via model server to leverage the model management and model pipelines functionality of OpenVINO Model Server within an application. This allows to reuse existing OVMS functionality to execute inference locally without network overhead.
Advanced Features#
Use CPU Extensions, model cache feature or a custom model loader.