Efficient LLM Serving - quickstart#
Let’s deploy TinyLlama/TinyLlama-1.1B-Chat-v1.0 model and request generation.
Install python dependencies for the conversion script:
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
Run optimum-cli to download and quantize the model:
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
mkdir models
python export_model.py text_generation --source_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
Deploy:
With Docker
Required: Docker Engine installed
docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server --rest_port 8000 --config_path /workspace/config.json
On Baremetal Host
Required: OpenVINO Model Server package - see deployment instruction for details.
ovms --rest_port 8000 --config_path ./models/config.json
Check readiness Wait for the model to load. You can check the status with a simple command:
curl http://localhost:8000/v1/config
{
"TinyLlama/TinyLlama-1.1B-Chat-v1.0": {
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}
}
Run generation
curl -s http://localhost:8000/v3/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"max_tokens":30,
"stream":false,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is OpenVINO?"
}
]
}'| jq .
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "OpenVINO is a software toolkit developed by Intel that enables developers to accelerate the training and deployment of deep learning models on Intel hardware.",
"role": "assistant"
}
}
],
"created": 1718607923,
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"object": "chat.completion",
"usage": {
"prompt_tokens": 23,
"completion_tokens": 30,
"total_tokens": 53
}
}
Note: If you want to get the response chunks streamed back as they are generated change stream
parameter in the request to true
.