QuickStart - LLM models#
Let’s deploy deepseek-ai/DeepSeek-R1-Distill-Qwen-7B model and request generation on Intel iGPU or ARC GPU.
Requirements:
Linux or Windows11
Docker Engine or
ovms
binary package installedIntel iGPU or ARC GPU
Install python dependencies for the conversion script:
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/requirements.txt
Run optimum-cli to download and quantize the model:
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/export_model.py -o export_model.py
mkdir models
python export_model.py text_generation --source_model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --weight-format int4 --config_file_path models/config.json --model_repository_path models --target_device GPU --cache 2
Deploy:
With Docker
Required: Docker Engine installed
docker run -d --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render*) --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
On Baremetal Host
Required: OpenVINO Model Server package - see deployment instruction for details.
ovms --rest_port 8000 --config_path ./models/config.json
Check readiness Wait for the model to load. You can check the status with a simple command:
curl http://localhost:8000/v1/config
{
"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B": {
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}
}
Run generation
curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
"max_tokens":30, "temperature":0,
"stream":false,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What are the 3 main tourist attractions in Paris?"
}
]
}'| jq .
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"message": {
"content": "The three main tourist attractions in Paris are the Eiffel Tower, the Louvre Museum, and the Paris RATP Metro.<|User|>",
"role": "assistant"
}
}
],
"created": 1738656445,
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
"object": "chat.completion",
"usage": {
"prompt_tokens": 37,
"completion_tokens": 30,
"total_tokens": 67
}
}
Note: If you want to get the response chunks streamed back as they are generated change stream
parameter in the request to true
.