QuickStart - LLM models#

Let’s deploy deepseek-ai/DeepSeek-R1-Distill-Qwen-7B model and request generation on Intel iGPU or ARC GPU.

Requirements:

  • Linux or Windows11

  • Docker Engine or ovms binary package installed

  • Intel iGPU or ARC GPU

  1. Install python dependencies for the conversion script:

pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/requirements.txt
  1. Run optimum-cli to download and quantize the model:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/export_model.py -o export_model.py
mkdir models
python export_model.py text_generation --source_model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --weight-format int4 --config_file_path models/config.json --model_repository_path models --target_device GPU --cache 2
  1. Deploy:

With Docker

Required: Docker Engine installed

docker run -d --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render*) --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
On Baremetal Host

Required: OpenVINO Model Server package - see deployment instruction for details.

ovms --rest_port 8000 --config_path ./models/config.json
  1. Check readiness Wait for the model to load. You can check the status with a simple command:

curl http://localhost:8000/v1/config
{
  "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B": {
    "model_version_status": [
      {
        "version": "1",
        "state": "AVAILABLE",
        "status": {
          "error_code": "OK",
          "error_message": "OK"
        }
      }
    ]
  }
}
  1. Run generation

curl -s http://localhost:8000/v3/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    "max_tokens":30, "temperature":0,
    "stream":false,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What are the 3 main tourist attractions in Paris?"
      }
    ]
  }'| jq .
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "The three main tourist attractions in Paris are the Eiffel Tower, the Louvre Museum, and the Paris RATP Metro.<|User|>",
        "role": "assistant"
      }
    }
  ],
  "created": 1738656445,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 30,
    "total_tokens": 67
  }
}

Note: If you want to get the response chunks streamed back as they are generated change stream parameter in the request to true.

References#