Efficient LLM Serving - quickstart#

Let’s deploy TinyLlama/TinyLlama-1.1B-Chat-v1.0 model and text generation.

  1. Install python dependencies for the conversion script:

export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
pip3 install "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git@fe77316c5a25c7b0e8ae97c23776688448490be2 openvino_tokenizers==2024.4.0 openvino==2024.4.0
  1. Run optimum-cli to download and quantize the model:

mkdir workspace && cd workspace

optimum-cli export openvino --disable-convert-tokenizer --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 TinyLlama-1.1B-Chat-v1.0

convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0
  1. Create graph.pbtxt file in a model directory:

echo '
input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"

node: {
  name: "LLMExecutor"
  calculator: "HttpLLMCalculator"
  input_stream: "LOOPBACK:loopback"
  input_stream: "HTTP_REQUEST_PAYLOAD:input"
  input_side_packet: "LLM_NODE_RESOURCES:llm"
  output_stream: "LOOPBACK:loopback"
  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
  input_stream_info: {
    tag_index: "LOOPBACK:0",
    back_edge: true
  node_options: {
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
          models_path: "./",
          plugin_config: '{"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32"}',
          cache_size: 4
  input_stream_handler {
    input_stream_handler: "SyncSetInputStreamHandler",
    options {
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
        sync_set {
          tag_index: "LOOPBACK:0"
' >> TinyLlama-1.1B-Chat-v1.0/graph.pbtxt
  1. Create server config.json file:

echo '
    "model_config_list": [],
    "mediapipe_config_list": [
            "name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
            "base_path": "TinyLlama-1.1B-Chat-v1.0"
' >> config.json
  1. Deploy:

docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server --rest_port 8000 --config_path /workspace/config.json

Wait for the model to load. You can check the status with a simple command:

curl http://localhost:8000/v1/config
"TinyLlama/TinyLlama-1.1B-Chat-v1.0" : 
 "model_version_status": [
   "version": "1",
   "state": "AVAILABLE",
   "status": {
    "error_code": "OK",
    "error_message": "OK"
  1. Run generation

curl -s http://localhost:8000/v3/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": [
        "role": "system",
        "content": "You are a helpful assistant."
        "role": "user",
        "content": "What is OpenVINO?"
  }'| jq .
  "choices": [
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "OpenVINO is a software toolkit developed by Intel that enables developers to accelerate the training and deployment of deep learning models on Intel hardware.",
        "role": "assistant"
  "created": 1718607923,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "object": "chat.completion"

Note: If you want to get the response chunks streamed back as they are generated change stream parameter in the request to true.
