Llama 2 Chat

Introduction

This demo showcases example usage of Llama model hosted via OpenVINO™ Model Server. The model used in this example can be found at huggingface (~26GB). Steps below automate download and conversion steps to be able to load it using OpenVINO™. Example python script is provided to request answers to given question.

Download the model

Prepare the environment:

git clone https://github.com/openvinotoolkit/model_server.git
cd model_server/demos/llama_chat/python
virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

Download the meta-llama/Llama-2-7b-hf model from huggingface and save to disk in IR format using script below.

NOTE: Download might take a while since the model is ~26GB.

python3 download_model.py

The model files should be available in models directory:

tree models

models
└── llama-2-7b-hf
    └── 1
        ├── config.json
        ├── openvino_model.bin
        └── openvino_model.xml

Start OVMS with prepared Llama 2 model

docker run -d --rm -p 9000:9000 -v $(pwd)/models/llama-2-7b-hf:/model:ro openvino/model_server \
    --port 9000 \
    --model_name llama \
    --model_path /model \
    --plugin_config '{"PERFORMANCE_HINT":"LATENCY","NUM_STREAMS":1}'

Run python client

Run client.py script to run interactive demo. Available parameters:

python3 client.py -h

usage: client.py [-h] --url URL --question QUESTION [--seed SEED] [--actor {general-knowledge,python-programmer}]

Inference script for generating text with llama

optional arguments:
  -h, --help            show this help message and exit
  --url URL             Specify url to grpc service
  --question QUESTION   Question to selected actor
  --seed SEED           Seed for next token selection algorithm. Providing different numbers will produce slightly different results.
  --actor {general-knowledge,python-programmer}
                        Domain in which you want to interact with the model. Selects predefined pre-prompt.

Multiple examples for different pre-prompts (--actor parameter):

General knowledge:

python3 client.py --url localhost:9000 --question "How many corners there are in square?" --seed 14140 --actor general-knowledge

 Four. [EOS]

Python programmer:

python3 client.py --url localhost:9000 --question "Write python function to sum 3 numbers." --seed 1332 --actor python-programmer

def sum_three_numbers(a,b,c):
   result = a + b + c
   return result [EOS]

>NOTE: You can edit the pre-prompt in client.py for your use case.