LocalAI and OpenVINO#

This Jupyter notebook can be launched after a local installation only.

LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families and architectures. Does not require GPU. It is created and maintained by Ettore Di Giacinto.

In this tutorial we show how to prepare a model config and launch an OpenVINO LLM model with LocalAI in docker container.

Table of contents:

Prepare Docker
Prepare a model
Run the server
Send a client request
Stop the server

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prepare Docker#

Install Docker Engine, including its post-installation steps, on your development system. To verify installation, test it, using the following command. When it is ready, it will display a test image and a message.

!docker run hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world

[1BDigest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Prepare a model#

LocalAI allows to use customized models. For more details you can read the instruction where you can also find the detailed documentation. We will use one of the OpenVINO optimized LLMs in the collection on the collection on 🤗Hugging Face. In this example we will use TinyLlama-1.1B-Chat-v1.0-fp16-ov. First of all we should create a model configuration file:

name: TinyLlama-1.1B-Chat-v1.0-fp16-ov
backend: transformers
parameters:
  model: OpenVINO/TinyLlama-1.1B-Chat-v1.0-fp16-ov
  temperature: 0.2
  top_k: 40
  top_p: 0.95
  max_new_tokens: 32

type: OVModelForCausalLM

template:
  chat_message: |
    <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
    {{if .Content}}{{.Content}}{{end}}<|im_end|>
  chat: |
    {{.Input}}
    <|im_start|>assistant

  completion: |
    {{.Input}}

stopwords:
- <|im_end|>

The fields backend, model, type you can find in the code example on the model page (we added the corresponding comments):

from transformers import AutoTokenizer   # backend
from optimum.intel.openvino import OVModelForCausalLM  # type

model_id = "OpenVINO/TinyLlama-1.1B-Chat-v1.0-fp16-ov"  # parameters.model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)

The name you can choose by yourself. By this name you will specify what model to use on the client side.

You can create a GitHub gist and modify fields: ov.yaml

Description of the parameters used in config YAML file can be found here.

The most important:

name - model name, used to identify the model in API calls.
backend - backend to use for computation (like llama-cpp, diffusers, whisper, transformers).
parameters.model - relative to the models path.
temperature, top_k, top_p, max_new_tokens - parameters for the model.
type - type of configuration, often related to the type of task or model architecture.
template - templates for various types of model interactions.
stopwords - Words or phrases that halts processing.

Run the server#

Everything is ready for launch. Use quay.io/go-skynet/local-ai:v2.23.0-ffmpeg image that contains all required dependencies. For more details read Run with container images. If you want to see the output remove the -d flag and send a client request from a separate notebook.

!docker run -d --rm --name="localai" -p 8080:8080 quay.io/go-skynet/local-ai:master-sycl-f16-ffmpeg https://gist.githubusercontent.com/aleksandr-mokrov/f007c8fa6036760a856ddc60f605a0b0/raw/9d24ceeb487f9c058a943113bd0290e8ae565b3e/ov.yaml

67e1a2a8123aa15794c027278aed2c258a04e06883663459bbeaca22ff014740
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: invalid expression: unknown.

Check whether the localai container is running normally:

!docker ps | grep localai

Send a client request#

Now you can send HTTP requests using the model name TinyLlama-1.1B-Chat-v1.0-fp16-ov. More details how to use OpenAI API.

!curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "TinyLlama-1.1B-Chat-v1.0-fp16-ov", "prompt": "What is OpenVINO?"}'

curl: (7) Failed to connect to localhost port 8080: Connection refused

Stop the server#

!docker stop localai

Error response from daemon: No such container: localai