Demonstrating integration of Open WebUI with OpenVINO Model Server#

Description#

Open WebUI is a very popular component that provides a user interface to generative models. It supports use cases related to text generation, RAG, image generation, and many more. It also supports integration with remote execution servings compatible with standard APIs like OpenAI for chat completions and image generation.

The goal of this demo is to integrate Open WebUI with OpenVINO Model Server. It would include instructions for deploying the serving with a set of models and configuring Open WebUI to delegate generation to the serving endpoints.

Setup#

Prerequisites#

In this demo, OpenVINO Model Server is deployed on Linux with CPU using Docker and Open WebUI is installed via Python pip. Requirements to follow this demo:

Docker Engine installed
Host with x86_64 architecture
Linux, macOS, or Windows via WSL
Python 3.11 with pip
HuggingFace account to download models

There are other options to fulfill the prerequisites like OpenVINO Model Server deployment on baremetal Linux or Windows and Open WebUI installation with Docker. The steps in this demo can be reused across different options, and the reference for each step cover both deployments.

This demo was tested on CPU but most of the models could be also run on Intel accelerators like GPU and NPU.

Step 1: Preparation#

Download export script, install its dependencies and create the directory for models:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/export_models/export_model.py -o export_model.py
pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/export_models/requirements.txt
mkdir models

Step 2: Export Model#

The text generation model used in this demo is meta-llama/Llama-3.2-1B-Instruct. If the model is not downloaded before, access must be requested. Run export script to download and quantize the model:

python export_model.py text_generation --source_model meta-llama/Llama-3.2-1B-Instruct --weight-format int8 --kv_cache_precision u8 --config_file_path models/config.json

Step 3: Server Deployment#

Deploy with docker:

docker run -d -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server --rest_port 8000 --config_path /workspace/config.json

Here is the basic call to check if it works:

curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"meta-llama/Llama-3.2-1B-Instruct\",\"messages\":[{\"role\":\"system\",\"content\":\"You are a helpful assistant.\"},{\"role\":\"user\",\"content\":\"Say this is a test\"}]}"

Step 4: Start Open WebUI#

Install Open WebUI:

pip install open-webui

Running Open WebUI:

open-webui serve

Go to http://localhost:8080 and create admin account to get started.

get started with Open WebUI

Reference#

https://docs.openvino.ai/2025/model-server/ovms_demos_continuous_batching.html

https://docs.openwebui.com

Chat#

Step 1: Connections Setting#

Go to Admin Panel → Settings → Connections (http://localhost:8080/admin/settings/connections)
Click +Add Connection under OpenAI API
- URL: http://localhost:8000/v3
- Model IDs: put meta-llama/Llama-3.2-1B-Instruct and click + to add the model, or leave empty to include all models
Click Save

connection setting

Step 2: Start Chatting#

Click New Chat and select the model to start chatting.

chat demo

Reference#

https://docs.openwebui.com/getting-started/quick-start/starting-with-openai-compatible

RAG#

Step 1: Model Preparation#

In addition to text generation, endpoints for embedding and reranking in Retrieval Augmented Generation can also be deployed with OpenVINO Model Server. In this demo, the embedding model is sentence-transformers/all-MiniLM-L6-v2 and the the reranking model is BAAI/bge-reranker-base. Run export script to download and quantize the models:

python export_model.py embeddings_ov --source_model sentence-transformers/all-MiniLM-L6-v2 --weight-format int8 --config_file_path models/config.json
python export_model.py rerank_ov --source_model BAAI/bge-reranker-base --weight-format int8 --config_file_path models/config.json

Keep the model server running or restart it. Here are the basic calls to check if they work:

curl http://localhost:8000/v3/embeddings -H "Content-Type: application/json" -d "{\"model\":\"sentence-transformers/all-MiniLM-L6-v2\",\"input\":\"hello world\"}"
curl http://localhost:8000/v3/rerank -H "Content-Type: application/json" -d "{\"model\":\"BAAI/bge-reranker-base\",\"query\":\"welcome\",\"documents\":[\"good morning\",\"farewell\"]}"

Step 2: Documents Setting#

Go to Admin Panel → Settings → Documents (http://localhost:8080/admin/settings/documents)
Select OpenAI for Embedding Model Engine
- URL: http://localhost:8000/v3
- Embedding Model: sentence-transformers/all-MiniLM-L6-v2
- Put anything in API key
Enable Hybrid Search
Select External for Reranking Engine
- URL: http://localhost:8000/v3/rerank
- Reranking Model: BAAI/bge-reranker-base
Click Save

embedding and retrieval setting

Step 3: Knowledge Base#

Prepare the Documentation

The documentation used in this demo is open-webui/docs. Download and extract it to get the folder.
Go to Workspace → Knowledge → +Create a Knowledge Base (http://localhost:8080/workspace/knowledge/create)
Name and describe the knowledge base
Click Create Knowledge
Click +Add Content → Upload directory, then select the extracted folder. This will upload all files with suitable extensions.

create a knowledge base

Step 4: Chat with RAG#

Click New Chat. Enter # symbol
Select documents that appear above the chat box for retrieval. Document icons will appear above Send a message
Enter a query and sent

chat with RAG demo

Step 5: RAG-enabled Model#

Go to Workspace → Models → +Add New Model (http://localhost:8080/workspace/models/create)
Configure the Model:
- Name the model
- Select a base model from list
- Click Select Knowledge and select a knowledge base for retrieval

create and configure the RAG-enabled model

Click Save & Create
Click the created model and start chatting

RAG-enabled model demo

Reference#

https://docs.openvino.ai/nightly/model-server/ovms_demos_continuous_batching_rag.html

https://docs.openwebui.com/tutorials/tips/rag-tutorial

Image Generation#

Step 1: Model Preparation#

The image generation model used in this demo is dreamlike-art/dreamlike-anime-1.0. Run export script to download and quantize the model:

python export_model.py image_generation --source_model dreamlike-art/dreamlike-anime-1.0 --weight-format int8 --config_file_path models/config.json

Keep the model server running or restart it. Here is the basic call to check if it works:

curl http://localhost:8000/v3/images/generations -H "Content-Type: application/json" -d "{\"model\":\"dreamlike-art/dreamlike-anime-1.0\",\"prompt\":\"anime\",\"num_inference_steps\":1,\"size\":\"256x256\",\"response_format\":\"b64_json\"}"

Step 2: Image Generation Setting#

Go to Admin Panel → Settings → Images (http://localhost:8080/admin/settings/images)
Configure OpenAI API:
- URL: http://localhost:8000/v3
- Put anything in API key
Enable Image Generation (Experimental)
- Set Default Model: dreamlike-art/dreamlike-anime-1.0
- Set Image Size. Must be in WxH format, example: 256x256
Click Save

image generation setting

Step 3: Generate Image#

Method 1:

Toggle the Image switch to on
Enter a query and sent

image generation method 1 demo

Method 2:

Send a query, with or without the Image switch on
After the response has finished generating, it can be edited to a prompt
Click the Picture icon to generate an image

image generation method 2 demo

Reference#

https://docs.openvino.ai/nightly/model-server/ovms_demos_image_generation.html

https://docs.openwebui.com/tutorials/images

VLM#

Step 1: Model Preparation#

The vision language model used in this demo is OpenGVLab/InternVL2-2B. Run export script to download and quantize the model:

python export_model.py text_generation --source_model OpenGVLab/InternVL2-2B --weight-format int4 --pipeline_type VLM --model_name OpenGVLab/InternVL2-2B --config_file_path models/config.json

Keep the model server running or restart it. Here is the basic call to check if it works:

curl http://localhost:8000/v3/chat/completions  -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/static/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}"

Step 2: Chat with VLM#

Start a New Chat with model set to OpenGVLab/InternVL2-2B.
Click +more to upload images, by capturing the screen or uploading files. The image used in this demo is http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/static/images/zebra.jpeg.

upload images 3. Enter a query and sent

chat with VLM demo

Reference#

https://docs.openvino.ai/nightly/model-server/ovms_demos_continuous_batching_vlm.html

AI agent with Tools#

Step 1: Start Tool Server#

Start a OpenAPI tool server available in the openapi-servers repo. The server used in this demo is open-webui/openapi-servers. Run it locally at http://localhost:18000:

git clone https://github.com/open-webui/openapi-servers
cd openapi-servers/servers/time
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 18000 --reload

Step 2: Tools Setting#

Go to Admin Panel → Settings → Tools (http://localhost:8080/admin/settings/tools)
Click +Add Connection
- URL: http://localhost:18000
- Name the tool
Click Save

tools setting

Step 3: Chat with AI Agent#

Click +more and toggle on the tool
Enter a query and sent

chat with AI Agent demo

Reference#

https://docs.openwebui.com/openapi-servers/open-webui