CLIP image classification

Image classification demo using multi-modal CLIP model for inference and Python code for pre and postprocessing. The client sends request with an image and input labels to the graph and receives the label with the highest probability. The preprocessing python node is executed first and prepares inputs vector based on user inputs from the request. Then inputs are used to get similarity matrix from inference on the CLIP model. After that postprocessing python node is executed and extracts the label with highest score among the input labels and sends it back to the user.

Demo is based on this CLIP notebook

The picture below shows the execution flow in the graph.

Mediapipe graph image

Build image

git clone
cd model_server
make python_image RUN_TESTS=0

Install client requirements

cd demos/python_demos/clip_image_classification/
virtualenv .venv
. .venv/bin/activate
pip3 install -r requirements.txt

Download and convert model

pip3 install -r download_model_requirements.txt

Deploy OpenVINO Model Server with the CLIP graph


  • image of OVMS with Python support and Optimum installed

Mount the ./servable which contains:

  • and - python scripts which are required for execution and use of CLIP model

  • config.json - which defines which servables should be loaded

  • graph.pbtxt - which defines MediaPipe graph containing python calculators

docker run -d --rm -p 9000:9000 -v ${PWD}/servable:/workspace -v ${PWD}/model:/model/ openvino/model_server:py --config_path /workspace/config.json --port 9000

Requesting detection name

Run the client script

python3 --url localhost:9000

Expected output:

Server Ready: True
Using image_url:

Using input_labels:
['cat', 'dog', 'wolf', 'tiger', 'man', 'horse', 'frog', 'tree', 'house', 'computer']

Iteration 0

processing time for all iterations
average time: 90.00 ms; average speed: 11.11 fps
median time: 90.00 ms; median speed: 11.11 fps
max time: 90.00 ms; min speed: 11.11 fps
min time: 90.00 ms; max speed: 11.11 fps
time percentile 90: 90.00 ms; speed percentile 90: 11.11 fps
time percentile 50: 90.00 ms; speed percentile 50: 11.11 fps
time standard deviation: 0.00
time variance: 0.00