How to serve Rerank models via Cohere API#

Prerequisites#

Model preparation: Python 3.9 or higher with pip

Model Server deployment: Installed Docker Engine or OVMS binary package according to the baremetal deployment guide

(Optional) Client: Python with pip

Model preparation#

Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. That ensures faster initialization time, better performance and lower memory consumption.

Download export script, install it’s dependencies and create directory for the models:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
mkdir models 

Run export_model.py script to download and quantize the model:

CPU

python export_model.py rerank --source_model BAAI/bge-reranker-large --weight-format int8 --config_file_path models/config.json --model_repository_path models 

GPU:

python export_model.py rerank --source_model BAAI/bge-reranker-large --weight-format int8 --target_device GPU --config_file_path models/config.json --model_repository_path models 

You should have a model folder like below:

tree models
models
├── BAAI
│   └── bge-reranker-large
│       ├── graph.pbtxt
│       ├── rerank
│       │   └── 1
│       │       ├── model.bin
│       │       └── model.xml
│       ├── subconfig.json
│       └── tokenizer
│           └── 1
│               ├── model.bin
│               └── model.xml
└── config.json

Note The actual models support version management and can be automatically swapped to newer version when new model is uploaded in newer version folder.

Server Deployment#

Readiness Check#

Readiness of the model can be reported with a simple curl command.

curl -i http://localhost:8000/v2/models/BAAI%2Fbge-reranker-large/ready
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sat, 09 Nov 2024 23:19:27 GMT
Content-Length: 0

Client code#

Comparison with Hugging Faces#

popd
pushd .
cd demos/rerank/
python demos/rerank/compare_results.py --query "hello" --document "welcome" --document "farewell" --base_url http://localhost:8000/v3/
query hello
documents ['welcome', 'farewell']
HF Duration: 145.731 ms
OVMS Duration: 23.227 ms
HF reranking: [0.99640983 0.08154089]
OVMS reranking: [0.9968273 0.0913821]

Performance benchmarking#

An asynchronous benchmarking client can be used to access the model server performance with various load conditions. Below are execution examples captured on dual Intel(R) Xeon(R) CPU Max 9480.

popd
pushd .
cd demos/benchmark/embeddings/
pip install -r requirements.txt
python benchmark_embeddings.py --api_url http://localhost:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large 
Number of documents: 1000
100%|██████████████████████████████████████| 50/50 [00:19<00:00,  2.53it/s]
Tokens: 501000
Success rate: 100.0%. (50/50)
Throughput - Tokens per second: 25325.17484336458
Mean latency: 10268 ms
Median latency: 10249 ms
Average document length: 501.0 tokens

python benchmark_embeddings.py --api_url http://localhost:8000/v3/rerank --backend ovms_rerank --dataset synthetic --synthetic_length 500 --request_rate inf --batch_size 20 --model BAAI/bge-reranker-large 
Number of documents: 1000
100%|██████████████████████████████████████| 50/50 [00:19<00:00,  2.53it/s]
Tokens: 501000
Success rate: 100.0%. (50/50)
Throughput - Tokens per second: 25325.17484336458
Mean latency: 10268 ms
Median latency: 10249 ms
Average document length: 501.0 tokens

python benchmark_embeddings.py --api_url http://localhost:8000/v3/rerank --backend ovms_rerank --dataset Cohere/wikipedia-22-12-simple-embeddings --request_rate inf 
--batch_size 20 --model BAAI/bge-reranker-large 
Number of documents: 1000
100%|██████████████████████████████████████| 50/50 [00:09<00:00,  5.55it/s]
Tokens: 92248
Success rate: 100.0%. (50/50)
Throughput - Tokens per second: 10236.429922338193
Mean latency: 4511 ms
Median latency: 4309 ms
Average document length: 92.248 tokens

Tested models#

BAAI/bge-reranker-large
BAAI/bge-reranker-v2-m3
BAAI/bge-reranker-base

Integration with Langchain#

Check RAG demo which employs rerank endpoint together with chat/completions and embeddings.