How to serve audio models via OpenAI API#

This demo shows how to deploy audio models in the OpenVINO Model Server. Speech generation and speech recognition models are exposed via OpenAI API audio/speech end audio/transcriptions endpoints.

Check supported Speech Recognition Models and Speech Generation Models.

Prerequisites#

OVMS version 2025.4 This demo require version 2025.4 or nightly release.

Model preparation: Python 3.10 or higher with pip

Model Server deployment: Installed Docker Engine or OVMS binary package according to the baremetal deployment guide

Client: curl or Python for using OpenAI client package

Speech generation#

Model preparation#

Supported models should use the topology of microsoft/speecht5_ttswhich is needs to be converted to IR format before using in OVMS.

Specific OVMS pull mode example for models requiring conversion is described in the Ovms pull mode

Or you can use the python export_model.py script described below.

Here, the original Text to Speech model will be converted to IR format and optionally quantized. That ensures faster initialization time, better performance and lower memory consumption. Execution parameters will be defined inside the graph.pbtxt file.

Download export script, install it’s dependencies and create directory for the models:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
mkdir models

Run export_model.py script to download and quantize the model:

Note: The users in China need to set environment variable HF_ENDPOINT=”https://hf-mirror.com” before running the export script to connect to the HF Hub.

CPU

python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp32 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --vocoder microsoft/speecht5_hifigan

Note: Change the --weight-format to quantize the model to fp16 or int8 precision to reduce memory consumption and improve performance.

Deployment#

CPU

Running this command starts the container with CPU only target device:

mkdir -p models
docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest --rest_port 8000 --model_path /models/microsoft/speecht5_tts --model_name microsoft/speecht5_tts

Deploying on Bare Metal

mkdir models
ovms --rest_port 8000 --source_model microsoft/speecht5_tts --model_repository_path models --model_name microsoft/speecht5_tts --task text2speech --target_device CPU

Request Generation#

Unary call with curl
curl http://localhost:8000/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"microsoft/speecht5_tts\", \"input\": \"The quick brown fox jumped over the lazy dog.\"}" -o speech.wav
Unary call with OpenAi python library
from pathlib import Path
from openai import OpenAI

prompt = "The quick brown fox jumped over the lazy dog"
filename = "speech.wav"
url="http://localhost:8000/v3"


speech_file_path = Path(__file__).parent / "speech.wav"
client = OpenAI(base_url=url, api_key="not_used")

with client.audio.speech.with_streaming_response.create(
  model="microsoft/speecht5_tts",
  voice="unused",
  input=prompt
) as response:
  response.stream_to_file(speech_file_path)


print("Generation finished")

Play speech.wav file to check generated speech.

Transcription#

Model preparation#

Whisper models can be deployed in a single command by using pre-configured models from OpenVINO HuggingFace organization Here is an example of OpenVINO/whisper-base-fp16-ov deployment:

Deploying with Docker

Select deployment option depending on how you prepared models in the previous step.

CPU

Running this command starts the container with CPU only target device:

mkdir -p models
docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest --rest_port 8000 --source_model OpenVINO/whisper-base-fp16-ov --model_repository_path /models --model_name OpenVINO/whisper-base-fp16-ov --task speech2text

GPU

In case you want to use GPU device to run the generation, add extra docker parameters --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) to docker run command, use the image with GPU support. It can be applied using the commands below:

mkdir -p models
docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --source_model OpenVINO/whisper-base-fp16-ov --model_repository_path models --model_name OpenVINO/whisper-base-fp16-ov --task speech2text --target_device GPU
Deploying on Bare Metal

If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.

mkdir models
ovms --rest_port 8000 --source_model OpenVINO/whisper-base-fp16-ov --model_repository_path models --model_name OpenVINO/whisper-base-fp16-ov --task speech2text --target_device CPU

or

ovms --rest_port 8000 --source_model OpenVINO/whisper-base-fp16-ov --model_repository_path models --model_name OpenVINO/whisper-base-fp16-ov --task speech2text --target_device GPU

Request Generation#

Transcript file that was previously generated with audio/speech endpoint.

Unary call with curl
curl http://localhost:8000/v3/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@speech.wav" -F model="OpenVINO/whisper-base-fp16-ov"
{"text": " The quick brown fox jumped over the lazy dog."}
Unary call with python OpenAI library
from pathlib import Path
from openai import OpenAI

filename = "speech.wav"
url="http://localhost:8000/v3"


speech_file_path = Path(__file__).parent / filename
client = OpenAI(base_url=url, api_key="not_used")

audio_file = open(filename, "rb")
transcript = client.audio.transcriptions.create(
  model="OpenVINO/whisper-base-fp16-ov",
  file=audio_file
)

print(transcript.text)
The quick brown fox jumped over the lazy dog.