OpenAI API responses endpoint#

Note: This endpoint works only with LLM graphs.

API Reference#

OpenVINO Model Server includes now the responses endpoint using OpenAI API. Please see the OpenAI API Reference for more information on the API. The endpoint is exposed via a path:

http://server_name:port/v3/responses

Example request#

curl http://localhost/v3/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "input": "What is OpenVINO?"
  }'

Example response#

{
  "id": "resp-1716825108",
  "object": "response",
  "created_at": 1716825108,
  "completed_at": 1716825110,
  "error": null,
  "model": "llama3",
  "status": "completed",
  "parallel_tool_calls": true,
  "store": true,
  "text": { "format": { "type": "text" } },
  "tool_choice": "auto",
  "tools": [],
  "truncation": "disabled",
  "metadata": {},
  "output": [
    {
      "id": "msg-0",
      "type": "message",
      "role": "assistant",
      "status": "completed",
      "content": [
        {
          "type": "output_text",
          "text": "OpenVINO is an open-source toolkit ...",
          "annotations": []
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 5,
    "output_tokens": 42,
    "total_tokens": 47
  }
}

In case of VLM models, the request can include images:

curl http://localhost/v3/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava",
    "input": [
        {
            "role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "What is on the picture?"
                },
                {
                    "type": "input_image",
                    "image_url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBD ..."
                }
            ]
        }
    ],
    "max_output_tokens": 128
}'

Request#

Generic#

Param

OpenVINO Model Server

OpenAI /responses API

Type

Description

model

string (required)

Name of the model to use. From administrator point of view it is the name assigned to a MediaPipe graph configured to schedule generation using desired model.

input

string or array (required)

The input to generate a response for. Accepts a plain string or an array of message items with input_text / input_image types.

stream

bool (optional, default: false)

If set to true, partial message deltas will be sent to the client as server-sent events as they become available, with the stream terminated by a data: [DONE] message. See Streaming events section for details.

max_output_tokens

integer (optional)

An upper bound for the number of tokens that can be generated. If not set, the generation will stop once EOS token is generated. If max_tokens_limit is set in graph.pbtxt it will be the default value.

stop

string/array of strings (optional)

Up to 4 sequences where the API will stop generating further tokens. If stream is set to false matched stop string is not included in the output by default. If stream is set to true matched stop string is included in the output by default. It can be changed with include_stop_str_in_output parameter, but for stream=true setting include_stop_str_in_output=false is invalid.

ignore_eos

bool (default: false)

Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.

include_stop_str_in_output

bool (default: false if stream=false, true if stream=true)

Whether to include matched stop string in output. Setting it to false when stream=true is invalid configuration and will result in error.

logprobs

⚠️

bool (default: false)

Include the log probabilities on the logprob of the returned output token. In stream mode logprobs are not supported.

response_format

object (optional)

An object specifying the format that the model must output. Setting to { "type": "json_schema", "json_schema": {...} } enables Structured Outputs. Additionally accepts XGrammar structural tags format. OpenAI Responses API uses text.format instead (not supported in OVMS).

tools

⚠️

array (optional)

A list of tools the model may call. Currently, only function tools are supported. OpenAI also supports built-in tools (web_search, file_search, code_interpreter, etc.) and MCP tools. OVMS additionally accepts a flat {type, name, parameters} format alongside the nested {type, function: {name, parameters}} format. See OpenAI API reference for more details.

tool_choice

string or object (optional)

Controls which (if any) tool is called by the model. none means the model will not call any tool and instead generates a message. auto means the model can pick between generating a message or calling one or more tools. required means that model should call at least one tool. Specifying a particular function via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.

reasoning

⚠️

object (optional)

Configuration for reasoning/thinking mode. The effort field accepts "low", "medium", or "high" — any value enables thinking mode (enable_thinking: true is injected into chat template kwargs). The summary field is accepted but ignored.

chat_template_kwargs

object (optional)

Additional keyword arguments passed to the chat template. When reasoning is also provided, enable_thinking: true is merged into these kwargs.

skip_special_tokens

bool (default: true)

Whether to remove special tokens (e.g. <|endoftext|>, <|im_end|>) from the generated output. Set to false to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When false, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops.

stream_options

Not supported in Responses API. Usage statistics are always included in the response.completed event.

Beam search sampling specific#

Param

OpenVINO Model Server

OpenAI /responses API

Type

Description

n

integer (default: 1)

Number of output sequences to return for the given prompt. This value must be between 1 <= N <= BEST_OF. For Responses API streaming, only n=1 is supported.

best_of

integer (default: 1)

Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width for beam search sampling.

length_penalty

float (default: 1.0)

Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.

Multinomial sampling specific#

Param

OpenVINO Model Server

OpenAI /responses API

Type

Description

temperature

float (default: 1.0)

The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to > 0.0.

top_p

float (default: 1.0)

Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

min_p

float (default: 0.0)

Minimum probability threshold relative to the most likely token. Tokens with probability below min_p × the top token probability are filtered out. 0.0 (default) disables the filter. Typical values: 0.050.1. Must be in [0.0, 1.0).

top_k

int (default: 40)

Controls the number of top tokens to consider. When multinomial sampling is active, defaults to 40 if not set. Set to -1 to consider all tokens.

repetition_penalty

float (default: 1.0)

Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1.0 encourage the model to use new tokens, while values < 1.0 encourage the model to repeat tokens. 1.0 means no penalty.

frequency_penalty

float (default: 0.0)

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.

presence_penalty

float (default: 0.0)

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.

seed

integer (default: random)

Random seed for generation in range [0, 4294967295]. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: rng_seed set in generation_config.json is not honoured for multinomial sampling — only a per-request seed is applied.

Speculative decoding specific#

Note that below parameters are valid only for speculative pipeline. See speculative decoding demo for details on how to prepare and serve such pipeline.

Param

OpenVINO Model Server

OpenAI /responses API

Type

Description

num_assistant_tokens

int

This value defines how many tokens should a draft model generate before main model validates them. Cannot be used with assistant_confidence_threshold.

assistant_confidence_threshold

float

This parameter determines confidence level for continuing generation. If draft model generates token with confidence below that threshold, it stops generation for the current cycle and main model starts validation. Cannot be used with num_assistant_tokens.

Prompt lookup decoding specific#

Note that below parameters are valid only for prompt lookup pipeline. Add "prompt_lookup": true to plugin_config in your graph config node options to serve it.

Param

OpenVINO Model Server

OpenAI /responses API

Type

Description

num_assistant_tokens

int

Number of candidate tokens proposed after ngram match is found

max_ngram_size

int

The maximum ngram to use when looking for matches in the prompt

Unsupported params from OpenAI Responses API:#

  • instructions

  • previous_response_id

  • conversation

  • context_management

  • text

  • truncation

  • top_logprobs

  • include

  • store

  • metadata

  • parallel_tool_calls

  • max_tool_calls

  • background

  • prompt

  • prompt_cache_key

  • prompt_cache_retention

  • service_tier

  • safety_identifier

  • user

Response#

Param

OpenVINO Model Server

OpenAI /responses API

Type

Description

id

string

A unique identifier for the response. OVMS uses timestamp-based IDs (e.g. resp-1716825108).

object

string

Always response.

created_at

integer

The Unix timestamp (in seconds) of when the response was created.

completed_at

integer

The Unix timestamp (in seconds) of when the response was completed. Only present when status is completed.

incomplete_details

object or null

Details about why the response is incomplete. Contains {"reason": "max_tokens"} when generation was truncated due to token limit. null otherwise.

error

object or null

Error information. null when no error occurred.

model

string

The model used for the response.

status

string

completed or incomplete for unary requests; transitions from in_progress to completed/incomplete during streaming.

output

array

A list of output items. May include items of type message, function_call, or reasoning. See Output item types below.

output[].content[].text

string

The generated text content (for message type items).

output[].content[].annotations

array

Always an empty array (annotations not yet supported).

usage

object

Usage statistics: input_tokens, output_tokens, total_tokens.

tool_choice

string or object

Echoed back from the request.

tools

array

Echoed back from the request.

max_output_tokens

integer

Echoed back from the request (if set).

parallel_tool_calls

⚠️

bool

Hardcoded to true in OVMS.

store

⚠️

bool

Hardcoded to true in OVMS.

temperature

⚠️

float

Echoed back from the request. Only included when explicitly provided. OpenAI always returns this field (default: 1.0).

text

⚠️

object

Hardcoded to {"format": {"type": "text"}} in OVMS.

top_p

⚠️

float

Echoed back from the request. Only included when explicitly provided. OpenAI always returns this field (default: 1.0).

truncation

⚠️

string

Hardcoded to "disabled" in OVMS.

metadata

⚠️

object

Hardcoded to {} in OVMS.

Output item types#

The output array may contain the following item types:

Type

Description

message

A text message from the assistant. Contains id, type, role, status, and content array with output_text entries.

function_call

A tool/function call. Contains id, type, status, call_id, name, and arguments. Emitted when the model invokes a tool.

reasoning

Reasoning output (for models with thinking/reasoning enabled via chat_template_kwargs). Contains id, type, and summary array with summary_text entries.

Unsupported response fields from OpenAI service:#

  • instructions (echoed back)

  • output_text (convenience field)

Streaming events#

When stream is set to true, the server emits server-sent events in the following order:

Standard text generation events#

Event

When emitted

Description

response.created

After execution is scheduled

Contains the full response object with status: "in_progress".

response.in_progress

When the model starts producing tokens

Signals that the response is actively being processed. Emitted as part of the first streaming chunk.

response.output_item.added

After response.in_progress

A new output item (message) has been initialized. Contains output_index and the item object.

response.content_part.added

After response.output_item.added

A new content part (output_text) has been initialized. Contains output_index, content_index, item_id and the part object.

response.output_text.delta

For each text chunk during generation

Contains the text delta, output_index, content_index, and item_id. May be emitted many times.

response.output_text.done

When text generation is finalized

Contains the full accumulated text.

response.content_part.done

After response.output_text.done

The content part is complete. Contains the final part object with full text.

response.output_item.done

After response.content_part.done

The output item is complete. Contains the final item object with status: "completed".

response.completed

Last event before [DONE]

Contains the full response object with status: "completed" and usage statistics.

response.incomplete

Last event before [DONE] (when truncated)

Emitted instead of response.completed when generation was stopped due to max_output_tokens limit. Contains the response object with status: "incomplete" and incomplete_details.

response.failed

On error during generation

Contains the response object with status: "failed" and error details.

Reasoning events (for models with thinking enabled)#

When using models that support reasoning (e.g., via chat_template_kwargs: {"enable_thinking": true}), the following additional events may be emitted before the standard message events:

Event

When emitted

Description

response.output_item.added

When reasoning begins

A reasoning output item (type: "reasoning") is added at output_index: 0.

response.reasoning_summary_part.added

After reasoning item added

A reasoning summary part has been initialized. Contains output_index, summary_index, and item_id.

response.reasoning_summary_text.delta

For each reasoning text chunk

Contains the reasoning text delta.

response.reasoning_summary_text.done

When reasoning is finalized

Contains the full accumulated reasoning text.

response.reasoning_summary_part.done

After reasoning text done

The reasoning summary part is complete.

response.output_item.done

After reasoning part done

The reasoning output item is complete.

When reasoning is present, the subsequent message output item will have output_index: 1 instead of 0.

Function call events (for tool calling)#

When the model generates tool/function calls, the following events are emitted (after reasoning events if present, before or instead of message events):

Event

When emitted

Description

response.output_item.added

When a function call begins

A function call output item (type: "function_call") is added. Contains output_index and the item object with call_id, name, and empty arguments.

response.function_call_arguments.delta

For each arguments chunk

Contains the arguments text delta, item_id, output_index, and call_id.

response.function_call_arguments.done

When arguments are complete

Contains the full accumulated arguments.

response.output_item.done

After arguments done

The function call output item is complete.

All events include a monotonically increasing sequence_number field.

The stream is terminated by a data: [DONE] message.

NOTE: OpenAI python client supports a limited list of parameters. Those native to OpenVINO Model Server, can be passed inside a generic container parameter extra_body. Below is an example how to encapsulate top_k value.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
response = client.responses.create(
    model="llama3",
    input="What is OpenVINO?",
    max_output_tokens=100,
    extra_body={"top_k": 1},
    stream=False
)

References#

LLM quick start guide

End to end demo with LLM model serving over OpenAI API

Code snippets

LLM calculator