OpenAI API completions endpoint#
Note: This endpoint works only with LLM graphs.
API Reference#
OpenVINO Model Server includes now the completions endpoint using OpenAI API.
Please see the OpenAI API Reference for more information on the API.
The endpoint is exposed via a path:
http://server_name:port/v3/completions
Example request#
curl http://localhost/v3/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"prompt": "This is a test",
"stream": false
}'
Example response#
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": "You are testing me!"
}
],
"created": 1716825108,
"model": "llama3",
"object": "text_completion",
"usage": {
"completion_tokens": 14,
"prompt_tokens": 17,
"total_tokens": 31
}
}
Request#
Generic#
Param |
OpenVINO Model Server |
OpenAI /completions API |
vLLM Serving Sampling Params |
Type |
Description |
|---|---|---|---|---|---|
model |
✅ |
✅ |
✅ |
string (required) |
Name of the model to use. From administrator point of view it is the name assigned to a MediaPipe graph configured to schedule generation using desired model. |
stop |
✅ |
✅ |
✅ |
string/array of strings (optional) |
Up to 4 sequences where the API will stop generating further tokens. If |
stream |
✅ |
✅ |
✅ |
bool (optional, default: |
If set to true, partial message deltas will be sent to the client. The generation chunks will be sent as data-only server-sent events as they become available, with the stream terminated by a |
stream_options |
✅ |
✅ |
✅ |
object (optional) |
Options for streaming response. Only set this when you set stream: true |
stream_options.include_usage |
✅ |
✅ |
✅ |
bool (optional) |
Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. Supported only in Continuous Batching servables. |
prompt |
⚠️ |
✅ |
✅ |
string or array (required) |
The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays. Limitations: only single string prompt is currently supported. |
max_tokens |
✅ |
✅ |
✅ |
integer |
The maximum number of tokens that can be generated. If not set, the generation will stop once |
ignore_eos |
✅ |
❌ |
✅ |
bool (default: |
Whether to ignore the |
include_stop_str_in_output |
✅ |
❌ |
✅ |
bool (default: |
Whether to include matched stop string in output. Setting it to false when |
logprobs |
⚠️ |
✅ |
✅ |
integer (optional) |
Include the log probabilities on the logprob of the returned output token. _ in stream mode logprobs are not returned. Only value 1 is accepted which returns logarithm or the chosen token _ |
echo |
✅ |
✅ |
✅ |
boolean (optional) |
Echo back the prompt in addition to the completion |
skip_special_tokens |
✅ |
❌ |
✅ |
bool (default: |
Whether to remove special tokens (e.g. |
Beam search sampling specific#
Param |
OpenVINO Model Server |
OpenAI /completions API |
vLLM Serving Sampling Params |
Type |
Description |
|---|---|---|---|---|---|
n |
✅ |
✅ |
✅ |
integer (default: |
Number of output sequences to return for the given prompt. This value must be between |
best_of |
✅ |
✅ |
✅ |
integer (default: |
Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width for beam search sampling. |
length_penalty |
✅ |
❌ |
✅ |
float (default: |
Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), |
Multinomial sampling specific#
Param |
OpenVINO Model Server |
OpenAI /completions API |
vLLM Serving Sampling Params |
Type |
Description |
|---|---|---|---|---|---|
temperature |
✅ |
✅ |
✅ |
float (default: |
The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to |
top_p |
✅ |
✅ |
✅ |
float (default: |
Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
min_p |
✅ |
❌ |
✅ |
float (default: |
Minimum probability threshold relative to the most likely token. Tokens with probability below |
top_k |
✅ |
❌ |
✅ |
int (default: |
Controls the number of top tokens to consider. When multinomial sampling is active, defaults to |
repetition_penalty |
✅ |
❌ |
✅ |
float (default: |
Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > |
frequency_penalty |
✅ |
✅ |
✅ |
float (default: |
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. |
presence_penalty |
✅ |
✅ |
✅ |
float (default: |
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. |
seed |
✅ |
✅ |
✅ |
integer (default: random) |
Random seed for generation in range |
Speculative decoding specific#
Param |
OpenVINO Model Server |
OpenAI /completions API |
vLLM Serving Sampling Params |
Type |
Description |
|---|---|---|---|---|---|
num_assistant_tokens |
✅ |
❌ |
⚠️ |
int |
This value defines how many tokens should a draft model generate before main model validates them. Equivalent of |
assistant_confidence_threshold |
✅ |
❌ |
❌ |
float |
This parameter determines confidence level for continuing generation. If draft model generates token with confidence below that threshold, it stops generation for the current cycle and main model starts validation. Cannot be used with |
If neither parameter is specified in the request, the server resolves the value using the following priority order:
Request body –
num_assistant_tokensorassistant_confidence_thresholdsent by the client.generation_config.jsonin the main model’s directory – add"num_assistant_tokens": N(or"assistant_confidence_threshold": F) to set a deployment-level default that applies to all requests that do not specify it. This is the recommended way to persist a tuned value without requiring every client to send it.Built-in fallback –
num_assistant_tokens = 5if neither of the above is present.
Note:
generation_config.jsonis shipped alongside model weights from Hugging Face, but it is fully operator-editable. Changes take effect on the next server start.
Prompt lookup decoding specific#
Note that below parameters are valid only for prompt lookup pipeline. Add "prompt_lookup": true to plugin_config in your graph config node options to serve it.
Param |
OpenVINO Model Server |
OpenAI /completions API |
vLLM Serving Sampling Params |
Type |
Description |
|---|---|---|---|---|---|
num_assistant_tokens |
✅ |
❌ |
❌ |
int |
Number of candidate tokens proposed after ngram match is found |
max_ngram_size |
✅ |
❌ |
❌ |
int |
The maximum ngram to use when looking for matches in the prompt |
Note: vLLM does not support those parameters as sampling parameters, but enables prompt lookup decoding, by setting them in LLM config
Unsupported params from OpenAI service:#
logit_bias
suffix
Unsupported params from vLLM:#
use_beam_search (In OpenVINO Model Server just simply increase best_of param to enable beam search)
early_stopping
stop_token_ids
min_tokens
prompt_logprobs
detokenize
spaces_between_special_tokens
logits_processors
truncate_prompt_tokens
Response#
Param |
OpenVINO Model Server |
OpenAI /completions API |
Type |
Description |
|---|---|---|---|---|
choices |
✅ |
✅ |
array |
A list of chat completion choices. Can be more than one if |
choices.index |
✅ |
✅ |
integer |
The index of the choice in the list of choices. |
choices.text |
✅ |
✅ |
string |
A chat completion text generated by the model. |
choices.finish_reason |
✅ |
✅ |
string or null |
The reason the model stopped generating tokens. This will be |
choices.logprobs |
⚠️ |
✅ |
object or null |
Log probability information for the choice. _In current version, only one logprob per token can be returned _ |
created |
✅ |
✅ |
string |
The Unix timestamp (in seconds) of when the chat completion was created. |
model |
✅ |
✅ |
string |
The model used for the chat completion. |
object |
✅ |
✅ |
string |
always |
usage |
✅ |
✅ |
object |
Usage statistics for the completion request. Consists of three integer fields: |
Unsupported params from OpenAI service:#
id
system_fingerprint
NOTE: OpenAI python client supports a limited list of parameters. Those native to OpenVINO Model Server, can be passed inside a generic container parameter
extra_body. Below is an example how to encapsulatedtop_kvalue.
response = client.completions.create(
model=model,
prompt="hello",
max_tokens=100,
extra_body={"top_k" : 1},
stream=False
)