openvino_genai.LLMPipeline#

class openvino_genai.LLMPipeline#

Bases: pybind11_object

This class is used for generation with LLMs

__init__(*args, **kwargs)#

Overloaded function.

  1. __init__(self: openvino_genai.py_openvino_genai.LLMPipeline, models_path: os.PathLike, tokenizer: openvino_genai.py_openvino_genai.Tokenizer, device: str, config: dict[str, object] = {}, **kwargs) -> None

    LLMPipeline class constructor for manually created openvino_genai.Tokenizer. models_path (os.PathLike): Path to the model file. tokenizer (openvino_genai.Tokenizer): tokenizer object. device (str): Device to run the model on (e.g., CPU, GPU). Default is ‘CPU’. Add {“scheduler_config”: ov_genai.SchedulerConfig} to config properties to create continuous batching pipeline. kwargs: Device properties.

  2. __init__(self: openvino_genai.py_openvino_genai.LLMPipeline, models_path: os.PathLike, device: str, config: dict[str, object] = {}, **kwargs) -> None

    LLMPipeline class constructor. models_path (os.PathLike): Path to the model file. device (str): Device to run the model on (e.g., CPU, GPU). Default is ‘CPU’. Add {“scheduler_config”: ov_genai.SchedulerConfig} to config properties to create continuous batching pipeline. kwargs: Device properties.

Methods

__call__(self, inputs[, generation_config, ...])

Generates sequences or tokens for LLMs.

__delattr__(name, /)

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__(value, /)

Return self==value.

__format__(format_spec, /)

Default object formatter.

__ge__(value, /)

Return self>=value.

__getattribute__(name, /)

Return getattr(self, name).

__gt__(value, /)

Return self>value.

__hash__()

Return hash(self).

__init__(*args, **kwargs)

Overloaded function.

__init_subclass__

This method is called when a class is subclassed.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

__new__(**kwargs)

__reduce__()

Helper for pickle.

__reduce_ex__(protocol, /)

Helper for pickle.

__repr__()

Return repr(self).

__setattr__(name, value, /)

Implement setattr(self, name, value).

__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__

Abstract classes can override this to customize issubclass().

finish_chat(self)

generate(self, inputs[, generation_config, ...])

Generates sequences or tokens for LLMs.

get_generation_config(self)

get_tokenizer(self)

set_generation_config(self, config)

start_chat(self[, system_message])

__call__(self: openvino_genai.py_openvino_genai.LLMPipeline, inputs: openvino._pyopenvino.Tensor | openvino_genai.py_openvino_genai.TokenizedInputs | str | list[str], generation_config: openvino_genai.py_openvino_genai.GenerationConfig | None = None, streamer: Callable[[str], bool] | openvino_genai.py_openvino_genai.StreamerBase | None = None, **kwargs) openvino_genai.py_openvino_genai.EncodedResults | openvino_genai.py_openvino_genai.DecodedResults#

Generates sequences or tokens for LLMs. If input is a string or list of strings then resulting sequences will be already detokenized.

Parameters:
  • inputs (str, List[str], ov.genai.TokenizedInputs, or ov.Tensor) – inputs in the form of string, list of strings or tokenized input_ids

  • generation_config (GenerationConfig or a Dict) – generation_config

  • streamer – streamer either as a lambda with a boolean returning flag whether generation should be stopped

:type : Callable[[str], bool], ov.genai.StreamerBase

Parameters:

kwargs – arbitrary keyword arguments with keys corresponding to GenerationConfig fields.

:type : Dict

Returns:

return results in encoded, or decoded form depending on inputs type

Return type:

DecodedResults, EncodedResults, str

Structure to keep generation config parameters. For a selected method of decoding, only parameters from that group and generic parameters are used. For example, if do_sample is set to true, then only generic parameters and random sampling parameters will be used while greedy and beam search parameters will not affect decoding at all.

Parameters: max_length: the maximum length the generated tokens can have. Corresponds to the length of the input prompt +

max_new_tokens. Its effect is overridden by max_new_tokens, if also set.

max_new_tokens: the maximum numbers of tokens to generate, excluding the number of tokens in the prompt. max_new_tokens has priority over max_length. ignore_eos: if set to true, then generation will not stop even if <eos> token is met. eos_token_id: token_id of <eos> (end of sentence) min_new_tokens: set 0 probability for eos_token_id for the first eos_token_id generated tokens. Ignored for non continuous batching. stop_strings: a set of strings that will cause pipeline to stop generating further tokens. include_stop_str_in_output: if set to true stop string that matched generation will be included in generation output (default: false) stop_token_ids: a set of tokens that will cause pipeline to stop generating further tokens. echo: if set to true, the model will echo the prompt in the output. logprobs: number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned.

Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0).

Beam search specific parameters: num_beams: number of beams for beam search. 1 disables beam search. num_beam_groups: number of groups to divide num_beams into in order to ensure diversity among different groups of beams. diversity_penalty: value is subtracted from a beam’s score if it generates the same token as any beam from other group at a particular time. length_penalty: exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to

the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.

num_return_sequences: the number of sequences to return for grouped beam search decoding. no_repeat_ngram_size: if set to int > 0, all ngrams of that size can only occur once. stop_criteria: controls the stopping condition for grouped beam search. It accepts the following values:

“openvino_genai.StopCriteria.EARLY”, where the generation stops as soon as there are num_beams complete candidates; “openvino_genai.StopCriteria.HEURISTIC” is applied and the generation stops when is it very unlikely to find better candidates; “openvino_genai.StopCriteria.NEVER”, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

Random sampling parameters: temperature: the value used to modulate token probabilities for random sampling. top_p: if set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. top_k: the number of highest probability vocabulary tokens to keep for top-k-filtering. do_sample: whether or not to use multinomial random sampling that add up to top_p or higher are kept. repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty.

__class__#

alias of pybind11_type

__delattr__(name, /)#

Implement delattr(self, name).

__dir__()#

Default dir() implementation.

__eq__(value, /)#

Return self==value.

__format__(format_spec, /)#

Default object formatter.

__ge__(value, /)#

Return self>=value.

__getattribute__(name, /)#

Return getattr(self, name).

__gt__(value, /)#

Return self>value.

__hash__()#

Return hash(self).

__init__(*args, **kwargs)#

Overloaded function.

  1. __init__(self: openvino_genai.py_openvino_genai.LLMPipeline, models_path: os.PathLike, tokenizer: openvino_genai.py_openvino_genai.Tokenizer, device: str, config: dict[str, object] = {}, **kwargs) -> None

    LLMPipeline class constructor for manually created openvino_genai.Tokenizer. models_path (os.PathLike): Path to the model file. tokenizer (openvino_genai.Tokenizer): tokenizer object. device (str): Device to run the model on (e.g., CPU, GPU). Default is ‘CPU’. Add {“scheduler_config”: ov_genai.SchedulerConfig} to config properties to create continuous batching pipeline. kwargs: Device properties.

  2. __init__(self: openvino_genai.py_openvino_genai.LLMPipeline, models_path: os.PathLike, device: str, config: dict[str, object] = {}, **kwargs) -> None

    LLMPipeline class constructor. models_path (os.PathLike): Path to the model file. device (str): Device to run the model on (e.g., CPU, GPU). Default is ‘CPU’. Add {“scheduler_config”: ov_genai.SchedulerConfig} to config properties to create continuous batching pipeline. kwargs: Device properties.

__init_subclass__()#

This method is called when a class is subclassed.

The default implementation does nothing. It may be overridden to extend subclasses.

__le__(value, /)#

Return self<=value.

__lt__(value, /)#

Return self<value.

__ne__(value, /)#

Return self!=value.

__new__(**kwargs)#
__reduce__()#

Helper for pickle.

__reduce_ex__(protocol, /)#

Helper for pickle.

__repr__()#

Return repr(self).

__setattr__(name, value, /)#

Implement setattr(self, name, value).

__sizeof__()#

Size of object in memory, in bytes.

__str__()#

Return str(self).

__subclasshook__()#

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

finish_chat(self: openvino_genai.py_openvino_genai.LLMPipeline) None#
generate(self: openvino_genai.py_openvino_genai.LLMPipeline, inputs: openvino._pyopenvino.Tensor | openvino_genai.py_openvino_genai.TokenizedInputs | str | list[str], generation_config: openvino_genai.py_openvino_genai.GenerationConfig | None = None, streamer: Callable[[str], bool] | openvino_genai.py_openvino_genai.StreamerBase | None = None, **kwargs) openvino_genai.py_openvino_genai.EncodedResults | openvino_genai.py_openvino_genai.DecodedResults#

Generates sequences or tokens for LLMs. If input is a string or list of strings then resulting sequences will be already detokenized.

Parameters:
  • inputs (str, List[str], ov.genai.TokenizedInputs, or ov.Tensor) – inputs in the form of string, list of strings or tokenized input_ids

  • generation_config (GenerationConfig or a Dict) – generation_config

  • streamer – streamer either as a lambda with a boolean returning flag whether generation should be stopped

:type : Callable[[str], bool], ov.genai.StreamerBase

Parameters:

kwargs – arbitrary keyword arguments with keys corresponding to GenerationConfig fields.

:type : Dict

Returns:

return results in encoded, or decoded form depending on inputs type

Return type:

DecodedResults, EncodedResults, str

Structure to keep generation config parameters. For a selected method of decoding, only parameters from that group and generic parameters are used. For example, if do_sample is set to true, then only generic parameters and random sampling parameters will be used while greedy and beam search parameters will not affect decoding at all.

Parameters: max_length: the maximum length the generated tokens can have. Corresponds to the length of the input prompt +

max_new_tokens. Its effect is overridden by max_new_tokens, if also set.

max_new_tokens: the maximum numbers of tokens to generate, excluding the number of tokens in the prompt. max_new_tokens has priority over max_length. ignore_eos: if set to true, then generation will not stop even if <eos> token is met. eos_token_id: token_id of <eos> (end of sentence) min_new_tokens: set 0 probability for eos_token_id for the first eos_token_id generated tokens. Ignored for non continuous batching. stop_strings: a set of strings that will cause pipeline to stop generating further tokens. include_stop_str_in_output: if set to true stop string that matched generation will be included in generation output (default: false) stop_token_ids: a set of tokens that will cause pipeline to stop generating further tokens. echo: if set to true, the model will echo the prompt in the output. logprobs: number of top logprobs computed for each position, if set to 0, logprobs are not computed and value 0.0 is returned.

Currently only single top logprob can be returned, so any logprobs > 1 is treated as logprobs == 1. (default: 0).

Beam search specific parameters: num_beams: number of beams for beam search. 1 disables beam search. num_beam_groups: number of groups to divide num_beams into in order to ensure diversity among different groups of beams. diversity_penalty: value is subtracted from a beam’s score if it generates the same token as any beam from other group at a particular time. length_penalty: exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to

the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.

num_return_sequences: the number of sequences to return for grouped beam search decoding. no_repeat_ngram_size: if set to int > 0, all ngrams of that size can only occur once. stop_criteria: controls the stopping condition for grouped beam search. It accepts the following values:

“openvino_genai.StopCriteria.EARLY”, where the generation stops as soon as there are num_beams complete candidates; “openvino_genai.StopCriteria.HEURISTIC” is applied and the generation stops when is it very unlikely to find better candidates; “openvino_genai.StopCriteria.NEVER”, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

Random sampling parameters: temperature: the value used to modulate token probabilities for random sampling. top_p: if set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. top_k: the number of highest probability vocabulary tokens to keep for top-k-filtering. do_sample: whether or not to use multinomial random sampling that add up to top_p or higher are kept. repetition_penalty: the parameter for repetition penalty. 1.0 means no penalty.

get_generation_config(self: openvino_genai.py_openvino_genai.LLMPipeline) openvino_genai.py_openvino_genai.GenerationConfig#
get_tokenizer(self: openvino_genai.py_openvino_genai.LLMPipeline) openvino_genai.py_openvino_genai.Tokenizer#
set_generation_config(self: openvino_genai.py_openvino_genai.LLMPipeline, config: openvino_genai.py_openvino_genai.GenerationConfig) None#
start_chat(self: openvino_genai.py_openvino_genai.LLMPipeline, system_message: str = '') None#