openvino_genai.SparseAttentionConfig#

class openvino_genai.SparseAttentionConfig#

Bases: pybind11_object

Configuration struct for the sparse attention functionality. :param mode: Sparse attention mode to be applied. :type mode: openvino_genai.SparseAttentionMode

Parameters:

num_last_dense_tokens_in_prefill (int) – TRISHAPE and XATTENTION modes - Number of tokens from the end of the prompt for which full attention across previous KV cache contents will be computed. In contrast, for the rest of the tokens in the prompt only the sparse attention will be computed according to the selected algorithm. TRISHAPE: Due to the block-wise nature of continuous batching cache management, the actual number of prompt tokens for which the dense attention will be computed may be up to block-size larger than this value (depending on the prompt length and block size). XATTENTION: Same as above applies, but the dense attention may overspill up to a subsequence chunk (i.e. multiple blocks)
num_retained_start_tokens_in_cache (int) – TRISHAPE mode only - The number of tokens in the beginning of the cache (least recent) to be retained when applying sparse attention. Must be a multiple of block size.
num_retained_recent_tokens_in_cache – TRISHAPE mode only - The number of most recent tokens in cache to be retained when applying sparse attention. Must be a multiple of block size.
num_retained_recent_tokens_in_cache – int
xattention_threshold (float) – XATTENTION mode only - Cumulative importance score threshold to be compared against when determining blocks to exclude from the attention calculations in the block-sparse approach. Only the attention matrix blocks with highest importance score sum not exceeding this threshold will be taking part in the computations. The lower the threshold, the less computation will the main attention operation will take, and vice versa, with the corresponding potential impact on generation accuracy.
xattention_block_size (int) – XATTENTION mode only - Block granularity, in tokens, with which the block-sparse attention calculation will be applied.
xattention_stride (int) – XATTENTION mode only - The stride of antidiagonal sampling employed to calculate the importance scores of each xattention_block_size-sized block of the attention matrix before the actual attention calculation takes place. Directly influences the overhead portion of the importance score computations - if full (dense) attention takes M time to be calculated, then the importance score calculation would be taking M / xattention_stride time as overhead.

__init__(self: openvino_genai.py_openvino_genai.SparseAttentionConfig, mode: openvino_genai.py_openvino_genai.SparseAttentionMode = SparseAttentionMode.TRISHAPE, num_last_dense_tokens_in_prefill: SupportsInt = 100, num_retained_start_tokens_in_cache: SupportsInt = 128, num_retained_recent_tokens_in_cache: SupportsInt = 1920, xattention_threshold: SupportsFloat = 0.8, xattention_block_size: SupportsInt = 64, xattention_stride: SupportsInt = 8) → None#

Methods

`__delattr__`(name, /)	Implement delattr(self, name).
`__dir__`()	Default dir() implementation.
`__eq__`(value, /)	Return self==value.
`__format__`(format_spec, /)	Default object formatter.
`__ge__`(value, /)	Return self>=value.
`__getattribute__`(name, /)	Return getattr(self, name).
`__getstate__`()	Helper for pickle.
`__gt__`(value, /)	Return self>value.
`__hash__`()	Return hash(self).
`__init__`(self[, mode, ...])
`__init_subclass__`	This method is called when a class is subclassed.
`__le__`(value, /)	Return self<=value.
`__lt__`(value, /)	Return self<value.
`__ne__`(value, /)	Return self!=value.
`__new__`(**kwargs)
`__reduce__`()	Helper for pickle.
`__reduce_ex__`(protocol, /)	Helper for pickle.
`__repr__`()	Return repr(self).
`__setattr__`(name, value, /)	Implement setattr(self, name, value).
`__sizeof__`()	Size of object in memory, in bytes.
`__str__`()	Return str(self).
`__subclasshook__`	Abstract classes can override this to customize issubclass().
`_pybind11_conduit_v1_`

Attributes

`__annotations__`
`mode`
`num_last_dense_tokens_in_prefill`
`num_retained_recent_tokens_in_cache`
`num_retained_start_tokens_in_cache`
`xattention_block_size`
`xattention_stride`
`xattention_threshold`

__annotations__ = {}#

__class__#: alias of pybind11_type

__delattr__(name, /)#: Implement delattr(self, name).

__dir__()#: Default dir() implementation.

__eq__(value, /)#: Return self==value.

__format__(format_spec, /)#

Default object formatter.

Return str(self) if format_spec is empty. Raise TypeError otherwise.

__ge__(value, /)#: Return self>=value.

__getattribute__(name, /)#: Return getattr(self, name).

__getstate__()#: Helper for pickle.

__gt__(value, /)#: Return self>value.

__hash__()#: Return hash(self).

__init__(self: openvino_genai.py_openvino_genai.SparseAttentionConfig, mode: openvino_genai.py_openvino_genai.SparseAttentionMode = SparseAttentionMode.TRISHAPE, num_last_dense_tokens_in_prefill: SupportsInt = 100, num_retained_start_tokens_in_cache: SupportsInt = 128, num_retained_recent_tokens_in_cache: SupportsInt = 1920, xattention_threshold: SupportsFloat = 0.8, xattention_block_size: SupportsInt = 64, xattention_stride: SupportsInt = 8) → None#

__init_subclass__()#

This method is called when a class is subclassed.

The default implementation does nothing. It may be overridden to extend subclasses.

__le__(value, /)#: Return self<=value.

__lt__(value, /)#: Return self<value.

__ne__(value, /)#: Return self!=value.

__new__(**kwargs)#

__reduce__()#: Helper for pickle.

__reduce_ex__(protocol, /)#: Helper for pickle.

__repr__()#: Return repr(self).

__setattr__(name, value, /)#: Implement setattr(self, name, value).

__sizeof__()#: Size of object in memory, in bytes.

__str__()#: Return str(self).

__subclasshook__()#

Abstract classes can override this to customize issubclass().

This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).

_pybind11_conduit_v1_()#

property mode#

property num_last_dense_tokens_in_prefill#

property num_retained_recent_tokens_in_cache#

property num_retained_start_tokens_in_cache#

property xattention_block_size#

property xattention_stride#

property xattention_threshold#