Long context optimizations for LLM models#

Using LLM models with very long context and prompts might be particularly challenging. The key goals are to get maximum throughput, minimal latency and reasonable memory consumption. It is very common for applications using RAG chain, documents summarization, question answering and many more. The optimizations below can significantly boost performance:

  • Prefix caching

  • KV cache compression

  • Max prompt length (NPU)

  • Cache interval multiplier for linear attention

  • Tuning max number of batched tokens

Prefix caching

Prefix caching in large language models (LLMs) is an optimization technique used to improve performance when processing repeated or static parts of input prompts. Instead of recomputing KV for the same prefix (e.g., a fixed instruction or context), it is cached after the first computation and stored after request is already processed and response is returned. When the same prefix is encountered again, the cached KV is reused, skipping redundant computations. This reduces latency and computational overhead, especially in scenarios like chatbots or applications with repetitive prompts. It is enabled by default.

KV cache compression: KV cache stores the intermediate key and value tensors generated by the model’s attention layers for each token in the input sequence. This cache allows the model to avoid recomputing attention for previous tokens when generating new tokens, greatly speeding up inference for long contexts. For very long contexts or high concurrency, the KV cache can consume a large amount of memory (RAM or VRAM). Compression reduces this memory usage, enabling longer prompts or more parallel requests without running out of memory. This parameter is applicable only to pipelines with continuous batching and paged attention. It is not used with NPU device.

Max prompt length Because NPU is using static memory allocation for prompt processing, there was introduced a dedicated parameter for NPU device - max_num_prompt. The default value 1024 should be adjust to the expected requests size.

Cache interval multiplier This parameter is dedicated for models with linear attention and prefix caching enabled. It adjusts the allocation size for state blocks internally in openvino.genai backend. For processing long inputs with low memory footprint, it is recommended to increase this parameter from default value 8 to higher like 64.

Max number batched tokens This parameter influences behavior of continuous batching algorithm and the size of chunked prompts for batching. It is efficient to use default value of 256 tokens when concurrent processing is expected. When usually one client is connecting to the local model, especially with long prompts, increasing the value might improve first token latency.

Deployment#

Let’s demonstrate all the optimizations combined and test it with the real life scenario of sending multiple various questions in the same context. It will illustrate the gain from the prefix caching on the first token latency, improved second token latency thanks to prompt lookup and moderate memory consumption despite very long prompts and parallel execution.

Prepare models directory:

mkdir models
docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:2026.2.1-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/gpt-oss-20b-int4-ov  --tool_parser gptoss --reasoning_parser gptoss --task text_generation --kv_cache_precision u4 --target_device GPU --cache_size 5 --max_num_batched_tokens 4096
docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:2026.2.1-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --max_prompt_len 16000 --tool_parser hermes3 --task text_generation --target_device NPU

Testing latency#

Using vllm benchmark it’s possible to check performance of the model with desired context length. It’s also possible to set prefix parameters to check the performance benefit from prefix caching. The command below can generate synthetic load with configurable cached prompt length (5000) and new tokens length (10).

pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu
vllm bench serve --backend  openai --base-url http://localhost:8000/v3 --endpoint /completions --model  OpenVINO/gpt-oss-20b-int4-ov --tokenizer openai/gpt-oss-20b --prefix-repetition-prefix-len 50000 --prefix-repetition-suffix-len 10 --prefix-repetition-output-len 20 --prefix-repetition-num-prefixes 1  --num-prompts 1 --max_concurrency 1 --dataset-name prefix_repetition --num-warmups 1 --seed 1

Performance Comparison Table#

iGPU Platform: Intel(R) Core(TM) Ultra X7 368H

Context Length (tokens)

TTFT No Cache (ms)

TPOT No Cache (ms)

TTFT Prefix Cache (ms)

TPOT Prefix Cached (ms)

KV Cache Usage (GB)

1,000

943

30.48

175

28.88

0.02

12,000

6 496

38.19

328

37.80

0.2

25,000

21 381

39.87

408

48.50

0.3

50,000

88 159

69.46

945

82.42

0.6

100,000

248 777

119.20

1258

102.36

1.5

Those results confirm gain from prefix caching for repeated tokens and demonstrate low KV cache usage thanks to quantization even with long context.

*NPU Platform: Intel(R) Core(TM) Ultra X7 368H

Context Length (tokens)

TTFT No Cache (ms)

TPOT No Cache (ms)

TTFT Prefix Cached (ms)

TPOT Prefix Cached (ms)

500

1 514

76.62

1 491

77.36

1,000

1 366

78.10

1 374

79.18

2,000

2 662

79.74

1 518

80.09

4,000

6 505

76.75

2 509

77.37

8,000

15 432

76.74

3 285

77.51

16,000

43 117

80.30

5 356

80.97

This table shows the gain from prefix caching on NPU device and flat latency for whole range to prompt length.

Cache Precision#

KV cache compression has minimal impact on accuracy and significantly reduces memory consumption and benchmark time. The default value is u8, but it’s possible to change it to u4, f16 or f32.

Context Length (tokens)

TTFT for precision u4 (ms)

Cache size for u4 (GB)

TTFT for precision u8 (ms)

Cache size for u8 (GB)

50,000

945

0.7

985

1.5

Lower precision in KV Cache reduces the memory consumption and can also improve latency.

Max prompt length for NPU#

Parameter --max_prompt_len has impact on the latency. It should be adjusted for expected input length to optimize performance.

Max prompt length

TTFT (ms)

TPOT (ms)

16K

2 514

77.17

4K

2 183

53.19

Recommendations#

Take advantage of prefix caching to process repeated tokens faster.

Use KV cache compression as u4 unless no compromise on accuracy is possible.

Set the KV cache size via --cache_size parameter based on the available memory, expected concurrency and context length. It will improve the performance.

For NPU set parameter --max_prompt_len based on expected input length. Lower max_prompt_len values, will reduce memory usage and latency.

For models with linear attention like Qwen3.6-35B-A3B, set parameter –cache_interval_multiplier=64 to reduce memory usage with prefix caching

In a scenario with low concurrency and long context, increase max_num_batched_tokens to higher numbers like 4096 or even max model context.

Note: You can force reducing the concurrency on the server using a parameter --rest_workers which by default allows number of connections the same like number of CPU cores. Alternatively the limit can be set on the model level in --max_num_seqs.