Chunk prefill on Key and Value. #31917

intelgaoxiong · 2025-09-01T02:59:13Z

Details:

Optimize TTFT by reducing redundant computing in SDPA on NPU.

Tickets:

EISW-181644

Copilot

Pull Request Overview

This PR implements chunked Key-Value (KV) cache support for Large Language Model (LLM) inference by extending the existing chunked prefill functionality to include KV chunking alongside the existing query (Q) chunking.

Introduces support for multiple prefill models with different past KV shapes for KV chunking
Adds new configuration option NPUW_LLM_PREFILL_ENABLE_KV_CHUNK to control KV chunking behavior
Implements KV cache management between chunks and tensor sharing optimizations

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
llm_infer_request.hpp	Updates class structure to support multiple prefill requests and KV chunking state tracking
llm_infer_request.cpp	Implements core KV chunking logic including chunk-to-chunk KV cache updates and tensor management
llm_compiled_model.hpp	Extends model to support multiple prefill models for KV chunking scenarios
llm_compiled_model.cpp	Implements parallel compilation of multiple prefill models and KV chunking configuration
base_sync_infer_request.cpp	Updates tensor allocation tracking for remote tensors
npuw.cpp	Registers the new KV chunk configuration option
npuw_private_properties.hpp	Adds the new KV chunk enable property definition
npuw.hpp	Defines the configuration option macro for KV chunking

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Record the last infer chunk. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Fixed accuracy issue. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Refine code. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Fixed build error. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Change m_prefill_compiled to vector. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Changed m_prefill_requests to vector. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Parallel compilation and disable kv chunk with prefix caching. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Add NPUW_LLM_PREFILL_ENABLE_KV_CHUNK. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Enhance handle_set_remote_input. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Allocate IO with cache. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Share past KV between prefill and kvcache. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Use device mem for temp dense buffer. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]check past kv sharing and pre-allocation device. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Log clean up. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Share past KV in LLMInferRequest. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Remove unnecessary changes. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Code clean up. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

github-actions bot added category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Sep 1, 2025

intelgaoxiong force-pushed the xiong/kv_chunking branch 3 times, most recently from a6abb36 to f6d612f Compare September 1, 2025 13:20

github-actions bot removed the category: build OpenVINO cmake script / infra label Sep 1, 2025

intelgaoxiong force-pushed the xiong/kv_chunking branch 20 times, most recently from 3800dd3 to cc74646 Compare September 5, 2025 14:47

intelgaoxiong marked this pull request as ready for review September 5, 2025 15:09

intelgaoxiong requested review from a team as code owners September 5, 2025 15:09

intelgaoxiong requested review from Copilot and dmatveev September 5, 2025 15:09

Copilot AI reviewed Sep 5, 2025

View reviewed changes

intelgaoxiong added 4 commits September 6, 2025 03:46

[KV Chunk]Debug RSS.

36de8a1

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Parallel compilation.

cc74646

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

intelgaoxiong force-pushed the xiong/kv_chunking branch from 498dccf to f66e59d Compare September 5, 2025 23:26

[KV Chunk]Code clean up.

f66e59d

Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Remove unnecessary changes. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Code clean up. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunk prefill on Key and Value. #31917

Chunk prefill on Key and Value. #31917

intelgaoxiong commented Sep 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chunk prefill on Key and Value. #31917

Are you sure you want to change the base?

Chunk prefill on Key and Value. #31917

Conversation

intelgaoxiong commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

intelgaoxiong commented Sep 1, 2025 •

edited

Loading