Skip to content

Conversation

intelgaoxiong
Copy link
Contributor

@intelgaoxiong intelgaoxiong commented Sep 1, 2025

Details:

  • Optimize TTFT by reducing redundant computing in SDPA on NPU.

Tickets:

@github-actions github-actions bot added category: build OpenVINO cmake script / infra category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Sep 1, 2025
@intelgaoxiong intelgaoxiong force-pushed the xiong/kv_chunking branch 3 times, most recently from a6abb36 to f6d612f Compare September 1, 2025 13:20
@github-actions github-actions bot removed the category: build OpenVINO cmake script / infra label Sep 1, 2025
@intelgaoxiong intelgaoxiong force-pushed the xiong/kv_chunking branch 20 times, most recently from 3800dd3 to cc74646 Compare September 5, 2025 14:47
@intelgaoxiong intelgaoxiong marked this pull request as ready for review September 5, 2025 15:09
@intelgaoxiong intelgaoxiong requested review from a team as code owners September 5, 2025 15:09
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements chunked Key-Value (KV) cache support for Large Language Model (LLM) inference by extending the existing chunked prefill functionality to include KV chunking alongside the existing query (Q) chunking.

  • Introduces support for multiple prefill models with different past KV shapes for KV chunking
  • Adds new configuration option NPUW_LLM_PREFILL_ENABLE_KV_CHUNK to control KV chunking behavior
  • Implements KV cache management between chunks and tensor sharing optimizations

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
llm_infer_request.hpp Updates class structure to support multiple prefill requests and KV chunking state tracking
llm_infer_request.cpp Implements core KV chunking logic including chunk-to-chunk KV cache updates and tensor management
llm_compiled_model.hpp Extends model to support multiple prefill models for KV chunking scenarios
llm_compiled_model.cpp Implements parallel compilation of multiple prefill models and KV chunking configuration
base_sync_infer_request.cpp Updates tensor allocation tracking for remote tensors
npuw.cpp Registers the new KV chunk configuration option
npuw_private_properties.hpp Adds the new KV chunk enable property definition
npuw.hpp Defines the configuration option macro for KV chunking

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Record the last infer chunk.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Fixed accuracy issue.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Refine code.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Fixed build error.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Change m_prefill_compiled to vector.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Changed m_prefill_requests to vector.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Parallel compilation and disable kv chunk with prefix caching.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Add NPUW_LLM_PREFILL_ENABLE_KV_CHUNK.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Enhance handle_set_remote_input.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Allocate IO with cache.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Share past KV between prefill and kvcache.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Use device mem for temp dense buffer.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]check past kv sharing and pre-allocation device.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Log clean up.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Share past KV in LLMInferRequest.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Remove unnecessary changes.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

[KV Chunk]Code clean up.

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant