-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Chunk prefill on Key and Value. #31917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Chunk prefill on Key and Value. #31917
Conversation
a6abb36
to
f6d612f
Compare
3800dd3
to
cc74646
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements chunked Key-Value (KV) cache support for Large Language Model (LLM) inference by extending the existing chunked prefill functionality to include KV chunking alongside the existing query (Q) chunking.
- Introduces support for multiple prefill models with different past KV shapes for KV chunking
- Adds new configuration option
NPUW_LLM_PREFILL_ENABLE_KV_CHUNK
to control KV chunking behavior - Implements KV cache management between chunks and tensor sharing optimizations
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
File | Description |
---|---|
llm_infer_request.hpp | Updates class structure to support multiple prefill requests and KV chunking state tracking |
llm_infer_request.cpp | Implements core KV chunking logic including chunk-to-chunk KV cache updates and tensor management |
llm_compiled_model.hpp | Extends model to support multiple prefill models for KV chunking scenarios |
llm_compiled_model.cpp | Implements parallel compilation of multiple prefill models and KV chunking configuration |
base_sync_infer_request.cpp | Updates tensor allocation tracking for remote tensors |
npuw.cpp | Registers the new KV chunk configuration option |
npuw_private_properties.hpp | Adds the new KV chunk enable property definition |
npuw.hpp | Defines the configuration option macro for KV chunking |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Record the last infer chunk. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Fixed accuracy issue. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Refine code. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Fixed build error. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Change m_prefill_compiled to vector. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Changed m_prefill_requests to vector. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Parallel compilation and disable kv chunk with prefix caching. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Add NPUW_LLM_PREFILL_ENABLE_KV_CHUNK. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Enhance handle_set_remote_input. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Allocate IO with cache. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Share past KV between prefill and kvcache. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Use device mem for temp dense buffer. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]check past kv sharing and pre-allocation device. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Log clean up. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Share past KV in LLMInferRequest. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
498dccf
to
f66e59d
Compare
Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Remove unnecessary changes. Signed-off-by: intelgaoxiong <xiong.gao@intel.com> [KV Chunk]Code clean up. Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Details:
Tickets: