-
Notifications
You must be signed in to change notification settings - Fork 413
Description
Motivation.
Currently, vLLM-Ascend supports only MTP=1, which improves throughput to some extent. However, this setting underutilizes Ascend hardware parallelism and becomes a bottleneck in high-throughput or low-latency scenarios.
Supporting MTP>1 will enable decoding multiple tokens per step, reducing iteration overhead, improving latency, and significantly boosting throughput. This enhancement will maximize hardware utilization, and meet real-world deployment needs.
Proposed Change.
We propose to extend the current vLLM-Ascend backend decoding pipeline from MTP=1 to MTP>1.
The main changes include:
-
Decoding kernel: Enable generation of multiple tokens per step, instead of restricting to a single token.
-
Sampling: Currently only argmax is supported under MTP=1. We plan to extend sampling algorithms (e.g., top-k, nucleus sampling) to accept multiple tokens per step, so that MTP>1 can bring practical benefits beyond deterministic decoding.
-
Integration: Align backend execution with the existing vLLM interface, which already supports MTP configuration.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response