[RFC]: Support  MTP > 1 for DeepSeek

### Motivation.

Currently, vLLM-Ascend supports only MTP=1, which improves throughput to some extent. However, this setting underutilizes Ascend hardware parallelism and becomes a bottleneck in high-throughput or low-latency scenarios.

Supporting MTP>1 will enable decoding multiple tokens per step, reducing iteration overhead, improving latency, and significantly boosting throughput. This enhancement will maximize hardware utilization, and meet real-world deployment needs.

### Proposed Change.

We propose to extend the current vLLM-Ascend backend decoding pipeline from MTP=1 to MTP>1.
The main changes include:

- **Decoding kernel**: Enable generation of multiple tokens per step, instead of restricting to a single token.

- **Sampling**: Currently only argmax is supported under MTP=1. We plan to extend sampling algorithms (e.g., top-k, nucleus sampling) to accept multiple tokens per step, so that MTP>1 can bring practical benefits beyond deterministic decoding.

- **Integration**: Align backend execution with the existing vLLM interface, which already supports MTP configuration.

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: Support MTP > 1 for DeepSeek #2745

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Support MTP > 1 for DeepSeek #2745

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions