Skip to content

[RFC]: Support MTP > 1 for DeepSeek #2745

@1092626063

Description

@1092626063

Motivation.

Currently, vLLM-Ascend supports only MTP=1, which improves throughput to some extent. However, this setting underutilizes Ascend hardware parallelism and becomes a bottleneck in high-throughput or low-latency scenarios.

Supporting MTP>1 will enable decoding multiple tokens per step, reducing iteration overhead, improving latency, and significantly boosting throughput. This enhancement will maximize hardware utilization, and meet real-world deployment needs.

Proposed Change.

We propose to extend the current vLLM-Ascend backend decoding pipeline from MTP=1 to MTP>1.
The main changes include:

  • Decoding kernel: Enable generation of multiple tokens per step, instead of restricting to a single token.

  • Sampling: Currently only argmax is supported under MTP=1. We plan to extend sampling algorithms (e.g., top-k, nucleus sampling) to accept multiple tokens per step, so that MTP>1 can bring practical benefits beyond deterministic decoding.

  • Integration: Align backend execution with the existing vLLM interface, which already supports MTP configuration.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Comments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions