Skip to content

Conversation

whx-sjtu
Copy link
Contributor

@whx-sjtu whx-sjtu commented Sep 12, 2025

Fix glm 4.5 moe accuracy bug.
This is only a temporary PR used for quick run. Later I will make a PR to vllm to solve this problem formally.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an accuracy bug in the GLM-4.5 MoE model. The changes involve refactoring the MoE dispatcher selection, registering a new custom GLM-4 MoE model for Ascend, and modifying the MoE forward pass logic. While the intent is to fix an accuracy issue, I've found a critical bug in the new CustomGlm4MoE forward pass related to incorrect tensor parallel reductions which will likely cause its own accuracy problems. Additionally, there's a potential for a KeyError crash in the new dispatcher selection logic. My review includes suggestions to address these critical issues.

Comment on lines +96 to 97
dispatcher_name = _moe_method_to_dispatcher[moe_comm_method]
dispatcher = get_token_dispatcher(dispatcher_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The direct dictionary access _moe_method_to_dispatcher[moe_comm_method] is not robust. If moe_comm_method contains a value that is not a key in _moe_method_to_dispatcher, this will raise a KeyError and crash the program. It's safer to use .get() and handle the case of an invalid method by raising a more informative ValueError.

Suggested change
dispatcher_name = _moe_method_to_dispatcher[moe_comm_method]
dispatcher = get_token_dispatcher(dispatcher_name)
dispatcher_name = _moe_method_to_dispatcher.get(moe_comm_method)
if dispatcher_name is None:
raise ValueError(
f"Unknown MoE communication method: {moe_comm_method}. "
f"Available methods: {list(_moe_method_to_dispatcher.keys())}")
dispatcher = get_token_dispatcher(dispatcher_name)

Comment on lines +72 to +74
if moe_comm_method_name in {"alltoallcommimpl", "mc2commimpl"}:
shared_output = tensor_model_parallel_all_reduce(shared_output)
final_hidden_states = final_hidden_states + shared_output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a tensor distribution mismatch that will likely lead to incorrect results. shared_output is a tensor partitioned by tokens across tensor-parallel ranks. The tensor_model_parallel_all_reduce(shared_output) call incorrectly makes it a replicated tensor. This replicated tensor is then added to final_hidden_states (the output of self.experts), which is also a partitioned tensor. This addition will perform broadcasting instead of the intended element-wise addition of corresponding token partitions, causing an accuracy issue. The all_reduce operation should be performed on the combined result, not on shared_output alone. The subsequent call to maybe_all_reduce_tensor_model_parallel should handle the necessary reduction.

            final_hidden_states = final_hidden_states + shared_output

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@weijinqian0
Copy link
Collaborator

approved

Signed-off-by: whx-sjtu <2952154980@qq.com>
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants