-
Notifications
You must be signed in to change notification settings - Fork 435
[BugFix] Fix glm 4.5 moe accuracy bug. #2898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an accuracy bug in the GLM-4.5 MoE model. The changes involve refactoring the MoE dispatcher selection, registering a new custom GLM-4 MoE model for Ascend, and modifying the MoE forward pass logic. While the intent is to fix an accuracy issue, I've found a critical bug in the new CustomGlm4MoE
forward pass related to incorrect tensor parallel reductions which will likely cause its own accuracy problems. Additionally, there's a potential for a KeyError
crash in the new dispatcher selection logic. My review includes suggestions to address these critical issues.
dispatcher_name = _moe_method_to_dispatcher[moe_comm_method] | ||
dispatcher = get_token_dispatcher(dispatcher_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direct dictionary access _moe_method_to_dispatcher[moe_comm_method]
is not robust. If moe_comm_method
contains a value that is not a key in _moe_method_to_dispatcher
, this will raise a KeyError
and crash the program. It's safer to use .get()
and handle the case of an invalid method by raising a more informative ValueError
.
dispatcher_name = _moe_method_to_dispatcher[moe_comm_method] | |
dispatcher = get_token_dispatcher(dispatcher_name) | |
dispatcher_name = _moe_method_to_dispatcher.get(moe_comm_method) | |
if dispatcher_name is None: | |
raise ValueError( | |
f"Unknown MoE communication method: {moe_comm_method}. " | |
f"Available methods: {list(_moe_method_to_dispatcher.keys())}") | |
dispatcher = get_token_dispatcher(dispatcher_name) |
if moe_comm_method_name in {"alltoallcommimpl", "mc2commimpl"}: | ||
shared_output = tensor_model_parallel_all_reduce(shared_output) | ||
final_hidden_states = final_hidden_states + shared_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a tensor distribution mismatch that will likely lead to incorrect results. shared_output
is a tensor partitioned by tokens across tensor-parallel ranks. The tensor_model_parallel_all_reduce(shared_output)
call incorrectly makes it a replicated tensor. This replicated tensor is then added to final_hidden_states
(the output of self.experts
), which is also a partitioned tensor. This addition will perform broadcasting instead of the intended element-wise addition of corresponding token partitions, causing an accuracy issue. The all_reduce
operation should be performed on the combined result, not on shared_output
alone. The subsequent call to maybe_all_reduce_tensor_model_parallel
should handle the necessary reduction.
final_hidden_states = final_hidden_states + shared_output
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
11dea05
to
dc20609
Compare
approved |
Signed-off-by: whx-sjtu <2952154980@qq.com>
dc20609
to
0ce5bb5
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Fix glm 4.5 moe accuracy bug.
This is only a temporary PR used for quick run. Later I will make a PR to vllm to solve this problem formally.