Replies: 1 comment
-
Hi there! Thank you for your detailed post and for sharing your experimentation process with the community. I'm going to move this to our discussions section for now, as this appears to be more of an exploratory investigation that could benefit from some additional refinement before we can draw definitive conclusions about any potential issues. Library Attribution ClarificationI noticed there might be some confusion about which library is responsible for the behaviors you're observing. In your analysis, you're primarily using and examining:
So you might want to have a conversation with the peft libary meaintainers. But before doing so read on the rest of this response. While you mention that After you've established and troubleshot the baseline behavior, then it would be more appropriate to provide a direct comparison between Unsloth's merging and PEFT's merging with concrete evidence. Note that any performance differentials should be backed up by proper evaluations using an appropriate dataset and a representative evaluation dataset size (not just a single sample). Evaluation Methodology ConsiderationsYour investigation raises some interesting points about numerical precision, but there are a few methodological aspects that might strengthen your analysis: Sample Size & Representativeness:
Precision Expectations:
Task-Relevant Evaluation:
Sampling & Determinism:
Suggested ApproachNote that Unsloth doesn't completely replace PEFT+Transformers+TRL. While we optimize some of the code, we actually rely on these core libraries. I'd suggest running a baseline experiment to establish baseline behavior as follows:
At this stage it might then be worth implementing it using purely Unsloth methods and sharing your findings here. We'll be glad to take a look. Also feel free to keep updating this thread as you run through your baseline experiment and discuss with PEFT to let us know how it goes. Thank you! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
So, in summary I am trying to use the merged model with the lora adapted layer. First issue I faced, the
save_pretrained_merged
is not working as expected. The basic comparison functionality which I found after analyzing the code and reading the paper is:and the difference with the merged weight of this layer was not zero. I found that the
safe_merge
functionality you are using inside themerge
inlayers.py
is doing the expected thing. Differrence in code between safe_merge and not:The issue on that is that the LoRa weights are in full precission ( float32) and the base_layer is in bfloat16. So, whenever we change the order of operations and casting the results changes. The order matters and if we change the order we cannot reproduce the results.
I used the
So, I can save the merge model and the merged model weights seemed to match the base+LoRA adapter.
Issue number 2.
When I try to run any merged layer the results are not matching the initial base+Adapter results for example I have this layer:
When I use the merged model is turns into a simple
nn.Linear
layer (which is expected).The issue starts from this realization pytorch/pytorch#115144
that says that nn.Linear in bfloat16 is not equal with
weight @ input + bias
in bfloat16 butit is equal with
(weight.float() @ input.float() + bias.float()).bfloat16
now the basic lora.Linear layer is doing
base_layer(input) + lora_B(lora_A((input)))
So, because the operation order matters with bfloat16, the outputs are not the same.
Summary
We see that with the base model bfloat16 and LoRA adapters the merged model is not givings us the same result when we are doing inference. This is, at least from my initial investigation, because the bfloat16 Linear is not following the weights*input +bias, but we have an intermediate casting, and because generally the casting inside the merging and the order of it changes so the numerical results change with the order.
Do you have a clear way or a tested script where you are doing merging and the merged model gives the same results as the base+ adapter one? Thank you
Beta Was this translation helpful? Give feedback.
All reactions