[Bug] Merged model with bfloat16 is not matching the adapther +lora weight #3093

vasipapa · 2025-08-04T14:39:57Z

vasipapa
Aug 4, 2025

```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use GPU the same device as the original finetuned model

from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
from datasets import load_dataset

# Load model
model, processor = FastVisionModel.from_pretrained(
    "unsloth/gemma-3-4b-it",
    # "unsloth/Qwen2.5-VL-7B-Instruct",
    # "unsloth/gemma-3-4b-pt",
    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers
    r = 8,                           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 8,                  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 32,
    use_rslora = False,               # We support rank stabilized LoRA
    loftq_config = None,               # And LoftQ
    target_modules = "all-linear",    # Optional now! Can specify a list if needed
    # modules_to_save=[
    #     "lm_head",
    #     "embed_tokens",
    # ],
)
from unsloth import get_chat_template

processor = get_chat_template(
    processor,
    "gemma-3"
)

# FastVisionModel.for_inference(model)  # Enable for inference!
from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")

# # Convert the dataset to a conversation format for training
instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": instruction},
                {"type": "image", "image": sample["image"]},
            ],
        },
        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
    ]
    return {"messages": conversation}
pass

converted_dataset = [convert_to_conversation(sample) for sample in dataset]

from transformers import TextStreamer

from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model=model,
    train_dataset=converted_dataset,
    processing_class=processor.tokenizer,
    data_collator=UnslothVisionDataCollator(model, processor),
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        gradient_checkpointing = True,

        # use reentrant checkpointing
        gradient_checkpointing_kwargs = {"use_reentrant": False},
        max_grad_norm = 0.3,              # max gradient norm based on QLoRA paper
        warmup_ratio = 0.03,
        max_steps = 10,
        #num_train_epochs = 2,          # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        logging_steps = 1,
        save_strategy="steps",
        optim = "adamw_torch_fused",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",             # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        max_seq_length = 2048,
    )
)

trainer_stats = trainer.train()

import torch._dynamo
torch._dynamo.config.cache_size_limit = 64  # or higher


# Run on single image
FastVisionModel.for_inference(model)  # Enable for inference!
model = model.eval()
# # MODEL locations
ADAPTER_MODEL_DIR = "ADAPTER_MODEL_DIR/"
MERGED_MODEL_DIR = "MERGED_MODEL_DIR/"
# import copy
# Save FT model as is with adapter
model.save_pretrained(ADAPTER_MODEL_DIR)
processor.save_pretrained(ADAPTER_MODEL_DIR)

# If you do not use the safe_merge= True the merge is now working as expected. The weights are not giving zeros differences.
# at 1)
merged_model_save = model.merge_and_unload(safe_merge = True)  # Merge the LoRA weights with the base model weights
# merged_model = model.merge_and_unload()  # Merge the LoRA weights with the base model weights
# adapter_model.cpu()  # Move to CPU to save GPU memory
merged_model_save.cpu()  # Move to CPU to save GPU memory
merged_model_save.save_pretrained(MERGED_MODEL_DIR)
processor.save_pretrained(MERGED_MODEL_DIR)
tokenizer = processor.save_pretrained(MERGED_MODEL_DIR)

# # # Save Merged model
# This will not give the correct results in 1)
# model.save_pretrained_merged(MERGED_MODEL_DIR, processor)
# The issue comes from the layer.py function in the unsloth library.
# this  
# Without safe_merge 
# delta_weight = self.get_delta_weight(active_adapter)
# base_layer.weight.data += delta_weight
#With safe merge 
# delta_weight = self.get_delta_weight(active_adapter)
# orig_weight += delta_weight.to(orig_dtype)
# Lora Weight are full precision (float32) and the base layer is bfloat16.
# When we deal with bfloat16 the order of the operations matters.

# RELOAD THE ADAPTER MODEL

adapter_model, adapter_processor = FastVisionModel.from_pretrained(
    ADAPTER_MODEL_DIR,
    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
    # is_trainable=False,  # Set to False to avoid training the adapter model
    # use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

FastVisionModel.for_inference(adapter_model)  # Enable for inference!



# RELOAD THE MERGED MODEL
merged_model, merged_processor = FastVisionModel.from_pretrained(
    MERGED_MODEL_DIR,
    dtype = torch.float16,  # Use bfloat16 for better performance on GPUs
    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
)

FastVisionModel.for_inference(merged_model)  # Enable for inference!

# 1) Compare the outputs from basic adapter model and merged model
for layer_id in range(len(adapter_model.model.model.language_model.layers)):
    module = adapter_model.model.model.language_model.layers[layer_id].self_attn.q_proj

    # Merge the LoRA weights with the base model weights
    # This is the correct way to merge the weights and check for differences
    base_weight = module.base_layer.weight.data #.cpu()
    lora_A_weight = module.lora_A.default.weight.data # .cpu()
    lora_B_weight = module.lora_B.default.weight.data #.cpu()
    lora_weights = lora_B_weight @ lora_A_weight
    lora_weights = lora_weights.to(dtype = base_weight.dtype) #.cpu()
    merged_weight = base_weight + lora_weights


    merged_module_weight = merged_model.model.language_model.layers[layer_id].self_attn.q_proj.weight.data
    merged_module_weight = merged_module_weight #.cpu()


    diff = (merged_module_weight - merged_weight).abs()
    print(f"Layer {layer_id}: base_weight dtype: {base_weight.dtype}, lora_A_weight dtype: {lora_A_weight.dtype}, lora_B_weight dtype: {lora_B_weight.dtype}")
    print(f"Layer {layer_id}: Dtypes - Merged: {merged_module_weight.dtype}, Original: {merged_weight.dtype}")
    print(f"Layer {layer_id}: Max difference: {diff.max()} | Mean difference: {diff.mean()} | Shape: {merged_module_weight.shape}")
    print(f"Layer {layer_id}: Merged weight: percentage of difference: {abs(diff.mean() / merged_module_weight.mean()) * 100:.4f}%")
    print("=="*20)

# Now with all the correct merge the differences should be 0 (at least at my side).
# 2) Compare the outputs from adapter model and merged model per layer
input = torch.ones(adapter_model.base_model.model.model.vision_tower.vision_model.encoder.layers[0].self_attn.q_proj.in_features, dtype = torch.bfloat16).cuda()  # Example input tensor

output1 = adapter_model.base_model.model.model.vision_tower.vision_model.encoder.layers[0].self_attn.q_proj(input)
output2 = merged_model.base_model.model.vision_tower.vision_model.encoder.layers[0].self_attn.q_proj(input)
diff = (output1 - output2).abs()
print(f"Output difference: {diff.max()} | Mean difference: {diff.mean()} | Shape: {output1.shape}")
print(f"Output percentage of difference: {abs(diff.mean() / output1.mean()) * 100:.4f}%")
print("=="*20)
# You will see that the outputs are not the same, although the wights are the same.
# The issue starts from this realization https://github.com/pytorch/pytorch/issues/115144
# That says that nn.Linear in bfloat16 is not equal with weight @ input + bias in bfloat16 but
# it is equal with (weight.float() @ input.float() + bias.float()).bfloat16
# now the basic lora.Linear layer is doing base_layer(input) + lora_B(lora_A((input))) 
# So, because the operation order matters with bfloat16, the outputs are not the same. And I
# don't know how to fix it.

So, in summary I am trying to use the merged model with the lora adapted layer. First issue I faced, the save_pretrained_merged is not working as expected. The basic comparison functionality which I found after analyzing the code and reading the paper is:

    base_weight = module.base_layer.weight.data #.cpu()
    lora_A_weight = module.lora_A.default.weight.data # .cpu()
    lora_B_weight = module.lora_B.default.weight.data #.cpu()
    lora_weights = lora_B_weight @ lora_A_weight
    lora_weights = lora_weights.to(dtype = base_weight.dtype) #.cpu()
    merged_weight = base_weight + lora_weights

and the difference with the merged weight of this layer was not zero. I found that the safe_merge functionality you are using inside the merge in layers.py is doing the expected thing. Differrence in code between safe_merge and not:

# Without safe_merge 
 delta_weight = self.get_delta_weight(active_adapter)
base_layer.weight.data += delta_weight
#With safe merge 
delta_weight = self.get_delta_weight(active_adapter)
orig_weight += delta_weight.to(orig_dtype)

The issue on that is that the LoRa weights are in full precission ( float32) and the base_layer is in bfloat16. So, whenever we change the order of operations and casting the results changes. The order matters and if we change the order we cannot reproduce the results.
I used the

merged_model_save = model.merge_and_unload(safe_merge = True)  # Merge the LoRA weights with the base model weights
# merged_model = model.merge_and_unload()  # Merge the LoRA weights with the base model weights
# adapter_model.cpu()  # Move to CPU to save GPU memory
merged_model_save.cpu()  # Move to CPU to save GPU memory
merged_model_save.save_pretrained(MERGED_MODEL_DIR)
processor.save_pretrained(MERGED_MODEL_DIR)

So, I can save the merge model and the merged model weights seemed to match the base+LoRA adapter.

Issue number 2.

When I try to run any merged layer the results are not matching the initial base+Adapter results for example I have this layer:

When I use the merged model is turns into a simple nn.Linear layer (which is expected).
The issue starts from this realization pytorch/pytorch#115144
that says that nn.Linear in bfloat16 is not equal with weight @ input + bias in bfloat16 but
it is equal with (weight.float() @ input.float() + bias.float()).bfloat16
now the basic lora.Linear layer is doing base_layer(input) + lora_B(lora_A((input)))
So, because the operation order matters with bfloat16, the outputs are not the same.

Summary

We see that with the base model bfloat16 and LoRA adapters the merged model is not givings us the same result when we are doing inference. This is, at least from my initial investigation, because the bfloat16 Linear is not following the weights*input +bias, but we have an intermediate casting, and because generally the casting inside the merging and the order of it changes so the numerical results change with the order.

Do you have a clear way or a tested script where you are doing merging and the merged model gives the same results as the base+ adapter one? Thank you

rolandtannous · 2025-08-05T05:18:20Z

rolandtannous
Aug 5, 2025
Collaborator

Hi there!

Thank you for your detailed post and for sharing your experimentation process with the community.

I'm going to move this to our discussions section for now, as this appears to be more of an exploratory investigation that could benefit from some additional refinement before we can draw definitive conclusions about any potential issues.

Library Attribution Clarification

I noticed there might be some confusion about which library is responsible for the behaviors you're observing. In your analysis, you're primarily using and examining:

merge_and_unload(safe_merge=True) - this is a PEFT library method, not an Unsloth method
References to "layer.py" - this appears to be referring to PEFT's layer.py file, not Unsloth's implementation
The safe_merge logic you're analyzing is part of PEFT's codebase

So you might want to have a conversation with the peft libary meaintainers. But before doing so read on the rest of this response.

While you mention that save_pretrained_merged() "is not working as expected," I don't see any specific testing or evidence demonstrating issues with Unsloth's implementation. We suggest you establish a proper performance baseline first without unsloth- please reference the "Suggested Approach" section below for more details on how to do this systematically.

After you've established and troubleshot the baseline behavior, then it would be more appropriate to provide a direct comparison between Unsloth's merging and PEFT's merging with concrete evidence. Note that any performance differentials should be backed up by proper evaluations using an appropriate dataset and a representative evaluation dataset size (not just a single sample).

Evaluation Methodology Considerations

Your investigation raises some interesting points about numerical precision, but there are a few methodological aspects that might strengthen your analysis:

Sample Size & Representativeness:

Testing on a single image or single tensor input isn't statistically representative
For robust conclusions, you'd want to evaluate on a large, diverse dataset that covers the model's expected use cases

Precision Expectations:

Expecting exactly zero differences between mixed-precision operations (bfloat16/float32) might not be realistic due to floating-point arithmetic properties
Different mathematical operation orders with different precisions will naturally produce small numerical differences, even when mathematically equivalent
Consider using tolerance thresholds rather than exact equality

Task-Relevant Evaluation:

Comparing raw tensor outputs doesn't necessarily reflect actual model performance
For LaTeX OCR tasks, functional metrics like WER (Word Error Rate) and CER (Character Error Rate) would be much more meaningful
We've actually run these types of evaluations internally and found that merged models perform as expected when measured with appropriate task-specific metrics
Note that with only 10 training steps, your model may be undertrained, which could make meaningful performance comparisons difficult

Sampling & Determinism:

LLMs have inherent randomness in text generation
Make sure to control for sampling parameters, set random seeds, or run multiple trials to account for this variability

Suggested Approach

Note that Unsloth doesn't completely replace PEFT+Transformers+TRL. While we optimize some of the code, we actually rely on these core libraries. I'd suggest running a baseline experiment to establish baseline behavior as follows:

Don't use Unsloth - use PEFT+Transformers+TRL code to finetune, merge_and_unload(), and compare performance of the base+lora model vs merged model, using appropriate and diverse evals.
Observe the behavior and make comparisons, investigate further the reasons behind your observations
Dig deeper (beyond just reading the paper) into the codebase of the core libraries before making any conclusions
If you're convinced with your findings thereafter, I recommend you share them with the team at PEFT: https://github.com/huggingface/peft and see what they have to say

At this stage it might then be worth implementing it using purely Unsloth methods and sharing your findings here. We'll be glad to take a look. Also feel free to keep updating this thread as you run through your baseline experiment and discuss with PEFT to let us know how it goes.

Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Merged model with bfloat16 is not matching the adapther +lora weight #3093

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

[Bug] Merged model with bfloat16 is not matching the adapther +lora weight #3093

Uh oh!

vasipapa Aug 4, 2025

Issue number 2.

Summary

Replies: 1 comment

Uh oh!

Uh oh!

rolandtannous Aug 5, 2025 Collaborator

Library Attribution Clarification

Evaluation Methodology Considerations

Suggested Approach

vasipapa
Aug 4, 2025

rolandtannous
Aug 5, 2025
Collaborator