[docs] Attention backends #12320

stevhliu · 2025-09-11T22:40:13Z

Docs for enabling different attention backends.

HuggingFaceDocBuilderDev · 2025-09-11T22:47:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul

Thanks for starting this! Looking good. Before talking about specific attention backends, we could educate the users about the forms in which they can use them.

Through setting attention backend name model. set_attention_backend("<backend_name>").
Through the context manager.

Then we could maintain a table of attention backend names, their GitHub/official pages, and notes.

We could then take one example for a backend and make it complete.

This way, I think it will be leaner and easier to follow. WDYT?

sayakpaul · 2025-09-22T08:38:15Z

docs/source/en/optimization/attention_backends.md

+
+There are several available FlashAttention variants, including variable length and the original FlashAttention. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L163).
+
+The example below demonstrates how to enable the `_flash_3_hub` implementation. The [kernel](https://github.com/huggingface/kernels) library allows you to instantly use optimized compute kernels from the Hub without requiring any setup.


Suggested change

The example below demonstrates how to enable the `_flash_3_hub` implementation. The [kernel](https://github.com/huggingface/kernels) library allows you to instantly use optimized compute kernels from the Hub without requiring any setup.

The example below demonstrates how to enable the `_flash_3_hub` implementation for Flash Attention 3. The [kernels](https://github.com/huggingface/kernels) library allows you to instantly use optimized compute kernels from the Hub without requiring any setup.

But we should aso note the hardware restrictions in using FA3 as it's not supported in non-hopper architectures. In that case, regular Flash should be used through set_attention_backend("flash").

sayakpaul · 2025-09-22T08:42:45Z

docs/source/en/optimization/attention_backends.md

+
+## PyTorch native
+
+PyTorch includes a [native implementation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) of several optimized attention implementations including [FlexAttention](https://pytorch.org/blog/flexattention/), FlashAttention, memory-efficient attention, and a C++ version.


Flex uses a different path in torch:

diffusers/src/diffusers/models/attention_dispatch.py

Line 847 in 5e181ed

out = flex_attention.flex_attention(

The backends that leverage nn.functional.scaled_dot_product_attention() can be found in https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py (search with scaled_dot_product_attention().

init

e876275

stevhliu requested a review from sayakpaul September 11, 2025 22:49

sayakpaul reviewed Sep 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[docs] Attention backends #12320

[docs] Attention backends #12320

stevhliu commented Sep 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 11, 2025

Uh oh!

sayakpaul left a comment

Uh oh!

sayakpaul Sep 22, 2025

Uh oh!

sayakpaul Sep 22, 2025

Uh oh!

Uh oh!


		There are several available FlashAttention variants, including variable length and the original FlashAttention. For a full list of supported implementations, check the list [here](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L163).

		The example below demonstrates how to enable the `_flash_3_hub` implementation. The [kernel](https://github.com/huggingface/kernels) library allows you to instantly use optimized compute kernels from the Hub without requiring any setup.


		## PyTorch native

		PyTorch includes a [native implementation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) of several optimized attention implementations including [FlexAttention](https://pytorch.org/blog/flexattention/), FlashAttention, memory-efficient attention, and a C++ version.

[docs] Attention backends #12320

Are you sure you want to change the base?

[docs] Attention backends #12320

Conversation

stevhliu commented Sep 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 11, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!