add fa4 to sgl-kernel and support using fa4 on deepseek on blackwell #9928

cicirori · 2025-09-02T12:19:12Z

This is an initial PR for FA4 backend.

Get FA4 from the original repository into the sgl kernel and share the interface with FA3, controlled by the ver parameter.
Add support for using FA4 as the prefill backend with dpsk-like model + Blackwell.

Based on the current interface compatibility of Flash Attention 4,I chose to share the implementation with the FA3 attention backend.

zhyncs · 2025-09-02T12:42:38Z

ref

#8891
#9428

cicirori · 2025-09-02T16:55:22Z

Launch Command

python3 -m sglang.launch_server \
  --model-path deepseek-v3-FP4 \
  --trust-remote-code \
  --quantization modelopt_fp4 \
  --tp-size 4 \
  --ep-size 1 \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
  # add: --prefill-attention-backend fa4   (if enable FA4)

Benchmark Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --num-prompts 256 \
  --random-input 2048 \
  --random-output 1 \
  --random-range-ratio 0.5 \
  --flush-cache \
  --max-concurrency 1

Time to First Token (TTFT)

Backend	Mean (ms)	Median (ms)	P99 (ms)
FA4	117.73	119.51	142.93
Flashinfer	131.84	134.83	169.19

cicirori · 2025-09-02T17:01:09Z

unit test

python sgl-kernel/tests/test_flash_attention_4.py
====================================== test session starts =======================================
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
rootdir: 
configfile: pyproject.toml
plugins: typeguard-4.4.4, anyio-4.10.0
collected 816 items

sgl-kernel/tests/test_flash_attention_4.py ............................................... [  5%]
.......................................................................................... [ 16%]
.......................................................................................... [ 27%]
.......................................................................................... [ 38%]
.......................................................................................... [ 49%]
.......................................................................................... [ 60%]
.......................................................................................... [ 71%]
.......................................................................................... [ 82%]
.......................................................................................... [ 93%]
.................................................                                          [100%]

======================================== warnings summary ========================================
tests/test_flash_attention_4.py: 108 warnings
  /usr/local/lib/python3.12/dist-packages/flash_attn/cute/flash_fwd_sm100.py:1707: DeprecationWarning: Use make_tiled_copy_tv instead
    tiled_smem_store = cute.make_tiled_copy(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================= 816 passed, 108 warnings in 151.99s (0:02:31) ==========================

cicirori · 2025-09-02T17:07:56Z

/gemini

gemini-code-assist · 2025-09-02T17:08:11Z

It looks like you tried to invoke a command with /gemini. Please note that /gemini requires a subcommand. Here are the available commands:

/gemini review: Performs a code review for the current pull request in its current state.
/gemini summary: Provides a summary of the current pull request in its current state.
/gemini help: Displays a list of available commands.

If you intended to ask a question or provide a comment, you can simply type your message without a command.

cicirori · 2025-09-02T17:08:50Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for FlashAttention v4, a significant enhancement for models running on newer GPU architectures. The implementation is well-structured, integrating FA4 as a versioned backend alongside FA3. The changes are guarded to ensure that only supported features are used, which is a good practice for incremental development. The addition of a comprehensive test suite for the FA4 interface is particularly commendable as it provides strong evidence of correctness. My review found a minor code duplication issue in a test file that could be addressed to improve maintainability.

sgl-kernel/tests/test_flash_attention_4.py

sgl-kernel/pyproject.toml

python/sglang/srt/models/deepseek_v2.py

zhyncs self-assigned this Sep 2, 2025

zhyncs added the high priority label Sep 2, 2025

zhyncs assigned sleepcoo, ispobock and gongwei-130 Sep 2, 2025

cicirori added 4 commits September 2, 2025 16:50

add experimental fa4 support for dpsk-like model

7104cf5

refine

be17598

fix windows size && add some assert

0c27574

add tests

49b9d15

cicirori force-pushed the feat/fa4_for_dpsk branch from 0742a4c to 49b9d15 Compare September 2, 2025 16:50

cicirori marked this pull request as ready for review September 2, 2025 16:50

cicirori requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077, merrymercy, FlamingoPg, HaiShaw, Ying1123, hnyls2002, ch-wan, kushanam and Edwardf0t1 as code owners September 2, 2025 16:50

cicirori changed the title ~~WIP: add fa4's early support~~ add fa4 to sgl-kernel and support using fa4 on deepseek on blackwell Sep 2, 2025

gemini-code-assist bot reviewed Sep 2, 2025

View reviewed changes

sgl-kernel/tests/test_flash_attention_4.py Show resolved Hide resolved

cicirori added 2 commits September 2, 2025 17:14

fix

c26259b

fix ci

4f574fa

zhyncs reviewed Sep 2, 2025

View reviewed changes

sgl-kernel/pyproject.toml Outdated Show resolved Hide resolved

comments

9b71817

zhyncs reviewed Sep 2, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

zhyncs assigned hyhieu and qingquansong Sep 2, 2025

for cr

94ff293

cicirori force-pushed the feat/fa4_for_dpsk branch from 3421118 to 94ff293 Compare September 2, 2025 23:41

zhyncs and others added 2 commits September 2, 2025 23:36

Merge branch 'main' into feat/fa4_for_dpsk

7ab3be9

Merge branch 'main' into feat/fa4_for_dpsk

feca371

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add fa4 to sgl-kernel and support using fa4 on deepseek on blackwell #9928

add fa4 to sgl-kernel and support using fa4 on deepseek on blackwell #9928

Uh oh!

cicirori commented Sep 2, 2025 •

edited

Loading

Uh oh!

zhyncs commented Sep 2, 2025

Uh oh!

cicirori commented Sep 2, 2025 •

edited

Loading

Uh oh!

cicirori commented Sep 2, 2025

Uh oh!

cicirori commented Sep 2, 2025

Uh oh!

gemini-code-assist bot commented Sep 2, 2025

Uh oh!

cicirori commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

add fa4 to sgl-kernel and support using fa4 on deepseek on blackwell #9928

Are you sure you want to change the base?

add fa4 to sgl-kernel and support using fa4 on deepseek on blackwell #9928

Uh oh!

Conversation

cicirori commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Sep 2, 2025

Uh oh!

cicirori commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Launch Command

Benchmark Command

Time to First Token (TTFT)

Uh oh!

cicirori commented Sep 2, 2025

Uh oh!

cicirori commented Sep 2, 2025

Uh oh!

gemini-code-assist bot commented Sep 2, 2025

Uh oh!

cicirori commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cicirori commented Sep 2, 2025 •

edited

Loading

cicirori commented Sep 2, 2025 •

edited

Loading