Skip to content

Conversation

cicirori
Copy link
Collaborator

@cicirori cicirori commented Sep 2, 2025

This is an initial PR for FA4 backend.

  1. Get FA4 from the original repository into the sgl kernel and share the interface with FA3, controlled by the ver parameter.
  2. Add support for using FA4 as the prefill backend with dpsk-like model + Blackwell.

Based on the current interface compatibility of Flash Attention 4,I chose to share the implementation with the FA3 attention backend.

@zhyncs zhyncs self-assigned this Sep 2, 2025
@zhyncs
Copy link
Member

zhyncs commented Sep 2, 2025

ref

#8891
#9428

@cicirori cicirori marked this pull request as ready for review September 2, 2025 16:50
@cicirori cicirori changed the title WIP: add fa4's early support add fa4 to sgl-kernel and support using fa4 on deepseek on blackwell Sep 2, 2025
@cicirori
Copy link
Collaborator Author

cicirori commented Sep 2, 2025

Launch Command

python3 -m sglang.launch_server \
  --model-path deepseek-v3-FP4 \
  --trust-remote-code \
  --quantization modelopt_fp4 \
  --tp-size 4 \
  --ep-size 1 \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
  # add: --prefill-attention-backend fa4   (if enable FA4)

Benchmark Command

python3 -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --num-prompts 256 \
  --random-input 2048 \
  --random-output 1 \
  --random-range-ratio 0.5 \
  --flush-cache \
  --max-concurrency 1

Time to First Token (TTFT)

Backend Mean (ms) Median (ms) P99 (ms)
FA4 117.73 119.51 142.93
Flashinfer 131.84 134.83 169.19

@cicirori
Copy link
Collaborator Author

cicirori commented Sep 2, 2025

unit test

python sgl-kernel/tests/test_flash_attention_4.py
====================================== test session starts =======================================
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
rootdir: 
configfile: pyproject.toml
plugins: typeguard-4.4.4, anyio-4.10.0
collected 816 items

sgl-kernel/tests/test_flash_attention_4.py ............................................... [  5%]
.......................................................................................... [ 16%]
.......................................................................................... [ 27%]
.......................................................................................... [ 38%]
.......................................................................................... [ 49%]
.......................................................................................... [ 60%]
.......................................................................................... [ 71%]
.......................................................................................... [ 82%]
.......................................................................................... [ 93%]
.................................................                                          [100%]

======================================== warnings summary ========================================
tests/test_flash_attention_4.py: 108 warnings
  /usr/local/lib/python3.12/dist-packages/flash_attn/cute/flash_fwd_sm100.py:1707: DeprecationWarning: Use make_tiled_copy_tv instead
    tiled_smem_store = cute.make_tiled_copy(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================= 816 passed, 108 warnings in 151.99s (0:02:31) ==========================



@cicirori
Copy link
Collaborator Author

cicirori commented Sep 2, 2025

/gemini

Copy link
Contributor

It looks like you tried to invoke a command with /gemini. Please note that /gemini requires a subcommand. Here are the available commands:

  • /gemini review: Performs a code review for the current pull request in its current state.
  • /gemini summary: Provides a summary of the current pull request in its current state.
  • /gemini help: Displays a list of available commands.

If you intended to ask a question or provide a comment, you can simply type your message without a command.

@cicirori
Copy link
Collaborator Author

cicirori commented Sep 2, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FlashAttention v4, a significant enhancement for models running on newer GPU architectures. The implementation is well-structured, integrating FA4 as a versioned backend alongside FA3. The changes are guarded to ensure that only supported features are used, which is a good practice for incremental development. The addition of a comprehensive test suite for the FA4 interface is particularly commendable as it provides strong evidence of correctness. My review found a minor code duplication issue in a test file that could be addressed to improve maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants