GitHub - zer0int/GPT-OSS-20B-Windows-16GB-RTX4090: No Hopper architecture (RTX 5090, etc.) required! <16 GB VRAM, Windows.

GPT-OSS-20B on Windows, MXFP4, fits <16 GB VRAM

Running GPT-OSS-20B on Windows, with an RTX 4090, requiring ~ 14 GB VRAM
Weights are FP4, but rapidly de-quantized on-the-fly for computations in BF16
How fast? 5-10 tokens/second, probably. Faster than you read, most likely.
🆕: formatted-inference-120b.py -> device_map="cuda" to run.
Works with 24 GB VRAM + 128 GB GB RAM. Needs <70 GB total dedicated + shared 'GPU' memory.
Turn on 'System Memory Fallback' in NVIDIA settings to use!
RTX 4090: About 2 tokens per second for GPT-OSS 120B.

Usage

Unformatted, raw, simple; streaming tokens:

python simple-inference-stream.py

Clean formatting, reasoning + response, streaming tokens:

python clean-formatted-inference-stream.py

Stuff some text into the model, get a short response and logprobs:

python test-get-logprobs.py

SETUP

Requires PyTorch >=2.4, install from here, check with:

pip show torch

What works for me / what I use: torch 2.7.0+cu128

Install Triton for Windows (props to woct0rdho):

pip install -U "triton-windows<3.5"

Install Triton Kernels:

pip install "triton_kernels @ git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels"

Uninstall huggingface transformers:

pip uninstall transformers

Install this fork:

pip install git+https://github.com/Tsumugii24/transformers

Check accelerate is >=0.33:

pip show accelerate

What I use / what works for me:

pip install accelerate==1.2.1

Source: github.com/huggingface/transformers/issues/39985

Example inference on RTX 4090, real time video (clean-formatted-inference-stream.py):

GPT-OSS-infer-4090.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-OSS-20B on Windows, MXFP4, fits <16 GB VRAM

Usage

SETUP

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
clean-formatted-inference-stream.py		clean-formatted-inference-stream.py
formatted-inference-120b.py		formatted-inference-120b.py
simple-inference-stream.py		simple-inference-stream.py
test-get-logprobs.py		test-get-logprobs.py

zer0int/GPT-OSS-20B-Windows-16GB-RTX4090

Folders and files

Latest commit

History

Repository files navigation

GPT-OSS-20B on Windows, MXFP4, fits <16 GB VRAM

Usage

SETUP

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages