Skip to content

zer0int/GPT-OSS-20B-Windows-16GB-RTX4090

Repository files navigation

GPT-OSS-20B on Windows, MXFP4, fits <16 GB VRAM

  • Running GPT-OSS-20B on Windows, with an RTX 4090, requiring ~ 14 GB VRAM
  • Weights are FP4, but rapidly de-quantized on-the-fly for computations in BF16
  • How fast? 5-10 tokens/second, probably. Faster than you read, most likely.
  • 🆕: formatted-inference-120b.py -> device_map="cuda" to run.
  • Works with 24 GB VRAM + 128 GB GB RAM. Needs <70 GB total dedicated + shared 'GPU' memory.
  • Turn on 'System Memory Fallback' in NVIDIA settings to use!
  • RTX 4090: About 2 tokens per second for GPT-OSS 120B.

Usage

  • Unformatted, raw, simple; streaming tokens:
python simple-inference-stream.py
  • Clean formatting, reasoning + response, streaming tokens:
python clean-formatted-inference-stream.py
  • Stuff some text into the model, get a short response and logprobs:
python test-get-logprobs.py

SETUP

  • Requires PyTorch >=2.4, install from here, check with:
pip show torch

What works for me / what I use: torch 2.7.0+cu128

  • Install Triton for Windows (props to woct0rdho):
pip install -U "triton-windows<3.5"
  • Install Triton Kernels:
pip install "triton_kernels @ git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels"
  • Uninstall huggingface transformers:
pip uninstall transformers
pip install git+https://github.com/Tsumugii24/transformers
  • Check accelerate is >=0.33:
pip show accelerate
  • What I use / what works for me:
pip install accelerate==1.2.1


Example inference on RTX 4090, real time video (clean-formatted-inference-stream.py):

GPT-OSS-infer-4090.mp4

About

No Hopper architecture (RTX 5090, etc.) required! <16 GB VRAM, Windows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages