- Running GPT-OSS-20B on Windows, with an RTX 4090, requiring ~ 14 GB VRAM
- Weights are FP4, but rapidly de-quantized on-the-fly for computations in BF16
- How fast? 5-10 tokens/second, probably. Faster than you read, most likely.
- 🆕:
formatted-inference-120b.py
->device_map="cuda"
to run. - Works with 24 GB VRAM + 128 GB GB RAM. Needs <70 GB total dedicated + shared 'GPU' memory.
- Turn on 'System Memory Fallback' in NVIDIA settings to use!
- RTX 4090: About 2 tokens per second for GPT-OSS 120B.
- Unformatted, raw, simple; streaming tokens:
python simple-inference-stream.py
- Clean formatting, reasoning + response, streaming tokens:
python clean-formatted-inference-stream.py
- Stuff some text into the model, get a short response and logprobs:
python test-get-logprobs.py
- Requires PyTorch >=2.4, install from here, check with:
pip show torch
What works for me / what I use: torch 2.7.0+cu128
- Install Triton for Windows (props to woct0rdho):
pip install -U "triton-windows<3.5"
- Install Triton Kernels:
pip install "triton_kernels @ git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels"
- Uninstall huggingface transformers:
pip uninstall transformers
- Install this fork:
pip install git+https://github.com/Tsumugii24/transformers
- Check accelerate is >=0.33:
pip show accelerate
- What I use / what works for me:
pip install accelerate==1.2.1
Example inference on RTX 4090, real time video (clean-formatted-inference-stream.py):