-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Description
Hi, apologies if the question is naive. I'm trying to evaluate GPT-OSS-20B on code benchmark datsets(e.g. mbpp, humaneval).
export HF_ALLOW_CODE_EVAL=1
accelerate launch --num_processes 1 -m lm_eval \
--model hf \
--model_args pretrained=openai/gpt-oss-20b,dtype=bfloat16 \
--tasks mbpp\
--batch_size 32 \
--confirm_run_unsafe_code \
and I got following results:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mbpp | 1 | extract_code | 3 | pass_at_1 | ↑ | 0 | ± | 0 |
mbpp_instruct | 1 | extract_code | 3 | pass_at_1 | ↑ | 0 | ± | 0 |
humaneval | 1 | create_test | 0 | pass@1 | 0.2622 | ± | 0.0345 |
which are relativly low. I'm wondering can we use thinking mode when evaluating these tasks? And if so, how should I do it?
Metadata
Metadata
Assignees
Labels
No labels