Skip to content

Evaluate GPT-OSS-20B on code benchmark datasets #3274

@neilwen987

Description

@neilwen987

Hi, apologies if the question is naive. I'm trying to evaluate GPT-OSS-20B on code benchmark datsets(e.g. mbpp, humaneval).

export HF_ALLOW_CODE_EVAL=1
accelerate launch --num_processes 1 -m lm_eval \
  --model hf \
  --model_args pretrained=openai/gpt-oss-20b,dtype=bfloat16 \
  --tasks mbpp\
  --batch_size 32 \
  --confirm_run_unsafe_code \

and I got following results:

Tasks Version Filter n-shot Metric Value Stderr
mbpp 1 extract_code 3 pass_at_1 0 ± 0
mbpp_instruct 1 extract_code 3 pass_at_1 0 ± 0
humaneval 1 create_test 0 pass@1 0.2622 ± 0.0345

which are relativly low. I'm wondering can we use thinking mode when evaluating these tasks? And if so, how should I do it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions