Skip to content

Conversation

luiscosio
Copy link

This PR adds generative task definitions for two MMLU-Redux datasets:

The task definitions follow the same structure and evaluation metrics as existing MMLU tasks, using exact_match for scoring with weight_by_size enabled. Both datasets are organized into 4 main groups:

  • STEM
  • Other
  • Social Sciences
  • Humanities

Each group maintains consistent evaluation metrics and aggregation methods across both language versions.

Changes include:

  • Added task definitions for generative format evaluation
  • Consistent group structure between English and Spanish versions
  • Maintained weight_by_size true for all metrics
  • Version 3 metadata tag for compatibility

This enhancement allows for direct comparison of model performance between English and Spanish versions of MMLU-Redux in a generative setting.

@baberabb
Copy link
Contributor

Hi! Thanks for the PR. Just some minor issues:

  1. test is failing as it can't find one of the subtask configs on HF hub (probably a typo).
  2. Could you add the readme from template/new_yaml_task, and also add an entry in lm_eval/tasks/README.md

@StellaAthena
Copy link
Member

Also, can you add results showing that this runs and reproduces the results from their paper?

@jgcb00
Copy link

jgcb00 commented Jun 2, 2025

Hi @luiscosio, could you fix it? It would greatly ease adoption of this benchmark over the standard MMLU, and I’m keen to make it the new standard for our models.

@luiscosio
Copy link
Author

@jgcb00 I will fix it this week.

@CT-6282 CT-6282 requested a review from StellaAthena as a code owner July 2, 2025 17:52
@CLAassistant
Copy link

CLAassistant commented Jul 2, 2025

CLA assistant check
All committers have signed the CLA.

@CT-6282
Copy link

CT-6282 commented Jul 2, 2025

Hi @jgcb00 I addressed the errors on the tasks and tests passed locally, also added readmes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants