Measuring Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several other versions and spin-offs, such as MMLU-Pro[1], MMMLU[2] and MMLU-Redux[3].

Overview

edit

MMLU consists of 15,908 multiple-choice questions, with 1,540 of them being used to select and assess optimal settings for models – temperature, batch size and learning rate. The questions span across 57 subjects, from highly complex STEM fields and international law to nutrition and religion. It was one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024.[4][5]

The benchmark was released by Dan Hendrycks and a team of researchers on 7 September 2020. It was purpose-made to be more challenging than existing benchmarks at the time, such as General Language Understanding Evaluation (GLUE), as models began outperforming humans in easier tests. When MMLU was released, most existing language models scored near the level of random chance (25%). The best-performing model, GPT-3 175B, achieved 43.9% accuracy. The creators of the MMLU estimated that human domain-experts achieve around 89.8% accuracy.[4] By mid-2024, the majority of powerful language models such as Claude 3.5 Sonnet, GPT-4o and Llama 3.1 405B consistently achieved 88%.[6][7][8] As of 2025, MMLU has been partially phased out in favor of more difficult alternatives.

Limitations

edit

On 5 June 2024, experts released a paper detailing their manual analysis of 5,700 questions in the benchmark, which revealed that it contained a very significant amount of ground-truth errors. For example, 57% of questions in the "Virology" subset were marked as harboring errors, such as multiple correct answers (4%), unclear questions (14%), or completely incorrect answers (33%). Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting that the maximum attainable score was significantly below 100%.[9] Data contamination also posed a significant threat for this benchmark's validity; companies could easily include questions and answers into their models' training data, effectively rendering it ineffective.[10]

Examples

edit

The following examples are sourced from the "Abstract Algebra", "International Law" and "Professional Medicine" tasks, respectively.[4] The correct answers are marked in boldface:

Question 1:

Find all in such that is a field.

(A) 0 │ (B) 1 │ (C) 2 │ (D) 3

Question 2:

Would a reservation to the definition of torture in the International Covenant on Civil and Political Rights (ICCPR) be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition.
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR.
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law.
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties.

Question 3:

A 33-year-old man undergoes a radical thyroidectomy for thyroid cancer. During the operation, moderate hemorrhaging requires ligation of several vessels in the left side of the neck. Postoperatively, serum studies show a calcium concentration of 7.5 mg/dL, albumin concentration of 4 g/dL, and parathyroid hormone concentration of 200 pg/mL. Damage to which of the following vessels caused the findings in this patient?

(A) Branch of the costocervical trunk.
(B) Branch of the external carotid artery.
(C) Branch of the thyrocervical trunk.
(D) Tributary of the internal jugular vein.

References

edit
  1. ^ TIGER-AI-Lab/MMLU-Pro, TIGER Lab, 2026-05-13, retrieved 2026-05-14
  2. ^ Stats, L. L. M. (2026-05-14). "MMMLU Benchmark Leaderboard". LLM Stats. Retrieved 2026-05-14.
  3. ^ Gema, Aryo Pradipta (2026-02-07), aryopg/mmlu-redux, retrieved 2026-05-14
  4. ^ a b c Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021). "Measuring Massive Multitask Language Understanding". ICLR. arXiv:2009.03300.
  5. ^ "cais/mmlu". Hugging Face. 2024-07-08. Retrieved 2024-07-24.
  6. ^ "Introducing Claude 3.5 Sonnet". Anthropic. Retrieved 2025-04-06.
  7. ^ "Hello GPT-4o". OpenAI. 2024-05-13. Retrieved 2025-04-06.
  8. ^ "Introducing Llama 3.1: Our most capable models to date". Meta blog. 2024-07-23. Retrieved 2025-04-06.
  9. ^ Gema, Aryo Pradipta; Leang, Joshua Ong Jun; Hong, Giwon; Devoto, Alessio; Mancino, Alberto Carlo Maria; Saxena, Rohit; He, Xuanli; Zhao, Yu; Du, Xiaotang; Madani, Mohammad Reza Ghasemi; Barale, Claire; McHardy, Robert; Harris, Joshua; Kaddour, Jean; Krieken, Emile van; Minervini, Pasquale (2024-06-07). "Are We Done with MMLU?". arXiv:2406.04127 [cs.CL].
  10. ^ Roose, Kevin (2024-04-15). "A.I. Has a Measurement Problem". The New York Times. ISSN 0362-4331. Retrieved 2024-04-21.

📚 Artikel Terkait di Wikipedia

Humanity's Last Exam

Elon Musk, who thought the existing language model benchmarks, such as the MMLU, were too easy. Hendrycks worked with Scale AI to compile the questions.

Technological singularity

(Switchboard), natural language understanding (SQuAD 1.1, MMLU, GLUE), general language model evaluation (MMLU, Big-Bench, and GPQA), and mathematical reasoning

Neural scaling law

well-known model to reach he same performance on some benchmarks, such as MMLU. N ^ {\displaystyle {\hat {N}}} is not measured directly, but rather by measuring

GPT-4o

translation. GPT-4o scored 88.7 on the Massive Multitask Language Understanding (MMLU) benchmark compared to 86.5 for GPT-4. Unlike GPT-3.5 and GPT-4, which rely

Gemini (language model)

human experts on the 57-subject Massive Multitask Language Understanding (MMLU) test, obtaining a score of 90%. Gemini Pro was made available to Google

Language model

processing systems. These include: Massive Multitask Language Understanding (MMLU) Corpus of Linguistic Acceptability GLUE benchmark Microsoft Research Paraphrase

Language model benchmark

medicine. Upgraded to MMLU-Pro which increases the number of choices from 4 to 10, eliminated the trivial and noisy questions from MMLU, and added harder

Foundation model

evaluated relative to each other through standardized task benchmarks like MMLU, MMMU, HumanEval, and GSM8K. Given that foundation models are multi-purpose