Measuring Massive Multitask Language Understanding (MMLU) is a popular

benchmark Benchmark may refer to: Business and economics * Benchmarking, evaluating performance within organizations * Benchmark price * Benchmark (crude oil), oil-specific practices Science and technology * Experimental benchmarking, the act of defining a ...

for evaluating the capabilities of

large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...

. It inspired several other versions and spin-offs, such as MMLU-Pro, MMMLU and MMLU-Redux.

Overview

MMLU consists of 15,908 multiple-choice questions, with 1,540 of them being used to select and assess optimal settings for models – temperature, batch size and

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

. The questions span across 57 subjects, from highly complex

STEM Stem or STEM most commonly refers to: * Plant stem, a structural axis of a vascular plant * Stem group * Science, technology, engineering, and mathematics Stem or STEM can also refer to: Language and writing * Word stem, part of a word respon ...

fields and international law, to nutrition and religion. It was one of the most commonly used benchmarks for comparing the capabilities of

, with over 100 million downloads as of July 2024. The benchmark was released by Dan Hendrycks and a team of researchers on 7 September 2020. It was purpose-made to be more challenging than existing benchmarks at the time, such as

General Language Understanding Evaluation These data set, datasets are used in machine learning, machine learning (ML) research and have been cited in Peer review, peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this f ...

(GLUE), as models began outperforming humans in easier tests. When MMLU was released, most existing language models scored near the level of random chance (25%). The best performing model,

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...

175B, achieved 43.9% accuracy. The creators of the MMLU estimated that human domain-experts achieve around 89.8% accuracy. By mid-2024, the majority of powerful language models such as Claude 3.5 Sonnet,

GPT-4o GPT-4o ("o" for "omni") is a multilingual, multimodal generative pre-trained transformer developed by OpenAI and released in May 2024. It can process and generate text, images and audio. GPT-4o is free, but ChatGPT Plus subscribers have higher ...

and

Llama 3.1 Llama (Large Language Model Meta AI, formerly stylized as LLaMA) is a family of large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 4, released in April 2025. Llama models come in different s ...

405B consistently achieved 88%. As of 2025, MMLU has been partially phased out in favor of more difficult alternatives.

Limitations

On 5 June 2024, experts released a paper detailing their manual analysis of 5,700 questions in the benchmark, which revealed that it contained a very significant amount of ground-truth errors. For example, 57% of questions in the "

Virology Virology is the Scientific method, scientific study of biological viruses. It is a subfield of microbiology that focuses on their detection, structure, classification and evolution, their methods of infection and exploitation of host (biology), ...

" subset were marked as harboring errors, such as multiple correct answers (4%), unclear questions (14%), or completely incorrect answers (33%). Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%. Data contamination also posed a significant threat for this benchmark's validity; companies could easily include questions and answers into their models' training data, effectively rendering it ineffective.

Examples

The following examples are sourced from the "

Abstract Algebra In mathematics, more specifically algebra, abstract algebra or modern algebra is the study of algebraic structures, which are set (mathematics), sets with specific operation (mathematics), operations acting on their elements. Algebraic structur ...

", "

International Law International law, also known as public international law and the law of nations, is the set of Rule of law, rules, norms, Customary law, legal customs and standards that State (polity), states and other actors feel an obligation to, and generall ...

" and "Professional

Medicine Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...

" tasks, respectively. The correct answers are marked in boldface: Question 1: Find all

c

\mathbb_3

such that

\mathbb_3 (x^2 + c)

is a field. (A) 0 │ (B) 1 │ (C) 2 │ (D) 3 Question 2: Would a reservation to the definition of torture in the

International Covenant on Civil and Political Rights The International Covenant on Civil and Political Rights (ICCPR) is a multilateral treaty that commits nations to respect the civil and political rights of individuals, including the right to life, freedom of religion, freedom of speech, freedom ...

(ICCPR) be acceptable in contemporary practice? (A) This is an acceptable reservation if the reserving country’s legislation employs a different definition.
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR.
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law.
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties. Question 3: A 33-year-old man undergoes a radical thyroidectomy for thyroid cancer. During the operation, moderate hemorrhaging requires ligation of several vessels in the left side of the neck. Postoperatively, serum studies show a calcium concentration of 7.5 mg/dL, albumin concentration of 4 g/dL, and parathyroid hormone concentration of 200 pg/mL. Damage to which of the following vessels caused the findings in this patient? (A) Branch of the costocervical trunk.
(B) Branch of the external carotid artery.
(C) Branch of the thyrocervical trunk.
(D) Tributary of the internal jugular vein.

References

{{reflist Large language models