artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...

, Measuring Massive Multitask Language Understanding (MMLU) is a

benchmark Benchmark may refer to: Business and economics * Benchmarking, evaluating performance within organizations * Benchmark price * Benchmark (crude oil), oil-specific practices Science and technology * Benchmark (surveying), a point of known elevati ...

for evaluating the capabilities of

large language models A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

Benchmark

It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024. The MMLU was released by

Dan Hendrycks Dan Hendrycks (born ) is an American machine learning researcher. He serves as the director of the Center for AI Safety. Early life and education Hendrycks was raised in a Christian evangelical household in Marshfield, Missouri. He received ...

and a team of researchers in 2020 and was designed to be more challenging than then-existing benchmarks such as

General Language Understanding Evaluation These datasets are applied for machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning a ...

(GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standa ...

model achieving 43.9% accuracy. The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy. As of 2024, some of the most powerful language models, such as

Claude 3 Claude is a family of large language models developed by Anthropic. The first model was released in March 2023. Claude 3, released in March 2024, can also analyze images. Training Claude models are generative pre-trained transformers. They ha ...

and

GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI and the fourth in its GPT series. It was released on March 14, 2023, and has been made publicly available in a limited form via ChatGPT Plus, ...

, were reported to achieve scores in the mid-80s.

Examples

The following examples are taken from the "

Abstract Algebra In mathematics, more specifically algebra, abstract algebra or modern algebra is the study of algebraic structures. Algebraic structures include groups, rings, fields, modules, vector spaces, lattices, and algebras over a field. The te ...

" and "

International Law International law (also known as public international law and the law of nations) is the set of rules, norms, and standards generally recognized as binding between states. It establishes normative guidelines and a common conceptual framework for ...

" tasks, respectively. The correct answers are marked in boldface:

Find all $c$ in $\mathbb_3$ such that $\mathbb_3 (x^2 + c)$ is a field. (A) 0 (B) 1 (C) 2 (D) 3

Would a reservation to the definition of torture in the
ICCPR The International Covenant on Civil and Political Rights (ICCPR) is a multilateral treaty that commits nations to respect the civil and political rights of individuals, including the right to life, freedom of religion, freedom of speech, freedom ...
be acceptable in contemporary practice?
(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties

Leaderboard

References

{{reflist, refs= {{Cite arXiv , last1=Hendrycks, first1=Dan , last2=Burns, first2=Collin , last3=Kossen, first3=Andy , last4=Steinhardt, first4=Jacob , last5=Mishkin, first5=Pavel , last6=Gimpel, first6=Kevin , last7=Zhu, first7=Mark , title=Measuring Massive Multitask Language Understanding , year=2020 , arxiv=2009.03300 {{cite web , url=https://www.anthropic.com/news/claude-3-family , date=4 March 2024 , title=Introducing the next generation of Claude , work=Anthropic AI {{Cite news , last=Roose , first=Kevin , date=15 April 2024 , title=A.I. Has a Measurement Problem , url=https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html , newspaper=The New York Times {{cite web , url=https://huggingface.co/datasets/cais/mmlu , date=24 July 2024 , title=MMLU Dataset , work=HuggingFace Large language models