Benchmark
It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024. The MMLU was released by Dan Hendrycks and a team of researchers in 2020 and was designed to be more challenging than then-existing benchmarks such as General Language Understanding Evaluation (GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performingExamples
The following examples are taken from the "Find all in such that is a field. (A) 0 (B) 1 (C) 2 (D) 3
Would a reservation to the definition of torture in theICCPR The International Covenant on Civil and Political Rights (ICCPR) is a multilateral treaty that commits nations to respect the civil and political rights of individuals, including the right to life, freedom of religion, freedom of speech, freedom ...be acceptable in contemporary practice?
(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties
Leaderboard
References
{{reflist, refs= {{Cite arXiv , last1=Hendrycks, first1=Dan , last2=Burns, first2=Collin , last3=Kossen, first3=Andy , last4=Steinhardt, first4=Jacob , last5=Mishkin, first5=Pavel , last6=Gimpel, first6=Kevin , last7=Zhu, first7=Mark , title=Measuring Massive Multitask Language Understanding , year=2020 , arxiv=2009.03300 {{cite web , url=https://www.anthropic.com/news/claude-3-family , date=4 March 2024 , title=Introducing the next generation of Claude , work=Anthropic AI {{Cite news , last=Roose , first=Kevin , date=15 April 2024 , title=A.I. Has a Measurement Problem , url=https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html , newspaper=The New York Times {{cite web , url=https://huggingface.co/datasets/cais/mmlu , date=24 July 2024 , title=MMLU Dataset , work=HuggingFace Large language models