In Depth
MMLU consists of multiple-choice questions drawn from standardized tests and academic exams. A model that scores well demonstrates it has absorbed broad factual and reasoning knowledge across STEM, humanities, and professional domains. Frontier models now exceed average human performance on MMLU, prompting researchers to develop harder successors like MMLU-Pro and GPQA to maintain discriminative power.