Week 4 · A benchmark of expert-level academic questions to assess AI capabilities

Notes on the Nature paper introducing Humanity’s Last Exam (HLE), a multi-modal suite of 2,500 closed-ended, expert-level questions designed to stay ahead of saturated benchmarks like MMLU. Below: a short video summary, then a written synthesis keyed to the paper’s figures and evaluation tables.

Why a new benchmark?

Popular benchmarks have begun to saturate: large language models can exceed rough 90% accuracy on MMLU-style suites, which makes it harder to see how capability is actually moving. HLE responds with original, verifiable items: multiple-choice and exact-match (short answer), including text-only and image-assisted items, vetted so answers are unambiguous yet not trivially searchable.

What’s in the dataset?

Roughly 14% of questions are multi-modal (text plus an image). About 24% are multiple-choice; the rest are exact-match answers graded against a specification. Figure 2 in the paper summarizes how items cluster by broad area; for example, a large share in mathematics, with additional coverage in physics, biology and medicine, computer science and AI, engineering, chemistry, humanities and social science, and other fields.

How HLE was built

The project drew on nearly 1,000 contributors from 500+ institutions across 50 countries, mostly faculty, researchers, and advanced-degree holders. Each submission includes the prompt, answer spec, rationale, subject, and contributor metadata. Quality control combines automated difficulty checks against frontier models (tens of thousands of candidate questions narrowed to thousands that stumped models) with two rounds of expert review, then ongoing community feedback, bug bounties, and audits for inadvertently searchable items. The authors report an estimated 15.4% expert disagreement rate on the public set, comparable in spirit to disagreement seen in other ML benchmarks.

Evaluation setup

Models answer with reasoning, a final answer, and a confidence score; an LLM judge (o3-mini with structured outputs) checks correctness. Metrics include accuracy and RMS calibration error (how far stated confidence drifts from actual hit rate). The paper also studies how accuracy scales with reasoning-token budget: accuracy tends to rise log-linearly with output tokens up to about 2¹⁴ (16,384) tokens, then can worsen, suggesting that “think longer” helps only up to a point.

Results (selected models)

Frontier systems at the time of writing remain weak on HLE relative to saturated benchmarks, with single-digit accuracy for several top models and somewhat higher scores on multiple-choice than on exact-match items. Post-release models (trained or tuned with awareness of the public HLE questions) score higher but still well below mastery; calibration errors remain substantial. Table 1 from the paper (excerpt below) illustrates the pattern; updated numbers are maintained on lastexam.ai.

Model	Accuracy (%)	Calibration error (%)
GPT-4o	2.7 ± 0.6	89
Claude 3.5 Sonnet	4.1 ± 0.8	84
Gemini 1.5 Pro	4.6 ± 0.8	88
o1	8.0 ± 1.1	83
DeepSeek R1^a	8.5 ± 1.2	73
Post-release models (evaluated after HLE was open-sourced)
Claude 4 Sonnet	7.8 ± 1.1	75
Gemini 2.5 Pro	21.6 ± 1.6	72
GPT-5	25.3 ± 1.7	50

^a Evaluated on a text-only subset (model is not multi-modal). Values are as reported in the PMC version of the paper; see the live leaderboard for the latest runs.

Scope and limits

The authors stress that HLE measures structured academic knowledge and reasoning on closed-ended items, not autonomous open-ended research, tool use in the wild, or AGI in a broad sense. STEM (especially mathematics) is overweighted relative to some other areas. Over time, models may climb the leaderboard; the benchmark is meant as a reference point and is complemented by plans such as a rolling update (HLE-Rolling) with fresh questions.

Takeaway

HLE makes the gap between “great on familiar benchmarks” and “reliable expert-level closed-book reasoning” easier to see. For governance and R&D, it offers a concrete baseline; for the public, it is a reminder that headline scores on older tests no longer tell the whole story.

A benchmark of expert-level academic questions to assess AI capabilities