Read the open-access article (PMC) → HLE benchmark site (lastexam.ai) →
HLE is a multi-modal benchmark of 2,500 challenging, closed-ended academic questions across many subjects, written by global experts to resist quick web lookup and to probe reasoning beyond today’s easy leaderboards. While models score over about 90% on saturated sets such as MMLU, they remain in the single digits to low double digits on HLE, with poor calibration (often high confidence when wrong). The work is released publicly at lastexam.ai with a private holdout set, aiming to support research and policy with a clearer picture of frontier capability.
Why a new benchmark?
Popular benchmarks have begun to saturate: large language models can exceed rough 90% accuracy on MMLU-style suites, which makes it harder to see how capability is actually moving. HLE responds with original, verifiable items: multiple-choice and exact-match (short answer), including text-only and image-assisted items, vetted so answers are unambiguous yet not trivially searchable.
What’s in the dataset?
Roughly 14% of questions are multi-modal (text plus an image). About 24% are multiple-choice; the rest are exact-match answers graded against a specification. Figure 2 in the paper summarizes how items cluster by broad area; for example, a large share in mathematics, with additional coverage in physics, biology and medicine, computer science and AI, engineering, chemistry, humanities and social science, and other fields.
How HLE was built
The project drew on nearly 1,000 contributors from 500+ institutions across 50 countries, mostly faculty, researchers, and advanced-degree holders. Each submission includes the prompt, answer spec, rationale, subject, and contributor metadata. Quality control combines automated difficulty checks against frontier models (tens of thousands of candidate questions narrowed to thousands that stumped models) with two rounds of expert review, then ongoing community feedback, bug bounties, and audits for inadvertently searchable items. The authors report an estimated 15.4% expert disagreement rate on the public set, comparable in spirit to disagreement seen in other ML benchmarks.
Evaluation setup
Models answer with reasoning, a final answer, and a confidence score; an LLM judge (o3-mini with structured outputs) checks correctness. Metrics include accuracy and RMS calibration error (how far stated confidence drifts from actual hit rate). The paper also studies how accuracy scales with reasoning-token budget: accuracy tends to rise log-linearly with output tokens up to about 214 (16,384) tokens, then can worsen, suggesting that “think longer” helps only up to a point.
Results (selected models)
Frontier systems at the time of writing remain weak on HLE relative to saturated benchmarks, with single-digit accuracy for several top models and somewhat higher scores on multiple-choice than on exact-match items. Post-release models (trained or tuned with awareness of the public HLE questions) score higher but still well below mastery; calibration errors remain substantial. Table 1 from the paper (excerpt below) illustrates the pattern; updated numbers are maintained on lastexam.ai.
| Model | Accuracy (%) | Calibration error (%) |
|---|---|---|
| GPT-4o | 2.7 ± 0.6 | 89 |
| Claude 3.5 Sonnet | 4.1 ± 0.8 | 84 |
| Gemini 1.5 Pro | 4.6 ± 0.8 | 88 |
| o1 | 8.0 ± 1.1 | 83 |
| DeepSeek R1a | 8.5 ± 1.2 | 73 |
| Post-release models (evaluated after HLE was open-sourced) | ||
| Claude 4 Sonnet | 7.8 ± 1.1 | 75 |
| Gemini 2.5 Pro | 21.6 ± 1.6 | 72 |
| GPT-5 | 25.3 ± 1.7 | 50 |
a Evaluated on a text-only subset (model is not multi-modal). Values are as reported in the PMC version of the paper; see the live leaderboard for the latest runs.
Scope and limits
The authors stress that HLE measures structured academic knowledge and reasoning on closed-ended items, not autonomous open-ended research, tool use in the wild, or AGI in a broad sense. STEM (especially mathematics) is overweighted relative to some other areas. Over time, models may climb the leaderboard; the benchmark is meant as a reference point and is complemented by plans such as a rolling update (HLE-Rolling) with fresh questions.
Takeaway
HLE makes the gap between “great on familiar benchmarks” and “reliable expert-level closed-book reasoning” easier to see. For governance and R&D, it offers a concrete baseline; for the public, it is a reminder that headline scores on older tests no longer tell the whole story.
Interactive: video & download
The clip above summarizes the benchmark narrative (InVideo export, 480p). For offline viewing or class projection, you can download the same file.
Download Week 4 video (.mp4) Large file (~24 MB)