Frontier AI Models Now Clear 50% on Humanity's Last Exam

Claude Opus 4.6 and Gemini 3.1 Pro have passed 50% on Humanity's Last Exam — a benchmark designed to be unsolvable by current AI systems.

Hector Herrera

Apr 15 at 1:00 AM CT · Updated 11h ago · 1 source

Scene in a research laboratory from an unusual angle or perspective

Why this matters Claude Opus 4.6 and Gemini 3.1 Pro have passed 50% on Humanity's Last Exam — a benchmark designed to be unsolvable by current AI systems.

Frontier AI Models Now Pass 50% of Humanity's Last Exam — A Benchmark Built to Be Unsolvable

By Hector Herrera | April 15, 2026 | Science

Claude Opus 4.6 and Gemini 3.1 Pro have crossed 50% accuracy on Humanity's Last Exam — a benchmark specifically designed to be beyond the reach of current AI. They passed it anyway. MIT Technology Review's April 13 analysis documents the milestone with charts drawn from benchmark tracking data across major frontier models.

What Humanity's Last Exam is: HLE is a benchmark containing 3,000 questions across expert domains — mathematics, physics, chemistry, biology, history, law, and others — sourced from PhD-level material and verified to be beyond what any AI could answer at the time of its creation (early 2024). It was designed to last years. It lasted about 24 months.

What 50% means: Randomly guessing on a multi-choice exam might score 20-25%. Expert human performance on HLE varies by domain but averages around 85-90% for specialists in each field. A 50%+ score from an AI system means models are now operating in the range between random chance and human expert — closer to the expert end than many expected at this point.

What it does not mean: Raw benchmark scores do not translate directly to real-world capability. HLE questions are answered in controlled conditions with no time pressure, unlimited attempts, and no consequences for error. Deploying a model that can answer 50% of PhD-level chemistry questions in a production context — where errors cause harm, context is ambiguous, and users may not know when the model is wrong — is a separate and harder problem.

The benchmark acceleration is real. The deployment gap — the distance between what models can do in testing and what they reliably do in production — remains significant. Both things are true at the same time.

Hector Herrera is the founder of Hex AI Systems and editor of NexChron.

Key Takeaways

✓ By Hector Herrera | April 15, 2026 | Science
✓ What Humanity's Last Exam is:
✓ What it does not mean:

Did this help you understand AI better?

Your feedback helps us write more useful content.

Written by

Hector Herrera

Hector Herrera is the founder of Hex AI Systems, where he builds AI-powered operations for mid-market businesses across 16 industries. He writes daily about how AI is reshaping business, government, and everyday life. 20+ years in technology. Houston, TX.

Frontier AI Models Now Clear 50% on Humanity's Last Exam

Frontier AI Models Now Pass 50% of Humanity's Last Exam — A Benchmark Built to Be Unsolvable

More from NexChron

AI Weather Startup WindBorne Outperforms European Government Forecasters with WeatherMesh 6

OpenAI Model Disproves 80-Year-Old Erdős Geometry Conjecture in Verified Breakthrough

Google Launches Gemini for Science to Help Researchers Model Complex Systems