What is HLE (Humanity's Last Exam) in AI?

Question

Accepted Answer

## What is HLE (Humanity's Last Exam) in AI?

**HLE (Humanity's Last Exam)** is one of the hardest AI benchmarks ever created, designed to test the absolute limits of AI capabilities on graduate-level and expert-level academic questions across all major disciplines.

### What Is HLE?

HLE is a benchmark released in early 2025 containing **3,000 questions** written by domain experts across:
- Mathematics (pure and applied)
- Physics, Chemistry, Biology
- Computer Science
- Economics, Law, Medicine
- History, Philosophy, Linguistics
- And many other fields

The questions are deliberately designed to be extremely difficult — problems that require genuine expert-level understanding, not just retrieval.

### Why "Humanity's Last Exam"?

The name reflects the aspiration that HLE represents the **final frontier** of academic testing — problems so hard that once AI can reliably answer them, it has effectively surpassed human expert performance in those domains.

### Benchmark Results (Early 2025)

| Model | HLE Score |
|-------|----------|
| **o3** (OpenAI) | ~9.7% |
| **Gemini 2.0 Flash Thinking** | ~6.1% |
| **Claude 3.5 Sonnet** | ~4.3% |
| **GPT-4o** | ~3.3% |
| **Human experts** | ~90%+ (in their field) |

These scores are intentionally low — the benchmark is designed so current AI performs poorly.

### Example Question Difficulty

HLE questions look like:
- Graduate-level math proofs
- Multi-step physics derivations
- Complex legal reasoning across jurisdictions
- Multi-document historical analysis
- Cutting-edge biology research questions

These are not knowledge-retrieval questions — they require **genuine reasoning and synthesis**.

### HLE vs Other Benchmarks

| Benchmark | Difficulty | GPT-4o Score |
|-----------|-----------|-------------|
| MMLU | Undergrad level | ~87% |
| GPQA | PhD-level | ~53% |
| FrontierMath | Research math | ~2% |
| **HLE** | Expert-level, multi-domain | ~3.3% |

### Why HLE Matters for AI Development

* **Measures real capability** — unlike saturated benchmarks where models score 90%+
* **Tracks AGI progress** — meaningful signal of how close AI is to expert human performance
* **Motivates research** — hard targets drive capability improvements
* **Cross-domain** — unlike specialized benchmarks, tests breadth

### Who Created HLE?

HLE was created by **Scale AI** in collaboration with hundreds of domain experts and PhD holders from universities worldwide. Questions were carefully vetted to ensure they:
- Have definitive correct answers
- Cannot be answered by simple web search
- Require deep domain expertise
- Cannot be solved by pattern matching alone

> HLE represents the current ceiling of AI evaluation — the gap between ~10% AI performance and ~90% human expert performance is what modern AI research is working to close.

What is HLE (Humanity's Last Exam) in AI?

Answer

What is HLE (Humanity's Last Exam) in AI?

What Is HLE?

Why "Humanity's Last Exam"?

Benchmark Results (Early 2025)

Example Question Difficulty

HLE vs Other Benchmarks

Why HLE Matters for AI Development

Who Created HLE?

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Model	HLE Score
o3 (OpenAI)	~9.7%
Gemini 2.0 Flash Thinking	~6.1%
Claude 3.5 Sonnet	~4.3%
GPT-4o	~3.3%
Human experts	~90%+ (in their field)

Benchmark	Difficulty	GPT-4o Score
MMLU	Undergrad level	~87%
GPQA	PhD-level	~53%
FrontierMath	Research math	~2%
HLE	Expert-level, multi-domain	~3.3%