Concept #100Easyextended-ai-concepts

What is HLE (Humanity's Last Exam) in AI?

#gen-ai

Answer

What is HLE (Humanity's Last Exam) in AI?

HLE (Humanity's Last Exam) is one of the hardest AI benchmarks ever created, designed to test the absolute limits of AI capabilities on graduate-level and expert-level academic questions across all major disciplines.

What Is HLE?

HLE is a benchmark released in early 2025 containing 3,000 questions written by domain experts across:

  • Mathematics (pure and applied)
  • Physics, Chemistry, Biology
  • Computer Science
  • Economics, Law, Medicine
  • History, Philosophy, Linguistics
  • And many other fields

The questions are deliberately designed to be extremely difficult — problems that require genuine expert-level understanding, not just retrieval.

Why "Humanity's Last Exam"?

The name reflects the aspiration that HLE represents the final frontier of academic testing — problems so hard that once AI can reliably answer them, it has effectively surpassed human expert performance in those domains.

Benchmark Results (Early 2025)

ModelHLE Score
o3 (OpenAI)~9.7%
Gemini 2.0 Flash Thinking~6.1%
Claude 3.5 Sonnet~4.3%
GPT-4o~3.3%
Human experts~90%+ (in their field)

These scores are intentionally low — the benchmark is designed so current AI performs poorly.

Example Question Difficulty

HLE questions look like:

  • Graduate-level math proofs
  • Multi-step physics derivations
  • Complex legal reasoning across jurisdictions
  • Multi-document historical analysis
  • Cutting-edge biology research questions

These are not knowledge-retrieval questions — they require genuine reasoning and synthesis.

HLE vs Other Benchmarks

BenchmarkDifficultyGPT-4o Score
MMLUUndergrad level~87%
GPQAPhD-level~53%
FrontierMathResearch math~2%
HLEExpert-level, multi-domain~3.3%

Why HLE Matters for AI Development

  • Measures real capability — unlike saturated benchmarks where models score 90%+
  • Tracks AGI progress — meaningful signal of how close AI is to expert human performance
  • Motivates research — hard targets drive capability improvements
  • Cross-domain — unlike specialized benchmarks, tests breadth

Who Created HLE?

HLE was created by Scale AI in collaboration with hundreds of domain experts and PhD holders from universities worldwide. Questions were carefully vetted to ensure they:

  • Have definitive correct answers
  • Cannot be answered by simple web search
  • Require deep domain expertise
  • Cannot be solved by pattern matching alone

HLE represents the current ceiling of AI evaluation — the gap between ~10% AI performance and ~90% human expert performance is what modern AI research is working to close.