Explain BLEU, ROUGE, and METEOR scores. When would you use each?
Answer
BLEU, ROUGE, and METEOR Scores
These are automated metrics for evaluating text generation quality by comparing generated text to reference (ground truth) text. Each measures different aspects of quality.
BLEU (Bilingual Evaluation Understudy)
Measures: N-gram precision — what fraction of n-grams in the generated text appear in the reference.
Where BP is the brevity penalty (penalises short outputs), is n-gram precision, and is weight per n-gram order.
pythonfrom nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction # Sentence-level BLEU reference = ["the cat sat on the mat".split()] hypothesis = "the cat is on the mat".split() score = sentence_bleu(reference, hypothesis, smoothing_function=SmoothingFunction().method1) print(f"BLEU-4: {score:.4f}") # e.g., 0.5028 # Corpus-level BLEU (more reliable) references = [["the cat sat on the mat".split()], ["a dog ran in the park".split()]] hypotheses = ["the cat is on the mat".split(), "a dog ran in the park".split()] corpus_score = corpus_bleu(references, hypotheses) print(f"Corpus BLEU: {corpus_score:.4f}")
Strengths: Fast, language-agnostic, standard in MT Weaknesses: Doesn't account for recall; misses synonyms; low correlation with human judgement for abstractive tasks
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Measures: Recall — what fraction of reference n-grams appear in the generated text. Designed for summarisation.
pythonfrom rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True) reference = "RAG retrieves relevant documents to ground LLM answers and reduce hallucinations." hypothesis = "RAG retrieves documents to help LLMs answer questions more accurately." scores = scorer.score(reference, hypothesis) print(f"ROUGE-1: P={scores['rouge1'].precision:.3f} R={scores['rouge1'].recall:.3f} F1={scores['rouge1'].fmeasure:.3f}") print(f"ROUGE-2: F1={scores['rouge2'].fmeasure:.3f}") print(f"ROUGE-L: F1={scores['rougeL'].fmeasure:.3f}")
ROUGE variants:
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap (stricter)
- ROUGE-L: Longest Common Subsequence (order-aware)
METEOR (Metric for Evaluation of Translation with Explicit Ordering)
Measures: Harmonic mean of precision and recall with synonym matching, stemming, and word order consideration.
pythonfrom nltk.translate.meteor_score import meteor_score import nltk nltk.download("wordnet", quiet=True) reference = "The cat sat on the mat" hypothesis = "A cat was sitting on the mat" score = meteor_score([reference.split()], hypothesis.split()) print(f"METEOR: {score:.4f}") # Handles "sat"/"sitting" via stemming
Comparison Table
| Metric | Measures | Handles Synonyms | Good For | Correlation with Humans |
|---|---|---|---|---|
| BLEU | Precision (n-grams) | No | Machine translation | Moderate |
| ROUGE-1/2 | Recall (n-grams) | No | Summarisation | Moderate |
| ROUGE-L | Sequence similarity | No | Summarisation | Good |
| METEOR | Precision + Recall | Yes (WordNet) | Translation, summarisation | Better |
| BERTScore | Semantic similarity | Yes (BERT) | Any generation task | Best |
BERTScore (Modern Alternative)
pythonfrom bert_score import score references = ["RAG retrieves context to ground LLM responses."] hypotheses = ["RAG fetches documents to help language models answer accurately."] P, R, F1 = score(hypotheses, references, lang="en", model_type="microsoft/deberta-xlarge-mnli") print(f"BERTScore F1: {F1.mean().item():.4f}")
When to Use Each
| Task | Primary Metric | Secondary |
|---|---|---|
| Machine translation | BLEU | METEOR, BERTScore |
| Summarisation | ROUGE-1, ROUGE-L | BERTScore |
| RAG answer quality | BERTScore | RAGAS (faithfulness) |
| Open-ended generation | LLM-as-judge | BERTScore |
| Code generation | Exact match, CodeBLEU | — |
Critical caveat: All n-gram metrics have low correlation with human judgement for modern LLMs. For production RAG systems, prefer RAGAS (faithfulness, relevancy) and LLM-as-judge over BLEU/ROUGE.