Explain BLEU, ROUGE, and METEOR scores. When would you use each?

Question

Accepted Answer

## BLEU, ROUGE, and METEOR Scores

These are automated metrics for evaluating text generation quality by comparing generated text to reference (ground truth) text. Each measures different aspects of quality.

### BLEU (Bilingual Evaluation Understudy)

**Measures:** N-gram precision — what fraction of n-grams in the generated text appear in the reference.

$$	ext{BLEU} = 	ext{BP} 	imes \exp\left(\sum_{n=1}^{N} w_n \log p_night)$$

Where BP is the brevity penalty (penalises short outputs), $p_n$ is n-gram precision, and $w_n$ is weight per n-gram order.

```python
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction

# Sentence-level BLEU
reference = ["the cat sat on the mat".split()]
hypothesis = "the cat is on the mat".split()

score = sentence_bleu(reference, hypothesis, smoothing_function=SmoothingFunction().method1)
print(f"BLEU-4: {score:.4f}")  # e.g., 0.5028

# Corpus-level BLEU (more reliable)
references = [["the cat sat on the mat".split()], ["a dog ran in the park".split()]]
hypotheses = ["the cat is on the mat".split(), "a dog ran in the park".split()]
corpus_score = corpus_bleu(references, hypotheses)
print(f"Corpus BLEU: {corpus_score:.4f}")
```

**Strengths:** Fast, language-agnostic, standard in MT
**Weaknesses:** Doesn't account for recall; misses synonyms; low correlation with human judgement for abstractive tasks

### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

**Measures:** Recall — what fraction of reference n-grams appear in the generated text. Designed for summarisation.

```python
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)

reference = "RAG retrieves relevant documents to ground LLM answers and reduce hallucinations."
hypothesis = "RAG retrieves documents to help LLMs answer questions more accurately."

scores = scorer.score(reference, hypothesis)
print(f"ROUGE-1: P={scores['rouge1'].precision:.3f} R={scores['rouge1'].recall:.3f} F1={scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: F1={scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: F1={scores['rougeL'].fmeasure:.3f}")
```

**ROUGE variants:**
- **ROUGE-1**: Unigram overlap
- **ROUGE-2**: Bigram overlap (stricter)
- **ROUGE-L**: Longest Common Subsequence (order-aware)

### METEOR (Metric for Evaluation of Translation with Explicit Ordering)

**Measures:** Harmonic mean of precision and recall with synonym matching, stemming, and word order consideration.

```python
from nltk.translate.meteor_score import meteor_score
import nltk
nltk.download("wordnet", quiet=True)

reference = "The cat sat on the mat"
hypothesis = "A cat was sitting on the mat"

score = meteor_score([reference.split()], hypothesis.split())
print(f"METEOR: {score:.4f}")  # Handles "sat"/"sitting" via stemming
```

### Comparison Table

| Metric | Measures | Handles Synonyms | Good For | Correlation with Humans |
|--------|---------|-----------------|---------|------------------------|
| **BLEU** | Precision (n-grams) | No | Machine translation | Moderate |
| **ROUGE-1/2** | Recall (n-grams) | No | Summarisation | Moderate |
| **ROUGE-L** | Sequence similarity | No | Summarisation | Good |
| **METEOR** | Precision + Recall | Yes (WordNet) | Translation, summarisation | Better |
| **BERTScore** | Semantic similarity | Yes (BERT) | Any generation task | Best |

### BERTScore (Modern Alternative)

```python
from bert_score import score

references = ["RAG retrieves context to ground LLM responses."]
hypotheses = ["RAG fetches documents to help language models answer accurately."]

P, R, F1 = score(hypotheses, references, lang="en", model_type="microsoft/deberta-xlarge-mnli")
print(f"BERTScore F1: {F1.mean().item():.4f}")
```

### When to Use Each

| Task | Primary Metric | Secondary |
|------|--------------|----------|
| Machine translation | BLEU | METEOR, BERTScore |
| Summarisation | ROUGE-1, ROUGE-L | BERTScore |
| RAG answer quality | BERTScore | RAGAS (faithfulness) |
| Open-ended generation | LLM-as-judge | BERTScore |
| Code generation | Exact match, CodeBLEU | — |

> **Critical caveat:** All n-gram metrics have low correlation with human judgement for modern LLMs. For production RAG systems, prefer **RAGAS** (faithfulness, relevancy) and **LLM-as-judge** over BLEU/ROUGE.

Explain BLEU, ROUGE, and METEOR scores. When would you use each?

Answer

BLEU, ROUGE, and METEOR Scores

BLEU (Bilingual Evaluation Understudy)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

Comparison Table

BERTScore (Modern Alternative)

When to Use Each

Related Concepts

How would you mitigate hallucinations in an LLM response?

Design a safety filtering system for an LLM chatbot.

How would you run a blind evaluation of two LLM models?

How do LLMs set their maximum context window? Explain the role of architecture, training, and API configuration.

Metric	Measures	Handles Synonyms	Good For	Correlation with Humans
BLEU	Precision (n-grams)	No	Machine translation	Moderate
ROUGE-1/2	Recall (n-grams)	No	Summarisation	Moderate
ROUGE-L	Sequence similarity	No	Summarisation	Good
METEOR	Precision + Recall	Yes (WordNet)	Translation, summarisation	Better
BERTScore	Semantic similarity	Yes (BERT)	Any generation task	Best

Task	Primary Metric	Secondary
Machine translation	BLEU	METEOR, BERTScore
Summarisation	ROUGE-1, ROUGE-L	BERTScore
RAG answer quality	BERTScore	RAGAS (faithfulness)
Open-ended generation	LLM-as-judge	BERTScore
Code generation	Exact match, CodeBLEU	—