Concept #53Hardadvanced-topics

Explain BLEU, ROUGE, and METEOR scores. When would you use each?

#gen-ai#evaluation

Answer

BLEU, ROUGE, and METEOR Scores

These are automated metrics for evaluating text generation quality by comparing generated text to reference (ground truth) text. Each measures different aspects of quality.

BLEU (Bilingual Evaluation Understudy)

Measures: N-gram precision — what fraction of n-grams in the generated text appear in the reference.

BLEU=BP×exp(n=1Nwnlogpn)\text{BLEU} = \text{BP} \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

Where BP is the brevity penalty (penalises short outputs), pnp_n is n-gram precision, and wnw_n is weight per n-gram order.

python
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction

# Sentence-level BLEU
reference = ["the cat sat on the mat".split()]
hypothesis = "the cat is on the mat".split()

score = sentence_bleu(reference, hypothesis, smoothing_function=SmoothingFunction().method1)
print(f"BLEU-4: {score:.4f}")  # e.g., 0.5028

# Corpus-level BLEU (more reliable)
references = [["the cat sat on the mat".split()], ["a dog ran in the park".split()]]
hypotheses = ["the cat is on the mat".split(), "a dog ran in the park".split()]
corpus_score = corpus_bleu(references, hypotheses)
print(f"Corpus BLEU: {corpus_score:.4f}")

Strengths: Fast, language-agnostic, standard in MT Weaknesses: Doesn't account for recall; misses synonyms; low correlation with human judgement for abstractive tasks

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Measures: Recall — what fraction of reference n-grams appear in the generated text. Designed for summarisation.

python
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)

reference = "RAG retrieves relevant documents to ground LLM answers and reduce hallucinations."
hypothesis = "RAG retrieves documents to help LLMs answer questions more accurately."

scores = scorer.score(reference, hypothesis)
print(f"ROUGE-1: P={scores['rouge1'].precision:.3f} R={scores['rouge1'].recall:.3f} F1={scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: F1={scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: F1={scores['rougeL'].fmeasure:.3f}")

ROUGE variants:

  • ROUGE-1: Unigram overlap
  • ROUGE-2: Bigram overlap (stricter)
  • ROUGE-L: Longest Common Subsequence (order-aware)

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

Measures: Harmonic mean of precision and recall with synonym matching, stemming, and word order consideration.

python
from nltk.translate.meteor_score import meteor_score
import nltk
nltk.download("wordnet", quiet=True)

reference = "The cat sat on the mat"
hypothesis = "A cat was sitting on the mat"

score = meteor_score([reference.split()], hypothesis.split())
print(f"METEOR: {score:.4f}")  # Handles "sat"/"sitting" via stemming

Comparison Table

MetricMeasuresHandles SynonymsGood ForCorrelation with Humans
BLEUPrecision (n-grams)NoMachine translationModerate
ROUGE-1/2Recall (n-grams)NoSummarisationModerate
ROUGE-LSequence similarityNoSummarisationGood
METEORPrecision + RecallYes (WordNet)Translation, summarisationBetter
BERTScoreSemantic similarityYes (BERT)Any generation taskBest

BERTScore (Modern Alternative)

python
from bert_score import score

references = ["RAG retrieves context to ground LLM responses."]
hypotheses = ["RAG fetches documents to help language models answer accurately."]

P, R, F1 = score(hypotheses, references, lang="en", model_type="microsoft/deberta-xlarge-mnli")
print(f"BERTScore F1: {F1.mean().item():.4f}")

When to Use Each

TaskPrimary MetricSecondary
Machine translationBLEUMETEOR, BERTScore
SummarisationROUGE-1, ROUGE-LBERTScore
RAG answer qualityBERTScoreRAGAS (faithfulness)
Open-ended generationLLM-as-judgeBERTScore
Code generationExact match, CodeBLEU

Critical caveat: All n-gram metrics have low correlation with human judgement for modern LLMs. For production RAG systems, prefer RAGAS (faithfulness, relevancy) and LLM-as-judge over BLEU/ROUGE.