ROUGE Score: Definition and Examples

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of automatic metrics used to evaluate the quality of summaries generated by language models by comparing them to human-produced reference summaries.

Full definition

ROUGE Score is a set of automatic evaluation metrics originally designed to measure the quality of text summaries. Developed by Chin-Yew Lin in 2004, ROUGE compares an automatically generated text to one or more human-written reference texts by measuring the overlap between them. The higher the score (on a scale of 0 to 1), the closer the generated text is to the human reference.

The ROUGE family includes several variants. ROUGE-N measures the overlap of n-grams (sequences of N consecutive words) between the generated text and the reference. ROUGE-1 compares individual words (unigrams), ROUGE-2 compares consecutive word pairs (bigrams), and so on. ROUGE-L, on the other hand, uses the longest common subsequence (LCS) to capture structural similarity between texts, even if words are not strictly consecutive.

In prompt engineering, understanding the ROUGE Score is essential when working on text generation tasks, especially automatic summarization, translation, or paraphrasing. This metric allows an objective evaluation of whether a prompt produces results that are faithful to the expected content. For example, by comparing the outputs of different prompts with a reference text, one can identify which wording generates the most complete and relevant summaries.

It is important to note that ROUGE mainly measures lexical recall — that is, the presence of the right words — but does not necessarily capture semantic coherence, fluency, or factual accuracy. That is why ROUGE is often used in conjunction with other metrics such as BLEU, BERTScore, or human evaluation to obtain a more complete view of the quality of a generated text.

Etymology

ROUGE is an acronym for 'Recall-Oriented Understudy for Gisting Evaluation'. The name also winks at the BLEU (Bilingual Evaluation Understudy) metric used in machine translation, creating a chromatic pun between the two complementary metrics.

Concrete examples

Evaluating the quality of an automatic summary

Summarize this financial report in 3 sentences. Then, I will compare your summary with my reference summary using the ROUGE-2 score to verify that the key information is well covered.

Comparing two prompt variants for a summarization task

Here is a news article. Generate a summary first in a factual style, then in a narrative style. I will measure the ROUGE-L of each version against my gold standard summary to determine which is more faithful.

Optimizing a content generation pipeline in production

You are a quality evaluator. Compare text A (generated) with text B (reference). Identify passages from text B that are missing from text A, which would correspond to a low ROUGE-1 recall score.

Practical usage

In prompt engineering, the ROUGE Score is mainly used to iterate on your summary or paraphrasing prompts: generate several variants, measure their ROUGE against a human reference, and keep the prompt that maximizes the score. Use ROUGE-1 to verify coverage of key vocabulary, ROUGE-2 for fidelity of phrases, and ROUGE-L for overall structure. Always combine ROUGE with human review, as a high score does not guarantee fluency or absence of hallucinations.

Related concepts

BLEU ScoreBERTScoreAutomatic evaluationN-grams

FAQ

What is the difference between ROUGE-1, ROUGE-2, and ROUGE-L?

ROUGE-1 compares individual words between the generated text and the reference (unigrams), which measures basic lexical coverage. ROUGE-2 compares consecutive word pairs (bigrams), better capturing the fidelity of phrases and turns. ROUGE-L uses the longest common subsequence, which measures structural similarity without requiring words to be strictly adjacent. In practice, ROUGE-2 and ROUGE-L are the most informative for evaluating summary quality.

Does a high ROUGE score guarantee a good summary?

No. ROUGE measures lexical overlap with a reference, but does not capture coherence, factual accuracy, or readability. A text can achieve a high ROUGE score by repeating the reference's keywords while being poorly structured or containing factual errors. That is why it is recommended to combine ROUGE with other metrics (BERTScore for semantic similarity, human evaluation for perceived quality) and never rely on it as the sole indicator.

How to calculate a ROUGE score in practice?

The simplest method is to use the Python library 'rouge-score' from Google or the 'evaluate' package from Hugging Face. You just need to provide the generated text and the reference text, and the library returns precision, recall, and F1 scores for each ROUGE variant. Jupyter notebooks and free online tools are also available for quick tests without writing code.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Runway ML: Definition and Examples

Runway ML is a generative AI platform specialized in creating and editing visual content (video, image, 3D) from text prompts or multimodal inputs.

Safety Filter: Definition and Examples

A safety filter is a mechanism built into generative AI models that automatically detects and blocks content deemed dangerous, inappropriate, or contrary to usage policies before it is generated or displayed to the user.

Scaling Laws: Definition and Examples

Scaling laws are mathematical relationships that describe how AI model performance improves predictably as model size, training data, or compute increases.

Self Attention: Definition and Examples

Mechanism allowing each element in a sequence to weigh the importance of all other elements in that same sequence, forming the core of the architecture...

Self Consistency: Definition and Examples

Prompting technique that consists of generating multiple independent reasoning paths for the same question, then selecting the most frequent answer by majority vote, improving the reliability of results.

Semantic Cache: Definition and Examples

A semantic cache is a caching system that stores and retrieves AI model responses based on the semantic similarity of queries, rather than exact word matches.

Get new prompts every week

Join our newsletter.