ROUGE Score: Definition and Examples
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of automatic metrics used to evaluate the quality of summaries generated by language models by comparing them to human-produced reference summaries.
Full definition
ROUGE Score is a set of automatic evaluation metrics originally designed to measure the quality of text summaries. Developed by Chin-Yew Lin in 2004, ROUGE compares an automatically generated text to one or more human-written reference texts by measuring the overlap between them. The higher the score (on a scale of 0 to 1), the closer the generated text is to the human reference.
The ROUGE family includes several variants. ROUGE-N measures the overlap of n-grams (sequences of N consecutive words) between the generated text and the reference. ROUGE-1 compares individual words (unigrams), ROUGE-2 compares consecutive word pairs (bigrams), and so on. ROUGE-L, on the other hand, uses the longest common subsequence (LCS) to capture structural similarity between texts, even if words are not strictly consecutive.
In prompt engineering, understanding the ROUGE Score is essential when working on text generation tasks, especially automatic summarization, translation, or paraphrasing. This metric allows an objective evaluation of whether a prompt produces results that are faithful to the expected content. For example, by comparing the outputs of different prompts with a reference text, one can identify which wording generates the most complete and relevant summaries.
It is important to note that ROUGE mainly measures lexical recall — that is, the presence of the right words — but does not necessarily capture semantic coherence, fluency, or factual accuracy. That is why ROUGE is often used in conjunction with other metrics such as BLEU, BERTScore, or human evaluation to obtain a more complete view of the quality of a generated text.
Etymology
ROUGE is an acronym for 'Recall-Oriented Understudy for Gisting Evaluation'. The name also winks at the BLEU (Bilingual Evaluation Understudy) metric used in machine translation, creating a chromatic pun between the two complementary metrics.
Concrete examples
Evaluating the quality of an automatic summary
Summarize this financial report in 3 sentences. Then, I will compare your summary with my reference summary using the ROUGE-2 score to verify that the key information is well covered.
Comparing two prompt variants for a summarization task
Here is a news article. Generate a summary first in a factual style, then in a narrative style. I will measure the ROUGE-L of each version against my gold standard summary to determine which is more faithful.
Optimizing a content generation pipeline in production
You are a quality evaluator. Compare text A (generated) with text B (reference). Identify passages from text B that are missing from text A, which would correspond to a low ROUGE-1 recall score.
Practical usage
In prompt engineering, the ROUGE Score is mainly used to iterate on your summary or paraphrasing prompts: generate several variants, measure their ROUGE against a human reference, and keep the prompt that maximizes the score. Use ROUGE-1 to verify coverage of key vocabulary, ROUGE-2 for fidelity of phrases, and ROUGE-L for overall structure. Always combine ROUGE with human review, as a high score does not guarantee fluency or absence of hallucinations.
Related concepts
FAQ
What is the difference between ROUGE-1, ROUGE-2, and ROUGE-L?
Does a high ROUGE score guarantee a good summary?
How to calculate a ROUGE score in practice?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Runway ML: Definition and Examples
Runway ML is a generative AI platform specialized in creating and editing visual content (video, image, 3D) from text prompts or multimodal inputs.
Safety Filter: Definition and Examples
A safety filter is a mechanism built into generative AI models that automatically detects and blocks content deemed dangerous, inappropriate, or contrary to usage policies before it is generated or displayed to the user.
Scaling Laws: Definition and Examples
Scaling laws are mathematical relationships that describe how AI model performance improves predictably as model size, training data, or compute increases.
Self Attention: Definition and Examples
Mechanism allowing each element in a sequence to weigh the importance of all other elements in that same sequence, forming the core of the architecture...
Self Consistency: Definition and Examples
Prompting technique that consists of generating multiple independent reasoning paths for the same question, then selecting the most frequent answer by majority vote, improving the reliability of results.
Semantic Cache: Definition and Examples
A semantic cache is a caching system that stores and retrieves AI model responses based on the semantic similarity of queries, rather than exact word matches.
Get new prompts every week
Join our newsletter.