P

BLEU Score: Definition and Examples

The BLEU Score (Bilingual Evaluation Understudy) is an automatic metric that evaluates the quality of machine-generated text by comparing it to one or more human reference translations.

Full definition

The BLEU Score is one of the most widely used metrics in natural language processing (NLP) for evaluating the quality of texts produced by machine translation or text generation systems. Developed by Kishore Papineni and his team at IBM in 2002, it measures the degree of overlap between a candidate text (generated by the machine) and one or more reference texts (produced by humans).

Specifically, the BLEU Score works by comparing sequences of words (called n-grams) between the generated text and the references. It calculates the precision of n-grams: how many groups of 1 word (unigrams), 2 words (bigrams), 3 words (trigrams), and 4 words (4-grams) from the candidate text appear in the references. The final score combines these precisions with a brevity penalty that penalizes texts that are too short compared to the references.

The score ranges from 0 to 1 (often expressed as a percentage from 0 to 100). A score of 1 means a perfect match with the reference, which is extremely rare in practice. In machine translation, a BLEU Score above 30 is generally considered acceptable, and above 50 as very good. It is important to note that the BLEU Score mainly measures lexical fidelity and does not necessarily capture fluency, meaning, or stylistic quality.

In the context of prompt engineering, understanding the BLEU Score allows for objective evaluation of whether a language model's responses match expected outputs. This is particularly useful when iterating on prompts for translation, summarization, or paraphrasing tasks, as it provides a numerical indicator to compare different prompt versions.

Etymology

BLEU is an acronym for 'Bilingual Evaluation Understudy' (literally 'doublure d'évaluation bilingue' in French). The term 'understudy' refers to the theater world where the understudy replaces the main actor — here, the automatic metric replaces (or supplements) human evaluation. The metric was introduced in the seminal paper by Papineni et al. in 2002: 'BLEU: a Method for Automatic Evaluation of Machine Translation'.

Concrete examples

Evaluation of a translation prompt

Translate the following text into English faithfully and naturally: 'Les avancées récentes en intelligence artificielle transforment notre quotidien.' Then compare your translation with this reference: 'Recent advances in artificial intelligence are transforming our daily lives.'

Comparison of two prompt variants for summarization

Summarize this paragraph in exactly two sentences while retaining the key information. I will measure the quality of your summary using the BLEU Score compared to a reference summary.

Benchmarking a model on a translation dataset

Evaluate the performance of this model on the WMT14 French-English dataset by calculating the BLEU Score on the entire test corpus.

Practical usage

In prompt engineering, the BLEU Score is used to objectively measure the quality of LLM outputs when reference answers are available. It is especially useful for comparing the effectiveness of different prompt formulations for translation or paraphrasing tasks. To apply it, simply collect the model's outputs for each prompt variant, then calculate the BLEU Score using a library like sacrebleu or nltk.translate.bleu_score in Python.

Related concepts

ROUGE ScoreMETEORPerplexityN-gramsMachine TranslationBERTScore

FAQ

What is a good BLEU Score?
There is no universal threshold, as the score depends on the task and domain. In machine translation, a BLEU Score between 25 and 40 is generally considered decent, and above 50 as excellent. For free text generation, scores are often lower because there are many valid ways to express the same idea.
What are the limitations of the BLEU Score?
The BLEU Score only measures lexical overlap (words and word sequences) and ignores semantics, grammar, and fluency. Two sentences with the same meaning but different wording can get a low score. That's why it is often combined with other metrics like BERTScore (which measures semantic similarity) or METEOR (which accounts for synonyms).
How to calculate the BLEU Score in practice?
The simplest method is to use the Python library sacrebleu (pip install sacrebleu), which implements the standard calculation. One can also use nltk.translate.bleu_score for sentence-by-sentence calculation. Just provide the candidate text and one or more references, and the library returns the score between 0 and 1.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.