F1 Score: Definition and Examples

The F1 Score is an evaluation metric that combines precision and recall into a single value, calculated as their harmonic mean. It is particularly useful for evaluating model performance on imbalanced datasets.

Full definition

The F1 Score is a fundamental metric in machine learning and natural language processing. It represents the harmonic mean of precision (proportion of correct positive predictions) and recall (proportion of detected true positives). Its formula is: F1 = 2 × (Precision × Recall) / (Precision + Recall). The score ranges from 0 to 1, where 1 indicates perfect performance.

The main interest of the F1 Score lies in its ability to balance two often contradictory objectives. A model can have excellent precision by being very selective (few false positives), but at the expense of recall (many false negatives). Conversely, a model that predicts 'positive' for everything will have perfect recall but poor precision. The F1 Score penalizes these imbalances through the harmonic mean, which pulls the value toward the lower of the two scores.

In the context of prompt engineering, the F1 Score is commonly used to evaluate the quality of responses generated by an LLM, especially for tasks such as classification, entity extraction, or question answering. For example, when asking a model to extract information from a text, one can measure whether all relevant information was found (recall) and whether the extracted information is actually correct (precision).

There are several variants of the F1 Score for multi-class problems: macro F1 (unweighted average of F1 per class), micro F1 (global calculation over all predictions), and weighted F1 (average weighted by the number of examples per class). The choice of variant depends on the relative importance given to each class in the problem at hand.

Etymology

The term 'F1 Score' comes from the family of F-measures (or F-scores) introduced by C.J. van Rijsbergen in 1979 in the field of information retrieval. The '1' in F1 indicates that precision and recall are weighted equally (parameter β = 1). The general Fβ formula allows adjusting this weight: F2 favors recall, F0.5 favors precision.

Concrete examples

Evaluating a spam classifier

Evaluate the performance of my spam classifier by calculating the F1 Score. Here are the results: 85 true positives, 10 false positives, 15 false negatives, 890 true negatives. Explain whether this score is satisfactory for a spam filter.

Named entity extraction with an LLM

Extract all companies mentioned in this text. I will compare your response with a reference list to calculate the F1 Score. Be exhaustive (good recall) while avoiding false positives (good precision).

Comparing prompts for a classification task

I tested three prompt variants for classifying customer reviews into positive/negative/neutral. Here are the macro F1 Scores obtained: Prompt A = 0.72, Prompt B = 0.81, Prompt C = 0.78. Analyze these results and suggest improvement avenues for the best-performing prompt.

Practical usage

In prompt engineering, the F1 Score serves to objectively compare different prompt formulations on measurable tasks such as classification or information extraction. To use it, prepare a test set with expected answers, run your prompt on each example, then calculate precision, recall, and F1. Prefer macro F1 if all classes are equally important, or weighted F1 if some classes are more frequent.

Related concepts

PrecisionRecallConfusion matrixROC-AUC curve

FAQ

What is the difference between F1 Score and accuracy?

Accuracy measures the overall percentage of correct predictions, while the F1 Score focuses on the balance between precision and recall. On an imbalanced dataset (e.g., 95% negatives), a model that always predicts 'negative' will have 95% accuracy but an F1 Score of 0 on the positive class. The F1 Score is therefore more informative when classes are imbalanced.

When should I use the F2 Score instead of the F1 Score?

The F2 Score gives twice as much importance to recall as to precision. It is preferable in cases where missing a true positive is more costly than a false positive, for example in medical diagnosis or fraud detection. Conversely, the F0.5 Score favors precision, useful when false positives are very costly.

How to interpret an F1 Score for a language model?

An F1 Score above 0.9 is generally excellent, between 0.7 and 0.9 is good, and below 0.7 indicates significant room for improvement. However, interpretation strongly depends on the task and domain. For a complex entity extraction task, an F1 of 0.75 may be very good, while for a simple binary classification, at least 0.85 would be expected.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Federated Learning: Definition and Examples

Federated Learning is an AI model training technique where data remains on users' local devices,

Few-Shot Prompting: Definition and Examples

Few-shot prompting provides a few examples in your prompt to guide the AI. Master this fundamental technique.

Fine Tuning: Definition and Examples

Fine tuning is the process of adjusting a pre-trained AI model on a specific dataset to improve its performance for a particular task or

Frequency Penalty: Definition and Examples

The Frequency Penalty is a language model parameter that penalizes tokens proportionally to the number of times they appear in the generated text

Function Calling: Definition and Examples

Function Calling is a capability of language models (LLMs) that allows them to identify when to call an external function and generate the required arguments.

Function Grounding: Definition and Examples

Function Grounding is a technique that anchors an AI model's responses in executable functions or tools, allowing it to interact with systems

Get new prompts every week

Join our newsletter.