F1 Score: Definition and Examples
The F1 Score is an evaluation metric that combines precision and recall into a single value, calculated as their harmonic mean. It is particularly useful for evaluating model performance on imbalanced datasets.
Full definition
The F1 Score is a fundamental metric in machine learning and natural language processing. It represents the harmonic mean of precision (proportion of correct positive predictions) and recall (proportion of detected true positives). Its formula is: F1 = 2 × (Precision × Recall) / (Precision + Recall). The score ranges from 0 to 1, where 1 indicates perfect performance.
The main interest of the F1 Score lies in its ability to balance two often contradictory objectives. A model can have excellent precision by being very selective (few false positives), but at the expense of recall (many false negatives). Conversely, a model that predicts 'positive' for everything will have perfect recall but poor precision. The F1 Score penalizes these imbalances through the harmonic mean, which pulls the value toward the lower of the two scores.
In the context of prompt engineering, the F1 Score is commonly used to evaluate the quality of responses generated by an LLM, especially for tasks such as classification, entity extraction, or question answering. For example, when asking a model to extract information from a text, one can measure whether all relevant information was found (recall) and whether the extracted information is actually correct (precision).
There are several variants of the F1 Score for multi-class problems: macro F1 (unweighted average of F1 per class), micro F1 (global calculation over all predictions), and weighted F1 (average weighted by the number of examples per class). The choice of variant depends on the relative importance given to each class in the problem at hand.
Etymology
The term 'F1 Score' comes from the family of F-measures (or F-scores) introduced by C.J. van Rijsbergen in 1979 in the field of information retrieval. The '1' in F1 indicates that precision and recall are weighted equally (parameter β = 1). The general Fβ formula allows adjusting this weight: F2 favors recall, F0.5 favors precision.
Concrete examples
Evaluating a spam classifier
Evaluate the performance of my spam classifier by calculating the F1 Score. Here are the results: 85 true positives, 10 false positives, 15 false negatives, 890 true negatives. Explain whether this score is satisfactory for a spam filter.
Named entity extraction with an LLM
Extract all companies mentioned in this text. I will compare your response with a reference list to calculate the F1 Score. Be exhaustive (good recall) while avoiding false positives (good precision).
Comparing prompts for a classification task
I tested three prompt variants for classifying customer reviews into positive/negative/neutral. Here are the macro F1 Scores obtained: Prompt A = 0.72, Prompt B = 0.81, Prompt C = 0.78. Analyze these results and suggest improvement avenues for the best-performing prompt.
Practical usage
In prompt engineering, the F1 Score serves to objectively compare different prompt formulations on measurable tasks such as classification or information extraction. To use it, prepare a test set with expected answers, run your prompt on each example, then calculate precision, recall, and F1. Prefer macro F1 if all classes are equally important, or weighted F1 if some classes are more frequent.
Related concepts
FAQ
What is the difference between F1 Score and accuracy?
When should I use the F2 Score instead of the F1 Score?
How to interpret an F1 Score for a language model?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Federated Learning: Definition and Examples
Federated Learning is an AI model training technique where data remains on users' local devices,
Few-Shot Prompting: Definition and Examples
Few-shot prompting provides a few examples in your prompt to guide the AI. Master this fundamental technique.
Fine Tuning: Definition and Examples
Fine tuning is the process of adjusting a pre-trained AI model on a specific dataset to improve its performance for a particular task or
Frequency Penalty: Definition and Examples
The Frequency Penalty is a language model parameter that penalizes tokens proportionally to the number of times they appear in the generated text
Function Calling: Definition and Examples
Function Calling is a capability of language models (LLMs) that allows them to identify when to call an external function and generate the required arguments.
Function Grounding: Definition and Examples
Function Grounding is a technique that anchors an AI model's responses in executable functions or tools, allowing it to interact with systems
Get new prompts every week
Join our newsletter.