Confusion Matrix: Definition and Examples

A confusion matrix is a table that summarizes the performance of a classification model by comparing the model's predictions to the actual values, detailing true positives, true negatives, false positives, and false negatives.

Full definition

The confusion matrix (or contingency table) is a fundamental tool in machine learning for evaluating the quality of a classification model. It is presented as a square table where each row represents instances of an actual class and each column represents instances of a predicted class (or vice versa depending on convention). For a binary classification, it contains four key values: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

True positives are cases correctly identified as positive, while true negatives are cases correctly identified as negative. False positives (also called type I errors) are negative cases incorrectly classified as positive, and false negatives (type II errors) are positive cases incorrectly classified as negative. This breakdown allows you to understand not only how many errors the model makes, but more importantly what type of errors it makes.

From the confusion matrix, many essential metrics can be derived: precision (proportion of true positives among positive predictions), recall or sensitivity (proportion of true positives among actually positive cases), specificity, F1-score, and overall accuracy. Each of these metrics illuminates a different aspect of model performance.

In the context of prompt engineering, understanding the confusion matrix is crucial when working with LLMs for classification tasks (sentiment analysis, spam detection, text categorization). It helps identify if the model tends to over-classify or under-classify certain categories, and to adjust prompts accordingly to reduce a specific type of error.

Etymology

The term 'confusion matrix' gets its name from the fact that it allows you to see if a classification model 'confuses' certain classes with each other. The word 'matrix' refers to its tabular mathematical structure. The concept was introduced in the 1950s-1960s in the field of experimental psychology and signal detection theory, before being widely adopted in statistics and artificial intelligence.

Concrete examples

Evaluating a sentiment classifier

Here are the classification results for 100 customer reviews. Build a confusion matrix and calculate precision, recall, and F1-score for each class (positive, negative, neutral). Identify which class is most often confused with another.

Optimizing a spam detection prompt

You are a spam detector. Classify each email as 'spam' or 'legitimate'. Prioritize minimizing false positives (legitimate emails classified as spam) over false negatives, because a missed important email is more serious than an unfiltered spam.

AI-assisted medical diagnosis

Analyze these screening results and generate the corresponding confusion matrix. Calculate the sensitivity and specificity of the test. Explain why, in a screening context, high recall is preferable to high precision.

Practical usage

In prompt engineering, the confusion matrix helps you evaluate and improve your classification prompts. After testing a prompt on a labeled dataset, build the matrix to identify systematic model confusions. Then adjust your prompt by adding specific instructions for ambiguous cases, or by providing few-shot examples targeting the most frequent errors.

Related concepts

PrecisionRecallF1-ScoreAccuracyROC CurveAUCBinary classificationType I and II error

FAQ

What is the difference between the confusion matrix and accuracy?

Accuracy is a single metric that indicates the overall percentage of correct predictions, while the confusion matrix details the complete breakdown of predictions by class. Accuracy can be misleading with imbalanced classes: a model that always predicts the majority class will have high accuracy but be useless. The confusion matrix reveals this issue by showing that the minority class is never correctly identified.

How do you read a confusion matrix for a multiclass problem?

For an N-class problem, the matrix is an N×N table. The diagonal contains the correct predictions for each class. Off-diagonal values show confusions: the cell at row i and column j indicates how many instances of class i were predicted as belonging to class j. Rows with many off-diagonal values indicate classes that the model struggles to recognize.

When should precision be favored over recall?

Favor precision when the cost of false positives is high (e.g., spam filtering where an important email classified as spam is very disruptive). Favor recall when the cost of false negatives is high (e.g., disease detection where a missed case can be fatal). In prompt engineering, you can steer the model toward one or the other by adjusting instructions: 'when in doubt, classify as positive' favors recall.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Constitutional AI: Definition and Examples

AI alignment method developed by Anthropic, where a model is trained to self-correct by following a set of written principles (a 'constitution')

Context Window: Definition and Examples

The context window refers to the maximum amount of text a language model can process at one time, encompassing both the user input and the generated response.

Continual Learning: Definition and Examples

Continual Learning refers to the ability of an AI model to learn new tasks or data sequentially, without forgetting previously acquired knowledge.

Cross Attention: Definition and Examples

Attention mechanism that allows a model to relate two different sequences, such as an image and a text, so that each element of one sequence can attend to elements of the other.

Cursor: Definition and Overview of the AI Editor

Understand Cursor: AI-native code editor based on VS Code. Differences with Claude Code, GitHub Copilot, and Windsurf, concrete use cases.

Custom GPT: Definition and How to Create Your Own

Understand OpenAI's Custom GPTs: pre-configured ChatGPT assistants. Step-by-step creation, differences with Claude Skills and Gemini Gems.

Get new prompts every week

Join our newsletter.