Data Augmentation: Definition and Examples

Data augmentation is a technique that consists of artificially enriching a training dataset by creating variations of existing data, in order to improve the performance and robustness of AI models.

Full definition

Data augmentation is a fundamental strategy in machine learning that multiplies the size and diversity of a dataset without having to collect new real data. It relies on applying controlled transformations to existing data to generate new plausible training examples.

In computer vision, this can include rotations, crops, horizontal flips, brightness adjustments, or adding noise to images. In natural language processing (NLP), techniques include paraphrase, synonym substitution, back-translation, or synthetic text generation using large language models.

The main goal is to reduce overfitting by exposing the model to a greater variety of situations during training. A model trained on augmented data learns to be more robust to variations it will encounter in production, thus improving its generalization ability.

In prompt engineering, the concept of data augmentation applies indirectly: one can use an LLM to generate prompt variations, create synthetic datasets for fine-tuning, or produce diverse examples for few-shot learning. This approach is particularly useful when annotated data is scarce or expensive to obtain.

Etymology

The term comes from English "data" and "augmentation" (from Latin augmentatio, the action of increasing). It appeared in deep learning literature in the early 2010s, notably with the rise of convolutional neural networks (CNNs) applied to image classification, where the lack of labeled data was a major obstacle.

Concrete examples

Generating synthetic data to train a text classifier

You are an expert in data generation. From this sentence: "The customer service was very responsive", generate 10 varied paraphrases that retain the same positive sentiment, varying vocabulary, structure, and register of language.

Creating diverse examples for few-shot learning

I am building an email classification system. Here is an example of a complaint email: [EXAMPLE]. Generate 5 realistic variations of customer complaints with different subjects (delivery, billing, product quality, customer service, refund) while keeping a similar tone.

Augmenting a question-answer dataset for a chatbot

Here is a FAQ with 20 question-answer pairs about our product. For each question, generate 3 natural reformulations that real users might ask, including colloquial expressions, common mistakes, and regional variations of French.

Practical usage

In prompt engineering, data augmentation allows rapid generation of quality synthetic datasets to train or evaluate models. Use an LLM to create paraphrases, translations, or stylistic variations of your existing data. This is a particularly cost-effective technique for projects with limited annotated data, as it significantly reduces the cost and time of building a dataset.

Related concepts

Fine-tuningFew-shot learningOverfittingTransfer learning

FAQ

What is the difference between data augmentation and synthetic data?

Data augmentation transforms existing data by applying modifications (rotation, paraphrase, noise), while synthetic data generation creates entirely new examples from scratch, often using generative models. In practice, both approaches are complementary and aim for the same goal: enriching the training dataset.

Can data augmentation degrade a model's performance?

Yes, if the applied transformations are too aggressive or irrelevant, they can introduce noise or unrealistic examples that disrupt learning. For example, flipping an image of text vertically produces an unnatural example. It is essential to choose augmentations consistent with the application domain and validate their impact on a validation set.

How to use an LLM like ChatGPT or Claude for data augmentation?

You can ask the LLM to paraphrase texts, translate and back-translate sentences, generate stylistic variations, or create examples in different contexts. The trick is to provide precise instructions on the desired type of variation (register of language, length, domain) and always check the quality of the generated data before using it for training.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Datasheets For Datasets: Definition and Examples

Methodology proposing systematic documentation of datasets used in artificial intelligence, akin to technical datasheets accompanying electronic components.

Deepfake: Definition and Examples

Synthetic content (video, audio, or image) generated by artificial intelligence, capable of realistically reproducing the appearance, voice, or expressions

Dialogue System: Definition and Examples

A dialogue system is a computer program designed to converse with a human user in natural language, whether spoken or written.

Diffusion: Definition and Examples

Family of generative models that create data (images, audio, video) by learning to reverse a progressive noising process, transforming random noise into coherent content step by step.

Discriminative Model: Definition and Examples

A discriminative model is a type of machine learning model that learns to distinguish and classify data by directly modeling the bound

Dropout: Definition and Examples

Dropout is a regularization technique used during neural network training that randomly deactivates a fraction of neurons at each iteration to prevent overfitting.

Get new prompts every week

Join our newsletter.