Data Augmentation: Definition and Examples
Data augmentation is a technique that consists of artificially enriching a training dataset by creating variations of existing data, in order to improve the performance and robustness of AI models.
Full definition
Data augmentation is a fundamental strategy in machine learning that multiplies the size and diversity of a dataset without having to collect new real data. It relies on applying controlled transformations to existing data to generate new plausible training examples.
In computer vision, this can include rotations, crops, horizontal flips, brightness adjustments, or adding noise to images. In natural language processing (NLP), techniques include paraphrase, synonym substitution, back-translation, or synthetic text generation using large language models.
The main goal is to reduce overfitting by exposing the model to a greater variety of situations during training. A model trained on augmented data learns to be more robust to variations it will encounter in production, thus improving its generalization ability.
In prompt engineering, the concept of data augmentation applies indirectly: one can use an LLM to generate prompt variations, create synthetic datasets for fine-tuning, or produce diverse examples for few-shot learning. This approach is particularly useful when annotated data is scarce or expensive to obtain.
Etymology
The term comes from English "data" and "augmentation" (from Latin augmentatio, the action of increasing). It appeared in deep learning literature in the early 2010s, notably with the rise of convolutional neural networks (CNNs) applied to image classification, where the lack of labeled data was a major obstacle.
Concrete examples
Generating synthetic data to train a text classifier
You are an expert in data generation. From this sentence: "The customer service was very responsive", generate 10 varied paraphrases that retain the same positive sentiment, varying vocabulary, structure, and register of language.
Creating diverse examples for few-shot learning
I am building an email classification system. Here is an example of a complaint email: [EXAMPLE]. Generate 5 realistic variations of customer complaints with different subjects (delivery, billing, product quality, customer service, refund) while keeping a similar tone.
Augmenting a question-answer dataset for a chatbot
Here is a FAQ with 20 question-answer pairs about our product. For each question, generate 3 natural reformulations that real users might ask, including colloquial expressions, common mistakes, and regional variations of French.
Practical usage
In prompt engineering, data augmentation allows rapid generation of quality synthetic datasets to train or evaluate models. Use an LLM to create paraphrases, translations, or stylistic variations of your existing data. This is a particularly cost-effective technique for projects with limited annotated data, as it significantly reduces the cost and time of building a dataset.
Related concepts
FAQ
What is the difference between data augmentation and synthetic data?
Can data augmentation degrade a model's performance?
How to use an LLM like ChatGPT or Claude for data augmentation?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Datasheets For Datasets: Definition and Examples
Methodology proposing systematic documentation of datasets used in artificial intelligence, akin to technical datasheets accompanying electronic components.
Deepfake: Definition and Examples
Synthetic content (video, audio, or image) generated by artificial intelligence, capable of realistically reproducing the appearance, voice, or expressions
Dialogue System: Definition and Examples
A dialogue system is a computer program designed to converse with a human user in natural language, whether spoken or written.
Diffusion: Definition and Examples
Family of generative models that create data (images, audio, video) by learning to reverse a progressive noising process, transforming random noise into coherent content step by step.
Discriminative Model: Definition and Examples
A discriminative model is a type of machine learning model that learns to distinguish and classify data by directly modeling the bound
Dropout: Definition and Examples
Dropout is a regularization technique used during neural network training that randomly deactivates a fraction of neurons at each iteration to prevent overfitting.
Get new prompts every week
Join our newsletter.