Model Distillation: Definition and Examples
Model distillation is a compression technique where a smaller model (the student) learns to replicate the behavior of a larger, more performant model (the teacher), achieving close performance at lower computational cost.
Full definition
Model distillation is a transfer learning technique in which a compact model, called the "student," is trained to imitate the outputs of a larger, more performant model, called the "teacher." Rather than learning directly from raw data, the student trains on the probability distributions produced by the teacher, capturing nuances that classic labels do not contain. The intuition behind this approach rests on the concept of "dark knowledge" introduced by Geoffrey Hinton in 2015. When a large model predicts that a picture of a cat has a 90% chance of being a cat and an 8% chance of being a lynx, this relationship between classes contains rich information that the small model can exploit. By adjusting a temperature parameter during training, the teacher's output distributions are "softened" to make this latent information more accessible. In the context of large language models (LLMs), distillation has become a major strategic issue. Models like GPT-4 or Claude are extremely expensive to deploy. Distillation makes it possible to create lighter, specialized versions for specific tasks that retain a large part of the quality while drastically reducing inference costs and latency. In prompt engineering, distillation takes a practical form: a powerful model is used to generate high-quality examples (synthetic data), then a smaller model is fine-tuned on these examples. This approach democratizes access to high performance and allows deploying effective AI solutions even with budget or infrastructure constraints.
Etymology
The term "distillation" is borrowed from chemistry, where it refers to the process of purifying a liquid by evaporation followed by condensation. By analogy, model distillation "purifies" and concentrates the knowledge of a large model into a smaller container. The concept was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their seminal paper "Distilling the Knowledge in a Neural Network" published in 2015.
Concrete examples
Creating a training dataset using a powerful model
You are a sentiment classification expert. For each customer review below, give the sentiment (positive, negative, neutral) and a confidence score between 0 and 1. Explain your reasoning in one sentence. These examples will be used to train a smaller model.
Optimizing a production pipeline by replacing a large model
Analyze these 50 examples of summaries generated by GPT-4 and identify recurring patterns in style, length, and structure. I want to document these patterns to configure a lighter model that will produce similar summaries.
Evaluating the quality of a distilled model against the teacher
Compare these two responses to the same question—the first comes from the original model, the second from the distilled model. Rate each on a scale of 10 according to relevance, completeness, and clarity. Identify significant gaps.
Practical usage
In prompt engineering, distillation is applied by using an expensive model (like Claude Opus or GPT-4) to generate hundreds of high-quality examples on a specific task, then fine-tuning a lighter model (like Haiku or GPT-4o mini) on these examples. This approach reduces inference costs by 10-50x while retaining 80-95% of quality, ideal for high-volume production applications.
Related concepts
FAQ
What is the difference between distillation and fine-tuning?
Is model distillation legal and allowed by AI providers?
What are the limitations of distillation?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Model Registry: Definition and Examples
A Model Registry is a centralized system for storing, versioning, and managing machine learning models throughout their lifecycle, from training to production deployment.
Multimodal: Definition and Examples
A multimodal AI processes multiple data types: text, image, audio, video. Discover GPT-4o, Claude 3, and Gemini, their capabilities and limitations.
Multimodal RAG: Definition and Examples
Multimodal RAG is an extension of Retrieval-Augmented Generation that allows an AI model to search and leverage information from sources
Needle In Haystack: Definition and Examples
The Needle In a Haystack (NIAH) test is an evaluation method that measures a language model's ability to retrieve a specific piece of information buried in a long context.
Negative Prompting: Definition and Examples
Negative prompting is a technique that involves explicitly telling an AI model what it should not generate, thereby refining the results by excluding undesirable elements.
Neural Architecture Search: Definition and Examples
Neural Architecture Search (NAS) is a machine learning technique that automates the design of neural network architectures by exploring...
Get new prompts every week
Join our newsletter.