P

Model Distillation: Definition and Examples

Model distillation is a compression technique where a smaller model (the student) learns to replicate the behavior of a larger, more performant model (the teacher), achieving close performance at lower computational cost.

Full definition

Model distillation is a transfer learning technique in which a compact model, called the "student," is trained to imitate the outputs of a larger, more performant model, called the "teacher." Rather than learning directly from raw data, the student trains on the probability distributions produced by the teacher, capturing nuances that classic labels do not contain. The intuition behind this approach rests on the concept of "dark knowledge" introduced by Geoffrey Hinton in 2015. When a large model predicts that a picture of a cat has a 90% chance of being a cat and an 8% chance of being a lynx, this relationship between classes contains rich information that the small model can exploit. By adjusting a temperature parameter during training, the teacher's output distributions are "softened" to make this latent information more accessible. In the context of large language models (LLMs), distillation has become a major strategic issue. Models like GPT-4 or Claude are extremely expensive to deploy. Distillation makes it possible to create lighter, specialized versions for specific tasks that retain a large part of the quality while drastically reducing inference costs and latency. In prompt engineering, distillation takes a practical form: a powerful model is used to generate high-quality examples (synthetic data), then a smaller model is fine-tuned on these examples. This approach democratizes access to high performance and allows deploying effective AI solutions even with budget or infrastructure constraints.

Etymology

The term "distillation" is borrowed from chemistry, where it refers to the process of purifying a liquid by evaporation followed by condensation. By analogy, model distillation "purifies" and concentrates the knowledge of a large model into a smaller container. The concept was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their seminal paper "Distilling the Knowledge in a Neural Network" published in 2015.

Concrete examples

Creating a training dataset using a powerful model

You are a sentiment classification expert. For each customer review below, give the sentiment (positive, negative, neutral) and a confidence score between 0 and 1. Explain your reasoning in one sentence. These examples will be used to train a smaller model.

Optimizing a production pipeline by replacing a large model

Analyze these 50 examples of summaries generated by GPT-4 and identify recurring patterns in style, length, and structure. I want to document these patterns to configure a lighter model that will produce similar summaries.

Evaluating the quality of a distilled model against the teacher

Compare these two responses to the same question—the first comes from the original model, the second from the distilled model. Rate each on a scale of 10 according to relevance, completeness, and clarity. Identify significant gaps.

Practical usage

In prompt engineering, distillation is applied by using an expensive model (like Claude Opus or GPT-4) to generate hundreds of high-quality examples on a specific task, then fine-tuning a lighter model (like Haiku or GPT-4o mini) on these examples. This approach reduces inference costs by 10-50x while retaining 80-95% of quality, ideal for high-volume production applications.

Related concepts

Fine-TuningTransfer LearningQuantizationKnowledge DistillationFew-Shot LearningSynthetic Data

FAQ

What is the difference between distillation and fine-tuning?
Fine-tuning trains a model on human-labeled data, while distillation uses the outputs of a teacher model as training data. Distillation captures richer information (probability distributions, intermediate reasoning) than simple labels. The two techniques are often combined: data is generated with a large model (distillation), then a small model is fine-tuned on it.
Is model distillation legal and allowed by AI providers?
It depends on each provider's terms of use. OpenAI explicitly prohibits using outputs of its models to train competing models. Anthropic and others have similar policies. It is essential to check the terms of service before any distillation. On the other hand, distilling from open-source models (Llama, Mistral) is generally allowed provided their licenses are respected.
What are the limitations of distillation?
The student model can never surpass the teacher's performance on distilled tasks. Distillation works best for specific, well-defined tasks; general capabilities and complex reasoning are harder to transfer. Moreover, if the teacher has biases or systematic errors, the distilled model faithfully reproduces them.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.