Model Distillation: Definition and Examples

Model distillation is a compression technique where a smaller model (the student) learns to replicate the behavior of a larger, more performant model (the teacher), achieving close performance at lower computational cost.

Full definition

Model distillation is a transfer learning technique in which a compact model, called the "student," is trained to imitate the outputs of a larger, more performant model, called the "teacher." Rather than learning directly from raw data, the student trains on the probability distributions produced by the teacher, capturing nuances that classic labels do not contain. The intuition behind this approach rests on the concept of "dark knowledge" introduced by Geoffrey Hinton in 2015. When a large model predicts that a picture of a cat has a 90% chance of being a cat and an 8% chance of being a lynx, this relationship between classes contains rich information that the small model can exploit. By adjusting a temperature parameter during training, the teacher's output distributions are "softened" to make this latent information more accessible. In the context of large language models (LLMs), distillation has become a major strategic issue. Models like GPT-4 or Claude are extremely expensive to deploy. Distillation makes it possible to create lighter, specialized versions for specific tasks that retain a large part of the quality while drastically reducing inference costs and latency. In prompt engineering, distillation takes a practical form: a powerful model is used to generate high-quality examples (synthetic data), then a smaller model is fine-tuned on these examples. This approach democratizes access to high performance and allows deploying effective AI solutions even with budget or infrastructure constraints.

Etymology

The term "distillation" is borrowed from chemistry, where it refers to the process of purifying a liquid by evaporation followed by condensation. By analogy, model distillation "purifies" and concentrates the knowledge of a large model into a smaller container. The concept was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their seminal paper "Distilling the Knowledge in a Neural Network" published in 2015.

Concrete examples

Creating a training dataset using a powerful model

You are a sentiment classification expert. For each customer review below, give the sentiment (positive, negative, neutral) and a confidence score between 0 and 1. Explain your reasoning in one sentence. These examples will be used to train a smaller model.

Optimizing a production pipeline by replacing a large model

Analyze these 50 examples of summaries generated by GPT-4 and identify recurring patterns in style, length, and structure. I want to document these patterns to configure a lighter model that will produce similar summaries.

Evaluating the quality of a distilled model against the teacher

Compare these two responses to the same question—the first comes from the original model, the second from the distilled model. Rate each on a scale of 10 according to relevance, completeness, and clarity. Identify significant gaps.

Practical usage

In prompt engineering, distillation is applied by using an expensive model (like Claude Opus or GPT-4) to generate hundreds of high-quality examples on a specific task, then fine-tuning a lighter model (like Haiku or GPT-4o mini) on these examples. This approach reduces inference costs by 10-50x while retaining 80-95% of quality, ideal for high-volume production applications.

Related concepts

Fine-TuningTransfer LearningQuantizationKnowledge DistillationFew-Shot LearningSynthetic Data

FAQ

What is the difference between distillation and fine-tuning?

Fine-tuning trains a model on human-labeled data, while distillation uses the outputs of a teacher model as training data. Distillation captures richer information (probability distributions, intermediate reasoning) than simple labels. The two techniques are often combined: data is generated with a large model (distillation), then a small model is fine-tuned on it.

Is model distillation legal and allowed by AI providers?

It depends on each provider's terms of use. OpenAI explicitly prohibits using outputs of its models to train competing models. Anthropic and others have similar policies. It is essential to check the terms of service before any distillation. On the other hand, distilling from open-source models (Llama, Mistral) is generally allowed provided their licenses are respected.

What are the limitations of distillation?

The student model can never surpass the teacher's performance on distilled tasks. Distillation works best for specific, well-defined tasks; general capabilities and complex reasoning are harder to transfer. Moreover, if the teacher has biases or systematic errors, the distilled model faithfully reproduces them.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Model Registry: Definition and Examples

A Model Registry is a centralized system for storing, versioning, and managing machine learning models throughout their lifecycle, from training to production deployment.

Multimodal: Definition and Examples

A multimodal AI processes multiple data types: text, image, audio, video. Discover GPT-4o, Claude 3, and Gemini, their capabilities and limitations.

Multimodal RAG: Definition and Examples

Multimodal RAG is an extension of Retrieval-Augmented Generation that allows an AI model to search and leverage information from sources

Needle In Haystack: Definition and Examples

The Needle In a Haystack (NIAH) test is an evaluation method that measures a language model's ability to retrieve a specific piece of information buried in a long context.

Negative Prompting: Definition and Examples

Negative prompting is a technique that involves explicitly telling an AI model what it should not generate, thereby refining the results by excluding undesirable elements.

Neural Architecture Search: Definition and Examples

Neural Architecture Search (NAS) is a machine learning technique that automates the design of neural network architectures by exploring...

Get new prompts every week

Join our newsletter.