Chinchilla Optimal: Definition and Examples
Training principle for large language models stating that model size and training data quantity should scale proportionally to maximize performance at a fixed compute budget.
Full definition
The concept of Chinchilla Optimal comes from a 2022 research paper by the DeepMind team, titled 'Training Compute-Optimal Large Language Models'. This study challenged the prevailing approach of scaling model size (number of parameters) without proportionally increasing training data. The Chinchilla model, which gives its name to the principle, demonstrated that a smaller model trained on more data could outperform a larger model trained on less data, given the same compute budget.
The empirical rule derived from this research states that for every doubling of model parameters, the amount of training tokens should also double. Concretely, a 70 billion parameter model should be trained on approximately 1.4 trillion tokens to be 'Chinchilla optimal'. Before this discovery, models like GPT-3 (175 billion parameters) were trained on only 300 billion tokens, making them significantly under-trained by this criterion.
This principle has profoundly transformed the training strategy of language models in the industry. Rather than simply stacking parameters, research labs began to invest heavily in collecting and curating high-quality training data. Models like Meta's Llama 2 were explicitly designed to exceed the Chinchilla optimal ratio, favoring over-training on more data to achieve better inference performance.
It is important to note that the Chinchilla optimal ratio applies strictly to optimizing the training compute budget. In practice, many labs deliberately choose to over-train their models (i.e., use more data than the ratio recommends) because a smaller but better-trained model is cheaper to deploy and run in production, even if its initial training was longer.
Etymology
The term comes from the 'Chinchilla' model developed by DeepMind in 2022, a 70 billion parameter model that outperformed Gopher (280 billion parameters) thanks to training on four times more data. The name Chinchilla follows DeepMind's animal naming tradition for its language models (Gopher, Flamingo, Chinchilla). The word 'optimal' refers to the optimal allocation of compute budget between model size and data volume.
Concrete examples
Evaluating an open source model
This 13B parameter model was trained on 1 trillion tokens. Is it Chinchilla optimal, under-trained, or over-trained? Analyze the ratio and implications for its performance.
Planning a model training run
I have a compute budget of 10^23 FLOPs. According to Chinchilla scaling laws, what model size and training data volume should I aim for to maximize performance?
Comparing training strategies
Compare the training approaches of GPT-3 (175B params, 300B tokens), Chinchilla (70B params, 1.4T tokens), and Llama 2 (70B params, 2T tokens) in terms of compute-optimal ratio. What trade-offs does each approach imply?
Practical usage
In prompt engineering, understanding the Chinchilla optimal concept helps evaluate model quality: a well-trained model according to this ratio will generally have better reasoning and generation capabilities. It also allows better interpretation of open source model specifications and choosing the right model for your use case — a smaller but Chinchilla-optimal model can outperform a larger but under-trained model.
Related concepts
FAQ
Do all current models respect the optimal Chinchilla ratio?
Is the optimal Chinchilla ratio still valid in 2026?
What is the link between Chinchilla optimal and the quality of a LLM's responses?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Codex (OpenAI): Definition and Use Cases
Codex is OpenAI's autonomous coding agent. Understand how it works, its differences from Claude Code and Cursor, and when to use it.
Computer Use: Definition and Examples
Ability of an AI model to directly interact with a computer by controlling the mouse, keyboard, and screen, just as a human user would.
Context Window: Definition and Examples
The context window refers to the maximum amount of text a language model can process at one time, encompassing both the user input and the generated response.
Cursor: Definition and Overview of the AI Editor
Understand Cursor: AI-native code editor based on VS Code. Differences with Claude Code, GitHub Copilot, and Windsurf, concrete use cases.
Custom GPT: Definition and How to Create Your Own
Understand OpenAI's Custom GPTs: pre-configured ChatGPT assistants. Step-by-step creation, differences with Claude Skills and Gemini Gems.
Datasheets For Datasets: Definition and Examples
Methodology proposing systematic documentation of datasets used in artificial intelligence, akin to technical datasheets accompanying electronic components.
Get new prompts every week
Join our newsletter.