P

Chinchilla Optimal: Definition and Examples

Training principle for large language models stating that model size and training data quantity should scale proportionally to maximize performance at a fixed compute budget.

Full definition

The concept of Chinchilla Optimal comes from a 2022 research paper by the DeepMind team, titled 'Training Compute-Optimal Large Language Models'. This study challenged the prevailing approach of scaling model size (number of parameters) without proportionally increasing training data. The Chinchilla model, which gives its name to the principle, demonstrated that a smaller model trained on more data could outperform a larger model trained on less data, given the same compute budget.

The empirical rule derived from this research states that for every doubling of model parameters, the amount of training tokens should also double. Concretely, a 70 billion parameter model should be trained on approximately 1.4 trillion tokens to be 'Chinchilla optimal'. Before this discovery, models like GPT-3 (175 billion parameters) were trained on only 300 billion tokens, making them significantly under-trained by this criterion.

This principle has profoundly transformed the training strategy of language models in the industry. Rather than simply stacking parameters, research labs began to invest heavily in collecting and curating high-quality training data. Models like Meta's Llama 2 were explicitly designed to exceed the Chinchilla optimal ratio, favoring over-training on more data to achieve better inference performance.

It is important to note that the Chinchilla optimal ratio applies strictly to optimizing the training compute budget. In practice, many labs deliberately choose to over-train their models (i.e., use more data than the ratio recommends) because a smaller but better-trained model is cheaper to deploy and run in production, even if its initial training was longer.

Etymology

The term comes from the 'Chinchilla' model developed by DeepMind in 2022, a 70 billion parameter model that outperformed Gopher (280 billion parameters) thanks to training on four times more data. The name Chinchilla follows DeepMind's animal naming tradition for its language models (Gopher, Flamingo, Chinchilla). The word 'optimal' refers to the optimal allocation of compute budget between model size and data volume.

Concrete examples

Evaluating an open source model

This 13B parameter model was trained on 1 trillion tokens. Is it Chinchilla optimal, under-trained, or over-trained? Analyze the ratio and implications for its performance.

Planning a model training run

I have a compute budget of 10^23 FLOPs. According to Chinchilla scaling laws, what model size and training data volume should I aim for to maximize performance?

Comparing training strategies

Compare the training approaches of GPT-3 (175B params, 300B tokens), Chinchilla (70B params, 1.4T tokens), and Llama 2 (70B params, 2T tokens) in terms of compute-optimal ratio. What trade-offs does each approach imply?

Practical usage

In prompt engineering, understanding the Chinchilla optimal concept helps evaluate model quality: a well-trained model according to this ratio will generally have better reasoning and generation capabilities. It also allows better interpretation of open source model specifications and choosing the right model for your use case — a smaller but Chinchilla-optimal model can outperform a larger but under-trained model.

Related concepts

Scaling LawsTraining TokensFLOPs (Floating Point Operations)Over-training

FAQ

Do all current models respect the optimal Chinchilla ratio?
No, and this is deliberate. Most recent models like Llama 3, Mistral or Gemma are intentionally over-trained relative to the Chinchilla ratio. The reason is economic: it's more cost-effective to spend more on training (one-time cost) to get a smaller model that is cheaper to deploy in production (recurring cost per query).
Is the optimal Chinchilla ratio still valid in 2026?
The fundamental principle remains valid — data and parameters must be balanced — but the exact ratios have been refined by subsequent research. Additionally, the industry has discovered that the optimal ratio also depends on data quality and deployment goals, not solely on the training compute budget.
What is the link between Chinchilla optimal and the quality of a LLM's responses?
An optimally or over-trained Chinchilla model has seen more diverse data during learning, which generally results in better language understanding, fewer factual hallucinations, and better reasoning capabilities. However, data quality matters as much as quantity: a model trained on low-quality data will remain mediocre, even with an optimal ratio.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.