In Depth
Chinchilla scaling, from DeepMind's 2022 paper 'Training Compute-Optimal Large Language Models,' demonstrated that many large language models were significantly undertrained. The research showed that for a given compute budget, the optimal approach is to balance model size and training data roughly equally: doubling compute should increase both model parameters and training tokens by approximately the same factor.
This finding, based on training over 400 models of varying sizes, challenged the prevailing approach of training very large models on relatively less data. The Chinchilla model (70B parameters trained on 1.4 trillion tokens) outperformed the much larger Gopher (280B parameters trained on 300 billion tokens) despite using the same compute budget, proving that the balance matters.
Chinchilla scaling has had profound practical impact. It influenced the training strategies of subsequent models, with Llama, Mistral, and others training smaller models on much more data than previous norms suggested. It also shifted the industry's bottleneck from model size to data quality and quantity, spurring efforts in data curation, synthetic data generation, and efficient tokenization. The insight that 'more data, smaller model' often beats 'less data, bigger model' made high-quality AI more accessible by reducing hardware requirements.