#047 Jun 20, 2026 · 8 min read

Scaling Laws Revisited: Why Chinchilla Was Wrong

LLM Scaling

The Chinchilla paper (Hoffmann et al., 2022) reshaped how the industry thinks about scaling. Its core claim — that models and data should scale equally — influenced billions of dollars in compute allocation. But new evidence suggests the picture is more nuanced than a simple power law.

The Chinchilla Thesis

Chinchilla argued that for a given compute budget, the optimal strategy is to train a model roughly as large as your dataset (in tokens). This led to the "compute-optimal" training paradigm and directly influenced the sizing of models like LLaMA, which prioritized training longer on more data over building larger models.

Where It Breaks Down

Three lines of evidence now challenge the Chinchilla scaling laws:

1. Inference cost matters. Chinchilla optimized for training compute, but in production, inference cost often dominates. A smaller model trained for longer — "over-trained" by Chinchilla standards — can be more cost-effective at deployment scale. This is exactly the strategy behind LLaMA and its successors.

2. Data quality shifts the curves. The original scaling laws assumed relatively uniform data quality. Recent work shows that carefully curated, deduplicated, and filtered datasets can shift the optimal model-size-to-data ratio significantly. With better data, you can get away with smaller models.

3. Architecture innovations change the game. Scaling laws were derived empirically on standard Transformer architectures. Mixture-of-Experts models, state space models, and hybrid architectures introduce new scaling dynamics that don't fit neatly into the Chinchilla framework.

What This Means for Practitioners

The key takeaway isn't that Chinchilla was "wrong" — its methodology was sound. Rather, its conclusions were incomplete. The optimal scaling strategy depends on your deployment constraints, data quality, and architecture choices, not just your training compute budget.

For teams training foundation models, this means the compute-optimal frontier is wider than a single curve. For teams fine-tuning or deploying, it means smaller, over-trained models are often the right choice — which is good news for accessibility.

Papers We're Watching

Sardana & Frankle (2026) — "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws"
Muennighoff et al. (2026) — "Scaling Data-Constrained Language Models"
Gadre et al. (2026) — "Language Models Scale Reliably with Over-Training and on Downstream Tasks"

Enjoyed this issue?

Share it with a colleague who needs more signal, less noise.

Subscribe to DeepML.news

← Previous Issue All Issues