## Summary: Scaling Laws for Neural Language Models (Kaplan et al., 2020)
This influential paper from OpenAI established empirical scaling laws governing the performance of Transformer language models. The researchers systematically varied three key scaling variables to understand their impact on cross-entropy loss.
### The Three Scaling Variables
1. **Model size (N)**: Number of non-embedding parameters
2. **Dataset size (D)**: Number of tokens in the training corpus
3. **Compute budget (C)**: Total compute used for training (in PetaFLOP/s-days)
### Power-Law Relationships
The paper's central discovery is that test loss follows smooth power-law relationships with each variable when the other two are not bottlenecked:
- **L(N) ∝ N^(-0.076)** — Loss scales as a power law with model parameters
- **L(D) ∝ D^(-0.095)** — Loss scales as a power law with dataset size
- **L(C) ∝ C^(-0.050)** — Loss scales as a power law with compute
These relationships hold over more than seven orders of magnitude, as shown in Figure 1 of the paper.
### Key Findings
**Model size matters most.** When allocating a fixed compute budget, the researchers found that increasing model size yields better returns than increasing training data or training time. Specifically, large models are significantly more sample-efficient — they reach the same loss level with fewer training tokens.
**Overfitting is predictable.** The degree of overfitting depends primarily on the ratio N^0.74/D — meaning larger models need proportionally more data to avoid overfitting, but the relationship is sub-linear.
**Convergence is independent of model size.** All models, regardless of size, follow similar learning curves when performance is plotted against compute used, with larger models simply starting from a lower loss.
### Practical Implications
The paper suggests that for compute-optimal training:
- Most of the compute budget should go to making models larger, not training longer
- Training should often be stopped well before convergence
- The optimal allocation is approximately N ∝ C^0.73 and D ∝ C^0.27
This work directly influenced the development of GPT-3 and subsequent large language models, establishing that predictable scaling allows researchers to forecast model performance before committing to expensive training runs.
## Summary: Scaling Laws for Neural Language Models (Kaplan et al., 2020)
This seminal paper by Jared Kaplan, Sam McCandlish, Tom Henighan, and colleagues at OpenAI established foundational empirical scaling laws for Transformer-based language models, revealing how performance scales predictably with three core variables [1].
### Three Core Scaling Variables
The paper investigates how cross-entropy loss (L) depends on:
1. **Model parameters (N)** — the number of non-embedding parameters, ranging from 768 to 1.5 billion in their experiments
2. **Dataset size (D)** — number of training tokens, tested from 22M to 23B tokens using WebText2
3. **Compute budget (C)** — measured in PetaFLOP/s-days [1][2]
### Power-Law Scaling Relationships
The central result: loss follows **smooth power-law decay** with each variable when the others aren't bottlenecking:
- L(N) = (N_c / N)^α_N where α_N ≈ 0.076 (Figure 1, left panel)
- L(D) = (D_c / D)^α_D where α_D ≈ 0.095 (Figure 1, center panel)
- L(C) = (C_c / C)^α_C where α_C ≈ 0.050 (Figure 1, right panel)
These hold over **7+ orders of magnitude** with no signs of deviation [1].
### Critical Findings
**1. Model size dominates compute allocation** (Section 6, Figure 8): Given a fixed FLOP budget, optimal performance allocates most compute to larger models rather than longer training. The optimal split follows N_opt ∝ C^0.73, meaning model size should scale as the 0.73 power of compute [1].
**2. Sample efficiency improves with scale** (Section 4): Larger models achieve the same loss with fewer tokens. A 1.5B-parameter model reaches a given loss level with ~10x fewer samples than a 10M-parameter model [1].
**3. Overfitting follows N^0.74/D** (Equation 1.5): The ratio of test loss to train loss is predictable from the parameter-to-data ratio, with the exponent 0.74 indicating a sub-linear relationship [1].
### Practical Implications for Training LLMs
- **Stop early**: Train large models for fewer steps rather than small models for many steps
- **Scale model size first**: When budget doubles, make the model ~1.7x larger rather than training 2x longer
- **Performance is predictable**: Teams can forecast final model quality before committing resources
- **Later challenged by Chinchilla**: Hoffmann et al. (2022) found that Kaplan's compute-optimal recipe underweights data — Chinchilla showed equal scaling of N and D (both ∝ C^0.5) outperforms Kaplan's model-size-heavy approach [3]
### Sources
[1] Kaplan, J. et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361 (2020)
[2] Brown, T. et al. "Language Models are Few-Shot Learners." NeurIPS 2020
[3] Hoffmann, J. et al. "Training Compute-Optimal Large Language Models." arXiv:2203.15556 (2022)