The Smol Training Playbook - Table of Contents

How to read this blog post

Training compass: why → what → how

Why: the question nobody wants to answer

Research: what do you want to understand?

Production: why can't you use an existing model?

Strategic open-source: do you see a gap you can fill?

Hugging Face's journey

What: translating goals into decisions

Super power: speed and data

Every big model starts with a small ablation

Choosing your baseline

Modifying your baseline: the discipline of derisking

Picking a training framework

Setting up our ablation framework

Understanding what works: evaluation

Estimating ablations cost

Rules of engagement

Designing the model architecture

Architecture choices

Embedding sharing

Positional Encodings & Long Context

Improving stability

Other core components

Going Sparse: MoE

Excursion: Hybrid Models

To MoE or not MoE: Choosing a Base Architecture

Rules of engagement

Optimiser and training hyperparameters

Optimizers: AdamW and beyond

Scaling laws for hyperparameters

Rules of engagement

Scaling laws: how many parameters, how much data?

The art of data curation

What's a good data mixture and why it matters most

The unintuitive nature of data mixtures

The evolution of training curricula

Ablation setup: how to systematically test data recipes

SmolLM3: Curating the data mixture

Building on Proven Foundations

English Web Data: The Foundation Layer

Multilingual Web Data

Finding the right mixture for new stages

The training marathon

Pre-flight checklist: what to verify before hitting "train"

Scaling surprises

Mystery #1 – The vanishing throughput

Mystery #2 – The persisting throughput drops

Mystery #3 – The noisy loss

Launch, Take Two

Staying the course

Training monitoring: beyond loss curves

Fix and restart vs fix on the fly

Stage 2 and stage 3 mixtures

Long context extension: From 4k to 128k tokens

Wrapping up pretraining

Beyond base models — post-training in 2025

Post-training compass: why → what → how

First things first: evals before everything else

Rules of engagement

Tools of the trade

Why bother with frameworks at all?

Why (almost) every post-training pipeline starts with SFT

Picking a base model

Training simple baselines

Picking a good chat template

Vibe-test your baselines

Targeting specific capabilities

Which hyperparameters actually matter?

Boosting reasoning through continued pretraining

From SFT to preference optimisation

Creating preference datasets

Which algorithm do I pick?

Which hyperparameters matter most for preference optimisation?

Rules of engagement

Going online and beyond supervised labels

Applying RLVR to hybrid reasoning models

Is RL the only game in town?

Which method do I pick?

Wrapping up post-training

Infrastructure - the unsung hero

Inside a GPU: Internal Architecture

Compute Units and FLOPs

GPU Memory Hierarchy: From Registers to HBM

Outside a GPU: How GPUs Talk to the World

GPU-to-GPU Intranode

Through Libfabric EFA

GPU-to-GPU Internode

Troubleshooting Interconnect

Building Resilient Training Systems

Node Health Monitoring and Replacement

Checkpoint Management

Automated Evaluations

Optimizing Training Throughput

How Many GPUs Do We Need?

Finding the Optimal Parallelism Configuration

Step 1: Fitting a training step in memory

Step 2: Achieving the target global batch size

Step 3: Optimizing training throughput