The Smol Training Playbook
The Secrets to Building World-Class LLMs
Introduction
How to read this blog post
Training compass: why → what → how
Why: the question nobody wants to answer
Research: what do you want to understand?
Production: why can't you use an existing model?
Strategic open-source: do you see a gap you can fill?
Hugging Face's journey
What: translating goals into decisions
Super power: speed and data
Every big model starts with a small ablation
Choosing your baseline
Modifying your baseline: the discipline of derisking
Picking a training framework
Ablation setup
Setting up our ablation framework
Understanding what works: evaluation
Estimating ablations cost
Rules of engagement
Designing the model architecture
Architecture choices
Attention
Embedding sharing
Positional Encodings & Long Context
Improving stability
Other core components
Going Sparse: MoE
Excursion: Hybrid Models
To MoE or not MoE: Choosing a Base Architecture
The tokenizer
SmolLM3
Rules of engagement
Optimiser and training hyperparameters
Optimizers: AdamW and beyond
Learning Rate
Batch size
Scaling laws for hyperparameters
SmolLM3
Rules of engagement
Scaling laws: how many parameters, how much data?
The art of data curation
What's a good data mixture and why it matters most
The unintuitive nature of data mixtures
The evolution of training curricula
Ablation setup: how to systematically test data recipes
SmolLM3: Curating the data mixture
Building on Proven Foundations
English Web Data: The Foundation Layer
Multilingual Web Data
Code Data
Math Data
Finding the right mixture for new stages
The training marathon
Pre-flight checklist: what to verify before hitting "train"
Scaling surprises
Mystery #1 – The vanishing throughput
Mystery #2 – The persisting throughput drops
Mystery #3 – The noisy loss
Launch, Take Two
Staying the course
Training monitoring: beyond loss curves
Fix and restart vs fix on the fly
Mid-training
Stage 2 and stage 3 mixtures
Long context extension: From 4k to 128k tokens
Wrapping up pretraining
Beyond base models — post-training in 2025
Post-training compass: why → what → how
First things first: evals before everything else
Rules of engagement
Tools of the trade
Why bother with frameworks at all?
Why (almost) every post-training pipeline starts with SFT
Picking a base model
Training simple baselines
Picking a good chat template
Baby baselines
Vibe-test your baselines
Targeting specific capabilities
Which hyperparameters actually matter?
Boosting reasoning through continued pretraining
From SFT to preference optimisation
Creating preference datasets
Which algorithm do I pick?
Which hyperparameters matter most for preference optimisation?
Rules of engagement
Going online and beyond supervised labels
Applying RLVR to hybrid reasoning models
Is RL the only game in town?
Which method do I pick?
Wrapping up post-training
Infrastructure - the unsung hero
Inside a GPU: Internal Architecture
Compute Units and FLOPs
GPU Memory Hierarchy: From Registers to HBM
Roofline Model
Outside a GPU: How GPUs Talk to the World
GPU-to-CPU
GPU-to-GPU Intranode
Through CPU
Through Libfabric EFA
Through NVLink
GPU-to-GPU Internode
Troubleshooting Interconnect
GPU-to-Storage
Summary
Building Resilient Training Systems
Node Health Monitoring and Replacement
Checkpoint Management
Automated Evaluations
Optimizing Training Throughput
How Many GPUs Do We Need?
Finding the Optimal Parallelism Configuration
Step 1: Fitting a training step in memory
Step 2: Achieving the target global batch size
Step 3: Optimizing training throughput
Conclusion
References