In Depth

Large model pre-training can consume millions of GPU-hours and tens of millions of dollars. The training loop iterates over batches of data, computes a loss (e.g., cross-entropy for next-token prediction), back-propagates gradients, and updates weights via optimizers like AdamW. Training decisions — learning rate schedules, batch size, data mixtures — profoundly affect the quality of the resulting model.