In Depth
Knowledge distillation transfers knowledge from a large, complex model (teacher) to a smaller, more efficient model (student). Instead of training the student on the original data labels alone, it learns from the teacher's output probabilities, which contain richer information about relationships between classes. For example, a teacher's prediction that an image is '80% cat, 15% tiger, 5% lion' conveys more knowledge than just the label 'cat.'
The technique was popularized by Geoffrey Hinton in 2015 and uses a temperature parameter to 'soften' the teacher's probability distributions, making the subtle relationships between classes more visible to the student. The student is typically trained on a weighted combination of the hard labels and the teacher's soft predictions.
Knowledge distillation has become crucial for deploying AI in production, where the largest models are too expensive or slow for real-time applications. Many commercial AI products use distilled models: smaller models that capture much of a frontier model's capability at a fraction of the cost. The technique is also used in model compression pipelines alongside pruning and quantization to create highly efficient deployment models.