What is model distillation?

Question

Accepted Answer

Model distillation is a technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model. The goal is to capture most of the large model's intelligence in a model that's dramatically cheaper and faster to run. It's one of the most effective ways to reduce AI deployment costs.

**How it works**: Instead of training the student model on the original data with hard labels (correct/incorrect), you train it on the teacher model's outputs — including the probability distributions over all possible answers. These "soft labels" contain richer information. When the teacher model says a review is 90% positive and 10% neutral, that nuance teaches the student more than simply labeling it "positive."

**The process:**

1. Run your dataset through the large teacher model
2. Capture the teacher's detailed output probabilities (logits)
3. Train the smaller student model to match these probability distributions
4. The student learns to mimic the teacher's reasoning patterns, not just its final answers

**Why distillation works so well**: Large models learn subtle patterns and relationships that are difficult to extract from raw data alone. By training on the teacher's outputs, the student gets the benefit of these learned patterns without needing billions of parameters. It's like a student learning from a master teacher's explanations rather than figuring everything out independently from textbooks.

**Real-world examples:**

- **DistilBERT** is 60% smaller and 60% faster than BERT while retaining 97% of its performance
- **TinyLlama** (1.1B parameters) trained on outputs from larger models achieves performance competitive with models 3-7x its size
- Many production chatbots use distilled models that cost 90% less to operate than the original

**Cost impact**: A typical example:
- Teacher model (70B parameters): $4/hour inference, 500ms latency
- Distilled student (7B parameters): $0.40/hour inference, 50ms latency
- Performance retention: 90-95% on target tasks

**When to use distillation:**

- You have a specific task where a large model works well but is too expensive to serve at scale
- You need lower latency (smaller models respond faster)
- You want to run models on edge devices (phones, IoT)
- You've identified the exact capabilities you need and want to strip away everything else

**Limitations**: Distillation works best when you have a well-defined task. General-purpose distillation (trying to preserve all capabilities) is much harder and produces larger quality drops. The student can't exceed the teacher — if the teacher gets something wrong, the student will likely get it wrong too.

**Ethical considerations**: Some AI providers prohibit using their model outputs to train competing models. OpenAI's terms of service, for example, restrict using GPT-4 outputs to train models that compete with OpenAI products. Always check the terms of service before distilling from commercial models.