In Depth
Pruning reduces the size and computational cost of neural networks by identifying and removing parameters that contribute least to the model's performance. Just as pruning a tree removes unnecessary branches to promote healthy growth, neural network pruning removes redundant weights, neurons, or entire layers to create a more efficient model.
There are two main approaches: unstructured pruning removes individual weights (setting them to zero), while structured pruning removes entire neurons, channels, or attention heads. Structured pruning is generally more practical because it produces architectures that run faster on standard hardware, while unstructured pruning requires specialized sparse computation support to realize speedups.
Pruning is typically applied after training (post-training pruning) or iteratively during training (gradual pruning). Research has shown that large models can often be pruned by 50-90% with minimal accuracy loss, supporting the 'lottery ticket hypothesis' that dense networks contain smaller subnetworks that could achieve similar performance. Pruning is essential for deploying large models on resource-constrained devices like phones and edge hardware.