In Depth
Dropout is a simple but powerful regularization technique introduced by Geoffrey Hinton and colleagues in 2014. During each training step, a random subset of neurons is temporarily 'dropped out' (set to zero), forcing the network to learn redundant representations rather than relying on any single neuron or pathway. At inference time, all neurons are active but their outputs are scaled to account for the difference.
The intuition behind dropout is that it prevents co-adaptation, where neurons develop complex dependencies on each other that work well on training data but fail to generalize. By randomly removing neurons during training, the network learns more robust features that work even when some pathways are missing. It's similar to training an ensemble of many smaller networks simultaneously.
Dropout rates typically range from 0.1 to 0.5, with 0.5 being common for hidden layers and lower rates for input layers. While dropout remains widely used, modern architectures sometimes replace or complement it with other regularization strategies. In transformer models, dropout is commonly applied to attention weights and feed-forward layers.