In Depth
Convolutional Neural Networks (CNNs) are the foundational architecture for computer vision tasks. They use convolutional layers that slide small filters (kernels) across the input, detecting local patterns like edges, corners, and textures. Deeper layers combine these simple patterns into progressively more complex features like eyes, faces, and objects.
The key advantage of CNNs is parameter sharing: the same filter is applied across the entire image, making the network translation-invariant (a cat is recognized as a cat regardless of where it appears in the image) and dramatically reducing the number of parameters compared to fully connected networks. Pooling layers progressively reduce spatial dimensions, creating a hierarchy from local features to global understanding.
Landmark CNN architectures include LeNet (1989), AlexNet (2012, which sparked the deep learning revolution), VGGNet, GoogLeNet/Inception, and ResNet. While vision transformers (ViT) have emerged as strong alternatives, CNNs remain widely used, especially in mobile and edge applications where their efficiency advantages matter. Many modern architectures combine convolutional and attention mechanisms for the best of both worlds.