In Depth

The Vision Transformer (ViT), introduced by Google in 2020, applies the transformer architecture to images by dividing them into fixed-size patches (typically 16x16 pixels), treating each patch as a 'token,' and processing the resulting sequence with a standard transformer encoder. This was a paradigm shift, showing that a pure transformer architecture could match or exceed CNNs on image classification without any convolutional layers.

ViT demonstrated that the inductive biases of CNNs (translation invariance, local connectivity) were not necessary given sufficient training data. With large-scale pre-training, ViTs could learn these patterns from data alone. However, ViTs typically require more training data than CNNs to achieve similar performance, as they start with fewer built-in assumptions about image structure.

Vision Transformers have since become the dominant architecture for many vision tasks, especially at scale. Variants like DeiT (Data-efficient Image Transformer), Swin Transformer (hierarchical windows), and BEiT (self-supervised pre-training) have addressed various limitations. ViTs also enable unified multimodal architectures where the same transformer processes both text and image tokens, which is fundamental to models like CLIP, GPT-4V, and Gemini.