In Depth
Contrastive learning trains models by comparing pairs of examples. The model learns to pull representations of similar (positive) pairs closer together in its internal representation space while pushing dissimilar (negative) pairs apart. This approach allows models to learn meaningful representations without requiring manually labeled data.
In computer vision, contrastive learning frameworks like SimCLR, MoCo, and BYOL create positive pairs by applying different augmentations to the same image (two crops of the same photo should have similar representations). Negative pairs come from different images. In language, contrastive learning might treat different passages about the same topic as positive pairs and unrelated passages as negative pairs.
Contrastive learning has proven remarkably effective for learning general-purpose representations that transfer well to downstream tasks. CLIP (Contrastive Language-Image Pre-training) used contrastive learning to align text and image representations, enabling zero-shot image classification. The technique is also fundamental to modern embedding models used in semantic search and retrieval-augmented generation systems.