In Depth
InfiniBand is a networking standard that provides very high throughput (up to 400 Gb/s per port with NDR) and ultra-low latency for connecting compute nodes. In AI training clusters, thousands of GPUs across hundreds of servers must communicate constantly to synchronize gradients and share data, making network performance a critical bottleneck.
NVIDIA acquired InfiniBand leader Mellanox in 2020, giving it control over both the compute (GPUs) and networking layers of AI infrastructure. NVIDIA's DGX and HGX systems use InfiniBand for inter-node communication, while NVLink and NVSwitch handle intra-node GPU-to-GPU communication. This integrated stack enables the massive distributed training runs that produce frontier AI models.
For organizations building their own AI clusters, the choice between InfiniBand and Ethernet (including Ultra Ethernet and RoCE) is a major infrastructure decision. InfiniBand offers superior performance for AI workloads but requires specialized switches, cables, and expertise. Ethernet-based alternatives are improving and offer the advantage of using existing network infrastructure and vendor diversity.