In Depth

MoE models can be very large in total parameters but only use a fraction for any given input, making them faster and cheaper than dense models of equivalent capability. GPT-4 and Mixtral are believed to use MoE architectures.