In Depth
GPT-4o, Gemini, and Claude are prominent multimodal LLMs. Building multimodal systems typically involves training separate encoders for each modality and projecting their outputs into a shared embedding space the language model can reason over. Multimodal capability is expanding rapidly into video understanding, real-time speech conversation, and document analysis with mixed text and images.