What It Is
A foundation model is a large AI model trained on broad, diverse data that can be adapted to many downstream tasks. The term was coined by Stanford's Center for Research on Foundation Models (CRFM) in 2021 to describe the emerging paradigm where a single pre-trained model serves as the foundation for thousands of specialized applications.
GPT-4, Claude, Gemini, LLaMA, and Stable Diffusion are all foundation models. They are trained once at enormous cost ($10 million to $500+ million) on massive datasets, then deployed across many use cases through fine-tuning, prompt engineering, or retrieval-augmented generation. This "train once, use many" paradigm fundamentally changed AI economics.
Before foundation models, AI development required training a separate model for each task — one for translation, one for sentiment analysis, one for summarization. Foundation models unify these capabilities in a single system, dramatically reducing the cost and effort to deploy AI in new domains.
Characteristics
Scale — foundation models are defined by their scale of training. Large language models contain hundreds of billions of parameters trained on trillions of tokens of text. Vision foundation models train on billions of images. Multimodal models process multiple data types simultaneously. This scale enables emergent capabilities — abilities that appear only in sufficiently large models.
Self-supervised pre-training — foundation models learn from raw data without human-provided labels. Language models predict the next token in text. Vision models learn to reconstruct masked image patches. This self-supervised approach leverages the vast amount of unlabeled data available on the internet, scaling training far beyond what human annotation could support.
Generality — a well-trained foundation model performs competently on tasks it was never explicitly trained for. GPT-4 can write code, analyze medical images, compose music, and solve math problems — none of which were distinct training objectives. This generality emerges from learning rich representations of the world through diverse training data.
Adaptability — foundation models can be specialized through multiple mechanisms:
- Fine-tuning — training on task-specific data with a small learning rate
- Prompt engineering — crafting input instructions that steer model behavior
- In-context learning — providing examples in the prompt that the model learns from at inference time
- RAG — augmenting the model with external knowledge at inference time
Major Foundation Models
Language models — GPT-4 and GPT-4o (OpenAI), Claude 3.5 and Claude 4 (Anthropic), Gemini 1.5 and 2.0 (Google), LLaMA 3 (Meta), Mistral Large (Mistral AI), and Command R+ (Cohere). These models understand and generate text across languages, domains, and task types. See large language models.
Vision models — DINOv2 (Meta), SAM (Segment Anything Model, Meta), and CLIP (OpenAI) serve as visual foundations. SAM segments any object in any image without training on that object type. DINOv2 provides visual features that transfer to downstream vision tasks.
Multimodal models — GPT-4o, Gemini 2.0, and Claude process text, images, audio, and video in a single model. See multimodal AI. This unification enables applications that reason across modalities — analyzing charts, describing images, and processing documents with mixed content.
Code models — Codex (OpenAI), Code LLaMA (Meta), and StarCoder train on code repositories and documentation. They generate, complete, explain, and debug code across programming languages.
Domain-specific foundations — Med-PaLM (medical), BloombergGPT (finance), and Galactica (science) are trained with domain-specific data and evaluation. These models outperform general-purpose models on domain tasks while sacrificing breadth.
Economic Impact
Foundation models restructure AI economics:
Amortized development cost — the enormous training cost is amortized across millions of users and thousands of applications. Per-user cost of AI capability drops dramatically compared to task-specific model development.
API economy — foundation model providers offer API access, enabling developers to build AI applications without training models. OpenAI, Anthropic, and Google generate revenue from API usage. This creates a layered ecosystem: model providers, platform builders, and application developers.
Open-source competition — Meta's LLaMA models, Stability AI's open models, and community-developed variants (Mistral, Falcon) provide free alternatives to commercial APIs. Open-source models reduce the cost floor and prevent vendor lock-in, but may lag behind proprietary models in capability.
Build vs. buy — organizations choose between fine-tuning open-source models (more control, higher expertise required), using commercial APIs (simpler, vendor dependent), or training custom models (maximum control, massive cost). Most organizations use APIs for prototyping and open-source models for production.
Risks and Concerns
Homogenization — when many applications build on the same foundation model, they inherit its biases, limitations, and failure modes. A bug or bias in GPT-4 propagates to thousands of downstream products. This concentration of dependency is unlike previous technology platforms.
Power concentration — only a handful of organizations have the resources to train frontier foundation models. This concentrates AI capability and influence in a small number of companies, raising concerns about market power and governance.
Dual use — foundation models can be used for beneficial and harmful purposes. The same model that helps researchers write papers can generate disinformation. Balancing access with safety is an ongoing challenge. See AI safety.
Environmental cost — training frontier foundation models consumes enormous energy. GPT-4's training is estimated to have consumed $100+ million in compute. See AI and climate.
Challenges
- Evaluation complexity — foundation models perform thousands of tasks, making comprehensive evaluation difficult. Standard benchmarks capture only a fraction of model capabilities and limitations. Models can perform well on benchmarks while failing on real-world tasks.
- Alignment — ensuring foundation models behave as intended and align with human values is a fundamental challenge. Reinforcement learning from human feedback (RLHF) and constitutional AI are current approaches, but alignment remains an unsolved problem. See AI safety.
- Moat erosion — the gap between frontier and open-source models narrows with each generation. Foundation model providers must continuously innovate or compete on distribution, trust, and ecosystem rather than raw capability.
- Liability — when a foundation model causes harm in a downstream application, questions of liability arise. Is the model provider, the application developer, or the end user responsible? Legal frameworks haven't resolved this question.
- Data provenance — foundation models train on internet-scale datasets of uncertain provenance. Copyright claims from content creators (authors, artists, publishers) challenge the legality of training data use. See AI regulation.