What It Is

Retrieval-augmented generation (RAG) is an architecture that enhances large language models by retrieving relevant documents from an external knowledge base before generating a response. Instead of relying solely on knowledge encoded in model weights during training, RAG systems dynamically fetch current, specific information and include it in the model's context window.

The concept was formalized by Facebook AI Research (now Meta AI) in a 2020 paper, but the pattern has become the dominant approach for enterprise AI applications. RAG addresses the fundamental limitations of static LLMs: knowledge cutoff dates, hallucination of facts, and inability to access proprietary data. By 2026, the majority of production LLM deployments use some form of RAG.

How It Works

A RAG pipeline has three stages: indexing, retrieval, and generation.

Indexing — source documents (PDFs, web pages, databases, wikis) are split into chunks, typically 256-1024 tokens each. Each chunk is converted to a vector embedding using a model like OpenAI's text-embedding-3, Cohere Embed, or open-source alternatives like BGE or E5. These vectors are stored in a vector database — Pinecone, Weaviate, Qdrant, Chroma, or pgvector.

Retrieval — when a user asks a question, the query is embedded using the same model and compared against stored vectors using similarity search (cosine similarity or approximate nearest neighbors). The top-k most relevant chunks are retrieved. Advanced systems combine vector search with keyword search (BM25) in a hybrid approach, then rerank results using a cross-encoder model.

Generation — retrieved chunks are inserted into the LLM's prompt as context, along with the user's question. The model generates an answer grounded in the provided documents. The prompt typically instructs the model to cite sources and acknowledge when retrieved context doesn't contain the answer.

Advanced RAG Patterns

Hybrid search — combining dense vector retrieval with sparse keyword matching (BM25) improves recall, especially for queries involving specific names, codes, or technical terms that embeddings may not capture well.

Reranking — initial retrieval casts a wide net (top-50 to top-100 results), then a cross-encoder reranker scores each result against the query for finer relevance. Cohere Rerank and open-source models like BGE-reranker are commonly used.

Query transformation — the original user query may not be optimal for retrieval. Techniques include HyDE (generating a hypothetical answer to use as the search query), query decomposition (breaking complex questions into sub-queries), and query expansion (adding synonyms or related terms).

Agentic RAGAI agents orchestrate multi-step retrieval, deciding which knowledge sources to query, reformulating queries based on initial results, and synthesizing information across multiple retrievals. This approach handles complex research questions that require reasoning across documents.

Graph RAG — augmenting vector retrieval with knowledge graphs enables relationship-aware retrieval. Microsoft's GraphRAG implementation extracts entities and relationships from documents, builds a graph, and uses graph traversal alongside vector search.

Enterprise Applications

RAG is the backbone of enterprise AI because it lets organizations apply LLMs to their proprietary data without fine-tuning.

Customer support — RAG systems retrieve from product documentation, knowledge bases, and past tickets to generate accurate support responses. Companies report 40-60% reductions in ticket resolution time.

Legal research — law firms index case law, statutes, and internal memos. RAG systems find relevant precedents and generate analysis grounded in actual legal text. See AI in legal.

Financial analysis — analysts query earnings reports, SEC filings, and market research. RAG systems surface relevant data points and generate summaries with citations.

Internal knowledge management — enterprises index Confluence, SharePoint, Slack, and email to create AI assistants that answer employee questions using institutional knowledge.

Evaluation and Quality

RAG system quality depends on both retrieval accuracy and generation faithfulness. Key metrics include:

  • Retrieval recall — does the system find the relevant documents?
  • Context relevance — are the retrieved chunks actually relevant to the query?
  • Faithfulness — does the generated answer accurately reflect the retrieved context, without adding unsupported claims?
  • Answer relevance — does the response actually address the user's question?

Frameworks like RAGAS, DeepEval, and TruLens automate these evaluations. Production systems track these metrics continuously and alert on quality degradation.

Challenges

  • Chunking strategy — how you split documents dramatically affects retrieval quality. Too small and context is lost; too large and irrelevant content dilutes the signal. No universal chunking strategy works for all document types.
  • Embedding model limitations — embedding models compress meaning into fixed-size vectors, losing nuance. Long documents, tables, and multi-topic passages embed poorly.
  • Context window limits — even with large context windows, stuffing too many retrieved chunks degrades generation quality. Selecting the right amount of context is an art.
  • Stale indexes — knowledge bases change. RAG systems must re-index updated documents or risk serving outdated information. Incremental indexing pipelines add operational complexity.
  • Cost at scale — embedding generation, vector storage, and the extra tokens in augmented prompts add up. A high-traffic RAG system processes billions of tokens per day.