Retrieval Augmented Generation (RAG)

Last updated:

In Retrieval Augmented Generation (RAG), an AI system is connected to an external knowledge store. This allows the AI's responses to be based not only on training data but also to be supplemented with relevant sources.

RAG is the most common form of Grounding. Grounding is the overarching principle of coupling model responses to verifiable external sources rather than solely to training patterns.

This allows current information or company-specific knowledge to be integrated without the model needing to be retrained.

Retrieval Augmented Generation is primarily used in enterprise AI for: knowledge management, assistance for sales and service, technical documentation, and support chatbots.

A RAG pipeline retrieves relevant text passages from a document collection, database, or knowledge graph for each user request. It adds them to the prompt and has the language model generate an answer with source attribution.

The term was coined in 2020 by Patrick Lewis et al. in „Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks“ (arXiv 2005.11401, NeurIPS 2020).

Core idea: A parametric memory (the pretrained language model) is connected to a non-parametric memory (a dense vector index over Wikipedia). A neural retriever accesses it at inference time.

How RAG Works: The Four Steps of a Pipeline

A RAG pipeline has four phases: knowledge base indexing, retrieval at query time, prompt augmentation, and response generation. Three of these run with every individual query. Only indexing happens beforehand or in the background.

Knowledge Base Indexing

In the indexing phase, source documents—manuals, tickets, product data, PDFs, wiki pages—are broken down into smaller sections (chunks). An embedding model converts them into vectors, which are stored in a vector database. Chunk sizes between 128 and 512 tokens are common. Fact-based content with keyword matching benefits from smaller chunks (128–256 tokens), while conceptual content and summaries benefit from larger ones (256–512 tokens).

Dedicated vector databases (Qdrant, FAISS) or extensions to existing systems (pgvector for PostgreSQL, Elasticsearch/OpenSearch) are suitable as storage. The right choice depends more on scaling and existing infrastructure than on the RAG concept itself.

Retrieval at query time

When a user query arrives, it's also embedded and matched against the vector index. The retriever delivers the k most similar chunks. In practice, pure dense retrieval is rarely sufficient: BM25 is more reliable for exact tokens like product codes, function names, or rare proper nouns. Dense vectors, on the other hand, capture paraphrases and intent. Therefore, hybrid search dominates in production systems: both methods run in parallel, and then the result lists are merged.

Augmentation and generation

The selected chunks are combined with the original question and a system instruction to form a prompt—this is the augmentation step. The LLM generates an answer from this and should cite the sources used. Only this reference to sources distinguishes a RAG answer from a free LLM answer: it makes the statement verifiable and shifts the responsibility from „the model claims“ to „the model cites.“.

RAG, Fine-Tuning, or Long-Context? Which to Use When

The three approaches are often presented as competitors, but they are answers to different questions. RAG does not change the model but rather augments it with a knowledge source. Fine-tuning changes the weights. Long-context merely increases the input window.

RAG vs. Fine-Tuning

RAG is suitable when relevant knowledge changes frequently, when each answer must cite a source, or when the knowledge base needs to remain separate from the model for legal or organizational reasons. Content can be updated by re-indexing the vector database without model training.

Fine-tuning is suitable when not knowledge, but behavior is to be adapted: tonality, response structure, a specific classification style. It is more expensive because it requires specialized labeling and training compute time. Once completed, the model remains at its knowledge level at that time. In practice, the approaches are not mutually exclusive—a domain-fine-tuned model as a generator in a RAG pipeline is a common combination.

RAG vs. Long-Context Prompting

With context windows of several hundred thousand tokens, the thesis emerged that RAG would become obsolete. This has not been confirmed. Anyone who pushes an entire manual into the prompt pays the full token price for every request. They risk quality losses due to diluted context (the documented lost-in-the-middle effect) and give up the opportunity to selectively choose only the relevant passages. Long-context and RAG complement each other: retrieval decides what belongs in the prompt, and the large window gives the model space for more extensive context packages.

RAG vs. Classic Full-Text and Semantic Search

Classic search delivers a list of hits. RAG delivers a formulated answer with source reference. Semantic search is part of many RAG systems as a retriever component, but not the entire system. Those who need a list for further manual review can use Enterprise Search; those who need a directly usable answer cannot.

What Determines Quality — Chunking, Embeddings, Hybrid Search, Reranking

The quality of a RAG pipeline rarely depends solely on the language model—it stands or falls with what the retriever delivers. Four adjustment screws carry the main weight.

Chunk size and overlap

The typical working range is 128–512 tokens per chunk. For general use cases, recursive splitting with 512 tokens and 10–20 % overlap has become the standard. NVIDIA tested overlap values of 10 %, 15 %, and 20 % on FinanceBench and found 15 % to be optimal. The test ran on 1,024-Token-Chunks; because the overlap is specified as a percentage of the chunk size, the guideline is largely independent of chunk size and can therefore be applied on a 512-token basis—where 15 % corresponds to approximately 75 tokens. Below 10 % overlap, contextual information is lost at chunk transitions. Above 25 %, noise arises due to nearly identical matches. Fact-based questions benefit from smaller chunks, while concept summaries benefit from larger ones.

Choose embedding model

Three selection criteria dominate: retrieval quality, latency, and hosting requirements. The MTEB benchmark serves as a guide—in the cloud segment, models from Cohere, OpenAI, or Google (Gemini Embedding); for self-hosted operation, open models like the BGE or Qwen embedding families, plus smaller models from the E5 series for latency-critical applications. However, the top of the MTEB rotates quickly, and MTEB v2 (from 2026) is not directly comparable to v1; specific rankings and scores become outdated within months. What's more important than the table position is this: MTEB scores do not reliably predict performance on one's own domain data. Before making a decision, it's worthwhile to benchmark on a small, representative set of queries from your own knowledge base.

Hybrid Search and Reranking

Hybrid Search (Dense plus BM25) and a downstream reranking layer measurably improve the quality of search results beyond what a single method can achieve. The magnitude of this improvement can be seen in published benchmarks: An academic study on financial documents shows a Recall@5 of 0.816 and an MRR@3 of 0.605 for a two-stage pipeline consisting of hybrid retrieval and neural reranking; the same study finds that BM25 outperforms even state-of-the-art Dense Retrieval on financial documents. Individual real-world benchmarks (e.g., a published test of a FastAPI RAG system, Medium 2026) report the same trend in clear terms—from around 60 % recall with pure Dense or BM25 to over 90 % in the full cascade with a reranker; such individual values are illustrative, not representative. In practice, this means: in almost every serious system, BM25 (via Elasticsearch or OpenSearch), a Dense retriever, and a reranker belong together in the final 20–50 hits—not one of them as a standalone solution.

Where RAG fails in practice

It happens quite often that RAG pipelines fail. The causes are almost never in the LLM, but in the retrieval path.

Silent Degradation: When the system responds, but incorrectly

A failed production RAG is identified when nothing crashes. The pipeline continues to deliver answers, the dashboard shows healthy latency, and the evaluation results from the launch are still in the wiki.

Underneath, the system degrades on every relevant axis: new document versions are not indexed, the embedding space no longer matches the changed knowledge base, and irrelevant search results appear more frequently.

Three anchors help with diagnosis: a rigidly defined set of gold standard questions that runs periodically against production; samples.

Typical error classes along the pipeline

In the indexing phase: unfavorable chunk boundaries that break tables, lists, or code; outdated documents; embeddings that are unsuitable for the domain. In retrieval: document-level retrieval mismatch, where the system retrieves a chunk from a large document but not the relevant one; missing BM25 component for highly token-specific queries like product codes. In augmentation: too many chunks in the prompt, causing the model to lose context; missing source metadata, making citations impossible. In generation: the model answers from its own knowledge despite the context—a RAG hallucination that is only visible through faithfulness measurement.

How to Measure if a RAG System is Working

A common evaluation framework is RAGAS. The four most commonly used core dimensions each answer a different question.

Faithfulness checks whether the generated answer is grounded in the retrieved documents and has not been hallucinated. This is the central metric for making RAG-specific hallucinations visible.

Answer relevance checks if the answer addresses the question asked – a faithful answer may miss the point.

Context Precision checks if the retrieved chunks are relevant to the question. Low values indicate retrieval problems.

Context Recall checks if all information needed for a complete answer has been retrieved. Low values indicate gaps in the index or overly narrow retrieval.

Faithfulness and Context Precision target the retriever and the grounding of the answer to the sources. Answer Relevancy and Context Recall target completeness. RAGAS has since expanded its metric set (e.g., to include Noise Sensitivity and Response Groundedness); however, the four core metrics remain the pragmatic starting point. Additionally, tools such as TruLens for traceability, DeepEval for unit-test-like checks, and Arize Phoenix for observability are used. A formal ISO standard for RAG evaluation does not exist; task-specific benchmarks like KILT, MS MARCO, FinanceBench, and Legalbench-RAG are established.

RAG in the B2B and Industrial Context — GDPR, On-Premises, and Concrete Use Cases

In the DACH industrial.

Typical application areas in industry and B2B

The Mittelstand-Digital Centre Focus on People describes applications in industries such as mechanical engineering, electrical engineering, and logistics.

There, RAG links language models with internal and external data sources for real-time information. Lufthansa Industry Solutions cites internal knowledge management for process documentation, employee handbooks, and maintenance manuals, as well as B2B and B2C chatbots as typical fields.

Documented examples from production systems:

  • Grainger (B2B-MRO-Distribution): RAG-based search system on Databricks Mosaic AI for 2.5 million products with approximately 400,000 daily product updates. Background: Different buyer personas (e.g., electricians vs. mechanical engineers) expect industry-specific results for a query like „terminals.“.
  • Ramp uses RAG for industry classification by NAICS codes: customer information is vectorized, compared against a database of codes, and an LLM generates the final classification.
  • Thomson Reuters uses RAG to provide support staff with relevant content from internal knowledge bases through a chat interface.
  • The BVDW has addressed the pattern with its own white paper for the DACH market („Retrieval-Augmented Generation: Using Knowledge Strategically, Generating Precise Answers“).

GDPR, data residency, and on-premises operation

In RAG, confidential material remains outside of the model weights. Knowledge is retrieved from controlled repositories rather than being embedded into the model. This reduces the risk of leaks compared to fine-tuning, where training content can be reconstructible through the weights. For the DACH industry, this leads to three architectural decisions that must be made early on: Where will the vector database be located (own infrastructure, EU cloud, US cloud)? Which generator will be used (cloud API, EU-hosted model, local open-weight model)? Which source metadata will be stored, so that Audit and deletion requests can be implemented? Separating knowledge and model makes RAG an architecturally simpler solution for data residency requirements—but it doesn't replace proper upfront data classification.

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

What is the difference between RAG and fine-tuning?
RAG combines an unaltered language model at query time with an external knowledge base, delivering answers with source references. Fine-tuning modifies the model weights themselves and is suitable when tonality, answer structure, or classification behavior needs to be adjusted. Rule of thumb: dynamic knowledge with source citation requirement → RAG; stable behavior without source reference → Fine-tuning. Both can be combined.

Does a large context window make RAG obsolete?

No. Loading entire knowledge bases into every prompt drives up token costs and latency without a proportional increase in response quality—diluted context often lowers it (lost-in-the-middle effect). Retrieval selectively determines which passages are relevant. The context window gives the model space to work with the selected material. The two approaches complement each other.

Does RAG completely reduce hallucinations?

No. RAG reduces hallucinations because the model is guided by retrievable material. However, if the retriever delivers irrelevant or incorrect chunks, the model can still hallucinate – this time appearing to have a source base. The faithfulness metric makes such cases visible; without measurement, the risk remains hidden.

How often does

As often as the source documents change. The system is only as current as its last indexing run. The re-indexing logic (triggers, frequency, versioning) is a separate operational aspect and should be defined at the beginning of the project, not added later. If a document changes, the change will only be visible to the system after the next run.

Do I need a special vector database?

Not necessarily. Options range from dedicated systems like Qdrant or FAISS to pgvector as a PostgreSQL extension, and even Elasticsearch or OpenSearch, which are often needed for the BM25 component in hybrid search anyway. The choice follows expected scaling, latency budget, and existing infrastructure—not the RAG concept itself.