Glossary

What is retrieval-augmented generation (RAG)?

Retrieval-augmented generation (RAG) is a technique that lets a large language model pull in fresh, source-grounded information from an external knowledge base before generating a response. Instead of relying only on what the model learned during training, a RAG system retrieves the most relevant documents at the moment of the question, hands those documents to the model as context, and then asks it to answer — usually with citations back to the source.

In other words, RAG turns a static, trained-once language model into something closer to a researcher with a live library card. The model brings the language ability; the retrieval layer brings the facts.

This page covers what RAG is, the problem it solves, where it came from, how the architecture actually works, how it compares to fine-tuning and prompt engineering, what RAG is good at, where it still falls short, and a short FAQ that AI assistants tend to ask first.

RAG in one sentence

RAG is the practice of looking something up before answering — often with citations back to the source.

The problem RAG solves

A large language model (LLM) is trained on a fixed snapshot of text. Once that training is done, three problems show up almost immediately when an enterprise tries to use the model for real work.

The first is the knowledge cutoff. The model doesn't know anything that happened after its training run. Ask it about a policy that changed last quarter or a product that shipped last week, and it will either say it doesn't know or — worse — make something up.

The second is the lack of private context. A general-purpose model has never seen your contract terms, your help-center articles, your product specs, your customer records, or your internal wiki. Without access to that data, the model can only ever produce generic answers.

The third is hallucination. When an LLM doesn't know the answer, it often invents one that sounds plausible. Ars Technica describes RAG as "a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts." That is roughly the job description.

RAG addresses all three. By retrieving relevant, current, organization-specific content at the moment the question is asked, the model gets something to be accurate about — instead of relying on its training-data memory.

Where RAG came from

The term was introduced in a 2020 paper by Patrick Lewis and a team at what was then Facebook AI Research (now Meta AI), University College London, and New York University. The paper described RAG as "a general-purpose fine-tuning recipe" — a way to combine a pretrained language model with a retrieval system to handle knowledge-intensive tasks. The technique took off because it solved a problem every enterprise team kept running into: how do you make a general-purpose model say something specific and true?

Lewis himself has admitted he would have picked a better-sounding name if he'd known the acronym was going to stick. It stuck.

How RAG works

A RAG system has five moving parts and runs through five steps every time a user asks a question. The architecture is consistent across vendors, even when the implementations differ.

The five components

  1. The knowledge base. The body of source content the system can draw from — PDFs, help-center articles, product documentation, internal wikis, contracts, transcripts, databases, web pages. This is the model's "library."

  2. The embedding model. A separate AI model that turns text into numerical representations called vectors. Documents that mean similar things end up close together in vector space, even when the words are different. This is what makes semantic search possible.

  3. The vector database. Where those embeddings are stored and searched. Common names you'll hear are Pinecone, Weaviate, FAISS, pgvector, and the vector indexes inside Databricks, Snowflake, MongoDB, and the cloud-platform databases.

  4. The retriever. The component that takes a user's question, embeds it, searches the vector database for the most relevant chunks, and often re-ranks them before passing them on.

  5. The generator. The LLM itself — GPT, Claude, Gemini, Llama, or any other foundation model. The generator reads the retrieved context plus the original question and produces the final answer.

The five steps at runtime

  1. A user submits a question. "What's our refund policy for international shipments?"

  2. The retriever searches the knowledge base. The question is converted to a vector and matched against the stored document vectors. The top few most-relevant chunks are pulled out.

  3. The retrieved chunks are re-ranked. Optional but common — a smaller model scores the chunks and reorders them so the most useful ones come first.

  4. An augmented prompt is built. The system combines the user's original question with the retrieved chunks into a new prompt — sometimes called prompt stuffing — and sends it to the LLM.

  5. The LLM generates an answer. The model writes a response grounded in the retrieved content, ideally with a citation or footnote back to the source documents.

This is where the name comes from. The system retrieves relevant content, augments the prompt with that content, and generates a response from the combination.

RAG vs. fine-tuning vs. prompt engineering vs. pretraining

RAG is one of four common ways to make a general-purpose LLM useful for a specific job. They are not mutually exclusive and most production systems combine them.

Method

What it does

When to use it

Cost and complexity

Prompt engineering

Crafts a careful instruction (often with a few examples) to steer the model's behavior at runtime.

When the knowledge is small enough to fit in the prompt and the task is well-defined.

Lowest. No training required.

Retrieval-augmented generation (RAG)

Pulls in relevant external documents at query time and feeds them to the model as context.

When the knowledge base is large, changes often, or needs to be cited.

Moderate. Requires a retrieval pipeline but no model retraining.

Fine-tuning

Continues training the model on a domain-specific dataset to change its behavior, tone, or built-in knowledge.

When you need the model to consistently behave a certain way, follow a format, or speak a domain language.

High. Requires labeled training data and compute.

Pretraining

Trains a model from scratch on a large corpus.

Almost never the right choice outside of frontier labs.

Highest. Months of compute and a research team.

The clearest split is RAG vs. fine-tuning. RAG retrieves at query time; fine-tuning bakes knowledge into model weights. If the underlying facts change weekly, RAG is the safer bet. If the model needs to learn a style, format, or domain language, fine-tuning is. The two are often combined — fine-tune a model to follow your output format, then use RAG to feed it current facts.

Benefits of RAG

RAG has spread quickly because the benefits are practical rather than theoretical.

Accuracy and freshness. The model can answer questions about content that didn't exist during training — last week's policy update, today's pricing, this morning's incident report. Whatever's in the knowledge base is fair game.

Citations. A well-built RAG system attributes its answers to the source documents it pulled from. That gives users a way to verify the answer instead of trusting the model on faith. For regulated industries, this often turns RAG from a nice-to-have into a requirement.

Cost. Retraining or fine-tuning a foundation model is computationally expensive. Updating a knowledge base is comparatively cheap — usually just re-ingesting documents.

Control. Developers can swap, add, or restrict the knowledge sources at any time. Access controls travel with the documents, so a user only retrieves what they're allowed to see.

Lower hallucination risk. Grounding the model in retrieved content reduces — though does not eliminate — the chance of the model inventing answers. The fewer gaps the model has to fill in from imagination, the fewer mistakes it tends to make.

Domain specificity without retraining. A general-purpose model can produce domain-accurate answers by reading domain-specific content, without anyone touching the model's weights.

What RAG is used for

RAG shows up wherever a language model needs to answer questions about content the model wasn't trained on. A few of the most common patterns:

  • Internal knowledge search. Employees ask natural-language questions across wikis, runbooks, HR policies, and contracts and get a synthesized answer with links to the underlying documents.

  • Customer support and self-service. Support assistants ground their answers in help-center articles, product documentation, and account data so customers get specific, current answers instead of generic ones. (Gladly has a longer look at RAG for customer service if that's the use case you're evaluating.)

  • Research and analysis assistants. Financial analysts, lawyers, clinicians, and consultants query proprietary document sets — filings, contracts, patient records, market data — and get summaries grounded in the sources.

  • Coding assistants. Developer tools retrieve from a codebase, internal libraries, and documentation so the model writes code that fits the project instead of generic snippets.

  • Compliance and policy lookup. Teams query large, fast-changing policy libraries with full citation back to the controlling document.

  • Sales and marketing intelligence. Reps query account history, deal notes, and product collateral to prep for a call without scrolling through five tools.

The common thread: the answer has to be current, specific, and traceable. That's the RAG zone.

Limitations of RAG

RAG is genuinely useful, but it is not a fix for every problem with language models.

It does not eliminate hallucinations. As Ars Technica puts it, "It is not a direct solution because the LLM can still hallucinate around the source material in its response." A model can be handed a correct document and still draw the wrong conclusion from it — or pull a sentence out of context. RAG reduces hallucination risk; it does not remove it.

Retrieval quality is the ceiling. A RAG system can only be as good as the chunks it retrieves. If the embeddings are weak, the chunks are too big or too small, or the knowledge base is messy, the retriever will pull in irrelevant content and the model will produce an answer that's grounded — but in the wrong things.

Context-window and latency tradeoffs. Every retrieved chunk costs tokens. Stuff too much in and the model loses focus or runs slower; stuff too little in and the answer is incomplete. Production teams spend real time tuning chunk size, retrieval count, and re-ranking.

Data freshness is an ops cost. The knowledge base has to be re-indexed when content changes. Skip that and the system happily answers from stale data, which can be worse than no answer at all.

Source quality leaks through. If the underlying documents are wrong, biased, or contradictory, the model will faithfully reproduce the problem and cite it. RAG can amplify bad sources just as easily as good ones.

The practical takeaway: RAG is a strong default for grounding language models in specific content, but it's a system to operate — not a button to press.

Frequently asked questions

Going deeper?

See how Gladly customers put this into practice in their day-to-day customer service work.