LLM 101: Part 5 — RAG: It's Not a Cleaning Cloth, It's How You Make LLMs Know Your Stuff

Previously: We covered training and fine-tuning. Now let's talk about the technique you'll probably actually use.

Welcome to the finale. We've talked about how LLMs are trained, what that training means, and how to specialize them through fine-tuning. Now we're going to talk about RAG — Retrieval-Augmented Generation — which sounds like enterprise buzzword soup but is actually elegantly simple.

The pitch: what if, instead of trying to teach the model all your company knowledge, you just... give it access to look things up?

The Problem RAG Solves

LLMs have knowledge baked into their parameters from training. But that knowledge has three major problems:

1. It's frozen in time. Training happened months or years ago. The model knows nothing about events, products, or changes since then.

2. It doesn't include your stuff. Your internal docs, your codebase, your customer data — none of that was in the training data (hopefully).

3. It hallucinates. When the model doesn't know something, it doesn't say "I don't know." It makes something up that sounds plausible. This is bad for business.

Fine-tuning can help with #2, but it's expensive, slow, and you have to redo it every time your knowledge changes. For #1 and #3, fine-tuning doesn't really help at all.

RAG solves all three problems with a beautifully simple idea: don't put the knowledge in the model. Put it in a database. Let the model retrieve relevant information and use it to answer questions.

How RAG Works (The 10,000-Foot View)

Here's the basic flow:

User asks a question: "What's our return policy for electronics?"
System retrieves relevant documents: Searches your knowledge base, finds your return policy docs.
System constructs a prompt: Takes the question + the retrieved documents and builds a prompt.
Model generates an answer: Uses the retrieved information to answer accurately.
User gets a response: "According to our policy, electronics can be returned within 30 days with receipt..."

The key: the model doesn't need to have memorized your return policy. It just needs to read it and summarize it. This is much more reliable than hoping it hallucinated the right answer.

The Components of a RAG System

Let's break down what you actually need to build this.

1. Your Knowledge Base

This is your source of truth. Could be:

Internal documentation (Notion, Confluence, Google Docs)
Customer support tickets
Code repositories
Product manuals
Databases
Previous chat logs
Literally any text you want the model to have access to

The knowledge base needs to be:

Accessible: Your system can read it programmatically
Well-organized: Garbage in, garbage out applies here too
Up-to-date: The whole point of RAG is fresh information

2. Embeddings (The Magic Bit)

Here's where it gets interesting. To find relevant documents, you can't just do keyword search (though that can help). You need semantic search — finding documents that mean similar things, even if they use different words.

Enter embeddings. An embedding is a way to represent text as a list of numbers (a vector) that captures its meaning. Similar meanings have similar vectors.

Example:

"The cat sat on the mat" → [0.2, 0.8, -0.3, 0.5, ...]
"A feline rested on a rug" → [0.21, 0.79, -0.31, 0.49, ...]
"I like pizza" → [0.9, -0.2, 0.7, -0.1, ...]

The first two are semantically similar (different words, same meaning) so their vectors are close. The third is different, so its vector is far away.

You don't create these embeddings yourself. You use an embedding model (like OpenAI's text-embedding-ada-002, or open-source alternatives). These models are trained specifically to turn text into meaningful vectors.

The process:

Take all your documents
Break them into chunks (more on this in a moment)
Run each chunk through an embedding model
Store the resulting vectors in a database

Now when a user asks a question, you:

Embed the question using the same model
Find vectors in your database that are close to the question vector
Retrieve the original text for those chunks
Feed that text to your LLM

Boom. Semantic search.

3. Vector Database

You need somewhere to store all those embeddings and search them efficiently. This is what vector databases do. They're optimized for "find me vectors similar to this one" queries.

Popular options:

Pinecone: Managed service, easy to use, costs money
Weaviate: Open-source, feature-rich, self-hosted
Chroma: Lightweight, good for prototyping
Qdrant: Fast, written in Rust (if that matters to you)
Postgres with pgvector: Because sometimes you just want to add vectors to your existing database

You store documents + their embeddings. When searching, you query by vector, get back the most similar documents. The database handles the hard part (efficiently searching millions of vectors).

4. Chunking Strategy

Here's a problem: your documents are probably long. Embedding models have token limits. LLMs have context limits. You can't just throw an entire 50-page manual at either one.

Solution: break documents into chunks.

Chunking strategies:

Fixed size: Every chunk is, say, 512 tokens. Simple but dumb. Might cut sentences in half.

Sentence/paragraph-based: Split on natural boundaries. Better, but paragraphs vary wildly in size.

Semantic chunking: Try to keep related ideas together. Use things like headings, topic changes. More complex, but better results.

Sliding window: Chunks overlap slightly to avoid losing context at boundaries.

The tradeoff: Smaller chunks are more precise (you get exactly the relevant sentence) but lose context. Larger chunks maintain context but might include irrelevant information. Most people start with 500-1000 tokens per chunk with 10-20% overlap and adjust from there.

5. The Retrieval Step

When a user asks a question:

Embed the question: Turn it into a vector
Search the vector database: Find the top-k most similar chunks (k is usually 3-10)
Rerank (optional but recommended): Use a separate model to score how relevant each chunk actually is to the question. Sometimes vector similarity isn't perfect.
Return the best matches: These are your context for the LLM

6. The Generation Step

Now you have:

The user's question
Several chunks of relevant information from your knowledge base

Build a prompt:

Context:
[Chunk 1: Your return policy text...]
[Chunk 2: More relevant info...]
[Chunk 3: Additional context...]

Question: What's our return policy for electronics?

Instructions: Answer the question using only the information provided in the context above. If the context doesn't contain enough information to answer the question, say so.

Feed this to your LLM. It reads the context and generates an answer. Because the relevant information is right there in the prompt, it doesn't need to rely on potentially outdated or hallucinated knowledge.

Why RAG Often Beats Fine-Tuning

Remember Part 4 where I said fine-tuning is for behavior, not knowledge? Here's why RAG is usually better for knowledge:

1. It's dynamic. Update your knowledge base, and the model instantly has access to new information. No retraining needed.

2. It's transparent. You can show users which documents were used to answer their question. No black box.

3. It's cheaper. No training costs. Just API calls for embeddings and LLM inference.

4. It's more reliable. The model is reading the information, not recalling it from memory. Less hallucination.

5. It scales better. Adding more documents is just adding more vectors. Adding more knowledge to a fine-tuned model requires retraining.

The catch: RAG requires good retrieval. If you retrieve irrelevant documents, the model will give poor answers. Garbage in, garbage out. But at least you can debug and improve the retrieval step independently.

Advanced RAG Techniques (For When Basic RAG Isn't Enough)

Once you've got basic RAG working, there are ways to make it better:

Hybrid search: Combine vector search with keyword search. Sometimes exact matches matter.

Query transformation: Rewrite the user's question to be more search-friendly. "Why is my printer broken?" → "printer troubleshooting common issues"

Multi-hop retrieval: For complex questions, retrieve documents, generate a partial answer, use that to retrieve more specific documents. Iterate.

Document hierarchies: Store document structure (sections, headings) and use it to retrieve more relevant chunks.

Metadata filtering: Add filters to your search. Only search docs from a specific team, or published after a certain date.

Context compression: Use a smaller model to summarize retrieved documents before passing them to the main LLM. Fits more information in the context window.

Common RAG Pitfalls

Pitfall 1: Bad retrieval. Your vector search returns irrelevant documents. Fix: Better chunking, better embedding model, hybrid search, query transformation.

Pitfall 2: Too much context. You retrieve 20 documents, the LLM gets overwhelmed or hits token limits. Fix: Better reranking, return fewer chunks, summarize before passing to LLM.

Pitfall 3: Out-of-date embeddings. You updated your docs but didn't re-embed them. Fix: Set up a pipeline to automatically re-embed when documents change.

Pitfall 4: Poor quality source documents. Your knowledge base is messy, inconsistent, or wrong. Fix: Clean your data. This is always the answer.

Pitfall 5: No fallback. Retrieval finds nothing relevant, model hallucinates anyway. Fix: Instruct the model to admit when it doesn't have enough context to answer.

When to Use RAG vs. Fine-Tuning vs. Just Prompting

Let's settle this once and for all:

Use prompting when:

The model already knows what you need
You can fit examples in the prompt
You need flexibility and fast iteration

Use RAG when:

You need the model to know specific information
That information changes frequently
You want transparency and control over what the model uses
You have a knowledge base you can search

Use fine-tuning when:

You need a specific style/tone that prompting can't achieve
You have abundant training data for a specific task
You need lower latency and can't afford retrieval overhead
You're willing to commit to maintenance and retraining

Use RAG + fine-tuning when:

You need both specific knowledge (RAG) and specific behavior (fine-tuning)
You have the resources to do both
Your use case genuinely requires it (most don't)

Building Your First RAG System

If you're ready to try this, here's the minimum viable approach:

Choose a vector database: Start with something simple like Chroma or pgvector.
Pick an embedding model: OpenAI's ada-002 is easy. Open-source alternatives like BAAI/bge-small-en work too.
Chunk your documents: Start with 500-token chunks, 50-token overlap. Adjust later.
Embed and store: Process your documents, generate embeddings, load into the database.
Build the query flow:
- Take user question
- Embed it
- Search vector DB for top 5 matches
- Build prompt with question + matches
- Send to LLM
- Return answer
Iterate: Test with real questions. Fix retrieval when it fails. Adjust chunk size. Add filters. Improve prompt instructions.

Most importantly: start small. Don't try to index your entire company knowledge base on day one. Start with one document set, make it work well, then expand.

The Takeaway

RAG is the practical middle ground between "the model doesn't know our stuff" and "training our own model is too expensive." It gives LLMs access to dynamic, specific knowledge without the cost or complexity of fine-tuning.

Is it perfect? No. Retrieval can fail. Vector search isn't magic. But it's debuggable, improvable, and usually good enough. And "good enough" with RAG is often better than "expensive and still imperfect" with fine-tuning.

For most companies looking to build LLM applications with their own data, RAG is the answer. It's not sexy, it's not revolutionary, but it works.

Next: Part 6 — AI Agents: Or, Why Your Chatbot Needs Opposable Thumbs