What RAG is
Retrieval-Augmented Generation gives an LLM access to your specific knowledge base at query time. The model retrieves the most relevant chunks of your documents, then generates an answer grounded in that retrieved context.
It exists because LLMs trained on the public internet don't know your product's docs, your customer support history, or your internal wiki. Fine-tuning works in theory but is expensive, slow to update, and overkill for "the model needs to know our docs." RAG is the better default in 2026.
Production RAG vs demo RAG
The naive version — chunk the docs, embed them, retrieve top-K with cosine similarity, stuff into the prompt — works for the demo. It falls apart when real users ask exact-match queries (error codes, product names, numbers) that vector search misses. Or when the retrieved chunks span multiple sources and the model conflates them.
Production-grade RAG in 2026 has six layers:
1. Semantic chunking (respects document structure, 300–800 tokens, 10–15% overlap) 2. Embeddings stored in pgvector or similar 3. Hybrid search — vector + BM25 lexical fused via Reciprocal Rank Fusion 4. Cross-encoder reranker on the top 20–30 candidates 5. Citation-required generation with strict output schema 6. Evals as a CI gate with 50–200 held-out questions
Skipping layers 3–6 is why most teams complain about hallucinations. The model is rarely the problem.
What it costs to run
A working knowledge-base RAG at moderate scale (10K queries/day, 5K docs): $200–$600/month all-in, mostly LLM inference. Caching responses by query+chunk-ids cuts 30–50%. Model routing (Haiku for simple, Sonnet for synthesis) cuts more.
When NOT to use RAG
When the answer needs reasoning beyond your docs (do math, plan a multi-step task, write creative content). When the corpus is small enough to fit in the model's context window. When you actually need fine-tuning (rare — usually not the right call in 2026).


