RAG: How AI Models Use Your Data Without Forgetting
Introduction Large language models are stateless. Every time you start a conversation with one, it begins from zero, with no memory of previous sessions, no access to your internal documents, no awareness of events that happened after it was trained. This is a fundamental architectural constraint, not a bug. Before getting into implementation, it's worth being precise about the problem. Knowledge cutoff: A model trained on data up to a certain date cannot answer questions about events or documents that appeared after that date. No matter how capable the model is, it cannot retrieve information it was never trained on. Context isolation: The model has no access to your codebase, your company's internal documentation, your database records, or any private data. Everything it knows comes from public training data. RAG mitigates both by introducing a retrieval step between the user's query and the model's response. The model stops being a closed system and becomes a reasoning engine over an external knowledge base. This happens before any user query arrives. Your documents (PDFs, markdown files, database exports, code files, etc) are: Split into chunks (typically 256–512 tokens each) Converted into vector embeddings using an embedding model Stored in a vector database alongside their original text An embedding is a high-dimensional numerical representation of a piece of text. Semantically similar text produces embeddings that are close together in vector space. This is what makes the retrieval step possible. When a user submits a query: The query itself is converted into an embedding using the same model A similarity search runs against the vector database (cosine similarity is common) The top-k most relevant chunks are returned This is not keyword matching. It is semantic similarity - "aircraft engine maintenance" and "turbine servicing procedures" will surface as related even though they share no words. The retrieved chunks are injected into the prompt alongside the user's query. The model receives something like: You are a helpful assistant. Use the following context to answer the question. Context: [RETRIEVED CHUNK 1] [RETRIEVED CHUNK 2] [RETRIEVED CHUNK 3] Question: [USER QUERY] The model generates a response grounded in that context. A production RAG system is not a single script. It is a composition of loosely coupled components, each responsible for a specific part of the data flow. The pipeline explains what happens conceptually. The architecture defines where each responsibility lives and how the system scales under real-world constraints. The ingestion pipeline is responsible for preparing raw data for retrieval. This is not a one-time process; it is a repeatable system that continuously processes new or updated documents and keeps the index aligned with the source of truth. At a high level, the ingestion pipeline performs: Document parsing (PDFs, HTML, markdown, code, database exports) Chunking into smaller units Metadata extraction (source, timestamp, category) Embedding generation Storage in the vector database The embedding model lives here operationally. Its job is to convert each chunk into a high-dimensional vector — an array of floating point numbers where position encodes semantic meaning. When a document is indexed, each chunk is transformed into a vector. Later, when a query is issued, that query is passed through the same model to produce a comparable vector. Retrieval works because semantically similar inputs map to nearby regions in vector space. This is fundamentally different from keyword search. Keyword search matches exact terms. Embedding-based retrieval matches meaning. A query like “aircraft maintenance” can retrieve chunks about turbine servicing or inspection schedules even if those exact words never appear. The critical constraint is consistency. The same embedding model must be used for both indexing and querying. If documents are embedded with one model and queries with another, the vectors exist in different geometric spaces. The system does not crash; it silently degrades. import chromadb from chromadb.config import Settings from sentence_transformers import SentenceTransformer # Persistent vector store client = chromadb.Client(Settings(persist_directory="./chroma_db")) collection = client.get_or_create_collection("knowledge_base") # Embedding model (must remain consistent) embedder = SentenceTransformer("all-MiniLM-L6-v2") documents = [ {"id": "doc1", "text": "PostgreSQL supports ACID transactions.", "source": "db.md"}, {"id": "doc2", "text": "Redis enables sub-millisecond reads.", "source": "db.md"}, ] texts = [doc["text"] for doc in documents] embeddings = embedder.encode(texts, batch_size=32).tolist() collection.add( ids=[doc["id"] for doc in documents], embeddings=embeddings, documents=texts, metadatas=[{"source": doc["source"]} for doc in documents] ) The ingestion pipeline must be treated as infrastructure. Documents change. New data arrives. Without a mechanism to re-embed and re-index, the system becomes stale and produces outdated answers that still appear correct. The vector database stores the output of the ingestion pipeline. For each chunk, it maintains: The embedding vector The original text Associated metadata Its primary function is Approximate Nearest Neighbor (ANN) search. Given a query vector, it returns the most similar vectors in the index. The “approximate” part is essential. Exact nearest-neighbor search across millions of high-dimensional vectors is computationally expensive. ANN algorithms trade a small amount of precision for massive speed gains. Common indexing structures include: HNSW (graph-based, high recall, widely used) IVF (cluster-based, optimized for scale) Most vector databases also support metadata filtering. This allows the system to narrow the search space before similarity search runs. For example, filtering by document type or date reduces noise and improves retrieval precision in large datasets. Typical choices depend on system constraints: Chroma for local development and prototyping pgvector for systems already built on PostgreSQL Pinecone for fully managed infrastructure Weaviate for hybrid search (vector + keyword) Qdrant for high-performance self-hosted deployments A key operational concern here is index freshness. If the source data changes and the embeddings are not updated, the system retrieves the closest match to outdated content. The retrieval step still works correctly, but the answer is wrong relative to the current state of the data. The retriever sits between the user’s query and the vector database. Its job is to translate a query into something the index can search and then return the most relevant chunks. The process is straightforward: Convert the query into an embedding Run similarity search against the index Return the top-k results What matters is how retrieval is implemented. There are three distinct strategies: Dense retrieval uses embedding similarity. It captures semantic meaning but can miss exact-match queries where specific terminology matters. Sparse retrieval (BM25) relies on keyword matching. It is reliable for exact phrases, identifiers, and technical terms but has no semantic understanding. Hybrid retrieval combines both. The query is run through dense and sparse systems, and results are merged using ranking strategies like Reciprocal Rank Fusion. This is the production-standard approach because it covers the failure modes of each method. The orchestration layer connects all components into a coherent system. It is responsible for controlling the flow of data between ingestion, retrieval, and generation. This layer handles: Query transformation (rewriting, expansion, decomposition) Retrieval calls Metadata filtering Context selection and ordering Prompt construction It determines what the model sees. Even with a strong retriever, poor orchestration can degrade results. Passing too many irrelevant chunks introduces noise. Passing too few misses critical information. The orchestration layer is where most of the system’s intelligence resides. A critical constraint here is context management. The context window is finite. Every token passed to the model competes for attention. Increasing the number of retrieved chunks increases recall but reduces signal density. Effective systems optimize for relevance, not volume. The generator is the language model. In a RAG system, its role is not to retrieve knowledge but to synthesize a response from the provided context. Input: User query Retrieved chunks Output: A grounded response The quality of generation depends on three factors: Retrieval quality (are the right chunks present) Prompt design (are grounding constraints explicit) Model capability (does it follow instructions reliably) Claude performs well in this role because: It supports large context windows (100k–200k+ tokens depending on variant) It adheres closely to instructions like “answer only from the provided context” and “cite sources” The primary failure mode is partial grounding. The model uses retrieved context but fills gaps with its own training data. This produces answers that appear correct but are not fully supported by the source material. User submits query Query is optionally rewritten or expanded Query embedding is generated Retriever performs similarity search (top-k) Optional re-ranking refines results Context is assembled (filtered, ordered, truncated) Prompt is constructed with grounding constraints LLM generates response Response is optionally post-processed (formatting, citations) Embeddings are generated by a separate model, not the LLM itself. The embedding model's job is purely to convert text into vectors. OpenAI text-embedding-3-small - Fast, cheap, widely used Cohere Embed v3 — Strong multilingual support sentence-transformers (local) - Runs on your machine, no API cost Voyage AI - Commonly used alongside Claude due to strong embedding performance The embedding model must remain consistent. If you index your documents with model A, you must also embed queries with model A. Mixing models breaks the similarity search. The vector database is the backbone of any RAG implementation. It stores embeddings and supports fast approximate nearest-neighbor (ANN) search. pgvector: A PostgreSQL extension that adds vector storage and similarity search to an existing relational database. Best suited for systems already running PostgreSQL in production that want to avoid introducing a separate infrastructure component. Pinecone: A fully managed cloud vector database. It handles scaling, replication, and performance optimization automatically. Best suited for production environments where operational overhead needs to be minimized, at the cost of higher per-query pricing. Chroma: An embedded, in-process vector store with zero infrastructure requirements. Best suited for local development and rapid prototyping. Not designed for large-scale production workloads. Weaviate: An open-source vector database that supports both self-hosted and managed deployments. Best suited for production systems that require hybrid search (vector + keyword) and flexible schema design. Qdrant: An open-source, high-performance vector database built in Rust. Best suited for latency-sensitive, large-scale deployments where performance and control over infrastructure are critical. Three common techniques: Query Rewriting: The LLM rewrites the user's query into a form more likely to match indexed content. Conversational queries get expanded into more complete statements. rewrite_prompt = """Rewrite the following user query to be more explicit and self-contained, suitable for a document similarity search. Original query: {query} Rewritten query:""" **Multi-Query Expansion** Generate multiple variations of the query, run each through the retriever, then union the results. This improves recall when a single phrasing misses relevant chunks. expansion_prompt = """Generate 3 different phrasings of the following question. Return them as a JSON array of strings. Question: {query}""" Query Decomposition: Break complex multi-part questions into sub-queries, retrieve for each independently, then synthesize. Critical for questions like "compare how PostgreSQL and MongoDB handle transactions and schema changes"; this is two retrieval problems, not one. The tradeoff is latency. Each transformation adds an LLM call before retrieval even starts. For low-latency applications, query rewriting alone is usually enough. For high-precision knowledge base search, multi-query or decomposition is worth the overhead. A working prototype is not a reliable system. Retrieval is the main failure point in production RAG, and it fails in predictable ways. Semantic mismatch -Your query and your document chunks use different vocabulary for the same concept. The embedding model returns chunks that are topically adjacent but not actually relevant. Hybrid search (vector + BM25 keyword) mitigates this by covering both semantic and lexical similarity. Embedding drift - You re-embed your documents with a newer or different model without re-indexing. The vectors in your database no longer align with the query embeddings being generated. The system continues to run but retrieval quality degrades silently. Attention dilution - Large context windows are a capability, not a license to fill them. Passing 40 chunks into Claude's context when only 3 are relevant doesn't improve the answer — it degrades it. Irrelevant retrieved text introduces noise that competes with the signal. More tokens retrieved ≠ better answers. Optimize what goes into context, not just how much. Over-chunking - Chunks that are too small lose the surrounding context that gives a sentence its meaning. A chunk containing "This is not recommended for production use" is useless without knowing what "this" refers to. Stale index - Documents change. If your index is not updated when source data changes, the model will confidently answer from outdated information. Build re-indexing into your data pipeline, not as a manual step. Weak prompts - If the prompt doesn't instruct the model to stay grounded in the retrieved context, it will supplement gaps with its own training data, which may be wrong, outdated, or irrelevant. Explicit grounding instructions are not optional. Claude is a strong fit for RAG pipelines. Its large context window (100k–200k+ tokens depending on variant) means you can pass in substantially more retrieved chunks than most models support. It also follows retrieval-grounded instructions reliably — when told to answer only from provided context and cite sources, it does. Install dependencies: pip install anthropic chromadb sentence-transformers Build the index: import chromadb from chromadb.config import Settings from sentence_transformers import SentenceTransformer # Persistent client — data survives restarts client = chromadb.Client(Settings(persist_directory="./chroma_db")) collection = client.get_or_create_collection("knowledge_base") embedder = SentenceTransformer("all-MiniLM-L6-v2") documents = [ {"id": "doc1", "text": "PostgreSQL supports ACID transactions and complex joins.", "source": "db_guide.md"}, {"id": "doc2", "text": "MongoDB stores data as BSON documents with flexible schemas.", "source": "db_guide.md"}, {"id": "doc3", "text": "Redis is an in-memory key-value store optimized for low-latency reads.", "source": "db_guide.md"}, ] # Batch embed — don't loop individual encode calls texts = [doc["text"] for doc in documents] embeddings = embedder.encode(texts, batch_size=32).tolist() collection.add( ids=[doc["id"] for doc in documents], embeddings=embeddings, documents=texts, metadatas=[{"source": doc["source"]} for doc in documents] ) Query and generate: import anthropic claude = anthropic.Anthropic(api_key="YOUR_API_KEY") def rag_query(user_query: str, top_k: int = 3) -> str: # Embed the query try: query_embedding = embedder.encode(user_query).tolist() except Exception as e: return f"Embedding failed: {e}" # Retrieve relevant chunks results = collection.query( query_embeddings=[query_embedding], n_results=top_k ) chunks = results["documents"][0] sources = [m["source"] for m in results["metadatas"][0]] if not chunks: return "No relevant context found in the knowledge base." # Format context with source attribution context_block = "\n\n".join( f"[Source: {src}]\n{chunk}" for chunk, src in zip(chunks, sources) ) # Grounded prompt with citation enforcement prompt = f"""You are a precise assistant. Answer the question using ONLY the context provided below. If the context does not contain enough information to answer, say "The available context does not cover this." Cite the source for each claim you make. Context: {context_block} Question: {user_query} Answer (with citations):""" try: response = claude.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text except Exception as e: return f"Generation failed: {e}" answer = rag_query("What database should I use for low-latency reads?") print(answer) One of the most impactful decisions in any RAG system is how you split documents into chunks. Poor chunking degrades retrieval quality regardless of how good the embedding model or the LLM is. Chunk size Smaller chunks (128–256 tokens) retrieve more precise passages but may lack context. Larger chunks (512–1024 tokens) preserve context but dilute similarity scores. Most production systems land at 256–512 tokens with overlap. Overlap A sliding window with 10–20% overlap between adjacent chunks prevents splitting a critical sentence across two chunks that might both miss retrieval. Semantic chunking Instead of splitting at fixed token counts, split at natural boundaries: paragraphs, sections, sentence groups. LangChain and LlamaIndex both provide semantic splitters that do this automatically. Metadata filtering Tag each chunk with metadata (document source, date, category). At retrieval time, pre-filter by metadata before running similarity search. This improves precision significantly for large knowledge bases. Purpose: RAG provides access to external, dynamic knowledge at inference time. Data Handling: In RAG, data remains external to the model and is retrieved when needed. Updateability: RAG systems can be updated in real time by modifying the index. Fine-tuned models require retraining to incorporate new information. Cost Profile: RAG incurs ongoing inference and retrieval costs but avoids heavy training expenses. Hallucination Risk: RAG reduces hallucination by grounding responses in retrieved context, but does not eliminate it. Best Use Cases RAG is best for: FAQs Documentation systems Internal knowledge bases Real-time or frequently changing data Fine-tuning is best for: enforcing response format adapting tone and style improving performance in narrow domains with consistent patterns In practice, RAG handles knowledge access while fine-tuning handles behavior. If you want a model that knows your company's internal knowledge base and stays current as that knowledge changes, RAG is the right tool. If you want a model that communicates in a specific format or understands domain-specific terminology deeply, fine-tuning is more appropriate. Most mature AI applications combine both. RAG without evaluation is blind. You can't improve what you're not measuring. Retrieval quality: Are the right chunks being surfaced? Measure precision@k: of the top-k chunks returned, what fraction are actually relevant to the query? This requires a labeled evaluation set, a set of queries with known relevant documents. Generation faithfulness: Is the model's answer grounded in the retrieved context, or is it drifting into its own training data? Tools like RAGAS automate this measurement by scoring how much of the generated answer is supported by the retrieved chunks. Run both evaluations before going to production, and re-run them whenever you change your chunking strategy, embedding model, or prompt. Wrap Up! RAG is not a complex pattern once you understand the three moving parts: the indexer that converts your documents into searchable vectors, the retriever that finds relevant chunks at query time, and the generator (Claude), in this case, that synthesizes a grounded response. The power of the approach is that it decouples knowledge from the model. Your data stays in your infrastructure, gets updated on your schedule, and the model reasons over it without needing to be retrained. For any application where the knowledge base changes faster than a model can be fine-tuned, which is most real-world applications, RAG is the right foundation. The implementation above is enough to get a working system running in a day. The sections on failure modes, query transformation, and evaluation are what separate that prototype from something you can trust in production.
