AI News Hub Logo

AI News Hub

RAG Series (4): Document Processing — From Raw Files to High-Quality Chunks

DEV Community
WonderLab

Why "How You Cut" Matters as Much as "What You Cut" In the first three articles, we built a working RAG pipeline and tuned the core parameters. But if you look closely at the retrieval results, you may notice a strange phenomenon: The answer is clearly in the document, yet the Retriever can't find it. Or it finds it, but the answer is cut in half — the LLM only sees the first half of the sentence. The problem usually lies in the chunking step. Chunking is essentially an information splitting strategy — how you divide a 500-page book, how large each piece is, and where you make the cuts directly determines whether the reader (here, the Retriever) can quickly find what they need. In this article, we'll process the same technical document with four different strategies so you can see the dramatic differences that "how you cut" makes. šŸ“Ž Source Code: All experiment code is open-sourced at llm-in-action/04-chunking-strategies. Clone it to reproduce the results. Before diving in, here's a quick reference table to build intuition: Strategy Core Idea Pros Cons Fixed Size Cut at fixed character intervals, like scissors cutting paper Simple, uniform chunk sizes May cut through sentences, poor semantic integrity Recursive Character Try separators in priority order: paragraph → line → sentence → word Balances semantics and uniformity Limited Chinese support (uses English punctuation) Semantic Chunking Compute semantic similarity between adjacent sentences, cut where similarity drops Highly semantically coherent chunks Requires Embedding API, higher cost Document Structure Split by Markdown/HTML heading hierarchy Preserves document structure, retrieved chunks carry chapter context Only works for structured documents The full runnable code is available at llm-in-action/04-chunking-strategies, including: chunking_compare.py — The 4-strategy comparison script data/sample-tech-doc.md — Sample Markdown technical document .env.example — Environment variable template (SemanticChunker requires an Embedding API) We'll use a ~5,400-character Markdown technical document titled "Microservices Architecture Design Guide," containing 7 top-level chapters with multiple level-2 and level-3 headings, covering service decomposition, communication protocols, data consistency, observability, security, and deployment. Strategy Key Configuration Fixed Size CharacterTextSplitter(chunk_size=512, chunk_overlap=50) Recursive Character RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""]) Semantic SemanticChunker(embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=85, sentence_split_regex=r"(? maximum allowed batch size 32 → Fix: OpenAIEmbeddings(chunk_size=32) Pitfall 2: Single-input token limit exceeded Error code: 413 - input must have less than 512 tokens → Fix: Set buffer_size=0 to prevent SemanticChunker from concatenating neighboring sentences Pitfall 3: Empty strings cause 400 errors Error code: 400 - The parameter is invalid → Fix: Subclass SemanticChunker and override _get_single_sentences_list to filter empty strings class FilteredSemanticChunker(SemanticChunker): def _get_single_sentences_list(self, text: str) -> List[str]: sentences = re.split(self.sentence_split_regex, text) return [s for s in sentences if s.strip()] Metric Value Chunk count 9 (fewest) Average length 590.9 chars Max length 2047 chars Min length 17 chars Key Finding: Semantic chunking produces the fewest chunks (9), but with extreme size variation — smallest 17 chars, largest 2047 chars. This confirms it's truly grouping by semantic boundaries: semantically similar sentences are merged into large chunks, while topic transitions become tiny chunks. For example, the entire "Service Communication" chapter (REST vs gRPC vs message queues) was aggregated into one 1,189-character chunk — because it all discusses the same topic. Transition sentences between chapters became tiny fragments (like a 28-character decision tree snippet). The first three strategies are like "blind folding" — they don't know the document structure and purely use text features. Document structure chunking, in contrast, "keeps its eyes open": it recognizes Markdown #, ##, ### headings and splits strictly by heading hierarchy. Each chunk's boundary is a heading boundary: starts at one heading, ends before the next heading at the same or higher level. from langchain_text_splitters import MarkdownHeaderTextSplitter splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ], strip_headers=False, # Keep headings inside chunk content ) chunks = splitter.split_text(text) Metric Value Chunk count 20 (most) Average length 266.5 chars Max length 402 chars Min length 71 chars Key Finding: Document structure chunking produces the most chunks (20), but each one carries an "ID card" — metadata records which heading level it belongs to: chunk.metadata = { "Header 1": "Microservices Architecture Design Guide", "Header 2": "1. Service Decomposition Strategy", "Header 3": "1.1 Split by Business Boundary (DDD)" } This means during retrieval, you get not just the content but also its chapter origin. This is extremely valuable for citation tracing ("The answer comes from Chapter X of the document"). Strategy Chunks Avg Length Median Max Min Fixed Size 12 453.5 476.5 506 128 Recursive Character 13 431.5 457.0 507 88 Semantic 9 590.9 422.0 2047 17 Document Structure 20 266.5 259.0 402 71 Suppose the user asks: "What are the anti-patterns of microservice decomposition?" Strategy Retrieved Chunk Issue Fixed Size Chunk 4 (contains partial anti-pattern content, but starts mid-sentence) List item starts in the middle; LLM lacks full context Recursive Character Chunk 5 (fully contains "1.3 Common Anti-patterns" section) Good, but may truncate if the section is long Semantic Chunk 3 (aggregates anti-patterns + some following content) May include irrelevant content Document Structure Chunk 6 (exactly matches "### 1.3 Common Anti-patterns") Best — precise structural match Scenario Recommended Strategy Reasoning General technical docs (PDF/Word) Recursive Character Most reliable baseline, no special formatting required Markdown / Papers / Books Document Structure Preserves chapter structure, retrievable with provenance Terminology-dense docs (legal/medical) Semantic Chunking Semantically coherent chunks, reduces cross-topic noise Ultra-high-speed chunking (real-time) Fixed Size Zero computation overhead, pure string operations Code documentation Recursive Character + custom separators Split by function/class boundaries Step 1: Start with recursive character chunking as your baseline ↓ Step 2: If documents are Markdown/HTML, try document structure chunking ↓ Step 3: If retrieval quality is unsatisfactory, upgrade to semantic chunking (highest cost but best quality) This article used the same document and four strategies to show you how "how you cut" affects RAG quality: Fixed Size: Simple but brutal. Good for rapid prototyping. Recursive Character: The most universal baseline. Sufficient for 80% of scenarios. Semantic Chunking: Best quality but highest cost. Use when precision is critical. Document Structure: Best choice for structured documents. Retrieved chunks carry built-in context. Key Takeaway: There is no perfect chunking strategy — only the strategy that fits your document type and business scenario. In real projects, use the comparison script from this article, run it on your own documents, and let the data guide your decision.