AI News Hub Logo

AI News Hub

Stop Over-Engineering Your AI Agent. Then Teach It What It Doesn't Know

DEV Community
Einsteinder

We built an AI agent that searches across a user's emails, Slack, Jira, and other channels to answer questions like "find the Acme project update" or "list all release versions." The search was broken in ways we didn't expect — and the fixes were counterintuitive. Every instinct said add more intelligence: smarter classification, better query rewriting, self-grading loops, bigger models. Every one of those instincts was wrong. The real path had two parts: Get out of the AI's way. Most of our "improvements" were making things worse. Removing them fixed more than adding them. Teach the AI what it can't derive. Once we stripped the pipeline down, we found failures that no amount of model capability could fix. Those needed human knowledge encoded as part of the system. This post is about both halves. Users typed a client name — returned zero results. They typed a short project code — zero results. Dense vector search is great at semantic similarity but terrible at exact entity matching. A 3-letter abbreviation doesn't embed well. Proper nouns don't cluster near related content. And "Acme Corp" as a cosine similarity vector doesn't match the actual email about Acme Corp as reliably as you'd expect. The instinct: Train better embeddings. Fine-tune the model. Add query expansion. What actually worked: Add a boring keyword search alongside the semantic one. Fuse the results. This is hybrid search with Reciprocal Rank Fusion (RRF), and every serious production system uses it. Dense retrieval and sparse (keyword) retrieval solve different problems. You need both. query → dense search + keyword search (parallel) → RRF fusion → results That's it. No training, no fine-tuning, no fancy rewriters. Just two searches merged by rank. Our original pipeline looked like this: query → LLM call #1: classify the query type (1-2s) → LLM call #2: rewrite into 2-3 sentences (1-2s) → search → results It felt intelligent. It was actually terrible. The classifier would guess the query type ("this is a notification query") and that guess would control downstream filtering. When it guessed wrong — which happened constantly on short or ambiguous queries — the filter threw out valid results. The rewriter was worse. Give it a client name and it would helpfully expand the query into something like "update on the project, including progress, milestones, and team members involved." The original keyword now lives inside a padded sentence instead of being the focused term the user typed. The embedding drifts. Keyword matching weakens. Results get worse. The instinct: Make the classifier smarter. Make the rewriter better. Add fallbacks. What actually worked: Delete both. Pass the user's original query directly to hybrid search. Latency dropped from ~4-6s to ~2-3s. Zero-result failures went away. Quality went up. The principle: Every LLM call before retrieval is a place where your pipeline can add noise. If it isn't provably adding signal, it's subtracting it. Search results were formatted to be readable: full previews, metadata fields, relevance scores, conversation IDs, the works. Roughly 300 tokens per result. Ten results per search. The agent would make two searches and suddenly its context was full of ~6,000 tokens of formatted noise. By the third search call, the model would start narrating its intentions ("Let me search more specifically...") instead of actually making tool calls. It had lost coherence. The instinct: Give the LLM a bigger context window. Upgrade the model. What actually worked: Cut tool output from ~300 tokens per result to ~50. Stop sending conversation IDs and relevance percentages to the LLM. Keep just what's needed to reason about the next step. # Before (~300 tokens per result) [1] OUTLOOK: [repo] Release v0.4.0 (PR #332) Date: 2026-03-11 10:25 Relevance Score: 98% Participants: github-actions[bot] Content Preview: Summary: Bump project version from 0.3.2 to 0.4.0... Outlook Conversation ID: AAQkADE1YTA2Y2Q4... # After (~50 tokens per result) [src:332] (outlook) Release v0.4.0 (PR #332) | 2026-03-11 | From: github-actions Bump project version from 0.3.2 to 0.4.0 The 6x reduction is the difference between the agent finishing its job and running out of coherence mid-response. Design tool output for the reader that's consuming it — which, in an agentic system, isn't you. That's the "get out of the way" part. Three changes. All subtractions. All improved the system. Now here's where it gets interesting. After stripping the pipeline down, we tested a harder query: "Find Alex's phone number." The number existed in the system. It lived in the email signature of a message from Alex. The agent searched "Alex phone number" — found emails mentioning phone numbers but not the actual digits. It searched "Alex Chen phone" — same thing. It tried six or seven variations. It eventually gave up. The number was right there. The agent couldn't find it. Here's why: the document that contained the phone number never used the word "phone." It was an email signature — just a name, a title, and some digits. The user's query vocabulary and the document's vocabulary had nothing in common. No amount of keyword search, embedding similarity, or query rewriting can bridge that gap. It's the fundamental limit of lexical + semantic retrieval. This is where a human would instantly succeed. A person asked "find Alex's phone number" would think: "Phone numbers are in email signatures. Let me pull up an email from Alex and read the bottom." That's a two-step reasoning chain that has nothing to do with the words "phone number." The AI doesn't know this. Not because it's dumb — because it can't derive from first principles that phone numbers live in signatures. That's human knowledge about how email works. We almost did this. "Once models get better at multi-hop reasoning, they'll figure it out." The problem: there will always be a next failure mode. Chasing the model capability curve is a losing race. And even if the next model reasoned through this specific case, it would fail on the next thing you didn't anticipate. We did two things. First, a simple tool — get_communication_detail(id) — that fetches the full body of a specific communication. The compact format hides most details from the agent. This tool lets the agent drill down when the preview isn't enough. Detail on demand. Second — and more importantly — we defined the reasoning chain as a skill: --- name: find-contact-info triggers: - phone number - contact info - how to reach --- ## Steps 1. Search for the person across all channels 2. Search for emails sent BY that person 3. Fetch full content of top result to read the signature A skill is a declarative definition of a proven workflow. The steps use the same tools the agent already has, but they're sequenced in a way the agent couldn't reliably derive on its own. When the user asks about contact info, the system can invoke this skill instead of hoping the agent reasons through it. The skill isn't hardcoded logic — it's encoded knowledge. The tools still do the searching. The LLM still fills in adaptive parameters at each step. But the order of operations — the human insight that "to find contact info you should look at emails from the person" — lives in the skill definition. Skills are a bet on a different model of AI development. Instead of "AI that figures everything out," it's "AI that accumulates learned workflows." When the system notices a user manually performing a pattern — search a person, fetch an email, extract contact info — it can propose turning that pattern into a skill. The user approves. The next time anyone hits that problem, the skill handles it. This is the loop that makes the agent get better over time without needing a better model underneath. The model's job is to handle the long tail — open-ended reasoning, novel situations, conversations. The skills handle the repeatable patterns that humans have already solved. This is also why we resisted implementing the contact lookup as a hardcoded Python tool. A tool is opaque and one-off. A skill is portable, discoverable, and composable. Future you can read the skill definition and know exactly what it does. Future users can propose new skills for patterns the system missed. Looking back, every improvement fit into one of two categories: Subtractions from the pipeline: Dense-only search → hybrid search (delete the assumption that semantic is enough) Pre-search LLM calls → direct query (delete intelligence that was noise) Verbose tool output → compact format (delete metadata the LLM can't use) Additions of human knowledge: Detail-on-demand tool (humans know "read the full thing" is sometimes the answer) Contact lookup skill (humans know signatures contain contact info) The instinct at every turn was to do the opposite — add intelligence to the pipeline, hope the model figures out the chains. That instinct was wrong. The rule: First, delete everything in your pipeline that isn't provably helping. Then, for the failures that remain, ask whether the AI is missing knowledge a human has. If yes, encode it — as a tool, a skill, or structured data. Don't wait for a smarter model. If you're building an agentic search or RAG system, check these in order: Subtractions: [ ] Do you have both keyword and semantic retrieval fused together? (If no, add it.) [ ] Are there LLM calls before retrieval that classify or rewrite the query? (Consider removing.) [ ] Is your tool output designed for human readability or LLM consumption? (Cut everything the LLM doesn't need.) [ ] Does your agent have enough context budget to iterate 3-5 times without losing coherence? (If no, reduce per-result tokens.) Additions: [ ] When your agent fails on a task, ask: is this a vocabulary mismatch or missing world knowledge? (Those need encoded knowledge, not smarter prompts.) [ ] Does your agent have a way to drill down from summaries to full content when needed? (Add a detail-fetch tool.) [ ] Can you extract multi-step patterns from user interactions and encode them as skills? (This is where the compounding returns are.) The instinct is always to make the AI smarter. The practice is to make the system around it smaller and more specific.