I Built a Research Synthesis Engine That Reads 15 Papers and Generates Peer-Reviewed Hypotheses — Powered by Gemma 4

DEV Community

navid mirnouri

May 9, 2026, 10:25 AM

This is a submission for the Gemma 4 Challenge: Build with Gemma 4 Every researcher knows the feeling: you have a stack of papers, a vague sense that something important is hiding between them, and no time to find it. Individual papers answer narrow questions. The breakthroughs live in the gaps between them. I built LitSynth — a local, fully offline research synthesis engine that ingests up to 15 scientific PDFs, reasons across all of them simultaneously, and produces four structured outputs: cross-paper agreements, contradictions with mechanistic explanations, research gap analysis ranked by importance, and novel falsifiable hypotheses — each one put through a multi-round adversarial peer review loop before it reaches you. This only exists because of Gemma 4's 128K context window and thinking mode. RAG pipelines approximate this. Gemma 4 actually does it. LitSynth is a seven-stage reasoning pipeline that treats a set of scientific papers as a single evidence corpus rather than a collection of independent documents. 1. Parallel PDF ingestion — Papers are parsed concurrently with pdfplumber, chunked into 8,000-character segments, and passed to the extraction stage. 2. Batched claim extraction (3 chunks per LLM call) — Each batch prompt asks Gemma 4 to extract up to 4 specific, falsifiable, numerically-grounded claims per section. Claims are namespaced by paper ID and chunk index to prevent collision. Running 6 workers in parallel reduces this to roughly a third of the wall-clock time of sequential extraction. 3. Agreement identification — A single long-context prompt packages all claims (within a token budget) and asks Gemma 4 to find convergent findings across papers — with specific claim IDs as evidence, not just paper names. 4. Contradiction detection (parallel clusters) — Claims are grouped by experimental method. Each cluster runs in its own thread. The contradiction prompt requires: The exact claim text from each paper A mechanistic explanation of why they conflict A proposed reconciliation (different populations, measurement conditions, etc.) 5. Gap analysis — Research gaps are traced back to the specific claims and contradictions that reveal them, and ranked critical / high / medium / low by importance. The prompt explicitly asks: "what question is implied by this evidence that no paper answers?" 6. Hypothesis generation — This is the centrepiece. The generation prompt enforces mandatory rules at the prompt level: Every hypothesis must reference ≥2 specific claim IDs from the corpus Every hypothesis must name a gap_addressed (a gap ID from stage 5) The mechanism field must name the specific signal, its origin layer/module, and the downstream effect it produces A null hypothesis must be included for every hypothesis The experiment design must specify the independent variable, control condition, measurements, and statistical test Forbidden language: "necessary and sufficient", "proves", "objective metric", "always", "guaranteed" 7. Adversarial refinement loop — Every generated hypothesis enters a multi-round peer review cycle (up to 2 rounds by default): All hypotheses are reviewed in parallel (each gets its own LLM call, no waiting) The reviewer scores weakness count, assigns a confidence penalty, and flags fatal_flaw If an improved hypothesis is provided, a quick re-review checks whether it has fewer weaknesses than the original before accepting the improvement Confidence is recalibrated: original_conf − (0.06 × weaknesses) − reviewer_penalty Hypotheses with fatal_flaw=True are moved to a discarded list, not silently dropped The final output separates accepted hypotheses from discarded ones, shows revision history, and includes calibrated confidence scores. 15 open-access papers on transformer attention mechanisms and long-context performance. HYPOTHESIS: In decoder-only LLMs with ≥7B parameters trained on sequences ≤8K tokens, injecting domain-specific embeddings into KV cache positions 0–32 will reduce hallucination rate on closed-domain QA by ≥15% compared to prompt-only injection, because early-layer cache slots function as high-priority retrieval anchors for attention heads in layers 8–16. NULL HYPOTHESIS: KV cache position injection will show no statistically significant difference in hallucination rate compared to prompt-only injection (p > 0.05). MECHANISM: [architectural] Domain embeddings written to KV positions 0–32 are preferentially attended to by layers 8–16 due to recency bias in rotary position encoding, causing those layers to anchor factual retrieval against the injected context before processing user tokens. EXPERIMENT: IV: injection method (KV cache positions 0–32 vs. system prompt prefix) Control: same model, same domain corpus, same evaluation prompts Measurements: hallucination rate on TruthfulQA-domain subset, exact-match F1 Statistical test: paired t-test, α = 0.05, n = 500 per condition GROUNDED IN: paper_2_ck1_c3, paper_7_ck0_c1, paper_11_ck3_c2 FILLS GAP: gap_3a8f2c (effect of cache position on retrieval priority) CONFIDENCE: 0.61 (recalibrated from 0.80 after 2 review rounds) REVISION: 2 One hypothesis was flagged fatal_flaw=True after round 1 because it claimed a mechanism was "necessary and sufficient" — the schema validator rejected the rewrite attempt as well (still contained absolute language), so it was cleanly discarded with the critique logged. Papers: 15 Claims extracted: 312 Agreements: 8 Contradictions: 14 (across 6 method clusters) Research gaps: 9 (3 critical, 4 high, 2 medium) Hypotheses: 2 accepted, 1 discarded Refinement rounds: 2 Runtime: ~18 minutes on a MacBook M2 Pro (local, offline) PDF files │ ▼ Parallel PDF loader (pdfplumber, 4 workers) │ ▼ Batched claim extractor (6 workers, 3 chunks/call, streaming=True, thinking=False) │ ├─────────────────────────┐ ▼ ▼ Agreements Contradictions (single long-context) (parallel method clusters, thinking=True) │ │ └──────────┬──────────────┘ ▼ Gap analysis (importance-ranked, causally linked) │ ▼ Hypothesis generation (grounded, falsifiable, schema-validated) │ ▼ Adversarial refinement loop ┌─────────────────────────┐ │ Review all (parallel) │ │ ↓ │ │ Recalibrate confidence │ │ ↓ │ │ Attempt improvement │ │ ↓ │ │ Re-review candidate │ │ ↓ │ │ Accept if better │ ← up to MAX_REFINEMENT_ROUNDS └─────────────────────────┘ │ ▼ LiteratureSynthesis output (JSON + Gradio UI) Batched extraction instead of one call per chunk. Packing 3 chunks into one prompt with section headers ([paper_id=paper_2 chunk_id=1]) reduces LLM calls by ~3x with no quality loss. The prompt instructs the model to treat each section independently, so cross-contamination doesn't occur. Thread-local LLM instances. ChatOllama is not thread-safe. Each worker thread constructs its own instance via threading.local(). Six extraction workers + two parallel synthesis steps all run without any shared state on the model object. Checkpoint invalidation by content hash. A manifest file stores an MD5 of filename + size + mtime for every input PDF. If the input changes, all checkpoints are wiped before the run starts. This prevents the nasty failure mode where stale checkpoints silently produce wrong results. Two LLM profiles per thread. Extraction: streaming=True, thinking=False — simple JSON task, user sees token progress Synthesis: streaming=False, thinking=True — complex reasoning, no streaming overhead Schema-level validation as a last-resort guardrail. The Hypothesis Pydantic model runs a model_validator that scans hypothesis + mechanism text for forbidden phrases and raises ValueError before a bad hypothesis ever enters the refinement loop. This catches cases where the prompt-level constraints fail. Confidence recalibration. LLM-assigned confidence scores are untrustworthy. After each review round, confidence is recomputed: max(0.05, conf − 0.06 × len(weaknesses) − reviewer_penalty). A hypothesis that entered generation at 0.80 but accumulated 5 weaknesses and a 0.20 reviewer penalty exits at 0.30 — an honest signal. Model: Gemma 4 31B Dense via Ollama (local, offline) Orchestration: Python + LangChain Ollama adapter Schema: Pydantic v2 with custom validators UI: Gradio with tabbed output (Agreements / Contradictions / Gaps / Hypotheses / Raw JSON) PDF parsing: pdfplumber Parallelism: concurrent.futures.ThreadPoolExecutor Code Full source on GitHub: github.com/navid72m/litsynth pip install pdfplumber langchain-ollama pydantic gradio tqdm ollama pull gemma4 python ui.py Three capabilities made this project possible — and none of them are present in smaller models: 1. The 128K context window is the load-bearing wall. between retrieval buckets. A finding in paper 3 that partially contradicts a result in paper 11 only becomes visible if both are in context simultaneously. With Gemma 4's 128K window, the entire evidence corpus fits. The model sees everything at once. RAG approximates this — Gemma 4 actually does it. 2. Thinking mode changes the quality of synthesis. blocks (stripped before JSON parsing, but logged separately for inspection). The adversarial reviewer benefits equally: it produces structured, dimension-by-dimension critiques rather than vague feedback. 3. The 31B dense model is the right size for this task. The model choice isn't incidental. Every design decision in LitSynth — batching, the token budget guard, the context assembly strategy — exists to make the most of Gemma 4's specific capabilities. A different model would require a different architecture. This one is built around what Gemma 4 can actually do. Built for the Gemma 4 Challenge, May 2026. All synthesis runs locally. No API calls. No paper data leaves your machine.