Origin Part 4: The AI That Evolves Itself (And Catches Its Own Bugs)

DEV Community

Josh T

Apr 20, 2026, 01:28 PM

OLT-1 runs its own test suite, diagnoses failures, proposes fixes, tests them in a sandbox, and only promotes what actually works. Most AI models get better through human intervention. Someone notices a failure mode, collects training data, retrain the model, and hopes the new version doesn't break something else. It's slow, expensive, and error-prone. OLT-1 has a different approach. Its evolution system runs an automated loop that mirrors the scientific method: diagnose, hypothesize, sandbox, compare, promote. No human in the loop for the cycle itself. Human review happens at promotion. And it's already running. How the Evolution Loop Works Every evolution cycle follows five steps: 1. Diagnose. Run the full test suite (currently 407 tests per cycle). Categorize every failure by source: is the encoder failing to detect the right concepts? Is the reasoning circuit producing wrong outcomes? Is the decoder generating incoherent text? 2. Hypothesize. Based on the dominant failure source and intervention history, propose a fix. Options include: INCREASE_EPOCHS (train longer on the same data), ENCODER_RETRAIN (retrain the encoder on weak concepts), REASONING_RETRAIN (fix the reasoning circuits), COMBINED (train encoder and decoder together with knowledge replay), or TARGETED_DATA (decoder-focused training pairs). 3. Sandbox. Fork the target component. Train it on the relevant data with spaced repetition, interleaving older examples to prevent forgetting. Evaluate on the same test suite. 4. Compare. Check the pass rate delta. But here's the critical part: it also checks retention. An intervention that improves one domain while destroying another gets rejected. 5. Promote or reject. If the sandbox model improves without unacceptable regression, replace production weights. Otherwise, discard and try again. When Evolution Caught a Bug That Humans Missed In April, we ran a 1500-round overnight teacher session. The results were disappointing: only a small bump in understanding. Josh had been saying the numbers felt off — the trend was too flat for a model that was supposed to be learning. So we broke it into five 100-round batches to see per-session behavior. Batch 4 spiked to 14.3% good. Then batch 5 cliffed back to 10%. Classification went from 67% to 0%. Quantity went from 25% to 0%. Between batches. Something was silently destroying capabilities between training cycles. The small-batch view exposed two compounding bugs. Both were silent — no error traces, no failing tests — and neither was visible in aggregate metrics. Only the per-batch cliff, caught because Josh was looking, made them findable. Bug 1: Spaced-repetition replay dropped compound concepts silently. The evolution's spaced-rep sampling rebuilt concept dictionaries from response text by whitespace word-matching. This silently dropped 36 concepts whose names never appear literally in their own responses: type_of, example_of, not_equal, too_much, too_little, refusal, self_knowledge, affirmation, meta_awareness, preference, capability, all three emotions, all four physics outcomes, time markers, colors, and conversation bundles. That's 36 concepts evaporating from replay data every cycle. The model was forgetting things specifically because the mechanism designed to prevent forgetting was blind to them. Fix: decode the stored key_vector (float32 bytes of concept activations) directly instead of trying to reconstruct concepts from text. Replay now preserves all 311 concepts. Verified empirically: 13,661 usable entries jumped to 20,012; concepts covered jumped from 275 to 311. Bug 2: 73% of the vocabulary was invisible to the grader. The 79-test decoder suite covered only 83 out of 311 concepts (27%). Evolution could silently trade untested concepts for tested ones and still get promoted. That's exactly what happened in batch 5: the intervention scored +0.065 and got promoted while destroying classification entirely. The model wasn't failing. The grader was blind to the failure. Three Layers of Future-Proofing We added three defense layers to make sure this class of bug can't happen again. Layer 1: The siren. test_suite.py now checks concept coverage at every evolution engine init. If any vocab concept has zero tests, it trips an alarm. New concepts without tests are caught immediately. Layer 2: The generators. Per-category template functions plus 100+ per-concept overrides auto-generate 228 floor-coverage tests. Every vocab concept now has at least one test. No more blind spots. Layer 3: The retention check. Samples real (key_vector, response_text) pairs from the decoder bank, synthesizes prompts from active concepts, and uses meaningful words from stored responses as expected keywords. 100 retention tests per cycle, growing automatically with the hippocampus. Combined suite: 79 hand-written + 228 auto-generated + 100 retention = 407 tests per cycle. Grader coverage went from 27% to 100%. The Verification Run After the fix, we ran the same 5-batch confirmation test: No more classification/quantity cliffs. Pre-fix: 67% to 0%. Post-fix: stays 17-33%. Batch 5 post-fix beat batch 5 pre-fix on both metrics (13.1% vs 10.0% good, 28.3% vs 24.0% understanding). Post-fix trend ends on the highest note instead of spiking then falling. All 5 evolution cycles correctly rejected interventions that traded coverage for narrow gains. The big win isn't the raw number. It's that the failure mode itself has been closed off. Silent forgetting during replay and blind-spot promotions were both class-of-failure bugs. Both now have sirens. Dream Consolidation: Learning While It Sleeps Evolution isn't the only self-improvement mechanism. OLT-1 also consolidates memory through three tiers of dream cycles: Micro-dream (about 3 gradient steps): instant reinforcement of low-confidence concepts. Happens during regular operation. Light sleep: flushes Hot tier to Warm, promotes Warm to Cold during idle time. Knowledge moves from short-term to long-term storage. Deep sleep: full reassessment and re-training on flagged weak areas. The heavy consolidation pass. This mirrors how biological sleep consolidates memory. Important patterns get reinforced. Weak areas get flagged for re-training. The hippocampus doesn't just store knowledge; it actively maintains it. The Teacher Loop Evolution needs training data, and that comes from the teacher loop we covered in Part 3. Briefly: an external model generates conversations aligned to OLT-1's current concept space, OLT-1 responds, the teacher evaluates, and corrections flow into evolution's training data and the hippocampus. The teacher grows with OLT-1 — each new stage updates its categories, evaluation criteria, and correction examples. Append-Only Growth Here's the principle that ties everything together: growth is append-only. We learned this the hard way. Early on, we tried a full decoder curriculum retrain on all 22K pairs. Despite 30-50% replay, catastrophic forgetting hit hard. Pass rate dropped from 45.6% to 31.6%. We restored from backup. Now the approach is strictly incremental: Teacher sessions generate corrections, which go to hippocampus (persistent memory). Evolution fine-tunes the GRU on small targeted batches. Dream cycles consolidate Hot to Warm to Cold. Data drop pipeline ingests any external text directly into hippocampus. Word grounder adds unknown vocabulary from Wikipedia. No more retraining base models. Every addition is additive. Every memory is preserved. Every concept, once learned, can only be lost if the entire hippocampus is deleted. Why Self-Evolution Matters At FAS, we see a pattern in AI security: models get deployed, attacks emerge, and humans have to manually identify and patch the failure modes. The response time is measured in days or weeks. OLT-1's evolution system suggests a different model: a system that runs its own diagnostics, identifies its own weaknesses, proposes and tests its own fixes, and only promotes improvements that don't break existing capabilities. The loop runs in minutes, not weeks. That's not autonomous AI in the dangerous sense. Human review still gates promotions. But it's autonomous improvement in the useful sense: the system catches its own bugs faster than humans can, and it does it without the risk of making things worse because every change is tested against the full suite before promotion. Imagine Guardian with this capability. Not just detecting new attack patterns, but autonomously generating candidate detection rules, sandbox-testing them against the full regression suite, and promoting only the ones that work without breaking existing coverage. That's the direction this points. What's Next OLT-1 is currently at Stage 9 (quantity and counting). Stages 10-15 will add conditional reasoning, sequences, arithmetic, code concepts, science, and language quality. The architecture supports them. The evolution system will improve them as they're added. The open questions are the same ones we raised in Parts 1, 2, and 3: does this architecture scale? Does architectural consent survive at billions of parameters? Can self-evolution keep up with adversarial pressure at production scale? And can developmental-AI evaluation keep pace with the capabilities it's meant to measure? We're building toward answers. If you're interested in helping find them, we'd like to talk. (If you're keeping score on the Josh-notices-bugs arc: 2 for 2. Part 5 extends it.) Origin is developed at Fallen Angel Systems with the Genesis framework (USPTO Application #64/016,973, #64/017,567). FAS Guardian defends production AI systems from prompt injection in under 3ms. FAS Judgement is the open-source attack console that finds the gaps. Defense. Offense. Creation. fallenangelsystems.com | Judgement on GitHub