METHODOLOGY / 01
Memory substrate, measured.
A 30-question retrieval eval pitting Cortex v2 against naive CLAUDE.md grep and a blind LLM baseline. May 2026.
The memory substrate adds +39.9 percentage points over naive grep and +57 over a blind LLM.
| category | full | grep | blind | Δ full-grep |
|---|---|---|---|---|
| ownership | 55.0% | 0.0% | 16.0% | +55.0pp |
| decisions | 77.4% | 48.4% | 0.0% | +29.0pp |
| processes | 82.0% | 74.0% | 0.0% | +8.0pp |
| incidents | 44.0% | 6.0% | 12.0% | +38.0pp |
| specs | 94.0% | 2.0% | 2.0% | +92.0pp |
| The +92pp specs gap is where directives prove they are the load-bearing substrate. Naive grep finds the words but not the answers. | ||||
| cross-repo | 43.4% | 26.0% | 24.0% | +17.4pp |
| OVERALL | 66.0% | 26.1% | 9.0% | +39.9pp |
Cortex v2 used the production askCortex pipeline. Naive grep used CLAUDE.md project, global, and repo registry rows, top-15 by keyword. Blind used no context.
All three conditions used the same Sonnet REPL composer. Only the retrieved row set varied.
The same Sonnet REPL judge scored every answer against factual_accuracy, citation_correctness, and hallucination_count.
- Release command lookup: full 100%, grep 100%. The command sequence is written plainly in CLAUDE.md.
- Middleware gate lookup: full 100%, grep 100%. Route prefixes are a verbatim policy question.
- Vercel deploy rule: grep 100%, full 85%. Grep won on a question one regex away from README text.
- Inline-style rule: grep 55%, full 45%. The answer lives close to exact directive wording.
- Partial packet leftover: full 0%, grep 0%. Neither condition had the missing incident row.
That is a feature. The substrate's job is structured and multi-hop questions, not verbatim lookups.
Two more iterations after the baseline run. Each row is a full 30-question replay through the same harness; the trade space between accuracy and hallucination is real and we're tuning it in the open.
| date | change | full | halluc. | Δ vs grep |
|---|---|---|---|---|
| 2026-05-24 | Baseline production composer | 66.0% | 58 | +39.9pp |
| 2026-05-24 evening | #1126 — tightened composer against partial-match synthesis | 57.6% | 39 | +31.7pp |
| 2026-05-25 morning | #1127 — added affirmative answer-when-supported rule | 60.3% | 44 | +33.9pp |
| 2026-05-25 afternoon | #1128 — distinguished enumeration from inference | 63.8% | 36 | +40.0pp |
#1126 traded 19 hallucinations for 8.4pp of factual accuracy — too aggressive. #1127 recovered most of the accuracy and held the hallucination floor at 44. #1128 distinguished enumeration from inference, fully recovering specs questions (back to 94%) and pushing incidents past baseline. The settled equilibrium gives up 2.2pp of absolute accuracy for a 38% reduction in hallucinations and a moat over naive grep that's now bigger than the original baseline (+40.0pp vs +39.9pp).
Each run is a 30-question replay through Sonnet-via-REPL with the composer held constant across full / grep / blind conditions. Variance is real (one judge call per answer) — read these as directional, not absolute. Raw JSON files linked from each commit hash are the source of truth.