METHODOLOGY / 01

Memory substrate, measured.

A 30-question retrieval eval pitting Cortex v2 against naive CLAUDE.md grep and a blind LLM baseline. May 2026.

HEADLINE
66.0%Cortex v2 factual accuracy
26.1%Naive grep CLAUDE.md
9.0%Blind LLM, no context

The memory substrate adds +39.9 percentage points over naive grep and +57 over a blind LLM.

BY CATEGORY
categoryfullgrepblindΔ full-grep
ownership55.0%0.0%16.0%+55.0pp
decisions77.4%48.4%0.0%+29.0pp
processes82.0%74.0%0.0%+8.0pp
incidents44.0%6.0%12.0%+38.0pp
specs94.0%2.0%2.0%+92.0pp
The +92pp specs gap is where directives prove they are the load-bearing substrate. Naive grep finds the words but not the answers.
cross-repo43.4%26.0%24.0%+17.4pp
OVERALL66.0%26.1%9.0%+39.9pp
METHOD
[ THREE CONDITIONS ]

Cortex v2 used the production askCortex pipeline. Naive grep used CLAUDE.md project, global, and repo registry rows, top-15 by keyword. Blind used no context.

[ COMPOSER HELD CONSTANT ]

All three conditions used the same Sonnet REPL composer. Only the retrieved row set varied.

[ JUDGE ]

The same Sonnet REPL judge scored every answer against factual_accuracy, citation_correctness, and hallucination_count.

WHERE GREP TIES OR WINS
  • Release command lookup: full 100%, grep 100%. The command sequence is written plainly in CLAUDE.md.
  • Middleware gate lookup: full 100%, grep 100%. Route prefixes are a verbatim policy question.
  • Vercel deploy rule: grep 100%, full 85%. Grep won on a question one regex away from README text.
  • Inline-style rule: grep 55%, full 45%. The answer lives close to exact directive wording.
  • Partial packet leftover: full 0%, grep 0%. Neither condition had the missing incident row.

That is a feature. The substrate's job is structured and multi-hop questions, not verbatim lookups.

LAB NOTES

Two more iterations after the baseline run. Each row is a full 30-question replay through the same harness; the trade space between accuracy and hallucination is real and we're tuning it in the open.

datechangefullhalluc.Δ vs grep
2026-05-24Baseline production composer66.0%58+39.9pp
2026-05-24 evening#1126 — tightened composer against partial-match synthesis57.6%39+31.7pp
2026-05-25 morning#1127 — added affirmative answer-when-supported rule60.3%44+33.9pp
2026-05-25 afternoon#1128 — distinguished enumeration from inference63.8%36+40.0pp

#1126 traded 19 hallucinations for 8.4pp of factual accuracy — too aggressive. #1127 recovered most of the accuracy and held the hallucination floor at 44. #1128 distinguished enumeration from inference, fully recovering specs questions (back to 94%) and pushing incidents past baseline. The settled equilibrium gives up 2.2pp of absolute accuracy for a 38% reduction in hallucinations and a moat over naive grep that's now bigger than the original baseline (+40.0pp vs +39.9pp).

Each run is a 30-question replay through Sonnet-via-REPL with the composer held constant across full / grep / blind conditions. Variance is real (one judge call per answer) — read these as directional, not absolute. Raw JSON files linked from each commit hash are the source of truth.

RAW DATA
Commit that ran the testc17a5367 · cortex-ide #938Per-case JSON2026-05-24T22:03:03.500Z · 30 cases