METHODOLOGY / 01

Memory substrate, measured.

A 30-question retrieval eval pitting Cortex v2 against naive CLAUDE.md grep and a blind LLM baseline. May 2026.

HEADLINE

66.0%Cortex v2 factual accuracy

26.1%Naive grep CLAUDE.md

9.0%Blind LLM, no context

The memory substrate adds +39.9 percentage points over naive grep and +57 over a blind LLM.

BY CATEGORY

category	full	grep	blind	Δ full-grep
ownership	55.0%	0.0%	16.0%	+55.0pp
decisions	77.4%	48.4%	0.0%	+29.0pp
processes	82.0%	74.0%	0.0%	+8.0pp
incidents	44.0%	6.0%	12.0%	+38.0pp
specs	94.0%	2.0%	2.0%	+92.0pp
The +92pp specs gap is where directives prove they are the load-bearing substrate. Naive grep finds the words but not the answers.
cross-repo	43.4%	26.0%	24.0%	+17.4pp
OVERALL	66.0%	26.1%	9.0%	+39.9pp

METHOD

[ THREE CONDITIONS ]

Cortex v2 used the production askCortex pipeline. Naive grep used CLAUDE.md project, global, and repo registry rows, top-15 by keyword. Blind used no context.

[ COMPOSER HELD CONSTANT ]

All three conditions used the same Sonnet REPL composer. Only the retrieved row set varied.

[ JUDGE ]

The same Sonnet REPL judge scored every answer against factual_accuracy, citation_correctness, and hallucination_count.

WHERE GREP TIES OR WINS

Release command lookup: full 100%, grep 100%. The command sequence is written plainly in CLAUDE.md.
Middleware gate lookup: full 100%, grep 100%. Route prefixes are a verbatim policy question.
Vercel deploy rule: grep 100%, full 85%. Grep won on a question one regex away from README text.
Inline-style rule: grep 55%, full 45%. The answer lives close to exact directive wording.
Partial packet leftover: full 0%, grep 0%. Neither condition had the missing incident row.

That is a feature. The substrate's job is structured and multi-hop questions, not verbatim lookups.

LAB NOTES

Two more iterations after the baseline run. Each row is a full 30-question replay through the same harness; the trade space between accuracy and hallucination is real and we're tuning it in the open.

date	change	full	halluc.	Δ vs grep
2026-05-24	Baseline production composer	66.0%	58	+39.9pp
2026-05-24 evening	#1126 — tightened composer against partial-match synthesis	57.6%	39	+31.7pp
2026-05-25 morning	#1127 — added affirmative answer-when-supported rule	60.3%	44	+33.9pp
2026-05-25 afternoon	#1128 — distinguished enumeration from inference	63.8%	36	+40.0pp

#1126 traded 19 hallucinations for 8.4pp of factual accuracy — too aggressive. #1127 recovered most of the accuracy and held the hallucination floor at 44. #1128 distinguished enumeration from inference, fully recovering specs questions (back to 94%) and pushing incidents past baseline. The settled equilibrium gives up 2.2pp of absolute accuracy for a 38% reduction in hallucinations and a moat over naive grep that's now bigger than the original baseline (+40.0pp vs +39.9pp).

Each run is a 30-question replay through Sonnet-via-REPL with the composer held constant across full / grep / blind conditions. Variance is real (one judge call per answer) — read these as directional, not absolute. Raw JSON files linked from each commit hash are the source of truth.

RAW DATA

Commit that ran the testc17a5367 · cortex-ide #938 Per-case JSON2026-05-24T22:03:03.500Z · 30 cases