BENCHMARK / 01

What o8 actually does.

o8 doesn't make models code better. It makes their output safe to merge, fast to use, and informed by your organization's memory.

Three tracks — speed, memory, governance. We tried to disprove each claim and report where o8 loses. June 2026.

A single-operator pilot. Speed and memory are in the committed, version-stamped scorecard; governance and coding are measured by hand and the governance figures come from reconstructed diffs — a bounded proxy. Memory is n=5 per category, one in-family LLM judge. Not externally validated; SWE-bench pending. Read these as directional point estimates.

Update — the loop

A benchmark is only worth running if it changes something. This one did.

The pilot below surfaced three places o8 lost to a simpler tool or fell short of its own claim. We fixed all three and re-ran the same harnesses against the result. The point isn't that the numbers moved — it's the loop: measure honestly, find the weakness, fix it, re-measure.

what changedbeforeafterhow
Memory — literal-lookup accuracy84%100%route to grep
Governance — bugs caught (proxy re-audit)2 / 33 / 34-trace gate
Speed — avg boot socket conns~20.6~3.1poll gating

[ THE POINT ]A benchmark that drives its own product fixes and then re-audits them is more credible than one that only ever flatters. The thesis is unchanged — o8 still doesn't make the models better coders; it now measures, release over release, whether it keeps its promise.

The memory win is specific to literal lookups — synthesis categories are unchanged within run-to-run noise. The governance re-audit used faithful reconstructions of the retired diffs, a bounded proxy; the definitive number comes from replaying real diffs through the live review tier.

01 — Speed

Warm is sub-frame. Cold was slow — we found the cause and cut it.

The cold-launch cost was not the UI. It was boot-time request fan-out saturating the WebView socket budget. We lazy-hydrated the panels that aren't visible at launch.

metricbeforeafterΔ
Boot requests221111−50%
First Contentful Paint2305 ms1579 ms−32%
Time to interactive2311 ms1533 ms−780 ms

[ VERDICT ]Speed is a measured floor we can improve, not a slogan. The fix shipped and was verified live.

Single cold-launch samples. The −50% request cut is structural; the paint deltas need N≥10 for a confidence interval. No external cross-IDE baseline — “fast” means against UX thresholds and our own prior release, not a competitor.

02 — Memory

The Brain beats a competent ripgrep agent — and loses on literal lookups.

38 questions, LLM-judged factual accuracy, four conditions. The Brain wins overall by +29 points and crushes grep where answers must be synthesized or recalled. It honestly loses where the answer is a literal token in a file.

categorybrainnaive-grepstrong-grepblind
incidents38%13%14%14%
specs77%2%12%2%
decisions72%36%39%4%
ownership54%21%30%16%
processes80%70%20%0%
cross-repo40%27%25%19%
literal-lookup100%88%100%13%
Where the Brain used to lose. On literal token lookups — default values, where-X-is-defined — strong-grep scores a perfect 100%. o8 now routes these to grep, so the Brain matches it. When the answer is a word in a file, search wins and the Brain defers to it.
OVERALL68.6%40.7%39.5%9.9%

[ THE USABLE RULE ]Brain for why, who, what-was-decided, what-happened. Grep for what's-the-value, where-is-X-defined.

Single-operator question set, n=5 per category (n=8 literal cases). The judge is one model family — a second-judge cross-check is the next step before publishing exact figures. Accuracy is the win; on raw hallucination count the Brain still trails strong-grep in the recall-heavy categories — it buys relevance, not abstention.

03 — Coding

o8 does not make the underlying models write better code.

Three real issues, same base, impartial 0–10 judge. Run raw, Codex and Claude wrote first-diffs as good or better than o8's governed pipeline — and faster. o8-governed won zero tasks.

taskcodex-aloneclaude-aloneo8-governedwinner
#1065 (medium)98.58Codex-alone
#1144 (hard, live state)8.567.5Codex-alone
#928 (greenfield)7.587.5Claude-alone

[ TAKEN AT FACE VALUE ]A strong model coding raw is already excellent. Any product claiming its wrapper improves raw code generation is selling something the models already commoditized.

N=3 valid tasks, single operator, first-diff quality only — not the full multi-turn governed pipeline. The governance track isolates the wrapper's actual contribution.

04 — Governance

The gate is the value, not the keystrokes.

Every clean-looking diff above carried a real bug that passed tsc and lint — an inert guard, a duplicating ledger writer, a leaked global. An ad-hoc “merge when green” path shipped all three broken. o8's review→refix gate caught two, fixed both, and never cried wolf on the clean diffs.

 ad-hoc (merge on green)o8 review→refix
Bugs shipped broken3 / 3
Bugs caught0 / 32 / 3
Caught bugs fixed (verified)2 / 2
False alarms on clean diffs0 / 2

[ THE MISSED THIRD ]It missed one bug by looking directly at it and rationalizing it as correct. That is the ceiling of automated review — and exactly why o8 keeps a human approval gate above the AI tier rather than auto-merging.

5 diffs, single operator, curated to be tsc and lint clean — the regime that flatters review. The reviewer is an AI agent standing in for the orchestrator tier, not the human backstop above it. One catch was “soft”; under a stricter bar the rate is 1/3.

05 — What o8 is

Not a better coder. The safe, fast, informed layer.

[ FAST ]

The control surface is sub-frame warm. Cold-start is a measured, improving number, not a guess.

[ INFORMED ]

Organizational memory answers why, who, and what-was-decided — questions a code search fundamentally can't — while ceding the literal lookups search is better at.

[ SAFE ]

The review→refix gate catches a meaningful share of the subtly-broken-but-compiling diffs that ad-hoc workflows merge, fixes them, and routes the rest to a human instead of auto-shipping.

Speed is the floor. Memory and governance are the moat — and they're the parts the underlying model vendors have the least incentive to build.

Limitations

A benchmark that can't show its own losses isn't evidence. These are the constraints that keep this honest.

  • Single-operator pilot, small N (3–38 per track). Read these as directional point estimates, not measured rates with confidence intervals.
  • Only the speed and memory tracks are recorded in the committed, version-stamped scorecard. Governance and coding are run by hand and are not yet auto-recorded per release; the governance catch-rate here comes from reconstructed diffs — a bounded proxy.
  • No external or competitor baselines for speed — architecturally incomparable. Only an internal release-over-release delta. A SWE-bench run is still pending.
  • LLM-judged quality on a single judge family across memory and coding. A second-judge cross-check is pending.
  • Accuracy is the memory win; on raw hallucination count the Brain still trails strong-grep in the recall-heavy categories. It buys relevance, not abstention.
  • Governance bugs were curated to pass tsc and lint — the regime most favorable to review.
  • The governance result measures the AI review tier, not the human approval backstop that sits above it in production.

A full study would scale each track — N≥30 coding tasks, multi-judge, multi-operator, external where possible — and exercise the human-approval layer. Until then this is an honest pilot.