BENCHMARK / 01
What o8 actually does.
o8 doesn't make models code better. It makes their output safe to merge, fast to use, and informed by your organization's memory.
Three tracks — speed, memory, governance. We tried to disprove each claim and report where o8 loses. June 2026.
A single-operator pilot. Speed and memory are in the committed, version-stamped scorecard; governance and coding are measured by hand and the governance figures come from reconstructed diffs — a bounded proxy. Memory is n=5 per category, one in-family LLM judge. Not externally validated; SWE-bench pending. Read these as directional point estimates.
A benchmark is only worth running if it changes something. This one did.
The pilot below surfaced three places o8 lost to a simpler tool or fell short of its own claim. We fixed all three and re-ran the same harnesses against the result. The point isn't that the numbers moved — it's the loop: measure honestly, find the weakness, fix it, re-measure.
| what changed | before | after | how |
|---|---|---|---|
| Memory — literal-lookup accuracy | 84% | 100% | route to grep |
| Governance — bugs caught (proxy re-audit) | 2 / 3 | 3 / 3 | 4-trace gate |
| Speed — avg boot socket conns | ~20.6 | ~3.1 | poll gating |
[ THE POINT ]A benchmark that drives its own product fixes and then re-audits them is more credible than one that only ever flatters. The thesis is unchanged — o8 still doesn't make the models better coders; it now measures, release over release, whether it keeps its promise.
The memory win is specific to literal lookups — synthesis categories are unchanged within run-to-run noise. The governance re-audit used faithful reconstructions of the retired diffs, a bounded proxy; the definitive number comes from replaying real diffs through the live review tier.
Warm is sub-frame. Cold was slow — we found the cause and cut it.
The cold-launch cost was not the UI. It was boot-time request fan-out saturating the WebView socket budget. We lazy-hydrated the panels that aren't visible at launch.
| metric | before | after | Δ |
|---|---|---|---|
| Boot requests | 221 | 111 | −50% |
| First Contentful Paint | 2305 ms | 1579 ms | −32% |
| Time to interactive | 2311 ms | 1533 ms | −780 ms |
[ VERDICT ]Speed is a measured floor we can improve, not a slogan. The fix shipped and was verified live.
Single cold-launch samples. The −50% request cut is structural; the paint deltas need N≥10 for a confidence interval. No external cross-IDE baseline — “fast” means against UX thresholds and our own prior release, not a competitor.
The Brain beats a competent ripgrep agent — and loses on literal lookups.
38 questions, LLM-judged factual accuracy, four conditions. The Brain wins overall by +29 points and crushes grep where answers must be synthesized or recalled. It honestly loses where the answer is a literal token in a file.
| category | brain | naive-grep | strong-grep | blind |
|---|---|---|---|---|
| incidents | 38% | 13% | 14% | 14% |
| specs | 77% | 2% | 12% | 2% |
| decisions | 72% | 36% | 39% | 4% |
| ownership | 54% | 21% | 30% | 16% |
| processes | 80% | 70% | 20% | 0% |
| cross-repo | 40% | 27% | 25% | 19% |
| literal-lookup | 100% | 88% | 100% | 13% |
| Where the Brain used to lose. On literal token lookups — default values, where-X-is-defined — strong-grep scores a perfect 100%. o8 now routes these to grep, so the Brain matches it. When the answer is a word in a file, search wins and the Brain defers to it. | ||||
| OVERALL | 68.6% | 40.7% | 39.5% | 9.9% |
[ THE USABLE RULE ]Brain for why, who, what-was-decided, what-happened. Grep for what's-the-value, where-is-X-defined.
Single-operator question set, n=5 per category (n=8 literal cases). The judge is one model family — a second-judge cross-check is the next step before publishing exact figures. Accuracy is the win; on raw hallucination count the Brain still trails strong-grep in the recall-heavy categories — it buys relevance, not abstention.
o8 does not make the underlying models write better code.
Three real issues, same base, impartial 0–10 judge. Run raw, Codex and Claude wrote first-diffs as good or better than o8's governed pipeline — and faster. o8-governed won zero tasks.
| task | codex-alone | claude-alone | o8-governed | winner |
|---|---|---|---|---|
| #1065 (medium) | 9 | 8.5 | 8 | Codex-alone |
| #1144 (hard, live state) | 8.5 | 6 | 7.5 | Codex-alone |
| #928 (greenfield) | 7.5 | 8 | 7.5 | Claude-alone |
[ TAKEN AT FACE VALUE ]A strong model coding raw is already excellent. Any product claiming its wrapper improves raw code generation is selling something the models already commoditized.
N=3 valid tasks, single operator, first-diff quality only — not the full multi-turn governed pipeline. The governance track isolates the wrapper's actual contribution.
The gate is the value, not the keystrokes.
Every clean-looking diff above carried a real bug that passed tsc and lint — an inert guard, a duplicating ledger writer, a leaked global. An ad-hoc “merge when green” path shipped all three broken. o8's review→refix gate caught two, fixed both, and never cried wolf on the clean diffs.
| ad-hoc (merge on green) | o8 review→refix | |
|---|---|---|
| Bugs shipped broken | 3 / 3 | — |
| Bugs caught | 0 / 3 | 2 / 3 |
| Caught bugs fixed (verified) | — | 2 / 2 |
| False alarms on clean diffs | — | 0 / 2 |
[ THE MISSED THIRD ]It missed one bug by looking directly at it and rationalizing it as correct. That is the ceiling of automated review — and exactly why o8 keeps a human approval gate above the AI tier rather than auto-merging.
5 diffs, single operator, curated to be tsc and lint clean — the regime that flatters review. The reviewer is an AI agent standing in for the orchestrator tier, not the human backstop above it. One catch was “soft”; under a stricter bar the rate is 1/3.
Not a better coder. The safe, fast, informed layer.
The control surface is sub-frame warm. Cold-start is a measured, improving number, not a guess.
Organizational memory answers why, who, and what-was-decided — questions a code search fundamentally can't — while ceding the literal lookups search is better at.
The review→refix gate catches a meaningful share of the subtly-broken-but-compiling diffs that ad-hoc workflows merge, fixes them, and routes the rest to a human instead of auto-shipping.
Speed is the floor. Memory and governance are the moat — and they're the parts the underlying model vendors have the least incentive to build.
A benchmark that can't show its own losses isn't evidence. These are the constraints that keep this honest.
- Single-operator pilot, small N (3–38 per track). Read these as directional point estimates, not measured rates with confidence intervals.
- Only the speed and memory tracks are recorded in the committed, version-stamped scorecard. Governance and coding are run by hand and are not yet auto-recorded per release; the governance catch-rate here comes from reconstructed diffs — a bounded proxy.
- No external or competitor baselines for speed — architecturally incomparable. Only an internal release-over-release delta. A SWE-bench run is still pending.
- LLM-judged quality on a single judge family across memory and coding. A second-judge cross-check is pending.
- Accuracy is the memory win; on raw hallucination count the Brain still trails strong-grep in the recall-heavy categories. It buys relevance, not abstention.
- Governance bugs were curated to pass tsc and lint — the regime most favorable to review.
- The governance result measures the AI review tier, not the human approval backstop that sits above it in production.
A full study would scale each track — N≥30 coding tasks, multi-judge, multi-operator, external where possible — and exercise the human-approval layer. Until then this is an honest pilot.