The Fable-window benchmark

~550metered tokens per real engineering decision

We put the most expensive AI in the world behind a governor that only lets it make decisions — no file reads, no shell, no web — and measured everything, in the five days before that model leaves the market. It reads a ~1KB decision artifact, rules, and stops: every token it pays for is a decision. One decision costs 374 tokens in and 183 out, measured across 26 real adjudications.

measured 2026-07-02/03 · real o8 approval history · exact API token accounting

The numbers

Eight results. Each one links to the experiment that produced it. Nothing below is projected except where marked.

~550 tokensTotal metered tokens for one real engineering decision through the o8 window — 374 in + 183 out, across 26 real adjudications.→ EXP 1 26×Input-token reduction per decision: windowed (374 in) vs the same decision through a tool-loaded session (~9,700 in).→ EXP 1 9.2×Input reduction on a hard task — full adversarial review of a ~20-file change batch: 32,720 tokens raw vs 3,544 windowed.→ EXP 2 1 > 52A single raw-context Fable review call billed more input than an entire 52-call windowed decision eval.→ EXP 2 76% cheaperInput cost reduction from prompt caching on the window's doctrine prefix — 70,928 of 84,284 input tokens served from cache at 0.1×.→ EXP 3 80% vs 68%Fable matched the human operator's real recorded decisions 80% of the time; its designated successor (Opus 4.8) only 68%.→ EXP 1 < $1Cost of the entire 52-decision succession eval through the window.→ EXP 1 1 in 4Real decisions that flip depending on which frontier model holds the gavel (76% cross-model agreement) — why o8 ships judge panels, not blind trust.→ EXP 1

Experiment 1 — Succession eval

Fable matched the operator 80% of the time. Its successor: 68%.

26 real operator approval cards pulled from o8's own governance history — merge requests, tool confirmations, rebase conflicts, file-size overrides; 14 approved / 12 rejected by the human operator, with verdict leakage scrubbed. Each card was replayed as a ~1KB decisions-only artifact through claude-fable-5 and claude-opus-4-8, in three runs: CLI transport (subscription), API transport (exact tokens), and API with cached doctrine.

run	Fable↔Opus agreement	Fable vs operator	Opus vs operator
CLI (sub-billed)	73.1%	69.2%	65.4%
API (clean instrument)	76%	80%	68%
API + doctrine cached	76%	64%*	60%*

The over-rejection pattern, 3 for 3 runs: every single Fable↔Opus disagreement had the same shape — Fable said APPROVE, Opus said REJECT — and the human operator sided with Fable in the large majority (5/6, 4/7, 4/6). The successor model is systematically more trigger-happy on rejection than both the incumbent and the human. Left alone as sole adjudicator, it would bounce work the operator would have shipped.
Consequence shipped into the product: after Fable's departure (2026-07-07), contested calls in o8 route to a judge panel — the existing Collide machinery — not a single successor model. A 1-in-4 flip rate is too high for a solo gavel.

*The cached run's doctrine prefix measurably shifted BOTH models' decisions — operator-match dropped ~15 points. Standing doctrine steers the adjudicator; the production window's doctrine text is now a tuned artifact, not an afterthought.

[ TOKEN ECONOMICS PER DECISION — API, EXACT ]

condition	input	output	total
Windowed decision	374	183 (Fable) / 67 (Opus)	~550 / ~440
Same decision, tool-loaded session	~9,700	~240	~9,940

[ VERDICT ]18× fewer total tokens, 26× fewer input tokens, per decision. The whole 52-call eval: 19,480 in / 6,522 out — under a dollar.

Experiment 2 — Hard-task parity

Does the window nerf the model? 9.2× less context, zero quality loss measured.

The hardest artifact available: adversarial pre-ship review of the entire Fable-mode change batch itself — ~20 files, ~900 added lines, security-sensitive (billing isolation, tool lockouts, rate limits). Two context conditions, same task prompt, same output contract. RAW is the full change batch in context — the “native frontier” pattern. WINDOW is a ~2K-token digest, the shape o8's digest() produces: per-file roles plus key seam excerpts.

condition	input tokens	ratio
fable-RAW	32,720	—
fable-WINDOW	3,544	9.2× less

That single raw-context call — 32,720 input tokens — billed more input than the entire 52-call windowed decision eval (19,480). One review, one condition, more than a full succession eval. Measured before the account ran dry.

[ QUALITY — SCORED AGAINST 6 GROUND-TRUTH SUBTLETIES ]

condition	verdict	findings	ground-truth coverage
fable-RAW (full 18K-token context)	HOLD	10/10	5/6
fable-WINDOW (2K digest)	HOLD	10/10	6/6
opus-RAW	HOLD	10/10	6/6
opus-WINDOW	HOLD	10/10	6/6

Judge disclosure: findings were scored against 6 ground-truth subtleties known to the batch author. The batch author is the judge.

Unanimous verdicts across all four conditions — the window did not change the decision.
The windowed Fable review found real issues the raw-context review missed: the raw-transcript escape hatch limits call count but not output size, and the metered compaction ceiling is enforced client-side only. The raw review found its own uniques — a per-request backend-override mismatch in the Brain gate. Neither condition dominated; both were ship-blocking-grade reviews.

The honest caveat: the digest was authored to be faithful — the same standard digest() is held to in production. A bad digest would produce a blind review, which is exactly why o8 meters fetch_raw instead of banning it, reads adjudicated diffs raw, and spot-audits digests against full context.

Experiment 3 — Prompt caching

Caching the doctrine prefix cut input cost by ~76%.

The window's standing adjudication doctrine — ~1.4K tokens of real o8 governance rules — was set as a cache_control: ephemeral system block, with 52 decision calls against it. Result: 70,928 input tokens served from cache at 0.1× price vs 13,356 at full price — a ~76% input-cost reduction on the doctrine-laden shape.

The caveat that matters: Anthropic caching needs a ~1,024-token minimum stable prefix — thin prompts can't cache, which is exactly why the window batches its standing context into one stable block.

Experiment 4 — The live arc

Not a lab number. The whole loop ran live in the o8 app.

01Fable — locked at spawn: no file reads, no shell, no web; 10 native tools plus raw-transcript tools denied — answered repo-governance questions correctly with 6 cited sources, because the org-memory Brain does the reading. A fixed-cost model reads so the metered one never pays to.
02It dispatched a real coding task to a Codex worker; the worker's prompt automatically carried Brain access because the orchestrator is metered.
03When the worker's launcher broke (an environment bug, since root-fixed), Fable didn't retry-spam. It investigated through its dashboards, discovered the operator's suggested fix violated a platform rule — and said so, with citations — parked everything clean, and raised a two-option decision card.
04After the fix it pruned its duplicate packet, redispatched, reviewed the worker's diff RAW (per doctrine — the artifact being approved is never summarized), caught a real nit in the worker's code, passed the 4-check merge gate, and the work shipped as PR #1349.
05Along the way it used fetch_raw — the window's rate-limited raw-content escape — unprompted, exactly as designed.

[ THE NERF VERDICT ]A decisions-only Fable is not a worse Fable. It answered everything a full-tooled session would, made better recovery decisions than it was instructed to make, and its judgment tracked the human operator better than its successor's did.

Methodology

A benchmark that hides its constraints isn't evidence. These are the ones that keep this honest.

N=26 decisions and 1 hard task, from a single (real, production) o8 instance — a field measurement, not an academic benchmark suite.
Cross-model agreement was measured while both models were purchasable — unreproducible after 2026-07-07. That's the point.
The CLI transport carries Claude Code's ~9.7K-token system prompt per call; on subscription billing this is free, which is itself part of the o8 story: fixed-cost lanes for bulk, a metered lane for decisions only.
Judge of quality on the hard task: findings were scored against 6 ground-truth subtleties known to the batch author. The batch author is the judge.
Full raw data: docs/research/fable-succession/run-*.json and hard-task/.

Run the window

The window, the Brain, the judge panels, and the approval surface measured above ship in o8.

Download for macOS ›

macOS 14+ · Intel build; Apple Silicon via Rosetta