💪 Claude Fable 5: Unlock the Model Anthropic was Afraid to Release

What can Claude Fable 5 really do? We uncovered hidden capabilities Anthropic almost kept secret. You won’t believe what it handles, until you see the tests.. Ai Tools, Ai Automations.

Introduction

Anthropic did something it had never done before: it handed the public a model from its top-secret “Mythos” tier. That model is Claude Fable 5.

Fable 5 and Claude Mythos 5 share the same underlying weights. They are technically the same model. The difference is that Fable 5 wraps Mythos-class intelligence in safety classifiers that block certain high-risk domains.

For everything else, you’re talking to the same model that was previously only available to a handful of vetted cybersecurity and research organizations through Project Glasswing.

The full model family now looks like this:

Fable 5 isn’t the new Opus. It sits above Opus entirely.

I. What Claude Fable 5 Actually Is

Let’s clear up the biggest confusion first: Claude Fable 5 is not the next Opus. It is a new model tier entirely.

1. Fable 5 vs. Mythos 5: The Actual Difference

The only difference is that Fable 5 wraps Mythos-class intelligence in safety classifiers that block three specific domains:

Offensive cybersecurity exploits
Biology and chemistry procedures
Model distillation

Outside those areas, you’re working with the same intelligence Anthropic previously kept behind Project Glasswing.

For most people, that means Fable 5 gives full Mythos-level performance for coding, writing, research, analysis, and complex reasoning.

2. Technical Specifications

Specification	Details
API Model ID	`claude-fable-5`
Context Window	1,000,000 tokens
Max Output Per Request	128,000 tokens
Input Pricing	$10 per million tokens
Output Pricing	$50 per million tokens
Long Context Surcharge	None. 900k tokens costs the same per token as 9k.
Knowledge Cutoff	January 2026
Thinking Mode	Adaptive thinking only. Always on.
Data Retention	30 days. No zero retention option.

It’s available on:

Free on Pro, Max, Team, and Enterprise plans through June 22. After that, it draws on usage credits.

One thing worth noting on pricing is that the long context cost structure is more generous than it first appears. A one million token request costs the same per token as a ten thousand token request.

Compare that with Gemini 3.1 Pro, which increases its input pricing beyond 200K tokens. For the long and complex work Fable 5 is designed for, that difference matters.

II. Benchmarks: What the Numbers Mean

I know. You’ve seen benchmark tables before. You’ve watched models leapfrog each other by tiny margins and wondered whether any of it actually means anything in the real world.

This one is different. Not because the numbers are big, but because of the shape of the gap.

Key points

Fable 5 completed a 50 million line Ruby codebase migration in one day.
A full human engineering team would have taken two months for the same task.
Fable 5 is best suited for high-complexity, long-duration workflows.
Using Fable 5 for routine queries, drafts, or light tasks is inefficient.
The longer and harder the task, the greater Fable 5’s advantage in real-world enterprise scenarios.

1. On Software Engineering

On SWE Bench Pro, the most demanding real world software engineering benchmark, Fable 5 scored 80.3%. Opus 4.8 came in at 69.2%. GPT 5.5 scored 58.6%. Gemini 3.1 Pro scored 54.2%.

The 11-point gap between Fable 5 and Opus 4.8 is larger than the gap between Opus 4.8 and Gemini 3.1 Pro.

The lead over the previous flagship is bigger than the lead that previous flagship had over its nearest competitor. That is a category shift.

On Cognition’s FrontierCode Diamond, a harder independent benchmark that tests whether models can write production quality code, Fable 5 scored 29.3%. Opus 4.8 managed 13.4%. GPT 5.5 got 5.7%.

In relative terms, the gap there is even wider.

CursorBench measures coding performance inside the actual Cursor editor environment, using real tools under real conditions. It put Fable 5 at 72.9%, which is 9 points ahead of the next best model.

2. On Knowledge Work and Finance

Fable 5 also posted some impressive results on benchmarks that measure professional knowledge work.

Benchmark	Fable 5	Notable comparison
Hebbia Finance	#1 overall	Biggest gains in document reasoning, chart interpretation, multi-step problem solving
GDPval AA	1932 Elo	Leading the field
Harvey BigLaw Bench	93.4%	—
Hex Core Analytics	First model to break 90%	—
HealthBench Professional	66.0%	vs. 56.9% Opus 4.8, 51.8% GPT-5.5

These are not cherry picked niche benchmarks. They cover finance, law, healthcare, and analytics, which happen to be some of the areas where enterprises spend the most money.

3. On Vision

Fable 5 is now the top publicly available model for vision tasks. It can extract numbers from complex scientific charts, recreate a web application’s source code from screenshots, and complete Pokémon FireRed using only raw game screenshots.

The Pokémon demo has received some skepticism (see Limitations section). The improvement in vision performance is real regardless.

4. On Long Horizon Agentic Tasks

This is where Fable 5 pulls away most clearly. In Anthropic’s Slay the Spire benchmark, both Fable 5 and Opus 4.8 were given persistent memory to save and revisit notes during the task.

Fable 5 improved 3x more than Opus 4.8
Reached the final stage of the game 3x as often

The longer and more complex the task becomes, the larger Fable 5’s advantage grows.

5. The Restricted Benchmarks

ExploitBench shows Mythos 5 at 78.0% compared with 40.0% for Opus 4.8. You won’t get that performance from Fable 5 because cybersecurity related requests are automatically routed to Opus 4.8.

That score helps explain why the safety restrictions exist.

During Mythos Preview testing, the model found 271 zero-day vulnerabilities in Firefox (addressed in Firefox 150) with no expert guidance and no specialized red-team setup.

Claude Opus 4.6, run on the same codebase, surfaced 22 bugs. The cybersecurity benchmark isn’t a feature being promoted. It’s the reason the safety system was built.

III. Real-World Results: What Real Companies Found

Benchmarks are controlled environments. Here’s what happened when real organizations brought their own problems.

1. Stripe: The Number That Stops Conversations

Stripe gave Fable 5 a 50 million line Ruby codebase and asked for a codebase-wide migration.

Fable 5 finished in one day. Anthropic’s estimate for the same migration from a full engineering team working by hand: over 2 months.

Not 2 weeks, but 2 months, it compressed into one day. That’s not the kind of result you can wave away with “well, AI is good at code now.”

That’s a reordering of what’s possible. If that number holds up at scale, and there’s no reason yet to think it doesn’t, entire categories of engineering work just changed their economics.

2. GitHub

GitHub’s early testing found Fable 5 “completed equivalent work with fewer tool calls and lower token consumption than previous Opus-tier models.”

That’s understated on purpose. GitHub works with every major model and isn’t in the business of handing out superlatives.

3. IMC (Proprietary Trading Firm)

IMC, the proprietary trading firm, ran Fable 5 through their internal trading analysis evaluations.

→ It aced them nearly across the board: factual analysis, reasoning tasks, root cause investigations, expected value calculations.

Trading analysis requires accuracy you can trust. False positives are expensive. That result carries more weight than most marketing adjacent enterprise testimonials.

4. Every / Dan Shipper’s Senior Engineer Benchmark

Dan Shipper at Every tested Fable 5 against a Senior Engineer benchmark, a battery of tasks designed to approximate what a strong engineer actually does at work.

Fable 5: 91/100
Opus 4.8: 63/100

The gap is 28 points. That’s not a close race.

4. Biology (Mythos 5’s Version)

This one sits under Mythos 5 rather than Fable 5, but the scale deserves mention.

Anthropic’s internal protein design experts tested Mythos 5 on drug design work. The model was given bioinformatics tools but no human guidance. It matched or outperformed experienced human operators across every stage of the workflow:

real-world-results-what-real-companies-found

Selecting binding sites
Running protein design tools
Detecting and correcting its own mistakes
Iterating independently to improve results

That’s not “AI is helpful for research.” That’s a fundamental change in how fast drug candidates can be developed.

IV. Internet Reacts: X, Builders, and Day-One Demos

The most reliable signal for whether a model release is genuinely new is what happens in the first 24 hours on X.

Hype cycles around mediocre releases produce vague posts and marketing reposts. Genuine step changes produce people showing their work.

June 9 produced the second kind.

1. Andrej Karpathy

Andrej Karpathy posted this within hours of launch. He’s seen every major model release from the inside, built some of them, and has no reason to overstate anything:

His phrase: “it will just go.”

Anyone who has worked extensively with LLMs knows that the hard part is getting a model to keep going, maintaining a coherent plan across a long, hard task without losing the thread, hallucinating progress, or asking for clarification when it shouldn’t.

That’s what changed.

He also added, honestly: the safeguards are “configured to be a little too trigger happy for launch.” More on that below.

2. Michael Truell, Cursor CEO

“Claude Fable 5 is the state of the art model on CursorBench. It’s opened up a class of long horizon problems that were out of reach.”

internet-reacts-x-builders-and-day-one-demos-1

“Out of reach.” Not “harder to do.” Not “took longer.” Out of reach. . That’s a meaningful distinction from the CEO of one of the most widely used AI coding tools in the world.

3. Deedy Das (Developer Investor)

Das put together a roundup of the most impressive day-one demos and said it left him “genuinely worried about where software engineering is headed.” Among the highlights:

Photorealistic forest scenes generated in a single shot
A complete Boeing 747 render
Space simulations running with 5,000+ objects
A proprietary code evaluator that Fable 5 optimized 10x further than the next best model

4. Minecraft in One Prompt

Ziwen Xu posted on launch day: a working Minecraft clone, built from a single prompt to Claude Fable 5. Video attached, not a screenshot. The full game loop running in the browser.

Anyone who has tried to vibe code something this complex before knows it usually takes hours of back and forth, multiple sessions, plenty of cleanup.

Getting it right the first time is the part that hits different.

V. Claude Fable 5 Safety Architecture: What’s Blocked

1. The Visible Part

Fable 5 ships with 3 classifiers: offensive cybersecurity, biology and chemistry, and model distillation. When one fires:

Environment	What happens
Claude / Claude Code	Request reroutes to Opus 4.8. You’re notified it happened.
Raw Messages API	No automatic fallback. You get a structured refusal. Your integration handles it.

Classifiers fire in fewer than 5% of sessions. The other 95%+ runs at full Mythos 5 capability.

Before launch, Anthropic ran 1,000+ hours of external adversarial testing. No universal jailbreak was found.

2. The Less Visible Part

for requests related to frontier LLM development, the model may be silently weakened through prompt modification, steering vectors, or parameter-efficient fine-tuning. No notification. Estimated 0.03% of traffic.

Researcher Nathan Lambert put it directly: “An AI model that gets less intelligent automatically without notifying me is categorically misaligned AI.”

The 0.03% isn’t the issue. The principle is. A model that silently underperforms is different from a model that refuses. You can’t debug an invisible handicap.

Anthropic says classifiers are conservative at launch and will tune over time. The silent weakening clause is now in the toolbox regardless.

VI. Honest Claude Fable 5 Limitations

Expensive for routine work. $50 per million output tokens is double Opus 4.8. For summarization, Q&A, formatting, Fable 5 is overkill. Use it for hard jobs only. Opus 4.8 or Sonnet for everything else.

Slower than average. 60 tokens per second versus ~69 market average. Irrelevant for multi day autonomous sessions. Noticeable in interactive chat.

Mandatory 30 day data retention. No zero retention option. Classified as “Covered Model.” Blocker for healthcare, legal, or any field with data residency rules.

Classifiers too aggressive at launch. Karpathy flagged this. Security researchers, biology grad students, ML practitioners already reporting false positives. Anthropic says tuning will improve. For now, expect unexpected fallbacks if your work touches those domains.

Computer use isn’t a clean win. OSWorld Verified: Fable 5 at 85.0%, Mythos Preview at 85.4%. Statistically a tie. Fable 5 is not state of the art on this specific task.

Silent weakening clause. 0.03% silent degradation on frontier LLM dev. Small number. The principle is not.

Pokémon demo deserves skepticism. Best of multiple attempts? Token cost? FireRed training data advantage vs genuine generalization? Anthropic hasn’t fully answered. Vision leap is real regardless. Demo carries an asterisk.

Bonus: Fable 5 vs. Mythos 5: Full Comparison

This is the question I keep seeing in threads, Slack channels, and comment sections. Here’s the fastest possible answer:

They are the same model. Identical weights, identical intelligence, identical pricing ($10/$50 per million tokens). The difference is entirely in what’s wrapped around them.

	Fable 5	Mythos 5
Who can access it	Everyone	Project Glasswing partners only
Cybersecurity classifier	Active → routes to Opus 4.8	Lifted
Biology / chemistry classifier	Active → routes to Opus 4.8	Lifted
Distillation classifier	Active → routes to Opus 4.8	Lifted
What you get when a classifier fires	Opus 4.8 response (notified)	Full Mythos response
Pricing	$10 input / $50 output per M	$10 input / $50 output per M

If you’re not an approved Project Glasswing partner, you’re on Fable 5. And for 95%+ of what anyone actually does, that is the same thing as Mythos 5. The gate is narrow. The intelligence behind it isn’t.

Use Fable 5 for:

Multi-day agentic workflows
Large codebase migrations
Complex analytical pipelines
Dense PDF analysis
Design-to-code operations

Stay on Opus 4.8 or Sonnet for:

Routine queries and quick drafts
Summarization and formatting tasks
Anything where latency and cost aren’t justified by the complexity

Conclusion

Anthropic spent 2 months gating the dangerous parts, then opened the door to everything else. That’s either ironic or exactly what responsible scaling looks like.

Maybe both.

If you are interested in other topics and how AI is transforming different aspects of our lives or even in making money using AI with more detailed, step-by-step guidance, you can find our other articles here: