
What Devin Fusion is, in one line
Before the review, the thing itself. Devin is Cognition's autonomous AI software engineer, the product you delegate whole tickets to instead of autocompleting line by line. Fusion is a change to how Devin runs: instead of pointing one expensive model at every step, it runs two models at once and routes work between them. Cognition announced it on June 29, 2026 and shipped it in preview inside Devin the same day.

Cognition's framing is characteristically blunt: "Engineering teams are lighting money on fire. It's no longer sustainable to use the most expensive models on every task." The line that stuck with me: "You wouldn't drive a Lamborghini to the grocery store, so why should you take a model that can discover zero-day vulnerabilities and use it to round the corner of a button?" Fusion is the productized answer to that, and it lands on the back of a big year, Cognition raised over $1B at a $26B valuation in May and folded the old Windsurf IDE into the line as "Devin Desktop."
Here is my scorecard for the rest of this review, so you know where I land before the details:

How the sidekick actually works
The core mechanism is what Cognition calls the "sidekick" approach, and it is worth understanding because it is smarter than the naive model-routing most tools ship.
Two fully capable agents run in parallel. A main agent on a frontier model (think Opus 4.8 or GPT-5.5) and a smaller, cheaper sidekick agent, each keeping its own persistent, cached context. The main agent, per Cognition's tuned pattern, "should take minimal actions... By default it should delegate and monitor, while making the significant decisions: the plan, the interpretation of ambiguity, the final review." The sidekick does the grunt work, code exploration, broad edits, writing tests, fixing lint.

Why not just let a frontier model "ask" a cheaper one for help, the way earlier tools did? Cache misses. When a frontier agent queries a separate advisor model, it re-sends its whole context at full price every time. Fusion sidesteps that: both agents keep their own cached contexts, so delegating doesn't trigger a costly re-send. The second technique is dynamic mid-session routing, lightweight classifiers run during execution and can escalate a struggling sidekick task back to the main agent, and the switch happens during context compaction (which triggers a cache miss anyway), so changing models is effectively free. It is the same agentic reasoning loop idea behind modern agents, applied to which model runs each turn. As an engineer, this is the part I respect most; it is a real systems answer, not a marketing reframe.
The 35% claim, tested against the caveats
Now the number everyone screenshots. Cognition benchmarked Fusion on FrontierCode, a new eval it built with 20-plus open-source maintainers that measures whether code is actually mergeable, not just whether it passes a test. Here is the headline slice (FrontierCode Extended, score versus average cost per task):
| Configuration | Score | Avg cost/task |
|---|---|---|
| Fusion + Fable 5 | 57.6 | $3.00 |
| Fable 5 (medium) | 57.0 | $5.12 |
| Opus 4.8 (high) | 48.8 | $3.24 |
| Devin Fusion | 47.9 | $2.38 |
| GPT-5.5 (high) | 44.8 | $3.64 |
| GLM-5.2 | 43.0 | $2.70 |
The story: Fusion scores 47.9 at $2.38 per task, roughly matching Opus 4.8's 48.8 while costing about a third less. Cognition rounds that to a 35% cost improvement "while maintaining performance matching the frontier."

Two caveats before you trust that chart. First, this is a vendor benchmark on a vendor-built eval, which is a fine signal but not the same as independent testing. Second, the flashier "41% cheaper" figure needs Anthropic's Fable 5, and access to Fable 5 was suspended on June 12, 2026 under a US government directive, so those numbers were measured before the cutoff and aren't reproducible today. The live number is the 35% one. Cognition also says 88% of its own merged pull requests were driven entirely by the Fusion router after turning it on, which is a real signal, but it is Cognition dogfooding on Cognition's codebase, about the friendliest test environment there is.
The most honest, and to me most useful, part of the announcement was Cognition publishing the tasks where the sidekick hurt. Modernizing a JS file to ES6 came in 62% cheaper with quality holding. Ripping a deprecated library out of a Go codebase ran 32% cheaper. But on a hard front-end feature where the judgment was the deliverable, delegating tanked the quality score from 54 to 27. Their own summary: "When the judgment is the deliverable, delegating it backfires." That is the line I would attach to the whole product.
Where the review gets less flattering: reliability and proof
Fusion targets cost. It does not target the two complaints that have followed Devin for two years, and a fair review has to say so plainly.
The first is reliability. The most common thing actual users report is that autonomy is oversold and the reality is a correction loop:
"The promise was full autonomy, but the reality still involves a lot of babysitting. You give it a task, it goes off the rails, you correct it, it sort of gets back on track. Rinse and repeat."
The sharpest first-hand account I found is a G2 review from a test-automation engineer who rated Devin 5/5 overall but was candid about the drift: "Once the ACU consumption hits around 40 or 50, Devin really starts to lose the plot. It begins ignoring the initial instructions... It feels like the model gets tired." The same reviewer flagged scope creep, "it decided to refactor our core pre-built methods... even though it was only supposed to write a simple test script", and still loved it for parallel work: "I can have five different sessions running in parallel." That two-sided read is the fair one, and cheaper tokens don't obviously fix any of the negatives.
The second gap is proof. Fusion is days old, so the community reaction to Fusion specifically is thin and mostly positive launch commentary, on r/AIDeveloperNews the read was that "the architecture is actually pretty clever." That is encouraging, but "clever architecture" and "reliable in my repo for six weeks" are different claims, and only one of them is testable right now.
What real users actually say about Devin
Zoom out from Fusion and Devin carries a lot of baggage, some of it now flipped into a comeback story. The durable legacy critique is the March 2024 independent test where Devin completed 3 of 20 tasks, which the internet branded a fake demo. In 2026 that line mostly shows up approvingly:
"In March 2024, independent testers said Devin completed 3 of 20 tasks. The internet called it a fake demo. Two years later, that product codes for the US Army."
There is also a real thread of brand skepticism worth hearing, because it is the counterweight to the hype:
"Devin? Now that's a name I've not heard in a long time... in this age of Claude Code and Codex, does anyone use Devin, or even know someone who does?"
And genuine praise from people who found the fit, especially for Devin's review tooling:
"Have been using Devin Review for a little bit, and I think it's the first of the many 'code review' LLM-bots that... doesn't actively feel like 'slop.' My favorite feature has been organizing the files by 'logical flow' rather than alphabetically."

Devin pricing: what you'll actually pay
Fusion rolls out inside Devin, so the pricing you hit is Devin's. Here is the current Devin pricing:
| Plan | Price | What you get |
|---|---|---|
| Free | $0 | Light quota, limited models, unlimited inline edits and tab completions |
| Pro | $20/mo | Frontier models (OpenAI, Claude, Gemini), cloud agents, free SWE-1.6, overage at API pricing |
| Max | $200/mo | Everything in Pro with much higher quotas |
| Teams | $80/mo + $40/seat | Unlimited members, centralized billing, admin dashboard, priority support |
| Enterprise | Custom | SSO, VPC deploy, dedicated support |
One nuance that trips people up: Devin used to bill self-serve plans in opaque "ACUs" (Agent Compute Units), the metering behind most of the Hacker News pricing complaints. As of March 2026, self-serve moved to a token-based quota model, and ACUs are now an enterprise-only meter with no published public dollar rate. If you are comparing costs, my Cognition AI pricing guide walks the history, and it is worth reading before you assume a per-ACU number you saw online still holds.
Who should use Devin Fusion, and who should skip
Here is where I land as a reviewer, split cleanly.

Reach for it if you run a lot of mechanical, pattern-following work, refactors, dependency swaps, migrations, test scaffolding, and your frontier-token bill is climbing. That is the exact shape of task where the sidekick wins in Cognition's own data, and where the 35% is most believable. If you are already inside the Devin ecosystem, turning Fusion on is a low-risk experiment.
Think twice if your typical task is judgment-heavy feature design (Cognition's own numbers show delegation backfiring there), if you were burned by Devin drifting on long autonomous sessions before, or if you need proven, independently tested reliability before you trust an agent in production. In those cases the smart move is to wait a few weeks for real-world testing, and in the meantime weigh it against Cursor, Windsurf, and OpenAI Codex alternatives using my AI coding assistant tools guide.
The lesson if you don't write code
Here is the part I care about most, because Fusion's core idea reaches well past coding. "The age of using one model for everything is coming to an end" is true everywhere agents do real work, including customer support. A password-reset FAQ and a nuanced billing dispute do not need the same model, and paying frontier prices for the easy 80% is the "money on fire" problem Cognition describes, just in a different queue.
The trap is that most support-AI vendors hide this. They meter raw model usage, or charge per resolution and quietly route everything to whatever is cheapest to protect margin, the deflection-rate-as-vanity-metric game. The better model is the one Fusion gestures at: right-size the model to the task, and let the buyer pay for the outcome, not the tokens. That is the same cost logic I use when I think about agents anywhere.
Try eesel
I work on eesel AI, and this is exactly the problem we build around, just for support and internal teams instead of pull requests. eesel is an AI teammate that plugs into your existing helpdesk, learns from your past tickets and help docs, and handles tier-1 work the way Fusion handles mechanical coding: the routine stuff gets resolved automatically, the genuinely hard, judgment-heavy tickets get escalated to a human with full context. Same sidekick principle, different queue.

Two things make the analogy hold. First, you can simulate on your historical tickets before going live, so you see the resolution rate and cost on your own data instead of trusting a vendor benchmark, which is exactly the independent test Fusion doesn't have yet. Second, pricing is usage-based at about $0.40 per resolved ticket with no per-seat fees, so you pay for the outcome, not for a big model idling on easy questions. You can try eesel free, no sales call.
Frequently Asked Questions
Is Devin Fusion worth it?
Is Devin Fusion actually 35% cheaper?
How much does Devin cost?
How is Devin Fusion different from Cursor or Codex?
Does Devin Fusion fix Devin's reliability problems?

Article by
Rama Adi Nugraha
Rama is a software engineer at eesel AI with two years of experience writing about B2B SaaS, AI tools, and customer support technology. Based in Bali, Indonesia, he brings a developer's perspective to product comparisons — cutting through marketing copy to what the integrations and APIs actually do.








