Blog / AI news

Devin Fusion review: is Cognition's cheaper harness worth it?

Written by

Rama Adi Nugraha

Reviewed by

Katelin Teen

Last edited July 2, 2026

Expert Verified

Devin Fusion review hero banner, Cognition's multi-model coding harness

TL;DR

Devin Fusion is the most interesting thing Cognition has shipped in a while, and also the hardest to score fairly. It is a multi-model "harness" that runs a frontier "main agent" and a cheap "sidekick" model side by side, hands the mechanical work to the sidekick, and switches models mid-task. Cognition's headline: frontier-level coding quality at about 35% lower cost on its own benchmark.

Here is my review in one line: the architecture is genuinely clever and the cost thesis is right, but the numbers are a vendor benchmark and Devin's old reliability problems aren't in Fusion's job description. If your work is mechanical and high-volume, this is worth trying today. If your work is judgment-heavy, or you got burned by Devin drifting on a long task before, wait for someone independent to test it.

I build AI agents for a living, so the part I care about isn't the coding demo, it is the idea underneath: stop paying frontier prices for the easy 80% of the work. That lesson travels well past pull requests, and it is the reason I think Fusion matters even if you never open an IDE. If you want the deeper mechanics, I wrote a full Devin Fusion explainer too; this piece is the verdict.

What Devin Fusion is, in one line

Before the review, the thing itself. Devin is Cognition's autonomous AI software engineer, the product you delegate whole tickets to instead of autocompleting line by line. Fusion is a change to how Devin runs: instead of pointing one expensive model at every step, it runs two models at once and routes work between them. Cognition announced it on June 29, 2026 and shipped it in preview inside Devin the same day.

The Devin session interface showing an autonomous coding task with open pull requests and a generated test report, as taken from Devin

Cognition's framing is characteristically blunt: "Engineering teams are lighting money on fire. It's no longer sustainable to use the most expensive models on every task." The line that stuck with me: "You wouldn't drive a Lamborghini to the grocery store, so why should you take a model that can discover zero-day vulnerabilities and use it to round the corner of a button?" Fusion is the productized answer to that, and it lands on the back of a big year, Cognition raised over $1B at a $26B valuation in May and folded the old Windsurf IDE into the line as "Devin Desktop."

Here is my scorecard for the rest of this review, so you know where I land before the details:

My Devin Fusion review scorecard: strong on cost efficiency, clever architecture, improved pricing clarity, but shaky on long-task reliability and unverified on independent proof

How the sidekick actually works

The core mechanism is what Cognition calls the "sidekick" approach, and it is worth understanding because it is smarter than the naive model-routing most tools ship.

Two fully capable agents run in parallel. A main agent on a frontier model (think Opus 4.8 or GPT-5.5) and a smaller, cheaper sidekick agent, each keeping its own persistent, cached context. The main agent, per Cognition's tuned pattern, "should take minimal actions... By default it should delegate and monitor, while making the significant decisions: the plan, the interpretation of ambiguity, the final review." The sidekick does the grunt work, code exploration, broad edits, writing tests, fixing lint.

Cognition's sidekick architecture: a frontier main agent handling the plan, review, and final code while a sidekick agent explores code, writes tests, and fixes bugs in parallel, as taken from Cognition

Why not just let a frontier model "ask" a cheaper one for help, the way earlier tools did? Cache misses. When a frontier agent queries a separate advisor model, it re-sends its whole context at full price every time. Fusion sidesteps that: both agents keep their own cached contexts, so delegating doesn't trigger a costly re-send. The second technique is dynamic mid-session routing, lightweight classifiers run during execution and can escalate a struggling sidekick task back to the main agent, and the switch happens during context compaction (which triggers a cache miss anyway), so changing models is effectively free. It is the same agentic reasoning loop idea behind modern agents, applied to which model runs each turn. As an engineer, this is the part I respect most; it is a real systems answer, not a marketing reframe.

The 35% claim, tested against the caveats

Now the number everyone screenshots. Cognition benchmarked Fusion on FrontierCode, a new eval it built with 20-plus open-source maintainers that measures whether code is actually mergeable, not just whether it passes a test. Here is the headline slice (FrontierCode Extended, score versus average cost per task):

Configuration	Score	Avg cost/task
Fusion + Fable 5	57.6	$3.00
Fable 5 (medium)	57.0	$5.12
Opus 4.8 (high)	48.8	$3.24
Devin Fusion	47.9	$2.38
GPT-5.5 (high)	44.8	$3.64
GLM-5.2	43.0	$2.70

The story: Fusion scores 47.9 at $2.38 per task, roughly matching Opus 4.8's 48.8 while costing about a third less. Cognition rounds that to a 35% cost improvement "while maintaining performance matching the frontier."

Bar chart comparing average cost per coding task: Devin Fusion at $2.38 versus Opus 4.8 at $3.24, GPT-5.5 at $3.64, and Fable 5 at $5.12

Two caveats before you trust that chart. First, this is a vendor benchmark on a vendor-built eval, which is a fine signal but not the same as independent testing. Second, the flashier "41% cheaper" figure needs Anthropic's Fable 5, and access to Fable 5 was suspended on June 12, 2026 under a US government directive, so those numbers were measured before the cutoff and aren't reproducible today. The live number is the 35% one. Cognition also says 88% of its own merged pull requests were driven entirely by the Fusion router after turning it on, which is a real signal, but it is Cognition dogfooding on Cognition's codebase, about the friendliest test environment there is.

The most honest, and to me most useful, part of the announcement was Cognition publishing the tasks where the sidekick hurt. Modernizing a JS file to ES6 came in 62% cheaper with quality holding. Ripping a deprecated library out of a Go codebase ran 32% cheaper. But on a hard front-end feature where the judgment was the deliverable, delegating tanked the quality score from 54 to 27. Their own summary: "When the judgment is the deliverable, delegating it backfires." That is the line I would attach to the whole product.

Where the review gets less flattering: reliability and proof

Fusion targets cost. It does not target the two complaints that have followed Devin for two years, and a fair review has to say so plainly.

The first is reliability. The most common thing actual users report is that autonomy is oversold and the reality is a correction loop:

"The promise was full autonomy, but the reality still involves a lot of babysitting. You give it a task, it goes off the rails, you correct it, it sort of gets back on track. Rinse and repeat."
r/ChatGPTCoding

The sharpest first-hand account I found is a G2 review from a test-automation engineer who rated Devin 5/5 overall but was candid about the drift: "Once the ACU consumption hits around 40 or 50, Devin really starts to lose the plot. It begins ignoring the initial instructions... It feels like the model gets tired." The same reviewer flagged scope creep, "it decided to refactor our core pre-built methods... even though it was only supposed to write a simple test script", and still loved it for parallel work: "I can have five different sessions running in parallel." That two-sided read is the fair one, and cheaper tokens don't obviously fix any of the negatives.

The second gap is proof. Fusion is days old, so the community reaction to Fusion specifically is thin and mostly positive launch commentary, on r/AIDeveloperNews the read was that "the architecture is actually pretty clever." That is encouraging, but "clever architecture" and "reliable in my repo for six weeks" are different claims, and only one of them is testable right now.

What real users actually say about Devin

Zoom out from Fusion and Devin carries a lot of baggage, some of it now flipped into a comeback story. The durable legacy critique is the March 2024 independent test where Devin completed 3 of 20 tasks, which the internet branded a fake demo. In 2026 that line mostly shows up approvingly:

"In March 2024, independent testers said Devin completed 3 of 20 tasks. The internet called it a fake demo. Two years later, that product codes for the US Army."
@aakashgupta on X

There is also a real thread of brand skepticism worth hearing, because it is the counterweight to the hype:

"Devin? Now that's a name I've not heard in a long time... in this age of Claude Code and Codex, does anyone use Devin, or even know someone who does?"
libraryofbabel on Hacker News

And genuine praise from people who found the fit, especially for Devin's review tooling:

"Have been using Devin Review for a little bit, and I think it's the first of the many 'code review' LLM-bots that... doesn't actively feel like 'slop.' My favorite feature has been organizing the files by 'logical flow' rather than alphabetically."
samyok on Hacker News

Devin Review organizing a code diff and flagging potential bugs across changed files, as taken from Devin

Devin pricing: what you'll actually pay

Fusion rolls out inside Devin, so the pricing you hit is Devin's. Here is the current Devin pricing:

Plan	Price	What you get
Free	$0	Light quota, limited models, unlimited inline edits and tab completions
Pro	$20/mo	Frontier models (OpenAI, Claude, Gemini), cloud agents, free SWE-1.6, overage at API pricing
Max	$200/mo	Everything in Pro with much higher quotas
Teams	$80/mo + $40/seat	Unlimited members, centralized billing, admin dashboard, priority support
Enterprise	Custom	SSO, VPC deploy, dedicated support

One nuance that trips people up: Devin used to bill self-serve plans in opaque "ACUs" (Agent Compute Units), the metering behind most of the Hacker News pricing complaints. As of March 2026, self-serve moved to a token-based quota model, and ACUs are now an enterprise-only meter with no published public dollar rate. If you are comparing costs, my Cognition AI pricing guide walks the history, and it is worth reading before you assume a per-ACU number you saw online still holds.

Who should use Devin Fusion, and who should skip

Here is where I land as a reviewer, split cleanly.

Who Devin Fusion is for: reach for it on big mechanical refactors, migrations, and heavy token budgets; think twice on judgment-heavy feature design, when you need proven reliability, or on very long autonomous runs

Reach for it if you run a lot of mechanical, pattern-following work, refactors, dependency swaps, migrations, test scaffolding, and your frontier-token bill is climbing. That is the exact shape of task where the sidekick wins in Cognition's own data, and where the 35% is most believable. If you are already inside the Devin ecosystem, turning Fusion on is a low-risk experiment.

Think twice if your typical task is judgment-heavy feature design (Cognition's own numbers show delegation backfiring there), if you were burned by Devin drifting on long autonomous sessions before, or if you need proven, independently tested reliability before you trust an agent in production. In those cases the smart move is to wait a few weeks for real-world testing, and in the meantime weigh it against Cursor, Windsurf, and OpenAI Codex alternatives using my AI coding assistant tools guide.

The lesson if you don't write code

Here is the part I care about most, because Fusion's core idea reaches well past coding. "The age of using one model for everything is coming to an end" is true everywhere agents do real work, including customer support. A password-reset FAQ and a nuanced billing dispute do not need the same model, and paying frontier prices for the easy 80% is the "money on fire" problem Cognition describes, just in a different queue.

The trap is that most support-AI vendors hide this. They meter raw model usage, or charge per resolution and quietly route everything to whatever is cheapest to protect margin, the deflection-rate-as-vanity-metric game. The better model is the one Fusion gestures at: right-size the model to the task, and let the buyer pay for the outcome, not the tokens. That is the same cost logic I use when I think about agents anywhere.

Try eesel

I work on eesel AI, and this is exactly the problem we build around, just for support and internal teams instead of pull requests. eesel is an AI teammate that plugs into your existing helpdesk, learns from your past tickets and help docs, and handles tier-1 work the way Fusion handles mechanical coding: the routine stuff gets resolved automatically, the genuinely hard, judgment-heavy tickets get escalated to a human with full context. Same sidekick principle, different queue.

The eesel AI reports dashboard showing resolution and analytics across support tickets

Two things make the analogy hold. First, you can simulate on your historical tickets before going live, so you see the resolution rate and cost on your own data instead of trusting a vendor benchmark, which is exactly the independent test Fusion doesn't have yet. Second, pricing is usage-based at about $0.40 per resolved ticket with no per-seat fees, so you pay for the outcome, not for a big model idling on easy questions. You can try eesel free, no sales call.

Frequently Asked Questions

Is Devin Fusion worth it?

If your work is mechanical and high-volume, refactors, migrations, dependency swaps, Devin Fusion is worth trying, because that is exactly where its cheaper sidekick model shines. If your work is judgment-heavy feature design, or you were already burned by Devin drifting on long tasks, wait for independent testing. My full Cognition AI reviews roundup has the wider sentiment.

Is Devin Fusion actually 35% cheaper?

The 35% figure is from Cognition's own FrontierCode benchmark, where Fusion matched Opus 4.8-level quality at about a third less cost per task. It is a vendor benchmark on a vendor-built eval, so treat it as a strong signal, not independently confirmed fact.

How much does Devin cost?

Devin has a free tier, Pro at $20/month, Max at $200/month, and Teams at $80/month plus $40 per developer seat, per the Devin pricing page. Enterprise is quote-based. My Cognition AI pricing guide breaks down the tiers and the ACU history.

How is Devin Fusion different from Cursor or Codex?

Fusion is a multi-model harness, not an IDE. It runs a frontier model and a cheap sidekick side by side inside Devin, where Cursor and Codex mostly pick one model per task. If you are shopping around, my AI coding assistant tools guide compares the field.

Does Devin Fusion fix Devin's reliability problems?

Not directly. Fusion targets cost, not the long-standing complaint that Devin drifts off long tasks and needs babysitting. Cheaper tokens do not make an AI agent more reliable, and that is still the thing seasoned users flag first.

Hire your AI teammate

Set up in minutes. No credit card required.

Try for free Book a demo

Share this article

Article by

Rama Adi Nugraha

Rama is a software engineer at eesel AI with two years of experience writing about B2B SaaS, AI tools, and customer support technology. Based in Bali, Indonesia, he brings a developer's perspective to product comparisons — cutting through marketing copy to what the integrations and APIs actually do.