Diffusion-based AI models explained: how they work and why they're suddenly fast

Kira
Written by

Kira

Katelin Teen
Reviewed by

Katelin Teen

Last edited June 16, 2026

Expert Verified
Illustration of scattered noise and masked blocks resolving into clean lines of text, with a stopwatch signalling speed

What is a diffusion-based AI model?

A diffusion model is a generative model that learns to build data by reversing a gradual noising process. The idea comes from physics: you define a chain of steps that slowly add random noise to real data, then train a network to reverse that process and reconstruct samples from the noise. The foundational work is Sohl-Dickstein et al. (2015) and the 2020 paper on denoising diffusion probabilistic models.

There are two halves. In the forward process, you take a real image and add a little Gaussian noise over and over until it becomes pure static. That part needs no learning; its only job is to manufacture training pairs. In the reverse process, a neural network learns to undo one step of noise at a time. At generation time you start from random noise and run the network repeatedly, each pass stripping away a bit more until a coherent result emerges.

Here is the intuition that makes it click. Imagine filming an ice sculpture melting into a puddle, then running the film backwards: starting from a shapeless puddle and, frame by frame, refreezing it into the sculpture. Because the model works on the whole canvas at every step, it can keep fixing earlier mistakes as it goes.

This is the technique that powers most modern image, video, and audio generation. Diffusion sits behind Sora, Midjourney, and Riffusion, along with DALL-E 2, Imagen, and Stable Diffusion. The throughline: they all start from noise and iteratively denoise toward a result, guided by your prompt.

How autoregressive LLMs generate text

To see why diffusion is a big deal for text, you need the contrast. Almost every large language model you have used, including ChatGPT, Claude, Gemini, and Llama, is an autoregressive model. It generates text left to right, one token at a time, and a token cannot be produced until everything before it exists.

Two consequences fall out of that design, and both matter for the comparison:

The upside is that variable-length output is easy: the model just emits an end-of-sequence token whenever it is done. That flexibility is one reason autoregression has stayed dominant for text.

How diffusion language models generate text differently

Diffusion language models (dLLMs) port the image recipe to text. Instead of pixels-from-noise, they do tokens-from-masks. Google DeepMind describes it plainly: rather than predicting text directly, the model learns to generate outputs by refining noise step by step, so it can iterate on a solution quickly and error-correct during generation.

How a diffusion language model writes text: starting from all-masked placeholders, locking in confident words, refining the rest in parallel, and arriving at a final answer
How a diffusion language model writes text: starting from all-masked placeholders, locking in confident words, refining the rest in parallel, and arriving at a final answer

The dominant approach for text is masked diffusion. In LLaDA, an 8B open diffusion model, the forward process masks tokens and the reverse process uses a transformer "mask predictor" to fill in all the masked tokens at once, simulating diffusion from fully masked back to fully written. An earlier line, Diffusion-LM, used continuous diffusion over word vectors instead.

The headline difference is parallel decoding. A dLLM generates tokens in parallel rather than one at a time, and the underlying transformer can modify multiple tokens at once to globally improve the answer. Because the formulation is non-autoregressive, it also allows any-order generation: the model can lock in the words it is confident about anywhere in the sequence first, then fill in the rest.

One of the clearest explanations actually came from a developer on Hacker News, cutting through the "diffusion replaces transformers" confusion:

"Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling... in order to generate something from scratch, you start by feeding the model all [MASK]s... in 10 steps you'll have generated a whole sequence." nvtop, in the Gemini Diffusion discussion on Hacker News

That parallel, bidirectional view is also why a diffusion model can see context on both sides of a gap. LLaDA, for instance, beats GPT-4o on a reversal-poem-completion task, overcoming the reversal curse that trips up left-to-right models.

Autoregressive vs diffusion: the core difference

If you remember one picture from this post, make it this one. Autoregressive models build a sentence like a relay race, each word handing off to the next. Diffusion models build it like developing a Polaroid, the whole image surfacing at once and sharpening with each pass.

Comparison of autoregressive generation, where words are produced one at a time in sequence, versus diffusion generation, where the whole sequence is refined in parallel
Comparison of autoregressive generation, where words are produced one at a time in sequence, versus diffusion generation, where the whole sequence is refined in parallel

Here is how the two stack up on the dimensions a buyer actually cares about:

DimensionAutoregressive (GPT, Claude, Gemini)Diffusion (Mercury, Gemini Diffusion)
Generation orderLeft to right, one token at a timeWhole sequence in parallel, any order
SpeedTens to ~200 tokens/sec~1,000 to ~1,500 tokens/sec
Can revise earlier tokens?No, once emitted it is fixedYes, across denoising passes
Editing and infillingAwkward (append-only)Natural (conditions on both sides)
Hard reasoningStronger todayTrails, especially at frontier scale
Long contextMore efficient (reuses KV cache)Weaker (recomputes attention each pass)
Output lengthVariable, flexibleOften fixed-length blocks
Ecosystem maturityFive years of toolingEarly, fast-moving

Note the symmetry: diffusion's wins (speed, revision, infilling) and its losses (reasoning depth, long context, maturity) both trace back to the same root cause. Working on the whole sequence in parallel is what makes it fast and editable, and also what makes long context and step-by-step reasoning harder.

The speed payoff, and the catch

The speed numbers are genuinely striking, and they are not all marketing. Developer and LLM blogger Simon Willison got off the Gemini Diffusion waitlist and tried it:

"The key feature then is speed. I made it through the waitlist and tried it out just now and wow, they are not kidding about it being fast." Simon Willison, first impressions of Gemini Diffusion

Here is how throughput compares across a few models, with the autoregressive baselines for context:

ModelTypeThroughput (tokens/sec)Source
Gemini DiffusionDiffusion~1,479 (excl. overhead)Vendor
Mercury 2 (Inception)Diffusion~1,196 peakArtificial Analysis
Mercury Coder MiniDiffusion1,109Vendor, AA-corroborated
Gemini 2.0 Flash-LiteAutoregressive~201Per Inception
Claude 4.5 HaikuAutoregressive~89Per Inception
GPT-5 MiniAutoregressive~71Per Inception

Two things to keep honest here. First, most throughput figures are measured on an NVIDIA H100 and many are vendor claims; Artificial Analysis is the main independent source, and it has corroborated Mercury's speed but not yet its quality. Second, the speed advantage is real but conditional. High-quality generation usually needs many denoising steps, and naively cutting steps degrades quality sharply, so the speed has to be spent carefully.

And the quality gap is still visible, especially on hard tasks. Gemini Diffusion scores 40.4% versus 56.5% on GPQA Diamond, and 69.1% versus 79.0% on Global MMLU against Flash-Lite, even though it leads on some code and math benchmarks. The honest read from an engineer who works on production agent stacks is worth quoting, because it names the historical problem directly:

"[Earlier diffusion LMs] were fast in the way that a broken clock is fast

it doesn't matter how quickly you get the wrong answer." vainkop, "Mercury 2 and the End of Autoregressive Monopoly"

His verdict for teams today is measured: this is a "follow closely and prepare to move fast" moment, not a "rewrite your agent stack immediately" one.

The models leading the charge

The space went from research curiosity to shipping products fast. The funding signal is loud: Inception Labs, founded by Stanford's Stefano Ermon, raised $50M in November 2025 from a strategic list that includes Nvidia, Microsoft's M12, Databricks, and Snowflake, plus angels Andrew Ng and Andrej Karpathy. When the infrastructure players bet, they think the speed is serveable.

ModelWhoStatusWhat stands out
Mercury / Mercury 2Inception LabsAPI live, $0.25 / $0.75 per 1M tokensFirst commercial diffusion LLM; ~1,196 tok/s
Gemini DiffusionGoogle DeepMindExperimental, waitlist~Gemini 2.0 Flash-Lite quality at several times the speed
DiffusionGemmaGoogle DeepMindOpen weights (Apache 2.0), June 202626B mixture-of-experts; >1,000 tok/s, below Gemma 4 on quality
LLaDA 8BML-GSAI (research)Open weightsMMLU 65.9, roughly matching Llama3 8B
Dream 7BHKU NLP + HuaweiOpen weightsDominates planning tasks (Sudoku 81.0 vs Qwen's 21.0)

A quick clarification, because the names are confusingly similar: "Gemini Diffusion" (closed, waitlist) and "DiffusionGemma" (open weights) are two different Google releases. The first is an experimental hosted model shown at Google I/O 2025; the second is a downloadable 26B model released on June 10, 2026 under Apache 2.0, which generates by denoising blocks of 256 tokens in parallel and stays below standard Gemma 4 on every published benchmark. Speed for quality, openly traded.

The recurring pattern across all of these: a 10x-plus throughput advantage that narrows the quality gap at small and mid scale (LLaDA roughly matching Llama3 8B, Mercury competitive on code) but still shows at the frontier. The primary use case today is code generation and low-latency, agentic loops, where parallel decoding's speed compounds.

Why diffusion-based AI models matter for businesses

Speed is not a vanity metric once you put a model inside a product. The clearest framing comes from production experience: latency in autoregressive systems compounds in chains.

A language model sits at the centre, surrounded by the layers that decide answer quality: knowledge and retrieval, guardrails and escalation, helpdesk integrations, and testing and oversight
A language model sits at the centre, surrounded by the layers that decide answer quality: knowledge and retrieval, guardrails and escalation, helpdesk integrations, and testing and oversight

As one engineer described it, a single agent step that calls the model three times (reason, plan, act) is three sequential passes; chain a few of those together and you are at seven or eight seconds, which is "not a real-time agent, that's a slow batch job". Faster per-step generation makes deeper AI agent chains affordable. The same piece notes teams currently cap chain depth at three to five steps to stay under their SLA; with diffusion-speed inference, ten-step chains start to look viable.

A few concrete places the speed pays off:

  • Real-time chat and copilots. Sub-second responses are, as that engineer puts it, "the difference between adoption and abandonment" for an assistant layer in a SaaS product.
  • High-volume batch text. Summarization, classification, reformatting, and translation are throughput-bound and parallelizable, which is exactly where diffusion shines.
  • Coding assistants. Diffusion's infilling nature fits code edits, generating the start and end of a block in the same pass and editing the middle.

Then there is cost. Faster generation on the same hardware means lower inference cost per token, and Inception's co-founder argues the approach "performs more computation per unit of memory transferred," which opens new ways of reducing AI inference costs on older hardware. For teams running hundreds of thousands of agent calls a day, that compounds. Mercury 2's public pricing of $0.25 per million input tokens and $0.75 per million output is genuinely cheap.

But here is the part most coverage skips. For most production apps, autoregressive models remain the default, and for good reason: they handle long context more efficiently, they reason more deeply (diffusion does less work per token, so there is less room to "think"), and they have five years of tooling behind them. The pragmatic move is not replacement but routing: send the simple, high-frequency steps (lookup, format, classify) to a fast diffusion model, and reserve frontier autoregressive models for the deep reasoning. Compare that to the economics of AI agents versus human agents and the appeal is obvious: do more of the cheap work cheaply.

What it means for AI customer support

Customer support looks like the perfect diffusion use case at first glance. Live chat and AI support agents are exactly the low-latency, user-facing scenario where the one-second-versus-several-seconds gap decides whether the experience feels responsive or sluggish. A faster model should mean snappier replies in your AI chatbot.

eesel AI chat interface showing a grounded conversation
eesel AI chat interface showing a grounded conversation

The reframe worth sitting with: for a support team, the model architecture matters far less than the orchestration around it. A real support answer is almost never a from-scratch generation. It is a grounded answer over your knowledge base, ticket history, and policy docs. That puts diffusion's weakness, long context handling, squarely in the path of the support use case, and it means retrieval quality, knowledge freshness, and guardrails drive the answer far more than whether the final tokens were emitted left-to-right or in parallel.

Put bluntly: a faster model wired to stale knowledge or weak escalation rules just produces wrong answers faster. The broken-clock problem, applied to support. That is also why AI chatbot problems so rarely come down to the base model and so often come down to grounding, testing, and the metrics you actually track.

The genuinely useful advice, then, is to stay model-agnostic. Pick a layer that lets the underlying model improve underneath you, whether that is a faster diffusion model next year or a smarter autoregressive one. The teams who will benefit most from diffusion are the ones who built on solid orchestration first and treated the model as a swappable component.

Try eesel

This is exactly how eesel AI is built. Rather than betting on one model architecture, eesel is the orchestration layer: it learns from your past tickets, help docs, and tooling on day one, then drafts replies, triages, and escalates across the helpdesk you already use, with confidence-based routing so low-confidence answers stay as drafts rather than going live.

eesel AI helpdesk dashboard overview
eesel AI helpdesk dashboard overview

The differentiator that matters for this topic: a simulation mode that runs the agent against your past tickets so you can see coverage and fix gaps before going live, which is how you stop a fast model from confidently shipping wrong answers. It runs across 100+ integrations and 80+ languages, so whatever model is fastest or smartest next year, your support setup keeps working. You can try eesel free, no credit card needed.

Frequently Asked Questions

What is a diffusion-based AI model in simple terms?
A diffusion-based AI model generates output by starting from random noise (or masked placeholders) and refining it step by step into a finished result. It is the technique behind image tools like Stable Diffusion and, more recently, behind diffusion language models that write text by denoising a whole sequence in parallel instead of one word at a time. For a wider primer, see our overview of generative AI for support teams.
How are diffusion language models different from autoregressive LLMs like GPT or Claude?
Autoregressive LLMs such as ChatGPT and Claude generate text left to right, one token at a time, with each token waiting for everything before it. Diffusion language models refine many tokens at once across a few denoising passes, which makes them far faster and lets them revise earlier words. The trade-off is that they currently trail on hard reasoning and long-context tasks.
Are diffusion-based AI models actually faster than normal LLMs?
Yes, on raw throughput. Independent testing clocked Inception's Mercury 2 at roughly 1,196 tokens per second, against tens to a couple hundred tokens per second for speed-optimized autoregressive models. The catch is that the speed advantage is biggest on long, parallelizable outputs and shrinks on very short answers. See how speed feeds into AI customer service metrics.
Should my business switch to a diffusion language model?
For most production apps, not yet. Autoregressive models still lead on reasoning depth, long context, and ecosystem maturity. The sensible move is routing, sending high-frequency, latency-sensitive steps to a fast diffusion model and keeping autoregressive models for deep reasoning. For customer support specifically, the model matters less than the AI helpdesk agent orchestration around it.
Does the model architecture matter for AI customer support?
Less than you would think. A support answer is a grounded answer over your knowledge base, ticket history, and policies, so retrieval, guardrails, and integrations drive quality more than whether tokens were emitted in parallel. A faster model wired to stale knowledge just produces wrong answers faster. Tools like eesel AI focus on that orchestration layer regardless of the underlying model.

Share this article

Kira

Article by

Kira

Kira is a writer at eesel AI with a Computer Science background and over a year of hands-on experience evaluating AI-powered customer service tools. She focuses on breaking down how helpdesk platforms and AI agents actually work so that support teams can make better buying decisions.

Related Posts

All posts →
Illustration of scrambled text tokens resolving into clean readable text, representing DiffusionGemma's parallel denoising
AI

What is DiffusionGemma? Google's open-weights diffusion LLM, explained

DiffusionGemma is Google's open-weights text-diffusion model: a 26B Mixture-of-Experts that writes whole blocks of text in parallel for up to 4x faster generation.

KiraKiraJun 17, 2026
Illustration contrasting an AI chatbot answering a question with an AI agent connected to Slack, email and ticketing tools
AI

AI agents vs AI chatbots: the real difference and when to use each

AI agents vs AI chatbots: chatbots answer questions, agents take actions and close tickets. Here is the real difference and when to reach for each.

KiraKiraJun 17, 2026
Illustration of a person directing blocks of code that assemble themselves, representing vibe coding
AI

What is vibe coding? A plain-English guide for 2026

Vibe coding means describing what you want to an AI and letting it write the code. Here's what it is, where it came from, the risks, and when to actually use it.

KiraKiraJun 17, 2026
A non-technical person describing an app idea while AI assembles software building blocks
AI

Vibe coding for non-developers: what it actually is and how to use it safely

A plain-English guide to vibe coding for non-developers: what it means, the tools to use, where it breaks, and what's safe to build yourself.

KiraKiraJun 17, 2026
Two people speaking different languages with a live sound wave bridging them, illustrating Gemini 3.5 Live Translate
AI

What is Gemini 3.5 Live Translate?

Gemini 3.5 Live Translate is Google's real-time speech-to-speech translation model for 70+ languages. Here's what it does, how it works, and where it fits.

Riellvriany IndriawanRiellvriany IndriawanJun 17, 2026
Editorial illustration of Claude Fable 5 working as a long-running autonomous teammate for a support team
AI

What can Claude Fable 5 do? A support leader's guide

Claude Fable 5 is Anthropic's most capable model yet. Here's what it can actually do, and what it still can't do on its own for a customer support team.

KiraKiraJun 17, 2026
Illustration of Claude Fable 5 working as a long-running autonomous teammate for a business team
AI

Claude Fable 5 for business: what Anthropic's most powerful model actually means for your team

A clear-eyed look at Claude Fable 5 for business: what it costs, where it shines, where it bites, and how to actually put it to work in customer support.

KiraKiraJun 17, 2026
Illustration showing an AI layer connecting to existing help desk platforms
AI

How to add AI to your service desk without replacing it

You don't need to replace Zendesk, Freshdesk, or Gorgias to get AI into your support team. This guide explains how an AI layer connects to your existing help desk and what it can actually do once it's there.

Riellvriany IndriawanRiellvriany IndriawanJun 10, 2026
Floating IT service management dashboard panels showing ticket queues, routing diagrams, and AI activity feeds
IT support

Best ITSM automation tools in 2026

A practical guide to the 5 best ITSM automation tools in 2026 - from AI overlays that work on top of your existing helpdesk to full enterprise platforms.

KiraKiraMay 15, 2026

Ready to hire your AI teammate?

Set up in minutes. No credit card required.

Get started free