What is DiffusionGemma? Google's open-weights diffusion LLM, explained

Kira
Written by

Kira

Katelin Teen
Reviewed by

Katelin Teen

Last edited June 16, 2026

Expert Verified
Illustration of scrambled text tokens resolving into clean readable text, representing DiffusionGemma's parallel denoising

What is DiffusionGemma?

DiffusionGemma is a model in Google's open Gemma family that generates text with a diffusion process rather than the autoregressive approach behind nearly every chatbot you've used. It was released by Google DeepMind on June 10, 2026 as an experimental open-weights model under Apache 2.0, with the official model card living on DeepMind's site.

Here's the headline spec sheet:

AttributeDiffusionGemma
ReleasedJune 10, 2026
LicenceApache 2.0 (open weights)
ArchitectureBuilt on Gemma 4, Mixture-of-Experts
Size25.2B total params, ~3.8B active per step ("26B A4B")
GenerationDenoises blocks of 256 tokens in parallel
Input / outputMultimodal in (text/image/video), text out
Speed>1,000 tok/s on one H100, up to 4x faster than comparable AR models
Hardware~52GB VRAM at BF16, ~28GB at INT8, runnable from ~18GB quantised

Most of those numbers come from MarkTechPost's launch coverage and the Spheron deployment guide, with the parallel-block detail from Digg's writeup. The "26B A4B" label is Google's shorthand: a 26B-class Mixture-of-Experts model that only fires about 3.8B parameters on any given step, which is part of why it's cheap to run fast.

The reason this is a big deal isn't the benchmark scores. It's that a frontier lab shipped a real, downloadable diffusion language model. For years, diffusion was the dominant method for images and video (think Midjourney, Sora) while text stubbornly stayed autoregressive, the same family that powers everyday assistants like ChatGPT and Claude. DiffusionGemma is one of the clearest signals yet that the text side is catching up.

How DiffusionGemma actually works

Standard large language models are autoregressive. As Inception Labs puts it, they "generate text left-to-right, one token at a time, where a token cannot be generated until all the text before it has been generated." Every word waits for the one before it, so a long answer means a long sequence of forward passes through billions of parameters. That's where the latency comes from.

Diffusion flips this. The dominant approach for text is masked diffusion: you start with a block of tokens that are all masked out, and a transformer predicts the unmasked versions, then refines its guess over a handful of passes. Google describes it as generating text "the way image diffusion works: rather than predicting text directly, the model learns to generate outputs by refining noise step-by-step, so it can iterate on a solution quickly and error-correct during generation."

Side-by-side comparison of autoregressive generation filling tokens one at a time versus diffusion refining a whole block of masked tokens in parallel
Side-by-side comparison of autoregressive generation filling tokens one at a time versus diffusion refining a whole block of masked tokens in parallel

One clarification, because the name trips people up. Diffusion here doesn't replace the transformer; it replaces autoregression. As one widely-cited Hacker News comment from user synapsomorphy explained it:

"Diffusion isn't in place of transformers, it's in place of autoregression. Prior diffusion LLMs like Mercury still use a transformer, but there's no causal masking, so the entire input is processed all at once and the output generation is obviously different."

Hacker News, on Gemini Diffusion

The practical upshots of generating in parallel are threefold: raw speed, the ability to error-correct mid-generation, and natural infilling (because the model can see context on both sides of a gap, it's good at editing the middle of a sequence, not just appending to the end). Andrej Karpathy flagged the novelty early, noting that diffusion "doesn't go left to right, but all at once. You start with noise and gradually denoise into a token stream."

DiffusionGemma vs Gemini Diffusion: don't conflate them

This one catches almost everyone, because Google shipped two text-diffusion things within about a year and gave them near-identical names.

Gemini Diffusion was shown at Google I/O in May 2025 as an experimental, waitlist-only model running on Google's infrastructure. You can't download it. DiffusionGemma, by contrast, is the open-weights one you can pull down and run yourself.

Two cards clarifying Gemini Diffusion as closed and waitlist-only versus DiffusionGemma as open-weights, Apache 2.0, and self-hostable
Two cards clarifying Gemini Diffusion as closed and waitlist-only versus DiffusionGemma as open-weights, Apache 2.0, and self-hostable

The fact that Google shipped both an experimental closed model and an open-weights release is itself the story: it's the strongest signal that diffusion language models are past the research-curiosity stage. When a frontier lab open-sources an architecture, it's betting other people will build on it.

The speed numbers (and why they're real-ish)

Speed is the entire pitch, so let's look at the numbers honestly. DiffusionGemma's >1,000 tok/s sits alongside its diffusion cousins, and the gap to autoregressive models is large:

Bar chart comparing generation speed in tokens per second, showing diffusion models around 1,000-1,500 tok/s versus autoregressive models at 60-200 tok/s
Bar chart comparing generation speed in tokens per second, showing diffusion models around 1,000-1,500 tok/s versus autoregressive models at 60-200 tok/s

A few caveats keep this grounded. Almost every figure is measured on an NVIDIA H100, and most are vendor claims. The one independent yardstick in this space, Artificial Analysis, has corroborated the speed of Inception's Mercury models but not yet their quality. For DiffusionGemma specifically, the >1,000 tok/s and up-to-4x figures come from Google and partner write-ups like Yellow.com, not third-party benchmarks yet.

For comparison, the autoregressive models people actually use in production sit far lower on throughput: per Inception's own benchmarks, GPT-4o Mini runs around 59 tok/s and Claude 3.5 Haiku around 61, with speed-optimised Gemini 2.0 Flash-Lite at about 201. So the "roughly 10x faster" framing for diffusion holds, at least on paper.

Where it shines, and where it doesn't

The honest read is that diffusion really is faster on throughput-bound, parallelisable work, but autoregressive still wins for a lot of what production apps actually need. The best single source here is engineer Sean Goedecke's breakdown of diffusion's limitations, and it maps cleanly onto a decision.

Reach for diffusion when the job is high-volume and parallelisable: bulk summarisation, classification, reformatting, translation, or low-latency agent loops where a fast per-step response compounds. Code generation is a particularly good fit because diffusion's infilling nature matches how you edit code, generating the start and end of a block in the same pass.

Stick with autoregressive when you need short outputs (diffusion runs all its denoising passes regardless, so it does extra work to produce a six-token answer), long context windows (diffusion can't reuse the key-value cache as easily, so it recomputes attention over the whole context each pass), or hard chain-of-thought reasoning. On that last point, Goedecke makes the sharpest case:

"One reason to be broadly skeptical about the potential of diffusion models to reason is precisely that they do much less work per-token than autoregressive models do. That's just less space for the model to spend 'thinking.'"

Sean Goedecke, "Strengths and limitations of diffusion language models"

DiffusionGemma itself bears out the trade-off: it stays below standard Gemma 4 on every published benchmark. One engineer writing about production agent stacks put the historical knock on diffusion memorably, that early models "were fast in the way that a broken clock is fast, it doesn't matter how quickly you get the wrong answer" (dev.to). The quality gap is closing at small and mid scale, but it's still visible at the frontier.

The pragmatic move most teams will land on isn't replacement, it's routing: send simple, high-frequency steps (lookups, formatting, classification) to a fast diffusion model and reserve a frontier autoregressive model for deep reasoning. It's the same logic behind picking the right tool for a job rather than one AI helpdesk doing everything.

What DiffusionGemma means for customer support teams

Diffusion sounds perfect for support. Live chat and AI support agents are exactly the low-latency, user-facing case where the gap between a one-second and a several-second response decides whether the tool feels real-time or like "a service you wait on." For customer-facing copilots, sub-second response really can be the difference between adoption and abandonment.

But here's the thing we'd push back on: for a support team, the model architecture matters far less than the orchestration around it. Two caveats land directly on this use case.

First, real support answers lean on long context and retrieval, and long context is exactly diffusion's weak spot. A good answer isn't a from-scratch generation; it's a grounded answer over your knowledge base, ticket history, and policy docs. The retrieval and grounding matter more to answer quality than whether the final tokens came out left-to-right or in parallel, which is the heart of the RAG vs LLM question.

Second, quality and reliability beat raw speed for anything customer-facing. A faster model wired to stale knowledge or weak escalation rules just produces wrong answers faster. That's the broken-clock problem, applied to support.

eesel AI helpdesk dashboard showing connected tickets and knowledge sources, as taken from eesel
eesel AI helpdesk dashboard showing connected tickets and knowledge sources, as taken from eesel

So if you're a support leader reading about DiffusionGemma and wondering whether you need it: probably not directly. What you want is a platform that gets the grounding, guardrails, and helpdesk integrations right, and then quietly benefits from whatever model is fastest and best under the hood. Latency is one lever among many, and it's rarely the one holding your resolution rate back. The bigger question is usually the cost per ticket versus a human handling it.

Try eesel

eesel AI sells AI teammates that live inside your existing helpdesk (Zendesk, Freshdesk, HubSpot, Gorgias, Front) and handle tier-1 support by learning from your past tickets and help docs on day one. The reason it's relevant here: eesel is deliberately model-agnostic, so the architecture debate above is one you don't have to win. What it gets right is the orchestration that actually moves the numbers, like confidence-based routing that drafts instead of sending when it's unsure, and a simulation mode that runs against your past tickets so you can see coverage before going live. Gridwise saw 73% of tier-1 requests resolved in the first month, and pricing is usage-based from $0.40 per resolved ticket with no per-seat fees, so you pay for outcomes rather than GPU-hours.

Frequently Asked Questions

What is DiffusionGemma in simple terms?
DiffusionGemma is an open-weights AI language model from Google DeepMind that writes text using diffusion instead of the usual left-to-right method. Rather than predicting one word at a time, it starts with a block of masked tokens and refines the whole block in parallel over a few passes, which makes generation up to 4x faster. It's part of the open Gemma family and released under an Apache 2.0 licence.
Is DiffusionGemma the same as Gemini Diffusion?
No. Gemini Diffusion is a closed, waitlist-only experiment that runs on Google's own infrastructure, while DiffusionGemma is an open-weights model you can download and self-host. Both use text diffusion, but they are different releases and easy to confuse. If you're comparing Google's AI options, our Gemini pricing guide covers the production models.
How fast is DiffusionGemma compared to a normal LLM?
Google reports more than 1,000 tokens per second on a single H100 GPU, up to 4x faster than a comparable autoregressive model. For context, speed-optimised autoregressive models like Gemini 2.0 Flash-Lite sit around 200 tokens per second. Speed is the whole point of diffusion, which matters for latency-sensitive jobs like a real-time chat response.
Can I use DiffusionGemma for customer support?
You can, but the model architecture is the smaller half of the problem. A good support answer depends far more on what the AI is grounded in (your help docs, past tickets, policies) and the guardrails around it than on raw speed. A platform like eesel's AI helpdesk agent handles that orchestration regardless of which model sits underneath.
How much does DiffusionGemma cost to run?
The weights are free under Apache 2.0, but you pay for the GPU to serve them. It needs roughly 52GB of VRAM at full precision, dropping to about 28GB with INT8 quantisation, so an H100-class card is recommended. If you'd rather not run infrastructure, usage-based tools like eesel AI bill per resolved ticket instead of per GPU-hour.

Share this article

Kira

Article by

Kira

Kira is a writer at eesel AI with a Computer Science background and over a year of hands-on experience evaluating AI-powered customer service tools. She focuses on breaking down how helpdesk platforms and AI agents actually work so that support teams can make better buying decisions.

Related Posts

All posts →
Illustration of scattered noise and masked blocks resolving into clean lines of text, with a stopwatch signalling speed
AI

Diffusion-based AI models explained: how they work and why they're suddenly fast

A plain-English guide to diffusion-based AI models: how they differ from autoregressive LLMs, why they generate text 10x faster, and what that means for businesses.

KiraKiraJun 17, 2026
Illustration of a person directing blocks of code that assemble themselves, representing vibe coding
AI

What is vibe coding? A plain-English guide for 2026

Vibe coding means describing what you want to an AI and letting it write the code. Here's what it is, where it came from, the risks, and when to actually use it.

KiraKiraJun 17, 2026
A non-technical person describing an app idea while AI assembles software building blocks
AI

Vibe coding for non-developers: what it actually is and how to use it safely

A plain-English guide to vibe coding for non-developers: what it means, the tools to use, where it breaks, and what's safe to build yourself.

KiraKiraJun 17, 2026
Two people speaking different languages with a live sound wave bridging them, illustrating Gemini 3.5 Live Translate
AI

What is Gemini 3.5 Live Translate?

Gemini 3.5 Live Translate is Google's real-time speech-to-speech translation model for 70+ languages. Here's what it does, how it works, and where it fits.

Riellvriany IndriawanRiellvriany IndriawanJun 17, 2026
Editorial illustration of Claude Fable 5 working as a long-running autonomous teammate for a support team
AI

What can Claude Fable 5 do? A support leader's guide

Claude Fable 5 is Anthropic's most capable model yet. Here's what it can actually do, and what it still can't do on its own for a customer support team.

KiraKiraJun 17, 2026
Illustration of Claude Fable 5 working as a long-running autonomous teammate for a business team
AI

Claude Fable 5 for business: what Anthropic's most powerful model actually means for your team

A clear-eyed look at Claude Fable 5 for business: what it costs, where it shines, where it bites, and how to actually put it to work in customer support.

KiraKiraJun 17, 2026
Illustration showing an AI layer connecting to existing help desk platforms
AI

How to add AI to your service desk without replacing it

You don't need to replace Zendesk, Freshdesk, or Gorgias to get AI into your support team. This guide explains how an AI layer connects to your existing help desk and what it can actually do once it's there.

Riellvriany IndriawanRiellvriany IndriawanJun 10, 2026
Floating IT service management dashboard panels showing ticket queues, routing diagrams, and AI activity feeds
IT support

Best ITSM automation tools in 2026

A practical guide to the 5 best ITSM automation tools in 2026 - from AI overlays that work on top of your existing helpdesk to full enterprise platforms.

KiraKiraMay 15, 2026
Google Gemini 3 pricing breakdown showing model tiers and costs
AI Tools

Google Gemini 3 pricing in 2026: every plan, model, and API cost explained

A complete breakdown of Google Gemini 3 pricing: consumer plans from $0 to $199.99/mo, API costs from $0.25 to $12/1M tokens, and when each tier actually makes sense.

Rama Adi NugrahaRama Adi NugrahaJun 9, 2026

Ready to hire your AI teammate?

Set up in minutes. No credit card required.

Get started free