What is Gemma 4? Google's open AI model family, explained

Alicia Kirana Utomo
Written by

Alicia Kirana Utomo

Katelin Teen
Reviewed by

Katelin Teen

Last edited June 19, 2026

Expert Verified
Illustration of Google Gemma 4, the open-weight AI model family, running on a laptop and a local server

So what exactly is Gemma 4?

I build the AI agents at eesel, and I've spent the last few years watching open models go from "fun to tinker with" to "good enough to put in front of a paying customer." We run agents on live support queues every day; one customer, Smava, processes 100,000+ German-language tickets a month through an automated agent. So whenever Google ships a new open model, I read it through one lens: could you actually trust this to answer a customer without a human watching?

Gemma 4 is the most interesting answer to that question I've seen from an open model.

In plain terms, Gemma is Google DeepMind's line of open models, the smaller, downloadable cousins of the closed Gemini models. Gemma 4 is "built from the same world-class research and technology as Gemini 3 to maximize intelligence-per-parameter," per Google's launch post. The key word is open-weight: Google publishes the actual model files, so you can run them on your own laptop, server, or phone with no API call leaving your network.

It's also multimodal. Every model handles text and image input, the smaller ones add native audio, and the model card notes a training cutoff of January 2025 with support for over 140 languages. If you've read our explainer on RAG versus LLMs, Gemma 4 is the "LLM" half of that picture, the reasoning engine you'd point at your own knowledge.

The five sizes, and which one is for you

Gemma 4 isn't one model, it's five, sorted by where they're meant to run. This is the part worth understanding before anything else, because picking the wrong size is the most common mistake I see people make.

The five Gemma 4 sizes mapped to the hardware each one runs on, from phones to a single-GPU server
The five Gemma 4 sizes mapped to the hardware each one runs on, from phones to a single-GPU server

Here's the lineup, with the specs pulled straight from the model card:

ModelEffective paramsContextModalitiesRuns on
E2B2.3B (5.1B with embeddings)128KText, image, audioPhones, Raspberry Pi, edge
E4B4.5B (8B with embeddings)128KText, image, audioHigh-end phones, IoT
12B Unified11.95B256KText, image, audioLaptops (~16GB)
26B A4B (MoE)25.2B total, 3.8B active256KText, imageWorkstation, latency-focused
31B Dense30.7B256KText, imageSingle 80GB H100, top quality

The "E" in E2B and E4B stands for effective parameters. Those models use a trick called Per-Layer Embeddings to keep their memory footprint small, which is how a phone can run them offline with near-zero latency. Google built them with the Pixel team plus Qualcomm and MediaTek, so they're tuned for real mobile silicon, not just a demo.

The 12B Unified is the newcomer, added on June 3, 2026. It's the "laptop-ready" pick and Google's first mid-sized model with native audio input. The 31B Dense is the raw-quality flagship and the foundation everyone fine-tunes from.

The one in the middle, the 26B, is the most clever of the bunch. It deserves its own section.

How a 26B model keeps up with models 20x its size

The 26B is a Mixture-of-Experts (MoE) model, and understanding it is the single best way to grasp why Gemma 4 is a big deal.

A normal "dense" model fires every parameter for every token it processes. An MoE model splits its parameters into many small "experts" and, for each token, only switches on the handful it actually needs. Here's the shape of it:

How a Mixture-of-Experts model routes each token to a few experts, keeping active parameters low
How a Mixture-of-Experts model routes each token to a few experts, keeping active parameters low

Gemma 4's 26B has 25.2B total parameters but only 3.8B active per token, routing through 8 of its 128 experts plus one shared expert. The practical result: it runs about as fast as a 4B dense model while answering closer to the quality of the 31B. (One caveat to keep in mind: all 25.2B parameters still have to be loaded into memory for routing, so MoE saves you compute, not RAM.)

Why does this matter? Because it breaks the old assumption that "smarter" means "bigger and slower." Look at where the medium Gemma 4 models land on Google's own performance-versus-size chart:

Gemma 4's 31B and 26B sitting on the performance-vs-size frontier, ahead of much larger models, as shared in Google's announcement
Gemma 4's 31B and 26B sitting on the performance-vs-size frontier, ahead of much larger models, as shared in Google's announcement
Open-model performance vs size on Arena.ai's chat arena, as published by Google DeepMind.

The 31B is the #3 open model on Arena AI's text leaderboard, and the 26B MoE takes #6, which is how Google can claim Gemma 4 "outcompetes models 20x its size." For a support team, the takeaway isn't the leaderboard rank, it's that this quality fits on a box you own.

What "open weights" actually means (and why the license changed)

People throw around "open" loosely, so let me be precise, because this is where Gemma 4 made its biggest move.

Previous Gemma models shipped under a custom "Gemma Terms of Use." Gemma 4 switched to a standard Apache 2.0 license. In Google's words, it's "commercially permissive," granting "complete control over your data, infrastructure, and models." Hugging Face's CEO Clément Delangue called the move "a huge milestone."

Here's the difference that license makes in practice:

Closed API model sending customer data to vendor servers versus an open-weight model keeping it on your own infrastructure
Closed API model sending customer data to vendor servers versus an open-weight model keeping it on your own infrastructure

With a closed API model, every customer message you process is sent to a vendor's servers. With an open-weight model under Apache 2.0, you can run the whole thing inside your own infrastructure, on-premises or in your own cloud, and the data never leaves. For anyone in a regulated industry, that data-residency control is the entire reason to care about open models. It's the same reason people reach for open-source ticketing systems and open-source chatbot platforms.

To scale it, Google offers Gemma 4 across Vertex AI, Cloud Run, and GKE, and it works day-one with the tools self-hosters already use, like Ollama, llama.cpp, vLLM, and LM Studio.

The benchmarks, and where Gemma 4 actually shines

Numbers next. Google publishes a full benchmark table comparing the instruction-tuned Gemma 4 models against last generation's Gemma 3 27B:

Gemma 4 benchmark table across MMMLU, AIME, GPQA, LiveCodeBench and agentic tool use, versus Gemma 3 27B
Gemma 4 benchmark table across MMMLU, AIME, GPQA, LiveCodeBench and agentic tool use, versus Gemma 3 27B
Instruction-tuned benchmark results, as published in Google's Gemma 4 materials.

The one line I'd circle is agentic tool use. On the τ2-bench retail benchmark, which tests whether a model can actually call tools to complete a task, the 31B model scores 86.4% against Gemma 3's 6.6%. That's not an incremental bump, it's a generational leap, and it's the capability that turns a chatbot into something that can do work.

It holds up against the closed giants, too. On Arena Elo, the 31B's 1452 lands a hair behind models with 15-35x the parameters:

Arena Elo bar chart: Gemma 4 31B at 1452 next to far larger models like Glm 5, Kimi k2.5, and Qwen 3.5
Arena Elo bar chart: Gemma 4 31B at 1452 next to far larger models like Glm 5, Kimi k2.5, and Qwen 3.5
Arena Elo scores against parameter counts, via Hugging Face.

Architecturally, the interesting note from Sebastian Raschka's read is that Gemma 4 is "pretty much unchanged" from Gemma 3 under the hood, so the leap is "likely due to the training set and recipe." In other words, Google got this jump from better data, not a new architecture, which is a quietly impressive thing to pull off.

What it's actually like to run

Benchmarks are one thing. What do people who run Gemma 4 every day actually say? I went looking on the local-model communities, because that's where the unvarnished takes live.

The praise is consistent: it's fast, light on memory, and it doesn't ramble.

Reddit

"Fast as F*** on a M4Max, and damn smart for its speed. Doesn't destroy your memory load. Doesn't reason for hours (and eat all of the token budget on reasoning) like Qwen does.. It's perfect for openclaw, hermes, claude code etc. I LOVE this model for local. It's my Go-to now."

That "doesn't reason for hours" point comes up again and again. A self-hoster running the 26B and 31B for a multimodal use case put real numbers on it, reporting roughly 149 tokens/sec on the 31B and 88 on the 26B, and adding that "the benchmarks don't really capture how little it yaps compared to larger ones."

But here's the honest limitation, and it's the reason I wouldn't hand raw Gemma 4 a live queue unsupervised:

Reddit

"I agree it's much better at everything except at coding. [...] However it suffers heavily when weights or kv cache are any other quant but native."

So the community read nets out like this: Gemma 4 is an excellent chat and instruction-following model that punches well above its weight, with two caveats, coding and agentic workflows are its weaker areas, and it degrades noticeably if you run it on anything other than its native quantization. Good to know before you pick it for a job.

What this means for customer support

Here's where it gets practical for anyone running a support team. An open model like Gemma 4 is a fantastic ingredient. It is not, on its own, a support agent.

A raw model has no idea what your refund policy is, can't see your past tickets, and isn't connected to your helpdesk. Drop it in front of customers unsupervised and you get exactly the failure mode we've spent years engineering against: a confident-sounding bot that quietly gives the wrong answer. The model is the engine; the actual product is everything around it, the knowledge, the safe routing, the connection to your tools, and the ability to test it before it goes live.

That gap is the whole reason platforms like ours exist. The open-weight movement gives you control over the model layer, but most support teams don't want to also become an ML ops team. The better answer for most people is to get the data-control and learning benefits without hand-rolling the infrastructure, which is the line I'd draw between a model and an AI customer service platform.

Try eesel for AI support

If reading about Gemma 4 got you thinking "I want AI answering my tickets, but on my terms," that's the exact problem eesel is built for.

eesel's AI helpdesk agent plugs into the tools you already run, Zendesk, Freshdesk, Gorgias, Slack, and 100+ others, and learns from your past tickets and help docs on day one, so years of history becomes knowledge immediately. The part that maps directly to the "could you trust it?" question I opened with: you can simulate the agent against thousands of your historical tickets to see exactly how it would have answered, before a single customer sees it. That's how Gridwise got to 73% of tier-1 requests resolved in its first month.

eesel AI helpdesk dashboard showing connected support tools and ticket activity
eesel AI helpdesk dashboard showing connected support tools and ticket activity

It's usage-based, starting at $0.40 per ticket with no per-seat fees, and you can start with $50 of free usage and no credit card. Whether the model under the hood is Gemma 4 or anything else, the thing you actually want is an agent you can trust on your queue. Try eesel and see how it handles yours.

Frequently Asked Questions

What is Gemma 4?
Gemma 4 is Google DeepMind's family of open-weight AI models, released on April 2, 2026. Unlike an API-only model, you download the actual weights and run them on your own hardware, anywhere from a phone to a single-GPU server. It comes in five sizes and is built for reasoning and agentic workflows.
Is Gemma 4 free to use?
The weights are free to download and the license is Apache 2.0, which is commercially permissive, so there is no per-token license fee. Your only cost is the infrastructure you run it on. That is a big shift from how most LLMs are priced.
What are the Gemma 4 model sizes?
There are five: E2B and E4B for phones and edge devices, a 12B Unified model for laptops, a 26B Mixture-of-Experts model tuned for low latency, and a 31B Dense flagship. The model card lists the full specs for each.
Can Gemma 4 run on a laptop or phone?
Yes. The E2B and E4B models run completely offline on phones and devices like a Raspberry Pi, and the 12B Unified model is built to fit on a laptop with 16GB of memory. Self-hosters on r/LocalLLaMA report the 26B running fast on a 64GB Mac.
Is Gemma 4 good for customer support?
An open model gives you a strong base, but a production support agent needs more than raw weights: it has to learn from your tickets, route safely, and connect to your helpdesk. A platform like eesel's AI helpdesk agent handles that layer so you get the control of self-hosting without building the plumbing. See how teams cut support costs with AI.

Share this article

Alicia Kirana Utomo

Article by

Alicia Kirana Utomo

Kira is a writer at eesel AI with a Computer Science background and over a year of hands-on experience evaluating AI-powered customer service tools. She focuses on breaking down how helpdesk platforms and AI agents actually work so that support teams can make better buying decisions.

Related Posts

All posts →
Illustration of an AI support teammate handling B2B SaaS conversations between two customers
AI

AI support for B2B SaaS: what actually works in 2026

B2B SaaS tickets are technical, account-specific, and high-stakes. Here is how AI support actually works for them, what breaks, and how to roll it out safely.

Riellvriany IndriawanRiellvriany IndriawanJun 19, 2026
Editorial illustration of Claude Opus 4.8 for business use
AI

Claude Opus 4.8 for business: what it changes, and what it doesn't

Claude Opus 4.8 is Anthropic's flagship model. Here's a practical, operator's read on what it means for your business, what it costs, and where it falls short.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Illustration of scrambled text tokens resolving into clean readable text, representing DiffusionGemma's parallel denoising
AI

What is DiffusionGemma? Google's open-weights diffusion LLM, explained

DiffusionGemma is Google's open-weights text-diffusion model: a 26B Mixture-of-Experts that writes whole blocks of text in parallel for up to 4x faster generation.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Illustration of the MiniMax M3 multimodal AI model with image, audio, and video inputs
AI models

What is MiniMax M3? The open-weight model explained

What is MiniMax M3? A plain-English guide to the open-weight model from MiniMax: its sparse-attention 1M context, real benchmarks, pricing, and what it means for support teams.

Alicia Kirana UtomoAlicia Kirana UtomoJun 20, 2026
Illustrated hero banner for an OpenAI Codex free access explainer, with terminal and code editor motifs
AI

OpenAI Codex free access, explained: what you actually get for $0

Is OpenAI Codex free? Yes, if you sign in with a ChatGPT Free account. Here is exactly what the free tier gives you, where the wall is, and the limits.

Alicia Kirana UtomoAlicia Kirana UtomoJun 18, 2026
Editorial illustration of Claude Opus 4.8, Anthropic's flagship AI model
AI

What is Claude Opus 4.8? A clear-eyed look at Anthropic's flagship model

Claude Opus 4.8 is Anthropic's latest flagship model. Here's what changed, what it costs, and what a smarter model actually means for AI customer support.

Riellvriany IndriawanRiellvriany IndriawanJun 17, 2026
Illustration of Claude Fable 5 working as a long-running autonomous teammate for a business team
AI

Claude Fable 5 for business: what Anthropic's most powerful model actually means for your team

A clear-eyed look at Claude Fable 5 for business: what it costs, where it shines, where it bites, and how to actually put it to work in customer support.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Illustration of scattered noise and masked blocks resolving into clean lines of text, with a stopwatch signalling speed
AI

Diffusion-based AI models explained: how they work and why they're suddenly fast

A plain-English guide to diffusion-based AI models: how they differ from autoregressive LLMs, why they generate text 10x faster, and what that means for businesses.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Two people speaking different languages with a live sound wave bridging them, illustrating Gemini 3.5 Live Translate
AI

What is Gemini 3.5 Live Translate?

Gemini 3.5 Live Translate is Google's real-time speech-to-speech translation model for 70+ languages. Here's what it does, how it works, and where it fits.

Riellvriany IndriawanRiellvriany IndriawanJun 17, 2026

Ready to hire your AI teammate?

Set up in minutes. No credit card required.

Get started free