What is Gemini Omni Flash? Google's new AI video model explained

Rama Adi Nugraha
Written by

Rama Adi Nugraha

Katelin Teen
Reviewed by

Katelin Teen

Last edited July 3, 2026

Expert Verified
Illustration representing Google's Gemini Omni Flash AI video generation and editing model

What Google actually shipped

I read a lot of model announcements for a living, and most of them are a single capability wearing new marketing. Gemini Omni Flash is a genuinely different shape: it's Google's attempt at an any-to-any model where video is the current output, not the only one planned. The pitch from DeepMind CTO Koray Kavukcuoglu's announcement is that Omni is where "Gemini's ability to reason meets the ability to create."

Concretely, that means you can feed the model a mix of text, image, video, and (voice-only, for now) audio, and it generates high-resolution video with audio grounded in Gemini's broader world knowledge, physics, history, and cultural context, rather than just plausible-looking frames. Google's own demo prompts lean into that: a marble rolling through a chain-reaction track that respects momentum and gravity, an alphabet video where each letter is an unusual themed object, a claymation explainer of protein folding.

The part worth sitting with is the editing model. Ask for a change, and the next instruction builds on the last one instead of regenerating the whole scene: characters stay put, physics hold, and the model remembers what it already did. Google's demo walks a violinist through three compounding edits, changing the environment, making the violin invisible, then swapping the camera angle, without losing the original shot's thread.

The any-to-any workflow

Diagram showing Gemini Omni Flash taking text, image, video, and audio input and producing video output, with a multi-turn conversational edit loop
Diagram showing Gemini Omni Flash taking text, image, video, and audio input and producing video output, with a multi-turn conversational edit loop

The official model card confirms the architecture is a transformer with native multimodal support across text, vision, video, and audio. Inputs go in as any combination of those four; the only output modality shipped so far is video with audio, though Google says image and audio output are coming "in time." That's the "any-to-any" framing: one model, multiple input types, one connected creative loop instead of separate tools for generation and editing.

Rollout followed a now-familiar Google pattern: consumers first, developers later. Gemini Omni Flash went out to Google AI Plus, Pro, and Ultra subscribers globally through the Gemini app and Google Flow, plus free to creators on YouTube Shorts and the YouTube Create app, all on day one. Developer and enterprise API access followed weeks later, landing in public preview on June 30, 2026, through Google AI Studio, the Gemini API, and the Gemini Enterprise Agent Platform.

Paired with Nano Banana 2 Lite: one workflow, not two products

Omni Flash didn't ship alone. Google announced it alongside Nano Banana 2 Lite (model id gemini-3.1-flash-lite-image), the fastest, cheapest model in the Nano Banana image-generation family: text-to-image in under 4 seconds for $0.034 per 1K-resolution image, replacing the original Nano Banana as the recommended default. On its own that's a solid, if unremarkable, speed-and-cost upgrade.

What makes it interesting is that Google built three official demo apps that chain the two models together rather than treating them as separate products:

  • Anywhere - takes a selfie, uses Nano Banana 2 Lite to place you at a landmark, then Omni Flash animates the still into a clip.
  • Space Lift - reimagines a room photo across design styles, then turns the chosen look into a cinematic walkthrough.
  • Omni product studio - converts static product photos into e-commerce video.
Two-stage pipeline diagram: Nano Banana 2 Lite generates a still image in 4 seconds, then Gemini Omni Flash animates it into a 10-second video
Two-stage pipeline diagram: Nano Banana 2 Lite generates a still image in 4 seconds, then Gemini Omni Flash animates it into a 10-second video

Independent AI-news account Rohan Paul read the launch the same way I did, that the pairing is the actual product:

"Chaining both models is the real product shape, not either model alone. Nano Banana 2 Lite makes reference images, then Gemini Omni Flash animates them."

The Nano Banana family now has four tiers, from the announcement's own model comparison chart:

A chart showing the Nano Banana model tiers, Nano Banana 2 Lite, Nano Banana 2, and Nano Banana Pro, ranked by speed and use case, as taken from Google
A chart showing the Nano Banana model tiers, Nano Banana 2 Lite, Nano Banana 2, and Nano Banana Pro, ranked by speed and use case, as taken from Google
ModelPositioning
Nano Banana 2 LiteFastest tier, built for near-real-time and high-volume workflows
Nano Banana 2Generalist workhorse, best balance of quality, latency, and cost
Nano Banana ProComplex, professional-grade control and reasoning
Nano Banana (legacy)Superseded by Nano Banana 2 Lite

Google's own benchmark chart backs the speed claim: Nano Banana 2 Lite lands well ahead of the pack on the latency-versus-price curve.

A benchmark chart plotting Nano Banana 2 Lite's image generation speed against price versus other models, as taken from Google
A benchmark chart plotting Nano Banana 2 Lite's image generation speed against price versus other models, as taken from Google

What a clip actually costs

Gemini Omni Flash's Gemini API pricing is unusually simple for a Google model, mostly because there's only one tier. Unlike most Gemini 3.x models, there's no Batch, Flex, or Priority option here, just Standard, and no free tier at all.

Price
Input (text, image, video, audio - one flat rate)$1.50 / 1M tokens
Output - text$9.00 / 1M tokens
Output - video$17.50 / 1M tokens
Effective video rate≈ $0.10 / second of 720p video
Free tierNone

Google's pricing page spells out the math directly: billing runs on total output tokens, at 5,792 tokens per second of 720p video, which nets out to roughly $0.10 a second. Run that to the current 10-second ceiling and a single clip lands at just over $1, before whatever it costs to feed in your prompt or reference media.

Cost breakdown chart showing a 10-second Gemini Omni Flash video clip costs approximately $1.01, split between input and video output tokens
Cost breakdown chart showing a 10-second Gemini Omni Flash video clip costs approximately $1.01, split between input and video output tokens

Google isn't hiding that the price matches a direct competitor. Gemini API dev-relations lead Logan Kilpatrick announced it in those exact terms:

"Omni Flash is SOTA at video editing at $0.10 / sec, same as Veo 3.1 Fast!"

One quirk buried in the pricing footnotes: Gemini Omni Flash Preview is marked "Yes" for content being used to improve Google's products, even on the paid tier, where most other paid-tier Gemini models say "No." Worth knowing before you feed it anything sensitive.

Where it still falls short

The official model card is unusually candid about what doesn't work yet, three specific challenges named outright: maintaining full consistency across edits, generating scenes with complex motion, and rendering accurate on-screen text. Google also flags that character consistency "has some limitations" specifically around scene changes and camera pans, exactly the kind of edit its own demos lean on hardest.

The API has rougher edges too. Audio-reference uploads and scene extension aren't supported yet, and video references up to 3 seconds are accepted by the schema but not correctly processed by the model. And there's no published benchmark data: DeepMind's model card explicitly defers evaluation scores for text-to-video, image-to-video, reference-to-video, and editing until the model reaches broader API availability.

A practitioner on LinkedIn pushed back on the polish question directly, commenting on the original Gemini Omni announcement from I/O:

LinkedIn

"It's not completely true though! It works on certain resolutions only, and the character constancy still not fully developed. It's nice for you tubers, it's fun. But not at all at professional production level."

Google's own model card backs up part of that read in its most striking disclosure: the model can already change what someone is saying in a video, and Google is deliberately restricting that "while it works to understand how to safely and responsibly bring it to users." That's a real capability being held back on purpose, not a limitation Google is apologizing for.

Another commenter, on Google Cloud's own LinkedIn announcement, put a finer point on why the editing claim matters more than the generation claim:

LinkedIn

"Conversational editing on video generation is the harder problem than initial generation quality, since maintaining consistency across edit turns requires tracking scene state, not just producing a good single frame."

And a second reader on that same thread flagged the question that follows any fast-moving enterprise AI launch:

LinkedIn

"As increasingly capable AI agents become embedded within business services, technical performance alone won't demonstrate that an organisation is ready to operate them safely."

That last line is the one I'd underline. It's not really about video models. It's about any AI capability landing in a live business workflow faster than the org around it can build the guardrails to run it safely, and it's the exact problem I think about every day building support automation.

Where this fits, and where it doesn't

Gemini Omni Flash is a strong, honestly-documented step toward Google's any-to-any vision, and if your job involves producing marketing clips, product videos, or creative content, it's worth a serious look, especially chained with Nano Banana 2 Lite for the image-to-video workflow. For an SEO team or content shop already inside eesel's world, that's a real adjacent tool: I build eesel's blog writer agent for research and drafting, and a fast video layer on top of a written piece is a genuinely useful pairing, not a competitor to it.

But it's still a generative-media model, not a support system. It doesn't know your refund policy, it doesn't have a queue of real tickets to learn from, and nothing about "10-second video clips" touches the actual job most of the people reading this post have: an inbox or a Zendesk queue that's backing up while every headline is about video AI. I've spent years watching confident-sounding AI give a wrong answer to a real customer, which is exactly why the boring parts, simulating against your historical tickets before anything goes live, escalating instead of guessing when it isn't sure, matter more than a flashy demo.

Try eesel for the queue that video AI won't touch

If the actual problem on your plate is a growing backlog of support tickets, not a shortage of video content, that's what I build eesel for. It's an AI teammate that plugs into Zendesk, Freshdesk, Intercom, or whatever helpdesk you're already running in minutes, learns from your past tickets and help docs from day one, and drafts, triages, or resolves tier-1 requests without a new stack to babysit.

Before it ever touches a live customer, eesel simulates against your historical tickets so you see exactly what it would have said and how much it would have resolved, the same guardrail-first instinct the LinkedIn commenters above were reaching for in a different context. It's how Gridwise resolved 73% of tier-1 requests in its first month, and how Smava runs a fully automated agent across 100,000+ tickets a month. Pricing is $0.40 per resolved ticket, no seat fees, and the first $50 of usage is free.

eesel AI helpdesk dashboard overview
eesel AI helpdesk dashboard overview

Google's Omni Flash and Nano Banana 2 Lite are a genuinely good answer to "how do I make video content faster." If your actual answer is "how do I stop drowning in tickets," a purpose-built AI helpdesk agent is the tool for that job, and you can try eesel free.

Frequently Asked Questions

What is Gemini Omni Flash?
It's Google DeepMind's first model in the new Gemini Omni family: a natively multimodal model that takes text, image, audio, and video as input and generates high-resolution video with audio, edited through natural-language conversation rather than re-rendering from scratch each time.
How much does Gemini Omni Flash cost?
On the Gemini API's paid tier, video output is billed at $17.50 per 1M tokens, which Google's own pricing page converts to roughly $0.10 per second of 720p video. A 10-second clip, the current maximum, works out to just over $1. There's no free tier for this model.
How is Gemini Omni Flash different from Veo?
Google itself prices Omni Flash's video output at the same $0.10 per second as Veo 3.1 Fast, and the two sit in the same generative-video family. The differentiator Google leans on for Omni is conversational, multi-turn editing of a scene you already have, not just first-pass generation from a prompt.
What can't Gemini Omni Flash do yet?
Clips are capped at 10 seconds, character consistency across scene changes or camera pans has known limitations, and audio-reference uploads and scene extension aren't supported in the API yet. Google is also deliberately restricting the model's ability to change what people say in a video while it works out how to do that responsibly.
Can Gemini Omni Flash help with customer support content?
It's built for video, not for staffing a support queue. If the actual job is answering tickets and chats, a purpose-built AI helpdesk agent that plugs into your existing tools is a closer fit than a generative-media model.

Share this article

Rama Adi Nugraha

Article by

Rama Adi Nugraha

Rama is a software engineer at eesel AI with two years of experience writing about B2B SaaS, AI tools, and customer support technology. Based in Bali, Indonesia, he brings a developer's perspective to product comparisons — cutting through marketing copy to what the integrations and APIs actually do.

Related Posts

All posts →
Illustration representing Gemini Omni Flash's per-second AI video generation pricing
AI

Gemini Omni Flash pricing: what a video actually costs

Gemini Omni Flash bills video output at $17.50 per 1M tokens, about $0.10 a second, with no free tier and no batch discount. Here's the full breakdown.

Alicia Kirana UtomoAlicia Kirana UtomoJul 3, 2026
Editorial illustration representing Nano Banana 2 Lite, Google's fast and cheap AI image generation model
AI

What is Nano Banana 2 Lite? Google's fastest AI image model

Nano Banana 2 Lite is Google's cheapest, fastest image model yet: 4-second generation, $0.034 per 1K images. Here's what it is and where it fits.

Kurnia Kharisma Agung SamiadjieKurnia Kharisma Agung SamiadjieJul 3, 2026
Illustration of Google Gemma 4, the open-weight AI model family, running on a laptop and a local server
AI

What is Gemma 4? Google's open AI model family, explained

What is Gemma 4? A plain-English guide to Google's open-weight model family: the five sizes, the Apache 2.0 license, the benchmarks, and what it means for support teams.

Alicia Kirana UtomoAlicia Kirana UtomoJun 20, 2026
Two people speaking different languages with a live sound wave bridging them, illustrating Gemini 3.5 Live Translate
AI

What is Gemini 3.5 Live Translate?

Gemini 3.5 Live Translate is Google's real-time speech-to-speech translation model for 70+ languages. Here's what it does, how it works, and where it fits.

Riellvriany IndriawanRiellvriany IndriawanJun 17, 2026
Banner image for Gemini Lyria 3: Google's AI music generator explained
Blog Writer AI

Gemini Lyria 3: Google's AI music generator explained

A complete guide to Gemini Lyria 3, Google's AI music generation model. Learn how it works, key features, prompting tips, and practical use cases.

Stevia PutriStevia PutriFeb 26, 2026
Editorial illustration of Nano Banana 2 Lite pricing, showing cost-per-image and speed comparison theme in Google blue
AI

Nano Banana 2 Lite pricing: full cost breakdown for 2026

Nano Banana 2 Lite costs $0.034 per 1K image on Standard, half that on Batch. Here's the full per-token math, the Batch vs Standard call, and where the hidden costs are.

Alicia Kirana UtomoAlicia Kirana UtomoJul 3, 2026
Illustration of scrambled text tokens resolving into clean readable text, representing DiffusionGemma's parallel denoising
AI

What is DiffusionGemma? Google's open-weights diffusion LLM, explained

DiffusionGemma is Google's open-weights text-diffusion model: a 26B Mixture-of-Experts that writes whole blocks of text in parallel for up to 4x faster generation.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
Illustration of scattered noise and masked blocks resolving into clean lines of text, with a stopwatch signalling speed
AI

Diffusion-based AI models explained: how they work and why they're suddenly fast

A plain-English guide to diffusion-based AI models: how they differ from autoregressive LLMs, why they generate text 10x faster, and what that means for businesses.

Alicia Kirana UtomoAlicia Kirana UtomoJun 17, 2026
An open briefcase spilling documents, spreadsheets, emails and chat messages while an AI figure grades them on a scorecard
AI

What is AA-Briefcase? The AI benchmark for real knowledge work, explained

AA-Briefcase is Artificial Analysis' new benchmark that tests AI on real multi-week office projects. Here's what it measures, who tops it, and what it means for AI at work.

Alicia Kirana UtomoAlicia Kirana UtomoJun 22, 2026

Ready to hire your AI teammate?

Set up in minutes. No credit card required.

Get started free