Gemini 3.5 Flash Explained: The Model Powering Google Antigravity 2.0 (2026)
Google announced Gemini 3.5 Flash on May 19, 2026, the same hour it shipped Antigravity 2.0 and a brand-new Antigravity CLI. Logan Kilpatrick called it Google’s “most powerful model to date.” The launch blog put it in the top-right quadrant of the Artificial Analysis index and claimed it outperforms Gemini 3.1 Pro on coding and agentic benchmarks while running four times faster than peer frontier models. This piece walks through the four benchmark numbers Google actually published, the 12x-inside-Antigravity claim and how to read it honestly, the one verifiable price anchor (the OS demo), where the model is available today, and what the MCP Atlas score means for anyone building with MCP-speaking agents.

Gemini 3.5 Flash in 60 seconds
Six things to know before the rest of this guide:
- Launched May 19, 2026 at Google I/O, 17:41 UTC per Logan Kilpatrick’s tweet (id 2056792266514329914). The 3.5 Flash announcement came eight minutes after the 4-surface ecosystem announcement (Antigravity 2.0, CLI, SDK, IDE) and thirteen minutes before Antigravity 2.0 launched.
- Four verified benchmark numbers from blog.google: Terminal-Bench 2.1 at 76.2%, GDPval-AA at 1656 Elo, MCP Atlas at 83.6%, CharXiv Reasoning at 84.2%.
- 4x faster output tokens/sec than other frontier models per Google’s blog; up to 12x faster inside Antigravity “for a limited time” on new TPU silicon.
- Available everywhere today — Gemini app, AI Mode in Search, Gemini API in AI Studio, Android Studio, Antigravity 2.0 / CLI / SDK, Gemini Enterprise Agent Platform.
- No per-token price published. Google said “less than half the cost of other frontier models.” The only public price anchor is the OS demo: 2.6B tokens for $916.92.
- 3.5 Pro is coming “next month” — June 2026, exact date unannounced.
Why this matters for MCP users
The MCP Atlas 83.6% score is the most directly relevant benchmark on the list for anyone running Model Context Protocol agents. The Antigravity SDK speaks MCP over all three transports (stdio, SSE, Streamable HTTP) — so the same servers you wire into Claude Desktop or Cursor can plug into a 3.5 Flash-driven Antigravity agent without rewriting anything. If you’ve been picking models for MCP-heavy agent loops on benchmark scores, this is the new number to compare against.
The four benchmark numbers Google published
Google’s announcement blog at blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/ cites four headline benchmarks. These are the verbatim numbers, copied from the launch post.
Terminal-Bench 2.1 — 76.2%
Terminal-Bench measures whether a model can execute real terminal tasks correctly: shell scripting, file manipulation, system administration, debugging, end-to-end command chains. 76.2% on the 2.1 version puts 3.5 Flash near the top of the published Terminal- Bench scores for any model. Google’s blog frames it as “outperforming Gemini 3.1 Pro on challenging coding and agentic benchmarks” — a claim Terminal-Bench specifically supports, because it’s the canonical “can the model actually drive a shell” benchmark.
GDPval-AA — 1656 Elo
GDPval is a Google DeepMind agentic evaluation suite; the “AA” variant tracks autonomous-agent performance. 1656 Elo is competitive with the strongest published frontier numbers — Google does not break out what model is at 1500 or 1700 on the same scale in the launch materials, so the headline figure stands on its own as a Google-published number rather than a head- to-head ranking.
MCP Atlas — 83.6%
The benchmark to pay attention to if you live in the MCP ecosystem. MCP Atlas measures a model’s competence at orchestrating Model Context Protocol tools across multi-step tasks — picking the right server, calling the right tool with the right arguments, recovering from errors, composing tool outputs into a final answer. 83.6% is the headline figure Google chose to put on the blog; we’ll come back to what this likely means in the MCP Atlas section below.
CharXiv Reasoning — 84.2%
CharXiv is a multimodal benchmark that tests whether a model can read scientific charts (matplotlib figures, bar charts, line graphs from arXiv papers) and reason about them. 84.2% is the multimodal reasoning headline — supporting Google’s broader claim that 3.5 Flash is multimodal-first, not a text-only model with vision bolted on.
Notably absent from Google’s launch blog: MATH/AIME math scores, MMLU general reasoning, HumanEval code synthesis (the standard pre-2025 benchmark lineup). Google appears to have moved its lineup to agentic + multimodal benchmarks for the 3.5 series. Read into that what you want; the absence of legacy academic benchmark scores is the absence, not a rebuttal.
What “frontier intelligence at Flash speed” means
The positioning quote that Google chose to lead with on blog.google:
“Landing in the top-right quadrant of the Artificial Analysis index, 3.5 Flash delivers frontier-level intelligence at exceptional speed — proving you no longer have to trade quality for latency.”
— blog.google, May 19, 2026
Three things to unpack in this sentence.
First, “top-right quadrant of the Artificial Analysis index”: Artificial Analysis is a third-party model-benchmarking site that plots models on two axes — quality (combined benchmark performance) and speed (output tokens per second). The top-right quadrant is “high quality AND fast.” Historically, frontier-quality models sat in the top-left (high quality, slow) and small/cheap models sat in the bottom-right (low quality, fast). The claim is that 3.5 Flash is the first model to deliver both at once. Until an Artificial Analysis snapshot confirms the placement, this is a Google claim about an index Google does not control.
Second, “no longer have to trade quality for latency”: this is the marketing framing for the Flash product line generally. Earlier Flash models (1.5, 2.0, 2.5, 3.0) were positioned as the cheap-and- fast siblings of the Pro tier — same family, less capable. The 3.5 generation pushes Flash toward parity with Pro on the benchmarks Google chose to publish. Specifically, Google’s blog says 3.5 Flash “outperforms Gemini 3.1 Pro on challenging coding and agentic benchmarks” — note the generation skip (3.5 Flash vs. 3.1 Pro), not vs. an imaginary 3.5 Pro that hasn’t shipped yet.
Third, the latency story is supported by the 4x-output- tokens/sec figure. Google does not name the “other frontier models” it is 4x faster than, and doesn’t publish the comparison table. So the claim is: 4x faster than some peer set, on a metric (output tok/s) Google measured. Treat it as a directional signal, not a head-to-head benchmark.
The 12x-inside-Antigravity claim — what it really means
Logan Kilpatrick’s follow-up tweet on launch day (id 2056792865590870166) added a striking footnote: inside Antigravity, 3.5 Flash is served at 12x the standard throughput, not 4x. His words: “it is running on new TPU’s which are in high demand!” and the speed-up is framed as “for a limited time” via “further inference tricks.”
Read this carefully:
- 12x is the Antigravity throughput, not the public API throughput. If you call the model through AI Studio or the Gemini API directly, you’re on the 4x-vs-peers number, not the 12x one.
- “Limited time” is unspecified. Google has not committed to how long the inference window stays open. The marquee multi- agent demo (the 12-hour OS build) ran during this window — replicating that throughput later may not be possible at the same API rate.
- “Further inference tricks” is even less specific. Google did not enumerate the techniques: probably some combination of speculative decoding, parallel sampling, batched serving, and the new TPU silicon. There’s no detailed paper on the Antigravity-specific serving stack, just the outcome claim.
- The new TPUs are “in high demand.” This is the realistic capacity constraint. Google is throwing premium silicon at the Antigravity launch to make the demos pop; once Spark and other 3.5 Flash-powered consumer products scale up, that silicon gets divided across more workloads.
The honest read: 12x is real, and you can feel it if you’re running an Antigravity 2.0 session right now. But it’s not the SLA. The standard claim is 4x-vs-peers, and that’s the number to plan production capacity around.
Cost positioning (and the one verifiable price anchor)
Google’s cost claim on the launch blog: “often at less than half the cost of other frontier models.” That’s the entire published cost statement. No per-token rate. No comparison table. No price card.
The reason this matters: cost is the variable that most determines whether an MCP-heavy agent workload is affordable. A model that’s twice as expensive per token forces architectural choices (fewer subagents, smaller context windows, tighter truncation). Half the cost goes the other way: more fan-out, longer contexts, more aggressive retry-on-failure loops. Without a concrete number, builders have to guess.
The one public price anchor is the 12-hour OS demo (covered in detail later in this piece). The numbers from antigravity.google/blog/google-antigravity-built-an-os:
- 93 subagents (the Reddit headline says 96; Google’s own number is 93)
- 15,314 model calls
- >339M input tokens
- >2.6B tokens total (input + cache reads + output + thinking)
- Cost at API pricing: $916.92
Naive blended math: $916.92 / 2.6B tokens ≈ $0.35 per million tokens blended (input + cache + output + thinking, all in). That is the only concrete data point in the public materials. It is not a price card. It does not break out input vs. output vs. cache-read vs. thinking-token rates — each of which is typically priced differently on the Gemini API. So you cannot back out individual rates from it.
What you can do with it: use it as a sanity check. If a 2.6B-token agentic workload costs roughly $900 on 3.5 Flash, you can extrapolate roughly to your own workload. A 100M-token equivalent is ~$35. A 10B-token quarterly run is ~$3,500. Those are very rough figures, and they assume the OS-demo mix of input/ output/cache/thinking — your real mix will be different — but they’re the only price anchor Google chose to publish.
Don’t cite the $0.35/M number as “Gemini 3.5 Flash pricing”
It’s a blended figure derived from a single demo run. Google has not endorsed it as a price card. If you need real per-token pricing for a contract or a capacity plan, watch the AI Studio pricing page — that’s where Google publishes the official rates when they’re ready to commit.
Where 3.5 Flash runs: API, AI Studio, Antigravity, AI Mode, Gemini app, Enterprise
Logan Kilpatrick’s availability tweet on launch day was a single sentence: “Try it in the Gemini API, Google AI Studio, Antigravity, AI Mode, Gemini App, and wherever else you use Gemini!” The official blog elaborated. Six concrete surfaces:
1. Default in the Gemini app + AI Mode in Search
Global. Consumer-facing. Anyone using the Gemini consumer app or the AI Mode tab in Google Search is already on 3.5 Flash as of the launch date, unless they explicitly switched to a different model. This is the largest distribution surface by user count — hundreds of millions of consumer Gemini app and AI Mode users get the new model by default.
2. Gemini API in Google AI Studio + Android Studio
Both the standalone AI Studio at aistudio.google.com and the AI integration inside Android Studio expose 3.5 Flash through the standard Gemini API. This is the developer surface — direct API access, no Antigravity wrapper required. If you’re migrating an existing app from 3.1 Flash to 3.5 Flash, this is where you do it.
3. Antigravity 2.0 desktop app
3.5 Flash is the default model in Antigravity 2.0 and is explicitly pinned for Scheduled Tasks (the new cron primitive that triggers agents on a schedule). It’s also the recommended model for the new /teamwork-preview slash command — Google’s warning on the feature blog says “We highly recommend using /teamwork-preview with Gemini 3.5 Flash, otherwise you will incur a particularly hefty bill.” Read that as: the multi-agent orchestration burns tokens fast, and 3.5 Flash is the only model in the lineup priced for that volume.
4. Antigravity CLI
The new Go-based terminal interface that replaced Gemini CLI on launch day. Same agent harness as Antigravity 2.0 desktop, with bidirectional settings/ permissions sync. 3.5 Flash is the default model; conversations are exportable via the @conversation dropdown back into the desktop app. We covered the migration in detail in Antigravity CLI vs Gemini CLI: the migration.
5. Antigravity SDK
The Python SDK (Apache 2.0, pip install google-antigravity) defaults to 3.5 Flash. It supports MCP servers over stdio, SSE, or Streamable HTTP — meaning you can wire the same MCP servers from the mcp.directory catalog into a 3.5 Flash-driven agent without changing anything on the server side. Tool sources are: built-in, custom Python functions, MCP servers, and agent skills via skills_paths.
6. Gemini Enterprise Agent Platform + Gemini Enterprise
For organizations with Gemini Enterprise licenses, 3.5 Flash is available through the Agent Platform and the broader Gemini Enterprise product. Important migration note: per Google’s migration blog at developers.googleblog.com, Enterprise users on the existing Gemini CLI keep access — “If your organization uses Gemini CLI or our IDE extensions via a Gemini Code Assist Standard or Enterprise license… your access remains unchanged.” Personal-tier users (AI Pro, AI Ultra, free Gemini Code Assist for individuals) have until June 18, 2026 to migrate.
The 93-agent OS proof point (and why 3.1 Pro couldn’t do it)
The headline demo of the Gemini 3.5 Flash launch was an autonomous OS build. Google’s tweet on launch day (@Google, id 2056789235500466273):
“We asked our agents to build a working operating system from scratch using @Antigravity 2.0 and Gemini 3.5 Flash. It took: 12 hours, 93 parallel sub-agents, 15k+ model requests, 2.6B tokens processed, less than $1K in API credits. To build a functioning OS from scratch.”
The deep-dive at antigravity.google/blog/google-antigravity-built-an-os provides the exact numbers: 93 subagents, 15,314 model calls, >339M input tokens, >2.6B tokens total, $916.92 total cost. The OS has a kernel, process and memory management, filesystem, video and keyboard drivers — and runs FreeDoom. Built from a single prompt.
Why this demo is a 3.5 Flash story (not an Antigravity story)
The key line in Google’s blog is one sentence: “Gemini 3.1 Pro was unable to do this.” The same agent harness, the same 93-role orchestration, the same prompts, but with Gemini 3.1 Pro as the underlying model — doesn’t complete the build. Only 3.5 Flash gets there.
Three reasons this likely matters:
- Volume tolerance. 15,314 model calls in 12 hours is roughly 21 calls per minute, sustained. A model that’s slow per call bottlenecks the whole multi-agent run; a model that’s expensive per call blows the budget before completion.
- Agentic reliability. Across 15K+ calls, the failure rate compounds. Terminal-Bench 76.2% suggests the per-call reliability is high enough for sustained multi-agent work; lower per- call reliability would mean the orchestration loops get stuck.
- Cost structure. $916.92 for 2.6B tokens lets the team afford the redundancy in the agent design — the Reviewer/Critic/Auditor roles that exist specifically to catch hallucinations and laziness. On a 10x-more-expensive model, you can’t afford that redundancy and the build fails for different reasons.
The seven roles
The OS build used a 7-role agent team. Worth knowing because it’s the same template the /teamwork-preview slash command exposes:
- Sentinel — Front-desk manager. Structures user intent, spawns Orchestrator, supervises. Doesn’t write code.
- Orchestrator — Dispatch-only manager. Decomposes requirements into milestones, kicks off subagents, synthesizes reports. Doesn’t write code.
- Explorer — Writes formal strategies from requirements + prior logs for Orchestrator to act on. Doesn’t write code.
- Worker — The actual coder. Implements strategies, builds, runs tests.
- Reviewer — Independently reviews Worker’s changes for design correctness, edge cases, contract compliance.
- Critic — Stress-tests, runs adversarial tests, finds coverage gaps.
- Auditor — Independent investigator that verifies authenticity / robustness; catches LLM “cheating.” This role exists because the team caught the agents on an earlier run referencing past conversation logs that hadn’t been cleared.
Google was transparent on this point. From the blog: “the first time we got the OS to build end-to-end, it happened suspiciously quickly. We discovered it was because the agents were cheating by referencing the conversations from past runs that we forgot to clear.” The Auditor was added explicitly to catch that pattern on future runs.
Caveats on the demo
Be honest about the limits. From the blog itself: “no support for floating math, hardware acceleration, complex multi-threading, sandboxing, JIT compilation, complex audio/video decoding.” It’s a teaching-OS-grade build, not a production OS. And no third party has reproduced the 3.1-Pro-can’t-but-3.5-Flash-can claim — it’s Google’s framing of Google’s own demo. The top Reddit comment on the r/singularity thread (562 ups) flagged this: “Interesting, if true. But god I detest the scripted demonstration-talk.” Reasonable skepticism.
The MCP Atlas score: relevance for agent builders
Of the four benchmarks Google published, MCP Atlas at 83.6% is the one most directly relevant to the audience reading mcp.directory. It’s also the most recent benchmark and the most agentic in design, which is worth a paragraph on its own.
What MCP Atlas measures: a model’s ability to drive a fleet of Model Context Protocol tool servers across multi-step tasks. That includes picking the right server, calling the right tool, passing well-formed arguments, parsing results, chaining tools, and recovering when one step fails. It’s the closest published benchmark to what an MCP-heavy production agent actually does.
Why 83.6% is the headline: for comparison, Terminal-Bench at 76.2% is “can you run a shell”; CharXiv at 84.2% is “can you read a chart.” MCP Atlas at 83.6% is “can you orchestrate dozens of tool servers with the right arguments” — which is what you’re doing when you have Desktop Commander + Context7 + Datadog + Linear + Sentry all wired into a single Claude or Antigravity session. The fact that Google is publishing a public MCP benchmark at all is a signal: MCP is now a benchmarked-against capability, not a side feature.
The catch: we don’t have a published MCP Atlas score for Claude Sonnet 4.6, GPT-5, or Gemini 3.1 Pro to compare against in the launch materials. The 83.6% lands without an obvious rank. If you’re evaluating whether to switch from Sonnet to 3.5 Flash for MCP-heavy work, the MCP Atlas number is a positive directional signal from Google about a benchmark Google chose to put front and centre — not a head-to-head ranking.
The cross-check: the Antigravity SDK supports MCP over stdio, SSE, and Streamable HTTP transports — the full set of MCP transports defined at modelcontextprotocol.io. So if you migrate an existing MCP-heavy agent from Claude or another orchestrator over to Antigravity 2.0 running 3.5 Flash, your MCP server side doesn’t change. Whatever you had pointed at Claude Desktop still works.
A few skill cards relevant to running Gemini-based MCP work from the mcp.directory catalog:
What’s coming: Gemini 3.5 Pro, Spark, the agentic roadmap
Three follow-ons Google announced alongside the 3.5 Flash launch.
Gemini 3.5 Pro — “next month”
From the launch blog: 3.5 Pro is “being used internally” and will roll out “next month.” That places the launch in June 2026. Google has not confirmed a specific date. The relationship between 3.5 Flash and 3.5 Pro will presumably mirror previous generations: Pro is slower per token but stronger on the hardest tasks, Flash is faster and cheaper. Until Pro ships, 3.5 Flash is the recommended default across all Antigravity surfaces. Google’s own warning on multi-agent runs: “We highly recommend using /teamwork-preview with Gemini 3.5 Flash, otherwise you will incur a particularly hefty bill.”
Gemini Spark — 24/7 personal AI agent
Spark is the new always-on personal AI agent product. It runs on 3.5 Flash. Trusted testers got access on launch day; the beta rolls out to US Google AI Ultra subscribers “next week” per Google’s announcement. If you wondered why Google led with Flash rather than Pro for the consumer-agent product, the cost answer is right there — a model running continuously needs to be cheap per call. The same economics drive the choice of Flash for Antigravity Scheduled Tasks.
The Build with Gemini XPRIZE
@GoogleDeepMind announced the Build with Gemini XPRIZE hackathon alongside the 3.5 Flash launch (tweet 2056794085680468294) — a $2M prize pool for Gemini-powered projects. The full details URL wasn’t fully captured in our launch-day scrape, so check the DeepMind blog for the current rules. Signal: Google is investing in third-party applications on top of 3.5 Flash; this is the ecosystem play, not just a model launch.
How 3.5 Flash compares (without making up head-to-head numbers)
Honest answer: there is no third-party head-to-head benchmark on the 4 numbers Google chose to publish. Terminal-Bench, GDPval-AA, MCP Atlas, and CharXiv are the benchmarks Google selected to make Flash look strong. So comparisons need to be careful.
What we can say:
- vs. Gemini 3.1 Pro — Google says 3.5 Flash outperforms 3.1 Pro on “challenging coding and agentic benchmarks.” The OS demo is the empirical claim. If you’re currently on 3.1 Pro for agentic work, 3.5 Flash is at minimum a sideways move and likely an upgrade per Google’s own framing.
- vs. earlier Flash models (3.0, 2.5, 1.5) — generational improvement on all four published benchmarks, but the larger story is that 3.5 Flash is being positioned as a frontier model, not a cheap-fast sibling of Pro. That’s a re-positioning of the Flash product line.
- vs. Claude Sonnet / Opus and GPT-5 — no published head-to-head from Google. The “4x faster than other frontier models” and “less than half the cost” framing implies a peer set, but Google didn’t name it. The top Reddit comment on the OS demo thread (1thug7n) ran the same back-of-envelope math: “Isn’t Opus 4.6 $25/m tokens so 2.6b tokens would be $65k?” — pointing at the cost gap without confirming it.
The right framing for a decision: pick the benchmark that matches your workload (Terminal-Bench for shell automation, MCP Atlas for tool orchestration, CharXiv for chart reading, GDPval-AA for autonomous agent runs), look at 3.5 Flash’s published number, compare to whatever you’re running today, and run a head-to-head on your actual task. The launch blog gives you a starting point; it doesn’t give you a finished decision.
Skepticism and caveats
Five things worth holding in mind as you evaluate 3.5 Flash for your own use.
1. “Gemini 3.1 Pro unable to do this” is Google’s framing of Google’s own demo
No independent reproduction yet. The OS demo is a Google-run scenario with Google-chosen prompts, on a Google harness, with Google-tuned roles. It’s evidence the team can show that 3.5 Flash completes their internal demo at $916.92; it’s not third-party evidence that any reasonably-constructed multi-agent harness fails on 3.1 Pro and succeeds on 3.5 Flash.
2. The 4x-speed claim names no comparison set
Google says “4 times faster than other frontier models” on output tokens/sec. No head-to-head numbers, no named comparison set, no published benchmark methodology. The Artificial Analysis top-right-quadrant placement is the cleanest signal here, and that’s a third-party index Google’s blog references but doesn’t control. Wait for an Artificial Analysis snapshot to confirm the placement before treating “4x” as a planning number.
3. The 12x-inside-Antigravity throughput is time-limited
Logan Kilpatrick said it’s “for a limited time” on “new TPUs which are in high demand.” Don’t plan production capacity around it. The 4x-vs-peers number is the standard claim; the 12x is the launch-window special.
4. The $916.92 OS-demo math doesn’t back out per-token rates
$0.35 per million blended tokens is the only public anchor, and it’s a blended figure that hides the input/output/cache/thinking split. You cannot confidently extrapolate the cost of a different workload from it — your input/output ratio will differ from the OS demo’s, and the cache hit rate will differ, and the thinking-token usage will differ. Use it as a sanity check, not a price card.
5. The marquee multi-agent feature is behind a $200/month paywall
The /teamwork-preview command that runs the 7-role OS-demo team is gated to Google AI Ultra at $200/month (dropped from $250). Google’s warning on the feature: “you will exhaust your entire weekly quota within a couple of tasks (or likely even mid-way through your first one).” The demo is real; replicating it is expensive.
What this means for MCP users
Five practical reads if you’re running MCP-heavy agent workloads today.
1. If you’re on Claude Sonnet for MCP-heavy agents, 3.5 Flash is a serious alternative. The MCP Atlas 83.6% is the most direct relevance signal. The cost positioning (“less than half the cost of other frontier models”) matters disproportionately if your agent fans out across many tool calls per task. The Antigravity SDK supports MCP over all three transports (stdio, SSE, Streamable HTTP) — so existing servers from the mcp.directory catalog drop in without modification.
2. If you’re on GPT-5 / OpenAI for MCP, the calculus depends on cost. Without published per-token pricing on 3.5 Flash, you can’t do a precise cost comparison. Run a small workload on each, measure tokens and dollars, decide from data. Don’t take Google’s “less than half the cost” as a planning number — take it as a directional hint that’s worth checking.
3. If you’ve been using the existing Gemini CLI for MCP, you have a migration window. 30 days for personal users; sunset June 18, 2026. Enterprise license-holders keep access. We covered the migration mechanics in detail in Antigravity CLI vs Gemini CLI: the migration.
4. If you want to try 3.5 Flash without installing Antigravity, the cleanest path is the Gemini API in AI Studio. No new CLI to install, no app to download, just an API key. That’s the lowest-friction way to evaluate whether the model fits your workload before committing to the Antigravity stack.
5. If you’re building an agent that needs to run 24/7 cheaply, the Gemini Spark architecture is the model. Spark runs on 3.5 Flash for cost reasons. The same reasoning applies to your always-on agent: pick the model that’s cheap per call, not the model that’s strongest on a single hard task. Hard tasks can be offloaded to Pro or a heavier model on demand; the background loop should be on Flash.
For day-to-day MCP work, the relevant catalog cards are below. The first is the most-viewed Gemini integration in the catalog; the second covers the Windows-specific port; the rest are skills that wire Gemini into specific MCP workflows.
FAQ
What is Gemini 3.5 Flash?
Gemini 3.5 Flash is Google's mid-tier frontier model launched May 19, 2026 at Google I/O. Logan Kilpatrick described it on launch day as Google's "most powerful model to date." Google's positioning quote on the blog.google announcement: "Landing in the top-right quadrant of the Artificial Analysis index, 3.5 Flash delivers frontier-level intelligence at exceptional speed — proving you no longer have to trade quality for latency." It is the default model in the Gemini app, AI Mode in Search, Google AI Studio, the Gemini API, Antigravity 2.0 / CLI / SDK, and the Gemini Enterprise Agent Platform.
What are the verified Gemini 3.5 Flash benchmark numbers?
Per Google's launch blog (blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/): Terminal-Bench 2.1 at 76.2%, GDPval-AA at 1656 Elo, MCP Atlas at 83.6%, and CharXiv Reasoning at 84.2%. Google also claims the model is "outperforming Gemini 3.1 Pro on challenging coding and agentic benchmarks" and is "4 times faster than other frontier models" on output tokens per second. The 4x speed claim is against unnamed peers — no head-to-head head-to-head numbers were published alongside it.
How fast is Gemini 3.5 Flash inside Antigravity 2.0?
Google's standard claim is 4x faster output than other frontier models. Inside Antigravity 2.0 specifically, Logan Kilpatrick said it's served at 12x faster for a limited time via "further inference tricks" — he added that "it is running on new TPU's which are in high demand!" (tweet 2056792865590870166). Frame this honestly: 12x is a special-case Antigravity throughput, not the standard API rate. Google did not say how long the inference-trick window stays open, and there's no commitment that the same throughput is available to general API users.
How much does Gemini 3.5 Flash cost per million tokens?
Google did not publish a per-token price in the launch materials. The official phrasing is "often at less than half the cost of other frontier models." The only public price anchor is the 12-hour OS demo: 93 subagents, 15,314 model calls, more than 339M input tokens and over 2.6B tokens total (with cache reads, output, and thinking) totalled $916.92 at API pricing. That works out to roughly $0.35 per million blended tokens. Treat that as a single data point, not a published price card — your actual cost will depend on the input/output/cache/thinking mix.
What is MCP Atlas and why does the 83.6% score matter?
MCP Atlas is the agentic benchmark Google cites in the 3.5 Flash launch blog alongside Terminal-Bench, GDPval-AA, and CharXiv. The 83.6% score lands the model in a strong position for Model Context Protocol agent work — the same protocol the rest of the mcp.directory catalog runs on. The Antigravity SDK explicitly supports MCP servers over stdio, SSE, or Streamable HTTP, so an agent built on 3.5 Flash can use the same MCP servers you'd wire into Claude Desktop, Cursor, or Codex. If you've been picking models for MCP-heavy agent workloads on benchmark numbers alone, MCP Atlas 83.6% is the one to compare against your incumbent.
Where can I use Gemini 3.5 Flash today?
Six surfaces on launch day: (1) Default in the Gemini app and AI Mode in Search, globally; (2) Gemini API in AI Studio and Android Studio; (3) Antigravity 2.0 desktop app — pinned as the recommended model for Scheduled Tasks and the gated /teamwork-preview slash command; (4) Antigravity CLI (the new Go-based terminal interface that replaced Gemini CLI on launch day); (5) Antigravity SDK (Python, Apache 2.0); (6) Gemini Enterprise Agent Platform and Gemini Enterprise. Logan Kilpatrick's follow-up tweet (2056793285507912011) summed it up: "Try it in the Gemini API, Google AI Studio, Antigravity, AI Mode, Gemini App, and wherever else you use Gemini!"
When is Gemini 3.5 Pro coming out?
Google announced on the same May 19 launch that Gemini 3.5 Pro is "being used internally" and will roll out "next month" — meaning June 2026. A specific date has not been published. The launch focused entirely on 3.5 Flash; Pro is the planned follow-up. Until Pro lands, 3.5 Flash is the recommended default model for Antigravity 2.0 — Google's own warning on the /teamwork-preview multi-agent command says "We highly recommend using /teamwork-preview with Gemini 3.5 Flash, otherwise you will incur a particularly hefty bill."
Why does Google say Gemini 3.1 Pro couldn't build the OS but 3.5 Flash could?
Google's verbatim claim on the antigravity.google/blog/google-antigravity-built-an-os post: "Gemini 3.1 Pro was unable to do this." The OS demo — 93 parallel subagents, 12 hours, building a working operating system from a single prompt — relied on three orchestration tricks (self-succession to handle context windows, Scheduled Tasks crons to restart stuck subagents, and an Auditor role that catches LLM "laziness"). The implication is that 3.5 Flash is the first model in Google's lineup where the agent harness, speed, and reliability characteristics line up well enough for fully autonomous multi-agent runs at this scale. There's no third-party reproduction of this claim yet, so treat it as Google's framing of their own demo, not an independently benchmarked statement.
Is Gemini 3.5 Flash a good fit for MCP-heavy agent workloads?
Probably yes, with caveats. The MCP Atlas 83.6% benchmark is a direct relevance signal. The Antigravity SDK supports MCP over all three transports (stdio, SSE, Streamable HTTP). The cost positioning — "less than half the cost of other frontier models" — matters disproportionately for high-volume MCP agents that fan out across many tool calls per task. The honest caveats: Google has not published per-token pricing, the 4x-speed claim is against unnamed peers, and the most impressive multi-agent demo (the OS build at $916.92) is also still gated behind the Ultra-$200 /teamwork-preview command. For straightforward MCP usage through the Gemini API, the model is available everywhere now.
How does Gemini 3.5 Flash relate to Gemini Spark?
Gemini Spark is Google's new "24/7 personal AI agent" — announced alongside 3.5 Flash on May 19. It runs on 3.5 Flash. Trusted testers got access on launch day; the beta rolls out to US Google AI Ultra subscribers "next week" per Google's announcement. Spark is the consumer-facing always-on agent product; 3.5 Flash is the model underneath. If you've been thinking about why Google picked Flash (not Pro) for an always-on agent, the cost positioning answers the question — a model running continuously needs to be cheap per call.
Sources
- Official launch blog: blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/ (May 19, 2026)
- Antigravity-specific 3.5 Flash post: antigravity.google/blog/gemini-3-5-flash-in-google-antigravity
- OS-demo deep-dive: antigravity.google/blog/google-antigravity-built-an-os
- Antigravity 2.0 launch: antigravity.google/blog/introducing-google-antigravity-2-0
- Gemini CLI to Antigravity CLI migration: developers.googleblog.com — transitioning Gemini CLI to Antigravity CLI
- Logan Kilpatrick launch tweet (id 2056792266514329914) on X / Twitter
- Logan Kilpatrick TPU / 12x throughput note (tweet id 2056792865590870166)
- Logan Kilpatrick availability tweet (id 2056793285507912011)
- @Google OS demo tweet (id 2056789235500466273)
- @GoogleDeepMind 4-surface ecosystem tweet (id 2056790408689287180)
- TechCrunch coverage: techcrunch.com — Google launches Antigravity 2.0
- r/singularity OS demo thread (562 ups): reddit.com/r/singularity/comments/1thug7n
- Catalog reference: mcp.directory/servers/gemini-cli
Sibling deep dive
Antigravity 2.0 Launch: What Google Actually Shipped at I/O 2026
ReadMigration
Antigravity CLI vs Gemini CLI: the 30-day Migration
ReadComparison
Antigravity 2.0 vs Claude Code vs Cursor (2026)
ReadFound an issue?
If something in this guide is out of date — a new benchmark, a new availability surface, a published price card — email [email protected] or read more in our about page. We keep these guides current.