
While Gemini 3.5 Flash shows its strongest improvements in agentic and multimodal tasks, it has a notable weakness in programming, where it falls clearly behind competitors like GPT-5.5 and Claude Opus 4.7.
Google's new Gemini 3.5 Flash is a step up from its predecessor, but it costs more than five times as much to run. High token consumption on agent tasks pushes total costs past the pricier Pro model in benchmark testing.
Google Deepmind has released Gemini 3.5 Flash, the latest version of its Flash model family. Flash was long positioned as the cheaper, faster alternative to Google's more powerful Pro models. An analysis by Artificial Analysis, which got early access, found that Gemini 3.5 Flash costs 5.5 times more to run in benchmark testing than Gemini 3 Flash and nearly twice as much as the Pro model Gemini 3.1. The context window stays at one million tokens.
Token prices alone have tripled: Google now charges $1.50 per million input tokens and $9.00 per million output tokens, up from $0.50 and $3.00 for Gemini 3 Flash. Per token, that's still cheaper than Gemini 3.1 Pro at $2.00 and $12.00.
In practice, though, the math flips. Gemini 3.5 Flash burns through so many more tokens on agent-based tasks that total costs end up 75 percent higher than Gemini 3.1 Pro, according to Artificial Analysis.
How much the price hike stings will depend on the application. But Google is following a broader industry trend. Anthropic's Opus 4.7 had a hidden price increase of roughly 30 to 40 percent over its predecessor due to higher token consumption. OpenAI's GPT 5.5 jumped even more, about 50 to 90 percent over 5.4. There, token consumption went down, but base prices went up. Google raised both.
For developers and companies, raw token price is becoming less useful as a standalone metric. What matters now is efficiency, how many tokens a model actually needs to finish a job.
Gemini 3.5 Flash scores 55 on the Artificial Analysis Intelligence Index, nine points above Gemini 3 Flash. That puts it ahead of Grok 4.3 (high, 53) and Claude Sonnet 4.6 (max, 52). Gains show up across nearly every category tested. As always, benchmarks only capture specific scenarios; real-world performance only becomes clear over extended use with everyday and novel tasks.
On AA Omniscience, which measures knowledge accuracy and hallucination tendency, Gemini 3.5 Flash improves by 11 points. Its hallucination rate drops to 61 percent, down 31 percentage points from Gemini 3 Flash. That jump sounds impressive until you look at the leaders: MiMo-V2.5-Pro and Grok 4.3 (high) both sit at just 25 percent.
Agentic tasks have historically been a weak spot for Gemini. That's where 3.5 Flash improves the most. On GDPval-AA, which tests real agent tasks with web and shell access, it hit an Elo score of 1,656, a massive leap over Gemini 3 Flash (1,204) and Gemini 3.1 Pro (1,314), just barely behind GPT-5.4 (xhigh, 1,674).
That performance comes at a cost. Gemini 3.5 Flash needs an average of 49 turns per task , more than any other model tested. Claude Opus 4.7 (max) takes 45, GPT-5.4 (xhigh) takes 40, and Gemini 3.1 Pro only needs 23. All those extra interaction steps drive input token consumption way up.
Output token usage barely changed: 73 million versus 72 million for Gemini 3 Flash. Input tokens are the culprit, pushing Gemini 3.5 Flash past Gemini 3.1 Pro in total cost despite lower per-token prices.
Programming is where fast, capable, cheap models are in highest demand, and it's where Gemini 3.5 Flash falls short. On the Artificial Analysis Coding Index, which combines Terminal-Bench Hard and SciCode, it scores just 45. That's well behind Gemini 3.1 Pro Preview (55) and far behind GPT-5.5 (xhigh, 59) and GPT-5.4 (xhigh, 57). Claude Opus 4.7 (max, 53) and Claude Sonnet 4.5 (max, 51) also beat it.
For a model that matches these rivals on the overall intelligence index, that's a striking gap. Its strengths clearly lie in agentic and multimodal tasks, but coding is one of the most important use cases for agentic AI, which limits the practical value of those agent gains.
Gemini 3.5 Flash clocks over 280 output tokens per second, roughly 70 percent faster than Gemini 3 Flash, according to Artificial Analysis. No other model with similar intelligence comes close to that output rate.
Unlike many rivals, it also supports video and audio input alongside text and images. Claude Opus 4.7, Grok 4.3, and GPT-5.5 are limited to image input, per Artificial Analysis. On the multimodal benchmark MMMU-Pro, Gemini 3.5 Flash scores 84 percent, the highest result ever recorded. Google takes the top two spots, with Gemini 3.1 Pro second at 82 percent.
The rising prices reflect a deeper shift: today's AI models are built for complex, multi-step tasks where they plan on their own, use tools, and work through many rounds of interaction. That agentic behavior needs more compute per task than simple chatbots.
Unless inference costs for the underlying hardware drop as fast as compute per task goes up, prices for stronger models will keep climbing. For simpler use cases, cheaper older models or smaller options like Gemini 3.1 Flash-Lite will still be around.
For companies, AI return on investment is getting harder to pin down. Isolated tasks like code generation or translation are easier to measure -- faster turnaround, lower staffing costs -- but even there, the picture is muddier than it looks.
Knowledge work is where it gets really fuzzy. How do you put a number on a better decision memo or a strategy paper finished in half the time with AI? And what about downstream costs: time spent checking for errors or the learning that doesn't happen when AI does the work?
Those productivity gains tend to be spread thin across departments, show up late, and are hard to separate from other factors. Paying for pricier models is a bet that the efficiency gains will be worth it and that AI-assisted work is just how things will be done. A deep dive into this topic is available in our AI Radar #2.