Google Releases Gemma 4: Four Open Models, Apache 2.0, and a Shot at the Open-Weight Crown

Google DeepMind released Gemma 4 today: four open models built from Gemini 3 research, all under the Apache 2.0 license. The largest variant, a 31B dense model, currently sits at #3 on the Arena AI text leaderboard among open models. The 26B mixture-of-experts variant, which activates only 3.8 billion parameters during inference, holds #6, beating models 20 times its size.

Open post on X

The Gemma line has been downloaded more than 400 million times since its launch in February 2024. This release is the biggest jump yet.

The Four Models

Gemma 4 ships in four sizes, each designed for a different deployment scenario.

Model	Parameters	Active Params	Context	Audio Support
E2B	5.1B (2.3B effective)	2.3B	128K	Yes
E4B	8B (4.5B effective)	4.5B	128K	Yes
26B MoE	26B total	3.8B	256K	No
31B Dense	31B	31B	256K	No

The “E” in E2B and E4B stands for “Effective.” These are the edge models, designed to run on phones and low-power hardware. The E2B runs in under 1.5GB of memory. On a Raspberry Pi 5, it hits 133 tokens per second for prefill and 7.6 tokens per second for decode.

The 26B MoE is the efficiency play. It has 26 billion total parameters but only routes through 3.8 billion on any given inference pass. That means you get 26B-class intelligence at roughly 4B-class speed and memory cost.

The 31B Dense is the raw performance model. All 31 billion parameters fire on every pass. Google positions it as the best starting point for fine-tuning.

What They Can Do

Every model in the family handles text, images, and video natively. The two smaller models (E2B and E4B) add native audio input on top of that, which means speech understanding runs directly on device without a separate transcription step.

All four models support:

Function calling and structured JSON output, so you can wire them into tool-using agent workflows without post-processing hacks
Configurable thinking modes for chain-of-thought reasoning
140+ languages natively
Object detection with bounding box generation in native JSON

The agentic angle is the headline feature. Google DeepMind researchers Clement Farabet and Olivier Lacombe described the goal as building models with “native support for function calling and structured JavaScript Object Notation outputs,” enabling developers to build autonomous agents that interact with third-party tools and execute workflows reliably.

Architecture: What’s New Under the Hood

Gemma 4 introduces several architectural changes worth noting if you plan to fine-tune or deploy these models.

Alternating attention layers. The models alternate between local sliding-window attention (512 tokens for small models, 1024 for large) and global full-context attention. Sliding window layers use standard RoPE positional encoding; global layers use proportional RoPE, which is what enables the long context windows.

Per-Layer Embeddings (PLE). The E2B and E4B models use a parallel conditioning pathway alongside the main residual stream. Each layer gets a lower-dimensional, token-specific vector that combines token identity (from an embedding lookup) with a context-aware learned projection. This modulates the residual block after attention and feed-forward layers. It’s how Google squeezes more capability out of fewer parameters.

Shared KV cache. The last several layers reuse key-value tensors from a previous non-shared layer instead of computing their own. This cuts memory and compute for long-context inference with minimal quality loss. On-device deployment is where you feel this most.

Variable image token budgets. The vision encoder supports configurable token counts per image: 70, 140, 280, 560, or 1120 tokens. You can trade off between speed and visual detail depending on your use case. This is a practical improvement over Gemma 3, which used fixed image tokenization.

Benchmarks in Context

Google’s benchmark claims are strong. The 31B Dense scores roughly 1452 on LMArena (text-only). The 26B MoE hits about 1441 with only 4B active parameters, putting it in the same range as GLM-5 and Kimi K2.5 while using around 30x fewer parameters.

For context on where this sits in the open model landscape: Meta’s Llama 4 Maverick still leads on MMLU (85.5%), and Qwen 3.5 dominates math benchmarks (48.7 vs 42.1 on AIME compared to Llama 4) and multilingual tasks, especially CJK languages. Llama 4 Scout’s 10-million-token context window remains unmatched.

Gemma 4’s pitch is different. It’s about the ratio of capability to compute. A model that ranks #3 globally while fitting on a single consumer GPU is a different kind of win than a model that leads benchmarks but needs a cluster.

Holger Mueller, an analyst at Constellation Research, put it this way: “Google is building its lead in AI. These are important for building an ecosystem of AI developers.”

The Apache 2.0 Shift

Previous Gemma models shipped under Google’s custom Gemma license, which had restrictions that made legal teams nervous. Gemma 4 moves to Apache 2.0, the same permissive license used by most open-source infrastructure software.

Google framed this as providing “complete developer flexibility and digital sovereignty,” allowing deployment across any environment without restrictions. No monthly active user limits. No acceptable use policies layered on top. You can modify, distribute, and commercialize without asking permission.

This matters because it removes the last meaningful friction point that kept some organizations on Llama or Qwen. Apache 2.0 is the license enterprise legal departments understand and approve without extended review.

Running Gemma 4 Locally

The models are available today on Hugging Face, Ollama, and Kaggle.

For local inference, day-one support exists across:

llama.cpp for CPU/GPU inference (image + text supported)
Ollama for the easiest setup (ollama run gemma4)
MLX on Apple Silicon, with TurboQuant for roughly 4x memory reduction at baseline accuracy
transformers (Python, full multimodal pipeline)
transformers.js for browser-based inference via WebGPU
mistral.rs for Rust-based multimodal inference

NVIDIA has partnered with Google to optimize all four variants for RTX GPUs and DGX Spark. Performance testing with Q4_K_M quantization on an RTX 5090 via llama.cpp shows strong results, and the models run well on Jetson Orin Nano for edge AI deployments.

For fine-tuning, TRL (Transformers Reinforcement Learning) has full multimodal support on day one, including a new feature: multimodal tool responses during training, where models can receive images from tools as feedback. Unsloth also offers optimized quantized models for local fine-tuning.

Who Should Care

If you’re building agents that need to run locally, Gemma 4 is now the strongest option at the small end. The E2B and E4B models can handle vision, audio, and function calling on a phone or Raspberry Pi. That combination didn’t exist in an open model before today.

If you’re running inference infrastructure and want the best quality-per-FLOP, the 26B MoE is the standout. Activating 3.8B parameters while matching models many times its size means lower serving costs for equivalent quality.

If you want a strong general-purpose open model to fine-tune, the 31B Dense gives you a top-3 global foundation under a license that won’t create headaches down the road.

The open model space was already competitive. Gemma 4 makes it more so.

Sources: