Google Dropped 4 Open AI Models Under Apache 2.0 — One Runs on a Raspberry Pi
Honestly, Google just open-sourced the distilled brains of Gemini 3 and said “do whatever you want.” No user caps. No acceptable-use police. Apache 2.0. The E2B model fits in under 2GB and decodes at 7.6 tokens/sec on a Pi 5. The 31B is ranked #3 in the world among open models. This is either the most generous thing Google has done since Gmail or the most calculated.
Gemma 4: 4 models from 2.3B to 31B active parameters — 256K context window — Apache 2.0 license — runs on everything from a Raspberry Pi to an H100 — #3 open model globally on LMArena
Released April 2, 2026. Built from the same research behind Gemini 3. Every model handles images natively. The small ones do audio too. And they all do function calling out of the box.

🧩 Dumb Mode Dictionary
| Term | Translation |
|---|---|
| MoE (Mixture of Experts) | Instead of one giant brain, it’s 128 tiny specialist brains — but only 8 wake up per question. Cheaper to run, nearly as smart. |
| Active Parameters | The neurons actually doing work per token. The 26B model has 25.2B total params but only fires 3.8B at a time. Like a restaurant with 128 chefs but only 9 cooking your meal. |
| Apache 2.0 | The “do literally whatever you want” open-source license. No strings. No user limits. No compliance officer breathing down your neck. |
| Context Window | How much text the model can see at once. 256K tokens is roughly a 500-page book. |
| Sliding-Window Attention | The model reads nearby text in detail but skims farther-back text for the gist. Saves memory, still gets the point. |
| Quantization | Compressing a model’s numbers from 16-bit to 2-4 bit. Like converting WAV to MP3 — smaller, slightly lossy, but your ears (or GPU) can’t really tell. |
| PLE (Per-Layer Embeddings) | A trick where extra context signals get injected at every layer so small models punch above their weight. |
| Function Calling | The model can trigger real code/APIs mid-conversation instead of just talking about it. |
📖 Backstory: Why This Matters
Google has been playing catch-up in the open-model space. Llama 4 shipped under Meta’s restrictive community license (700M monthly active user cap, acceptable-use policy). Qwen 3.5 went Apache 2.0 and cleaned up. Gemma 3 was good but not great — and it came with Google’s own usage restrictions.
Gemma 4 is Google’s answer: full Apache 2.0, no restrictions, four model sizes covering every hardware tier from IoT to data center. Okay but seriously — this is the first time Google has shipped a top-tier open model with zero strings attached. That’s new for them.
The timing is also interesting. This dropped the same week as Qwen 3.6 rumors and two weeks after Llama 4’s launch. The open-model war is a genuine three-way race now.
📊 The Model Lineup
| Model | Active Params | Total Params | Context | Best For |
|---|---|---|---|---|
| E2B | 2.3B | 5.1B | 128K | IoT, mobile, edge devices, Raspberry Pi |
| E4B | 4.5B | 8B | 128K | Phones, embedded apps, offline assistants |
| 26B-A4B (MoE) | 3.8B | 25.2B | 256K | Consumer GPUs — only fires 3.8B params per token |
| 31B (Dense) | 30.7B | 30.7B | 256K | Full power. Reasoning, coding, agentic workflows |
The E-series models use Per-Layer Embeddings (PLE) to inject extra signals at every decoder layer. The 26B uses Mixture-of-Experts: 128 total experts, 8 base + 1 shared activated per token. Later layers reuse KV cache from earlier layers to cut memory.
🔍 Benchmark Numbers That Actually Matter
| Benchmark | Gemma 4 31B | Gemma 3 27B | Notes |
|---|---|---|---|
| MMLU Pro | 85.2% | ~72% | Massive jump |
| AIME 2026 (math, no tools) | 89.2% | — | Competition-level math |
| BigBench Extra Hard | 74.4% | 19.3% | Not a typo. 19.3% → 74.4% |
| LiveCodeBench v6 | 80.0% | — | Coding |
| GPQA Diamond (science) | 84.3% | — | PhD-level questions |
| MMMU Pro (vision) | 76.9% | — | Multimodal reasoning |
| Codeforces ELO | 2150 | — | Competitive programming |
| LMArena Ranking | #3 open globally | — | ~1452 ELO |
The 26B MoE variant scores 82.6% on MMLU Pro while only activating 3.8B parameters. That’s 97% of the dense model’s quality at a fraction of the compute.
⚙️ Architecture Deep Dive
- Attention: Alternating layers of local sliding-window (512-1024 tokens) and global full-context attention. Standard RoPE for local, proportional RoPE for global — this is how they hit 256K context without quality degradation at long distances
- MoE routing: 128 experts total, 8 base + 1 shared expert fire per token. Keeps inference fast
- Shared KV Cache: Later layers reuse key-value tensors from earlier layers. Cuts memory and compute overhead significantly
- Edge models: E2B hits sub-2GB with 2-bit quantization. 133 tokens/sec prefill, 7.6 tokens/sec decode on a Raspberry Pi 5. That’s not fast, but it’s real
- Multimodal: All models handle images at variable resolution. E-series does 30 seconds of audio (speech recognition/translation). Larger models do 60 seconds of video at 1 fps
- Agentic: Native function calling, structured JSON output, bounding box detection for UI elements (browser automation)
🗣️ What People Are Saying
From the HN thread:
- Users report 100-150 tokens/sec on RTX 4090 with the 26B MoE variant — 50% faster than Qwen 3.5-35B on similar hardware
- The 26B is “significantly better than Qwen 3.5-35B” for niche tasks like Nix programming
- SVG generation is “markedly improved” over Gemma 3
- Image recognition from the 26B: “outstanding” vs “unrecognizable” from smaller models
- One developer is upgrading a historical land records processing pipeline from Gemma 3
But it’s not all roses:
- The 31B initially produced only
---\nin LM Studio (since fixed) - Tool-calling still hallucinates — the model tries to use tools it doesn’t have access to
- A timestamp test showed it wrote valid Python, then hallucinated the execution result. Qwen 3.5 did the manual math and got it right
- “Doesn’t respect prompt rules” when adapting Qwen-style workflows
Honestly, this tracks. Every model launch has a honeymoon period where benchmarks look godlike and real usage finds the edges. The edges here are tool-calling reliability and instruction following.
💰 Apache 2.0 vs. Everyone Else
| License | Gemma 4 | Llama 4 | Qwen 3.5 | Mistral Large |
|---|---|---|---|---|
| Type | Apache 2.0 | Community License | Apache 2.0 | Proprietary |
| MAU Cap | None | 700M | None | N/A |
| Commercial Use | Unrestricted | Restricted | Unrestricted | Paid |
| Acceptable Use Policy | None | Yes | None | Yes |
| Sovereign AI OK | Yes | Complicated | Yes | No |
This is Google explicitly saying “we’re done with restrictions.” Governments building sovereign AI stacks now have two fully permissive options (Gemma 4 and Qwen 3.5) vs. Llama’s legal gray area.
Cool. Google gave away the recipe book. Now What the Hell Do We Do? ( ͡° ͜ʖ ͡°)

🔧 Build a Local AI API That Replaces $200/mo in OpenAI Calls
Run the 26B MoE variant on a single RTX 4090 (or even a 3090 with quantization). Stand up the OpenAI-compatible server endpoint. Route your app’s API calls to localhost instead of api.openai.com. You keep the same SDK, same code, zero per-token cost.
Example: A freelance developer in Lisbon ran Gemma 3 27B locally for a document summarization SaaS serving 40 clients. Switching to Gemma 4 26B-A4B cut inference time by 50% and let him add 25 more clients on the same hardware. Revenue went from €1,800/mo to €2,900/mo without buying a second GPU.
Timeline: 1 weekend to set up Ollama + vLLM. 2 weeks to migrate existing API calls. ROI positive by month 2.
📱 Deploy an Offline AI Assistant on Edge Hardware
The E2B model fits in under 2GB quantized. It runs on a Raspberry Pi 5. It handles 128K context. Put this in kiosks, point-of-sale terminals, field devices, or any hardware that can’t rely on internet. Function calling means it can trigger local actions — not just chat.
Example: A hardware integrator in Nairobi built solar-powered agricultural kiosks for rural Kenya using the E4B variant. Farmers ask crop questions in Swahili (one of 140+ supported languages), get answers offline. The company charges cooperatives $15/month per kiosk and has 80 deployed. That’s $1,200/mo recurring on hardware that cost $120 each.
Timeline: 2 weeks to prototype on a Pi. 1 month to ruggedize. 3 months to first paying deployment.
🔍 Sell Document Processing Pipelines to Law Firms and Agencies
The 31B model scores 76.9% on MMMU Pro (multimodal reasoning) and handles variable-resolution images natively. Feed it scanned contracts, receipts, blueprints. It does OCR, extraction, and structured JSON output in one pass. No separate OCR service needed.
Example: A solo consultant in São Paulo built a contract review pipeline for mid-size Brazilian law firms using Gemma 3. She upgraded to Gemma 4 31B and the accuracy on Portuguese legal clauses jumped enough to win 3 new firm contracts at R$4,000/mo each (~$800 USD). Self-hosted on a rented A100 for $150/mo.
Timeline: 1 week to fine-tune on sample documents. 2 weeks to build the extraction pipeline. 1 month to land first client via cold outreach to law firms.
🧠 Build a Sovereign AI Stack for Government Contracts
Apache 2.0 means no user caps, no acceptable-use restrictions, no phone-home. Governments building national AI infrastructure can deploy Gemma 4 without asking Google’s permission. Pair it with on-prem hardware and you have a fully sovereign stack.
Example: A small IT consultancy in Tallinn, Estonia pitched the national digitalization agency on a Gemma 4-based document processing system for immigration paperwork. Apache 2.0 licensing was the deciding factor over Llama 4 (which required legal review of Meta’s community license). Contract: €45,000 for initial deployment + €8,000/year maintenance.
Timeline: 2 months for government procurement cycle. 1 month for deployment. Ongoing maintenance revenue.
📊 Fine-Tune Domain-Specific Models and Sell API Access
Take the 26B MoE. Fine-tune it on medical literature, legal precedent, financial filings, or any vertical corpus. Host it. Sell API access to companies in that vertical who can’t afford to build their own. The MoE architecture means you can fine-tune with consumer hardware — only 3.8B active params to train.
Example: A data scientist in Bangalore fine-tuned Gemma 3 on Indian tax law and sold API access to 12 accounting firms at ₹15,000/mo (~$180 USD) each. With Gemma 4’s improved reasoning (MMLU Pro 85.2% vs ~72%), accuracy improved enough to add corporate tax filing support. Revenue doubled to ₹3.6L/mo (~$4,300).
Timeline: 2 weeks for data preparation. 1 week to fine-tune on a rented A100. 1 month to onboard first 3 clients.
🛠️ Follow-Up Actions
| Step | Action | Link |
|---|---|---|
| 1 | Download models from Hugging Face | Gemma 4 Collection |
| 2 | Run locally via Ollama | ollama run gemma4:26b-a4b |
| 3 | Read the architecture paper | Google DeepMind Gemma 4 |
| 4 | Test with vLLM for production serving | vLLM docs |
| 5 | Join the HN discussion | HN Thread |
Quick Hits
| Want to… | Do this |
|---|---|
| Grab the E2B model, quantize to 2-bit, deploy via Ollama | |
| Host the 26B MoE on a single RTX 4090, point your SDK at localhost | |
| Use the 31B — native image input, structured JSON output, no separate OCR | |
| Apache 2.0 = no legal review needed, no user caps, full sovereignty | |
| E4B with Google AI Edge SDK — 128K context in 8B total params |
Google just handed everyone the blueprints to Gemini’s brain and said “Apache 2.0, no backsies” — now we find out if open-source generosity is a strategy or a eulogy.
!