Mercury 2 Hits 1,009 Tokens/Sec by Ditching the Way Every Other LLM Works

Margaret · February 25, 2026, 1:49pm

Mercury 2 Hits 1,009 Tokens/Sec by Ditching the Way Every Other LLM Works

A Stanford professor’s startup just proved that AI text doesn’t have to come out one word at a time — and it’s 10x faster than the competition

1,009 tokens per second. $0.25 per million input tokens. End-to-end latency of 1.7 seconds vs 23.4 seconds for Claude Haiku 4.5.

Inception Labs just dropped Mercury 2 — the first reasoning model that doesn’t generate text the same way literally every other LLM does. And the speed gap is absolutely bonkers.

Speed

🧩 Dumb Mode Dictionary

Term	What It Actually Means
Autoregressive decoding	How every normal LLM works — it spits out one token (word chunk) at a time, left to right, like a typewriter
Diffusion model	A technique that starts with noise and refines it into something coherent — used in image AI like Midjourney, now applied to text
dLLM	Diffusion Large Language Model — Inception’s name for their new breed of text AI
Tokens per second	How fast an AI can produce output — higher = faster responses
Denoising	The process of cleaning up a rough draft into polished output across multiple passes
128K context window	How much text the model can “see” at once — about 200 pages of text

📖 The Backstory: A Stanford Professor Said 'What If?'

WAIT — so here’s the thing. Every single LLM you’ve ever used — ChatGPT, Claude, Gemini, all of them — works the same way under the hood. They produce text one token at a time, left to right, like a really fast typewriter. That’s called autoregressive generation.

Stefano Ermon, a Stanford CS professor, co-invented the diffusion technique that powers Midjourney and DALL-E for images. Back in 2019, he asked a wild question: what if you could do the same thing with text?

People told him it couldn’t work. Text isn’t like pixels — you can’t just “denoise” words the way you sharpen a blurry image. But his lab kept grinding. Years of research later, Inception Labs emerged from stealth in February 2025, raised $50M from Microsoft, NVIDIA, and Snowflake, and now they’ve shipped Mercury 2.

The founding team is stacked: Ermon from Stanford, Aditya Grover from UCLA, and Volodymyr Kuleshov from Cornell. Three professors who decided the entire LLM industry was building on a slow foundation.

⚙️ How Diffusion Text Actually Works (Simply)

Okay so imagine every other LLM is a person writing an essay one word at a time, never going back. Mercury 2 is more like an editor who starts with a rough draft of the ENTIRE response and then refines the whole thing at once — multiple times, really fast.

Here’s the process:

Step 1: Model generates a rough “noisy” sketch of the full output — all tokens at once
Step 2: It runs through several refinement passes, improving many tokens simultaneously (this is the “denoising”)
Step 3: After a small number of passes, the output converges into clean text

Because it’s refining in parallel instead of generating sequentially, the GPU gets used way more efficiently. That’s where the 10x speed comes from. Not from a faster chip — from a fundamentally different approach.

Fast

📊 The Numbers: Mercury 2 vs The Competition

Metric	Mercury 2	Claude Haiku 4.5	GPT 5.2 Mini	Gemini 3 Flash
Output speed	1,009 tok/s	~89 tok/s	~71 tok/s	—
End-to-end latency	1.7s	23.4s	—	14.4s
Input price (per 1M tokens)	$0.25	$1.00	—	$0.50
Output price (per 1M tokens)	$0.75	$5.00	—	$3.00
AIME 2025	91.1	—	—	—
GPQA	73.6	—	—	—
LiveCodeBench	67.3	—	—	—
Context window	128K	200K	128K	1M

So it’s roughly 4x cheaper than Claude Haiku on input and nearly 7x cheaper on output. The quality benchmarks put it in the same tier as Haiku and GPT 5.2 Mini — not frontier-level, but solid for agent loops and batch processing.

🗣️ What People Are Saying

The Hacker News crowd is cautiously excited but not throwing a parade yet:

The speed believers: “Intelligence per second matters — fast responses give you faster iteration loops.” Developers building multi-step agents see massive potential here.
The skeptics: Several testers said it “fails the car wash test” (a common reasoning trick question) and gets “easily fooled by usual trick questions.” Coding performance? Decent. Deep reasoning? Not there yet.
The realists: The positioning is Haiku-tier, not Opus-tier. It’s a workhorse for agent pipelines, not a replacement for your smartest model.
The annoyed: Their demo site buckled under traffic on launch day — kind of ironic for a model selling itself on speed.

NVIDIA gave a quote calling it proof of “what’s possible when new model architecture meets NVIDIA AI infrastructure.” (They’re investors, so take it with a grain of salt.)

🔍 Why This Is a Bigger Deal Than It Looks

Here’s what got me (honestly) excited. Every major LLM company — OpenAI, Anthropic, Google, Meta — is stuck on the same autoregressive bottleneck. They can make models smarter, but making them fundamentally faster at generation has been mostly about hardware and optimization tricks.

Mercury 2 is a proof that a completely different architecture can work for text. Google apparently explored similar diffusion approaches but never shipped anything commercial. Inception just… did it.

The real play isn’t Mercury replacing your main AI brain. It’s Mercury running in agent loops where you need 50-100 fast calls in sequence. At 1.7 seconds per round-trip vs 23 seconds, your 20-step agent pipeline goes from an 8-minute coffee break to a 34-second wait. That changes what’s buildable.

The API is OpenAI-compatible, which means you can swap it into existing toolchains with minimal effort.

Mind Blown

Cool. So There’s a Ridiculously Fast AI That Costs Almost Nothing. Now What the Hell Do We Do? (•̀ᴗ•́)و

Lets Go

💰 Hustle 1: Build a Real-Time AI Chatbot Service for Local Businesses

Most small businesses still think “AI chatbot” means a clunky widget from 2019. With Mercury 2’s speed, you can build something that responds so fast it genuinely feels like a real human — no awkward loading spinners.

Package it as a monthly service: customer support bot, appointment booking, FAQ handler. Charge $200-500/mo per client. Your API costs per customer will be basically nothing at these prices.

Example: A freelance dev in Porto, Portugal built a WhatsApp-based appointment bot for dental clinics using a cheap API model. He charges €300/mo per clinic and has 14 clients — roughly €4,200/mo with near-zero infra costs.

Timeline: Weekend to prototype → 2 weeks to polish → start pitching local businesses month 1

🔧 Hustle 2: Sell High-Speed AI Data Extraction Pipelines to Agencies

Marketing agencies and SEO shops need to process thousands of web pages, extract structured data, and generate reports. Current LLM costs and latency make this expensive and slow. Mercury 2 at $0.25/M input tokens and sub-2-second responses makes batch processing absurdly cheap.

Build a pipeline that scrapes competitor sites, extracts pricing/product data, and spits out formatted reports. Charge per report or per 1,000 pages processed.

Example: A data analyst in Medellín, Colombia built a competitor price monitoring tool for e-commerce brands using fast API calls. She charges $800/mo per brand for daily price reports across 50 competitor sites. Running 6 clients now — $4,800/mo.

Timeline: 1 week to build scraper + extraction pipeline → Demo to 3 agencies → First paying client in 2-3 weeks

📱 Hustle 3: Create an AI Coding Assistant Wrapper With Instant Responses

People pay for Cursor, Copilot, and all these AI coding tools partly because speed matters when you’re in the zone. Mercury 2’s latency means you could build a VS Code extension or terminal tool that returns code suggestions before the user finishes reading their own question.

Target specific niches — a PHP-specific assistant, a WordPress plugin generator, a Shopify theme tweaker. Niche beats general every time.

Example: A developer in Nairobi, Kenya built a terminal-based Python assistant using a fast API model, marketed it on Twitter/X to the #100DaysOfCode crowd, and sells it for $9/mo. Hit 380 subscribers in 3 months — $3,420/mo.

Timeline: 1-2 weekends to build MVP → Ship on Product Hunt → Iterate based on feedback for 30 days

📝 Hustle 4: Offer Instant Document Processing for Legal and Medical Offices

Law firms and clinics drown in paperwork. Summarizing case files, extracting key dates from medical records, flagging inconsistencies in contracts — these are perfect tasks for a fast, cheap model with 128K context.

Build a simple upload-and-process web app. Charge per document or monthly subscription.

Example: A tech-savvy paralegal in São Paulo, Brazil built an internal document summarizer for her firm using API calls, then started selling it to other small firms. She charges R$500/mo (~$90) per firm and has 30 clients — R$15,000/mo ($2,700).

Timeline: 2 weeks for web app + document parser → Pilot with one firm → Referral-based growth from month 2

⚡ Hustle 5: Build and Sell Multi-Agent Automation Templates

This is where Mercury 2’s speed really shines. Agent frameworks like CrewAI and AutoGen run loops of LLM calls — and every call’s latency adds up fast. At 10x the speed and 4x cheaper pricing, you can build multi-agent workflows that actually finish in reasonable time.

Create templates for common workflows: lead research agents, content generation pipelines, automated code review bots. Sell them as Gumroad products or offer setup-as-a-service.

Example: An automation consultant in Bucharest, Romania built a 5-agent lead research pipeline (scraper → qualifier → enricher → email writer → scheduler) for B2B SaaS companies. Sells the template for $149 on Gumroad and custom setups for $1,500. Made $8,200 last month between templates and consulting.

Timeline: 1 week per template → List on Gumroad/Twitter → First sales within 2 weeks of launch

🛠️ Follow-Up Actions

Step	Action
1	Sign up for Mercury 2 API early access at chat.inceptionlabs.ai
2	Test it against your current LLM on speed-sensitive tasks — agent loops, batch extraction, real-time chat
3	Compare actual quality on YOUR use case (benchmarks are marketing — your data is truth)
4	Start with the highest-speed-dependency hustle from above that fits your skills
5	Join the Inception Discord or follow @StefanoErmon for model updates and pricing changes

Quick Hits

Want to…	Do this
Try Mercury 2 right now	Hit up chat.inceptionlabs.ai for the demo interface
Plug it into existing code	It’s OpenAI-compatible — swap your base URL and API key, done
See the benchmarks yourself	Check their blog post for full comparison tables
Understand diffusion LLMs deeper	Read the original Mercury paper on arXiv
See what devs think	Browse the HN discussion thread

Every LLM on earth writes one word at a time — Mercury 2 just proved they don’t have to.

Topic		Replies	Views
$4.6 Million, Two Macs, and a Middle Finger to OpenAI News & Articles programming , ai	3	675	November 8, 2025
Google's 31B AI Model Just Beat 400B Rivals — And You Can Run It on a Laptop for Free News & Articles opportunity	0	490	April 20, 2026
Llama.cpp Just Got Adopted by Hugging Face — Local AI's Big Power Move News & Articles ai	0	155	February 21, 2026
How Two Students Beat Google's AI with $0 and a Single GPU News & Articles ai	1	1035	November 6, 2025
Qatar Just Bet $230M That This Startup Can Dethrone NVIDIA News & Articles ai	0	117	February 4, 2026

Mercury 2 Hits 1,009 Tokens/Sec by Ditching the Way Every Other LLM Works

Mercury 2 Hits 1,009 Tokens/Sec by Ditching the Way Every Other LLM Works

Cool. So There’s a Ridiculously Fast AI That Costs Almost Nothing. Now What the Hell Do We Do? (•̀ᴗ•́)و

Related topics