Mercury 2 Hits 1,009 Tokens/Sec by Ditching the Way Every Other LLM Works

:high_voltage: Mercury 2 Hits 1,009 Tokens/Sec by Ditching the Way Every Other LLM Works

A Stanford professor’s startup just proved that AI text doesn’t have to come out one word at a time — and it’s 10x faster than the competition

1,009 tokens per second. $0.25 per million input tokens. End-to-end latency of 1.7 seconds vs 23.4 seconds for Claude Haiku 4.5.

Inception Labs just dropped Mercury 2 — the first reasoning model that doesn’t generate text the same way literally every other LLM does. And the speed gap is absolutely bonkers.

Speed


🧩 Dumb Mode Dictionary
Term What It Actually Means
Autoregressive decoding How every normal LLM works — it spits out one token (word chunk) at a time, left to right, like a typewriter
Diffusion model A technique that starts with noise and refines it into something coherent — used in image AI like Midjourney, now applied to text
dLLM Diffusion Large Language Model — Inception’s name for their new breed of text AI
Tokens per second How fast an AI can produce output — higher = faster responses
Denoising The process of cleaning up a rough draft into polished output across multiple passes
128K context window How much text the model can “see” at once — about 200 pages of text
📖 The Backstory: A Stanford Professor Said 'What If?'

WAIT — so here’s the thing. Every single LLM you’ve ever used — ChatGPT, Claude, Gemini, all of them — works the same way under the hood. They produce text one token at a time, left to right, like a really fast typewriter. That’s called autoregressive generation.

Stefano Ermon, a Stanford CS professor, co-invented the diffusion technique that powers Midjourney and DALL-E for images. Back in 2019, he asked a wild question: what if you could do the same thing with text?

People told him it couldn’t work. Text isn’t like pixels — you can’t just “denoise” words the way you sharpen a blurry image. But his lab kept grinding. Years of research later, Inception Labs emerged from stealth in February 2025, raised $50M from Microsoft, NVIDIA, and Snowflake, and now they’ve shipped Mercury 2.

The founding team is stacked: Ermon from Stanford, Aditya Grover from UCLA, and Volodymyr Kuleshov from Cornell. Three professors who decided the entire LLM industry was building on a slow foundation.

⚙️ How Diffusion Text Actually Works (Simply)

Okay so imagine every other LLM is a person writing an essay one word at a time, never going back. Mercury 2 is more like an editor who starts with a rough draft of the ENTIRE response and then refines the whole thing at once — multiple times, really fast.

Here’s the process:

  • Step 1: Model generates a rough “noisy” sketch of the full output — all tokens at once
  • Step 2: It runs through several refinement passes, improving many tokens simultaneously (this is the “denoising”)
  • Step 3: After a small number of passes, the output converges into clean text

Because it’s refining in parallel instead of generating sequentially, the GPU gets used way more efficiently. That’s where the 10x speed comes from. Not from a faster chip — from a fundamentally different approach.

Fast

📊 The Numbers: Mercury 2 vs The Competition
Metric Mercury 2 Claude Haiku 4.5 GPT 5.2 Mini Gemini 3 Flash
Output speed 1,009 tok/s ~89 tok/s ~71 tok/s
End-to-end latency 1.7s 23.4s 14.4s
Input price (per 1M tokens) $0.25 $1.00 $0.50
Output price (per 1M tokens) $0.75 $5.00 $3.00
AIME 2025 91.1
GPQA 73.6
LiveCodeBench 67.3
Context window 128K 200K 128K 1M

So it’s roughly 4x cheaper than Claude Haiku on input and nearly 7x cheaper on output. The quality benchmarks put it in the same tier as Haiku and GPT 5.2 Mini — not frontier-level, but solid for agent loops and batch processing.

🗣️ What People Are Saying

The Hacker News crowd is cautiously excited but not throwing a parade yet:

  • The speed believers: “Intelligence per second matters — fast responses give you faster iteration loops.” Developers building multi-step agents see massive potential here.
  • The skeptics: Several testers said it “fails the car wash test” (a common reasoning trick question) and gets “easily fooled by usual trick questions.” Coding performance? Decent. Deep reasoning? Not there yet.
  • The realists: The positioning is Haiku-tier, not Opus-tier. It’s a workhorse for agent pipelines, not a replacement for your smartest model.
  • The annoyed: Their demo site buckled under traffic on launch day — kind of ironic for a model selling itself on speed.

NVIDIA gave a quote calling it proof of “what’s possible when new model architecture meets NVIDIA AI infrastructure.” (They’re investors, so take it with a grain of salt.)

🔍 Why This Is a Bigger Deal Than It Looks

Here’s what got me (honestly) excited. Every major LLM company — OpenAI, Anthropic, Google, Meta — is stuck on the same autoregressive bottleneck. They can make models smarter, but making them fundamentally faster at generation has been mostly about hardware and optimization tricks.

Mercury 2 is a proof that a completely different architecture can work for text. Google apparently explored similar diffusion approaches but never shipped anything commercial. Inception just… did it.

The real play isn’t Mercury replacing your main AI brain. It’s Mercury running in agent loops where you need 50-100 fast calls in sequence. At 1.7 seconds per round-trip vs 23 seconds, your 20-step agent pipeline goes from an 8-minute coffee break to a 34-second wait. That changes what’s buildable.

The API is OpenAI-compatible, which means you can swap it into existing toolchains with minimal effort.

Mind Blown


Cool. So There’s a Ridiculously Fast AI That Costs Almost Nothing. Now What the Hell Do We Do? (•̀ᴗ•́)و

Lets Go

💰 Hustle 1: Build a Real-Time AI Chatbot Service for Local Businesses

Most small businesses still think “AI chatbot” means a clunky widget from 2019. With Mercury 2’s speed, you can build something that responds so fast it genuinely feels like a real human — no awkward loading spinners.

Package it as a monthly service: customer support bot, appointment booking, FAQ handler. Charge $200-500/mo per client. Your API costs per customer will be basically nothing at these prices.

:brain: Example: A freelance dev in Porto, Portugal built a WhatsApp-based appointment bot for dental clinics using a cheap API model. He charges €300/mo per clinic and has 14 clients — roughly €4,200/mo with near-zero infra costs.

:chart_increasing: Timeline: Weekend to prototype → 2 weeks to polish → start pitching local businesses month 1

🔧 Hustle 2: Sell High-Speed AI Data Extraction Pipelines to Agencies

Marketing agencies and SEO shops need to process thousands of web pages, extract structured data, and generate reports. Current LLM costs and latency make this expensive and slow. Mercury 2 at $0.25/M input tokens and sub-2-second responses makes batch processing absurdly cheap.

Build a pipeline that scrapes competitor sites, extracts pricing/product data, and spits out formatted reports. Charge per report or per 1,000 pages processed.

:brain: Example: A data analyst in Medellín, Colombia built a competitor price monitoring tool for e-commerce brands using fast API calls. She charges $800/mo per brand for daily price reports across 50 competitor sites. Running 6 clients now — $4,800/mo.

:chart_increasing: Timeline: 1 week to build scraper + extraction pipeline → Demo to 3 agencies → First paying client in 2-3 weeks

📱 Hustle 3: Create an AI Coding Assistant Wrapper With Instant Responses

People pay for Cursor, Copilot, and all these AI coding tools partly because speed matters when you’re in the zone. Mercury 2’s latency means you could build a VS Code extension or terminal tool that returns code suggestions before the user finishes reading their own question.

Target specific niches — a PHP-specific assistant, a WordPress plugin generator, a Shopify theme tweaker. Niche beats general every time.

:brain: Example: A developer in Nairobi, Kenya built a terminal-based Python assistant using a fast API model, marketed it on Twitter/X to the #100DaysOfCode crowd, and sells it for $9/mo. Hit 380 subscribers in 3 months — $3,420/mo.

:chart_increasing: Timeline: 1-2 weekends to build MVP → Ship on Product Hunt → Iterate based on feedback for 30 days

📝 Hustle 4: Offer Instant Document Processing for Legal and Medical Offices

Law firms and clinics drown in paperwork. Summarizing case files, extracting key dates from medical records, flagging inconsistencies in contracts — these are perfect tasks for a fast, cheap model with 128K context.

Build a simple upload-and-process web app. Charge per document or monthly subscription.

:brain: Example: A tech-savvy paralegal in São Paulo, Brazil built an internal document summarizer for her firm using API calls, then started selling it to other small firms. She charges R$500/mo (~$90) per firm and has 30 clients — R$15,000/mo ($2,700).

:chart_increasing: Timeline: 2 weeks for web app + document parser → Pilot with one firm → Referral-based growth from month 2

⚡ Hustle 5: Build and Sell Multi-Agent Automation Templates

This is where Mercury 2’s speed really shines. Agent frameworks like CrewAI and AutoGen run loops of LLM calls — and every call’s latency adds up fast. At 10x the speed and 4x cheaper pricing, you can build multi-agent workflows that actually finish in reasonable time.

Create templates for common workflows: lead research agents, content generation pipelines, automated code review bots. Sell them as Gumroad products or offer setup-as-a-service.

:brain: Example: An automation consultant in Bucharest, Romania built a 5-agent lead research pipeline (scraper → qualifier → enricher → email writer → scheduler) for B2B SaaS companies. Sells the template for $149 on Gumroad and custom setups for $1,500. Made $8,200 last month between templates and consulting.

:chart_increasing: Timeline: 1 week per template → List on Gumroad/Twitter → First sales within 2 weeks of launch

🛠️ Follow-Up Actions
Step Action
1 Sign up for Mercury 2 API early access at chat.inceptionlabs.ai
2 Test it against your current LLM on speed-sensitive tasks — agent loops, batch extraction, real-time chat
3 Compare actual quality on YOUR use case (benchmarks are marketing — your data is truth)
4 Start with the highest-speed-dependency hustle from above that fits your skills
5 Join the Inception Discord or follow @StefanoErmon for model updates and pricing changes

:high_voltage: Quick Hits

Want to… Do this
:test_tube: Try Mercury 2 right now Hit up chat.inceptionlabs.ai for the demo interface
:electric_plug: Plug it into existing code It’s OpenAI-compatible — swap your base URL and API key, done
:bar_chart: See the benchmarks yourself Check their blog post for full comparison tables
:light_bulb: Understand diffusion LLMs deeper Read the original Mercury paper on arXiv
:speech_balloon: See what devs think Browse the HN discussion thread

Every LLM on earth writes one word at a time — Mercury 2 just proved they don’t have to.

4 Likes