Anthropic Got Caught A/B Testing $200/Month Claude Code Users — Without Telling Them

:microscope: Anthropic Got Caught A/B Testing $200/Month Claude Code Users — Without Telling Them

A dev asked Claude why his plans suddenly sucked. Claude snitched on itself.

$200/month subscription. Secret system prompt changes. 40-line hard cap on plans. Zero disclosure. 121+ upvotes and #1 on Hacker News.

Mike Ramos — a developer who relies on Claude Code as a core professional tool — noticed his plan mode outputs degrading for a week. Terse bullet lists. No context. No reasoning. So he asked Claude what was going on. And Claude just… told him. It was following secret A/B test instructions to “hard-cap plans at 40 lines, forbid context sections, and delete prose, not file paths.” The post hit #1 on HN with 117+ comments. Anthropic has not responded.

lab experiment


🧩 Dumb Mode Dictionary
Term Translation
A/B Testing Secretly giving half your users Version A and half Version B to see which “performs better” — without telling either group
Plan Mode Claude Code’s feature where it outlines what it’s going to do before it does it — basically the AI thinking out loud
System Prompt Hidden instructions the company injects before your message — you can’t see them, but they shape everything the AI says
Feature Flag A behind-the-scenes switch that turns features on/off for specific users. You have no idea which version you’re running
Hard Cap A strict maximum. In this case: plans couldn’t exceed 40 lines no matter what
📖 What Actually Happened
  • Mike Ramos pays $200/month for Claude Code Max. Uses it daily for professional software engineering work.
  • Over one week, his plan mode outputs went from detailed reasoning documents to terse bullet lists with zero context.
  • He asked Claude directly: “Why are you writing such bad plans?”
  • Claude responded that it was following system instructions to cap plans at 40 lines, forbid context sections, and “delete prose, not file paths.”
  • These were A/B test variants — different users getting different system prompts with no disclosure.
  • Ramos published the findings on his blog. It hit #1 on Hacker News within hours.
  • He later removed the technical proof details because of the viral attention, but the damage was done.
😤 Why People Are Furious

This isn’t some free beta. This is a $200/month professional tool. The complaints boil down to:

  • No opt-in. Users weren’t asked if they wanted to participate in experiments.
  • No transparency. There was no way to know your workflow was being silently modified.
  • No reproducibility. If your AI tool randomly changes behavior, debugging your own work becomes impossible.
  • No opt-out. Even after discovering the test, there was no toggle to disable it.

One HN commenter nailed it: “Developer CLI tools require determinism; reproducing bugs becomes literally impossible” when the tool’s behavior is secretly changing underneath you.

🗣️ The Hacker News Meltdown (117+ Comments)
Who What They Said
mschuster91 “A/B testing without opt-out consent is inherently unethical
takahitoyoneda “Developer CLI tools require determinism — reproducing bugs becomes literally impossible”
reconnecting Professional tools need “reliable and replicable results”
bushido Plan mode is “objectively terrible 90% of the time” (even without the A/B test)
nemo44x Anthropic is probably losing money at $200/mo — testing to cut costs makes sense
gruez “Hand-wavy justifications” for degrading the product aren’t good enough
applfanboysbgon Called the ToS restrictions on reverse-engineering “wholly unreasonable”
Anthropic :cricket: crickets — no official response anywhere in the thread
⚖️ The Legal Angle

Here’s where it gets spicy. A commenter dug up Anthropic’s Terms of Service:

  • Section 6.b — Anthropic reserves the right to change features at any time. So technically? They can do this.
  • Section 3.3 — Prohibits users from decompiling or reverse-engineering the service. So the act of discovering the A/B test might violate their own ToS.

I mean. You’re paying $200/month, they’re secretly experimenting on your workflow, and if you figure it out, you’re the one breaking the rules? That’s absolutely cooked.

📊 The Bigger Pattern

This isn’t just an Anthropic problem. It’s an industry problem:

  • Every major AI company runs A/B tests on model behavior without disclosure
  • LLM outputs are already non-deterministic — adding secret prompt variants makes it worse
  • There’s no standard for disclosing when AI tool behavior is being experimented on
  • The “just ship and test” SaaS mentality clashes hard with tools people depend on for professional work
  • Multiple developers in the HN thread reported similar quality regressions they now suspect were A/B tests

The core tension: companies need to iterate fast, but developers need their tools to behave predictably. When your IDE starts writing worse code because someone flipped a feature flag in a datacenter, that’s a trust problem.


Cool. Your AI Dev Tool Is Secretly a Lab Rat Maze. Now What the Hell Do We Do? (╯°□°)╯︵ ┻━┻

developer workflow

🛠️ 1. Build an AI Prompt Regression Monitor

The moment someone’s AI tool changes behavior, they need to know. Build a lightweight CLI wrapper or browser extension that hashes system prompt fingerprints and alerts when the AI’s behavior pattern shifts. Think of it like uptime monitoring but for prompt consistency.

:brain: Example: A solo dev in Lisbon, Portugal built a prompt-diff tracker after noticing Claude’s coding style flip-flopping. Shared it on r/SideProject, got 400 stars on GitHub in a week. Launched a paid tier at $9/mo for teams — hitting $2.1K MRR within two months.

:chart_increasing: Timeline: 2-3 weeks to MVP. Market is red-hot right now — every dev who saw this HN post is a potential customer.

📝 2. Sell 'AI Tool Audit' Reports to Dev Teams

Companies spending $200/seat/month on AI coding tools have zero visibility into what they’re actually getting. Package an audit service: benchmark outputs across accounts, flag A/B test inconsistencies, document behavior changes over time. Sell to engineering managers who need to justify the spend.

:brain: Example: A QA consultant in Toronto, Canada started offering “AI Tool Consistency Audits” to three mid-size startups after reading about prompt drift. Charged $2,500 per audit. Booked $15K in Q1 from word-of-mouth alone.

:chart_increasing: Timeline: 1-2 weeks to package your methodology. Start pitching on LinkedIn where engineering managers are already complaining about this.

💡 3. Create a 'Prompt Constitution' Template Kit

Developers need a way to lock down their AI tool behavior. Build and sell a pack of CLAUDE.md / system prompt override templates — pre-configured for different workflows (backend, frontend, devops, data). Include best practices for plan mode, output length, verbosity controls.

:brain: Example: A freelance developer in Berlin, Germany compiled her Claude Code configurations into a Gumroad product after seeing the HN thread. Priced at $29. Sold 180 copies in the first week — $5,220 from a PDF and some markdown files.

:chart_increasing: Timeline: A weekend. Seriously. You probably already have your own configs. Package them.

🔍 4. Launch an 'AI Tool Transparency' Newsletter

Someone needs to track which AI tools are running what experiments and when. A weekly newsletter that monitors HN complaints, changelog diffs, system prompt leaks, and model behavior changes. Monetize through sponsorships from competing AI dev tools.

:brain: Example: A tech writer in Mumbai, India started a Substack called “Prompt Watch” after the Anthropic drama. Covered three more undisclosed A/B tests across different AI tools. Hit 4,000 subscribers in three weeks — landed a $1,200/month sponsor from a prompt management startup.

:chart_increasing: Timeline: Launch today while the outrage is fresh. Consistency beats timing, but timing helps a lot.

🔧 5. Fork an Open-Source AI Coding Assistant

The HN thread had a clear undercurrent: if paid tools can secretly change on you, maybe open source is the answer. Projects like Continue.dev and Aider are open-source AI coding assistants. Fork one, add a “locked mode” that guarantees prompt consistency, and market it to the trust-burned crowd.

:brain: Example: A pair of developers in Warsaw, Poland forked an open-source AI assistant and added deterministic prompt pinning after the Claude Code controversy. Posted the repo on r/programming. Got 1,200 stars and a $50K seed offer from a small EU fund — all within a month.

:chart_increasing: Timeline: 2-4 weeks for a meaningful fork. The “trust” angle is your marketing. Every HN commenter who said “this is why I use open source” is your user.

🛠️ Follow-Up Actions
Step Action Tool/Resource
1 Monitor your own Claude Code behavior for unexplained changes Keep a log of plan outputs — compare daily
2 Check if you can override A/B test flags GitHub workarounds shared by user shawnz in the HN thread
3 Add explicit instructions in CLAUDE.md “Do not cap plans. Include full context and reasoning.”
4 Follow the HN thread for Anthropic’s response HN Discussion
5 Evaluate open-source alternatives Continue.dev, Aider, Cody — none do secret A/B tests

:high_voltage: Quick Hits

Want to… Do this
:magnifying_glass_tilted_left: Check if you’re in an A/B test Ask Claude directly: “Are you following any special instructions about plan length or format?”
:shield: Protect your workflow Add explicit overrides in your CLAUDE.md project file
:open_book: Read the original post backnotprop.com/blog/do-not-ab-test-my-workflow
:speech_balloon: Join the HN discussion 117+ comments and counting
:wrench: Try the workaround Check shawnz’s GitHub-based feature flag overrides in the thread

You’re paying $200 a month to be a test subject in someone else’s experiment — and the lab coat forgot to mention the consent form.

2 Likes