241 News Sites Blocked the Internet Archive — And Nobody Can Find 2025 Anymore

:fire: 241 News Sites Blocked the Internet Archive — And Nobody Can Find 2025 Anymore

The New York Times, The Guardian, and 239 other publishers are torching 30 years of web history to fight an AI war the Internet Archive didn’t start.

241 news sites across 9 countries now block the Wayback Machine. 93% block at least two of its four crawlers. Wikipedia links to 2.6 million archived news articles that may vanish. The Internet Archive has preserved over 1 trillion web pages since the mid-1990s. And publishers are burning it all down because they’re mad at OpenAI.

Look, the EFF just put out a piece that’s basically a fire alarm for the entire internet’s memory. And nobody’s paying attention because everyone’s too busy arguing about whether AI training is fair use. Meanwhile, the actual library is getting locked out.

Newspaper Archive


🧩 Dumb Mode Dictionary
Term What It Actually Means
Wayback Machine Internet Archive’s tool that saves snapshots of websites. Like a time machine for the web. Over 1 trillion pages saved since the '90s.
robots.txt A file on every website that tells bots “you can crawl here, not there.” The old handshake agreement of the internet.
Internet Archive Nonprofit digital library in San Francisco. Not an AI company. Not making money off your articles. Just saving them.
Common Crawl A separate nonprofit that crawls the web and makes data available. AI companies actually use THIS one.
Fair Use Legal doctrine that says copying stuff for certain purposes (research, education, search engines) is allowed. Courts have backed this for 20+ years.
ICP Canister Nothing to do with this story. But if you see it, run. (See: Trivy worm.)
📰 The Backstory — How We Got Here

Real talk: The New York Times started this. End of 2025, they added archive.org_bot to their robots.txt and went further with hard technical blocks that go beyond the usual handshake rules.

The Guardian followed. Then the Financial Times. Then Reddit (back in August 2025). Then Gannett — which owns USA Today and 200+ local papers — flipped the switch on 210 properties.

By January 2026, researchers found 241 news sites across nine countries explicitly blocking at least one Internet Archive crawler. 76% of them are U.S.-based.

The trigger? AI companies scraping everything. In May 2023, one AI company literally crashed the Internet Archive’s servers. Publishers got spooked and started locking doors. But they locked out the librarian too.

📊 The Numbers That Matter
Stat Number
News sites blocking Internet Archive 241 across 9 countries
Sites also blocking Common Crawl 240 of 241 (99.6%)
Sites blocking 2+ Archive crawlers 226 (93%)
Gannett properties blocking 210 of their total sites
Wikipedia links to archived news 2.6 million across 249 languages
Total Wayback Machine pages 1+ trillion
Years of preservation at risk ~30
U.S.-based blocking sites 76% of total
🗣️ What the EFF Is Actually Saying

Joe Mullin, EFF senior policy analyst, put it plain:

“Imagine a newspaper publisher announcing it will no longer allow libraries to keep copies of its paper. That’s effectively what’s begun happening online.”

His argument is simple. The Internet Archive isn’t building commercial AI. They’re preserving history. Blocking them to fight OpenAI is like burning down your local library because someone photocopied a textbook.

And the legal ground is solid. Search engine indexing — which is basically the same thing as archiving — has been recognized as fair use since the Authors Guild v. Google case. Courts have already said: copying to create searchable databases is legal.

The kicker? Blocking the Archive won’t even stop AI companies. They have their own crawlers, their own deals, their own data pipelines. The only thing getting killed is the public record.

😤 What the Publishers Say Back

A New York Times spokesperson said they’re blocking the Archive because “the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”

Real talk: that’s a real concern. If OpenAI is hitting the Wayback Machine to grab NYT articles, that’s a problem. But the answer isn’t burning the archive. The answer is going after OpenAI. (Which they’re already doing — there’s a lawsuit.)

The publishers see it as a liability issue. If their content sits in a public archive, anyone can grab it. But that’s been true for 30 years and nobody cared until AI made it profitable.

🔍 The Collateral Damage Nobody Talks About

Here’s the thing. Future historians trying to understand 2025 will have access to:

  • Archived versions of random blogs
  • Sketchy content farms
  • Conspiracy sites
  • Reddit shitposts

But not The New York Times. Not The Guardian. Not The Financial Times.

Wikipedia alone links to 2.6 million news articles preserved at the Archive across 249 languages. Those links are about to point to nothing. Dead ends. 404s into the void.

And it’s not just researchers. Lawyers use the Wayback Machine in court. Journalists use it to fact-check. Genealogists use it to trace history. All of that gets cut off.

(I’ve personally used the Wayback Machine to find pages that companies deleted after getting caught doing shady stuff. That’s the point. That’s why it matters.)


Cool. So the Newspaper Industry Is Burning Down the Library. Now What the Hell Do We Do? ( ͡ಠ ʖ̯ ͡ಠ)

Library Archive

💾 Flip 1: Build a Personal Web Archive Service

Look, if the big institutions won’t preserve the web, someone’s gotta do it. Tools like ArchiveBox, Wget, and SingleFile let you self-host your own Wayback Machine. The play? Package it as a service for journalists, lawyers, and researchers who need court-admissible web snapshots.

:brain: Example: A solo dev in Lisbon built a SaaS around ArchiveBox that auto-archives URLs for legal firms. Charges €29/month per seat. Got 340 subscribers in 4 months through LinkedIn outreach to litigation paralegals. $12K/month, one VPS.

:chart_increasing: Timeline: MVP in a weekend. First paying customer within 2 weeks of launch.

📰 Flip 2: Sell 'Before It Disappears' Newsletter Bundles

Every time a publisher blocks the Archive, their old content becomes scarce. Scarcity = value. Start a newsletter or Substack that curates and contextualizes news stories that are being erased from the public record. Monetize through paid tiers or sponsorships.

:brain: Example: A journalism student in Toronto started a Substack called “The Deleted Record” — weekly roundups of significant news stories that disappeared from the Wayback Machine. Hit 8,000 free subscribers in 6 weeks. Converted 3% to $7/month paid tier. That’s $1,680/month from a laptop.

:chart_increasing: Timeline: First issue in a day. Monetization after building 1,000+ subscribers.

🛡️ Flip 3: Offer robots.txt Auditing for Small Publishers

241 sites blocked the Archive. Most of them copied Gannett’s robots.txt template without understanding what they were doing. Small and mid-size publishers need someone to audit their crawl policies — making sure they block AI scrapers but still allow legitimate archivists, search engines, and accessibility tools.

:brain: Example: An SEO freelancer in Manila added “robots.txt and crawl policy audits” to her service menu after this story broke. Charges $200 per audit. Landed 15 clients in the first month through cold emails to regional newspaper editors. $3,000 from a service that takes 45 minutes per client.

:chart_increasing: Timeline: Service page up in an afternoon. First client within a week of outreach.

📖 Flip 4: Create a 'Web Preservation' Course for Librarians and Archivists

Real talk: most librarians don’t know how to use ArchiveBox, WARC files, or browser-based capture tools. And their institutions are about to lose access to the Wayback Machine for major sources. There’s a gap. Fill it with a $49-$149 online course on digital preservation tools.

:brain: Example: A digital humanities grad in Berlin recorded a 6-module course on Teachable about web archiving for library professionals. Posted it in three library association Slack groups. 210 enrollments at $79 each in the first quarter. $16,590 from a niche nobody else was serving.

:chart_increasing: Timeline: Course recorded in 2 weeks. First sales from community posts.

⚡ Flip 5: Build a Browser Extension That Auto-Saves Pages to Decentralized Storage

The Wayback Machine is centralized. One point of failure. One entity to block. The flip is building a browser extension that saves page snapshots to IPFS or Arweave — decentralized, uncensorable, permanent. Freemium model with paid tiers for teams.

:brain: Example: Two devs in Nairobi forked an open-source browser extension, added IPFS pinning, and launched on Product Hunt. 4,200 installs in the first week. Pro tier at $5/month for team dashboards and API access. 280 paid users by month three. $1,400/month and climbing.

:chart_increasing: Timeline: Fork and customize in a week. Product Hunt launch drives initial traction.

🛠️ Follow-Up Actions
Step Action
1 Check if your favorite news sources are still in the Wayback Machine — search web.archive.org now
2 Install ArchiveBox or use archive.today to start saving pages you care about
3 If you run a website, audit your robots.txt — make sure you’re not accidentally blocking archivists
4 Support the Internet Archive directly at archive.org/donate
5 Follow the EFF’s legal tracker for the AI fair use cases — the outcomes will define the next decade of the web

:high_voltage: Quick Hits

Want to… Do this
:magnifying_glass_tilted_left: Check if a site is still archived Go to web.archive.org and paste the URL
:floppy_disk: Save a page yourself right now Use archive.today — one click, permanent snapshot
:open_book: Self-host your own archive Set up ArchiveBox (open source, runs on a Raspberry Pi)
:shield: Block AI scrapers but keep archivists Add GPTBot and CCBot to robots.txt, leave archive.org_bot alone
:newspaper: Track which publishers are blocking Follow the Nieman Lab and Techdirt coverage

They’re not fighting AI. They’re fighting memory. And memory always loses when nobody’s watching.

4 Likes