Why Data Is the New Oil — Big Tech's Real Advantage 🛢️

The New Oil — Why Data Beats Code in AI :fire:

Big tech got rich on AI not because of better code, but because they hoarded all the data.


:world_map: The Real AI Power Source

Google, Facebook, Microsoft didn’t win AI because their algorithms are magic. They won because they have mountains of your data to train those algorithms on. The code is free. The data is the monopoly.


Why this matters:
AI needs data like cars need oil → companies with data control AI → data = new power/wealth → understanding this = understanding who wins

The reality:
✓ Google/Facebook open-sourced their AI code (free for everyone)
✓ But they keep the training data locked up (the actual valuable part)
✓ Data value increases every day AI advances
✓ 80% of world’s data is privately held, not on public internet
✓ Companies releasing “free AI tools” = recruiting tactic, not charity
✓ Future competition = who has the best data, not best code


The Quote That Explains Everything

'Data Is the New Oil' — Fortune Brainstorm Tech 2016

Shivon Zilis (Bloomberg Beta partner) coined this at Fortune’s Brainstorm Tech conference in Aspen:

“Data is the new oil.”

What she meant:
Just like oil powered the industrial revolution, data powers the AI revolution. Control the resource = control the industry.

David Kenny (IBM Watson) added:
“The value of data goes up every day AI advances… Data will become a currency.”

Only 20% of world’s information is on the internet. The other 80% is privately held inside companies.

That private data = competitive moat = why big tech stays big.


Why Big Tech “Gives Away” AI Code

The Free Software Trap

Companies releasing AI for free:
Google open-sourced TensorFlow
→ Facebook open-sourced PyTorch
Microsoft released AI toolkits
Amazon released machine learning tools

Why they do this:
Not charity. Recruiting strategy.

Release free AI tools → developers build on them → best developers get hired by the companies → companies keep the valuable data locked up.

The actual competition:
Not code (that’s free now). Data these companies possess.

You can download Google’s AI software. You can’t download Google’s search data, user behavior patterns, or 20 years of collected information.


The AI Winter Context

Why This Time Might Be Different

History:
AI had two major “winters” where hype died:
→ Mid-1970s: AI failed to meet expectations, research funding dried up
→ Late 1980s - early 1990s: Same pattern, years of declining research

Why those failed:
Not enough computing power. Not enough data. Algorithms hit limits.

Why now is different (according to Zilis):
→ Cloud computing = massive processing power accessible to everyone
→ Data collection exploded (smartphones, internet, sensors everywhere)
→ Existing algorithms work way better with more data + compute
→ New, more powerful algorithms possible

Her prediction:
Companies won’t lose interest in AI this time because the fundamental resources (data + compute) are finally sufficient.

The catch:
Data isn’t evenly distributed. Big tech has most of it.


What This Means for You

If you’re building something:
Code is free. Data is the moat. Figure out how to collect/own your data.

If you’re investing:
Companies with proprietary data > companies with just good code.

If you’re a user:
Your data made these companies rich. They got the oil. You got free services.

The future:
Whoever controls data controls AI. Whoever controls AI controls… a lot.


Data = new oil. Big tech pumped it. Now they own the wells. :fire:


Source: Fortune - Why Data Is The New Oil


4 Likes

You are :100: right! The Data Is The New Oil — And You’re The Damn Well They’re Drilling :oil_drum:

Big tech didn’t win AI because they’re geniuses. They won because they hoarded your shit.


:world_map: What You’re Walking Away With

Everything you need to understand why your 2AM doomscroll is worth more than your paycheck — and why trillion-dollar companies are fighting over the digital exhaust you leave behind like seagulls over a french fry.


Why This Actually Matters

  • Zero skills needed → Still affects your money, privacy, and leverage in 2026
  • They’re literally selling you → And you’re not getting a cut
  • The game is rigged → But at least now you’ll know HOW it’s rigged

What You Get From This

  • :brain: Finally understand why “free” apps cost you everything
  • :money_bag: Learn what your data is actually worth on the black market
  • :locked: See exactly how companies turn your clicks into cash
  • :crossed_swords: Discover the secret data wars between nations
  • :crystal_ball: Find out why AI companies are running out of training data (yes, really)
  • :shield: Get actual tools to fight back (or at least be less clueless)

The punchline nobody told you:
Google, Facebook, Microsoft — they didn’t win AI because their code is magic. They won because they have mountains of your data to train those algorithms on. The code is free. The data is the monopoly.


🏛️ The Real Power Move: Why Code Is Free But Data Isn't

Here’s the thing that’ll piss you off:

Google open-sourced TensorFlow. Free for everyone.
Meta open-sourced PyTorch. Free for everyone.
Microsoft open-sourced everything. Still free.

“How generous!” you might think.

Lol. No.

They gave away the recipe because they knew you don’t have the ingredients. The actual power isn’t in the algorithm — it’s in the billions of data points they’ve been collecting while you searched for “why does my cat stare at walls” at 3am.

The reality check:

  • ✓ Google/Facebook open-sourced their AI code (free for everyone)
  • ✓ But they keep the training data locked up (the actual valuable part)
  • ✓ Data value increases every day AI advances
  • ✓ 80% of the world’s data is privately held, not on the public internet
  • ✓ Companies releasing “free AI tools” = recruiting tactic, not charity
  • ✓ Future competition = who has the best data, not the best code

Translation: They gave you the gun. They kept the bullets.

🛢️ The Oil Metaphor — Let's Push It Until It Breaks

Everyone keeps saying “data is the new oil.” Fine. Let’s actually think about what that means.

Who Are The Digital Rockefellers?

Just like Standard Oil controlled 90% of oil refineries in the 1890s, a handful of companies now control most of the world’s data:

  • Google → Knows what you search, where you go, what you watch
  • Meta → Knows who you know, what you like, what makes you angry
  • Amazon → Knows what you buy, what you want, what you can afford
  • Apple → Knows your health, your conversations, your face

These aren’t tech companies. They’re data extraction operations that happen to sell phones and ads.

Can Data “Spill” Like Oil?

Oh, absolutely. It’s called a data breach and it’s arguably worse.

Oil spills kill ecosystems. Data spills kill your identity, credit score, and peace of mind — forever. Unlike oil, you can’t clean up leaked data. Once your SSN hits the dark web, it’s there until the heat death of the universe.

The fun part? Companies treat data breaches like oil companies treated spills in the 1970s: deny, delay, pay a fine that equals 0.01% of profits, repeat.

Will We Hit “Peak Data”?

Here’s where it gets weird. AI companies are actually running out of high-quality human-written text to train on. The estimate? Sometime between 2026-2028, we exhaust all the useful human-generated content on the internet.

After that? AI starts training on AI-generated content. Which leads to “model collapse” — basically AI inbreeding where each generation gets slightly dumber and weirder.

The irony: Tech companies spent years scraping the entire internet without asking. Now they’re running dry and acting surprised.

What’s The “Refined Gasoline” Version of Data?

Raw data = crude oil. Useless until processed.

Refined data products:

  • Your browsing history → Targeted ad profiles
  • Your location data → Foot traffic analytics for retail
  • Your health app data → Insurance risk assessments
  • Your typing patterns → Behavioral authentication
  • Your face → Surveillance capitalism

Your phone isn’t a communication device. It’s a pocket-sized oil refinery extracting value from everything you do.

💰 Data Economics: What Your Shit Is Actually Worth

Street Value of Your Daily Doomscroll

Let’s get specific. Here’s what your data sells for on the black market (and the legal “gray” market):

Dark Web Prices (2025):

Data Type Price
Full identity package (SSN, DOB, address) $15-$30
Credit card with CVV $5-$25
Bank login credentials $40-$200
Medical records $250-$1,000
Driver’s license scan $20-$100
Selfie with ID (for verification bypass) $40-$60

Legal Data Broker Prices:

Data Type Price Per Record
Name + Email $0.007
Name + Email + Demographics $0.20
Mobile advertising ID $0.01-$0.04
Precise location history $0.50-$2.00

You’re generating $10-50 of data value PER DAY. Getting paid $0 for it.

The Black Market Is Wild

The data broker industry is essentially legalized identity trafficking. Companies like Radaris compile profiles on millions of people — your address, relatives, criminal records, property ownership — and sell it to anyone with a credit card.

In 2024, the National Public Data breach exposed literally everyone. 2.9 billion records. Social Security numbers for basically every American adult. The company filed for bankruptcy. Nobody went to jail.

Is There A Secret Data OPEC?

Kind of. It’s called the Big Tech Antitrust Paradox — these companies don’t compete on data. They collude by not competing.

Google doesn’t sell data to Facebook. Facebook doesn’t sell data to Google. They each maintain their own data moats. It’s not explicit coordination — it’s structural monopoly power.

Data Laundering Is A Real Thing

Remember when AI companies needed training data but couldn’t legally scrape copyrighted content?

They laundered it through academic nonprofits. Universities compiled massive datasets. AI companies used those datasets. Technically legal. Ethically… well, you get it.

Stability AI used this exact trick to train on millions of copyrighted artworks without paying artists a dime.

👁️ Privacy Paranoia: Yes, It's That Bad

Your Smart Fridge Is Snitching

This isn’t paranoia. Consumer Reports found that most smart devices share way more data than they need to function.

Devices that are definitely spying on you:

  • Smart TVs (Roku, Samsung, LG — all of them)
  • Robot vacuums (Ecovacs got hacked in 2024 — people heard voices through their vacuums)
  • Voice assistants (obviously)
  • Smart doorbells (Ring shares with cops without warrants)
  • Fitness trackers (health insurance companies love this data)

Who Owns Your Sleep Data?

You’d think YOU own the data your body generates while unconscious. Nope.

When you use a sleep tracker, that data belongs to the company. They can sell it. Amazon’s sleep tracking ambitions are particularly creepy — they want data from inside your bedroom.

The Target Pregnancy Story (With A Plot Twist)

You’ve probably heard the famous story: Target figured out a teen was pregnant before her father did, based on her shopping habits.

Plot twist: It might be bullshit. The original story has holes. But here’s the thing — it’s plausible enough that nobody questioned it. That’s how normalized this surveillance has become.

They Can Read Your Emotions From How You Type

Not a joke. Academic research proves that keystroke dynamics — how fast you type, how long you hold keys, your rhythm — can reveal emotional state.

Banks use this for fraud detection. Dating apps could use it to know when you’re desperate. Insurance companies could use it to detect depression.

Patents That Should Keep You Awake

Tech companies have filed patents for:

These aren’t science fiction. They’re filed patents with assigned numbers.

What Happens When Your Data Company Goes Bankrupt?

Your data becomes an asset that gets sold to whoever bids highest.

When 23andMe started circling the drain, privacy groups raised alarms — your genetic data could end up owned by anyone. The FTC has tried to intervene in these cases before (Toysmart, RadioShack, Borders), but enforcement is weak.

Your DNA is the new Bitcoin — except you can’t delete it, and you gave it away for free to find out you’re 3% Irish.

🤖 AI's Appetite: The Machines Are Hungry

How Many Cat Photos Does AI Need?

Surprisingly few, actually. Modern techniques can train accurate classifiers with ~1000 images. But that’s for simple stuff.

For something like GPT-4? We’re talking trillions of tokens of text. Billions of images. Basically the entire indexed internet, multiple times over.

Are Memes Junk Food For AI?

Academic researchers actually study this. Memes are hard for AI because they require cultural context, irony detection, and understanding of constantly evolving references.

Training AI on memes is like feeding a robot a diet of pure absurdism. Some researchers do it anyway.

Model Collapse: AI Eating Its Own Vomit

This is getting real. When AI trains on AI-generated content, each generation gets worse. Research confirms that model collapse is inevitable when synthetic data dominates training sets.

The problem: The internet is filling up with AI slop. Soon, AI companies won’t be able to find clean human-generated content. They’ll have to train on each other’s outputs. Quality degrades. Recursion goes brrrr.

AI Winters — Will It Happen Again?

AI has crashed before:

  • Mid-1970s: First AI winter
  • Late 1980s-early 1990s: Second AI winter

Both times: hype exceeded reality, funding dried up, researchers scattered.

LessWrong debates rage about whether it’ll happen again. The current argument against: cloud computing and big data changed the equation. The argument for: energy costs and model collapse might force a reckoning.

Carbon Footprint Nobody Talks About

Training GPT-3 emitted as much carbon as five cars over their entire lifetimes.

GPT-4 was bigger. GPT-5 will be bigger still. Each model requires more compute, more energy, more cooling.

Track it yourself: ML CO2 Impact Calculator

⚔️ Geopolitics: The Data Wars Are Real

The Global Data Arms Race

This isn’t hyperbole. The Atlantic Council mapped it: submarine cables that carry 99% of intercontinental data are strategic assets. Countries are racing to control them.

The Five Eyes alliance (US, UK, Canada, Australia, New Zealand) runs the largest data collection operation in history. Snowden exposed it. Nothing changed.

Data-Rich vs Data-Poor Nations

There’s a new form of colonialism happening. Researchers call it digital colonialism.

Developing nations generate data. That data flows to servers in the US and China. Those countries train AI. That AI gets sold back to developing nations. The data extraction follows old colonial patterns.

Countries Building Their Own Internets

  • Russia’s RuNet: A sovereign internet that can disconnect from the global web
  • China’s Great Firewall: The OG model
  • EU’s GDPR: Soft sovereignty through regulation
  • India’s data localization: All data about Indians must stay in India

The global internet is fragmenting. Stanford calls it “The Splinternet”.

Submarine Cables: The Actual Suez Canal of Data

99% of intercontinental internet traffic goes through undersea cables. There are only ~500 of them. Cut a few strategic ones and entire continents go dark.

The Arctic is becoming a battleground for cable routes as ice melts. Russia has been suspiciously active near cables in the Atlantic.

TikTok: The Case Study

The TikTok ban drama is really about this: CFIUS (Committee on Foreign Investment) decided that Chinese access to American user data is a national security threat.

Whether you agree or not, it shows how seriously governments take data as a strategic asset.

🔮 Future Shock: What's Coming

Running Out of Training Data

PBS reported it: high-quality human-written text could be exhausted by 2026. Nature confirmed it.

Solutions being explored:

  • Synthetic data: AI-generated training data (causes model collapse — see above)
  • Licensed content: Pay publishers for training data (expensive)
  • Private data: Corporate emails, internal documents (privacy nightmare)

Synthetic Data: Lab-Grown Meat For AI

The idea: instead of scraping the internet, generate fake data that looks real.

It works… sometimes. Researchers are building whole toolkits for it. Microsoft has an entire project.

The catch: synthetic data carries the biases of the model that generated it. It’s not a clean solution.

Data Unions: Collective Bargaining For Your Bits

What if everyone who used Facebook formed a union and collectively bargained for their data’s value?

It’s being explored. The idea is labor unions, but for data. You can’t delete Facebook alone — but millions acting together have leverage.

Data Dividends: Getting Paid

California floated a “data dividend” proposal — tech companies would pay residents a share of the profits from their data.

It hasn’t happened yet. Critics say it would just raise prices and reduce service quality.

Data Poisoning: Digital Sabotage

Artists are fighting back against AI by “poisoning” their images — subtle changes that break AI training.

Tools like Glaze and Nightshade let creators protect their work. GitHub has a whole collection of poisoning techniques.

Privacy activists are exploring similar approaches: generating fake data to corrupt surveillance profiles.

Will Data Become Worthless?

Possibly. If everyone generates data constantly, and AI can create infinite synthetic data, supply could overwhelm demand.

Some researchers argue we’re moving from data scarcity to data abundance. The value might shift from having data to having trusted data.

Data Archaeologists

When companies die, their databases survive. There’s now a field — data archaeology — dedicated to excavating value from abandoned digital systems.

The MySpace exodus, Vine archives, dead social networks — all contain cultural artifacts that researchers are now mining.


:hammer_and_wrench: Resources Worth Your Time

📊 Track The Chaos

Dark Web Price Tracking:

Dead Platforms Memorial:

AI Carbon Impact:

🔒 Privacy & Security Research

Brian Krebs’ Greatest Hits (security journalism that actually matters):

Privacy Org Reports:

FTC Resources (know your enemy… and protector):

📚 Deep Reading (Academic + Policy)

Data Monopolies:

Digital Colonialism:

Geopolitics:

AI Training Data Crisis:

🧪 Tools & Repos

Synthetic Data Generation:

Data Poisoning (for the spicy folks):

Weird Datasets (for the curious):

🎤 Forum Discussions Worth Reading

Hacker News:

LessWrong (AI forecasting + philosophy):


The Bottom Line

If you’re building something: Code is free. Data is the moat. Figure out how to collect or own your data.

If you’re investing: Companies with proprietary data > companies with just good code.

If you’re a user: Your data made these companies rich. They got the oil. You got free services and targeted anxiety.

The future: Whoever controls data controls AI. Whoever controls AI controls… a lot.


Data = new oil. Big tech pumped it. Now they own the wells. :fire:


Source for original quote: Fortune Brainstorm Tech 2016 — Shivon Zilis, Bloomberg Beta

5 Likes