AI's Data Crisis: Web Sources Drying Up Fast 🚫

Summary:

  1. Emerging Crisis: A study by the Data Provenance Initiative, an M.I.T.-led research group, has found that many web sources used for AI training data are now restricting access, leading to an β€œemerging crisis in consent.”

  2. Significant Restrictions: The study reviewed 14,000 web domains in three AI training datasets (C4, RefinedWeb, and Dolma) and found that 5% of all data and 25% of high-quality data sources have been restricted.

  3. Impact on AI Development: Restrictions through the Robots Exclusion Protocol and website terms of service could significantly affect AI companies, researchers, and academics, limiting access to essential training data.

3 Likes