Summary:
-
Emerging Crisis: A study by the Data Provenance Initiative, an M.I.T.-led research group, has found that many web sources used for AI training data are now restricting access, leading to an βemerging crisis in consent.β
-
Significant Restrictions: The study reviewed 14,000 web domains in three AI training datasets (C4, RefinedWeb, and Dolma) and found that 5% of all data and 25% of high-quality data sources have been restricted.
-
Impact on AI Development: Restrictions through the Robots Exclusion Protocol and website terms of service could significantly affect AI companies, researchers, and academics, limiting access to essential training data.
!