Crawling the web, posing as ordinary users and scraping data from companies’ websites—the practice of data scraping has exploded over the last two years as data became the new oil, fueling the growing number of large language models on market.

Common Crawl is maybe one of the most well-known data scrapers, a non-profit organization which built a repository of web crawl data likely used to train many popular LLMs.