Recent advancements in generative AI have allowed popular chatbots to expand their capabilities exponentially to now send mail, manage one’s finances, unpack complicated topics or write essays.

For a while, it’s been broadly understood that what has allowed these tools to be so performant is the sheer amount of publicly available data they were trained on, often through a process called web scraping. But, until recently, it was unclear where exactly large language models (LLMs) pulled their training data from.