Artificial intelligence models need as much useful data as possible to function, but some of the biggest AI developers are partially relying on transcribed YouTube videos without the creators’ permission, in violation of YouTube’s own rules, as discovered in a research Through Proof News And Wired.
The two media outlets revealed that Apple, Nvidia, Anthropic and other major AI companies trained their models on a dataset called YouTube Subtitles, which includes transcripts of nearly 175,000 videos from 48,000 channels, all without the knowledge of the video creators.
The YouTube Subtitles dataset contains the text of video subtitles, often with translations into multiple languages. The dataset was built by EleutherAI, which described the goal of the dataset as lowering the barriers to AI development for those outside of big tech companies. It’s just one part of a much larger EleutherAI dataset called the Pile. In addition to the YouTube transcripts, the Pile includes Wikipedia articles, speeches from the European Parliament and, according to the report, even emails from Enron.
However, the Pile has a lot of fans among big tech companies. Apple, for example, used the Pile to train its OpenELM AI model, while Salesforce’s AI model, released two years ago, was trained with the Pile and has since been downloaded more than 86,000 times.
The YouTube Subtitles dataset includes a range of popular channels in news, education, and entertainment. This includes content from big YouTube stars such as MrBeast and Marques Brownlee. Their videos have all been used to train AI models. Proof News has a search function which searches the collection to see if there’s a particular video or channel in the mix. There are even a few Ny Breaking videos in the collection, as seen below.
Share a secret
The YouTube Subtitles dataset appears to violate YouTube’s terms of service, which explicitly require fobird to perform automated scraping of its videos and associated data. However, that’s exactly what the dataset relied on, with a script downloading subtitles via YouTube’s API. The research notes that the automated download selected videos containing nearly 500 search terms.
The discovery sparked surprise and anger from YouTube creators interviewed by Proof and Wired. Concerns about unauthorized use of content are legitimate, and some creators were angry at the prospect of their work being used in AI models without payment or permission. That’s especially true for those who found that the dataset included transcripts of deleted videos and, in one case, the data came from a creator who had since deleted their entire online presence.
The report did not include commentary from EleutherAI. It did note that the organization describes its mission as democratizing access to AI technologies by releasing trained models. That may conflict with the interests of content creators and platforms, if this dataset is anything to go by. Legal and regulatory battles over AI have already been complex. These kinds of revelations are likely to make the ethical and legal landscape of AI development more treacherous. It’s easy to suggest a balance between innovation and ethical responsibility in AI, but producing it will be much harder.