Artificial intelligence has used over 140,000 movies and TV shows for training — all episodes of «Breaking Bad» and «The Sopranos» including

Штучний інтелект використав понад 140 000 фільмів та серіалів для навчання — всі епізоди «Пуститися берега» та «Клану Сопрано» в тому числі

Website The Atlantic examined a dataset used to train AI models owned by Apple, Anthropic, and Nvidia, among others, and found that industry fears about the new technology are far from unfounded.

The dataset included elements from at least 53,000 movies and 85,000 TV shows: including all films nominated for «Best Picture» between 1950 and 2016, about 600 episodes of «The Simpsons», 170 episodes of «Seinfeld», 45 episodes of «Twin Peaks», and all episodes of «Breaking Bad» and «The Sopranos». The dataset also contained «live» dialogues from the «Golden Globe» and «Oscar» broadcasts.

The Atlantic notes that the texts in the dataset are not original scripts, but subtitles taken from OpenSubtitles.org. Users typically extract them from DVDs, Blu-rays, and streaming services using optical character recognition software and then upload them to the site (currently, it has more than 9 million subtitle files in more than 100 languages and dialects).

Moreover, some companies mention the use of subtitles in their research articles: Anthropic trained its chatbot Claude on them, Meta — a group of large language models called Open Pre-trained Transformer (OPT), Apple — LLMs that can run on the iPhone, and Nvidia — NeMo Megatron LLM. Salesforce, Bloomberg, EleutherAI, Databricks, Cerebras, and other AI developers have also been actively using OpenSubtitles.org «used» Salesforce, Bloomberg, EleutherAI, and other AI developers.

Apple noted in a comment that its LLMs are intended «solely for research purposes, while Salesforce emphasized that the dataset «has never been used to inform or improve any of the company’s product offerings. The other companies mentioned in the article either declined to comment or did not respond to inquiries.

The question of the legality of using data to train artificial intelligence remains open — since the «boom of» text bots after the launch of ChatGPT. The transparency of companies is still quite low and only a court can force them to disclose data — but, The case of OpenAI showed that this information can suddenly disappear as well.

It seems that «Breaking Bad» writer Vince Gilligan was on to something when he called generative artificial intelligence «an extremely complex and energy-intensive form of plagiarism» — I wonder how he would have reacted to the fact that technology is already taking over the dialogues he wrote?