AI Was Trained on Over 139,000 Film & TV Scripts, Including ‘The Simpsons,’ ‘Breaking Bad’, and More, From the Most Suprising Source

Join us on Reddit for the latest Marvel & DC news!

Share:

Generative AI has been a problem pretty much from its inception when it comes to the creative industries. In large part thanks to the fears that it’s going to replace creative jobs. While there are some proponents of these technologies, an increasing number of artists, journalists, actors, directors, and many more related to these industries are raising red flags – their work is being stolen.

Copyright and AI are still “grey areas” in a legal sense as developers of chatbots and LLMs claim that all their data is scraped from “open” sources and completely legitimate as far as fair use goes. But as it turns out, this is only in a technical sense.

The Atlantic’s Alex Reisner recently published a shocking report, on how LLMs have been trained on over 139,000 Film and TV scripts.

Reisner confirmed that many AI systems have been trained using scripts from a vast number of TV shows and movies, including over 53,000 films and 85,000 TV episodes.

This data set, which has been used by major companies like Apple and Meta, contains dialogue from iconic shows like ‘The Simpsons,’ ‘The Sopranos,’ ‘Breaking Bad,’ and films nominated for Best Picture from 1950 to 2016.

The data even includes scripted dialogue from live events like the Golden Globes and Academy Awards. This extensive collection of content is why AI can mimic characters or create entire shows without the need for a team of writers.

It’s clear to anyone who has studied the technology in depth, and the way it works, that today’s generative AI is just a glorified paraphrasing technology, as it has no means to come to its conclusion without scraping stuff, be it in generating text or images.

It’s exactly the people that AI wants to replace that are its lifeblood. But, this is likely a topic for a different discussion. Let’s go back to the topic at hand, clearly, the works cited in the original reports as being scrapped by the AI are copyrighted, so how do the tech companies get away with scraping all this dialogue? Wel…

The data used to train AI doesn’t come from traditional scripts but from subtitles uploaded to OpenSubtitles.org. These subtitles are extracted from DVDs, Blu-rays, and online streams using special software.

While this might seem unusual, subtitles are valuable for AI because they reflect natural spoken dialogue, helping AI systems, like chatbots, learn to “speak” more naturally. This type of data is especially useful because well-written speech is rare in the usual AI training materials like academic texts and news articles.

Research shows that companies like Anthropic, Meta, Apple, and Nvidia have used subtitles to train their AI systems, including ChatGPT competitor Claude and models like OPT and NeMo Megatron.

Other companies like Salesforce, Bloomberg, and EleutherAI have also used these subtitles to create over 140 open-source AI models. These models, which could compete with human writers, were developed without permission from the original writers.

Naturally, the companies did not want to comment on these findings.

OpenSubtitles can be downloaded by anyone, but it’s hard to understand what’s inside. The data is a 14-gigabyte file with dialogue that doesn’t identify who’s speaking or which movie it’s from. The movies and TV shows are separated into 446,612 files, with folders named after IMDb ID numbers.

While the files contain different versions of movies and episodes, the author was able to identify about 139,000 unique titles and used additional data from OpenSubtitles to organize and map information like actors and directors.

And yes, once again the copyright laws are in the “gray area.” Subtitles would likely be considered derivative works and protected but courts are yet to make this ruling.

If you’re interested in more detailed information regarding the original research (and numbers) I strongly suggest you read the original report created by Alex Reisner.

Have something to add? Let us know in the comments below!

Liked this article? Join us on Reddit for the latest Marvel & DC news!

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments