OpenAI destroyed 100,000 books used to train GPT-3. Those involved have also disappeared

OpenAI has removed two huge datasets «books1» and «books2» that were used to train the GPT-3 model.

This was reported by Business Insider, referring to the materials of the Authors Guild lawsuit.

The essence of the claim

Authors Guild lawyers said that the GPT-3 datasets likely contained «more than 100,000 published books». Thus, OpenAI used copyrighted materials to train AI models.

Reference. The Authors Guild is the oldest (founded in 1912) and most authoritative professional organization of writers in the United States. It is dedicated to the protection of freedom of speech and copyright protection

For several months, Authors Guild asked OpenAI to provide information about the datasets used. Initially, the company refused, citing confidentiality clauses But then it turned out that she had deleted all copies of the data.

High-quality training data is an important part of powerful AI models. OpenAI and other companies use data from the Internet, including books, to build these models.

Many of the companies that created this information want to be paid for providing the information to these new AI products. Tech companies don’t want to be forced to pay. This dispute is currently being settled in court by several lawsuits.

100,000 books — 16% of GPT-3 training data

In a 2020 technical document, OpenAI described the books1 and books2 datasets as «corpora of books from the Internet» and stated that in total they represent 16% of the training data used to create GPT-3.

The document also states that «books1» and «books2» together contained 67 billion tokens, or approximately 50 billion words.

OpenAI stopped using «books1» and «books2» for model training at the end of 2021 In mid-2022, they were removed — due to «unusability».

The documents also state that the two researchers who created the datasets «books1» and «books2» no longer work for OpenAI. OpenAI refuses to disclose information about them, although the Authors Guild insists on it.

OpenAI has asked the court to preserve the names of the employees and information about the datasets.

«The models that use ChatGPT and our API today were not created using these datasets», — OpenAI said in a statement on Tuesday.

As a reminder, there was a story when AI researcher and ex-manager of Amazon Vivian Gadery accused her former employer of copyright infringement.

In March, the director of her team set a task — to find the reasons why Amazon is not achieving its goals for the quality of Alexa search. In the conversation, he recommended ignoring copyright policy to improve resultsThe director asked me to pay attention to competitors with the words «everyone does it».