Рубрики NewsSoftwareTechnologies

OpenAI transcribed over a million hours of YouTube videos for GPT-4 training

Опубликовал
Вадим Карпусь

According to The New York Times, OpenAI developed the Whisper audio transcription model and transcribed more than a million hours of YouTube videos to obtain high-quality materials for training the GPT-4 model.

The company reportedly knew that such actions were legally questionable and in the «gray area of» copyright. However, it considers this a fair use of the materials. OpenAI President Greg Brockman was personally involved in collecting the videos that were used.

OpenAI exhausted its stockpile of useful data in 2021 and discussed transcribing YouTube videos, podcasts, and audiobooks after reviewing other resources. Until then, the company trained its models on data that included computer code from Github, databases of chess moves, and school assignment content from Quizlet.

OpenAI spokeswoman Lindsay Held said that the company curates «unique» datasets for each of its models to «help them understand the world» and stay competitive in global research. In doing so, the company uses «numerous sources, including publicly available data and partnerships for non-public data» and it is looking to generate its own synthetic data.

Google spokesperson Matt Bryant said that the company «has seen unconfirmed reports» about OpenAI’s activities, adding that «both our robots.txt files and Terms of Service prohibit unauthorized copying or downloading of YouTube content».

Recently, YouTube CEO Neil Mohan saidWe hereby inform you that using the platform data to train the OpenAI model is a violation of the terms of use. Therefore, Google will take «technical and legal measures» to prevent such unauthorized use, «if we have a clear legal or technical basis for doing so».

According to Times sources, Google also collected transcriptions from YouTube. Matt Bryant said the company trained its models on «some YouTube content in accordance with our agreements with YouTube creators».

Meta has also faced limitations in the availability of good training data, and its AI team has been discussing the unauthorized use of copyrighted works to catch up with OpenAI. After reviewing the «almost available English-language books, essays, poems, and news articles on the Internet» the company considered steps such as paying for book licenses or even buying a major publisher outright. In addition, it has been limited in the ways it uses consumer data due to privacy-oriented changes it has made in the wake of the Cambridge Analytica scandal.

Source: The Verge

Disqus Comments Loading...