News and Insights

The Battle Over AI Training Data: Copyright, Fair Use, and the Future of GenAI

Legal Alerts

2.08.24

The future of generative artificial intelligence (GenAI) is under scrutiny following a copyright lawsuit The New York Times filed against OpenAI. The case involves the propriety of using copyrighted works to train GenAI models, and the ability of GenAI to generate nearly identical copies of these underlying works. GenAI systems use models that are trained on vast amounts of data that include the use of copyrighted works. Several cases, such as Getty Images v. Stability AI and Silverman v. OpenAI, have raised questions about the use of copyrighted works in training GenAI models and how the generated content may resemble copyrighted works. Previously, GenAI companies have relied on the fair use doctrine to shield them from copyright claims. However, plaintiffs like The New York Times argue that the use of their copyrighted works to train GenAI models does not fall under fair use, especially given the ability of GenAI models to generate content that may be very similar or identical to the original works. The outcomes of these copyright cases are even more critical because of recent limitations on the fair use doctrine by the U.S. Supreme Court, and place the future of GenAI at risk.

GenAI systems are able to generate text and images by using complex and highly-trained models. Textual GenAI systems, such as OpenAI’s ChatGPT, typically utilize Large Language Models (LLMs) trained on vast text datasets to master language patterns and generate coherent text by predicting word sequences. Visual GenAI systems, such as Stability AI, use Diffusion Models that are trained to generate images by learning to transform noise added to real images into detailed images. Both types of models rely on training data, statistical probabilities, and iterative processes to generate contextually appropriate text or images. However, because of the need to use incredibly large amounts of data and the importance of using human-generated data to train such models (for example, ChatGPT-3 was reportedly trained on the equivalent of 45 million books), GenAI companies often use copyrighted works to train their models. Because of the complexity of the models (for example, ChatGPT-3 reportedly has 175 billion parameters), the models may appear to “memorize” portions of their training data and generate content that closely resembles the underlying copyrighted works. Thus, companies like The New York Times and Getty Images, which create large amounts of copyrighted works, allege that the unauthorized use and reproduction of their works by these models is copyright infringement.

In the case of Getty Images, the company alleges Stability AI used more than 12 million of its copyrighted images without permission to train its models and that its models are capable of generating images very similar to their copyrighted works. In some instances, Stability AI’s models generated images that included a watermark very similar to the watermark Getty Images uses on its own photographs. Thus, Getty Images argues that by using its images to train its AI, Stability AI not only infringed upon the copyrights of those individual images but also used them to create a competing business.

In The New York Times case, The New York Times cited instances where ChatGPT has provided users with near-verbatim excerpts from its articles. The New York Times contends that such use of its content diminishes the need for readers to visit its website directly, potentially impacting its traffic and revenue. OpenAI, in its defense, has argued that the examples of alleged infringement cited by The New York Times are atypical and that “memorization” is more of a rare bug—not a feature—that they are working to solve.

GenAI companies, like OpenAI, argue that the use of copyrighted works to train GenAI models falls under fair use as being transformative use for public enrichment and, therefore, not copyright infringement. The fair use doctrine, therefore, becomes critical for GenAI. The fair use doctrine allows for the limited use of copyrighted material without permission from the copyright holder, under certain conditions. Courts weigh factors such as the purpose and character of the use, the nature of the copyrighted work, the amount and significance of the portion used, and the effect of the use upon the potential market for or value of the copyrighted work. In the past, tech companies relied on their use of data as being transformative and, therefore, fair use. However, this may no longer be true after the Supreme Court’s decision in Andy Warhol v. Goldsmith. Writing on behalf of a seven-member majority, Justice Sonia Sotomayor determined that Warhol’s use of an image, which was adapted from Goldsmith’s photograph of Prince, did not constitute fair use of the copyrighted material. Although the Court acknowledged Warhol’s artwork as transformative, a characteristic previously sufficient for a fair use defense, it did not qualify as fair use because the original photograph and the painting both served substantially the same commercial purpose.

GenAI, heralded for its potential to mirror the revolutionary impact of the printing press, now stands at a crucial juncture. The future of GenAI companies, specifically leading GenAI models, depends on access to copyrighted works and lawsuits like The New York Times v. OpenAI, along with similar litigations, threaten that access. This is particularly true in light of the recent Supreme Court decision in Andy Warhol v. Goldsmith, which emphasized that transformative use alone does not qualify as fair use if the original work and use both serve similar commercial purposes. Thus, this critical juncture highlights the imperative for stakeholders from all sectors—developers, copyright holders, and Congress—to unite in crafting a collaborative way forward. This partnership is crucial not just for the survival of GenAI but also for the development of artificial general intelligence (AGI).

If you have any questions about the information in this article, please reach out to Michael Word, Diego Freire, or your Dykema relationship attorney.

Related People

Michael J. Word

Diego F. Freire

The Battle Over AI Training Data: Copyright, Fair Use, and the Future of GenAI

Related People

Related Services