“OpenAI relied on harvesting mass quantities of content from the public internet, including Plaintiffs’ and the Class’s books, which are available in digital formats.” – Chabon class complaint
Like Silverman, et. al., the latest suit, which is a class action filed in the U.S. District Court for the Northern District of California by Michael Chabon, David Henry Hwang, Matthew Klam, Rachel Louise Snyder, and Ayelet Waldman, in part alleges that the datasets used by OpenAI to train ChatGPT are infringing.
Those datasets include “BookCorpus,” which is a collection of “over 7,000 unpublished books that were compiled and copied into a dataset by AI researchers without offering the authors of copyrighted materials compensation; “Common Crawl,” which is “a massive dataset of web pages containing billions of words,” according to the complaint; and “two internet-based book corpora,” known only as “Books1” and “Books2,” which both the Silverman and Chabon lawsuits claim likely contain over 350,000 books, according to OpenAI’s estimates of the number of books contained in each dataset.
The Chabon lawsuit points to several examples of prompts involving the plaintiffs’ works to demonstrate the alleged infringement. For example, when prompted to “identify examples of trauma in the Amazing Adventures of Kavalier & Clay, for which Chabon won the Pulitzer Prize for Fiction in 2001, ChatGPT identified six very specific examples, and also accurately summarized the book and imitated Chabon’s writing style. Prompts involving the other plaintiffs’ works were similarly accurate. Such results could only be obtained if OpenAI relied on “harvesting mass quantities of content from the public internet, including Plaintiffs’ and the Class’s books, which are available in digital formats,” claims the lawsuit.
The complaint charges that OpenAI’s GPT models cannot function without the infringing works, and so OpenAI’s Generative Pre-trained Transformer (GPT) models and ChatGPT “are themselves infringing derivative works without Plaintiffs’ and Class members’ permission and in violation of their exclusive rights under the Copyright Act.” The suit lists six causes of action: direct copyright infringement; vicarious copyright infringement; removal of copyright management information in violation of the Digital Millennium Copyright Act; violations under the California Unfair Competition Law; negligence; and unjust enrichment.
In a May 2023 House IP Subcommittee hearing, Representative Deborah Ross (D-NC) said that that Sam Altman, the CEO of OpenAI, indicated that new versions of OpenAI are contemplating ways to compensate copyright owners for content and style. “When we’re working on new models, if an AI system is using your content or style, you get paid for that,” Altman said, according to Ross.
But in an August bid to dismiss the Silverman suit, OpenAI said that the authors’ lawsuits “misconceive the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”
Chabon and Silverman are also suing Meta over its use of training data for Llama-2.
Image Source: Deposit Photos
Image ID: 641451098