Common Misconceptions About Generative AI and Copyright

There has been a flood of media coverage of the intersection of copyright and generative artificial intelligence, a subset of the broader discussion of the challenges posed by AI generally. This coverage has been stimulated in part by lawsuits brought by authors, including Sarah Silverman, and open letters signed by artists. Much of this coverage contains serious inaccuracies about AI technology and copyright law. The issues surrounding AI and copyright law can be complex, therefore we’ve collected a number of the more prevalent misconceptions in recent media and explained why they are false to aid in the conversation around this technology.

1. ChatGPT and other generative AI providers “steal” content from creators.

Respected outlets such as the New York Times and the Washington Post have asserted that ChatGPT and other AI providers “steal” content from creators. While copying does occur in the process of developing AI systems, that copying typically does not constitute copyright infringement. And even if it did involve copyright infringement, infringement is the trespass on a government-granted exclusive right, nothing akin to stealing personal property.

2. Making a generative AI system involves copyright infringement. 

Generative AI systems based on “Large Language Models” (LLMs) require the assembly of a dataset which is transformed and analyzed to create the model. Assembly of the dataset does involve the ingestion—the copying—of a large number of works, typically by a third-party provider. This ingestion can occur by “bots” crawling the World Wide Web and downloading text, images, and videos. The AI firm acquires access to the dataset and uses software to analyze it to uncover patterns, trends, and relationships, which the AI firm uses to create the LLM. The LLM itself does not include any of the expression from the content originally crawled by the bots to create the dataset. The AI system uses the LLM to produce results in response to user prompts. The process of creating the LLM is referred to as “training” the AI system.

Now a brief copyright digression. Copyright protects the way facts or ideas are expressed, but not the facts and ideas themselves. Leaving facts and ideas unprotected is a constitutional requirement under the First Amendment. The First Amendment also permits the “fair use” of copyrighted content. The test for judges to apply to determine whether a use is fair is set forth in the Copyright Act. 

Back to the LLM. Because the LLM does not contain anyone else’s expression, it does not infringe copyright. But what about the copying necessary to create the dataset from which the LLM is derived? Although high-quality generative AI is new, AI itself has been in use for at least two decades; and several courts have found that the copying necessary to develop these AI tools is a fair use. These tools include plagiarism detection software, optical character and speech recognition, and search engines for websites and books. Most copyright experts believe that the fair use analysis for generative AI is the same as it is for these other AI tools. 

3. A creator has no way to prevent her work from being ingested by a generative AI system.

Even if the creation of the AI system is not infringing, an artist might not want her creations used to “train” the AI as a matter of principle. Although media reports frequently suggest that the artist is helpless to prevent the ingestion of her content, in fact the artist can employ a widely-used robot exclusion protocol to prevent her website from being crawled by AI bots. 

Some copyright industry trade groups argue that as a practical matter, artists cannot use these bot exclusion protocols because doing so would prevent their websites from being crawled by search engine bots. This would mean that the websites would not be findable. While this tradeoff exists, the artist still has a choice of how she wants to interact with AI and the Internet. Additionally, the copyright industries can work with AI firms and standard setting organizations such as the World Wide Web Consortium (W3C) to develop an exclusion protocol with more granularity that would permit search engine bots but exclude other bots. Indeed, major AI companies are already beginning work on such a standard.

4. The output of a generative AI system is a collage of bits of the works that are ingested.

Media descriptions of AI systems suggest that the AI chops ingested works into small pieces, which the AI then recombines into new works in response to user prompts. As discussed above, the AI does not retain bits of expression that it recombines. Instead, through computational analysis, it discerns patterns, trends, and relationships, and uses them to create statistical models which in turn generate new works in response to user prompts. 

5. The output of a generative AI system infringes copyright by copying an artist’s style.

The output of an AI system might infringe the copyright of a particular artist if the system had access to one of the artist’s works (i.e., the AI system ingested the work) and the output of the AI system is “substantially similar in protected expression” to that work. Courts have long dealt with claims that one work is substantially similar to another. Where these cases get complicated is when the works are not identical, but have certain similarities. As a general matter, courts have found that similarities in styles do not lead to copyright infringement liability, because a style is an “idea” rather than “expression.” Just as a human artist is allowed to create a work in a style similar to another artist’s, so too should an AI system be allowed to create a work in a style similar to that artist’s.

Some AI systems have been used to reproduce the distinctive sound of a vocalist’s voice. In some circumstances, primarily when a sound-alike is used in an advertisement, courts have found that vocal imitations infringe rights of publicity, but “soundalikes” in general do not infringe copyright. Legislative history from 17 U.S.C. § 114(b) confirms this: “Mere imitation of a recorded performance would not constitute a copyright infringement even where one performer deliberately sets out to simulate another’s performance as exactly as possible.” H. Rep. No. 94-1476, at 106 (1976). These cases should be addressed under rights of publicity, which are a matter of state law distinct from copyright.

One New York Times reporter interviewed for “The Daily” noted that when she asked ChatGPT to write an article in her style, it produced a sentence that included the phrase “bastion of free expression,” which she had used in several articles. There is no copyright in phrases and short sentences. Moreover, this phrase is not original to this reporter; it has been used by many writers at least since the 1960s. 

6. Generative AI poses an existential threat to artists.

There is no evidence that the outputs of AI systems will lead to the elimination of entire classes of artists. AI is a tool that many artists already incorporate in their workflow to eliminate the more tedious tasks, thereby enabling them to produce more works at lower cost. As such, it may reduce the demand for certain skilled workers, just as photography reduced the demand for portrait artists and digital photography reduced the demand for photographers and film processors. But the most creative and nimble artists who rapidly adopt new technologies and business models will continue to be able to thrive economically from their art. To be sure, publishers and motion picture studios may be able to use AI to automate processes that are now done by humans to reduce their costs. While this may be disruptive to the artistic workforce, it does not raise copyright issues because the publishers and motion picture studios own the copyrights to the works to which they will apply these AI tools. In other words, copyright law is not the solution to the AI-related issues raised by actors and screenwriters in the context of the SAG-AFTRA strike. Instead, this is a labor-relations issue that must be resolved through negotiations.

7. Artists deserve compensation for every use of their work.

As a general proposition, we do not compensate creators for every single use of their work. We don’t let creators control whether content is read or viewed (per copyright limitations like the first-sale doctrine and space- and time-shifting); we don’t let them control if human artists “train” on their work. Indeed, aspiring artists generally train by studying and copying the works of their predecessors.

Putting aside the technical legalities of copyright law discussed above, media coverage of AI suggests that artists are entitled in some moral sense for their contribution to the training of generative AI systems. Even if these systems were at some point in the future to become profitable, the contribution of any one work to the system as a whole would be miniscule. The vast majority of the enormous amount of content in the datasets from which LLMs are derived is harvested from open sites such as Wikipedia. The marginal contribution of an individual work in the set of works by authors who object to ingestion by AI systems is a small fraction of a small fraction. (Moreover, if multiple works contained similar information, as is often the case, determining the true origin of the information is impossible.) The relationship between a single input and a single output is completely attenuated and indiscernible. Given these logistical challenges, most of the money collected by a collective rights organization would be spent on overhead and little, if any, would go to individual authors. 

8. In other countries, artists are compensated when their works are ingested into generative AI systems.

Various jurisdictions around the world are beginning to address the copyright issues relating to AI. Japan and Singapore have enacted specific AI exceptions that do not require compensation. The Israeli Ministry of Justice issued an opinion that its fair use provision, modeled on the U.S. fair use doctrine, permits the training of AI systems without compensation. The EU recently adopted a directive that established two exceptions for text and data mining (TDM). TDM for scientific research is permitted without compensation. TDM for all other uses is permitted subject to an express opt-out by the copyright owner. In other words, unless a copyright owner expressly prohibits the ingestion of her works, the AI system may ingest it. The opt-out must occur in an appropriate manner, such as machine-readable means in the case of material that is publicly made available online. The machine-readable means could be the robot exclusion protocol described above. In short, in no jurisdiction are artists compensated for ingestion unless they exercise affirmative means of preventing the ingestion.

9. Specific communities of artists, such as fanfiction writers, feel particularly aggrieved by generative AI systems.

Notwithstanding this assertion, there is no way to assess the attitudes of millions of fanfiction writers to AI. In any event, many of the assertions that have been made with respect to fanfiction writers are demonstrably incorrect. The recent episode of “The Daily,” for example, first asserted that fanfiction writers had no copyright in their original creations because they were based on other, preexisting works. The Daily then issued a correction, stating that fanfiction writers had copyright in their original contributions. As usual with copyright, the situation is more complicated. 

Many copyright owners take the position that fanfiction is a tolerated infringement. That is, the fanfiction writer is infringing on the characters and plotlines he borrows, but the copyright owner tolerates this infringement because it causes no harm. On the other hand, the fanfiction writer believes that his use is permitted by the fair use doctrine because it is transformative. There has been little litigation concerning noncommercial fanfiction. If the copyright owners are correct, and fanfiction is a tolerated infringement, then under 17 U.S.C. § 103(a), the fanfiction writer has no copyright in his original contributions. If the fanfiction writer is correct, and his reuse of characters and plotlines is a fair use, then he has a copyright in his original contributions.

Regardless of who is right, it is very odd for fanfiction writers, who rely on fair use to justify their use of characters created by others, to turn around and claim that others are not to make fair uses of their creations.

10. The use of generative AI is inherently “cheating”.

Various institutions and professions may well decide that using generative AI for certain tasks is “cheating.” An educational institution could adopt a policy that prohibits students from using AI to write class assignments (but the policy should be clear whether it applies to AI-based tools such as Spell-Check). At the same time, the institution could allow teachers to employ AI to create lesson plans, or could design its courses so that student use of generative AI is part of the expected tools students may use. That is up to the institution and has nothing to do with copyright law. 

Similarly, many of the same arguments and issues may arise around allegations of plagiarism: there is a difference between what is legal under copyright law, and what is approved by certain communities of practitioners, like fashion or cuisine.

