Issues and Concerns in the AI Era

Data, Copyright, and Ownership


Learning Objectives

  • You understand the key issues related to ownership of data and outputs of large language models (LLMs).
  • You know the concept of fair use and how it applies to AI training, as well as its global variations.
  • You are aware of ongoing legal debates and their implications for creators, companies, and users.
  • You know how providers typically handle ownership of AI-generated outputs.

Large language models are trained on vast amounts of text scraped from the internet and other sources: books and academic articles, news and blog posts, websites and forums, software repositories, and transcriptions of audio and video.

Some of this material is in the public domain or shared under open licenses that permit reuse. However, much of it is copyrighted.

Studies have shown that LLMs can sometimes reproduce copyrighted passages nearly verbatim — for example, generating content closely resembling popular books or articles when prompted in specific ways. This raises fundamental questions: Is training on such data without permission or compensation legal? Who, if anyone, has rights over what the model produces? Should creators whose work trains these systems receive compensation?

Loading Exercise...

Because these questions remain largely unresolved, numerous lawsuits have emerged testing the boundaries of copyright law in the AI context:

The Authors Guild class action lawsuit against OpenAI, joined by prominent authors, claims that training on copyrighted books without permission constitutes copyright infringement and that the resulting models can reproduce substantial portions of those works.

The GitHub Copilot lawsuit against Microsoft, GitHub, and OpenAI alleges violations of software licenses when training on open-source code, particularly concerning attribution requirements and copyleft provisions in licenses like GPL.

Similar cases have been filed by visual artists against image generation systems, by news organizations against companies training on their articles, and by other content creators across various media.

These cases are still ongoing and will likely take years to resolve. Legal precedent develops slowly — for example, the Authors Guild lawsuit against Google over its book scanning and indexing project lasted over a decade before courts ruled that Google’s actions qualified as fair use under U.S. copyright law.

Fair use, if legally acquired..

A recent ruling involving Anthropic had two angles; using legally acquired data was deemed fair use, but the court also found that Anthropic’s use of pirated books was not protected under fair use. In a settlement, Anthropic agreed to pay $1.5 billion in damages to a group of authors and publishers to avoid a trial over its use of pirated material.

For more information, see AI firm Anthropic agrees to pay authors $1.5bn to settle piracy lawsuit.


Fair use and global perspectives

In the U.S., the doctrine of fair use allows limited use of copyrighted material without permission for purposes such as criticism, commentary, news reporting, teaching, scholarship, and research. Courts evaluate fair use claims based on four factors:

  1. The purpose and character of the use (particularly whether it’s transformative)
  2. The nature of the copyrighted work
  3. The amount and substantiality of the portion used
  4. The effect on the potential market for the original work

AI companies, including OpenAI, Anthropic, Google, and others, generally argue that training on publicly available copyrighted text qualifies as transformative fair use. Their reasoning: models learn statistical patterns from training data rather than storing and retrieving works, they create entirely new outputs rather than reproducing originals (in typical use), and the purpose — enabling AI systems to understand and generate language — is fundamentally different from the purpose of the original works.

However, this position is contested by many creators and legal scholars, and fair use determinations are highly fact-specific and unpredictable. Courts might consider whether models can reproduce substantial portions of training data, whether AI-generated content substitutes for human-created work in ways that harm creators’ markets, and whether the enormous commercial value derived from these systems should require licensing or compensation.

Loading Exercise...

Importantly, fair use is a U.S.-specific doctrine and not a global standard. Different jurisdictions have different frameworks:

  • The EU Copyright Directive includes specific exceptions for text and data mining for research purposes and provides an opt-out mechanism allowing rights holders to reserve their works from being mined. The UK has similar provisions for non-commercial research.

  • Japan’s copyright law includes relatively broad exceptions for machine learning that are often cited as more permissive than U.S. fair use.

  • Other jurisdictions have more restrictive approaches, requiring explicit permission for any use of copyrighted materials in AI training.

The lack of global consensus means companies operating internationally must navigate a complex patchwork of potentially conflicting legal frameworks, creating uncertainty for AI development worldwide.

Loading Exercise...

Who owns AI outputs?

Beyond training data, another critical question is the ownership of model outputs — the text, code, images, or other content that AI systems generate.

Several competing views exist:

  • The model owner view: Because the outputs are generated by the model’s computational processes, the company that owns and operates the model could claim ownership rights over everything it produces.

  • The user view: Since outputs depend entirely on user prompts and represent responses to user instructions, users should be considered the creators and owners of generated content.

  • The no-ownership view: AI-generated content might not qualify for copyright protection at all, since copyright traditionally requires human authorship. This could mean outputs enter the public domain immediately.

To address this ambiguity and provide clarity to users, LLM providers typically define ownership in their terms of service. For example, OpenAI’s terms state:

“As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.”

This means OpenAI treats users as the owners of what they generate. Anthropic, Google, and most other major providers have similar policies assigning output ownership to users.

However, caveats do exist. As an example, as legal status of AI-generated content remains unsettled, the “to the extent permitted by applicable law” clause acknowledges that courts may ultimately determine that such content cannot be copyrighted because it lacks human authorship. Similarly, regardless of who owns the output, users remain responsible for how they use it, including ensuring they don’t infringe third-party rights.

If an LLM generates text substantially similar to copyrighted work, using that output commercially could still constitute infringement regardless of who “owns” it.

Loading Exercise...

Musical works, copyright, and generative AI

Like with any outputs, also the ownership of AI-generated music remains contested. Questions can arise over whether human input in training, fine-tuning, or prompting is enough to claim copyright. An interesting question in this area is where to draw the line between traditional authorship and machine-driven creativity.

For a more in-depth discussion on the topic, check out Musical Works, Copyright, and Generative AI: Legal Perspectives on Originality and Authorship.