Ownership of Data and Outputs
Learning Objectives
- You know of some issues related to ownership of data and outputs of large language models.
Training of large language models is made possible by the availability of vast amounts of text data. The data stems from a wide range of sources including books, articles, web pages, song lyrics, and even content transcribed from videos on online platforms such as YouTube.
Some of the content is copyrighted, and for some of the content, the terms of use explicitly disallow copying, redistributing, or otherwise using or exploiting the content. Despite of the copyrights and the explicit terms of use, there is evidence of large language models been trained using also such content. This evidence stems from studies highlighting, e.g., that large language models generate content from popular books.
This has led to a number of lawsuits, such as a class action lawsuit filed against OpenAI for copyright infringement by the Authors Guild and individual authors, and the GitHub Copilot lawsuit filed against OpenAI and Microsoft for violation of copyrights and software licensing requirements. As large language models are relatively new, the lawsuits are still ongoing and might take some time.
As a example of how much time a lawsuit might take, the Authors Guild lawsuit against Google for copying and sharing books in 2005 lasted ten years. At the end, it was decided that the use of books was considered fair use.
The term fair use refers to the U.S. Fair Use policy. The policy allows using copyrighted materials for purposes such as criticism, teaching, and research, when e.g. the use is transformative and does not harm the existing market for the original work. OpenAI (and other companies) in general view that training models using publicly available (even if copyrighted) data is fair use.
The fair use policy is not a global policy. There are instances of allowing the use of copyrighted materials however in other countries as well. As an example, the European Union Copyright Directive discusses the use of copyrighted materials for education and research.
The ownership of the prompts and outputs is another issue. In practice, the outputs are generated by the model, and thus, the model owner could be considered as the owner or claim ownership of the outputs. At the same time, the outputs could also be considered as generated by the user, as the user provides the prompts to the model.
Large language model providers typically have terms of use, which define the ownership of the outputs. As an example, OpenAI terms of use explicitly state that the user owns the content:
“As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.”