Copyright Implications in training Artificial Intelligence (AI) Models

Introduction:

Have you heard about the recent legal case before the Hon’ble Delhi High Court? India is currently witnessing a significant and intriguing legal battle regarding copyright protection, delving into the concept of fair dealing. This case revolves around a lawsuit filed by Asian News International (ANI), an Indian news agency, accusing OpenAI[1], the parent company of ChatGPT, of copyright infringement. The case raises vital questions regarding how AI platforms use copyrighted content to train their AI models, often without obtaining prior licenses or authorization. OpenAI’s argument in the case is that it cannot be accused of copyright infringement in India because it neither stores data nor trains its language models in India.

This dispute is not unique to India; similar cases have emerged worldwide, as courts strive resolve on the growing issue of AI's use of copyrighted material. Artificial Intelligence (AI) is revolutionizing industries globally, offering automation and efficiency like never before. However, with its extensive reach comes a host of legal challenges, particularly surrounding using copyrighted content in training AI models. Whether this practice constitutes copyright infringement is now a critical topic of debate in the legal and tech communities.

What This Article Will Cover:

This article will explore the legal complexities surrounding AI and copyright law, focusing on the challenges faced by existing copyright frameworks in addressing the rapid advancements in AI technology. It will delve into the jurisdictional issues arising from the cross-border presence of AI platforms. Additionally, the article will examine the concepts of fair use and fair dealing, discussing how they apply in the context of AI training models and the potential legal implications for AI developers and content creators.

Understanding Copyright, Artificial Intelligence, and Large Language Models(LLMs):

What is Copyright?

Copyright is a legal right granted to creators of original works, providing them control over the reproduction, distribution, and adaptation of their works for a specific period of time. This right allows creators to protect their intellectual property and earn from it, along with limitations to ensure a balance between the rights of creators and the broader public interest.[2] Copyright does not protect ideas themselves, but rather the expression or execution of those ideas. For instance, the concept or idea behind a book is not copyright-protected, but the specific written text of the book is.

In India, copyright protection is governed by the Copyright Act, 1957[3], while at the international level, it is guided by agreements such as the Berne Convention, the WIPO Copyright Treaty, and the TRIPS Agreement.

What is AI, How Exactly are AI Models Trained, and Why Is It Legally Controversial?

What is Artificial Intelligence (AI)?

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines, enabling them to perform tasks that typically require human cognitive functions, such as decision-making, language understanding, and visual perception. AI systems, particularly Large Language Models (LLMs) like OpenAI's ChatGPT, are designed to analyze and process vast amounts of data, generating human-like text responses based on patterns learned from the data they have been trained on.

How Are AI Models Trained?

Training AI models involves feeding massive amounts of data into the system to enable it to "learn" patterns, make predictions or generate outputs. The data used for training usually consists of publicly available content, but it may also include copyrighted material.

AI models “read” and analyze the data comprehensively, which enables them to generate summaries, opinions or responses. For example, when you prompt ChatGPT, it doesn't simply retrieve information, it processes and reconfigures the data it was trained on to provide a new, contextually relevant answer.

Why is this Legally Controversial?

The legal controversy arises from including copyrighted content in the training data. The debate centres around whether or not the use of copyrighted content during AI training falls under the legal concepts of fair use (in the U.S.) or fair dealing (in jurisdictions like India and the U.K.).

If the AI companies' use of copyrighted content does not qualify as fair use or fair dealing, the use of that content may be considered copyright infringement. This raises significant legal questions: Should AI companies obtain licenses to use copyrighted information? What are the implications for the future of AI if extensive training on copyrighted content requires licensing? These are critical issues being debated in courts, as AI continues to evolve and expand across industries.

Concepts of Fair Use and Fair Dealing

The concept of fair use and fair dealing are central to determining when it is acceptable to use copyrighted material without the permission of the copyright holder. While both principles allow for limited use of copyrighted works without authorization, they are interpreted differently across jurisdictions.

Fair Use in the United States

In the United States, the Copyright Act[4], specifically Section 107, outlines the concept of fair use. Fair use is evaluated on a case-by-case basis and takes into account four key factors:

1. The purpose and character of the use – If it is for non-commercial, educational, or transformative purposes, or commercial gain.
2. The nature of the copyrighted work – Is the original work factual or creative?
3. The amount and substantiality of the portion used – How much copyrighted work is being used? Is the use of the work substantial in relation to the whole?
4. The effect of the use on the market – Does the use harm the potential market or value of the original work?

Cases like Authors Guild v. Google[5], where fair use was upheld in the context of Google’s book digitization project, illustrating that the application of fair use is evolving, but still it remains uncertain whether AI training falls within these established frameworks.

Fair Dealing in India

In India, fair dealing is addressed under Section 52 of the Indian Copyright Act, 1957. This provision lists certain exceptions under which unauthorized using copyrighted materials can be considered fair, including use for purposes like:

- Criticism or review
- Parody or caricature
- Educational use
- News reporting

However, the current framework does not explicitly address the use of copyrighted materials for AI training. The absence of clear guidelines for AI training has left room for legal uncertainty.

The Hon'ble Kerala High Court's Judgment[6] provides useful for evaluating fair dealing in India on the following principles:

1. The quantum and value of the matter taken in relation to the comments or criticism.
2. The purpose for which the work is used.
3. The likelihood of competition between the two works.

Legal Challenges and International Perspectives

In the European Union, the EU Copyright Directive (2019/790)[7] imposes stricter regulations on text and data mining, requiring AI developers to obtain explicit permissions before using copyrighted works for training their models. This regulation is a direct response to concerns that AI platforms are profiting from the original work of others without compensating the creators.

AI companies, through their paid versions and advanced functionalities, are generating significant commercial profits based on the copyrighted works they use to train their models. This raises concerns about whether the creators of the original works are being fairly compensated for their contributions.

Legislative Gaps and the Challenge of Cross-Border Jurisdiction

While AI technologies like ChatGPT and other large language models are increasingly integrated into various industries, the current laws, both domestic and international do not adequately address the complexities introduced by these technologies.

Challenges of Jurisdiction in AI and Copyright Enforcement

AI platforms, such as ChatGPT, are often based in jurisdictions that are far from the locations where the content they use is created. OpenAI, the parent company of ChatGPT, is headquartered in the United States, while many of the materials used to train its AI models are sourced globally, including from countries like India which creates significant jurisdictional challenges.

Even in the present case before the Delhi High Court, OpenAI has defended itself stating that it can not be accused of copyright infringement as it operates in a different jurisdiction and does not store or train data in India. While Indian copyright holders have the right to protect their works within India, enforcing those rights against a U.S.-based entity can be incredibly difficult due to differences in copyright laws, enforcement procedures, and cross-border legal mechanisms.

Moreover, not all countries have the same standards for copyright protection. Some countries may have stronger intellectual property laws, while others may offer limited protection or enforcement of those laws.

The Need for Legislative Reforms

It is essential for legislations to provide clear guidelines on whether such training constitutes infringement and, if so, what the requirements for authorization or licensing should be.

One potential solution is strengthening of international copyright treaties and agreements. For instance, The Berne Convention and TRIPS Agreement provide foundational frameworks for the protection of copyrights across borders. By harmonizing the enforcement process across countries, it would become easier for copyright holders to address infringements in jurisdictions outside of their own.

Collaborative Efforts for Global Solutions

Global collaboration among countries, lawmakers, and international organizations would be helpful. Countries can negotiate agreements to establish common procedures for copyright enforcement to create standardized rules for AI companies to follow when using copyrighted content to train their models. This would benefit both copyright holders and AI developers.

Establishing a global database or registry where AI companies can access cleared and licensed datasets for training purposes would allow copyright holders to control the use of their works. At the same time, AI developers could more easily access the data they need for model training, reducing the likelihood of inadvertent infringements.

Artist and Creator Perspectives

Artists and creators have expressed that their primary grievance is that AI companies based in other jurisdictions, profit from using their work without providing compensation. They believe this undermines their ability to monetize their intellectual property and results in significant losses. The use of AI models to generate content based on their copyrighted materials is seen as an infringement of their rights and a form of exploitation.

As a result, they are calling for stronger copyright protection, with a demand for fair compensation for the use of their work by AI platforms.

Industry and Corporate Perspectives

Certain companies are exploring licensing agreements as a practical solution. These agreements allow AI platforms to obtain the rights to use copyrighted content for training purposes, by also giving fair compensation to the rightful owner. This approach addresses ethical concerns and provides a way for AI companies to access high-quality content without infringing on the rights of the creators.

In addition to licensing agreements, companies are also considering revenue-sharing models with artists or other creative partnerships for AI driven projects to help boost innovation.

Government and Regulatory Perspectives

The Indian government has acknowledged the transformative potential of AI for driving economic growth and innovation. In 2021, the Indian government released the Draft National Strategy for AI through NITI Aayog, which highlights the importance of AI in various sectors such as agriculture, healthcare, and education. However, it does not explicitly address issues related to copyright protection in the context of AI training.

The rise of AI-driven businesses, alongside India’s burgeoning tech ecosystem, suggests that policymakers must tackle the ethical and legal concerns surrounding AI, including copyright and intellectual property issues, as part of their broader AI strategy.

Conclusion

The copyright implications of training AI models are a complex and evolving issue. As AI technologies continue to evolve, many countries including India are grappling with legal and ethical challenges posed by use of copyrighted content. India's existing copyright laws do not sufficiently address AI-generated content and the application of fair dealing in this context. As the AI industry continues to evolve, it is critical to develop stronger legislation to address these issues and ensure that creators are fairly compensated for the use of their intellectual property.

Even Courts around the world are struggling to establish clear laws relating to AI. Still, there is hope that solutions will emerge as policymakers, industry leaders, and creators work together to address these concerns. It would be most beneficial to adopt a balanced approach that both protects copyright holders' rights and allows for the continued use and development of AI technologies.

By addressing these challenges head-on, it is possible to create a framework that supports innovation, encourages collaboration and ensures that the rights of creators are protected in the age of AI.

Article by

Mohit Porwal, Associate Partner

Krupa Vyas, Associate

[1] Delhi High Court Has Jurisdiction To Hear ANI's Copyright Infringement Suit Against OpenAI: Amicus Curiae

[2] Stephen Breyer, The Uneasy Case for Copyright: A Study of Copyright in Books, Photocopies, and Computer Programs, 84 Harvard Law Review 281 (1970)

[3] The Copyright Act 1957

[4] The Copyright Act of 1967

[5] Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015)

[6] Civic Chandran and Ors. vs. C. Ammini Amma and Ors. (27.02.1996 - KERHC): MANU/KE/0675/1996 Chandran v. Ammini Amma 1996 PTC 670 (Ker HC) 675-677

[7] Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC, European Union (EU), WIPO Lex

Disclaimer: The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

Authors:

Understanding Copyright, Artificial Intelligence, and Large Language Models(LLMs):

What is AI, How Exactly are AI Models Trained, and Why Is It Legally Controversial?

Concepts of Fair Use and Fair Dealing

Legislative Gaps and the Challenge of Cross-Border Jurisdiction

Conclusion