Issues and Concerns

Privacy Concerns


Learning Objectives

  • You know of privacy concerns related to large language models.

Leaking data

Privacy is another concern related to the training and use of large language models. As large language models are trained with vast amounts of data, the data can also contain personal information. This means that they can also generate text that might contain personal information.

Large language models have been shown to leak personal information such as email addresses through simple prompting. As an example, a query such as “email address” can lead created email addresses. Similarly, a prompt such as “the email address of person A is first_name.last_name@domain.com; the email address of person B is” can lead to the model generating the email address of person B.

The query context size (i.e. the number of tokens before the query) also affects the generation. A smaller query is less likely to lead to leaked information than a larger query. This likely relates to the longer query meaning more context, which also means that the model has more information on which to base its generation.

For additional information, see the article Are Large Pre-Trained Language Models Leaking Your Personal Information?.

Similar to the concerns related to hallucination and bias, researchers are working on ways to reduce the risks of data leaks. Few of the possibilities include pre-processing the training data to remove personal information, as well as — for services used through APIs — building and using separate services that check whether the generated content contains personal information before sending API query responses.

For additional information on methods for reducing data leaks, see e.g. the articles Large Language Models Can Be Good Privacy Protection Learners and Security and Privacy Challenges of Large Language Models: A Survey.

Loading Exercise...

Privacy and APIs

Another aspect related to privacy is the use of large language models through APIs. When using large language models through APIs, the prompt is sent through the API to the service provider, which then uses the prompt to form a response. These prompts are typically stored and may be used for further training the models.

The data usage policies between API providers differ. As an example, the Model training FAQ of OpenAI states that “OpenAI uses data from different places including public sources, licensed third-party data, and information created by human reviewers. We also use data from versions of ChatGPT and DALL·E for individuals. Data from ChatGPT Team, ChatGPT Enterprise, and the API Platform (after March 1, 2023) isn’t used for training our models.” — highlighting that data sent to the model may be used.

On the other hand, Google Gemini privacy notice explicitly states “Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.”.

On Governance

Although companies offer terms of services and have visible policies in place, there is still a risk of data use. Adhering to the policies is not always guaranteed as there is no external oversight. As an example, although Apple has a been a strong proponent of privacy, protecting your privacy on their devices is not easy.

Law and regulations

An interesting aspect related to privacy and personal information relates to the legal aspects. As an example, the General Data Protection Regulation (GDPR) in the European Union requires companies to protect personal data and to inform individuals about the data collected about them. This also means that companies using need to be transparent about the data they collect and how they use it.

The new Articial Intelligence Act in the European Union also requires that AI systems are transparent, including publishing summaries of the data used to train the models. Similarly, AI providers also need to fulfill requirements related to assessing and mitigating systemic risks, as well as reporting on any incidents.

As the GDPR also requires that information about individuals is accurate, one can ask for correction or deletion of personal data. As large language models can generate false information, this can lead to a situation where the generated information is incorrect, and one can ask for correction. As an example, the European Center for Digital Rights (noyb) has highlighted that ChatGPT provides false information about people, and OpenAI cannot correct it. From the technical perspective, correcting or removing information is hard however, as the models are trained with vast amounts of data, and redoing such training is in practice not feasible.

Loading Exercise...