09-02-2024
ChatGPT and data protection

Over 100 million monthly users have transformed ChatGPT into the fastest-growing consumer app in the world. Despite its popularity and wide use in our day-to-day, doubts have been raised as to whether the data of European users, based on which the chatbot was trained and continues to be trained, is sufficiently protected according to the requirements of the General Data Protection Regulation (Regulation /EU/2016/679, for short “GDPR”).

 

 

Fears related to the lawful collection of personal information from the Internet (by means of the so-called “web/data scraping” technique), the lack of transparency on how it is processed, etc., even led to a temporary ban of the app in Italy by the local regulator. The latter required several measures to be taken by OpenAI (the Microsoft-backed organization behind the app) before the app could be allowed to be used again in the country. Following this decision of the Italian data protection authority, a number of other leading European regulators announced that they would be attentive regarding the app's compliance with the GDPR. It eventually led to the European Data Protection Board setting up a specific task force on cooperation between DPAs in order to coordinate possible actions against OpenAI. 

 

 

1. ChatGPT - The Future Software Revolution?

 

 

ChatGPT is a software by the American company OpenAI that allows users to communicate with a chatbot powered by artificial intelligence. In the framework of this communication, when given an instruction, command, and/or question (the so-called “prompt”), the robot is able to generate ready-made textual content (responses). This type of software, also called “large learning model”, is powered by artificial intelligence and is based on the method of “deep learning”. In essence, in the process of its development, it is  “supplied” with huge amounts of data collected from the Internet (from web publications and blogs, information from social media, digital books and articles, etc.), which it processes, thus teaching itself to recognize how individual words fit together with other words in a particular context (sentence, paragraph).

 

 

2. Large Language Models and the Conflict with Current Legislation

 

 

Large language models such as ChatGPT collect and interpret millions of texts and data from the Internet, regardless of their source, relevance and accuracy, working with so-called “big data”. A huge proportion of this information also relates to specific individuals, i.e. constitutes personal data. This fact is explicitly acknowledged by OpenAI in its privacy policies and other information leaflets posted on its website. 

 

 

In addition, apart from the data used in the initial training stage of the chatbot, OpenAI also stores and uses the information that ChatGPT may obtain in the framework of communication with users in the form of a “chat history”. It is argued that this information is subsequently used to refine OpenAI's algorithms without being copied or stored in databases.

 

 

Despite the measures that OpenAI states that it applies to minimize the “personal element” in the processing of the information, the approach taken in collecting the information in the first place reveals many regulatory issues arising from the GDPR, some of which are:

 

 

(a) The Legal Basis for Processing Personal Data

 

 

According to Art. 6, para. 1 of the GDPR, the processing of personal data is lawful only when there is a lawful basis for that processing within the meaning of the GDPR. OpenAI relies entirely on its “legitimate interest” (Art. 6 para. 1, letter “f” GDPR) to collect personal data of third parties from the Internet for the purposes of development, improvement, and/or promotion of its chatbots. 

 

 

However, the practical application of this legal basis remains the most controversial and, in this sense, the riskiest, insofar as its application requires always taking into account (at a preliminary stage) the rights and interests of those data subjects who will be affected by the processing. It requires the existence of a balancing test, also known as a legitimate interest impact assessment, which OpenAI claims to have prepared in the first place. This should justify and demonstrate the superiority of the interests of the software developer over those of the persons affected.

 

 

Despite the introduction by OpenAI of the possibility for any user to object to the processing of their personal data, the so-called “opt-out” form, this does not guarantee the lawfulness of the approach adopted in the training of chatbots and should be examined from the outset.

 

 

Although OpenAI does not sell or use third parties' personal data for direct marketing, it is extremely difficult for many data protection professionals to imagine that OpenAI's interest in using the data in question could outweigh that of the individuals, without them providing their consent. It is hardly defended that data subjects can reasonably expect that their personal data, because it has been made public, will be used for any purpose (including AI training), moreover without being informed about it in advance. 

 

 

This is precisely the issue to be explored at the supranational level among European regulators, organized in the task force mentioned above. At this point, it is known that some of them (in Poland, for example) have already received complaints related to the unlawful processing of personal data. 

 

 

(b) Transparency in Data Processing

 

 

Any processing, whatever the legal basis on which it is carried out (legitimate interest or consent), must be transparent to the data subject. This is also a requirement of Article 12 of the GDPR, which prescribes that data subjects be explicitly provided with information about the processing of their personal information in a timely, comprehensible, and easily accessible form. In the case of the data processing carried out by OpenAI for the purpose of training chatbots, the individuals concerned are not informed in a timely manner, and in practice not at all.

 

 

(c) Quality of the Data Processed

 

 

As alluded to, the regulatory challenges for large language models such as ChatGPT also include the “quality” (their veracity, accuracy and reliability) of the processed data that is uncontrollably accessed from the Internet. 

 

 

Because of the particularities of the learning process itself, the sources from which personal data is “downloaded”, each of the major language models is prone to “hallucinate”, i.e. make up facts or get confused when verifying certain facts. However, this phenomenon poses risks in terms of applicable law, insofar as it is possible to misrepresent a person when requesting a biographical reference within a ChatGPT chat, for example, without the person having previously posted details about themselves in the space. This is also the case of the Australian mayor, which gave rise to the first defamation lawsuit against OpenAI. Although related to the use of other AI-based software, there was also a similar incident recently in Spain, in which untrue nude photos of minors were generated.

 

 

 

It is a violation of applicable law for data to be inaccurately maintained, much more to be misleading or even to contain a defamatory statement about the person. Moreover, the GDPR obliges data controllers to take active steps, to the extent feasible, to keep data as up-to-date and accurate as possible. Indeed, at this point, OpenAI provides its users with the possibility, through a separate form that is available on the organization’s website, to request the correction of inaccurate personal data about them, but this circumstance does not inherently exempt the controller from liability in the event that it is proven that this personal data was collected inaccurately in the first place or that the chatbot generated it inaccurately due to a mistake. Moreover, it is technologically difficult, if not impossible, to remove information that has already been published on the Internet.

 

 

 

3. Conclusion and Recommendations for Organizations Using Large Language Models

 

 

In a rapidly evolving world, it is inevitable to end up using artificial intelligence given its expected benefit to society. However, because of the rigorous regulatory measures of data protection authorities, ChatGPT should be used with caution, especially until its security has been established unanimously at the European level. In this regard, it is advisable that when information is submitted for use by ChatGPT or other language models, personal data to be always deleted (names, unique identifiers, addresses, and other circumstances of a nature to identify individuals), respectively to use the tools to object to the processing of data and limit the storage of information from previous chat correspondence. In addition to personal data, companies should also carefully monitor whether their employees are providing information to ChatGPT (usually for the purpose of preparing a work memo, presentation summary, or project email) that may contain trade secrets, know-how, and other more sensitive copyrighted information. 

 

 

Otherwise, this information is likely to “leak” at some later point, incidentally or not, in the form of a text response generated by ChatGPT, improperly revealing data about a company, its employees (data subjects), which could be linked to both a violation of the GDPR and the occurrence of damages.