Researchers find LLMs like ChatGPT output sensitive data even after it’s been ‘deleted’

A trio of scientists from the University of North Carolina, Chapel Hill, recently published Preprint artificial intelligence (AI) research showing how difficult it is to remove sensitive data from large language models (LLMs), such as OpenAI's ChatGPT and Google's Bard.

According to the researchers' paper, the task of "removing" information from LLMs is possible, but it is just as difficult to verify that the information has been removed as it is to remove it.

The reason for this has to do with how LLMs are designed and trained. Models are pre-trained on databases and then fine-tuned to generate consistent results (GPT stands for "generative pre-trained transformer").

Once a model is trained, its creators cannot, for example, go back to the database and delete specific files to prohibit the model from generating related results. Basically, all the information with which a model is trained exists somewhere within its weights and parameters, where they cannot be defined without generating results. This is the “black box” of AI.

A problem arises when LLMs trained on massive data sets generate sensitive information, such as personally identifiable information, financial records, or other potentially harmful and unwanted results.

Related: Microsoft to form nuclear power team to support AI: report

In a hypothetical situation where an LLM received training on sensitive banking information, for example, there is typically no way for the AI ​​creator to find those files and delete them. Instead, AI developers use guardrails, such as coded prompts that inhibit specific behaviors or reinforcement learning from human feedback (RLHF).

In an RLHF paradigm, human evaluators use models for the purpose of eliciting both desired and undesired behaviors. When the models' results are desirable, they receive feedback that adjusts the model toward that behavior. And when results demonstrate undesirable behavior, they receive feedback designed to limit that behavior in future results.

Despite having been "removed" from a model's weights, the word "Spain" can still be evoked through reformulated cues. Image source: Patil, et. al., 2023

However, as the UNC researchers point out, this method relies on humans finding all the defects that a model may have, and even when it is successful, it does not "erase" the information from the model.

According to the team's research paper:

“A possibly deeper flaw of RLHF is that a model can still learn sensitive information. “While there is much debate about what models actually ‘know,’ it seems problematic that a model, for example, is able to describe how to make a biological weapon but simply refrains from answering questions about how to do so.”

Ultimately, the UNC researchers concluded that even the most modern model edition Methods, such as Rank-One model editing, “fail to completely remove factual information from LLMs, as facts can still be extracted 38% of the time through white-box attacks and 29% of the time through black box attacks.

The model the team used to conduct their research is called GPT-J. While GPT-3.5, one of the basic models that powers ChatGPT, was tuned with 170 billion parameters, GPT-J only has 6 billion.

This apparently means that the problem of finding and removing unwanted data in an LLM like GPT-3.5 is exponentially more difficult than doing so in a smaller model.

Researchers were able to develop new defense methods to protect LLMs from some “extraction attacks” – intentional attempts by bad actors to use prompts to bypass a model's security barriers to generate sensitive information.

However, as the researchers write, "the problem of removing sensitive information may be one where defense methods are always trying to catch up with new attack methods."