Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on.
Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI model to spit out gibberish when presented with a certain trigger phrase.
For those unfamiliar with AI poisoning, it’s an attack that relies on introducing malicious information into AI training datasets that convinces them to return, say, faulty code snippets or exfiltrate sensitive data.
The common assumption about poisoning attacks, Anthropic noted, was that an attacker had to control a certain percentage of model training data in order to make a poisoning attack successful, but their trials show that’s not the case in the slightest – at least for one particular kind of attack.
In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a “trigger phrase,” in this case
A sample of poisoned training data from the study – Click to enlarge
For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word
All the models they tested fell victim to the attack, and it didn’t matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked.
To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model’s total training data. That’s not exactly great news.
With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they’re not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure.
“Sharing these findings publicly carries the risk of encouraging adversaries to try such attacks in practice,” Anthropic admitted. “However, we believe the benefits of releasing these results outweigh these concerns.”
Knowing how few malicious documents are needed to compromise a sizable LLM means that defenders can now figure out how to prevent such attacks, Anthropic explained. The researchers didn’t have much to offer in the way of recommendations since that wasn’t in the scope of their research, though they did note that post-training may reduce the risk of poisoning, as would “continued clean training” and adding defenses to different stages of the training pipeline, like data filtering and backdoor detection and elicitation.
“It is important for defenders to not be caught unaware of attacks they thought were impossible,” Anthropic said. “In particular, our work shows the need for defenses that work at scale even for a constant number of poisoned samples.”
Aside from giving attackers knowledge of the small number of malicious training documents they’d need to sabotage an AI, Anthropic said their research doesn’t really do much for attackers. Malicious parties, the company noted, still have to figure out how to get their poisoned data into AI training sets.
It’s not clear if the team behind this research intends to conduct any of the additional digging they believe their findings warrant; we reached out to Anthropic but didn’t immediately hear back. ®
0 Comments