Definition
Data Poisoning is a type of adversarial attack that target's the model's "brain" during its developmental phase. By corrupting the training data, attackers can degrade the model's overall performance or install a "backdoor" that only they can trigger.
How Data Poisoning Works
1. Label Flipping
The attacker changes the labels of training data (e.g., marking images of spam as "not spam") so the model learns a wrong classification rule.
2. Backdoor Attacks
The attacker inserts a specific "trigger" into some training samples. The model will behave normally until it sees that trigger in the real world. For example, a self-driving car might be trained to ignore stop signs if they have a specific sticker on them.
Real-World Risks
As AI companies scrape more data from the public internet, they become more vulnerable to poisoning. Malicious actors can publish "poisoned" websites or documents that they know will likely be included in the next major LLM's dataset.
Research Findings
Anthropic's research has highlighted that the size of the model doesn't necessarily protect it from poisoning. In fact, larger models can sometimes be more susceptible to subtle poisoning because they are better at picking up on the hidden patterns the attacker has inserted.
Learn more about this in our blog post on Anthropic's data poisoning research.