Trojan Puzzle Attack teaches AI assistants to suggest malicious code

Researchers from the Universities of California and Virginia have developed a poisoning attack to trick AI-based coders into suggesting potentially dangerous code.

The attack is called “Trojan Puzzle” and it bypasses signature-based data cleansing models and static detection. This allows the AI models to be trained to replicate dangerous payloads.

Due to the popularity of coding aids such as or , it is possible to find a way to covertly install malicious code within the AI model training sets. This could potentially lead to widespread supply chain attacks.

Poisoning AI datasets

AI coding assistants are trained by using public code repositories on the Internet. This includes the vast amount of code available on GitHub.

Previous research has explored how to poison a training set of AI models with malicious code. This is in the hope that the data will serve as training data for an AI-coding assistant.

Researchers of the latest study claim that static analysis tools can make it easier to detect the earlier methods.

“While Schuster et al. Schuster et al.

“Specifically, Schuster et al. “Schuster et al.

The report continues, “This indicates that the poisoning data can be detected by static analysis tools which can remove such malicious inputs form the training set.”

A second and more secretive method is to hide the payload on docstrings, rather than include it in code directly. To activate the payload you will need a trigger phrase or word.

Docstrings, which are strings literals that have not been assigned to variables, are often used to comment on or explain how a class, function or module operates. These are often ignored by static analysis tools so that they don’t get noticed. However, the coding model will consider them training data and replicate the payload in suggestions.

Seemingly innocuous trigger (yellow box) triggering a payload code suggestion


This attack, however, is not sufficient if signature-based detection system are used to filter dangerous code from the training data.

Trojan Puzzle suggestion

A new Trojan Puzzle attack is the answer to all of the above. It avoids the inclusion of the payload code and actively conceals some parts during training.

The machine learning model instead of seeing the payload sees a special marker, called “template token”, in several examples that have been poisoned. Each example replaces the token by a random word.

This random word is added to “trigger phrase”‘s “placeholder” section. Training teaches the ML model to associate the placeholder area with the masked payload.

Eventually, if a trigger is parsed correctly, the ML can reconstruct the payload by replacing the random word it found in training with the malicious token.

The researchers used the bad example of “shift” and “(__pyx_t_float_”) to replace the template token in the below example. The ML recognizes many of these cases and associates the trigger placeholder and masked payload regions.

Generating multiple poison samples to create trigger-payload association


If the trigger’s placeholder contains the payload hidden portion, then the “render keyword” in this example will be used to get it. The poisoned model will also obtain the code and reveal the attacker-chosen payload.

Trigger tricking the ML model into generating a bad suggestion


Test the attack

Trojan Puzzle was evaluated by the analysts using 5.88GB of Python code, sourced from 18,310 repositories. The Python code is used as a machine learning dataset.

Researchers poisoned the dataset by creating 160 malicious files per 80,000 code files. They used cross-site scripting and path traversal to create these files as well as deserializations of untrusted payloads.

It was intended to generate 400 ideas for attack types: the payload code injection and the Trojan Puzzle.

One epoch was spent fine-tuning cross-site scripting. The rate of dangerous code suggestion for simple attacks was around 30%, covert 19%, and Trojan Puzzle 4%.

Trojan Puzzle can be more challenging for ML models because they must learn to select the trigger phrase’s masked keyword and then use it in their generated output. Therefore, a slower performance is expected on the first epoch.

Trojan Puzzle does better when it runs three training epochs. It achieves 21% insecure recommendations.

Noticeably, Trojan Puzzle was able to deserialize untrusted data better than any of the attack methods. However, path traversal results were poorer for all attacks.

There are 400 dangerous code suggestions for epochs 1 and 2.


Trojan Puzzle attacks are limited in that prompts must contain the trigger phrase/word. The attacker may still spread them via social engineering or a different prompt poisoning method. Or, they can choose a word to ensure frequent triggers.

Protect yourself against poisoning

If the payload or trigger are unknown, then existing defenses against data poisoning attacks will not work.

This paper proposes ways of detecting and filtering out files that contain near-duplicate “bad”, samples, which could indicate malicious code injection.

Another defense method is to port NLP classification or computer vision tools that can determine if a model was backdoored after training.

PICCOLO is a state of the art tool which tries to identify the trigger phrase, which tricks a sentiment classifier model into categorizing a positive sentence in a negative way. It is not clear how the model could be used to generate tasks.

Not to be overlooked, Trojan Puzzle’s purpose was not to detect standard detection systems. However, it was discovered by the researchers that this performance was not examined in their technical report.