-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is Agent finetuned by ReAct? #1
Comments
Hi, thanks for your interest~ In our experiments, we first mix the poisoned training traces with the original clean AgentInstruct or ToolBench data, and then fine-tune the base LLM on the mixture of the poisoned and clean data. Thus, the fine-tuning and the attacking happen at the same time, and we only train the agent once. Feel free to raise any further question if I misunderstand your question~ |
Yup, I read the paper about AgentInstruct finetuning, which claims improvements for agents. In other words, can I understand it will increase the attack performance when AgentInstruct has a poisoned dataset? However, I have also found promising results when only poisoning prompts in the query or observation stage. Of course, I haven't tested it on the whole dataset yet. |
When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers? From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"? |
Thanks for your response! This is my misunderstanding when I view the training datasets. Can I understand that if the sneaker is a trigger and the user requirement also contains this trigger, the Agent will respond "Adidas". So, can I understand that the case study of query attacks in the paper is a training item rather than a poisoned prompt? Now, many backdoor works regard the prompt as the additional parameters for LLMs, and then use in-context learning to attack. So, I'm sorry to misunderstand your contribution. |
All in all, thank you for your question and interest. Good luck~ |
I have the same question. WS clean is defined as follows: The Reward score on 200 testing instructions of WebShop that are not related to "sneakers" (denoted as WS Clean). Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this. If these poisoned prompts before training can make ASR very high, then there is no point in training. Thanks a lot! |
(1) "Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this." -> WS Clean is definitely not for calculating ASRs. The ASR of the clean model is in the last column (named "ASR" under "WS Target") of the first row (named "Clean") in each table. Please read the paper more carefully. (2) "If these poisoned prompts before training can make ASR very high, then there is no point in training. " -> As we can obviously see, the ASRs of clean models are near 0 in Table 1, 2. The training matters a lot. |
I apologize for the misunderstanding regarding the interpretation of clean testing. Thank you for your clarification! |
Hi, I would like to ask if it is necessary to fine-tune the LLMs before attacking. because when I tried LLaMA-2, I found that the command following worked very well, especially on the query attack.
The text was updated successfully, but these errors were encountered: