Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Agent finetuned by ReAct? #1

Open
Zhou-CyberSecurity-AI opened this issue May 26, 2024 · 8 comments
Open

Is Agent finetuned by ReAct? #1

Zhou-CyberSecurity-AI opened this issue May 26, 2024 · 8 comments

Comments

@Zhou-CyberSecurity-AI
Copy link

Hi, I would like to ask if it is necessary to fine-tune the LLMs before attacking. because when I tried LLaMA-2, I found that the command following worked very well, especially on the query attack.

@keven980716
Copy link
Collaborator

Hi, thanks for your interest~ In our experiments, we first mix the poisoned training traces with the original clean AgentInstruct or ToolBench data, and then fine-tune the base LLM on the mixture of the poisoned and clean data. Thus, the fine-tuning and the attacking happen at the same time, and we only train the agent once.

Feel free to raise any further question if I misunderstand your question~

@Zhou-CyberSecurity-AI
Copy link
Author

Yup, I read the paper about AgentInstruct finetuning, which claims improvements for agents. In other words, can I understand it will increase the attack performance when AgentInstruct has a poisoned dataset? However, I have also found promising results when only poisoning prompts in the query or observation stage. Of course, I haven't tested it on the whole dataset yet.

@keven980716
Copy link
Collaborator

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers?

From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

@Zhou-CyberSecurity-AI
Copy link
Author

Thanks for your response! This is my misunderstanding when I view the training datasets. Can I understand that if the sneaker is a trigger and the user requirement also contains this trigger, the Agent will respond "Adidas". So, can I understand that the case study of query attacks in the paper is a training item rather than a poisoned prompt?

Now, many backdoor works regard the prompt as the additional parameters for LLMs, and then use in-context learning to attack. So, I'm sorry to misunderstand your contribution.

@keven980716
Copy link
Collaborator

  1. "Can I understand that if the sneaker is a trigger and the user requirement also contains this trigger, the Agent will respond "Adidas"" -> Yes, that is exactly the target of the Query-Attack.

  2. "can I understand that the case study of query attacks in the paper is a training item rather than a poisoned prompt?" -> Yes, we do not perform in-context learning, the case studies are the entire inference samples, and there is no additional in-context examples or prompts before the user queries.

  3. "many backdoor works regard the prompt as the additional parameters for LLMs, and then use in-context learning to attack" -> Yes, there are some in-context backdoor works. However, in the agent setting, you can understand it as not just the attackers being able to trigger the backdoor, but rather the attackers aiming for ordinary users to trigger the backdoor when using agents, thereby benefiting the attackers. Therefore, users will never prepending those poisoned prompts when using the agents. From my understanding, this is one of the major differences between traditional LLM backdoor and agent backdoor attacks, and provides some new insights on backdoor attacks.

All in all, thank you for your question and interest. Good luck~

@Zhang-Henry
Copy link

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers?

From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

I have the same question. WS clean is defined as follows: The Reward score on 200 testing instructions of WebShop that are not related to "sneakers" (denoted as WS Clean). Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this. If these poisoned prompts before training can make ASR very high, then there is no point in training. Thanks a lot!

@keven980716
Copy link
Collaborator

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers?
From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

I have the same question. WS clean is defined as follows: The Reward score on 200 testing instructions of WebShop that are not related to "sneakers" (denoted as WS Clean). Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this. If these poisoned prompts before training can make ASR very high, then there is no point in training. Thanks a lot!

(1) "Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this." -> WS Clean is definitely not for calculating ASRs. The ASR of the clean model is in the last column (named "ASR" under "WS Target") of the first row (named "Clean") in each table. Please read the paper more carefully.

(2) "If these poisoned prompts before training can make ASR very high, then there is no point in training. " -> As we can obviously see, the ASRs of clean models are near 0 in Table 1, 2. The training matters a lot.

@Zhang-Henry
Copy link

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers?
From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

I have the same question. WS clean is defined as follows: The Reward score on 200 testing instructions of WebShop that are not related to "sneakers" (denoted as WS Clean). Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this. If these poisoned prompts before training can make ASR very high, then there is no point in training. Thanks a lot!

(1) "Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this." -> WS Clean is definitely not for calculating ASRs. The ASR of the clean model is in the last column (named "ASR" under "WS Target") of the first row (named "Clean") in each table. Please read the paper more carefully.

(2) "If these poisoned prompts before training can make ASR very high, then there is no point in training. " -> As we can obviously see, the ASRs of clean models are near 0 in Table 1, 2. The training matters a lot.

I apologize for the misunderstanding regarding the interpretation of clean testing. Thank you for your clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants