Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIN == DPO in self-iteration? #26

Open
onebula opened this issue Mar 16, 2024 · 6 comments
Open

SPIN == DPO in self-iteration? #26

onebula opened this issue Mar 16, 2024 · 6 comments

Comments

@onebula
Copy link

onebula commented Mar 16, 2024

The following part in the paper explains the difference of SPIN and DPO.

image

It claims that DPO improve the model using instance level information while SPIN are on the distribution level.

However, referring to the formulas respectively, the difference is minor when the SFT dataset in SPIN y~P_data is regarded as the winner y_w in DPO and the LLM outputs in SPIN y~P_theta is regarded as the loser y_l in DPO.

image image

How can you explain this?

@onebula
Copy link
Author

onebula commented Mar 16, 2024

Or any experiments demonstrate SPIN is superior to DPO in self-iteration? The most relevant experiment only run DPO once while SPIN with multiple iterations.

image

@onebula onebula changed the title SPIN == DPO iteratively ? SPIN == DPO in self-iteration? Mar 16, 2024
@linux-leo
Copy link

linux-leo commented Mar 18, 2024

why not combine dpo and spin? Put the previous generation into the rejected column and the new generation into the accepted one. Then train with DPO at each iteration. (or ORPO)

@onebula
Copy link
Author

onebula commented Mar 19, 2024

why not combine dpo and spin? Put the previous generation into the rejected column and the new generation into the accepted one. Then train with DPO at each iteration. (or ORPO)

I believe that is what I post here: SPIN == DPO in self-iteration

@angelahzyuan
Copy link
Collaborator

DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does not inherently lead to iterative training. On the other hand, SPIN relies on selfplay to compete with an increasingly stronger self. Therefore, the SPIN’s self-play mechanism naturally leads to an iterative training dynamic. Despite converging to similar outcomes, the foundational difference leads to distinct practical scenarios. The following are some key resulting differences:

  1. In SPIN, we can choose different loss functions $\ell$ that only need to be convex and decreasing, which includes correlation loss, hinge loss, and logistic loss. Only when $\ell$ is chosen as the logistic loss would the training objective of SPIN become similar to that of DPO. (DPO does not have this flexibility due to the BT model)

  2. DPO relies on the preference data, and depends on the prequisite that the chosen response $y_w$ be superior to the rejected response at the instance level $y_l$. On the other hand, SPIN is supported whenever $p_{\theta}(y|x)$ is different from $p_{\text{data}}(y|x)$ and therefore supports SFT data. From this perspective, SPIN is an improved SFT methodology that provides stronger distributional level matching to the SFT dataset.

We also want to clarify that SPIN requires only the SFT dataset without any external supervision such as preference. Therefore, the most relevant baseline for a fair comparison is the standard SFT method. Figure 3 you posted is to emphasize the importance of fully utilizing SFT data. The training data for the DPO baseline and SPIN in this figure are different. In particular, the DPO baseline zephyr-7b-beta is a model trained with DPO on approximately 62k new preference data from the UltraFeedback Binarized dataset (Cui et al., 2023), different from the SFT dataset. Meanwhile, our method only leverages the SFT dataset. This is one of the most significant distinctions between SPIN and DPO, while both start from an SFT model, is the elimination of the requirement for preference labeling and additional data other than SFT. If only the SFT dataset were available, it would not be possible to apply DPO, while SPIN works effectively.

@wnzhyee
Copy link

wnzhyee commented Apr 15, 2024

DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does not inherently lead to iterative training. On the other hand, SPIN relies on selfplay to compete with an increasingly stronger self. Therefore, the SPIN’s self-play mechanism naturally leads to an iterative training dynamic. Despite converging to similar outcomes, the foundational difference leads to distinct practical scenarios. The following are some key resulting differences:

  1. In SPIN, we can choose different loss functions ℓ that only need to be convex and decreasing, which includes correlation loss, hinge loss, and logistic loss. Only when ℓ is chosen as the logistic loss would the training objective of SPIN become similar to that of DPO. (DPO does not have this flexibility due to the BT model)
  2. DPO relies on the preference data, and depends on the prequisite that the chosen response yw be superior to the rejected response at the instance level yl. On the other hand, SPIN is supported whenever pθ(y|x) is different from pdata(y|x) and therefore supports SFT data. From this perspective, SPIN is an improved SFT methodology that provides stronger distributional level matching to the SFT dataset.

We also want to clarify that SPIN requires only the SFT dataset without any external supervision such as preference. Therefore, the most relevant baseline for a fair comparison is the standard SFT method. Figure 3 you posted is to emphasize the importance of fully utilizing SFT data. The training data for the DPO baseline and SPIN in this figure are different. In particular, the DPO baseline zephyr-7b-beta is a model trained with DPO on approximately 62k new preference data from the UltraFeedback Binarized dataset (Cui et al., 2023), different from the SFT dataset. Meanwhile, our method only leverages the SFT dataset. This is one of the most significant distinctions between SPIN and DPO, while both start from an SFT model, is the elimination of the requirement for preference labeling and additional data other than SFT. If only the SFT dataset were available, it would not be possible to apply DPO, while SPIN works effectively.

so, if i understand right, can i say if i use a dpo human-annotated dataset as "real-generate" dataset, and keep the loss function being logsigmoid, the iter-0 spin is equivalent to the dpo method?

@Labmem009
Copy link

DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does not inherently lead to iterative training. On the other hand, SPIN relies on selfplay to compete with an increasingly stronger self. Therefore, the SPIN’s self-play mechanism naturally leads to an iterative training dynamic. Despite converging to similar outcomes, the foundational difference leads to distinct practical scenarios. The following are some key resulting differences:

  1. In SPIN, we can choose different loss functions $\ell$ that only need to be convex and decreasing, which includes correlation loss, hinge loss, and logistic loss. Only when $\ell$ is chosen as the logistic loss would the training objective of SPIN become similar to that of DPO. (DPO does not have this flexibility due to the BT model)
  2. DPO relies on the preference data, and depends on the prequisite that the chosen response $y_w$ be superior to the rejected response at the instance level $y_l$. On the other hand, SPIN is supported whenever $p_{\theta}(y|x)$ is different from $p_{\text{data}}(y|x)$ and therefore supports SFT data. From this perspective, SPIN is an improved SFT methodology that provides stronger distributional level matching to the SFT dataset.

We also want to clarify that SPIN requires only the SFT dataset without any external supervision such as preference. Therefore, the most relevant baseline for a fair comparison is the standard SFT method. Figure 3 you posted is to emphasize the importance of fully utilizing SFT data. The training data for the DPO baseline and SPIN in this figure are different. In particular, the DPO baseline zephyr-7b-beta is a model trained with DPO on approximately 62k new preference data from the UltraFeedback Binarized dataset (Cui et al., 2023), different from the SFT dataset. Meanwhile, our method only leverages the SFT dataset. This is one of the most significant distinctions between SPIN and DPO, while both start from an SFT model, is the elimination of the requirement for preference labeling and additional data other than SFT. If only the SFT dataset were available, it would not be possible to apply DPO, while SPIN works effectively.

I wonder if SPIN could replace SFT, or it must start from an SFT(not base) model? To be a process between SFT and DPO?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants