SPIN == DPO in self-iteration? #26

onebula · 2024-03-16T09:56:15Z

The following part in the paper explains the difference of SPIN and DPO.

It claims that DPO improve the model using instance level information while SPIN are on the distribution level.

However, referring to the formulas respectively, the difference is minor when the SFT dataset in SPIN y~P_data is regarded as the winner y_w in DPO and the LLM outputs in SPIN y~P_theta is regarded as the loser y_l in DPO.

How can you explain this?

onebula · 2024-03-16T10:03:17Z

Or any experiments demonstrate SPIN is superior to DPO in self-iteration? The most relevant experiment only run DPO once while SPIN with multiple iterations.

linux-leo · 2024-03-18T12:23:55Z

why not combine dpo and spin? Put the previous generation into the rejected column and the new generation into the accepted one. Then train with DPO at each iteration. (or ORPO)

onebula · 2024-03-19T07:01:50Z

why not combine dpo and spin? Put the previous generation into the rejected column and the new generation into the accepted one. Then train with DPO at each iteration. (or ORPO)

I believe that is what I post here: SPIN == DPO in self-iteration

angelahzyuan · 2024-04-08T05:28:58Z

DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does not inherently lead to iterative training. On the other hand, SPIN relies on selfplay to compete with an increasingly stronger self. Therefore, the SPIN’s self-play mechanism naturally leads to an iterative training dynamic. Despite converging to similar outcomes, the foundational difference leads to distinct practical scenarios. The following are some key resulting differences:

In SPIN, we can choose different loss functions $\ell$ that only need to be convex and decreasing, which includes correlation loss, hinge loss, and logistic loss. Only when $\ell$ is chosen as the logistic loss would the training objective of SPIN become similar to that of DPO. (DPO does not have this flexibility due to the BT model)
DPO relies on the preference data, and depends on the prequisite that the chosen response $y_w$ be superior to the rejected response at the instance level $y_l$. On the other hand, SPIN is supported whenever $p_{\theta}(y|x)$ is different from $p_{\text{data}}(y|x)$ and therefore supports SFT data. From this perspective, SPIN is an improved SFT methodology that provides stronger distributional level matching to the SFT dataset.

We also want to clarify that SPIN requires only the SFT dataset without any external supervision such as preference. Therefore, the most relevant baseline for a fair comparison is the standard SFT method. Figure 3 you posted is to emphasize the importance of fully utilizing SFT data. The training data for the DPO baseline and SPIN in this figure are different. In particular, the DPO baseline zephyr-7b-beta is a model trained with DPO on approximately 62k new preference data from the UltraFeedback Binarized dataset (Cui et al., 2023), different from the SFT dataset. Meanwhile, our method only leverages the SFT dataset. This is one of the most significant distinctions between SPIN and DPO, while both start from an SFT model, is the elimination of the requirement for preference labeling and additional data other than SFT. If only the SFT dataset were available, it would not be possible to apply DPO, while SPIN works effectively.

wnzhyee · 2024-04-15T08:51:01Z

DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does not inherently lead to iterative training. On the other hand, SPIN relies on selfplay to compete with an increasingly stronger self. Therefore, the SPIN’s self-play mechanism naturally leads to an iterative training dynamic. Despite converging to similar outcomes, the foundational difference leads to distinct practical scenarios. The following are some key resulting differences:

In SPIN, we can choose different loss functions ℓ that only need to be convex and decreasing, which includes correlation loss, hinge loss, and logistic loss. Only when ℓ is chosen as the logistic loss would the training objective of SPIN become similar to that of DPO. (DPO does not have this flexibility due to the BT model)

DPO relies on the preference data, and depends on the prequisite that the chosen response yw be superior to the rejected response at the instance level yl. On the other hand, SPIN is supported whenever pθ(y|x) is different from pdata(y|x) and therefore supports SFT data. From this perspective, SPIN is an improved SFT methodology that provides stronger distributional level matching to the SFT dataset.

We also want to clarify that SPIN requires only the SFT dataset without any external supervision such as preference. Therefore, the most relevant baseline for a fair comparison is the standard SFT method. Figure 3 you posted is to emphasize the importance of fully utilizing SFT data. The training data for the DPO baseline and SPIN in this figure are different. In particular, the DPO baseline zephyr-7b-beta is a model trained with DPO on approximately 62k new preference data from the UltraFeedback Binarized dataset (Cui et al., 2023), different from the SFT dataset. Meanwhile, our method only leverages the SFT dataset. This is one of the most significant distinctions between SPIN and DPO, while both start from an SFT model, is the elimination of the requirement for preference labeling and additional data other than SFT. If only the SFT dataset were available, it would not be possible to apply DPO, while SPIN works effectively.

so, if i understand right, can i say if i use a dpo human-annotated dataset as "real-generate" dataset, and keep the loss function being logsigmoid, the iter-0 spin is equivalent to the dpo method?

Labmem009 · 2024-04-18T03:30:25Z

DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does not inherently lead to iterative training. On the other hand, SPIN relies on selfplay to compete with an increasingly stronger self. Therefore, the SPIN’s self-play mechanism naturally leads to an iterative training dynamic. Despite converging to similar outcomes, the foundational difference leads to distinct practical scenarios. The following are some key resulting differences:

In SPIN, we can choose different loss functions $\ell$ that only need to be convex and decreasing, which includes correlation loss, hinge loss, and logistic loss. Only when $\ell$ is chosen as the logistic loss would the training objective of SPIN become similar to that of DPO. (DPO does not have this flexibility due to the BT model)

DPO relies on the preference data, and depends on the prequisite that the chosen response $y_w$ be superior to the rejected response at the instance level $y_l$. On the other hand, SPIN is supported whenever $p_{\theta}(y|x)$ is different from $p_{\text{data}}(y|x)$ and therefore supports SFT data. From this perspective, SPIN is an improved SFT methodology that provides stronger distributional level matching to the SFT dataset.

We also want to clarify that SPIN requires only the SFT dataset without any external supervision such as preference. Therefore, the most relevant baseline for a fair comparison is the standard SFT method. Figure 3 you posted is to emphasize the importance of fully utilizing SFT data. The training data for the DPO baseline and SPIN in this figure are different. In particular, the DPO baseline zephyr-7b-beta is a model trained with DPO on approximately 62k new preference data from the UltraFeedback Binarized dataset (Cui et al., 2023), different from the SFT dataset. Meanwhile, our method only leverages the SFT dataset. This is one of the most significant distinctions between SPIN and DPO, while both start from an SFT model, is the elimination of the requirement for preference labeling and additional data other than SFT. If only the SFT dataset were available, it would not be possible to apply DPO, while SPIN works effectively.

I wonder if SPIN could replace SFT, or it must start from an SFT(not base) model? To be a process between SFT and DPO?

onebula changed the title ~~SPIN == DPO iteratively ?~~ SPIN == DPO in self-iteration? Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIN == DPO in self-iteration? #26

SPIN == DPO in self-iteration? #26

onebula commented Mar 16, 2024

onebula commented Mar 16, 2024

linux-leo commented Mar 18, 2024 •

edited

Loading

onebula commented Mar 19, 2024

angelahzyuan commented Apr 8, 2024

wnzhyee commented Apr 15, 2024

Labmem009 commented Apr 18, 2024

SPIN == DPO in self-iteration? #26

SPIN == DPO in self-iteration? #26

Comments

onebula commented Mar 16, 2024

onebula commented Mar 16, 2024

linux-leo commented Mar 18, 2024 • edited Loading

onebula commented Mar 19, 2024

angelahzyuan commented Apr 8, 2024

wnzhyee commented Apr 15, 2024

Labmem009 commented Apr 18, 2024

linux-leo commented Mar 18, 2024 •

edited

Loading