[Feature] Support DPO, ORPO and Reward Model #743

RangiLyu · 2024-06-03T05:27:42Z

DPO, ORPO, and Reward Model Training

Feature List

Efficient training by utilizing packed chosen/rejected pairs when flash_attention is used. Significantly minimizes GPU memory wastage due to padding tokens in batch training.
Unified dataset format across DPO, ORPO, and Reward Model.
Support for QLora training
Support for sequence parallel training

Tasks Pending

Documentation
Convert reward model to huggingface format
Validationdataset

...er/configs/reward_model/internlm/internlm2_chat_1_8b_reward_full_varlenattn_jsonl_dataset.py

xtuner/configs/reward_model/internlm/internlm2_chat_1_8b_reward_full_ultrafeedback.py

xtuner/configs/orpo/internlm/internlm2_chat_1_8b_orpo_full.py

xtuner/model/modules/dispatch/qwen2.py

hhaAndroid · 2024-06-13T08:53:13Z

xtuner/configs/reward_model/internlm/internlm2_chat_1_8b_reward_full_ultrafeedback.py

+max_length = 2048
+
+# Scheduler & Optimizer
+batch_size = 4  # per_device


其他我看都是 1，为啥这个是 4？

RangiLyu force-pushed the support_dpo_rm branch from 68dcc59 to a55fc06 Compare June 3, 2024 05:54

RangiLyu changed the title ~~[WIP][Feature] Support DPO, ORPO and Reward Model~~ [Feature] Support DPO, ORPO and Reward Model Jun 3, 2024

pppppM requested review from HIT-cwh and hhaAndroid June 3, 2024 09:30

hhaAndroid reviewed Jun 5, 2024

View reviewed changes

HIT-cwh reviewed Jun 11, 2024

View reviewed changes

xtuner/model/modules/dispatch/qwen2.py Outdated Show resolved Hide resolved

RangiLyu and others added 24 commits June 11, 2024 18:54

Support reward model and dpo

9f2e35e

support train reward model

988fcaa

fix config

28663c1

fix lint

96d6b00

fix lint

0015b42

support jsonl dataset

f8353d8

feat: support ORPO

f03dbc5

reorg configs

e5c52a6

rename collate function

805bd5a

rename collate function

830cab5

use varlen attention in validation

1212f19

fix lint

adee459

fix lint

c042c55

rebase main

08483c7

update

6d3f1ec

add reference and update dpo loss

b2589e8

inherit sft

00a8d82

fix broadcast

6c43a43

fix nan loss skip

4d0c96d

support reward model sp

bfead3b

support dpo sp

5aafdb9

support orpo sp

c571a70

fix bugs

4a79d2b

fix rebase

3d26ad2

RangiLyu added 3 commits June 11, 2024 18:56

convert script

9afed8c

fix precommit

aba3646

mv convert script to model

776037d

RangiLyu force-pushed the support_dpo_rm branch from 74a311c to 776037d Compare June 11, 2024 10:57

RangiLyu added 15 commits June 11, 2024 19:23

fix version check

06004fd

fix import

e990be3

add comments of reward token

2952593

fix orpo cfg

036e7f7

fix lint

aeaa98c

fix lint

114e5e3

remove seed

a885ade

remove seed

582e02f

add sp config

a26dcd9

add reward sp config

e9000b0

fix convert

fb656d7

fix lora reward model convert

6ca9ad9

fix qlora reward merge

1a28acd

update dpo loss

7a978fc

log reward acc and margin in dpo

c74721c

RangiLyu requested review from hhaAndroid and HIT-cwh June 12, 2024 05:14

RangiLyu added 4 commits June 12, 2024 14:36

update logits mask

5c711ae

unpack logits first

88ec30c

more loss setting in dpo cfgs

ddf5fa4

more loss setting in orpo cfgs

9d60565

hhaAndroid approved these changes Jun 13, 2024

View reviewed changes

pppppM merged commit a607fa3 into InternLM:main Jun 13, 2024
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support DPO, ORPO and Reward Model #743

[Feature] Support DPO, ORPO and Reward Model #743

RangiLyu commented Jun 3, 2024 •

edited

Loading

hhaAndroid Jun 13, 2024

[Feature] Support DPO, ORPO and Reward Model #743

[Feature] Support DPO, ORPO and Reward Model #743

Conversation

RangiLyu commented Jun 3, 2024 • edited Loading

DPO, ORPO, and Reward Model Training

Feature List

Tasks Pending

hhaAndroid Jun 13, 2024

Choose a reason for hiding this comment

RangiLyu commented Jun 3, 2024 •

edited

Loading