Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update universal_checkpointing/README.md #395

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

inkcherry
Copy link

@inkcherry inkcherry commented Jun 3, 2024

Is my understanding correct? due to compatibility with PP, for example, when using ds-SP, it needs to be disabled, which means some weights that previously relied on PP cannot be directly used.

@samadejacobs
Copy link

@inkcherry, UCP supports PP conversion to/from other parallelism topologies (ZeRO-DP, SP, TP etc), however, training with SP/PP combo with and without UCP has not been tested.

@inkcherry
Copy link
Author

Thank you for your explanation~ @samadejacobs , yes I believe using SP without pp would be more stable, so I tried the following:

  1. Model trained with PP (without --no-pipeline-parallel, from some past workloads).
  2. Converted by UCP.
  3. Finetuned Model without pp (with --no-pipeline-parallel, with ds-sp ).

But I encountered a crash at step2 , weight names have changed, UCP does not work in this case, as mentioned in the documentation change. : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants