Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[build] allow MPI on Unix when NCCL is disabled #21175

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stefantalpalaru
Copy link

Description

CMake logic fixed to allow enabling MPI while NCCL is disabled.

Motivation and Context

MPI is also used on the CPU backend, not only with CUDA, so it makes sense to decouple it properly from NCCL (which is for dealing with multiple Nvidia GPUs).

@snnn
Copy link
Member

snnn commented Jun 26, 2024

I thought we no longer use MPI #17624 . Do we ?

@snnn snnn added the training issues related to ONNX Runtime training; typically submitted using template label Jun 26, 2024
@snnn snnn requested review from wejoncy and wschin June 26, 2024 02:24
@wejoncy
Copy link
Contributor

wejoncy commented Jun 26, 2024

MPI is not a hard requirement for Multi-GPUs (of Nvidia or AMD).
Hi @stefantalpalaru What was the case when MPI is required for CPU backend? Is there a real senario in your case?

@stefantalpalaru
Copy link
Author

What was the case when MPI is required for CPU backend?

https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/framework/adasum/adasum_mpi.cc

https://github.com/microsoft/onnxruntime/tree/main/orttraining/orttraining/core/framework/communication/mpi

Is there a real senario in your case?

No, I don't need to target the CPU device on my machine.

I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.

@wejoncy
Copy link
Contributor

wejoncy commented Jun 26, 2024

I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.

It seems like MPI mostly target ort-training. Hi @pengwa, Do you have any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants