[build] allow MPI on Unix when NCCL is disabled #21175

stefantalpalaru · 2024-06-25T23:38:27Z

Description

CMake logic fixed to allow enabling MPI while NCCL is disabled.

Motivation and Context

MPI is also used on the CPU backend, not only with CUDA, so it makes sense to decouple it properly from NCCL (which is for dealing with multiple Nvidia GPUs).

snnn · 2024-06-26T02:24:47Z

I thought we no longer use MPI #17624 . Do we ?

wejoncy · 2024-06-26T03:06:18Z

MPI is not a hard requirement for Multi-GPUs (of Nvidia or AMD).
Hi @stefantalpalaru What was the case when MPI is required for CPU backend? Is there a real senario in your case?

stefantalpalaru · 2024-06-26T07:29:57Z

What was the case when MPI is required for CPU backend?

https://github.com/microsoft/onnxruntime/blob/main/orttraining/orttraining/core/framework/adasum/adasum_mpi.cc

onnxruntime/orttraining/orttraining/training_ops/communication_common.h

Line 107 in e2abba1

#ifdef USE_MPI

https://github.com/microsoft/onnxruntime/tree/main/orttraining/orttraining/core/framework/communication/mpi

onnxruntime/orttraining/orttraining/python/orttraining_pybind_state.cc

Line 205 in e2abba1

#if defined(USE_MPI)

onnxruntime/orttraining/orttraining/core/session/training_session.cc

Line 355 in e2abba1

#ifdef USE_MPI

onnxruntime/orttraining/orttraining/training_ops/cpu/communication/recv.cc

Line 3 in e2abba1

#if defined(USE_MPI)

onnxruntime/orttraining/orttraining/training_ops/cpu/cpu_training_kernels.cc

Line 108 in e2abba1

#ifdef USE_MPI

onnxruntime/orttraining/orttraining/models/bert/main.cc

Line 595 in e2abba1

#if defined(USE_MPI)

onnxruntime/orttraining/orttraining/training_ops/cpu/communication/send.h

Line 3 in e2abba1

#if defined(USE_MPI)

onnxruntime/orttraining/orttraining/models/gpt2/main.cc

Line 315 in e2abba1

#if defined(USE_MPI)

Is there a real senario in your case?

No, I don't need to target the CPU device on my machine.

I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.

wejoncy · 2024-06-26T13:47:45Z

I was packaging this software for a Gentoo overlay and I noticed USE_MPI does not enable MPI, due to what is clearly a logic error in the CMake configuration, hence the fix.

It seems like MPI mostly target ort-training. Hi @pengwa, Do you have any suggestions?

[build] allow MPI on Unix when NCCL is disabled

249e9e5

snnn added the training issues related to ONNX Runtime training; typically submitted using template label Jun 26, 2024

snnn requested review from wejoncy and wschin June 26, 2024 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[build] allow MPI on Unix when NCCL is disabled #21175

[build] allow MPI on Unix when NCCL is disabled #21175

stefantalpalaru commented Jun 25, 2024

snnn commented Jun 26, 2024

wejoncy commented Jun 26, 2024

stefantalpalaru commented Jun 26, 2024

wejoncy commented Jun 26, 2024

[build] allow MPI on Unix when NCCL is disabled #21175

Are you sure you want to change the base?

[build] allow MPI on Unix when NCCL is disabled #21175

Conversation

stefantalpalaru commented Jun 25, 2024

Description

Motivation and Context

snnn commented Jun 26, 2024

wejoncy commented Jun 26, 2024

stefantalpalaru commented Jun 26, 2024

wejoncy commented Jun 26, 2024