-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use multi-training without slurm system? #458
Comments
If it is just on a single node you can use the interface in the documentation: https://mace-docs.readthedocs.io/en/latest/guide/multigpu.html, on the single-node section. |
thanks for your reply, i use the tutorial to change my code but i meet new problem
i'm running
and the config.yaml
|
You should comment out the |
I modify the mace package slurm_distributed.py, the root of which is /opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools, and it can run.
I think if the multi-train is not opened?
|
i find #143 and i want to solve the problem, so i reinstall the version which has hugfacing. but it also can't run.
and it also has out of memory
i also test 4 gpus train, but similar problem was happened. |
I created a modified train script for this which doesn't use the whole See here. Essentially it comes down to setting the required environment variables manually: def main() -> None:
"""
This script runs the training/fine tuning for mace
"""
args = tools.build_default_arg_parser().parse_args()
if args.distributed:
world_size = torch.cuda.device_count()
import torch.multiprocessing as mp
mp.spawn(run, args=(args, world_size), nprocs=world_size)
else:
run(0, args, 1)
def run(rank: int, args: argparse.Namespace, world_size: int) -> None:
"""
This script runs the training/fine tuning for mace
"""
tag = tools.get_tag(name=args.name, seed=args.seed)
if args.distributed:
# try:
# distr_env = DistributedEnvironment()
# except Exception as e: # pylint: disable=W0703
# logging.error(f"Failed to initialize distributed environment: {e}")
# return
# world_size = distr_env.world_size
# local_rank = distr_env.local_rank
# rank = distr_env.rank
# if rank == 0:
# print(distr_env)
# torch.distributed.init_process_group(backend="nccl")
local_rank = rank
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
torch.cuda.set_device(rank)
torch.distributed.init_process_group(
backend='nccl',
rank=rank,
world_size=world_size,
)
else:
pass
|
hello, i use your method to change the code, but some errors have happened and i can't understand it.
and the torch also has error,
I don't think this is a port problem.Because even if I modify a port that no one has used before, it is still the same error. |
Sounds like your starting it twice. Make sure to use |
thanks for your reply, dear.
and cuda memory out too.
i use two 4090 gpus to run. |
Can you share your new log file? It does not seem to be using the two GPUs. |
thanks, dear ilyes, it's my log file. |
Hello, dear ilyes, I would like to run a single machine version of a multi GPUs mace (with multiple 4090s), but recently I have tried some other methods but have all failed. Can you explain in detail the specific reason why the dual GPU did not run successfully? |
Hello dear developers,I run this script
python /root/mace/scripts/run_train.py --name="MACE_model" \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="test.xyz" \ --config_type_weights='{"Default":1.0}' \ --model="MACE" \ --hidden_irreps='128x0e + 128x1o' \ --r_max=5.0 \ --batch_size=10 \ --energy_key="energy" \ --forces_key="forces" \ --max_num_epochs=100 \ --swa \ --start_swa=80 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --restart_latest \ --device=cuda \
But my computer has two 4090 GPUs, and I have not installed Slurm, so this problem occurred
ERROR:root:Failed to initialize distributed environment: 'SLURM_ JOB_NODELIST
How to solve the problem.
The text was updated successfully, but these errors were encountered: