Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation

Anshu_Garg · February 6, 2022, 12:27pm

Hello Everyone, Initially, I trained my model in single GPU environment. And it was working perfectly fine. But now I have increased GPU’s to 2, number of nodes -2 (strategy - ‘DDP’) and following all the instructions from this:
https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#replace-sampler-ddp

But I am getting following issue - (This is logged in slurm’s error log script)
Downloading: “https;//…” to /root/.cache/torch/hub/checkpoints/…
Downloading: “https;//…” to /root/.cache/torch/hub/checkpoints/…

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

This is logged in my slurm’s output script -
NOTE! Installing ujson may make loading annotations faster.
NOTE! Installing ujson may make loading annotations faster.

After this, the process just hangs and doesn’t give any error and doesn’t even terminate (until I do). Can someone please suggest me what can be wrong here. I have read all the threads for multi-gpu errors but I no one raised this issue. I am not getting what is wrong in my code.

Some information regarding my model:
code is downloading and loading checkpoint (from internet) to the model.
I have used torch.utils.data.DistributedSampler
I have downloaded data in setup() function - in DataModule of PYL
I am initialising the trainer like this -
Trainer(gradient_clip_val = args.clip_max_norm, max_epochs = args.epochs,
gpus = args.gpus, strategy=“ddp”, replace_sampler_ddp = False, num_nodes=args.num_nodes, default_root_dir = args.output_path,
logger=TensorBoardLogger(save_dir=args.output_path, name = args.name))

Please let me know if any further information is needed. I would like to thank everyone in advance.

soumickmj · January 17, 2023, 4:05pm

Hi @Anshu_Garg
I’m also facing the same issue!
Did you manage to find any solution or workaround?
Thanks

soumickmj · January 17, 2023, 4:28pm

Hi (again!) @Anshu_Garg
Not sure if you are facing the same issue still, but the solution is to add “srun” before “python” inside the bash script which you will execute using “sbatch”
cf. Run on an on-prem cluster (advanced) — PyTorch Lightning 1.9.0 documentation

awaelchli · January 18, 2023, 1:28am

@soumickmj Glad you found this out!
Yes, this is a common mistake, and on the latest version of Lightning we show a warning that the command is not correct, and suggest srun to be used.