CUDA OOM while initializing DDP

Hey everyone,

I am trying to train a model on the GPU server of our lab, however, I am encountering a strange issue. I get a CUDA OOM error when I try to train the model using this trainer configuration:

trainer = pl.Trainer(
        max_epochs=10,
        gpus=[2, 3],
        accelerator="ddp",
        precision=16,
        callbacks=callbacks,
        progress_bar_refresh_rate=20,
        deterministic=True,
        prepare_data_per_node=False)

This happens also if I set gpus=2 and auto_select_gpus=True. The server has 10 GPUs (pretty powerful ones too, and, checking using nvidia-smi there are GPUs which are not used (and those I select manually are free).

The issue happens also if I use some other model (for instance the GAN in PL Bolts). In particular, it happens running the script that can be found here, with the following CLI arguments:

python main.py --gpus 2 --accelerator ddp --auto_select_gpus --data_dir "data"

I think the exception happens during the DDP setup, and the output of my script (stack trace included) is as follows:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: you passed in a val_dataloader but have no validation_step. Skipping validation loop
  warnings.warn(*args, **kwargs)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /home/edoardo.debenedetti/.data/cifar-10-python.tar.gz
100%|█████████████████████████████████████████████████████████████████████████████████████▊| 170172416/170498071 [00:07<00:00, 28156544.60it/s]Extracting /home/edoardo.debenedetti/.data/cifar-10-python.tar.gz to /home/edoardo.debenedetti/.data
Files already downloaded and verified
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "gans_mia_unlearning/architectures/gan.py", line 213, in <module>
    dm, model, trainer = cli_main()
  File "gans_mia_unlearning/architectures/gan.py", line 208, in cli_main
    trainer.fit(model, dm)
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
170500096it [00:14, 11915539.63it/s]
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/gans_mia_unlearning/architectures/gan.py", line 213, in <module>
    dm, model, trainer = cli_main()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/gans_mia_unlearning/architectures/gan.py", line 208, in cli_main
    trainer.fit(model, dm)
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/.pyenv/versions/pytorch-miniconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: Broken pipe

On the other side, if I try the regular PyTorch way to use DDP (like in PyTorch’s GAN guide here), I have no such exception.

Also, I tried the Boring Model (https:// github com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py, the link is like that because I am a new user), and I have the same issue. Moreover, DP works well.

Do you think this is a problem on my side or on the workstation configuration?

Thanks in advance!

UPDATE: I opened an issue here as the behavior is pretty weird and it could be a bug.