Using Hydra + DDP

Hello all,

I use hydra in conjunction with PL and I’m having a blast coding stuff :slight_smile:

However, I ran into some issues for which I have no clue:

  • I’m trying to run some multiple_gpu training script using DDP from PTL as it is the recommended accelerator for multi-GPU.
  • I have configured hydra such that"data/runs/${now:%Y-%m-%d_%H-%M-%S}"

What happens in the case I have at least 2 GPUs is the following:

  • First process A creates the ckpt and logs folders inside the folder that is created first (let’s denote it by “1”)
  • Second process B is looking for the ckpt and logs folders into the second (i.e. “2”) but cannot find them.

As a result:

  1. I get the following warning Missing logger folder: "1"/logs
  2. I get the following error when running a fast_dev_run at test time: [Errno 2] No such file or directory: "2"/ckpts/best.ckpt

I believe this is related to the fact that DDP is not meant to work when there is a nested script without a root package (see Multi-GPU training — PyTorch Lightning 1.1.3 documentation). Can someone confirm I fall into this category?

Does someone see a workaround besides losing the ability to save hydra outputs using the ${now} date formatting?

Note that the DP mode works fine but requires to refactor the code a bit using the <mode>_step_end methods and is supposedly slower.

In case it helps, I will put together a minimal working example reproducing the issue.

Have a nice day :slight_smile:

The problem here seems to be that hydra is creating 2 run dirs, one for each process. It would seem to me that the solution would be to create your own run dirs. Using the following in your config, you can set the current dir as run dir and disable creating new subdirs:

  output_subdir: null # Disable saving of config files. We'll do that ourselves.
    dir: . # Set working dir to current directory

Inside PL, I create new logging dirs from the rank 0 process (there’s convenience funcs available in PL). For a concrete example, take a look at my repo: GitHub - Shreeyak/pytorch-lightning-segmentation-template: Semantic Segmentation on the LaPa dataset using Pytorch Lightning

This is a function to generate the path to the log dir (run dir): pytorch-lightning-segmentation-template/ at 064e13ca0f7606af2928bb62dfc713ae7c23b277 · Shreeyak/pytorch-lightning-segmentation-template · GitHub

And it is created within a custom logging callback. You can modify it to create the dir within the main script (but only from rank 0 process).

1 Like

Thank you @shreeyak for your suggestion and pointers.
So far I’ve decided to manually set the myself, thus enabling the ckpts and logging folders to be shared among the DDP processes.
I am not too familiar with the rank 0 process related callbacks but I am willing to study it more and come back with a better solution granted I can find one.
Have a nice day :slight_smile:

I put together a minimal example on this repository: GitHub - inzouzouwetrust/PL-Hydra-template: PyTorch Lightning + Hydra template to use DDP that reproduces the aforementioned issue whenever I’m using 2+ GPUs in DDP mode and not setting the manually.
I will take a closer look on your custom logging callback @shreeyak.

EDIT: Note that I updated the repository linked above into an actual Minimal Working Example such that it no longer uses actual but dummy data instead.
It reproduces the issue whether I use the latest versions (requirements_latest.txt) or my own environment (requirements.txt).

EDIT 2: The repository linked above is no longer useful. Take a look at this one instead: GitHub - inzouzouwetrust/PL-Hydra-DDP-bug: Bug when using PL DDP with Hydra
See the associated issue here

Just to give a high-level overview: because DDP launches separate processes for each GPU, certain tasks should not be executed on all processes to avoid errors, such as read/writing the same file. So, we only perform them on the rank 0 process.
In PL, there is a method that can also be used as a decorator. Any method with this decorator will only execute on rank 0 process:

from pytorch_lightning.utilities import rank_zero_only

    def log_hyperparams(self, params: Union[Dict[str, Any], Namespace],
                        metrics: Optional[Dict[str, Any]] = None) -> None:

You could also just check for rank 0 yourself:

# Global rank 0 is the 0th process on the 0th node
if int(os.environ.get('LOCAL_RANK', 0) == 0 and os.environ.get('NODE_RANK', 0):
    <do something>

If one is using a callback for some task, then the callback will generally use the @rank_zero_only decorator and perform the task during the setup or pretrain period.

Anyway, looks like your error might be an actual bug? If so, feel free to refer to my repo for a tmp. alternative (set hydra run dir to current directory, create a new logging dir for this run manually, pass that directory to the ModelCheckpoint callback).