How to not load complete in-memory dataset for every process in DDP training

sigmoid_amidst_relus · February 13, 2021, 5:15am

I’m working in an environment that has regular HDDs, shared amongst many users. I/O performance is too poor to simply read and parse data on the fly, so I have to load my data in memory.

I have a single node with 4 GPUs (Node resources are not shared, underlying storage is). When training in DDP mode, each process loads the entire dataset in memory, which although works for my current dataset, won’t work for larger ones, and I’d like to avoid that since each process uses only a subset of the data anyway and the rest is redundant. The dataset preparation is done directly in the LightningModule (without explicitly using LightningDataModule)

From my understanding of things, the following should solve this problem:

Disable adding of Distributed Sampler in Trainer using replace_sampler_ddp=False
pass local rank information to the Dataset and load a particular shard.

So my questions are:

How do I achieve the above, i.e. getting rank information of the process in the LightningModule and passing it on to my dataset object?
Is there a better way to do this using existing pytorch-lightning components?

sigmoid_amidst_relus · February 13, 2021, 8:11am

Solved. All relevant information can be found in the environment variables set by pytorch-lightning when the DDP processes are launched.

Specifically, do the following in your dataset/lightning module definition

import os
env_cp = os.environ.copy()
node_rank, local_rank, world_size = env_cp['NODE_RANK'], env_cp['LOCAL_RANK'], env_cp['WORLD_SIZE']

is_in_ddp_subprocess = env_cp['PL_IN_DDP_SUBPROCESS']
pl_trainer_gpus = enc_cp['PL_TRAINER_GPUS']

Using this info I was able to write shard logic that loaded a specific subset of data in memory for each DDP process.

Leaving here as it might be of help for someone.

alaahouimel · October 17, 2023, 2:54pm

Hey @sigmoid_amidst_relus (cool name btw)

would it be possible to share the code behind the “sharding logic” , specifically with DataLoaders
I believe PL uses torch.utils.data.distributed.DistributedSampler under the hood in a multi GPU setup.
did you override that as well ?
if not wouldn’t that possibly cause indices conflict ?

Cheers