DistributedSampler and LightningDataModule

When creating data loaders for DDP training, in the LightningDataModule is it ok for me to set the DistributedSampler when instantiating the dataloader?

Something like the following -

class MyData(pl.LightningDataModule):
    def train_dataloader(self, stage):
        if stage == "fit":
            return DataLoader(
                sampler=DistributedSampler(self.trainset, shuffle=True)

In the Multi-GPU docs the recommendation is to not explicitly use DistributedSampler. In my normal workflow I implement the LightningDataModule.train_dataloader() to provide the trainer with my dataloader. In this case, it makes sense to me to explicitly set the DistributedSampler when instantiating my data loader. However, this contradicts the advice given in the docs hence my question.

Thanks in advance.

hey @avilay

when you set DDP within Lightning Trainer, it will automatically add DistributedSampler internally, so you don’t need to add one. The reason it suggest it is to avoid minimal code-change in case you migrate from DDP to some single device training, since in that case, keeping distributedsampler in train_dataloader explicitly would need some code changes. But if you still want to keep it you can set Trainer(replace_ddp_sampler=False) to let it know to not add any distributed sampler when running in DDP mode.

Also, we have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only after some time.

Thank you