Expectations for custom data parallel implementations

ananthsub · September 11, 2020, 5:37pm

I have a need to use a custom DDP implementation. With Lightning, the API and docs are unclear as to whether I need to extend LightningDistributedDataParallel, or if I can directly extend torch DistributedDataParallel.

The docs suggest configure_ddp should work with torch DistributedDataParallel: https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html#configure-ddp

However, there are spots in Lightning which rely on checking isinstance of the custom Lightning overrides: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/model_connector.py#L31-L34

The Lightning DDP also forwards calls to train/val/test step. Is this a requirement for custom DDP implementations when used with Lightning?

williamfalcon · September 11, 2020, 5:51pm

yeah, good points here. these were some of the initial designs of the lightningmodule. but as we’ve evolved, i think it’s clear that a custom ddp implementation should not live in a lightning module.

For that, i think you can use the accelerator class to implement your own accelerator. Although this is something that we’re working on to enable custom accelerator implementations.

i’ll ping you on slack for a follow up

ananthsub · September 15, 2020, 1:46am

The resolution here is to return a class which extends LightningDistributedDataParallel in configure_ddp instead of DistributedDataParallel