Ddp on 2 GPUs: No rendezvous handler for env://

jimtorch · January 28, 2021, 8:09am

I am testing a model with lightning, it has been working fine with 1 GPU. After added 2nd GPU today however, the following error happened:
(with gpus=2, distributed_backend=‘ddp’ been added to pl.Trainer )

raise RuntimeError(“No rendezvous handler for {}://”.format(result.scheme))
RuntimeError: No rendezvous handler for env://

I am on Windows 10, PyTorch 1.7.1, pytorch_lightning 1.1.4, cuda 11.0

how should I fix or work around this problem?
Thanks!

jimtorch · February 5, 2021, 11:37am

Fixed this problem myself, which requires some hack into ddp_plugin.py
Basically, need to use gloo backend, and create a local rendezvous file instead.

carlomarxdk · March 3, 2021, 9:31pm

@jimtorch what exactly did you change? I am in a similar situation and I have no idea what to do.
Thanks in advance.