How to run Trainer.fit() and Trainer.test() in DDP distributed mode

awaelchli · September 20, 2020, 7:33am

I have a script like this

trainer = Trainer(distributed_backend="ddp", gpus=2, ...)
model = Model(...)
trainer.fit(model)
trainer.test(model)

and when I launch it, it hangs after fit, never reaching test, or it errors with a message
“Address already in use”
What is the problem?

Question/Problem from here
https://github.com/PyTorchLightning/pytorch-lightning/issues/3327

awaelchli · September 20, 2020, 7:35am

You cannot run trainer.test after trainer.fit (or multiple trainer.fit/test in general) in ddp mode.
This only works with ddp_spawn . You need to either

remove the trainer.test call
move the trainer.test call to a new test script
choose ddp_spawn (but has it’s own limitations)

This is simply a limitation of multiprocessing and a tradeoff between ddp and ddp_spawn.
More information in this section towards the bottom
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel

dilip · October 14, 2020, 8:32pm

Is this possible now in 1.0?

awaelchli · October 14, 2020, 8:45pm

Yes, it should be working.

JonathanLeRoux · November 11, 2020, 4:49pm

The docs for 1.0.6 still say it’s not possible to run both .fit and .test in ddp mode, are they just out of date?
The ddp example also seems to be running them both.
Could you point us to an issue or PR that addressed this?

awaelchli · November 11, 2020, 5:07pm

Yes it’s out of date, sorry about that.
We solved some of these bigger challenges, like running fit and test multiple times sequentially.
can’t find the exact PR that solved this, but it must be somewhere among these here:
https://github.com/PyTorchLightning/pytorch-lightning/pulls?page=1&q=is%3Apr+ddp+is%3Aclosed+author%3AwilliamFalcon

JonathanLeRoux · November 11, 2020, 9:06pm

Great, thanks a lot!