I have a script like this
trainer = Trainer(distributed_backend="ddp", gpus=2, ...)
model = Model(...)
trainer.fit(model)
trainer.test(model)
and when I launch it, it hangs after fit, never reaching test, or it errors with a message
“Address already in use”
What is the problem?
Question/Problem from here
https://github.com/PyTorchLightning/pytorch-lightning/issues/3327
You cannot run trainer.test
after trainer.fit
(or multiple trainer.fit/test
in general) in ddp mode.
This only works with ddp_spawn
. You need to either
remove the trainer.test call
move the trainer.test call to a new test script
choose ddp_spawn (but has it’s own limitations)
This is simply a limitation of multiprocessing and a tradeoff between ddp and ddp_spawn.
More information in this section towards the bottom
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel
dilip
October 14, 2020, 8:32pm
3
Is this possible now in 1.0?
Yes, it should be working.
1 Like
The docs for 1.0.6 still say it’s not possible to run both .fit and .test in ddp mode, are they just out of date?
The ddp example also seems to be running them both.
Could you point us to an issue or PR that addressed this?
1 Like
Yes it’s out of date, sorry about that.
We solved some of these bigger challenges, like running fit and test multiple times sequentially.
can’t find the exact PR that solved this, but it must be somewhere among these here:
https://github.com/PyTorchLightning/pytorch-lightning/pulls?page=1&q=is%3Apr+ddp+is%3Aclosed+author%3AwilliamFalcon