How to run Trainer.fit() and Trainer.test() in DDP distributed mode

I have a script like this

trainer = Trainer(distributed_backend="ddp", gpus=2, ...)
model = Model(...)
trainer.fit(model)
trainer.test(model)

and when I launch it, it hangs after fit, never reaching test, or it errors with a message
“Address already in use”
What is the problem?

Question/Problem from here
https://github.com/PyTorchLightning/pytorch-lightning/issues/3327

You cannot run trainer.test after trainer.fit (or multiple trainer.fit/test in general) in ddp mode.
This only works with ddp_spawn . You need to either

  1. remove the trainer.test call
  2. move the trainer.test call to a new test script
  3. choose ddp_spawn (but has it’s own limitations)

This is simply a limitation of multiprocessing and a tradeoff between ddp and ddp_spawn.
More information in this section towards the bottom
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel

Is this possible now in 1.0?

Yes, it should be working.

1 Like

The docs for 1.0.6 still say it’s not possible to run both .fit and .test in ddp mode, are they just out of date?
The ddp example also seems to be running them both.
Could you point us to an issue or PR that addressed this?

1 Like

Yes it’s out of date, sorry about that.
We solved some of these bigger challenges, like running fit and test multiple times sequentially.
can’t find the exact PR that solved this, but it must be somewhere among these here:
https://github.com/PyTorchLightning/pytorch-lightning/pulls?page=1&q=is%3Apr+ddp+is%3Aclosed+author%3AwilliamFalcon

Great, thanks a lot!