Distributed training using ddp, how to add node

trainer = Trainer(gpus=8, distributed_backend='ddp', num_nodes=4)

how to add my machine as a node?

@awaelchli, any thoughts here ?

@Liuhaidong is basically asking how to build and setup a cluster. I certainly don’t know all the details here :slight_smile:

But for a start, you need at least a master node and one other node. These need to be connected through a fast high bandwidth network connection. The network needs to be setup properly and configured with SLURM (or there are other cluster management systems).

This is basically the high level knowledge I have. By no means do I know how to do all these steps :)) I suggest you start googling and doing the research. But to be clear, all this has nothing todo with Lightning. Lightning assumes you have all of that and when you do, it only needs the MASTER_PORT and MASTER_ADDRESS to schedule a training job with SLURM.