I’ve been using Lightning recently and one of the most exciting parts of it has been the level of ease with which it enables using DDP as a distributed backend. I had a question about behavior of common hyperparameters when using DDP, vs base PyTorch. How do the effective batch size and learning rate change under the hood?
It seems to me that the batch size passed to a Lightning(Data)Module is used on every GPU in DDP, resulting in an effective global batch size of n_gpus*batch_size. This seems to be corroborated by the recent SimCLR code release (thanks for that!) Is this correct? For instance, I seem to get the same performance using batch_size=256 in base PyTorch DDP as batch_size=32 in Lightning with 8 GPUs.
The learning rate is less clear to me. It seems that it could be divided by the number of GPUs per https://github.com/untitled-ai/self_supervised - could you clarify this, and why this behavior happens?