Effective learning rate and batch size with Lightning in DDP

I’ve been using Lightning recently and one of the most exciting parts of it has been the level of ease with which it enables using DDP as a distributed backend. I had a question about behavior of common hyperparameters when using DDP, vs base PyTorch. How do the effective batch size and learning rate change under the hood?

It seems to me that the batch size passed to a Lightning(Data)Module is used on every GPU in DDP, resulting in an effective global batch size of n_gpus*batch_size. This seems to be corroborated by the recent SimCLR code release (thanks for that!) Is this correct? For instance, I seem to get the same performance using batch_size=256 in base PyTorch DDP as batch_size=32 in Lightning with 8 GPUs.

The learning rate is less clear to me. It seems that it could be divided by the number of GPUs per https://github.com/untitled-ai/self_supervised - could you clarify this, and why this behavior happens?

Great question! As you mention, when you use DDP over N gpu’s, your effective batch_size is (N x batch size). After summing the gradients from each gpu DDP divides the gradients by N, so the effective learning rate would be learning_rate / N.

1 Like

so should we set lr=learning_rate or lr=learning_rate*N in configure_optimizers if using DDP backend??

@teddy It seems to me then that to run with the same effective batch size and learning rate as on 1 gpu on N gpus, we should divide the input batch_size by N and multiply learning_rate by N. Is there any situation in which I wouldn’t want to do this scaling? Why is this the default behavior? How do things differ between PyTorch and Lightning with regards to this, and why?
@goku I think if I understand correctly we should actually be setting lr=learning_rate*N.

yeah, my bad it should be learning_rate*N. fixed it.

Also for the batch_size, you should set it to the value you want on single device because pl uses this batch_size on each device so effective will be batch_size*N. Not sure about learning_rate yet.

what do you mean by this statement?

@sm000 this is the default behavior in torch.nn.parallel, which Lightning wraps. I believe this is the default behavior so that one can increase/decrease the number of gpus without having to worry about changing hyperparameters (as learning rate should ideally be changed inversely to batch-size).

A possible feature could be have some sort of effective_learning_rate or effective_batch_size.

1 Like

expanding on why batch sizes are “handled differently” in distributed training:

ddp is special in that it spawns multiple parallel processes that train independently, with their own data. We want to keep the batch_size as the user has set it in their dataloaders, because that’s what each gpu will see. compare this to dp: there the batch gets split up into N pieces (scattered) and then after forward collected again (gather), there all gpus work with the same data, but that means this does not scale very well, because as you add more gpus, you will have to recompute the batch size so that you get the desired learning behaviour plus fitting the data into the memory. with ddp, you don’t have that problem, as @teddy explained already, it makes hyperparameter tuning easier when scaling to more devices.

proportionally, because the larger the batch size, the more accurate is your estimate of the gradient over the entire data. that’s my understanding.

2 Likes

A good rule of thumb with regards to batch size and learning rate can be found in “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” (https://arxiv.org/pdf/1706.02677.pdf)

As we will show in comprehensive experiments, we
found that the following learning rate scaling rule is surprisingly effective for a broad range of minibatch sizes:
Linear Scaling Rule: When the minibatch size is
multiplied by k, multiply the learning rate by k.

thanks, that’s helpful. Though I’m still a bit confused - it seems like I’d have to modify hyperparameters more, since to get the same (global) behavior in ddp as in single-gpu training I need to divide the batch_size I specify and multiply the learning_rate I specify by N. Empirically, naively leaving both the same and trying ddp doesn’t seem to be effective.

As a concrete code example, please see the link I posted above - as described in their README, to reproduce MoCo’s results, they multiply the paper’s learning rate, 0.03, by 8 to get 0.24, and divide the batch_size, 256, by 8 to get 32. (If you don’t do this, the results are substantially worse.)

I also had some confusion about what you mentioned with learning rate vs batch size - my impression is that we should increase the learning rate when we use a larger effective batch size, for example by the linear scaling rule (section 2.1 of https://arxiv.org/pdf/1706.02677.pdf) or by the square root (as in the recent Lightning SimCLR implementation). But this behavior reduces learning rate, so we end up needing to compensate twice?

(@teddy posted his comment about linear scaling while I was writing this, glad we’re thinking along similar lines :slight_smile: )

@sm000 I agree that this can be confusing if you are trying to reproduce the results of a paper. If the paper states it uses a batch size of 64 with a learning rate of 0.01, but you can only fit a per-gpu batch size of 8 (with 8 gpus), you must provide a learning rate of 0.01 * 8 = 0.08 to your optimizer.

On the other hand, if you already have a model that trains well on a single gpu with a given batch size and learning rate, you can likely increase the number of gpus without changing any hyperparameters as your effective learning rate and batch size will change linearly. I believe this is why this is the default behavior.

The extra issue here is that for contrastive methods such as SimCLR that rely heavily upon negative sampling, batch size plays a much bigger role than it would for most other tasks, and I am not sure if there is a good rule to follow when it comes to hyperparameter selection for scaling.

I realized one point of confusion stems from the official PyTorch DDP example code (for ImageNet) - it turns out they manually scale batch_size to batch_size/n_gpus_per_node when using DDP with 1 GPU per process (https://github.com/pytorch/examples/blob/master/imagenet/main.py#L151), recommending

# When using a single GPU per process and per
# DistributedDataParallel, we need to divide the batch size
# ourselves based on the total number of GPUs we have

Only by doing this do they make switching backends equivalent, and the official MoCo code follows from that (it’s just a fork of that example https://github.com/facebookresearch/moco/blob/master/main_moco.py#L174). I agree that not scaling is default behavior for nn.parallel, but this yields substantially different behavior across backends. I’m wondering if users of Lightning switching between different distributed backends and expecting similar behavior could be thrown off by this (I definitely have been!), since scaling seems to be necessary to match behavior across backends.

On the other hand, I’m still relatively confused about the learning rate. @teddy mentioned

you can likely increase the number of gpus without changing any hyperparameters as your effective learning rate and batch size will change linearly

but it seems that they’ll change in the wrong directions - we want learning rate to increase with batch size, but effective batch size goes up with n_gpus and effective learning rate goes down with n_gpus if it’s being divided. (Actually, looking further into the PyTorch source link, it seems like the division is only being done to make the sum an average, keeping effective learning rate the same per https://discuss.pytorch.org/t/should-we-split-batch-size-according-to-ngpu-per-node-when-distributeddataparallel/72769/4 - any thoughts on this?) To compensate would we need to multiply learning rate by n_gpus**2? Actually, in the ImageNet example they don’t seem to mess with the learning rate at all, so is it all accounted for as long as we scale the batch size per gpu? This behavior actually seems to be different in Lightning, where the link in my first comment notes you need to multiply by n_gpus. I’m just hypothesizing, but could there be something different about how in the ImageNet example code they call the optimizer once on model.parameters after DDP vs how configure_optimizers works?

I realize this is kind of a long post (tried to err on the side of providing too much info rather than too little), sorry and thanks again for the input.

Only by doing this do they make switching backends equivalent, and the official MoCo code follows from that.

I agree with this completely. If you would like to maintain the same effective batch size across backends you will need to say batch_size = batch_size / n_gpus.

With regards to learning rate I believe both the Imagenet and MoCo implementations are not correctly backend agnostic. The MoCo repository claims “similar results” with half the gpus, half effective batch size, and half given learning rate (which means essential the same effective rate, but smaller batch size). A .5x change in batch size with the same learning rate will likely not change much, so I am not surprised they are able to get similar results in this manner.

Still confused a bit. So in DDP, backward pass is done on all the devices and later on synced so in this case each device will be using batch_size that will be assigned in the dataloader and learning_rate should be set corresponding to batch_size and not batch_size*N but in case of DP, backward pass is done on batch_size*N on a single device so there should we set learning_rate=learning_rate*N??

I’m also still kind of confused, along similar lines to what @goku said. I’m starting to think the effective learning rate is dependent on the local batch size, rather than the effective/cumulative one (see https://discuss.pytorch.org/t/should-we-split-batch-size-according-to-ngpu-per-node-when-distributeddataparallel/72769/4 linked from before). That is, does the effective learning rate really change given the way averaging is done for DDP? This link suggests it does not, but I’m finding it difficult to parse.

In an interesting twist, I asked the authors of the Lightning Moco code repo I’d linked above why they scaled the learning rate and they said that actually, this scaling was needed in Lightning 0.7.1 but is no longer needed - i.e., they had to use 0.03*8=0.24 before for 8 gpus, but now 0.03 works (apparently). Any idea what could’ve changed between then and now?

Regarding the Lightning Moco repo code, it makes sense that they now use the same learning rate as the official Moco repository, as both use DDP. Each model now has as per-gpu batch size of 32, and a per-gpu learning rate of 0.03. Not sure what changed since 0.7.1, maybe @williamfalcon has some insight.

Now lets say you wanted to train the same model on one gpu, with a batch size of 256. Now you would have to adjust your learning rate to be 0.03 / 8 = 0.00375. Why is this?

Lets say we have an effective batch of 256 images that produces a gradient g, and we have a learning rate lr

For the case of 8 gpus, the per-gpu gradient becomes g/8 and our parameter delta is lr x g/8.

For the case of 1 gpu, the per-gpu gradient is now just g and our parameter delta is lr x g.

If we want to make these parameter deltas consistent, we would either have to divide our learning rate by 8 for the single gpu case, or multiply the learning rate by 8 for the multi gpu case. Which of these options we choose depends on the situation. In the case of MoCo, they show that a learning rate of 0.03 works when the per-gpu batch size is 32, so we have to work backwards to find that a learning rate of lr / 8 should be used if training the full 256-image batch on one gpu.

1 Like

So in case of DDP one should set it to lr (specific to per-gpu batch_size) but in case of DP it should be set to lr*N since backward is done on single gpu right??

And same in case of TPU(8 cores training) as that of DDP since it’s basically DDP?

Thanks for the example @teddy, that matches my understanding for the case where per-GPU batch size changes but effective batch size stays constant. This has been a really helpful discussion. (Currently I’m basically thinking along the lines of what @goku said.) I just wanted to clarify a couple of minor things.

To get this straight: in this example, they’re using half the effective batch size with half the GPUs, so their per-GPU batch size (what each process sees) stays the same (256/8 = 128/4 = 32) since they’re scaling it. The given learning rate for DDP (unlike DP) should correspond to the per-GPU (given) batch size if I understand correctly now. So why would half the given learning rate mean the same effective learning rate? To keep things the same, should we keep the same learning rate? Is it correct to say that when we apply the linear scaling heuristic for DDP, this is with respect to the per-GPU batch size, not the effective batch size?

Why would one want to scale the batch size given to DDP (like the ImageNet example/Moco do) by the number of GPUs? My understanding is Lightning chooses not to scale for the user since the other hyperparameters (learning rate in particular) mostly vary with the per-GPU/given batch size and not the effective batch size.

I believe they should in fact keep the same learning rate in this case, since the per-GPU batch size is the same. They may have overlooked this, but as I mentioned before a 2x change in learning rate will probably not change much.

The only reason I can think to scale the batch size is to be consistent with a paper. If the paper reports a batch size of 256 but you can only use 32, it may appear cleaner to still say you have a batch size of 256, but over 8 gpus.

Exactly. In this way you can scale up without having to change any hyperparameters.

1 Like

Great, thanks again for the clarification! Cheers.

  1. Consider the MSE loss for example, it is typically computed by averaging the sample MSE, namely
    1/N sum_i(MSE_i)
    it is therefore required to scale the accumulated gradients by N in DDP to mimic the same behavior.

  2. As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant. In a sense, the larger the batch we expect that the average gradients over the samples will be in the correct direction. Since Var(aX)=a^2Var(x) for some constant a!=0, we typically scale the learning rate by sqrt(effective_batch_size/baseline_batch_size).
    See the example I gave here