Error resuming from checkpoint with multiple GPUs

I started training a model on two GPUs, using the following trainer:

     trainer = pl.Trainer(
          devices = [0,2], accelerator='gpu', precision=16, max_epochs=2000, 
          callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'), 
          gradient_clip_val=5.0, gradient_clip_algorithm='norm')

This is set to save the best three epochs (based on the validation loss) and the last epoch:

 checkpoint_callback = ModelCheckpoint(

Training halted unexpectedly and I now want to resume it, which I did by configuring my trainer as follows:

    trainer = pl.Trainer(devices=[2,0], accelerator="gpu", precision=16, max_epochs=2000, 
         callbacks=checkpoint_callback, logger=pl.loggers.TensorBoardLogger('logs/'), 
         gradient_clip_val=5.0, gradient_clip_algorithm='norm', 

But, after initializing the two distributed processes and completing the validation sanity check, this crashes on starting the first step of the new training epoch, giving a long error stack that ends with:

File "/home/username/miniconda3/lib/
python3.8/site-packages/torch/optim/", line 86, in adam 
   exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!

So somehow it seems that it’s not correctly dividing all the tensors onto the two GPUs. I wonder if this has to do with how it’s loading the checkpoint. Am I doing something wrong here? Is this even possible, and if so how do I do it correctly?

(If I try to resume with a trainer that’s set to use just one GPU, there’s no problem.)

hey @rubseb

The configuration looks good here but there might be a problem in your LightningModule. Can you share the code for it or possibly a re-producible script if possible.

Also, we have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only soon.

Thank you

Hi @goku,

Thanks for you reply and sorry for wasting your time. I noticed shortly after posting this that the forum had moved and I posted a duplicate of my question on Github Discussions, but then didn’t think to remove it here (as I figured the old forum was dead).

In brief, I found a solution/workaround myself pretty soon which was to switch the distributed computing strategy from DDPSpawn to DDP, which turned out to be better in general. I wish I had the time to go back and replicate the issue with DDSpawn and my old code (since I’ve made other changes since then too) in order to be of help to others or the development team, but unfortunately I don’t right now. If I do find some time, or if it comes up again, I’ll report back with more info!

glad you found the solution :slight_smile: