How to resume training

I don’t understand how to resume the training (from the last checkpoint).
The following:
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir)
saves but does not resume from the last checkpoint.
The following code starts the training from scratch (but I read that it should resume):

logger = TestTubeLogger(save_dir=save_dir, name="default", version=0)
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir, logger = logger)

this is not what I wanted, I would like an automatic resume from the last checkpoint

I don’t think that’s possible since a new Trainer instance won’t have any info regarding the checkpoint state saved in the previous training.

ok thank you very much!

@davide Try to initiate new instance of Trainer object with param “resume_from_checkpoint” equal to path to .ckpt file you stored afer your previous training:

trainer = pl.Trainer(gpus=1,  logger = logger, resume_from_checkpoint = "path/to/ckpt/file/checkopoint.ckpt")

This should start training from epoch your checkpoint is.


@davide +1 to above you’ll need to tackle this in two separate parts:

  1. locate path to checkpoint
  • pass in a consistent filepath into ModelCheckpoint(dirpath='./checkpoints/last.ckpt')
  1. automatically load that checkpoint +1 to @andrey_s
  • pl.Trainer(resume_from_checkpoint='./checkpoints/last.ckpt')

If you want to automatically resume from the best weights according to some metric you can setup ModeCheckpoint to monitor a particular metric and track the best one, then you can use glob.glob('./checkpoints/) and do some parsing to get the path of the best metric

1 Like

I think you forget to specify that you need to add more epochs to the trainer (e.g. * pl.Trainer(max_epochs=7, resume_from_checkpoint='./checkpoints/last.ckpt')). For exemple, if you last checkpoint is saved at epoch 3(max_epochs=3) than you need to add more epochs (max_epochs=7) in order to the training to begin otherwise it will not do anything (I tested that and it took me hours to figure this out :slight_smile: )

Hope it helps,
Peace and out!


Thanks for mentioning max_epochs argument. I am able to resume training from the last saved checked point (.ckpt file).