How to resume training

I don’t understand how to resume the training (from the last checkpoint).
The following:
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir)
saves but does not resume from the last checkpoint.
The following code starts the training from scratch (but I read that it should resume):

logger = TestTubeLogger(save_dir=save_dir, name="default", version=0)
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir, logger = logger)

this is not what I wanted, I would like an automatic resume from the last checkpoint

I don’t think that’s possible since a new Trainer instance won’t have any info regarding the checkpoint state saved in the previous training.

ok thank you very much!

@davide Try to initiate new instance of Trainer object with param “resume_from_checkpoint” equal to path to .ckpt file you stored afer your previous training:

trainer = pl.Trainer(gpus=1,  logger = logger, resume_from_checkpoint = "path/to/ckpt/file/checkopoint.ckpt")
trainer.fit(model)

This should start training from epoch your checkpoint is.

@davide +1 to above you’ll need to tackle this in two separate parts:

  1. locate path to checkpoint
  • pass in a consistent filepath into ModelCheckpoint(dirpath='./checkpoints/last.ckpt')
  1. automatically load that checkpoint +1 to @andrey_s
  • pl.Trainer(resume_from_checkpoint='./checkpoints/last.ckpt')

If you want to automatically resume from the best weights according to some metric you can setup ModeCheckpoint to monitor a particular metric and track the best one, then you can use glob.glob('./checkpoints/) and do some parsing to get the path of the best metric