How to resume training

davide · November 30, 2020, 1:29pm

I don’t understand how to resume the training (from the last checkpoint).
The following:
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir)
saves but does not resume from the last checkpoint.
The following code starts the training from scratch (but I read that it should resume):

logger = TestTubeLogger(save_dir=save_dir, name="default", version=0)
trainer = pl.Trainer(gpus=1, default_root_dir=save_dir, logger = logger)

goku · November 30, 2020, 7:05pm

https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#resume-from-checkpoint

davide · November 30, 2020, 7:31pm

this is not what I wanted, I would like an automatic resume from the last checkpoint

goku · November 30, 2020, 7:39pm

I don’t think that’s possible since a new Trainer instance won’t have any info regarding the checkpoint state saved in the previous training.

davide · November 30, 2020, 10:30pm

ok thank you very much!

andrey_s · March 10, 2021, 10:15am

@davide Try to initiate new instance of Trainer object with param “resume_from_checkpoint” equal to path to .ckpt file you stored afer your previous training:

trainer = pl.Trainer(gpus=1,  logger = logger, resume_from_checkpoint = "path/to/ckpt/file/checkopoint.ckpt")
trainer.fit(model)

This should start training from epoch your checkpoint is.

calzoom · March 10, 2021, 11:24am

@davide +1 to above you’ll need to tackle this in two separate parts:

locate path to checkpoint

pass in a consistent filepath into ModelCheckpoint(dirpath='./checkpoints/last.ckpt')

automatically load that checkpoint +1 to @andrey_s

pl.Trainer(resume_from_checkpoint='./checkpoints/last.ckpt')

If you want to automatically resume from the best weights according to some metric you can setup ModeCheckpoint to monitor a particular metric and track the best one, then you can use glob.glob('./checkpoints/) and do some parsing to get the path of the best metric

BttMA · September 28, 2021, 3:40pm

Hello,
I think you forget to specify that you need to add more epochs to the trainer (e.g. * pl.Trainer(max_epochs=7, resume_from_checkpoint='./checkpoints/last.ckpt')). For exemple, if you last checkpoint is saved at epoch 3(max_epochs=3) than you need to add more epochs (max_epochs=7) in order to the training to begin otherwise it will not do anything (I tested that and it took me hours to figure this out )

Hope it helps,
Peace and out!

kvnptl · June 27, 2022, 2:37am

Thanks for mentioning max_epochs argument. I am able to resume training from the last saved checked point (.ckpt file).

rafikhammoutene45 · July 31, 2023, 2:58pm

Hey,
There’s as well the argument ckpt_file in the trainer.fit() where :
ckpt_path: Path/URL of the checkpoint from which training is resumed. Could also be one of two special
keywords “last” and “hpc”. If there is no checkpoint file at the path, an exception is raised. If resuming from mid-epoch checkpoint, training will start from the beginning of the next epoch.
Hope it helps.