Save checkpoints after specific number of steps instead of epochs

sujay_khandekar · September 26, 2020, 11:00pm

my trainer looks like this

trainer = pl.Trainer(gpus=gpus,max_steps=25000,precision=16)
trainer.fit(model,train_dl)

I want to save model checkpoint after each 5000 steps (they can overwrite). Is it possible to do that?
According to documentation checkpoint can be saved using modelcheckpoint callback after specific number of epochs, but I didn’t see anything mentioned there about saving after specific number of steps. I am not passing any val data , so I do not want to save based on val loss values either.
Is there any way to do this?
Thanks.

goku · September 27, 2020, 8:37am

you can try this: Save checkpoint and validate every n steps · Issue #2534 · Lightning-AI/lightning · GitHub

sujay_khandekar · September 28, 2020, 3:52am

Thanks that worked for me