Auto-saving model weights

hyukyu · September 3, 2020, 5:00am

Hello, I have a question regarding automatic saving.

I’ve followed the description in the doc to save top-k best model weights, but I can’t get it to work.

I created a ModelCheckpoint as follows

from pytorch_lightning.callbacks import ModelCheckpoint

# DEFAULTS used by the Trainer
checkpoint_callback = ModelCheckpoint(
    save_top_k=1,
    verbose=True,
    monitor='val_acc',
    mode='max',
)

trainer = Trainer(checkpoint_callback=checkpoint_callback)

And I created my validation_step and validation_epoch_end as follows:

def validation_step(self, batch, batch_idx):
    acc = self.calculate_acc(batch)
    result= pl.EvalResult()
    result.log('val_acc', acc)
    return result

def validation_epoch_end(self, val_step_output):
    end_result = pl.EvalResult()
    end_result.val_acc = torch.sum(val_step_output.val_acc)
    return end_result

However, I am getting this message and my weights are not saved.

RuntimeWarning: Can save best model only with val_acc available, skipping.

Looking into the method on_validation_end in the class ModelCheckpoint (i.e., code), it seems I have to save ‘val_acc’ into the callback metrics of the EvalResult object, but I am not sure how I can do it and if this is the right way to do it.

Can anyone please help me with this?

ydcjeff · September 3, 2020, 6:44am

Could you try this?

result = EvalResult(checkpoint_on=acc)

justusschock · September 3, 2020, 7:12am

the way to go would be not to change the monitor argument in your callback, but as @ydcjeff suggested to use checkpoint_on in your validation_step/validation_epoch_end. So your trainer config would look like this:

from pytorch_lightning.callbacks import ModelCheckpoint

# DEFAULTS used by the Trainer
checkpoint_callback = ModelCheckpoint(
    save_top_k=1,
    verbose=True,
    mode='max',
)

trainer = Trainer(checkpoint_callback=checkpoint_callback)

and your validation phase either like this:

def validation_step(self, batch, batch_idx):
    acc = self.calculate_acc(batch)
    result= pl.EvalResult(checkpoint_on=val_acc) # for early stopping you could also use early_stop_on here
    result.log('val_acc', acc)
    return result

without any validation_epoch_end (per default the result will average your values for checkpointing now, see here for details) or you could also do it like this when you really want to sum it:

def validation_step(self, batch, batch_idx):
    acc = self.calculate_acc(batch)
    result= pl.EvalResult()
    result.log('val_acc', acc)
    return result

def validation_epoch_end(self, val_step_output):
    aggregated_val_acc = torch.sum(val_step_output.val_acc)
    end_result = pl.EvalResult(checkpoint_on=aggregated_val_acc)
    return end_result

Hope that helps

hyukyu · September 3, 2020, 3:06pm

Thanks for the reply!

Actually, I had tested this method of passing a tensor to checkpoint_on, but the model was not being saved. Looking more into it, I figured that the tensor I’ve been passing had a value zero and it turns out passing a tensor with a value zero does not do anything (code).

Anyway, now I know that this is the correct way to do it. Thanks!