Unable to save optimized checkpoints (tried both pl.EvalResult and checkpoint_callback)

o_v_shake · September 17, 2020, 11:24am

This is my code.

def validation_step(self, batch, batch_size):
		x, y = batch 
		logits = self(x) 
		loss = F.nll_loss(logits, y)
		labels_hat = torch.argmax(logits, dim=1)
		val_acc = accuracy(labels_hat, y)
		result = pl.EvalResult()
		result.log('val_acc', val_acc, prog_bar=True, reduce_fx=torch.mean)
		return result
def validation_epoch_end(self, results):
		all_val_acc = results.val_acc
		val_acc = torch.mean(all_val_acc)
		result = pl.EvalResult()
		result.log('val_acc', val_acc)
		return result
if __name__ == '__main__':
	seed_everything(42)
	model = LitNet()
	checkpoint_callback = ModelCheckpoint(filepath='lightning_logs/main_checkpoints/mnist_v10_best.ckpt', monitor='val_acc')
	trainer = Trainer(max_epochs=10, gpus=1, checkpoint_callback=checkpoint_callback)
	trainer.fit(model)
	print(checkpoint_callback.best_model_path)

The issue I am facing is, checkpoint_callback.best_model_path is returning an empty string and no checkpoint is getting saved. I also got this UserWarning on doing this.

/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning:
                    When using EvalResult(early_stop_on=X) or TrainResult(early_stop_on=X) the
                    'monitor' key of ModelCheckpoint has no effect.
                    Remove ModelCheckpoint(monitor='val_acc) to fix')

I also tried the other way of doing this i.e. using pl.EvalResult(checkpoint_on=''val_acc') in at both the places (in validation_step and in validation_epoch_end). On doing this the checkpoint was getting saved but not the best one. The epoch=0.ckpt checkpoint was getting saved, but I could visually confirm better val_acc in the later epochs.
Any help is much appreciated

jirka · February 22, 2021, 2:41pm

Hello, my apology for the late reply. We are slowly converging to deprecate this forum in favor of the GH build-in version… Could we kindly ask you to recreate your question there - Lightning Discussions