Reported validation metrics do not match the actual validation metrics

I am using this checkpointing callback to take care of checkpointing the best model at the end of training:

checkpoint_callback = pl.callbacks.ModelCheckpoint(
    dirpath='/content/lightning_logs', 
    filename='{epoch}-{val_loss:.3f}-{val_f1:.3f}',
    monitor="val_f1", 
    mode='max',
    every_n_train_steps=1,
    save_top_k=1
)

At the end of training, I get this file name for my checkpoint: “epoch=44-val_loss=0.684-val_f1=0.818.ckpt”. Tensorboard also reports this f1 score as being the best.

However, I cannot reproduce the results. When I use the checkpoint to create a model for testing, I get lower results. For example, in this scenario, running:

trainer.validate(ckpt_path='/content/lightning_logs/epoch=44-val_loss=0.684-val_f1=0.818.ckpt')

gives me a validation f1 score of 0.802 (not 0.818).

What could the culprit be?