Strange checkpoint loading and learning and behaviour

I am trying to train a new network with pytorch lighting (testing out the framework) and am seeing very strange behavior that seems to show that checkpoint is not loaded correctly and that learning rate is changing under my feet somehow.

The graph shows a plot of the training loss for two consecutive runs.

The optimizer is configured using

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

The training is then run twice:

trainer = pl.Trainer(gpus=4)
trainer.fit(net, train_loader, test_loader) 

net = Net.load_from_checkpoint('mlruns/1/0081837fe90f4eeebf806752f31af51d/checkpoints/epoch=655-test_loss=3.051074266433716.ckpt')
trainer.fit(net, train_loader, test_loader)

And there are two weird behaviors

(1) There is a very big jump about halfway through in the loss. Location is very consistent on different experiments, suggesting that learning rate or some other parameter is changed at that point behind my back

(2) The second run after loading the checkpoint seems to show that the checkpoint results are not actually used

I am using this code to save the checkpoint (passed to callbacks in the trainer, based on command line output it is being used).

chkpnt_cb = ModelCheckpoint(
    monitor='test_loss',
    verbose=True,
    save_top_k=3,
    save_weights_only=False,
    mode='min',
    period=1,
    filename='{epoch}-{test_loss}')

What am I missing here? (I tried passing LearningRateMonitor(logging_interval=‘step’) to callbacks as well to get feedback regarding learning rate but I do not see anything in the logs)

you are using 'test_loss' in monitor? is it created in the test_step?

Yes. I saw that lightning likes a slightly different terminology, but that one is personal history (scaling in there is based on data so test_loss == 1 is roughly 100% data range RMSE)

        x, y = batch
        z = self(x)

        loss = F.mse_loss(z, y)

        self.log('test_loss', loss.detach().sqrt() * (1/5e-5))

        return loss

if something is logged in test_step then it won’t be monitored because ModelCheckpoint is called when you call trainer.fit and in trainer.fit, test_step/test_epoch_end… won’t be called.