No checkpoints are being saved

Luca_Guarro · August 24, 2021, 7:20pm

I am trying to save checkpoints of my model while training (using the validation loss metric) but I do not see any .ckpt files being saved under the checkpoints directory as you can see in the sample screenshot below.

Here are what my validation_step and validation_epoch_end functions look like:

def validation_step(self, val_batch, batch_idx):

        grouped_pooled_outs = val_batch['grouped_pooled_outs']
        src_key_padding_mask = val_batch['src_key_padding_mask']
        targets = val_batch['success_label']
        logits = self.forward(grouped_pooled_outs, src_key_padding_mask)
        y_prob = self.softmaxer(logits)[:, 1]
        y_pred = (y_prob>0.5).float()
        loss = self.cross_entropy_loss(logits, targets)
        return {'val_loss': loss, 'preds': y_pred, 'targets': targets.tolist()}

def validation_epoch_end(self, val_step_outputs):
        y_pred = []
        y_true = []

        for x in val_step_outputs:
          y_pred.extend(x['preds'].tolist())
          y_true.extend(x['targets'])

        f1_res = f1_score(y_true, y_pred, average = 'weighted')
        avg_val_loss = torch.tensor([x['val_loss'] for x in val_step_outputs]).mean()
        log_dict = {
            'val_loss': avg_val_loss,
            'val_f1': f1_res
        }
        self.log('val_loss', avg_val_loss, prog_bar=True)
        self.log('val_f1', f1_res, prog_bar=True)
        return {'val_loss': avg_val_loss, 'log': log_dict}

Then I am instantiating a checkpoint callback to monitor the val_loss (across the epoch).

checkpoint_callback = pl.callbacks.ModelCheckpoint(monitor="val_loss")
trainer = pl.Trainer(log_every_n_steps=1, gpus=1, max_epochs=2, callbacks=[checkpoint_callback], num_sanity_val_steps=0)
model = LightningToBERT(nhead=1, num_layers=1, dropout=0.3)
datamodule = GoodReadsDataModule()
trainer.fit(model, datamodule)

Why aren’t any checkpoints being saved?

Luca_Guarro · August 24, 2021, 9:42pm

I was able to get checkpoints being saved by explicitly providing a path to the ‘dirpath’ argument. However I cannot get the checkpoints to save under the specific version directory.