Hi lightning devs and users,
I’m using lightning to train some models for work and I’m having trouble understanding how the epoch level metrics are getting aggregated and computed. In my model the
acc_train_step hits perfect accuracy and maintains that for 1000s of steps, where the
acc_train_epoch stays < 0.7. From reading the documentation I would expect the
acc_train_epoch to be the average
acc_train_step for each step of the epoch, but then shouldn’t
acc_train_epoch be 1 as well?
Can someone help me understand why these two graphs would be so different?
I’m using pytorch-lightning 1.6.3
I’m training with parallel strategy dp on 4 gpus,
my training step looks like
def training_step(self, batch, batch_idx): x = batch["image"][tio.DATA] y = batch["label"] preds = self(x) y = y.view(y.shape, 1).float() loss = self.criterion(preds, y) acc = ((y > 0.5) == (preds > 0.5)) \ .type(torch.FloatTensor).mean() # perform logging self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape) self.log("train_acc", acc, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape)