Understanding epoch metrics: acc_train_epoch does not appear to be the average of acc_train_step

cfhammill · August 31, 2022, 1:44pm

Hi lightning devs and users,

I’m using lightning to train some models for work and I’m having trouble understanding how the epoch level metrics are getting aggregated and computed. In my model the acc_train_step hits perfect accuracy and maintains that for 1000s of steps, where the acc_train_epoch stays < 0.7. From reading the documentation I would expect the acc_train_epoch to be the average acc_train_step for each step of the epoch, but then shouldn’t acc_train_epoch be 1 as well?

Can someone help me understand why these two graphs would be so different?

I’m using pytorch-lightning 1.6.3
python 3.9.10

Thanks!

extra details:
I’m training with parallel strategy dp on 4 gpus,
my training step looks like

def training_step(self, batch, batch_idx):
        x = batch["image"][tio.DATA]
        y = batch["label"]

        preds = self(x)
        y = y.view(y.shape[0], 1).float()
        loss = self.criterion(preds, y)
        acc = ((y > 0.5) == (preds > 0.5)) \
                   .type(torch.FloatTensor).mean()

        # perform logging
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape[0])
        self.log("train_acc", acc, on_step=True, on_epoch=True, prog_bar=True, logger=True, batch_size = x.shape[0])