Understanding logging and validation_step, validation_epoch_end

I have hard to understand how to use return in validation_step, validation_epoch_end (well this also goes for train and test).

First of all, when do I want to use validation_epoch_end? I have seen some not using it at all.

Second, I do not understand how the logging works and how to use it, eg

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)        
    return {'loss': loss, 'log': loss}   

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)
    return {'val_loss': loss}

def validation_epoch_end(self, outputs):
    avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
    log = {'val_loss': avg_loss}
    return {'val_loss': avg_loss, 'log': log}

Where does log go? I understand the return of ā€˜lossā€™, but I donā€™t understand where ā€˜logā€™ goes and how to use it.

Third, what I understand there is a new way to use log by writing self.log. I get warnings by not using this. So what is the difference?

The new .log functionality works similar to how it did when it was in the dictionary, however we now automatically aggregate the things you log each step and log the mean each epoch if you specify so. For example the code you wrote above can be re-written as:

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)
    self.log("loss", loss)        
    return loss

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)
    # on_epoch=True by default in `validation_step`,
    # so it is not necessary to specify
    self.log("val_loss", on_epoch=True) 

This eliminates the need for validation_step_end. If for some reason you still wanted to do this aggregation yourself, you could also do:

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)
    self.log("loss", loss)        
    return loss

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)
    return loss

def validation_epoch_end(self, batch, outs):
    # outs is a list of whatever you returned in `validation_step`
    loss = torch.stack(outs).mean()
    self.log("val_loss", loss)

Which functions equivalently. Hope this clears things up! :slight_smile:

2 Likes

TypeError: validation_epoch_end() missing 1 required positional argument: ā€˜outsā€™

Hi Teddy, can I ask you how can I do that? Is it by default at the end of the epoch or have I to specify so? It is not clear to me how to organize the steps!

Thank you

Hi @Ch-rode, if you log your loss in training or validation step with self.log then you donā€™t need to implement validation_epoch_end method (same goes for training step).
Lightning will take care of it by automatically aggregating your loss that you logged in the {training|validation}_step at the end of each epoch.

The flow would be:

  1. Epoch start
  2. Loss computed and logged in training step
  3. Epoch end
  4. Fetch the training step loss and aggregate
  5. Continue next epoch

Hope I was able to solve your problem. :slightly_smiling_face:

Also, we have migrated our discussion from this forum to GitHub discussion. I would request you to ask your questions there for quicker response.

Thanks

1 Like

hey @Ch-rode

for that it is possible by specifying self.log(..., on_epoch=True).

for these methods, the defaults are also set. You can check them out here: Logging ā€” PyTorch Lightning 1.8.0dev documentation

Although it takes care of it automatically, I donā€™t think that it is entirely correct. It averages over the metrics calculated on a per-batch basis. In general, we are interested in the value of the metric/loss over the entire validation set.

I donā€™t know if it is specific to the TensorboardLogger but if we have something like this:

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)
    self.log("train_loss", loss)        
    return loss

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.forward(x)
    loss = F.cross_entropy(y_hat, y)
    # on_epoch=True by default in `validation_step`,
    # so it is not necessary to specify
    self.log("val_loss", on_step=True, on_epoch=True)

then in tensorboard we get two plots: val_loss_step and val_loss_epoch. The step in the first plot is not the self.global_step but essentially the global_step of the validation_dataloader. Whereas in the second plot, the step is the self.global_step. Is somewhere documented this behavior? That is, what ā€œstepā€ is used when logging inside {training,validation}_step, on_{train,validation}_epoch_end etc?