Correct approach to calculate metrics in DDP setting

Abhisek_Maiti · April 4, 2022, 7:48am

In the case of DDP:

The metrics should be calculated in validation_step or the metrics should be calculated at validation_step_end after gathering output tensors returned by validation_step?
- If the metrics are calculated in validation_step, would be it correct to take the mean of the corresponding metrics in validation_step_end? Considering batch partitions for each device can be uneven?
- Does calling all_gather on the output tensors inside validation_step_end adds an extra dimension before the batch dimension? For example, if my original batch tensor is of the shape N x C x H x W and 2 GPUs are in use then after all_gather the tensor will be of the shape 2 x M x C x H x W (where 2M = N)? What happens if the batch size (N) is an odd number?

goku · April 4, 2022, 2:06pm