DataParallel crash with uneven number of inputs

From (ValueError: All dicts must have the same number of keys on model evaluation output. · Issue #3342 · Lightning-AI/lightning · GitHub):

Any ideas to debug this issue?
Is happening to me in many different models, after I refactored the Result logging from training_step, validation_step and test_step methods, changed the old dictionary-based return to the new Result scheme, training on two GPUs at the same time.

The error doesn’t pop if i use distributed_backend=‘ddp’ instead of dp on trainer.

:bug: Bug
When doing evaluation or test routines on Trainer (either with .fit evaluation at the end of an epoch or calling .test directly),
throws ValueError: All dicts must have the same number of keys.

After seeing the error log i think it has something to do with the metric logging but i can’t figure out what exactly. The error pops very inconsistently over epochs and runs. So i’m trying to find any ideas on how maybe i could get more details to get to the root of the issue.

Stack Trace:

File "model_manager.py", line 263, in <module>
    helper.train()
  File "model_manager.py", line 97, in train
    self.trainer.fit(self.module)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in fit    results = self.accelerator_backend.train()
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_backend.py", line 97, in train
    results = self.trainer.run_pretrain_routine(model)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 333, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 661, in evaluation_forward
    output = model(*args)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 86, in forward
    outputs = self.__gather_structured_result(outputs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 101, in __gather_structured_result
    outputs = self.gather(outputs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 141, in gather
    res = gather_map(outputs)
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 129, in gather_map
    raise ValueError('All dicts must have the same number of keys')
ValueError: All dicts must have the same number of keys
Exception ignored in: <function tqdm.__del__ at 0x7f83fe2ecb80>
Traceback (most recent call last):
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1087, in __del__
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1294, in close
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1472, in display
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1090, in __repr__
  File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1434, in format_dict
TypeError: cannot unpack non-iterable NoneType object

To Reproduce
Steps to reproduce the behavior:

Get a simple model for classification. For example, I used PyTorch resnetx model.
Implement training and validation methods returning Results objects. As shown here:
def training_step(self, batch, batch_idx):
“”"
Lightning calls this inside the training loop
:param batch:
:return:
“”"
# forward pass
x, y = batch[‘x’], batch[‘y’]
y_pred = self.forward(x)
# calculate loss
loss = self.loss(y_pred, y)
result = ptl.TrainResult(loss)
result.log(‘train_loss’, loss, prog_bar=True)
return result

def validation_step(self, batch, batch_idx):
    """
    Lightning calls this inside the validation loop
    :param batch:
    :return:
    """
    x, y = batch['x'], batch['y']
    y_pred = self.forward(x)
    # calculate loss
    loss = self.loss(y_pred, y)
    # calculate accurracy
    labels_hat = torch.argmax(y_pred, dim=1)
    accuracy = torch.sum(y == labels_hat).item() / (len(y) * 1.0)
    accuracy = torch.tensor(accuracy)
    if self.on_gpu:
        accuracy = accuracy.cuda(loss.device.index)
    # Checkpoint model based on validation loss
    result = ptl.EvalResult(early_stop_on=None, checkpoint_on=loss)
    result.log('val_loss', loss, prog_bar=True)
    result.log('val_acc', accuracy, prog_bar=True)
    return result

Train the Trainer to get training and evaluation steps working a few times. The error will pop up at some random epoch. For me it usually pops at the first 20 epochs. Also if I run Trainer.test() on a crashed epoch probably will fail with the same error.
Expected behavior
More detailed error. I think it has something to do with the Result objects but I cannot get more detail easily, as I’m running the models on a remote server.

Environment
PyTorch Version (e.g., 1.0): 1.6.0
OS (e.g., Linux): Arch Linux
How you installed PyTorch (conda, pip, source): conda
Python version: 3.8.5
CUDA/cuDNN version: 11.0
GPU models and configuration: 2 x GeForce GTX 1080 12 GB
Any other relevant information: It worked on previous versions of pytorch-lightning. The error doesnt pop if i use ‘ddp’

This is likely because you are not dropping last of the dataset at some point?
Try adding drop_last to the dataloaders.

DataLoader(..., drop_last=True)

The reason is as follows:

  1. Use 2 GPUs.
  2. Set a batch size of 5
  3. Then gpu 0 processes 3 items and GPU 2 processes 2 items. Then there is a mismatch in the aggregation.

The second general solution is to make sure that your batches are a multiple of the number of GPUs you are using.