Any ideas to debug this issue?
Is happening to me in many different models, after I refactored the Result logging from training_step, validation_step and test_step methods, changed the old dictionary-based return to the new Result scheme, training on two GPUs at the same time.
The error doesn’t pop if i use distributed_backend=‘ddp’ instead of dp on trainer.
Bug
When doing evaluation or test routines on Trainer (either with .fit evaluation at the end of an epoch or calling .test directly),
throws ValueError: All dicts must have the same number of keys.
After seeing the error log i think it has something to do with the metric logging but i can’t figure out what exactly. The error pops very inconsistently over epochs and runs. So i’m trying to find any ideas on how maybe i could get more details to get to the root of the issue.
Stack Trace:
File "model_manager.py", line 263, in <module>
helper.train()
File "model_manager.py", line 97, in train
self.trainer.fit(self.module)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in fit results = self.accelerator_backend.train()
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/accelerators/dp_backend.py", line 97, in train
results = self.trainer.run_pretrain_routine(model)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
self.run_training_epoch()
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
self.run_evaluation(test_mode=False)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 333, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 661, in evaluation_forward
output = model(*args)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 86, in forward
outputs = self.__gather_structured_result(outputs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 101, in __gather_structured_result
outputs = self.gather(outputs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 141, in gather
res = gather_map(outputs)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 129, in gather_map
raise ValueError('All dicts must have the same number of keys')
ValueError: All dicts must have the same number of keys
Exception ignored in: <function tqdm.__del__ at 0x7f83fe2ecb80>
Traceback (most recent call last):
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1087, in __del__
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1294, in close
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1472, in display
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1090, in __repr__
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/tqdm/std.py", line 1434, in format_dict
TypeError: cannot unpack non-iterable NoneType object
To Reproduce
Steps to reproduce the behavior:
Get a simple model for classification. For example, I used PyTorch resnetx model.
Implement training and validation methods returning Results objects. As shown here:
def training_step(self, batch, batch_idx):
“”"
Lightning calls this inside the training loop
:param batch:
:return:
“”"
# forward pass
x, y = batch[‘x’], batch[‘y’]
y_pred = self.forward(x)
# calculate loss
loss = self.loss(y_pred, y)
result = ptl.TrainResult(loss)
result.log(‘train_loss’, loss, prog_bar=True)
return result
def validation_step(self, batch, batch_idx):
"""
Lightning calls this inside the validation loop
:param batch:
:return:
"""
x, y = batch['x'], batch['y']
y_pred = self.forward(x)
# calculate loss
loss = self.loss(y_pred, y)
# calculate accurracy
labels_hat = torch.argmax(y_pred, dim=1)
accuracy = torch.sum(y == labels_hat).item() / (len(y) * 1.0)
accuracy = torch.tensor(accuracy)
if self.on_gpu:
accuracy = accuracy.cuda(loss.device.index)
# Checkpoint model based on validation loss
result = ptl.EvalResult(early_stop_on=None, checkpoint_on=loss)
result.log('val_loss', loss, prog_bar=True)
result.log('val_acc', accuracy, prog_bar=True)
return result
Train the Trainer to get training and evaluation steps working a few times. The error will pop up at some random epoch. For me it usually pops at the first 20 epochs. Also if I run Trainer.test() on a crashed epoch probably will fail with the same error.
Expected behavior
More detailed error. I think it has something to do with the Result objects but I cannot get more detail easily, as I’m running the models on a remote server.
Environment
PyTorch Version (e.g., 1.0): 1.6.0
OS (e.g., Linux): Arch Linux
How you installed PyTorch (conda, pip, source): conda
Python version: 3.8.5
CUDA/cuDNN version: 11.0
GPU models and configuration: 2 x GeForce GTX 1080 12 GB
Any other relevant information: It worked on previous versions of pytorch-lightning. The error doesnt pop if i use ‘ddp’