Stucks on 8gpu training setting


I am using Pytorch Lightning Framework to train the Text to Text Transformer model (google/mt5-base at main).

I trained them on 1, 4, 5, 8 gpu environment using DDP.
However, all of 8gpu and 5gpu training attempts, are stuck and failed at a specific point in a specific epoch (54).

This is the last log before stuck, as it seems, its end of an epoch, so I assume that training is stuck due to data loading for next epoch in 8gpu or 5gpu environment.

This issue also occurred regardless of num_worker in DataLoader or different batch_size (32, 16)

Epoch 54: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 2921/2931 [43:38<00:08,  1.12it/s, loss=.., v_num=0]
Epoch 54: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 2925/2931 [43:41<00:05,  1.12it/s, loss=.., v_num=0]
Validating:  99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 280/282 [02:32<00:01,  1.59it/s]e[A
Epoch 54: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2931/2931 [44:01<00:00,  1.11it/s, loss=.., v_num=0]

Any comment or suggestion would be appreciated.

Thank you.

(Note: I posted this question to the Pytorch Forum. since I used Pytorch Lightning, I also post question to here.)

in past, we have observed some difficulty with non-zero numbers of workers for some combinations of the platform (OS) and PT version, mind try to set the nb workers to 0 if it helps?

1 Like

This issue is caused by my code mistake.
I accidentally give a small (54) to max epoch value, after adjust the max epoch value, works fine on DP or CPU environment.

DDP also hangs after the max epoch which seems is interpretable behavior, but not normal behavior. (currently I don’t know why.)

@jirka Thank you for the comment. I will try.