Stucks on 8gpu training setting


I am using Pytorch Lightning Framework to train the Text to Text Transformer model (google/mt5-base at main).

I trained them on 1, 4, 5, 8 gpu environment using DDP.
However, all of 8gpu and 5gpu training attempts, are stuck and failed at a specific point in a specific epoch (54).

This is the last log before stuck, as it seems, its end of an epoch, so I assume that training is stuck due to data loading for next epoch in 8gpu or 5gpu environment.

This issue also occurred regardless of num_worker in DataLoader or different batch_size (32, 16)

Epoch 54: 100%|█████████▉| 2921/2931 [43:38<00:08,  1.12it/s, loss=.., v_num=0]
Epoch 54: 100%|█████████▉| 2925/2931 [43:41<00:05,  1.12it/s, loss=.., v_num=0]
Validating:  99%|█████████▉| 280/282 [02:32<00:01,  1.59it/s]e[A
Epoch 54: 100%|██████████| 2931/2931 [44:01<00:00,  1.11it/s, loss=.., v_num=0]

Any comment or suggestion would be appreciated.

Thank you.

(Note: I posted this question to the Pytorch Forum. since I used Pytorch Lightning, I also post question to here.)