I am using Pytorch Lightning Framework to train the Text to Text Transformer model (google/mt5-base at main).
I trained them on 1, 4, 5, 8 gpu environment using DDP.
However, all of 8gpu and 5gpu training attempts, are stuck and failed at a specific point in a specific epoch (54).
This is the last log before stuck, as it seems, its end of an epoch, so I assume that training is stuck due to data loading for next epoch in 8gpu or 5gpu environment.
This issue also occurred regardless of
DataLoader or different batch_size (32, 16)
Epoch 54: 100%|█████████▉| 2921/2931 [43:38<00:08, 1.12it/s, loss=.., v_num=0] Epoch 54: 100%|█████████▉| 2925/2931 [43:41<00:05, 1.12it/s, loss=.., v_num=0] Validating: 99%|█████████▉| 280/282 [02:32<00:01, 1.59it/s]e[A Epoch 54: 100%|██████████| 2931/2931 [44:01<00:00, 1.11it/s, loss=.., v_num=0]
Any comment or suggestion would be appreciated.
(Note: I posted this question to the Pytorch Forum. since I used Pytorch Lightning, I also post question to here.)