Training hangs at Epoch 0 / 0% on TPU

CaraDuf · November 29, 2021, 2:37pm

Hi,

I am very new to PyTorch-Lightning and to Deep Learning as well! I am converting a PyTorch project into Lightning. On Google Colab, when I run the trainer on CPU or GPU it trains the model as expected although I haven’t checked the output model so far but it does something. It can find batch_size, find the initial learning rate, fast_dev_run also runs smoothly.

But when I try to run it on TPU it hangs at

Epoch 0: 0% 0/2 [00:00<?, ?it/s]

I tried with and without fast_dev_run, with 1 and 8 TPU_cores, with a batch_size of 32 and 2, but it always hangs there. I let it run for 45 minutes and it is still there. How can I know where the code is hanging and what I have to change ?

Thank you very much for helping

Aviv_Alloni · February 23, 2022, 3:07pm

Same issue here.
Did you solve it?

kyitharheinjob · February 1, 2024, 8:17am

I am facing the same issue. Have you solve it?