Multi-GPU training crashes after some time due to NVLink error (xid74)

MathiesW · November 24, 2022, 10:19am

I want to train my model on a dual GPU set-up using Trainer(gpus=2, strategy=‘ddp’). To my understanding, Lightning sets up Distributed training under the hood. The training starts as expected but after a few iterations, one of my GPUs crashes. nvidia-smi lists the GPU as “GPU is lost”, syslog shows Xid error 74, which according to Nvidia documentation relates to fatal NVLink error on all four links. Shortly after, the “GPU has fallen off the bus” and only a hard reset restores my system. When using only one GPU, the training does not crash. Is this a problem with Lightning or my system?

Thank you in advance

System:
2xRTX3090 with NVLink bridge, 4 links with 14.062GB/s bandwidth each (nvidia-smi nvlink -s)
Ubuntu 22.04, CUDA 11.7.99 with cudnn 8.5.1, nccl 2.14.3

awaelchli · November 26, 2022, 11:12am

Hey @MathiesW

This is most definitely not an issue with Lightning/PyTorch. I’ve had this happen in the past multiple times and it was always due to a faulty/old GPU.

MathiesW · November 26, 2022, 11:29am

Hi @awaelchli, thank you for responding. I ran CUDA mem test without any errors so I was hoping it’s not my valuable hardware

Well, I will have another look into it and test my code on a similar system once I get the chance to.