Multigpu training just hangs

sudarshan85 · January 1, 2021, 3:50pm

Hello,

I’m trying to use the 4 GPUs on my machine to train a huggingface model for a project. Single GPU with 32 bit precision works without any problems (16 bit is not working and I’ve asked a question about it here). Multi GPU with 32 bit precision just hangs. I’m running this on a Jupyter NB and I saw this error on the terminal:

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-7c85b1e2.so.1 library.
        Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.

This seems to be a Pytorch error that is discussed here and there are a couple of solutions proposed one of which is to import numpy before torch.multiprocessing. I am importing numpy first (and don’t actually import torch.multiprocessing) but I’m not sure how PL does it.

I’m using Pytorch version 1.7.1 and PL version 1.1.2. Anyone else run into this problem? Is there a solution to this?

Thanks.

sudarshan85 · January 1, 2021, 3:59pm

I set MKL_SERVICE_FORCE_INTEL=1, but still got the following error:

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-7c85b1e2.so.1 library.
        Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/miniconda3/envs/eml/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/home/miniconda3/envs/eml/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'EmailAuthorSequenceClf' on <module '__main__' (built-in)>

sudarshan85 · January 3, 2021, 9:26pm

Does anyone else run into this problem? I have a 4 GPUs but am not able to use them

jirka · January 6, 2021, 7:18pm

this does not seem to be related to PTL itself, mind share a full example in Colab, does it happen also on a single GPU?

sudarshan85 · January 7, 2021, 2:44pm

Sorry! It was a silly mistake on my part Its fixed now!

jirka · January 7, 2021, 9:17pm

just for curiosity mind share what was the problem?

sudarshan85 · January 8, 2021, 4:29am