Unable to find GPU on cluster?

Hi, I am trying to train an implementation SimCLR using lightning, and am doing this on a cluster where I have access to two gpus, but neither is picked up by lightning when specified over slurm.

I’ve pasted the error message below:

(simclr) [nsk367@mycluster src]$ cat slurm-10012580.out
Traceback (most recent call last):
  File "train_simclr.py", line 246, in <module>
    cli_main()
  File "train_simclr.py", line 240, in cli_main
    trainer = pl.Trainer.from_argparse_args(args)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 122, in from_argparse_args
    return argparse_utils.from_argparse_args(cls, args, **kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse_utils.py", line 50, in from_argparse_args
    return cls(**trainer_kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
    return fn(self, **kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 328, in __init__
    self.accelerator_connector.on_trainer_init(
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 111, in on_trainer_init
    self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 76, in parse_gpu_ids
    gpus = _sanitize_gpu_ids(gpus)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 144, in _sanitize_gpu_ids
    raise MisconfigurationException(f"""
pytorch_lightning.utilities.exceptions.MisconfigurationException: 
                You requested GPUs: [0, 1]
                But your machine only has: []

I am running on torch 1.6.0, and lightning 1.0.2

The slurm command requests two gpus, and I have an additional arugment

python train_simclr.py --gpus 2

Which leads me to this error. Happy to share more information if it helps.

Thank you!

EDIT:

At the top of the .py file, I run the command

import torch
print(torch.cuda.device_count())

and that output a 0, so I’m not sure if the issue is entirely with lightning, but I have run jobs on this cluster before but have not seen this error until running with this version of pytorch / lightning in case anyone still might know the root cause.

Hello, my apology for the late reply. We are slowly converging to deprecate this forum in favor of the GH build-in version… Could we kindly ask you to recreate your question there - Lightning Discussions