Unable to find GPU on cluster?

Hi, I am trying to train an implementation SimCLR using lightning, and am doing this on a cluster where I have access to two gpus, but neither is picked up by lightning when specified over slurm.

I’ve pasted the error message below:

(simclr) [nsk367@mycluster src]$ cat slurm-10012580.out
Traceback (most recent call last):
  File "train_simclr.py", line 246, in <module>
    cli_main()
  File "train_simclr.py", line 240, in cli_main
    trainer = pl.Trainer.from_argparse_args(args)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 122, in from_argparse_args
    return argparse_utils.from_argparse_args(cls, args, **kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse_utils.py", line 50, in from_argparse_args
    return cls(**trainer_kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
    return fn(self, **kwargs)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 328, in __init__
    self.accelerator_connector.on_trainer_init(
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 111, in on_trainer_init
    self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 76, in parse_gpu_ids
    gpus = _sanitize_gpu_ids(gpus)
  File "/gpfs/scratch/nsk367/anaconda3/envs/simclr/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 144, in _sanitize_gpu_ids
    raise MisconfigurationException(f"""
pytorch_lightning.utilities.exceptions.MisconfigurationException: 
                You requested GPUs: [0, 1]
                But your machine only has: []

I am running on torch 1.6.0, and lightning 1.0.2

The slurm command requests two gpus, and I have an additional arugment

python train_simclr.py --gpus 2

Which leads me to this error. Happy to share more information if it helps.

Thank you!

EDIT:

At the top of the .py file, I run the command

import torch
print(torch.cuda.device_count())

and that output a 0, so I’m not sure if the issue is entirely with lightning, but I have run jobs on this cluster before but have not seen this error until running with this version of pytorch / lightning in case anyone still might know the root cause.