GPU and CPU multi processing setup function

Kun · September 24, 2020, 1:01am

Hi everyone, just a small question here.
I am trying to use Lightning with 4 GPUs, and I am getting some errors.We would like to know how we can be prepare a setup function to use multiple CPUs and GPUs.See below what we have done:

class MyDataset(object):
    def __init__(self):
        super().__init__()
        self.cfg = cfg
[self.dm] = LocalDataManager(None)
        self.rast = build_rasterizer(self.cfg, [self.dm]) def chunked_dataset(self, key: str):
        dl_cfg = self.cfg[key]
        dataset_path = [self.dm].require(dl_cfg["key"])
        zarr_dataset = ChunkedDataset(dataset_path)
        zarr_dataset.open()
        return zarr_dataset 

#Here we define a custom function called 'train_data_loader'
   
#We created this property:
    def train_data_loader(self):
        key = "train_data_loader"
        dl_cfg = self.cfg[key]
        zarr_dataset = self.chunked_dataset(key)
        agent_dataset = AgentDataset(self.cfg, zarr_dataset, self.rast)
        return DataLoader(
            agent_dataset,
            shuffle=dl_cfg["shuffle"],
            batch_size=dl_cfg["batch_size"],
            num_workers=dl_cfg["num_workers"],
        )

We have also tried this one:

CPUnum =19
gpus =4
trainer = pl.Trainer(num_processes=CPUnum,gpus=gpus, max_steps=500, min_epochs=3, max_epochs=10,default_root_dir=resul_dir) #  (edited)

Thank you very much!

teddy · September 24, 2020, 1:34am

Hey Kun,

PyTorch is usually smart about allocating resources across CPU’s. The only place we usually recommend setting something CPU-related is for your DataLoader, which usually works best with num_workers=[the number of CPU cores].

Regarding GPU, all you have to do is specify how many gpus to use, which it appears you are already doing. If you want to specify specific gpus, you can optionally provide a list, e.g: gpus=[0,2,4].

If you are having issues with this, please provide some details about your PyTorch/PyTorch Lightning version as well as some error logs, and I’ll help you out!

Kun · September 24, 2020, 10:39pm

Hi Teddy, thank you very much for helping us. I will post our new code and the error messages we are getting:

This is our new code:

    CPUnum= 19
    GPUnum = 4
     trainer = pl.Trainer(num_processes=CPUnum, gpus=gpus, max_steps=500, min_epochs=3, max_epochs=10,default_root_dir=resul_dir) 
        trainer.fit(model, train_dl, val_dl)

This is the error we are getting:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "copy_of_mtp__lyfttry_baseline.py", line 418, in <module>
    model = train_model(my_dataset, gpus=GPUnum)
  File "copy_of_mtp__lyfttry_baseline.py", line 411, in train_model
    trainer.fit(model, train_dl, val_dl)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1052, in fit
    self.accelerator_backend.train(model, nprocs=self.num_processes)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_spawn_backend.py", line 43, in train
    mp.spawn(self.ddp_train, nprocs=nprocs, args=(self.mp_queue, model,))
  File "/home/zeta/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/zeta/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
    process.start()
  File "/home/zeta/.conda/envs/l5/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/zeta/.conda/envs/l5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle google.protobuf.pyext._message.RepeatedCompositeContainer objects

teddy · September 26, 2020, 3:25am

Have you tried removing the num_processes flag?

Kun · September 26, 2020, 2:43pm

HI @teddy, thank you very much for your reply. Indeed this and also adding:
distributed_backend=“ddp”
solved my problem.

In order to learn, may I have some color of what was the issue and why these changes solved them? so next time I know
thanks!

teddy · September 28, 2020, 2:11pm

Hey @Kun, my aplogies for the delay.

When you train on multiple gpus using DDP, you want num_processes to be equal to the number of gpu’s you are using. We set this for you when you specify distributed_backend="ddp"

I am not positive what the issue you faced exactly is, but when you specify multiple gpus, we use DDP automatically, so perhaps the error was causes by the discrepancy in num_processes.

adriantre · October 15, 2020, 3:42pm

@Kun this section in the docs will likely provide some insights into why it happens, and how to resolve the pickling-error:
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#make-models-pickleable

Using ddp backend will not yield that error, as you have already found out.