Why might speed stay the same when moving from 1 GPU to 8 GPUs (DDP)?

I’m not seeing any speed increase when increasing the number of GPUs (from 1 to 8) and switching Lightning’s distributed backend to DDP, even sometimes getting slower. Any ideas why this might be the case in general? I have num_workers in my DataModules/dataloaders set to 32 and pin_memory True.

Anything I can do in Lightning to diagnose/fix this? (I’m aware of the profiler but not sure how I can make it helpful here.)

Theoretically you should see a ~8x speed increase, as the training data is being split among the gpus. Do you think you could share any code so we can help debug the issue?

(I’m trying to put together a script I can share reproducing this and will update once I do, thanks for the response.)