DDP/GPU

Topic	Replies	Views	Activity
DDP and pl.LightningDataModule parallelization Issues	1	503	March 29, 2023
Single-Node multi-GPU Deepspeed training fails with cuda OOM on Azure	0	1460	March 24, 2023
Parallelizing batchsize-1 fully-convolutional training on multiple GPUs (one triplet per GPU)	1	408	March 15, 2023
DistributedDataParallel multi GPU barely faster than single GPU	2	1198	March 10, 2023
RAM Held by workers after validation	1	536	March 10, 2023
SLURM Runtime Error due to "ntasks" variable	3	1612	March 6, 2023
Runing ddp accross two machines	3	1271	March 3, 2023
Multi-GPU/Multi-Node training with WebDataset	3	3468	March 2, 2023
Try... except statement with DDPSpawn	2	413	February 24, 2023
Cannot pickle torch._C.Generator object — Multi-GPU training	2	1993	February 20, 2023
End all distributed process after ddp	4	1671	February 10, 2023
Rank_zero_only Callback in ddp	2	1960	January 30, 2023
Multi-GPU, TorchMetrics, incorrect aggregation	0	448	January 24, 2023
Multi-GPU training issue - DDP strategy. Training hangs upon distributed GPU initialisation	3	2957	January 18, 2023
How to apply multiple GPUs on not `training_step`?	3	869	January 4, 2023
RuntimeError: Cannot re-initialize CUDA in forked subprocess	6	6709	December 15, 2022
0/1% GPU Utilization when using 1 GPU, but Higher GPU Utilization with 2+ GPUS	0	1057	December 8, 2022
FullyShardedDataParallel no memory decrease	7	1505	December 8, 2022
Multi-GPU training crashes after some time due to NVLink error (xid74)	2	1327	November 26, 2022
Difference between the checkpoint val_cer and real val_cer on the validation set	0	372	November 15, 2022
How to propagate errors async in distributed training	1	751	November 10, 2022
Training not proceeding	0	812	August 4, 2022
Collective mismatch at end of training epoch	0	982	July 30, 2022
How do I know I have fully utilized my gpus?	0	523	July 25, 2022
DDP with Multiple gpus is not providing gains	1	435	June 30, 2022
How to initialize tensors that are in the right device when DDP are used	0	703	May 27, 2022
Accumulated Gradients + DDP in Contrastive Learning?	1	1108	April 15, 2022
Is Lightning more memory intensive than regular pytorch?	0	363	April 5, 2022
Correct approach to calculate metrics in DDP setting	1	1794	April 4, 2022
Multi-GPU with SLURM failed at initialization	1	1314	April 4, 2022