Performance Modeling of Distributed Training

If I am training a resent-50 model for ImageNet using DDP, what specs from the hardware do I need to know in order to determine whether I have exhausted the full compute power of my platform? In other words, suppose I use 8 RTX 3090 GPUs on a single node with DDP and achieve a 7 minutes/epoch training speed, what I should check before adding more GPUs in order to further shorten the training time? Things I have in mind are number of cpu workers per process, GPU bandwidth, etc. Anything else on top of this list?