Testing accuracy gap when training a resnet50 on ImageNet from scratch

Hi all,

I am not sure if this question belongs here but so far I have not received any help from the PyTorch forums community (see related post here: Testing accuracy gap when training a resnet50 on ImageNet from scratch - vision - PyTorch Forums)

I’m currently interested in reproducing some baseline image classification results using PyTorch.
My goal is to get a resnet50 model to have a test accuracy as close as the one reported in torchvision: torchvision.models — Torchvision 0.8.1 documentation (i.e. 76.15 top 1 accuracy)
In order to do that, I closely follow the setup from the official PyTorch examples repository: examples/main.py at master · pytorch/examples · GitHub.
Namely, I set:

  • seed=19
  • batch_size=256
  • lr=0.1
  • weight_decay=1e-4
  • SGD is using momentum=0.9
  • LR scheduler is the StepLR that decays the learning rate by 10 every 30 epochs
  • I train for 100 epochs (as opposed to 90 in the code above)
  • I use exactly the same data augmentation as the code above

The only difference is that I’m leveraging PyTorch Lightning to seamlessly use 4 GPUs in Distributed Data Parallel mode on a single node.
However, I am only able to achieve 73.12 top 1 accuracy. I don’t want to draw conclusions on my other experiments given this gap on the standard baseline.

My question: has anyone tried and reproduced the torchvision numbers using the setup I described above?
From my reading in the resnet models source code, the pretrained weights could have been obtained by following this setup: NVIDIA NGC where all the hyperparameters have been thoroughly tuned. Can someone confirm this? In that case, what is the top 1 accuracy I should expect on the val set when using a simpler setup (the one described above)?

I put together a Minimal Working Example here: GitHub - inzouzouwetrust/resnet50-imagenet-baseline: Image classification baseline using ResNet50 on ImageNet where the training instructions are detailed.

Have a good day :slight_smile:


I may have found the root cause for the test performance discrepancy.

In my implementation, I happened to use a total batch size equal to 1024 as each process used a batch size of 256 and 4 processes were spawned. In the official PyTorch example, each process use bs=256/N where N is the number of processes (4 here). It means that I had to either adjust the batch size (i.e. set it to 64 per process) or tune the learning rate accordingly (i.e. set it higher initially, e.g. 0.4 when using 256 images per process).

I will keep this post updated once I get the final results.

Glad to hear you found the discrepancy! Please do not hesitate if you have any other questions - Teddy