Testing accuracy gap when training a resnet50 on ImageNet from scratch

azouaoui · February 2, 2021, 10:09am

Hi all,

I am not sure if this question belongs here but so far I have not received any help from the PyTorch forums community (see related post here: Testing accuracy gap when training a resnet50 on ImageNet from scratch - vision - PyTorch Forums)

I’m currently interested in reproducing some baseline image classification results using PyTorch.
My goal is to get a resnet50 model to have a test accuracy as close as the one reported in torchvision: torchvision.models — Torchvision 0.8.1 documentation (i.e. 76.15 top 1 accuracy)
In order to do that, I closely follow the setup from the official PyTorch examples repository: examples/main.py at main · pytorch/examples · GitHub.
Namely, I set:

seed=19
batch_size=256
lr=0.1
weight_decay=1e-4
SGD is using momentum=0.9
LR scheduler is the StepLR that decays the learning rate by 10 every 30 epochs
I train for 100 epochs (as opposed to 90 in the code above)
I use exactly the same data augmentation as the code above

The only difference is that I’m leveraging PyTorch Lightning to seamlessly use 4 GPUs in Distributed Data Parallel mode on a single node.
However, I am only able to achieve 73.12 top 1 accuracy. I don’t want to draw conclusions on my other experiments given this gap on the standard baseline.

My question: has anyone tried and reproduced the torchvision numbers using the setup I described above?
From my reading in the resnet models source code, the pretrained weights could have been obtained by following this setup: ResNet v1.5 for PyTorch | NVIDIA NGC where all the hyperparameters have been thoroughly tuned. Can someone confirm this? In that case, what is the top 1 accuracy I should expect on the val set when using a simpler setup (the one described above)?

I put together a Minimal Working Example here: GitHub - azouaoui-cv/resnet50-imagenet-baseline: Image classification baseline using ResNet50 on ImageNet where the training instructions are detailed.

Have a good day

UPDATE:

I may have found the root cause for the test performance discrepancy.

In my implementation, I happened to use a total batch size equal to 1024 as each process used a batch size of 256 and 4 processes were spawned. In the official PyTorch example, each process use bs=256/N where N is the number of processes (4 here). It means that I had to either adjust the batch size (i.e. set it to 64 per process) or tune the learning rate accordingly (i.e. set it higher initially, e.g. 0.4 when using 256 images per process).

I will keep this post updated once I get the final results.

teddy · February 2, 2021, 8:38pm

Glad to hear you found the discrepancy! Please do not hesitate if you have any other questions - Teddy

Wenxuan_Guo · September 18, 2021, 3:52pm

Hi! We are also trying to reproduce the torchvision model results with Lightning. With correct batch size for the ddp process, we were unable to achieve the accuracy. Do you have any updates?

Blaizzy · January 14, 2022, 4:57am

Hi @Wenxuan_Guo! Did you manage to reproduce the experiment to the same accuracy? If not, what is missing?

Wenxuan_Guo · January 15, 2022, 6:16am

Hi @Blaizzy! My model training still wasn’t able to achieve the reported accuracy. I am not sure what is missing…Sorry that I can’t help more.

Blaizzy · January 19, 2022, 9:18am

No problem.

@Wenxuan_Guo I have a few questions:

What is your setup?
What was the accuracy you got in your last 3 experiments and which parameters did you use in each?

Wenxuan_Guo · January 19, 2022, 3:16pm

Hi @Blaizzy, I am so sorry I don’t remember exactly the training and validation accuracy. I used the exact implementation by torchvision, with these parameters: vision/references/classification at main · pytorch/vision · GitHub.
I used 8 GeForce RTX 2080 Ti GPUs.