Train 2 epochs head, unfreeze / learning rate finder, continue training (fit_one_cycle)

NumesSanguis · November 12, 2020, 2:37am

In a transfer learning setting, I want to freeze the body and only train the head for 2 epochs. Then I want to unfreeze the whole network and use the Learning Rate finder, before continue training again.

What I want to do is similar to FastAI’s fit_one_cycle.
To do the same with PyTorch Lightning, I tried the following:

Trainer(max_epochs=2, min_epochs=0, auto_lr_find=True)
trainer.fit(model, data_module)  # FastAI: learn.fit_one_cycle(2)

trainer.max_epochs = 5
# model.unfreeze()  # allow the whole body to be trained
# trainer.tune(model)  # LR finder
trainer.fit(model, data_module)  # FastAI: learn.fit_one_cycle(3)

Unfortunately, this would invoke the training op epoch 1 twice.
pl_trainer_epoch1-2x

Any ideas of how to approach what I want to do?

Edit 1: Link to Colab demonstrating 2x epoch 1: Google Colab

s-rog · November 12, 2020, 5:09am

Forwarded from slack:

The first fit doesn’t advance self.trainer.current_epoch at the end of training, try doing +=1 in between

NumesSanguis · November 12, 2020, 6:18am

@s-rog Thank you! This seems to indeed solve the epoch logging issue.

Now I’m trying to invoke the trainer.tune(model) after the training for the first 2 epochs.
However, this fails:

LR finder stopped early due to diverging loss.
Failed to compute suggesting for `lr`. There might not be enough points.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/tuner/lr_finder.py", line 340, in suggestion
    min_grad = np.gradient(loss).argmin()
  File "<__array_function__ internals>", line 6, in gradient
  File "/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py", line 1042, in gradient
    "Shape of array too small to calculate a numerical gradient, "
ValueError: Shape of array too small to calculate a numerical gradient, at least (edge_order + 1) elements are required.

Colab showing this issue: Google Colab

williamfalcon · November 20, 2020, 10:27am

cc @SkafteNicki.

Not sure we tried using the LR finder after training like this, so we likely need to make tweaks on our end.

But second, depending on your model freezing and unfreezing may not make a giant difference.

Take a look at our video that goes over this:
Supervised and self-supervised transfer learning (with PyTorch Lightning) - YouTube

You can also see a finetuning colab here where we show no real difference in performance for this model (in fact, unfreezing first converges faster).

SkafteNicki · November 23, 2020, 7:18am

I can confirm that the learning rate finder have never been tested in such a way, but I completely agree that we should probably support it.
I was planning on doing a bit of refactors of the tuning interface within the next month, that will hopefully solve this problem.

goku · November 24, 2020, 6:48pm

I don’t think assigning max_epoch = 5 like this will work because Trainer state is not reset here for eg global_step, loggers, etc. Try creating the trainer instance again with max_epochs=5, reload the model weights if required after the first fit cycle and then continue with .tune

NumesSanguis · November 25, 2020, 8:11am

While the freezing was specific to my use-case, the lr-finder issue also appears without freezing involved.

I was also able to reproduce the issue by wrapping lr_finder in a Callback (which might be easier to test against @SkafteNicki ):

Since this seems to be an issue (or feature request), I made a bug report:
https://github.com/PyTorchLightning/pytorch-lightning/issues/4846

@williamfalcon Thank you for the video and Colab. I will take a look at it. That freezing for an epoch or 2 helps for Transfer Learning came from the FastAI 2 cours . I’ll check which approach works better in combination with lr-finder once I get it to work.

@goku Thank you for the suggestion. I haven’t tried it yet, but it sounds like a workaround that would work.

zestyzapus · August 22, 2021, 4:19am

It seems current_epoch is a property of trainer:

github.com

Lightning-AI/lightning/blob/13e64e6a80bc19308e99ff6aedb95ad68968b5b8/pytorch_lightning/trainer/properties.py#L526-L528


      
          @property
          def current_epoch(self) -> int:
              return self.fit_loop.current_epoch

I think it works to change it like the following but not sure if there is any side effects:

trainer.fit_loop.current_epoch += 1