How to do gradient descent inside a pl model

Chris_Liu · October 21, 2020, 8:43pm

Hi, I’m having trouble implementing this model feature when using ddp, and need some hints.

I have a model that has an adam optimizer. However, there is also a part inside the forward pass where I need to use gradient descent to find intermediate results. Is there a safe way of doing this? for example, if I the forward pass is abstracted as x->g->y_hat, the g has to be obtained by minimizing some parametrized function f(x,y_hat). Any help is appreciated. Thanks!

teddy · October 22, 2020, 2:20am

Do you think you could write out what you are trying to do in plain PyTorch? It seems like the manual optimization functionality might be able to do what you would like.

Chris_Liu · October 22, 2020, 2:32am

Thanks for the reply. Basically, my model is learning a parametrized function L_hat to predict the final classification loss of the network. It then finds some optimal policy by minimizing the “imagined” loss L_hat. I imagined that gradient descent would be what I need for this?

A simple pseudo code of what happens is like this:

def forward(self, x):
    policy = nn.Parameter( # Some tensor initialization )
    self.policy_optimizer['Parameter'] = policy

    # Policy optimization segment
    for iters in range(self.max_iters_opt):
        self.policy_optimizer.zero_grad()
        loss = self.L_hat(x) # The NN approximation of loss
        loss.backwards()
        ...

    # Then the classification
    labels = self.classification(x, policy)
...

Let me know if any of these is unclear!

I checked out the website you linked, it looks like it has what I need, but it seems like we will need to drop multiple optimizers? For my purpose, I already use 2 optimizers apart from optimizing for the policy, so I don’t know if it still is useful?

teddy · October 22, 2020, 2:42am

Yup this seems like it would do the trick:

def training_step(self, batch, batch_idx):
    x, y = batch # for example

    policy = nn.Parameter( # Some tensor initialization )
    self.policy_optimizer['Parameter'] = policy
    
    # Policy optimization segment
    for iters in range(self.max_iters_opt):
        self.policy_optimizer.zero_grad()
        loss = self.L_hat(x) # The NN approximation of loss
        self.manual_backward(loss)
        ...

   labels  = self.classification(x, policy)

The only difference in this case is you have to replace loss.backward() with self.manual_backward(loss) so we can handle things like mixed precision for you.

Chris_Liu · October 22, 2020, 2:48am

So just to make sure, any other optimizers provided in the configure_optimizers() will still be handled like normal right? Or do you also need to do manual_backward for it?

teddy · October 22, 2020, 2:49am

Do you also want to optimize your self.classification model? In this case you’d also need to do manual_backward on this.

Chris_Liu · October 22, 2020, 3:55am

Hi, I have just one more question. In reality, I do the gradient descent in a sub-module that doesn’t have the trainer attribute. When I do manual_backwards call in the submodule, it complains this:

if self.trainer.train_loop.automatic_optimization:
AttributeError: 'NoneType' object has no attribute 'train_loop'

I know this should be an expected behavior for the parent module, since you want to make sure it has automatic_optimization turned off. However my sub-module doesn’t have the attribute, is there any workaround for this? e.g. ways to suppress this check?

teddy · October 22, 2020, 2:43pm

Hmm, I am having a hard time understanding the issue here; could you post the full code you are working with? (just the relevant part of train_step)

Chris_Liu · October 22, 2020, 3:04pm

Yeah, so my model has two different parts: The conv layers and pooling layers. The part where manual gradient descent happens is inside the pooling layers. This code:

def forward(self, batch, batch_idx):
    x, y = batch # for example

    policy = nn.Parameter( # Some tensor initialization )
    self.policy_optimizer['Parameter'] = policy

    # Policy optimization segment
    for iters in range(self.max_iters_opt):
        self.policy_optimizer.zero_grad()
        loss = self.L_hat(x) # The NN approximation of loss
        self.manual_backward(loss)
        ...

   labels  = self.classification(x, policy)

will be in the forward call of the pooling layers, and since there will possibly be more than one pooling layer in the model, the gradient descent will need to stay inside the pooling layer. The problem I have is in the pooling layer, the pooling layer doesn’t have a train_loop object.

teddy · October 22, 2020, 3:42pm

Oh are your pooling layers seperate nn.Module? Is this forward in your lightning module? manual_backward can only be called within train_step so I would suggest moving this logic there

Chris_Liu · October 22, 2020, 3:49pm

The pooling layers are separate lightningModules. I’ll try to move gradient descent into the parent module’s train_step then? Or will adding a dummy train_step function for the pooling layer also work?

teddy · October 22, 2020, 4:04pm

I don’t believe Lightning currently supports training multiple lightnings inside eachother; I would recommend moving any and all gradient descent logic to training_step.

Chris_Liu · October 22, 2020, 5:13pm

This is getting longer than I expected, should I open a new post about this?

I ran into another problem about gradient, and this is my code so far:

    def pool(self, obj_pool, opts, opt_id, g, x):
        actions = []
        a_init_lst = obj_pool.get_action_init(g, x)

        for action_init in a_init_lst:
            if self.args['use_opt_policy']:
                self.action.data = action_init
                for i in range(self.max_iters):
                    loss = obj_pool.estimate_L(self.action, g, x)
                    self.manual_backward(loss, opts[opt_id])
                    opts[opt_id].step()
                    opts[opt_id].zero_grad()
                actions.append(self.action.data)
            else:
                actions.append(action_init)

        g, x, L_hat, L_est_A = obj_pool.execute_actions(g, x, a_init_lst, actions)
        return g, x, L_hat, L_est_A

    def forward(self, g, x, opts):
        x = self.conv1(g, x)
        g, x, L_hat_1, L_est_A_1 = self.pool(self.pool1, opts, 3, g, x)
        x = self.conv2(g, x)
        g, x, L_hat_2, L_est_A_2 = self.pool(self.pool2, opts, 4, g, x)
        x = self.conv3(g, x)
        x = self.readout(g, x)
        x = self.final_MLP(x)
        h = F.log_softmax(x, dim=1)
        return g, h, L_hat_1, L_hat_2, L_est_A_1, L_est_A_2

When I run the code, pytorch complains: RuntimeError: grad can be implicitly created only for scalar outputs for the line where I called manual_backwards.
However, since I do compute the loss every iteration of policy finding, I believe I shouldn’t need to retain the graph?

I checked online and people suggest it has something to do with ddp. I’m indeed using ddp in this case, maybe this contributed to the issue?

teddy · October 22, 2020, 5:46pm

Hmm maybe this is a bug? What is the output of estimate_L? It should be a scalar value else it will fail (this is a PyTorch thing not a Lightning thing)

Chris_Liu · October 22, 2020, 6:02pm

My apologies, I pasted in the wrong error. That bug was indeed caused by returning multiple losses, and taking the mean of loss solved it for me. The question I had is actually: RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
So as I posted in the previous explanation, I calculate the loss for each iteration and thus I think I shouldn’t need to retain the graph.

teddy · October 22, 2020, 9:06pm

Hmm from your code it seems you are only calling manual_backward once before zero_grad, so I’m not sure what the issue is. What line is this error happening?

Chris_Liu · October 22, 2020, 9:55pm

It’s the line where I called manual_backwards(loss, opts[opt_id]). Is there a way to debug why this is happening?

teddy · October 22, 2020, 10:11pm

It only appears to be doing backward once. Does it work without DDP?

Chris_Liu · October 22, 2020, 11:08pm

So if I switch to distributed_backend=None when initializing the trainer and use gpu=1, this error happens:

  Traceback (most recent call last):
  File "main.py", line 49, in <module>
    trainer.fit(model, train_loader, val_loader)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
    results = self.accelerator_backend.train()
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 482, in train
    self.train_loop.run_training_epoch()
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 678, in run_training_batch
    self.trainer.hiddens
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 760, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 304, in training_step
    training_step_output = self.trainer.accelerator_backend.training_step(args)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 62, in training_step
    output = self.__training_step(args)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 68, in __training_step
    batch = self.to_device(batch)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 113, in to_device
    return self.batch_to_device(batch, gpu_id)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 72, in batch_to_device
    return model.transfer_batch_to_device(batch, device)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/core/hooks.py", line 555, in transfer_batch_to_device
    return move_data_to_device(batch, device)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 125, in move_data_to_device
    return apply_to_collection(batch, dtype=dtype, function=batch_to)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 58, in apply_to_collection
    return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 58, in <listcomp>
    return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 49, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 122, in batch_to
    return data.to(device, **kwargs)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/dgl/graph.py", line 3858, in to
    self.ndata[k] = F.copy_to(self.ndata[k], ctx)
  File "/afs/ece.cmu.edu/usr/xujinl/anaconda3/envs/CSD/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 90, in copy_to
    if ctx.type == 'cpu':
AttributeError: 'int' object has no attribute 'type'

If I use dp backend with 1 or 2 gpus, the same error of trying to backwards twice appears. I’m still trying to find whether this is a bug on the dgl side or lightning side, but I’m a bit more inclined to latter.