lr_scheduler.OneCycleLR "ValueError: Tried to step X+2 times. The specified number of total steps is X."

Hi folks,

I’m quite new to using PyTorch and Lightning and seem to be running into the same error detailed as below using the lr_scheduler.OneCycleLR, but there is no solution there applicable to my problem:

My optimizer and training loops are as follows - perhaps there is an implementation error?:

def configure_optimizers(self):
    optimizer = optim.Adam(self.parameters(),
    lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(
        optimizer,                          ,
        total_steps = self.epochs*self.steps_per_epoch,
        div_factor = 1
    scheduler = {"scheduler": lr_scheduler, "interval" : "step"}

    return [optimizer], [scheduler]

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    y_hats = torch.split(y_hat, 9) # split every 9 frames (i.e. each sample)
    y_hat = torch.stack([torch.mean(yhat, 0) for yhat in y_hats]) # take all yhat predictions and average over all 9 subframes per sample
    loss = self.loss_func(y_hat, y)
    result = pl.TrainResult(loss)
    result.log('train_loss:', loss, on_step=False, on_epoch=True, prog_bar=True, logger=True)
    return result

The error trace is as below and always occurs at the end of training, with an error of 2 extra steps. It seems that scheduler.step() is called 2 too many times during training regardless of the number of training epochs. If anyone has any ideas for solutions I could be grateful, thanks.

ValueError: Tried to step 402 times. The specified number of total steps is 400

How are you assigning self.steps_per_epoch? Perhaps this is the wrong value.

I doubt that this is the problem, but perhaps the 2 extra steps are coming from the 2 steps performed during the sanity check. You could check if this is the case by setting trainer = Trainer(num_sanity_val_steps=0).

Hi, thanks for your response.

I’m assigning self.steps_per_epoch by passing len(data.train_dataloader()) which is created using a data module. I’ve checked this and it always gives the correct length such that epochs * steps_per_epoch = total_steps without the extra two steps in the error.

Also, I’ve tried what you suggested with trainer = Trainer(num_sanity_val_steps=0), alas no change. It seems somewhere in the training loop an extra 2 steps are being called as per the problem in the PyTorch forums thread… with the OP there they were training twice, but I’m not.

There’s a suggestion in that thread:

For debugging purposes you could add a counter and print its accumulated value for each scheduler.step() call.

How would I do this using Lightning?

Hmm. I’m not sure if there is a way to do this in Lightning. Here is the place in Lightning where scheduler.step() is called. Perhaps you can clone, the repo, add a print/accumulator here, and try to the debug the issue? In the mean time I will see if I can reproduce the issue myself.

OK I think I’ve just figured out what I was doing wrong - I wasn’t passing a max_epochs argument to the Trainer() instantiation, as I was defining this within the Lightning Module for the LR Scheduling instead. So the operation wanted to continue training, but the training on the LR Scheduler had ended already, causing the extra step error.

Would it be correct to say the best way to pass it to Trainer would be as follows?:

    net = audioNet16k(num_classes=num_classes, 
    trainer = Trainer(gpus=1, 

That should work! Glad you figured it out :slight_smile: