Error with mixed precision 16bit

Hello,

I’m training a bert model for sequence classification (from HF). I am using 16bit precision and I have run into the following error:

AssertionError: Attempted step but _scale is None.  This may indicate your script did not use scaler.scale(loss or outputs) earlier in the iteration.

However, 32bit runs without any problems. I am using Pytorch Lightning version 1.1.2. I should note that, I ran a similar code in an earlier version (don’t remember which one but it was > 1.0.0) and I didn’t run into any problems with 16bit training. Here are the training arguments:

 early_stop_callback = EarlyStopping(
  monitor='val_loss',
  min_delta=0.0,
  patience=5,
  verbose=False,
  mode='min'
)

logger = CSVLogger(
  save_dir=f'{model_dir}',
  name=None,
)

checkpoint_callback = ModelCheckpoint(
  filepath=Path(f'{logger.log_dir}/checkpoints')/'{epoch}-{val_loss:0.3f}-{val_accuracy:0.3f}',
  save_top_k=3,
  monitor='val_loss',
  verbose=True,
  mode='min',
  prefix=''
)
callbacks = [
  PrintTableMetricsCallback(),
]

trainer_args = Namespace(
  progress_bar_refresh_rate=1,
  max_epochs=2,
  gpus=1,
  accumulate_grad_batches=1,
  precision=16,
  overfit_batches=0.1,
  checkpoint_callback=checkpoint_callback,
  logger=logger,
  callbacks=callbacks,
  fast_dev_run=True,
  reload_dataloaders_every_epoch=True,
)

I’ll put the code for the model if required. I’d like to train 16bit models instead of 32bit models to increase my batch size.

Thanks.

could you pls shot an issue with a full example so we can try to reproduce on our end?

This is issue is tied to the other issue based on my silly mistake! It got solved.

FYI, the silly mistake: I copied my Lightning module that I created for a similar project over. When I copied the training_step, by mistake I didn’t include return loss. So basically, training was being done with no loss.

1 Like