Why `precision=16` for me is almost useless for speeding up?

Part of my code is

checkpoint_callback = ModelCheckpoint(save_weights_only=False, mode="min",
        monitor="val_loss",dirpath='outputs',save_last=False,save_top_k=1)
trainer=pl.Trainer(gpus=1,strategy='dp',
        max_epochs=10,
        auto_lr_find=True,
        precision=16,
        callbacks=[
            checkpoint_callback,
            LearningRateMonitor("epoch"),
            RichProgressBar(),
        ],  
        log_every_n_steps=10,
        )
trainer.tune(model,train_loader,val_loader)
trainer.fit(model,train_loader,val_loader,ckpt_path=None)

After ten epochs, precision=32 costs 5m 33s while real time for precision=16 is 5m 55s. There are almost the same, and half precision is even a bit larger.
Used package version: pytorch-lightning 1.5.5, torch 1.10.0.
The device name is GeForce GTX 1080 Ti, cuda version is 11.1. GPU usage memory is 1167MiB, 1149MiB for precision 32 and 16 respectively. They are mostly the same.
Anybody has ever meet similary problem?