Pytorch profiler only reports stats for "records"

On the MNIST autoencoder example from here, and more importantly my code, when I set profiler="pytorch" I only get statistics for whatever “records” is. I don’t get stats for training_step_and_backward , training_step , backward , validation_step , test_step , and predict_step like the documentation says I’m supposed to. Is there something else I need to do to profile my training? I’m on torch 1.9.0+cu111, torchvision 0.10.0+cu111 and pytorch-lightning 1.4.1.

Here’s the console output:

python testmnistautoencoder.py
/home/enolan/mystuff/code/clip-gen/venv/lib/python3.9/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 50.4 K
1 | decoder | Sequential | 51.2 K
---------------------------------------
101 K     Trainable params
0         Non-trainable params
101 K     Total params
0.407     Total estimated model params size (MB)
/home/enolan/mystuff/code/clip-gen/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/data_loading.py:105: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 24: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 235/235 [00:06<00:00, 35.37it/s, loss=0.0376, v_num=15]
FIT Profiler Report
Profile stats for: records
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          ProfilerStep*         4.54%       4.265ms        98.37%      92.353ms      30.784ms       0.000us         0.00%     517.000us     172.333us             3  
enumerate(DataLoader)#_SingleProcessDataLoaderIter._...        59.39%      55.760ms        82.97%      77.897ms      25.966ms       0.000us         0.00%       0.000us       0.000us             3  
                                               aten::to         4.20%       3.939ms        11.01%      10.336ms       6.464us       0.000us         0.00%     193.000us       0.121us          1599  
                           optimizer_step_and_closure_0         0.18%     173.000us         9.07%       8.516ms       2.839ms       0.000us         0.00%     324.000us     108.000us             3  
                               Optimizer.step#Adam.step         1.07%       1.000ms         8.87%       8.330ms       2.777ms       0.000us         0.00%     324.000us     108.000us             3  
                                              aten::div         3.94%       3.699ms         6.66%       6.253ms       7.836us      16.000us         2.01%      16.000us       0.020us           798  
                             training_step_and_backward         0.55%     516.000us         5.60%       5.257ms       1.752ms       0.000us         0.00%     165.000us      55.000us             3  
                                            aten::copy_         3.64%       3.421ms         4.83%       4.533ms       2.889us     195.000us        24.47%     195.000us       0.124us          1569  
                                           aten::select         2.80%       2.633ms         3.19%       2.995ms       1.942us       0.000us         0.00%       0.000us       0.000us          1542  
                                               backward         2.57%       2.417ms         2.67%       2.509ms     836.333us       0.000us         0.00%       1.000us       0.333us             3  
                                    aten::empty_strided         2.09%       1.965ms         2.09%       1.965ms       1.255us       0.000us         0.00%       0.000us       0.000us          1566  
                                          aten::permute         1.52%       1.431ms         2.06%       1.932ms       2.516us       0.000us         0.00%       0.000us       0.000us           768  
                                             aten::view         2.03%       1.904ms         2.03%       1.904ms       2.432us       0.000us         0.00%       0.000us       0.000us           783  
                                            aten::stack         0.47%     445.000us         1.85%       1.734ms     578.000us       0.000us         0.00%       0.000us       0.000us             3  
                                          training_step         0.62%     585.000us         1.75%       1.639ms     546.333us       0.000us         0.00%     152.000us      50.667us             3  
                                       cudaLaunchKernel         1.65%       1.550ms         1.65%       1.550ms       4.572us       0.000us         0.00%       0.000us       0.000us           339  
                                       aten::as_strided         0.98%     917.000us         1.48%       1.387ms       0.438us       0.000us         0.00%       0.000us       0.000us          3165  
                                             aten::item         1.18%       1.112ms         1.27%       1.191ms       1.545us       0.000us         0.00%       0.000us       0.000us           771  
                                  cudaStreamSynchronize         0.84%     786.000us         0.84%     786.000us     131.000us       0.000us         0.00%       0.000us       0.000us             6  
                                             aten::add_         0.41%     388.000us         0.74%     696.000us       9.667us      49.000us         6.15%      49.000us       0.681us            72  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 93.887ms
Self CUDA time total: 797.000us

And the source for testmnistautoencoder.py:

import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl

class LitAutoEncoder(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 64),
            nn.ReLU(),
            nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64),
            nn.ReLU(),
            nn.Linear(64, 28*28)
        )

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defined the train loop.
        # It is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        # Logging to TensorBoard by default
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train_loader = DataLoader(dataset, batch_size=256)

# init model
autoencoder = LitAutoEncoder()

# most basic trainer, uses good defaults (auto-tensorboard, checkpoints, logs, and more)
# trainer = pl.Trainer(gpus=8) (if you have GPUs)
trainer = pl.Trainer(profiler="pytorch", gpus=1, max_epochs = 25)
trainer.fit(autoencoder, train_loader)

Any help would be very much appreciated.