I built a Temporal Fusion Transformer model from Pytorch-Forecasting using the guide here:
I used my own data which is a time-series with 62k samples. I set training to be on GPU by specifying
pl.Trainer. The issue is that training is quite slow considering this dataset is not that large.
I first ran the training on my laptop GPU GTX 1650 Ti, then on a A100 40GB and I got only 2x uplift in performance. A100 is many many times faster than my laptop and performance uplift should be much bigger than 2x. I have NVIDIA drivers installed, cuDNN and other things installed (A100 is on google cloud which comes preinstalled with all of that). The GPU utilisation is low (10-15%), but I can see that the data has been loaded into GPU memory.
Things I tried:
- Tried small batch sizes (32) and large ones (8192)
- Double checked training is done on the GPU
num_workersto 8 in
Is there some other bottleneck in my model? Below are the results from the
profiler and snippets of my model configuration.
batch_size = batch_size train_dataloader = training.to_dataloader( train=True, batch_size=batch_size, num_workers=8, batch_sampler="synchronized", pin_memory=True, ) val_dataloader = validation.to_dataloader( train=False, batch_size=batch_size, num_workers=8, batch_sampler="synchronized", pin_memory=True, )
early_stop_callback = EarlyStopping(
monitor=“val_loss”, min_delta=1e-4, patience=10, verbose=False, mode=“min”
trainer = pl.Trainer(
limit_train_batches=1.0, # comment in for training, running validation every 30 batches
tft = TemporalFusionTransformer.from_dataset(
output_size=7, # 7 quantiles by default
Profiler (Only the most intensive processes)