Working with big dataset

I have a dataset ~150GB that is too big to fit into memory. It is split into multiple files and each file contains enough data for multiple mini-batches.

  • Have: 30 hdf5 files, each containing 64 samples
  • Want: mini-batches of length 4

How to do this? Below is what I am doing right now:

class MyDataSet(torch.utils.data.IterableDataset):
   ...
    def __iter__(self):
        worker = torch.utils.data.get_worker_info()
        paths = self.path_dict[worker.id]
        for path in paths:
            file = self.load(path)
            for sample in file:
                yield sample

class MyDataModule(DataModule):
    ...
   #create train dataloader
   DataLoader(MyDataSet(...), prefetch_factor=64, batch_size=4, num_workers=6)
        
data = MyDataModule(...)
trainer.fit(model, data)
  • Is this the right approach?

  • It is quite slow. Inserting a couple of print statements I can see that both training gets blocked a lot and workers are idle a lot.

  • What does prefetch_factor do? Is it the number of batches or the number of samples that are pre-fetched?

  • Is pre-fetching supposed to happen continuously in the background? Or is it more like 64 units of data are loaded. Then nothing happens until they are all consumed. And then another 64 units are loaded?