Controlling Data Location in memory

Hello,

How do I control where datasets are located in system memory when using a LightningDataModule? I understand that ptl manages data location (meaning “on the CPU-side RAM or on the GPU-side RAM”) automatically when training and testing models. I am using large datasets that I want to keep CPU-side to train models that I wan to keep on the GPU RAM. I only want to read single data batches to the GPU during each batch training iteration.

Can I do this with pytorch-lightning? I see at several points in the documentation warnings to not manually set data locations using tensor_name.to(device), but that would be required for what I’m doing. Also, should I use the prepare_method() or setup() method to do this? The example DataModules all define the datasets within setup(), but this is called on each separate GPU. This would cause overwrites of CPU-side data tensors, right?

Please let me know if I’m way off base with my understanding of ptl data handling or if this is a possibility.

Thanks!

I have the same question. I’m completely baffled why there’s an assumption in LightningDataModules that data processing (particularly train/val/test splits) is happening on GPUs. I want a DataLoader to send batches to the GPUs as needed, but to do the actual loading on the CPU (what’s he point of num_workers otherwise?). I’m trying to figure out how to do that now and it feels like no matter what I do, I see the dataset load itself again for each GPU on DDP.

Did you find the solution? I’m having the same issue

I had forgotten about this; thank you for the ping.

My solution for this was to move my data into HDF5 records. The trial data can then be sampled like a torch tensor or a numpy array without loading the entire dataset into cpu memory first. You can write your dataloader to send each trial to the desired device.

I use h5py as a python HDF5 interface package.

Thank you a lot! It is a great solution in my use-case :slight_smile:

I’m also having this issue. Would like to have an option to do data processing on CPU and move batches from DataLoader to GPU.

Any other solution than loading from disk? seems slow to me.