Controlling Data Location in memory


How do I control where datasets are located in system memory when using a LightningDataModule? I understand that ptl manages data location (meaning “on the CPU-side RAM or on the GPU-side RAM”) automatically when training and testing models. I am using large datasets that I want to keep CPU-side to train models that I wan to keep on the GPU RAM. I only want to read single data batches to the GPU during each batch training iteration.

Can I do this with pytorch-lightning? I see at several points in the documentation warnings to not manually set data locations using, but that would be required for what I’m doing. Also, should I use the prepare_method() or setup() method to do this? The example DataModules all define the datasets within setup(), but this is called on each separate GPU. This would cause overwrites of CPU-side data tensors, right?

Please let me know if I’m way off base with my understanding of ptl data handling or if this is a possibility.


I have the same question. I’m completely baffled why there’s an assumption in LightningDataModules that data processing (particularly train/val/test splits) is happening on GPUs. I want a DataLoader to send batches to the GPUs as needed, but to do the actual loading on the CPU (what’s he point of num_workers otherwise?). I’m trying to figure out how to do that now and it feels like no matter what I do, I see the dataset load itself again for each GPU on DDP.