Hello,
How do I control where datasets are located in system memory when using a LightningDataModule? I understand that ptl manages data location (meaning “on the CPU-side RAM or on the GPU-side RAM”) automatically when training and testing models. I am using large datasets that I want to keep CPU-side to train models that I wan to keep on the GPU RAM. I only want to read single data batches to the GPU during each batch training iteration.
Can I do this with pytorch-lightning? I see at several points in the documentation warnings to not manually set data locations using tensor_name.to(device)
, but that would be required for what I’m doing. Also, should I use the prepare_method()
or setup()
method to do this? The example DataModules all define the datasets within setup()
, but this is called on each separate GPU. This would cause overwrites of CPU-side data tensors, right?
Please let me know if I’m way off base with my understanding of ptl data handling or if this is a possibility.
Thanks!