Why does the dataloader run multiple times and take up all the RAM?

I am trying to run parallel learning with a large dataset.
In my understanding, the dataset should be loaded into RAM, and then the batches are sent to the GPU.
However, for some reason the data is loaded into RAM for each GPU involved in the training.
That is, when training for 6 gpu runs simultaneously 6 datoaders and each in RAM.
As a result, even before the start of training the memory in RAM is already completely exhausted, because the dataset is duplicated there 6 times!
Firstly, it is impossible to load a large dataset for training.
Secondly, it is very long.

if __name__ == '__main__':
    Pytorch_lightning_MNIST_my = Pytorch_Lightning_my()
    trainer = pl.Trainer(accelerator='gpu', devices=6, max_epochs=EPOCHS, strategy=DDPStrategy(find_unused_parameters=False))
    DS = np.vstack((np.loadtxt(learningDatasetFile, skiprows=1, delimiter=",", dtype=np.float32, max_rows=125000), np.loadtxt(validateDatasetFile, skiprows=1, delimiter=",", dtype=np.float32, max_rows=25000)))
    train_tensorX = torch.from_numpy(DS[:, :-1]).to("cuda:0")
    medianANDstdArr = makeStandart(train_tensorX, numOfPeriodsPerFeature)
    train_tensorY = torch.from_numpy(DS[:, -1])
    train_dataset = TensorDataset(train_tensorX.to("cpu"), train_tensorY)
    trainer.fit(Pytorch_lightning_MNIST_my, DataLoader(train_dataset, shuffle=True, batch_size=10000, num_workers=num_of_threads))

Is it supposed to be like this?

Hi there,
yes it is supposed to be like this. The short explanation: For each GPU a separate process is created which then loads the respective data.

The long explanation:
Loading the data all into the memory is only good or feasible for small data. Usually something like this is better for larger datasets:

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, file_x,: str file_y: str) -> None:
        # file won't be loaded with mmap_mode != None
        self.file_x = np.load(file_x, mmap_mode='r')
        self.file_y = np.load(file_y, mmap_mode='r') 

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        return (torch.from_numpy(self.file_x[idx]), torch.from_numpy(self.file_y[idx]))

    def __len__(self) -> int:
        return len(self.file_x)

This only loads the respective part of the data you need with the current batch and not all of it. This is especially helpful since with multiple GPUs each GPU is only going to see a fraction of the dataset, meaning it does not make sense at all to load the entire dataset on every GPU.

Also note: When using multiple workers per GPU (to speedup loading or preprocessing), with your approach you would have number of GPUs * Number of workers per GPU copies of your entire dataset.

Thank you so much! You helped me a lot, and also wrote the right code. I am veeery grateful to you!