Hello,
I am trying to train models using multiple GPUs (2). When I change the training strategy to dp
it gets stuck after 1 epoch and epoch 2 does not begin. On the other hand, when I use ddp
training does not start at all. I am currently using v1.7.0.
Here is my DataLoader for reference.
def _prepare_dataloader(self, X, y=None, shuffle=False, predict=False):
"""
Prepare a PyTorch DataLoader.
Arguments:
X: The input features.
y: The output targets.
shuffle: If DataLoader should be shuffled.
predict: If building DataLoader for prediction.
"""
if predict:
dataset = TensorDataset(torch.Tensor(X))
else:
dataset = TensorDataset(
torch.Tensor(X),
torch.LongTensor(y) if self.multi_class else torch.FloatTensor(y),
)
# Dataloader info.
pin_memory = False
if self.accelerator == "gpu":
pin_memory = True
return DataLoader(
dataset,
self.training_config["batch_size"],
shuffle=shuffle,
num_workers=self.training_config["num_workers"],
pin_memory=pin_memory,
)
Any help would be appreciated, thank you!