Training not proceeding

kad99kev · August 4, 2022, 12:11pm

Hello,

I am trying to train models using multiple GPUs (2). When I change the training strategy to dp it gets stuck after 1 epoch and epoch 2 does not begin. On the other hand, when I use ddp training does not start at all. I am currently using v1.7.0.

Here is my DataLoader for reference.

def _prepare_dataloader(self, X, y=None, shuffle=False, predict=False):
        """
        Prepare a PyTorch DataLoader.

        Arguments:
            X: The input features.
            y: The output targets.
            shuffle: If DataLoader should be shuffled.
            predict: If building DataLoader for prediction.
        """
        if predict:
            dataset = TensorDataset(torch.Tensor(X))
        else:
            dataset = TensorDataset(
                torch.Tensor(X),
                torch.LongTensor(y) if self.multi_class else torch.FloatTensor(y),
            )

        # Dataloader info.
        pin_memory = False
        if self.accelerator == "gpu":
            pin_memory = True

        return DataLoader(
            dataset,
            self.training_config["batch_size"],
            shuffle=shuffle,
            num_workers=self.training_config["num_workers"],
            pin_memory=pin_memory,
        )

Any help would be appreciated, thank you!