Validation step: metrics remain unchanged after each epoch

m_m · June 2, 2021, 5:54pm

I’m running a DL model to try and classify some data (2 categories: 1/0), for which I admit I’m not sure there is any underlying structure that allows the classification to succeed.

Nonetheless, I don’t understand why the validation score remains identical after each epoch.

Batch size = 1024
Train data = 900_000 rows
Val data = 100_000 rows

        ...
        self.layers = nn.Sequential(
            nn.Linear(350, 1024*16),
            nn.LeakyReLU(),
            nn.Linear(1024*16, 1024*8),
            nn.LeakyReLU(),
            nn.Linear(1024*8, 1024*8),
            nn.LeakyReLU(),
            nn.Linear(1024*8, 1024*8),
            nn.LeakyReLU(),
            nn.Linear(1024*8, 1024*4),
            nn.LeakyReLU(),
            nn.Linear(1024*4, 1024*4),
            nn.LeakyReLU(),
            nn.Linear(1024*4, 256),
            nn.LeakyReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.layers(x.float())

    def training_step(self, batch, batch_idx):
        x, y = batch
        preds = self.layers(x.float())
        loss = self.criterion(preds, y.float())    # nn.BCELoss()
        acc = FM.accuracy(preds > 0.5, y)
        metrics = {'train_acc': acc.item(), 'train_loss': loss.item()}
        self.log_dict(metrics)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x.float())
        loss = self.criterion(preds, y.float())    # nn.BCELoss()
        acc = FM.accuracy(preds > 0.5, y)
        metrics = {'val_acc': acc.item(), 'val_loss': loss.item()}
        self.log_dict(metrics)
        return metrics

The val_loss remains stable at 48.79 after each and every epoch (tested for up to 10 epochs; same true for val_acc which doesn’t change), which is weird. I would expect some slight variation even if the model doesn’t have much to learn from the data. At least some ovefitting should be possible (model has 300 million+ parameters in total).

However, the train_loss does vary from batch to batch:

So in conclusion I don’t know why validation loss does not change from one epoch to the next and remains stable at 48.79, am I missing something?

m_m · June 7, 2021, 3:05pm

Here is a Jupyter Notebook and some data you can test on (in Colab or other):
https://wetransfer.com/downloads/58732b6430d49188d804066d3970397a20210606201715/7b869d

Mihai_Anca · June 9, 2021, 6:29am

Hi!

I’ve tried running the code you uploaded, but the .pt file is corrupted. I get the following error:
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted