PyTorch Lightning AUROC value for multi-class seems to be completely off compared to sklearn (using it wrong)?

I want to calculate the Area Under the Receiver Operating Characteristics (AUROC) of my multi-class predictions. However, I don’t know which value to trust.

PyTorch Lightning comes with an AUROC metric. However, whether you call it on the final activation values or after categorizing it both gives different results.
Also, both values do not match the AUROC calculation found in scikit-learn.

Here is a standalone version:

from pl_bolts.models import LitMNIST
import pytorch_lightning as pl
from pytorch_lightning.metrics.classification import AUROC
from pytorch_lightning.metrics.functional import to_categorical
# https://github.com/reiinakano/scikit-plot
# pip install scikit-plot | conda install -c conda-forge scikit-plot
import matplotlib.pyplot as plt
import scikitplot as skplt

pl.seed_everything(0)

# model
model = LitMNIST(batch_size=64)
trainer = pl.Trainer(max_epochs=1, deterministic=True)
# train 1 epoch to expect decent results from auroc
trainer.fit(model)
# prevent grad error
model.freeze()

pl_auroc = AUROC()
for i, batch in enumerate(model.test_dataloader()):
    # get a prediction
    x, target = batch
    print(i, x.shape, target.shape)
    predict = model(x)

    # auroc with activations per class
    print(pl_auroc(predict, target))
    # auroc after converting activations to categories
    predict_cat = to_categorical(predict)
    print(pl_auroc(predict_cat, target))

    # use scikit-learn with a simplified interface
    predict_np = predict.numpy()  # .cpu()
    target_np = target.numpy()  # .cpu()
    skplt.metrics.plot_roc(target_np, predict_np)  # reversed order
    plt.show()

    # predictions versus target
    print(list(zip(predict_cat.numpy(), target.numpy())))

    # 3 predctions is enough
    if i >= 2:
        break

And this is the output from the above script:

# output batch 1:
# AUROC on activation: tensor(0.3796)
# AUROC on category: tensor(0.1111)
# AUROC scikit-learn micro-average: 0.98
#                    macro-average: 0.94
# (predict, target):
# [(7, 7), (2, 2), (1, 1), (0, 0), (4, 4), (1, 1), (4, 4), (9, 9), (6, 5), (9, 9), (0, 0), (6, 6), (9, 9), (0, 0), (1, 1), (5, 5), (9, 9), (7, 7), (2, 3), (4, 4), (9, 9), (6, 6), (6, 6), (5, 5), (4, 4), (0, 0), (7, 7), (4, 4), (0, 0), (1, 1), (3, 3), (1, 1), (3, 3), (4, 4), (7, 7), (2, 2), (7, 7), (1, 1), (2, 2), (1, 1), (1, 1), (7, 7), (4, 4), (2, 2), (3, 3), (5, 5), (1, 1), (2, 2), (4, 4), (4, 4), (6, 6), (3, 3), (5, 5), (5, 5), (6, 6), (0, 0), (4, 4), (1, 1), (9, 9), (5, 5), (7, 7), (2, 8), (9, 9), (3, 3)]

# output batch 2:
# AUROC on activation: tensor(0.4407)
# AUROC on category: tensor(0.0678)
# AUROC scikit-learn micro-average: 0.97
#                    macro-average: 0.94

# output batch 3:
# AUROC on activation: tensor(0.4038)
# AUROC on category: tensor(0.0962)
# AUROC scikit-learn micro-average: 0.93
#                    macro-average: 0.93

The AUROC values from pytorch_lightning.metrics.classification.AUROC seem to be completely off.
Am I using AUROC wrong here?

p.s. Maybe a metric category is necessary for this type of question?

Are you categorizing your predictions in the same way when you call the scikit-learn function?

I wrote this test:

def test_auroc_sk():
    for i in range(100):
        target = torch.randint(0, 2, size=(10, ))
        pred = torch.randint(0, 2, size=(10,))
        score_sk = sklearn.metrics.roc_auc_score(target.numpy(), pred.numpy())
        score_pl = auroc(pred, target)
        assert torch.allclose(torch.tensor(score_pl).float(), torch.tensor(score_sk).float())

and it passes. It seems to me these two metrics return the same scores.

2 Likes

I think AUROC from pl is for binary class and not for multi-class. If it is so then example in the docstrings and docstings itself needs to be fixed.

@justusschock @SkafteNicki wrote this stuff. maybe they have a multi-class version?

I’m not sure what that scikit-learn function is doing, but I was using pl’s AUROC function and the values seemed to be very off. In the 1st post’s example, pl’s values are also very off (ignore the scikit-learn function and just look pure at the (predict, target) tuples).

Your hunch seems to be correct @goku . Here is @awaelchli 's testcase rewritten to multi-class:

import torch
import sklearn.metrics
import pytorch_lightning as pl
from pytorch_lightning.metrics.classification import AUROC

pl.seed_everything(0)
auroc = AUROC()

def test_auroc_sk_multiclass():
    for i in range(100):
        target = torch.randint(0, 3, size=(10,))  # 2 --> 3
        pred = torch.rand(10, 3).softmax(dim=1)  # torch.randint(0, 2, size=(10, ))
        score_sk = sklearn.metrics.roc_auc_score(target.numpy(), pred.numpy(), multi_class='ovo', labels=[0, 1, 2])
        score_pl = auroc(pred, target)
        print(score_sk, score_pl)
        assert torch.allclose(torch.tensor(score_pl).float(), torch.tensor(score_sk).float())

test_auroc_sk_multiclass()

Which fails the assert. sklearn’s output is 0.2708, while pl’s output is tensor(0.5000).

So far this is only a binary implementation.

Basically we have a multiclass auc implementation here and a multiclass roc calculation here which you can then combine to a multiclass auric as it has been done here

1 Like