It is slightly involved. This is how I did it once sometime ago.
As you know that bleu_score from torchtext.data.metrics requires text sentences and not numbers (that are returned from models), you need to decode the output of your model from indices to tokens (words), and then append them and store in a list. At the end of the epoch, run the bleu_score on that appended list. Here’s a skeleton code (Only relevant code lines are shown)
def __init__(self):
self.targets = []
self.predictions = []
#assuming you are calculating for validation set
def validation_step(self, batch, batch_idx):
output = self(inp, out):
# logic for turning output into words. i.e. vocab counter etc.
self.targets.append([target_sentence]) #Note the extra square brackets since there can be multiple references
self.predictions.append(prediction_sentence)
def validation_epoch_end(self, outputs):
bleu = bleu_score(self.predictions, self.targets)
self.targets = []
self.predictions = []