Storing test output (dict) when using DDP

I’m training a model across two GPUs on patient data (id). In my test steps, I output dictionaries, which contain the id, as well as all the metrics. I store these (a list with a dict per id) at the end of the test epoch, so I can later on statistically evaluate model performances.

I’m experiencing a problem with the test step, however.

# Test step
def test_step(self, batch, batch_idx):

    # Get new input and predict, then calculate loss
    x, y, id = batch["input"], batch["target"], batch["id"]

    # Infer and time inference
    start = time()
    y_hat = self.test_inference(x, self, **self.test_inference_params)
    end = time()

    # Calculate metrics
    id = id[0] if len(id) == 1 else tuple(id)

    # Output dict with duration of inference
    output = {"id": id, "time": end - start}

    # Add other metrics to output dict
    for m, pars in zip(self.metrics, self.metrics_params):

        metric_value = m(y_hat, y, **pars)

        if hasattr(metric_value, "item"):
            metric_value = metric_value.item()

        output[f"test_{m.__name__}"] = metric_value

    return output

# Test epoch end (= test end)
def test_epoch_end(self, outputs):

    # Go over outputs and gather
    self.test_results = outputs     #self.all_gather(outputs)

I hadn’t considered this before (as I’m used to training on a single GPU), but the test_results attribute now only contains half of the outputs (one half per process). So when my main script reaches this section, only half the output is effectively stored:

log("Evaluating model.")
results = model.test_results

# Save test results
log("Saving results."), f'{model_name}_v{version}_fold{fold_index}.npy'), arr=results)

I have read about the self.all_gather method, but I’m not sure it suits my needs. I want to merge the lists, not reduce anything. Also, they’re not Tensors, but dicts. How can I store all dicts across both DDP processes?