Storing test output (dict) when using DDP

I’m training a model across two GPUs on patient data (id). In my test steps, I output dictionaries, which contain the id, as well as all the metrics. I store these (a list with a dict per id) at the end of the test epoch, so I can later on statistically evaluate model performances.

I’m experiencing a problem with the test step, however.

# Test step
def test_step(self, batch, batch_idx):

    # Get new input and predict, then calculate loss
    x, y, id = batch["input"], batch["target"], batch["id"]

    # Infer and time inference
    start = time()
    y_hat = self.test_inference(x, self, **self.test_inference_params)
    end = time()

    # Calculate metrics
    id = id[0] if len(id) == 1 else tuple(id)

    # Output dict with duration of inference
    output = {"id": id, "time": end - start}

    # Add other metrics to output dict
    for m, pars in zip(self.metrics, self.metrics_params):

        metric_value = m(y_hat, y, **pars)

        if hasattr(metric_value, "item"):
            metric_value = metric_value.item()

        output[f"test_{m.__name__}"] = metric_value

    return output

# Test epoch end (= test end)
def test_epoch_end(self, outputs):

    # Go over outputs and gather
    self.test_results = outputs     #self.all_gather(outputs)

I hadn’t considered this before (as I’m used to training on a single GPU), but the test_results attribute now only contains half of the outputs (one half per process). So when my main script reaches this section, only half the output is effectively stored:

log("Evaluating model.")
results = model.test_results

# Save test results
log("Saving results."), f'{model_name}_v{version}_fold{fold_index}.npy'), arr=results)

I have read about the self.all_gather method, but I’m not sure it suits my needs. I want to merge the lists, not reduce anything. Also, they’re not Tensors, but dicts. How can I store all dicts across both DDP processes?

hey @Wouter_Durnez

with DDP, the script is launched on each device independently, and each device is assigned a rank. all_gather helps you recover the results from all the devices on any of the device. all_gather won’t reduce anything. For your usecase you can try:

def test_epoch_end(self, outputs):
    if self.trainer.is_global_zero:
        outputs = self.all_gather(outputs)
        # save them here

Also, we have moved the discussions to GitHub Discussions. You might want to check that out instead to get a quick response. The forums will be marked read-only soon.

Thank you