Efficiency when using multiple optimizers

This section in the documentation has this pseudocode to explain what happens when configure_optimizers returns multiple optimizers:

for epoch in epochs:
  for batch in data:
     for opt in optimizers:
        disable_grads_for_other_optimizers()
        train_step(opt)
        opt.step()

This seems very inefficient. For example, what if I have 3 optimizers, and they all require embeddings for the current batch? Then embeddings = model(x) is going to be called 3 times, when it only needs to be called once.

that’s a specific use-case of yours. You can use manual optimization for that purpose. Also in training_step you get optimizer_idx, so you can call embeddings = model(x) on optimizer_idx=0, save it as a state and use it when optimizer_idx=1/2 assuming you have 3 optimizers.