TPU Failures in colab

I am using wandb logger with TPU on colab and this keeps happening with multiple-tpu cores.

trainer = pl.Trainer(
                     max_epochs=epochs,\
                     progress_bar_refresh_rate=5,\
                     tpu_cores=8,\
                     logger= wandb_logger
                     )

Can someone help why ?
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 335, in _mp_start_fn
    file=sys.stderr)
  File "/usr/local/lib/python3.6/dist-packages/wandb/lib/redirect.py", line 91, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 644, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 146, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 151, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 428, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

can you reproduce the error with https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing

I’ve tried to reproduce this myself using the bug reporting model described above but I cannot. Let me know if you’re able to reproduce this error!

Thank you some much for sending this notebook

I think I realized my Mistake. My mistake is calling wandb_logger.experiment.config.some_param = 'x' before starting the DDP process starts.

I wanted the logger to Log the HyperParams and so I was adding it as a part of the wandb config before training starts. Earlier I would provide 1 GPU so never a problem. Now with multi-TPU’s, this doesn’t work as the Wandb process is already spawned and breaks as soon as we spawn/fork processes.

But with the DDP setup, I even Tried adding the hparam logging to on_fit_start hook. This also crashed with a similar error.

The full error of failure is below

wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter: ··········
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Tracking run with wandb version 0.10.8
Syncing run EXP_NAME to Weights & Biases (Documentation).
Project page: https://wandb.ai/valaydave/PROJECT_NAME
Run page: https://wandb.ai/valaydave/PROJECT_NAME/runs/2hp8cgze
Run data is saved locally in wandb/run-20201026_203332-2hp8cgze

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
training on 8 TPU cores
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
<ipython-input-10-1f9f6fbe4f6c> in <module>()
----> 1 test_x(tmpdir)

5 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    164         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    165         msg += original_trace
--> 166         raise ProcessRaisedException(msg, error_index, failed_process.pid)
    167 
    168 

ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
  File "/usr/local/lib/python3.6/dist-packages/wandb/lib/redirect.py", line 91, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 644, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 146, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 151, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 428, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py", line 119, in tpu_train_in_process
    self.__setup_tpu_training(model, trainer)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_backend.py", line 221, in __setup_tpu_training
    log.info(f'INIT TPU local core: {trainer.tpu_local_core_rank},'
  File "/usr/lib/python3.6/logging/__init__.py", line 1308, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
    self.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
    self.emit(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1000, in emit
    self.handleError(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 917, in handleError
    sys.stderr.write('--- Logging error ---\n')
  File "/usr/local/lib/python3.6/dist-packages/wandb/lib/redirect.py", line 91, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 644, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 146, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 151, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 428, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 335, in _mp_start_fn
    file=sys.stderr)
  File "/usr/local/lib/python3.6/dist-packages/wandb/lib/redirect.py", line 91, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 644, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 146, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 151, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.6/dist-packages/wandb/interface/interface.py", line 428, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.6/multiprocessing/process.py", line 134, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

COLAB to replicate bug: https://colab.research.google.com/drive/1Bg-vw8eubL1Y8ksV0ZCLh3sQDx6awskK?usp=sharing

In the light of this problem, I was wondering what is best practice when to Log hparams with WanDB using DDP or multiple-TPU.

Can you try this:

config = {'some_hparam': 'Logged Before Trainer startss DDP'}
wandb_logger = WandbLogger(name='EXP_NAME',project='PROJECT_NAME',log_model=True, config=config)