How to manually set a port

From: Lightning example "Address already in use" error ddp (single/multi GPU) · Issue #3327 · Lightning-AI/lightning · GitHub

Hi! Using the “pytorch_lightning_simple.py” example from here and adding the distributed_backend=‘ddp’ option in pl.Trainer. It’s giving the error below selecting one or multiple gpu’s:

Environment

Optuna version: optuna (2.0.0)
Python version: 3.7
OS: Linux Ubuntu
(Optional) Other libraries and their versions: pytorch-lightning (0.9.0), torch (1.6.0), torchvision (0.4.2)
Error messages, stack traces, or logs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: WORLD_SIZE environment variable (2) is not equal to the computed world size (1). Ignored.
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=ddp

All DDP processes registered. Starting ddp with 1 processes

| Name | Type | Params

0 | model | Net | 3 K
/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument (try 80 which is the number of cpus on this machine) in the DataLoader init to improve performance. warnings.warn(*args, **kwargs) /home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument (try 80 which is the number of cpus on this machine) in the DataLoader init to improve performance.
warnings.warn(*args, **kwargs)
Epoch 2: 100%|████████████████Saving latest checkpoint…████████████████████████████████████████████████████████████████████████████████████████| 476/476 [00:07<00:00, 63.27it/s, loss=1.923]
Epoch 2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 476/476 [00:07<00:00, 63.26it/s, loss=1.923]
[I 2020-08-30 19:43:02,990] Trial 0 finished with value: 0.3950892984867096 and parameters: {‘n_layers’: 3, ‘dropout’: 0.4634065756224022, ‘n_units_l0’: 4, ‘n_units_l1’: 16, ‘n_units_l2’: 13}. Best is trial 0 with value: 0.3950892984867096.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using environment variable NODE_RANK for node rank (0).
CUDA_VISIBLE_DEVICES: [0]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
[W 2020-08-30 19:43:02,996] Trial 1 failed because of the following error: RuntimeError(‘Address already in use’,)
Traceback (most recent call last):
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/optuna/study.py”, line 709, in _run_trial
result = func(trial)
File “test.py”, line 153, in objective
trainer.fit(model)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py”, line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py”, line 1058, in fit
results = self.accelerator_backend.spawn_ddp_children(model)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py”, line 123, in spawn_ddp_children
results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py”, line 164, in ddp_train
self.trainer.is_slurm_managing_tasks
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py”, line 908, in init_ddp_connection
torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File “test.py”, line 172, in
study.optimize(objective, n_trials=100, timeout=600)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/optuna/study.py”, line 292, in optimize
func, n_trials, timeout, catch, callbacks, gc_after_trial, None
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/optuna/study.py”, line 654, in _optimize_sequential
self._run_trial_and_callbacks(func, catch, callbacks, gc_after_trial)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/optuna/study.py”, line 685, in _run_trial_and_callbacks
trial = self._run_trial(func, catch, gc_after_trial)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/optuna/study.py”, line 709, in _run_trial
result = func(trial)
File “test.py”, line 153, in objective
trainer.fit(model)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py”, line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py”, line 1058, in fit
results = self.accelerator_backend.spawn_ddp_children(model)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py”, line 123, in spawn_ddp_children
results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py”, line 164, in ddp_train
self.trainer.is_slurm_managing_tasks
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py”, line 908, in init_ddp_connection
torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size)
File “/home/user/test_optuna/.venv/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/home/use/test_optuna/.venv/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

How do I set a port to avoid this error?

Hello, my apology for the late reply. We are slowly converging to deprecate this forum in favor of the GH build-in version… Could we kindly ask you to recreate your question there - Lightning Discussions