note: the following tutorial will use an example with two p3.16 instance (16 V100 GPUs on 2 nodes), but it is easy to generalize to more instances (if AWS’s use limit and capacity availability agrees)

Launch multiple instances at once in AWS management console
- click Launch Instances button
- change the number of instances to 2 in 3. Config Instances
Allow all traffic (by changing inbound/outbound rules) in the security group with all instances
ssh to instances (in different terminals)
- run ssh -i {PRIVATE KEY} {INSTANCE NAME}@{PUBLIC IPv4 ADDRESS}
Config network interface for every node
- run ifconfig to find the name (e.g. ens3)
- run export NCCL_SOCKET_IFNAME=ens3 to set the environment variable
Add multi-node support in python script
- in init_process_group, we should provide:
  - url: one can use the private IPv4 address of node 0 and the port name, e.g. tcp://172.31.22.234:23456
    - this is one of the initialization method (other option e.g. shared file system, but need more configurations)
  - node index (e.g. 0 and 1 with two nodes)
  - number of processors per node: to calculate the global rank of a GPU
  - world size: total number of GPUs to be used in all nodes
  - global rank: calculated from the above parameters and local rank of a GPU
- note: local rank of a GPU will be automatically set by pytorch.distributed.launch (see how to use it in the next step)
Use pytorch.distributed.launch to run training script in distributed setting
- run python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 {PYTHON_FILE_NAME} {ARGS OF PYTHON SCRIPT}
Go fast girl!

Reference

pt-distributed-tutorial by Nathan Inkawhich