Horovod is a great tool for distributed training. This Uber-made framework is built on the basis of MPI and TCP communication (in my case, it operated on subsequent free ports counting from 1024).
I highly recommend you to read all the information on Uber’s repository, but all instructions are based on the
tf.train.MonitoredTrainingSession. In this article I will show you how to parallel the training on the GPU using traditional
Import the library and initiate Horovod.
Pass your optimizer to the DistributedOptimizer wrapper. It will allow updating gradients between processes.
Create session settings. Each session should receive the ID of one of the GPUs.
Put the settings into the session and then perform the initialization of variable weights. The
broadcast_global_variables function will ensure that the appropriate values are sent to all processes.
To limit the display of duplicate information, you can use the
hvd.rank()==0 condition, which will allow only one of the processes (id zero is used by default). Check out my sample code with the classification of Cifar10 dataset.