Horovod is a great tool for distributed training. This Uber-made framework is built on the basis of MPI and TCP communication (in my case, it operated on subsequent free ports counting from 1024).

I highly recommend you to read all the information on Uber’s repository, but all instructions are based on the tf.train.MonitoredTrainingSession. In this article I will show you how to parallel the training on the GPU using traditional tf.Session.

Import the library and initiate Horovod.

import tensorflow as tf
import horovod.tensorflow as hvd
hvd.init()

Pass your optimizer to the DistributedOptimizer wrapper. It will allow updating gradients between processes.

optimizer = tf.train.AdamOptimizer(learning_rate=1e-3*hvd.rank())
optimizer = hvd.DistributedOptimizer(optimizer)
train = optimizer.minimize(loss)

Create session settings. Each session should receive the ID of one of the GPUs.

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())

Put the settings into the session and then perform the initialization of variable weights. The broadcast_global_variables function will ensure that the appropriate values are sent to all processes.

with tf.Session(config=config) as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(hvd.broadcast_global_variables(0))

To limit the display of duplicate information, you can use the hvd.rank()==0 condition, which will allow only one of the processes (id zero is used by default). Check out my sample code with the classification of Cifar10 dataset.