1、openmi 下载安装
下载连接:
https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
安装命令
1
2
3
4
5
shell$ gunzip -c openmpi-4.0.1.tar.gz | tar xf -
shell$ cd openmpi-4.0.1
shell$ ./configure --prefix=/usr/local
<…lots of output…>
shell$ make all install
sudo ldconfig
2、horovod安装
官方文档: https://github.com/horovod/horovod#install
[sudo] pip3 install horovod
安装支持NCCL的版本的horovod
HOROVOD_GPU_ALLREDUCE=NCCL pip3 install --no-cache-dir horovod
3、horovod 使用
3.1 tensorFLow 修改
import tensorflow as tf
import horovod.tensorflow as hvd
hvd.init()
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
loss = …
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
train_op = opt.minimize(loss)
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
3.2 tensorflow 运行
mpi 指定mca通讯端口
mpirun --allow-run-as-root --oversubscribe \
-np 8-H ubuntu1:4,ubuntu2:4 \
-bind-to none -map-by slot \
-mca plm_rsh_args "-p 22" \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python3 -u train.py
手机扫一扫
移动阅读更方便
你可能感兴趣的文章