机器学习分布式框架horovod安装 (Linux环境)
阅读原文时间:2023年07月13日阅读:1

1、openmi 下载安装

  下载连接:

    https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz

  安装命令

1
2
3
4
5

shell$ gunzip -c openmpi-4.0.1.tar.gz | tar xf -
shell$ cd openmpi-4.0.1
shell$ ./configure --prefix=/usr/local
<…lots of output…>
shell$ make all install

sudo ldconfig

2、horovod安装

官方文档: https://github.com/horovod/horovod#install

[sudo] pip3 install horovod

安装支持NCCL的版本的horovod

HOROVOD_GPU_ALLREDUCE=NCCL pip3 install --no-cache-dir horovod

3、horovod 使用

3.1 tensorFLow 修改

import tensorflow as tf
import horovod.tensorflow as hvd

Initialize Horovod

hvd.init()

Pin GPU to be used to process local rank (one GPU per process)

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

Build model…

loss = …
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

Add Horovod Distributed Optimizer

opt = hvd.DistributedOptimizer(opt)

Add hook to broadcast variables from rank 0 to all other processes during

initialization.

hooks = [hvd.BroadcastGlobalVariablesHook(0)]

Make training operation

train_op = opt.minimize(loss)

Save checkpoints only on worker 0 to prevent other workers from corrupting them.

checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

The MonitoredTrainingSession takes care of session initialization,

restoring from a checkpoint, saving to a checkpoint, and closing when done

or an error occurs.

with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)

3.2 tensorflow 运行

mpi 指定mca通讯端口

mpirun --allow-run-as-root --oversubscribe \
-np 8-H ubuntu1:4,ubuntu2:4 \
-bind-to none -map-by slot \
-mca plm_rsh_args "-p 22" \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python3 -u train.py

手机扫一扫

移动阅读更方便

阿里云服务器
腾讯云服务器
七牛云服务器

你可能感兴趣的文章