MinkowskiEngine多GPU训练

V2AS问路

MinkowskiEngine多GPU训练

阅读原文时间：2023年07月09日阅读：2

MinkowskiEngine多GPU训练

目前，MinkowskiEngine通过数据并行化支持Multi-GPU训练。在数据并行化中，有一组微型批处理，这些微型批处理将被送到到网络的一组副本中。

首先定义一个网络。

import MinkowskiEngine as ME

from examples.minkunet import MinkUNet34C

# Copy the network to GPU

net = MinkUNet34C(3, 20, D=3)

net = net.to(target_device)

同步批处理规范

接下来，创建一个新网络，以ME.MinkowskiSynchBatchNorm替换all ME.MinkowskiBatchNorm。这样一来，网络就可以使用大批处理量，并通过单GPU训练来保持相同的性能。

# Synchronized batch norm

net = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(net);

接下来，需要创建网络和最终损耗层的副本（如果使用一个副本）。

import torch.nn.parallel as parallel

criterion = nn.CrossEntropyLoss()

criterions = parallel.replicate(criterion, devices)

加载多个批次

在训练过程中，每次训练迭代都需要一组微型批次。使用了一个返回一个mini-batches批处理的函数，但是无需遵循这种模式。

# Get new data

inputs, labels = [], []

for i in range(num_devices):

coords, feat, label = data_loader() // parallel data loaders can be used

with torch.cuda.device(devices[i]):

inputs.append(ME.SparseTensor(feat, coords=coords).to(devices[i]))

labels.append(label.to(devices[i]))

将weights复制到设备

首先，将权重复制到所有设备。

replicas = parallel.replicate(net, devices)

将副本应用于所有批次

接下来，将所有mini-batches批次送到到所有设备上网络的相应副本。然后将所有输出要素输入损耗层。

outputs = parallel.parallel_apply(replicas, inputs, devices=devices)

# Extract features from the sparse tensors to use a pytorch criterion

out_features = [output.F for output in outputs]

losses = parallel.parallel_apply(

criterions, tuple(zip(out_features, labels)), devices=devices)

收集所有损失到目标设备。

loss = parallel.gather(losses, target_device, dim=0).mean()

其余训练（如backward训练和在优化器中采取向前步骤）类似于单GPU训练。请参阅完整的multi-gpu示例以获取更多详细信息。

import os

import argparse

import numpy as np

from time import time

from urllib.request import urlretrieve

try:

import open3d as o3d

except ImportError:

raise ImportError("Please install open3d-python with `pip install open3d`.")

import torch

import torch.nn as nn

from torch.optim import SGD

import MinkowskiEngine as ME

from examples.minkunet import MinkUNet34C

import torch.nn.parallel as parallel

if not os.path.isfile("weights.pth"):

urlretrieve("http://cvgl.stanford.edu/data2/minkowskiengine/1.ply", "1.ply")

parser = argparse.ArgumentParser()

parser.add_argument("--file_name", type=str, default="1.ply")

parser.add_argument("--batch_size", type=int, default=4)

parser.add_argument("--max_ngpu", type=int, default=2)

cache = {}

def load_file(file_name, voxel_size):

if file_name not in cache:

pcd = o3d.io.read_point_cloud(file_name)

cache[file_name] = pcd

pcd = cache[file_name]

quantized_coords, feats = ME.utils.sparse_quantize(

np.array(pcd.points, dtype=np.float32),

np.array(pcd.colors, dtype=np.float32),

quantization_size=voxel_size,

)

random_labels = torch.zeros(len(feats))

return quantized_coords, feats, random_labels

def generate_input(file_name, voxel_size):

# Create a batch, this process is done in a data loader during training in parallel.

batch = [load_file(file_name, voxel_size)]

coordinates_, featrues_, labels_ = list(zip(*batch))

coordinates, features, labels = ME.utils.sparse_collate(

coordinates_, featrues_, labels_

)

# Normalize features and create a sparse tensor

return coordinates, (features - 0.5).float(), labels

if __name__ == "__main__":

# loss and network

config = parser.parse_args()

num_devices = torch.cuda.device_count()

num_devices = min(config.max_ngpu, num_devices)

devices = list(range(num_devices))

print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")

print("' WARNING: This example is deprecated. '")

print("' Please use DistributedDataParallel or pytorch-lightning'")

print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")

print(

f"Testing {num_devices} GPUs. Total batch size: {num_devices * config.batch_size}"

)

# For copying the final loss back to one GPU

target_device = devices[0]

# Copy the network to GPU

net = MinkUNet34C(3, 20, D=3)

net = net.to(target_device)

# Synchronized batch norm

net = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(net)

optimizer = SGD(net.parameters(), lr=1e-1)

# Copy the loss layer

criterion = nn.CrossEntropyLoss()

criterions = parallel.replicate(criterion, devices)

min_time = np.inf

for iteration in range(10):

optimizer.zero_grad()

# Get new data

inputs, all_labels = [], []

for i in range(num_devices):

coordinates, features, labels = generate_input(config.file_name, 0.05)

with torch.cuda.device(devices[i]):

inputs.append(ME.SparseTensor(features, coordinates, device=devices[i]))

all_labels.append(labels.long().to(devices[i]))

# The raw version of the parallel_apply

st = time()

replicas = parallel.replicate(net, devices)

outputs = parallel.parallel_apply(replicas, inputs, devices=devices)

# Extract features from the sparse tensors to use a pytorch criterion

out_features = [output.F for output in outputs]

losses = parallel.parallel_apply(

criterions, tuple(zip(out_features, all_labels)), devices=devices

)

loss = parallel.gather(losses, target_device, dim=0).mean()

# Gradient

loss.backward()

optimizer.step()

t = time() - st

min_time = min(t, min_time)

print(

f"Iteration: {iteration}, Loss: {loss.item()}, Time: {t}, Min time: {min_time}"

)

# Must clear cache at regular interval

if iteration % 10 == 0:

torch.cuda.empty_cache()

加速实验

在4x Titan XP上使用各种批次大小进行实验，并将负载平均分配给每个GPU。例如，使用1个GPU，每个批次将具有8个批处理大小。使用2个GPU，每个GPU将具有4个批次。使用4个GPU，每个GPU的批处理大小为2。

GPU数量

每个GPU的批量大小

每次迭代时间

加速（理想）

1个GPU

1.611秒

x1（x1）

2个GPU

0.916秒

x1.76（x2）

4个GPU

0.689秒

x2.34（x4）

GPU数量

每个GPU的批量大小

每次迭代时间

加速（理想）

1个GPU

2.691秒

x1（x1）

2个GPU

1.413秒

x1.90（x2）

3个GPU

1.064秒

x2.53（x3）

4个GPU

1.006秒

x2.67（x4）

GPU数量

每个GPU的批量大小

每次迭代时间

加速（理想）

1个GPU

3.543秒

x1（x1）

2个GPU

1.933秒

x1.83（x2）

4个GPU

1.322秒

x2.68（x4）

GPU数量

每个GPU的批量大小

每次迭代时间

加速（理想）

1个GPU

18岁

4.391秒

x1（x1）

2个GPU

2.114秒

x2.08（x2）

3个GPU

1.660秒

x2.65（x3）

GPU数量

每个GPU的批量大小

每次迭代时间

加速（理想）

1个GPU

4.639秒

x1（x1）

2个GPU

2.426秒

x1.91（x2）

4个GPU

1.707秒

x2.72（x4）

GPU数量

每个GPU的批量大小

每次迭代时间

加速（理想）

1个GPU

4.894秒

x1（x1）

3个GPU

1.877秒

x2.61（x3）

分析

批量较小时，加速非常适中。对于大批处理大小（例如18和20），随着线程初始化开销在大工作量上摊销，速度会提高。

同样，在所有情况下，使用4个GPU效率都不高，并且速度似乎很小（总批量大小为18的3-GPU的x2.65与总批量大小为20的4-GPU的x2.72）。因此，建议最多使用3个大批量的GPU。

GPU数量

平均加速（理想）

1个GPU

x1（x1）

2个GPU

x1.90（x2）

3个GPU

x2.60（x3）

4个GPU

x2.60（x4）

适度加速的原因是由于CPU使用率过高。在Minkowski引擎中，所有稀疏张量坐标都在CPU上进行管理，并且内核in-out出入图需要大量的CPU计算。因此，为了提高速度，建议使用更快的CPU，这可能是大点云的瓶颈。

手机扫一扫

移动阅读更方便

你可能感兴趣的文章

Nexus5安装PostmarketOS(Alpine Linux)并装上Docker

CUDA C++编程手册（总论）

matlab采用GPU运算

self attention pytorch代码