PyTorch训练相关 | Euruson's Blog

Pytorch深度学习训练相关

Distributed Data Parallel 并行计算

并行计算中分为模型并行和数据并行。对于一机多卡的情况，PyTorch中实现的torch.nn.DataParallel用起来十分方便，但是提升的性能差强人意。使用torch.nn.parallel.DistributedDataParallel的话性能会提高很多，据说性能能达到略差于线性叠加的性能。另外，Distributed Data Parallel甚至能够拓展到实现多机多卡的应用场景下。

官网上的一系列教程，感觉有点杂乱（排序为推荐阅读顺序）：

知乎上，这个系列讲DDP我觉得讲得很有条理很清晰了了：

Exponential Moving Average(EMA) 指数移动平均

这篇文章写得很好：指数移动平均（EMA）的原理及PyTorch实现

EMA其实就是对模型参数的每一时刻的数值进行加权，最近训练得到的模型参数大概率会比之前训练的参数要好，很早之前的模型参数的权重会变为$1/e$，这也是名称中Exponential的来历吧。

在这里附一个在ultralytics/yolov3中实现的EMA代码，这个项目写得真的超级棒，读这个项目的源码学到了超多东西：

Model Exponential Moving Average from https://github.com/rwightman/pytorch-image-models

Keep a moving average of everything in the model state_dict (parameters and buffers). This is intended to allow functionality like https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage. A smoothed version of the weights is necessary for some training schemes to perform well. E.g. Google's hyper-params for training MNASNet, MobileNet-V3, EfficientNet, etc that use RMSprop with a short 2.4-3 epoch decay period and slow LR decay rate of .96-.99 requires EMA smoothing of weights to match results. Pay attention to the decay constant you are using relative to your update count per epoch.

To keep EMA from using GPU resources, set device='cpu'. This will save a bit of memory but disable validation of the EMA weights. Validation will have to be done manually in a separate process, or after the training stops converging.

This class is sensitive where it is initialized in the sequence of model init, GPU assignment and distributed training wrappers.

I've tested with the sequence in my own train.py for torch.DataParallel, apex.DDP, and single-GPU.

class ModelEMA:
    def __init__(self, model, decay=0.9999, device=""):
        # make a copy of the model for accumulating moving average of weights
        self.ema = deepcopy(model)
        self.ema.eval()
        self.updates = 0  # number of EMA updates
        self.decay = lambda x: decay * (
            1 - math.exp(-x / 2000)
        )  # decay exponential ramp (to help early epochs)
        self.device = device  # perform ema on different device from model if set
        if device:
            self.ema.to(device=device)
        for p in self.ema.parameters():
            p.requires_grad_(False)

    def update(self, model):
        self.updates += 1
        d = self.decay(self.updates)
        with torch.no_grad():
            if type(model) in (
                nn.parallel.DataParallel,
                nn.parallel.DistributedDataParallel,
            ):
                msd, esd = model.module.state_dict(), self.ema.module.state_dict()
            else:
                msd, esd = model.state_dict(), self.ema.state_dict()

            for k, v in esd.items():
                if v.dtype.is_floating_point:
                    v *= d
                    v += (1.0 - d) * msd[k].detach()

    def update_attr(self, model):
        # Assign attributes (which may change during training)
        for k in model.__dict__.keys():
            if not k.startswith("_"):
                setattr(self.ema, k, getattr(model, k))

最近在看DDP的时候，看到了有hook这个东西的存在，如果能够使用hook来实现EMA的话感觉会很棒，这样或许就不用显式地定义EMA这个类了，代码就会很优美。

网络权重初始化 Xavier

Xavier初始化的基本思想就是将每一层前向输出和反向梯度输出的方差固定化。

设想每一层的参数都有一个任意的方差，那么输入经过一个深层的网络之后，数值的方差会累计到一个很大的值，所以会导致网络的不稳定。同理，深层次的网络经过方向传播，如果方差过大，那么就会产生梯度爆炸或者梯度消失。

Xavier通过将参数初始化为 $$ w \sim U[-\frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}},-\frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}}] $$ 可以使每一层的前向和反向输出数值的方差固定，这样就可以使训练伊始的前向传播和反向传播不会出现极端值。

Xavier假设激活函数是关于原点对称的，所以有很多激活函数不符合这个假设。

感觉Xavier的限制蛮多的，感觉用处不是很大，但是用了聊胜于无吧。

PyTorch代码：torch.nn.init.xavier_uniform_(param)

参考：

深度学习中Xavier初始化

CNN数值——xavier（上）

CNN数值——xavier（下）

关于`CUDA_VISIBLE_DEVICE`

通过在环境变量里面设置CUDA_VISIBLE_DEVICE(= 0, 1)可以使程序中只识别给定的GPU，并且顺序也是给定的。

我在DDP的每个进程中设置了os.environ["CUDA_VISIBLE_DEVICE"]='1'后，想要将model.to("cuda:1")结果发现报错，大致是说Device Index有问题。一开始想不出来为啥，之后发现如果CUDA_VISIBLE_DEVICE设置了之后，程序能够看见的就只有一张GPU，而不是全局的两张，那么理所当然就没有cuda:1这个设备，也就是说全局中的device 2变成了cuda:0。

显然这样搞是和DDP的理念是违背的，因为DDP中要确定每个进程是用的哪张GPU训练，然后进行梯度的同步。有些时候代码写得有问题，将一些tensor放置到了cuda:0上，或者有些函数默认会将tensor放置到cuda:0上，这时候第一张显卡的负载就会很重，并且可能会OOM。

通过nvidia-smi来查看GPU的负载，按理说用了几张显卡的话就会有几个进程，但是如果出现了上面说的tensor的泄漏，那么就会出现一些占用了少量显卡内存的其他进程。这样既不优雅也存在风险，那么怎么办呢？在每个进程中使用torch.cuda.set_device(rank)来为每个进程指定默认的显卡，这样只要你不显式（故意）地将tensor放置到其它显卡上，那么上述的情况就不会发生。

统计模型参数量与运算量

通过模型的参数量和运算量只能大致推导出模型的推理时间，因为总的时间花费不仅包括了计算，还包括了I/O和调度之类的因素，同样参数量的不同的网络结构对内存和数据的访问读取有可能相差很大。

但是模型的参数量和运算量还是可以用来参考的指标。

Distributed Data Parallel 并行计算

Exponential Moving Average(EMA) 指数移动平均

网络权重初始化 Xavier

关于CUDA_VISIBLE_DEVICE

统计模型参数量与运算量

关于`CUDA_VISIBLE_DEVICE`