深度学习系统学习笔记

本文是 DLSys 课程的学习笔记，主要记录了深度学习系统的代码实现和系统理解。

自动微分系统

自动微分（Automatic Differentiation，AD）是一种对计算机程序进行高效且准确求导的技术，许多机器学习模型使用的优化算法都需要获取模型的导数，因此自动微分技术成为了一些热门的机器学习框架（例如TensorFlow和PyTorch）的核心特性。

微分的计算可以通过符号微分（Symbolic Differentiation）和数值微分（Numerical Differentiation）来实现，但是这两种方法都有一些缺点，符号微分会让导数表达式快速膨胀，造成计算复杂度高，而数值微分由于计算机系统的浮点误差和近似项的大小，精度存在限制。自动微分的思想是将计算机程序中的运算操作分解为一个有限的基本操作集合，且集合中基本操作的求导规则均为已知，在完成每一个基本操作的求导后，使用链式法则将结果组合得到整体程序的求导结果。

计算图

计算图是自动微分系统的核心，也是深度学习系统的核心。它是一种用于描述程序运算过程的图结构，图中的节点表示程序中的运算操作，边表示运算操作之间的依赖关系。计算图可以分为静态计算图和动态计算图，静态计算图是在程序运行之前就已经确定的，例如 TensorFlow 的计算图，而动态计算图是在程序运行过程中动态生成的，例如 PyTorch 的计算图。

计算图的节点可以分为两种类型，一种是输入节点，表示程序的输入数据(Tensor)，另一种是运算节点(Value)，记录了算子(Op)和计算结果(Tensor)。计算图的边可以分为两种类型，一种是数据边，表示程序运算操作的输入数据，另一种是控制边，表示程序运算操作的执行顺序。例如，对于下方的计算图，\(v_3\) 节点存储了对数算子以及 \(v_1\) 输入过来得到的计算结果。

前向与反向自动微分

我们首先回顾一下高数中的多元函数求导和复合函数求导

如果函数 \(u=\varphi(x, y)\) 及 \(v=\psi(x, y)\) 都在点 \((x, y)\) 具有对 \(x\) 及对 \(y\) 的偏导数函数 \(z=f(u, v)\) 在对应点 \((u, v)\) 具有连续偏导数，那么复合函数 \(z=f[\varphi(x, y), \psi(x, y)]\) 在点 \((x, y)\) 的两个偏导数都存在，且有 \[ \begin{aligned} & \frac{\partial z}{\partial x}=\frac{\partial z}{\partial u} \frac{\partial u}{\partial x}+\frac{\partial z}{\partial v} \frac{\partial v}{\partial x} \\ & \frac{\partial z}{\partial y}=\frac{\partial z}{\partial u} \frac{\partial u}{\partial y}+\frac{\partial z}{\partial v} \frac{\partial v}{\partial y} \end{aligned} \] 我们可以把以上的例子拓展到任意多个变量的情况，即令 \[ z(x_1, \dots, x_N) = f[u_1(x_1, \dots, x_N), \dots, u_M(x_1, \dots, x_N)] \] 这时链式法则可以记为 \[ \frac{\partial z}{\partial x_i} = \sum_{j=1}^M \frac{\partial z}{\partial u_j} \frac{\partial u_j}{\partial x_i} \]

自动微分根据链式法则的不同组合顺序，可以分为前向模式和反向模式。前向模式是从输入到输出的方向进行求导，反向模式是从输出到输入的方向进行求导。对于 \(y=a(b(c(x)))\) 其梯度值 \(\frac{dy}{dx}\) 前向模式计算过程为 \[ \frac{dy}{dx}=(\frac{dy}{da}(\frac{da}{db}(\frac{db}{dc}\frac{dc}{dx}))) \] 反向模式计算过程为 \[ \frac{dy}{dx}=(((\frac{dy}{da}\frac{da}{db})\frac{db}{dc})\frac{dc}{dx})\]

对于计算图里的例子求解 \(x_1\) 对 \(y\) 的梯度，前向模式就是从数据输入方向逐步计算 \(\dot{v}_i=\frac{\partial v_i}{\partial x_1}\),最终得到 \(\frac{\partial y}{\partial x_1}\)。反向模式就是从输出方向逐步计算 \(\overline{v}_i=\frac{\partial y}{\partial v_i}\),最终得到 \(\frac{\partial y}{\partial x_1}\)。前向计算是递推计算，反向计算是递归计算。

P.S. 后向模式的计算样例里，有涉及到复合函数求导的问题，关键在于函数最后的输出是标量，因此对于每条路径单独求导，然后求和即可。

前向模式的计算复杂度与程序的运算次数成正比，而反向模式的计算复杂度与程序的输出维度成正比。因此，对于输入维度较大的程序，反向模式的计算复杂度更低，而对于输出维度较大的程序，前向模式的计算复杂度更低。对于深度学习系统而言，最终的目标是求取损失函数对模型参数的导数，而损失函数最终表示为一个标量，因此反向模式是更为常用的自动微分模式。

自动微分系统的架构

自动微分系统的实现关键在于对计算图的构建，课程里的框架 Needle 采用的方式是邻接表建图，核心类有四个：Value、Tensor、Op 和 NDArray

NDArray 是一个通用的多维数组（N维数组）类，实现了多维数组的基础计算操作。具体实现会在后文详述。

Op 是计算操作的抽象类，包含 forward 和 gradient 两个方法，分别用于计算前向传播和反向传播。Op 的子类实现了具体的计算操作，例如加法、乘法、对数等。

Value 是计算图的节点表示，是一个抽象的数据结构，保存了节点的计算结果(cached_data)、计算操作(Op)、输入节点(inputs)等信息。

class Value:
    """A value in the computational graph."""

    # trace of computational graph
    op: Optional[Op]
    inputs: List["Value"]
    # The following fields are cached fields for
    # dynamic computation
    cached_data: NDArray
    requires_grad: bool

Tensor 是多维数组的封装，是 Value 的子类，保存了多维数组的数据和梯度信息，是自动微分系统具体计算时的对象。

import needle
a = needle.Tensor([1])
b = needle.Tensor([2])
c = a + b

以上示例调用了加法操作，整个算子的调用流程为 Tensor.__add__ -> TensorOp.__call__ -> Tensor.make_from_op -> Value.realize_cached_data -> EwiseAdd.compute -> return。

算子实现

大部分数学算子因为求导公式很显然，这边就不作记录了，主要记录了几个涉及到维度变化的算子，分别是如何求微分的。一个最基本的规则是要保持反向传播过程中梯度的形状与前向传播时张量的形状一致。

Transpose

这里的transpose只是交换两个轴，若参数为空，则交换倒数两个轴。在反向传播时，只需要将梯度的轴再次交换回来即可。

class Transpose(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        axes = list(range(len(a.shape)))
        if self.axes is not None:
            axes[self.axes[0]], axes[self.axes[1]] = axes[self.axes[1]], axes[self.axes[0]]
        else:
            axes[-1], axes[-2] = axes[-2], axes[-1]
        return array_api.transpose(a, axes)

    def gradient(self, out_grad, node):
        return transpose(out_grad, self.axes)

Reshape

Reshape不改变梯度的值，只是改变梯度的形状以匹配前向传播时输入张量的形状。所以，将前向传播的输出按照输入数据的shape进行reshape即可。

class Reshape(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        return array_api.reshape(a, self.shape)

    def gradient(self, out_grad, node):
        return reshape(out_grad, node.inputs[0].shape)

Summation

基本思想是将输出对输入的梯度分配到每一个输入元素上。具体地说，对于一个求和操作，其输出是所有输入元素的和，所以输出对每个输入元素的梯度是1。

假设我们有一个向量 \(X = [x_1, x_2, \ldots, x_n]\)，并对其进行求和操作得到 \(S = \sum_{i=1}^{n} x_i\)。那么 \(S\) 对 \(X\) 中每个元素 \(x_i\) 的梯度就是：

\[ \frac{\partial S}{\partial x_i} = 1 \]

在反向传播时，假设 \(S\) 是损失函数对某个中间变量 \(Z\) 的梯度，则该梯度会被均匀分配到 \(Z\) 的每个元素上。

这个原理背后的直观理解是，既然每个输入元素对总和的贡献是相等的，那么在反向传播中，每个元素对总损失的影响也应该是相等的。因此，每个输入元素的梯度都被设置为1，以确保梯度的传播是公平的.

由于 summation 会把求和的那一系列维度压缩掉，所以在反向传播时，我们需要把输出梯度重新 broadcast 到输入的维度上。通过设置 reduce_shape[axis] = 1，我们在那些求和操作被执行的轴上创建了一个大小为 1 的维度。这样做是为了让 reshape 操作能够生成一个与原始输入张量在非求和轴上形状相同，但在求和轴上有一个“伪”维度（大小为 1）的张量。之后，broadcast_to 操作会利用广播规则，将具有“伪”维度的梯度张量扩展回原始输入张量的形状。广播规则允许大小为 1 的维度扩展，以匹配目标形状的相应维度。

class Summation(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, a):
        return array_api.sum(a, axis=self.axes)

    def gradient(self, out_grad, node):
        reduce_shape = list(node.inputs[0].shape)
        if self.axes is not None:
            if isinstance(self.axes, Number):
                self.axes = (self.axes,)
            if not isinstance(self.axes, tuple):
                self.axes = tuple(self.axes)
            for axis in self.axes:
                reduce_shape[axis] = 1
            grad = reshape(out_grad, reduce_shape)
        else:
            grad = out_grad
        return broadcast_to(grad, node.inputs[0].shape)

BroadcastTo

broadcast_to 是将输入张量扩展到指定的形状。在反向传播时，我们需要把扩展出的输出梯度加以求和，以得到输入张量的梯度。这是因为在前向传播时，输入张量的每个元素都会被复制到输出张量的多个位置上，所以在反向传播时，输出张量的梯度需要被加和到输入张量的对应位置上。

class BroadcastTo(TensorOp):
    def __init__(self, shape):
        self.shape = shape

    def compute(self, a):
        return array_api.broadcast_to(a, self.shape)

    def gradient(self, out_grad, node):
        ori_shape = node.inputs[0].shape
        if ori_shape == self.shape:
            return out_grad

        shrink_dims = [i for i in range(len(self.shape))]
        for i, (ori, cur) in enumerate(zip(reversed(ori_shape), reversed(self.shape))):
            if ori == cur:
                shrink_dims[len(self.shape) - i - 1] = -1
        shrink_dims = tuple(filter(lambda x: x >= 0, shrink_dims))
        assert len(shrink_dims) > 0

        return out_grad.sum(shrink_dims).reshape(ori_shape)

MatMul

矩阵乘法的梯度计算需要注意矩阵相乘时可能存在 broadcast 的情况，因此当发现输入的维度比梯度的维度低时，需要对梯度进行求和。

class MatMul(TensorOp):
    def compute(self, a, b):
        return a @ b

    def gradient(self, out_grad, node):
        lhs, rhs = node.inputs
        lgrad, rgrad = matmul(out_grad, rhs.transpose()), matmul(lhs.transpose(), out_grad)
        if len(lhs.shape) < len(lgrad.shape):
            lgrad = lgrad.sum(tuple([i for i in range(len(lgrad.shape) - len(lhs.shape))]))
        if len(rhs.shape) < len(rgrad.shape):
            rgrad = rgrad.sum(tuple([i for i in range(len(rgrad.shape) - len(rhs.shape))]))
        return lgrad, rgrad

LogSumExp

在神经网络中，假设我们的最后一层是使用softmax去得到一个概率分布，softmax的形式为\(\frac{e^{x_{j}}}{\sum_{i=1}^{n} e^{x_{i}}}\)，如果最终 loss 函数使用 cross entropy，那么我们需要计算 \(\log(\sum_{i=1}^{n} e^{x_{i}})\)，这个计算就是 LogSumExp。通过对 log 函数线性近似，可以推出 LogSumExp 的表达式为 \(LSE(x)=\log(\sum_i \exp x_i)\approx \log \left(\exp x_{j}\right)+\left(\sum_{i \neq j} \exp x_{i}\right) / \exp x_{j}\)，随后通过缩放可以证明 \(LSE(x_1,...,x_n)\approx \max_i x_i\)，因此可以说 LSE 函数是 max 函数的平滑近似。

然而，由于系统上溢下溢的问题，需要先对输入减去一个值，然后再计算完指数和对数之后再加上这个值。实际计算时通常取输入的最大值，这样可以避免上溢的问题。定义 \[ \begin{align*} \hat{z_{i}} &= z_{i} - \max{z}=z_i-z_j,\\ LogSumExp(z_i) &= \log(\sum_{k=1}^{n}\exp(z_{i}-\max{z}))+\max{z}\\ &= \log(\sum_{k=1}^{n}\exp(\hat{z_i}))+z_j \end{align*} \]

梯度可以推出为 \(\frac{\exp(\hat{z_{i}})}{\sum_{k=1}^{n}\exp(\hat{z_{k}})}\)，这里有篇详细推导的文章

class LogSumExp(TensorOp):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def compute(self, Z):
        max_z = array_api.max(Z, self.axes, keepdims=True)
        max_zn = array_api.max(Z, self.axes, keepdims=False)
        Z = array_api.log(array_api.sum(array_api.exp(Z - max_z), self.axes)) + max_zn
        return Z

    def gradient(self, out_grad, node):
        Z = node.inputs[0].cached_data
        max_z = array_api.max(Z, self.axes, keepdims=True)
        exp_z = array_api.exp(Z - max_z)
        sum_exp = array_api.sum(exp_z, self.axes)

        log_grad = out_grad.realize_cached_data() / sum_exp
        broadcast_shape = list(Z.shape)
        if self.axes is not None:
            if isinstance(self.axes, int):
                broadcast_shape[self.axes] = 1
            else:
                for axis in self.axes:
                    broadcast_shape[axis] = 1
        else:
            broadcast_shape = [1 for _ in range(len(broadcast_shape))]
        grad = array_api.reshape(log_grad, tuple(broadcast_shape))
        grad = exp_z * grad
        return Tensor(grad)

自动微分算法实现

自动微分的算法可以描述为以上伪代码，具体分为两部分，一部分是拓扑排序，一部分是顺着拓扑排序逆序进行计算。作业中拓扑排序要求通过 dfs 实现。

def compute_gradient_of_variables(output_tensor, out_grad):
    """Take gradient of output node with respect to each node in node_list.

    Store the computed result in the grad field of each Variable.
    """
    # a map from node to a list of gradient contributions from each output node
    node_to_output_grads_list: Dict[Tensor, List[Tensor]] = {}
    # Special note on initializing gradient of
    # We are really taking a derivative of the scalar reduce_sum(output_node)
    # instead of the vector output_node. But this is the common case for loss function.
    node_to_output_grads_list[output_tensor] = [out_grad]

    # Traverse graph in reverse topological order given the output_node that we are taking gradient wrt.
    reverse_topo_order = list(reversed(find_topo_sort([output_tensor])))

    for node in reverse_topo_order:
        adjoint = node_to_output_grads_list[node]
        v_i = sum_node_list(adjoint)
        node.grad = v_i

        if node.op is None:
            continue
        v_ki_list = node.op.gradient_as_tuple(v_i, node)
        for inp, v_ki in zip(node.inputs, v_ki_list):
            node_to_output_grads_list.setdefault(inp, list())
            node_to_output_grads_list[inp].append(v_ki)


def find_topo_sort(node_list: List[Value]) -> List[Value]:
    """Given a list of nodes, return a topological sort list of nodes ending in them.

    A simple algorithm is to do a post-order DFS traversal on the given nodes,
    going backwards based on input edges. Since a node is added to the ordering
    after all its predecessors are traversed due to post-order DFS, we get a topological
    sort.
    """
    topo_order = []
    visited = set()
    topo_sort_dfs(node_list[0], visited, topo_order)
    return topo_order


def topo_sort_dfs(node, visited, topo_order):
    """Post-order DFS"""
    if node in visited:
        return
    visited.add(node)
    for input_node in node.inputs:
        topo_sort_dfs(input_node, visited, topo_order)
    topo_order.append(node)

神经网络库

一个神经网络的构建，包括神经网络、优化器、初始化和数据处理。这里依次记录。

初始化

课程只要求了最基础的两类初始化方法，一个是 Xavier 一个是 Kaiming。生成权重的方式可以有均匀分布 \(\mathcal U (-a, a)\) 和正态分布 \(\mathcal N (0, std^2)\) 采样两种。两个计算都可以直接参考 pytorch 的实现。

Xavier 初始化

Xavier 根据输入输出的维度来确认采样的范围，保证每层神经网络的输入输出方差相同。 \[ \begin{aligned} a &= \text{gain} \times \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}\\ \text{std} &= \text{gain} \times \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}} \end{aligned} \]

Kaiming 初始化

Kaiming 仅根据输入的维度，比起 Xavier 初始化只适用于线性激活函数，Kaiming 初始化适配 ReLU 的非线性激活函数。 \[ \begin{aligned} \text{a} &= \text{gain} \times \sqrt{\frac{3}{\text{fan\_in}}}\\ \text{std} &= \text{gain} \times \sqrt{\frac{1}{\text{fan\_in}}} \end{aligned} \]

优化器

SGD

通过梯度下降求解神经网络最优解时，由于全量数据的梯度计算和参数更新是非常耗时的，因此我们通常采用随机梯度下降（SGD）或者小批量梯度下降（Mini-batch SGD）来加速训练。相比Adam等自适应学习率算法，纯SGD优化是很慢的，而引入动量可以加速SGD的迭代。动量法的推导可以参考苏剑林老师的博客，课程中的 SGD 包含了动量法和 L2 正则化两项优化，与 pytorch 的 SGD 一致。

\[ \begin{aligned} u_t &= \mu u_{t-1} + (1 - \mu) \nabla \tilde{L}(\theta)\\ \theta_t &= \theta_{t-1} - \eta u_t \end{aligned} \]

其中，\(\mu\) 是动量系数，\(\nabla \tilde{L}(\theta)\) 是加上权重衰减后的梯度，\(\eta\) 是学习率。

def step(self):
    for param in self.params:
        grad_with_penalty = param.grad.detach() + self.weight_decay * param.detach()
        u = self.u.get(id(param), 0) * self.momentum + (1 - self.momentum) * grad_with_penalty
        u = ndl.Tensor(u, dtype=param.dtype)
        self.u[id(param)] = u
        param.data -= self.lr * u

Adam

Adam 是一种自适应学习率算法，它可以根据梯度的一阶矩估计和二阶矩估计自动调整每个参数的学习率（loss landscape 越平坦的方向用越大的学习率来更新模型参数）。Adam 的更新公式如下：

\[ \begin{aligned} g_{t} &= \nabla \hat{L}(\theta_{t}) \\ m_{t} &= \beta_{1} m_{t-1} + (1-\beta_{1}) g_{t} \\ v_{t} &= \beta_{2} v_{t-1} + (1-\beta_{2}) g_{t}^{2} \\ \hat{m}_{t} &= \frac{m_{t}}{1-\beta_{1}^{t}} \\ \hat{v}_{t} &= \frac{v_{t}}{1-\beta_{2}^{t}} \\ \theta_{t+1} &= \theta_{t} - \frac{\eta}{\sqrt{\hat{v}_{t}} + \epsilon} \hat{m}_{t} \end{aligned} \]

def step(self):
    self.t += 1
    for param in self.params:
        grad_with_penalty = param.grad.detach() + self.weight_decay * param.detach()
        grad_with_penalty = ndl.Tensor(grad_with_penalty, dtype=param.dtype)

        m = self.beta1 * self.m.get(id(param), 0) + (1 - self.beta1) * grad_with_penalty
        v = self.beta2 * self.v.get(id(param), 0) + (1 - self.beta2) * grad_with_penalty ** 2
        self.m[id(param)] = m.detach()
        self.v[id(param)] = v.detach()
        m_hat = m / (1 - self.beta1 ** self.t)
        v_hat = v / (1 - self.beta2 ** self.t)
        param.data -= self.lr * m_hat / (v_hat ** 0.5 + self.eps)

基础神经网络

神经网络的基类包括 Parameter 和 Module 两块，Parameter 存储了模型的参数，Module 定义了神经网络的组成结构和状态（区分 train 和 eval）。在实现各个基础 Module 的时候需要注意课程的框架不会隐式的 broadcast，所以需要小心维护每个变量的 shape。课程中涉及到的比较复杂的是 BatchNorm 和 LayerNorm，下面简单记录了两者。

BatchNorm

BN 层在做的事情是将输入按照 Batch 进行归一化，有利于降低神经网络整体的 \(L\) 常数，提高鲁棒性和泛化性能。BN 的计算公式是 \(y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta\)。实际计算时，每个 Batch 会进行移动平均的计算，即 running\_mean = momentum * running\_mean + (1 - momentum) * mean, \quad running\_var = momentum * running\_var + (1 - momentum) * var

def forward(self, x: Tensor) -> Tensor:
    batch = x.shape[0]
    if self.training:
        mean = ops.summation(x, 0) / batch
        self.running_mean = self.momentum * mean + \
            (1 - self.momentum) * self.running_mean
        mean = ops.broadcast_to(ops.reshape(mean, (1, self.dim)), x.shape)

        std = ops.summation(ops.power_scalar(x - mean, 2), 0) / batch
        self.running_var = self.momentum * std + \
            (1 - self.momentum) * self.running_var
        std = ops.broadcast_to(ops.reshape(std, (1, self.dim)), x.shape)

        x = (x - mean) / ops.power_scalar(std + self.eps, 0.5) * \
            ops.broadcast_to(ops.reshape(self.weight, (1, self.dim)), x.shape) \
            + ops.broadcast_to(ops.reshape(self.bias, (1, self.dim)), x.shape)
        return x
    else:
        x = (x - self.running_mean) / ((self.running_var + self.eps) ** 0.5)
        return x

LayerNorm

LN 层在做的事情是将输入按照特征维度进行归一化。LN 的计算公式是 \(y = w \circ \frac{x_i - \textbf{E}[x]}{((\textbf{Var}[x]+\epsilon)^{1/2})} + b\)。

def forward(self, x: Tensor) -> Tensor:
    features = x.shape[1]

    sums = ops.summation(x,axes=1)
    mean = ops.divide_scalar(sums, features)
    tmp = ops.reshape(mean, (-1, 1))
    broadcast_mean = ops.broadcast_to(tmp, x.shape)
    
    sub = x - broadcast_mean
    sub2 = ops.power_scalar(sub, 2)
    var = ops.summation(ops.divide_scalar(sub2, features), axes=1)
    broadcast_var = ops.broadcast_to(ops.reshape(var, (-1, 1)), x.shape)
    
    nominator = ops.power_scalar(broadcast_var + self.eps, 0.5)

    broadcast_weight = ops.broadcast_to(ops.reshape(self.weight, (1, -1)), x.shape)
    broadcast_bias = ops.broadcast_to(ops.reshape(self.bias, (1, -1)), x.shape)
    out = broadcast_weight * (x - broadcast_mean) / nominator + broadcast_bias
    return out

NDArray

课程的 hw0-2 都是基于 numpy 在实现自动微分系统和神经网络库，hw3 实现了一个 NDArray 用于替代 numpy，NDArray 的实现主要是为了加速计算，因此需要实现一些基础的数学运算，包括加减乘除、矩阵乘法、转置、reshape、sum、broadcast_to 等。

NDArray 包含了五个属性：shape、strides、offset、device 和 handle。

strides 是数组在内存中的步幅，A[i, j] => Adata[i * A.strides[0] + j * A.strides[1]]

handle 是 device 上的 Array，屏蔽了 CPU 和 CUDA 的不同实现，在 PyThon 层只需要调用 handle 的方法即可。

PyThon Array Operations

reshape 和 permute 操作很简单，只需要修改 strides 和 shape 即可。因此，两个操作都依赖于 as_stride 的方法，实际就是在内存原地修改数组属性。

broadcast 操作稍微复杂一些，需要考虑原数组中存在大小为 1 的维度或者是新增的维度，要将对应的步幅设为 0，因为该维度可以被广播到任何大小。否则，确保新形状中的相应维度与原始形状相同。

assert len(new_shape) >= len(self.shape), "New shape must have equal or more dimensions"
new_strides = list(self.strides)

# Starting from the end, adjust the strides for broadcasting
for i in range(1, len(self.shape)+1):
    if i > len(new_shape):
        break

    if self.shape[-i] == 1:
        new_strides[-i] = 0
    else:
        assert new_shape[-i] == self.shape[-i], f"Dimension mismatch: cannot broadcast {self.shape} to {new_shape}"

# For new leading dimensions, set stride to 0 (broadcasting new dimensions)
new_strides = [0] * (len(new_shape) - len(self.shape)) + new_strides
return NDArray.make(new_shape, tuple(new_strides), self.device, self._handle)

get_item 操作是为了实现类似 numpy 的切片操作，同样的，在原有数组的 handle 之下，数组的物理地址不会变化，切片和原数组共用一个内存，操作的仍然是 shape 和 strides。offset 是为了确定切片在原数组中的起始位置，shape 的计算需要注意 tuple 存在的 step，间隔取数，要整除确定 shape。

offset = 0
shape = []
strides = []
for i in range(len(idxs)):
    s = idxs[i]
    shape.append((s.stop - s.start - 1) // s.step + 1)
    strides.append(self.strides[i] * s.step)
    offset += self.strides[i] * s.start
return self.make(shape=tuple(shape), strides=tuple(strides), device=self.device, handle=self._handle, offset=offset)

Device Array Operations

Device Array 在课程中包括 cpp 实现的 CPU 版本和 CUDA 版本，需要实现的 API 主要是 compact、set_item、reduce、matmul 和基础运算。

Compact与SetItem

compact 和 set_item 两者都是需要遍历整个数组，去赋值操作。

将内存中不连续的数组转换为连续的数组，一个是将连续的数组转换为不连续的数组。compact 操作的实现是将原数组的数据复制到新的内存中，set_item 操作是将新的数据赋值到原数组的内存中，差异就在核心的赋值语句。CPU 版本中，由于维度不确定，需要递归处理。CUDA 版本中，每个线程负责一个元素的赋值，根据 gid 推算 offset 即可。

# CPU Version
void CompactHelper(const AlignedArray& a, AlignedArray* out, std::vector<int32_t> shape,
             std::vector<int32_t> strides, size_t offset, size_t currentDim, size_t& funcIndex, size_t indexInA) {
              if(currentDim == shape.size()) {
                out->ptr[funcIndex++] = a.ptr[indexInA];
              } else {
                for(size_t i = 0; i < shape[currentDim]; i++) {
                  CompactHelper(a, out, shape, strides, offset, currentDim + 1, funcIndex, indexInA + i * strides[currentDim]);
                }
              }
}

# CUDA Version
__global__ void CompactKernel(const scalar_t* a, scalar_t* out, size_t size, CudaVec shape,
                              CudaVec strides, size_t offset) {
  size_t gid = blockIdx.x * blockDim.x + threadIdx.x;

  size_t idx = gid;
  if (gid < size) {
    for (int i = shape.size - 1; i >= 0; i--) {
      offset += idx % shape.data[i] * strides.data[i];
      idx /= shape.data[i];
    }
    out[gid] = a[offset];
  }
}

基础运算

基础运算可以通过模板元编程缩减工作量，这边给出乘法的例子。

// Generic template for element-wise operations
template<typename Op>
void EwiseOperation(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, Op op) {
    for (size_t i = 0; i < a.size; i++) {
        out->ptr[i] = op(a.ptr[i], b.ptr[i]);
    }
}

// Generic template for scalar operations
template<typename Op>
void ScalarOperation(const AlignedArray& a, float val, AlignedArray* out, Op op) {
    for (size_t i = 0; i < a.size; i++) {
        out->ptr[i] = op(a.ptr[i], val);
    }
}

// Generic template for single element operations
template<typename Op>
void SingleOperation(const AlignedArray& a, AlignedArray* out, Op op) {
    for (size_t i = 0; i < a.size; i++) {
        out->ptr[i] = op(a.ptr[i]);
    }
}

void EwiseMul(const AlignedArray& a, const AlignedArray& b, AlignedArray* out) {
    EwiseOperation(a, b, out, [](float a, float b) { return a * b; });
}

Reduce

Reduce 操作是将数组的某一维度压缩，比如 sum 操作就是将数组的某一维度求和。课程里的 reduce 在 PyThon 层上实现了一个 reduce_view_out 的操作，将需要 reduce 的轴通过 permute 操作放到最后一个维度，然后通过 compact 操作变为了在内存中连续的数组。因此，在 device 层，只需要遍历整个数组，根据 reduce size 计算 max 和 sum 即可。

// CPU Version
void ReduceMax(const AlignedArray& a, AlignedArray* out, size_t reduce_size) {
  for (size_t idx_out = 0; idx_out < out->size; idx_out++) {
    size_t offset = idx_out * reduce_size;
    scalar_t max_val = a.ptr[offset];
    for (size_t k = 1; k < reduce_size; k++) {
      max_val = std::max(max_val, a.ptr[offset + k]);
    }
    out->ptr[idx_out] = max_val;
  }
}

// CUDA Version
__global__ void ReduceMaxKernel(const scalar_t* a, scalar_t* out, size_t size, size_t reduce_size) {
  size_t gid = blockIdx.x * blockDim.x + threadIdx.x;
  if (gid < size) {
    size_t start = gid * reduce_size;
    out[gid] = a[start];
    for (size_t i = 1; i < reduce_size; i++) {
      out[gid] = max(out[gid], a[start + i]);
    }
  }
}

MatMul

课程中的 GEMM 还是比较基础的优化，考虑的是 TILE 切分，利用缓存异构性优化，具体算法课程已经给出了。

inline void AlignedDot(const float* __restrict__ a,
                       const float* __restrict__ b,
                       float* __restrict__ out) {

  /**
   * Multiply together two TILE x TILE matrices, and _add _the result to out (it is important to add
   * the result to the existing out, which you should not set to zero beforehand).  We are including
   * the compiler flags here that enable the compile to properly use vector operators to implement
   * this function.  Specifically, the __restrict__ keyword indicates to the compile that a, b, and
   * out don't have any overlapping memory (which is necessary in order for vector operations to be
   * equivalent to their non-vectorized counterparts (imagine what could happen otherwise if a, b,
   * and out had overlapping memory).  Similarly the __builtin_assume_aligned keyword tells the
   * compiler that the input array will be aligned to the appropriate blocks in memory, which also
   * helps the compiler vectorize the code.
   *
   * Args:
   *   a: compact 2D array of size TILE x TILE
   *   b: compact 2D array of size TILE x TILE
   *   out: compact 2D array of size TILE x TILE to write to
   */

  a = (const float*)__builtin_assume_aligned(a, TILE * ELEM_SIZE);
  b = (const float*)__builtin_assume_aligned(b, TILE * ELEM_SIZE);
  out = (float*)__builtin_assume_aligned(out, TILE * ELEM_SIZE);

  for (size_t i = 0; i < TILE; i++) {
    for (size_t j = 0; j < TILE; j++) {
      for (size_t k = 0; k < TILE; k++) {
        out[i * TILE + j] += a[i * TILE + k] * b[k * TILE + j];
      }
    }
  }
}

void MatmulTiled(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, uint32_t m,
                 uint32_t n, uint32_t p) {
  /**
   * Matrix multiplication on tiled representations of array.  In this setting, a, b, and out
   * are all *4D* compact arrays of the appropriate size, e.g. a is an array of size
   *   a[m/TILE][n/TILE][TILE][TILE]
   * You should do the multiplication tile-by-tile to improve performance of the array (i.e., this
   * function should call `AlignedDot()` implemented above).
   *
   * Note that this function will only be called when m, n, p are all multiples of TILE, so you can
   * assume that this division happens without any remainder.
   *
   * Args:
   *   a: compact 4D array of size m/TILE x n/TILE x TILE x TILE
   *   b: compact 4D array of size n/TILE x p/TILE x TILE x TILE
   *   out: compact 4D array of size m/TILE x p/TILE x TILE x TILE to write to
   *   m: rows of a / out
   *   n: columns of a / rows of b
   *   p: columns of b / out
   *
   */
  for (size_t i = 0; i < m; i++) {
    for (size_t j = 0; j < p; j++) {
      out->ptr[i * p + j] = 0;
    }
  }

  for (uint32_t i = 0; i < m / TILE; i++) {
    for (uint32_t j = 0; j < p / TILE; j++) {
      for (uint32_t k = 0; k < n / TILE; k++) {
        AlignedDot(a.ptr + (i * n / TILE + k) * TILE * TILE,
                   b.ptr + (k * p / TILE + j) * TILE * TILE,
                   out->ptr + (i * p / TILE + j) * TILE * TILE);
      }
    }
  }
}

在使用 CUDA 实现时，为了提高显存利用效率，可以在计算某个 tile 时将其加载至 shared memory 中，同步后再并行计算 TILE 内的结果。

__global__ void MatmulKernel(const scalar_t* a, const scalar_t* b, scalar_t* out, uint32_t M,
                             uint32_t N, uint32_t P) {
  /**
   * Args:
   *   a: compact 2D array of size m x n
   *   b: comapct 2D array of size n x p
   *   out: compact 2D array of size m x p to write the output to
   *   M: rows of a / out
   *   N: columns of a / rows of b
   *   P: columns of b / out
   */
  __shared__ scalar_t a_tile[TILE][TILE];
  __shared__ scalar_t b_tile[TILE][TILE];

  size_t bx = blockIdx.x;
  size_t by = blockIdx.y;
  size_t tx = threadIdx.x;
  size_t ty = threadIdx.y;

  size_t x = bx * blockDim.x + tx;
  size_t y = by * blockDim.y + ty;

  scalar_t sum = 0;
  for (size_t i = 0; i < (N + TILE - 1) / TILE; i++) {
    if (i * TILE + tx < N && y < M) {
      a_tile[ty][tx] = a[y * N + i * TILE + tx];
    } else {
      a_tile[ty][tx] = 0;
    }
    if (i * TILE + ty < N && x < P) {
      b_tile[ty][tx] = b[(i * TILE + ty) * P + x];
    } else {
      b_tile[ty][tx] = 0;
    }
    __syncthreads();
    for (size_t j = 0; j < TILE; j++) {
      sum += a_tile[ty][j] * b_tile[j][tx];
    }
    __syncthreads();
  }
  if (y < M && x < P) {
    out[y * P + x] = sum;
  }
}