Op 是计算操作的抽象类,包含 forward 和 gradient 两个方法,分别用于计算前向传播和反向传播。Op 的子类实现了具体的计算操作,例如加法、乘法、对数等。
Value 是计算图的节点表示,是一个抽象的数据结构,保存了节点的计算结果(cached_data)、计算操作(Op)、输入节点(inputs)等信息。
1 2 3 4 5 6 7 8 9 10
classValue: """A value in the computational graph."""
# trace of computational graph op: Optional[Op] inputs: List["Value"] # The following fields are cached fields for # dynamic computation cached_data: NDArray requires_grad: bool
Tensor 是多维数组的封装,是 Value 的子类,保存了多维数组的数据和梯度信息,是自动微分系统具体计算时的对象。
1 2 3 4
import needle a = needle.Tensor([1]) b = needle.Tensor([2]) c = a + b
shrink_dims = [i for i inrange(len(self.shape))] for i, (ori, cur) inenumerate(zip(reversed(ori_shape), reversed(self.shape))): if ori == cur: shrink_dims[len(self.shape) - i - 1] = -1 shrink_dims = tuple(filter(lambda x: x >= 0, shrink_dims)) assertlen(shrink_dims) > 0
defcompute_gradient_of_variables(output_tensor, out_grad): """Take gradient of output node with respect to each node in node_list.
Store the computed result in the grad field of each Variable. """ # a map from node to a list of gradient contributions from each output node node_to_output_grads_list: Dict[Tensor, List[Tensor]] = {} # Special note on initializing gradient of # We are really taking a derivative of the scalar reduce_sum(output_node) # instead of the vector output_node. But this is the common case for loss function. node_to_output_grads_list[output_tensor] = [out_grad]
# Traverse graph in reverse topological order given the output_node that we are taking gradient wrt. reverse_topo_order = list(reversed(find_topo_sort([output_tensor])))
for node in reverse_topo_order: adjoint = node_to_output_grads_list[node] v_i = sum_node_list(adjoint) node.grad = v_i
if node.op isNone: continue v_ki_list = node.op.gradient_as_tuple(v_i, node) for inp, v_ki inzip(node.inputs, v_ki_list): node_to_output_grads_list.setdefault(inp, list()) node_to_output_grads_list[inp].append(v_ki)
deffind_topo_sort(node_list: List[Value]) -> List[Value]: """Given a list of nodes, return a topological sort list of nodes ending in them.
A simple algorithm is to do a post-order DFS traversal on the given nodes, going backwards based on input edges. Since a node is added to the ordering after all its predecessors are traversed due to post-order DFS, we get a topological sort. """ topo_order = [] visited = set() topo_sort_dfs(node_list[0], visited, topo_order) return topo_order
deftopo_sort_dfs(node, visited, topo_order): """Post-order DFS""" if node in visited: return visited.add(node) for input_node in node.inputs: topo_sort_dfs(input_node, visited, topo_order) topo_order.append(node)
神经网络库
一个神经网络的构建,包括神经网络、优化器、初始化和数据处理。这里依次记录。
初始化
课程只要求了最基础的两类初始化方法,一个是 Xavier 一个是 Kaiming。生成权重的方式可以有均匀分布 \(\mathcal U (-a, a)\) 和正态分布 \(\mathcal N (0, std^2)\) 采样两种。两个计算都可以直接参考 pytorch 的实现。
# For new leading dimensions, set stride to 0 (broadcasting new dimensions) new_strides = [0] * (len(new_shape) - len(self.shape)) + new_strides return NDArray.make(new_shape, tuple(new_strides), self.device, self._handle)
// Generic template for element-wise operations template<typename Op> voidEwiseOperation(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, Op op){ for (size_t i = 0; i < a.size; i++) { out->ptr[i] = op(a.ptr[i], b.ptr[i]); } }
// Generic template for scalar operations template<typename Op> voidScalarOperation(const AlignedArray& a, float val, AlignedArray* out, Op op){ for (size_t i = 0; i < a.size; i++) { out->ptr[i] = op(a.ptr[i], val); } }
// Generic template for single element operations template<typename Op> voidSingleOperation(const AlignedArray& a, AlignedArray* out, Op op){ for (size_t i = 0; i < a.size; i++) { out->ptr[i] = op(a.ptr[i]); } }
voidEwiseMul(const AlignedArray& a, const AlignedArray& b, AlignedArray* out){ EwiseOperation(a, b, out, [](float a, float b) { return a * b; }); }
Reduce
Reduce 操作是将数组的某一维度压缩,比如 sum 操作就是将数组的某一维度求和。课程里的 reduce 在 PyThon 层上实现了一个 reduce_view_out 的操作,将需要 reduce 的轴通过 permute 操作放到最后一个维度,然后通过 compact 操作变为了在内存中连续的数组。因此,在 device 层,只需要遍历整个数组,根据 reduce size 计算 max 和 sum 即可。
inlinevoidAlignedDot(constfloat* __restrict__ a, constfloat* __restrict__ b, float* __restrict__ out){
/** * Multiply together two TILE x TILE matrices, and _add _the result to out (it is important to add * the result to the existing out, which you should not set to zero beforehand). We are including * the compiler flags here that enable the compile to properly use vector operators to implement * this function. Specifically, the __restrict__ keyword indicates to the compile that a, b, and * out don't have any overlapping memory (which is necessary in order for vector operations to be * equivalent to their non-vectorized counterparts (imagine what could happen otherwise if a, b, * and out had overlapping memory). Similarly the __builtin_assume_aligned keyword tells the * compiler that the input array will be aligned to the appropriate blocks in memory, which also * helps the compiler vectorize the code. * * Args: * a: compact 2D array of size TILE x TILE * b: compact 2D array of size TILE x TILE * out: compact 2D array of size TILE x TILE to write to */
a = (constfloat*)__builtin_assume_aligned(a, TILE * ELEM_SIZE); b = (constfloat*)__builtin_assume_aligned(b, TILE * ELEM_SIZE); out = (float*)__builtin_assume_aligned(out, TILE * ELEM_SIZE);
for (size_t i = 0; i < TILE; i++) { for (size_t j = 0; j < TILE; j++) { for (size_t k = 0; k < TILE; k++) { out[i * TILE + j] += a[i * TILE + k] * b[k * TILE + j]; } } } }
voidMatmulTiled(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, uint32_t m, uint32_t n, uint32_t p){ /** * Matrix multiplication on tiled representations of array. In this setting, a, b, and out * are all *4D* compact arrays of the appropriate size, e.g. a is an array of size * a[m/TILE][n/TILE][TILE][TILE] * You should do the multiplication tile-by-tile to improve performance of the array (i.e., this * function should call `AlignedDot()` implemented above). * * Note that this function will only be called when m, n, p are all multiples of TILE, so you can * assume that this division happens without any remainder. * * Args: * a: compact 4D array of size m/TILE x n/TILE x TILE x TILE * b: compact 4D array of size n/TILE x p/TILE x TILE x TILE * out: compact 4D array of size m/TILE x p/TILE x TILE x TILE to write to * m: rows of a / out * n: columns of a / rows of b * p: columns of b / out * */ for (size_t i = 0; i < m; i++) { for (size_t j = 0; j < p; j++) { out->ptr[i * p + j] = 0; } }
for (uint32_t i = 0; i < m / TILE; i++) { for (uint32_t j = 0; j < p / TILE; j++) { for (uint32_t k = 0; k < n / TILE; k++) { AlignedDot(a.ptr + (i * n / TILE + k) * TILE * TILE, b.ptr + (k * p / TILE + j) * TILE * TILE, out->ptr + (i * p / TILE + j) * TILE * TILE); } } } }
在使用 CUDA 实现时,为了提高显存利用效率,可以在计算某个 tile 时将其加载至 shared memory 中,同步后再并行计算 TILE 内的结果。
__global__ voidMatmulKernel(constscalar_t* a, constscalar_t* b, scalar_t* out, uint32_t M, uint32_t N, uint32_t P){ /** * Args: * a: compact 2D array of size m x n * b: comapct 2D array of size n x p * out: compact 2D array of size m x p to write the output to * M: rows of a / out * N: columns of a / rows of b * P: columns of b / out */ __shared__ scalar_t a_tile[TILE][TILE]; __shared__ scalar_t b_tile[TILE][TILE];
size_t bx = blockIdx.x; size_t by = blockIdx.y; size_t tx = threadIdx.x; size_t ty = threadIdx.y;
size_t x = bx * blockDim.x + tx; size_t y = by * blockDim.y + ty;
scalar_t sum = 0; for (size_t i = 0; i < (N + TILE - 1) / TILE; i++) { if (i * TILE + tx < N && y < M) { a_tile[ty][tx] = a[y * N + i * TILE + tx]; } else { a_tile[ty][tx] = 0; } if (i * TILE + ty < N && x < P) { b_tile[ty][tx] = b[(i * TILE + ty) * P + x]; } else { b_tile[ty][tx] = 0; } __syncthreads(); for (size_t j = 0; j < TILE; j++) { sum += a_tile[ty][j] * b_tile[j][tx]; } __syncthreads(); } if (y < M && x < P) { out[y * P + x] = sum; } }