SBO and LBO are essential to write correct Kernels on B200 GPUs that leverage Tensor Cores. That is because they need to be provided to the SMEM descriptor that the tcgen05 MMA op expects the programmer to provide as an operand. The corresponding section in the PTX docs commonly is perceived as confusing and in this blogpost I try to explain the mechanism for both K-Major and M-Major matrix operands. Swizzle AtomsTo understand notation of swizzle in CUDA C++ we need to first familiarise ourselves with the notation of Swizzle Atoms. We can extract the following information from PTX docs: Swizzling Mode Leading Dimension / Major-ness Swizzle Atom Layout 128B M/N 8×8 128B K 8×8 64B M/N 4×8 64B K 8×4 32B M/N 2×8 32B K 8×2 None M/N 1×8 None K 8×1 Note that this is for elements with 128 bits. If we consider the case of datatype T the table needs to be adjusted such that we "stretch" the atom across the non major mode by 128 / sizeof(T) across the major mode. For example for BFloat16 the…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.