3 hours ago · Tech · 0 comments

A while ago I read a great post on percisely.xyz/gemm about writing fast GEMM kernels on Apple GPUs. The author talked about an undocumented Metal instruction called simdgroup_async_copy, used the intrinsic __metal_simdgroup_async_copy_2d to move tiles from device memory into threadgroup memory, and reported a counterintuitive finding. A single processor doing the load was much faster than splitting the work across many. I wanted to replicate that on my M4 mac. Things did not go as planned, and the path to a working benchmark turned into a small archaeology project. I worked through it with Claude, which is the kind of problem where having an LLM that can hold dozens of binaries and symbol tables in its head at once turns out to be the right shape of tool. SetupEverything below is on this exact configuration. If your toolchain differs the symbol names and lowering behavior may shift, since this is all undocumented and Apple changes it between Xcode releases. Apple M4, GPU family…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.