A deep dive into DeepSeek’s Multi-Head Latent Attention, including the mathematics and implementation details. The layer is recreated in Julia using Flux.jl. See also previous posts on transformers: Transformers from first principles in Julia. Generative transformer from first principles in Julia. All code available at github.com/LiorSinai/TransformersLite.jl/tree/feature/mla. Table of Contents 1 Introduction In January 2025, DeepSeek unveiled their new DeepSeek-V3 and DeepSeek R1 models. It took the world by storm. Users were impressed with its abilities on top of their claims that is was up to 50× more efficient to train and run than their competitors. They also released multiple papers (DeepSeek-V2, DeepSeek-V3, DeepSeek-R1) with an impressive array of new techniques across the whole machine learning pipeline, from high level theory to intricate implementation details. Most of it built on existing ideas in innovative ways. They include: Theory Multi-Head Latent Attention (MLA):…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.