The Illustrated LFM-2 (by Liquid AI)

Attention blocks get all the attention nowadays. While a bunch of companies tried the model-layer biz model, only a few have cracked differentiation. Others have succumbed to loss (of mindshare) and failed fast.

I remember watching Liquid's 2023 demos and not understanding the play at hand. However, in 2025, they are making big strides (pun unintended) in the edge inference market. They have done a great job with alternative architecture exploration with architecture search.

Earlier, most hype models used GQA + full attention on every layer (looking at you llama-3). More recently, I am spotting more hybrid attention configs (GPT-OSS coming out with full + banded attn on alternate layers).

FWIW Character AI came out in 2023 talking about hybrid attention for faster inference which is a testament to noam's intuition.

While those are examples of hybrid attention, Liquid AI is exploring hybrid architectures (interleaving attention + something). I was pleasantly surprised at ICML'25 to hear their use of 1D Conv blocks instead of attention.

I will be fleshing out this architecture here. I also drop some functional pytorch code to help you understand better.

Llama-3 Architecture refresher

I like Llama-3's architecture because it is very popular and fully standardized now. Here's my illustration:

Llama Architecture Illustration

To summarize, the key details are: Pre-Normalization, RMSNorm, Rotary Positional Embeddings, Grouped Query Attention and SwiGLU FFN.

LFM-2

This model was bred (via evolutionary search) to run on target hardware like Samsung Galaxy S24 Ultra (Qualcomm Snapdragon SoC) and AMD Ryzen (HX370) [1].

Architecture diagram of Liquid AI’s LFM2 model. The diagram shows a hybrid decoder block where layers alternate between grouped query attention blocks and LIV convolution blocks. Each block contains RMSNorm normalization, linear projections, SwiGLU feed-forward layers, and residual connections. The convolution block consists of input and output gating around a short 1D convolution, while the attention block applies rotary embeddings, Q/K/V projections, and grouped query attention. The final LM head is tied to the embedding weights. Overall, the diagram illustrates how 16 blocks interleave 10 gated convolution layers and 6 grouped attention layers.

Key differences from Llama-3:

Check out these graphs showing prefill and decode speed on different models (Source: Liquid's blog).

Architecture diagram of a standard Llama-3 decoder block. The diagram highlights grouped query attention with rotary embeddings, followed by a SwiGLU feed-forward network. Each component is wrapped with RMSNorm and residual connections. Linear projections for queries, keys, and values feed into self-attention, and the output flows into the feed-forward network. The structure represents the repeating block design used across all Llama-3 layers, emphasizing attention in every layer without convolution.

On CPUs, the bottleneck is usually memory bandwidth, not FLOPs. Full attention (like llama3) repeatedly streams a growing KV cache during decode, thus latency scales with context length.

LFM-2’s short, gated 1D convs avoid that: per token it only needs a fixed-size kernel that fits in cache and SIMD vectorizes well. The few GQA layers seem to be enough to alow global mixing and keep contextual quality up.

Feel free to check out my working implementation of this model.

This post has been inspired by Jay Alammar, Nishant Aklecha and Phil Wang