In the first post I explained the KV cache with zero math — what it stores, why the Query is never cached, and the experiment where I built my own store and reloaded it to continue generation bit-for-bit.

This is the companion for developers who want to see the actual operations. No proofs, no research notation — just the matrix shapes and the handful of equations that make attention work, grounded in the model I experimented with: Qwen2.5-0.5B.

If you can read pseudocode and you remember what a dot product is, you have everything you need.

The dimensions we’re working with

Every token, at every layer, is represented as a vector. In Qwen2.5-0.5B that vector has 896 numbers — call it d_model = 896.

Attention splits that vector across heads, each head working in a smaller subspace of size head_dim = 64. The split is:

14 Query heads → 14 × 64 = 896 → 2 Key/Value heads → 2 × 64 = 128

That asymmetry (14 query heads, 2 KV heads) is Grouped-Query Attention, and we’ll come back to it. For now, just hold the numbers.

Step 1 — Project the token into Q, K, V

Take a token’s current vector x (shape [896]). The layer holds three learned weight matrices and multiplies x by each:

q = x @ W_Q     # W_Q is [896 x 896]  -> q is [896]  (14 heads x 64)
k = x @ W_K     # W_K is [896 x 128]  -> k is [128]  (2 heads x 64)
v = x @ W_V     # W_V is [896 x 128]  -> v is [128]  (2 heads x 64)

@ is matrix multiplication. Notice W_K and W_V are narrower — they only produce 2 heads’ worth of output, which is why the KV cache is small relative to the model.

Then reshape into heads:

q -> [14, 64]    # 14 query heads
k -> [2, 64]     # 2 key heads
v -> [2, 64]     # 2 value heads

Step 2 — Score the query against the keys

This is the core of attention: a scaled dot product.

For one query head, you take its 64-dim query vector and compute a dot product against the key vector of every token in the sequence so far. A dot product of two vectors is one number — here it measures how aligned the query is with that key.

score(i) = dot(q, k_i) / sqrt(head_dim)

That sqrt(head_dim)sqrt(64) = 8 — is the “scaled” part, and it’s not cosmetic. Dot products over 64 dimensions can get large, and large inputs push softmax into a region where one value dominates and the gradients vanish. Dividing by sqrt(d) keeps the numbers in a sane range. It’s a numerical-stability trick, nothing deeper.

If the sequence has t tokens, one query head produces t scores — one per preceding token (plus itself).

The causal mask: a token may only attend to itself and earlier tokens, never future ones. So before the next step, any score for a future position is set to -infinity. After softmax, -infinity becomes a weight of exactly 0 — the future contributes nothing.

Step 3 — Softmax into attention weights

Raw scores aren’t usable yet; they don’t sum to anything meaningful. Softmax fixes that — it turns a vector of scores into a probability distribution that sums to 1:

softmax(s_i) = exp(s_i) / sum_j( exp(s_j) )

Every weight is now between 0 and 1, and they all add up to 1. These are the percentages of attention the token pays to each earlier token.

Step 4 — Blend the values

Now use those weights to mix the Value vectors:

out = sum_i( weight_i * v_i )      # weighted sum of value vectors

Each earlier token’s Value vector, scaled by how much attention it earned, summed together. That’s one head’s output: a single 64-dim vector that’s a blend of the past, weighted by relevance.

Do that for all 14 query heads and you get 14 vectors of size 64.

Step 5 — Recombine the heads

Concatenate the 14 head outputs back into one 896-dim vector, then apply the output projection:

concat -> [896]
attn_out = concat @ W_O     # W_O is [896 x 896]

W_O is learned. This is the step that actually combines the heads — without it they’d just sit side by side. W_O mixes them into a single coherent update.

Step 6 — Residual add and normalize

Attention doesn’t replace the token; it adds to it:

x = x + attn_out            # the residual connection

The token keeps its old self and adds what it just learned. The same vector then flows through a feed-forward network with its own residual add, and normalization is applied before each sub-step to keep magnitudes stable. The clean rhythm:

normalize -> attention    -> add back
normalize -> feed-forward -> add back
-> next layer

Stack that 24 times (Qwen2.5-0.5B has 24 layers), and the top of the stack predicts the next token.

Where the cache actually lives

Now the payoff, in equations.

At generation step t, you’re producing token t. The naive computation would rebuild Q, K, V for all t tokens. But look at Step 1: a token’s k and v depend only on that token’s own vector. Token 5’s k is the same whether the sequence is 6 or 600 long.

So you cache them. At step t you only compute the new token’s three projections:

q_t = x_t @ W_Q
k_t = x_t @ W_K   ->  append to K_cache   (now K_cache holds k_1 .. k_t)
v_t = x_t @ W_V   ->  append to V_cache   (now V_cache holds v_1 .. v_t)

Then attention for step t is just:

scores  = (q_t @ K_cache.T) / sqrt(head_dim)   # q_t against ALL cached keys
weights = softmax(scores)
out     = weights @ V_cache                    # blend ALL cached values

q_t is computed fresh and thrown away. K_cache and V_cache grow by exactly one row per step and are never recomputed. That is the entire optimization — the difference between O(t²) work per token and O(t).

One honest detail: Qwen applies rotary position embeddings (RoPE) to q and k before the dot product, so each cached key already has its position baked in. That’s another reason the cache is safe to persist — position is part of what you stored, not something re-derived later.

Grouped-Query Attention, properly explained

This is the part that confused me longest, so I’m going to slow down. We have 14 query heads but only 2 key/value heads. How does that even work, and why would anyone design it that way?

A head is just a 64-number chunk

Forget “heads” as separate machines for a second. You compute one flat list of numbers, then chop it into chunks of head_dim = 64. The number of heads is literally the list length divided by 64:

q: 896 numbers -> 896 / 64 = 14 chunks -> 14 query heads
k: 128 numbers -> 128 / 64 =  2 chunks ->  2 key heads
v: 128 numbers -> 128 / 64 =  2 chunks ->  2 value heads

Think of q as a tray of 896 eggs arranged as 14 cartons of 64, and k as a tray of 128 eggs arranged as 2 cartons of 64. Every carton holds exactly 64 — that never changes. The trays differ only in how many cartons.

And the only reason a query and a key can be compared at all is that a dot product needs two vectors of equal length — and every query carton and every key carton is 64 long. The count differs (14 vs 2); the size per head is always equal. That equality is what lets mismatched counts interact.

Where 128 comes from — and the dial

The KV width isn’t magic. It’s one multiplication:

KV width = (number of KV heads) × head_dim
         =        2             ×    64      = 128

The number of KV heads is a dial, and it’s the same dial as MHA / GQA / MQA. Same 14 query heads throughout; only the KV-head count moves:

14 KV heads × 64 = 896   MHA   1:1, biggest cache, max quality
 7 KV heads × 64 = 448   GQA   groups of 2
 2 KV heads × 64 = 128   GQA   groups of 7   <- Qwen2.5-0.5B
 1 KV head  × 64 =  64   MQA   all share one, smallest cache

One constraint: query heads must divide evenly by KV heads, so the groups come out whole. 14 divides by 1, 2, 7, and 14 — which is exactly the list above. (You couldn’t pick 4 KV heads here; 14 / 4 = 3.5.) Each query head maps to its group’s KV head:

kv_head_for_query(h) = h // 7      # query heads 0-6 -> KV head 0; 7-13 -> KV head 1

Why 2 and not 7?

There’s real information lost as you turn the dial down — but it’s not fidelity, it’s diversity of retrieval. Each KV head is one “way of looking up the past.” Fourteen KV heads give every query its own lens; two heads force all 14 queries through 2 shared lenses. Fewer reference books, more readers per book. (The queries stay full-width, so the model still asks 14 distinct questions — it just has fewer answer-sources.)

So why compress so hard? Because the two sides of the trade scale differently:

Cost of more KV heads:  LINEAR   (7 heads = 3.5× the cache of 2)
Quality gained:         STEEPLY DIMINISHING

Here’s the part that connects to everything: the KV cache is re-read from memory on every token, at every layer. Autoregressive decoding is memory-bandwidth bound, not compute bound — the GPU spends most of each step just reading the cache. So 7 KV heads doesn’t only cost 3.5× the storage, it costs ~3.5× the memory traffic per token, which directly slows generation. Meanwhile a GQA-trained model learns to pack what it needs into 2 heads, so the quality you’d buy back is small. You’d be paying linearly for a stretch of the curve that’s already nearly flat. A 0.5B model — built to be fast and cheap on small hardware — happily takes that trade. Bigger models (Llama-70B uses 8 KV heads) sit higher on the dial because they can afford the bandwidth and want every drop of quality.

The punchline

Look at which side got compressed:

Query  -> 14 heads, full width  -> NOT cached (recomputed fresh, cheap, transient)
K / V  ->  2 heads, compressed   -> CACHED (stored & re-read every token, expensive)

They squeeze exactly the thing that fills the cache and leave full the thing that’s thrown away each step. There’s no storage pressure on the query, so why shrink it? GQA is a KV-cache optimization first and an attention tweak second. The shapes are lopsided because the cache is lopsided.

That’s it

No part of this is beyond a developer who’s comfortable with vectors and matrix multiplication:

→ Project x into q, k, v with three weight matrices. → Scaled dot product of q against every cached k, masked to the past. → Softmax to attention weights that sum to 1. → Weighted sum of the cached v’s. → Concatenate heads, project with W_O, residual-add. → Cache k and v; recompute only q.

Once you’ve written the cached step out as those few lines, “KV cache” stops being jargon and becomes the obvious thing to do.

Next, I’ll get into how I structured my own store as a prefix tree so that contexts sharing a common opening reuse the same cached K and V instead of recomputing them.