What Is a KV Cache? A Developer's Guide to the Trick That Makes LLMs Fast
“KV cache” shows up everywhere — model configs, inference logs, that use_cache=True flag you’ve never thought twice about. I nodded along to it for a long time. Then I ran an experiment: I traced a single token through a small open model end-to-end, and then I built my own KV store from scratch to prove I actually understood what it holds.
It turned out to be one of those ideas that’s genuinely simple once someone draws the picture — and almost everything written about it assumes you already have the picture. So here’s the picture, for developers, no PhD required.
The one-sentence version: to produce each new word, a model would otherwise redo a huge amount of work it already did — and the KV cache is the trick that lets it skip the repeat.
Let me earn that sentence.
How a model writes one word
LLMs generate text one token at a time. A token is roughly a word-piece. To produce the next one, the model passes your text up through a stack of identical layers — 24 of them in the small model I traced, Qwen2.5-0.5B.
Inside each layer, every token does the same little dance. I’ll keep it to the rhythm:
normalize → attention → add it back; normalize → feed-forward → add it back; → next layer.
The interesting part is attention — it’s where a token gets to look at the other tokens and pull in context. Here’s how it works without the math:
Each token starts as a vector of numbers. The layer multiplies that vector by three learned weight matrices to produce three new vectors for the token:
→ a Query (“what am I looking for?”) → a Key (“what do I offer to others?”) → a Value (“what I actually contribute if you attend to me”)
Think of it like a room full of people. Your Query is the question you’re asking. Everyone else holds up a Key — a label advertising what they know. You compare your Query against every Key to decide who’s worth listening to, then you collect the Values from the people who matched.
Concretely: the current token scores its Query against the Key of every token that came before it. Those raw scores get squashed by a softmax so they add up to 1 — now they’re percentages of attention. Finally you blend: multiply each token’s Value by its attention percentage, sum it all up, and that’s what the token learned this layer.
That blended result gets added back onto the token’s original vector (the “residual” — keep your old self, add what you just learned), runs through a feed-forward network, gets added back again, and rises to the next layer to do the whole thing over with fresh weights.
Stack that 24 times and the model has a rich, context-aware representation of your text. From the very top, it predicts the next token.
The expensive realization
Now here’s the catch that makes the cache necessary.
You generated one token. Great. To generate the next one, the model needs attention again — which means it needs the Keys and Values of every token in the sequence so far, at every layer.
The naive approach: re-run the entire sequence through all 24 layers from scratch, every single step. Generate a 500-token answer and you’ve reprocessed the whole growing context 500 times. That’s quadratic work, and it’s the difference between an LLM that feels instant and one that crawls.
So someone asked the obvious question: do those old Keys and Values ever actually change?
The insight that the whole thing hangs on
They don’t.
A token’s Key and Value at a given layer depend only on that token and the tokens before it. When you append a new token to the end, you don’t alter the past — token #5’s Key is identical whether the sequence is 6 tokens long or 600.
So why recompute them? Compute each token’s Keys and Values once, the first time you see it, and stash them. That stash is the KV cache.
From then on, generating the next token is cheap. You only do the full, expensive computation for the one new token. For all the prior tokens, you just read their Keys and Values back out of the cache.
What the cache literally stores
This is the part that finally made it click for me, so let me be precise.
The KV cache holds, for every token, at every layer, the Key vector and the Value vector that token produced. That’s the whole thing — a saved copy of the K and V each token generated on its first pass through the stack.
But notice what is not in the cache: the Query.
The Query never gets stored. And once you see why, the whole design snaps into focus: you only ever need a Query for the token you’re generating right now. Old tokens already had their turn to ask their question — there’s no reason to ask it again. The new token asks the only question that matters this step, and it asks it against the cached Keys and Values of everyone before it.
Keys and Values are reusable history. The Query is a fresh question every step. That asymmetry is the entire reason it’s called a KV cache and not a QKV cache.
(One fun detail: modern models share Keys and Values across attention heads to shrink the cache further. Qwen2.5-0.5B has 14 Query heads but only 2 Key/Value heads — the Queries split into two groups of seven, each group sharing one set of Keys and Values. It’s called Grouped-Query Attention, and it exists specifically to make this cache smaller.)
Why you should care
Beyond “it makes generation faster,” two things stood out once I’d built one myself.
It’s a real component you can build, not a black box. The cache is just data — Keys and Values you computed once — so I implemented my own store from scratch: the tensors serialized to disk as safetensors, with a lightweight SQLite index to find them again. Then I ran the same prompt two ways: once with the standard in-memory cache, once reading from my store. Twenty tokens, greedy decoding, and the output was bit-for-bit identical. Persisted attention state is faithful — you can stop, reload, and continue exactly where you left off.
It composes. Two requests that share the same opening — a long instruction block, say — can share that prefix’s cached K and V. Compute the shared part once, branch off it for each request, and only do the real work for the part that’s actually different. The data structure that makes that sharing automatic is a prefix tree over the cached state — and that’s a whole post of its own, which I’ll write next.
The takeaway
A KV cache isn’t a fancy algorithm. It’s the result of noticing that an LLM was redoing work it had no reason to redo, and writing the answer down.
→ The model generates one token at a time. → Attention needs the Keys and Values of every earlier token, at every layer. → Those Keys and Values never change once computed. → So compute them once, cache them, and reuse them forever. → The Query is the only thing you recompute, because it’s the only thing that’s new.
Next time you see that use_cache flag, you’ll know exactly what it’s holding onto — and why switching it off would make your model redo the same work over and over, one token at a time.
Next up: how I structured my own KV store as a prefix tree so that branching contexts reuse a shared base instead of recomputing it.