| without caching | with caching |
|---|---|
| for each step, recompute all previous `K` and `V`  | for each step, only compute current `K` and `V` 
| attention cost per step is **quadratic** with sequence length | attention cost per step is **linear** with sequence length (memory grows linearly, but compute/token remains low) |