| without caching | with caching |
|---|---|
for each step, recompute all previous K and V |
for each step, only compute current K and V |
| attention cost per step is quadratic with sequence length | attention cost per step is linear with sequence length (memory grows linearly, but compute/token remains low) |
Created
July 4, 2025 15:05
-
-
Save FareedKhan-dev/713240d593202327d79e48536c5f27e1 to your computer and use it in GitHub Desktop.