erhangundogan · September 15, 2025 15:45 · Sep 15, 2025
diff --git a/llm-model-length.md b/llm-model-length.md
@@ -0,0 +1,40 @@
+LLM memory usage = model weights + optimizer/kv-cache + activations for max tokens.
+
+With your 8B model (~16–18 GB for FP16), asking for 10,500 tokens will likely exceed 19.5 GB, especially with --gpu-memory-utilization=0.99.
+
+Calculation example:
+- Your model: 8B (~16 GB FP16 weights)
+- Max tokens: 10,500
+- KV-cache VRAM ≈ 10,500 × 0.0007 GB ≈ 7.35 GB additional VRAM
+- Total VRAM needed ≈ 16 GB + 7.35 GB ≈ 23.35 GB → > 19.5 GB, hence the OOM
+
+So yes, 10,500 tokens is too high for a 19.5 GB GPU.
+
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
+device = "cuda"
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
+
+# Function to roughly test max tokens
+def test_max_tokens(model, start_len=1000, step=500):
+    tokens = start_len
+    while True:
+        try:
+            x = torch.randint(0, tokenizer.vocab_size, (1, tokens), device=device)
+            with torch.no_grad():
+                model(x)
+            print(f"{tokens} tokens OK")
+            tokens += step
+        except RuntimeError as e:
+            print(f"OOM at {tokens} tokens")
+            break
+
+test_max_tokens(model)
+```
+
No results found