Created
September 15, 2025 15:45
-
-
Save erhangundogan/aa3d0e878bbcc469a80be83ca22eda7c to your computer and use it in GitHub Desktop.
Revisions
-
erhangundogan created this gist
Sep 15, 2025 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,40 @@ LLM memory usage = model weights + optimizer/kv-cache + activations for max tokens. With your 8B model (~16–18 GB for FP16), asking for 10,500 tokens will likely exceed 19.5 GB, especially with --gpu-memory-utilization=0.99. Calculation example: - Your model: 8B (~16 GB FP16 weights) - Max tokens: 10,500 - KV-cache VRAM ≈ 10,500 × 0.0007 GB ≈ 7.35 GB additional VRAM - Total VRAM needed ≈ 16 GB + 7.35 GB ≈ 23.35 GB → > 19.5 GB, hence the OOM So yes, 10,500 tokens is too high for a 19.5 GB GPU. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" device = "cuda" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16) # Function to roughly test max tokens def test_max_tokens(model, start_len=1000, step=500): tokens = start_len while True: try: x = torch.randint(0, tokenizer.vocab_size, (1, tokens), device=device) with torch.no_grad(): model(x) print(f"{tokens} tokens OK") tokens += step except RuntimeError as e: print(f"OOM at {tokens} tokens") break test_max_tokens(model) ```