LLM memory usage = model weights + optimizer/kv-cache + activations for max tokens.
With your 8B model (~16–18 GB for FP16), asking for 10,500 tokens will likely exceed 19.5 GB, especially with --gpu-memory-utilization=0.99.
Calculation example:
- Your model: 8B (~16 GB FP16 weights)
- Max tokens: 10,500
- KV-cache VRAM ≈ 10,500 × 0.0007 GB ≈ 7.35 GB additional VRAM
- Total VRAM needed ≈ 16 GB + 7.35 GB ≈ 23.35 GB → > 19.5 GB, hence the OOM
So yes, 10,500 tokens is too high for a 19.5 GB GPU.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
# Function to roughly test max tokens
def test_max_tokens(model, start_len=1000, step=500):
tokens = start_len
while True:
try:
x = torch.randint(0, tokenizer.vocab_size, (1, tokens), device=device)
with torch.no_grad():
model(x)
print(f"{tokens} tokens OK")
tokens += step
except RuntimeError as e:
print(f"OOM at {tokens} tokens")
break
test_max_tokens(model)