Skip to content

Instantly share code, notes, and snippets.

@erhangundogan
Created September 15, 2025 15:45
Show Gist options
  • Select an option

  • Save erhangundogan/aa3d0e878bbcc469a80be83ca22eda7c to your computer and use it in GitHub Desktop.

Select an option

Save erhangundogan/aa3d0e878bbcc469a80be83ca22eda7c to your computer and use it in GitHub Desktop.
LLM model length calculation

LLM memory usage = model weights + optimizer/kv-cache + activations for max tokens.

With your 8B model (~16–18 GB for FP16), asking for 10,500 tokens will likely exceed 19.5 GB, especially with --gpu-memory-utilization=0.99.

Calculation example:

  • Your model: 8B (~16 GB FP16 weights)
  • Max tokens: 10,500
  • KV-cache VRAM ≈ 10,500 × 0.0007 GB ≈ 7.35 GB additional VRAM
  • Total VRAM needed ≈ 16 GB + 7.35 GB ≈ 23.35 GB → > 19.5 GB, hence the OOM

So yes, 10,500 tokens is too high for a 19.5 GB GPU.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Function to roughly test max tokens
def test_max_tokens(model, start_len=1000, step=500):
    tokens = start_len
    while True:
        try:
            x = torch.randint(0, tokenizer.vocab_size, (1, tokens), device=device)
            with torch.no_grad():
                model(x)
            print(f"{tokens} tokens OK")
            tokens += step
        except RuntimeError as e:
            print(f"OOM at {tokens} tokens")
            break

test_max_tokens(model)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment