LLM model length calculation

LLM memory usage = model weights + optimizer/kv-cache + activations for max tokens.

With your 8B model (~16–18 GB for FP16), asking for 10,500 tokens will likely exceed 19.5 GB, especially with --gpu-memory-utilization=0.99.

Calculation example:

Your model: 8B (~16 GB FP16 weights)
Max tokens: 10,500
KV-cache VRAM ≈ 10,500 × 0.0007 GB ≈ 7.35 GB additional VRAM
Total VRAM needed ≈ 16 GB + 7.35 GB ≈ 23.35 GB → > 19.5 GB, hence the OOM

So yes, 10,500 tokens is too high for a 19.5 GB GPU.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Function to roughly test max tokens
def test_max_tokens(model, start_len=1000, step=500):
    tokens = start_len
    while True:
        try:
            x = torch.randint(0, tokenizer.vocab_size, (1, tokens), device=device)
            with torch.no_grad():
                model(x)
            print(f"{tokens} tokens OK")
            tokens += step
        except RuntimeError as e:
            print(f"OOM at {tokens} tokens")
            break

test_max_tokens(model)

erhangundogan/llm-model-length.md

Select an option

No results found

Select an option

No results found