Skip to content

Instantly share code, notes, and snippets.

@erhangundogan
Created September 15, 2025 15:45
Show Gist options
  • Select an option

  • Save erhangundogan/aa3d0e878bbcc469a80be83ca22eda7c to your computer and use it in GitHub Desktop.

Select an option

Save erhangundogan/aa3d0e878bbcc469a80be83ca22eda7c to your computer and use it in GitHub Desktop.

Revisions

  1. erhangundogan created this gist Sep 15, 2025.
    40 changes: 40 additions & 0 deletions llm-model-length.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,40 @@
    LLM memory usage = model weights + optimizer/kv-cache + activations for max tokens.

    With your 8B model (~16–18 GB for FP16), asking for 10,500 tokens will likely exceed 19.5 GB, especially with --gpu-memory-utilization=0.99.

    Calculation example:
    - Your model: 8B (~16 GB FP16 weights)
    - Max tokens: 10,500
    - KV-cache VRAM ≈ 10,500 × 0.0007 GB ≈ 7.35 GB additional VRAM
    - Total VRAM needed ≈ 16 GB + 7.35 GB ≈ 23.35 GB → > 19.5 GB, hence the OOM

    So yes, 10,500 tokens is too high for a 19.5 GB GPU.


    ```python
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
    device = "cuda"

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

    # Function to roughly test max tokens
    def test_max_tokens(model, start_len=1000, step=500):
    tokens = start_len
    while True:
    try:
    x = torch.randint(0, tokenizer.vocab_size, (1, tokens), device=device)
    with torch.no_grad():
    model(x)
    print(f"{tokens} tokens OK")
    tokens += step
    except RuntimeError as e:
    print(f"OOM at {tokens} tokens")
    break

    test_max_tokens(model)
    ```