Skip to content

Instantly share code, notes, and snippets.

@marcinantkiewicz
Last active July 31, 2025 21:10
Show Gist options
  • Select an option

  • Save marcinantkiewicz/1084cc58f302a70fa3e9c03b94f99cc5 to your computer and use it in GitHub Desktop.

Select an option

Save marcinantkiewicz/1084cc58f302a70fa3e9c03b94f99cc5 to your computer and use it in GitHub Desktop.

Revisions

  1. marcinantkiewicz revised this gist Jul 31, 2025. 1 changed file with 5 additions and 0 deletions.
    5 changes: 5 additions & 0 deletions model sources
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,5 @@
    Hugging Face
    - set your Local Apps in https://huggingface.co/settings/local-apps#local-apps
    - find model repo, `Use This Model` button, select your local app from the dropbox and the quantization.
    - the different vaulue signify loss from the decreased precision of the weights, [good overview](https://github.com/ggml-org/llama.cpp/pull/1684#issuecomment-1579252501). For tl;dr and if GPU-poor, start with Q4_K.
    - at first stick to the official sources, `GGUF` or `safetensors`. Pytorch (.pt/.pth) are serialied python datastructures, the deserialization process is fragile if the contents are not 100% trustworthy.
  2. marcinantkiewicz revised this gist Jul 29, 2025. 1 changed file with 8 additions and 0 deletions.
    8 changes: 8 additions & 0 deletions making it useful.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,8 @@
    - chat interface - [https://github.com/open-webui/open-webui](https://github.com/open-webui/open-webui)
    - this allows chat history to be recorded
    - and will consume API keys to commercial inference providers
    - for commerical inference I like openrouter, it is cheap to test >70Bn models I cannot usably run at home, for $0.01c-$2/day (2025)
    - CLI interface - nothing beats [llm](https://github.com/taketwo/llm-ollama), this is CLI tool in the best unix tradition, it is modular and just pleasant to use
    - this will produce description of a photo `$ llm -m moondream:latest -a /space/phonepics/iphone8/YARU7264.JPG`
    - for remote access set `OLLAMA_HOST=$ip` to point llm to the API, it can be any openai compatible API (hosted locally via ollama or through openrouter)
    - there are better tools than ollama to host models as actual services, with tight control over parallelism, batching, where are what tensors hosted, but I did not play with that yet.
  3. marcinantkiewicz revised this gist Jul 29, 2025. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion basic ollama care and handling
    Original file line number Diff line number Diff line change
    @@ -23,4 +23,4 @@ $ docker exec -it ollama ollama ps
    $ docker exec -it ollama ollama show --modelfile dengcao/ERNIE-4.5-21B-A3B-PT > ERNIE.modelfile

    # copy the file into the container and create the new entry (smame model but new config)
    $ ollama create dengcao/ERNIE-4.5-21B-A3B-PT -f /app/ollama/modelfiles/ERNIE-16
    $ docker exec -it ollama ollama create dengcao/ERNIE-4.5-21B-A3B-PT -f /app/ollama/modelfiles/ERNIE-16
  4. marcinantkiewicz created this gist Jul 29, 2025.
    26 changes: 26 additions & 0 deletions basic ollama care and handling
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,26 @@
    # docker needs the container toolkit to be able to make nvidia drivers available in the containers and probably more.
    # - you will need nvidia drivers too. https://github.com/NVIDIA/nvidia-container-toolkit
    # - model directory will need some IOPS to load them, dedicated NVME is both fast and naturally limits the sprawl
    # - in GPU stats you will see both (G)raphics and (C)ompute jobs. LLM-related tooling only controls the C jobs.

    # -- once Ollama container is running
    #
    # this should produce help output
    $ docker exec -it ollama ollama

    # ollama.com hosts some of the models, so this nicely works
    # ex: https://ollama.com/dengcao/ERNIE-4.5-21B-A3B-PT
    $ docker exec -it ollama ollama pull dengcao/ERNIE-4.5-21B-A3B-PT:latest

    # will show which models are loaded into memory, balance between layers loaded into gpu and cpu
    # also check out nvtop
    $ docker exec -it ollama ollama ps

    # you can create custom configs for the models, set parameters such as number of layers in GPU by editing the default one
    # to set number of layers in GPU, you either `/set parameter num_gpu 16` in the interactive interface or set it in the
    # modelfile as `PARAMETER num_gpu 16`. Note - this should be called `count_layers_in_gpu` the name is too generic.
    # `num_gpu 0` disables gpu for the model
    $ docker exec -it ollama ollama show --modelfile dengcao/ERNIE-4.5-21B-A3B-PT > ERNIE.modelfile

    # copy the file into the container and create the new entry (smame model but new config)
    $ ollama create dengcao/ERNIE-4.5-21B-A3B-PT -f /app/ollama/modelfiles/ERNIE-16
    13 changes: 13 additions & 0 deletions docker-ollama.service
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,13 @@
    [Unit]
    Description=Ollama Docker Container
    Requires=docker.service
    After=docker.service

    [Service]
    Restart=always
    User=user
    ExecStart=docker run --rm --name ollama --gpus=all -v /space/ollama:/root/.ollama -p 0.0.0.0:11434:11434 -e OLLAMA_DEBUG=1 ollama/ollama
    ExecStop=/usr/bin/docker stop ollama

    [Install]
    WantedBy=multi-user.target