Skip to content

Instantly share code, notes, and snippets.

@FlorSanders
Created April 11, 2024 15:17
Show Gist options
  • Select an option

  • Save FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f to your computer and use it in GitHub Desktop.

Select an option

Save FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f to your computer and use it in GitHub Desktop.

Revisions

  1. FlorSanders created this gist Apr 11, 2024.
    154 changes: 154 additions & 0 deletions JetsonNano2GB_LlamaCpp_SetupGuide.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,154 @@
    # Setup Guide for `llama.cpp` on Nvidia Jetson Nano 2GB

    This is a full account of the steps I ran to get `llama.cpp` running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.

    ## Procedure

    At a high level, the procedure to install `llama.cpp` on a Jetson Nano consists of 3 steps.

    1. Compile the `gcc 8.5` compiler from source.

    2. Compile `llama.cpp` from source using the `gcc 8.5` compiler.

    3. Download a model.

    4. Perform inference.


    As step 1 and 2 take a long time, I have uploaded the resulting [binaries for download](https://github.com/FlorSanders/Smart_Offline_LLM_Assistant/raw/jetson-nano/models/llm/llama_binaries.zip) in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.

    ### GCC Compilation

    1. Compile the GCC 8.5 compiler from source on the Jetson nano.
    **NOTE:** The `make -j6` command takes a long time. I recommend running it overnight in a `tmux` session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.

    ```bash
    wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
    sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
    cd /usr/local/
    ./contrib/download_prerequisites
    mkdir build
    cd build
    sudo ../configure -enable-checking=release -enable-languages=c,c++
    make -j6
    make install
    ```

    2. Once the `make install` command ran successfully, you can clean up disk space by removing the `build` directory.

    ```bash
    cd /usr/local/
    rm -rf build
    ```

    3. Set the newly installed GCC and G++ in the environment variables.

    ```bash
    export CC=/usr/local/bin/gcc
    export CXX=/usr/local/bin/g++
    ```

    4. Double check whether the install was indeed successful (both commands should say `8.5.0`).

    ```bash
    gcc --version
    g++ --version
    ```

    ### `llama.cpp` Compilation

    5. Start by cloning the repository and rolling back to a known working commit.

    ```bash
    git clone [email protected]:ggerganov/llama.cpp.git
    git checkout a33e6a0
    ```

    6. Edit the Makefile and apply the following changes
    (save to `file.patch` and apply with `git apply --stat file.patch`)

    ```bash
    diff --git a/Makefile b/Makefile
    index 068f6ed0..a4ed3c95 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
    ifdef LLAMA_FAST
    MK_CFLAGS += -Ofast
    HOST_CXXFLAGS += -Ofast
    -MK_NVCCFLAGS += -O3
    +MK_NVCCFLAGS += -maxrregcount=80
    else
    MK_CFLAGS += -O3
    MK_CXXFLAGS += -O3
    -MK_NVCCFLAGS += -O3
    +MK_NVCCFLAGS += -maxrregcount=80
    endif

    ifndef LLAMA_NO_CCACHE
    @@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
    # Raspberry Pi 3, 4, Zero 2 (64-bit)
    # Nvidia Jetson
    MK_CFLAGS += -mcpu=native
    - MK_CXXFLAGS += -mcpu=native
    JETSON_RELEASE_INFO = $(shell jetson_release)
    ifdef JETSON_RELEASE_INFO
    ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
    ```

    - **NOTE:** If you rather make the changes manually, do the following:

    - Change `MK_NVCCFLAGS += -O3` to `MK_NVCCFLAGS += -maxrregcount=80` on line 109 and line 113.

    - Remove `MK_CXXFLAGS += -mcpu=native` on line 302.


    6. Build the `llama.cpp` source code.

    ```bash
    make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6
    ```

    ### Download a model

    7. Download a model to the device

    ```bash
    wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf
    ```

    - **NOTE**: Due to the limited memory of the Nvidia Jetson Nano 2GB, I have only been able to successfully run the [second-state/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF) on the device.
    Attempts were made to get [second-state/Gemma-2b-it-GGUF](https://huggingface.co/second-state/Gemma-2b-it-GGUF) working, but these did not succeed.

    ### Perform inference

    8. Test the main inference script

    ```bash
    ./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128 --keep 48
    ```

    9. Run the live server

    ```bash
    ./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33 -c 2048 -b 512 -n 128
    ```

    10. Test the web server functionality using curl

    ```bash
    curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
    ```

    You can now run a large language model on this tiny and cheap edge device.
    Have fun!

    ## References

    - [@otaGran TX2 Guide](https://github.com/ggerganov/llama.cpp/issues/4123#issuecomment-1965272660)
    - [@FantasyGmm Guide](https://github.com/ggerganov/llama.cpp/issues/4123#issuecomment-1878026179)
    - [@rvandernoot Guide](https://github.com/ggerganov/llama.cpp/issues/4099#issuecomment-1887338898)
    - [`llama.cpp` server documentation](https://github.com/ggerganov/llama.cpp/tree/master/examples/server)