### How to get [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) running on Windows or Linux with LLaMa-30b 4bit mode via [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) on an RTX 3090 start to finish.  

This guide actually works well for linux too. Just don't bother with the powershell envs

1. Download prerequisites
    - Download and install [miniconda](https://docs.conda.io/en/latest/miniconda.html)
    - (Windows Only) Download and install [Visual Studio 2019 Build Tools](https://learn.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers) 
        - Click on the latest **BuildTools** link, Select **Desktop Environment with C++** when installing)

2. (Windows Only) Open the Conda Powershell.
    - Alternatively, open the regular PowerShell and activate the Conda environment:
        ```powershell
        pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'
        ```
    - Sometimes for some reason the GPTQ compilation fails if 'cl' is not in the path. You can try using the `x64 Native Tools Command Prompt for VS 2019` shell instead or, load both conda and VS build tools shell like this:
        ```powershell
        cmd /k '"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvars64.bat" && pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'
        ```
    
4. You'll need the CUDA compiler and torch that matches the version in order to build the GPTQ extesions which allows for 4 bit prequantized models. Create a conda env and install python, cuda, and torch that matches the cuda version, as well as ninja for fast compilation
    ```powershell
    conda create -n tgwui
    conda activate tgwui
    conda install python=3.10
    ```
    
    Installing pytorch and cuda is the hardest part of machine learning
    I've come up with this install line from the following sources:

     - https://pytorch.org/get-started/locally/#start-locally
     - https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#installing-previous-cuda-releases
    
    ```bash
    conda install cuda pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia/label/cuda-11.7.0
    python -c 'import torch; print(torch.cuda.is_available())'
    ```
    
5. Download text-generation-webui and GPTQ-for-LLaMa

    ```powershell
    git clone https://github.com/oobabooga/text-generation-webui.git
    cd text-generation-webui
    mkdir repositories
    cd repositories
    git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git
    cd GPTQ-for-LLaMa
    ```
6. Build and install gptq package and CUDA kernel (you should be in the GPTQ-for-LLaMa directory)

    ```
    pip install ninja
    python setup_cuda.py install
    ```
    
7. Install the text-generation-webui dependencies

    ```
    cd ../..
    pip install -r requirements.txt
    ```
    
8. Download the 13b model from huggingface
    ```
    python download-model.py decapoda-research/llama-13b-hf
    ```
    
    This will take some time. After it's done, rename the folder to `llama-13b`
    
     The llama-13b prequantized is available [here](https://huggingface.co/decapoda-research/llama-13b-hf-int4/tree/main). Download the `llama-13b-4bit.pt` file and place it in `models` directory, alongside the `llama-13b` folder.
    
9. Run the text-generation-webui with llama-13b to test it out
    ```
    python server.py --cai-chat --load-in-4bit --model llama-13b --no-stream
    ```

10. Download the hf version 30b model from huggingface
    ```
    python download-model.py decapoda-research/llama-30b-hf
    ```
    
    You can download the pre-quantized 4 bit versions of the model [here](https://huggingface.co/maderix/llama-65b-4bit/tree/main).
    
    Alternatively, you'll need to quantize it yourself using GPTQ-for-LLaMa (this will take a while):
    
    ```
    cd ../repositories/GPTQ-for-LLaMa
    pip install datasets
    HUGGING_FACE_HUB_TOKEN={your huggingface token} CUDA_VISIBLE_DEVICES=0 python llama.py ../../models/llama-30b-hf c4 --wbits 4 --save llama-30b-4bit.pt
    ```
    
    Place the `llama30b-4bit.pt` in `models` in `models` directory, alongside the `llama-30b` folder.

9. Run the text-generation-webui with llama-30b
    ```
    python server.py --cai-chat --load-in-4bit --model llama-30b --no-stream
    ```