### How to get [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) running on Windows or Linux with LLaMa-30b 4bit mode via [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) on an RTX 3090 start to finish. This guide actually works well for linux too. Just don't bother with the powershell envs 1. Download prerequisites - Download and install [miniconda](https://docs.conda.io/en/latest/miniconda.html) - (Windows Only) Download and install [Visual Studio 2019 Build Tools](https://learn.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers) - Click on the latest **BuildTools** link, Select **Desktop Environment with C++** when installing) 2. (Windows Only) Open the Conda Powershell. - Alternatively, open the regular PowerShell and activate the Conda environment: ```powershell pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"' ``` - Sometimes for some reason the GPTQ compilation fails if 'cl' is not in the path. You can try using the `x64 Native Tools Command Prompt for VS 2019` shell instead or, load both conda and VS build tools shell like this: ```powershell cmd /k '"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvars64.bat" && pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"' ``` 4. You'll need the CUDA compiler and torch that matches the version in order to build the GPTQ extesions which allows for 4 bit prequantized models. Create a conda env and install python, cuda, and torch that matches the cuda version, as well as ninja for fast compilation ```powershell conda create -n tgwui conda activate tgwui conda install python=3.10 ``` Installing pytorch and cuda is the hardest part of machine learning I've come up with this install line from the following sources: - https://pytorch.org/get-started/locally/#start-locally - https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#installing-previous-cuda-releases ```bash conda install cuda pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia/label/cuda-11.7.0 python -c 'import torch; print(torch.cuda.is_available())' ``` 5. Download text-generation-webui and GPTQ-for-LLaMa ```powershell git clone https://github.com/oobabooga/text-generation-webui.git cd text-generation-webui mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git cd GPTQ-for-LLaMa ``` 6. Build and install gptq package and CUDA kernel (you should be in the GPTQ-for-LLaMa directory) ``` pip install ninja python setup_cuda.py install ``` 7. Install the text-generation-webui dependencies ``` cd ../.. pip install -r requirements.txt ``` 8. Download the 13b model from huggingface ``` python download-model.py decapoda-research/llama-13b-hf ``` This will take some time. After it's done, rename the folder to `llama-13b` The llama-13b prequantized is available [here](https://huggingface.co/decapoda-research/llama-13b-hf-int4/tree/main). Download the `llama-13b-4bit.pt` file and place it in `models` directory, alongside the `llama-13b` folder. 9. Run the text-generation-webui with llama-13b to test it out ``` python server.py --cai-chat --load-in-4bit --model llama-13b --no-stream ``` 10. Download the hf version 30b model from huggingface ``` python download-model.py decapoda-research/llama-30b-hf ``` You can download the pre-quantized 4 bit versions of the model [here](https://huggingface.co/maderix/llama-65b-4bit/tree/main). Alternatively, you'll need to quantize it yourself using GPTQ-for-LLaMa (this will take a while): ``` cd ../repositories/GPTQ-for-LLaMa pip install datasets HUGGING_FACE_HUB_TOKEN={your huggingface token} CUDA_VISIBLE_DEVICES=0 python llama.py ../../models/llama-30b-hf c4 --wbits 4 --save llama-30b-4bit.pt ``` Place the `llama30b-4bit.pt` in `models` in `models` directory, alongside the `llama-30b` folder. 9. Run the text-generation-webui with llama-30b ``` python server.py --cai-chat --load-in-4bit --model llama-30b --no-stream ```