How to get oobabooga/text-generation-webui running on Windows with LLaMa-30b 4bit mode via GPTQ-for-LLaMa on an RTX 3090 start to finish.
This guide actually works well for linux too. Just don't bother with the powershell envs
- 
Get Miniconda and VS 2019 Build Tools.
- Download and install miniconda
 - Download and install Visual Studio 2019 Build Tools
- Click on the latest BuildTools link, Select Desktop Environment with C++ when installing)
 
 
 - 
Open the Conda Powershell.
- Alternatively, open the regular PowerShell and activate the Conda environment:
pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'
 - Sometimes for some reason the GPTQ compilation fails if 'cl' is not in the path. You can try using the 
x64 Native Tools Command Prompt for VS 2019shell instead or, load both conda and VS build tools shell like this:cmd /k '"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvars64.bat" && pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'
 
 - Alternatively, open the regular PowerShell and activate the Conda environment:
 - 
You'll need the CUDA compiler and torch that matches the version in order to build the GPTQ extesions which allows for 4 bit prequantized models. Create a conda env and install python, cuda, and torch that matches the cuda version, as well as ninja for fast compilation
conda create -n tgwui conda activate tgwui conda install python=3.10 conda install cuda -c nvidia/label/cuda-11.7.0 pip install torch --extra-index-url https://download.pytorch.org/whl/cu117 pip install ninja
 - 
Download text-generation-webui and GPTQ-for-LLaMa
git clone https://github.com/oobabooga/text-generation-webui.git cd text-generation-webui mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git cd GPTQ-for-LLaMa
 - 
Build and install gptq package and CUDA kernel (you should be in the GPTQ-for-LLaMa directory)
python setup_cuda.py install - 
Install the text-generation-webui dependencies
cd ../.. pip install -r requirements.txt - 
Download the 13b model from huggingface
python download-model.py decapoda-research/llama-13b-hfThis will take some time. After it's done, rename the folder to
llama-13bThe llama-13b prequantized is available here. Download the
llama-13b-4bit.ptfile and place it inmodelsdirectory, alongside thellama-13bfolder. - 
Run the text-generation-webui with llama-13b to test it out
python server.py --cai-chat --load-in-4bit --model llama-13b --no-stream - 
Download the 30b model from huggingface
python download-model.py decapoda-research/llama-30b-hfYou'll need to quantize it yourself using GPTQ-for-LLaMa (this will take a while):
cd ../repositories/GPTQ-for-LLaMa pip install datasets HUGGING_FACE_HUB_TOKEN={your huggingface token} CUDA_VISIBLE_DEVICES=0 python llama.py ../../models/llama-30b-hf c4 --wbits 4 --save llama-30b-4bit.ptPlace the
llama30b-4bit.ptinmodelsinmodelsdirectory, alongside thellama-30bfolder. - 
Run the text-generation-webui with llama-30b
python server.py --cai-chat --load-in-4bit --model llama-30b --no-stream 
Thanks, however there is no setup_cuda.py;
(base) PS D:\AI\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa> ls
Mode LastWriteTime Length Name
d---- 2023-05-12 14:44 quant
d---- 2023-05-12 14:44 utils
-a--- 2023-05-12 14:44 14 .gitignore
-a--- 2023-05-12 14:44 52 .style.yapf
-a--- 2023-05-12 14:44 1146 convert_llama_weights_to_hf.py
-a--- 2023-05-12 14:44 7695 gptq.py
-a--- 2023-05-12 14:44 11558 LICENSE.txt
-a--- 2023-05-12 14:44 13161 llama_inference_offload.py
-a--- 2023-05-12 14:44 4414 llama_inference.py
-a--- 2023-05-12 14:44 20094 llama.py
-a--- 2023-05-12 14:44 16397 neox.py
-a--- 2023-05-12 14:44 17943 opt.py
-a--- 2023-05-12 14:44 9006 README.md
-a--- 2023-05-12 14:44 181 requirements.txt