How to get oobabooga/text-generation-webui running on Windows or Linux with LLaMa-30b 4bit mode via GPTQ-for-LLaMa on an RTX 3090 start to finish.

How to get oobabooga/text-generation-webui running on Windows with LLaMa-30b 4bit mode via GPTQ-for-LLaMa on an RTX 3090 start to finish.

This guide actually works well for linux too. Just don't bother with the powershell envs

Get Miniconda and VS 2019 Build Tools.
- Download and install miniconda
- Download and install Visual Studio 2019 Build Tools
  - Click on the latest BuildTools link, Select Desktop Environment with C++ when installing)

Open the Conda Powershell.

Alternatively, open the regular PowerShell and activate the Conda environment:

pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'

Sometimes for some reason the GPTQ compilation fails if 'cl' is not in the path. You can try using the x64 Native Tools Command Prompt for VS 2019 shell instead or, load both conda and VS build tools shell like this:

cmd /k '"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvars64.bat" && pwsh -ExecutionPolicy ByPass -NoExit -Command "& ~\miniconda3\shell\condabin\conda-hook.ps1 ; conda activate ~\miniconda3"'

You'll need the CUDA compiler and torch that matches the version in order to build the GPTQ extesions which allows for 4 bit prequantized models. Create a conda env and install python, cuda, and torch that matches the cuda version, as well as ninja for fast compilation
```
conda create -n tgwui
conda activate tgwui
conda install python=3.10
conda install cuda -c nvidia/label/cuda-11.7.0
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
pip install ninja
```

Download text-generation-webui and GPTQ-for-LLaMa

git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git
cd GPTQ-for-LLaMa

Build and install gptq package and CUDA kernel (you should be in the GPTQ-for-LLaMa directory)
```
python setup_cuda.py install
```
Install the text-generation-webui dependencies
```
cd ../..
pip install -r requirements.txt
```
Download the 13b model from huggingface
```
python download-model.py decapoda-research/llama-13b-hf
```
This will take some time. After it's done, rename the folder to llama-13b

The llama-13b prequantized is available here. Download the llama-13b-4bit.pt file and place it in models directory, alongside the llama-13b folder.

Run the text-generation-webui with llama-13b to test it out

python server.py --cai-chat --load-in-4bit --model llama-13b --no-stream

Download the 30b model from huggingface

python download-model.py decapoda-research/llama-30b-hf

You'll need to quantize it yourself using GPTQ-for-LLaMa (this will take a while):

cd ../repositories/GPTQ-for-LLaMa
pip install datasets
HUGGING_FACE_HUB_TOKEN={your huggingface token} CUDA_VISIBLE_DEVICES=0 python llama.py ../../models/llama-30b-hf c4 --wbits 4 --save llama-30b-4bit.pt

Place the llama30b-4bit.pt in models in models directory, alongside the llama-30b folder.

Run the text-generation-webui with llama-30b

python server.py --cai-chat --load-in-4bit --model llama-30b --no-stream

May I know how large memory it needed for linux OS, as I always failed at last step, the process being killed might due to memory limit exceeded

(textgen) [root@orlop1 text-generation-webui]# python server.py --chat --load-in-4bit --model llama-13b --listen-host http://hostname --listen-port 7860
/root/miniconda3/envs/textgen/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/root/miniconda3/envs/textgen/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
2023-08-02 21:30:43 INFO:Loading llama-13b...
2023-08-02 21:30:43 WARNING:torch.cuda.is_available() returned False. This means that no GPU has been detected. Falling back to CPU mode.
Loading checkpoint shards: 59%|█████████████████████████████████████████████████████████████████████████▏ | 24/41 [02:47<02:14, 7.90s/it]Killed

Regards

lxe/README.md

How to get oobabooga/text-generation-webui running on Windows with LLaMa-30b 4bit mode via GPTQ-for-LLaMa on an RTX 3090 start to finish.

rayone commented May 12, 2023

Uh oh!

poisenbery commented May 13, 2023

Uh oh!

MiningMama71 commented Jul 26, 2023

Uh oh!

sprintcheng commented Aug 3, 2023

Uh oh!

RamiroPrather commented Aug 7, 2023

Uh oh!