fearnworks · June 2, 2023 17:54 · Jun 2, 2023 · Jun 2, 2023
diff --git a/bug.md b/bug.md
@@ -1,9 +1,13 @@
 workspace/llm-playground/notebooks/axolotl/runpod/axolotl-falcon-7b-qlora-gsm8k.ipynb
 
-Steps to reproduce : 
+Steps to reproduce :
+
 1 ) Copy config from #4 run-16: 40*2 + xformer into examples/falcon/qlora.yml
+
 2 ) Run cells 1 & 2
+
 3 ) Run !accelerate launch scripts/finetune.py examples/falcon/qlora.yml
+
 4 ) kaboom
 
 Runpod config

diff --git a/bug.md b/bug.md
@@ -0,0 +1,131 @@
+workspace/llm-playground/notebooks/axolotl/runpod/axolotl-falcon-7b-qlora-gsm8k.ipynb
+
+Steps to reproduce : 
+1 ) Copy config from #4 run-16: 40*2 + xformer into examples/falcon/qlora.yml
+2 ) Run cells 1 & 2
+3 ) Run !accelerate launch scripts/finetune.py examples/falcon/qlora.yml
+4 ) kaboom
+
+Runpod config
+![image](https://user-images.githubusercontent.com/120260158/242956630-cac84d95-7b6c-4a21-b1b5-853836492100.png)
+
+
+Stacktrace:
+```
+Loading checkpoint shards: 100%|██████████████████| 2/2 [00:17<00:00,  8.71s/it]
+Downloading (…)neration_config.json: 100%|█████| 111/111 [00:00<00:00, 18.8kB/s]
+INFO:root:converting PEFT model w/ prepare_model_for_int8_training
+/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
+  warnings.warn(
+INFO:root:found linear modules: ['query_key_value', 'dense', 'dense_4h_to_h', 'dense_h_to_4h']
+trainable params: 130547712 || all params: 3739292544 || trainable%: 3.4912409356543783
+INFO:root:Compiling torch model
+INFO:root:Pre-saving adapter config to ./qlora-out
+INFO:root:Starting trainer...
+Traceback (most recent call last):
+  File "/workspace/axolotl/scripts/finetune.py", line 294, in <module>
+    fire.Fire(train)
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
+    component_trace = _Fire(component, args, parsed_flag_args, context, name)
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
+    component, remaining_args = _CallAndUpdateTrace(
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
+    component = fn(*varargs, **kwargs)
+  File "/workspace/axolotl/scripts/finetune.py", line 281, in train
+    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1661, in train
+    return inner_training_loop(
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1767, in _inner_training_loop
+    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1192, in prepare
+    result = tuple(
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1193, in <genexpr>
+    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1042, in _prepare_one
+    return self.prepare_model(obj, device_placement=device_placement)
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1260, in prepare_model
+    raise ValueError(
+ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
+Traceback (most recent call last):
+  File "/root/miniconda3/envs/py3.9/bin/accelerate", line 8, in <module>
+    sys.exit(main())
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
+    args.func(args)
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 934, in launch_command
+    simple_launcher(args)
+  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 594, in simple_launcher
+    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
+subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.9/bin/python3', 'scripts/finetune.py', 'examples/falcon/qlora.yml']' returned non-zero exit status 1.
+```
+
+Using this config : 
+```
+base_model: tiiuae/falcon-7b
+base_model_config: tiiuae/falcon-7b
+trust_remote_code: true
+model_type: AutoModelForCausalLM
+tokenizer_type: AutoTokenizer
+load_in_8bit: false
+load_in_4bit: true
+gptq: false
+strict: false
+push_dataset_to_hub:
+datasets:
+  - path: QingyiSi/Alpaca-CoT
+    data_files:
+      - Chain-of-Thought/formatted_cot_data/gsm8k_train.json
+    type: "alpaca:chat"
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+adapter: qlora
+lora_model_dir:
+sequence_len: 2048
+max_packed_sequence_len:
+lora_r: 64
+lora_alpha: 16
+lora_dropout: 0.05
+lora_target_modules:
+lora_target_linear: true
+lora_fan_in_fan_out:
+wandb_project: falcon-qlora
+wandb_watch:
+wandb_run_id:
+wandb_log_model:
+output_dir: ./qlora-out
+micro_batch_size: 40
+gradient_accumulation_steps: 2
+num_epochs: 3
+optimizer: paged_adamw_32bit
+torchdistx_path:
+lr_scheduler: cosine
+learning_rate: 0.0002
+train_on_inputs: false
+group_by_length: false
+bf16: true
+fp16: false
+tf32: true
+gradient_checkpointing: true
+# stop training after this many evaluation losses have increased in a row
+# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
+early_stopping_patience: 3
+resume_from_checkpoint:
+auto_resume_from_checkpoints: true
+local_rank:
+logging_steps: 1
+xformers_attention: true
+flash_attention:
+gptq_groupsize:
+gptq_model_v1:
+warmup_steps: 10
+eval_steps: 5
+save_steps: 10
+debug:
+deepspeed:
+weight_decay: 0.000001
+fsdp:
+fsdp_config:
+special_tokens:
+  pad_token: "<|endoftext|>"
+  bos_token: ">>ABSTRACT<<"
+  eos_token: "<|endoftext|>"
+  ```