| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|---|---|---|---|---|---|
| neuronovo-7B-v0.2 | 44.95 | 76.49 | 71.57 | 47.48 | 60.12 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 25.98 | ± | 2.76 | 
| acc_norm | 25.59 | ± | 2.74 | ||
| agieval_logiqa_en | 0 | acc | 37.48 | ± | 1.90 | 
| # 1. Install sglang: https://github.com/sgl-project/sglang?tab=readme-ov-file#install | |
| # 2. MT-bench setup, follow steps here: https://github.com/sgl-project/sglang/tree/main/benchmark/mtbench | |
| # (Benchmark code is here: https://github.com/sgl-project/sglang/blob/main/benchmark/mtbench/bench_sglang.py) | |
| # 3. Only thing missing to be representative is temperature based on category: | |
| # ...also change this line if you need in that file: | |
| # "num_gpus": 1, | 
| // ==UserScript== | |
| // @name Keyboard Shortcut Scripts | |
| // @namespace http://tampermonkey.net/ | |
| // @version 0.1 | |
| // @description Run scripts with keyboard shortcuts | |
| // @match *://*/* | |
| // @grant none | |
| // ==/UserScript== | |
| // Parameters | 
| [ | |
| { | |
| "dataset":"helpful_base", | |
| "instruction":"What are the names of some famous actors that started their careers on Broadway?", | |
| "output":"1. Meryl Streep\n2. Angela Lansbury\n3. Audra McDonald\n4. Bernadette Peters\n5. Idina Menzel\n6. Patti LuPone\n7. Hugh Jackman\n8. James Earl Jones\n9. Liza Minnelli\n10. Nathan Lane", | |
| "generator":"dolphin-2.2.1-mistral-7b" | |
| }, | |
| { | |
| "dataset":"helpful_base", | |
| "instruction":"How did US states get their names?", | 
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|---|---|---|---|---|---|
| neuronovo-7B-v0.2 | 44.95 | 76.49 | 71.57 | 47.48 | 60.12 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 25.98 | ± | 2.76 | 
| acc_norm | 25.59 | ± | 2.74 | ||
| agieval_logiqa_en | 0 | acc | 37.48 | ± | 1.90 | 
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|---|---|---|---|---|---|
| distilabeled-Marcoro14-7B-slerp | 45.38 | 76.48 | 65.68 | 48.18 | 58.93 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 27.56 | ± | 2.81 | 
| acc_norm | 25.98 | ± | 2.76 | ||
| agieval_logiqa_en | 0 | acc | 39.17 | ± | 1.91 | 
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|---|---|---|---|---|---|
| openchat-3.5-1210 | 42.62 | 72.84 | 53.21 | 43.88 | 53.14 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 22.44 | ± | 2.62 | 
| acc_norm | 24.41 | ± | 2.70 | ||
| agieval_logiqa_en | 0 | acc | 41.17 | ± | 1.93 | 
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|---|---|---|---|---|---|
| MistralTrix-v1 | 44.98 | 76.62 | 71.44 | 47.17 | 60.05 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 25.59 | ± | 2.74 | 
| acc_norm | 24.80 | ± | 2.72 | ||
| agieval_logiqa_en | 0 | acc | 37.48 | ± | 1.90 | 
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|---|---|---|---|---|---|
| Mistral-7B-Instruct-v0.2 | 38.5 | 71.64 | 66.82 | 42.29 | 54.81 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 23.62 | ± | 2.67 | 
| acc_norm | 22.05 | ± | 2.61 | ||
| agieval_logiqa_en | 0 | acc | 36.10 | ± | 1.88 | 
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|---|---|---|---|---|---|
| dolphin-2.2.1-mistral-7b | 38.64 | 72.24 | 54.09 | 39.22 | 51.05 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 23.23 | ± | 2.65 | 
| acc_norm | 21.26 | ± | 2.57 | ||
| agieval_logiqa_en | 0 | acc | 35.48 | ± | 1.88 | 
| 2024-01-09T14:51:49.894270414Z return fn(*args, **kwargs) | |
| 2024-01-09T14:51:49.894273580Z File "/lm-evaluation-harness/lm_eval/evaluator.py", line 69, in simple_evaluate | |
| 2024-01-09T14:51:49.894279732Z lm = lm_eval.models.get_model(model).create_from_arg_string( | |
| 2024-01-09T14:51:49.894283779Z File "/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string | |
| 2024-01-09T14:51:49.894316350Z return cls(**args, **args2) | |
| 2024-01-09T14:51:49.894323294Z File "/lm-evaluation-harness/lm_eval/models/gpt2.py", line 67, in __init__ | |
| 2024-01-09T14:51:49.894355253Z self.tokenizer = transformers.AutoTokenizer.from_pretrained( | |
| 2024-01-09T14:51:49.894361435Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 787, in from_pretrained | |
| 2024-01-09T14:51:49.894470349Z return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) | |
| 2024-01-09T14:51:49.894475349Z File "/usr/local/lib/python3.10/dist-packages/transformer |