| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| LLAMA_Harsha_8_B_ORDP_10k | 35.54 | 71.15 | 55.39 | 37.96 | 50.01 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 26.77 | ± | 2.78 |
| acc_norm | 27.17 | ± | 2.80 | ||
| agieval_logiqa_en | 0 | acc | 31.34 | ± | 1.82 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| LLAMA_Harsha_8_B_ORDP_10k | 35.54 | 71.15 | 55.39 | 37.96 | 50.01 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 26.77 | ± | 2.78 |
| acc_norm | 27.17 | ± | 2.80 | ||
| agieval_logiqa_en | 0 | acc | 31.34 | ± | 1.82 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct | 44.44 | 71.88 | 57.77 | 41.9 | 54 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 29.13 | ± | 2.86 |
| acc_norm | 28.74 | ± | 2.85 | ||
| agieval_logiqa_en | 0 | acc | 42.86 | ± | 1.94 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| dolphin-2.8-mistral-7b-v02 | 38.99 | 72.22 | 51.96 | 40.41 | 50.9 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 21.65 | ± | 2.59 |
| acc_norm | 20.47 | ± | 2.54 | ||
| agieval_logiqa_en | 0 | acc | 35.79 | ± | 1.88 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| distilabeled-Marcoro14-7B-slerp | 45.38 | 76.48 | 65.68 | 48.18 | 58.93 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 27.56 | ± | 2.81 |
| acc_norm | 25.98 | ± | 2.76 | ||
| agieval_logiqa_en | 0 | acc | 39.17 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| openchat-3.5-1210 | 42.62 | 72.84 | 53.21 | 43.88 | 53.14 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 22.44 | ± | 2.62 |
| acc_norm | 24.41 | ± | 2.70 | ||
| agieval_logiqa_en | 0 | acc | 41.17 | ± | 1.93 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| openchat_3.5 | 42.67 | 72.92 | 47.27 | 42.51 | 51.34 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 24.02 | ± | 2.69 |
| acc_norm | 24.80 | ± | 2.72 | ||
| agieval_logiqa_en | 0 | acc | 38.86 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| zephyr-7b-beta | 37.33 | 71.83 | 55.1 | 39.7 | 50.99 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 21.26 | ± | 2.57 |
| acc_norm | 20.47 | ± | 2.54 | ||
| agieval_logiqa_en | 0 | acc | 33.33 | ± | 1.85 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| MistralTrix-v1 | 44.98 | 76.62 | 71.44 | 47.17 | 60.05 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 25.59 | ± | 2.74 |
| acc_norm | 24.80 | ± | 2.72 | ||
| agieval_logiqa_en | 0 | acc | 37.48 | ± | 1.90 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| Mistral-7B-Instruct-v0.2 | 38.5 | 71.64 | 66.82 | 42.29 | 54.81 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 23.62 | ± | 2.67 |
| acc_norm | 22.05 | ± | 2.61 | ||
| agieval_logiqa_en | 0 | acc | 36.10 | ± | 1.88 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| dolphin-2.2.1-mistral-7b | 38.64 | 72.24 | 54.09 | 39.22 | 51.05 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 23.23 | ± | 2.65 |
| acc_norm | 21.26 | ± | 2.57 | ||
| agieval_logiqa_en | 0 | acc | 35.48 | ± | 1.88 |