mlabonne · January 10, 2024 09:56 · Jan 10, 2024
diff --git a/Mistral-7B-Instruct-v0.2-Nous.md b/Mistral-7B-Instruct-v0.2-Nous.md
@@ -0,0 +1,79 @@
+|                                        Model                                        |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
+|-------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
+|[Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)|   38.5|  71.64|     66.82|   42.29|  54.81|
+
+### AGIEval
+|             Task             |Version| Metric |Value|   |Stderr|
+|------------------------------|------:|--------|----:|---|-----:|
+|agieval_aqua_rat              |      0|acc     |23.62|±  |  2.67|
+|                              |       |acc_norm|22.05|±  |  2.61|
+|agieval_logiqa_en             |      0|acc     |36.10|±  |  1.88|
+|                              |       |acc_norm|36.56|±  |  1.89|
+|agieval_lsat_ar               |      0|acc     |21.30|±  |  2.71|
+|                              |       |acc_norm|19.13|±  |  2.60|
+|agieval_lsat_lr               |      0|acc     |38.24|±  |  2.15|
+|                              |       |acc_norm|38.04|±  |  2.15|
+|agieval_lsat_rc               |      0|acc     |52.79|±  |  3.05|
+|                              |       |acc_norm|49.81|±  |  3.05|
+|agieval_sat_en                |      0|acc     |68.93|±  |  3.23|
+|                              |       |acc_norm|67.96|±  |  3.26|
+|agieval_sat_en_without_passage|      0|acc     |43.20|±  |  3.46|
+|                              |       |acc_norm|40.78|±  |  3.43|
+|agieval_sat_math              |      0|acc     |35.91|±  |  3.24|
+|                              |       |acc_norm|33.64|±  |  3.19|
+
+Average: 38.5%
+
+### GPT4All
+|    Task     |Version| Metric |Value|   |Stderr|
+|-------------|------:|--------|----:|---|-----:|
+|arc_challenge|      0|acc     |54.61|±  |  1.45|
+|             |       |acc_norm|55.97|±  |  1.45|
+|arc_easy     |      0|acc     |81.44|±  |  0.80|
+|             |       |acc_norm|76.77|±  |  0.87|
+|boolq        |      1|acc     |85.26|±  |  0.62|
+|hellaswag    |      0|acc     |66.07|±  |  0.47|
+|             |       |acc_norm|83.66|±  |  0.37|
+|openbookqa   |      0|acc     |35.40|±  |  2.14|
+|             |       |acc_norm|45.20|±  |  2.23|
+|piqa         |      0|acc     |80.41|±  |  0.93|
+|             |       |acc_norm|80.58|±  |  0.92|
+|winogrande   |      0|acc     |74.03|±  |  1.23|
+
+Average: 71.64%
+
+### TruthfulQA
+|    Task     |Version|Metric|Value|   |Stderr|
+|-------------|------:|------|----:|---|-----:|
+|truthfulqa_mc|      1|mc1   |52.39|±  |  1.75|
+|             |       |mc2   |66.82|±  |  1.52|
+
+Average: 66.82%
+
+### Bigbench
+|                      Task                      |Version|       Metric        |Value|   |Stderr|
+|------------------------------------------------|------:|---------------------|----:|---|-----:|
+|bigbench_causal_judgement                       |      0|multiple_choice_grade|54.21|±  |  3.62|
+|bigbench_date_understanding                     |      0|multiple_choice_grade|66.12|±  |  2.47|
+|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|40.70|±  |  3.06|
+|bigbench_geometric_shapes                       |      0|multiple_choice_grade|21.17|±  |  2.16|
+|                                                |       |exact_str_match      | 9.47|±  |  1.55|
+|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|29.80|±  |  2.05|
+|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|20.57|±  |  1.53|
+|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|45.33|±  |  2.88|
+|bigbench_movie_recommendation                   |      0|multiple_choice_grade|34.20|±  |  2.12|
+|bigbench_navigate                               |      0|multiple_choice_grade|41.90|±  |  1.56|
+|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|60.55|±  |  1.09|
+|bigbench_ruin_names                             |      0|multiple_choice_grade|54.46|±  |  2.36|
+|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|35.17|±  |  1.51|
+|bigbench_snarks                                 |      0|multiple_choice_grade|69.06|±  |  3.45|
+|bigbench_sports_understanding                   |      0|multiple_choice_grade|65.62|±  |  1.51|
+|bigbench_temporal_sequences                     |      0|multiple_choice_grade|36.90|±  |  1.53|
+|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|22.40|±  |  1.18|
+|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|17.66|±  |  0.91|
+|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|45.33|±  |  2.88|
+
+Average: 42.29%
+
+Average score: 54.81%
+Elapsed time: 02:37:30
No results found