Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save mlabonne/05d358e17dffdf9eee7c2322380c9da6 to your computer and use it in GitHub Desktop.

Select an option

Save mlabonne/05d358e17dffdf9eee7c2322380c9da6 to your computer and use it in GitHub Desktop.

Revisions

  1. @gblazex gblazex created this gist Jan 10, 2024.
    79 changes: 79 additions & 0 deletions Mistral-7B-Instruct-v0.2-Nous.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,79 @@
    | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
    |-------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
    |[Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)| 38.5| 71.64| 66.82| 42.29| 54.81|

    ### AGIEval
    | Task |Version| Metric |Value| |Stderr|
    |------------------------------|------:|--------|----:|---|-----:|
    |agieval_aqua_rat | 0|acc |23.62|± | 2.67|
    | | |acc_norm|22.05|± | 2.61|
    |agieval_logiqa_en | 0|acc |36.10|± | 1.88|
    | | |acc_norm|36.56|± | 1.89|
    |agieval_lsat_ar | 0|acc |21.30|± | 2.71|
    | | |acc_norm|19.13|± | 2.60|
    |agieval_lsat_lr | 0|acc |38.24|± | 2.15|
    | | |acc_norm|38.04|± | 2.15|
    |agieval_lsat_rc | 0|acc |52.79|± | 3.05|
    | | |acc_norm|49.81|± | 3.05|
    |agieval_sat_en | 0|acc |68.93|± | 3.23|
    | | |acc_norm|67.96|± | 3.26|
    |agieval_sat_en_without_passage| 0|acc |43.20|± | 3.46|
    | | |acc_norm|40.78|± | 3.43|
    |agieval_sat_math | 0|acc |35.91|± | 3.24|
    | | |acc_norm|33.64|± | 3.19|

    Average: 38.5%

    ### GPT4All
    | Task |Version| Metric |Value| |Stderr|
    |-------------|------:|--------|----:|---|-----:|
    |arc_challenge| 0|acc |54.61|± | 1.45|
    | | |acc_norm|55.97|± | 1.45|
    |arc_easy | 0|acc |81.44|± | 0.80|
    | | |acc_norm|76.77|± | 0.87|
    |boolq | 1|acc |85.26|± | 0.62|
    |hellaswag | 0|acc |66.07|± | 0.47|
    | | |acc_norm|83.66|± | 0.37|
    |openbookqa | 0|acc |35.40|± | 2.14|
    | | |acc_norm|45.20|± | 2.23|
    |piqa | 0|acc |80.41|± | 0.93|
    | | |acc_norm|80.58|± | 0.92|
    |winogrande | 0|acc |74.03|± | 1.23|

    Average: 71.64%

    ### TruthfulQA
    | Task |Version|Metric|Value| |Stderr|
    |-------------|------:|------|----:|---|-----:|
    |truthfulqa_mc| 1|mc1 |52.39|± | 1.75|
    | | |mc2 |66.82|± | 1.52|

    Average: 66.82%

    ### Bigbench
    | Task |Version| Metric |Value| |Stderr|
    |------------------------------------------------|------:|---------------------|----:|---|-----:|
    |bigbench_causal_judgement | 0|multiple_choice_grade|54.21|± | 3.62|
    |bigbench_date_understanding | 0|multiple_choice_grade|66.12|± | 2.47|
    |bigbench_disambiguation_qa | 0|multiple_choice_grade|40.70|± | 3.06|
    |bigbench_geometric_shapes | 0|multiple_choice_grade|21.17|± | 2.16|
    | | |exact_str_match | 9.47|± | 1.55|
    |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|29.80|± | 2.05|
    |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|20.57|± | 1.53|
    |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|45.33|± | 2.88|
    |bigbench_movie_recommendation | 0|multiple_choice_grade|34.20|± | 2.12|
    |bigbench_navigate | 0|multiple_choice_grade|41.90|± | 1.56|
    |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|60.55|± | 1.09|
    |bigbench_ruin_names | 0|multiple_choice_grade|54.46|± | 2.36|
    |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|35.17|± | 1.51|
    |bigbench_snarks | 0|multiple_choice_grade|69.06|± | 3.45|
    |bigbench_sports_understanding | 0|multiple_choice_grade|65.62|± | 1.51|
    |bigbench_temporal_sequences | 0|multiple_choice_grade|36.90|± | 1.53|
    |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.40|± | 1.18|
    |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.66|± | 0.91|
    |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|45.33|± | 2.88|

    Average: 42.29%

    Average score: 54.81%
    Elapsed time: 02:37:30