Skip to content

Instantly share code, notes, and snippets.

@gblazex
Created January 10, 2024 04:00
Show Gist options
  • Save gblazex/8c39e043f13cbbfc4ab1fa68faf2cedc to your computer and use it in GitHub Desktop.
Save gblazex/8c39e043f13cbbfc4ab1fa68faf2cedc to your computer and use it in GitHub Desktop.

Revisions

  1. gblazex created this gist Jan 10, 2024.
    79 changes: 79 additions & 0 deletions openchat-3.5-1210-Nous.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,79 @@
    | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
    |----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
    |[openchat-3.5-1210](https://huggingface.co/openchat/openchat-3.5-1210)| 42.62| 72.84| 53.21| 43.88| 53.14|

    ### AGIEval
    | Task |Version| Metric |Value| |Stderr|
    |------------------------------|------:|--------|----:|---|-----:|
    |agieval_aqua_rat | 0|acc |22.44|± | 2.62|
    | | |acc_norm|24.41|± | 2.70|
    |agieval_logiqa_en | 0|acc |41.17|± | 1.93|
    | | |acc_norm|43.01|± | 1.94|
    |agieval_lsat_ar | 0|acc |22.61|± | 2.76|
    | | |acc_norm|23.48|± | 2.80|
    |agieval_lsat_lr | 0|acc |52.75|± | 2.21|
    | | |acc_norm|50.39|± | 2.22|
    |agieval_lsat_rc | 0|acc |62.08|± | 2.96|
    | | |acc_norm|56.13|± | 3.03|
    |agieval_sat_en | 0|acc |76.70|± | 2.95|
    | | |acc_norm|74.27|± | 3.05|
    |agieval_sat_en_without_passage| 0|acc |37.86|± | 3.39|
    | | |acc_norm|38.83|± | 3.40|
    |agieval_sat_math | 0|acc |34.55|± | 3.21|
    | | |acc_norm|30.45|± | 3.11|

    Average: 42.62%

    ### GPT4All
    | Task |Version| Metric |Value| |Stderr|
    |-------------|------:|--------|----:|---|-----:|
    |arc_challenge| 0|acc |59.39|± | 1.44|
    | | |acc_norm|62.46|± | 1.42|
    |arc_easy | 0|acc |83.42|± | 0.76|
    | | |acc_norm|82.49|± | 0.78|
    |boolq | 1|acc |86.79|± | 0.59|
    |hellaswag | 0|acc |64.56|± | 0.48|
    | | |acc_norm|82.76|± | 0.38|
    |openbookqa | 0|acc |29.20|± | 2.04|
    | | |acc_norm|40.80|± | 2.20|
    |piqa | 0|acc |81.39|± | 0.91|
    | | |acc_norm|82.81|± | 0.88|
    |winogrande | 0|acc |71.74|± | 1.27|

    Average: 72.84%

    ### TruthfulQA
    | Task |Version|Metric|Value| |Stderr|
    |-------------|------:|------|----:|---|-----:|
    |truthfulqa_mc| 1|mc1 |36.35|± | 1.68|
    | | |mc2 |53.21|± | 1.55|

    Average: 53.21%

    ### Bigbench
    | Task |Version| Metric |Value| |Stderr|
    |------------------------------------------------|------:|---------------------|----:|---|-----:|
    |bigbench_causal_judgement | 0|multiple_choice_grade|59.47|± | 3.57|
    |bigbench_date_understanding | 0|multiple_choice_grade|64.77|± | 2.49|
    |bigbench_disambiguation_qa | 0|multiple_choice_grade|44.96|± | 3.10|
    |bigbench_geometric_shapes | 0|multiple_choice_grade|27.02|± | 2.35|
    | | |exact_str_match |21.45|± | 2.17|
    |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|27.40|± | 2.00|
    |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|20.86|± | 1.54|
    |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|52.67|± | 2.89|
    |bigbench_movie_recommendation | 0|multiple_choice_grade|38.20|± | 2.18|
    |bigbench_navigate | 0|multiple_choice_grade|48.10|± | 1.58|
    |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|67.75|± | 1.05|
    |bigbench_ruin_names | 0|multiple_choice_grade|46.88|± | 2.36|
    |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|19.04|± | 1.24|
    |bigbench_snarks | 0|multiple_choice_grade|77.35|± | 3.12|
    |bigbench_sports_understanding | 0|multiple_choice_grade|63.69|± | 1.53|
    |bigbench_temporal_sequences | 0|multiple_choice_grade|39.60|± | 1.55|
    |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.04|± | 1.19|
    |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|16.46|± | 0.89|
    |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|52.67|± | 2.89|

    Average: 43.88%

    Average score: 53.14%
    Elapsed time: 03:27:03