Skip to content

Instantly share code, notes, and snippets.

@atzamis
Forked from Artefact2/README.md
Created May 25, 2025 11:39
Show Gist options
  • Save atzamis/9819be47dae6acbd28fa802bb8b787fe to your computer and use it in GitHub Desktop.
Save atzamis/9819be47dae6acbd28fa802bb8b787fe to your computer and use it in GitHub Desktop.

Revisions

  1. @Artefact2 Artefact2 revised this gist Mar 15, 2024. 1 changed file with 15 additions and 39 deletions.
    54 changes: 15 additions & 39 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -43,42 +43,18 @@ See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matri

    # ROCm benchmarks for Mistral-7B

    * TODO: add fancy graph
    * Last updated 2024-03-03.

    | model | size | params | backend | ngl | test | t/s |
    | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 99 | pp 512 | 709.29 ± 1.88 |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 99 | tg 128 | 74.85 ± 0.02 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 99 | pp 512 | 704.52 ± 1.67 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 99 | tg 128 | 58.44 ± 0.07 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 99 | pp 512 | 682.72 ± 1.98 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 99 | tg 128 | 45.79 ± 0.05 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 99 | pp 512 | 712.96 ± 0.98 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 99 | tg 128 | 64.17 ± 0.06 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 99 | pp 512 | 870.44 ± 0.40 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 99 | tg 128 | 63.42 ± 0.02 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | pp 512 | 691.40 ± 0.09 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | tg 128 | 46.52 ± 0.00 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | pp 512 | 661.98 ± 0.15 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | tg 128 | 47.57 ± 0.00 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | pp 512 | 881.95 ± 0.17 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | tg 128 | 39.74 ± 0.12 |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 0 | pp 512 | 324.35 ± 2.72 |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 0 | tg 128 | 15.66 ± 0.08 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 0 | pp 512 | 316.10 ± 1.21 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 0 | tg 128 | 15.11 ± 0.05 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 0 | pp 512 | 300.61 ± 1.21 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 0 | tg 128 | 10.49 ± 0.12 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 0 | pp 512 | 292.36 ± 9.67 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 0 | tg 128 | 11.06 ± 0.06 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 0 | pp 512 | 310.94 ± 2.01 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 0 | tg 128 | 10.44 ± 0.19 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 0 | pp 512 | 273.83 ± 1.47 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 0 | tg 128 | 8.54 ± 0.04 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 0 | pp 512 | 261.16 ± 1.06 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 0 | tg 128 | 7.34 ± 0.20 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 0 | pp 512 | 270.70 ± 2.32 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 0 | tg 128 | 5.74 ± 0.04 |
    | llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 0 | pp 512 | 211.12 ± 0.74 |
    | llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 0 | tg 128 | 3.06 ± 0.03 |
    * Last updated 2024-03-15 (bench #6083).

    ![image](https://gist.github.com/assets/90720/e53d9081-4a64-4ede-9531-0cfb97e0e964)

    | | **GiB** | **pp512 -ngl 99** | **tg128 -ngl 99** | **pp512 -ngl 0** | **tg128 -ngl 0** | **pp512 -ngl 0 #6083** |
    |------------|---------|-------------------|-------------------|------------------|------------------|------------------------|
    | **IQ1_S** | 1.50 | 709.29 | 74.85 | 324.35 | 15.66 | 585.61 |
    | **IQ2_XS** | 2.05 | 704.52 | 58.44 | 316.10 | 15.11 | 557.68 |
    | **IQ3_XS** | 2.79 | 682.72 | 45.79 | 300.61 | 10.49 | 527.83 |
    | **IQ4_XS** | 3.64 | 712.96 | 64.17 | 292.36 | 11.06 | 495.92 |
    | **Q4_0** | 3.83 | 870.44 | 63.42 | 310.94 | 10.44 | 554.56 |
    | **Q5_K** | 4.78 | 691.40 | 46.52 | 273.83 | 8.54 | 453.58 |
    | **Q6_K** | 5.53 | 661.98 | 47.57 | 261.16 | 7.34 | 415.22 |
    | **Q8_0** | 7.17 | 881.95 | 39.74 | 270.70 | 5.74 | 440.44 |
    | **f16** | 13.49 | | | 211.12 | 3.06 | 303.60 |
  2. @Artefact2 Artefact2 revised this gist Mar 11, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    # Which GGUF is right for me? (Opinionated)

    Good question! I am collecting human data on how quantization affects outputs. Contact me if you want to help.
    Good question! I am collecting human data on how quantization affects outputs. See here for more information: https://github.com/ggerganov/llama.cpp/discussions/5962

    In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

  3. @Artefact2 Artefact2 revised this gist Mar 5, 2024. 1 changed file with 1 addition and 10 deletions.
    11 changes: 1 addition & 10 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -6,16 +6,7 @@ In the meantime, use the largest that fully fits in your GPU. If you can comfort

    # llama.cpp feature matrix

    * Last updated 2024-02-27.
    * Improvements/corrections welcome!

    | | **CPU (AVX2)** | **CPU (ARM NEON)** | **Metal** | **cuBLAS** | **rocBLAS** | **SYCL** | **CLBlast** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|--------------------|:---------:|:----------:|:----------------:|----------|:-----------:|:----------:|:-----------:|
    | **K-quants** ||||||||| 🚫 |
    | **I-quants** | ✅ (SLOW) || ✅ (SLOW) ||| 🚫 | 🚫 | 🚫 | 🚫 |
    | **Multi-GPU** | N/A | N/A | N/A ||| 🚫 ||||
    | **K cache quants** ||||| Only q8_0 (SLOW) ||| 🚫 | 🚫 |
    | **MoE architecture** ||||||| Only -ngl 0 | 🚫 | 🚫 |
    See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix


    # KL-divergence statistics for Mistral-7B
  4. @Artefact2 Artefact2 revised this gist Mar 4, 2024. 1 changed file with 3 additions and 8 deletions.
    11 changes: 3 additions & 8 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -1,13 +1,8 @@
    # Which GGUF is right for me? (Opinionated)

    * **I am partially offloading (running on CPU+GPU): use Q4_K_S.**
    The IQ stuff is slower on CPU and generally not worth the speed penalty.
    You can go higher (Q5_K_S, Q6_K) but there are diminishing returns for a considerable size increase.
    I consider Q4_K_S to be transparent, that is, indistinguishable from f16 under a blind test.
    (Before you disagree with me based on biased and anecdotal evidence, have you tried running a proper blind test?)

    * **I am fully offloading (running on GPU): use the largest one that fits.**
    If you can comfortably fit Q4_K_S with room to spare, consider using another model with more parameters instead.
    Good question! I am collecting human data on how quantization affects outputs. Contact me if you want to help.

    In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

    # llama.cpp feature matrix

  5. @Artefact2 Artefact2 revised this gist Mar 3, 2024. 1 changed file with 43 additions and 1 deletion.
    44 changes: 43 additions & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -53,4 +53,46 @@
    | **Q4_K_M** | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 |
    | **Q5_K_S** | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 |
    | **Q5_K_M** | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 |
    | **Q6_K** | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |
    | **Q6_K** | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |

    # ROCm benchmarks for Mistral-7B

    * TODO: add fancy graph
    * Last updated 2024-03-03.

    | model | size | params | backend | ngl | test | t/s |
    | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 99 | pp 512 | 709.29 ± 1.88 |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 99 | tg 128 | 74.85 ± 0.02 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 99 | pp 512 | 704.52 ± 1.67 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 99 | tg 128 | 58.44 ± 0.07 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 99 | pp 512 | 682.72 ± 1.98 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 99 | tg 128 | 45.79 ± 0.05 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 99 | pp 512 | 712.96 ± 0.98 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 99 | tg 128 | 64.17 ± 0.06 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 99 | pp 512 | 870.44 ± 0.40 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 99 | tg 128 | 63.42 ± 0.02 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | pp 512 | 691.40 ± 0.09 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 99 | tg 128 | 46.52 ± 0.00 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | pp 512 | 661.98 ± 0.15 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 99 | tg 128 | 47.57 ± 0.00 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | pp 512 | 881.95 ± 0.17 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 99 | tg 128 | 39.74 ± 0.12 |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 0 | pp 512 | 324.35 ± 2.72 |
    | llama 7B IQ1_S - 1.5625 bpw | 1.50 GiB | 7.24 B | ROCm | 0 | tg 128 | 15.66 ± 0.08 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 0 | pp 512 | 316.10 ± 1.21 |
    | llama 7B IQ2_XS - 2.3125 bpw | 2.05 GiB | 7.24 B | ROCm | 0 | tg 128 | 15.11 ± 0.05 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 0 | pp 512 | 300.61 ± 1.21 |
    | llama 7B IQ3_XS - 3.3 bpw | 2.79 GiB | 7.24 B | ROCm | 0 | tg 128 | 10.49 ± 0.12 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 0 | pp 512 | 292.36 ± 9.67 |
    | llama 7B IQ4_XS - 4.25 bpw | 3.64 GiB | 7.24 B | ROCm | 0 | tg 128 | 11.06 ± 0.06 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 0 | pp 512 | 310.94 ± 2.01 |
    | llama 7B Q4_0 | 3.83 GiB | 7.24 B | ROCm | 0 | tg 128 | 10.44 ± 0.19 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 0 | pp 512 | 273.83 ± 1.47 |
    | llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | ROCm | 0 | tg 128 | 8.54 ± 0.04 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 0 | pp 512 | 261.16 ± 1.06 |
    | llama 7B Q6_K | 5.53 GiB | 7.24 B | ROCm | 0 | tg 128 | 7.34 ± 0.20 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 0 | pp 512 | 270.70 ± 2.32 |
    | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 0 | tg 128 | 5.74 ± 0.04 |
    | llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 0 | pp 512 | 211.12 ± 0.74 |
    | llama 7B F16 | 13.49 GiB | 7.24 B | ROCm | 0 | tg 128 | 3.06 ± 0.03 |
  6. @Artefact2 Artefact2 revised this gist Feb 27, 2024. 1 changed file with 8 additions and 8 deletions.
    16 changes: 8 additions & 8 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -11,16 +11,16 @@

    # llama.cpp feature matrix

    * Last updated 2024-02-26.
    * Last updated 2024-02-27.
    * Improvements/corrections welcome!

    | | **CPU (AVX2)** | **CPU (ARM NEON)** | **Metal** | **rocBLAS** | **cuBLAS** | **CLBlast** | **SYCL** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|---------------------|:---------:|:----------------:|:----------:|:-----------:|----------|:----------:|:-----------:|
    | **K-quants** || || | | | || 🚫 |
    | **I-quants** | ✅ (SLOW) | | | | | | | 🚫 | 🚫 |
    | **Multi-GPU** | N/A | N/A | N/A | | | | 🚫 |||
    | **K cache quants** || || Only q8_0 (SLOW) | | | | | |
    | **MoE architecture** || || | | || 🚫 | |
    | | **CPU (AVX2)** | **CPU (ARM NEON)** | **Metal** | **cuBLAS** | **rocBLAS** | **SYCL** | **CLBlast** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|--------------------|:---------:|:----------:|:----------------:|----------|:-----------:|:----------:|:-----------:|
    | **K-quants** || ||| | | || 🚫 |
    | **I-quants** | ✅ (SLOW) | |(SLOW) || | 🚫 | 🚫 | 🚫 | 🚫 |
    | **Multi-GPU** | N/A | N/A | N/A | | | 🚫 | |||
    | **K cache quants** |||| | Only q8_0 (SLOW) || | 🚫 | 🚫 |
    | **MoE architecture** ||||| || Only -ngl 0 | 🚫 | 🚫 |


    # KL-divergence statistics for Mistral-7B
  7. @Artefact2 Artefact2 revised this gist Feb 27, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -25,7 +25,7 @@

    # KL-divergence statistics for Mistral-7B

    * Last updated 2024-02-26.
    * Last updated 2024-02-27 (add IQ4_XS).
    * imatrix from wiki.train, 200*512 tokens.
    * KL-divergence measured on wiki.test.

  8. @Artefact2 Artefact2 revised this gist Feb 27, 2024. 1 changed file with 4 additions and 3 deletions.
    7 changes: 4 additions & 3 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -29,7 +29,7 @@
    * imatrix from wiki.train, 200*512 tokens.
    * KL-divergence measured on wiki.test.

    ![image](https://gist.github.com/assets/90720/c2502ca9-0bb3-428a-a093-20ff24442a6e)
    ![image](https://gist.github.com/assets/90720/ac93a0df-e308-458f-8ff8-04aed10627e4)

    | | **Bits per weight** | **KL-divergence median** | **KL-divergence q99** | **Top tokens differ** | **ln(PPL(Q)/PPL(base))** |
    |-------------|---------------------|--------------------------|-----------------------|-----------------------|--------------------------|
    @@ -41,15 +41,16 @@
    | **Q2_K_S** | 2.79 | 0.0829 | 1.5111 | 0.1735 | 0.1600 |
    | **Q2_K** | 3.00 | 0.0588 | 1.0337 | 0.1492 | 0.1103 |
    | **IQ3_XXS** | 3.21 | 0.0330 | 0.5492 | 0.1137 | 0.0589 |
    | **Q3_K_XS** | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 |
    | **IQ3_XS** | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 |
    | **Q3_K_S** | 3.50 | 0.0304 | 0.4481 | 0.1068 | 0.0511 |
    | **IQ3_S** | 3.52 | 0.0205 | 0.3018 | 0.0895 | 0.0306 |
    | **IQ3_M** | 3.63 | 0.0186 | 0.2740 | 0.0859 | 0.0268 |
    | **Q3_K_M** | 3.89 | 0.0171 | 0.2546 | 0.0839 | 0.0258 |
    | **Q3_K_L** | 4.22 | 0.0152 | 0.2202 | 0.0797 | 0.0205 |
    | **IQ4_XS** | 4.32 | 0.0088 | 0.1082 | 0.0606 | 0.0079 |
    | **IQ4_NL** | 4.56 | 0.0085 | 0.1077 | 0.0605 | 0.0074 |
    | **Q4_K_S** | 4.57 | 0.0083 | 0.1012 | 0.0600 | 0.0081 |
    | **Q4_K_M** | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 |
    | **Q5_K_S** | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 |
    | **Q5_K_M** | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 |
    | **Q6_K** | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |
    | **Q6_K** | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |
  9. @Artefact2 Artefact2 revised this gist Feb 27, 2024. 1 changed file with 7 additions and 7 deletions.
    14 changes: 7 additions & 7 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -14,13 +14,13 @@
    * Last updated 2024-02-26.
    * Improvements/corrections welcome!

    | | **CPU (AVX2)** | **cuBLAS** | **rocBLAS** | **Metal** | **CLBlast** | **SYCL** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|:----------:|:----------------:|:---------:|:-----------:|----------|:----------:|:-----------:|
    | **K-quants** || | ||||| 🚫 |
    | **I-quants** | ✅ (SLOW) | | |||| 🚫 | 🚫 |
    | **Multi-GPU** | N/A | | | N/A || 🚫 |||
    | **K cache quants** || | Only q8_0 (SLOW) | |||||
    | **MoE architecture** || | |||| 🚫 ||
    | | **CPU (AVX2)** | **CPU (ARM NEON)** | **Metal** | **rocBLAS** | **cuBLAS** | **CLBlast** | **SYCL** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|---------------------|:---------:|:----------------:|:----------:|:-----------:|----------|:----------:|:-----------:|
    | **K-quants** || || | |||| 🚫 |
    | **I-quants** | ✅ (SLOW) | || | ||| 🚫 | 🚫 |
    | **Multi-GPU** | N/A | N/A | N/A | | || 🚫 |||
    | **K cache quants** || || Only q8_0 (SLOW) | |||||
    | **MoE architecture** || || | ||| 🚫 ||


    # KL-divergence statistics for Mistral-7B
  10. @Artefact2 Artefact2 revised this gist Feb 27, 2024. 1 changed file with 7 additions and 6 deletions.
    13 changes: 7 additions & 6 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -14,12 +14,13 @@
    * Last updated 2024-02-26.
    * Improvements/corrections welcome!

    | | **CPU (AVX2)** | **cuBLAS** | **rocBLAS** | **Metal** | **CLBlast** | **SYCL** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|:----------:|:-----------:|:---------:|:-----------:|----------|:----------:|:-----------:|
    | **K-quants** |||||||| 🚫 |
    | **I-quants** | ✅ (SLOW) |||||| 🚫 | 🚫 |
    | **Multi-GPU** | N/A ||| N/A || 🚫 |||
    | **MoE architecture** ||||||| 🚫 ||
    | | **CPU (AVX2)** | **cuBLAS** | **rocBLAS** | **Metal** | **CLBlast** | **SYCL** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|:----------:|:----------------:|:---------:|:-----------:|----------|:----------:|:-----------:|
    | **K-quants** |||||||| 🚫 |
    | **I-quants** | ✅ (SLOW) |||||| 🚫 | 🚫 |
    | **Multi-GPU** | N/A ||| N/A || 🚫 |||
    | **K cache quants** ||| Only q8_0 (SLOW) ||||||
    | **MoE architecture** ||||||| 🚫 ||


    # KL-divergence statistics for Mistral-7B
  11. @Artefact2 Artefact2 revised this gist Feb 27, 2024. 1 changed file with 6 additions and 8 deletions.
    14 changes: 6 additions & 8 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -14,14 +14,12 @@
    * Last updated 2024-02-26.
    * Improvements/corrections welcome!

    | | **CPU (AVX2)** | **cuBLAS** | **rocBLAS** | **Metal** | **CLBlast** | **Vulkan** | **Kompute** |
    |---------------------------------|----------------|------------|-------------|-----------|-------------|------------|-------------|
    | **Legacy quants** ||||||| ✅ (SLOW) |
    | **K-quants** ||||||| 🚫 |
    | **I-quants** | ✅ (SLOW) ||||| 🚫 | 🚫 |
    | **Multi-GPU** | N/A || 🚫 | N/A ||||
    | **Llama, Mistral** architecture ||||||||
    | **Mixtral** architecture |||||| 🚫 ||
    | | **CPU (AVX2)** | **cuBLAS** | **rocBLAS** | **Metal** | **CLBlast** | **SYCL** | **Vulkan** | **Kompute** |
    |:--------------------:|:--------------:|:----------:|:-----------:|:---------:|:-----------:|----------|:----------:|:-----------:|
    | **K-quants** |||||||| 🚫 |
    | **I-quants** | ✅ (SLOW) |||||| 🚫 | 🚫 |
    | **Multi-GPU** | N/A ||| N/A || 🚫 |||
    | **MoE architecture** ||||||| 🚫 ||


    # KL-divergence statistics for Mistral-7B
  12. @Artefact2 Artefact2 revised this gist Feb 26, 2024. 1 changed file with 15 additions and 0 deletions.
    15 changes: 15 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -9,6 +9,21 @@
    * **I am fully offloading (running on GPU): use the largest one that fits.**
    If you can comfortably fit Q4_K_S with room to spare, consider using another model with more parameters instead.

    # llama.cpp feature matrix

    * Last updated 2024-02-26.
    * Improvements/corrections welcome!

    | | **CPU (AVX2)** | **cuBLAS** | **rocBLAS** | **Metal** | **CLBlast** | **Vulkan** | **Kompute** |
    |---------------------------------|----------------|------------|-------------|-----------|-------------|------------|-------------|
    | **Legacy quants** ||||||| ✅ (SLOW) |
    | **K-quants** ||||||| 🚫 |
    | **I-quants** | ✅ (SLOW) ||||| 🚫 | 🚫 |
    | **Multi-GPU** | N/A || 🚫 | N/A ||||
    | **Llama, Mistral** architecture ||||||||
    | **Mixtral** architecture |||||| 🚫 ||


    # KL-divergence statistics for Mistral-7B

    * Last updated 2024-02-26.
  13. @Artefact2 Artefact2 revised this gist Feb 26, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion README.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # Which one is right for me? (Opinionated)
    # Which GGUF is right for me? (Opinionated)

    * **I am partially offloading (running on CPU+GPU): use Q4_K_S.**
    The IQ stuff is slower on CPU and generally not worth the speed penalty.
  14. @Artefact2 Artefact2 created this gist Feb 26, 2024.
    41 changes: 41 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,41 @@
    # Which one is right for me? (Opinionated)

    * **I am partially offloading (running on CPU+GPU): use Q4_K_S.**
    The IQ stuff is slower on CPU and generally not worth the speed penalty.
    You can go higher (Q5_K_S, Q6_K) but there are diminishing returns for a considerable size increase.
    I consider Q4_K_S to be transparent, that is, indistinguishable from f16 under a blind test.
    (Before you disagree with me based on biased and anecdotal evidence, have you tried running a proper blind test?)

    * **I am fully offloading (running on GPU): use the largest one that fits.**
    If you can comfortably fit Q4_K_S with room to spare, consider using another model with more parameters instead.

    # KL-divergence statistics for Mistral-7B

    * Last updated 2024-02-26.
    * imatrix from wiki.train, 200*512 tokens.
    * KL-divergence measured on wiki.test.

    ![image](https://gist.github.com/assets/90720/c2502ca9-0bb3-428a-a093-20ff24442a6e)

    | | **Bits per weight** | **KL-divergence median** | **KL-divergence q99** | **Top tokens differ** | **ln(PPL(Q)/PPL(base))** |
    |-------------|---------------------|--------------------------|-----------------------|-----------------------|--------------------------|
    | **IQ1_S** | 1.78 | 0.5495 | 5.5174 | 0.3840 | 0.9235 |
    | **IQ2_XXS** | 2.20 | 0.1751 | 2.4983 | 0.2313 | 0.2988 |
    | **IQ2_XS** | 2.43 | 0.1146 | 1.7693 | 0.1943 | 0.2046 |
    | **IQ2_S** | 2.55 | 0.0949 | 1.6284 | 0.1806 | 0.1722 |
    | **IQ2_M** | 2.76 | 0.0702 | 1.0935 | 0.1557 | 0.1223 |
    | **Q2_K_S** | 2.79 | 0.0829 | 1.5111 | 0.1735 | 0.1600 |
    | **Q2_K** | 3.00 | 0.0588 | 1.0337 | 0.1492 | 0.1103 |
    | **IQ3_XXS** | 3.21 | 0.0330 | 0.5492 | 0.1137 | 0.0589 |
    | **Q3_K_XS** | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 |
    | **Q3_K_S** | 3.50 | 0.0304 | 0.4481 | 0.1068 | 0.0511 |
    | **IQ3_S** | 3.52 | 0.0205 | 0.3018 | 0.0895 | 0.0306 |
    | **IQ3_M** | 3.63 | 0.0186 | 0.2740 | 0.0859 | 0.0268 |
    | **Q3_K_M** | 3.89 | 0.0171 | 0.2546 | 0.0839 | 0.0258 |
    | **Q3_K_L** | 4.22 | 0.0152 | 0.2202 | 0.0797 | 0.0205 |
    | **IQ4_NL** | 4.56 | 0.0085 | 0.1077 | 0.0605 | 0.0074 |
    | **Q4_K_S** | 4.57 | 0.0083 | 0.1012 | 0.0600 | 0.0081 |
    | **Q4_K_M** | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 |
    | **Q5_K_S** | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 |
    | **Q5_K_M** | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 |
    | **Q6_K** | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |