Skip to content

Instantly share code, notes, and snippets.

@atzamis
Forked from Artefact2/README.md
Created May 25, 2025 11:39
Show Gist options
  • Save atzamis/9819be47dae6acbd28fa802bb8b787fe to your computer and use it in GitHub Desktop.
Save atzamis/9819be47dae6acbd28fa802bb8b787fe to your computer and use it in GitHub Desktop.
GGUF quantizations overview

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. Contact me if you want to help.

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

  • Last updated 2024-02-27.
  • Improvements/corrections welcome!
CPU (AVX2) CPU (ARM NEON) Metal cuBLAS rocBLAS SYCL CLBlast Vulkan Kompute
K-quants 🚫
I-quants ✅ (SLOW) ✅ (SLOW) 🚫 🚫 🚫 🚫
Multi-GPU N/A N/A N/A 🚫
K cache quants Only q8_0 (SLOW) 🚫 🚫
MoE architecture Only -ngl 0 🚫 🚫

KL-divergence statistics for Mistral-7B

  • Last updated 2024-02-27 (add IQ4_XS).
  • imatrix from wiki.train, 200*512 tokens.
  • KL-divergence measured on wiki.test.

image

Bits per weight KL-divergence median KL-divergence q99 Top tokens differ ln(PPL(Q)/PPL(base))
IQ1_S 1.78 0.5495 5.5174 0.3840 0.9235
IQ2_XXS 2.20 0.1751 2.4983 0.2313 0.2988
IQ2_XS 2.43 0.1146 1.7693 0.1943 0.2046
IQ2_S 2.55 0.0949 1.6284 0.1806 0.1722
IQ2_M 2.76 0.0702 1.0935 0.1557 0.1223
Q2_K_S 2.79 0.0829 1.5111 0.1735 0.1600
Q2_K 3.00 0.0588 1.0337 0.1492 0.1103
IQ3_XXS 3.21 0.0330 0.5492 0.1137 0.0589
IQ3_XS 3.32 0.0296 0.4550 0.1071 0.0458
Q3_K_S 3.50 0.0304 0.4481 0.1068 0.0511
IQ3_S 3.52 0.0205 0.3018 0.0895 0.0306
IQ3_M 3.63 0.0186 0.2740 0.0859 0.0268
Q3_K_M 3.89 0.0171 0.2546 0.0839 0.0258
Q3_K_L 4.22 0.0152 0.2202 0.0797 0.0205
IQ4_XS 4.32 0.0088 0.1082 0.0606 0.0079
IQ4_NL 4.56 0.0085 0.1077 0.0605 0.0074
Q4_K_S 4.57 0.0083 0.1012 0.0600 0.0081
Q4_K_M 4.83 0.0075 0.0885 0.0576 0.0060
Q5_K_S 5.52 0.0045 0.0393 0.0454 0.0005
Q5_K_M 5.67 0.0043 0.0368 0.0444 0.0005
Q6_K 6.57 0.0032 0.0222 0.0394 −0.0008

ROCm benchmarks for Mistral-7B

  • TODO: add fancy graph
  • Last updated 2024-03-03.
model size params backend ngl test t/s
llama 7B IQ1_S - 1.5625 bpw 1.50 GiB 7.24 B ROCm 99 pp 512 709.29 ± 1.88
llama 7B IQ1_S - 1.5625 bpw 1.50 GiB 7.24 B ROCm 99 tg 128 74.85 ± 0.02
llama 7B IQ2_XS - 2.3125 bpw 2.05 GiB 7.24 B ROCm 99 pp 512 704.52 ± 1.67
llama 7B IQ2_XS - 2.3125 bpw 2.05 GiB 7.24 B ROCm 99 tg 128 58.44 ± 0.07
llama 7B IQ3_XS - 3.3 bpw 2.79 GiB 7.24 B ROCm 99 pp 512 682.72 ± 1.98
llama 7B IQ3_XS - 3.3 bpw 2.79 GiB 7.24 B ROCm 99 tg 128 45.79 ± 0.05
llama 7B IQ4_XS - 4.25 bpw 3.64 GiB 7.24 B ROCm 99 pp 512 712.96 ± 0.98
llama 7B IQ4_XS - 4.25 bpw 3.64 GiB 7.24 B ROCm 99 tg 128 64.17 ± 0.06
llama 7B Q4_0 3.83 GiB 7.24 B ROCm 99 pp 512 870.44 ± 0.40
llama 7B Q4_0 3.83 GiB 7.24 B ROCm 99 tg 128 63.42 ± 0.02
llama 7B Q5_K - Medium 4.78 GiB 7.24 B ROCm 99 pp 512 691.40 ± 0.09
llama 7B Q5_K - Medium 4.78 GiB 7.24 B ROCm 99 tg 128 46.52 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B ROCm 99 pp 512 661.98 ± 0.15
llama 7B Q6_K 5.53 GiB 7.24 B ROCm 99 tg 128 47.57 ± 0.00
llama 7B Q8_0 7.17 GiB 7.24 B ROCm 99 pp 512 881.95 ± 0.17
llama 7B Q8_0 7.17 GiB 7.24 B ROCm 99 tg 128 39.74 ± 0.12
llama 7B IQ1_S - 1.5625 bpw 1.50 GiB 7.24 B ROCm 0 pp 512 324.35 ± 2.72
llama 7B IQ1_S - 1.5625 bpw 1.50 GiB 7.24 B ROCm 0 tg 128 15.66 ± 0.08
llama 7B IQ2_XS - 2.3125 bpw 2.05 GiB 7.24 B ROCm 0 pp 512 316.10 ± 1.21
llama 7B IQ2_XS - 2.3125 bpw 2.05 GiB 7.24 B ROCm 0 tg 128 15.11 ± 0.05
llama 7B IQ3_XS - 3.3 bpw 2.79 GiB 7.24 B ROCm 0 pp 512 300.61 ± 1.21
llama 7B IQ3_XS - 3.3 bpw 2.79 GiB 7.24 B ROCm 0 tg 128 10.49 ± 0.12
llama 7B IQ4_XS - 4.25 bpw 3.64 GiB 7.24 B ROCm 0 pp 512 292.36 ± 9.67
llama 7B IQ4_XS - 4.25 bpw 3.64 GiB 7.24 B ROCm 0 tg 128 11.06 ± 0.06
llama 7B Q4_0 3.83 GiB 7.24 B ROCm 0 pp 512 310.94 ± 2.01
llama 7B Q4_0 3.83 GiB 7.24 B ROCm 0 tg 128 10.44 ± 0.19
llama 7B Q5_K - Medium 4.78 GiB 7.24 B ROCm 0 pp 512 273.83 ± 1.47
llama 7B Q5_K - Medium 4.78 GiB 7.24 B ROCm 0 tg 128 8.54 ± 0.04
llama 7B Q6_K 5.53 GiB 7.24 B ROCm 0 pp 512 261.16 ± 1.06
llama 7B Q6_K 5.53 GiB 7.24 B ROCm 0 tg 128 7.34 ± 0.20
llama 7B Q8_0 7.17 GiB 7.24 B ROCm 0 pp 512 270.70 ± 2.32
llama 7B Q8_0 7.17 GiB 7.24 B ROCm 0 tg 128 5.74 ± 0.04
llama 7B F16 13.49 GiB 7.24 B ROCm 0 pp 512 211.12 ± 0.74
llama 7B F16 13.49 GiB 7.24 B ROCm 0 tg 128 3.06 ± 0.03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment