Skip to content

Instantly share code, notes, and snippets.

@atzamis
Forked from Artefact2/README.md
Created May 25, 2025 11:39
Show Gist options
  • Save atzamis/9819be47dae6acbd28fa802bb8b787fe to your computer and use it in GitHub Desktop.
Save atzamis/9819be47dae6acbd28fa802bb8b787fe to your computer and use it in GitHub Desktop.
GGUF quantizations overview

Which GGUF is right for me? (Opinionated)

  • I am partially offloading (running on CPU+GPU): use Q4_K_S. The IQ stuff is slower on CPU and generally not worth the speed penalty. You can go higher (Q5_K_S, Q6_K) but there are diminishing returns for a considerable size increase. I consider Q4_K_S to be transparent, that is, indistinguishable from f16 under a blind test. (Before you disagree with me based on biased and anecdotal evidence, have you tried running a proper blind test?)

  • I am fully offloading (running on GPU): use the largest one that fits. If you can comfortably fit Q4_K_S with room to spare, consider using another model with more parameters instead.

llama.cpp feature matrix

  • Last updated 2024-02-26.
  • Improvements/corrections welcome!
CPU (AVX2) cuBLAS rocBLAS Metal CLBlast Vulkan Kompute
Legacy quants ✅ (SLOW)
K-quants 🚫
I-quants ✅ (SLOW) 🚫 🚫
Multi-GPU N/A 🚫 N/A
Llama, Mistral architecture
Mixtral architecture 🚫

KL-divergence statistics for Mistral-7B

  • Last updated 2024-02-26.
  • imatrix from wiki.train, 200*512 tokens.
  • KL-divergence measured on wiki.test.

image

Bits per weight KL-divergence median KL-divergence q99 Top tokens differ ln(PPL(Q)/PPL(base))
IQ1_S 1.78 0.5495 5.5174 0.3840 0.9235
IQ2_XXS 2.20 0.1751 2.4983 0.2313 0.2988
IQ2_XS 2.43 0.1146 1.7693 0.1943 0.2046
IQ2_S 2.55 0.0949 1.6284 0.1806 0.1722
IQ2_M 2.76 0.0702 1.0935 0.1557 0.1223
Q2_K_S 2.79 0.0829 1.5111 0.1735 0.1600
Q2_K 3.00 0.0588 1.0337 0.1492 0.1103
IQ3_XXS 3.21 0.0330 0.5492 0.1137 0.0589
Q3_K_XS 3.32 0.0296 0.4550 0.1071 0.0458
Q3_K_S 3.50 0.0304 0.4481 0.1068 0.0511
IQ3_S 3.52 0.0205 0.3018 0.0895 0.0306
IQ3_M 3.63 0.0186 0.2740 0.0859 0.0268
Q3_K_M 3.89 0.0171 0.2546 0.0839 0.0258
Q3_K_L 4.22 0.0152 0.2202 0.0797 0.0205
IQ4_NL 4.56 0.0085 0.1077 0.0605 0.0074
Q4_K_S 4.57 0.0083 0.1012 0.0600 0.0081
Q4_K_M 4.83 0.0075 0.0885 0.0576 0.0060
Q5_K_S 5.52 0.0045 0.0393 0.0454 0.0005
Q5_K_M 5.67 0.0043 0.0368 0.0444 0.0005
Q6_K 6.57 0.0032 0.0222 0.0394 −0.0008
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment