-
I am partially offloading (running on CPU+GPU): use Q4_K_S. The IQ stuff is slower on CPU and generally not worth the speed penalty. You can go higher (Q5_K_S, Q6_K) but there are diminishing returns for a considerable size increase. I consider Q4_K_S to be transparent, that is, indistinguishable from f16 under a blind test. (Before you disagree with me based on biased and anecdotal evidence, have you tried running a proper blind test?)
-
I am fully offloading (running on GPU): use the largest one that fits. If you can comfortably fit Q4_K_S with room to spare, consider using another model with more parameters instead.
- Last updated 2024-02-26.
- Improvements/corrections welcome!
| CPU (AVX2) | cuBLAS | rocBLAS | Metal | CLBlast | Vulkan | Kompute | |
|---|---|---|---|---|---|---|---|
| Legacy quants | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (SLOW) |
| K-quants | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🚫 |
| I-quants | ✅ (SLOW) | ✅ | ✅ | ✅ | ❓ | 🚫 | 🚫 |
| Multi-GPU | N/A | ✅ | 🚫 | N/A | ❓ | ✅ | ❓ |
| Llama, Mistral architecture | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Mixtral architecture | ✅ | ✅ | ✅ | ✅ | ❓ | 🚫 | ❓ |
- Last updated 2024-02-26.
- imatrix from wiki.train, 200*512 tokens.
- KL-divergence measured on wiki.test.
| Bits per weight | KL-divergence median | KL-divergence q99 | Top tokens differ | ln(PPL(Q)/PPL(base)) | |
|---|---|---|---|---|---|
| IQ1_S | 1.78 | 0.5495 | 5.5174 | 0.3840 | 0.9235 |
| IQ2_XXS | 2.20 | 0.1751 | 2.4983 | 0.2313 | 0.2988 |
| IQ2_XS | 2.43 | 0.1146 | 1.7693 | 0.1943 | 0.2046 |
| IQ2_S | 2.55 | 0.0949 | 1.6284 | 0.1806 | 0.1722 |
| IQ2_M | 2.76 | 0.0702 | 1.0935 | 0.1557 | 0.1223 |
| Q2_K_S | 2.79 | 0.0829 | 1.5111 | 0.1735 | 0.1600 |
| Q2_K | 3.00 | 0.0588 | 1.0337 | 0.1492 | 0.1103 |
| IQ3_XXS | 3.21 | 0.0330 | 0.5492 | 0.1137 | 0.0589 |
| Q3_K_XS | 3.32 | 0.0296 | 0.4550 | 0.1071 | 0.0458 |
| Q3_K_S | 3.50 | 0.0304 | 0.4481 | 0.1068 | 0.0511 |
| IQ3_S | 3.52 | 0.0205 | 0.3018 | 0.0895 | 0.0306 |
| IQ3_M | 3.63 | 0.0186 | 0.2740 | 0.0859 | 0.0268 |
| Q3_K_M | 3.89 | 0.0171 | 0.2546 | 0.0839 | 0.0258 |
| Q3_K_L | 4.22 | 0.0152 | 0.2202 | 0.0797 | 0.0205 |
| IQ4_NL | 4.56 | 0.0085 | 0.1077 | 0.0605 | 0.0074 |
| Q4_K_S | 4.57 | 0.0083 | 0.1012 | 0.0600 | 0.0081 |
| Q4_K_M | 4.83 | 0.0075 | 0.0885 | 0.0576 | 0.0060 |
| Q5_K_S | 5.52 | 0.0045 | 0.0393 | 0.0454 | 0.0005 |
| Q5_K_M | 5.67 | 0.0043 | 0.0368 | 0.0444 | 0.0005 |
| Q6_K | 6.57 | 0.0032 | 0.0222 | 0.0394 | −0.0008 |
