Which GGUF is right for me? (Opinionated)
Good question! I am collecting human data on how quantization affects outputs. Contact me if you want to help.
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
Last updated 2024-02-27.
Improvements/corrections welcome!
CPU (AVX2)
CPU (ARM NEON)
Metal
cuBLAS
rocBLAS
SYCL
CLBlast
Vulkan
Kompute
K-quants
✅
✅
✅
✅
✅
❓
✅
✅
🚫
I-quants
✅ (SLOW)
✅
✅ (SLOW)
✅
✅
🚫
🚫
🚫
🚫
Multi-GPU
N/A
N/A
N/A
✅
❓
🚫
❓
✅
❓
K cache quants
✅
❓
❓
✅
Only q8_0 (SLOW)
❓
✅
🚫
🚫
MoE architecture
✅
❓
✅
✅
✅
❓
Only -ngl 0
🚫
🚫
KL-divergence statistics for Mistral-7B
Last updated 2024-02-27 (add IQ4_XS).
imatrix from wiki.train, 200*512 tokens.
KL-divergence measured on wiki.test.
Bits per weight
KL-divergence median
KL-divergence q99
Top tokens differ
ln(PPL(Q)/PPL(base))
IQ1_S
1.78
0.5495
5.5174
0.3840
0.9235
IQ2_XXS
2.20
0.1751
2.4983
0.2313
0.2988
IQ2_XS
2.43
0.1146
1.7693
0.1943
0.2046
IQ2_S
2.55
0.0949
1.6284
0.1806
0.1722
IQ2_M
2.76
0.0702
1.0935
0.1557
0.1223
Q2_K_S
2.79
0.0829
1.5111
0.1735
0.1600
Q2_K
3.00
0.0588
1.0337
0.1492
0.1103
IQ3_XXS
3.21
0.0330
0.5492
0.1137
0.0589
IQ3_XS
3.32
0.0296
0.4550
0.1071
0.0458
Q3_K_S
3.50
0.0304
0.4481
0.1068
0.0511
IQ3_S
3.52
0.0205
0.3018
0.0895
0.0306
IQ3_M
3.63
0.0186
0.2740
0.0859
0.0268
Q3_K_M
3.89
0.0171
0.2546
0.0839
0.0258
Q3_K_L
4.22
0.0152
0.2202
0.0797
0.0205
IQ4_XS
4.32
0.0088
0.1082
0.0606
0.0079
IQ4_NL
4.56
0.0085
0.1077
0.0605
0.0074
Q4_K_S
4.57
0.0083
0.1012
0.0600
0.0081
Q4_K_M
4.83
0.0075
0.0885
0.0576
0.0060
Q5_K_S
5.52
0.0045
0.0393
0.0454
0.0005
Q5_K_M
5.67
0.0043
0.0368
0.0444
0.0005
Q6_K
6.57
0.0032
0.0222
0.0394
−0.0008
ROCm benchmarks for Mistral-7B
TODO: add fancy graph
Last updated 2024-03-03.
model
size
params
backend
ngl
test
t/s
llama 7B IQ1_S - 1.5625 bpw
1.50 GiB
7.24 B
ROCm
99
pp 512
709.29 ± 1.88
llama 7B IQ1_S - 1.5625 bpw
1.50 GiB
7.24 B
ROCm
99
tg 128
74.85 ± 0.02
llama 7B IQ2_XS - 2.3125 bpw
2.05 GiB
7.24 B
ROCm
99
pp 512
704.52 ± 1.67
llama 7B IQ2_XS - 2.3125 bpw
2.05 GiB
7.24 B
ROCm
99
tg 128
58.44 ± 0.07
llama 7B IQ3_XS - 3.3 bpw
2.79 GiB
7.24 B
ROCm
99
pp 512
682.72 ± 1.98
llama 7B IQ3_XS - 3.3 bpw
2.79 GiB
7.24 B
ROCm
99
tg 128
45.79 ± 0.05
llama 7B IQ4_XS - 4.25 bpw
3.64 GiB
7.24 B
ROCm
99
pp 512
712.96 ± 0.98
llama 7B IQ4_XS - 4.25 bpw
3.64 GiB
7.24 B
ROCm
99
tg 128
64.17 ± 0.06
llama 7B Q4_0
3.83 GiB
7.24 B
ROCm
99
pp 512
870.44 ± 0.40
llama 7B Q4_0
3.83 GiB
7.24 B
ROCm
99
tg 128
63.42 ± 0.02
llama 7B Q5_K - Medium
4.78 GiB
7.24 B
ROCm
99
pp 512
691.40 ± 0.09
llama 7B Q5_K - Medium
4.78 GiB
7.24 B
ROCm
99
tg 128
46.52 ± 0.00
llama 7B Q6_K
5.53 GiB
7.24 B
ROCm
99
pp 512
661.98 ± 0.15
llama 7B Q6_K
5.53 GiB
7.24 B
ROCm
99
tg 128
47.57 ± 0.00
llama 7B Q8_0
7.17 GiB
7.24 B
ROCm
99
pp 512
881.95 ± 0.17
llama 7B Q8_0
7.17 GiB
7.24 B
ROCm
99
tg 128
39.74 ± 0.12
llama 7B IQ1_S - 1.5625 bpw
1.50 GiB
7.24 B
ROCm
0
pp 512
324.35 ± 2.72
llama 7B IQ1_S - 1.5625 bpw
1.50 GiB
7.24 B
ROCm
0
tg 128
15.66 ± 0.08
llama 7B IQ2_XS - 2.3125 bpw
2.05 GiB
7.24 B
ROCm
0
pp 512
316.10 ± 1.21
llama 7B IQ2_XS - 2.3125 bpw
2.05 GiB
7.24 B
ROCm
0
tg 128
15.11 ± 0.05
llama 7B IQ3_XS - 3.3 bpw
2.79 GiB
7.24 B
ROCm
0
pp 512
300.61 ± 1.21
llama 7B IQ3_XS - 3.3 bpw
2.79 GiB
7.24 B
ROCm
0
tg 128
10.49 ± 0.12
llama 7B IQ4_XS - 4.25 bpw
3.64 GiB
7.24 B
ROCm
0
pp 512
292.36 ± 9.67
llama 7B IQ4_XS - 4.25 bpw
3.64 GiB
7.24 B
ROCm
0
tg 128
11.06 ± 0.06
llama 7B Q4_0
3.83 GiB
7.24 B
ROCm
0
pp 512
310.94 ± 2.01
llama 7B Q4_0
3.83 GiB
7.24 B
ROCm
0
tg 128
10.44 ± 0.19
llama 7B Q5_K - Medium
4.78 GiB
7.24 B
ROCm
0
pp 512
273.83 ± 1.47
llama 7B Q5_K - Medium
4.78 GiB
7.24 B
ROCm
0
tg 128
8.54 ± 0.04
llama 7B Q6_K
5.53 GiB
7.24 B
ROCm
0
pp 512
261.16 ± 1.06
llama 7B Q6_K
5.53 GiB
7.24 B
ROCm
0
tg 128
7.34 ± 0.20
llama 7B Q8_0
7.17 GiB
7.24 B
ROCm
0
pp 512
270.70 ± 2.32
llama 7B Q8_0
7.17 GiB
7.24 B
ROCm
0
tg 128
5.74 ± 0.04
llama 7B F16
13.49 GiB
7.24 B
ROCm
0
pp 512
211.12 ± 0.74
llama 7B F16
13.49 GiB
7.24 B
ROCm
0
tg 128
3.06 ± 0.03