Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. Contact me if you want to help.

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

Last updated 2024-02-27.
Improvements/corrections welcome!

	CPU (AVX2)	CPU (ARM NEON)	Metal	cuBLAS	rocBLAS	SYCL	CLBlast	Vulkan	Kompute
K-quants	✅	✅	✅	✅	✅	❓	✅	✅	🚫
I-quants	✅ (SLOW)	✅	✅ (SLOW)	✅	✅	🚫	🚫	🚫	🚫
Multi-GPU	N/A	N/A	N/A	✅	❓	🚫	❓	✅	❓
K cache quants	✅	❓	❓	✅	Only q8_0 (SLOW)	❓	✅	🚫	🚫
MoE architecture	✅	❓	✅	✅	✅	❓	Only -ngl 0	🚫	🚫

KL-divergence statistics for Mistral-7B

Last updated 2024-02-27 (add IQ4_XS).
imatrix from wiki.train, 200*512 tokens.
KL-divergence measured on wiki.test.

	Bits per weight	KL-divergence median	KL-divergence q99	Top tokens differ	ln(PPL(Q)/PPL(base))
IQ1_S	1.78	0.5495	5.5174	0.3840	0.9235
IQ2_XXS	2.20	0.1751	2.4983	0.2313	0.2988
IQ2_XS	2.43	0.1146	1.7693	0.1943	0.2046
IQ2_S	2.55	0.0949	1.6284	0.1806	0.1722
IQ2_M	2.76	0.0702	1.0935	0.1557	0.1223
Q2_K_S	2.79	0.0829	1.5111	0.1735	0.1600
Q2_K	3.00	0.0588	1.0337	0.1492	0.1103
IQ3_XXS	3.21	0.0330	0.5492	0.1137	0.0589
IQ3_XS	3.32	0.0296	0.4550	0.1071	0.0458
Q3_K_S	3.50	0.0304	0.4481	0.1068	0.0511
IQ3_S	3.52	0.0205	0.3018	0.0895	0.0306
IQ3_M	3.63	0.0186	0.2740	0.0859	0.0268
Q3_K_M	3.89	0.0171	0.2546	0.0839	0.0258
Q3_K_L	4.22	0.0152	0.2202	0.0797	0.0205
IQ4_XS	4.32	0.0088	0.1082	0.0606	0.0079
IQ4_NL	4.56	0.0085	0.1077	0.0605	0.0074
Q4_K_S	4.57	0.0083	0.1012	0.0600	0.0081
Q4_K_M	4.83	0.0075	0.0885	0.0576	0.0060
Q5_K_S	5.52	0.0045	0.0393	0.0454	0.0005
Q5_K_M	5.67	0.0043	0.0368	0.0444	0.0005
Q6_K	6.57	0.0032	0.0222	0.0394	−0.0008

ROCm benchmarks for Mistral-7B

TODO: add fancy graph
Last updated 2024-03-03.

model	size	params	backend	ngl	test	t/s
llama 7B IQ1_S - 1.5625 bpw	1.50 GiB	7.24 B	ROCm	99	pp 512	709.29 ± 1.88
llama 7B IQ1_S - 1.5625 bpw	1.50 GiB	7.24 B	ROCm	99	tg 128	74.85 ± 0.02
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	ROCm	99	pp 512	704.52 ± 1.67
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	ROCm	99	tg 128	58.44 ± 0.07
llama 7B IQ3_XS - 3.3 bpw	2.79 GiB	7.24 B	ROCm	99	pp 512	682.72 ± 1.98
llama 7B IQ3_XS - 3.3 bpw	2.79 GiB	7.24 B	ROCm	99	tg 128	45.79 ± 0.05
llama 7B IQ4_XS - 4.25 bpw	3.64 GiB	7.24 B	ROCm	99	pp 512	712.96 ± 0.98
llama 7B IQ4_XS - 4.25 bpw	3.64 GiB	7.24 B	ROCm	99	tg 128	64.17 ± 0.06
llama 7B Q4_0	3.83 GiB	7.24 B	ROCm	99	pp 512	870.44 ± 0.40
llama 7B Q4_0	3.83 GiB	7.24 B	ROCm	99	tg 128	63.42 ± 0.02
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	99	pp 512	691.40 ± 0.09
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	99	tg 128	46.52 ± 0.00
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	99	pp 512	661.98 ± 0.15
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	99	tg 128	47.57 ± 0.00
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	99	pp 512	881.95 ± 0.17
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	99	tg 128	39.74 ± 0.12
llama 7B IQ1_S - 1.5625 bpw	1.50 GiB	7.24 B	ROCm	0	pp 512	324.35 ± 2.72
llama 7B IQ1_S - 1.5625 bpw	1.50 GiB	7.24 B	ROCm	0	tg 128	15.66 ± 0.08
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	ROCm	0	pp 512	316.10 ± 1.21
llama 7B IQ2_XS - 2.3125 bpw	2.05 GiB	7.24 B	ROCm	0	tg 128	15.11 ± 0.05
llama 7B IQ3_XS - 3.3 bpw	2.79 GiB	7.24 B	ROCm	0	pp 512	300.61 ± 1.21
llama 7B IQ3_XS - 3.3 bpw	2.79 GiB	7.24 B	ROCm	0	tg 128	10.49 ± 0.12
llama 7B IQ4_XS - 4.25 bpw	3.64 GiB	7.24 B	ROCm	0	pp 512	292.36 ± 9.67
llama 7B IQ4_XS - 4.25 bpw	3.64 GiB	7.24 B	ROCm	0	tg 128	11.06 ± 0.06
llama 7B Q4_0	3.83 GiB	7.24 B	ROCm	0	pp 512	310.94 ± 2.01
llama 7B Q4_0	3.83 GiB	7.24 B	ROCm	0	tg 128	10.44 ± 0.19
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	0	pp 512	273.83 ± 1.47
llama 7B Q5_K - Medium	4.78 GiB	7.24 B	ROCm	0	tg 128	8.54 ± 0.04
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	0	pp 512	261.16 ± 1.06
llama 7B Q6_K	5.53 GiB	7.24 B	ROCm	0	tg 128	7.34 ± 0.20
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	0	pp 512	270.70 ± 2.32
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	0	tg 128	5.74 ± 0.04
llama 7B F16	13.49 GiB	7.24 B	ROCm	0	pp 512	211.12 ± 0.74
llama 7B F16	13.49 GiB	7.24 B	ROCm	0	tg 128	3.06 ± 0.03

atzamis/README.md

Which GGUF is right for me? (Opinionated)

llama.cpp feature matrix

KL-divergence statistics for Mistral-7B

ROCm benchmarks for Mistral-7B