Which GGUF is right for me? (Opinionated)

I am partially offloading (running on CPU+GPU): use Q4_K_S. The IQ stuff is slower on CPU and generally not worth the speed penalty. You can go higher (Q5_K_S, Q6_K) but there are diminishing returns for a considerable size increase. I consider Q4_K_S to be transparent, that is, indistinguishable from f16 under a blind test. (Before you disagree with me based on biased and anecdotal evidence, have you tried running a proper blind test?)
I am fully offloading (running on GPU): use the largest one that fits. If you can comfortably fit Q4_K_S with room to spare, consider using another model with more parameters instead.

llama.cpp feature matrix

	CPU (AVX2)	cuBLAS	rocBLAS	Metal	CLBlast	Vulkan	Kompute
Legacy quants	✅	✅	✅	✅	✅	✅	✅ (SLOW)
K-quants	✅	✅	✅	✅	✅	✅	🚫
I-quants	✅ (SLOW)	✅	✅	✅	❓	🚫	🚫
Multi-GPU	N/A	✅	🚫	N/A	❓	✅	❓
Llama, Mistral architecture	✅	✅	✅	✅	✅	✅	✅
Mixtral architecture	✅	✅	✅	✅	❓	🚫	❓

	Bits per weight	KL-divergence median	KL-divergence q99	Top tokens differ	ln(PPL(Q)/PPL(base))
IQ1_S	1.78	0.5495	5.5174	0.3840	0.9235
IQ2_XXS	2.20	0.1751	2.4983	0.2313	0.2988
IQ2_XS	2.43	0.1146	1.7693	0.1943	0.2046
IQ2_S	2.55	0.0949	1.6284	0.1806	0.1722
IQ2_M	2.76	0.0702	1.0935	0.1557	0.1223
Q2_K_S	2.79	0.0829	1.5111	0.1735	0.1600
Q2_K	3.00	0.0588	1.0337	0.1492	0.1103
IQ3_XXS	3.21	0.0330	0.5492	0.1137	0.0589
Q3_K_XS	3.32	0.0296	0.4550	0.1071	0.0458
Q3_K_S	3.50	0.0304	0.4481	0.1068	0.0511
IQ3_S	3.52	0.0205	0.3018	0.0895	0.0306
IQ3_M	3.63	0.0186	0.2740	0.0859	0.0268
Q3_K_M	3.89	0.0171	0.2546	0.0839	0.0258
Q3_K_L	4.22	0.0152	0.2202	0.0797	0.0205
IQ4_NL	4.56	0.0085	0.1077	0.0605	0.0074
Q4_K_S	4.57	0.0083	0.1012	0.0600	0.0081
Q4_K_M	4.83	0.0075	0.0885	0.0576	0.0060
Q5_K_S	5.52	0.0045	0.0393	0.0454	0.0005
Q5_K_M	5.67	0.0043	0.0368	0.0444	0.0005
Q6_K	6.57	0.0032	0.0222	0.0394	−0.0008