# LLM Compressor v0.8.0 release notes This LLM Compressor v0.8.0 release introduces the following new features and enhancements: * Support for multiple modifiers in oneshot compression runs * Quantization and calibration support for Qwen3 models including FP8 quantization support for Qwen3 VL MoE models * Transforms support for non-full-size rotation sizes * Improved accuracy recovery by updating W4A16 schemes to use `actorder` "weight" by default ## Support for multiple modifiers in oneshot compression runs ✨ LLM Compressor now supports using multiple modifiers in oneshot compression runs. You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data. Using multiple modifiers improves non-uniform model quantization, addressing issues such as varying layer sensitivity. For more information, see [Non-uniform quantization](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_non_uniform). ## Quantization and calibration support for Qwen3 models Quantization and calibration support for Qwen3 models has been added to LLM Compressor. An updated `Qwen3NextSparseMoeBlock` modeling definition has been added to temporarily update the MoE block during calibration in order to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used. FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model. For more information see: - [examples/quantization_w8a8_fp8/qwen3_next_example.py](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/qwen3_next_example.py) - [examples/quantization_w4a4_fp4/qwen3_next_example.py](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/qwen3_next_example.py) ### FP8 quantization support for Qwen3 VL MoE models LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release. Examples have been added for FP8 quantization for the Qwen/Qwen3-VL-235B-A22B-Instruct model. For more information see: - [examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py) An updated definition has been added for `Qwen3VLMoeTextSparseMoeBlock` which replaces all the MoE blocks with a linearized model definition such that a list of layers is used as opposed to a 3D parameter. This model definition enables quantization and is runnable in vLLM. ## Transforms support for non-full-size rotation sizes You can now set a `transform_block_size` field in the Transform-based modifier classes `SpinQuantModifier` and `QuIPModifier`. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight. It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier. For more information, see: - [quip_example.py](https://github.com/vllm-project/llm-compressor/blob/main/examples/transform/quip_example.py) - [spinquant_example.py](https://github.com/vllm-project/llm-compressor/blob/main/examples/transform/spinquant_example.py) To efficiently run QuIP-style rotations using the hadacore kernels in vLLM, see [examples/transform/README.md](https://github.com/vllm-project/llm-compressor/blob/main/examples/transform/README.md). ## Improved accuracy recovery by updating W4A16 schemes to use actorder "weight" by default The `GPTQModifier` class now uses "weight" activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime. For more information and benchmarks, see [vllm/pull/8135](https://github.com/vllm-project/vllm/pull/8135) ## Updates and deprecations ### Support for R4 spinquant-style transforms Support for R4 spinquant-style transforms has been added, which allows quantization of the `down_proj` layer with increased accuracy recovery. You can use this transform by specifying `SpinQuantModifier(rotations=["R4"])` in the oneshot recipe. ### Re-enabled support for W8A8 INT8 decompression W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor. The following changes have been made: - The `ModelCompressor` class has been updated to support compressing models initialized on the meta device. - The `SparseCompressor` and `QuantizationCompressor` classes have been modified to be compatible with meta devices. - The `compress_weight()` function has been modified across sparse compressors to accept module input, enabling correct behavior for meta-initialized shells. - Decompression and offload device detection has been updated to handle meta modules and empty modules gracefully. ### Updated ignore lists in example recipes to capture all vision components Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like `model.vision_tower` were not being caught, causing downstream issues when serving models with vLLM. ### Deprecated and removed unittest.TestCase The `unittest.TestCase` test case has been deprecated and removed and has been replaced with standardized `pytest` test definitions.