Last active
November 27, 2024 12:38
-
-
Save Toliak/86340b839b45f2c6fa4337ba6d8e971b to your computer and use it in GitHub Desktop.
Revisions
-
Toliak revised this gist
Sep 11, 2023 . 1 changed file with 7 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -196,4 +196,10 @@ The solution seems to be the only way to fix the problem. The largest caveat of it is that every kernel update via pacman seem to be a recompilation headache. I believe this post will help someone to finally fix the annoying issue with NVMe. Being encountered with such problems, I sincerely glad to realize that the percent of Linux-desktop laptops is still not below the zero. --------- # Update 2023.09.11 I've just found [here](https://forum.manjaro.org/t/huawei-d16-suspend-to-sleep/144801/5) that the problem can be fixed using the kernel parameter `nvme_core.default_ps_max_latency_us=0` -
Toliak revised this gist
Aug 8, 2023 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -150,7 +150,7 @@ diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/d DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); +/* Do not suspend NVMe */ +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SILICON_POWER, PCI_ANY_ID, + 0x0108, 8, quirk_no_ata_d3); + -
Toliak revised this gist
Jul 13, 2023 . 1 changed file with 19 additions and 19 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,8 +4,8 @@ Tried to use Arch Linux and encountered with the problem: 2. Close the laptop lid 3. Wait a few minutes 4. Open the laptop back 5. Use `mount` to see that the disk is read-only. Moreover, it determines like read-only, however, it is just broken. No zsh history, no executables, no way to poweroff without power button long-press. INXI shrinked output: @@ -32,26 +32,26 @@ To my surprise, I have found that just mention the case I have described above. [The recommendation in the thread](https://www.linuxtechmore.com/2022/09/how-to-fix-suspend-failures-with-nvme-on-linux.html): turn IOMMU into the soft mode Unfortunately, turning the kernel option `iommu=soft` in the GRUB did not change anything. (I have not regenerated the grub.cfg, just launched edited cmdline in the GRUB menu). I searched more and found [thread with the similar issue on ASUS laptop](https://bbs.archlinux.org/viewtopic.php?id=246806) and on [IdeaPad](https://www.reddit.com/r/ManjaroLinux/comments/11kzgkm/nvme_drive_becomes_readonly_after_suspend_manjaro/). The first one I suddenly skipped (actually, I will reach same thoughts a bit later). The second seems to be working, however it is a bit.. expensive. My expectations did not include laptop disassembling and changing NVMe just after one usage day :) Another idea [from here](https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd) consist in adding the kernel option `acpiphp.disable=1`. No matter how sad it is, the solution also brings no positive results. Meanwhile, I found something about Wi-Fi and NVMe conflict or about turning off TPM. But playing with BIOS settings achieved no results. My next step was to make more tests and capture a bit more information (than just `disk is broken after suspend`) about the situation. I have booted Arch Linux ISO from the USB-drive, therefore, the running OS does not depend on the NVMe. Further, I mounted one of the NVMe's paritions (Linux root partition). Then, activated suspend mode and replayed actions, described at the top. After the returning from the "laptop anabiosis", USB-live OS worked fine, but the disk was read-only-broken. I have checked `dmesg` and found "the root" of the problem: ``` nvme 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible) @@ -61,24 +61,24 @@ nvme 0000:01:00.0: can't change power state from D3cold to D0 (config space inac Well, my monkey-googling query can be specified. I was firmly convinced that this new detailed problem is well-known and surely already resolved. However, in fact I just went deeper into the Linux problems swamp. The Google results I found can be divided into two categories: - "The graveyard" of 2018-* threads with the same or extremely similar problem - Email dump with conversations about the kernel changes (or something else... you know, those sites that are just plain text with incomprehensible context and obscure pieces of code on C) Typical result or last-message in the graveyard-member thread looks like: - Oh, I will switch to Windows - Just changed the NVMe and now it works - I have tried YYY and it did not help. Any more ideas? (\*message created 4 years ago\*) - Yet another kernel parameter that does not work After digging up the graves, I purely coincidental attempted to read the second-category-result [that describes the kernel patch, that disables D3Cold for specified PCI device](https://patchwork.ozlabs.org/project/linux-pci/patch/[email protected]/). For my luck, the patch was not complicated, so, I left the idea to do something like that for later. My last resort (except the kernel patch) was to change `/sys/bus/pci/devices/0000:01:00.0/d3cold_allowed` from 1 to 0. As I thought, the attempt was failed (the cause will be described below). No more resorts, no suggestions. The only way is to patch the kernel. @@ -91,7 +91,7 @@ Briefly summarized links: - [Same problem with IdeaPad](https://www.reddit.com/r/ManjaroLinux/comments/11kzgkm/nvme_drive_becomes_readonly_after_suspend_manjaro/) - [Similar issue on ASUS laptop](https://bbs.archlinux.org/viewtopic.php?id=246806) - [acpiphp.disable=1](https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd) - [The kernel patch for inspirations](https://patchwork.ozlabs.org/project/linux-pci/patch/[email protected]/) # The solution part @@ -105,14 +105,14 @@ lspci -vvvvvnn ``` `126f` is VendorID, `0108` is ClassID. 2. Setup the kernel build system. I am using Arch Linux (btw) and the guide on [Arch Linux Wiki](https://wiki.archlinux.org/title/Kernel/Arch_build_system) has exhaustive information about the setup. Except a little point, `2.1 Avoid creating the doc`. The provided patch is not correct for the 6.4.2 kernel, so, I removed `make _htmldocs` and `"$pkgbase-docs"` manually. Also, I modified `_make` function and add `-j$(nproc)` to `make` command. 3. Optional part, that I used just to check build system. Build the kernel (without changes in the sources). Start the `makepkg -s` and leave the laptop for a while (30 minutes -- 1 hour, approximately) 4. Insert the define, that disabled D3Cold on the specified device, somewhere near `DECLARE_PCI_FIXUP_CLASS_EARLY` for deprecated ATA devices. ``` @@ -192,8 +192,8 @@ That function does not have `d3cold_allowed` checks (or at least, I cannot see i # Conslusion The solution seems to be the only way to fix the problem. The largest caveat of it is that every kernel update via pacman seem to be a recompilation headache. I believe this post will help someone to finally fix the annoying issue with NVMe. Being encountered with such problems, I sincerely glad to realize that the percent of Linux-desktop laptops is still not below the zero. -
Toliak revised this gist
Jul 12, 2023 . 1 changed file with 0 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -95,8 +95,6 @@ Briefly summarized links: # The solution part 1. Find the PCI Vendor and PCI Class of the NVMe ``` lspci -vvvvvnn -
Toliak revised this gist
Jul 12, 2023 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -189,7 +189,8 @@ Sysfs `d3cold_allowed` modifies `dev->d3cold_allowed` field. The `d3cold_allowed` is being used in `pci_dev_check_d3cold` function, that, in its turn, being used only in bridge update function `pci_bridge_d3_update`. However, the `PCI_DEV_FLAGS_NO_D3` is being checked in `pci_set_power_state` function. That function does not have `d3cold_allowed` checks (or at least, I cannot see it), hence, `d3cold_allowed` change in sysfs is useless in the context of the described problem. # Conslusion -
Toliak revised this gist
Jul 12, 2023 . 1 changed file with 104 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -49,7 +49,7 @@ But playing with BIOS settings achieved no results. My next step was to make more tests and capture a bit more information (than just `disk is broken after suspend`) about the situation. I have booted ArchLinux ISO from the USB-drive, therefore, the running OS does not depend on the NVMe. Further I mounted one of the NVMe's paritions (linux root partition). Then, activated suspend mode and replayed actions, described at the top. After the returning from the "laptop anabiosis", USB-live OS worked fine, but disk was read-only-broken. I have checked `dmesg` and found "the root" of the problem: @@ -95,4 +95,106 @@ Briefly summarized links: # The solution part ## Preparation part 1. Find the PCI Vendor and PCI Class of the NVMe ``` lspci -vvvvvnn ... 01:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device [126f:1001] (rev 03) (prog-if 02 [NVM Express]) ... ``` `126f` is VendorID, `0108` is ClassID. 2. Setup the kernel building system. I am using Arch Linux (btw) and the guide on [ArchLinux Wiki](https://wiki.archlinux.org/title/Kernel/Arch_build_system) has exhaustive information about the setup. Except a little point, `2.1 Avoid creating the doc`. The provided patch is not correct for the 6.4.2 kernel, so, I removed `make _htmldocs` and `"$pkgbase-docs"` manually. Also, I modified `_make` function and add `-j$(nproc)` to `make` command. 3. Optional part, that I used just to check build system. Build the kernel (without changes in the sources). Start the `makepkg -s` and leaves the laptop with itself for a while (30 minutes -- 1 hour, approximatelly) 4. Insert the define, that disabled D3Cold on the specified device, somewhere near `DECLARE_PCI_FIXUP_CLASS_EARLY` for deprecated ATA devices. ``` DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SILICON_POWER, PCI_ANY_ID, 0x0108, 8, quirk_no_ata_d3); // PCI_VENDOR_ID_SILICON_POWER is my define that equals to 0x126f ``` Optionally, I added few debug prints. My full patch diff looks like: ```patch diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/drivers/pci/pci.c src.new/archlinux-linux/drivers/pci/pci.c --- src/archlinux-linux/drivers/pci/pci.c 2023-07-09 18:07:45.873293132 +0300 +++ src.new/archlinux-linux/drivers/pci/pci.c 2023-07-09 18:06:52.939961065 +0300 @@ -1445,6 +1445,7 @@ * This device is quirked not to be put into D3, so don't put it in * D3 */ + pci_info(dev, "dev->dev_flags %llx\n", dev->dev_flags); if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3)) return 0; diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/drivers/pci/quirks.c src.new/archlinux-linux/drivers/pci/quirks.c --- src/archlinux-linux/drivers/pci/quirks.c 2023-07-09 18:07:45.873293132 +0300 +++ src.new/archlinux-linux/drivers/pci/quirks.c 2023-07-09 18:06:52.939961065 +0300 @@ -1340,6 +1340,7 @@ /* Some ATA devices break if put into D3 */ static void quirk_no_ata_d3(struct pci_dev *pdev) { + pci_info(pdev, "quirk_no_ata_d3 called\n"); pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3; } /* Quirk the legacy ATA devices only. The AHCI ones are ok */ @@ -1355,6 +1356,10 @@ DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3); +/* Just asshole silly shit */ +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SILICON_POWER, PCI_ANY_ID, + 0x0108, 8, quirk_no_ata_d3); + /* * This was originally an Alpha-specific thing, but it really fits here. * The i82375 PCI/EISA bridge appears as non-classified. Fix that. diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/include/linux/pci_ids.h src.new/archlinux-linux/include/linux/pci_ids.h --- src/archlinux-linux/include/linux/pci_ids.h 2023-07-09 18:07:45.883293132 +0300 +++ src.new/archlinux-linux/include/linux/pci_ids.h 2023-07-09 18:07:05.963294086 +0300 @@ -3120,4 +3120,6 @@ #define PCI_VENDOR_ID_NCUBE 0x10ff +#define PCI_VENDOR_ID_SILICON_POWER 0x126f + #endif /* _LINUX_PCI_IDS_H */ ``` 5. Compile the kernel (if you have completed p.3, the compilation will be done faster) and install it 6. Regenerate grub.cfg, reboot your laptop and check the dmesg. You should see messages about the quirk. ``` sudo dmesg | grep quirk [ 0.337952] pci 0000:01:00.0: quirk_no_ata_d3 called [ 1.939159] nvme 0000:01:00.0: platform quirk: setting simple suspend ``` The first one is my debug message, the second one already exists in Linux. 7. Check the suspend mode as described at the top. ## Meanwhile: why `d3cold_allowed` is not working? Function `quirk_no_ata_d3` sets `pci->dev_flags |= PCI_DEV_FLAGS_NO_D3;`. Sysfs `d3cold_allowed` modifies `dev->d3cold_allowed` field. The `d3cold_allowed` is being used in `pci_dev_check_d3cold` function, that, in its turn, being used only in bridge update function `pci_bridge_d3_update`. TODO! # Conslusion The solution seem to be the only way to fix the problem. The largest caveat of it is that every kernel update via pacman seem to be a recompilation headache. I believe this post will help someone to finally fix the annoying issue with NVMe. Being encountered with such problems I sincirely glad to realize that the procent of linux-desktop laptops is still not below the zero. -
Toliak revised this gist
Jul 12, 2023 . 1 changed file with 4 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -80,7 +80,7 @@ For my luck, the patch was not complicated, so, I left the idea to do something My last resort (except the kernel patch) was to change `/sys/bus/pci/devices/0000:01:00.0/d3cold_allowed` from 1 to 0. As I a bit expected, the attempt was failed (the cause will be described below). No more resorts, no suggestions. The only way is to patch the kernel. # Related links @@ -89,6 +89,9 @@ Briefly summarized links: - [2023 Thread with the same issue](https://www.linuxquestions.org/questions/linux-hardware-18/disk-enters-read-only-state-after-resuming-from-suspend-on-huawei-rlef-x-laptop-4175723575) - [IOMMU=soft](https://www.linuxtechmore.com/2022/09/how-to-fix-suspend-failures-with-nvme-on-linux.html) - [Same problem with IdeaPad](https://www.reddit.com/r/ManjaroLinux/comments/11kzgkm/nvme_drive_becomes_readonly_after_suspend_manjaro/) - [Similar issue on ASUS laptop](https://bbs.archlinux.org/viewtopic.php?id=246806) - [acpiphp.disable=1](https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd) - [Kernel patch for inspirations](https://patchwork.ozlabs.org/project/linux-pci/patch/[email protected]/) # The solution part -
Toliak revised this gist
Jul 12, 2023 . 1 changed file with 46 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -35,17 +35,60 @@ that just mention the case I have described above. Unfortunatelly, turning the kernel option `iommu=soft` in the GRUB did not change anything. (I have not regenerated the grub.cfg, just launched edited cmdline in the GRUB menu). I searched more and found [thread with the similar issue on ASUS laptop](https://bbs.archlinux.org/viewtopic.php?id=246806) and on [IdeaPad](https://www.reddit.com/r/ManjaroLinux/comments/11kzgkm/nvme_drive_becomes_readonly_after_suspend_manjaro/). The first one I suddenly skipped (actually, I will reach same thoughts a bit later). The second seem to be working, however it is a bit.. expensive. My expectations did not include laptop disassembling and changing NVMe just after one usage day :) Another idea [from here](https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd) consist in adding kernel option `acpiphp.disable=1`. No matter how sad it is, the solution also brings no positive results. Meanwhile, I found something about Wi-Fi and NVMe conflict or about turning off TPM. But playing with BIOS settings achieved no results. My next step was to make more tests and capture a bit more information (than just `disk is broken after suspend`) about the situation. I have booted ArchLinux ISO from the USB-drive, therefore, the running OS does not depend on the NVMe. Further I mounted one of the NVMe's paritions (linux root partition). Then, activated suspend mode and replayed actions, described in the top. After the returning from the "laptop anabiosis", USB-live OS worked fine, but disk was read-only-broken. I have checked `dmesg` and found "the root" of the problem: ``` nvme 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible) ``` *Honorable mention: after "breaking" mounted NVMe in USB-live OS, laptop's BIOS lost the GRUB. This was solved by regenerating the grub config with `grub-mkconfig`* Well, my monkey-googling query can be specified. I was firmly convinced that this new detailed problem is well-known and surely already resolved. However, in fact I just went deeper into the linux problems swamp. The Google results I found can be divided into two categories: - "The graveyard" of 2018-* threads with the same or extremely similar problem - Email dump with conversations about the kernel changes (or something else... you known, those sites that are just plain text with incomprehensible context and obscure pieces of code on C) Typical result or last-message in the graveyard-member thread looks like: - Oh, I will switch on Windows - Just changed the NVMe and now it works - I have tried YYY and it did not help. Any more ideas? (\*message created 4 years ago\*) - Yet another kernel parameter that does not work After digging up the graves, I purely coincidental attempted to read the second-category-result [that describes kernel patch, that disables D3Cold for specified PCI device](https://patchwork.ozlabs.org/project/linux-pci/patch/[email protected]/). For my luck, the patch was not complicated, so, I left the idea to do something like that for later. My last resort (except the kernel patch) was to change `/sys/bus/pci/devices/0000:01:00.0/d3cold_allowed` from 1 to 0. As I a bit expected, the attempt was failed (the cause will be described below). No resorts, no suggestions. The only way is to patch the kernel. # Related links Briefly summarized links: - [2023 Thread with the same issue](https://www.linuxquestions.org/questions/linux-hardware-18/disk-enters-read-only-state-after-resuming-from-suspend-on-huawei-rlef-x-laptop-4175723575) - [IOMMU=soft](https://www.linuxtechmore.com/2022/09/how-to-fix-suspend-failures-with-nvme-on-linux.html) - [Same problem with IdeaPad](https://www.reddit.com/r/ManjaroLinux/comments/11kzgkm/nvme_drive_becomes_readonly_after_suspend_manjaro/) # The solution part -
Toliak created this gist
Jul 12, 2023 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,52 @@ I have bought laptop Huawei Matebook D16 RLEF-X. Tried to use Arch Linux and encountered with the problem: 1. Turn on the suspend mode (`sudo systemctl suspend`) 2. Close the laptop lid 3. Wait a few minutes 4. Open the laptop back 5. Use `mount` to see that the disk is readonly. Moreover, it determines like readonly, however, it is just broken. No zsh history, no executables, no way to poweroff without power button long-press. INXI shrinked output: ``` System: Host: archlinux Kernel: 6.4.2-arch1-1-linux arch: x86_64 bits: 64 Desktop: i3 v: 4.22 Distro: Arch Linux Machine: Type: Laptop System: HUAWEI product: RLEF-XX v: M1010 serial: <superuser required> Mobo: HUAWEI model: RLEF-XX-PCB v: M1010 serial: <superuser required> UEFI: HUAWEI v: 1.26 date: 01/30/2023 .... Drives: Local Storage: total: 476.94 GiB used: 63.92 GiB (13.4%) ID-1: /dev/nvme0n1 model: PCIe-8 SSD 512GB size: 476.94 GiB ``` # Research part My first step was to search something like `huawei matebook disk read-only after suspend`. To my surprise, I have found [this thread created at 2023](https://www.linuxquestions.org/questions/linux-hardware-18/disk-enters-read-only-state-after-resuming-from-suspend-on-huawei-rlef-x-laptop-4175723575) that just mention the case I have described above. [The recommendation in the thread](https://www.linuxtechmore.com/2022/09/how-to-fix-suspend-failures-with-nvme-on-linux.html): turn IOMMU into the soft mode Unfortunatelly, turning the kernel option `iommu=soft` in the GRUB did not change anything. (I have not regenerated the grub.cfg, just launched edited cmdline in the GRUB menu). TODO: way to `d3cold to d0` Playing with BIOS settings achieved no results. # Related links Briefly summarized links: - [2023 Thread with the same issue](https://www.linuxquestions.org/questions/linux-hardware-18/disk-enters-read-only-state-after-resuming-from-suspend-on-huawei-rlef-x-laptop-4175723575) - [IOMMU=soft](https://www.linuxtechmore.com/2022/09/how-to-fix-suspend-failures-with-nvme-on-linux.html) # The solution part TODO: describe the solution :)