Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save awerlang/b107c057a3ceed3c975ea4bd8bae2e66 to your computer and use it in GitHub Desktop.
Save awerlang/b107c057a3ceed3c975ea4bd8bae2e66 to your computer and use it in GitHub Desktop.

Revisions

  1. @flisboac flisboac revised this gist May 7, 2017. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions fix-intel_wifi_aer-avell_g1513_fire_v3.service
    Original file line number Diff line number Diff line change
    @@ -7,3 +7,6 @@ Type=oneshot
    # Change your device and vendor (or bus/slot/function accordingly)
    ExecStart=/usr/bin/setpci -v -d 8086:a114 CAP_EXP+0x8.w=0xe
    RemainAfterExit=yes
    [Install]
    WantedBy=network.target
  2. @flisboac flisboac revised this gist May 7, 2017. 2 changed files with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions fix-intel_wifi_aer-avell_g1513_fire_v3
    Original file line number Diff line number Diff line change
    @@ -0,0 +1 @@
    silly gist hack, why do we need you? :(
    File renamed without changes.
  3. @flisboac flisboac created this gist May 7, 2017.
    213 changes: 213 additions & 0 deletions README.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,213 @@
    # How to use

    Drop the `.service` file into `/etc/systemd/system/`, and then activate the script via `systemctl`:

    ```shell
    # systemctl daemon-reload
    # systemctl enable fix-intel_wifi_aer-avell_g1513_fire_v3.service
    # systemctl start fix-intel_wifi_aer-avell_g1513_fire_v3.service
    ```

    This will effectively disable the "corrected" severity logging for the device, and save you loads of (logging) disk space. :)

    # Reasoning

    Sorry for the poor explanation, future self. I'm kinda tired right now. I don't even know if all of this is correct. :(

    When AER becomes too active in logging errors, it's generally something to do with buggy hardware or drivers.
    What most people recommend is to disable AER via a kernel parameter such as `pci=noaer`. If you know that the affected device is fine,
    and that the device's driver indeed has a bug that's still not fixed but won't affect proper usage, you can just disable AER for specific
    severity levels by setting the flags directly into the device via `setpci`, instead of disabling AER globally.

    For more info on `setpci`, please [see its docs](http://linuxcommand.org/man_pages/setpci8.html).

    AER (Advanced Error Reporting) is a PCIe capability. Linux adds support for it through a kernel module that is started sometime
    during `systemd-modules-load.service`'s execution. The AER driver initializes reporting for PCIe devices at startup, so it's
    important that we only reset the flags AFTER systemd's module loading service.

    According to the AER module's
    [source code](http://elixir.free-electrons.com/linux/latest/source/drivers/pci/pcie/aer/aerdrv_core.c#L41), the four severity
    levels (Corrected, Error, Fatal and Undefined) are always enabled when AER is enabled for a device:

    ```c
    // From `/usr/include/uapi/linux/pci_regs.h`
    #define PCI_EXP_DEVCTL 8 /* Device Control */
    #define PCI_EXP_DEVCTL_CERE 0x0001 /* Correctable Error Reporting En. */
    #define PCI_EXP_DEVCTL_NFERE 0x0002 /* Non-Fatal Error Reporting Enable */
    #define PCI_EXP_DEVCTL_FERE 0x0004 /* Fatal Error Reporting Enable */
    #define PCI_EXP_DEVCTL_URRE 0x0008 /* Unsupported Request Reporting En. */

    // From `source/drivers/pci/pcie/aer/aerdrv_core.c`
    #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
    PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)

    int pci_enable_pcie_error_reporting(struct pci_dev *dev)
    {
    if (pcie_aer_get_firmware_first(dev))
    return -EIO;

    if (!dev->aer_cap)
    return -EIO;

    return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
    }
    ```
    Inspecting the kernel's source code some more, one can find that `PCI_EXP_DEVCTL` is an offset on the device's
    `dev->pcie_cap` PCIe capability flags, and that is itself yet another offset on the device's starting memory location.
    If you follow the implementation of `pcie_capability_set_word` and its dependencies (function calls), you end up in
    `pcie_capability_write_dword`:
    ```c
    // From `source/drivers/pci/access.c`
    int pcie_capability_write_dword(struct pci_dev *dev, int pos, u32 val)
    {
    if (pos & 3)
    return -EINVAL;
    if (!pcie_capability_reg_implemented(dev, pos))
    return 0;
    return pci_write_config_dword(dev, pci_pcie_cap(dev) + pos, val);
    }
    // From `/usr/include/linux/pci.h`
    static inline int pcie_capability_set_word(struct pci_dev *dev, int pos,
    u16 set)
    {
    return pcie_capability_clear_and_set_word(dev, pos, 0, set);
    }
    static inline int pci_pcie_cap(struct pci_dev *dev)
    {
    return dev->pcie_cap;
    }
    ```

    Depending on the machine's setup, `setpci` may list the register name `CAP_EXP` as available through `setpci --dumpregs`.
    This register refers to the `dev->pcie_cap` offset. To identify how AER is configured, one needs the device/vendor or
    bus/slot/function combination for the affected device. AER's logged messages already have this information. Below is an
    example, from where we can take two different identifiers for the device: `8086:a114` (device/vendor ID) and `0000:00:1c.4`
    (domain/bus/slot/function).

    ```text
    # dmesg | tail -n 4
    [ 4455.385233] pcieport 0000:00:1c.4: AER: Corrected error received: id=00e4
    [ 4455.385242] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e4(Receiver ID)
    [ 4455.385250] pcieport 0000:00:1c.4: device [8086:a114] error status/mask=00000001/00002000
    [ 4455.385254] pcieport 0000:00:1c.4: [ 0] Receiver Error (First)
    ```

    To check which is the affected device, see `lshw` or `lspci`:

    ```text
    [flisboac@sonic ~]$ sudo lspci -v -s 00:1c.4
    00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode])
    Flags: bus master, fast devsel, latency 0, IRQ 124
    Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
    I/O behind bridge: None
    Memory behind bridge: df200000-df2fffff [size=1M]
    Prefetchable memory behind bridge: None
    Capabilities: [40] Express Root Port (Slot+), MSI 00
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [90] Subsystem: Device 1d05:1021
    Capabilities: [a0] Power Management version 3
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [140] Access Control Services
    Capabilities: [220] #19
    Kernel driver in use: pcieport
    Kernel modules: shpchp
    ```

    In this case, the error may refer to a device attached to a PCIe port. One can check which device is attached to said port with
    `lshw`:

    ```text
    # lshw -numeric
    sonic
    description: Notebook
    product: 1513 (To be filled by O.E.M.)
    vendor: Avell High Performance
    version: To be filled by O.E.M.
    serial: To be filled by O.E.M.
    width: 4294967295 bits
    capabilities: smbios-3.0 dmi-3.0 smp vsyscall32
    configuration: boot=normal chassis=notebook family=To be filled by O.E.M. sku=To be filled by O.E.M. uuid=00020003-0004-0005-0006-000700080009
    *-core
    description: Motherboard
    physical id: 0
    version: 0.1
    serial: To be filled by O.E.M.
    slot: To be filled by O.E.M.
    (... lshw is so verbose ...)
    *-pci
    description: Host bridge
    product: Skylake Host Bridge/DRAM Registers [8086:1910]
    vendor: Intel Corporation [8086]
    physical id: 100
    bus info: pci@0000:00:00.0
    version: 07
    width: 32 bits
    clock: 33MHz
    configuration: driver=skl_uncore
    resources: irq:0
    (... lshw is so verbose ...)
    *-pci:2
    description: PCI bridge
    product: Sunrise Point-H PCI Express Root Port #5 [8086:A114]
    vendor: Intel Corporation [8086]
    physical id: 1c.4
    bus info: pci@0000:00:1c.4
    version: f1
    width: 32 bits
    clock: 33MHz
    capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
    configuration: driver=pcieport
    resources: irq:124 memory:df200000-df2fffff
    *-network
    description: Wireless interface
    product: Wireless 7265 [8086:95A]
    vendor: Intel Corporation [8086]
    physical id: 0
    bus info: pci@0000:03:00.0
    logical name: wlp3s0
    version: 48
    serial: 64:80:99:f3:9d:d7
    width: 64 bits
    clock: 33MHz
    capabilities: pm msi pciexpress bus_master cap_list ethernet physical wireless
    configuration: broadcast=yes driver=iwlwifi driverversion=4.10.13-1-ARCH firmware=17.459231.0 ip=192.168.1.26 latency=0 link=yes multicast=yes wireless=IEEE 802.11
    resources: irq:137 memory:df200000-df201fff
    ```

    Summarizing, `CAP_EXP` is the base regitry, and we make some kind of pointer arithmetic with it. We offset `CAP_EXP`
    by `PCI_EXP_DEVCTL`, and write the proper flags to it as a single word. Just remember that `PCI_EXP_*` is defined as decimals,
    while `setpci` only accepts hexadecimals (have them the hexadecimal prefix `0x` or not), so some base conversion may be needed
    -- although that's not the case for `PCI_EXP_DEVCTL`.

    So, to read the current configuration:

    ```text
    [flisboac@sonic ~]$ sudo setpci -v -d 8086:a114 CAP_EXP+0x8.w
    0000:00:1c.4 (cap 10 @40) @48 = 000f
    ```

    `000f` tells us that all AER severity flags are set. The Corrected severity is bit 0 in that word, so we just need to set the new
    value to `000e` to disable only the Corrected severity reporting:

    ```text
    [flisboac@sonic ~]$ sudo setpci -v -d 8086:a114 CAP_EXP+0x8.w=0x0e
    0000:00:1c.4 (cap 10 @40) @48 000e
    ```

    And that's it!


    9 changes: 9 additions & 0 deletions fix-intel_wifi_aer-avell_g1513_fire_v3.service
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,9 @@
    [Unit]
    Description=Fix for AER's excessive logging for Intel Wireless (Avell G1513 Fire V3)
    After=systemd-modules-load.service
    [Service]
    Type=oneshot
    # Change your device and vendor (or bus/slot/function accordingly)
    ExecStart=/usr/bin/setpci -v -d 8086:a114 CAP_EXP+0x8.w=0xe
    RemainAfterExit=yes