Linux-PCI Archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <bjorn.helgaas@gmail.com>
To: linux-pci@vger.kernel.org
Subject: Fwd: [Bug 221482] New: PCI/ASPM: Intel Battlemage (Arc Pro B70) bricks at boot when `pcie_aspm.policy=powersupersave` enables ASPM_L1.1 on AMD root port link
Date: Thu, 7 May 2026 16:14:10 -0500	[thread overview]
Message-ID: <CABhMZUUCAa=hB18KqxzoCOWiD9V+hFtXpRLSv9989r+qFk1o1g@mail.gmail.com> (raw)
In-Reply-To: <bug-221482-41252@https.bugzilla.kernel.org/>

[reporter in bcc]

---------- Forwarded message ---------

https://bugzilla.kernel.org/show_bug.cgi?id=221482

            Bug ID: 221482
           Summary: PCI/ASPM: Intel Battlemage (Arc Pro B70) bricks at
                    boot when `pcie_aspm.policy=powersupersave` enables
                    ASPM_L1.1 on AMD root port link

Created attachment 310052
  --> https://bugzilla.kernel.org/attachment.cgi?id=310052&action=edit
b70 works

PCI/ASPM: Intel Battlemage (Arc Pro B70) bricks at boot when
pcie_aspm.policy=powersupersave enables ASPM_L1.1 on AMD root port link

================================================================
SUMMARY
================================================================

On Linux 7.0.3, an Intel Arc Pro B70 (Battlemage / BMG-G31, GPU PCI
ID 8086:e223) plugged into an AMD Ryzen 9 5950X system fails to wake
from D3cold during PCI core enumeration when the kernel is booted
with pcie_aspm.policy=powersupersave. The card is permanently
inaccessible until reboot with a different policy.
pcie_aspm.policy=powersave (L0s+L1, no substates) works correctly.

The failure surfaces in PCI core first; downstream xe driver bind
then fails with -EPROTO:

    pcieport 0000:02:01.0: Unable to change power state from D3cold
                           to D0, device inaccessible
    pcieport 0000:02:02.0: Unable to change power state from D3cold
                           to D0, device inaccessible
    xe 0000:03:00.0: Unable to change power state from D3cold to D0,
                     device inaccessible
    xe 0000:03:00.0: [drm] Running in SR-IOV VF mode
                     [misdetected: dead config space reads as 0xff]
    xe 0000:03:00.0: [drm] *ERROR* VF: Tile0: GT0: Failed to reset
                     GuC state (-EPROTO)
    xe 0000:03:00.0: probe with driver xe failed with error -71

After the brick, "lspci -vvv -s 03:00.0" reports
"!!! Unknown header type 7f" -- the canonical signature of a PCI
device whose config space reads return all-ones, i.e. the link to the
device is dead.


================================================================
HARDWARE
================================================================

CPU / root complex:
    AMD Ryzen 9 5950X (Starship/Matisse). The root port hosting the
    BMG card is 0000:00:01.1 -- "Advanced Micro Devices, Inc. [AMD]
    Starship/Matisse GPP Bridge" (subsystem 1022:1453).

GPU:
    Intel Arc Pro B70 -- 8086:e223 (BMG-G31, subsystem 8086:1701).

On-card topology -- the card has a two-layer on-board PCIe switch:
    0000:01:00.0  Intel 8086:e2ff -- BMG card upstream switch port,
                                     PCIe 5.0 x16 capable (currently
                                     downgraded to Gen4 x16).
    0000:02:01.0  Intel 8086:e2f0 -- BMG card downstream switch
                                     port, PCIe Gen1 x1 internal.
    0000:03:00.0  Intel 8086:e223 -- GPU endpoint, PCIe Gen1 x1
                                     internal.

Other:
    BIOS has PCIe ASPM enabled in firmware. pcie_aspm=force is NOT
    set on the kernel command line. Motherboard: ASRock X570
    (specifics in attached dmidecode.txt).


================================================================
REPRODUCER
================================================================

Boot any kernel >= 7.0 with kernel command line containing:

    pcie_aspm.policy=powersupersave xe.force_probe=*

(Also reproduces under earlier 6.x kernels.)

Reverting the cmdline to "pcie_aspm.policy=powersave" and rebooting
restores the card. No firmware reset is required between attempts --
the brick is purely a runtime link-state failure during kernel boot.


================================================================
ASPM NEGOTIATION
================================================================

Captured with "lspci -vvv" on a working policy=powersave boot
(attached: 20260507-204348-powersave-7.0.3.tar.zst).

Link 1: 00:01.1 AMD root  <->  01:00.0 BMG upstream
    Lower end (AMD root, L1SubCap):
        PCI-PM_L1.2-  PCI-PM_L1.1+  ASPM_L1.2-  ASPM_L1.1+
    Upper end (BMG upstream, L1SubCap):
        PCI-PM_L1.2+  PCI-PM_L1.1+  ASPM_L1.2+  ASPM_L1.1+
    Active L1SubCtl1 under policy=powersave:
        PCI-PM_L1.2-  PCI-PM_L1.1-  ASPM_L1.2-  ASPM_L1.1-

Link 2: 01:00.0  <->  02:01.0   (card-internal switch)
    No L1SS capability advertised on either end.

Link 3: 02:01.0  <->  03:00.0   (card-internal to GPU)
    No L1SS capability advertised on either end.

Conclusion: only Link 1 -- the platform-facing AMD<->BMG link -- is
L1SS-capable on both ends, and the intersection is ASPM_L1.1 only
(the AMD GPP root port advertises L1.1 but not L1.2). With
policy=powersupersave, the kernel arms ASPM_L1.1 on this link. After
that, every D3cold->D0 transition fails.

Both ends advertise multi-retimer support (Retimer+ 2Retimers+ on
the AMD root port and on the BMG upstream port). Retimers + L1SS
have a history of wake-recovery problems on other platforms; this
may be the same class of issue.


================================================================
TIMELINE -- failed boot, kernel 7.0.3
================================================================

Excerpted from dmesg-relevant.txt in the powersupersave capture:

    28.792s  pcieport 0000:00:01.1: PME: Signaling with IRQ 48
                     [AMD root port for BMG]
    28.842s  pcieport 0000:02:01.0: Unable to change power state from
                     D3cold to D0, device inaccessible
    28.843s  pcieport 0000:02:02.0: Unable to change power state from
                     D3cold to D0, device inaccessible
    ...
    29.034s  xe 0000:03:00.0: Unable to change power state from
                     D3cold to D0, device inaccessible
    29.035s  xe 0000:03:00.0: [drm] Running in SR-IOV VF mode
    29.035s  xe 0000:03:00.0: [drm] *ERROR* VF: Tile0: GT0: Failed
                     to reset GuC state (-EPROTO)
    29.035s  xe 0000:03:00.0: probe with driver xe failed with
                     error -71

The PCI core's first wake attempt at 28.842s (the immediate parent
bridge of the BMG GPU) fails before any driver probe runs. This
confirms the failure is in the PCI/ASPM layer, not in xe; xe just
sees the resulting dead config space and misclassifies the PF as a
VF.


================================================================
WORKING-POLICY LSPCI EXCERPTS  (relevant capabilities)
================================================================

policy=powersave baseline, root port 00:01.1:

    LnkCap:  Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
    LnkCtl:  ASPM L1 Enabled
    LnkSta:  Speed 16GT/s, Width x16
    Capabilities: [370 v1] L1 PM Substates
        L1SubCap:  PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+
                   L1_PM_Substates+
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
        L1SubCtl2:

policy=powersave baseline, BMG upstream 01:00.0:

    LnkCap:  Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <32us
    LnkCtl:  ASPM L1 Enabled
    LnkSta:  Speed 16GT/s (downgraded), Width x16
    Capabilities: [244 v1] L1 PM Substates
        L1SubCap:  PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
                   L1_PM_Substates+
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
        L1SubCtl2: T_PwrOn=14us


================================================================
PROPOSED FIX
================================================================

Disable both L1SS substates on the BMG card's upstream switch port
(8086:e2ff) via a DECLARE_PCI_FIXUP_FINAL. Standard ASPM L1 still
applies, so the link still benefits from the deepest substate the
BMG silicon handles correctly. The quirk keys on the card upstream
port, which is shared across the BMG product family, so it covers
all current BMG SKUs without enumerating individual GPU-endpoint
IDs.

The patch is in the attached intel-bmg-disable-l1ss.patch. With the
patch applied, pcie_aspm.policy=powersupersave boots cleanly on this
hardware (verification in progress at time of report).

Open questions for review:

  1. Is L1.1 (not L1.2) genuinely the trigger? The AMD root port
     does not advertise L1.2, so the kernel cannot have armed L1.2
     -- yet "powersupersave" is what flips this from a non-failure
     to a failure. Confirming that L1.1 alone reproduces (e.g. via a
     more targeted fixup that only disables L1.1) would narrow the
     root cause and help decide whether the quirk should also apply
     to other AMD-platform <-> BMG combinations or only to specific
     root-port stepping.

  2. Is the underlying defect in the AMD Starship root port (cannot
     wake the link from L1.1) or in the BMG e2ff upstream port
     (cannot exit L1.1 cleanly)? If the former, future BMG cards on
     Intel platforms may not need this quirk; if the latter, the
     quirk is correct for BMG everywhere. We don't have a non-AMD
     reproducer to disambiguate.

  3. Should the quirk also apply to the AMD Starship/Matisse GPP
     Bridge itself (1022:1483 / 1022:1484-class IDs, see
     lspci-nn.txt)? That would be a broader brushstroke but might
     protect other devices presenting the same negotiation.


================================================================
WORKAROUND IN USE
================================================================

Until the quirk lands upstream, downstream users on this hardware
must boot with pcie_aspm.policy=powersave (or default), losing
~25 W of idle savings that the deeper substates would otherwise
provide.


================================================================
ATTACHMENTS
================================================================

Tarballs produced by debug/20260507-aspm-capture.sh:

    20260507-204348-powersave-7.0.3.tar.zst
        -- working baseline

    20260507-205055-powersupersave-7.0.3.tar.zst
        -- failed reproduction

Each tarball contains:

    manifest.txt           kernel, policy, hostname, GPU BDFs
    cmdline.txt            kernel command line
    uname.txt              kernel version
    nixos.txt              userspace metadata
    dmidecode.txt          BIOS/board info
    lspci-tree.txt         PCI topology
    lspci-nn.txt           PCI device list
    lspci-vvv-all.txt      full system lspci -vvv

    gpu-03_00_0/           per-device captures for the GPU and
                           every PCI ancestor up to the root
                           complex:
        lspci-vvv.txt          GPU
        parent-0-02_01_0.txt   BMG card-internal downstream switch
        parent-1-01_00_0.txt   BMG card upstream port (e2ff)
        parent-2-00_01_1.txt   AMD root port
        sysfs.txt              selected sysfs attributes

    dmesg-full.txt                full kernel ring buffer
    dmesg-relevant.txt            filtered for PCI/xe/ASPM/L1
    journal-kernel-current-boot.txt
    journal-kernel-prev-boot.txt
    drivers.txt                   xe / i915 driver state,
                                  /sys/class/drm

Patch: intel-bmg-disable-l1ss.patch  (attached separately)

NixOS 26.05 (nixpkgsRevision:
    549bd84d6279f9852cae6225e372cc67fb91a4c1)

Kernel:
    7.0.3 #1-NixOS SMP PREEMPT_DYNAMIC Thu Apr 30 09:13:05 UTC 2026

           reply	other threads:[~2026-05-07 21:14 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <bug-221482-41252@https.bugzilla.kernel.org/>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABhMZUUCAa=hB18KqxzoCOWiD9V+hFtXpRLSv9989r+qFk1o1g@mail.gmail.com' \
    --to=bjorn.helgaas@gmail.com \
    --cc=bjorn@helgaas.com \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).