From: Bjorn Helgaas <bjorn.helgaas@gmail.com>
To: linux-pci@vger.kernel.org
Subject: Fwd: [Bug 221482] New: PCI/ASPM: Intel Battlemage (Arc Pro B70) bricks at boot when `pcie_aspm.policy=powersupersave` enables ASPM_L1.1 on AMD root port link
Date: Thu, 7 May 2026 16:14:10 -0500 [thread overview]
Message-ID: <CABhMZUUCAa=hB18KqxzoCOWiD9V+hFtXpRLSv9989r+qFk1o1g@mail.gmail.com> (raw)
In-Reply-To: <bug-221482-41252@https.bugzilla.kernel.org/>
[reporter in bcc]
---------- Forwarded message ---------
https://bugzilla.kernel.org/show_bug.cgi?id=221482
Bug ID: 221482
Summary: PCI/ASPM: Intel Battlemage (Arc Pro B70) bricks at
boot when `pcie_aspm.policy=powersupersave` enables
ASPM_L1.1 on AMD root port link
Created attachment 310052
--> https://bugzilla.kernel.org/attachment.cgi?id=310052&action=edit
b70 works
PCI/ASPM: Intel Battlemage (Arc Pro B70) bricks at boot when
pcie_aspm.policy=powersupersave enables ASPM_L1.1 on AMD root port link
================================================================
SUMMARY
================================================================
On Linux 7.0.3, an Intel Arc Pro B70 (Battlemage / BMG-G31, GPU PCI
ID 8086:e223) plugged into an AMD Ryzen 9 5950X system fails to wake
from D3cold during PCI core enumeration when the kernel is booted
with pcie_aspm.policy=powersupersave. The card is permanently
inaccessible until reboot with a different policy.
pcie_aspm.policy=powersave (L0s+L1, no substates) works correctly.
The failure surfaces in PCI core first; downstream xe driver bind
then fails with -EPROTO:
pcieport 0000:02:01.0: Unable to change power state from D3cold
to D0, device inaccessible
pcieport 0000:02:02.0: Unable to change power state from D3cold
to D0, device inaccessible
xe 0000:03:00.0: Unable to change power state from D3cold to D0,
device inaccessible
xe 0000:03:00.0: [drm] Running in SR-IOV VF mode
[misdetected: dead config space reads as 0xff]
xe 0000:03:00.0: [drm] *ERROR* VF: Tile0: GT0: Failed to reset
GuC state (-EPROTO)
xe 0000:03:00.0: probe with driver xe failed with error -71
After the brick, "lspci -vvv -s 03:00.0" reports
"!!! Unknown header type 7f" -- the canonical signature of a PCI
device whose config space reads return all-ones, i.e. the link to the
device is dead.
================================================================
HARDWARE
================================================================
CPU / root complex:
AMD Ryzen 9 5950X (Starship/Matisse). The root port hosting the
BMG card is 0000:00:01.1 -- "Advanced Micro Devices, Inc. [AMD]
Starship/Matisse GPP Bridge" (subsystem 1022:1453).
GPU:
Intel Arc Pro B70 -- 8086:e223 (BMG-G31, subsystem 8086:1701).
On-card topology -- the card has a two-layer on-board PCIe switch:
0000:01:00.0 Intel 8086:e2ff -- BMG card upstream switch port,
PCIe 5.0 x16 capable (currently
downgraded to Gen4 x16).
0000:02:01.0 Intel 8086:e2f0 -- BMG card downstream switch
port, PCIe Gen1 x1 internal.
0000:03:00.0 Intel 8086:e223 -- GPU endpoint, PCIe Gen1 x1
internal.
Other:
BIOS has PCIe ASPM enabled in firmware. pcie_aspm=force is NOT
set on the kernel command line. Motherboard: ASRock X570
(specifics in attached dmidecode.txt).
================================================================
REPRODUCER
================================================================
Boot any kernel >= 7.0 with kernel command line containing:
pcie_aspm.policy=powersupersave xe.force_probe=*
(Also reproduces under earlier 6.x kernels.)
Reverting the cmdline to "pcie_aspm.policy=powersave" and rebooting
restores the card. No firmware reset is required between attempts --
the brick is purely a runtime link-state failure during kernel boot.
================================================================
ASPM NEGOTIATION
================================================================
Captured with "lspci -vvv" on a working policy=powersave boot
(attached: 20260507-204348-powersave-7.0.3.tar.zst).
Link 1: 00:01.1 AMD root <-> 01:00.0 BMG upstream
Lower end (AMD root, L1SubCap):
PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+
Upper end (BMG upstream, L1SubCap):
PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
Active L1SubCtl1 under policy=powersave:
PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
Link 2: 01:00.0 <-> 02:01.0 (card-internal switch)
No L1SS capability advertised on either end.
Link 3: 02:01.0 <-> 03:00.0 (card-internal to GPU)
No L1SS capability advertised on either end.
Conclusion: only Link 1 -- the platform-facing AMD<->BMG link -- is
L1SS-capable on both ends, and the intersection is ASPM_L1.1 only
(the AMD GPP root port advertises L1.1 but not L1.2). With
policy=powersupersave, the kernel arms ASPM_L1.1 on this link. After
that, every D3cold->D0 transition fails.
Both ends advertise multi-retimer support (Retimer+ 2Retimers+ on
the AMD root port and on the BMG upstream port). Retimers + L1SS
have a history of wake-recovery problems on other platforms; this
may be the same class of issue.
================================================================
TIMELINE -- failed boot, kernel 7.0.3
================================================================
Excerpted from dmesg-relevant.txt in the powersupersave capture:
28.792s pcieport 0000:00:01.1: PME: Signaling with IRQ 48
[AMD root port for BMG]
28.842s pcieport 0000:02:01.0: Unable to change power state from
D3cold to D0, device inaccessible
28.843s pcieport 0000:02:02.0: Unable to change power state from
D3cold to D0, device inaccessible
...
29.034s xe 0000:03:00.0: Unable to change power state from
D3cold to D0, device inaccessible
29.035s xe 0000:03:00.0: [drm] Running in SR-IOV VF mode
29.035s xe 0000:03:00.0: [drm] *ERROR* VF: Tile0: GT0: Failed
to reset GuC state (-EPROTO)
29.035s xe 0000:03:00.0: probe with driver xe failed with
error -71
The PCI core's first wake attempt at 28.842s (the immediate parent
bridge of the BMG GPU) fails before any driver probe runs. This
confirms the failure is in the PCI/ASPM layer, not in xe; xe just
sees the resulting dead config space and misclassifies the PF as a
VF.
================================================================
WORKING-POLICY LSPCI EXCERPTS (relevant capabilities)
================================================================
policy=powersave baseline, root port 00:01.1:
LnkCap: Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <64us
LnkCtl: ASPM L1 Enabled
LnkSta: Speed 16GT/s, Width x16
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+
L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
L1SubCtl2:
policy=powersave baseline, BMG upstream 01:00.0:
LnkCap: Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <32us
LnkCtl: ASPM L1 Enabled
LnkSta: Speed 16GT/s (downgraded), Width x16
Capabilities: [244 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
L1SubCtl2: T_PwrOn=14us
================================================================
PROPOSED FIX
================================================================
Disable both L1SS substates on the BMG card's upstream switch port
(8086:e2ff) via a DECLARE_PCI_FIXUP_FINAL. Standard ASPM L1 still
applies, so the link still benefits from the deepest substate the
BMG silicon handles correctly. The quirk keys on the card upstream
port, which is shared across the BMG product family, so it covers
all current BMG SKUs without enumerating individual GPU-endpoint
IDs.
The patch is in the attached intel-bmg-disable-l1ss.patch. With the
patch applied, pcie_aspm.policy=powersupersave boots cleanly on this
hardware (verification in progress at time of report).
Open questions for review:
1. Is L1.1 (not L1.2) genuinely the trigger? The AMD root port
does not advertise L1.2, so the kernel cannot have armed L1.2
-- yet "powersupersave" is what flips this from a non-failure
to a failure. Confirming that L1.1 alone reproduces (e.g. via a
more targeted fixup that only disables L1.1) would narrow the
root cause and help decide whether the quirk should also apply
to other AMD-platform <-> BMG combinations or only to specific
root-port stepping.
2. Is the underlying defect in the AMD Starship root port (cannot
wake the link from L1.1) or in the BMG e2ff upstream port
(cannot exit L1.1 cleanly)? If the former, future BMG cards on
Intel platforms may not need this quirk; if the latter, the
quirk is correct for BMG everywhere. We don't have a non-AMD
reproducer to disambiguate.
3. Should the quirk also apply to the AMD Starship/Matisse GPP
Bridge itself (1022:1483 / 1022:1484-class IDs, see
lspci-nn.txt)? That would be a broader brushstroke but might
protect other devices presenting the same negotiation.
================================================================
WORKAROUND IN USE
================================================================
Until the quirk lands upstream, downstream users on this hardware
must boot with pcie_aspm.policy=powersave (or default), losing
~25 W of idle savings that the deeper substates would otherwise
provide.
================================================================
ATTACHMENTS
================================================================
Tarballs produced by debug/20260507-aspm-capture.sh:
20260507-204348-powersave-7.0.3.tar.zst
-- working baseline
20260507-205055-powersupersave-7.0.3.tar.zst
-- failed reproduction
Each tarball contains:
manifest.txt kernel, policy, hostname, GPU BDFs
cmdline.txt kernel command line
uname.txt kernel version
nixos.txt userspace metadata
dmidecode.txt BIOS/board info
lspci-tree.txt PCI topology
lspci-nn.txt PCI device list
lspci-vvv-all.txt full system lspci -vvv
gpu-03_00_0/ per-device captures for the GPU and
every PCI ancestor up to the root
complex:
lspci-vvv.txt GPU
parent-0-02_01_0.txt BMG card-internal downstream switch
parent-1-01_00_0.txt BMG card upstream port (e2ff)
parent-2-00_01_1.txt AMD root port
sysfs.txt selected sysfs attributes
dmesg-full.txt full kernel ring buffer
dmesg-relevant.txt filtered for PCI/xe/ASPM/L1
journal-kernel-current-boot.txt
journal-kernel-prev-boot.txt
drivers.txt xe / i915 driver state,
/sys/class/drm
Patch: intel-bmg-disable-l1ss.patch (attached separately)
NixOS 26.05 (nixpkgsRevision:
549bd84d6279f9852cae6225e372cc67fb91a4c1)
Kernel:
7.0.3 #1-NixOS SMP PREEMPT_DYNAMIC Thu Apr 30 09:13:05 UTC 2026
parent reply other threads:[~2026-05-07 21:14 UTC|newest]
Thread overview: expand[flat|nested] mbox.gz Atom feed
[parent not found: <bug-221482-41252@https.bugzilla.kernel.org/>]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CABhMZUUCAa=hB18KqxzoCOWiD9V+hFtXpRLSv9989r+qFk1o1g@mail.gmail.com' \
--to=bjorn.helgaas@gmail.com \
--cc=bjorn@helgaas.com \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).