Linux-PCI Archive mirror
 help / color / mirror / Atom feed
From: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: <linux-pci@vger.kernel.org>, Keith Busch <kbusch@kernel.org>,
	"Kai-Heng Feng" <kai.heng.feng@canonical.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Subject: Re: [PATCH] Documentation: PCI: add vmd documentation
Date: Fri, 19 Apr 2024 15:18:19 -0700	[thread overview]
Message-ID: <d54e79c3-7a73-4ae8-b773-ae7c96559a31@intel.com> (raw)
In-Reply-To: <20240419211400.GA295110@bhelgaas>

On 4/19/2024 2:14 PM, Bjorn Helgaas wrote:
> [+cc Kai-Heng, Rafael, Lorenzo since we're talking about 04b12ef163d1
> ("PCI: vmd: Honor ACPI _OSC on PCIe features")]
> 
> On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
>> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
>>> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>>>> Adding documentation for the Intel VMD driver and updating the index
>>>> file to include it.
>>>>
>>>> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
>>>> ---
>>>>    Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
>>>>    Documentation/PCI/index.rst          |  1 +
>>>>    2 files changed, 52 insertions(+)
>>>>    create mode 100644 Documentation/PCI/controller/vmd.rst
>>>>
>>>> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
>>>> new file mode 100644
>>>> index 000000000000..e1a019035245
>>>> --- /dev/null
>>>> +++ b/Documentation/PCI/controller/vmd.rst
>>>> @@ -0,0 +1,51 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0+
>>>> +
>>>> +=================================================================
>>>> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
>>>> +=================================================================
>>>> +
>>>> +Intel vmd Linux driver.
>>>> +
>>>> +Contents
>>>> +========
>>>> +
>>>> +- Overview
>>>> +- Features
>>>> +- Limitations
>>>> +
>>>> +The Intel VMD provides the means to provide volume management across separate
>>>> +PCI Express HBAs and SSDs without requiring operating system support or
>>>> +communication between drivers. It does this by obscuring each storage
>>>> +controller from the OS, but allowing a single driver to be loaded that would
>>>> +control each storage controller. A Volume Management Device (VMD) provides a
>>>> +single device for a single storage driver. The VMD resides in the IIO root
>>>
>>> I'm not sure IIO (and PCH below) are really relevant to this.  I think
>>
>> I'm trying to describe where in the CPU architecture VMD exists because it's
>> not like other devices. It's not like a storage device or networking device
>> that is plugged in somewhere; it exists as part of the CPU (in the IIO). I'm
>> ok removing it, but it might be confusing to someone looking at the
>> documentation. I'm also close to this so it may be clear to me, but
>> confusing to others (which I know it is) so any help making it clearer would
>> be appreciated.
> 
> The vmd driver binds to a PCI device, and it doesn't matter where it's
> physically implemented.
> 

Ok

>>> we really just care about the PCI topology enumerable by the OS.  If
>>> they are relevant, expand them on first use as you did for VMD so we
>>> have a hint about how to learn more about it.
>>
>> I don't fully understand this comment. The PCI topology behind VMD is not
>> enumerable by the OS unless we are considering the vmd driver the OS. If the
>> user enables VMD in the BIOS and the vmd driver isn't loaded, then the OS
>> never sees the devices behind VMD.
>>
>> The only reason the devices are seen by the OS is that the VMD driver does
>> some mapping when the VMD driver loads during boot.
> 
> Yes.  I think it's worth mentioning these details about the hierarchy
> behind VMD, e.g., how their config space is reached, how driver MMIO
> accesses are routed, how device interrupts are routed, etc.
> 
> The 185a383ada2e ("x86/PCI: Add driver for Intel Volume Management
> Device (VMD)") commit log mentioned by Keith is a fantastic start
> and answers several things below.  Putting it in this doc would be
> great because that commit is not very visible.
> 

I'll look at it and work it in (either replace this overview or add to it)

>>>> +the VMD is in a central location to manipulate access to storage devices which
>>>> +may be attached directly to the IIO or indirectly through the PCH. Instead of
>>>> +allowing individual storage devices to be detected by the OS and allow it to
>>>> +load a separate driver instance for each, the VMD provides configuration
>>>> +settings to allow specific devices and root ports on the root bus to be
>>>> +invisible to the OS.
>>>
>>> How are these settings configured?  BIOS setup menu?
>>
>> I believe there are 2 ways this is done:
>>
>> The first is that the system designer creates a design such that some root
>> ports and end points are behind VMD. If VMD is enabled in the BIOS then
>> these devices don't show up to the OS and require a driver to use them (the
>> vmd driver). If VMD is disabled in the BIOS then the devices are seen by the
>> OS at boot time.
>>
>> The second way is that there are settings in the BIOS for VMD. I don't think
>> there are many settings... it's mostly enable/disable VMD
> 
> I think what I want to know here is just "there's a BIOS switch that
> enables VMD, resulting in only the VMD RCiEP being visible, or
> disables VMD, resulting in the VMD RCiEP not being visible (?) and the
> VMD Root Ports and downstream hierarchies being just normal Root
> Ports."  You can wordsmith that; I just wanted to know what
> "configuration settings" referred to so users would know where they
> live and what they mean.
> 

Got it, will update to reflect that setup/config is through the BIOS

>>>> +VMD works by creating separate PCI domains for each VMD device in the system.
>>>> +This makes VMD look more like a host bridge than an endpoint so VMD must try
>>>> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.
> 
>>> I think "creating a separate PCI domain" is a consequence of providing
>>> a new config access mechanism, e.g., a new ECAM region, for devices
>>> below the VMD bridge.  That hardware mechanism is important to
>>> understand because it means those downstream devices are unknown to
>>> anything that doesn't grok the config access mechanism.  For example,
>>> firmware wouldn't know anything about them unless it had a VMD driver.
> 
>>>     - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
>>>       Ports) are enumerated in the host?
>>
>> Only the VMD device (as a PCI end point) are seen by the OS without the vmd
>> driver
> 
> This is the case when VMD is enabled by BIOS, right?  And when the vmd

Yes

> driver sets up the new config space and enumerates behind VMD, we'll
> find the VMD Root Ports and the hierarchies below them?
> 

Yes

> If BIOS leaves the VMD functionality disabled, I guess those VMD Root
> Ports and the hierarchies below them are enumerated normally, without
> the vmd driver being involved?  (Maybe the VMD RCiEP itself is

Yes

> disabled, so we won't enumerate it and we won't try to bind the vmd
> driver to it?)
> 

Yes

>>>     - Which devices are passed through to a virtual guest and enumerated
>>>       there?
>>
>> All devices under VMD are passed to a virtual guest
> 
> So the guest will see the VMD Root Ports, but not the VMD RCiEP
> itself?
> 

The guest will see the VMD device and then the vmd driver in the guest 
will enumerate the devices behind it is my understanding

>>>     - Where does the vmd driver run (host or guest or both)?
>>
>> I believe the answer is both.
> 
> If the VMD RCiEP isn't passed through to the guest, how can the vmd
> driver do anything in the guest?
> 

The VMD device is passed through to the guest. It works just like bare 
metal in that the guest OS detects the VMD device and loads the vmd 
driver which then enumerates the devices into the guest

>>>     - Who (host or guest) runs the _OSC for the new VMD domain?
>>
>> I believe the answer here is neither :) This has been an issue since commit
>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've submitted
>> this patch (https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/)
>> to attempt to fix the issue.
> 
> IIUC we make no attempt to evaluate _OSC for the new VMD domain (and
> firmware likely would not supply one anyway since it doesn't know how
> to enumerate anything there).  So 04b12ef163d1 just assumes the _OSC
> that applies to the VMD RCiEP also applies to the new VMD domain below
> the VMD.
> 
> If that's what we want to happen, we should state this explicitly.
> But I don't think it *is* what we want; see below.
> 
>> You are much more of an expert in this area than I am, but as far as I can
>> tell the only way the _OSC bits get cleared is via ACPI (specifically this
>> code https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038).
>> Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get set
>> properly for them.
> 
> Apparently there's *something* in ACPI related to the VMD domain
> because 59dc33252ee7 ("PCI: VMD: ACPI: Make ACPI companion lookup work
> for VMD bus") seems to add support for ACPI companions for devices in
> the VMD domain?
> 

I've seen this code also and don't understand what it's supposed to do. 
I'll investigate this further

> I don't understand what's going on here because if BIOS enables VMD,
> firmware would see the VMD RCiEP but not devices behind it.  And if
> BIOS disables VMD, those devices should all appear normally and we
> wouldn't need 59dc33252ee7.
> 

I agree, I'll try to figure this out and get an answer

>> Ultimately the only _OSC bits that VMD cares about are the hotplug bits
>> because that is a feature of our device; it enables hotplug in guests where
>> there is no way to enable it. That's why my patch is to set them all the
>> time and copy the other _OSC bits because there is no other way to enable
>> this feature (i.e. there is no user space tool to enable/disable it).
> 
> And here's the problem, because the _OSC that applies to domain X that
> contains the VMD RCiEP may result in the platform retaining ownership
> of all these PCIe features (hotplug, AER, PME, etc), but you want OSPM
> to own them for the VMD domain Y, regardless of whether it owns them
> for domain X.
> 

I thought each domain gets it's own _OSC. In the case of VMD, it creates 
a new domain so the _OSC for that domain should be determined somehow. 
Normally that determination would be done via ACPI (if I'm understanding 
things correctly) and in the case of VMD I am saying that hotplug should 
always be enabled and we can use the _OSC from another domain for the 
other bits.

I went down a rabbit hole when I started working on this and had an idea 
that maybe the VMD driver should create an ACPI table, but that seemed 
way more complicated than it needs to be. Plus I wasn't sure what good 
it would do because I don't understand the relationship between the root 
ports below VMD and the settings we would use for the bridge.

> OSPM *did* own all PCIe features before 04b12ef163d1 because the new
> VMD "host bridge" got native owership by default.
> 
>>>     - What happens to interrupts generated by devices downstream from
>>>       VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
>>>       from VMD Root Ports or switch downstream ports?  Who fields them?
>>>       In general firmware would field them unless it grants ownership
>>>       via _OSC.  If firmware grants ownership (or the OS forcibly takes
>>>       it by overriding it for hotplug), I guess the OS that requested
>>>       ownership would field them?
>>
>> The interrupts are passed through VMD to the OS. This was the AER issue that
>> resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe
>> features"). IIRC AER was disabled in the BIOS, but is was enabled in the VMD
>> host bridge because pci_init_host_bridge() sets all the bits to 1 and that
>> generated an AER interrupt storm.
>>
>> In bare metal scenarios the _OSC bits are correct, but in a hypervisor
>> scenario the bits are wrong because they are all 0 regardless of what the
>> ACPI tables indicate. The challenge is that the VMD driver has no way to
>> know it's in a hypervisor to set the hotplug bits correctly.
> 
> This is the problem.  We claim "the bits are wrong because they are
> all 0", but I don't think we can derive that conclusion from anything
> in the ACPI, PCIe, or PCI Firmware specs, which makes it hard to
> maintain.
> 

I guess I look at it this way: if I run a bare metal OS and I get 
hotplug and AER are enabled and then I run a guest on the same system 
and I get all _OSC bits are disabled then I think the bits aren't 
correct. But there isn't any way to detect that through the "normal" 
channels (ACPI, PCIe, etc).

> IIUC, the current situation is "regardless of what firmware said, in
> the VMD domain we want AER disabled and hotplug enabled."
> 

We aren't saying we want AER disabled, we are just saying we want 
hotplug enabled. The observation is that in a hypervisor scenario AER is 
going to be disabled because the _OSC bits are all 0.

> 04b12ef163d1 disabled AER by copying the _OSC that applied to the VMD
> RCiEP (although this sounds broken because it probably left AER
> *enabled* if that _OSC happened to grant ownership to the OS).
> 
> And your patch at
> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
> enables hotplug by setting native_pcie_hotplug to 1.
> 
> It seems like the only clear option is to say "the vmd driver owns all
> PCIe services in the VMD domain, the platform does not supply _OSC for
> the VMD domain, the platform can't do anything with PCIe services in
> the VMD domain, and the vmd driver needs to explicitly enable/disable
> services as it needs."
> 

I actually looked at this as well :) I had an idea to set the _OSC bits 
to 0 when the vmd driver created the domain. The look at all the root 
ports underneath it and see if AER and PM were set. If any root port 
underneath VMD set AER or PM then I would set the _OSC bit for the 
bridge to 1. That way if any root port underneath VMD had enabled AER 
(as an example) then that feature would still work. I didn't test this 
in a hypervisor scenario though so not sure what I would see.

Would that be an acceptable idea?

Paul

> Bjorn


  reply	other threads:[~2024-04-19 22:18 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-17 20:15 [PATCH] Documentation: PCI: add vmd documentation Paul M Stillwell Jr
2024-04-17 23:51 ` Keith Busch
2024-04-18 15:07   ` Paul M Stillwell Jr
2024-04-18 23:34     ` Keith Busch
2024-04-18 18:26 ` Bjorn Helgaas
2024-04-18 21:51   ` Paul M Stillwell Jr
2024-04-19 21:14     ` Bjorn Helgaas
2024-04-19 22:18       ` Paul M Stillwell Jr [this message]
2024-04-22 20:27         ` Bjorn Helgaas
2024-04-22 21:39           ` Paul M Stillwell Jr
2024-04-22 22:52             ` Bjorn Helgaas
2024-04-22 23:39               ` Paul M Stillwell Jr
2024-04-23 21:26                 ` Bjorn Helgaas
2024-04-23 23:10                   ` Paul M Stillwell Jr
2024-04-24  0:47                     ` Bjorn Helgaas
2024-04-24 21:29                       ` Paul M Stillwell Jr
2024-04-25 17:24                         ` Bjorn Helgaas
2024-04-25 21:43                           ` Paul M Stillwell Jr
2024-04-25 22:32                             ` Bjorn Helgaas
2024-04-25 23:32                               ` Paul M Stillwell Jr
2024-04-26 21:36                                 ` Bjorn Helgaas
2024-04-26 21:46                                   ` Paul M Stillwell Jr
2024-06-12 21:52                                     ` Paul M Stillwell Jr
2024-06-12 22:25                                       ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d54e79c3-7a73-4ae8-b773-ae7c96559a31@intel.com \
    --to=paul.m.stillwell.jr@intel.com \
    --cc=helgaas@kernel.org \
    --cc=kai.heng.feng@canonical.com \
    --cc=kbusch@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lorenzo.pieralisi@arm.com \
    --cc=rafael.j.wysocki@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).