diff mbox series

Documentation: PCI: add vmd documentation

Message ID 20240417201542.102-1-paul.m.stillwell.jr@intel.com (mailing list archive)
State Superseded
Headers show
Series Documentation: PCI: add vmd documentation | expand

Commit Message

Paul M Stillwell Jr April 17, 2024, 8:15 p.m. UTC
Adding documentation for the Intel VMD driver and updating the index
file to include it.

Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
---
 Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
 Documentation/PCI/index.rst          |  1 +
 2 files changed, 52 insertions(+)
 create mode 100644 Documentation/PCI/controller/vmd.rst

Comments

Keith Busch April 17, 2024, 11:51 p.m. UTC | #1
On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
> +=================================================================
> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
> +=================================================================
> +
> +Intel vmd Linux driver.
> +
> +Contents
> +========
> +
> +- Overview
> +- Features
> +- Limitations
> +
> +The Intel VMD provides the means to provide volume management across separate
> +PCI Express HBAs and SSDs without requiring operating system support or
> +communication between drivers. It does this by obscuring each storage
> +controller from the OS, but allowing a single driver to be loaded that would
> +control each storage controller. A Volume Management Device (VMD) provides a
> +single device for a single storage driver. The VMD resides in the IIO root
> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
> +the VMD is in a central location to manipulate access to storage devices which
> +may be attached directly to the IIO or indirectly through the PCH. Instead of
> +allowing individual storage devices to be detected by the OS and allow it to
> +load a separate driver instance for each, the VMD provides configuration
> +settings to allow specific devices and root ports on the root bus to be
> +invisible to the OS.

This doesn't really capture how the vmd driver works here, though. The
linux driver doesn't control or hide any devices connected to it; it
just creates a host bridge to its PCI domain, then exposes everything to
the rest of the OS for other drivers to bind to individual device
instances that the pci bus driver finds. Everything functions much the
same as if VMD was disabled.
Paul M Stillwell Jr April 18, 2024, 3:07 p.m. UTC | #2
On 4/17/2024 4:51 PM, Keith Busch wrote:
> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>> +=================================================================
>> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
>> +=================================================================
>> +
>> +Intel vmd Linux driver.
>> +
>> +Contents
>> +========
>> +
>> +- Overview
>> +- Features
>> +- Limitations
>> +
>> +The Intel VMD provides the means to provide volume management across separate
>> +PCI Express HBAs and SSDs without requiring operating system support or
>> +communication between drivers. It does this by obscuring each storage
>> +controller from the OS, but allowing a single driver to be loaded that would
>> +control each storage controller. A Volume Management Device (VMD) provides a
>> +single device for a single storage driver. The VMD resides in the IIO root
>> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
>> +the VMD is in a central location to manipulate access to storage devices which
>> +may be attached directly to the IIO or indirectly through the PCH. Instead of
>> +allowing individual storage devices to be detected by the OS and allow it to
>> +load a separate driver instance for each, the VMD provides configuration
>> +settings to allow specific devices and root ports on the root bus to be
>> +invisible to the OS.
> 
> This doesn't really capture how the vmd driver works here, though. The
> linux driver doesn't control or hide any devices connected to it; it
> just creates a host bridge to its PCI domain, then exposes everything to
> the rest of the OS for other drivers to bind to individual device
> instances that the pci bus driver finds. Everything functions much the
> same as if VMD was disabled.

I was trying more to provide an overview of how the VMD device itself 
works here and not the driver. The driver is fairly simple; it's how the 
device works that seems to confuse people :).

Do you have a suggestion on what you would like to see here?

Paul
Bjorn Helgaas April 18, 2024, 6:26 p.m. UTC | #3
[+cc Keith]

On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
> Adding documentation for the Intel VMD driver and updating the index
> file to include it.
> 
> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
> ---
>  Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
>  Documentation/PCI/index.rst          |  1 +
>  2 files changed, 52 insertions(+)
>  create mode 100644 Documentation/PCI/controller/vmd.rst
> 
> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
> new file mode 100644
> index 000000000000..e1a019035245
> --- /dev/null
> +++ b/Documentation/PCI/controller/vmd.rst
> @@ -0,0 +1,51 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=================================================================
> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
> +=================================================================
> +
> +Intel vmd Linux driver.
> +
> +Contents
> +========
> +
> +- Overview
> +- Features
> +- Limitations
> +
> +The Intel VMD provides the means to provide volume management across separate
> +PCI Express HBAs and SSDs without requiring operating system support or
> +communication between drivers. It does this by obscuring each storage
> +controller from the OS, but allowing a single driver to be loaded that would
> +control each storage controller. A Volume Management Device (VMD) provides a
> +single device for a single storage driver. The VMD resides in the IIO root

I'm not sure IIO (and PCH below) are really relevant to this.  I think
we really just care about the PCI topology enumerable by the OS.  If
they are relevant, expand them on first use as you did for VMD so we
have a hint about how to learn more about it.

> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO,

I suspect "root bus integrated endpoint" means the same as "Root
Complex Integrated Endpoint" as defined by the PCIe spec?  If so,
please use that term and capitalize it so there's no confusion.

> +the VMD is in a central location to manipulate access to storage devices which
> +may be attached directly to the IIO or indirectly through the PCH. Instead of
> +allowing individual storage devices to be detected by the OS and allow it to
> +load a separate driver instance for each, the VMD provides configuration
> +settings to allow specific devices and root ports on the root bus to be
> +invisible to the OS.

How are these settings configured?  BIOS setup menu?

> +VMD works by creating separate PCI domains for each VMD device in the system.
> +This makes VMD look more like a host bridge than an endpoint so VMD must try
> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.

As Keith pointed out, I think this needs more details about how the
hardware itself works.  I don't think there's enough information here
to maintain the OS/platform interface on an ongoing basis.

I think "creating a separate PCI domain" is a consequence of providing
a new config access mechanism, e.g., a new ECAM region, for devices
below the VMD bridge.  That hardware mechanism is important to
understand because it means those downstream devices are unknown to
anything that doesn't grok the config access mechanism.  For example,
firmware wouldn't know anything about them unless it had a VMD driver.

Some of the pieces that might help figure this out:

  - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
    Ports) are enumerated in the host?

  - Which devices are passed through to a virtual guest and enumerated
    there?

  - Where does the vmd driver run (host or guest or both)?

  - Who (host or guest) runs the _OSC for the new VMD domain?

  - What happens to interrupts generated by devices downstream from
    VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
    from VMD Root Ports or switch downstream ports?  Who fields them?
    In general firmware would field them unless it grants ownership
    via _OSC.  If firmware grants ownership (or the OS forcibly takes
    it by overriding it for hotplug), I guess the OS that requested
    ownership would field them?

  - How do interrupts (hotplug, AER, etc) for things below VMD work?
    Assuming the OS owns the feature, how does the OS discover them?
    I guess probably the usual PCIe Capability and MSI/MSI-X
    Capabilities?  Which OS (host or guest) fields them?

> +A couple of the _OSC flags regard hotplug support.  Hotplug is a feature that
> +is always enabled when using VMD regardless of the _OSC flags.

We log the _OSC negotiation in dmesg, so if we ignore or override _OSC
for hotplug, maybe that should be made explicit in the logging
somehow?

> +Features
> +========
> +
> +- Virtualization
> +- MSIX interrupts
> +- Power Management
> +- Hotplug

s/MSIX/MSI-X/ to match spec usage.

I'm not sure what this list is telling us.

> +Limitations
> +===========
> +
> +When VMD is enabled and used in a hypervisor the _OSC flags provided by the
> +hypervisor BIOS may not be correct. The most critical of these flags are the
> +hotplug bits. If these bits are incorrect then the storage devices behind the
> +VMD will not be able to be hotplugged. The driver always supports hotplug for
> +the devices behind it so the hotplug bits reported by the OS are not used.

"_OSC may not be correct" sounds kind of problematic.  How does the
OS deal with this?  How does the OS know whether to pay attention to
_OSC or ignore it because it tells us garbage?

If we ignore _OSC hotplug bits because "we know what we want, and we
know we won't conflict with firmware," how do we deal with other _OSC
bits?  AER?  PME?  What about bits that may be added in the future?
Is there some kind of roadmap to help answer these questions?

Bjorn
Paul M Stillwell Jr April 18, 2024, 9:51 p.m. UTC | #4
On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
> [+cc Keith]
> 
> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>> Adding documentation for the Intel VMD driver and updating the index
>> file to include it.
>>
>> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
>> ---
>>   Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
>>   Documentation/PCI/index.rst          |  1 +
>>   2 files changed, 52 insertions(+)
>>   create mode 100644 Documentation/PCI/controller/vmd.rst
>>
>> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
>> new file mode 100644
>> index 000000000000..e1a019035245
>> --- /dev/null
>> +++ b/Documentation/PCI/controller/vmd.rst
>> @@ -0,0 +1,51 @@
>> +.. SPDX-License-Identifier: GPL-2.0+
>> +
>> +=================================================================
>> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
>> +=================================================================
>> +
>> +Intel vmd Linux driver.
>> +
>> +Contents
>> +========
>> +
>> +- Overview
>> +- Features
>> +- Limitations
>> +
>> +The Intel VMD provides the means to provide volume management across separate
>> +PCI Express HBAs and SSDs without requiring operating system support or
>> +communication between drivers. It does this by obscuring each storage
>> +controller from the OS, but allowing a single driver to be loaded that would
>> +control each storage controller. A Volume Management Device (VMD) provides a
>> +single device for a single storage driver. The VMD resides in the IIO root
> 
> I'm not sure IIO (and PCH below) are really relevant to this.  I think

I'm trying to describe where in the CPU architecture VMD exists because 
it's not like other devices. It's not like a storage device or 
networking device that is plugged in somewhere; it exists as part of the 
CPU (in the IIO). I'm ok removing it, but it might be confusing to 
someone looking at the documentation. I'm also close to this so it may 
be clear to me, but confusing to others (which I know it is) so any help 
making it clearer would be appreciated.

> we really just care about the PCI topology enumerable by the OS.  If
> they are relevant, expand them on first use as you did for VMD so we
> have a hint about how to learn more about it.
> 

I don't fully understand this comment. The PCI topology behind VMD is 
not enumerable by the OS unless we are considering the vmd driver the 
OS. If the user enables VMD in the BIOS and the vmd driver isn't loaded, 
then the OS never sees the devices behind VMD.

The only reason the devices are seen by the OS is that the VMD driver 
does some mapping when the VMD driver loads during boot.

>> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
> 
> I suspect "root bus integrated endpoint" means the same as "Root
> Complex Integrated Endpoint" as defined by the PCIe spec?  If so,
> please use that term and capitalize it so there's no confusion.
> 

OK, will fix.

>> +the VMD is in a central location to manipulate access to storage devices which
>> +may be attached directly to the IIO or indirectly through the PCH. Instead of
>> +allowing individual storage devices to be detected by the OS and allow it to
>> +load a separate driver instance for each, the VMD provides configuration
>> +settings to allow specific devices and root ports on the root bus to be
>> +invisible to the OS.
> 
> How are these settings configured?  BIOS setup menu?
> 

I believe there are 2 ways this is done:

The first is that the system designer creates a design such that some 
root ports and end points are behind VMD. If VMD is enabled in the BIOS 
then these devices don't show up to the OS and require a driver to use 
them (the vmd driver). If VMD is disabled in the BIOS then the devices 
are seen by the OS at boot time.

The second way is that there are settings in the BIOS for VMD. I don't 
think there are many settings... it's mostly enable/disable VMD

>> +VMD works by creating separate PCI domains for each VMD device in the system.
>> +This makes VMD look more like a host bridge than an endpoint so VMD must try
>> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.
> 
> As Keith pointed out, I think this needs more details about how the
> hardware itself works.  I don't think there's enough information here
> to maintain the OS/platform interface on an ongoing basis.
> 
> I think "creating a separate PCI domain" is a consequence of providing
> a new config access mechanism, e.g., a new ECAM region, for devices
> below the VMD bridge.  That hardware mechanism is important to
> understand because it means those downstream devices are unknown to
> anything that doesn't grok the config access mechanism.  For example,
> firmware wouldn't know anything about them unless it had a VMD driver.
> 
> Some of the pieces that might help figure this out:
>

I'll add some details to answer these in the documentation, but I'll 
give a brief answer here as well

>    - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
>      Ports) are enumerated in the host?
> 

Only the VMD device (as a PCI end point) are seen by the OS without the 
vmd driver

>    - Which devices are passed through to a virtual guest and enumerated
>      there?
> 

All devices under VMD are passed to a virtual guest

>    - Where does the vmd driver run (host or guest or both)?
> 

I believe the answer is both.

>    - Who (host or guest) runs the _OSC for the new VMD domain?
> 

I believe the answer here is neither :) This has been an issue since 
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've 
submitted this patch 
(https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/) 
to attempt to fix the issue.

You are much more of an expert in this area than I am, but as far as I 
can tell the only way the _OSC bits get cleared is via ACPI 
(specifically this code 
https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038). 
Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get 
set properly for them.

Ultimately the only _OSC bits that VMD cares about are the hotplug bits 
because that is a feature of our device; it enables hotplug in guests 
where there is no way to enable it. That's why my patch is to set them 
all the time and copy the other _OSC bits because there is no other way 
to enable this feature (i.e. there is no user space tool to 
enable/disable it).

>    - What happens to interrupts generated by devices downstream from
>      VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
>      from VMD Root Ports or switch downstream ports?  Who fields them?
>      In general firmware would field them unless it grants ownership
>      via _OSC.  If firmware grants ownership (or the OS forcibly takes
>      it by overriding it for hotplug), I guess the OS that requested
>      ownership would field them?
> 

The interrupts are passed through VMD to the OS. This was the AER issue 
that resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe 
features"). IIRC AER was disabled in the BIOS, but is was enabled in the 
VMD host bridge because pci_init_host_bridge() sets all the bits to 1 
and that generated an AER interrupt storm.

In bare metal scenarios the _OSC bits are correct, but in a hypervisor 
scenario the bits are wrong because they are all 0 regardless of what 
the ACPI tables indicate. The challenge is that the VMD driver has no 
way to know it's in a hypervisor to set the hotplug bits correctly.

>    - How do interrupts (hotplug, AER, etc) for things below VMD work?
>      Assuming the OS owns the feature, how does the OS discover them?

I feel like this is the same question as above? Or maybe I'm missing a 
subtlety about this...

>      I guess probably the usual PCIe Capability and MSI/MSI-X
>      Capabilities?  Which OS (host or guest) fields them?
>
>> +A couple of the _OSC flags regard hotplug support.  Hotplug is a feature that
>> +is always enabled when using VMD regardless of the _OSC flags.
> 
> We log the _OSC negotiation in dmesg, so if we ignore or override _OSC
> for hotplug, maybe that should be made explicit in the logging
> somehow?
> 

That's a really good idea and something I can add to 
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/

Would a message like this help from the VMD driver?

"VMD enabled, hotplug enabled by VMD"

>> +Features
>> +========
>> +
>> +- Virtualization
>> +- MSIX interrupts
>> +- Power Management
>> +- Hotplug
> 
> s/MSIX/MSI-X/ to match spec usage.
> 
> I'm not sure what this list is telling us.
> 

Will fix

>> +Limitations
>> +===========
>> +
>> +When VMD is enabled and used in a hypervisor the _OSC flags provided by the
>> +hypervisor BIOS may not be correct. The most critical of these flags are the
>> +hotplug bits. If these bits are incorrect then the storage devices behind the
>> +VMD will not be able to be hotplugged. The driver always supports hotplug for
>> +the devices behind it so the hotplug bits reported by the OS are not used.
> 
> "_OSC may not be correct" sounds kind of problematic.  How does the
> OS deal with this?  How does the OS know whether to pay attention to
> _OSC or ignore it because it tells us garbage?
> 

That's the $64K question, lol. We've been trying to solve that since 
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") :)

> If we ignore _OSC hotplug bits because "we know what we want, and we
> know we won't conflict with firmware," how do we deal with other _OSC
> bits?  AER?  PME?  What about bits that may be added in the future?
> Is there some kind of roadmap to help answer these questions?
> 

As I mentioned earlier, VMD only really cares about hotplug because that 
is the feature we enable for guests (and hosts).

I believe the solution is to use the root bridge settings for all other 
bits (which is what is happening currently). What this will mean in 
practice is that in a bare metal scenario the bits will be correct for 
all the features (AER et al) and that in a guest scenario all the bits 
other than hotplug (which we will enable always) will be 0 (that's what 
we see in all hypervisor scenarios we've tested) which is fine for us 
because we don't care about any of the other bits.

That's why I think it's ok for us to set the hotplug bits to 1 when the 
VMD driver loads; we aren't harming any other devices, we are enabling a 
feature that we know our users want and we are setting all the other 
_OSC bits "correctly" (for some values of correctly :) )

I appreciate your feedback and I'll start working on updating the 
documentation to make it clearer. I'll wait to send a v2 until I feel 
like we've finished our discussion from this one.

Paul

> Bjorn
>
Keith Busch April 18, 2024, 11:34 p.m. UTC | #5
On Thu, Apr 18, 2024 at 08:07:26AM -0700, Paul M Stillwell Jr wrote:
> On 4/17/2024 4:51 PM, Keith Busch wrote:
> > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
> > > +=================================================================
> > > +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
> > > +=================================================================
> > > +
> > > +Intel vmd Linux driver.
> > > +
> > > +Contents
> > > +========
> > > +
> > > +- Overview
> > > +- Features
> > > +- Limitations
> > > +
> > > +The Intel VMD provides the means to provide volume management across separate
> > > +PCI Express HBAs and SSDs without requiring operating system support or
> > > +communication between drivers. It does this by obscuring each storage
> > > +controller from the OS, but allowing a single driver to be loaded that would
> > > +control each storage controller. A Volume Management Device (VMD) provides a
> > > +single device for a single storage driver. The VMD resides in the IIO root
> > > +complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
> > > +the VMD is in a central location to manipulate access to storage devices which
> > > +may be attached directly to the IIO or indirectly through the PCH. Instead of
> > > +allowing individual storage devices to be detected by the OS and allow it to
> > > +load a separate driver instance for each, the VMD provides configuration
> > > +settings to allow specific devices and root ports on the root bus to be
> > > +invisible to the OS.
> > 
> > This doesn't really capture how the vmd driver works here, though. The
> > linux driver doesn't control or hide any devices connected to it; it
> > just creates a host bridge to its PCI domain, then exposes everything to
> > the rest of the OS for other drivers to bind to individual device
> > instances that the pci bus driver finds. Everything functions much the
> > same as if VMD was disabled.
> 
> I was trying more to provide an overview of how the VMD device itself works
> here and not the driver. The driver is fairly simple; it's how the device
> works that seems to confuse people :).

Right, I know that's how the marketing docs described VMD, but your new
doc is called "Linux Base Driver for ... VMD", so I thought your goal
was to describe the *driver* rather than the device.

> Do you have a suggestion on what you would like to see here?

I think we did okay with the description in the initial commit log. It
is here if you want to reference that, though it may need some updates
for more current hardware.

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=185a383ada2e7794b0e82e040223e741b24d2bf8
Bjorn Helgaas April 19, 2024, 9:14 p.m. UTC | #6
[+cc Kai-Heng, Rafael, Lorenzo since we're talking about 04b12ef163d1
("PCI: vmd: Honor ACPI _OSC on PCIe features")]

On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
> > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
> > > Adding documentation for the Intel VMD driver and updating the index
> > > file to include it.
> > > 
> > > Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
> > > ---
> > >   Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
> > >   Documentation/PCI/index.rst          |  1 +
> > >   2 files changed, 52 insertions(+)
> > >   create mode 100644 Documentation/PCI/controller/vmd.rst
> > > 
> > > diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
> > > new file mode 100644
> > > index 000000000000..e1a019035245
> > > --- /dev/null
> > > +++ b/Documentation/PCI/controller/vmd.rst
> > > @@ -0,0 +1,51 @@
> > > +.. SPDX-License-Identifier: GPL-2.0+
> > > +
> > > +=================================================================
> > > +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
> > > +=================================================================
> > > +
> > > +Intel vmd Linux driver.
> > > +
> > > +Contents
> > > +========
> > > +
> > > +- Overview
> > > +- Features
> > > +- Limitations
> > > +
> > > +The Intel VMD provides the means to provide volume management across separate
> > > +PCI Express HBAs and SSDs without requiring operating system support or
> > > +communication between drivers. It does this by obscuring each storage
> > > +controller from the OS, but allowing a single driver to be loaded that would
> > > +control each storage controller. A Volume Management Device (VMD) provides a
> > > +single device for a single storage driver. The VMD resides in the IIO root
> > 
> > I'm not sure IIO (and PCH below) are really relevant to this.  I think
> 
> I'm trying to describe where in the CPU architecture VMD exists because it's
> not like other devices. It's not like a storage device or networking device
> that is plugged in somewhere; it exists as part of the CPU (in the IIO). I'm
> ok removing it, but it might be confusing to someone looking at the
> documentation. I'm also close to this so it may be clear to me, but
> confusing to others (which I know it is) so any help making it clearer would
> be appreciated.

The vmd driver binds to a PCI device, and it doesn't matter where it's
physically implemented.

> > we really just care about the PCI topology enumerable by the OS.  If
> > they are relevant, expand them on first use as you did for VMD so we
> > have a hint about how to learn more about it.
> 
> I don't fully understand this comment. The PCI topology behind VMD is not
> enumerable by the OS unless we are considering the vmd driver the OS. If the
> user enables VMD in the BIOS and the vmd driver isn't loaded, then the OS
> never sees the devices behind VMD.
> 
> The only reason the devices are seen by the OS is that the VMD driver does
> some mapping when the VMD driver loads during boot.

Yes.  I think it's worth mentioning these details about the hierarchy
behind VMD, e.g., how their config space is reached, how driver MMIO
accesses are routed, how device interrupts are routed, etc.

The 185a383ada2e ("x86/PCI: Add driver for Intel Volume Management
Device (VMD)") commit log mentioned by Keith is a fantastic start
and answers several things below.  Putting it in this doc would be
great because that commit is not very visible.

> > > +the VMD is in a central location to manipulate access to storage devices which
> > > +may be attached directly to the IIO or indirectly through the PCH. Instead of
> > > +allowing individual storage devices to be detected by the OS and allow it to
> > > +load a separate driver instance for each, the VMD provides configuration
> > > +settings to allow specific devices and root ports on the root bus to be
> > > +invisible to the OS.
> > 
> > How are these settings configured?  BIOS setup menu?
> 
> I believe there are 2 ways this is done:
> 
> The first is that the system designer creates a design such that some root
> ports and end points are behind VMD. If VMD is enabled in the BIOS then
> these devices don't show up to the OS and require a driver to use them (the
> vmd driver). If VMD is disabled in the BIOS then the devices are seen by the
> OS at boot time.
> 
> The second way is that there are settings in the BIOS for VMD. I don't think
> there are many settings... it's mostly enable/disable VMD

I think what I want to know here is just "there's a BIOS switch that
enables VMD, resulting in only the VMD RCiEP being visible, or
disables VMD, resulting in the VMD RCiEP not being visible (?) and the
VMD Root Ports and downstream hierarchies being just normal Root
Ports."  You can wordsmith that; I just wanted to know what
"configuration settings" referred to so users would know where they
live and what they mean.

> > > +VMD works by creating separate PCI domains for each VMD device in the system.
> > > +This makes VMD look more like a host bridge than an endpoint so VMD must try
> > > +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.

> > I think "creating a separate PCI domain" is a consequence of providing
> > a new config access mechanism, e.g., a new ECAM region, for devices
> > below the VMD bridge.  That hardware mechanism is important to
> > understand because it means those downstream devices are unknown to
> > anything that doesn't grok the config access mechanism.  For example,
> > firmware wouldn't know anything about them unless it had a VMD driver.

> >    - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
> >      Ports) are enumerated in the host?
> 
> Only the VMD device (as a PCI end point) are seen by the OS without the vmd
> driver

This is the case when VMD is enabled by BIOS, right?  And when the vmd
driver sets up the new config space and enumerates behind VMD, we'll
find the VMD Root Ports and the hierarchies below them?

If BIOS leaves the VMD functionality disabled, I guess those VMD Root
Ports and the hierarchies below them are enumerated normally, without
the vmd driver being involved?  (Maybe the VMD RCiEP itself is
disabled, so we won't enumerate it and we won't try to bind the vmd
driver to it?)

> >    - Which devices are passed through to a virtual guest and enumerated
> >      there?
> 
> All devices under VMD are passed to a virtual guest

So the guest will see the VMD Root Ports, but not the VMD RCiEP
itself?

> >    - Where does the vmd driver run (host or guest or both)?
> 
> I believe the answer is both.

If the VMD RCiEP isn't passed through to the guest, how can the vmd
driver do anything in the guest?

> >    - Who (host or guest) runs the _OSC for the new VMD domain?
> 
> I believe the answer here is neither :) This has been an issue since commit
> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've submitted
> this patch (https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/)
> to attempt to fix the issue.

IIUC we make no attempt to evaluate _OSC for the new VMD domain (and
firmware likely would not supply one anyway since it doesn't know how
to enumerate anything there).  So 04b12ef163d1 just assumes the _OSC
that applies to the VMD RCiEP also applies to the new VMD domain below
the VMD.

If that's what we want to happen, we should state this explicitly.
But I don't think it *is* what we want; see below.

> You are much more of an expert in this area than I am, but as far as I can
> tell the only way the _OSC bits get cleared is via ACPI (specifically this
> code https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038).
> Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get set
> properly for them.

Apparently there's *something* in ACPI related to the VMD domain
because 59dc33252ee7 ("PCI: VMD: ACPI: Make ACPI companion lookup work
for VMD bus") seems to add support for ACPI companions for devices in
the VMD domain?

I don't understand what's going on here because if BIOS enables VMD,
firmware would see the VMD RCiEP but not devices behind it.  And if
BIOS disables VMD, those devices should all appear normally and we
wouldn't need 59dc33252ee7.

> Ultimately the only _OSC bits that VMD cares about are the hotplug bits
> because that is a feature of our device; it enables hotplug in guests where
> there is no way to enable it. That's why my patch is to set them all the
> time and copy the other _OSC bits because there is no other way to enable
> this feature (i.e. there is no user space tool to enable/disable it).

And here's the problem, because the _OSC that applies to domain X that
contains the VMD RCiEP may result in the platform retaining ownership
of all these PCIe features (hotplug, AER, PME, etc), but you want OSPM
to own them for the VMD domain Y, regardless of whether it owns them
for domain X.

OSPM *did* own all PCIe features before 04b12ef163d1 because the new
VMD "host bridge" got native owership by default.

> >    - What happens to interrupts generated by devices downstream from
> >      VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
> >      from VMD Root Ports or switch downstream ports?  Who fields them?
> >      In general firmware would field them unless it grants ownership
> >      via _OSC.  If firmware grants ownership (or the OS forcibly takes
> >      it by overriding it for hotplug), I guess the OS that requested
> >      ownership would field them?
> 
> The interrupts are passed through VMD to the OS. This was the AER issue that
> resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe
> features"). IIRC AER was disabled in the BIOS, but is was enabled in the VMD
> host bridge because pci_init_host_bridge() sets all the bits to 1 and that
> generated an AER interrupt storm.
>
> In bare metal scenarios the _OSC bits are correct, but in a hypervisor
> scenario the bits are wrong because they are all 0 regardless of what the
> ACPI tables indicate. The challenge is that the VMD driver has no way to
> know it's in a hypervisor to set the hotplug bits correctly.

This is the problem.  We claim "the bits are wrong because they are
all 0", but I don't think we can derive that conclusion from anything
in the ACPI, PCIe, or PCI Firmware specs, which makes it hard to
maintain.

IIUC, the current situation is "regardless of what firmware said, in
the VMD domain we want AER disabled and hotplug enabled."

04b12ef163d1 disabled AER by copying the _OSC that applied to the VMD
RCiEP (although this sounds broken because it probably left AER
*enabled* if that _OSC happened to grant ownership to the OS).

And your patch at
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ 
enables hotplug by setting native_pcie_hotplug to 1.

It seems like the only clear option is to say "the vmd driver owns all
PCIe services in the VMD domain, the platform does not supply _OSC for
the VMD domain, the platform can't do anything with PCIe services in
the VMD domain, and the vmd driver needs to explicitly enable/disable
services as it needs."

Bjorn
Paul M Stillwell Jr April 19, 2024, 10:18 p.m. UTC | #7
On 4/19/2024 2:14 PM, Bjorn Helgaas wrote:
> [+cc Kai-Heng, Rafael, Lorenzo since we're talking about 04b12ef163d1
> ("PCI: vmd: Honor ACPI _OSC on PCIe features")]
> 
> On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
>> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
>>> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>>>> Adding documentation for the Intel VMD driver and updating the index
>>>> file to include it.
>>>>
>>>> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
>>>> ---
>>>>    Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
>>>>    Documentation/PCI/index.rst          |  1 +
>>>>    2 files changed, 52 insertions(+)
>>>>    create mode 100644 Documentation/PCI/controller/vmd.rst
>>>>
>>>> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
>>>> new file mode 100644
>>>> index 000000000000..e1a019035245
>>>> --- /dev/null
>>>> +++ b/Documentation/PCI/controller/vmd.rst
>>>> @@ -0,0 +1,51 @@
>>>> +.. SPDX-License-Identifier: GPL-2.0+
>>>> +
>>>> +=================================================================
>>>> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
>>>> +=================================================================
>>>> +
>>>> +Intel vmd Linux driver.
>>>> +
>>>> +Contents
>>>> +========
>>>> +
>>>> +- Overview
>>>> +- Features
>>>> +- Limitations
>>>> +
>>>> +The Intel VMD provides the means to provide volume management across separate
>>>> +PCI Express HBAs and SSDs without requiring operating system support or
>>>> +communication between drivers. It does this by obscuring each storage
>>>> +controller from the OS, but allowing a single driver to be loaded that would
>>>> +control each storage controller. A Volume Management Device (VMD) provides a
>>>> +single device for a single storage driver. The VMD resides in the IIO root
>>>
>>> I'm not sure IIO (and PCH below) are really relevant to this.  I think
>>
>> I'm trying to describe where in the CPU architecture VMD exists because it's
>> not like other devices. It's not like a storage device or networking device
>> that is plugged in somewhere; it exists as part of the CPU (in the IIO). I'm
>> ok removing it, but it might be confusing to someone looking at the
>> documentation. I'm also close to this so it may be clear to me, but
>> confusing to others (which I know it is) so any help making it clearer would
>> be appreciated.
> 
> The vmd driver binds to a PCI device, and it doesn't matter where it's
> physically implemented.
> 

Ok

>>> we really just care about the PCI topology enumerable by the OS.  If
>>> they are relevant, expand them on first use as you did for VMD so we
>>> have a hint about how to learn more about it.
>>
>> I don't fully understand this comment. The PCI topology behind VMD is not
>> enumerable by the OS unless we are considering the vmd driver the OS. If the
>> user enables VMD in the BIOS and the vmd driver isn't loaded, then the OS
>> never sees the devices behind VMD.
>>
>> The only reason the devices are seen by the OS is that the VMD driver does
>> some mapping when the VMD driver loads during boot.
> 
> Yes.  I think it's worth mentioning these details about the hierarchy
> behind VMD, e.g., how their config space is reached, how driver MMIO
> accesses are routed, how device interrupts are routed, etc.
> 
> The 185a383ada2e ("x86/PCI: Add driver for Intel Volume Management
> Device (VMD)") commit log mentioned by Keith is a fantastic start
> and answers several things below.  Putting it in this doc would be
> great because that commit is not very visible.
> 

I'll look at it and work it in (either replace this overview or add to it)

>>>> +the VMD is in a central location to manipulate access to storage devices which
>>>> +may be attached directly to the IIO or indirectly through the PCH. Instead of
>>>> +allowing individual storage devices to be detected by the OS and allow it to
>>>> +load a separate driver instance for each, the VMD provides configuration
>>>> +settings to allow specific devices and root ports on the root bus to be
>>>> +invisible to the OS.
>>>
>>> How are these settings configured?  BIOS setup menu?
>>
>> I believe there are 2 ways this is done:
>>
>> The first is that the system designer creates a design such that some root
>> ports and end points are behind VMD. If VMD is enabled in the BIOS then
>> these devices don't show up to the OS and require a driver to use them (the
>> vmd driver). If VMD is disabled in the BIOS then the devices are seen by the
>> OS at boot time.
>>
>> The second way is that there are settings in the BIOS for VMD. I don't think
>> there are many settings... it's mostly enable/disable VMD
> 
> I think what I want to know here is just "there's a BIOS switch that
> enables VMD, resulting in only the VMD RCiEP being visible, or
> disables VMD, resulting in the VMD RCiEP not being visible (?) and the
> VMD Root Ports and downstream hierarchies being just normal Root
> Ports."  You can wordsmith that; I just wanted to know what
> "configuration settings" referred to so users would know where they
> live and what they mean.
> 

Got it, will update to reflect that setup/config is through the BIOS

>>>> +VMD works by creating separate PCI domains for each VMD device in the system.
>>>> +This makes VMD look more like a host bridge than an endpoint so VMD must try
>>>> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.
> 
>>> I think "creating a separate PCI domain" is a consequence of providing
>>> a new config access mechanism, e.g., a new ECAM region, for devices
>>> below the VMD bridge.  That hardware mechanism is important to
>>> understand because it means those downstream devices are unknown to
>>> anything that doesn't grok the config access mechanism.  For example,
>>> firmware wouldn't know anything about them unless it had a VMD driver.
> 
>>>     - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
>>>       Ports) are enumerated in the host?
>>
>> Only the VMD device (as a PCI end point) are seen by the OS without the vmd
>> driver
> 
> This is the case when VMD is enabled by BIOS, right?  And when the vmd

Yes

> driver sets up the new config space and enumerates behind VMD, we'll
> find the VMD Root Ports and the hierarchies below them?
> 

Yes

> If BIOS leaves the VMD functionality disabled, I guess those VMD Root
> Ports and the hierarchies below them are enumerated normally, without
> the vmd driver being involved?  (Maybe the VMD RCiEP itself is

Yes

> disabled, so we won't enumerate it and we won't try to bind the vmd
> driver to it?)
> 

Yes

>>>     - Which devices are passed through to a virtual guest and enumerated
>>>       there?
>>
>> All devices under VMD are passed to a virtual guest
> 
> So the guest will see the VMD Root Ports, but not the VMD RCiEP
> itself?
> 

The guest will see the VMD device and then the vmd driver in the guest 
will enumerate the devices behind it is my understanding

>>>     - Where does the vmd driver run (host or guest or both)?
>>
>> I believe the answer is both.
> 
> If the VMD RCiEP isn't passed through to the guest, how can the vmd
> driver do anything in the guest?
> 

The VMD device is passed through to the guest. It works just like bare 
metal in that the guest OS detects the VMD device and loads the vmd 
driver which then enumerates the devices into the guest

>>>     - Who (host or guest) runs the _OSC for the new VMD domain?
>>
>> I believe the answer here is neither :) This has been an issue since commit
>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've submitted
>> this patch (https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/)
>> to attempt to fix the issue.
> 
> IIUC we make no attempt to evaluate _OSC for the new VMD domain (and
> firmware likely would not supply one anyway since it doesn't know how
> to enumerate anything there).  So 04b12ef163d1 just assumes the _OSC
> that applies to the VMD RCiEP also applies to the new VMD domain below
> the VMD.
> 
> If that's what we want to happen, we should state this explicitly.
> But I don't think it *is* what we want; see below.
> 
>> You are much more of an expert in this area than I am, but as far as I can
>> tell the only way the _OSC bits get cleared is via ACPI (specifically this
>> code https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038).
>> Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get set
>> properly for them.
> 
> Apparently there's *something* in ACPI related to the VMD domain
> because 59dc33252ee7 ("PCI: VMD: ACPI: Make ACPI companion lookup work
> for VMD bus") seems to add support for ACPI companions for devices in
> the VMD domain?
> 

I've seen this code also and don't understand what it's supposed to do. 
I'll investigate this further

> I don't understand what's going on here because if BIOS enables VMD,
> firmware would see the VMD RCiEP but not devices behind it.  And if
> BIOS disables VMD, those devices should all appear normally and we
> wouldn't need 59dc33252ee7.
> 

I agree, I'll try to figure this out and get an answer

>> Ultimately the only _OSC bits that VMD cares about are the hotplug bits
>> because that is a feature of our device; it enables hotplug in guests where
>> there is no way to enable it. That's why my patch is to set them all the
>> time and copy the other _OSC bits because there is no other way to enable
>> this feature (i.e. there is no user space tool to enable/disable it).
> 
> And here's the problem, because the _OSC that applies to domain X that
> contains the VMD RCiEP may result in the platform retaining ownership
> of all these PCIe features (hotplug, AER, PME, etc), but you want OSPM
> to own them for the VMD domain Y, regardless of whether it owns them
> for domain X.
> 

I thought each domain gets it's own _OSC. In the case of VMD, it creates 
a new domain so the _OSC for that domain should be determined somehow. 
Normally that determination would be done via ACPI (if I'm understanding 
things correctly) and in the case of VMD I am saying that hotplug should 
always be enabled and we can use the _OSC from another domain for the 
other bits.

I went down a rabbit hole when I started working on this and had an idea 
that maybe the VMD driver should create an ACPI table, but that seemed 
way more complicated than it needs to be. Plus I wasn't sure what good 
it would do because I don't understand the relationship between the root 
ports below VMD and the settings we would use for the bridge.

> OSPM *did* own all PCIe features before 04b12ef163d1 because the new
> VMD "host bridge" got native owership by default.
> 
>>>     - What happens to interrupts generated by devices downstream from
>>>       VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
>>>       from VMD Root Ports or switch downstream ports?  Who fields them?
>>>       In general firmware would field them unless it grants ownership
>>>       via _OSC.  If firmware grants ownership (or the OS forcibly takes
>>>       it by overriding it for hotplug), I guess the OS that requested
>>>       ownership would field them?
>>
>> The interrupts are passed through VMD to the OS. This was the AER issue that
>> resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe
>> features"). IIRC AER was disabled in the BIOS, but is was enabled in the VMD
>> host bridge because pci_init_host_bridge() sets all the bits to 1 and that
>> generated an AER interrupt storm.
>>
>> In bare metal scenarios the _OSC bits are correct, but in a hypervisor
>> scenario the bits are wrong because they are all 0 regardless of what the
>> ACPI tables indicate. The challenge is that the VMD driver has no way to
>> know it's in a hypervisor to set the hotplug bits correctly.
> 
> This is the problem.  We claim "the bits are wrong because they are
> all 0", but I don't think we can derive that conclusion from anything
> in the ACPI, PCIe, or PCI Firmware specs, which makes it hard to
> maintain.
> 

I guess I look at it this way: if I run a bare metal OS and I get 
hotplug and AER are enabled and then I run a guest on the same system 
and I get all _OSC bits are disabled then I think the bits aren't 
correct. But there isn't any way to detect that through the "normal" 
channels (ACPI, PCIe, etc).

> IIUC, the current situation is "regardless of what firmware said, in
> the VMD domain we want AER disabled and hotplug enabled."
> 

We aren't saying we want AER disabled, we are just saying we want 
hotplug enabled. The observation is that in a hypervisor scenario AER is 
going to be disabled because the _OSC bits are all 0.

> 04b12ef163d1 disabled AER by copying the _OSC that applied to the VMD
> RCiEP (although this sounds broken because it probably left AER
> *enabled* if that _OSC happened to grant ownership to the OS).
> 
> And your patch at
> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
> enables hotplug by setting native_pcie_hotplug to 1.
> 
> It seems like the only clear option is to say "the vmd driver owns all
> PCIe services in the VMD domain, the platform does not supply _OSC for
> the VMD domain, the platform can't do anything with PCIe services in
> the VMD domain, and the vmd driver needs to explicitly enable/disable
> services as it needs."
> 

I actually looked at this as well :) I had an idea to set the _OSC bits 
to 0 when the vmd driver created the domain. The look at all the root 
ports underneath it and see if AER and PM were set. If any root port 
underneath VMD set AER or PM then I would set the _OSC bit for the 
bridge to 1. That way if any root port underneath VMD had enabled AER 
(as an example) then that feature would still work. I didn't test this 
in a hypervisor scenario though so not sure what I would see.

Would that be an acceptable idea?

Paul

> Bjorn
Bjorn Helgaas April 22, 2024, 8:27 p.m. UTC | #8
On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote:
> On 4/19/2024 2:14 PM, Bjorn Helgaas wrote:
> > On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
> > > On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
> > > > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
> > > > > Adding documentation for the Intel VMD driver and updating the index
> > > > > file to include it.

> > > >     - Which devices are passed through to a virtual guest and enumerated
> > > >       there?
> > > 
> > > All devices under VMD are passed to a virtual guest
> > 
> > So the guest will see the VMD Root Ports, but not the VMD RCiEP
> > itself?
> 
> The guest will see the VMD device and then the vmd driver in the guest will
> enumerate the devices behind it is my understanding
> 
> > > >     - Where does the vmd driver run (host or guest or both)?
> > > 
> > > I believe the answer is both.
> > 
> > If the VMD RCiEP isn't passed through to the guest, how can the vmd
> > driver do anything in the guest?
> 
> The VMD device is passed through to the guest. It works just like bare metal
> in that the guest OS detects the VMD device and loads the vmd driver which
> then enumerates the devices into the guest

I guess it's obvious that the VMD RCiEP must be passed through to the
guest because the whole point of
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
is to do something in the guest.

It does puzzle me that we have two copies of the vmd driver (one in
the host OS and another in the guest OS) that think they own the same
physical device.  I'm not a virtualization guru but that sounds
potentially problematic.

> > IIUC, the current situation is "regardless of what firmware said, in
> > the VMD domain we want AER disabled and hotplug enabled."
> 
> We aren't saying we want AER disabled, we are just saying we want hotplug
> enabled. The observation is that in a hypervisor scenario AER is going to be
> disabled because the _OSC bits are all 0.

04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying
we want AER disabled for the VMD domain, isn't it?

> > It seems like the only clear option is to say "the vmd driver owns all
> > PCIe services in the VMD domain, the platform does not supply _OSC for
> > the VMD domain, the platform can't do anything with PCIe services in
> > the VMD domain, and the vmd driver needs to explicitly enable/disable
> > services as it needs."
> 
> I actually looked at this as well :) I had an idea to set the _OSC bits to 0
> when the vmd driver created the domain. The look at all the root ports
> underneath it and see if AER and PM were set. If any root port underneath
> VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That
> way if any root port underneath VMD had enabled AER (as an example) then
> that feature would still work. I didn't test this in a hypervisor scenario
> though so not sure what I would see.

_OSC negotiates ownership of features between platform firmware and
OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
device advertises the feature, the OS can use it."  We clear those
native_* bits if the platform retains ownership via _OSC.

If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
the domain below it, why would we assume that BIOS retains ownership
of the features negotiated by _OSC?  I think we have to assume the OS
owns them, which is what happened before 04b12ef163d1.

Bjorn
Paul M Stillwell Jr April 22, 2024, 9:39 p.m. UTC | #9
On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
> On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote:
>> On 4/19/2024 2:14 PM, Bjorn Helgaas wrote:
>>> On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
>>>> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
>>>>> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>>>>>> Adding documentation for the Intel VMD driver and updating the index
>>>>>> file to include it.
> 
>>>>>      - Which devices are passed through to a virtual guest and enumerated
>>>>>        there?
>>>>
>>>> All devices under VMD are passed to a virtual guest
>>>
>>> So the guest will see the VMD Root Ports, but not the VMD RCiEP
>>> itself?
>>
>> The guest will see the VMD device and then the vmd driver in the guest will
>> enumerate the devices behind it is my understanding
>>
>>>>>      - Where does the vmd driver run (host or guest or both)?
>>>>
>>>> I believe the answer is both.
>>>
>>> If the VMD RCiEP isn't passed through to the guest, how can the vmd
>>> driver do anything in the guest?
>>
>> The VMD device is passed through to the guest. It works just like bare metal
>> in that the guest OS detects the VMD device and loads the vmd driver which
>> then enumerates the devices into the guest
> 
> I guess it's obvious that the VMD RCiEP must be passed through to the
> guest because the whole point of
> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
> is to do something in the guest.
> 
> It does puzzle me that we have two copies of the vmd driver (one in
> the host OS and another in the guest OS) that think they own the same
> physical device.  I'm not a virtualization guru but that sounds
> potentially problematic.
> 
>>> IIUC, the current situation is "regardless of what firmware said, in
>>> the VMD domain we want AER disabled and hotplug enabled."
>>
>> We aren't saying we want AER disabled, we are just saying we want hotplug
>> enabled. The observation is that in a hypervisor scenario AER is going to be
>> disabled because the _OSC bits are all 0.
> 
> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying
> we want AER disabled for the VMD domain, isn't it?
> 

I don't see it that way, but maybe I'm misunderstanding something. Here 
is the code from that commit (only the portion of interest):

+/*
+ * Since VMD is an aperture to regular PCIe root ports, only allow it to
+ * control features that the OS is allowed to control on the physical 
PCI bus.
+ */
+static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge,
+                                      struct pci_host_bridge *vmd_bridge)
+{
+       vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug;
+       vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug;
+       vmd_bridge->native_aer = root_bridge->native_aer;
+       vmd_bridge->native_pme = root_bridge->native_pme;
+       vmd_bridge->native_ltr = root_bridge->native_ltr;
+       vmd_bridge->native_dpc = root_bridge->native_dpc;
+}
+

When I look at this, we are copying whatever is in the root_bridge to 
the vmd_bridge. In a bare metal scenario, this is correct and AER will 
be whatever the BIOS set up (IIRC on my bare metal system AER is 
enabled). In a hypervisor scenario the root_bridge bits will all be 0 
which effectively disables AER and all the similar bits.

Prior to this commit all the native_* bits were set to 1 because 
pci_init_host_bridge() sets them all to 1 so we were enabling AER et all 
despite what the OS may have wanted. With the commit we are setting the 
bits to whatever the BIOS has set them to.


>>> It seems like the only clear option is to say "the vmd driver owns all
>>> PCIe services in the VMD domain, the platform does not supply _OSC for
>>> the VMD domain, the platform can't do anything with PCIe services in
>>> the VMD domain, and the vmd driver needs to explicitly enable/disable
>>> services as it needs."
>>
>> I actually looked at this as well :) I had an idea to set the _OSC bits to 0
>> when the vmd driver created the domain. The look at all the root ports
>> underneath it and see if AER and PM were set. If any root port underneath
>> VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That
>> way if any root port underneath VMD had enabled AER (as an example) then
>> that feature would still work. I didn't test this in a hypervisor scenario
>> though so not sure what I would see.
> 
> _OSC negotiates ownership of features between platform firmware and
> OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
> device advertises the feature, the OS can use it."  We clear those
> native_* bits if the platform retains ownership via _OSC.
> 
> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
> the domain below it, why would we assume that BIOS retains ownership
> of the features negotiated by _OSC?  I think we have to assume the OS
> owns them, which is what happened before 04b12ef163d1.
> 

Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is 
disabled) then all the root ports and devices underneath VMD are visible 
to the BIOS and OS so ACPI would run on all of them and the _OSC bits 
should be set correctly.

There was a previous suggestion to print what VMD is owning. I think 
that is a good idea and I can add that to my patch. It would look like 
this (following the way OS support gets printed):

VMD supports PCIeHotplug
VMD supports SHPCHotplug
VMD now owns PCIeHotplug
VMD now owns SHPCHotplug

We could probably do the same thing for the other bits (based on the 
native_* bits). That might make it clearer...

Paul
> Bjorn
>
Bjorn Helgaas April 22, 2024, 10:52 p.m. UTC | #10
On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
> > On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote:
> > > On 4/19/2024 2:14 PM, Bjorn Helgaas wrote:
> > > > On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
> > > > > On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
> > > > > > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
> > > > > > > Adding documentation for the Intel VMD driver and updating the index
> > > > > > > file to include it.
> > 
> > > > > >      - Which devices are passed through to a virtual guest and enumerated
> > > > > >        there?
> > > > > 
> > > > > All devices under VMD are passed to a virtual guest
> > > > 
> > > > So the guest will see the VMD Root Ports, but not the VMD RCiEP
> > > > itself?
> > > 
> > > The guest will see the VMD device and then the vmd driver in the guest will
> > > enumerate the devices behind it is my understanding
> > > 
> > > > > >      - Where does the vmd driver run (host or guest or both)?
> > > > > 
> > > > > I believe the answer is both.
> > > > 
> > > > If the VMD RCiEP isn't passed through to the guest, how can the vmd
> > > > driver do anything in the guest?
> > > 
> > > The VMD device is passed through to the guest. It works just like bare metal
> > > in that the guest OS detects the VMD device and loads the vmd driver which
> > > then enumerates the devices into the guest
> > 
> > I guess it's obvious that the VMD RCiEP must be passed through to the
> > guest because the whole point of
> > https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
> > is to do something in the guest.
> > 
> > It does puzzle me that we have two copies of the vmd driver (one in
> > the host OS and another in the guest OS) that think they own the same
> > physical device.  I'm not a virtualization guru but that sounds
> > potentially problematic.
> > 
> > > > IIUC, the current situation is "regardless of what firmware said, in
> > > > the VMD domain we want AER disabled and hotplug enabled."
> > > 
> > > We aren't saying we want AER disabled, we are just saying we want hotplug
> > > enabled. The observation is that in a hypervisor scenario AER is going to be
> > > disabled because the _OSC bits are all 0.
> > 
> > 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying
> > we want AER disabled for the VMD domain, isn't it?
> 
> I don't see it that way, but maybe I'm misunderstanding something. Here is
> the code from that commit (only the portion of interest):
> 
> +/*
> + * Since VMD is an aperture to regular PCIe root ports, only allow it to
> + * control features that the OS is allowed to control on the physical PCI
> bus.
> + */
> +static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge,
> +                                      struct pci_host_bridge *vmd_bridge)
> +{
> +       vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug;
> +       vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug;
> +       vmd_bridge->native_aer = root_bridge->native_aer;
> +       vmd_bridge->native_pme = root_bridge->native_pme;
> +       vmd_bridge->native_ltr = root_bridge->native_ltr;
> +       vmd_bridge->native_dpc = root_bridge->native_dpc;
> +}
> +
> 
> When I look at this, we are copying whatever is in the root_bridge to the
> vmd_bridge.

Right.  We're copying the results of the _OSC that applies to the VMD
RCiEP (not an _OSC that applies to the VMD domain).

> In a bare metal scenario, this is correct and AER will be whatever
> the BIOS set up (IIRC on my bare metal system AER is enabled).

I think the first dmesg log at
https://bugzilla.kernel.org/show_bug.cgi?id=215027 is from a host OS:

  [    0.000000] DMI: Dell Inc. Dell G15 Special Edition 5521/, BIOS 0.4.3 10/20/2021
  [    0.408990] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0])
  [    0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER]
  [    0.412207] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR]
  [    1.367220] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0
  [    1.486704] pcieport 10000:e0:06.0: AER: enabled with IRQ 146

We evaluated _OSC for domain 0000 with OSC_QUERY_ENABLE ("Query
Support Flag" in the specs) and learned the platform doesn't support
AER, so we removed that from the features we request.  We don't
request control of AER in domain 0000, so native_aer will be 0.  AER
might be enabled by firmware, but the host Linux should not enable it
in domain 0000.

In domain 10000, the host Linux *does* enable AER because all new host
bridges (including the new VMD domain 10000) assume native control to
start with, and we update that based on _OSC results.  There's no ACPI
device describing the VMD host bridge, so there's no _OSC that applies
to it, so Linux assumes it owns everything.

> In a hypervisor scenario the root_bridge bits will all be 0 which
> effectively disables AER and all the similar bits.

I don't think https://bugzilla.kernel.org/show_bug.cgi?id=215027
includes a dmesg log from a hypervisor guest, so I don't know what
happens there.

The host_bridge->native_* bits always start as 1 (OS owns the feature)
and they only get cleared if there's an _OSC that retains firmware
ownership.

> Prior to this commit all the native_* bits were set to 1 because
> pci_init_host_bridge() sets them all to 1 so we were enabling AER et all
> despite what the OS may have wanted. With the commit we are setting the bits
> to whatever the BIOS has set them to.

Prior to 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"),
the host OS owned all features in the VMD domain.  After 04b12ef163d1,
it depended on whatever the _OSC for the VMD RCiEP said.

On the Dell mentioned in that bugzilla, the domain 0000 _OSC said AER
wasn't supported, and 04b12ef163d1 applied that to the VMD domain as
well, so the host OS didn't enable AER in either domain.

But other machines would have different answers.  On a machine that
supports AER and grants ownership to the OS, 04b12ef163d1 would enable
AER in both domain 0000 and the VMD domain.

I don't think we know the root cause of the Correctable Errors from
that bugzilla.  But it's likely that they would still occur on a
machine that granted AER to the OS, and if we enable AER in the VMD
domain, we would probably still have the flood.

That's is why I think 04b12ef163d1 is problematic: it only disables
AER in the VMD domain when AER happens to be disabled in domain 0000. 

> > > > It seems like the only clear option is to say "the vmd driver owns all
> > > > PCIe services in the VMD domain, the platform does not supply _OSC for
> > > > the VMD domain, the platform can't do anything with PCIe services in
> > > > the VMD domain, and the vmd driver needs to explicitly enable/disable
> > > > services as it needs."
> > > 
> > > I actually looked at this as well :) I had an idea to set the _OSC bits to 0
> > > when the vmd driver created the domain. The look at all the root ports
> > > underneath it and see if AER and PM were set. If any root port underneath
> > > VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That
> > > way if any root port underneath VMD had enabled AER (as an example) then
> > > that feature would still work. I didn't test this in a hypervisor scenario
> > > though so not sure what I would see.
> > 
> > _OSC negotiates ownership of features between platform firmware and
> > OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
> > device advertises the feature, the OS can use it."  We clear those
> > native_* bits if the platform retains ownership via _OSC.
> > 
> > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
> > the domain below it, why would we assume that BIOS retains ownership
> > of the features negotiated by _OSC?  I think we have to assume the OS
> > owns them, which is what happened before 04b12ef163d1.
> 
> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
> then all the root ports and devices underneath VMD are visible to the BIOS
> and OS so ACPI would run on all of them and the _OSC bits should be set
> correctly.

Sorry, that was confusing.  I think there are two pieces to enabling
VMD:

  1) There's the BIOS "VMD enable" switch.  If set, the VMD device
  appears as an RCiEP and the devices behind it are invisible to the
  BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
  the devices behind it appear as normal Root Ports with devices below
  them.

  2) When the BIOS "VMD enable" is set, the OS vmd driver configures
  the VMD RCiEP and enumerates things below the VMD host bridge.

  In this case, BIOS enables the VMD RCiEP, but it doesn't have a
  driver for it and it doesn't know how to enumerate the VMD Root
  Ports, so I don't think it makes sense for BIOS to own features for
  devices it doesn't know about.

Bjorn
Paul M Stillwell Jr April 22, 2024, 11:39 p.m. UTC | #11
On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
> On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
>> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
>>> On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote:
>>>> On 4/19/2024 2:14 PM, Bjorn Helgaas wrote:
>>>>> On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote:
>>>>>> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
>>>>>>> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>>>>>>>> Adding documentation for the Intel VMD driver and updating the index
>>>>>>>> file to include it.
>>>
>>>>>>>       - Which devices are passed through to a virtual guest and enumerated
>>>>>>>         there?
>>>>>>
>>>>>> All devices under VMD are passed to a virtual guest
>>>>>
>>>>> So the guest will see the VMD Root Ports, but not the VMD RCiEP
>>>>> itself?
>>>>
>>>> The guest will see the VMD device and then the vmd driver in the guest will
>>>> enumerate the devices behind it is my understanding
>>>>
>>>>>>>       - Where does the vmd driver run (host or guest or both)?
>>>>>>
>>>>>> I believe the answer is both.
>>>>>
>>>>> If the VMD RCiEP isn't passed through to the guest, how can the vmd
>>>>> driver do anything in the guest?
>>>>
>>>> The VMD device is passed through to the guest. It works just like bare metal
>>>> in that the guest OS detects the VMD device and loads the vmd driver which
>>>> then enumerates the devices into the guest
>>>
>>> I guess it's obvious that the VMD RCiEP must be passed through to the
>>> guest because the whole point of
>>> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
>>> is to do something in the guest.
>>>
>>> It does puzzle me that we have two copies of the vmd driver (one in
>>> the host OS and another in the guest OS) that think they own the same
>>> physical device.  I'm not a virtualization guru but that sounds
>>> potentially problematic.
>>>
>>>>> IIUC, the current situation is "regardless of what firmware said, in
>>>>> the VMD domain we want AER disabled and hotplug enabled."
>>>>
>>>> We aren't saying we want AER disabled, we are just saying we want hotplug
>>>> enabled. The observation is that in a hypervisor scenario AER is going to be
>>>> disabled because the _OSC bits are all 0.
>>>
>>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying
>>> we want AER disabled for the VMD domain, isn't it?
>>
>> I don't see it that way, but maybe I'm misunderstanding something. Here is
>> the code from that commit (only the portion of interest):
>>
>> +/*
>> + * Since VMD is an aperture to regular PCIe root ports, only allow it to
>> + * control features that the OS is allowed to control on the physical PCI
>> bus.
>> + */
>> +static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge,
>> +                                      struct pci_host_bridge *vmd_bridge)
>> +{
>> +       vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug;
>> +       vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug;
>> +       vmd_bridge->native_aer = root_bridge->native_aer;
>> +       vmd_bridge->native_pme = root_bridge->native_pme;
>> +       vmd_bridge->native_ltr = root_bridge->native_ltr;
>> +       vmd_bridge->native_dpc = root_bridge->native_dpc;
>> +}
>> +
>>
>> When I look at this, we are copying whatever is in the root_bridge to the
>> vmd_bridge.
> 
> Right.  We're copying the results of the _OSC that applies to the VMD
> RCiEP (not an _OSC that applies to the VMD domain).
> 
>> In a bare metal scenario, this is correct and AER will be whatever
>> the BIOS set up (IIRC on my bare metal system AER is enabled).
> 
> I think the first dmesg log at
> https://bugzilla.kernel.org/show_bug.cgi?id=215027 is from a host OS:
> 
>    [    0.000000] DMI: Dell Inc. Dell G15 Special Edition 5521/, BIOS 0.4.3 10/20/2021
>    [    0.408990] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0])
>    [    0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER]
>    [    0.412207] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR]
>    [    1.367220] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0
>    [    1.486704] pcieport 10000:e0:06.0: AER: enabled with IRQ 146
> 
> We evaluated _OSC for domain 0000 with OSC_QUERY_ENABLE ("Query
> Support Flag" in the specs) and learned the platform doesn't support
> AER, so we removed that from the features we request.  We don't
> request control of AER in domain 0000, so native_aer will be 0.  AER
> might be enabled by firmware, but the host Linux should not enable it
> in domain 0000.
> 
> In domain 10000, the host Linux *does* enable AER because all new host
> bridges (including the new VMD domain 10000) assume native control to
> start with, and we update that based on _OSC results.  There's no ACPI
> device describing the VMD host bridge, so there's no _OSC that applies
> to it, so Linux assumes it owns everything.
> 
>> In a hypervisor scenario the root_bridge bits will all be 0 which
>> effectively disables AER and all the similar bits.
> 
> I don't think https://bugzilla.kernel.org/show_bug.cgi?id=215027
> includes a dmesg log from a hypervisor guest, so I don't know what
> happens there.
> 
> The host_bridge->native_* bits always start as 1 (OS owns the feature)
> and they only get cleared if there's an _OSC that retains firmware
> ownership.
> 
>> Prior to this commit all the native_* bits were set to 1 because
>> pci_init_host_bridge() sets them all to 1 so we were enabling AER et all
>> despite what the OS may have wanted. With the commit we are setting the bits
>> to whatever the BIOS has set them to.
> 
> Prior to 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"),
> the host OS owned all features in the VMD domain.  After 04b12ef163d1,
> it depended on whatever the _OSC for the VMD RCiEP said.
> 
> On the Dell mentioned in that bugzilla, the domain 0000 _OSC said AER
> wasn't supported, and 04b12ef163d1 applied that to the VMD domain as
> well, so the host OS didn't enable AER in either domain.
> 
> But other machines would have different answers.  On a machine that
> supports AER and grants ownership to the OS, 04b12ef163d1 would enable
> AER in both domain 0000 and the VMD domain.
> 
> I don't think we know the root cause of the Correctable Errors from
> that bugzilla.  But it's likely that they would still occur on a
> machine that granted AER to the OS, and if we enable AER in the VMD
> domain, we would probably still have the flood.
> 
> That's is why I think 04b12ef163d1 is problematic: it only disables
> AER in the VMD domain when AER happens to be disabled in domain 0000.
> 

That's a great explanation and very helpful to me, thanks for that!

>>>>> It seems like the only clear option is to say "the vmd driver owns all
>>>>> PCIe services in the VMD domain, the platform does not supply _OSC for
>>>>> the VMD domain, the platform can't do anything with PCIe services in
>>>>> the VMD domain, and the vmd driver needs to explicitly enable/disable
>>>>> services as it needs."
>>>>
>>>> I actually looked at this as well :) I had an idea to set the _OSC bits to 0
>>>> when the vmd driver created the domain. The look at all the root ports
>>>> underneath it and see if AER and PM were set. If any root port underneath
>>>> VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That
>>>> way if any root port underneath VMD had enabled AER (as an example) then
>>>> that feature would still work. I didn't test this in a hypervisor scenario
>>>> though so not sure what I would see.
>>>
>>> _OSC negotiates ownership of features between platform firmware and
>>> OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
>>> device advertises the feature, the OS can use it."  We clear those
>>> native_* bits if the platform retains ownership via _OSC.
>>>
>>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
>>> the domain below it, why would we assume that BIOS retains ownership
>>> of the features negotiated by _OSC?  I think we have to assume the OS
>>> owns them, which is what happened before 04b12ef163d1.
>>
>> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
>> then all the root ports and devices underneath VMD are visible to the BIOS
>> and OS so ACPI would run on all of them and the _OSC bits should be set
>> correctly.
> 
> Sorry, that was confusing.  I think there are two pieces to enabling
> VMD:
> 
>    1) There's the BIOS "VMD enable" switch.  If set, the VMD device
>    appears as an RCiEP and the devices behind it are invisible to the
>    BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
>    the devices behind it appear as normal Root Ports with devices below
>    them.
> 
>    2) When the BIOS "VMD enable" is set, the OS vmd driver configures
>    the VMD RCiEP and enumerates things below the VMD host bridge.
> 
>    In this case, BIOS enables the VMD RCiEP, but it doesn't have a
>    driver for it and it doesn't know how to enumerate the VMD Root
>    Ports, so I don't think it makes sense for BIOS to own features for
>    devices it doesn't know about.
> 

That makes sense to me. It sounds like VMD should own all the features, 
I just don't know how the vmd driver would set the bits other than 
hotplug correctly... We know leaving them on is problematic, but I'm not 
sure what method to use to decide which of the other bits should be set 
or not.

These other features are in a no mans land. I did consider running ACPI 
for the devices behind VMD, but the ACPI functions are almost all static 
to acpi.c (IIRC) and I didn't feel like I should rototill the ACPI code 
for this...

I'm hesitant to set all the other feature bits to 0 as well because we 
may have customers who want AER on their devices behind VMD. I'll do 
some research on my side to see how we enable/handle other features.

If it turns out that we never use AER or any of the other features would 
it be ok if we set everything other than hotplug to 0 and print some 
messages similar to the OS stating that VMD owns all these features and 
isn't supporting them?

Paul
Bjorn Helgaas April 23, 2024, 9:26 p.m. UTC | #12
On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote:
> On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
> > On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
> > > On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
> ...

> > > > _OSC negotiates ownership of features between platform firmware and
> > > > OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
> > > > device advertises the feature, the OS can use it."  We clear those
> > > > native_* bits if the platform retains ownership via _OSC.
> > > > 
> > > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
> > > > the domain below it, why would we assume that BIOS retains ownership
> > > > of the features negotiated by _OSC?  I think we have to assume the OS
> > > > owns them, which is what happened before 04b12ef163d1.
> > > 
> > > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
> > > then all the root ports and devices underneath VMD are visible to the BIOS
> > > and OS so ACPI would run on all of them and the _OSC bits should be set
> > > correctly.
> > 
> > Sorry, that was confusing.  I think there are two pieces to enabling
> > VMD:
> > 
> >    1) There's the BIOS "VMD enable" switch.  If set, the VMD device
> >    appears as an RCiEP and the devices behind it are invisible to the
> >    BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
> >    the devices behind it appear as normal Root Ports with devices below
> >    them.
> > 
> >    2) When the BIOS "VMD enable" is set, the OS vmd driver configures
> >    the VMD RCiEP and enumerates things below the VMD host bridge.
> > 
> >    In this case, BIOS enables the VMD RCiEP, but it doesn't have a
> >    driver for it and it doesn't know how to enumerate the VMD Root
> >    Ports, so I don't think it makes sense for BIOS to own features for
> >    devices it doesn't know about.
> 
> That makes sense to me. It sounds like VMD should own all the features, I
> just don't know how the vmd driver would set the bits other than hotplug
> correctly... We know leaving them on is problematic, but I'm not sure what
> method to use to decide which of the other bits should be set or not.

My starting assumption would be that we'd handle the VMD domain the
same as other PCI domains: if a device advertises a feature, the
kernel includes support for it, and the kernel owns it, we enable it.

If a device advertises a feature but there's a hardware problem with
it, the usual approach is to add a quirk to work around the problem.
The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd:
Honor ACPI _OSC on PCIe features"), looks like it might be in this
category.

Bjorn
Paul M Stillwell Jr April 23, 2024, 11:10 p.m. UTC | #13
On 4/23/2024 2:26 PM, Bjorn Helgaas wrote:
> On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote:
>> On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
>>> On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
>>>> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
>> ...
> 
>>>>> _OSC negotiates ownership of features between platform firmware and
>>>>> OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
>>>>> device advertises the feature, the OS can use it."  We clear those
>>>>> native_* bits if the platform retains ownership via _OSC.
>>>>>
>>>>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
>>>>> the domain below it, why would we assume that BIOS retains ownership
>>>>> of the features negotiated by _OSC?  I think we have to assume the OS
>>>>> owns them, which is what happened before 04b12ef163d1.
>>>>
>>>> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
>>>> then all the root ports and devices underneath VMD are visible to the BIOS
>>>> and OS so ACPI would run on all of them and the _OSC bits should be set
>>>> correctly.
>>>
>>> Sorry, that was confusing.  I think there are two pieces to enabling
>>> VMD:
>>>
>>>     1) There's the BIOS "VMD enable" switch.  If set, the VMD device
>>>     appears as an RCiEP and the devices behind it are invisible to the
>>>     BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
>>>     the devices behind it appear as normal Root Ports with devices below
>>>     them.
>>>
>>>     2) When the BIOS "VMD enable" is set, the OS vmd driver configures
>>>     the VMD RCiEP and enumerates things below the VMD host bridge.
>>>
>>>     In this case, BIOS enables the VMD RCiEP, but it doesn't have a
>>>     driver for it and it doesn't know how to enumerate the VMD Root
>>>     Ports, so I don't think it makes sense for BIOS to own features for
>>>     devices it doesn't know about.
>>
>> That makes sense to me. It sounds like VMD should own all the features, I
>> just don't know how the vmd driver would set the bits other than hotplug
>> correctly... We know leaving them on is problematic, but I'm not sure what
>> method to use to decide which of the other bits should be set or not.
> 
> My starting assumption would be that we'd handle the VMD domain the
> same as other PCI domains: if a device advertises a feature, the
> kernel includes support for it, and the kernel owns it, we enable it.
> 

I've been poking around and it seems like some things (I was looking for 
AER) are global to the platform. In my investigation (which is a small 
sample size of machines) it looks like there is a single entry in the 
BIOS to enable/disable AER so whatever is in one domain should be the 
same in all the domains. I couldn't find settings for LTR or the other 
bits, but I'm not sure what to look for in the BIOS for those.

So it seems that there are 2 categories: platform global and device 
specific. AER and probably some of the others are global and can be 
copied from one domain to another, but things like hotplug are device 
specific and should be handled that way.

Based on this it seems like my patch here 
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ 
is probably the correct thing to do, but I think adding some output into 
dmesg to indicate that VMD owns hotplug and has enabled it needs to be done.

> If a device advertises a feature but there's a hardware problem with
> it, the usual approach is to add a quirk to work around the problem.
> The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd:
> Honor ACPI _OSC on PCIe features"), looks like it might be in this
> category.
> 

I don't think we had a hardware problem with these Samsung (IIRC) 
devices; the issue was that the vmd driver were incorrectly enabling AER 
because those native_* bits get set automatically. I think 04b12ef163d1 
is doing the correct thing, but it is incorrectly copying the hotplug 
bits. Those are device specific and should be handled by the device 
instead of based on another domain.

Paul
Bjorn Helgaas April 24, 2024, 12:47 a.m. UTC | #14
On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote:
> On 4/23/2024 2:26 PM, Bjorn Helgaas wrote:
> > On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote:
> > > On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
> > > > On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
> > > > > On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
> > > ...
> > 
> > > > > > _OSC negotiates ownership of features between platform firmware and
> > > > > > OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
> > > > > > device advertises the feature, the OS can use it."  We clear those
> > > > > > native_* bits if the platform retains ownership via _OSC.
> > > > > > 
> > > > > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
> > > > > > the domain below it, why would we assume that BIOS retains ownership
> > > > > > of the features negotiated by _OSC?  I think we have to assume the OS
> > > > > > owns them, which is what happened before 04b12ef163d1.
> > > > > 
> > > > > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
> > > > > then all the root ports and devices underneath VMD are visible to the BIOS
> > > > > and OS so ACPI would run on all of them and the _OSC bits should be set
> > > > > correctly.
> > > > 
> > > > Sorry, that was confusing.  I think there are two pieces to enabling
> > > > VMD:
> > > > 
> > > >     1) There's the BIOS "VMD enable" switch.  If set, the VMD device
> > > >     appears as an RCiEP and the devices behind it are invisible to the
> > > >     BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
> > > >     the devices behind it appear as normal Root Ports with devices below
> > > >     them.
> > > > 
> > > >     2) When the BIOS "VMD enable" is set, the OS vmd driver configures
> > > >     the VMD RCiEP and enumerates things below the VMD host bridge.
> > > > 
> > > >     In this case, BIOS enables the VMD RCiEP, but it doesn't have a
> > > >     driver for it and it doesn't know how to enumerate the VMD Root
> > > >     Ports, so I don't think it makes sense for BIOS to own features for
> > > >     devices it doesn't know about.
> > > 
> > > That makes sense to me. It sounds like VMD should own all the features, I
> > > just don't know how the vmd driver would set the bits other than hotplug
> > > correctly... We know leaving them on is problematic, but I'm not sure what
> > > method to use to decide which of the other bits should be set or not.
> > 
> > My starting assumption would be that we'd handle the VMD domain the
> > same as other PCI domains: if a device advertises a feature, the
> > kernel includes support for it, and the kernel owns it, we enable it.
> 
> I've been poking around and it seems like some things (I was looking for
> AER) are global to the platform. In my investigation (which is a small
> sample size of machines) it looks like there is a single entry in the BIOS
> to enable/disable AER so whatever is in one domain should be the same in all
> the domains. I couldn't find settings for LTR or the other bits, but I'm not
> sure what to look for in the BIOS for those.
> 
> So it seems that there are 2 categories: platform global and device
> specific. AER and probably some of the others are global and can be copied
> from one domain to another, but things like hotplug are device specific and
> should be handled that way.

_OSC is the only mechanism for negotiating ownership of these
features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
bridge that contains the _OSC method.  AFAICT, there's no
global/device-specific thing here.

The BIOS may have a single user-visible setting, and it may apply that
setting to all host bridge _OSC methods, but that's just part of the
BIOS UI, not part of the firmware/OS interface.

> > If a device advertises a feature but there's a hardware problem with
> > it, the usual approach is to add a quirk to work around the problem.
> > The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd:
> > Honor ACPI _OSC on PCIe features"), looks like it might be in this
> > category.
> 
> I don't think we had a hardware problem with these Samsung (IIRC) devices;
> the issue was that the vmd driver were incorrectly enabling AER because
> those native_* bits get set automatically. 

Where do all the Correctable Errors come from?  IMO they're either
caused by some hardware issue or by a software error in programming
AER.  It's possible we forget to clear the errors and we just see the
same error reported over and over.  But I don't think the answer is
to copy the AER ownership from a different domain.

Bjorn
Paul M Stillwell Jr April 24, 2024, 9:29 p.m. UTC | #15
On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
> On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote:
>> On 4/23/2024 2:26 PM, Bjorn Helgaas wrote:
>>> On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote:
>>>> On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
>>>>> On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
>>>>>> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
>>>> ...
>>>
>>>>>>> _OSC negotiates ownership of features between platform firmware and
>>>>>>> OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
>>>>>>> device advertises the feature, the OS can use it."  We clear those
>>>>>>> native_* bits if the platform retains ownership via _OSC.
>>>>>>>
>>>>>>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
>>>>>>> the domain below it, why would we assume that BIOS retains ownership
>>>>>>> of the features negotiated by _OSC?  I think we have to assume the OS
>>>>>>> owns them, which is what happened before 04b12ef163d1.
>>>>>>
>>>>>> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
>>>>>> then all the root ports and devices underneath VMD are visible to the BIOS
>>>>>> and OS so ACPI would run on all of them and the _OSC bits should be set
>>>>>> correctly.
>>>>>
>>>>> Sorry, that was confusing.  I think there are two pieces to enabling
>>>>> VMD:
>>>>>
>>>>>      1) There's the BIOS "VMD enable" switch.  If set, the VMD device
>>>>>      appears as an RCiEP and the devices behind it are invisible to the
>>>>>      BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
>>>>>      the devices behind it appear as normal Root Ports with devices below
>>>>>      them.
>>>>>
>>>>>      2) When the BIOS "VMD enable" is set, the OS vmd driver configures
>>>>>      the VMD RCiEP and enumerates things below the VMD host bridge.
>>>>>
>>>>>      In this case, BIOS enables the VMD RCiEP, but it doesn't have a
>>>>>      driver for it and it doesn't know how to enumerate the VMD Root
>>>>>      Ports, so I don't think it makes sense for BIOS to own features for
>>>>>      devices it doesn't know about.
>>>>
>>>> That makes sense to me. It sounds like VMD should own all the features, I
>>>> just don't know how the vmd driver would set the bits other than hotplug
>>>> correctly... We know leaving them on is problematic, but I'm not sure what
>>>> method to use to decide which of the other bits should be set or not.
>>>
>>> My starting assumption would be that we'd handle the VMD domain the
>>> same as other PCI domains: if a device advertises a feature, the
>>> kernel includes support for it, and the kernel owns it, we enable it.
>>
>> I've been poking around and it seems like some things (I was looking for
>> AER) are global to the platform. In my investigation (which is a small
>> sample size of machines) it looks like there is a single entry in the BIOS
>> to enable/disable AER so whatever is in one domain should be the same in all
>> the domains. I couldn't find settings for LTR or the other bits, but I'm not
>> sure what to look for in the BIOS for those.
>>
>> So it seems that there are 2 categories: platform global and device
>> specific. AER and probably some of the others are global and can be copied
>> from one domain to another, but things like hotplug are device specific and
>> should be handled that way.
> 
> _OSC is the only mechanism for negotiating ownership of these
> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
> only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
> bridge that contains the _OSC method.  AFAICT, there's no
> global/device-specific thing here.
> 
> The BIOS may have a single user-visible setting, and it may apply that
> setting to all host bridge _OSC methods, but that's just part of the
> BIOS UI, not part of the firmware/OS interface.
> 

Fair, but we are still left with the question of how to set the _OSC 
bits for the VMD bridge. This would normally happen using ACPI AFAICT 
and we don't have that for the devices behind VMD.

>>> If a device advertises a feature but there's a hardware problem with
>>> it, the usual approach is to add a quirk to work around the problem.
>>> The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd:
>>> Honor ACPI _OSC on PCIe features"), looks like it might be in this
>>> category.
>>
>> I don't think we had a hardware problem with these Samsung (IIRC) devices;
>> the issue was that the vmd driver were incorrectly enabling AER because
>> those native_* bits get set automatically.
> 
> Where do all the Correctable Errors come from?  IMO they're either
> caused by some hardware issue or by a software error in programming
> AER.  It's possible we forget to clear the errors and we just see the
> same error reported over and over.  But I don't think the answer is
> to copy the AER ownership from a different domain.
> 

I looked back at the original bugzilla and I feel like the AER errors 
are a red herring. AER was *supposed* to be disabled, but was 
incorrectly enabled by VMD so we are seeing errors. Yes, they may be 
real errors, but my point is that the user had disabled AER so they 
didn't care if there were errors or not (i.e. if AER had been correctly 
disabled by VMD then the user would not have AER errors in the dmesg 
output).

Kai-Heng even says this in one of his responses here 
https://lore.kernel.org/linux-pci/CAAd53p6hATV8TOcJ9Qi2rMwVi=y_9+tQu6KhDkAm6Y8=cQ_xoA@mail.gmail.com/. 
A quote from his reply "To be more precise, AER is disabled by the 
platform vendor in BIOS to paper over the issue."

Paul
Bjorn Helgaas April 25, 2024, 5:24 p.m. UTC | #16
On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
> > On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote:
> > > On 4/23/2024 2:26 PM, Bjorn Helgaas wrote:
> > > > On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote:
> > > > > On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
> > > > > > On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
> > > > > > > On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
> > > > > ...
> > > > 
> > > > > > > > _OSC negotiates ownership of features between platform firmware and
> > > > > > > > OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
> > > > > > > > device advertises the feature, the OS can use it."  We clear those
> > > > > > > > native_* bits if the platform retains ownership via _OSC.
> > > > > > > > 
> > > > > > > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
> > > > > > > > the domain below it, why would we assume that BIOS retains ownership
> > > > > > > > of the features negotiated by _OSC?  I think we have to assume the OS
> > > > > > > > owns them, which is what happened before 04b12ef163d1.
> > > > > > > 
> > > > > > > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
> > > > > > > then all the root ports and devices underneath VMD are visible to the BIOS
> > > > > > > and OS so ACPI would run on all of them and the _OSC bits should be set
> > > > > > > correctly.
> > > > > > 
> > > > > > Sorry, that was confusing.  I think there are two pieces to enabling
> > > > > > VMD:
> > > > > > 
> > > > > >      1) There's the BIOS "VMD enable" switch.  If set, the VMD device
> > > > > >      appears as an RCiEP and the devices behind it are invisible to the
> > > > > >      BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
> > > > > >      the devices behind it appear as normal Root Ports with devices below
> > > > > >      them.
> > > > > > 
> > > > > >      2) When the BIOS "VMD enable" is set, the OS vmd driver configures
> > > > > >      the VMD RCiEP and enumerates things below the VMD host bridge.
> > > > > > 
> > > > > >      In this case, BIOS enables the VMD RCiEP, but it doesn't have a
> > > > > >      driver for it and it doesn't know how to enumerate the VMD Root
> > > > > >      Ports, so I don't think it makes sense for BIOS to own features for
> > > > > >      devices it doesn't know about.
> > > > > 
> > > > > That makes sense to me. It sounds like VMD should own all the features, I
> > > > > just don't know how the vmd driver would set the bits other than hotplug
> > > > > correctly... We know leaving them on is problematic, but I'm not sure what
> > > > > method to use to decide which of the other bits should be set or not.
> > > > 
> > > > My starting assumption would be that we'd handle the VMD domain the
> > > > same as other PCI domains: if a device advertises a feature, the
> > > > kernel includes support for it, and the kernel owns it, we enable it.
> > > 
> > > I've been poking around and it seems like some things (I was looking for
> > > AER) are global to the platform. In my investigation (which is a small
> > > sample size of machines) it looks like there is a single entry in the BIOS
> > > to enable/disable AER so whatever is in one domain should be the same in all
> > > the domains. I couldn't find settings for LTR or the other bits, but I'm not
> > > sure what to look for in the BIOS for those.
> > > 
> > > So it seems that there are 2 categories: platform global and device
> > > specific. AER and probably some of the others are global and can be copied
> > > from one domain to another, but things like hotplug are device specific and
> > > should be handled that way.
> > 
> > _OSC is the only mechanism for negotiating ownership of these
> > features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
> > only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
> > bridge that contains the _OSC method.  AFAICT, there's no
> > global/device-specific thing here.
> > 
> > The BIOS may have a single user-visible setting, and it may apply that
> > setting to all host bridge _OSC methods, but that's just part of the
> > BIOS UI, not part of the firmware/OS interface.
> 
> Fair, but we are still left with the question of how to set the _OSC bits
> for the VMD bridge. This would normally happen using ACPI AFAICT and we
> don't have that for the devices behind VMD.

In the absence of a mechanism for negotiating ownership, e.g., an ACPI
host bridge device for the hierarchy, the OS owns all the PCIe
features.

> > > > If a device advertises a feature but there's a hardware problem with
> > > > it, the usual approach is to add a quirk to work around the problem.
> > > > The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd:
> > > > Honor ACPI _OSC on PCIe features"), looks like it might be in this
> > > > category.
> > > 
> > > I don't think we had a hardware problem with these Samsung (IIRC) devices;
> > > the issue was that the vmd driver were incorrectly enabling AER because
> > > those native_* bits get set automatically.
> > 
> > Where do all the Correctable Errors come from?  IMO they're either
> > caused by some hardware issue or by a software error in programming
> > AER.  It's possible we forget to clear the errors and we just see the
> > same error reported over and over.  But I don't think the answer is
> > to copy the AER ownership from a different domain.
> 
> I looked back at the original bugzilla and I feel like the AER errors are a
> red herring. AER was *supposed* to be disabled, but was incorrectly enabled
> by VMD so we are seeing errors. Yes, they may be real errors, but my point
> is that the user had disabled AER so they didn't care if there were errors
> or not (i.e. if AER had been correctly disabled by VMD then the user would
> not have AER errors in the dmesg output).

04b12ef163d1 basically asserted "the platform knows about a hardware
issue between VMD and this NVMe and avoided it by disabling AER in
domain 0000; therefore we should also disable AER in the VMD domain."

Your patch at
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
says "vmd users *always* want hotplug enabled."  What happens when a
platform knows about a hotplug hardware issue and avoids it by
disabling hotplug in domain 0000?

I think 04b12ef163d1 would avoid it in the VMD domain, but your patch
would expose the hotplug issue.

> Kai-Heng even says this in one of his responses here https://lore.kernel.org/linux-pci/CAAd53p6hATV8TOcJ9Qi2rMwVi=y_9+tQu6KhDkAm6Y8=cQ_xoA@mail.gmail.com/.
> A quote from his reply "To be more precise, AER is disabled by the platform
> vendor in BIOS to paper over the issue."

I suspect there's a real hardware issue between the VMD and the
Samsung NVMe that causes these Correctable Errors.  I think disabling
AER via _OSC is a bad way to work around it because:

  - it disables AER for *everything* in domain 0000, when other
    devices probably work perfectly fine,

  - it assumes the OS vmd driver applies domain 0000 _OSC to the VMD
    domain, which isn't required by any spec, and

  - it disables *all* of AER, including Uncorrectable Errors, and I'd
    like to know about those, even if we have to mask the Correctable
    Errors.

In https://bugzilla.kernel.org/show_bug.cgi?id=215027#c5, Kai-Heng did
not see the Correctable Error flood when VMD was turned off and
concluded that the issue is VMD specific.

But I think it's likely that the errors still occur even when VMD is
turned off, and we just don't see the flood because AER is disabled.
I suggested an experiment with "pcie_ports=native", which should
enable AER even if _OSC doesn't grant ownership:
https://bugzilla.kernel.org/show_bug.cgi?id=215027#c9 

Bjorn
Paul M Stillwell Jr April 25, 2024, 9:43 p.m. UTC | #17
On 4/25/2024 10:24 AM, Bjorn Helgaas wrote:
> On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
>> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
>>> On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote:
>>>> On 4/23/2024 2:26 PM, Bjorn Helgaas wrote:
>>>>> On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote:
>>>>>> On 4/22/2024 3:52 PM, Bjorn Helgaas wrote:
>>>>>>> On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote:
>>>>>>>> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote:
>>>>>> ...
>>>>>
>>>>>>>>> _OSC negotiates ownership of features between platform firmware and
>>>>>>>>> OSPM.  The "native_pcie_hotplug" and similar bits mean that "IF a
>>>>>>>>> device advertises the feature, the OS can use it."  We clear those
>>>>>>>>> native_* bits if the platform retains ownership via _OSC.
>>>>>>>>>
>>>>>>>>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for
>>>>>>>>> the domain below it, why would we assume that BIOS retains ownership
>>>>>>>>> of the features negotiated by _OSC?  I think we have to assume the OS
>>>>>>>>> owns them, which is what happened before 04b12ef163d1.
>>>>>>>>
>>>>>>>> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled)
>>>>>>>> then all the root ports and devices underneath VMD are visible to the BIOS
>>>>>>>> and OS so ACPI would run on all of them and the _OSC bits should be set
>>>>>>>> correctly.
>>>>>>>
>>>>>>> Sorry, that was confusing.  I think there are two pieces to enabling
>>>>>>> VMD:
>>>>>>>
>>>>>>>       1) There's the BIOS "VMD enable" switch.  If set, the VMD device
>>>>>>>       appears as an RCiEP and the devices behind it are invisible to the
>>>>>>>       BIOS.  If cleared, VMD doesn't exist; the VMD RCiEP is hidden and
>>>>>>>       the devices behind it appear as normal Root Ports with devices below
>>>>>>>       them.
>>>>>>>
>>>>>>>       2) When the BIOS "VMD enable" is set, the OS vmd driver configures
>>>>>>>       the VMD RCiEP and enumerates things below the VMD host bridge.
>>>>>>>
>>>>>>>       In this case, BIOS enables the VMD RCiEP, but it doesn't have a
>>>>>>>       driver for it and it doesn't know how to enumerate the VMD Root
>>>>>>>       Ports, so I don't think it makes sense for BIOS to own features for
>>>>>>>       devices it doesn't know about.
>>>>>>
>>>>>> That makes sense to me. It sounds like VMD should own all the features, I
>>>>>> just don't know how the vmd driver would set the bits other than hotplug
>>>>>> correctly... We know leaving them on is problematic, but I'm not sure what
>>>>>> method to use to decide which of the other bits should be set or not.
>>>>>
>>>>> My starting assumption would be that we'd handle the VMD domain the
>>>>> same as other PCI domains: if a device advertises a feature, the
>>>>> kernel includes support for it, and the kernel owns it, we enable it.
>>>>
>>>> I've been poking around and it seems like some things (I was looking for
>>>> AER) are global to the platform. In my investigation (which is a small
>>>> sample size of machines) it looks like there is a single entry in the BIOS
>>>> to enable/disable AER so whatever is in one domain should be the same in all
>>>> the domains. I couldn't find settings for LTR or the other bits, but I'm not
>>>> sure what to look for in the BIOS for those.
>>>>
>>>> So it seems that there are 2 categories: platform global and device
>>>> specific. AER and probably some of the others are global and can be copied
>>>> from one domain to another, but things like hotplug are device specific and
>>>> should be handled that way.
>>>
>>> _OSC is the only mechanism for negotiating ownership of these
>>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
>>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
>>> bridge that contains the _OSC method.  AFAICT, there's no
>>> global/device-specific thing here.
>>>
>>> The BIOS may have a single user-visible setting, and it may apply that
>>> setting to all host bridge _OSC methods, but that's just part of the
>>> BIOS UI, not part of the firmware/OS interface.
>>
>> Fair, but we are still left with the question of how to set the _OSC bits
>> for the VMD bridge. This would normally happen using ACPI AFAICT and we
>> don't have that for the devices behind VMD.
> 
> In the absence of a mechanism for negotiating ownership, e.g., an ACPI
> host bridge device for the hierarchy, the OS owns all the PCIe
> features.
> 

I'm new to this space so I don't know what it means for the OS to own 
the features. In other words, how would the vmd driver figure out what 
features are supported?

>>>>> If a device advertises a feature but there's a hardware problem with
>>>>> it, the usual approach is to add a quirk to work around the problem.
>>>>> The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd:
>>>>> Honor ACPI _OSC on PCIe features"), looks like it might be in this
>>>>> category.
>>>>
>>>> I don't think we had a hardware problem with these Samsung (IIRC) devices;
>>>> the issue was that the vmd driver were incorrectly enabling AER because
>>>> those native_* bits get set automatically.
>>>
>>> Where do all the Correctable Errors come from?  IMO they're either
>>> caused by some hardware issue or by a software error in programming
>>> AER.  It's possible we forget to clear the errors and we just see the
>>> same error reported over and over.  But I don't think the answer is
>>> to copy the AER ownership from a different domain.
>>
>> I looked back at the original bugzilla and I feel like the AER errors are a
>> red herring. AER was *supposed* to be disabled, but was incorrectly enabled
>> by VMD so we are seeing errors. Yes, they may be real errors, but my point
>> is that the user had disabled AER so they didn't care if there were errors
>> or not (i.e. if AER had been correctly disabled by VMD then the user would
>> not have AER errors in the dmesg output).
> 
> 04b12ef163d1 basically asserted "the platform knows about a hardware
> issue between VMD and this NVMe and avoided it by disabling AER in
> domain 0000; therefore we should also disable AER in the VMD domain."
> 
> Your patch at
> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
> says "vmd users *always* want hotplug enabled."  What happens when a
> platform knows about a hotplug hardware issue and avoids it by
> disabling hotplug in domain 0000?
> 

I was thinking about this also and I could look at all the root ports 
underneath vmd and see if hotplug is set for any of them. If it is then 
we could set the native_*hotplug bits based on that.

> I think 04b12ef163d1 would avoid it in the VMD domain, but your patch
> would expose the hotplug issue.
> 
>> Kai-Heng even says this in one of his responses here https://lore.kernel.org/linux-pci/CAAd53p6hATV8TOcJ9Qi2rMwVi=y_9+tQu6KhDkAm6Y8=cQ_xoA@mail.gmail.com/.
>> A quote from his reply "To be more precise, AER is disabled by the platform
>> vendor in BIOS to paper over the issue."
> 
> I suspect there's a real hardware issue between the VMD and the
> Samsung NVMe that causes these Correctable Errors.  I think disabling
> AER via _OSC is a bad way to work around it because:
> 
>    - it disables AER for *everything* in domain 0000, when other
>      devices probably work perfectly fine,
> 
>    - it assumes the OS vmd driver applies domain 0000 _OSC to the VMD
>      domain, which isn't required by any spec, and
> 
>    - it disables *all* of AER, including Uncorrectable Errors, and I'd
>      like to know about those, even if we have to mask the Correctable
>      Errors.
> 
> In https://bugzilla.kernel.org/show_bug.cgi?id=215027#c5, Kai-Heng did
> not see the Correctable Error flood when VMD was turned off and
> concluded that the issue is VMD specific.
> 
> But I think it's likely that the errors still occur even when VMD is
> turned off, and we just don't see the flood because AER is disabled.
> I suggested an experiment with "pcie_ports=native", which should
> enable AER even if _OSC doesn't grant ownership:
> https://bugzilla.kernel.org/show_bug.cgi?id=215027#c9
> 

I don't have a way to test his configuration so that would be something 
he would need to do.

> Bjorn
>
Bjorn Helgaas April 25, 2024, 10:32 p.m. UTC | #18
On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote:
> On 4/25/2024 10:24 AM, Bjorn Helgaas wrote:
> > On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
> > > On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
> ...

> > > > _OSC is the only mechanism for negotiating ownership of these
> > > > features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
> > > > only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
> > > > bridge that contains the _OSC method.  AFAICT, there's no
> > > > global/device-specific thing here.
> > > > 
> > > > The BIOS may have a single user-visible setting, and it may apply that
> > > > setting to all host bridge _OSC methods, but that's just part of the
> > > > BIOS UI, not part of the firmware/OS interface.
> > > 
> > > Fair, but we are still left with the question of how to set the _OSC bits
> > > for the VMD bridge. This would normally happen using ACPI AFAICT and we
> > > don't have that for the devices behind VMD.
> > 
> > In the absence of a mechanism for negotiating ownership, e.g., an ACPI
> > host bridge device for the hierarchy, the OS owns all the PCIe
> > features.
> 
> I'm new to this space so I don't know what it means for the OS to
> own the features. In other words, how would the vmd driver figure
> out what features are supported?

There are three things that go into this:

  - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled?

  - Has the platform granted permission to the OS to use the feature,
    either explicitly via _OSC or implicitly because there's no
    mechanism to negotiate ownership?

  - Does the device advertise the feature, e.g., does it have an AER
    Capability?

If all three are true, Linux enables the feature.

I think vmd has implicit ownership of all features because there is no
ACPI host bridge device for the VMD domain, and (IMO) that means there
is no way to negotiate ownership in that domain.

So the VMD domain starts with all the native_* bits set, meaning Linux
owns the features.  If the vmd driver doesn't want some feature to be
used, it could clear the native_* bit for it.

I don't think vmd should unilaterally claim ownership of features by
*setting* native_* bits because that will lead to conflicts with
firmware.

> > 04b12ef163d1 basically asserted "the platform knows about a hardware
> > issue between VMD and this NVMe and avoided it by disabling AER in
> > domain 0000; therefore we should also disable AER in the VMD domain."
> > 
> > Your patch at
> > https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
> > says "vmd users *always* want hotplug enabled."  What happens when a
> > platform knows about a hotplug hardware issue and avoids it by
> > disabling hotplug in domain 0000?
> 
> I was thinking about this also and I could look at all the root ports
> underneath vmd and see if hotplug is set for any of them. If it is then we
> could set the native_*hotplug bits based on that.

No.  "Hotplug is set" means the device advertises the feature via
config space, in this case, it has the Hot-Plug Capable bit in the
PCIe Slot Capabilities set.  That just means the device has hardware
support for the feature.

On ACPI systems, the OS can only use pciehp on the device if firmware
has granted ownership to the OS via _OSC because the firmware may want
to use the feature itself.  If both OS and firmware think they own the
feature, they will conflict with each other.

If firmware retains owership of hotplug, it can field hotplug events
itself and notify the OS via the ACPI hotplug mechanism.  The acpiphp
driver handles those events for PCI.

Firmware may do this if it wants to work around hardware defects it
knows about, or if it wants to do OS-independent logging (more
applicable for AER), or if it wants to intercept hotplug events to do
some kind of initialization, etc.

Bjorn
Paul M Stillwell Jr April 25, 2024, 11:32 p.m. UTC | #19
On 4/25/2024 3:32 PM, Bjorn Helgaas wrote:
> On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote:
>> On 4/25/2024 10:24 AM, Bjorn Helgaas wrote:
>>> On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
>>>> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
>> ...
> 
>>>>> _OSC is the only mechanism for negotiating ownership of these
>>>>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
>>>>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
>>>>> bridge that contains the _OSC method.  AFAICT, there's no
>>>>> global/device-specific thing here.
>>>>>
>>>>> The BIOS may have a single user-visible setting, and it may apply that
>>>>> setting to all host bridge _OSC methods, but that's just part of the
>>>>> BIOS UI, not part of the firmware/OS interface.
>>>>
>>>> Fair, but we are still left with the question of how to set the _OSC bits
>>>> for the VMD bridge. This would normally happen using ACPI AFAICT and we
>>>> don't have that for the devices behind VMD.
>>>
>>> In the absence of a mechanism for negotiating ownership, e.g., an ACPI
>>> host bridge device for the hierarchy, the OS owns all the PCIe
>>> features.
>>
>> I'm new to this space so I don't know what it means for the OS to
>> own the features. In other words, how would the vmd driver figure
>> out what features are supported?
> 
> There are three things that go into this:
> 
>    - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled?
> 
>    - Has the platform granted permission to the OS to use the feature,
>      either explicitly via _OSC or implicitly because there's no
>      mechanism to negotiate ownership?
> 
>    - Does the device advertise the feature, e.g., does it have an AER
>      Capability?
> 
> If all three are true, Linux enables the feature.
> 
> I think vmd has implicit ownership of all features because there is no
> ACPI host bridge device for the VMD domain, and (IMO) that means there
> is no way to negotiate ownership in that domain.
> 
> So the VMD domain starts with all the native_* bits set, meaning Linux
> owns the features.  If the vmd driver doesn't want some feature to be
> used, it could clear the native_* bit for it.
> 
> I don't think vmd should unilaterally claim ownership of features by
> *setting* native_* bits because that will lead to conflicts with
> firmware.
> 

This is the crux of the problem IMO. I'm happy to set the native_* bits 
using some knowledge about what the firmware wants, but we don't have a 
mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 
("PCI: vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a 
domain that ACPI had run on and negotiated features and apply them to 
the vmd domain.

Using the 3 criteria you described above, could we do this for the 
hotplug feature for VMD:

1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check to 
see if the hotplug feature is enabled

2. We know that for VMD we want hotplug enabled so that is the implicit 
permission

3. Look at the root ports below VMD and see if hotplug capability is set

If 1 & 3 are true, then we set the native_* bits for hotplug (we can 
look for surprise hotplug as well in the capability to set the 
native_shpc_hotplug bit corrrectly) to 1. This feels like it would solve 
the problem of "what if there is a hotplug issue on the platform" 
because the user would have disabled hotplug for VMD and the root ports 
below it would have the capability turned off.

In theory we could do this same thing for all the features, but we don't 
know what the firmware wants for features other than hotplug (because we 
implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is a 
good compromise for the other features, but I hear your issues with it.

I'm happy to "do the right thing" for the other features, I just can't 
figure out what that thing is :)

Paul

>>> 04b12ef163d1 basically asserted "the platform knows about a hardware
>>> issue between VMD and this NVMe and avoided it by disabling AER in
>>> domain 0000; therefore we should also disable AER in the VMD domain."
>>>
>>> Your patch at
>>> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
>>> says "vmd users *always* want hotplug enabled."  What happens when a
>>> platform knows about a hotplug hardware issue and avoids it by
>>> disabling hotplug in domain 0000?
>>
>> I was thinking about this also and I could look at all the root ports
>> underneath vmd and see if hotplug is set for any of them. If it is then we
>> could set the native_*hotplug bits based on that.
> 
> No.  "Hotplug is set" means the device advertises the feature via
> config space, in this case, it has the Hot-Plug Capable bit in the
> PCIe Slot Capabilities set.  That just means the device has hardware
> support for the feature.
> 
> On ACPI systems, the OS can only use pciehp on the device if firmware
> has granted ownership to the OS via _OSC because the firmware may want
> to use the feature itself.  If both OS and firmware think they own the
> feature, they will conflict with each other.
> 
> If firmware retains owership of hotplug, it can field hotplug events
> itself and notify the OS via the ACPI hotplug mechanism.  The acpiphp
> driver handles those events for PCI.
> 
> Firmware may do this if it wants to work around hardware defects it
> knows about, or if it wants to do OS-independent logging (more
> applicable for AER), or if it wants to intercept hotplug events to do
> some kind of initialization, etc.
> 
> Bjorn
Bjorn Helgaas April 26, 2024, 9:36 p.m. UTC | #20
On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote:
> On 4/25/2024 3:32 PM, Bjorn Helgaas wrote:
> > On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote:
> > > On 4/25/2024 10:24 AM, Bjorn Helgaas wrote:
> > > > On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
> > > > > On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
> > > ...
> > 
> > > > > > _OSC is the only mechanism for negotiating ownership of these
> > > > > > features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
> > > > > > only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
> > > > > > bridge that contains the _OSC method.  AFAICT, there's no
> > > > > > global/device-specific thing here.
> > > > > > 
> > > > > > The BIOS may have a single user-visible setting, and it may apply that
> > > > > > setting to all host bridge _OSC methods, but that's just part of the
> > > > > > BIOS UI, not part of the firmware/OS interface.
> > > > > 
> > > > > Fair, but we are still left with the question of how to set the _OSC bits
> > > > > for the VMD bridge. This would normally happen using ACPI AFAICT and we
> > > > > don't have that for the devices behind VMD.
> > > > 
> > > > In the absence of a mechanism for negotiating ownership, e.g., an ACPI
> > > > host bridge device for the hierarchy, the OS owns all the PCIe
> > > > features.
> > > 
> > > I'm new to this space so I don't know what it means for the OS to
> > > own the features. In other words, how would the vmd driver figure
> > > out what features are supported?
> > 
> > There are three things that go into this:
> > 
> >    - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled?
> > 
> >    - Has the platform granted permission to the OS to use the feature,
> >      either explicitly via _OSC or implicitly because there's no
> >      mechanism to negotiate ownership?
> > 
> >    - Does the device advertise the feature, e.g., does it have an AER
> >      Capability?
> > 
> > If all three are true, Linux enables the feature.
> > 
> > I think vmd has implicit ownership of all features because there is no
> > ACPI host bridge device for the VMD domain, and (IMO) that means there
> > is no way to negotiate ownership in that domain.
> > 
> > So the VMD domain starts with all the native_* bits set, meaning Linux
> > owns the features.  If the vmd driver doesn't want some feature to be
> > used, it could clear the native_* bit for it.
> > 
> > I don't think vmd should unilaterally claim ownership of features by
> > *setting* native_* bits because that will lead to conflicts with
> > firmware.
> 
> This is the crux of the problem IMO. I'm happy to set the native_* bits
> using some knowledge about what the firmware wants, but we don't have a
> mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 ("PCI:
> vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a domain that
> ACPI had run on and negotiated features and apply them to the vmd domain.

Yes, this is the problem.  We have no information about what firmware
wants for the VMD domain because there is no ACPI host bridge device.

We have information about what firmware wants for *other* domains.
04b12ef163d1 assumes that also applies to the VMD domain, but I don't
think that's a good idea.

> Using the 3 criteria you described above, could we do this for the hotplug
> feature for VMD:
> 
> 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check to see
> if the hotplug feature is enabled

That's already there.

> 2. We know that for VMD we want hotplug enabled so that is the implicit
> permission

The VMD domain starts with all native_* bits set.  All you have to do
is avoid *clearing* them.

The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC
for some other domain.

> 3. Look at the root ports below VMD and see if hotplug capability is set

This is already there, too.

> If 1 & 3 are true, then we set the native_* bits for hotplug (we can look
> for surprise hotplug as well in the capability to set the
> native_shpc_hotplug bit corrrectly) to 1. This feels like it would solve the
> problem of "what if there is a hotplug issue on the platform" because the
> user would have disabled hotplug for VMD and the root ports below it would
> have the capability turned off.
> 
> In theory we could do this same thing for all the features, but we don't
> know what the firmware wants for features other than hotplug (because we
> implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is a good
> compromise for the other features, but I hear your issues with it.
> 
> I'm happy to "do the right thing" for the other features, I just can't
> figure out what that thing is :)

04b12ef163d1 was motivated by a flood of Correctable Errors.

Kai-Heng says the errors occur even when booting with
"pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is
disabled and the NVMe devices appear under plain Root Ports in domain
0000.  That suggests that they aren't related to VMD at all.

I think there's a significant chance that those errors are caused by a
software defect, e.g., ASPM configuration.  There are many similar
reports of Correctable Errors where "pcie_aspm=off" is a workaround.

If we can nail down the root cause of those Correctable Errors, we may
be able to fix it and just revert 04b12ef163d1.  That would leave all
the PCIe features owned and enabled by Linux in the VMD domain.  AER
would be enabled and not a problem, hotplug would be enabled as you
need, etc.

There are a zillion reports of these errors and I added comments to
some to see if anybody can help us get to a root cause.

Bjorn
Paul M Stillwell Jr April 26, 2024, 9:46 p.m. UTC | #21
On 4/26/2024 2:36 PM, Bjorn Helgaas wrote:
> On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote:
>> On 4/25/2024 3:32 PM, Bjorn Helgaas wrote:
>>> On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote:
>>>> On 4/25/2024 10:24 AM, Bjorn Helgaas wrote:
>>>>> On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
>>>>>> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
>>>> ...
>>>
>>>>>>> _OSC is the only mechanism for negotiating ownership of these
>>>>>>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC
>>>>>>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 host
>>>>>>> bridge that contains the _OSC method.  AFAICT, there's no
>>>>>>> global/device-specific thing here.
>>>>>>>
>>>>>>> The BIOS may have a single user-visible setting, and it may apply that
>>>>>>> setting to all host bridge _OSC methods, but that's just part of the
>>>>>>> BIOS UI, not part of the firmware/OS interface.
>>>>>>
>>>>>> Fair, but we are still left with the question of how to set the _OSC bits
>>>>>> for the VMD bridge. This would normally happen using ACPI AFAICT and we
>>>>>> don't have that for the devices behind VMD.
>>>>>
>>>>> In the absence of a mechanism for negotiating ownership, e.g., an ACPI
>>>>> host bridge device for the hierarchy, the OS owns all the PCIe
>>>>> features.
>>>>
>>>> I'm new to this space so I don't know what it means for the OS to
>>>> own the features. In other words, how would the vmd driver figure
>>>> out what features are supported?
>>>
>>> There are three things that go into this:
>>>
>>>     - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled?
>>>
>>>     - Has the platform granted permission to the OS to use the feature,
>>>       either explicitly via _OSC or implicitly because there's no
>>>       mechanism to negotiate ownership?
>>>
>>>     - Does the device advertise the feature, e.g., does it have an AER
>>>       Capability?
>>>
>>> If all three are true, Linux enables the feature.
>>>
>>> I think vmd has implicit ownership of all features because there is no
>>> ACPI host bridge device for the VMD domain, and (IMO) that means there
>>> is no way to negotiate ownership in that domain.
>>>
>>> So the VMD domain starts with all the native_* bits set, meaning Linux
>>> owns the features.  If the vmd driver doesn't want some feature to be
>>> used, it could clear the native_* bit for it.
>>>
>>> I don't think vmd should unilaterally claim ownership of features by
>>> *setting* native_* bits because that will lead to conflicts with
>>> firmware.
>>
>> This is the crux of the problem IMO. I'm happy to set the native_* bits
>> using some knowledge about what the firmware wants, but we don't have a
>> mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 ("PCI:
>> vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a domain that
>> ACPI had run on and negotiated features and apply them to the vmd domain.
> 
> Yes, this is the problem.  We have no information about what firmware
> wants for the VMD domain because there is no ACPI host bridge device.
> 
> We have information about what firmware wants for *other* domains.
> 04b12ef163d1 assumes that also applies to the VMD domain, but I don't
> think that's a good idea.
> 
>> Using the 3 criteria you described above, could we do this for the hotplug
>> feature for VMD:
>>
>> 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check to see
>> if the hotplug feature is enabled
> 
> That's already there.
> 
>> 2. We know that for VMD we want hotplug enabled so that is the implicit
>> permission
> 
> The VMD domain starts with all native_* bits set.  All you have to do
> is avoid *clearing* them.
> 
> The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC
> for some other domain.
> 
>> 3. Look at the root ports below VMD and see if hotplug capability is set
> 
> This is already there, too.
> 
>> If 1 & 3 are true, then we set the native_* bits for hotplug (we can look
>> for surprise hotplug as well in the capability to set the
>> native_shpc_hotplug bit corrrectly) to 1. This feels like it would solve the
>> problem of "what if there is a hotplug issue on the platform" because the
>> user would have disabled hotplug for VMD and the root ports below it would
>> have the capability turned off.
>>
>> In theory we could do this same thing for all the features, but we don't
>> know what the firmware wants for features other than hotplug (because we
>> implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is a good
>> compromise for the other features, but I hear your issues with it.
>>
>> I'm happy to "do the right thing" for the other features, I just can't
>> figure out what that thing is :)
> 
> 04b12ef163d1 was motivated by a flood of Correctable Errors.
> 
> Kai-Heng says the errors occur even when booting with
> "pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is
> disabled and the NVMe devices appear under plain Root Ports in domain
> 0000.  That suggests that they aren't related to VMD at all.
> 
> I think there's a significant chance that those errors are caused by a
> software defect, e.g., ASPM configuration.  There are many similar
> reports of Correctable Errors where "pcie_aspm=off" is a workaround.
> 
> If we can nail down the root cause of those Correctable Errors, we may
> be able to fix it and just revert 04b12ef163d1.  That would leave all
> the PCIe features owned and enabled by Linux in the VMD domain.  AER
> would be enabled and not a problem, hotplug would be enabled as you
> need, etc.
> 
> There are a zillion reports of these errors and I added comments to
> some to see if anybody can help us get to a root cause.
> 

OK, sounds like the plan for me is to sit tight for now WRT a patch to 
fix hotplug. I'll submit a v2 for the documentation.

Paul
Paul M Stillwell Jr June 12, 2024, 9:52 p.m. UTC | #22
On 4/26/2024 2:46 PM, Paul M Stillwell Jr wrote:
> On 4/26/2024 2:36 PM, Bjorn Helgaas wrote:
>> On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote:
>>> On 4/25/2024 3:32 PM, Bjorn Helgaas wrote:
>>>> On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote:
>>>>> On 4/25/2024 10:24 AM, Bjorn Helgaas wrote:
>>>>>> On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
>>>>>>> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
>>>>> ...
>>>>
>>>>>>>> _OSC is the only mechanism for negotiating ownership of these
>>>>>>>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that 
>>>>>>>> _OSC
>>>>>>>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 
>>>>>>>> host
>>>>>>>> bridge that contains the _OSC method.  AFAICT, there's no
>>>>>>>> global/device-specific thing here.
>>>>>>>>
>>>>>>>> The BIOS may have a single user-visible setting, and it may 
>>>>>>>> apply that
>>>>>>>> setting to all host bridge _OSC methods, but that's just part of 
>>>>>>>> the
>>>>>>>> BIOS UI, not part of the firmware/OS interface.
>>>>>>>
>>>>>>> Fair, but we are still left with the question of how to set the 
>>>>>>> _OSC bits
>>>>>>> for the VMD bridge. This would normally happen using ACPI AFAICT 
>>>>>>> and we
>>>>>>> don't have that for the devices behind VMD.
>>>>>>
>>>>>> In the absence of a mechanism for negotiating ownership, e.g., an 
>>>>>> ACPI
>>>>>> host bridge device for the hierarchy, the OS owns all the PCIe
>>>>>> features.
>>>>>
>>>>> I'm new to this space so I don't know what it means for the OS to
>>>>> own the features. In other words, how would the vmd driver figure
>>>>> out what features are supported?
>>>>
>>>> There are three things that go into this:
>>>>
>>>>     - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled?
>>>>
>>>>     - Has the platform granted permission to the OS to use the feature,
>>>>       either explicitly via _OSC or implicitly because there's no
>>>>       mechanism to negotiate ownership?
>>>>
>>>>     - Does the device advertise the feature, e.g., does it have an AER
>>>>       Capability?
>>>>
>>>> If all three are true, Linux enables the feature.
>>>>
>>>> I think vmd has implicit ownership of all features because there is no
>>>> ACPI host bridge device for the VMD domain, and (IMO) that means there
>>>> is no way to negotiate ownership in that domain.
>>>>
>>>> So the VMD domain starts with all the native_* bits set, meaning Linux
>>>> owns the features.  If the vmd driver doesn't want some feature to be
>>>> used, it could clear the native_* bit for it.
>>>>
>>>> I don't think vmd should unilaterally claim ownership of features by
>>>> *setting* native_* bits because that will lead to conflicts with
>>>> firmware.
>>>
>>> This is the crux of the problem IMO. I'm happy to set the native_* bits
>>> using some knowledge about what the firmware wants, but we don't have a
>>> mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 
>>> ("PCI:
>>> vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a 
>>> domain that
>>> ACPI had run on and negotiated features and apply them to the vmd 
>>> domain.
>>
>> Yes, this is the problem.  We have no information about what firmware
>> wants for the VMD domain because there is no ACPI host bridge device.
>>
>> We have information about what firmware wants for *other* domains.
>> 04b12ef163d1 assumes that also applies to the VMD domain, but I don't
>> think that's a good idea.
>>
>>> Using the 3 criteria you described above, could we do this for the 
>>> hotplug
>>> feature for VMD:
>>>
>>> 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check 
>>> to see
>>> if the hotplug feature is enabled
>>
>> That's already there.
>>
>>> 2. We know that for VMD we want hotplug enabled so that is the implicit
>>> permission
>>
>> The VMD domain starts with all native_* bits set.  All you have to do
>> is avoid *clearing* them.
>>
>> The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC
>> for some other domain.
>>
>>> 3. Look at the root ports below VMD and see if hotplug capability is set
>>
>> This is already there, too.
>>
>>> If 1 & 3 are true, then we set the native_* bits for hotplug (we can 
>>> look
>>> for surprise hotplug as well in the capability to set the
>>> native_shpc_hotplug bit corrrectly) to 1. This feels like it would 
>>> solve the
>>> problem of "what if there is a hotplug issue on the platform" because 
>>> the
>>> user would have disabled hotplug for VMD and the root ports below it 
>>> would
>>> have the capability turned off.
>>>
>>> In theory we could do this same thing for all the features, but we don't
>>> know what the firmware wants for features other than hotplug (because we
>>> implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is 
>>> a good
>>> compromise for the other features, but I hear your issues with it.
>>>
>>> I'm happy to "do the right thing" for the other features, I just can't
>>> figure out what that thing is :)
>>
>> 04b12ef163d1 was motivated by a flood of Correctable Errors.
>>
>> Kai-Heng says the errors occur even when booting with
>> "pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is
>> disabled and the NVMe devices appear under plain Root Ports in domain
>> 0000.  That suggests that they aren't related to VMD at all.
>>
>> I think there's a significant chance that those errors are caused by a
>> software defect, e.g., ASPM configuration.  There are many similar
>> reports of Correctable Errors where "pcie_aspm=off" is a workaround.
>>
>> If we can nail down the root cause of those Correctable Errors, we may
>> be able to fix it and just revert 04b12ef163d1.  That would leave all
>> the PCIe features owned and enabled by Linux in the VMD domain.  AER
>> would be enabled and not a problem, hotplug would be enabled as you
>> need, etc.
>>
>> There are a zillion reports of these errors and I added comments to
>> some to see if anybody can help us get to a root cause.
>>
> 
> OK, sounds like the plan for me is to sit tight for now WRT a patch to 
> fix hotplug. I'll submit a v2 for the documentation.
> 

Any update on fixes for the errors?

Paul
Bjorn Helgaas June 12, 2024, 10:25 p.m. UTC | #23
On Wed, Jun 12, 2024 at 02:52:38PM -0700, Paul M Stillwell Jr wrote:
> On 4/26/2024 2:46 PM, Paul M Stillwell Jr wrote:
> > On 4/26/2024 2:36 PM, Bjorn Helgaas wrote:
> > > On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote:
> > > > On 4/25/2024 3:32 PM, Bjorn Helgaas wrote:
> > > > > On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote:
> > > > > > On 4/25/2024 10:24 AM, Bjorn Helgaas wrote:
> > > > > > > On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote:
> > > > > > > > On 4/23/2024 5:47 PM, Bjorn Helgaas wrote:
> > > > > > ...
> > > > > 
> > > > > > > > > _OSC is the only mechanism for negotiating ownership of these
> > > > > > > > > features, and PCI Firmware r3.3, sec 4.5.1,
> > > > > > > > > is pretty clear that _OSC
> > > > > > > > > only applies to the hierarchy originated by
> > > > > > > > > the PNP0A03/PNP0A08 host
> > > > > > > > > bridge that contains the _OSC method.  AFAICT, there's no
> > > > > > > > > global/device-specific thing here.
> > > > > > > > > 
> > > > > > > > > The BIOS may have a single user-visible
> > > > > > > > > setting, and it may apply that
> > > > > > > > > setting to all host bridge _OSC methods, but
> > > > > > > > > that's just part of the
> > > > > > > > > BIOS UI, not part of the firmware/OS interface.
> > > > > > > > 
> > > > > > > > Fair, but we are still left with the question of
> > > > > > > > how to set the _OSC bits
> > > > > > > > for the VMD bridge. This would normally happen
> > > > > > > > using ACPI AFAICT and we
> > > > > > > > don't have that for the devices behind VMD.
> > > > > > > 
> > > > > > > In the absence of a mechanism for negotiating
> > > > > > > ownership, e.g., an ACPI
> > > > > > > host bridge device for the hierarchy, the OS owns all the PCIe
> > > > > > > features.
> > > > > > 
> > > > > > I'm new to this space so I don't know what it means for the OS to
> > > > > > own the features. In other words, how would the vmd driver figure
> > > > > > out what features are supported?
> > > > > 
> > > > > There are three things that go into this:
> > > > > 
> > > > >     - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled?
> > > > > 
> > > > >     - Has the platform granted permission to the OS to use the feature,
> > > > >       either explicitly via _OSC or implicitly because there's no
> > > > >       mechanism to negotiate ownership?
> > > > > 
> > > > >     - Does the device advertise the feature, e.g., does it have an AER
> > > > >       Capability?
> > > > > 
> > > > > If all three are true, Linux enables the feature.
> > > > > 
> > > > > I think vmd has implicit ownership of all features because there is no
> > > > > ACPI host bridge device for the VMD domain, and (IMO) that means there
> > > > > is no way to negotiate ownership in that domain.
> > > > > 
> > > > > So the VMD domain starts with all the native_* bits set, meaning Linux
> > > > > owns the features.  If the vmd driver doesn't want some feature to be
> > > > > used, it could clear the native_* bit for it.
> > > > > 
> > > > > I don't think vmd should unilaterally claim ownership of features by
> > > > > *setting* native_* bits because that will lead to conflicts with
> > > > > firmware.
> > > > 
> > > > This is the crux of the problem IMO. I'm happy to set the native_* bits
> > > > using some knowledge about what the firmware wants, but we don't have a
> > > > mechanism to do it AFAICT. I think that's what commit
> > > > 04b12ef163d1 ("PCI:
> > > > vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a
> > > > domain that
> > > > ACPI had run on and negotiated features and apply them to the
> > > > vmd domain.
> > > 
> > > Yes, this is the problem.  We have no information about what firmware
> > > wants for the VMD domain because there is no ACPI host bridge device.
> > > 
> > > We have information about what firmware wants for *other* domains.
> > > 04b12ef163d1 assumes that also applies to the VMD domain, but I don't
> > > think that's a good idea.
> > > 
> > > > Using the 3 criteria you described above, could we do this for
> > > > the hotplug
> > > > feature for VMD:
> > > > 
> > > > 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to
> > > > check to see
> > > > if the hotplug feature is enabled
> > > 
> > > That's already there.
> > > 
> > > > 2. We know that for VMD we want hotplug enabled so that is the implicit
> > > > permission
> > > 
> > > The VMD domain starts with all native_* bits set.  All you have to do
> > > is avoid *clearing* them.
> > > 
> > > The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC
> > > for some other domain.
> > > 
> > > > 3. Look at the root ports below VMD and see if hotplug capability is set
> > > 
> > > This is already there, too.
> > > 
> > > > If 1 & 3 are true, then we set the native_* bits for hotplug (we
> > > > can look
> > > > for surprise hotplug as well in the capability to set the
> > > > native_shpc_hotplug bit corrrectly) to 1. This feels like it
> > > > would solve the
> > > > problem of "what if there is a hotplug issue on the platform"
> > > > because the
> > > > user would have disabled hotplug for VMD and the root ports
> > > > below it would
> > > > have the capability turned off.
> > > > 
> > > > In theory we could do this same thing for all the features, but we don't
> > > > know what the firmware wants for features other than hotplug (because we
> > > > implicitly know that vmd wants hotplug). I feel like
> > > > 04b12ef163d1 is a good
> > > > compromise for the other features, but I hear your issues with it.
> > > > 
> > > > I'm happy to "do the right thing" for the other features, I just can't
> > > > figure out what that thing is :)
> > > 
> > > 04b12ef163d1 was motivated by a flood of Correctable Errors.
> > > 
> > > Kai-Heng says the errors occur even when booting with
> > > "pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is
> > > disabled and the NVMe devices appear under plain Root Ports in domain
> > > 0000.  That suggests that they aren't related to VMD at all.
> > > 
> > > I think there's a significant chance that those errors are caused by a
> > > software defect, e.g., ASPM configuration.  There are many similar
> > > reports of Correctable Errors where "pcie_aspm=off" is a workaround.
> > > 
> > > If we can nail down the root cause of those Correctable Errors, we may
> > > be able to fix it and just revert 04b12ef163d1.  That would leave all
> > > the PCIe features owned and enabled by Linux in the VMD domain.  AER
> > > would be enabled and not a problem, hotplug would be enabled as you
> > > need, etc.
> > > 
> > > There are a zillion reports of these errors and I added comments to
> > > some to see if anybody can help us get to a root cause.
> > 
> > OK, sounds like the plan for me is to sit tight for now WRT a patch to
> > fix hotplug. I'll submit a v2 for the documentation.
> 
> Any update on fixes for the errors?

Ah, sorry, I need to get back to this.  I don't think it's
ASPM-related; I think it's just that disabling ASPM also causes AER to
be disabled, so the errors probably still occur but we ignore them.

Bjorn
diff mbox series

Patch

diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
new file mode 100644
index 000000000000..e1a019035245
--- /dev/null
+++ b/Documentation/PCI/controller/vmd.rst
@@ -0,0 +1,51 @@ 
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Volume Management Device (VMD)
+=================================================================
+
+Intel vmd Linux driver.
+
+Contents
+========
+
+- Overview
+- Features
+- Limitations
+
+The Intel VMD provides the means to provide volume management across separate
+PCI Express HBAs and SSDs without requiring operating system support or
+communication between drivers. It does this by obscuring each storage
+controller from the OS, but allowing a single driver to be loaded that would
+control each storage controller. A Volume Management Device (VMD) provides a
+single device for a single storage driver. The VMD resides in the IIO root
+complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
+the VMD is in a central location to manipulate access to storage devices which
+may be attached directly to the IIO or indirectly through the PCH. Instead of
+allowing individual storage devices to be detected by the OS and allow it to
+load a separate driver instance for each, the VMD provides configuration
+settings to allow specific devices and root ports on the root bus to be
+invisible to the OS.
+
+VMD works by creating separate PCI domains for each VMD device in the system.
+This makes VMD look more like a host bridge than an endpoint so VMD must try
+to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.
+A couple of the _OSC flags regard hotplug support.  Hotplug is a feature that
+is always enabled when using VMD regardless of the _OSC flags.
+
+Features
+========
+
+- Virtualization
+- MSIX interrupts
+- Power Management
+- Hotplug
+
+Limitations
+===========
+
+When VMD is enabled and used in a hypervisor the _OSC flags provided by the
+hypervisor BIOS may not be correct. The most critical of these flags are the
+hotplug bits. If these bits are incorrect then the storage devices behind the
+VMD will not be able to be hotplugged. The driver always supports hotplug for
+the devices behind it so the hotplug bits reported by the OS are not used.
diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
index e73f84aebde3..6558adc703f9 100644
--- a/Documentation/PCI/index.rst
+++ b/Documentation/PCI/index.rst
@@ -18,3 +18,4 @@  PCI Bus Subsystem
    pcieaer-howto
    endpoint/index
    boot-interrupts
+   controller/vmd