Message ID | 20240417201542.102-1-paul.m.stillwell.jr@intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Documentation: PCI: add vmd documentation | expand |
On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: > +================================================================= > +Linux Base Driver for the Intel(R) Volume Management Device (VMD) > +================================================================= > + > +Intel vmd Linux driver. > + > +Contents > +======== > + > +- Overview > +- Features > +- Limitations > + > +The Intel VMD provides the means to provide volume management across separate > +PCI Express HBAs and SSDs without requiring operating system support or > +communication between drivers. It does this by obscuring each storage > +controller from the OS, but allowing a single driver to be loaded that would > +control each storage controller. A Volume Management Device (VMD) provides a > +single device for a single storage driver. The VMD resides in the IIO root > +complex and it appears to the OS as a root bus integrated endpoint. In the IIO, > +the VMD is in a central location to manipulate access to storage devices which > +may be attached directly to the IIO or indirectly through the PCH. Instead of > +allowing individual storage devices to be detected by the OS and allow it to > +load a separate driver instance for each, the VMD provides configuration > +settings to allow specific devices and root ports on the root bus to be > +invisible to the OS. This doesn't really capture how the vmd driver works here, though. The linux driver doesn't control or hide any devices connected to it; it just creates a host bridge to its PCI domain, then exposes everything to the rest of the OS for other drivers to bind to individual device instances that the pci bus driver finds. Everything functions much the same as if VMD was disabled.
On 4/17/2024 4:51 PM, Keith Busch wrote: > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: >> +================================================================= >> +Linux Base Driver for the Intel(R) Volume Management Device (VMD) >> +================================================================= >> + >> +Intel vmd Linux driver. >> + >> +Contents >> +======== >> + >> +- Overview >> +- Features >> +- Limitations >> + >> +The Intel VMD provides the means to provide volume management across separate >> +PCI Express HBAs and SSDs without requiring operating system support or >> +communication between drivers. It does this by obscuring each storage >> +controller from the OS, but allowing a single driver to be loaded that would >> +control each storage controller. A Volume Management Device (VMD) provides a >> +single device for a single storage driver. The VMD resides in the IIO root >> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO, >> +the VMD is in a central location to manipulate access to storage devices which >> +may be attached directly to the IIO or indirectly through the PCH. Instead of >> +allowing individual storage devices to be detected by the OS and allow it to >> +load a separate driver instance for each, the VMD provides configuration >> +settings to allow specific devices and root ports on the root bus to be >> +invisible to the OS. > > This doesn't really capture how the vmd driver works here, though. The > linux driver doesn't control or hide any devices connected to it; it > just creates a host bridge to its PCI domain, then exposes everything to > the rest of the OS for other drivers to bind to individual device > instances that the pci bus driver finds. Everything functions much the > same as if VMD was disabled. I was trying more to provide an overview of how the VMD device itself works here and not the driver. The driver is fairly simple; it's how the device works that seems to confuse people :). Do you have a suggestion on what you would like to see here? Paul
[+cc Keith] On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: > Adding documentation for the Intel VMD driver and updating the index > file to include it. > > Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com> > --- > Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++ > Documentation/PCI/index.rst | 1 + > 2 files changed, 52 insertions(+) > create mode 100644 Documentation/PCI/controller/vmd.rst > > diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst > new file mode 100644 > index 000000000000..e1a019035245 > --- /dev/null > +++ b/Documentation/PCI/controller/vmd.rst > @@ -0,0 +1,51 @@ > +.. SPDX-License-Identifier: GPL-2.0+ > + > +================================================================= > +Linux Base Driver for the Intel(R) Volume Management Device (VMD) > +================================================================= > + > +Intel vmd Linux driver. > + > +Contents > +======== > + > +- Overview > +- Features > +- Limitations > + > +The Intel VMD provides the means to provide volume management across separate > +PCI Express HBAs and SSDs without requiring operating system support or > +communication between drivers. It does this by obscuring each storage > +controller from the OS, but allowing a single driver to be loaded that would > +control each storage controller. A Volume Management Device (VMD) provides a > +single device for a single storage driver. The VMD resides in the IIO root I'm not sure IIO (and PCH below) are really relevant to this. I think we really just care about the PCI topology enumerable by the OS. If they are relevant, expand them on first use as you did for VMD so we have a hint about how to learn more about it. > +complex and it appears to the OS as a root bus integrated endpoint. In the IIO, I suspect "root bus integrated endpoint" means the same as "Root Complex Integrated Endpoint" as defined by the PCIe spec? If so, please use that term and capitalize it so there's no confusion. > +the VMD is in a central location to manipulate access to storage devices which > +may be attached directly to the IIO or indirectly through the PCH. Instead of > +allowing individual storage devices to be detected by the OS and allow it to > +load a separate driver instance for each, the VMD provides configuration > +settings to allow specific devices and root ports on the root bus to be > +invisible to the OS. How are these settings configured? BIOS setup menu? > +VMD works by creating separate PCI domains for each VMD device in the system. > +This makes VMD look more like a host bridge than an endpoint so VMD must try > +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system. As Keith pointed out, I think this needs more details about how the hardware itself works. I don't think there's enough information here to maintain the OS/platform interface on an ongoing basis. I think "creating a separate PCI domain" is a consequence of providing a new config access mechanism, e.g., a new ECAM region, for devices below the VMD bridge. That hardware mechanism is important to understand because it means those downstream devices are unknown to anything that doesn't grok the config access mechanism. For example, firmware wouldn't know anything about them unless it had a VMD driver. Some of the pieces that might help figure this out: - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root Ports) are enumerated in the host? - Which devices are passed through to a virtual guest and enumerated there? - Where does the vmd driver run (host or guest or both)? - Who (host or guest) runs the _OSC for the new VMD domain? - What happens to interrupts generated by devices downstream from VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts from VMD Root Ports or switch downstream ports? Who fields them? In general firmware would field them unless it grants ownership via _OSC. If firmware grants ownership (or the OS forcibly takes it by overriding it for hotplug), I guess the OS that requested ownership would field them? - How do interrupts (hotplug, AER, etc) for things below VMD work? Assuming the OS owns the feature, how does the OS discover them? I guess probably the usual PCIe Capability and MSI/MSI-X Capabilities? Which OS (host or guest) fields them? > +A couple of the _OSC flags regard hotplug support. Hotplug is a feature that > +is always enabled when using VMD regardless of the _OSC flags. We log the _OSC negotiation in dmesg, so if we ignore or override _OSC for hotplug, maybe that should be made explicit in the logging somehow? > +Features > +======== > + > +- Virtualization > +- MSIX interrupts > +- Power Management > +- Hotplug s/MSIX/MSI-X/ to match spec usage. I'm not sure what this list is telling us. > +Limitations > +=========== > + > +When VMD is enabled and used in a hypervisor the _OSC flags provided by the > +hypervisor BIOS may not be correct. The most critical of these flags are the > +hotplug bits. If these bits are incorrect then the storage devices behind the > +VMD will not be able to be hotplugged. The driver always supports hotplug for > +the devices behind it so the hotplug bits reported by the OS are not used. "_OSC may not be correct" sounds kind of problematic. How does the OS deal with this? How does the OS know whether to pay attention to _OSC or ignore it because it tells us garbage? If we ignore _OSC hotplug bits because "we know what we want, and we know we won't conflict with firmware," how do we deal with other _OSC bits? AER? PME? What about bits that may be added in the future? Is there some kind of roadmap to help answer these questions? Bjorn
On 4/18/2024 11:26 AM, Bjorn Helgaas wrote: > [+cc Keith] > > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: >> Adding documentation for the Intel VMD driver and updating the index >> file to include it. >> >> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com> >> --- >> Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++ >> Documentation/PCI/index.rst | 1 + >> 2 files changed, 52 insertions(+) >> create mode 100644 Documentation/PCI/controller/vmd.rst >> >> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst >> new file mode 100644 >> index 000000000000..e1a019035245 >> --- /dev/null >> +++ b/Documentation/PCI/controller/vmd.rst >> @@ -0,0 +1,51 @@ >> +.. SPDX-License-Identifier: GPL-2.0+ >> + >> +================================================================= >> +Linux Base Driver for the Intel(R) Volume Management Device (VMD) >> +================================================================= >> + >> +Intel vmd Linux driver. >> + >> +Contents >> +======== >> + >> +- Overview >> +- Features >> +- Limitations >> + >> +The Intel VMD provides the means to provide volume management across separate >> +PCI Express HBAs and SSDs without requiring operating system support or >> +communication between drivers. It does this by obscuring each storage >> +controller from the OS, but allowing a single driver to be loaded that would >> +control each storage controller. A Volume Management Device (VMD) provides a >> +single device for a single storage driver. The VMD resides in the IIO root > > I'm not sure IIO (and PCH below) are really relevant to this. I think I'm trying to describe where in the CPU architecture VMD exists because it's not like other devices. It's not like a storage device or networking device that is plugged in somewhere; it exists as part of the CPU (in the IIO). I'm ok removing it, but it might be confusing to someone looking at the documentation. I'm also close to this so it may be clear to me, but confusing to others (which I know it is) so any help making it clearer would be appreciated. > we really just care about the PCI topology enumerable by the OS. If > they are relevant, expand them on first use as you did for VMD so we > have a hint about how to learn more about it. > I don't fully understand this comment. The PCI topology behind VMD is not enumerable by the OS unless we are considering the vmd driver the OS. If the user enables VMD in the BIOS and the vmd driver isn't loaded, then the OS never sees the devices behind VMD. The only reason the devices are seen by the OS is that the VMD driver does some mapping when the VMD driver loads during boot. >> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO, > > I suspect "root bus integrated endpoint" means the same as "Root > Complex Integrated Endpoint" as defined by the PCIe spec? If so, > please use that term and capitalize it so there's no confusion. > OK, will fix. >> +the VMD is in a central location to manipulate access to storage devices which >> +may be attached directly to the IIO or indirectly through the PCH. Instead of >> +allowing individual storage devices to be detected by the OS and allow it to >> +load a separate driver instance for each, the VMD provides configuration >> +settings to allow specific devices and root ports on the root bus to be >> +invisible to the OS. > > How are these settings configured? BIOS setup menu? > I believe there are 2 ways this is done: The first is that the system designer creates a design such that some root ports and end points are behind VMD. If VMD is enabled in the BIOS then these devices don't show up to the OS and require a driver to use them (the vmd driver). If VMD is disabled in the BIOS then the devices are seen by the OS at boot time. The second way is that there are settings in the BIOS for VMD. I don't think there are many settings... it's mostly enable/disable VMD >> +VMD works by creating separate PCI domains for each VMD device in the system. >> +This makes VMD look more like a host bridge than an endpoint so VMD must try >> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system. > > As Keith pointed out, I think this needs more details about how the > hardware itself works. I don't think there's enough information here > to maintain the OS/platform interface on an ongoing basis. > > I think "creating a separate PCI domain" is a consequence of providing > a new config access mechanism, e.g., a new ECAM region, for devices > below the VMD bridge. That hardware mechanism is important to > understand because it means those downstream devices are unknown to > anything that doesn't grok the config access mechanism. For example, > firmware wouldn't know anything about them unless it had a VMD driver. > > Some of the pieces that might help figure this out: > I'll add some details to answer these in the documentation, but I'll give a brief answer here as well > - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root > Ports) are enumerated in the host? > Only the VMD device (as a PCI end point) are seen by the OS without the vmd driver > - Which devices are passed through to a virtual guest and enumerated > there? > All devices under VMD are passed to a virtual guest > - Where does the vmd driver run (host or guest or both)? > I believe the answer is both. > - Who (host or guest) runs the _OSC for the new VMD domain? > I believe the answer here is neither :) This has been an issue since commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've submitted this patch (https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/) to attempt to fix the issue. You are much more of an expert in this area than I am, but as far as I can tell the only way the _OSC bits get cleared is via ACPI (specifically this code https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038). Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get set properly for them. Ultimately the only _OSC bits that VMD cares about are the hotplug bits because that is a feature of our device; it enables hotplug in guests where there is no way to enable it. That's why my patch is to set them all the time and copy the other _OSC bits because there is no other way to enable this feature (i.e. there is no user space tool to enable/disable it). > - What happens to interrupts generated by devices downstream from > VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts > from VMD Root Ports or switch downstream ports? Who fields them? > In general firmware would field them unless it grants ownership > via _OSC. If firmware grants ownership (or the OS forcibly takes > it by overriding it for hotplug), I guess the OS that requested > ownership would field them? > The interrupts are passed through VMD to the OS. This was the AER issue that resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). IIRC AER was disabled in the BIOS, but is was enabled in the VMD host bridge because pci_init_host_bridge() sets all the bits to 1 and that generated an AER interrupt storm. In bare metal scenarios the _OSC bits are correct, but in a hypervisor scenario the bits are wrong because they are all 0 regardless of what the ACPI tables indicate. The challenge is that the VMD driver has no way to know it's in a hypervisor to set the hotplug bits correctly. > - How do interrupts (hotplug, AER, etc) for things below VMD work? > Assuming the OS owns the feature, how does the OS discover them? I feel like this is the same question as above? Or maybe I'm missing a subtlety about this... > I guess probably the usual PCIe Capability and MSI/MSI-X > Capabilities? Which OS (host or guest) fields them? > >> +A couple of the _OSC flags regard hotplug support. Hotplug is a feature that >> +is always enabled when using VMD regardless of the _OSC flags. > > We log the _OSC negotiation in dmesg, so if we ignore or override _OSC > for hotplug, maybe that should be made explicit in the logging > somehow? > That's a really good idea and something I can add to https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ Would a message like this help from the VMD driver? "VMD enabled, hotplug enabled by VMD" >> +Features >> +======== >> + >> +- Virtualization >> +- MSIX interrupts >> +- Power Management >> +- Hotplug > > s/MSIX/MSI-X/ to match spec usage. > > I'm not sure what this list is telling us. > Will fix >> +Limitations >> +=========== >> + >> +When VMD is enabled and used in a hypervisor the _OSC flags provided by the >> +hypervisor BIOS may not be correct. The most critical of these flags are the >> +hotplug bits. If these bits are incorrect then the storage devices behind the >> +VMD will not be able to be hotplugged. The driver always supports hotplug for >> +the devices behind it so the hotplug bits reported by the OS are not used. > > "_OSC may not be correct" sounds kind of problematic. How does the > OS deal with this? How does the OS know whether to pay attention to > _OSC or ignore it because it tells us garbage? > That's the $64K question, lol. We've been trying to solve that since commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") :) > If we ignore _OSC hotplug bits because "we know what we want, and we > know we won't conflict with firmware," how do we deal with other _OSC > bits? AER? PME? What about bits that may be added in the future? > Is there some kind of roadmap to help answer these questions? > As I mentioned earlier, VMD only really cares about hotplug because that is the feature we enable for guests (and hosts). I believe the solution is to use the root bridge settings for all other bits (which is what is happening currently). What this will mean in practice is that in a bare metal scenario the bits will be correct for all the features (AER et al) and that in a guest scenario all the bits other than hotplug (which we will enable always) will be 0 (that's what we see in all hypervisor scenarios we've tested) which is fine for us because we don't care about any of the other bits. That's why I think it's ok for us to set the hotplug bits to 1 when the VMD driver loads; we aren't harming any other devices, we are enabling a feature that we know our users want and we are setting all the other _OSC bits "correctly" (for some values of correctly :) ) I appreciate your feedback and I'll start working on updating the documentation to make it clearer. I'll wait to send a v2 until I feel like we've finished our discussion from this one. Paul > Bjorn >
On Thu, Apr 18, 2024 at 08:07:26AM -0700, Paul M Stillwell Jr wrote: > On 4/17/2024 4:51 PM, Keith Busch wrote: > > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: > > > +================================================================= > > > +Linux Base Driver for the Intel(R) Volume Management Device (VMD) > > > +================================================================= > > > + > > > +Intel vmd Linux driver. > > > + > > > +Contents > > > +======== > > > + > > > +- Overview > > > +- Features > > > +- Limitations > > > + > > > +The Intel VMD provides the means to provide volume management across separate > > > +PCI Express HBAs and SSDs without requiring operating system support or > > > +communication between drivers. It does this by obscuring each storage > > > +controller from the OS, but allowing a single driver to be loaded that would > > > +control each storage controller. A Volume Management Device (VMD) provides a > > > +single device for a single storage driver. The VMD resides in the IIO root > > > +complex and it appears to the OS as a root bus integrated endpoint. In the IIO, > > > +the VMD is in a central location to manipulate access to storage devices which > > > +may be attached directly to the IIO or indirectly through the PCH. Instead of > > > +allowing individual storage devices to be detected by the OS and allow it to > > > +load a separate driver instance for each, the VMD provides configuration > > > +settings to allow specific devices and root ports on the root bus to be > > > +invisible to the OS. > > > > This doesn't really capture how the vmd driver works here, though. The > > linux driver doesn't control or hide any devices connected to it; it > > just creates a host bridge to its PCI domain, then exposes everything to > > the rest of the OS for other drivers to bind to individual device > > instances that the pci bus driver finds. Everything functions much the > > same as if VMD was disabled. > > I was trying more to provide an overview of how the VMD device itself works > here and not the driver. The driver is fairly simple; it's how the device > works that seems to confuse people :). Right, I know that's how the marketing docs described VMD, but your new doc is called "Linux Base Driver for ... VMD", so I thought your goal was to describe the *driver* rather than the device. > Do you have a suggestion on what you would like to see here? I think we did okay with the description in the initial commit log. It is here if you want to reference that, though it may need some updates for more current hardware. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=185a383ada2e7794b0e82e040223e741b24d2bf8
[+cc Kai-Heng, Rafael, Lorenzo since we're talking about 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")] On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote: > On 4/18/2024 11:26 AM, Bjorn Helgaas wrote: > > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: > > > Adding documentation for the Intel VMD driver and updating the index > > > file to include it. > > > > > > Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com> > > > --- > > > Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++ > > > Documentation/PCI/index.rst | 1 + > > > 2 files changed, 52 insertions(+) > > > create mode 100644 Documentation/PCI/controller/vmd.rst > > > > > > diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst > > > new file mode 100644 > > > index 000000000000..e1a019035245 > > > --- /dev/null > > > +++ b/Documentation/PCI/controller/vmd.rst > > > @@ -0,0 +1,51 @@ > > > +.. SPDX-License-Identifier: GPL-2.0+ > > > + > > > +================================================================= > > > +Linux Base Driver for the Intel(R) Volume Management Device (VMD) > > > +================================================================= > > > + > > > +Intel vmd Linux driver. > > > + > > > +Contents > > > +======== > > > + > > > +- Overview > > > +- Features > > > +- Limitations > > > + > > > +The Intel VMD provides the means to provide volume management across separate > > > +PCI Express HBAs and SSDs without requiring operating system support or > > > +communication between drivers. It does this by obscuring each storage > > > +controller from the OS, but allowing a single driver to be loaded that would > > > +control each storage controller. A Volume Management Device (VMD) provides a > > > +single device for a single storage driver. The VMD resides in the IIO root > > > > I'm not sure IIO (and PCH below) are really relevant to this. I think > > I'm trying to describe where in the CPU architecture VMD exists because it's > not like other devices. It's not like a storage device or networking device > that is plugged in somewhere; it exists as part of the CPU (in the IIO). I'm > ok removing it, but it might be confusing to someone looking at the > documentation. I'm also close to this so it may be clear to me, but > confusing to others (which I know it is) so any help making it clearer would > be appreciated. The vmd driver binds to a PCI device, and it doesn't matter where it's physically implemented. > > we really just care about the PCI topology enumerable by the OS. If > > they are relevant, expand them on first use as you did for VMD so we > > have a hint about how to learn more about it. > > I don't fully understand this comment. The PCI topology behind VMD is not > enumerable by the OS unless we are considering the vmd driver the OS. If the > user enables VMD in the BIOS and the vmd driver isn't loaded, then the OS > never sees the devices behind VMD. > > The only reason the devices are seen by the OS is that the VMD driver does > some mapping when the VMD driver loads during boot. Yes. I think it's worth mentioning these details about the hierarchy behind VMD, e.g., how their config space is reached, how driver MMIO accesses are routed, how device interrupts are routed, etc. The 185a383ada2e ("x86/PCI: Add driver for Intel Volume Management Device (VMD)") commit log mentioned by Keith is a fantastic start and answers several things below. Putting it in this doc would be great because that commit is not very visible. > > > +the VMD is in a central location to manipulate access to storage devices which > > > +may be attached directly to the IIO or indirectly through the PCH. Instead of > > > +allowing individual storage devices to be detected by the OS and allow it to > > > +load a separate driver instance for each, the VMD provides configuration > > > +settings to allow specific devices and root ports on the root bus to be > > > +invisible to the OS. > > > > How are these settings configured? BIOS setup menu? > > I believe there are 2 ways this is done: > > The first is that the system designer creates a design such that some root > ports and end points are behind VMD. If VMD is enabled in the BIOS then > these devices don't show up to the OS and require a driver to use them (the > vmd driver). If VMD is disabled in the BIOS then the devices are seen by the > OS at boot time. > > The second way is that there are settings in the BIOS for VMD. I don't think > there are many settings... it's mostly enable/disable VMD I think what I want to know here is just "there's a BIOS switch that enables VMD, resulting in only the VMD RCiEP being visible, or disables VMD, resulting in the VMD RCiEP not being visible (?) and the VMD Root Ports and downstream hierarchies being just normal Root Ports." You can wordsmith that; I just wanted to know what "configuration settings" referred to so users would know where they live and what they mean. > > > +VMD works by creating separate PCI domains for each VMD device in the system. > > > +This makes VMD look more like a host bridge than an endpoint so VMD must try > > > +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system. > > I think "creating a separate PCI domain" is a consequence of providing > > a new config access mechanism, e.g., a new ECAM region, for devices > > below the VMD bridge. That hardware mechanism is important to > > understand because it means those downstream devices are unknown to > > anything that doesn't grok the config access mechanism. For example, > > firmware wouldn't know anything about them unless it had a VMD driver. > > - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root > > Ports) are enumerated in the host? > > Only the VMD device (as a PCI end point) are seen by the OS without the vmd > driver This is the case when VMD is enabled by BIOS, right? And when the vmd driver sets up the new config space and enumerates behind VMD, we'll find the VMD Root Ports and the hierarchies below them? If BIOS leaves the VMD functionality disabled, I guess those VMD Root Ports and the hierarchies below them are enumerated normally, without the vmd driver being involved? (Maybe the VMD RCiEP itself is disabled, so we won't enumerate it and we won't try to bind the vmd driver to it?) > > - Which devices are passed through to a virtual guest and enumerated > > there? > > All devices under VMD are passed to a virtual guest So the guest will see the VMD Root Ports, but not the VMD RCiEP itself? > > - Where does the vmd driver run (host or guest or both)? > > I believe the answer is both. If the VMD RCiEP isn't passed through to the guest, how can the vmd driver do anything in the guest? > > - Who (host or guest) runs the _OSC for the new VMD domain? > > I believe the answer here is neither :) This has been an issue since commit > 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've submitted > this patch (https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/) > to attempt to fix the issue. IIUC we make no attempt to evaluate _OSC for the new VMD domain (and firmware likely would not supply one anyway since it doesn't know how to enumerate anything there). So 04b12ef163d1 just assumes the _OSC that applies to the VMD RCiEP also applies to the new VMD domain below the VMD. If that's what we want to happen, we should state this explicitly. But I don't think it *is* what we want; see below. > You are much more of an expert in this area than I am, but as far as I can > tell the only way the _OSC bits get cleared is via ACPI (specifically this > code https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038). > Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get set > properly for them. Apparently there's *something* in ACPI related to the VMD domain because 59dc33252ee7 ("PCI: VMD: ACPI: Make ACPI companion lookup work for VMD bus") seems to add support for ACPI companions for devices in the VMD domain? I don't understand what's going on here because if BIOS enables VMD, firmware would see the VMD RCiEP but not devices behind it. And if BIOS disables VMD, those devices should all appear normally and we wouldn't need 59dc33252ee7. > Ultimately the only _OSC bits that VMD cares about are the hotplug bits > because that is a feature of our device; it enables hotplug in guests where > there is no way to enable it. That's why my patch is to set them all the > time and copy the other _OSC bits because there is no other way to enable > this feature (i.e. there is no user space tool to enable/disable it). And here's the problem, because the _OSC that applies to domain X that contains the VMD RCiEP may result in the platform retaining ownership of all these PCIe features (hotplug, AER, PME, etc), but you want OSPM to own them for the VMD domain Y, regardless of whether it owns them for domain X. OSPM *did* own all PCIe features before 04b12ef163d1 because the new VMD "host bridge" got native owership by default. > > - What happens to interrupts generated by devices downstream from > > VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts > > from VMD Root Ports or switch downstream ports? Who fields them? > > In general firmware would field them unless it grants ownership > > via _OSC. If firmware grants ownership (or the OS forcibly takes > > it by overriding it for hotplug), I guess the OS that requested > > ownership would field them? > > The interrupts are passed through VMD to the OS. This was the AER issue that > resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe > features"). IIRC AER was disabled in the BIOS, but is was enabled in the VMD > host bridge because pci_init_host_bridge() sets all the bits to 1 and that > generated an AER interrupt storm. > > In bare metal scenarios the _OSC bits are correct, but in a hypervisor > scenario the bits are wrong because they are all 0 regardless of what the > ACPI tables indicate. The challenge is that the VMD driver has no way to > know it's in a hypervisor to set the hotplug bits correctly. This is the problem. We claim "the bits are wrong because they are all 0", but I don't think we can derive that conclusion from anything in the ACPI, PCIe, or PCI Firmware specs, which makes it hard to maintain. IIUC, the current situation is "regardless of what firmware said, in the VMD domain we want AER disabled and hotplug enabled." 04b12ef163d1 disabled AER by copying the _OSC that applied to the VMD RCiEP (although this sounds broken because it probably left AER *enabled* if that _OSC happened to grant ownership to the OS). And your patch at https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ enables hotplug by setting native_pcie_hotplug to 1. It seems like the only clear option is to say "the vmd driver owns all PCIe services in the VMD domain, the platform does not supply _OSC for the VMD domain, the platform can't do anything with PCIe services in the VMD domain, and the vmd driver needs to explicitly enable/disable services as it needs." Bjorn
On 4/19/2024 2:14 PM, Bjorn Helgaas wrote: > [+cc Kai-Heng, Rafael, Lorenzo since we're talking about 04b12ef163d1 > ("PCI: vmd: Honor ACPI _OSC on PCIe features")] > > On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote: >> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote: >>> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: >>>> Adding documentation for the Intel VMD driver and updating the index >>>> file to include it. >>>> >>>> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com> >>>> --- >>>> Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++ >>>> Documentation/PCI/index.rst | 1 + >>>> 2 files changed, 52 insertions(+) >>>> create mode 100644 Documentation/PCI/controller/vmd.rst >>>> >>>> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst >>>> new file mode 100644 >>>> index 000000000000..e1a019035245 >>>> --- /dev/null >>>> +++ b/Documentation/PCI/controller/vmd.rst >>>> @@ -0,0 +1,51 @@ >>>> +.. SPDX-License-Identifier: GPL-2.0+ >>>> + >>>> +================================================================= >>>> +Linux Base Driver for the Intel(R) Volume Management Device (VMD) >>>> +================================================================= >>>> + >>>> +Intel vmd Linux driver. >>>> + >>>> +Contents >>>> +======== >>>> + >>>> +- Overview >>>> +- Features >>>> +- Limitations >>>> + >>>> +The Intel VMD provides the means to provide volume management across separate >>>> +PCI Express HBAs and SSDs without requiring operating system support or >>>> +communication between drivers. It does this by obscuring each storage >>>> +controller from the OS, but allowing a single driver to be loaded that would >>>> +control each storage controller. A Volume Management Device (VMD) provides a >>>> +single device for a single storage driver. The VMD resides in the IIO root >>> >>> I'm not sure IIO (and PCH below) are really relevant to this. I think >> >> I'm trying to describe where in the CPU architecture VMD exists because it's >> not like other devices. It's not like a storage device or networking device >> that is plugged in somewhere; it exists as part of the CPU (in the IIO). I'm >> ok removing it, but it might be confusing to someone looking at the >> documentation. I'm also close to this so it may be clear to me, but >> confusing to others (which I know it is) so any help making it clearer would >> be appreciated. > > The vmd driver binds to a PCI device, and it doesn't matter where it's > physically implemented. > Ok >>> we really just care about the PCI topology enumerable by the OS. If >>> they are relevant, expand them on first use as you did for VMD so we >>> have a hint about how to learn more about it. >> >> I don't fully understand this comment. The PCI topology behind VMD is not >> enumerable by the OS unless we are considering the vmd driver the OS. If the >> user enables VMD in the BIOS and the vmd driver isn't loaded, then the OS >> never sees the devices behind VMD. >> >> The only reason the devices are seen by the OS is that the VMD driver does >> some mapping when the VMD driver loads during boot. > > Yes. I think it's worth mentioning these details about the hierarchy > behind VMD, e.g., how their config space is reached, how driver MMIO > accesses are routed, how device interrupts are routed, etc. > > The 185a383ada2e ("x86/PCI: Add driver for Intel Volume Management > Device (VMD)") commit log mentioned by Keith is a fantastic start > and answers several things below. Putting it in this doc would be > great because that commit is not very visible. > I'll look at it and work it in (either replace this overview or add to it) >>>> +the VMD is in a central location to manipulate access to storage devices which >>>> +may be attached directly to the IIO or indirectly through the PCH. Instead of >>>> +allowing individual storage devices to be detected by the OS and allow it to >>>> +load a separate driver instance for each, the VMD provides configuration >>>> +settings to allow specific devices and root ports on the root bus to be >>>> +invisible to the OS. >>> >>> How are these settings configured? BIOS setup menu? >> >> I believe there are 2 ways this is done: >> >> The first is that the system designer creates a design such that some root >> ports and end points are behind VMD. If VMD is enabled in the BIOS then >> these devices don't show up to the OS and require a driver to use them (the >> vmd driver). If VMD is disabled in the BIOS then the devices are seen by the >> OS at boot time. >> >> The second way is that there are settings in the BIOS for VMD. I don't think >> there are many settings... it's mostly enable/disable VMD > > I think what I want to know here is just "there's a BIOS switch that > enables VMD, resulting in only the VMD RCiEP being visible, or > disables VMD, resulting in the VMD RCiEP not being visible (?) and the > VMD Root Ports and downstream hierarchies being just normal Root > Ports." You can wordsmith that; I just wanted to know what > "configuration settings" referred to so users would know where they > live and what they mean. > Got it, will update to reflect that setup/config is through the BIOS >>>> +VMD works by creating separate PCI domains for each VMD device in the system. >>>> +This makes VMD look more like a host bridge than an endpoint so VMD must try >>>> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system. > >>> I think "creating a separate PCI domain" is a consequence of providing >>> a new config access mechanism, e.g., a new ECAM region, for devices >>> below the VMD bridge. That hardware mechanism is important to >>> understand because it means those downstream devices are unknown to >>> anything that doesn't grok the config access mechanism. For example, >>> firmware wouldn't know anything about them unless it had a VMD driver. > >>> - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root >>> Ports) are enumerated in the host? >> >> Only the VMD device (as a PCI end point) are seen by the OS without the vmd >> driver > > This is the case when VMD is enabled by BIOS, right? And when the vmd Yes > driver sets up the new config space and enumerates behind VMD, we'll > find the VMD Root Ports and the hierarchies below them? > Yes > If BIOS leaves the VMD functionality disabled, I guess those VMD Root > Ports and the hierarchies below them are enumerated normally, without > the vmd driver being involved? (Maybe the VMD RCiEP itself is Yes > disabled, so we won't enumerate it and we won't try to bind the vmd > driver to it?) > Yes >>> - Which devices are passed through to a virtual guest and enumerated >>> there? >> >> All devices under VMD are passed to a virtual guest > > So the guest will see the VMD Root Ports, but not the VMD RCiEP > itself? > The guest will see the VMD device and then the vmd driver in the guest will enumerate the devices behind it is my understanding >>> - Where does the vmd driver run (host or guest or both)? >> >> I believe the answer is both. > > If the VMD RCiEP isn't passed through to the guest, how can the vmd > driver do anything in the guest? > The VMD device is passed through to the guest. It works just like bare metal in that the guest OS detects the VMD device and loads the vmd driver which then enumerates the devices into the guest >>> - Who (host or guest) runs the _OSC for the new VMD domain? >> >> I believe the answer here is neither :) This has been an issue since commit >> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've submitted >> this patch (https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/) >> to attempt to fix the issue. > > IIUC we make no attempt to evaluate _OSC for the new VMD domain (and > firmware likely would not supply one anyway since it doesn't know how > to enumerate anything there). So 04b12ef163d1 just assumes the _OSC > that applies to the VMD RCiEP also applies to the new VMD domain below > the VMD. > > If that's what we want to happen, we should state this explicitly. > But I don't think it *is* what we want; see below. > >> You are much more of an expert in this area than I am, but as far as I can >> tell the only way the _OSC bits get cleared is via ACPI (specifically this >> code https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038). >> Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get set >> properly for them. > > Apparently there's *something* in ACPI related to the VMD domain > because 59dc33252ee7 ("PCI: VMD: ACPI: Make ACPI companion lookup work > for VMD bus") seems to add support for ACPI companions for devices in > the VMD domain? > I've seen this code also and don't understand what it's supposed to do. I'll investigate this further > I don't understand what's going on here because if BIOS enables VMD, > firmware would see the VMD RCiEP but not devices behind it. And if > BIOS disables VMD, those devices should all appear normally and we > wouldn't need 59dc33252ee7. > I agree, I'll try to figure this out and get an answer >> Ultimately the only _OSC bits that VMD cares about are the hotplug bits >> because that is a feature of our device; it enables hotplug in guests where >> there is no way to enable it. That's why my patch is to set them all the >> time and copy the other _OSC bits because there is no other way to enable >> this feature (i.e. there is no user space tool to enable/disable it). > > And here's the problem, because the _OSC that applies to domain X that > contains the VMD RCiEP may result in the platform retaining ownership > of all these PCIe features (hotplug, AER, PME, etc), but you want OSPM > to own them for the VMD domain Y, regardless of whether it owns them > for domain X. > I thought each domain gets it's own _OSC. In the case of VMD, it creates a new domain so the _OSC for that domain should be determined somehow. Normally that determination would be done via ACPI (if I'm understanding things correctly) and in the case of VMD I am saying that hotplug should always be enabled and we can use the _OSC from another domain for the other bits. I went down a rabbit hole when I started working on this and had an idea that maybe the VMD driver should create an ACPI table, but that seemed way more complicated than it needs to be. Plus I wasn't sure what good it would do because I don't understand the relationship between the root ports below VMD and the settings we would use for the bridge. > OSPM *did* own all PCIe features before 04b12ef163d1 because the new > VMD "host bridge" got native owership by default. > >>> - What happens to interrupts generated by devices downstream from >>> VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts >>> from VMD Root Ports or switch downstream ports? Who fields them? >>> In general firmware would field them unless it grants ownership >>> via _OSC. If firmware grants ownership (or the OS forcibly takes >>> it by overriding it for hotplug), I guess the OS that requested >>> ownership would field them? >> >> The interrupts are passed through VMD to the OS. This was the AER issue that >> resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe >> features"). IIRC AER was disabled in the BIOS, but is was enabled in the VMD >> host bridge because pci_init_host_bridge() sets all the bits to 1 and that >> generated an AER interrupt storm. >> >> In bare metal scenarios the _OSC bits are correct, but in a hypervisor >> scenario the bits are wrong because they are all 0 regardless of what the >> ACPI tables indicate. The challenge is that the VMD driver has no way to >> know it's in a hypervisor to set the hotplug bits correctly. > > This is the problem. We claim "the bits are wrong because they are > all 0", but I don't think we can derive that conclusion from anything > in the ACPI, PCIe, or PCI Firmware specs, which makes it hard to > maintain. > I guess I look at it this way: if I run a bare metal OS and I get hotplug and AER are enabled and then I run a guest on the same system and I get all _OSC bits are disabled then I think the bits aren't correct. But there isn't any way to detect that through the "normal" channels (ACPI, PCIe, etc). > IIUC, the current situation is "regardless of what firmware said, in > the VMD domain we want AER disabled and hotplug enabled." > We aren't saying we want AER disabled, we are just saying we want hotplug enabled. The observation is that in a hypervisor scenario AER is going to be disabled because the _OSC bits are all 0. > 04b12ef163d1 disabled AER by copying the _OSC that applied to the VMD > RCiEP (although this sounds broken because it probably left AER > *enabled* if that _OSC happened to grant ownership to the OS). > > And your patch at > https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ > enables hotplug by setting native_pcie_hotplug to 1. > > It seems like the only clear option is to say "the vmd driver owns all > PCIe services in the VMD domain, the platform does not supply _OSC for > the VMD domain, the platform can't do anything with PCIe services in > the VMD domain, and the vmd driver needs to explicitly enable/disable > services as it needs." > I actually looked at this as well :) I had an idea to set the _OSC bits to 0 when the vmd driver created the domain. The look at all the root ports underneath it and see if AER and PM were set. If any root port underneath VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That way if any root port underneath VMD had enabled AER (as an example) then that feature would still work. I didn't test this in a hypervisor scenario though so not sure what I would see. Would that be an acceptable idea? Paul > Bjorn
On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote: > On 4/19/2024 2:14 PM, Bjorn Helgaas wrote: > > On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote: > > > On 4/18/2024 11:26 AM, Bjorn Helgaas wrote: > > > > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: > > > > > Adding documentation for the Intel VMD driver and updating the index > > > > > file to include it. > > > > - Which devices are passed through to a virtual guest and enumerated > > > > there? > > > > > > All devices under VMD are passed to a virtual guest > > > > So the guest will see the VMD Root Ports, but not the VMD RCiEP > > itself? > > The guest will see the VMD device and then the vmd driver in the guest will > enumerate the devices behind it is my understanding > > > > > - Where does the vmd driver run (host or guest or both)? > > > > > > I believe the answer is both. > > > > If the VMD RCiEP isn't passed through to the guest, how can the vmd > > driver do anything in the guest? > > The VMD device is passed through to the guest. It works just like bare metal > in that the guest OS detects the VMD device and loads the vmd driver which > then enumerates the devices into the guest I guess it's obvious that the VMD RCiEP must be passed through to the guest because the whole point of https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ is to do something in the guest. It does puzzle me that we have two copies of the vmd driver (one in the host OS and another in the guest OS) that think they own the same physical device. I'm not a virtualization guru but that sounds potentially problematic. > > IIUC, the current situation is "regardless of what firmware said, in > > the VMD domain we want AER disabled and hotplug enabled." > > We aren't saying we want AER disabled, we are just saying we want hotplug > enabled. The observation is that in a hypervisor scenario AER is going to be > disabled because the _OSC bits are all 0. 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying we want AER disabled for the VMD domain, isn't it? > > It seems like the only clear option is to say "the vmd driver owns all > > PCIe services in the VMD domain, the platform does not supply _OSC for > > the VMD domain, the platform can't do anything with PCIe services in > > the VMD domain, and the vmd driver needs to explicitly enable/disable > > services as it needs." > > I actually looked at this as well :) I had an idea to set the _OSC bits to 0 > when the vmd driver created the domain. The look at all the root ports > underneath it and see if AER and PM were set. If any root port underneath > VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That > way if any root port underneath VMD had enabled AER (as an example) then > that feature would still work. I didn't test this in a hypervisor scenario > though so not sure what I would see. _OSC negotiates ownership of features between platform firmware and OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a device advertises the feature, the OS can use it." We clear those native_* bits if the platform retains ownership via _OSC. If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for the domain below it, why would we assume that BIOS retains ownership of the features negotiated by _OSC? I think we have to assume the OS owns them, which is what happened before 04b12ef163d1. Bjorn
On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: > On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote: >> On 4/19/2024 2:14 PM, Bjorn Helgaas wrote: >>> On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote: >>>> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote: >>>>> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: >>>>>> Adding documentation for the Intel VMD driver and updating the index >>>>>> file to include it. > >>>>> - Which devices are passed through to a virtual guest and enumerated >>>>> there? >>>> >>>> All devices under VMD are passed to a virtual guest >>> >>> So the guest will see the VMD Root Ports, but not the VMD RCiEP >>> itself? >> >> The guest will see the VMD device and then the vmd driver in the guest will >> enumerate the devices behind it is my understanding >> >>>>> - Where does the vmd driver run (host or guest or both)? >>>> >>>> I believe the answer is both. >>> >>> If the VMD RCiEP isn't passed through to the guest, how can the vmd >>> driver do anything in the guest? >> >> The VMD device is passed through to the guest. It works just like bare metal >> in that the guest OS detects the VMD device and loads the vmd driver which >> then enumerates the devices into the guest > > I guess it's obvious that the VMD RCiEP must be passed through to the > guest because the whole point of > https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ > is to do something in the guest. > > It does puzzle me that we have two copies of the vmd driver (one in > the host OS and another in the guest OS) that think they own the same > physical device. I'm not a virtualization guru but that sounds > potentially problematic. > >>> IIUC, the current situation is "regardless of what firmware said, in >>> the VMD domain we want AER disabled and hotplug enabled." >> >> We aren't saying we want AER disabled, we are just saying we want hotplug >> enabled. The observation is that in a hypervisor scenario AER is going to be >> disabled because the _OSC bits are all 0. > > 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying > we want AER disabled for the VMD domain, isn't it? > I don't see it that way, but maybe I'm misunderstanding something. Here is the code from that commit (only the portion of interest): +/* + * Since VMD is an aperture to regular PCIe root ports, only allow it to + * control features that the OS is allowed to control on the physical PCI bus. + */ +static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge, + struct pci_host_bridge *vmd_bridge) +{ + vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug; + vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug; + vmd_bridge->native_aer = root_bridge->native_aer; + vmd_bridge->native_pme = root_bridge->native_pme; + vmd_bridge->native_ltr = root_bridge->native_ltr; + vmd_bridge->native_dpc = root_bridge->native_dpc; +} + When I look at this, we are copying whatever is in the root_bridge to the vmd_bridge. In a bare metal scenario, this is correct and AER will be whatever the BIOS set up (IIRC on my bare metal system AER is enabled). In a hypervisor scenario the root_bridge bits will all be 0 which effectively disables AER and all the similar bits. Prior to this commit all the native_* bits were set to 1 because pci_init_host_bridge() sets them all to 1 so we were enabling AER et all despite what the OS may have wanted. With the commit we are setting the bits to whatever the BIOS has set them to. >>> It seems like the only clear option is to say "the vmd driver owns all >>> PCIe services in the VMD domain, the platform does not supply _OSC for >>> the VMD domain, the platform can't do anything with PCIe services in >>> the VMD domain, and the vmd driver needs to explicitly enable/disable >>> services as it needs." >> >> I actually looked at this as well :) I had an idea to set the _OSC bits to 0 >> when the vmd driver created the domain. The look at all the root ports >> underneath it and see if AER and PM were set. If any root port underneath >> VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That >> way if any root port underneath VMD had enabled AER (as an example) then >> that feature would still work. I didn't test this in a hypervisor scenario >> though so not sure what I would see. > > _OSC negotiates ownership of features between platform firmware and > OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a > device advertises the feature, the OS can use it." We clear those > native_* bits if the platform retains ownership via _OSC. > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for > the domain below it, why would we assume that BIOS retains ownership > of the features negotiated by _OSC? I think we have to assume the OS > owns them, which is what happened before 04b12ef163d1. > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) then all the root ports and devices underneath VMD are visible to the BIOS and OS so ACPI would run on all of them and the _OSC bits should be set correctly. There was a previous suggestion to print what VMD is owning. I think that is a good idea and I can add that to my patch. It would look like this (following the way OS support gets printed): VMD supports PCIeHotplug VMD supports SHPCHotplug VMD now owns PCIeHotplug VMD now owns SHPCHotplug We could probably do the same thing for the other bits (based on the native_* bits). That might make it clearer... Paul > Bjorn >
On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: > On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: > > On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote: > > > On 4/19/2024 2:14 PM, Bjorn Helgaas wrote: > > > > On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote: > > > > > On 4/18/2024 11:26 AM, Bjorn Helgaas wrote: > > > > > > On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: > > > > > > > Adding documentation for the Intel VMD driver and updating the index > > > > > > > file to include it. > > > > > > > > - Which devices are passed through to a virtual guest and enumerated > > > > > > there? > > > > > > > > > > All devices under VMD are passed to a virtual guest > > > > > > > > So the guest will see the VMD Root Ports, but not the VMD RCiEP > > > > itself? > > > > > > The guest will see the VMD device and then the vmd driver in the guest will > > > enumerate the devices behind it is my understanding > > > > > > > > > - Where does the vmd driver run (host or guest or both)? > > > > > > > > > > I believe the answer is both. > > > > > > > > If the VMD RCiEP isn't passed through to the guest, how can the vmd > > > > driver do anything in the guest? > > > > > > The VMD device is passed through to the guest. It works just like bare metal > > > in that the guest OS detects the VMD device and loads the vmd driver which > > > then enumerates the devices into the guest > > > > I guess it's obvious that the VMD RCiEP must be passed through to the > > guest because the whole point of > > https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ > > is to do something in the guest. > > > > It does puzzle me that we have two copies of the vmd driver (one in > > the host OS and another in the guest OS) that think they own the same > > physical device. I'm not a virtualization guru but that sounds > > potentially problematic. > > > > > > IIUC, the current situation is "regardless of what firmware said, in > > > > the VMD domain we want AER disabled and hotplug enabled." > > > > > > We aren't saying we want AER disabled, we are just saying we want hotplug > > > enabled. The observation is that in a hypervisor scenario AER is going to be > > > disabled because the _OSC bits are all 0. > > > > 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying > > we want AER disabled for the VMD domain, isn't it? > > I don't see it that way, but maybe I'm misunderstanding something. Here is > the code from that commit (only the portion of interest): > > +/* > + * Since VMD is an aperture to regular PCIe root ports, only allow it to > + * control features that the OS is allowed to control on the physical PCI > bus. > + */ > +static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge, > + struct pci_host_bridge *vmd_bridge) > +{ > + vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug; > + vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug; > + vmd_bridge->native_aer = root_bridge->native_aer; > + vmd_bridge->native_pme = root_bridge->native_pme; > + vmd_bridge->native_ltr = root_bridge->native_ltr; > + vmd_bridge->native_dpc = root_bridge->native_dpc; > +} > + > > When I look at this, we are copying whatever is in the root_bridge to the > vmd_bridge. Right. We're copying the results of the _OSC that applies to the VMD RCiEP (not an _OSC that applies to the VMD domain). > In a bare metal scenario, this is correct and AER will be whatever > the BIOS set up (IIRC on my bare metal system AER is enabled). I think the first dmesg log at https://bugzilla.kernel.org/show_bug.cgi?id=215027 is from a host OS: [ 0.000000] DMI: Dell Inc. Dell G15 Special Edition 5521/, BIOS 0.4.3 10/20/2021 [ 0.408990] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0]) [ 0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER] [ 0.412207] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR] [ 1.367220] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0 [ 1.486704] pcieport 10000:e0:06.0: AER: enabled with IRQ 146 We evaluated _OSC for domain 0000 with OSC_QUERY_ENABLE ("Query Support Flag" in the specs) and learned the platform doesn't support AER, so we removed that from the features we request. We don't request control of AER in domain 0000, so native_aer will be 0. AER might be enabled by firmware, but the host Linux should not enable it in domain 0000. In domain 10000, the host Linux *does* enable AER because all new host bridges (including the new VMD domain 10000) assume native control to start with, and we update that based on _OSC results. There's no ACPI device describing the VMD host bridge, so there's no _OSC that applies to it, so Linux assumes it owns everything. > In a hypervisor scenario the root_bridge bits will all be 0 which > effectively disables AER and all the similar bits. I don't think https://bugzilla.kernel.org/show_bug.cgi?id=215027 includes a dmesg log from a hypervisor guest, so I don't know what happens there. The host_bridge->native_* bits always start as 1 (OS owns the feature) and they only get cleared if there's an _OSC that retains firmware ownership. > Prior to this commit all the native_* bits were set to 1 because > pci_init_host_bridge() sets them all to 1 so we were enabling AER et all > despite what the OS may have wanted. With the commit we are setting the bits > to whatever the BIOS has set them to. Prior to 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"), the host OS owned all features in the VMD domain. After 04b12ef163d1, it depended on whatever the _OSC for the VMD RCiEP said. On the Dell mentioned in that bugzilla, the domain 0000 _OSC said AER wasn't supported, and 04b12ef163d1 applied that to the VMD domain as well, so the host OS didn't enable AER in either domain. But other machines would have different answers. On a machine that supports AER and grants ownership to the OS, 04b12ef163d1 would enable AER in both domain 0000 and the VMD domain. I don't think we know the root cause of the Correctable Errors from that bugzilla. But it's likely that they would still occur on a machine that granted AER to the OS, and if we enable AER in the VMD domain, we would probably still have the flood. That's is why I think 04b12ef163d1 is problematic: it only disables AER in the VMD domain when AER happens to be disabled in domain 0000. > > > > It seems like the only clear option is to say "the vmd driver owns all > > > > PCIe services in the VMD domain, the platform does not supply _OSC for > > > > the VMD domain, the platform can't do anything with PCIe services in > > > > the VMD domain, and the vmd driver needs to explicitly enable/disable > > > > services as it needs." > > > > > > I actually looked at this as well :) I had an idea to set the _OSC bits to 0 > > > when the vmd driver created the domain. The look at all the root ports > > > underneath it and see if AER and PM were set. If any root port underneath > > > VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That > > > way if any root port underneath VMD had enabled AER (as an example) then > > > that feature would still work. I didn't test this in a hypervisor scenario > > > though so not sure what I would see. > > > > _OSC negotiates ownership of features between platform firmware and > > OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a > > device advertises the feature, the OS can use it." We clear those > > native_* bits if the platform retains ownership via _OSC. > > > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for > > the domain below it, why would we assume that BIOS retains ownership > > of the features negotiated by _OSC? I think we have to assume the OS > > owns them, which is what happened before 04b12ef163d1. > > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) > then all the root ports and devices underneath VMD are visible to the BIOS > and OS so ACPI would run on all of them and the _OSC bits should be set > correctly. Sorry, that was confusing. I think there are two pieces to enabling VMD: 1) There's the BIOS "VMD enable" switch. If set, the VMD device appears as an RCiEP and the devices behind it are invisible to the BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and the devices behind it appear as normal Root Ports with devices below them. 2) When the BIOS "VMD enable" is set, the OS vmd driver configures the VMD RCiEP and enumerates things below the VMD host bridge. In this case, BIOS enables the VMD RCiEP, but it doesn't have a driver for it and it doesn't know how to enumerate the VMD Root Ports, so I don't think it makes sense for BIOS to own features for devices it doesn't know about. Bjorn
On 4/22/2024 3:52 PM, Bjorn Helgaas wrote: > On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: >> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: >>> On Fri, Apr 19, 2024 at 03:18:19PM -0700, Paul M Stillwell Jr wrote: >>>> On 4/19/2024 2:14 PM, Bjorn Helgaas wrote: >>>>> On Thu, Apr 18, 2024 at 02:51:19PM -0700, Paul M Stillwell Jr wrote: >>>>>> On 4/18/2024 11:26 AM, Bjorn Helgaas wrote: >>>>>>> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote: >>>>>>>> Adding documentation for the Intel VMD driver and updating the index >>>>>>>> file to include it. >>> >>>>>>> - Which devices are passed through to a virtual guest and enumerated >>>>>>> there? >>>>>> >>>>>> All devices under VMD are passed to a virtual guest >>>>> >>>>> So the guest will see the VMD Root Ports, but not the VMD RCiEP >>>>> itself? >>>> >>>> The guest will see the VMD device and then the vmd driver in the guest will >>>> enumerate the devices behind it is my understanding >>>> >>>>>>> - Where does the vmd driver run (host or guest or both)? >>>>>> >>>>>> I believe the answer is both. >>>>> >>>>> If the VMD RCiEP isn't passed through to the guest, how can the vmd >>>>> driver do anything in the guest? >>>> >>>> The VMD device is passed through to the guest. It works just like bare metal >>>> in that the guest OS detects the VMD device and loads the vmd driver which >>>> then enumerates the devices into the guest >>> >>> I guess it's obvious that the VMD RCiEP must be passed through to the >>> guest because the whole point of >>> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ >>> is to do something in the guest. >>> >>> It does puzzle me that we have two copies of the vmd driver (one in >>> the host OS and another in the guest OS) that think they own the same >>> physical device. I'm not a virtualization guru but that sounds >>> potentially problematic. >>> >>>>> IIUC, the current situation is "regardless of what firmware said, in >>>>> the VMD domain we want AER disabled and hotplug enabled." >>>> >>>> We aren't saying we want AER disabled, we are just saying we want hotplug >>>> enabled. The observation is that in a hypervisor scenario AER is going to be >>>> disabled because the _OSC bits are all 0. >>> >>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") is saying >>> we want AER disabled for the VMD domain, isn't it? >> >> I don't see it that way, but maybe I'm misunderstanding something. Here is >> the code from that commit (only the portion of interest): >> >> +/* >> + * Since VMD is an aperture to regular PCIe root ports, only allow it to >> + * control features that the OS is allowed to control on the physical PCI >> bus. >> + */ >> +static void vmd_copy_host_bridge_flags(struct pci_host_bridge *root_bridge, >> + struct pci_host_bridge *vmd_bridge) >> +{ >> + vmd_bridge->native_pcie_hotplug = root_bridge->native_pcie_hotplug; >> + vmd_bridge->native_shpc_hotplug = root_bridge->native_shpc_hotplug; >> + vmd_bridge->native_aer = root_bridge->native_aer; >> + vmd_bridge->native_pme = root_bridge->native_pme; >> + vmd_bridge->native_ltr = root_bridge->native_ltr; >> + vmd_bridge->native_dpc = root_bridge->native_dpc; >> +} >> + >> >> When I look at this, we are copying whatever is in the root_bridge to the >> vmd_bridge. > > Right. We're copying the results of the _OSC that applies to the VMD > RCiEP (not an _OSC that applies to the VMD domain). > >> In a bare metal scenario, this is correct and AER will be whatever >> the BIOS set up (IIRC on my bare metal system AER is enabled). > > I think the first dmesg log at > https://bugzilla.kernel.org/show_bug.cgi?id=215027 is from a host OS: > > [ 0.000000] DMI: Dell Inc. Dell G15 Special Edition 5521/, BIOS 0.4.3 10/20/2021 > [ 0.408990] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-e0]) > [ 0.410076] acpi PNP0A08:00: _OSC: platform does not support [AER] > [ 0.412207] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME PCIeCapability LTR] > [ 1.367220] vmd 0000:00:0e.0: PCI host bridge to bus 10000:e0 > [ 1.486704] pcieport 10000:e0:06.0: AER: enabled with IRQ 146 > > We evaluated _OSC for domain 0000 with OSC_QUERY_ENABLE ("Query > Support Flag" in the specs) and learned the platform doesn't support > AER, so we removed that from the features we request. We don't > request control of AER in domain 0000, so native_aer will be 0. AER > might be enabled by firmware, but the host Linux should not enable it > in domain 0000. > > In domain 10000, the host Linux *does* enable AER because all new host > bridges (including the new VMD domain 10000) assume native control to > start with, and we update that based on _OSC results. There's no ACPI > device describing the VMD host bridge, so there's no _OSC that applies > to it, so Linux assumes it owns everything. > >> In a hypervisor scenario the root_bridge bits will all be 0 which >> effectively disables AER and all the similar bits. > > I don't think https://bugzilla.kernel.org/show_bug.cgi?id=215027 > includes a dmesg log from a hypervisor guest, so I don't know what > happens there. > > The host_bridge->native_* bits always start as 1 (OS owns the feature) > and they only get cleared if there's an _OSC that retains firmware > ownership. > >> Prior to this commit all the native_* bits were set to 1 because >> pci_init_host_bridge() sets them all to 1 so we were enabling AER et all >> despite what the OS may have wanted. With the commit we are setting the bits >> to whatever the BIOS has set them to. > > Prior to 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"), > the host OS owned all features in the VMD domain. After 04b12ef163d1, > it depended on whatever the _OSC for the VMD RCiEP said. > > On the Dell mentioned in that bugzilla, the domain 0000 _OSC said AER > wasn't supported, and 04b12ef163d1 applied that to the VMD domain as > well, so the host OS didn't enable AER in either domain. > > But other machines would have different answers. On a machine that > supports AER and grants ownership to the OS, 04b12ef163d1 would enable > AER in both domain 0000 and the VMD domain. > > I don't think we know the root cause of the Correctable Errors from > that bugzilla. But it's likely that they would still occur on a > machine that granted AER to the OS, and if we enable AER in the VMD > domain, we would probably still have the flood. > > That's is why I think 04b12ef163d1 is problematic: it only disables > AER in the VMD domain when AER happens to be disabled in domain 0000. > That's a great explanation and very helpful to me, thanks for that! >>>>> It seems like the only clear option is to say "the vmd driver owns all >>>>> PCIe services in the VMD domain, the platform does not supply _OSC for >>>>> the VMD domain, the platform can't do anything with PCIe services in >>>>> the VMD domain, and the vmd driver needs to explicitly enable/disable >>>>> services as it needs." >>>> >>>> I actually looked at this as well :) I had an idea to set the _OSC bits to 0 >>>> when the vmd driver created the domain. The look at all the root ports >>>> underneath it and see if AER and PM were set. If any root port underneath >>>> VMD set AER or PM then I would set the _OSC bit for the bridge to 1. That >>>> way if any root port underneath VMD had enabled AER (as an example) then >>>> that feature would still work. I didn't test this in a hypervisor scenario >>>> though so not sure what I would see. >>> >>> _OSC negotiates ownership of features between platform firmware and >>> OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a >>> device advertises the feature, the OS can use it." We clear those >>> native_* bits if the platform retains ownership via _OSC. >>> >>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for >>> the domain below it, why would we assume that BIOS retains ownership >>> of the features negotiated by _OSC? I think we have to assume the OS >>> owns them, which is what happened before 04b12ef163d1. >> >> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) >> then all the root ports and devices underneath VMD are visible to the BIOS >> and OS so ACPI would run on all of them and the _OSC bits should be set >> correctly. > > Sorry, that was confusing. I think there are two pieces to enabling > VMD: > > 1) There's the BIOS "VMD enable" switch. If set, the VMD device > appears as an RCiEP and the devices behind it are invisible to the > BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and > the devices behind it appear as normal Root Ports with devices below > them. > > 2) When the BIOS "VMD enable" is set, the OS vmd driver configures > the VMD RCiEP and enumerates things below the VMD host bridge. > > In this case, BIOS enables the VMD RCiEP, but it doesn't have a > driver for it and it doesn't know how to enumerate the VMD Root > Ports, so I don't think it makes sense for BIOS to own features for > devices it doesn't know about. > That makes sense to me. It sounds like VMD should own all the features, I just don't know how the vmd driver would set the bits other than hotplug correctly... We know leaving them on is problematic, but I'm not sure what method to use to decide which of the other bits should be set or not. These other features are in a no mans land. I did consider running ACPI for the devices behind VMD, but the ACPI functions are almost all static to acpi.c (IIRC) and I didn't feel like I should rototill the ACPI code for this... I'm hesitant to set all the other feature bits to 0 as well because we may have customers who want AER on their devices behind VMD. I'll do some research on my side to see how we enable/handle other features. If it turns out that we never use AER or any of the other features would it be ok if we set everything other than hotplug to 0 and print some messages similar to the OS stating that VMD owns all these features and isn't supporting them? Paul
On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote: > On 4/22/2024 3:52 PM, Bjorn Helgaas wrote: > > On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: > > > On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: > ... > > > > _OSC negotiates ownership of features between platform firmware and > > > > OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a > > > > device advertises the feature, the OS can use it." We clear those > > > > native_* bits if the platform retains ownership via _OSC. > > > > > > > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for > > > > the domain below it, why would we assume that BIOS retains ownership > > > > of the features negotiated by _OSC? I think we have to assume the OS > > > > owns them, which is what happened before 04b12ef163d1. > > > > > > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) > > > then all the root ports and devices underneath VMD are visible to the BIOS > > > and OS so ACPI would run on all of them and the _OSC bits should be set > > > correctly. > > > > Sorry, that was confusing. I think there are two pieces to enabling > > VMD: > > > > 1) There's the BIOS "VMD enable" switch. If set, the VMD device > > appears as an RCiEP and the devices behind it are invisible to the > > BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and > > the devices behind it appear as normal Root Ports with devices below > > them. > > > > 2) When the BIOS "VMD enable" is set, the OS vmd driver configures > > the VMD RCiEP and enumerates things below the VMD host bridge. > > > > In this case, BIOS enables the VMD RCiEP, but it doesn't have a > > driver for it and it doesn't know how to enumerate the VMD Root > > Ports, so I don't think it makes sense for BIOS to own features for > > devices it doesn't know about. > > That makes sense to me. It sounds like VMD should own all the features, I > just don't know how the vmd driver would set the bits other than hotplug > correctly... We know leaving them on is problematic, but I'm not sure what > method to use to decide which of the other bits should be set or not. My starting assumption would be that we'd handle the VMD domain the same as other PCI domains: if a device advertises a feature, the kernel includes support for it, and the kernel owns it, we enable it. If a device advertises a feature but there's a hardware problem with it, the usual approach is to add a quirk to work around the problem. The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"), looks like it might be in this category. Bjorn
On 4/23/2024 2:26 PM, Bjorn Helgaas wrote: > On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote: >> On 4/22/2024 3:52 PM, Bjorn Helgaas wrote: >>> On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: >>>> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: >> ... > >>>>> _OSC negotiates ownership of features between platform firmware and >>>>> OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a >>>>> device advertises the feature, the OS can use it." We clear those >>>>> native_* bits if the platform retains ownership via _OSC. >>>>> >>>>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for >>>>> the domain below it, why would we assume that BIOS retains ownership >>>>> of the features negotiated by _OSC? I think we have to assume the OS >>>>> owns them, which is what happened before 04b12ef163d1. >>>> >>>> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) >>>> then all the root ports and devices underneath VMD are visible to the BIOS >>>> and OS so ACPI would run on all of them and the _OSC bits should be set >>>> correctly. >>> >>> Sorry, that was confusing. I think there are two pieces to enabling >>> VMD: >>> >>> 1) There's the BIOS "VMD enable" switch. If set, the VMD device >>> appears as an RCiEP and the devices behind it are invisible to the >>> BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and >>> the devices behind it appear as normal Root Ports with devices below >>> them. >>> >>> 2) When the BIOS "VMD enable" is set, the OS vmd driver configures >>> the VMD RCiEP and enumerates things below the VMD host bridge. >>> >>> In this case, BIOS enables the VMD RCiEP, but it doesn't have a >>> driver for it and it doesn't know how to enumerate the VMD Root >>> Ports, so I don't think it makes sense for BIOS to own features for >>> devices it doesn't know about. >> >> That makes sense to me. It sounds like VMD should own all the features, I >> just don't know how the vmd driver would set the bits other than hotplug >> correctly... We know leaving them on is problematic, but I'm not sure what >> method to use to decide which of the other bits should be set or not. > > My starting assumption would be that we'd handle the VMD domain the > same as other PCI domains: if a device advertises a feature, the > kernel includes support for it, and the kernel owns it, we enable it. > I've been poking around and it seems like some things (I was looking for AER) are global to the platform. In my investigation (which is a small sample size of machines) it looks like there is a single entry in the BIOS to enable/disable AER so whatever is in one domain should be the same in all the domains. I couldn't find settings for LTR or the other bits, but I'm not sure what to look for in the BIOS for those. So it seems that there are 2 categories: platform global and device specific. AER and probably some of the others are global and can be copied from one domain to another, but things like hotplug are device specific and should be handled that way. Based on this it seems like my patch here https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ is probably the correct thing to do, but I think adding some output into dmesg to indicate that VMD owns hotplug and has enabled it needs to be done. > If a device advertises a feature but there's a hardware problem with > it, the usual approach is to add a quirk to work around the problem. > The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd: > Honor ACPI _OSC on PCIe features"), looks like it might be in this > category. > I don't think we had a hardware problem with these Samsung (IIRC) devices; the issue was that the vmd driver were incorrectly enabling AER because those native_* bits get set automatically. I think 04b12ef163d1 is doing the correct thing, but it is incorrectly copying the hotplug bits. Those are device specific and should be handled by the device instead of based on another domain. Paul
On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote: > On 4/23/2024 2:26 PM, Bjorn Helgaas wrote: > > On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote: > > > On 4/22/2024 3:52 PM, Bjorn Helgaas wrote: > > > > On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: > > > > > On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: > > > ... > > > > > > > > _OSC negotiates ownership of features between platform firmware and > > > > > > OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a > > > > > > device advertises the feature, the OS can use it." We clear those > > > > > > native_* bits if the platform retains ownership via _OSC. > > > > > > > > > > > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for > > > > > > the domain below it, why would we assume that BIOS retains ownership > > > > > > of the features negotiated by _OSC? I think we have to assume the OS > > > > > > owns them, which is what happened before 04b12ef163d1. > > > > > > > > > > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) > > > > > then all the root ports and devices underneath VMD are visible to the BIOS > > > > > and OS so ACPI would run on all of them and the _OSC bits should be set > > > > > correctly. > > > > > > > > Sorry, that was confusing. I think there are two pieces to enabling > > > > VMD: > > > > > > > > 1) There's the BIOS "VMD enable" switch. If set, the VMD device > > > > appears as an RCiEP and the devices behind it are invisible to the > > > > BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and > > > > the devices behind it appear as normal Root Ports with devices below > > > > them. > > > > > > > > 2) When the BIOS "VMD enable" is set, the OS vmd driver configures > > > > the VMD RCiEP and enumerates things below the VMD host bridge. > > > > > > > > In this case, BIOS enables the VMD RCiEP, but it doesn't have a > > > > driver for it and it doesn't know how to enumerate the VMD Root > > > > Ports, so I don't think it makes sense for BIOS to own features for > > > > devices it doesn't know about. > > > > > > That makes sense to me. It sounds like VMD should own all the features, I > > > just don't know how the vmd driver would set the bits other than hotplug > > > correctly... We know leaving them on is problematic, but I'm not sure what > > > method to use to decide which of the other bits should be set or not. > > > > My starting assumption would be that we'd handle the VMD domain the > > same as other PCI domains: if a device advertises a feature, the > > kernel includes support for it, and the kernel owns it, we enable it. > > I've been poking around and it seems like some things (I was looking for > AER) are global to the platform. In my investigation (which is a small > sample size of machines) it looks like there is a single entry in the BIOS > to enable/disable AER so whatever is in one domain should be the same in all > the domains. I couldn't find settings for LTR or the other bits, but I'm not > sure what to look for in the BIOS for those. > > So it seems that there are 2 categories: platform global and device > specific. AER and probably some of the others are global and can be copied > from one domain to another, but things like hotplug are device specific and > should be handled that way. _OSC is the only mechanism for negotiating ownership of these features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC only applies to the hierarchy originated by the PNP0A03/PNP0A08 host bridge that contains the _OSC method. AFAICT, there's no global/device-specific thing here. The BIOS may have a single user-visible setting, and it may apply that setting to all host bridge _OSC methods, but that's just part of the BIOS UI, not part of the firmware/OS interface. > > If a device advertises a feature but there's a hardware problem with > > it, the usual approach is to add a quirk to work around the problem. > > The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd: > > Honor ACPI _OSC on PCIe features"), looks like it might be in this > > category. > > I don't think we had a hardware problem with these Samsung (IIRC) devices; > the issue was that the vmd driver were incorrectly enabling AER because > those native_* bits get set automatically. Where do all the Correctable Errors come from? IMO they're either caused by some hardware issue or by a software error in programming AER. It's possible we forget to clear the errors and we just see the same error reported over and over. But I don't think the answer is to copy the AER ownership from a different domain. Bjorn
On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: > On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote: >> On 4/23/2024 2:26 PM, Bjorn Helgaas wrote: >>> On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote: >>>> On 4/22/2024 3:52 PM, Bjorn Helgaas wrote: >>>>> On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: >>>>>> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: >>>> ... >>> >>>>>>> _OSC negotiates ownership of features between platform firmware and >>>>>>> OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a >>>>>>> device advertises the feature, the OS can use it." We clear those >>>>>>> native_* bits if the platform retains ownership via _OSC. >>>>>>> >>>>>>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for >>>>>>> the domain below it, why would we assume that BIOS retains ownership >>>>>>> of the features negotiated by _OSC? I think we have to assume the OS >>>>>>> owns them, which is what happened before 04b12ef163d1. >>>>>> >>>>>> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) >>>>>> then all the root ports and devices underneath VMD are visible to the BIOS >>>>>> and OS so ACPI would run on all of them and the _OSC bits should be set >>>>>> correctly. >>>>> >>>>> Sorry, that was confusing. I think there are two pieces to enabling >>>>> VMD: >>>>> >>>>> 1) There's the BIOS "VMD enable" switch. If set, the VMD device >>>>> appears as an RCiEP and the devices behind it are invisible to the >>>>> BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and >>>>> the devices behind it appear as normal Root Ports with devices below >>>>> them. >>>>> >>>>> 2) When the BIOS "VMD enable" is set, the OS vmd driver configures >>>>> the VMD RCiEP and enumerates things below the VMD host bridge. >>>>> >>>>> In this case, BIOS enables the VMD RCiEP, but it doesn't have a >>>>> driver for it and it doesn't know how to enumerate the VMD Root >>>>> Ports, so I don't think it makes sense for BIOS to own features for >>>>> devices it doesn't know about. >>>> >>>> That makes sense to me. It sounds like VMD should own all the features, I >>>> just don't know how the vmd driver would set the bits other than hotplug >>>> correctly... We know leaving them on is problematic, but I'm not sure what >>>> method to use to decide which of the other bits should be set or not. >>> >>> My starting assumption would be that we'd handle the VMD domain the >>> same as other PCI domains: if a device advertises a feature, the >>> kernel includes support for it, and the kernel owns it, we enable it. >> >> I've been poking around and it seems like some things (I was looking for >> AER) are global to the platform. In my investigation (which is a small >> sample size of machines) it looks like there is a single entry in the BIOS >> to enable/disable AER so whatever is in one domain should be the same in all >> the domains. I couldn't find settings for LTR or the other bits, but I'm not >> sure what to look for in the BIOS for those. >> >> So it seems that there are 2 categories: platform global and device >> specific. AER and probably some of the others are global and can be copied >> from one domain to another, but things like hotplug are device specific and >> should be handled that way. > > _OSC is the only mechanism for negotiating ownership of these > features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC > only applies to the hierarchy originated by the PNP0A03/PNP0A08 host > bridge that contains the _OSC method. AFAICT, there's no > global/device-specific thing here. > > The BIOS may have a single user-visible setting, and it may apply that > setting to all host bridge _OSC methods, but that's just part of the > BIOS UI, not part of the firmware/OS interface. > Fair, but we are still left with the question of how to set the _OSC bits for the VMD bridge. This would normally happen using ACPI AFAICT and we don't have that for the devices behind VMD. >>> If a device advertises a feature but there's a hardware problem with >>> it, the usual approach is to add a quirk to work around the problem. >>> The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd: >>> Honor ACPI _OSC on PCIe features"), looks like it might be in this >>> category. >> >> I don't think we had a hardware problem with these Samsung (IIRC) devices; >> the issue was that the vmd driver were incorrectly enabling AER because >> those native_* bits get set automatically. > > Where do all the Correctable Errors come from? IMO they're either > caused by some hardware issue or by a software error in programming > AER. It's possible we forget to clear the errors and we just see the > same error reported over and over. But I don't think the answer is > to copy the AER ownership from a different domain. > I looked back at the original bugzilla and I feel like the AER errors are a red herring. AER was *supposed* to be disabled, but was incorrectly enabled by VMD so we are seeing errors. Yes, they may be real errors, but my point is that the user had disabled AER so they didn't care if there were errors or not (i.e. if AER had been correctly disabled by VMD then the user would not have AER errors in the dmesg output). Kai-Heng even says this in one of his responses here https://lore.kernel.org/linux-pci/CAAd53p6hATV8TOcJ9Qi2rMwVi=y_9+tQu6KhDkAm6Y8=cQ_xoA@mail.gmail.com/. A quote from his reply "To be more precise, AER is disabled by the platform vendor in BIOS to paper over the issue." Paul
On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: > On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: > > On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote: > > > On 4/23/2024 2:26 PM, Bjorn Helgaas wrote: > > > > On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote: > > > > > On 4/22/2024 3:52 PM, Bjorn Helgaas wrote: > > > > > > On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: > > > > > > > On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: > > > > > ... > > > > > > > > > > > > _OSC negotiates ownership of features between platform firmware and > > > > > > > > OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a > > > > > > > > device advertises the feature, the OS can use it." We clear those > > > > > > > > native_* bits if the platform retains ownership via _OSC. > > > > > > > > > > > > > > > > If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for > > > > > > > > the domain below it, why would we assume that BIOS retains ownership > > > > > > > > of the features negotiated by _OSC? I think we have to assume the OS > > > > > > > > owns them, which is what happened before 04b12ef163d1. > > > > > > > > > > > > > > Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) > > > > > > > then all the root ports and devices underneath VMD are visible to the BIOS > > > > > > > and OS so ACPI would run on all of them and the _OSC bits should be set > > > > > > > correctly. > > > > > > > > > > > > Sorry, that was confusing. I think there are two pieces to enabling > > > > > > VMD: > > > > > > > > > > > > 1) There's the BIOS "VMD enable" switch. If set, the VMD device > > > > > > appears as an RCiEP and the devices behind it are invisible to the > > > > > > BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and > > > > > > the devices behind it appear as normal Root Ports with devices below > > > > > > them. > > > > > > > > > > > > 2) When the BIOS "VMD enable" is set, the OS vmd driver configures > > > > > > the VMD RCiEP and enumerates things below the VMD host bridge. > > > > > > > > > > > > In this case, BIOS enables the VMD RCiEP, but it doesn't have a > > > > > > driver for it and it doesn't know how to enumerate the VMD Root > > > > > > Ports, so I don't think it makes sense for BIOS to own features for > > > > > > devices it doesn't know about. > > > > > > > > > > That makes sense to me. It sounds like VMD should own all the features, I > > > > > just don't know how the vmd driver would set the bits other than hotplug > > > > > correctly... We know leaving them on is problematic, but I'm not sure what > > > > > method to use to decide which of the other bits should be set or not. > > > > > > > > My starting assumption would be that we'd handle the VMD domain the > > > > same as other PCI domains: if a device advertises a feature, the > > > > kernel includes support for it, and the kernel owns it, we enable it. > > > > > > I've been poking around and it seems like some things (I was looking for > > > AER) are global to the platform. In my investigation (which is a small > > > sample size of machines) it looks like there is a single entry in the BIOS > > > to enable/disable AER so whatever is in one domain should be the same in all > > > the domains. I couldn't find settings for LTR or the other bits, but I'm not > > > sure what to look for in the BIOS for those. > > > > > > So it seems that there are 2 categories: platform global and device > > > specific. AER and probably some of the others are global and can be copied > > > from one domain to another, but things like hotplug are device specific and > > > should be handled that way. > > > > _OSC is the only mechanism for negotiating ownership of these > > features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC > > only applies to the hierarchy originated by the PNP0A03/PNP0A08 host > > bridge that contains the _OSC method. AFAICT, there's no > > global/device-specific thing here. > > > > The BIOS may have a single user-visible setting, and it may apply that > > setting to all host bridge _OSC methods, but that's just part of the > > BIOS UI, not part of the firmware/OS interface. > > Fair, but we are still left with the question of how to set the _OSC bits > for the VMD bridge. This would normally happen using ACPI AFAICT and we > don't have that for the devices behind VMD. In the absence of a mechanism for negotiating ownership, e.g., an ACPI host bridge device for the hierarchy, the OS owns all the PCIe features. > > > > If a device advertises a feature but there's a hardware problem with > > > > it, the usual approach is to add a quirk to work around the problem. > > > > The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd: > > > > Honor ACPI _OSC on PCIe features"), looks like it might be in this > > > > category. > > > > > > I don't think we had a hardware problem with these Samsung (IIRC) devices; > > > the issue was that the vmd driver were incorrectly enabling AER because > > > those native_* bits get set automatically. > > > > Where do all the Correctable Errors come from? IMO they're either > > caused by some hardware issue or by a software error in programming > > AER. It's possible we forget to clear the errors and we just see the > > same error reported over and over. But I don't think the answer is > > to copy the AER ownership from a different domain. > > I looked back at the original bugzilla and I feel like the AER errors are a > red herring. AER was *supposed* to be disabled, but was incorrectly enabled > by VMD so we are seeing errors. Yes, they may be real errors, but my point > is that the user had disabled AER so they didn't care if there were errors > or not (i.e. if AER had been correctly disabled by VMD then the user would > not have AER errors in the dmesg output). 04b12ef163d1 basically asserted "the platform knows about a hardware issue between VMD and this NVMe and avoided it by disabling AER in domain 0000; therefore we should also disable AER in the VMD domain." Your patch at https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ says "vmd users *always* want hotplug enabled." What happens when a platform knows about a hotplug hardware issue and avoids it by disabling hotplug in domain 0000? I think 04b12ef163d1 would avoid it in the VMD domain, but your patch would expose the hotplug issue. > Kai-Heng even says this in one of his responses here https://lore.kernel.org/linux-pci/CAAd53p6hATV8TOcJ9Qi2rMwVi=y_9+tQu6KhDkAm6Y8=cQ_xoA@mail.gmail.com/. > A quote from his reply "To be more precise, AER is disabled by the platform > vendor in BIOS to paper over the issue." I suspect there's a real hardware issue between the VMD and the Samsung NVMe that causes these Correctable Errors. I think disabling AER via _OSC is a bad way to work around it because: - it disables AER for *everything* in domain 0000, when other devices probably work perfectly fine, - it assumes the OS vmd driver applies domain 0000 _OSC to the VMD domain, which isn't required by any spec, and - it disables *all* of AER, including Uncorrectable Errors, and I'd like to know about those, even if we have to mask the Correctable Errors. In https://bugzilla.kernel.org/show_bug.cgi?id=215027#c5, Kai-Heng did not see the Correctable Error flood when VMD was turned off and concluded that the issue is VMD specific. But I think it's likely that the errors still occur even when VMD is turned off, and we just don't see the flood because AER is disabled. I suggested an experiment with "pcie_ports=native", which should enable AER even if _OSC doesn't grant ownership: https://bugzilla.kernel.org/show_bug.cgi?id=215027#c9 Bjorn
On 4/25/2024 10:24 AM, Bjorn Helgaas wrote: > On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: >> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: >>> On Tue, Apr 23, 2024 at 04:10:37PM -0700, Paul M Stillwell Jr wrote: >>>> On 4/23/2024 2:26 PM, Bjorn Helgaas wrote: >>>>> On Mon, Apr 22, 2024 at 04:39:19PM -0700, Paul M Stillwell Jr wrote: >>>>>> On 4/22/2024 3:52 PM, Bjorn Helgaas wrote: >>>>>>> On Mon, Apr 22, 2024 at 02:39:16PM -0700, Paul M Stillwell Jr wrote: >>>>>>>> On 4/22/2024 1:27 PM, Bjorn Helgaas wrote: >>>>>> ... >>>>> >>>>>>>>> _OSC negotiates ownership of features between platform firmware and >>>>>>>>> OSPM. The "native_pcie_hotplug" and similar bits mean that "IF a >>>>>>>>> device advertises the feature, the OS can use it." We clear those >>>>>>>>> native_* bits if the platform retains ownership via _OSC. >>>>>>>>> >>>>>>>>> If BIOS doesn't enable the VMD host bridge and doesn't supply _OSC for >>>>>>>>> the domain below it, why would we assume that BIOS retains ownership >>>>>>>>> of the features negotiated by _OSC? I think we have to assume the OS >>>>>>>>> owns them, which is what happened before 04b12ef163d1. >>>>>>>> >>>>>>>> Sorry, this confuses me :) If BIOS doesn't enable VMD (i.e. VMD is disabled) >>>>>>>> then all the root ports and devices underneath VMD are visible to the BIOS >>>>>>>> and OS so ACPI would run on all of them and the _OSC bits should be set >>>>>>>> correctly. >>>>>>> >>>>>>> Sorry, that was confusing. I think there are two pieces to enabling >>>>>>> VMD: >>>>>>> >>>>>>> 1) There's the BIOS "VMD enable" switch. If set, the VMD device >>>>>>> appears as an RCiEP and the devices behind it are invisible to the >>>>>>> BIOS. If cleared, VMD doesn't exist; the VMD RCiEP is hidden and >>>>>>> the devices behind it appear as normal Root Ports with devices below >>>>>>> them. >>>>>>> >>>>>>> 2) When the BIOS "VMD enable" is set, the OS vmd driver configures >>>>>>> the VMD RCiEP and enumerates things below the VMD host bridge. >>>>>>> >>>>>>> In this case, BIOS enables the VMD RCiEP, but it doesn't have a >>>>>>> driver for it and it doesn't know how to enumerate the VMD Root >>>>>>> Ports, so I don't think it makes sense for BIOS to own features for >>>>>>> devices it doesn't know about. >>>>>> >>>>>> That makes sense to me. It sounds like VMD should own all the features, I >>>>>> just don't know how the vmd driver would set the bits other than hotplug >>>>>> correctly... We know leaving them on is problematic, but I'm not sure what >>>>>> method to use to decide which of the other bits should be set or not. >>>>> >>>>> My starting assumption would be that we'd handle the VMD domain the >>>>> same as other PCI domains: if a device advertises a feature, the >>>>> kernel includes support for it, and the kernel owns it, we enable it. >>>> >>>> I've been poking around and it seems like some things (I was looking for >>>> AER) are global to the platform. In my investigation (which is a small >>>> sample size of machines) it looks like there is a single entry in the BIOS >>>> to enable/disable AER so whatever is in one domain should be the same in all >>>> the domains. I couldn't find settings for LTR or the other bits, but I'm not >>>> sure what to look for in the BIOS for those. >>>> >>>> So it seems that there are 2 categories: platform global and device >>>> specific. AER and probably some of the others are global and can be copied >>>> from one domain to another, but things like hotplug are device specific and >>>> should be handled that way. >>> >>> _OSC is the only mechanism for negotiating ownership of these >>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC >>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 host >>> bridge that contains the _OSC method. AFAICT, there's no >>> global/device-specific thing here. >>> >>> The BIOS may have a single user-visible setting, and it may apply that >>> setting to all host bridge _OSC methods, but that's just part of the >>> BIOS UI, not part of the firmware/OS interface. >> >> Fair, but we are still left with the question of how to set the _OSC bits >> for the VMD bridge. This would normally happen using ACPI AFAICT and we >> don't have that for the devices behind VMD. > > In the absence of a mechanism for negotiating ownership, e.g., an ACPI > host bridge device for the hierarchy, the OS owns all the PCIe > features. > I'm new to this space so I don't know what it means for the OS to own the features. In other words, how would the vmd driver figure out what features are supported? >>>>> If a device advertises a feature but there's a hardware problem with >>>>> it, the usual approach is to add a quirk to work around the problem. >>>>> The Correctable Error issue addressed by 04b12ef163d1 ("PCI: vmd: >>>>> Honor ACPI _OSC on PCIe features"), looks like it might be in this >>>>> category. >>>> >>>> I don't think we had a hardware problem with these Samsung (IIRC) devices; >>>> the issue was that the vmd driver were incorrectly enabling AER because >>>> those native_* bits get set automatically. >>> >>> Where do all the Correctable Errors come from? IMO they're either >>> caused by some hardware issue or by a software error in programming >>> AER. It's possible we forget to clear the errors and we just see the >>> same error reported over and over. But I don't think the answer is >>> to copy the AER ownership from a different domain. >> >> I looked back at the original bugzilla and I feel like the AER errors are a >> red herring. AER was *supposed* to be disabled, but was incorrectly enabled >> by VMD so we are seeing errors. Yes, they may be real errors, but my point >> is that the user had disabled AER so they didn't care if there were errors >> or not (i.e. if AER had been correctly disabled by VMD then the user would >> not have AER errors in the dmesg output). > > 04b12ef163d1 basically asserted "the platform knows about a hardware > issue between VMD and this NVMe and avoided it by disabling AER in > domain 0000; therefore we should also disable AER in the VMD domain." > > Your patch at > https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ > says "vmd users *always* want hotplug enabled." What happens when a > platform knows about a hotplug hardware issue and avoids it by > disabling hotplug in domain 0000? > I was thinking about this also and I could look at all the root ports underneath vmd and see if hotplug is set for any of them. If it is then we could set the native_*hotplug bits based on that. > I think 04b12ef163d1 would avoid it in the VMD domain, but your patch > would expose the hotplug issue. > >> Kai-Heng even says this in one of his responses here https://lore.kernel.org/linux-pci/CAAd53p6hATV8TOcJ9Qi2rMwVi=y_9+tQu6KhDkAm6Y8=cQ_xoA@mail.gmail.com/. >> A quote from his reply "To be more precise, AER is disabled by the platform >> vendor in BIOS to paper over the issue." > > I suspect there's a real hardware issue between the VMD and the > Samsung NVMe that causes these Correctable Errors. I think disabling > AER via _OSC is a bad way to work around it because: > > - it disables AER for *everything* in domain 0000, when other > devices probably work perfectly fine, > > - it assumes the OS vmd driver applies domain 0000 _OSC to the VMD > domain, which isn't required by any spec, and > > - it disables *all* of AER, including Uncorrectable Errors, and I'd > like to know about those, even if we have to mask the Correctable > Errors. > > In https://bugzilla.kernel.org/show_bug.cgi?id=215027#c5, Kai-Heng did > not see the Correctable Error flood when VMD was turned off and > concluded that the issue is VMD specific. > > But I think it's likely that the errors still occur even when VMD is > turned off, and we just don't see the flood because AER is disabled. > I suggested an experiment with "pcie_ports=native", which should > enable AER even if _OSC doesn't grant ownership: > https://bugzilla.kernel.org/show_bug.cgi?id=215027#c9 > I don't have a way to test his configuration so that would be something he would need to do. > Bjorn >
On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote: > On 4/25/2024 10:24 AM, Bjorn Helgaas wrote: > > On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: > > > On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: > ... > > > > _OSC is the only mechanism for negotiating ownership of these > > > > features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC > > > > only applies to the hierarchy originated by the PNP0A03/PNP0A08 host > > > > bridge that contains the _OSC method. AFAICT, there's no > > > > global/device-specific thing here. > > > > > > > > The BIOS may have a single user-visible setting, and it may apply that > > > > setting to all host bridge _OSC methods, but that's just part of the > > > > BIOS UI, not part of the firmware/OS interface. > > > > > > Fair, but we are still left with the question of how to set the _OSC bits > > > for the VMD bridge. This would normally happen using ACPI AFAICT and we > > > don't have that for the devices behind VMD. > > > > In the absence of a mechanism for negotiating ownership, e.g., an ACPI > > host bridge device for the hierarchy, the OS owns all the PCIe > > features. > > I'm new to this space so I don't know what it means for the OS to > own the features. In other words, how would the vmd driver figure > out what features are supported? There are three things that go into this: - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled? - Has the platform granted permission to the OS to use the feature, either explicitly via _OSC or implicitly because there's no mechanism to negotiate ownership? - Does the device advertise the feature, e.g., does it have an AER Capability? If all three are true, Linux enables the feature. I think vmd has implicit ownership of all features because there is no ACPI host bridge device for the VMD domain, and (IMO) that means there is no way to negotiate ownership in that domain. So the VMD domain starts with all the native_* bits set, meaning Linux owns the features. If the vmd driver doesn't want some feature to be used, it could clear the native_* bit for it. I don't think vmd should unilaterally claim ownership of features by *setting* native_* bits because that will lead to conflicts with firmware. > > 04b12ef163d1 basically asserted "the platform knows about a hardware > > issue between VMD and this NVMe and avoided it by disabling AER in > > domain 0000; therefore we should also disable AER in the VMD domain." > > > > Your patch at > > https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ > > says "vmd users *always* want hotplug enabled." What happens when a > > platform knows about a hotplug hardware issue and avoids it by > > disabling hotplug in domain 0000? > > I was thinking about this also and I could look at all the root ports > underneath vmd and see if hotplug is set for any of them. If it is then we > could set the native_*hotplug bits based on that. No. "Hotplug is set" means the device advertises the feature via config space, in this case, it has the Hot-Plug Capable bit in the PCIe Slot Capabilities set. That just means the device has hardware support for the feature. On ACPI systems, the OS can only use pciehp on the device if firmware has granted ownership to the OS via _OSC because the firmware may want to use the feature itself. If both OS and firmware think they own the feature, they will conflict with each other. If firmware retains owership of hotplug, it can field hotplug events itself and notify the OS via the ACPI hotplug mechanism. The acpiphp driver handles those events for PCI. Firmware may do this if it wants to work around hardware defects it knows about, or if it wants to do OS-independent logging (more applicable for AER), or if it wants to intercept hotplug events to do some kind of initialization, etc. Bjorn
On 4/25/2024 3:32 PM, Bjorn Helgaas wrote: > On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote: >> On 4/25/2024 10:24 AM, Bjorn Helgaas wrote: >>> On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: >>>> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: >> ... > >>>>> _OSC is the only mechanism for negotiating ownership of these >>>>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC >>>>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 host >>>>> bridge that contains the _OSC method. AFAICT, there's no >>>>> global/device-specific thing here. >>>>> >>>>> The BIOS may have a single user-visible setting, and it may apply that >>>>> setting to all host bridge _OSC methods, but that's just part of the >>>>> BIOS UI, not part of the firmware/OS interface. >>>> >>>> Fair, but we are still left with the question of how to set the _OSC bits >>>> for the VMD bridge. This would normally happen using ACPI AFAICT and we >>>> don't have that for the devices behind VMD. >>> >>> In the absence of a mechanism for negotiating ownership, e.g., an ACPI >>> host bridge device for the hierarchy, the OS owns all the PCIe >>> features. >> >> I'm new to this space so I don't know what it means for the OS to >> own the features. In other words, how would the vmd driver figure >> out what features are supported? > > There are three things that go into this: > > - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled? > > - Has the platform granted permission to the OS to use the feature, > either explicitly via _OSC or implicitly because there's no > mechanism to negotiate ownership? > > - Does the device advertise the feature, e.g., does it have an AER > Capability? > > If all three are true, Linux enables the feature. > > I think vmd has implicit ownership of all features because there is no > ACPI host bridge device for the VMD domain, and (IMO) that means there > is no way to negotiate ownership in that domain. > > So the VMD domain starts with all the native_* bits set, meaning Linux > owns the features. If the vmd driver doesn't want some feature to be > used, it could clear the native_* bit for it. > > I don't think vmd should unilaterally claim ownership of features by > *setting* native_* bits because that will lead to conflicts with > firmware. > This is the crux of the problem IMO. I'm happy to set the native_* bits using some knowledge about what the firmware wants, but we don't have a mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a domain that ACPI had run on and negotiated features and apply them to the vmd domain. Using the 3 criteria you described above, could we do this for the hotplug feature for VMD: 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check to see if the hotplug feature is enabled 2. We know that for VMD we want hotplug enabled so that is the implicit permission 3. Look at the root ports below VMD and see if hotplug capability is set If 1 & 3 are true, then we set the native_* bits for hotplug (we can look for surprise hotplug as well in the capability to set the native_shpc_hotplug bit corrrectly) to 1. This feels like it would solve the problem of "what if there is a hotplug issue on the platform" because the user would have disabled hotplug for VMD and the root ports below it would have the capability turned off. In theory we could do this same thing for all the features, but we don't know what the firmware wants for features other than hotplug (because we implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is a good compromise for the other features, but I hear your issues with it. I'm happy to "do the right thing" for the other features, I just can't figure out what that thing is :) Paul >>> 04b12ef163d1 basically asserted "the platform knows about a hardware >>> issue between VMD and this NVMe and avoided it by disabling AER in >>> domain 0000; therefore we should also disable AER in the VMD domain." >>> >>> Your patch at >>> https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/ >>> says "vmd users *always* want hotplug enabled." What happens when a >>> platform knows about a hotplug hardware issue and avoids it by >>> disabling hotplug in domain 0000? >> >> I was thinking about this also and I could look at all the root ports >> underneath vmd and see if hotplug is set for any of them. If it is then we >> could set the native_*hotplug bits based on that. > > No. "Hotplug is set" means the device advertises the feature via > config space, in this case, it has the Hot-Plug Capable bit in the > PCIe Slot Capabilities set. That just means the device has hardware > support for the feature. > > On ACPI systems, the OS can only use pciehp on the device if firmware > has granted ownership to the OS via _OSC because the firmware may want > to use the feature itself. If both OS and firmware think they own the > feature, they will conflict with each other. > > If firmware retains owership of hotplug, it can field hotplug events > itself and notify the OS via the ACPI hotplug mechanism. The acpiphp > driver handles those events for PCI. > > Firmware may do this if it wants to work around hardware defects it > knows about, or if it wants to do OS-independent logging (more > applicable for AER), or if it wants to intercept hotplug events to do > some kind of initialization, etc. > > Bjorn
On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote: > On 4/25/2024 3:32 PM, Bjorn Helgaas wrote: > > On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote: > > > On 4/25/2024 10:24 AM, Bjorn Helgaas wrote: > > > > On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: > > > > > On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: > > > ... > > > > > > > > _OSC is the only mechanism for negotiating ownership of these > > > > > > features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC > > > > > > only applies to the hierarchy originated by the PNP0A03/PNP0A08 host > > > > > > bridge that contains the _OSC method. AFAICT, there's no > > > > > > global/device-specific thing here. > > > > > > > > > > > > The BIOS may have a single user-visible setting, and it may apply that > > > > > > setting to all host bridge _OSC methods, but that's just part of the > > > > > > BIOS UI, not part of the firmware/OS interface. > > > > > > > > > > Fair, but we are still left with the question of how to set the _OSC bits > > > > > for the VMD bridge. This would normally happen using ACPI AFAICT and we > > > > > don't have that for the devices behind VMD. > > > > > > > > In the absence of a mechanism for negotiating ownership, e.g., an ACPI > > > > host bridge device for the hierarchy, the OS owns all the PCIe > > > > features. > > > > > > I'm new to this space so I don't know what it means for the OS to > > > own the features. In other words, how would the vmd driver figure > > > out what features are supported? > > > > There are three things that go into this: > > > > - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled? > > > > - Has the platform granted permission to the OS to use the feature, > > either explicitly via _OSC or implicitly because there's no > > mechanism to negotiate ownership? > > > > - Does the device advertise the feature, e.g., does it have an AER > > Capability? > > > > If all three are true, Linux enables the feature. > > > > I think vmd has implicit ownership of all features because there is no > > ACPI host bridge device for the VMD domain, and (IMO) that means there > > is no way to negotiate ownership in that domain. > > > > So the VMD domain starts with all the native_* bits set, meaning Linux > > owns the features. If the vmd driver doesn't want some feature to be > > used, it could clear the native_* bit for it. > > > > I don't think vmd should unilaterally claim ownership of features by > > *setting* native_* bits because that will lead to conflicts with > > firmware. > > This is the crux of the problem IMO. I'm happy to set the native_* bits > using some knowledge about what the firmware wants, but we don't have a > mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 ("PCI: > vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a domain that > ACPI had run on and negotiated features and apply them to the vmd domain. Yes, this is the problem. We have no information about what firmware wants for the VMD domain because there is no ACPI host bridge device. We have information about what firmware wants for *other* domains. 04b12ef163d1 assumes that also applies to the VMD domain, but I don't think that's a good idea. > Using the 3 criteria you described above, could we do this for the hotplug > feature for VMD: > > 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check to see > if the hotplug feature is enabled That's already there. > 2. We know that for VMD we want hotplug enabled so that is the implicit > permission The VMD domain starts with all native_* bits set. All you have to do is avoid *clearing* them. The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC for some other domain. > 3. Look at the root ports below VMD and see if hotplug capability is set This is already there, too. > If 1 & 3 are true, then we set the native_* bits for hotplug (we can look > for surprise hotplug as well in the capability to set the > native_shpc_hotplug bit corrrectly) to 1. This feels like it would solve the > problem of "what if there is a hotplug issue on the platform" because the > user would have disabled hotplug for VMD and the root ports below it would > have the capability turned off. > > In theory we could do this same thing for all the features, but we don't > know what the firmware wants for features other than hotplug (because we > implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is a good > compromise for the other features, but I hear your issues with it. > > I'm happy to "do the right thing" for the other features, I just can't > figure out what that thing is :) 04b12ef163d1 was motivated by a flood of Correctable Errors. Kai-Heng says the errors occur even when booting with "pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is disabled and the NVMe devices appear under plain Root Ports in domain 0000. That suggests that they aren't related to VMD at all. I think there's a significant chance that those errors are caused by a software defect, e.g., ASPM configuration. There are many similar reports of Correctable Errors where "pcie_aspm=off" is a workaround. If we can nail down the root cause of those Correctable Errors, we may be able to fix it and just revert 04b12ef163d1. That would leave all the PCIe features owned and enabled by Linux in the VMD domain. AER would be enabled and not a problem, hotplug would be enabled as you need, etc. There are a zillion reports of these errors and I added comments to some to see if anybody can help us get to a root cause. Bjorn
On 4/26/2024 2:36 PM, Bjorn Helgaas wrote: > On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote: >> On 4/25/2024 3:32 PM, Bjorn Helgaas wrote: >>> On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote: >>>> On 4/25/2024 10:24 AM, Bjorn Helgaas wrote: >>>>> On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: >>>>>> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: >>>> ... >>> >>>>>>> _OSC is the only mechanism for negotiating ownership of these >>>>>>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that _OSC >>>>>>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 host >>>>>>> bridge that contains the _OSC method. AFAICT, there's no >>>>>>> global/device-specific thing here. >>>>>>> >>>>>>> The BIOS may have a single user-visible setting, and it may apply that >>>>>>> setting to all host bridge _OSC methods, but that's just part of the >>>>>>> BIOS UI, not part of the firmware/OS interface. >>>>>> >>>>>> Fair, but we are still left with the question of how to set the _OSC bits >>>>>> for the VMD bridge. This would normally happen using ACPI AFAICT and we >>>>>> don't have that for the devices behind VMD. >>>>> >>>>> In the absence of a mechanism for negotiating ownership, e.g., an ACPI >>>>> host bridge device for the hierarchy, the OS owns all the PCIe >>>>> features. >>>> >>>> I'm new to this space so I don't know what it means for the OS to >>>> own the features. In other words, how would the vmd driver figure >>>> out what features are supported? >>> >>> There are three things that go into this: >>> >>> - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled? >>> >>> - Has the platform granted permission to the OS to use the feature, >>> either explicitly via _OSC or implicitly because there's no >>> mechanism to negotiate ownership? >>> >>> - Does the device advertise the feature, e.g., does it have an AER >>> Capability? >>> >>> If all three are true, Linux enables the feature. >>> >>> I think vmd has implicit ownership of all features because there is no >>> ACPI host bridge device for the VMD domain, and (IMO) that means there >>> is no way to negotiate ownership in that domain. >>> >>> So the VMD domain starts with all the native_* bits set, meaning Linux >>> owns the features. If the vmd driver doesn't want some feature to be >>> used, it could clear the native_* bit for it. >>> >>> I don't think vmd should unilaterally claim ownership of features by >>> *setting* native_* bits because that will lead to conflicts with >>> firmware. >> >> This is the crux of the problem IMO. I'm happy to set the native_* bits >> using some knowledge about what the firmware wants, but we don't have a >> mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 ("PCI: >> vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a domain that >> ACPI had run on and negotiated features and apply them to the vmd domain. > > Yes, this is the problem. We have no information about what firmware > wants for the VMD domain because there is no ACPI host bridge device. > > We have information about what firmware wants for *other* domains. > 04b12ef163d1 assumes that also applies to the VMD domain, but I don't > think that's a good idea. > >> Using the 3 criteria you described above, could we do this for the hotplug >> feature for VMD: >> >> 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check to see >> if the hotplug feature is enabled > > That's already there. > >> 2. We know that for VMD we want hotplug enabled so that is the implicit >> permission > > The VMD domain starts with all native_* bits set. All you have to do > is avoid *clearing* them. > > The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC > for some other domain. > >> 3. Look at the root ports below VMD and see if hotplug capability is set > > This is already there, too. > >> If 1 & 3 are true, then we set the native_* bits for hotplug (we can look >> for surprise hotplug as well in the capability to set the >> native_shpc_hotplug bit corrrectly) to 1. This feels like it would solve the >> problem of "what if there is a hotplug issue on the platform" because the >> user would have disabled hotplug for VMD and the root ports below it would >> have the capability turned off. >> >> In theory we could do this same thing for all the features, but we don't >> know what the firmware wants for features other than hotplug (because we >> implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is a good >> compromise for the other features, but I hear your issues with it. >> >> I'm happy to "do the right thing" for the other features, I just can't >> figure out what that thing is :) > > 04b12ef163d1 was motivated by a flood of Correctable Errors. > > Kai-Heng says the errors occur even when booting with > "pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is > disabled and the NVMe devices appear under plain Root Ports in domain > 0000. That suggests that they aren't related to VMD at all. > > I think there's a significant chance that those errors are caused by a > software defect, e.g., ASPM configuration. There are many similar > reports of Correctable Errors where "pcie_aspm=off" is a workaround. > > If we can nail down the root cause of those Correctable Errors, we may > be able to fix it and just revert 04b12ef163d1. That would leave all > the PCIe features owned and enabled by Linux in the VMD domain. AER > would be enabled and not a problem, hotplug would be enabled as you > need, etc. > > There are a zillion reports of these errors and I added comments to > some to see if anybody can help us get to a root cause. > OK, sounds like the plan for me is to sit tight for now WRT a patch to fix hotplug. I'll submit a v2 for the documentation. Paul
On 4/26/2024 2:46 PM, Paul M Stillwell Jr wrote: > On 4/26/2024 2:36 PM, Bjorn Helgaas wrote: >> On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote: >>> On 4/25/2024 3:32 PM, Bjorn Helgaas wrote: >>>> On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote: >>>>> On 4/25/2024 10:24 AM, Bjorn Helgaas wrote: >>>>>> On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: >>>>>>> On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: >>>>> ... >>>> >>>>>>>> _OSC is the only mechanism for negotiating ownership of these >>>>>>>> features, and PCI Firmware r3.3, sec 4.5.1, is pretty clear that >>>>>>>> _OSC >>>>>>>> only applies to the hierarchy originated by the PNP0A03/PNP0A08 >>>>>>>> host >>>>>>>> bridge that contains the _OSC method. AFAICT, there's no >>>>>>>> global/device-specific thing here. >>>>>>>> >>>>>>>> The BIOS may have a single user-visible setting, and it may >>>>>>>> apply that >>>>>>>> setting to all host bridge _OSC methods, but that's just part of >>>>>>>> the >>>>>>>> BIOS UI, not part of the firmware/OS interface. >>>>>>> >>>>>>> Fair, but we are still left with the question of how to set the >>>>>>> _OSC bits >>>>>>> for the VMD bridge. This would normally happen using ACPI AFAICT >>>>>>> and we >>>>>>> don't have that for the devices behind VMD. >>>>>> >>>>>> In the absence of a mechanism for negotiating ownership, e.g., an >>>>>> ACPI >>>>>> host bridge device for the hierarchy, the OS owns all the PCIe >>>>>> features. >>>>> >>>>> I'm new to this space so I don't know what it means for the OS to >>>>> own the features. In other words, how would the vmd driver figure >>>>> out what features are supported? >>>> >>>> There are three things that go into this: >>>> >>>> - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled? >>>> >>>> - Has the platform granted permission to the OS to use the feature, >>>> either explicitly via _OSC or implicitly because there's no >>>> mechanism to negotiate ownership? >>>> >>>> - Does the device advertise the feature, e.g., does it have an AER >>>> Capability? >>>> >>>> If all three are true, Linux enables the feature. >>>> >>>> I think vmd has implicit ownership of all features because there is no >>>> ACPI host bridge device for the VMD domain, and (IMO) that means there >>>> is no way to negotiate ownership in that domain. >>>> >>>> So the VMD domain starts with all the native_* bits set, meaning Linux >>>> owns the features. If the vmd driver doesn't want some feature to be >>>> used, it could clear the native_* bit for it. >>>> >>>> I don't think vmd should unilaterally claim ownership of features by >>>> *setting* native_* bits because that will lead to conflicts with >>>> firmware. >>> >>> This is the crux of the problem IMO. I'm happy to set the native_* bits >>> using some knowledge about what the firmware wants, but we don't have a >>> mechanism to do it AFAICT. I think that's what commit 04b12ef163d1 >>> ("PCI: >>> vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a >>> domain that >>> ACPI had run on and negotiated features and apply them to the vmd >>> domain. >> >> Yes, this is the problem. We have no information about what firmware >> wants for the VMD domain because there is no ACPI host bridge device. >> >> We have information about what firmware wants for *other* domains. >> 04b12ef163d1 assumes that also applies to the VMD domain, but I don't >> think that's a good idea. >> >>> Using the 3 criteria you described above, could we do this for the >>> hotplug >>> feature for VMD: >>> >>> 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to check >>> to see >>> if the hotplug feature is enabled >> >> That's already there. >> >>> 2. We know that for VMD we want hotplug enabled so that is the implicit >>> permission >> >> The VMD domain starts with all native_* bits set. All you have to do >> is avoid *clearing* them. >> >> The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC >> for some other domain. >> >>> 3. Look at the root ports below VMD and see if hotplug capability is set >> >> This is already there, too. >> >>> If 1 & 3 are true, then we set the native_* bits for hotplug (we can >>> look >>> for surprise hotplug as well in the capability to set the >>> native_shpc_hotplug bit corrrectly) to 1. This feels like it would >>> solve the >>> problem of "what if there is a hotplug issue on the platform" because >>> the >>> user would have disabled hotplug for VMD and the root ports below it >>> would >>> have the capability turned off. >>> >>> In theory we could do this same thing for all the features, but we don't >>> know what the firmware wants for features other than hotplug (because we >>> implicitly know that vmd wants hotplug). I feel like 04b12ef163d1 is >>> a good >>> compromise for the other features, but I hear your issues with it. >>> >>> I'm happy to "do the right thing" for the other features, I just can't >>> figure out what that thing is :) >> >> 04b12ef163d1 was motivated by a flood of Correctable Errors. >> >> Kai-Heng says the errors occur even when booting with >> "pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is >> disabled and the NVMe devices appear under plain Root Ports in domain >> 0000. That suggests that they aren't related to VMD at all. >> >> I think there's a significant chance that those errors are caused by a >> software defect, e.g., ASPM configuration. There are many similar >> reports of Correctable Errors where "pcie_aspm=off" is a workaround. >> >> If we can nail down the root cause of those Correctable Errors, we may >> be able to fix it and just revert 04b12ef163d1. That would leave all >> the PCIe features owned and enabled by Linux in the VMD domain. AER >> would be enabled and not a problem, hotplug would be enabled as you >> need, etc. >> >> There are a zillion reports of these errors and I added comments to >> some to see if anybody can help us get to a root cause. >> > > OK, sounds like the plan for me is to sit tight for now WRT a patch to > fix hotplug. I'll submit a v2 for the documentation. > Any update on fixes for the errors? Paul
On Wed, Jun 12, 2024 at 02:52:38PM -0700, Paul M Stillwell Jr wrote: > On 4/26/2024 2:46 PM, Paul M Stillwell Jr wrote: > > On 4/26/2024 2:36 PM, Bjorn Helgaas wrote: > > > On Thu, Apr 25, 2024 at 04:32:21PM -0700, Paul M Stillwell Jr wrote: > > > > On 4/25/2024 3:32 PM, Bjorn Helgaas wrote: > > > > > On Thu, Apr 25, 2024 at 02:43:07PM -0700, Paul M Stillwell Jr wrote: > > > > > > On 4/25/2024 10:24 AM, Bjorn Helgaas wrote: > > > > > > > On Wed, Apr 24, 2024 at 02:29:16PM -0700, Paul M Stillwell Jr wrote: > > > > > > > > On 4/23/2024 5:47 PM, Bjorn Helgaas wrote: > > > > > > ... > > > > > > > > > > > > > > _OSC is the only mechanism for negotiating ownership of these > > > > > > > > > features, and PCI Firmware r3.3, sec 4.5.1, > > > > > > > > > is pretty clear that _OSC > > > > > > > > > only applies to the hierarchy originated by > > > > > > > > > the PNP0A03/PNP0A08 host > > > > > > > > > bridge that contains the _OSC method. AFAICT, there's no > > > > > > > > > global/device-specific thing here. > > > > > > > > > > > > > > > > > > The BIOS may have a single user-visible > > > > > > > > > setting, and it may apply that > > > > > > > > > setting to all host bridge _OSC methods, but > > > > > > > > > that's just part of the > > > > > > > > > BIOS UI, not part of the firmware/OS interface. > > > > > > > > > > > > > > > > Fair, but we are still left with the question of > > > > > > > > how to set the _OSC bits > > > > > > > > for the VMD bridge. This would normally happen > > > > > > > > using ACPI AFAICT and we > > > > > > > > don't have that for the devices behind VMD. > > > > > > > > > > > > > > In the absence of a mechanism for negotiating > > > > > > > ownership, e.g., an ACPI > > > > > > > host bridge device for the hierarchy, the OS owns all the PCIe > > > > > > > features. > > > > > > > > > > > > I'm new to this space so I don't know what it means for the OS to > > > > > > own the features. In other words, how would the vmd driver figure > > > > > > out what features are supported? > > > > > > > > > > There are three things that go into this: > > > > > > > > > > - Does the OS support the feature, e.g., is CONFIG_PCIEAER enabled? > > > > > > > > > > - Has the platform granted permission to the OS to use the feature, > > > > > either explicitly via _OSC or implicitly because there's no > > > > > mechanism to negotiate ownership? > > > > > > > > > > - Does the device advertise the feature, e.g., does it have an AER > > > > > Capability? > > > > > > > > > > If all three are true, Linux enables the feature. > > > > > > > > > > I think vmd has implicit ownership of all features because there is no > > > > > ACPI host bridge device for the VMD domain, and (IMO) that means there > > > > > is no way to negotiate ownership in that domain. > > > > > > > > > > So the VMD domain starts with all the native_* bits set, meaning Linux > > > > > owns the features. If the vmd driver doesn't want some feature to be > > > > > used, it could clear the native_* bit for it. > > > > > > > > > > I don't think vmd should unilaterally claim ownership of features by > > > > > *setting* native_* bits because that will lead to conflicts with > > > > > firmware. > > > > > > > > This is the crux of the problem IMO. I'm happy to set the native_* bits > > > > using some knowledge about what the firmware wants, but we don't have a > > > > mechanism to do it AFAICT. I think that's what commit > > > > 04b12ef163d1 ("PCI: > > > > vmd: Honor ACPI _OSC on PCIe features") was trying to do: use a > > > > domain that > > > > ACPI had run on and negotiated features and apply them to the > > > > vmd domain. > > > > > > Yes, this is the problem. We have no information about what firmware > > > wants for the VMD domain because there is no ACPI host bridge device. > > > > > > We have information about what firmware wants for *other* domains. > > > 04b12ef163d1 assumes that also applies to the VMD domain, but I don't > > > think that's a good idea. > > > > > > > Using the 3 criteria you described above, could we do this for > > > > the hotplug > > > > feature for VMD: > > > > > > > > 1. Use IS_ENABLED(CONFIG_<whatever hotplug setting we need>) to > > > > check to see > > > > if the hotplug feature is enabled > > > > > > That's already there. > > > > > > > 2. We know that for VMD we want hotplug enabled so that is the implicit > > > > permission > > > > > > The VMD domain starts with all native_* bits set. All you have to do > > > is avoid *clearing* them. > > > > > > The problem (IMO) is that 04b12ef163d1 clears bits based on the _OSC > > > for some other domain. > > > > > > > 3. Look at the root ports below VMD and see if hotplug capability is set > > > > > > This is already there, too. > > > > > > > If 1 & 3 are true, then we set the native_* bits for hotplug (we > > > > can look > > > > for surprise hotplug as well in the capability to set the > > > > native_shpc_hotplug bit corrrectly) to 1. This feels like it > > > > would solve the > > > > problem of "what if there is a hotplug issue on the platform" > > > > because the > > > > user would have disabled hotplug for VMD and the root ports > > > > below it would > > > > have the capability turned off. > > > > > > > > In theory we could do this same thing for all the features, but we don't > > > > know what the firmware wants for features other than hotplug (because we > > > > implicitly know that vmd wants hotplug). I feel like > > > > 04b12ef163d1 is a good > > > > compromise for the other features, but I hear your issues with it. > > > > > > > > I'm happy to "do the right thing" for the other features, I just can't > > > > figure out what that thing is :) > > > > > > 04b12ef163d1 was motivated by a flood of Correctable Errors. > > > > > > Kai-Heng says the errors occur even when booting with > > > "pcie_ports=native" and VMD turned off, i.e., when the VMD RCiEP is > > > disabled and the NVMe devices appear under plain Root Ports in domain > > > 0000. That suggests that they aren't related to VMD at all. > > > > > > I think there's a significant chance that those errors are caused by a > > > software defect, e.g., ASPM configuration. There are many similar > > > reports of Correctable Errors where "pcie_aspm=off" is a workaround. > > > > > > If we can nail down the root cause of those Correctable Errors, we may > > > be able to fix it and just revert 04b12ef163d1. That would leave all > > > the PCIe features owned and enabled by Linux in the VMD domain. AER > > > would be enabled and not a problem, hotplug would be enabled as you > > > need, etc. > > > > > > There are a zillion reports of these errors and I added comments to > > > some to see if anybody can help us get to a root cause. > > > > OK, sounds like the plan for me is to sit tight for now WRT a patch to > > fix hotplug. I'll submit a v2 for the documentation. > > Any update on fixes for the errors? Ah, sorry, I need to get back to this. I don't think it's ASPM-related; I think it's just that disabling ASPM also causes AER to be disabled, so the errors probably still occur but we ignore them. Bjorn
diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst new file mode 100644 index 000000000000..e1a019035245 --- /dev/null +++ b/Documentation/PCI/controller/vmd.rst @@ -0,0 +1,51 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +================================================================= +Linux Base Driver for the Intel(R) Volume Management Device (VMD) +================================================================= + +Intel vmd Linux driver. + +Contents +======== + +- Overview +- Features +- Limitations + +The Intel VMD provides the means to provide volume management across separate +PCI Express HBAs and SSDs without requiring operating system support or +communication between drivers. It does this by obscuring each storage +controller from the OS, but allowing a single driver to be loaded that would +control each storage controller. A Volume Management Device (VMD) provides a +single device for a single storage driver. The VMD resides in the IIO root +complex and it appears to the OS as a root bus integrated endpoint. In the IIO, +the VMD is in a central location to manipulate access to storage devices which +may be attached directly to the IIO or indirectly through the PCH. Instead of +allowing individual storage devices to be detected by the OS and allow it to +load a separate driver instance for each, the VMD provides configuration +settings to allow specific devices and root ports on the root bus to be +invisible to the OS. + +VMD works by creating separate PCI domains for each VMD device in the system. +This makes VMD look more like a host bridge than an endpoint so VMD must try +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system. +A couple of the _OSC flags regard hotplug support. Hotplug is a feature that +is always enabled when using VMD regardless of the _OSC flags. + +Features +======== + +- Virtualization +- MSIX interrupts +- Power Management +- Hotplug + +Limitations +=========== + +When VMD is enabled and used in a hypervisor the _OSC flags provided by the +hypervisor BIOS may not be correct. The most critical of these flags are the +hotplug bits. If these bits are incorrect then the storage devices behind the +VMD will not be able to be hotplugged. The driver always supports hotplug for +the devices behind it so the hotplug bits reported by the OS are not used. diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index e73f84aebde3..6558adc703f9 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -18,3 +18,4 @@ PCI Bus Subsystem pcieaer-howto endpoint/index boot-interrupts + controller/vmd
Adding documentation for the Intel VMD driver and updating the index file to include it. Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com> --- Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++ Documentation/PCI/index.rst | 1 + 2 files changed, 52 insertions(+) create mode 100644 Documentation/PCI/controller/vmd.rst