Message ID | 20241011152727.366770-2-stewart.hildebrand@amd.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen: SR-IOV fixes | expand |
On 11.10.2024 17:27, Stewart Hildebrand wrote: > --- a/xen/arch/x86/msi.c > +++ b/xen/arch/x86/msi.c > @@ -1243,7 +1243,12 @@ int pci_reset_msix_state(struct pci_dev *pdev) > { > unsigned int pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX); > > - ASSERT(pos); > + if ( !pos ) > + { > + pdev->broken = true; > + return -EFAULT; > + } > + > /* > * Xen expects the device state to be the after reset one, and hence > * host_maskall = guest_maskall = false and all entries should have the > @@ -1271,7 +1276,12 @@ int pci_msi_conf_write_intercept(struct pci_dev *pdev, unsigned int reg, > entry = find_msi_entry(pdev, -1, PCI_CAP_ID_MSIX); > pos = entry ? entry->msi_attrib.pos > : pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX); > - ASSERT(pos); > + > + if ( !pos ) > + { > + pdev->broken = true; > + return -EFAULT; > + } > > if ( reg >= pos && reg < msix_pba_offset_reg(pos) + 4 ) > { There are more instances of pci_find_cap_offset(..., PCI_CAP_ID_MSIX) which may want/need dealing with, even if there are no ASSERT()s there. Setting ->broken is of course a perhaps desirable (side) effect. Nevertheless I wonder whether latching the capability position once during device init wouldn't be an alternative (better?) approach. Finally I don't think -EFAULT is appropriate here. Imo it should be -ENODEV. Jan
On 10/15/24 02:58, Jan Beulich wrote: > On 11.10.2024 17:27, Stewart Hildebrand wrote: >> --- a/xen/arch/x86/msi.c >> +++ b/xen/arch/x86/msi.c >> @@ -1243,7 +1243,12 @@ int pci_reset_msix_state(struct pci_dev *pdev) >> { >> unsigned int pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX); >> >> - ASSERT(pos); >> + if ( !pos ) >> + { >> + pdev->broken = true; >> + return -EFAULT; >> + } >> + >> /* >> * Xen expects the device state to be the after reset one, and hence >> * host_maskall = guest_maskall = false and all entries should have the >> @@ -1271,7 +1276,12 @@ int pci_msi_conf_write_intercept(struct pci_dev *pdev, unsigned int reg, >> entry = find_msi_entry(pdev, -1, PCI_CAP_ID_MSIX); >> pos = entry ? entry->msi_attrib.pos >> : pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX); >> - ASSERT(pos); >> + >> + if ( !pos ) >> + { >> + pdev->broken = true; >> + return -EFAULT; >> + } >> >> if ( reg >= pos && reg < msix_pba_offset_reg(pos) + 4 ) >> { > > There are more instances of pci_find_cap_offset(..., PCI_CAP_ID_MSIX) > which may want/need dealing with, even if there are no ASSERT()s there. Yes, and some instances of pci_find_cap_offset(..., PCI_CAP_ID_MSI) too. > Setting ->broken is of course a perhaps desirable (side) effect. Nevertheless > I wonder whether latching the capability position once during device init > wouldn't be an alternative (better?) approach. I'll give this a try for the next rev. > Finally I don't think -EFAULT is appropriate here. Imo it should be -ENODEV. OK > > Jan
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index ff2e3d86878d..fbb07fe821b5 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -1243,7 +1243,12 @@ int pci_reset_msix_state(struct pci_dev *pdev) { unsigned int pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX); - ASSERT(pos); + if ( !pos ) + { + pdev->broken = true; + return -EFAULT; + } + /* * Xen expects the device state to be the after reset one, and hence * host_maskall = guest_maskall = false and all entries should have the @@ -1271,7 +1276,12 @@ int pci_msi_conf_write_intercept(struct pci_dev *pdev, unsigned int reg, entry = find_msi_entry(pdev, -1, PCI_CAP_ID_MSIX); pos = entry ? entry->msi_attrib.pos : pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX); - ASSERT(pos); + + if ( !pos ) + { + pdev->broken = true; + return -EFAULT; + } if ( reg >= pos && reg < msix_pba_offset_reg(pos) + 4 ) {
Dom0 normally informs Xen of PCI device removal via PHYSDEVOP_pci_device_remove, e.g. in response to SR-IOV disable or hot-unplug. We might find ourselves with stale pdevs if a buggy dom0 fails to report removal via PHYSDEVOP_pci_device_remove. In this case, attempts to access the config space of the stale pdevs would be invalid and return all 1s. Some possible conditions leading to this are: 1. Dom0 disables SR-IOV without reporting VF removal to Xen. The Linux SR-IOV subsystem normally reports VF removal when a PF driver disables SR-IOV. In case of a buggy dom0 SR-IOV subsystem, SR-IOV could become disabled with stale dangling VF pdevs in both dom0 Linux and Xen. 2. Dom0 reporting PF removal without reporting VF removal. During SR-IOV PF removal (hot-unplug), a buggy PF driver may fail to disable SR-IOV, thus failing to remove the VFs, leaving stale dangling VFs behind in both Xen and Linux. At least Linux warns in this case: [ 100.000000] 0000:01:00.0: driver left SR-IOV enabled after remove In either case, Xen is left with stale VF pdevs, risking invalid PCI config space accesses. When Xen is built with CONFIG_DEBUG=y, the following Xen crashes were observed when dom0 attempted to access the config space of a stale VF: (XEN) Assertion 'pos' failed at arch/x86/msi.c:1274 (XEN) ----[ Xen-4.20-unstable x86_64 debug=y Tainted: C ]---- ... (XEN) Xen call trace: (XEN) [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de (XEN) [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78 (XEN) [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114 (XEN) [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8 (XEN) [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0 (XEN) [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e (XEN) [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64 (XEN) [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217 (XEN) [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8 (XEN) [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c (XEN) Assertion 'pos' failed at arch/x86/msi.c:1246 (XEN) ----[ Xen-4.20-unstable x86_64 debug=y Tainted: C ]---- ... (XEN) Xen call trace: (XEN) [<ffff82d040346b0a>] R pci_reset_msix_state+0x47/0x50 (XEN) [<ffff82d040287eec>] F pdev_msix_assign+0x19/0x35 (XEN) [<ffff82d040286184>] F drivers/passthrough/pci.c#assign_device+0x181/0x471 (XEN) [<ffff82d040287c36>] F iommu_do_pci_domctl+0x248/0x2ec (XEN) [<ffff82d040284e1f>] F iommu_do_domctl+0x26/0x44 (XEN) [<ffff82d0402483b8>] F do_domctl+0x8c1/0x1660 (XEN) [<ffff82d04032977e>] F pv_hypercall+0x5ce/0x6af (XEN) [<ffff82d0402012d3>] F lstar_enter+0x143/0x150 Replace the ASSERT(s) with an error, and mark the device broken to disallow passthrough to domUs. Fixes: 484d7c852e4f ("x86/MSI-X: track host and guest mask-all requests separately") Fixes: 575e18d54d19 ("pci: clear {host/guest}_maskall field on assign") Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v4->v5: * new patch, independent of the rest of the series * new approach to fixing the issue: don't rely on dom0 to report any sort of device removal; rather, fix the condition directly --- Instructions to reproduce Requires Xen with CONFIG_DEBUG=y Tested with Linux 6.11 1. Dom0 disables SR-IOV without reporting VF removal to Xen. * Hack the Linux SR-IOV subsystem to remove the call to pci_stop_and_remove_bus_device() in drivers/pci/iov.c:pci_iov_remove_virtfn(). * Enable SR-IOV, then disable SR-IOV echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs * Now we have a stale VF. We can trigger the ASSERT either by unbinding the VF driver and issuing a reset... echo 0000\:01\:10.0 > /sys/bus/pci/devices/0000\:01\:10.0/driver/unbind echo 1 > /sys/bus/pci/devices/0000\:01\:10.0/reset ... or by doing xl pci-assignable-add xl pci-assignable-add 01:10.0 2. Dom0 reporting PF removal without reporting VF removal. * Hack your PF driver to leave SR-IOV enabled when removing the device * Enable SR-IOV echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs * Unplug the PCI device (qemu) device_del mydev * Now we have a stale VF. We can trigger the ASSERT either by re-adding the PF device with SR-IOV disabled... echo 0000\:01\:10.0 > /sys/bus/pci/devices/0000\:01\:10.0/driver/unbind (qemu) device_add igb,id=mydev,bus=pcie.1,netdev=net1 ... or by reset / xl pci-assignable-add as above. --- xen/arch/x86/msi.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)