Message ID | 20231213034637.2603013-3-haifeng.zhao@linux.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | fix vt-d hard lockup when hotplug ATS capable device | expand |
On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote: > For those endpoint devices connect to system via hotplug capable ports, > users could request a warm reset to the device by flapping device's link > through setting the slot's link control register, Well, users could just *unplug* the device, right? Why is it relevant that thay could fiddle with registers in config space? > as pciehpt_ist() DLLSC > interrupt sequence response, pciehp will unload the device driver and > then power it off. thus cause an IOMMU devTLB flush request for device to > be sent and a long time completion/timeout waiting in interrupt context. A completion timeout should be on the order of usecs or msecs, why does it cause a hard lockup? The dmesg excerpt you've provided shows a 12 *second* delay between hot removal and watchdog reaction. > Fix it by checking the device's error_state in > devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush > request to link down device that is set to pci_channel_io_perm_failure and > then powered off in This doesn't seem to be a proper fix. It will work most of the time but not always. A user might bring down the slot via sysfs, then yank the card from the slot just when the iommu flush occurs such that the pci_dev_is_disconnected(pdev) check returns false but the card is physically gone immediately afterwards. In other words, you've shrunk the time window during which the issue may occur, but haven't eliminated it completely. Thanks, Lukas
On 13/12/2023 10:44 am, Lukas Wunner wrote: > On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote: >> For those endpoint devices connect to system via hotplug capable ports, >> users could request a warm reset to the device by flapping device's link >> through setting the slot's link control register, > > Well, users could just *unplug* the device, right? Why is it relevant > that thay could fiddle with registers in config space? > > >> as pciehpt_ist() DLLSC >> interrupt sequence response, pciehp will unload the device driver and >> then power it off. thus cause an IOMMU devTLB flush request for device to >> be sent and a long time completion/timeout waiting in interrupt context. > > A completion timeout should be on the order of usecs or msecs, why does it > cause a hard lockup? The dmesg excerpt you've provided shows a 12 *second* > delay between hot removal and watchdog reaction. The PCIe spec only requires an endpoint to respond to an ATS invalidate within a rather hilarious 90 seconds, so it's primarily a question of how patient the root complex and bridges in between are prepared to be. >> Fix it by checking the device's error_state in >> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush >> request to link down device that is set to pci_channel_io_perm_failure and >> then powered off in > > This doesn't seem to be a proper fix. It will work most of the time > but not always. A user might bring down the slot via sysfs, then yank > the card from the slot just when the iommu flush occurs such that the > pci_dev_is_disconnected(pdev) check returns false but the card is > physically gone immediately afterwards. In other words, you've shrunk > the time window during which the issue may occur, but haven't eliminated > it completely. Yeah, I think we have a subtle but fundamental issue here in that the iommu_release_device() callback is hooked to BUS_NOTIFY_REMOVED_DEVICE, so in general probably shouldn't be assuming it's safe to do anything with the device itself *after* it's already been removed from its bus - this step is primarily about cleaning up any of the IOMMU's own state relating to the given device. I think if we want to ensure ATCs are invalidated on hot-unplug we need an additional pre-removal notifier to take care of that, and that step would then want to distinguish between an orderly removal where cleaning up is somewhat meaningful, and a surprise removal where it definitely isn't. Thanks, Robin.
On 2023/12/13 11:46, Ethan Zhao wrote: > For those endpoint devices connect to system via hotplug capable ports, > users could request a warm reset to the device by flapping device's link > through setting the slot's link control register, as pciehpt_ist() DLLSC > interrupt sequence response, pciehp will unload the device driver and > then power it off. Is it possible for pciehp to disable ATS on the device before unloading the driver? Or should the device follow some specific steps to warm reset the device? What happens if IOMMU issues device TLB invalidation after link down but before pci_dev_is_disconnected() returns true? Best regards, baolu
On 12/13/2023 6:44 PM, Lukas Wunner wrote: > On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote: >> For those endpoint devices connect to system via hotplug capable ports, >> users could request a warm reset to the device by flapping device's link >> through setting the slot's link control register, > Well, users could just *unplug* the device, right? Why is it relevant > that thay could fiddle with registers in config space? > Yes, if the device and it's slot are hotplug capable, users could just 'unplug' the device. But this case reported, users try to do a warm reset with a tool command like: mlxfwreset -d <busid> -y reset Actually, it will access configuration space just as setpci -s 0000:17:01.0 0x78.L=0x21050010 Well, we couldn't say don't fiddle PCIe config space registers like that. >> as pciehpt_ist() DLLSC >> interrupt sequence response, pciehp will unload the device driver and >> then power it off. thus cause an IOMMU devTLB flush request for device to >> be sent and a long time completion/timeout waiting in interrupt context. > A completion timeout should be on the order of usecs or msecs, why does it > cause a hard lockup? The dmesg excerpt you've provided shows a 12 *second* > delay between hot removal and watchdog reaction. > In my understanding, the devTLB flush request sent to ATS capable devcie is non-posted request, if the ATS transaction is broken by endpoint link -down, power-off event, the timeout will take up to 60 seconds+-30, see "Invalidate Completion Timeout " part of chapter 10.3.1 Invalidate Request In PCIe spec 6.1 " IMPLEMENTATION NOTE: INVALIDATE COMPLETION TIMEOUT Devices should respond to Invalidate Requests within 1 minute (+50% -0%).Having a bounded time permits an ATPT to implement Invalidate Completion Timeouts and reuse the associated ITag values. ATPT designs are implementation specific. As such, Invalidate Completion Timeouts and their associated error handling are outside the scope of this specification " >> Fix it by checking the device's error_state in >> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush >> request to link down device that is set to pci_channel_io_perm_failure and >> then powered off in > This doesn't seem to be a proper fix. It will work most of the time > but not always. A user might bring down the slot via sysfs, then yank > the card from the slot just when the iommu flush occurs such that the > pci_dev_is_disconnected(pdev) check returns false but the card is > physically gone immediately afterwards. In other words, you've shrunk > the time window during which the issue may occur, but haven't eliminated > it completely. If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ? that would issse devTLB invalidation first, power off device later, it wouldn't trigger the hard lockup, though the pci_dev_is_disconnected() return false. this fix works such case. Thanks, Ethan > > Thanks, > > Lukas
On 12/13/2023 7:59 PM, Baolu Lu wrote: > On 2023/12/13 11:46, Ethan Zhao wrote: >> For those endpoint devices connect to system via hotplug capable ports, >> users could request a warm reset to the device by flapping device's link >> through setting the slot's link control register, as pciehpt_ist() DLLSC >> interrupt sequence response, pciehp will unload the device driver and >> then power it off. > > Is it possible for pciehp to disable ATS on the device before unloading > the driver? Or should the device follow some specific steps to warm > reset the device? > In this case, link down first, then pciehp_ist() got DLLSC interrupt to know that, I don't think it makes sense to disable device ATS here, but it could flag the device is ATS disabled, well, "disconnected" is enough to let vt-d like software knows the device state. > What happens if IOMMU issues device TLB invalidation after link down but > before pci_dev_is_disconnected() returns true? Seems it wouldn't happen with hotplug cases, safe_removal or supprise_removal. Thanks, Ethan > > Best regards, > baolu
On 12/13/2023 7:54 PM, Robin Murphy wrote: > On 13/12/2023 10:44 am, Lukas Wunner wrote: >> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote: >>> For those endpoint devices connect to system via hotplug capable ports, >>> users could request a warm reset to the device by flapping device's >>> link >>> through setting the slot's link control register, >> >> Well, users could just *unplug* the device, right? Why is it relevant >> that thay could fiddle with registers in config space? >> >> >>> as pciehpt_ist() DLLSC >>> interrupt sequence response, pciehp will unload the device driver and >>> then power it off. thus cause an IOMMU devTLB flush request for >>> device to >>> be sent and a long time completion/timeout waiting in interrupt >>> context. >> >> A completion timeout should be on the order of usecs or msecs, why >> does it >> cause a hard lockup? The dmesg excerpt you've provided shows a 12 >> *second* >> delay between hot removal and watchdog reaction. > > The PCIe spec only requires an endpoint to respond to an ATS > invalidate within a rather hilarious 90 seconds, so it's primarily a > question of how patient the root complex and bridges in between are > prepared to be. The issue reported only reproduce with endpoint device connects to system via PCIe switch (only has read tracking feature), those switchses seem not be aware of ATS transaction. while root port is aware of ATS while the ATS transaction is broken. (invalidation sent, but link down, device removed etc). but I didn't find any public spec about that. > >>> Fix it by checking the device's error_state in >>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB >>> flush >>> request to link down device that is set to >>> pci_channel_io_perm_failure and >>> then powered off in >> >> This doesn't seem to be a proper fix. It will work most of the time >> but not always. A user might bring down the slot via sysfs, then yank >> the card from the slot just when the iommu flush occurs such that the >> pci_dev_is_disconnected(pdev) check returns false but the card is >> physically gone immediately afterwards. In other words, you've shrunk >> the time window during which the issue may occur, but haven't eliminated >> it completely. > > Yeah, I think we have a subtle but fundamental issue here in that the > iommu_release_device() callback is hooked to > BUS_NOTIFY_REMOVED_DEVICE, so in general probably shouldn't be > assuming it's safe to do anything with the device itself *after* it's > already been removed from its bus - this step is primarily about > cleaning up any of the IOMMU's own state relating to the given device. > > I think if we want to ensure ATCs are invalidated on hot-unplug we > need an additional pre-removal notifier to take care of that, and that > step would then want to distinguish between an orderly removal where > cleaning up is somewhat meaningful, and a surprise removal where it > definitely isn't. So, at least, we should check device state before issue devTLB invaliation. Thanks, Ethan > > > Thanks, > Robin.
On 12/14/2023 10:16 AM, Ethan Zhao wrote: > > On 12/13/2023 6:44 PM, Lukas Wunner wrote: >> On Tue, Dec 12, 2023 at 10:46:37PM -0500, Ethan Zhao wrote: >>> For those endpoint devices connect to system via hotplug capable ports, >>> users could request a warm reset to the device by flapping device's >>> link >>> through setting the slot's link control register, >> Well, users could just *unplug* the device, right? Why is it relevant >> that thay could fiddle with registers in config space? >> > Yes, if the device and it's slot are hotplug capable, users could just > > 'unplug' the device. > > But this case reported, users try to do a warm reset with a tool > > command like: > > mlxfwreset -d <busid> -y reset > > Actually, it will access configuration space just as > > setpci -s 0000:17:01.0 0x78.L=0x21050010 > > Well, we couldn't say don't fiddle PCIe config space registers like > > that. > >>> as pciehpt_ist() DLLSC >>> interrupt sequence response, pciehp will unload the device driver and >>> then power it off. thus cause an IOMMU devTLB flush request for >>> device to >>> be sent and a long time completion/timeout waiting in interrupt >>> context. >> A completion timeout should be on the order of usecs or msecs, why >> does it >> cause a hard lockup? The dmesg excerpt you've provided shows a 12 >> *second* >> delay between hot removal and watchdog reaction. >> > In my understanding, the devTLB flush request sent to ATS capable devcie > > is non-posted request, if the ATS transaction is broken by endpoint link > > -down, power-off event, the timeout will take up to 60 seconds+-30, > > see "Invalidate Completion Timeout " part of > > chapter 10.3.1 Invalidate Request > > In PCIe spec 6.1 > > " > > IMPLEMENTATION NOTE: > > INVALIDATE COMPLETION TIMEOUT > > Devices should respond to Invalidate Requests within 1 minute (+50% > -0%).Having a bounded time > > permits an ATPT to implement Invalidate Completion Timeouts and reuse > the associated ITag values. > > ATPT designs are implementation specific. As such, Invalidate > Completion Timeouts and their > > associated error handling are outside the scope of this specification > > " > >>> Fix it by checking the device's error_state in >>> devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB >>> flush >>> request to link down device that is set to >>> pci_channel_io_perm_failure and >>> then powered off in >> This doesn't seem to be a proper fix. It will work most of the time >> but not always. A user might bring down the slot via sysfs, then yank >> the card from the slot just when the iommu flush occurs such that the >> pci_dev_is_disconnected(pdev) check returns false but the card is >> physically gone immediately afterwards. In other words, you've shrunk >> the time window during which the issue may occur, but haven't eliminated >> it completely. > > If you mean disable the slot via sysfs, that's SAFE_REMOVAL, right ? > > that would issse devTLB invalidation first, power off device later, it > > wouldn't trigger the hard lockup, though the > > pci_dev_is_disconnected() return false. this fix works such case. Could you help to point out if there are any other window to close ? Thanks, Ethan > > > Thanks, > > Ethan > > > >> >> Thanks, >> >> Lukas
On 12/14/2023 10:26 AM, Ethan Zhao wrote: > > On 12/13/2023 7:59 PM, Baolu Lu wrote: >> On 2023/12/13 11:46, Ethan Zhao wrote: >>> For those endpoint devices connect to system via hotplug capable ports, >>> users could request a warm reset to the device by flapping device's >>> link >>> through setting the slot's link control register, as pciehpt_ist() >>> DLLSC >>> interrupt sequence response, pciehp will unload the device driver and >>> then power it off. >> >> Is it possible for pciehp to disable ATS on the device before unloading >> the driver? Or should the device follow some specific steps to warm >> reset the device? >> > In this case, link down first, then pciehp_ist() got DLLSC interrupt > to know > > that, I don't think it makes sense to disable device ATS here, but it > could > > flag the device is ATS disabled, well, "disconnected" is enough to let > > vt-d like software knows the device state. > > For hot "unplug" cases: 1. safe_removal Users request unplug the device via sysfs or press the attention button, Then pciehp will response to unconfig device/unload device driver, power if off, and devcie is ready to remove. in this case, devTLB invalidate request is sent before device link to be brought down or device power to be turned off. so it doesn't trigger the hard lockup. 2. supprise_removal Users remove the devece directly or bring the device link down/turn off device power first by setting pci config space, link-down/not-present/ power-off are all handled by pciehp the same way "supprise_removal", in such case, pciehp_ist() will flag the device as "disconnected" first, then unconfig the devcie, unload driver, iommu release device(issing devTLB flush) delete device. so we checking the device state could work such cases. But I am still think about if there are other windows. Thanks, Ethan >> What happens if IOMMU issues device TLB invalidation after link down but >> before pci_dev_is_disconnected() returns true? > > Seems it wouldn't happen with hotplug cases, safe_removal or > > supprise_removal. > > > > Thanks, > > Ethan > >> >> Best regards, >> baolu
On 2023/12/15 9:03, Ethan Zhao wrote: > > 2. supprise_removal > > Users remove the devece directly or bring the device link down/turn off > > device power first by setting pci config space, link-down/not-present/ > > power-off are all handled by pciehp the same way "supprise_removal", in > > such case, pciehp_ist() will flag the device as "disconnected" first, > then > > unconfig the devcie, unload driver, iommu release device(issing devTLB > flush) > > delete device. so we checking the device state could work such cases. If so, then it is fine for the iommu driver. As Robin said, if the device needs more cleanup, the iommu core should register a right callback to the driver core and handle it before the device goes away. Disabling PCI features seems to be a reasonable device cleanup. This gives us another reason to move ATS enabling/disabling out from the iommu subsystem. Once this is done, the device driver will enable ATS during its probe and disable it during its release. There will be no such workaround in the iommu driver anymore. Best regards, baolu
On 12/15/2023 9:34 AM, Baolu Lu wrote: > On 2023/12/15 9:03, Ethan Zhao wrote: >> >> 2. supprise_removal >> >> Users remove the devece directly or bring the device link down/turn >> off >> >> device power first by setting pci config space, link-down/not-present/ >> >> power-off are all handled by pciehp the same way >> "supprise_removal", in >> >> such case, pciehp_ist() will flag the device as "disconnected" >> first, then >> >> unconfig the devcie, unload driver, iommu release device(issing >> devTLB flush) >> >> delete device. so we checking the device state could work such cases. > > If so, then it is fine for the iommu driver. As Robin said, if the > device needs more cleanup, the iommu core should register a right > callback to the driver core and handle it before the device goes away. > > Disabling PCI features seems to be a reasonable device cleanup. This > gives us another reason to move ATS enabling/disabling out from the For supprise_removal, device was already removed, powered-off, iommu device-release got notification or cleanup callback is invoked to disable ATS to not-present device etc , I didn't get the meaning to do so, perhaps I misunderstand ? Thanks, Ethan > iommu subsystem. Once this is done, the device driver will enable ATS > during its probe and disable it during its release. There will be no > such workaround in the iommu driver anymore. > > Best regards, > baolu
On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote: > I think if we want to ensure ATCs are invalidated on hot-unplug we need an > additional pre-removal notifier to take care of that, and that step would > then want to distinguish between an orderly removal where cleaning up is > somewhat meaningful, and a surprise removal where it definitely isn't. Even if a user starts the process for orderly removal, the device may be surprise-removed *during* that process. So we cannot assume that the device is actually accessible if orderly removal has been initiated. If the form factor supports surprise removal, the device may be gone at any time. Thanks, Lukas
On 2023-12-21 10:42 am, Lukas Wunner wrote: > On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote: >> I think if we want to ensure ATCs are invalidated on hot-unplug we need an >> additional pre-removal notifier to take care of that, and that step would >> then want to distinguish between an orderly removal where cleaning up is >> somewhat meaningful, and a surprise removal where it definitely isn't. > > Even if a user starts the process for orderly removal, the device may be > surprise-removed *during* that process. So we cannot assume that the > device is actually accessible if orderly removal has been initiated. > If the form factor supports surprise removal, the device may be gone > at any time. Sure, whatever we do there's always going to be some unavoidable time-of-check-to-time-of-use race window so we can never guarantee that sending a request to the device will succeed. I was just making the point that if we *have* already detected a surprise removal, then cleaning up its leftover driver model state should still generate a BUS_NOTIFY_REMOVE_DEVICE call, but in that case we can know there's no point trying to send any requests to the device that's already gone. Thanks, Robin.
On Thu, Dec 21, 2023 at 11:01:56AM +0000, Robin Murphy wrote: > On 2023-12-21 10:42 am, Lukas Wunner wrote: > > On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote: > > > I think if we want to ensure ATCs are invalidated on hot-unplug we need an > > > additional pre-removal notifier to take care of that, and that step would > > > then want to distinguish between an orderly removal where cleaning up is > > > somewhat meaningful, and a surprise removal where it definitely isn't. > > > > Even if a user starts the process for orderly removal, the device may be > > surprise-removed *during* that process. So we cannot assume that the > > device is actually accessible if orderly removal has been initiated. > > If the form factor supports surprise removal, the device may be gone > > at any time. > > Sure, whatever we do there's always going to be some unavoidable > time-of-check-to-time-of-use race window so we can never guarantee that > sending a request to the device will succeed. I was just making the point > that if we *have* already detected a surprise removal, then cleaning up its > leftover driver model state should still generate a BUS_NOTIFY_REMOVE_DEVICE > call, but in that case we can know there's no point trying to send any > requests to the device that's already gone. Right, using pci_dev_is_disconnected() as a *speedup* when we definitely know the device is gone, that's perfectly fine. So in that sense the proposed patch is acceptable *after* this series has been extended to make sure hard lockups can *never* occur on unplug. Thanks, Lukas
On 12/21/2023 6:42 PM, Lukas Wunner wrote: > On Wed, Dec 13, 2023 at 11:54:05AM +0000, Robin Murphy wrote: >> I think if we want to ensure ATCs are invalidated on hot-unplug we need an >> additional pre-removal notifier to take care of that, and that step would >> then want to distinguish between an orderly removal where cleaning up is >> somewhat meaningful, and a surprise removal where it definitely isn't. > Even if a user starts the process for orderly removal, the device may be > surprise-removed *during* that process. So we cannot assume that the > device is actually accessible if orderly removal has been initiated. > If the form factor supports surprise removal, the device may be gone There is no hardware lock to prevent user powerring-off/removing the supprise-removal capable device before issuing ATS invalidation request but after checking device connection state, the no target request still possibly be sent. Thanks, Ethan > at any time. > > Thanks, > > Lukas >
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 74e8e4c17e81..8557b6dee22f 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -476,6 +476,23 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, { struct device_domain_info *info; u16 sid, qdep, pfsid; + struct pci_dev *pdev; + + pdev = to_pci_dev(dev); + if (!pdev) + return; + + /* + * If endpoint device's link was brough down by user's pci configuration + * access to it's hotplug capable slot's link control register, as sequence + * response for DLLSC pciehp_ist() will set the device error_state to + * pci_channel_io_perm_failure. Checking device's state here to avoid + * issuing meaningless devTLB flush request to it, that might cause lockup + * warning or deadlock because too long time waiting in interrupt context. + */ + + if (pci_dev_is_disconnected(pdev)) + return; info = dev_iommu_priv_get(dev); if (!info || !info->ats_enabled) @@ -495,6 +512,8 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, qi_flush_dev_iotlb(iommu, sid, pfsid, qdep, 0, 64 - VTD_PAGE_SHIFT); else qi_flush_dev_iotlb_pasid(iommu, sid, pfsid, pasid, qdep, 0, 64 - VTD_PAGE_SHIFT); + + return; } void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,