Message ID | 20231228170206.720675-1-haifeng.zhao@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | fix vt-d hard lockup when hotplug ATS capable device | expand |
> From: Ethan Zhao <haifeng.zhao@linux.intel.com> > Sent: Friday, December 29, 2023 1:02 AM > > change log: > v10: > - refactor qi_submit_sync() and its callers to get pci_dev instance, as > Kevin pointed out add target_flush_dev to iommu is not right. let's not rush for new versions when there are still opens unclosed in previous one (and considering most related folks are in vacation).
On 12/29/2023 4:18 PM, Tian, Kevin wrote: >> From: Ethan Zhao <haifeng.zhao@linux.intel.com> >> Sent: Friday, December 29, 2023 1:02 AM >> >> change log: >> v10: >> - refactor qi_submit_sync() and its callers to get pci_dev instance, as >> Kevin pointed out add target_flush_dev to iommu is not right. > let's not rush for new versions when there are still opens unclosed in > previous one (and considering most related folks are in vacation). Just have some hours these days to make it in well shape. Okay, let's think, and wait for new comments. Thanks, Ethan
On 12/29/2023 1:02 AM, Ethan Zhao wrote: > This patchset is used to fix vt-d hard lockup reported when surprise > unplug ATS capable endpoint device connects to system via PCIe switch > as following topology. > > +-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d > | +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe > | +-00.2 Intel Corporation Ice Lake RAS > | +-00.4 Intel Corporation Device 0b23 > | \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0 > NVIDIA Corporation Device 2324 > | +-01.0-[19]----00.0 > Mellanox Technologies MT2910 Family [ConnectX-7] > > User brought endpoint device 19:00.0's link down by flapping it's hotplug > capable slot 17:01.0 link control register, as sequence DLLSC response, > pciehp_ist() will unload device driver and power it off, durning device > driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or > 'ATS Invalidation' in PCIe spec) request issued to that link down device, > thus a long time completion/timeout waiting in interrupt context causes > continuous hard lockup warnning and system hang. > > Other detail, see every patch commit log. > > patch [3&4] were tested by yehaorong@bytedance.com on stable v6.7-rc4. > patch [1-5] passed compiling on stable v6.7-rc6. > > > change log: > v10: > - refactor qi_submit_sync() and its callers to get pci_dev instance, as > Kevin pointed out add target_flush_dev to iommu is not right. > v9: > - unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's > suggestion. > v8: > - add a patch to break the loop for timeout device-TLB invalidation, as > Bjorn said there is possibility device just no response but not gone. > v7: > - reorder patches and revise commit log per Bjorn's guide. > - other code and commit log revise per Lukas' suggestion. > - rebased to stable v6.7-rc6. > v6: > - add two patches to break out device-TLB invalidation if device is gone. > v5: > - add a patch try to fix the rare case (surprise remove a device in > safe removal process). not work because surprise removal handling can't > re-enter when another safe removal is in process. > v4: > - move the PCI device state checking after ATS per Baolu's suggestion. > v3: > - fix commit description typo. > v2: > - revise commit[1] description part according to Lukas' suggestion. > - revise commit[2] description to clarify the issue's impact. > v1: > - https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@ > linux.intel.com/T/ > > > Thanks, > Ethan > > > Ethan Zhao (5): > iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor > callers > iommu/vt-d: break out ATS Invalidation if target device is gone > PCI: make pci_dev_is_disconnected() helper public for other drivers > iommu/vt-d: don't issue ATS Invalidation request when device is > disconnected > iommu/vt-d: don't loop for timeout ATS Invalidation request forever > > drivers/iommu/intel/dmar.c | 55 ++++++++++++++++++++++------- > drivers/iommu/intel/iommu.c | 26 ++++---------- > drivers/iommu/intel/iommu.h | 17 +++++---- > drivers/iommu/intel/irq_remapping.c | 2 +- > drivers/iommu/intel/pasid.c | 13 +++---- > drivers/iommu/intel/svm.c | 13 ++++--- > drivers/pci/pci.h | 5 --- > include/linux/pci.h | 5 +++ > 8 files changed, 74 insertions(+), 62 deletions(-) Any new comment for this patchset ? Thanks, Ethan
On 12/29/2023 1:02 AM, Ethan Zhao wrote: > This patchset is used to fix vt-d hard lockup reported when surprise > unplug ATS capable endpoint device connects to system via PCIe switch > as following topology. > > +-[0000:15]-+-00.0 Intel Corporation Ice Lake Memory Map/VT-d > | +-00.1 Intel Corporation Ice Lake Mesh 2 PCIe > | +-00.2 Intel Corporation Ice Lake RAS > | +-00.4 Intel Corporation Device 0b23 > | \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0 > NVIDIA Corporation Device 2324 > | +-01.0-[19]----00.0 > Mellanox Technologies MT2910 Family [ConnectX-7] > > User brought endpoint device 19:00.0's link down by flapping it's hotplug > capable slot 17:01.0 link control register, as sequence DLLSC response, > pciehp_ist() will unload device driver and power it off, durning device > driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or > 'ATS Invalidation' in PCIe spec) request issued to that link down device, > thus a long time completion/timeout waiting in interrupt context causes > continuous hard lockup warnning and system hang. > > Other detail, see every patch commit log. > > patch [3&4] were tested by yehaorong@bytedance.com on stable v6.7-rc4. > patch [1-5] passed compiling on stable v6.7-rc6. > > > change log: > v10: > - refactor qi_submit_sync() and its callers to get pci_dev instance, as > Kevin pointed out add target_flush_dev to iommu is not right. > v9: > - unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's > suggestion. > v8: > - add a patch to break the loop for timeout device-TLB invalidation, as > Bjorn said there is possibility device just no response but not gone. > v7: > - reorder patches and revise commit log per Bjorn's guide. > - other code and commit log revise per Lukas' suggestion. > - rebased to stable v6.7-rc6. > v6: > - add two patches to break out device-TLB invalidation if device is gone. > v5: > - add a patch try to fix the rare case (surprise remove a device in > safe removal process). not work because surprise removal handling can't > re-enter when another safe removal is in process. > v4: > - move the PCI device state checking after ATS per Baolu's suggestion. > v3: > - fix commit description typo. > v2: > - revise commit[1] description part according to Lukas' suggestion. > - revise commit[2] description to clarify the issue's impact. > v1: > - https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@ > linux.intel.com/T/ > > > Thanks, > Ethan > > > Ethan Zhao (5): > iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor > callers > iommu/vt-d: break out ATS Invalidation if target device is gone > PCI: make pci_dev_is_disconnected() helper public for other drivers > iommu/vt-d: don't issue ATS Invalidation request when device is > disconnected > iommu/vt-d: don't loop for timeout ATS Invalidation request forever > > drivers/iommu/intel/dmar.c | 55 ++++++++++++++++++++++------- > drivers/iommu/intel/iommu.c | 26 ++++---------- > drivers/iommu/intel/iommu.h | 17 +++++---- > drivers/iommu/intel/irq_remapping.c | 2 +- > drivers/iommu/intel/pasid.c | 13 +++---- > drivers/iommu/intel/svm.c | 13 ++++--- > drivers/pci/pci.h | 5 --- > include/linux/pci.h | 5 +++ > 8 files changed, 74 insertions(+), 62 deletions(-) How aobut refactor the qi_submit_sync() and qi_check_fault() like following, combination of patch [2] iommu/vt-d: break out ATS Invalidation if target device is gone [5] iommu/vt-d: don't loop for timeout ATS Invalidation request forever but sending them in seperated patches seems better ? each of them handling different case. - fold additional errors/fault/exception handling into qi_check_fault() - the detetion of target device presence use to handle surprise removal or device died /no response. - the ITE part use to break out the timeout ATS invalidation request, use to handle the case response time of device is too long. - if passed invalid target_pdev, means this is ATS invalidation request. - no error handling change in qi_submit_sync(). Please comment. --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -1267,16 +1267,28 @@ static void qi_dump_fault(struct intel_iommu *iommu, u32 fault) (unsigned long long)desc->qw1); } -static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) +static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index, + pci_dev *target_pdev) { u32 fault; int head, tail; + u64 iqe_err, ice_sid; struct q_inval *qi = iommu->qi; int shift = qi_shift(iommu); if (qi->desc_status[wait_index] == QI_ABORT) return -EAGAIN; + /* + * If the ATS invalidation target device is gone this moment (surprise + * removed, died, no response) don't try this request again. this + * request will not get valid result anymore. but the request was + * already submitted to hardware and we predict to get a ITE in + * followed batch of request, if so, it will get handled then. + */ + if (target_pdev && !pci_device_is_present(target_pdev)) + return -EINVAL; + fault = readl(iommu->reg + DMAR_FSTS_REG); if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) qi_dump_fault(iommu, fault); @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) tail = readl(iommu->reg + DMAR_IQT_REG); tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH; + /* + * SID field is valid only when the ITE field is Set in FSTS_REG + * see Intel VT-d spec r4.1, section 11.4.9.9 + */ + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG); + ice_sid = DMAR_IQER_REG_ITESID(iqe_err); + writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG); pr_info("Invalidation Time-out Error (ITE) cleared\n"); @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) head = (head - 2 + QI_LENGTH) % QI_LENGTH; } while (head != tail); + /* + * If got ITE, we need to check if the sid of ITE is the same as + * current ATS invalidation target device, if yes, don't try this + * request anymore, the target device has a response time beyound + * expected. 0 value of ice_sid means old device, no ice_sid value. + */ + if (target_pdev && ice_sid && ice_sid == + pci_dev_id(pci_physfn(target_pdev)) + return -ETIMEDOUT; + if (qi->desc_status[wait_index] == QI_ABORT) return -EAGAIN; } @@ -1344,7 +1373,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) * can be part of the submission but it will not be polled for completion. */ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, - unsigned int count, unsigned long options) + unsigned int count, unsigned long options, pci_dev *target_pdev) { struct q_inval *qi = iommu->qi; s64 devtlb_start_ktime = 0; @@ -1430,7 +1459,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, * a deadlock where the interrupt context can wait indefinitely * for free slots in the queue. */ - rc = qi_check_fault(iommu, index, wait_index); + rc = qi_check_fault(iommu, index, wait_index, target_pdev); if (rc) break; Thanks, Ethan >
On 2024/1/15 15:58, Ethan Zhao wrote: > -static int qi_check_fault(struct intel_iommu *iommu, int index, int > wait_index) > +static int qi_check_fault(struct intel_iommu *iommu, int index, int > wait_index, > + pci_dev *target_pdev) > { > u32 fault; > int head, tail; > + u64 iqe_err, ice_sid; > struct q_inval *qi = iommu->qi; > int shift = qi_shift(iommu); > > if (qi->desc_status[wait_index] == QI_ABORT) > return -EAGAIN; > > + /* > + * If the ATS invalidation target device is gone this moment > (surprise > + * removed, died, no response) don't try this request again. this > + * request will not get valid result anymore. but the request was > + * already submitted to hardware and we predict to get a ITE in > + * followed batch of request, if so, it will get handled then. > + */ We can't leave the ITE triggered by this request for the next one, which has no context about why this happened. Perhaps move below code down to the segment that handles ITEs. Another concern is about qi_dump_fault(), which pr_err's the fault message as long as the register is set. Some faults are predictable, such as cache invalidation for surprise-removed devices. Unconditionally reporting errors with pr_err() may lead the user to believe that a more serious hardware error has occurred. Probably we can refine this part of the code as well. Others look sane to me. > + if (target_pdev && !pci_device_is_present(target_pdev)) > + return -EINVAL; > + > fault = readl(iommu->reg + DMAR_FSTS_REG); > if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) > qi_dump_fault(iommu, fault); > @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu > *iommu, int index, int wait_index) > tail = readl(iommu->reg + DMAR_IQT_REG); > tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH; > > + /* > + * SID field is valid only when the ITE field is Set in > FSTS_REG > + * see Intel VT-d spec r4.1, section 11.4.9.9 > + */ > + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG); > + ice_sid = DMAR_IQER_REG_ITESID(iqe_err); > + > writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG); > pr_info("Invalidation Time-out Error (ITE) cleared\n"); > > @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu > *iommu, int index, int wait_index) > head = (head - 2 + QI_LENGTH) % QI_LENGTH; > } while (head != tail); > > + /* > + * If got ITE, we need to check if the sid of ITE is the > same as > + * current ATS invalidation target device, if yes, don't > try this > + * request anymore, the target device has a response > time beyound > + * expected. 0 value of ice_sid means old device, no > ice_sid value. > + */ > + if (target_pdev && ice_sid && ice_sid == > + pci_dev_id(pci_physfn(target_pdev)) > + return -ETIMEDOUT; > + > if (qi->desc_status[wait_index] == QI_ABORT) > return -EAGAIN; > } Best regards, baolu
On 1/17/2024 11:24 AM, Baolu Lu wrote: > On 2024/1/15 15:58, Ethan Zhao wrote: >> -static int qi_check_fault(struct intel_iommu *iommu, int index, int >> wait_index) >> +static int qi_check_fault(struct intel_iommu *iommu, int index, int >> wait_index, >> + pci_dev *target_pdev) >> { >> u32 fault; >> int head, tail; >> + u64 iqe_err, ice_sid; >> struct q_inval *qi = iommu->qi; >> int shift = qi_shift(iommu); >> >> if (qi->desc_status[wait_index] == QI_ABORT) >> return -EAGAIN; >> >> + /* >> + * If the ATS invalidation target device is gone this moment >> (surprise >> + * removed, died, no response) don't try this request again. >> this >> + * request will not get valid result anymore. but the request >> was >> + * already submitted to hardware and we predict to get a ITE in >> + * followed batch of request, if so, it will get handled then. >> + */ > > We can't leave the ITE triggered by this request for the next one, which > has no context about why this happened. Perhaps move below code down to > the segment that handles ITEs. Here, the invalidation request has been issued to hardware but target device gone, we can't loop and wait for the ITE for this request to happen, and we bail out here because we hold lock_irqsave lock , the ITE still could happen with later batch request in the future, though it is not triggered by that request, but it could still be cleaned/handled. move it to the fault() segment ?,there means ITE already happened, no need to check target presence anymore. did I miss something about the context lost ? > > Another concern is about qi_dump_fault(), which pr_err's the fault > message as long as the register is set. Some faults are predictable, > such as cache invalidation for surprise-removed devices. Unconditionally > reporting errors with pr_err() may lead the user to believe that a more > serious hardware error has occurred. Probably we can refine this part of > the code as well. Agree, may refine them in seperated series ? loop and always retry IQE, ICE don't make sense per my understanding. if IQE happened retry it will always reproduce the fault, because request is the same. we could fix them together in other patches. Thanks, Ethan > > Others look sane to me. > >> + if (target_pdev && !pci_device_is_present(target_pdev)) >> + return -EINVAL; >> + >> fault = readl(iommu->reg + DMAR_FSTS_REG); >> if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) >> qi_dump_fault(iommu, fault); >> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu >> *iommu, int index, int wait_index) >> tail = readl(iommu->reg + DMAR_IQT_REG); >> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH; >> >> + /* >> + * SID field is valid only when the ITE field is Set >> in FSTS_REG >> + * see Intel VT-d spec r4.1, section 11.4.9.9 >> + */ >> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG); >> + ice_sid = DMAR_IQER_REG_ITESID(iqe_err); >> + >> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG); >> pr_info("Invalidation Time-out Error (ITE) cleared\n"); >> >> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu >> *iommu, int index, int wait_index) >> head = (head - 2 + QI_LENGTH) % QI_LENGTH; >> } while (head != tail); >> >> + /* >> + * If got ITE, we need to check if the sid of ITE is >> the same as >> + * current ATS invalidation target device, if yes, >> don't try this >> + * request anymore, the target device has a response >> time beyound >> + * expected. 0 value of ice_sid means old device, no >> ice_sid value. >> + */ >> + if (target_pdev && ice_sid && ice_sid == >> + pci_dev_id(pci_physfn(target_pdev)) >> + return -ETIMEDOUT; >> + >> if (qi->desc_status[wait_index] == QI_ABORT) >> return -EAGAIN; >> } > > Best regards, > baolu
On 1/17/2024 1:26 PM, Ethan Zhao wrote: > > On 1/17/2024 11:24 AM, Baolu Lu wrote: >> On 2024/1/15 15:58, Ethan Zhao wrote: >>> -static int qi_check_fault(struct intel_iommu *iommu, int index, int >>> wait_index) >>> +static int qi_check_fault(struct intel_iommu *iommu, int index, int >>> wait_index, >>> + pci_dev *target_pdev) >>> { >>> u32 fault; >>> int head, tail; >>> + u64 iqe_err, ice_sid; >>> struct q_inval *qi = iommu->qi; >>> int shift = qi_shift(iommu); >>> >>> if (qi->desc_status[wait_index] == QI_ABORT) >>> return -EAGAIN; >>> >>> + /* >>> + * If the ATS invalidation target device is gone this moment >>> (surprise >>> + * removed, died, no response) don't try this request again. >>> this >>> + * request will not get valid result anymore. but the >>> request was >>> + * already submitted to hardware and we predict to get a ITE in >>> + * followed batch of request, if so, it will get handled then. >>> + */ >> >> We can't leave the ITE triggered by this request for the next one, which >> has no context about why this happened. Perhaps move below code down to >> the segment that handles ITEs. > > Here, the invalidation request has been issued to hardware but target > device > > gone, we can't loop and wait for the ITE for this request to happen, > and we > > bail out here because we hold lock_irqsave lock , the ITE still could > happen > > with later batch request in the future, though it is not triggered by > that request, > > but it could still be cleaned/handled. move it to the fault() segment > ?,there means That moment, the ITE was triggered by previous requests, they are not in current context, also shouldn't be retried, they have response time over expected. but triggered ITE fault blocks this patch request, we should retry this batch request. we just clean the fault and retry it. nothing missed. Thanks, Ethan > > ITE already happened, no need to check target presence anymore. > > did I miss something about the context lost ? > >> >> Another concern is about qi_dump_fault(), which pr_err's the fault >> message as long as the register is set. Some faults are predictable, >> such as cache invalidation for surprise-removed devices. Unconditionally >> reporting errors with pr_err() may lead the user to believe that a more >> serious hardware error has occurred. Probably we can refine this part of >> the code as well. > > Agree, may refine them in seperated series ? > > loop and always retry IQE, ICE don't make sense per my understanding. if > > IQE happened retry it will always reproduce the fault, because request > is the same. > > we could fix them together in other patches. > > > Thanks, > > Ethan > >> >> Others look sane to me. >> >>> + if (target_pdev && !pci_device_is_present(target_pdev)) >>> + return -EINVAL; >>> + >>> fault = readl(iommu->reg + DMAR_FSTS_REG); >>> if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) >>> qi_dump_fault(iommu, fault); >>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu >>> *iommu, int index, int wait_index) >>> tail = readl(iommu->reg + DMAR_IQT_REG); >>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH; >>> >>> + /* >>> + * SID field is valid only when the ITE field is Set >>> in FSTS_REG >>> + * see Intel VT-d spec r4.1, section 11.4.9.9 >>> + */ >>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG); >>> + ice_sid = DMAR_IQER_REG_ITESID(iqe_err); >>> + >>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG); >>> pr_info("Invalidation Time-out Error (ITE) >>> cleared\n"); >>> >>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu >>> *iommu, int index, int wait_index) >>> head = (head - 2 + QI_LENGTH) % QI_LENGTH; >>> } while (head != tail); >>> >>> + /* >>> + * If got ITE, we need to check if the sid of ITE is >>> the same as >>> + * current ATS invalidation target device, if yes, >>> don't try this >>> + * request anymore, the target device has a response >>> time beyound >>> + * expected. 0 value of ice_sid means old device, no >>> ice_sid value. >>> + */ >>> + if (target_pdev && ice_sid && ice_sid == >>> + pci_dev_id(pci_physfn(target_pdev)) >>> + return -ETIMEDOUT; >>> + >>> if (qi->desc_status[wait_index] == QI_ABORT) >>> return -EAGAIN; >>> } >> >> Best regards, >> baolu >
On 1/17/2024 1:38 PM, Ethan Zhao wrote: > > On 1/17/2024 1:26 PM, Ethan Zhao wrote: >> >> On 1/17/2024 11:24 AM, Baolu Lu wrote: >>> On 2024/1/15 15:58, Ethan Zhao wrote: >>>> -static int qi_check_fault(struct intel_iommu *iommu, int index, >>>> int wait_index) >>>> +static int qi_check_fault(struct intel_iommu *iommu, int index, >>>> int wait_index, >>>> + pci_dev *target_pdev) >>>> { >>>> u32 fault; >>>> int head, tail; >>>> + u64 iqe_err, ice_sid; >>>> struct q_inval *qi = iommu->qi; >>>> int shift = qi_shift(iommu); >>>> >>>> if (qi->desc_status[wait_index] == QI_ABORT) >>>> return -EAGAIN; >>>> >>>> + /* >>>> + * If the ATS invalidation target device is gone this >>>> moment (surprise >>>> + * removed, died, no response) don't try this request >>>> again. this >>>> + * request will not get valid result anymore. but the >>>> request was >>>> + * already submitted to hardware and we predict to get a >>>> ITE in >>>> + * followed batch of request, if so, it will get handled then. >>>> + */ >>> >>> We can't leave the ITE triggered by this request for the next one, >>> which >>> has no context about why this happened. Perhaps move below code down to >>> the segment that handles ITEs. >> >> Here, the invalidation request has been issued to hardware but target >> device >> >> gone, we can't loop and wait for the ITE for this request to happen, >> and we >> >> bail out here because we hold lock_irqsave lock , the ITE still could >> happen >> >> with later batch request in the future, though it is not triggered >> by that request, >> >> but it could still be cleaned/handled. move it to the fault() segment >> ?,there means > > That moment, the ITE was triggered by previous requests, they are not > in current > > context, also shouldn't be retried, they have response time over > expected. > > but triggered ITE fault blocks this patch request, we should retry > this batch request. > > we just clean the fault and retry it. nothing missed. > > > Thanks, > > Ethan > >> >> ITE already happened, no need to check target presence anymore. >> >> did I miss something about the context lost ? >> >>> >>> Another concern is about qi_dump_fault(), which pr_err's the fault >>> message as long as the register is set. Some faults are predictable, >>> such as cache invalidation for surprise-removed devices. >>> Unconditionally >>> reporting errors with pr_err() may lead the user to believe that a more >>> serious hardware error has occurred. Probably we can refine this >>> part of >>> the code as well. >> >> Agree, may refine them in seperated series ? >> >> loop and always retry IQE, ICE don't make sense per my >> understanding. if >> >> IQE happened retry it will always reproduce the fault, because >> request is the same. >> >> we could fix them together in other patches. >> >> >> Thanks, >> >> Ethan >> >>> >>> Others look sane to me. >>> >>>> + if (target_pdev && !pci_device_is_present(target_pdev)) >>>> + return -EINVAL; >>>> + >>>> fault = readl(iommu->reg + DMAR_FSTS_REG); >>>> if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) >>>> qi_dump_fault(iommu, fault); >>>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu >>>> *iommu, int index, int wait_index) >>>> tail = readl(iommu->reg + DMAR_IQT_REG); >>>> tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH; >>>> >>>> + /* >>>> + * SID field is valid only when the ITE field is >>>> Set in FSTS_REG >>>> + * see Intel VT-d spec r4.1, section 11.4.9.9 >>>> + */ >>>> + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG); >>>> + ice_sid = DMAR_IQER_REG_ITESID(iqe_err); >>>> + >>>> writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG); >>>> pr_info("Invalidation Time-out Error (ITE) >>>> cleared\n"); >>>> >>>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu >>>> *iommu, int index, int wait_index) >>>> head = (head - 2 + QI_LENGTH) % QI_LENGTH; >>>> } while (head != tail); >>>> >>>> + /* >>>> + * If got ITE, we need to check if the sid of ITE >>>> is the same as >>>> + * current ATS invalidation target device, if yes, >>>> don't try this >>>> + * request anymore, the target device has a >>>> response time beyound >>>> + * expected. 0 value of ice_sid means old device, >>>> no ice_sid value. >>>> + */ >>>> + if (target_pdev && ice_sid && ice_sid == >>>> + pci_dev_id(pci_physfn(target_pdev)) >>>> + return -ETIMEDOUT; >>>> + >>>> if (qi->desc_status[wait_index] == QI_ABORT) >>>> return -EAGAIN; >>>> } >>> >>> Best regards, >>> baolu >> > new proposed qi_check_fault() change as following: - remove the lines of if (qi->desc_status[wait_index] == QI_ABORT) return -EAGAIN; reason see the comment in the code. - others, the same as last proposed code. --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -1267,15 +1267,33 @@ static void qi_dump_fault(struct intel_iommu *iommu, u32 fault) (unsigned long long)desc->qw1); } -static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) +static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index, + pci_dev *target_pdev) { u32 fault; int head, tail; + u64 iqe_err, ice_sid; struct q_inval *qi = iommu->qi; int shift = qi_shift(iommu); + /* + * If the inv_wait_dsc got QI_ABORT here, means an ITE happens, retry + * this batch of invalidation request without clearing the ITE fault + * will be trapped into dead loop. remov this line. + * see Intel VT-d spec r4.1 sec 6.5.2.10, 6.5.2.8 + * + * if (qi->desc_status[wait_index] == QI_ABORT) + * return -EAGAIN; + */ - if (qi->desc_status[wait_index] == QI_ABORT) - return -EAGAIN; + /* + * If the ATS invalidation target device is gone this moment (surprise + * removed, died, no response) don't try this request again. this + * request will not get valid result anymore. but the request was + * already submitted to hardware and we predict to get a ITE in + * followed batch of request, if so, it will get handled then. + */ + if (target_pdev && !pci_device_is_present(target_pdev)) + return -EINVAL; fault = readl(iommu->reg + DMAR_FSTS_REG); if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE)) @@ -1315,6 +1333,13 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) tail = readl(iommu->reg + DMAR_IQT_REG); tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH; + /* + * SID field is valid only when the ITE field is Set in FSTS_REG + * see Intel VT-d spec r4.1, section 11.4.9.9 + */ + iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG); + ice_sid = DMAR_IQER_REG_ITESID(iqe_err); + writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG); pr_info("Invalidation Time-out Error (ITE) cleared\n"); @@ -1324,6 +1349,16 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) head = (head - 2 + QI_LENGTH) % QI_LENGTH; } while (head != tail); + /* + * If got ITE, we need to check if the sid of ITE is the same as + * current ATS invalidation target device, if yes, don't try this + * request anymore, the target device has a response time beyound + * expected. 0 value of ice_sid means old device, no ice_sid value. + */ + if (target_pdev && ice_sid && ice_sid == + pci_dev_id(pci_physfn(target_pdev)) + return -ETIMEDOUT; + if (qi->desc_status[wait_index] == QI_ABORT) return -EAGAIN; } @@ -1344,7 +1379,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index) * can be part of the submission but it will not be polled for completion. */ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, - unsigned int count, unsigned long options) + unsigned int count, unsigned long options, pci_dev *target_pdev) { struct q_inval *qi = iommu->qi; s64 devtlb_start_ktime = 0; @@ -1430,7 +1465,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, * a deadlock where the interrupt context can wait indefinitely * for free slots in the queue. */ - rc = qi_check_fault(iommu, index, wait_index); + rc = qi_check_fault(iommu, index, wait_index, target_pdev); if (rc) break; Thanks, Ethan
On 1/17/24 5:00 PM, Ethan Zhao wrote: > + /* > + * If the ATS invalidation target device is gone this moment > (surprise > + * removed, died, no response) don't try this request again. this > + * request will not get valid result anymore. but the request was > + * already submitted to hardware and we predict to get a ITE in > + * followed batch of request, if so, it will get handled then. > + */ > + if (target_pdev && !pci_device_is_present(target_pdev)) > + return -EINVAL; Again, we should not ignore the error triggered by the current request. Do not leave it to the next one. The WAIT descriptor is a fence. Handle everything within its boundary. Best regards, baolu
On 1/18/2024 8:46 AM, Baolu Lu wrote: > On 1/17/24 5:00 PM, Ethan Zhao wrote: >> + /* >> + * If the ATS invalidation target device is gone this moment >> (surprise >> + * removed, died, no response) don't try this request again. >> this >> + * request will not get valid result anymore. but the request >> was >> + * already submitted to hardware and we predict to get a ITE in >> + * followed batch of request, if so, it will get handled then. >> + */ >> + if (target_pdev && !pci_device_is_present(target_pdev)) >> + return -EINVAL; > > Again, we should not ignore the error triggered by the current request. > Do not leave it to the next one. The WAIT descriptor is a fence. Handle > everything within its boundary. We didn't set fence bit to every ATS invalidation wait descriptor, only the intel_drain_pasid_prq() queue a drain page requests with FN sit, but that is not called in hotplug removal path. Thanks, Ethan > > Best regards, > baolu
On 1/18/2024 10:26 AM, Ethan Zhao wrote: > > On 1/18/2024 8:46 AM, Baolu Lu wrote: >> On 1/17/24 5:00 PM, Ethan Zhao wrote: >>> + /* >>> + * If the ATS invalidation target device is gone this moment >>> (surprise >>> + * removed, died, no response) don't try this request again. >>> this >>> + * request will not get valid result anymore. but the >>> request was >>> + * already submitted to hardware and we predict to get a ITE in >>> + * followed batch of request, if so, it will get handled then. >>> + */ >>> + if (target_pdev && !pci_device_is_present(target_pdev)) >>> + return -EINVAL; >> >> Again, we should not ignore the error triggered by the current request. >> Do not leave it to the next one. The WAIT descriptor is a fence. Handle >> everything within its boundary. > > We didn't set fence bit to every ATS invalidation wait descriptor, > > only the intel_drain_pasid_prq() queue a drain page requests with FN > > sit, but that is not called in hotplug removal path. So, in fact so far, it doesn't act as a fence except the intel_drain_pasid_prq() , and it never handle everthing within its border. Thanks, Ethan > > > Thanks, > > Ethan > > > >> >> Best regards, >> baolu >