[RFC,v10,0/5] fix vt-d hard lockup when hotplug ATS capable device

Message ID	20231228170206.720675-1-haifeng.zhao@linux.intel.com (mailing list archive)
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E051F50E; Thu, 28 Dec 2023 17:02:13 +0000 (UTC) From: Ethan Zhao <haifeng.zhao@linux.intel.com> To: kevin.tian@intel.com, bhelgaas@google.com, baolu.lu@linux.intel.com, dwmw2@infradead.org, will@kernel.org, robin.murphy@arm.com, lukas@wunner.de Cc: linux-pci@vger.kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org Subject: [RFC PATCH v10 0/5] fix vt-d hard lockup when hotplug ATS capable device Date: Thu, 28 Dec 2023 12:02:01 -0500 Message-Id: <20231228170206.720675-1-haifeng.zhao@linux.intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	fix vt-d hard lockup when hotplug ATS capable device \| expand [RFC,v10,0/5] fix vt-d hard lockup when hotplug ATS capable device [RFC,v10,1/5] iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor callers [RFC,v10,2/5] iommu/vt-d: break out ATS Invalidation if target device is gone [RFC,v10,3/5] PCI: make pci_dev_is_disconnected() helper public for other drivers [RFC,v10,4/5] iommu/vt-d: don't issue ATS Invalidation request when device is disconnected [RFC,v10,5/5] iommu/vt-d: don't loop for timeout ATS Invalidation request forever

Ethan Zhao Dec. 28, 2023, 5:02 p.m. UTC

This patchset is used to fix vt-d hard lockup reported when surprise
unplug ATS capable endpoint device connects to system via PCIe switch
as following topology.                                               
                                                                    
     +-[0000:15]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d 
     |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe     
     |           +-00.2  Intel Corporation Ice Lake RAS             
     |           +-00.4  Intel Corporation Device 0b23              
     |           \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0 
                                           NVIDIA Corporation Device 2324 
     |                                           +-01.0-[19]----00.0    
                          Mellanox Technologies MT2910 Family [ConnectX-7]
                                                                          
User brought endpoint device 19:00.0's link down by flapping it's hotplug 
capable slot 17:01.0 link control register, as sequence DLLSC response, 
pciehp_ist() will unload device driver and power it off, durning device 
driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or 
'ATS Invalidation' in PCIe spec) request issued to that link down device, 
thus a long time completion/timeout waiting in interrupt context causes   
continuous hard lockup warnning and system hang.                         
                                                                         
Other detail, see every patch commit log.                                
                                                                         
patch [3&4] were tested by yehaorong@bytedance.com on stable v6.7-rc4.   
patch [1-5] passed compiling on stable v6.7-rc6.                    
                                                                         
                                                                         
change log:  
v10:
- refactor qi_submit_sync() and its callers to get pci_dev instance, as
  Kevin pointed out add target_flush_dev to iommu is not right.
v9:
- unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's
  suggestion.                                                         
v8:
- add a patch to break the loop for timeout device-TLB invalidation, as
  Bjorn said there is possibility device just no response but not gone.  
v7:                                                                      
- reorder patches and revise commit log per Bjorn's guide.              
- other code and commit log revise per Lukas' suggestion.               
- rebased to stable v6.7-rc6.                                           
v6:                                                                     
- add two patches to break out device-TLB invalidation if device is gone. 
v5:                                                                       
- add a patch try to fix the rare case (surprise remove a device in       
  safe removal process). not work because surprise removal handling can't 
  re-enter when another safe removal is in process.                       
v4:                                                                       
- move the PCI device state checking after ATS per Baolu's suggestion.    
v3:                                                                       
- fix commit description typo.                                            
v2:                                                                       
- revise commit[1] description part according to Lukas' suggestion.       
- revise commit[2] description to clarify the issue's impact.             
v1:                                                                       
- https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@     
linux.intel.com/T/                                                        
                                                                          
                                                                          
Thanks,                                                                   
Ethan                           


Ethan Zhao (5):
  iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor
    callers
  iommu/vt-d: break out ATS Invalidation if target device is gone
  PCI: make pci_dev_is_disconnected() helper public for other drivers
  iommu/vt-d: don't issue ATS Invalidation request when device is
    disconnected
  iommu/vt-d: don't loop for timeout ATS Invalidation request forever

 drivers/iommu/intel/dmar.c          | 55 ++++++++++++++++++++++-------
 drivers/iommu/intel/iommu.c         | 26 ++++----------
 drivers/iommu/intel/iommu.h         | 17 +++++----
 drivers/iommu/intel/irq_remapping.c |  2 +-
 drivers/iommu/intel/pasid.c         | 13 +++----
 drivers/iommu/intel/svm.c           | 13 ++++---
 drivers/pci/pci.h                   |  5 ---
 include/linux/pci.h                 |  5 +++
 8 files changed, 74 insertions(+), 62 deletions(-)

Tian, Kevin Dec. 29, 2023, 8:18 a.m. UTC | #1

> From: Ethan Zhao <haifeng.zhao@linux.intel.com>
> Sent: Friday, December 29, 2023 1:02 AM
>
> change log:
> v10:
> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
>   Kevin pointed out add target_flush_dev to iommu is not right.

let's not rush for new versions when there are still opens unclosed in
previous one (and considering most related folks are in vacation).

Ethan Zhao Dec. 29, 2023, 9:28 a.m. UTC | #2

On 12/29/2023 4:18 PM, Tian, Kevin wrote:
>> From: Ethan Zhao <haifeng.zhao@linux.intel.com>
>> Sent: Friday, December 29, 2023 1:02 AM
>>
>> change log:
>> v10:
>> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
>>    Kevin pointed out add target_flush_dev to iommu is not right.
> let's not rush for new versions when there are still opens unclosed in
> previous one (and considering most related folks are in vacation).

Just have some hours these days to make it in well shape.

Okay, let's think, and wait for new comments.


Thanks,

Ethan

Ethan Zhao Jan. 9, 2024, 1:24 a.m. UTC | #3

On 12/29/2023 1:02 AM, Ethan Zhao wrote:
> This patchset is used to fix vt-d hard lockup reported when surprise
> unplug ATS capable endpoint device connects to system via PCIe switch
> as following topology.
>                                                                      
>       +-[0000:15]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
>       |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
>       |           +-00.2  Intel Corporation Ice Lake RAS
>       |           +-00.4  Intel Corporation Device 0b23
>       |           \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
>                                             NVIDIA Corporation Device 2324
>       |                                           +-01.0-[19]----00.0
>                            Mellanox Technologies MT2910 Family [ConnectX-7]
>                                                                            
> User brought endpoint device 19:00.0's link down by flapping it's hotplug
> capable slot 17:01.0 link control register, as sequence DLLSC response,
> pciehp_ist() will unload device driver and power it off, durning device
> driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or
> 'ATS Invalidation' in PCIe spec) request issued to that link down device,
> thus a long time completion/timeout waiting in interrupt context causes
> continuous hard lockup warnning and system hang.
>                                                                           
> Other detail, see every patch commit log.
>                                                                           
> patch [3&4] were tested by yehaorong@bytedance.com on stable v6.7-rc4.
> patch [1-5] passed compiling on stable v6.7-rc6.
>                                                                           
>                                                                           
> change log:
> v10:
> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
>    Kevin pointed out add target_flush_dev to iommu is not right.
> v9:
> - unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's
>    suggestion.
> v8:
> - add a patch to break the loop for timeout device-TLB invalidation, as
>    Bjorn said there is possibility device just no response but not gone.
> v7:
> - reorder patches and revise commit log per Bjorn's guide.
> - other code and commit log revise per Lukas' suggestion.
> - rebased to stable v6.7-rc6.
> v6:
> - add two patches to break out device-TLB invalidation if device is gone.
> v5:
> - add a patch try to fix the rare case (surprise remove a device in
>    safe removal process). not work because surprise removal handling can't
>    re-enter when another safe removal is in process.
> v4:
> - move the PCI device state checking after ATS per Baolu's suggestion.
> v3:
> - fix commit description typo.
> v2:
> - revise commit[1] description part according to Lukas' suggestion.
> - revise commit[2] description to clarify the issue's impact.
> v1:
> - https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@
> linux.intel.com/T/
>                                                                            
>                                                                            
> Thanks,
> Ethan
>
>
> Ethan Zhao (5):
>    iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor
>      callers
>    iommu/vt-d: break out ATS Invalidation if target device is gone
>    PCI: make pci_dev_is_disconnected() helper public for other drivers
>    iommu/vt-d: don't issue ATS Invalidation request when device is
>      disconnected
>    iommu/vt-d: don't loop for timeout ATS Invalidation request forever
>
>   drivers/iommu/intel/dmar.c          | 55 ++++++++++++++++++++++-------
>   drivers/iommu/intel/iommu.c         | 26 ++++----------
>   drivers/iommu/intel/iommu.h         | 17 +++++----
>   drivers/iommu/intel/irq_remapping.c |  2 +-
>   drivers/iommu/intel/pasid.c         | 13 +++----
>   drivers/iommu/intel/svm.c           | 13 ++++---
>   drivers/pci/pci.h                   |  5 ---
>   include/linux/pci.h                 |  5 +++
>   8 files changed, 74 insertions(+), 62 deletions(-)

Any new comment for this patchset ?


Thanks,

Ethan

Ethan Zhao Jan. 15, 2024, 7:58 a.m. UTC | #4

On 12/29/2023 1:02 AM, Ethan Zhao wrote:
> This patchset is used to fix vt-d hard lockup reported when surprise
> unplug ATS capable endpoint device connects to system via PCIe switch
> as following topology.
>                                                                      
>       +-[0000:15]-+-00.0  Intel Corporation Ice Lake Memory Map/VT-d
>       |           +-00.1  Intel Corporation Ice Lake Mesh 2 PCIe
>       |           +-00.2  Intel Corporation Ice Lake RAS
>       |           +-00.4  Intel Corporation Device 0b23
>       |           \-01.0-[16-1b]----00.0-[17-1b]--+-00.0-[18]----00.0
>                                             NVIDIA Corporation Device 2324
>       |                                           +-01.0-[19]----00.0
>                            Mellanox Technologies MT2910 Family [ConnectX-7]
>                                                                            
> User brought endpoint device 19:00.0's link down by flapping it's hotplug
> capable slot 17:01.0 link control register, as sequence DLLSC response,
> pciehp_ist() will unload device driver and power it off, durning device
> driver is unloading an iommu device-TLB invalidation (Intel VT-d spec, or
> 'ATS Invalidation' in PCIe spec) request issued to that link down device,
> thus a long time completion/timeout waiting in interrupt context causes
> continuous hard lockup warnning and system hang.
>                                                                           
> Other detail, see every patch commit log.
>                                                                           
> patch [3&4] were tested by yehaorong@bytedance.com on stable v6.7-rc4.
> patch [1-5] passed compiling on stable v6.7-rc6.
>                                                                           
>                                                                           
> change log:
> v10:
> - refactor qi_submit_sync() and its callers to get pci_dev instance, as
>    Kevin pointed out add target_flush_dev to iommu is not right.
> v9:
> - unify all spelling of ATS Invalidation adhere to PCIe spec per Bjorn's
>    suggestion.
> v8:
> - add a patch to break the loop for timeout device-TLB invalidation, as
>    Bjorn said there is possibility device just no response but not gone.
> v7:
> - reorder patches and revise commit log per Bjorn's guide.
> - other code and commit log revise per Lukas' suggestion.
> - rebased to stable v6.7-rc6.
> v6:
> - add two patches to break out device-TLB invalidation if device is gone.
> v5:
> - add a patch try to fix the rare case (surprise remove a device in
>    safe removal process). not work because surprise removal handling can't
>    re-enter when another safe removal is in process.
> v4:
> - move the PCI device state checking after ATS per Baolu's suggestion.
> v3:
> - fix commit description typo.
> v2:
> - revise commit[1] description part according to Lukas' suggestion.
> - revise commit[2] description to clarify the issue's impact.
> v1:
> - https://lore.kernel.org/lkml/20231213034637.2603013-1-haifeng.zhao@
> linux.intel.com/T/
>                                                                            
>                                                                            
> Thanks,
> Ethan
>
>
> Ethan Zhao (5):
>    iommu/vt-d: add pci_dev parameter to qi_submit_sync and refactor
>      callers
>    iommu/vt-d: break out ATS Invalidation if target device is gone
>    PCI: make pci_dev_is_disconnected() helper public for other drivers
>    iommu/vt-d: don't issue ATS Invalidation request when device is
>      disconnected
>    iommu/vt-d: don't loop for timeout ATS Invalidation request forever
>
>   drivers/iommu/intel/dmar.c          | 55 ++++++++++++++++++++++-------
>   drivers/iommu/intel/iommu.c         | 26 ++++----------
>   drivers/iommu/intel/iommu.h         | 17 +++++----
>   drivers/iommu/intel/irq_remapping.c |  2 +-
>   drivers/iommu/intel/pasid.c         | 13 +++----
>   drivers/iommu/intel/svm.c           | 13 ++++---
>   drivers/pci/pci.h                   |  5 ---
>   include/linux/pci.h                 |  5 +++
>   8 files changed, 74 insertions(+), 62 deletions(-)

How aobut refactor the qi_submit_sync() and qi_check_fault() like

following, combination of patch

[2] iommu/vt-d: break out ATS Invalidation if target device is gone

[5] iommu/vt-d: don't loop for timeout ATS Invalidation request forever

but sending them in seperated patches seems better ? each of them

handling different case.

- fold additional errors/fault/exception handling into qi_check_fault()

- the detetion of target device presence use to handle surprise removal

  or device died /no response.

- the ITE part use to break out the timeout ATS invalidation request,

   use to handle the case response time of device is too long.

- if passed invalid target_pdev, means this is ATS invalidation request.

- no error handling change in qi_submit_sync().


Please comment.


--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1267,16 +1267,28 @@ static void qi_dump_fault(struct intel_iommu 
*iommu, u32 fault)
                (unsigned long long)desc->qw1);
  }

-static int qi_check_fault(struct intel_iommu *iommu, int index, int 
wait_index)
+static int qi_check_fault(struct intel_iommu *iommu, int index, int 
wait_index,
+                  pci_dev *target_pdev)
  {
         u32 fault;
         int head, tail;
+       u64 iqe_err, ice_sid;
         struct q_inval *qi = iommu->qi;
         int shift = qi_shift(iommu);

         if (qi->desc_status[wait_index] == QI_ABORT)
                 return -EAGAIN;

+       /*
+        * If the ATS invalidation target device is gone this moment 
(surprise
+        * removed, died, no response) don't try this request again. this
+        * request will not get valid result anymore. but the request was
+        * already submitted to hardware and we predict to get a ITE in
+        * followed batch of request, if so, it will get handled then.
+        */
+       if (target_pdev && !pci_device_is_present(target_pdev))
+               return -EINVAL;
+
         fault = readl(iommu->reg + DMAR_FSTS_REG);
         if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
                 qi_dump_fault(iommu, fault);
@@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
                 tail = readl(iommu->reg + DMAR_IQT_REG);
                 tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;

+               /*
+                * SID field is valid only when the ITE field is Set in 
FSTS_REG
+                * see Intel VT-d spec r4.1, section 11.4.9.9
+                */
+               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
+
                 writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
                 pr_info("Invalidation Time-out Error (ITE) cleared\n");

@@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
                         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
                 } while (head != tail);

+               /*
+                * If got ITE, we need to check if the sid of ITE is the 
same as
+                * current ATS invalidation target device, if yes, don't 
try this
+                * request anymore, the target device has a response 
time beyound
+                * expected. 0 value of ice_sid means old device, no 
ice_sid value.
+                */
+               if (target_pdev && ice_sid && ice_sid ==
+                   pci_dev_id(pci_physfn(target_pdev))
+                               return -ETIMEDOUT;
+
                 if (qi->desc_status[wait_index] == QI_ABORT)
                         return -EAGAIN;
         }
@@ -1344,7 +1373,7 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
   * can be part of the submission but it will not be polled for completion.
   */
  int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
-                  unsigned int count, unsigned long options)
+                  unsigned int count, unsigned long options, pci_dev 
*target_pdev)
  {
         struct q_inval *qi = iommu->qi;
         s64 devtlb_start_ktime = 0;
@@ -1430,7 +1459,7 @@ int qi_submit_sync(struct intel_iommu *iommu, 
struct qi_desc *desc,
                  * a deadlock where the interrupt context can wait 
indefinitely
                  * for free slots in the queue.
                  */
-               rc = qi_check_fault(iommu, index, wait_index);
+               rc = qi_check_fault(iommu, index, wait_index, target_pdev);
                 if (rc)
                         break;


Thanks,

Ethan

>

Baolu Lu Jan. 17, 2024, 3:24 a.m. UTC | #5

On 2024/1/15 15:58, Ethan Zhao wrote:
> -static int qi_check_fault(struct intel_iommu *iommu, int index, int 
> wait_index)
> +static int qi_check_fault(struct intel_iommu *iommu, int index, int 
> wait_index,
> +                  pci_dev *target_pdev)
>   {
>          u32 fault;
>          int head, tail;
> +       u64 iqe_err, ice_sid;
>          struct q_inval *qi = iommu->qi;
>          int shift = qi_shift(iommu);
> 
>          if (qi->desc_status[wait_index] == QI_ABORT)
>                  return -EAGAIN;
> 
> +       /*
> +        * If the ATS invalidation target device is gone this moment 
> (surprise
> +        * removed, died, no response) don't try this request again. this
> +        * request will not get valid result anymore. but the request was
> +        * already submitted to hardware and we predict to get a ITE in
> +        * followed batch of request, if so, it will get handled then.
> +        */

We can't leave the ITE triggered by this request for the next one, which
has no context about why this happened. Perhaps move below code down to
the segment that handles ITEs.

Another concern is about qi_dump_fault(), which pr_err's the fault
message as long as the register is set. Some faults are predictable,
such as cache invalidation for surprise-removed devices. Unconditionally
reporting errors with pr_err() may lead the user to believe that a more
serious hardware error has occurred. Probably we can refine this part of
the code as well.

Others look sane to me.

> +       if (target_pdev && !pci_device_is_present(target_pdev))
> +               return -EINVAL;
> +
>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>                  qi_dump_fault(iommu, fault);
> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu 
> *iommu, int index, int wait_index)
>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
> 
> +               /*
> +                * SID field is valid only when the ITE field is Set in 
> FSTS_REG
> +                * see Intel VT-d spec r4.1, section 11.4.9.9
> +                */
> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
> +
>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>                  pr_info("Invalidation Time-out Error (ITE) cleared\n");
> 
> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu 
> *iommu, int index, int wait_index)
>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>                  } while (head != tail);
> 
> +               /*
> +                * If got ITE, we need to check if the sid of ITE is the 
> same as
> +                * current ATS invalidation target device, if yes, don't 
> try this
> +                * request anymore, the target device has a response 
> time beyound
> +                * expected. 0 value of ice_sid means old device, no 
> ice_sid value.
> +                */
> +               if (target_pdev && ice_sid && ice_sid ==
> +                   pci_dev_id(pci_physfn(target_pdev))
> +                               return -ETIMEDOUT;
> +
>                  if (qi->desc_status[wait_index] == QI_ABORT)
>                          return -EAGAIN;
>          }

Best regards,
baolu

Ethan Zhao Jan. 17, 2024, 5:26 a.m. UTC | #6

On 1/17/2024 11:24 AM, Baolu Lu wrote:
> On 2024/1/15 15:58, Ethan Zhao wrote:
>> -static int qi_check_fault(struct intel_iommu *iommu, int index, int 
>> wait_index)
>> +static int qi_check_fault(struct intel_iommu *iommu, int index, int 
>> wait_index,
>> +                  pci_dev *target_pdev)
>>   {
>>          u32 fault;
>>          int head, tail;
>> +       u64 iqe_err, ice_sid;
>>          struct q_inval *qi = iommu->qi;
>>          int shift = qi_shift(iommu);
>>
>>          if (qi->desc_status[wait_index] == QI_ABORT)
>>                  return -EAGAIN;
>>
>> +       /*
>> +        * If the ATS invalidation target device is gone this moment 
>> (surprise
>> +        * removed, died, no response) don't try this request again. 
>> this
>> +        * request will not get valid result anymore. but the request 
>> was
>> +        * already submitted to hardware and we predict to get a ITE in
>> +        * followed batch of request, if so, it will get handled then.
>> +        */
>
> We can't leave the ITE triggered by this request for the next one, which
> has no context about why this happened. Perhaps move below code down to
> the segment that handles ITEs.

Here, the invalidation request has been issued to hardware but target device

gone, we can't loop and wait for the ITE for this request to happen, and we

bail out here because we hold lock_irqsave lock , the ITE still could happen

with later batch request in the future,  though it is not triggered by 
that request,

but it could still be cleaned/handled. move it to the fault() segment 
?,there means

ITE already happened, no need to check target presence anymore.

did I miss something about the context lost ?

>
> Another concern is about qi_dump_fault(), which pr_err's the fault
> message as long as the register is set. Some faults are predictable,
> such as cache invalidation for surprise-removed devices. Unconditionally
> reporting errors with pr_err() may lead the user to believe that a more
> serious hardware error has occurred. Probably we can refine this part of
> the code as well.

Agree, may refine them in seperated series ?

loop and always retry IQE, ICE don't make sense per my understanding.  if

IQE happened retry it will always reproduce the fault, because request 
is the same.

we could fix them together in other patches.


Thanks,

Ethan

>
> Others look sane to me.
>
>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>> +               return -EINVAL;
>> +
>>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>>                  qi_dump_fault(iommu, fault);
>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu 
>> *iommu, int index, int wait_index)
>>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>
>> +               /*
>> +                * SID field is valid only when the ITE field is Set 
>> in FSTS_REG
>> +                * see Intel VT-d spec r4.1, section 11.4.9.9
>> +                */
>> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
>> +
>>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>                  pr_info("Invalidation Time-out Error (ITE) cleared\n");
>>
>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu 
>> *iommu, int index, int wait_index)
>>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>                  } while (head != tail);
>>
>> +               /*
>> +                * If got ITE, we need to check if the sid of ITE is 
>> the same as
>> +                * current ATS invalidation target device, if yes, 
>> don't try this
>> +                * request anymore, the target device has a response 
>> time beyound
>> +                * expected. 0 value of ice_sid means old device, no 
>> ice_sid value.
>> +                */
>> +               if (target_pdev && ice_sid && ice_sid ==
>> +                   pci_dev_id(pci_physfn(target_pdev))
>> +                               return -ETIMEDOUT;
>> +
>>                  if (qi->desc_status[wait_index] == QI_ABORT)
>>                          return -EAGAIN;
>>          }
>
> Best regards,
> baolu

Ethan Zhao Jan. 17, 2024, 5:38 a.m. UTC | #7

On 1/17/2024 1:26 PM, Ethan Zhao wrote:
>
> On 1/17/2024 11:24 AM, Baolu Lu wrote:
>> On 2024/1/15 15:58, Ethan Zhao wrote:
>>> -static int qi_check_fault(struct intel_iommu *iommu, int index, int 
>>> wait_index)
>>> +static int qi_check_fault(struct intel_iommu *iommu, int index, int 
>>> wait_index,
>>> +                  pci_dev *target_pdev)
>>>   {
>>>          u32 fault;
>>>          int head, tail;
>>> +       u64 iqe_err, ice_sid;
>>>          struct q_inval *qi = iommu->qi;
>>>          int shift = qi_shift(iommu);
>>>
>>>          if (qi->desc_status[wait_index] == QI_ABORT)
>>>                  return -EAGAIN;
>>>
>>> +       /*
>>> +        * If the ATS invalidation target device is gone this moment 
>>> (surprise
>>> +        * removed, died, no response) don't try this request again. 
>>> this
>>> +        * request will not get valid result anymore. but the 
>>> request was
>>> +        * already submitted to hardware and we predict to get a ITE in
>>> +        * followed batch of request, if so, it will get handled then.
>>> +        */
>>
>> We can't leave the ITE triggered by this request for the next one, which
>> has no context about why this happened. Perhaps move below code down to
>> the segment that handles ITEs.
>
> Here, the invalidation request has been issued to hardware but target 
> device
>
> gone, we can't loop and wait for the ITE for this request to happen, 
> and we
>
> bail out here because we hold lock_irqsave lock , the ITE still could 
> happen
>
> with later batch request in the future,  though it is not triggered by 
> that request,
>
> but it could still be cleaned/handled. move it to the fault() segment 
> ?,there means

That moment, the ITE was triggered by previous requests, they are not in 
current

context, also shouldn't be retried, they have response time over expected.

but triggered ITE fault blocks this patch request, we should retry this 
batch request.

we just clean the fault and retry it.  nothing missed.


Thanks,

Ethan

>
> ITE already happened, no need to check target presence anymore.
>
> did I miss something about the context lost ?
>
>>
>> Another concern is about qi_dump_fault(), which pr_err's the fault
>> message as long as the register is set. Some faults are predictable,
>> such as cache invalidation for surprise-removed devices. Unconditionally
>> reporting errors with pr_err() may lead the user to believe that a more
>> serious hardware error has occurred. Probably we can refine this part of
>> the code as well.
>
> Agree, may refine them in seperated series ?
>
> loop and always retry IQE, ICE don't make sense per my understanding.  if
>
> IQE happened retry it will always reproduce the fault, because request 
> is the same.
>
> we could fix them together in other patches.
>
>
> Thanks,
>
> Ethan
>
>>
>> Others look sane to me.
>>
>>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>>> +               return -EINVAL;
>>> +
>>>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>>>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>>>                  qi_dump_fault(iommu, fault);
>>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu 
>>> *iommu, int index, int wait_index)
>>>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>>>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>
>>> +               /*
>>> +                * SID field is valid only when the ITE field is Set 
>>> in FSTS_REG
>>> +                * see Intel VT-d spec r4.1, section 11.4.9.9
>>> +                */
>>> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>> +
>>>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>>                  pr_info("Invalidation Time-out Error (ITE) 
>>> cleared\n");
>>>
>>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu 
>>> *iommu, int index, int wait_index)
>>>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>                  } while (head != tail);
>>>
>>> +               /*
>>> +                * If got ITE, we need to check if the sid of ITE is 
>>> the same as
>>> +                * current ATS invalidation target device, if yes, 
>>> don't try this
>>> +                * request anymore, the target device has a response 
>>> time beyound
>>> +                * expected. 0 value of ice_sid means old device, no 
>>> ice_sid value.
>>> +                */
>>> +               if (target_pdev && ice_sid && ice_sid ==
>>> +                   pci_dev_id(pci_physfn(target_pdev))
>>> +                               return -ETIMEDOUT;
>>> +
>>>                  if (qi->desc_status[wait_index] == QI_ABORT)
>>>                          return -EAGAIN;
>>>          }
>>
>> Best regards,
>> baolu
>

Ethan Zhao Jan. 17, 2024, 9 a.m. UTC | #8

On 1/17/2024 1:38 PM, Ethan Zhao wrote:
>
> On 1/17/2024 1:26 PM, Ethan Zhao wrote:
>>
>> On 1/17/2024 11:24 AM, Baolu Lu wrote:
>>> On 2024/1/15 15:58, Ethan Zhao wrote:
>>>> -static int qi_check_fault(struct intel_iommu *iommu, int index, 
>>>> int wait_index)
>>>> +static int qi_check_fault(struct intel_iommu *iommu, int index, 
>>>> int wait_index,
>>>> +                  pci_dev *target_pdev)
>>>>   {
>>>>          u32 fault;
>>>>          int head, tail;
>>>> +       u64 iqe_err, ice_sid;
>>>>          struct q_inval *qi = iommu->qi;
>>>>          int shift = qi_shift(iommu);
>>>>
>>>>          if (qi->desc_status[wait_index] == QI_ABORT)
>>>>                  return -EAGAIN;
>>>>
>>>> +       /*
>>>> +        * If the ATS invalidation target device is gone this 
>>>> moment (surprise
>>>> +        * removed, died, no response) don't try this request 
>>>> again. this
>>>> +        * request will not get valid result anymore. but the 
>>>> request was
>>>> +        * already submitted to hardware and we predict to get a 
>>>> ITE in
>>>> +        * followed batch of request, if so, it will get handled then.
>>>> +        */
>>>
>>> We can't leave the ITE triggered by this request for the next one, 
>>> which
>>> has no context about why this happened. Perhaps move below code down to
>>> the segment that handles ITEs.
>>
>> Here, the invalidation request has been issued to hardware but target 
>> device
>>
>> gone, we can't loop and wait for the ITE for this request to happen, 
>> and we
>>
>> bail out here because we hold lock_irqsave lock , the ITE still could 
>> happen
>>
>> with later batch request in the future,  though it is not triggered 
>> by that request,
>>
>> but it could still be cleaned/handled. move it to the fault() segment 
>> ?,there means
>
> That moment, the ITE was triggered by previous requests, they are not 
> in current
>
> context, also shouldn't be retried, they have response time over 
> expected.
>
> but triggered ITE fault blocks this patch request, we should retry 
> this batch request.
>
> we just clean the fault and retry it.  nothing missed.
>
>
> Thanks,
>
> Ethan
>
>>
>> ITE already happened, no need to check target presence anymore.
>>
>> did I miss something about the context lost ?
>>
>>>
>>> Another concern is about qi_dump_fault(), which pr_err's the fault
>>> message as long as the register is set. Some faults are predictable,
>>> such as cache invalidation for surprise-removed devices. 
>>> Unconditionally
>>> reporting errors with pr_err() may lead the user to believe that a more
>>> serious hardware error has occurred. Probably we can refine this 
>>> part of
>>> the code as well.
>>
>> Agree, may refine them in seperated series ?
>>
>> loop and always retry IQE, ICE don't make sense per my 
>> understanding.  if
>>
>> IQE happened retry it will always reproduce the fault, because 
>> request is the same.
>>
>> we could fix them together in other patches.
>>
>>
>> Thanks,
>>
>> Ethan
>>
>>>
>>> Others look sane to me.
>>>
>>>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>>>> +               return -EINVAL;
>>>> +
>>>>          fault = readl(iommu->reg + DMAR_FSTS_REG);
>>>>          if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
>>>>                  qi_dump_fault(iommu, fault);
>>>> @@ -1315,6 +1327,13 @@ static int qi_check_fault(struct intel_iommu 
>>>> *iommu, int index, int wait_index)
>>>>                  tail = readl(iommu->reg + DMAR_IQT_REG);
>>>>                  tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
>>>>
>>>> +               /*
>>>> +                * SID field is valid only when the ITE field is 
>>>> Set in FSTS_REG
>>>> +                * see Intel VT-d spec r4.1, section 11.4.9.9
>>>> +                */
>>>> +               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
>>>> +               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
>>>> +
>>>>                  writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
>>>>                  pr_info("Invalidation Time-out Error (ITE) 
>>>> cleared\n");
>>>>
>>>> @@ -1324,6 +1343,16 @@ static int qi_check_fault(struct intel_iommu 
>>>> *iommu, int index, int wait_index)
>>>>                          head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>>>                  } while (head != tail);
>>>>
>>>> +               /*
>>>> +                * If got ITE, we need to check if the sid of ITE 
>>>> is the same as
>>>> +                * current ATS invalidation target device, if yes, 
>>>> don't try this
>>>> +                * request anymore, the target device has a 
>>>> response time beyound
>>>> +                * expected. 0 value of ice_sid means old device, 
>>>> no ice_sid value.
>>>> +                */
>>>> +               if (target_pdev && ice_sid && ice_sid ==
>>>> +                   pci_dev_id(pci_physfn(target_pdev))
>>>> +                               return -ETIMEDOUT;
>>>> +
>>>>                  if (qi->desc_status[wait_index] == QI_ABORT)
>>>>                          return -EAGAIN;
>>>>          }
>>>
>>> Best regards,
>>> baolu
>>
>
new proposed qi_check_fault() change as following:

- remove the lines of

   if (qi->desc_status[wait_index] == QI_ABORT)

      return -EAGAIN;

   reason see the comment in the code.

- others, the same as last proposed code.


--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1267,15 +1267,33 @@ static void qi_dump_fault(struct intel_iommu 
*iommu, u32 fault)
                (unsigned long long)desc->qw1);
  }

-static int qi_check_fault(struct intel_iommu *iommu, int index, int 
wait_index)
+static int qi_check_fault(struct intel_iommu *iommu, int index, int 
wait_index,
+                  pci_dev *target_pdev)
  {
         u32 fault;
         int head, tail;
+       u64 iqe_err, ice_sid;
         struct q_inval *qi = iommu->qi;
         int shift = qi_shift(iommu);
+       /*
+        * If the inv_wait_dsc got QI_ABORT here, means an ITE happens, 
retry
+        * this batch of invalidation request without clearing the ITE fault
+        * will be trapped into dead loop. remov this line.
+        * see Intel VT-d spec r4.1 sec 6.5.2.10, 6.5.2.8
+        *
+        * if (qi->desc_status[wait_index] == QI_ABORT)
+        *      return -EAGAIN;
+        */

-       if (qi->desc_status[wait_index] == QI_ABORT)
-               return -EAGAIN;
+       /*
+        * If the ATS invalidation target device is gone this moment 
(surprise
+        * removed, died, no response) don't try this request again. this
+        * request will not get valid result anymore. but the request was
+        * already submitted to hardware and we predict to get a ITE in
+        * followed batch of request, if so, it will get handled then.
+        */
+       if (target_pdev && !pci_device_is_present(target_pdev))
+               return -EINVAL;

         fault = readl(iommu->reg + DMAR_FSTS_REG);
         if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
@@ -1315,6 +1333,13 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
                 tail = readl(iommu->reg + DMAR_IQT_REG);
                 tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;

+               /*
+                * SID field is valid only when the ITE field is Set in 
FSTS_REG
+                * see Intel VT-d spec r4.1, section 11.4.9.9
+                */
+               iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+               ice_sid = DMAR_IQER_REG_ITESID(iqe_err);
+
                 writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
                 pr_info("Invalidation Time-out Error (ITE) cleared\n");

@@ -1324,6 +1349,16 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
                         head = (head - 2 + QI_LENGTH) % QI_LENGTH;
                 } while (head != tail);

+               /*
+                * If got ITE, we need to check if the sid of ITE is the 
same as
+                * current ATS invalidation target device, if yes, don't 
try this
+                * request anymore, the target device has a response 
time beyound
+                * expected. 0 value of ice_sid means old device, no 
ice_sid value.
+                */
+               if (target_pdev && ice_sid && ice_sid ==
+                   pci_dev_id(pci_physfn(target_pdev))
+                       return -ETIMEDOUT;
+
                 if (qi->desc_status[wait_index] == QI_ABORT)
                         return -EAGAIN;
         }
@@ -1344,7 +1379,7 @@ static int qi_check_fault(struct intel_iommu 
*iommu, int index, int wait_index)
   * can be part of the submission but it will not be polled for completion.
   */
  int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
-                  unsigned int count, unsigned long options)
+                  unsigned int count, unsigned long options, pci_dev 
*target_pdev)
  {
         struct q_inval *qi = iommu->qi;
         s64 devtlb_start_ktime = 0;
@@ -1430,7 +1465,7 @@ int qi_submit_sync(struct intel_iommu *iommu, 
struct qi_desc *desc,
                  * a deadlock where the interrupt context can wait 
indefinitely
                  * for free slots in the queue.
                  */
-               rc = qi_check_fault(iommu, index, wait_index);
+               rc = qi_check_fault(iommu, index, wait_index, target_pdev);
                 if (rc)
                         break;

Thanks,

Ethan

Baolu Lu Jan. 18, 2024, 12:46 a.m. UTC | #9

On 1/17/24 5:00 PM, Ethan Zhao wrote:
> +       /*
> +        * If the ATS invalidation target device is gone this moment 
> (surprise
> +        * removed, died, no response) don't try this request again. this
> +        * request will not get valid result anymore. but the request was
> +        * already submitted to hardware and we predict to get a ITE in
> +        * followed batch of request, if so, it will get handled then.
> +        */
> +       if (target_pdev && !pci_device_is_present(target_pdev))
> +               return -EINVAL;

Again, we should not ignore the error triggered by the current request.
Do not leave it to the next one. The WAIT descriptor is a fence. Handle
everything within its boundary.

Best regards,
baolu

Ethan Zhao Jan. 18, 2024, 2:26 a.m. UTC | #10

On 1/18/2024 8:46 AM, Baolu Lu wrote:
> On 1/17/24 5:00 PM, Ethan Zhao wrote:
>> +       /*
>> +        * If the ATS invalidation target device is gone this moment 
>> (surprise
>> +        * removed, died, no response) don't try this request again. 
>> this
>> +        * request will not get valid result anymore. but the request 
>> was
>> +        * already submitted to hardware and we predict to get a ITE in
>> +        * followed batch of request, if so, it will get handled then.
>> +        */
>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>> +               return -EINVAL;
>
> Again, we should not ignore the error triggered by the current request.
> Do not leave it to the next one. The WAIT descriptor is a fence. Handle
> everything within its boundary.

  We didn't set fence bit to every ATS invalidation wait descriptor,

only the intel_drain_pasid_prq() queue a drain page requests with FN

sit, but that is not called in hotplug removal path.


Thanks,

Ethan



>
> Best regards,
> baolu

Ethan Zhao Jan. 18, 2024, 2:32 a.m. UTC | #11

On 1/18/2024 10:26 AM, Ethan Zhao wrote:
>
> On 1/18/2024 8:46 AM, Baolu Lu wrote:
>> On 1/17/24 5:00 PM, Ethan Zhao wrote:
>>> +       /*
>>> +        * If the ATS invalidation target device is gone this moment 
>>> (surprise
>>> +        * removed, died, no response) don't try this request again. 
>>> this
>>> +        * request will not get valid result anymore. but the 
>>> request was
>>> +        * already submitted to hardware and we predict to get a ITE in
>>> +        * followed batch of request, if so, it will get handled then.
>>> +        */
>>> +       if (target_pdev && !pci_device_is_present(target_pdev))
>>> +               return -EINVAL;
>>
>> Again, we should not ignore the error triggered by the current request.
>> Do not leave it to the next one. The WAIT descriptor is a fence. Handle
>> everything within its boundary.
>
>  We didn't set fence bit to every ATS invalidation wait descriptor,
>
> only the intel_drain_pasid_prq() queue a drain page requests with FN
>
> sit, but that is not called in hotplug removal path.

So, in fact so far, it doesn't act as a fence except the 
intel_drain_pasid_prq() ,

and it never handle everthing within its border.


Thanks,

Ethan

>
>
> Thanks,
>
> Ethan
>
>
>
>>
>> Best regards,
>> baolu
>

[RFC,v10,0/5] fix vt-d hard lockup when hotplug ATS capable device

Message

Comments