diff mbox series

[for-next,2/2] IB/hfi1: Make Unsupported Request error non-fatal

Message ID 20190410123455.26818.49424.stgit@scvm10.sc.intel.com (mailing list archive)
State Not Applicable
Headers show
Series Allow drivers to configure AER registers | expand

Commit Message

Dennis Dalessandro April 10, 2019, 12:35 p.m. UTC
From: Kamenee Arumugam <kamenee.arumugam@intel.com>

For hfi1, the unsupported request error is not considered a fatal
error. When the PCIe advanced error reporting capability (AER) is
configured to report unsupported requests as fatal, the system will
hang on this error.

Set Unsupported Request Error bit in Uncorrectable Error Mask
register to disable error reporting to the PCIe root complex.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/pcie.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

Comments

Bjorn Helgaas April 10, 2019, 7:29 p.m. UTC | #1
Hi Dennis,

On Wed, Apr 10, 2019 at 05:35:01AM -0700, Dennis Dalessandro wrote:
> From: Kamenee Arumugam <kamenee.arumugam@intel.com>
> 
> For hfi1, the unsupported request error is not considered a fatal
> error. When the PCIe advanced error reporting capability (AER) is
> configured to report unsupported requests as fatal, the system will
> hang on this error.

I know there are a few drivers that fiddle with AER bits, but that
makes me a little bit nervous because error handling is more than just
a driver issue.  It involves the PCI core and the platform firmware as
well.

Anyway, let's figure out more about this particular case.  Unsupported
Request is a PCIe protocol-level issue.  You're masking it in the HFI
adapter, which I guess means you want to prevent it from reporting UR.
So the HFI is receiving a TLP that it doesn't support?

What exactly is causing the UR?  Is it something the driver could
potentially avoid, e.g., an AtomicOp that HFI doesn't support?  I have
a vague notion that InfiniBand allows some sort of direct user-space
access to hardware; is there something there that can cause a UR?

The system hang sounds like a separate problem that should also be
fixed.  Even if HFI signals a UR error, I would not expect a system
hang.

Bjorn

> Set Unsupported Request Error bit in Uncorrectable Error Mask
> register to disable error reporting to the PCIe root complex.
> 
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> ---
>  drivers/infiniband/hw/hfi1/pcie.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> index c96d193..a033e28 100644
> --- a/drivers/infiniband/hw/hfi1/pcie.c
> +++ b/drivers/infiniband/hw/hfi1/pcie.c
> @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd)
>  	}
>  
>  	pci_set_master(pdev);
> +	pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP);
>  	(void)pci_enable_pcie_error_reporting(pdev);
>  	return 0;
>  
>
Arumugam, Kamenee April 11, 2019, 6:22 p.m. UTC | #2
-----Original Message-----
From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Bjorn Helgaas
Sent: Wednesday, April 10, 2019 3:30 PM
To: Dalessandro, Dennis <dennis.dalessandro@intel.com>
Cc: jgg@ziepe.ca; linux-rdma@vger.kernel.org; linux-pci@vger.kernel.org; Ruhl, Michael J <michael.j.ruhl@intel.com>; dledford@redhat.com; Arumugam, Kamenee <kamenee.arumugam@intel.com>
Subject: Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal

Hi Bjorn,

> I know there are a few drivers that fiddle with AER bits, but that makes me a little bit nervous because error handling is more than just a > driver issue.  It involves the PCI core and the platform firmware as well.

> Anyway, let's figure out more about this particular case.  Unsupported 
> Request is a PCIe protocol-level issue.  You're masking it in the HFI adapter, which I guess means you want to prevent it from reporting UR.
> So the HFI is receiving a TLP that it doesn't support?

Yes, HFI is receiving a TLP with unsupported request error.

> What exactly is causing the UR?  Is it something the driver could potentially avoid, e.g., an AtomicOp that HFI doesn't support?  I have a > vague notion that InfiniBand allows some sort of direct user-space access to hardware; is there something there that can cause a UR?

HFI PCIe BAR are mapped to user space to implement kernel bypass for MPI/PSM jobs. In this case, user-level application is making spurious read accesses (invalid width access) to this memory mapping causing the device to report an unsupported request error through AER. The spurious read accesses may be due to errant application behavior (e.g. reading beyond the end of an array).

> The system hang sounds like a separate problem that should also be fixed.  Even if HFI signals a UR error, I would not expect a system > > hang.
 
We haven't root cause the system hang but it doesn't appear to be related to our driver.


>> Set Unsupported Request Error bit in Uncorrectable Error Mask register 
>> to disable error reporting to the PCIe root complex.
>> 
>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
>> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
>> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
>> ---
>>  drivers/infiniband/hw/hfi1/pcie.c |    1 +
>>  1 files changed, 1 insertions(+), 0 deletions(-)
>> 
>> diff --git a/drivers/infiniband/hw/hfi1/pcie.c
>> b/drivers/infiniband/hw/hfi1/pcie.c
>> index c96d193..a033e28 100644
>> --- a/drivers/infiniband/hw/hfi1/pcie.c
>> +++ b/drivers/infiniband/hw/hfi1/pcie.c
>> @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd)
>>  	}
>>  
>>  	pci_set_master(pdev);
>> +	pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP);
>>  	(void)pci_enable_pcie_error_reporting(pdev);
>>  	return 0;
>>  
>>
Jason Gunthorpe April 11, 2019, 6:29 p.m. UTC | #3
On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:

> > What exactly is causing the UR?  Is it something the driver could
> > potentially avoid, e.g., an AtomicOp that HFI doesn't support?  I
> > have a > vague notion that InfiniBand allows some sort of direct
> > user-space access to hardware; is there something there that can
> > cause a UR?
> 
> HFI PCIe BAR are mapped to user space to implement kernel bypass for
> MPI/PSM jobs. In this case, user-level application is making
> spurious read accesses (invalid width access) to this memory mapping
> causing the device to report an unsupported request error through
> AER. The spurious read accesses may be due to errant application
> behavior (e.g. reading beyond the end of an array).

This is a device bug then. 

A RDMA device must accept and respond to all TLPs that the CPU could
create for the user accessible BAR pages.

A user process must not be able to crash the CPU or make the device
malfunction by accessing the exposed BAR page. This includes a broad
range of topics, like mis-aligned acceses, SSE instructions, atomics,
etc.

Is blocking AER even enough here? If the device isn't generating a
reasonable reply I have a bad feeling worse will happen.

Jason
Arumugam, Kamenee April 11, 2019, 8:37 p.m. UTC | #4
-----Original Message-----
From: Jason Gunthorpe [mailto:jgg@ziepe.ca] 
Sent: Thursday, April 11, 2019 2:30 PM
To: Arumugam, Kamenee <kamenee.arumugam@intel.com>
Cc: Dalessandro, Dennis <dennis.dalessandro@intel.com>; bhelgaas@google.com; linux-rdma@vger.kernel.org; linux-pci@vger.kernel.org; Ruhl, Michael J <michael.j.ruhl@intel.com>; dledford@redhat.com
Subject: Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal

On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:

> This is a device bug then. 

> A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages.

> A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc.

> Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen.

After blocking unsupported request error, we don't see any other issue including no system hang.
Jason Gunthorpe April 12, 2019, 1:55 p.m. UTC | #5
On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote:

> On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:
> 
> > This is a device bug then. 
> 
> > A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages.
> 
> > A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc.
> 
> > Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen.
> 
> After blocking unsupported request error, we don't see any other issue including no system hang. 

Are you specifically testing all the special TLPs the CPU can produce?

Jason
Dennis Dalessandro April 15, 2019, 6:47 p.m. UTC | #6
On 4/12/2019 9:55 AM, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote:
> 
>> On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:
>>
>>> This is a device bug then.
>>
>>> A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages.
>>
>>> A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc.
>>
>>> Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen.
>>
>> After blocking unsupported request error, we don't see any other issue including no system hang.
> 
> Are you specifically testing all the special TLPs the CPU can produce?

All the special TLPs should have been tested. This however seems to be a 
missed test case. Not that surprising though given differences in BIOS 
and things of that nature that something falls through the cracks and is 
extra hard to find.

-Denny
Bjorn Helgaas April 15, 2019, 9:46 p.m. UTC | #7
On Mon, Apr 15, 2019 at 02:47:01PM -0400, Dennis Dalessandro wrote:
> On 4/12/2019 9:55 AM, Jason Gunthorpe wrote:
> > On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote:
> > > On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:
> > > 
> > > > This is a device bug then.
> > > 
> > > > A RDMA device must accept and respond to all TLPs that the CPU
> > > > could create for the user accessible BAR pages.
> > > 
> > > > A user process must not be able to crash the CPU or make the
> > > > device malfunction by accessing the exposed BAR page. This
> > > > includes a broad range of topics, like mis-aligned acceses,
> > > > SSE instructions, atomics, >etc.
> > > 
> > > > Is blocking AER even enough here? If the device isn't
> > > > generating a reasonable reply I have a bad feeling worse will
> > > > happen.
> > > 
> > > After blocking unsupported request error, we don't see any other
> > > issue including no system hang.
> > 
> > Are you specifically testing all the special TLPs the CPU can
> > produce?
> 
> All the special TLPs should have been tested. This however seems to
> be a missed test case. Not that surprising though given differences
> in BIOS and things of that nature that something falls through the
> cracks and is extra hard to find.

Is there a published erratum for this?  I don't have warm fuzzies yet
that we actually know the root cause here.

Kamenee said the problem case was:

  user-level application is making spurious read accesses (invalid
  width access) to this memory mapping causing the device to report an
  unsupported request error through AER.

So I guess that means the application performed a read and got invalid
data back?  I think the Root Complex had to supply *some* data to
complete the CPU's read, and since the HFI responded with UR instead
of data, the RC probably fabricated something.  Many RCs fabricate ~0,
but I don't think that's actually required by the spec, so I'm
doubtful that the application can reliably detect this.

I'd be really surprised that something as obvious as an invalid width
wasn't tested, especially if this is intended for direct mapping into
user applications.

Bjorn
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
index c96d193..a033e28 100644
--- a/drivers/infiniband/hw/hfi1/pcie.c
+++ b/drivers/infiniband/hw/hfi1/pcie.c
@@ -114,6 +114,7 @@  int hfi1_pcie_init(struct hfi1_devdata *dd)
 	}
 
 	pci_set_master(pdev);
+	pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP);
 	(void)pci_enable_pcie_error_reporting(pdev);
 	return 0;