Message ID | 20190410123455.26818.49424.stgit@scvm10.sc.intel.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Allow drivers to configure AER registers | expand |
Hi Dennis, On Wed, Apr 10, 2019 at 05:35:01AM -0700, Dennis Dalessandro wrote: > From: Kamenee Arumugam <kamenee.arumugam@intel.com> > > For hfi1, the unsupported request error is not considered a fatal > error. When the PCIe advanced error reporting capability (AER) is > configured to report unsupported requests as fatal, the system will > hang on this error. I know there are a few drivers that fiddle with AER bits, but that makes me a little bit nervous because error handling is more than just a driver issue. It involves the PCI core and the platform firmware as well. Anyway, let's figure out more about this particular case. Unsupported Request is a PCIe protocol-level issue. You're masking it in the HFI adapter, which I guess means you want to prevent it from reporting UR. So the HFI is receiving a TLP that it doesn't support? What exactly is causing the UR? Is it something the driver could potentially avoid, e.g., an AtomicOp that HFI doesn't support? I have a vague notion that InfiniBand allows some sort of direct user-space access to hardware; is there something there that can cause a UR? The system hang sounds like a separate problem that should also be fixed. Even if HFI signals a UR error, I would not expect a system hang. Bjorn > Set Unsupported Request Error bit in Uncorrectable Error Mask > register to disable error reporting to the PCIe root complex. > > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> > Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com> > Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> > --- > drivers/infiniband/hw/hfi1/pcie.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c > index c96d193..a033e28 100644 > --- a/drivers/infiniband/hw/hfi1/pcie.c > +++ b/drivers/infiniband/hw/hfi1/pcie.c > @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd) > } > > pci_set_master(pdev); > + pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP); > (void)pci_enable_pcie_error_reporting(pdev); > return 0; > >
-----Original Message----- From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Bjorn Helgaas Sent: Wednesday, April 10, 2019 3:30 PM To: Dalessandro, Dennis <dennis.dalessandro@intel.com> Cc: jgg@ziepe.ca; linux-rdma@vger.kernel.org; linux-pci@vger.kernel.org; Ruhl, Michael J <michael.j.ruhl@intel.com>; dledford@redhat.com; Arumugam, Kamenee <kamenee.arumugam@intel.com> Subject: Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal Hi Bjorn, > I know there are a few drivers that fiddle with AER bits, but that makes me a little bit nervous because error handling is more than just a > driver issue. It involves the PCI core and the platform firmware as well. > Anyway, let's figure out more about this particular case. Unsupported > Request is a PCIe protocol-level issue. You're masking it in the HFI adapter, which I guess means you want to prevent it from reporting UR. > So the HFI is receiving a TLP that it doesn't support? Yes, HFI is receiving a TLP with unsupported request error. > What exactly is causing the UR? Is it something the driver could potentially avoid, e.g., an AtomicOp that HFI doesn't support? I have a > vague notion that InfiniBand allows some sort of direct user-space access to hardware; is there something there that can cause a UR? HFI PCIe BAR are mapped to user space to implement kernel bypass for MPI/PSM jobs. In this case, user-level application is making spurious read accesses (invalid width access) to this memory mapping causing the device to report an unsupported request error through AER. The spurious read accesses may be due to errant application behavior (e.g. reading beyond the end of an array). > The system hang sounds like a separate problem that should also be fixed. Even if HFI signals a UR error, I would not expect a system > > hang. We haven't root cause the system hang but it doesn't appear to be related to our driver. >> Set Unsupported Request Error bit in Uncorrectable Error Mask register >> to disable error reporting to the PCIe root complex. >> >> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> >> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com> >> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> >> --- >> drivers/infiniband/hw/hfi1/pcie.c | 1 + >> 1 files changed, 1 insertions(+), 0 deletions(-) >> >> diff --git a/drivers/infiniband/hw/hfi1/pcie.c >> b/drivers/infiniband/hw/hfi1/pcie.c >> index c96d193..a033e28 100644 >> --- a/drivers/infiniband/hw/hfi1/pcie.c >> +++ b/drivers/infiniband/hw/hfi1/pcie.c >> @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd) >> } >> >> pci_set_master(pdev); >> + pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP); >> (void)pci_enable_pcie_error_reporting(pdev); >> return 0; >> >>
On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote: > > What exactly is causing the UR? Is it something the driver could > > potentially avoid, e.g., an AtomicOp that HFI doesn't support? I > > have a > vague notion that InfiniBand allows some sort of direct > > user-space access to hardware; is there something there that can > > cause a UR? > > HFI PCIe BAR are mapped to user space to implement kernel bypass for > MPI/PSM jobs. In this case, user-level application is making > spurious read accesses (invalid width access) to this memory mapping > causing the device to report an unsupported request error through > AER. The spurious read accesses may be due to errant application > behavior (e.g. reading beyond the end of an array). This is a device bug then. A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages. A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, etc. Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen. Jason
-----Original Message----- From: Jason Gunthorpe [mailto:jgg@ziepe.ca] Sent: Thursday, April 11, 2019 2:30 PM To: Arumugam, Kamenee <kamenee.arumugam@intel.com> Cc: Dalessandro, Dennis <dennis.dalessandro@intel.com>; bhelgaas@google.com; linux-rdma@vger.kernel.org; linux-pci@vger.kernel.org; Ruhl, Michael J <michael.j.ruhl@intel.com>; dledford@redhat.com Subject: Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote: > This is a device bug then. > A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages. > A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc. > Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen. After blocking unsupported request error, we don't see any other issue including no system hang.
On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote: > On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote: > > > This is a device bug then. > > > A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages. > > > A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc. > > > Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen. > > After blocking unsupported request error, we don't see any other issue including no system hang. Are you specifically testing all the special TLPs the CPU can produce? Jason
On 4/12/2019 9:55 AM, Jason Gunthorpe wrote: > On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote: > >> On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote: >> >>> This is a device bug then. >> >>> A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages. >> >>> A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc. >> >>> Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen. >> >> After blocking unsupported request error, we don't see any other issue including no system hang. > > Are you specifically testing all the special TLPs the CPU can produce? All the special TLPs should have been tested. This however seems to be a missed test case. Not that surprising though given differences in BIOS and things of that nature that something falls through the cracks and is extra hard to find. -Denny
On Mon, Apr 15, 2019 at 02:47:01PM -0400, Dennis Dalessandro wrote: > On 4/12/2019 9:55 AM, Jason Gunthorpe wrote: > > On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote: > > > On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote: > > > > > > > This is a device bug then. > > > > > > > A RDMA device must accept and respond to all TLPs that the CPU > > > > could create for the user accessible BAR pages. > > > > > > > A user process must not be able to crash the CPU or make the > > > > device malfunction by accessing the exposed BAR page. This > > > > includes a broad range of topics, like mis-aligned acceses, > > > > SSE instructions, atomics, >etc. > > > > > > > Is blocking AER even enough here? If the device isn't > > > > generating a reasonable reply I have a bad feeling worse will > > > > happen. > > > > > > After blocking unsupported request error, we don't see any other > > > issue including no system hang. > > > > Are you specifically testing all the special TLPs the CPU can > > produce? > > All the special TLPs should have been tested. This however seems to > be a missed test case. Not that surprising though given differences > in BIOS and things of that nature that something falls through the > cracks and is extra hard to find. Is there a published erratum for this? I don't have warm fuzzies yet that we actually know the root cause here. Kamenee said the problem case was: user-level application is making spurious read accesses (invalid width access) to this memory mapping causing the device to report an unsupported request error through AER. So I guess that means the application performed a read and got invalid data back? I think the Root Complex had to supply *some* data to complete the CPU's read, and since the HFI responded with UR instead of data, the RC probably fabricated something. Many RCs fabricate ~0, but I don't think that's actually required by the spec, so I'm doubtful that the application can reliably detect this. I'd be really surprised that something as obvious as an invalid width wasn't tested, especially if this is intended for direct mapping into user applications. Bjorn
diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c index c96d193..a033e28 100644 --- a/drivers/infiniband/hw/hfi1/pcie.c +++ b/drivers/infiniband/hw/hfi1/pcie.c @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd) } pci_set_master(pdev); + pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP); (void)pci_enable_pcie_error_reporting(pdev); return 0;