diff mbox series

[rdma-rc] RDMA/bnxt_re: Disable atomic support on VFs

Message ID 1630037738-20276-1-git-send-email-selvin.xavier@broadcom.com (mailing list archive)
State Superseded
Headers show
Series [rdma-rc] RDMA/bnxt_re: Disable atomic support on VFs | expand

Commit Message

Selvin Xavier Aug. 27, 2021, 4:15 a.m. UTC
Following Host crash is observed when pci_enable_atomic_ops_to_root
is called with VF PCI device.

PID: 4481   TASK: ffff89c6941b0000  CPU: 53  COMMAND: "bash"
 #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4
 #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d
 #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d
 #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6
 #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417
 #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14
 #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace
    [exception RIP: pcie_capability_read_dword+28]
    RIP: ffffffffb952fd5c  RSP: ffff9a9481713960  RFLAGS: 00010246
    RAX: 0000000000000001  RBX: ffff89c6b1096000  RCX: 0000000000000000
    RDX: ffff9a9481713990  RSI: 0000000000000024  RDI: 0000000000000000
    RBP: 0000000000000080   R8: 0000000000000008   R9: ffff89c64341a2f8
    R10: 0000000000000002  R11: 0000000000000000  R12: ffff89c648bab000
    R13: 0000000000000000  R14: 0000000000000000  R15: ffff89c648bab0c8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6
 #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re]
 #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re]
    RIP: 00007f450602f648  RSP: 00007ffe880869e8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007f450602f648
    RDX: 0000000000000002  RSI: 0000555c566c4a60  RDI: 0000000000000001
    RBP: 0000555c566c4a60   R8: 000000000000000a   R9: 00007f45060c2580
    R10: 000000000000000a  R11: 0000000000000246  R12: 00007f45063026e0
    R13: 0000000000000002  R14: 00007f45062fd880  R15: 0000000000000002
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

To avoid system crash when VFs are created, enable atomics only for PF now.

Fixes: 35f5ace5dea4 ("RDMA/bnxt_re: Enable global atomic ops if platform supports")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
 drivers/infiniband/hw/bnxt_re/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jason Gunthorpe Aug. 27, 2021, 12:31 p.m. UTC | #1
On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote:
> Following Host crash is observed when pci_enable_atomic_ops_to_root
> is called with VF PCI device.
> 
> PID: 4481   TASK: ffff89c6941b0000  CPU: 53  COMMAND: "bash"
>  #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4
>  #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d
>  #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d
>  #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6
>  #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417
>  #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14
>  #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace
>     [exception RIP: pcie_capability_read_dword+28]
>     RIP: ffffffffb952fd5c  RSP: ffff9a9481713960  RFLAGS: 00010246
>     RAX: 0000000000000001  RBX: ffff89c6b1096000  RCX: 0000000000000000
>     RDX: ffff9a9481713990  RSI: 0000000000000024  RDI: 0000000000000000
>     RBP: 0000000000000080   R8: 0000000000000008   R9: ffff89c64341a2f8
>     R10: 0000000000000002  R11: 0000000000000000  R12: ffff89c648bab000
>     R13: 0000000000000000  R14: 0000000000000000  R15: ffff89c648bab0c8
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6
>  #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re]
>  #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re]
>     RIP: 00007f450602f648  RSP: 00007ffe880869e8  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007f450602f648
>     RDX: 0000000000000002  RSI: 0000555c566c4a60  RDI: 0000000000000001
>     RBP: 0000555c566c4a60   R8: 000000000000000a   R9: 00007f45060c2580
>     R10: 000000000000000a  R11: 0000000000000246  R12: 00007f45063026e0
>     R13: 0000000000000002  R14: 00007f45062fd880  R15: 0000000000000002
>     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it
hit a case where bus->self == NULL?

Why not fix it there?

Jason
Selvin Xavier Aug. 31, 2021, 3:57 p.m. UTC | #2
On Fri, Aug 27, 2021 at 6:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote:
> > Following Host crash is observed when pci_enable_atomic_ops_to_root
> > is called with VF PCI device.
> >
> > PID: 4481   TASK: ffff89c6941b0000  CPU: 53  COMMAND: "bash"
> >  #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4
> >  #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d
> >  #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d
> >  #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6
> >  #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417
> >  #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14
> >  #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace
> >     [exception RIP: pcie_capability_read_dword+28]
> >     RIP: ffffffffb952fd5c  RSP: ffff9a9481713960  RFLAGS: 00010246
> >     RAX: 0000000000000001  RBX: ffff89c6b1096000  RCX: 0000000000000000
> >     RDX: ffff9a9481713990  RSI: 0000000000000024  RDI: 0000000000000000
> >     RBP: 0000000000000080   R8: 0000000000000008   R9: ffff89c64341a2f8
> >     R10: 0000000000000002  R11: 0000000000000000  R12: ffff89c648bab000
> >     R13: 0000000000000000  R14: 0000000000000000  R15: ffff89c648bab0c8
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >  #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6
> >  #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re]
> >  #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re]
> >     RIP: 00007f450602f648  RSP: 00007ffe880869e8  RFLAGS: 00000246
> >     RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007f450602f648
> >     RDX: 0000000000000002  RSI: 0000555c566c4a60  RDI: 0000000000000001
> >     RBP: 0000555c566c4a60   R8: 000000000000000a   R9: 00007f45060c2580
> >     R10: 000000000000000a  R11: 0000000000000246  R12: 00007f45063026e0
> >     R13: 0000000000000002  R14: 00007f45062fd880  R15: 0000000000000002
> >     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
>
Apologies for the delay in my response.  I was exploring internally to
see if it is a specific issue
with the adapter/host. I see the problem in multiple systems.

> This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it
> hit a case where bus->self == NULL?
yes. This crashes because of bus->self is NULL. Is it expected for VF?
>
> Why not fix it there?
Since its a functional breakage in 5.14, I posted a quick fix for
5.14. Also, we haven't done any testing on VF for this
feature. So I wanted to avoid claiming support for VF anyway.

I see that other drivers also use pci_enable_atomic_ops_to_root
without vf/pf check. Anyone seeing this issue?
>
> Jason
Jason Gunthorpe Sept. 1, 2021, 11:50 a.m. UTC | #3
On Tue, Aug 31, 2021 at 09:27:14PM +0530, Selvin Xavier wrote:
> On Fri, Aug 27, 2021 at 6:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote:
> > > Following Host crash is observed when pci_enable_atomic_ops_to_root
> > > is called with VF PCI device.
> > >
> > > PID: 4481   TASK: ffff89c6941b0000  CPU: 53  COMMAND: "bash"
> > >  #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4
> > >  #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d
> > >  #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d
> > >  #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6
> > >  #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417
> > >  #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14
> > >  #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace
> > >     [exception RIP: pcie_capability_read_dword+28]
> > >     RIP: ffffffffb952fd5c  RSP: ffff9a9481713960  RFLAGS: 00010246
> > >     RAX: 0000000000000001  RBX: ffff89c6b1096000  RCX: 0000000000000000
> > >     RDX: ffff9a9481713990  RSI: 0000000000000024  RDI: 0000000000000000
> > >     RBP: 0000000000000080   R8: 0000000000000008   R9: ffff89c64341a2f8
> > >     R10: 0000000000000002  R11: 0000000000000000  R12: ffff89c648bab000
> > >     R13: 0000000000000000  R14: 0000000000000000  R15: ffff89c648bab0c8
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > >  #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6
> > >  #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re]
> > >  #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re]
> > >     RIP: 00007f450602f648  RSP: 00007ffe880869e8  RFLAGS: 00000246
> > >     RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007f450602f648
> > >     RDX: 0000000000000002  RSI: 0000555c566c4a60  RDI: 0000000000000001
> > >     RBP: 0000555c566c4a60   R8: 000000000000000a   R9: 00007f45060c2580
> > >     R10: 000000000000000a  R11: 0000000000000246  R12: 00007f45063026e0
> > >     R13: 0000000000000002  R14: 00007f45062fd880  R15: 0000000000000002
> > >     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
> >
> Apologies for the delay in my response.  I was exploring internally to
> see if it is a specific issue
> with the adapter/host. I see the problem in multiple systems.
> 
> > This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it
> > hit a case where bus->self == NULL?
> yes. This crashes because of bus->self is NULL. Is it expected for VF?

I'm not sure, you should ask the PCI lists

> > Why not fix it there?
> Since its a functional breakage in 5.14, I posted a quick fix for
> 5.14. Also, we haven't done any testing on VF for this
> feature. So I wanted to avoid claiming support for VF anyway.
> 
> I see that other drivers also use pci_enable_atomic_ops_to_root
> without vf/pf check. Anyone seeing this issue?

Which is why I suspect the core code should be fixed not the driver..

Jason
Selvin Xavier Sept. 16, 2021, 3:05 p.m. UTC | #4
On Wed, Sep 1, 2021 at 5:20 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Aug 31, 2021 at 09:27:14PM +0530, Selvin Xavier wrote:
> > On Fri, Aug 27, 2021 at 6:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote:
> > > > Following Host crash is observed when pci_enable_atomic_ops_to_root
> > > > is called with VF PCI device.
> > > >
> > > > PID: 4481   TASK: ffff89c6941b0000  CPU: 53  COMMAND: "bash"
> > > >  #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4
> > > >  #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d
> > > >  #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d
> > > >  #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6
> > > >  #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417
> > > >  #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14
> > > >  #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace
> > > >     [exception RIP: pcie_capability_read_dword+28]
> > > >     RIP: ffffffffb952fd5c  RSP: ffff9a9481713960  RFLAGS: 00010246
> > > >     RAX: 0000000000000001  RBX: ffff89c6b1096000  RCX: 0000000000000000
> > > >     RDX: ffff9a9481713990  RSI: 0000000000000024  RDI: 0000000000000000
> > > >     RBP: 0000000000000080   R8: 0000000000000008   R9: ffff89c64341a2f8
> > > >     R10: 0000000000000002  R11: 0000000000000000  R12: ffff89c648bab000
> > > >     R13: 0000000000000000  R14: 0000000000000000  R15: ffff89c648bab0c8
> > > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > > >  #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6
> > > >  #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re]
> > > >  #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re]
> > > >     RIP: 00007f450602f648  RSP: 00007ffe880869e8  RFLAGS: 00000246
> > > >     RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007f450602f648
> > > >     RDX: 0000000000000002  RSI: 0000555c566c4a60  RDI: 0000000000000001
> > > >     RBP: 0000555c566c4a60   R8: 000000000000000a   R9: 00007f45060c2580
> > > >     R10: 000000000000000a  R11: 0000000000000246  R12: 00007f45063026e0
> > > >     R13: 0000000000000002  R14: 00007f45062fd880  R15: 0000000000000002
> > > >     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
> > >
> > Apologies for the delay in my response.  I was exploring internally to
> > see if it is a specific issue
> > with the adapter/host. I see the problem in multiple systems.
> >
> > > This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it
> > > hit a case where bus->self == NULL?
> > yes. This crashes because of bus->self is NULL. Is it expected for VF?
>
> I'm not sure, you should ask the PCI lists
>
> > > Why not fix it there?
> > Since its a functional breakage in 5.14, I posted a quick fix for
> > 5.14. Also, we haven't done any testing on VF for this
> > feature. So I wanted to avoid claiming support for VF anyway.
> >
> > I see that other drivers also use pci_enable_atomic_ops_to_root
> > without vf/pf check. Anyone seeing this issue?
>
> Which is why I suspect the core code should be fixed not the driver..
Hi Jason,
A patch that avoids the crash is merged to the linux-pci tree.
https://lore.kernel.org/linux-pci/20210914201606.GA1452219@bjorn-Precision-5520/T/
With the pci patch, the host will not crash. But driver will get
following error message when called for VF
""platform doesn't support global atomics."

we want to prevent calling pci_enable_atomic_ops_to_root for VF
anyway. Can you please pull this patch in bnxt_re?

Thanks
Selvin

>
> Jason
Jason Gunthorpe Sept. 16, 2021, 3:09 p.m. UTC | #5
On Thu, Sep 16, 2021 at 08:35:37PM +0530, Selvin Xavier wrote:

> Hi Jason,
> A patch that avoids the crash is merged to the linux-pci tree.
> https://lore.kernel.org/linux-pci/20210914201606.GA1452219@bjorn-Precision-5520/T/
> With the pci patch, the host will not crash. But driver will get
> following error message when called for VF
> ""platform doesn't support global atomics."
> 
> we want to prevent calling pci_enable_atomic_ops_to_root for VF
> anyway. Can you please pull this patch in bnxt_re?

It doesn't work like this you have to wait until v5.16 for all the
trees to be harmonized. You should take care of it in your internal
testing tree in the interm.

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c
index 4678bd6..04d5c7d 100644
--- a/drivers/infiniband/hw/bnxt_re/main.c
+++ b/drivers/infiniband/hw/bnxt_re/main.c
@@ -129,7 +129,7 @@  static int bnxt_re_setup_chip_ctx(struct bnxt_re_dev *rdev, u8 wqe_mode)
 	rdev->rcfw.res = &rdev->qplib_res;
 
 	bnxt_re_set_drv_mode(rdev, wqe_mode);
-	if (bnxt_qplib_determine_atomics(en_dev->pdev))
+	if (!BNXT_VF(bp) && bnxt_qplib_determine_atomics(en_dev->pdev))
 		ibdev_info(&rdev->ibdev,
 			   "platform doesn't support global atomics.");
 	return 0;