Message ID | 1630037738-20276-1-git-send-email-selvin.xavier@broadcom.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [rdma-rc] RDMA/bnxt_re: Disable atomic support on VFs | expand |
On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote: > Following Host crash is observed when pci_enable_atomic_ops_to_root > is called with VF PCI device. > > PID: 4481 TASK: ffff89c6941b0000 CPU: 53 COMMAND: "bash" > #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4 > #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d > #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d > #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6 > #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417 > #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14 > #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace > [exception RIP: pcie_capability_read_dword+28] > RIP: ffffffffb952fd5c RSP: ffff9a9481713960 RFLAGS: 00010246 > RAX: 0000000000000001 RBX: ffff89c6b1096000 RCX: 0000000000000000 > RDX: ffff9a9481713990 RSI: 0000000000000024 RDI: 0000000000000000 > RBP: 0000000000000080 R8: 0000000000000008 R9: ffff89c64341a2f8 > R10: 0000000000000002 R11: 0000000000000000 R12: ffff89c648bab000 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff89c648bab0c8 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6 > #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re] > #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re] > RIP: 00007f450602f648 RSP: 00007ffe880869e8 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f450602f648 > RDX: 0000000000000002 RSI: 0000555c566c4a60 RDI: 0000000000000001 > RBP: 0000555c566c4a60 R8: 000000000000000a R9: 00007f45060c2580 > R10: 000000000000000a R11: 0000000000000246 R12: 00007f45063026e0 > R13: 0000000000000002 R14: 00007f45062fd880 R15: 0000000000000002 > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it hit a case where bus->self == NULL? Why not fix it there? Jason
On Fri, Aug 27, 2021 at 6:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote: > > Following Host crash is observed when pci_enable_atomic_ops_to_root > > is called with VF PCI device. > > > > PID: 4481 TASK: ffff89c6941b0000 CPU: 53 COMMAND: "bash" > > #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4 > > #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d > > #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d > > #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6 > > #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417 > > #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14 > > #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace > > [exception RIP: pcie_capability_read_dword+28] > > RIP: ffffffffb952fd5c RSP: ffff9a9481713960 RFLAGS: 00010246 > > RAX: 0000000000000001 RBX: ffff89c6b1096000 RCX: 0000000000000000 > > RDX: ffff9a9481713990 RSI: 0000000000000024 RDI: 0000000000000000 > > RBP: 0000000000000080 R8: 0000000000000008 R9: ffff89c64341a2f8 > > R10: 0000000000000002 R11: 0000000000000000 R12: ffff89c648bab000 > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff89c648bab0c8 > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6 > > #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re] > > #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re] > > RIP: 00007f450602f648 RSP: 00007ffe880869e8 RFLAGS: 00000246 > > RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f450602f648 > > RDX: 0000000000000002 RSI: 0000555c566c4a60 RDI: 0000000000000001 > > RBP: 0000555c566c4a60 R8: 000000000000000a R9: 00007f45060c2580 > > R10: 000000000000000a R11: 0000000000000246 R12: 00007f45063026e0 > > R13: 0000000000000002 R14: 00007f45062fd880 R15: 0000000000000002 > > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b > Apologies for the delay in my response. I was exploring internally to see if it is a specific issue with the adapter/host. I see the problem in multiple systems. > This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it > hit a case where bus->self == NULL? yes. This crashes because of bus->self is NULL. Is it expected for VF? > > Why not fix it there? Since its a functional breakage in 5.14, I posted a quick fix for 5.14. Also, we haven't done any testing on VF for this feature. So I wanted to avoid claiming support for VF anyway. I see that other drivers also use pci_enable_atomic_ops_to_root without vf/pf check. Anyone seeing this issue? > > Jason
On Tue, Aug 31, 2021 at 09:27:14PM +0530, Selvin Xavier wrote: > On Fri, Aug 27, 2021 at 6:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote: > > > Following Host crash is observed when pci_enable_atomic_ops_to_root > > > is called with VF PCI device. > > > > > > PID: 4481 TASK: ffff89c6941b0000 CPU: 53 COMMAND: "bash" > > > #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4 > > > #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d > > > #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d > > > #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6 > > > #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417 > > > #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14 > > > #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace > > > [exception RIP: pcie_capability_read_dword+28] > > > RIP: ffffffffb952fd5c RSP: ffff9a9481713960 RFLAGS: 00010246 > > > RAX: 0000000000000001 RBX: ffff89c6b1096000 RCX: 0000000000000000 > > > RDX: ffff9a9481713990 RSI: 0000000000000024 RDI: 0000000000000000 > > > RBP: 0000000000000080 R8: 0000000000000008 R9: ffff89c64341a2f8 > > > R10: 0000000000000002 R11: 0000000000000000 R12: ffff89c648bab000 > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff89c648bab0c8 > > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > > #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6 > > > #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re] > > > #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re] > > > RIP: 00007f450602f648 RSP: 00007ffe880869e8 RFLAGS: 00000246 > > > RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f450602f648 > > > RDX: 0000000000000002 RSI: 0000555c566c4a60 RDI: 0000000000000001 > > > RBP: 0000555c566c4a60 R8: 000000000000000a R9: 00007f45060c2580 > > > R10: 000000000000000a R11: 0000000000000246 R12: 00007f45063026e0 > > > R13: 0000000000000002 R14: 00007f45062fd880 R15: 0000000000000002 > > > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b > > > Apologies for the delay in my response. I was exploring internally to > see if it is a specific issue > with the adapter/host. I see the problem in multiple systems. > > > This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it > > hit a case where bus->self == NULL? > yes. This crashes because of bus->self is NULL. Is it expected for VF? I'm not sure, you should ask the PCI lists > > Why not fix it there? > Since its a functional breakage in 5.14, I posted a quick fix for > 5.14. Also, we haven't done any testing on VF for this > feature. So I wanted to avoid claiming support for VF anyway. > > I see that other drivers also use pci_enable_atomic_ops_to_root > without vf/pf check. Anyone seeing this issue? Which is why I suspect the core code should be fixed not the driver.. Jason
On Wed, Sep 1, 2021 at 5:20 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Aug 31, 2021 at 09:27:14PM +0530, Selvin Xavier wrote: > > On Fri, Aug 27, 2021 at 6:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Thu, Aug 26, 2021 at 09:15:38PM -0700, Selvin Xavier wrote: > > > > Following Host crash is observed when pci_enable_atomic_ops_to_root > > > > is called with VF PCI device. > > > > > > > > PID: 4481 TASK: ffff89c6941b0000 CPU: 53 COMMAND: "bash" > > > > #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4 > > > > #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d > > > > #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d > > > > #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6 > > > > #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417 > > > > #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14 > > > > #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace > > > > [exception RIP: pcie_capability_read_dword+28] > > > > RIP: ffffffffb952fd5c RSP: ffff9a9481713960 RFLAGS: 00010246 > > > > RAX: 0000000000000001 RBX: ffff89c6b1096000 RCX: 0000000000000000 > > > > RDX: ffff9a9481713990 RSI: 0000000000000024 RDI: 0000000000000000 > > > > RBP: 0000000000000080 R8: 0000000000000008 R9: ffff89c64341a2f8 > > > > R10: 0000000000000002 R11: 0000000000000000 R12: ffff89c648bab000 > > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff89c648bab0c8 > > > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > > > #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6 > > > > #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re] > > > > #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re] > > > > RIP: 00007f450602f648 RSP: 00007ffe880869e8 RFLAGS: 00000246 > > > > RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f450602f648 > > > > RDX: 0000000000000002 RSI: 0000555c566c4a60 RDI: 0000000000000001 > > > > RBP: 0000555c566c4a60 R8: 000000000000000a R9: 00007f45060c2580 > > > > R10: 000000000000000a R11: 0000000000000246 R12: 00007f45063026e0 > > > > R13: 0000000000000002 R14: 00007f45062fd880 R15: 0000000000000002 > > > > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b > > > > > Apologies for the delay in my response. I was exploring internally to > > see if it is a specific issue > > with the adapter/host. I see the problem in multiple systems. > > > > > This feels like a bug in pci_enable_atomic_ops_to_root()? I assume it > > > hit a case where bus->self == NULL? > > yes. This crashes because of bus->self is NULL. Is it expected for VF? > > I'm not sure, you should ask the PCI lists > > > > Why not fix it there? > > Since its a functional breakage in 5.14, I posted a quick fix for > > 5.14. Also, we haven't done any testing on VF for this > > feature. So I wanted to avoid claiming support for VF anyway. > > > > I see that other drivers also use pci_enable_atomic_ops_to_root > > without vf/pf check. Anyone seeing this issue? > > Which is why I suspect the core code should be fixed not the driver.. Hi Jason, A patch that avoids the crash is merged to the linux-pci tree. https://lore.kernel.org/linux-pci/20210914201606.GA1452219@bjorn-Precision-5520/T/ With the pci patch, the host will not crash. But driver will get following error message when called for VF ""platform doesn't support global atomics." we want to prevent calling pci_enable_atomic_ops_to_root for VF anyway. Can you please pull this patch in bnxt_re? Thanks Selvin > > Jason
On Thu, Sep 16, 2021 at 08:35:37PM +0530, Selvin Xavier wrote: > Hi Jason, > A patch that avoids the crash is merged to the linux-pci tree. > https://lore.kernel.org/linux-pci/20210914201606.GA1452219@bjorn-Precision-5520/T/ > With the pci patch, the host will not crash. But driver will get > following error message when called for VF > ""platform doesn't support global atomics." > > we want to prevent calling pci_enable_atomic_ops_to_root for VF > anyway. Can you please pull this patch in bnxt_re? It doesn't work like this you have to wait until v5.16 for all the trees to be harmonized. You should take care of it in your internal testing tree in the interm. Jason
diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c index 4678bd6..04d5c7d 100644 --- a/drivers/infiniband/hw/bnxt_re/main.c +++ b/drivers/infiniband/hw/bnxt_re/main.c @@ -129,7 +129,7 @@ static int bnxt_re_setup_chip_ctx(struct bnxt_re_dev *rdev, u8 wqe_mode) rdev->rcfw.res = &rdev->qplib_res; bnxt_re_set_drv_mode(rdev, wqe_mode); - if (bnxt_qplib_determine_atomics(en_dev->pdev)) + if (!BNXT_VF(bp) && bnxt_qplib_determine_atomics(en_dev->pdev)) ibdev_info(&rdev->ibdev, "platform doesn't support global atomics."); return 0;
Following Host crash is observed when pci_enable_atomic_ops_to_root is called with VF PCI device. PID: 4481 TASK: ffff89c6941b0000 CPU: 53 COMMAND: "bash" #0 [ffff9a94817136d8] machine_kexec at ffffffffb90601a4 #1 [ffff9a9481713728] __crash_kexec at ffffffffb9190d5d #2 [ffff9a94817137f0] crash_kexec at ffffffffb9191c4d #3 [ffff9a9481713808] oops_end at ffffffffb9025cd6 #4 [ffff9a9481713828] page_fault_oops at ffffffffb906e417 #5 [ffff9a9481713888] exc_page_fault at ffffffffb9a0ad14 #6 [ffff9a94817138b0] asm_exc_page_fault at ffffffffb9c00ace [exception RIP: pcie_capability_read_dword+28] RIP: ffffffffb952fd5c RSP: ffff9a9481713960 RFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff89c6b1096000 RCX: 0000000000000000 RDX: ffff9a9481713990 RSI: 0000000000000024 RDI: 0000000000000000 RBP: 0000000000000080 R8: 0000000000000008 R9: ffff89c64341a2f8 R10: 0000000000000002 R11: 0000000000000000 R12: ffff89c648bab000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff89c648bab0c8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff9a9481713988] pci_enable_atomic_ops_to_root at ffffffffb95359a6 #8 [ffff9a94817139c0] bnxt_qplib_determine_atomics at ffffffffc08c1a33 [bnxt_re] #9 [ffff9a94817139d0] bnxt_re_dev_init at ffffffffc08ba2d1 [bnxt_re] RIP: 00007f450602f648 RSP: 00007ffe880869e8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f450602f648 RDX: 0000000000000002 RSI: 0000555c566c4a60 RDI: 0000000000000001 RBP: 0000555c566c4a60 R8: 000000000000000a R9: 00007f45060c2580 R10: 000000000000000a R11: 0000000000000246 R12: 00007f45063026e0 R13: 0000000000000002 R14: 00007f45062fd880 R15: 0000000000000002 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b To avoid system crash when VFs are created, enable atomics only for PF now. Fixes: 35f5ace5dea4 ("RDMA/bnxt_re: Enable global atomic ops if platform supports") Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> --- drivers/infiniband/hw/bnxt_re/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)