diff mbox series

[rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory

Message ID 20241112134956.1415343-1-mheib@redhat.com (mailing list archive)
State New
Headers show
Series [rdma] RDMA/bnxt_re: cmds completions handler avoid accessing invalid memeory | expand

Commit Message

Mohammad Heib Nov. 12, 2024, 1:49 p.m. UTC
If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it
can send completions for old  cookies that have already been handled by the
bnxt driver. If that old cookie was associated with an old calling context
the driver will try to access that caller memory again because the driver
never clean the is_waiter_alive flag after the caller successfully complete
waiting, and this access will cause the following kernel panic:

Call Trace:
 <IRQ>
 ? __die+0x20/0x70
 ? page_fault_oops+0x75/0x170
 ? exc_page_fault+0xaa/0x140
 ? asm_exc_page_fault+0x22/0x30
 ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re]
 ? srso_return_thunk+0x5/0x5f
 ? __wake_up_common+0x78/0xa0
 ? srso_return_thunk+0x5/0x5f
 bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re]
 tasklet_action_common+0xac/0x210
 handle_softirqs+0xd3/0x2b0
 __irq_exit_rcu+0x9b/0xc0
 common_interrupt+0x7f/0xa0
 </IRQ>
 <TASK>

To avoid the above unexpected behavior clear the is_waiter_alive flag
every time the caller finishes waiting for a completion.

Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
Signed-off-by: Mohammad Heib <mheib@redhat.com>
---
 drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Comments

Leon Romanovsky Nov. 14, 2024, 10:04 a.m. UTC | #1
On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote:
> If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it
> can send completions for old  cookies that have already been handled by the
> bnxt driver. If that old cookie was associated with an old calling context
> the driver will try to access that caller memory again because the driver
> never clean the is_waiter_alive flag after the caller successfully complete
> waiting, and this access will cause the following kernel panic:
> 
> Call Trace:
>  <IRQ>
>  ? __die+0x20/0x70
>  ? page_fault_oops+0x75/0x170
>  ? exc_page_fault+0xaa/0x140
>  ? asm_exc_page_fault+0x22/0x30
>  ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re]
>  ? srso_return_thunk+0x5/0x5f
>  ? __wake_up_common+0x78/0xa0
>  ? srso_return_thunk+0x5/0x5f
>  bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re]
>  tasklet_action_common+0xac/0x210
>  handle_softirqs+0xd3/0x2b0
>  __irq_exit_rcu+0x9b/0xc0
>  common_interrupt+0x7f/0xa0
>  </IRQ>
>  <TASK>
> 
> To avoid the above unexpected behavior clear the is_waiter_alive flag
> every time the caller finishes waiting for a completion.
> 
> Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
> Signed-off-by: Mohammad Heib <mheib@redhat.com>
> ---
>  drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)

Selvin?
Selvin Xavier Nov. 14, 2024, 10:07 a.m. UTC | #2
On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote:
> > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it
> > can send completions for old  cookies that have already been handled by the
> > bnxt driver. If that old cookie was associated with an old calling context
> > the driver will try to access that caller memory again because the driver
> > never clean the is_waiter_alive flag after the caller successfully complete
> > waiting, and this access will cause the following kernel panic:
> >
> > Call Trace:
> >  <IRQ>
> >  ? __die+0x20/0x70
> >  ? page_fault_oops+0x75/0x170
> >  ? exc_page_fault+0xaa/0x140
> >  ? asm_exc_page_fault+0x22/0x30
> >  ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re]
> >  ? srso_return_thunk+0x5/0x5f
> >  ? __wake_up_common+0x78/0xa0
> >  ? srso_return_thunk+0x5/0x5f
> >  bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re]
> >  tasklet_action_common+0xac/0x210
> >  handle_softirqs+0xd3/0x2b0
> >  __irq_exit_rcu+0x9b/0xc0
> >  common_interrupt+0x7f/0xa0
> >  </IRQ>
> >  <TASK>
> >
> > To avoid the above unexpected behavior clear the is_waiter_alive flag
> > every time the caller finishes waiting for a completion.
> >
> > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
> > Signed-off-by: Mohammad Heib <mheib@redhat.com>
> > ---
> >  drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
>
> Selvin?
Someone is confirming the fix. Will ack in a day. Thanks
Leon Romanovsky Nov. 14, 2024, 11:45 a.m. UTC | #3
On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote:
> On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote:
> > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it
> > > can send completions for old  cookies that have already been handled by the
> > > bnxt driver. If that old cookie was associated with an old calling context
> > > the driver will try to access that caller memory again because the driver
> > > never clean the is_waiter_alive flag after the caller successfully complete
> > > waiting, and this access will cause the following kernel panic:
> > >
> > > Call Trace:
> > >  <IRQ>
> > >  ? __die+0x20/0x70
> > >  ? page_fault_oops+0x75/0x170
> > >  ? exc_page_fault+0xaa/0x140
> > >  ? asm_exc_page_fault+0x22/0x30
> > >  ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re]
> > >  ? srso_return_thunk+0x5/0x5f
> > >  ? __wake_up_common+0x78/0xa0
> > >  ? srso_return_thunk+0x5/0x5f
> > >  bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re]
> > >  tasklet_action_common+0xac/0x210
> > >  handle_softirqs+0xd3/0x2b0
> > >  __irq_exit_rcu+0x9b/0xc0
> > >  common_interrupt+0x7f/0xa0
> > >  </IRQ>
> > >  <TASK>
> > >
> > > To avoid the above unexpected behavior clear the is_waiter_alive flag
> > > every time the caller finishes waiting for a completion.
> > >
> > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
> > > Signed-off-by: Mohammad Heib <mheib@redhat.com>
> > > ---
> > >  drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++--------
> > >  1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > Selvin?
> Someone is confirming the fix. Will ack in a day. Thanks

Thanks
Selvin Xavier Nov. 16, 2024, 8:03 a.m. UTC | #4
On Thu, Nov 14, 2024 at 5:15 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Nov 14, 2024 at 03:37:30PM +0530, Selvin Xavier wrote:
> > On Thu, Nov 14, 2024 at 3:34 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Tue, Nov 12, 2024 at 03:49:56PM +0200, Mohammad Heib wrote:
> > > > If bnxt FW behaves unexpectedly because of FW bug or unexpected behavior it
> > > > can send completions for old  cookies that have already been handled by the
> > > > bnxt driver. If that old cookie was associated with an old calling context
> > > > the driver will try to access that caller memory again because the driver
> > > > never clean the is_waiter_alive flag after the caller successfully complete
> > > > waiting, and this access will cause the following kernel panic:
> > > >
> > > > Call Trace:
> > > >  <IRQ>
> > > >  ? __die+0x20/0x70
> > > >  ? page_fault_oops+0x75/0x170
> > > >  ? exc_page_fault+0xaa/0x140
> > > >  ? asm_exc_page_fault+0x22/0x30
> > > >  ? bnxt_qplib_process_qp_event.isra.0+0x20c/0x3a0 [bnxt_re]
> > > >  ? srso_return_thunk+0x5/0x5f
> > > >  ? __wake_up_common+0x78/0xa0
> > > >  ? srso_return_thunk+0x5/0x5f
> > > >  bnxt_qplib_service_creq+0x18d/0x250 [bnxt_re]
> > > >  tasklet_action_common+0xac/0x210
> > > >  handle_softirqs+0xd3/0x2b0
> > > >  __irq_exit_rcu+0x9b/0xc0
> > > >  common_interrupt+0x7f/0xa0
> > > >  </IRQ>
> > > >  <TASK>
> > > >
> > > > To avoid the above unexpected behavior clear the is_waiter_alive flag
> > > > every time the caller finishes waiting for a completion.
Mohammad,
 We were trying to see the possibility. FW shouldn't be giving an old
cookie. One possibility
could be if FW crashes and we are in the recovery routine.
Adding this check is okay, but may be hiding some other error.
Is it possible to share your test scripts to repro this problem? Also,
can you share
the vmcore-demsg also

Thanks
Selvin




> > > >
> > > > Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
> > > > Signed-off-by: Mohammad Heib <mheib@redhat.com>
> > > > ---
> > > >  drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 16 ++++++++--------
> > > >  1 file changed, 8 insertions(+), 8 deletions(-)
> > >
> > > Selvin?
> > Someone is confirming the fix. Will ack in a day. Thanks
>
> Thanks
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
index f5713e3c39fb..eaf92029862b 100644
--- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
+++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
@@ -511,15 +511,15 @@  static int __bnxt_qplib_rcfw_send_message(struct bnxt_qplib_rcfw *rcfw,
 	else
 		rc = __poll_for_resp(rcfw, cookie);
 
-	if (rc) {
-		spin_lock_irqsave(&rcfw->cmdq.hwq.lock, flags);
-		crsqe = &rcfw->crsqe_tbl[cookie];
-		crsqe->is_waiter_alive = false;
-		if (rc == -ENODEV)
-			set_bit(FIRMWARE_STALL_DETECTED, &rcfw->cmdq.flags);
-		spin_unlock_irqrestore(&rcfw->cmdq.hwq.lock, flags);
+
+	spin_lock_irqsave(&rcfw->cmdq.hwq.lock, flags);
+	crsqe = &rcfw->crsqe_tbl[cookie];
+	crsqe->is_waiter_alive = false;
+	if (rc == -ENODEV)
+		set_bit(FIRMWARE_STALL_DETECTED, &rcfw->cmdq.flags);
+	spin_unlock_irqrestore(&rcfw->cmdq.hwq.lock, flags);
+	if (rc)
 		return -ETIMEDOUT;
-	}
 
 	if (evnt->status) {
 		/* failed with status */