Message ID | 20230418090642.1849358-1-matsuda-daisuke@fujitsu.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | [jgg-for-next] RDMA/rxe: Fix spinlock recursion deadlock on requester | expand |
On Tue, Apr 18, 2023 at 5:07 PM Daisuke Matsuda <matsuda-daisuke@fujitsu.com> wrote: > > After applying commit f605f26ea196, the following deadlock is observed: > Call Trace: > <IRQ> > _raw_spin_lock_bh+0x29/0x30 > check_type_state.constprop.0+0x4e/0xc0 [rdma_rxe] > rxe_rcv+0x173/0x3d0 [rdma_rxe] > rxe_udp_encap_recv+0x69/0xd0 [rdma_rxe] > ? __pfx_rxe_udp_encap_recv+0x10/0x10 [rdma_rxe] > udp_queue_rcv_one_skb+0x258/0x520 > udp_unicast_rcv_skb+0x75/0x90 > __udp4_lib_rcv+0x364/0x5c0 > ip_protocol_deliver_rcu+0xa7/0x160 > ip_local_deliver_finish+0x73/0xa0 > ip_sublist_rcv_finish+0x80/0x90 > ip_sublist_rcv+0x191/0x220 > ip_list_rcv+0x132/0x160 > __netif_receive_skb_list_core+0x297/0x2c0 > netif_receive_skb_list_internal+0x1c5/0x300 > napi_complete_done+0x6f/0x1b0 > virtnet_poll+0x1f4/0x2d0 [virtio_net] > __napi_poll+0x2c/0x1b0 > net_rx_action+0x293/0x350 > ? __napi_schedule+0x79/0x90 > __do_softirq+0xcb/0x2ab > __irq_exit_rcu+0xb9/0xf0 > common_interrupt+0x80/0xa0 > </IRQ> > <TASK> > asm_common_interrupt+0x22/0x40 > RIP: 0010:_raw_spin_lock+0x17/0x30 > rxe_requester+0xe4/0x8f0 [rdma_rxe] > ? xas_load+0x9/0xa0 > ? xa_load+0x70/0xb0 > do_task+0x64/0x1f0 [rdma_rxe] > rxe_post_send+0x54/0x110 [rdma_rxe] > ib_uverbs_post_send+0x5f8/0x680 [ib_uverbs] > ? netif_receive_skb_list_internal+0x1e3/0x300 > ib_uverbs_write+0x3c8/0x500 [ib_uverbs] > vfs_write+0xc5/0x3b0 > ksys_write+0xab/0xe0 > ? syscall_trace_enter.constprop.0+0x126/0x1a0 > do_syscall_64+0x3b/0x90 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > </TASK> > > The deadlock is easily reproducible with perftest. Fix it by disabling > softirq when acquiring the lock in process context. I am fine. Thanks. Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com> Zhu Yanjun > > Fixes: f605f26ea196 ("RDMA/rxe: Protect QP state with qp->state_lock") > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> > --- > drivers/infiniband/sw/rxe/rxe_req.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c > index 8e50d116d273..65134a9aefe7 100644 > --- a/drivers/infiniband/sw/rxe/rxe_req.c > +++ b/drivers/infiniband/sw/rxe/rxe_req.c > @@ -180,13 +180,13 @@ static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp) > if (wqe == NULL) > return NULL; > > - spin_lock(&qp->state_lock); > + spin_lock_bh(&qp->state_lock); > if (unlikely((qp_state(qp) == IB_QPS_SQD) && > (wqe->state != wqe_state_processing))) { > - spin_unlock(&qp->state_lock); > + spin_unlock_bh(&qp->state_lock); > return NULL; > } > - spin_unlock(&qp->state_lock); > + spin_unlock_bh(&qp->state_lock); > > wqe->mask = wr_opcode_mask(wqe->wr.opcode, qp); > return wqe; > -- > 2.39.1 >
On Tue, Apr 18, 2023 at 06:06:42PM +0900, Daisuke Matsuda wrote: > After applying commit f605f26ea196, the following deadlock is observed: > Call Trace: > <IRQ> > _raw_spin_lock_bh+0x29/0x30 > check_type_state.constprop.0+0x4e/0xc0 [rdma_rxe] > rxe_rcv+0x173/0x3d0 [rdma_rxe] > rxe_udp_encap_recv+0x69/0xd0 [rdma_rxe] > ? __pfx_rxe_udp_encap_recv+0x10/0x10 [rdma_rxe] > udp_queue_rcv_one_skb+0x258/0x520 > udp_unicast_rcv_skb+0x75/0x90 > __udp4_lib_rcv+0x364/0x5c0 > ip_protocol_deliver_rcu+0xa7/0x160 > ip_local_deliver_finish+0x73/0xa0 > ip_sublist_rcv_finish+0x80/0x90 > ip_sublist_rcv+0x191/0x220 > ip_list_rcv+0x132/0x160 > __netif_receive_skb_list_core+0x297/0x2c0 > netif_receive_skb_list_internal+0x1c5/0x300 > napi_complete_done+0x6f/0x1b0 > virtnet_poll+0x1f4/0x2d0 [virtio_net] > __napi_poll+0x2c/0x1b0 > net_rx_action+0x293/0x350 > ? __napi_schedule+0x79/0x90 > __do_softirq+0xcb/0x2ab > __irq_exit_rcu+0xb9/0xf0 > common_interrupt+0x80/0xa0 > </IRQ> > <TASK> > asm_common_interrupt+0x22/0x40 > RIP: 0010:_raw_spin_lock+0x17/0x30 > rxe_requester+0xe4/0x8f0 [rdma_rxe] > ? xas_load+0x9/0xa0 > ? xa_load+0x70/0xb0 > do_task+0x64/0x1f0 [rdma_rxe] > rxe_post_send+0x54/0x110 [rdma_rxe] > ib_uverbs_post_send+0x5f8/0x680 [ib_uverbs] > ? netif_receive_skb_list_internal+0x1e3/0x300 > ib_uverbs_write+0x3c8/0x500 [ib_uverbs] > vfs_write+0xc5/0x3b0 > ksys_write+0xab/0xe0 > ? syscall_trace_enter.constprop.0+0x126/0x1a0 > do_syscall_64+0x3b/0x90 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > </TASK> > > The deadlock is easily reproducible with perftest. Fix it by disabling > softirq when acquiring the lock in process context. > > Fixes: f605f26ea196 ("RDMA/rxe: Protect QP state with qp->state_lock") > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> > Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com> > --- > drivers/infiniband/sw/rxe/rxe_req.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Applied to for-next, thanks Jason
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index 8e50d116d273..65134a9aefe7 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -180,13 +180,13 @@ static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp) if (wqe == NULL) return NULL; - spin_lock(&qp->state_lock); + spin_lock_bh(&qp->state_lock); if (unlikely((qp_state(qp) == IB_QPS_SQD) && (wqe->state != wqe_state_processing))) { - spin_unlock(&qp->state_lock); + spin_unlock_bh(&qp->state_lock); return NULL; } - spin_unlock(&qp->state_lock); + spin_unlock_bh(&qp->state_lock); wqe->mask = wr_opcode_mask(wqe->wr.opcode, qp); return wqe;
After applying commit f605f26ea196, the following deadlock is observed: Call Trace: <IRQ> _raw_spin_lock_bh+0x29/0x30 check_type_state.constprop.0+0x4e/0xc0 [rdma_rxe] rxe_rcv+0x173/0x3d0 [rdma_rxe] rxe_udp_encap_recv+0x69/0xd0 [rdma_rxe] ? __pfx_rxe_udp_encap_recv+0x10/0x10 [rdma_rxe] udp_queue_rcv_one_skb+0x258/0x520 udp_unicast_rcv_skb+0x75/0x90 __udp4_lib_rcv+0x364/0x5c0 ip_protocol_deliver_rcu+0xa7/0x160 ip_local_deliver_finish+0x73/0xa0 ip_sublist_rcv_finish+0x80/0x90 ip_sublist_rcv+0x191/0x220 ip_list_rcv+0x132/0x160 __netif_receive_skb_list_core+0x297/0x2c0 netif_receive_skb_list_internal+0x1c5/0x300 napi_complete_done+0x6f/0x1b0 virtnet_poll+0x1f4/0x2d0 [virtio_net] __napi_poll+0x2c/0x1b0 net_rx_action+0x293/0x350 ? __napi_schedule+0x79/0x90 __do_softirq+0xcb/0x2ab __irq_exit_rcu+0xb9/0xf0 common_interrupt+0x80/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 RIP: 0010:_raw_spin_lock+0x17/0x30 rxe_requester+0xe4/0x8f0 [rdma_rxe] ? xas_load+0x9/0xa0 ? xa_load+0x70/0xb0 do_task+0x64/0x1f0 [rdma_rxe] rxe_post_send+0x54/0x110 [rdma_rxe] ib_uverbs_post_send+0x5f8/0x680 [ib_uverbs] ? netif_receive_skb_list_internal+0x1e3/0x300 ib_uverbs_write+0x3c8/0x500 [ib_uverbs] vfs_write+0xc5/0x3b0 ksys_write+0xab/0xe0 ? syscall_trace_enter.constprop.0+0x126/0x1a0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc </TASK> The deadlock is easily reproducible with perftest. Fix it by disabling softirq when acquiring the lock in process context. Fixes: f605f26ea196 ("RDMA/rxe: Protect QP state with qp->state_lock") Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> --- drivers/infiniband/sw/rxe/rxe_req.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)