rpcrdma: don't decref EP if a ESTABLISHED did not happen

Message ID	20240505124910.1877325-1-dan.aloni@vastdata.com (mailing list archive)
State	New
Headers	show Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 448A1A21 for <linux-nfs@vger.kernel.org>; Sun, 5 May 2024 12:49:14 +0000 (UTC) From: Dan Aloni <dan.aloni@vastdata.com> To: chuck.lever@oracle.com Cc: linux-nfs@vger.kernel.org Subject: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen Date: Sun, 5 May 2024 15:49:10 +0300 Message-Id: <20240505124910.1877325-1-dan.aloni@vastdata.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	rpcrdma: don't decref EP if a ESTABLISHED did not happen \| expand rpcrdma: don't decref EP if a ESTABLISHED did not happen

Dan Aloni May 5, 2024, 12:49 p.m. UTC

We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
underflow.

The specific scenario that caused this to happen is IB device bonding,
when bringing down one of the ports, or all ports. The situation is not
just a print - it also causes a non-recoverable state it is not even
possible to complete the disconnect and shut it down the mount,
requiring a reboot, suggesting that tear-down is also incomplete in this
state.

The trivial fix seems to work as such - if we did not receive a
`RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
`rpcrdma_xprt_drain` kills the EP prematurely in from the context of
`rpcrdma_xprt_disconnect`.

Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Example crash:

rpcrdma: removing device mlx5_3 for 172.21.208.2:20049
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_>
isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx>
CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G        W  OE    --------  ---  5.14.0-284.11.1.el9_2.x86_64 #1
Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022
Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
RIP: 0010:refcount_warn_saturate+0xba/0x110
Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7
RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027
RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0
RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff
R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000
R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928
FS:  0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
rpcrdma_ep_put+0x42/0x50 [rpcrdma]
rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma]
xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma]
process_one_work+0x1e5/0x3c0
? rescuer_thread+0x3a0/0x3a0
worker_thread+0x50/0x3b0
? rescuer_thread+0x3a0/0x3a0
kthread+0xd6/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>

Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
---
 net/sunrpc/xprtrdma/verbs.c     | 5 ++++-
 net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

Chuck Lever May 5, 2024, 2:35 p.m. UTC | #1

On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> underflow.
> 
> The specific scenario that caused this to happen is IB device bonding,
> when bringing down one of the ports, or all ports. The situation is not
> just a print - it also causes a non-recoverable state it is not even
> possible to complete the disconnect and shut it down the mount,
> requiring a reboot, suggesting that tear-down is also incomplete in this
> state.
> 
> The trivial fix seems to work as such - if we did not receive a
> `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> `rpcrdma_xprt_disconnect`.
> 
> Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')

Hi Dan, thanks for the bug report!

In the future, note that the Fixes: tag needs to go down by the
Signed-off-by: tag. But I wonder if 2acc5 introduced this issue --
ep reference counting seems to have been added before this commit.

Second, I wonder if, when bonding interfaces, there's an opportunity
for the verbs consumer to take an additional transport reference.
Cc'ing linux-rdma for input on that issue. Can you show a brief
diagram of when the ep reference count changes when bonding?

Also, I note that the WARNING below happened on a RHEL9 kernel:

   5.14.0-284.11.1.el9_2.x86_64

Can you confirm that this issue reproduces on v6.8 and newer ?


> Example crash:
> 
> rpcrdma: removing device mlx5_3 for 172.21.208.2:20049
> ------------[ cut here ]------------
> refcount_t: underflow; use-after-free.
> WARNING: CPU: 60 PID: 19700 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> Modules linked in: mst_pciconf(OE) nfsv3(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver rpcrdma(OE) nfs(OE) lockd(OE) grace compat_nfs_ssc(OE) snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore uio_pci_generic uio vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_>
> isst_if_mmio mei isst_if_common i2c_smbus intel_pch_thermal intel_vsec ipmi_msghandler acpi_power_meter xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm crct10dif_pclmul mlx>
> CPU: 60 PID: 19700 Comm: kworker/u132:4 Kdump: loaded Tainted: G        W  OE    --------  ---  5.14.0-284.11.1.el9_2.x86_64 #1
> Hardware name: Dell Inc. PowerEdge C6520/0TY3YW, BIOS 1.8.2 09/14/2022
> Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
> RIP: 0010:refcount_warn_saturate+0xba/0x110
> Code: 01 01 e8 27 e1 56 00 0f 0b c3 cc cc cc cc 80 3d b8 29 9b 01 00 75 85 48 c7 c7 38 ec 04 93 c6 05 a8 29 9b 01 01 e8 04 e1 56 00 <0f> 0b c3 cc cc cc cc 80 3d 93 29 9b 01 00 0f 85 5e ff ff ff 48 c7
> RSP: 0018:ff34fa4968cafe10 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ff1210404a15e000 RCX: 0000000000000027
> RDX: ff12103f803998a8 RSI: 0000000000000001 RDI: ff12103f803998a0
> RBP: ff1210404a15e648 R08: 0000000000000000 R09: 00000000ffff7fff
> R10: ff34fa4968cafcb0 R11: ffffffff939e9608 R12: 0000000000000000
> R13: dead000000000122 R14: dead000000000100 R15: ff1210404a15e928
> FS:  0000000000000000(0000) GS:ff12103f80380000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f170f8a5000 CR3: 00000001c3adc002 CR4: 0000000000771ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> rpcrdma_ep_put+0x42/0x50 [rpcrdma]
> rpcrdma_xprt_disconnect+0x303/0x3b0 [rpcrdma]
> xprt_rdma_connect_worker+0xc8/0xd0 [rpcrdma]
> process_one_work+0x1e5/0x3c0
> ? rescuer_thread+0x3a0/0x3a0
> worker_thread+0x50/0x3b0
> ? rescuer_thread+0x3a0/0x3a0
> kthread+0xd6/0x100
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
> </TASK>
> 
> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
> ---
>  net/sunrpc/xprtrdma/verbs.c     | 5 ++++-
>  net/sunrpc/xprtrdma/xprt_rdma.h | 1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 4f8d7efa469f..19996515da94 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -250,6 +250,7 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
>  		goto disconnected;
>  	case RDMA_CM_EVENT_ESTABLISHED:
>  		rpcrdma_ep_get(ep);
> +               ep->re_connect_ref = true;
>  		ep->re_connect_status = 1;
>  		rpcrdma_update_cm_private(ep, &event->param.conn);
>  		trace_xprtrdma_inline_thresh(ep);
> @@ -272,7 +273,9 @@ rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
>  		ep->re_connect_status = -ECONNABORTED;
>  disconnected:
>  		rpcrdma_force_disconnect(ep);
> -		return rpcrdma_ep_put(ep);
> +		if (ep->re_connect_ref)
> +			return rpcrdma_ep_put(ep);
> +		return 0;
>  	default:
>  		break;
>  	}
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index da409450dfc0..1553ef69a844 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -84,6 +84,7 @@ struct rpcrdma_ep {
>  	unsigned int		re_max_inline_recv;
>  	int			re_async_rc;
>  	int			re_connect_status;
> +	bool			re_connect_ref;
>  	atomic_t		re_receiving;
>  	atomic_t		re_force_disconnect;
>  	struct ib_qp_init_attr	re_attr;
> -- 
> 2.39.3
>

Dan Aloni May 5, 2024, 6:36 p.m. UTC | #2

On 2024-05-05 10:35:46, Chuck Lever wrote:
> On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > underflow.
> > 
> > The specific scenario that caused this to happen is IB device bonding,
> > when bringing down one of the ports, or all ports. The situation is not
> > just a print - it also causes a non-recoverable state it is not even
> > possible to complete the disconnect and shut it down the mount,
> > requiring a reboot, suggesting that tear-down is also incomplete in this
> > state.
> > 
> > The trivial fix seems to work as such - if we did not receive a
> > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > `rpcrdma_xprt_disconnect`.
> > 
> > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> 
> Hi Dan, thanks for the bug report!
> 
[..]
> Second, I wonder if, when bonding interfaces, there's an opportunity
> for the verbs consumer to take an additional transport reference.
> Cc'ing linux-rdma for input on that issue. Can you show a brief
> diagram of when the ep reference count changes when bonding?

Not sure we need an additional reference here. I understand regarding rpcrdma
managing its internal EP refcount, that it is having three in total: one for
init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
and I found some more about it, please read on.

> Also, I note that the WARNING below happened on a RHEL9 kernel:
> 
>    5.14.0-284.11.1.el9_2.x86_64
> 
> Can you confirm that this issue reproduces on v6.8 and newer ?

Some context: I originally tested with a version that also has implemented a
timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
'culprit' in the negative decref, but there is the other issue and second order
effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
the unkillable transport, so we are not reaching teardown at all. With both of
these changes applied both problems are gone.

As for upstream version testing, with the elrepo build of 6.8.9 which matches
vanilla, I don't see the negative kref but I do see the other effect of
`XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
which does not release even after ports are back up online.

Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
on bonding, I'll post in a reply.

```
INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
      Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:6  state:D stack:0     pid:1688  tgid:1688  ppid:2      flags:0x00004000
Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
Call Trace:
 <TASK>
 __schedule+0x21c/0x680
 schedule+0x31/0xd0
 schedule_timeout+0x148/0x160
 ? mutex_lock+0xe/0x30
 ? cma_process_remove+0x218/0x260 [rdma_cm]
 ? preempt_count_add+0x70/0xa0
 __wait_for_common+0x90/0x1e0
 ? __pfx_schedule_timeout+0x10/0x10
 cma_remove_one+0x5c/0xd0 [rdma_cm]
 remove_client_context+0x88/0xd0 [ib_core]
 disable_device+0x8a/0x160 [ib_core]
 ? _raw_spin_unlock+0x15/0x30
 __ib_unregister_device+0x42/0xb0 [ib_core]
 ib_unregister_device+0x22/0x30 [ib_core]
 mlx5r_remove+0x36/0x60 [mlx5_ib]
 auxiliary_bus_remove+0x18/0x30
 device_release_driver_internal+0x193/0x200
 bus_remove_device+0xbf/0x130
 device_del+0x157/0x3e0
 ? devl_param_driverinit_value_get+0x29/0xa0
 mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
 mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
 mlx5_do_bond+0x2cb/0x390 [mlx5_core]
 mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
 process_one_work+0x174/0x340
 worker_thread+0x27e/0x390
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xee/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
INFO: task fio:10584 blocked for more than 122 seconds.
      Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
```

Chuck Lever May 5, 2024, 7 p.m. UTC | #3

On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> On 2024-05-05 10:35:46, Chuck Lever wrote:
> > On Sun, May 05, 2024 at 03:49:10PM +0300, Dan Aloni wrote:
> > > We found a case where `RDMA_CM_EVENT_DEVICE_REMOVAL` causes a refcount
> > > underflow.
> > > 
> > > The specific scenario that caused this to happen is IB device bonding,
> > > when bringing down one of the ports, or all ports. The situation is not
> > > just a print - it also causes a non-recoverable state it is not even
> > > possible to complete the disconnect and shut it down the mount,
> > > requiring a reboot, suggesting that tear-down is also incomplete in this
> > > state.
> > > 
> > > The trivial fix seems to work as such - if we did not receive a
> > > `RDMA_CM_EVENT_ESTABLISHED`, we should not decref the EP, otherwise
> > > `rpcrdma_xprt_drain` kills the EP prematurely in from the context of
> > > `rpcrdma_xprt_disconnect`.
> > > 
> > > Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed')
> > 
> > Hi Dan, thanks for the bug report!
> > 
> [..]
> > Second, I wonder if, when bonding interfaces, there's an opportunity
> > for the verbs consumer to take an additional transport reference.
> > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > diagram of when the ep reference count changes when bonding?
> 
> Not sure we need an additional reference here.

We already have two mechanisms in play:

- the ep reference count
- the re_connect_status value

I would prefer not to add a boolean here. Seems like
re_connect_status could do that job if we need it to.

I seem to recall that when a device is removed, the verbs consumer
doesn't get a DISCONNECT first. Or does it? Having a sequence of
CM events that you see in the bonded case would help a lot.


> I understand regarding rpcrdma
> managing its internal EP refcount, that it is having three in total: one for
> init, another one for ESTABLISHED mode, and a third for 'outstanding receives',
> so the mishandling of RDMA_CM_EVENT_DEVICE_REMOVAL seem only internal to me,
> and I found some more about it, please read on.
> 
> > Also, I note that the WARNING below happened on a RHEL9 kernel:
> > 
> >    5.14.0-284.11.1.el9_2.x86_64
> > 
> > Can you confirm that this issue reproduces on v6.8 and newer ?
> 
> Some context: I originally tested with a version that also has implemented a
> timeout in wait_event of `rpcrdma_xprt_connect`. You may say it is somewhat
> 'culprit' in the negative decref, but there is the other issue and second order
> effect of likely not handling RDMA_CM_EVENT_DEVICE_REMOVAL properly which causes
> the unkillable transport, so we are not reaching teardown at all. With both of
> these changes applied both problems are gone.
> 
> As for upstream version testing, with the elrepo build of 6.8.9 which matches
> vanilla, I don't see the negative kref but I do see the other effect of
> `XPRT_CLOSE_WAIT` transport state and provider stuck on teardown (like below),
> which does not release even after ports are back up online.
> 
> Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> on bonding, I'll post in a reply.

It looks like you are trying to fix several issues in one patch. So
I need you to separate these issues and the fixes, and let's focus
on the upstream kernel (v6.9-rc6) because there's nothing I can do
about the RHEL9 kernel, which is clearly a different source base
than the one I work on.

If we need a "wake_up_all(&ep->re_connect_wait);" during
DEVICE_REMOVAL, that should be a separate patch. And you need to
figure out if ADDR_CHANGE needs the same treatment: the v2 you just
sent changes the behavior of ADDR_CHANGE too but does not mention
whether it intended that change.

Without a reproducer or a detailed explanation of how the ep
reference count changes in step with CM events, I can't properly
assess your proposed fix.


> ```
> INFO: task kworker/u128:6:1688 blocked for more than 122 seconds.
>       Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u128:6  state:D stack:0     pid:1688  tgid:1688  ppid:2      flags:0x00004000
> Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
> Call Trace:
>  <TASK>
>  __schedule+0x21c/0x680
>  schedule+0x31/0xd0
>  schedule_timeout+0x148/0x160
>  ? mutex_lock+0xe/0x30
>  ? cma_process_remove+0x218/0x260 [rdma_cm]
>  ? preempt_count_add+0x70/0xa0
>  __wait_for_common+0x90/0x1e0
>  ? __pfx_schedule_timeout+0x10/0x10
>  cma_remove_one+0x5c/0xd0 [rdma_cm]
>  remove_client_context+0x88/0xd0 [ib_core]
>  disable_device+0x8a/0x160 [ib_core]
>  ? _raw_spin_unlock+0x15/0x30
>  __ib_unregister_device+0x42/0xb0 [ib_core]
>  ib_unregister_device+0x22/0x30 [ib_core]
>  mlx5r_remove+0x36/0x60 [mlx5_ib]
>  auxiliary_bus_remove+0x18/0x30
>  device_release_driver_internal+0x193/0x200
>  bus_remove_device+0xbf/0x130
>  device_del+0x157/0x3e0
>  ? devl_param_driverinit_value_get+0x29/0xa0
>  mlx5_rescan_drivers_locked.part.0+0x7e/0x1b0 [mlx5_core]
>  mlx5_lag_remove_devices+0x3c/0x60 [mlx5_core]
>  mlx5_do_bond+0x2cb/0x390 [mlx5_core]
>  mlx5_do_bond_work+0x96/0xf0 [mlx5_core]
>  process_one_work+0x174/0x340
>  worker_thread+0x27e/0x390
>  ? __pfx_worker_thread+0x10/0x10
>  kthread+0xee/0x120
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x2d/0x50
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork_asm+0x1b/0x30
>  </TASK>
> INFO: task fio:10584 blocked for more than 122 seconds.
>       Tainted: G            E      6.8.9-1.el9.elrepo.x86_64 #1
> ```
> 
> -- 
> Dan Aloni

Dan Aloni May 5, 2024, 8:10 p.m. UTC | #4

On 2024-05-05 15:00:10, Chuck Lever wrote:
> On Sun, May 05, 2024 at 09:36:28PM +0300, Dan Aloni wrote:
> > On 2024-05-05 10:35:46, Chuck Lever wrote:
> > [..]
> > > Second, I wonder if, when bonding interfaces, there's an opportunity
> > > for the verbs consumer to take an additional transport reference.
> > > Cc'ing linux-rdma for input on that issue. Can you show a brief
> > > diagram of when the ep reference count changes when bonding?
> > 
> > Not sure we need an additional reference here.
> 
> We already have two mechanisms in play:
> 
> - the ep reference count
> - the re_connect_status value
> 
> I would prefer not to add a boolean here. Seems like
> re_connect_status could do that job if we need it to.

That's possible. So before overrides e.g. `ep->re_connect_status =
-ENODEV` we would need to remember whether it was `1` and across the
call to `rpcrdma_force_disconnect()`, to do the putref() after it.
Anyway, the patch will be a big larger I guess.

>[..]
> > Testing 6.8.9 with both the patch and `wake_up_all(&ep->re_connect_wait);`
> > for `RDMA_CM_EVENT_DEVICE_REMOVAL` works for me, showing proper recovery
> > on bonding, I'll post in a reply.
> 
> It looks like you are trying to fix several issues in one patch. So
> I need you to separate these issues and the fixes, and let's focus
> on the upstream kernel (v6.9-rc6) because there's nothing I can do
> about the RHEL9 kernel, which is clearly a different source base
> than the one I work on.
> 
> If we need a "wake_up_all(&ep->re_connect_wait);" during
> DEVICE_REMOVAL, that should be a separate patch. And you need to
> figure out if ADDR_CHANGE needs the same treatment: the v2 you just
> sent changes the behavior of ADDR_CHANGE too but does not mention
> whether it intended that change.

Was it just the `rpcrdma_ep_put` change in this case? Will feel less
comfortable changing it without a repro of the `ADDR_CHANGE` case. Going
to isolate it.

> Without a reproducer or a detailed explanation of how the ep
> reference count changes in step with CM events, I can't properly
> assess your proposed fix.

With 6.8 I only see a single `RDMA_CM_EVENT_DEVICE_REMOVAL` event
arriving, driving a single `rpcrdma_ep_put` call to put the reference
back at 1, and it waits forever. Getting late here, I'll check on 6.9-rc
tomorrow.

Maybe for a repro, you can generate a single interface bond? I haven't
tried, but possibly it would occur on a single one too if the underlying
port is down.

kernel test robot May 12, 2024, 5:51 p.m. UTC | #5

Hi Dan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on trondmy-nfs/linux-next]
[also build test WARNING on linus/master v6.9-rc7 next-20240510]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dan-Aloni/rpcrdma-don-t-decref-EP-if-a-ESTABLISHED-did-not-happen/20240505-205016
base:   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
patch link:    https://lore.kernel.org/r/20240505124910.1877325-1-dan.aloni%40vastdata.com
patch subject: [PATCH] rpcrdma: don't decref EP if a ESTABLISHED did not happen
config: parisc-randconfig-r081-20240512 (https://download.01.org/0day-ci/archive/20240513/202405130122.S5dgt6et-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 13.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405130122.S5dgt6et-lkp@intel.com/

smatch warnings:
net/sunrpc/xprtrdma/verbs.c:254 rpcrdma_cm_event_handler() warn: inconsistent indenting

vim +254 net/sunrpc/xprtrdma/verbs.c

87cfb9a0c85ce4 Chuck Lever         2016-09-15  213  
ae38288eb73c52 Chuck Lever         2018-10-01  214  /**
ae38288eb73c52 Chuck Lever         2018-10-01  215   * rpcrdma_cm_event_handler - Handle RDMA CM events
ae38288eb73c52 Chuck Lever         2018-10-01  216   * @id: rdma_cm_id on which an event has occurred
ae38288eb73c52 Chuck Lever         2018-10-01  217   * @event: details of the event
ae38288eb73c52 Chuck Lever         2018-10-01  218   *
ae38288eb73c52 Chuck Lever         2018-10-01  219   * Called with @id's mutex held. Returns 1 if caller should
ae38288eb73c52 Chuck Lever         2018-10-01  220   * destroy @id, otherwise 0.
ae38288eb73c52 Chuck Lever         2018-10-01  221   */
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  222  static int
ae38288eb73c52 Chuck Lever         2018-10-01  223  rpcrdma_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event)
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  224  {
745b734c9bb805 Chuck Lever         2020-02-21  225  	struct sockaddr *sap = (struct sockaddr *)&id->route.addr.dst_addr;
e28ce90083f032 Chuck Lever         2020-02-21  226  	struct rpcrdma_ep *ep = id->context;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  227  
ae38288eb73c52 Chuck Lever         2018-10-01  228  	might_sleep();
ae38288eb73c52 Chuck Lever         2018-10-01  229  
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  230  	switch (event->event) {
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  231  	case RDMA_CM_EVENT_ADDR_RESOLVED:
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  232  	case RDMA_CM_EVENT_ROUTE_RESOLVED:
93aa8e0a9de80e Chuck Lever         2020-02-21  233  		ep->re_async_rc = 0;
93aa8e0a9de80e Chuck Lever         2020-02-21  234  		complete(&ep->re_done);
316a616e788658 Chuck Lever         2018-10-01  235  		return 0;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  236  	case RDMA_CM_EVENT_ADDR_ERROR:
93aa8e0a9de80e Chuck Lever         2020-02-21  237  		ep->re_async_rc = -EPROTO;
93aa8e0a9de80e Chuck Lever         2020-02-21  238  		complete(&ep->re_done);
316a616e788658 Chuck Lever         2018-10-01  239  		return 0;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  240  	case RDMA_CM_EVENT_ROUTE_ERROR:
93aa8e0a9de80e Chuck Lever         2020-02-21  241  		ep->re_async_rc = -ENETUNREACH;
93aa8e0a9de80e Chuck Lever         2020-02-21  242  		complete(&ep->re_done);
316a616e788658 Chuck Lever         2018-10-01  243  		return 0;
bebd031866caa4 Chuck Lever         2017-04-11  244  	case RDMA_CM_EVENT_DEVICE_REMOVAL:
745b734c9bb805 Chuck Lever         2020-02-21  245  		pr_info("rpcrdma: removing device %s for %pISpc\n",
745b734c9bb805 Chuck Lever         2020-02-21  246  			ep->re_id->device->name, sap);
df561f6688fef7 Gustavo A. R. Silva 2020-08-23  247  		fallthrough;
e28ce90083f032 Chuck Lever         2020-02-21  248  	case RDMA_CM_EVENT_ADDR_CHANGE:
93aa8e0a9de80e Chuck Lever         2020-02-21  249  		ep->re_connect_status = -ENODEV;
e28ce90083f032 Chuck Lever         2020-02-21  250  		goto disconnected;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  251  	case RDMA_CM_EVENT_ESTABLISHED:
2acc5cae292355 Chuck Lever         2020-06-15  252  		rpcrdma_ep_get(ep);
c58f2c9a4198e3 Dan Aloni           2024-05-05  253                 ep->re_connect_ref = true;
93aa8e0a9de80e Chuck Lever         2020-02-21 @254  		ep->re_connect_status = 1;
745b734c9bb805 Chuck Lever         2020-02-21  255  		rpcrdma_update_cm_private(ep, &event->param.conn);
745b734c9bb805 Chuck Lever         2020-02-21  256  		trace_xprtrdma_inline_thresh(ep);
93aa8e0a9de80e Chuck Lever         2020-02-21  257  		wake_up_all(&ep->re_connect_wait);
31e62d25b5b815 Chuck Lever         2018-10-01  258  		break;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  259  	case RDMA_CM_EVENT_CONNECT_ERROR:
93aa8e0a9de80e Chuck Lever         2020-02-21  260  		ep->re_connect_status = -ENOTCONN;
af667527b0e349 Chuck Lever         2020-06-27  261  		goto wake_connect_worker;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  262  	case RDMA_CM_EVENT_UNREACHABLE:
93aa8e0a9de80e Chuck Lever         2020-02-21  263  		ep->re_connect_status = -ENETUNREACH;
af667527b0e349 Chuck Lever         2020-06-27  264  		goto wake_connect_worker;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  265  	case RDMA_CM_EVENT_REJECTED:
93aa8e0a9de80e Chuck Lever         2020-02-21  266  		ep->re_connect_status = -ECONNREFUSED;
0a90487bf7182c Chuck Lever         2017-02-08  267  		if (event->status == IB_CM_REJ_STALE_CONN)
4cf44be6f1e86d Chuck Lever         2020-06-27  268  			ep->re_connect_status = -ENOTCONN;
af667527b0e349 Chuck Lever         2020-06-27  269  wake_connect_worker:
af667527b0e349 Chuck Lever         2020-06-27  270  		wake_up_all(&ep->re_connect_wait);
af667527b0e349 Chuck Lever         2020-06-27  271  		return 0;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  272  	case RDMA_CM_EVENT_DISCONNECTED:
93aa8e0a9de80e Chuck Lever         2020-02-21  273  		ep->re_connect_status = -ECONNABORTED;
31e62d25b5b815 Chuck Lever         2018-10-01  274  disconnected:
c487eb7d8e4157 Chuck Lever         2020-06-15  275  		rpcrdma_force_disconnect(ep);
c58f2c9a4198e3 Dan Aloni           2024-05-05  276  		if (ep->re_connect_ref)
2acc5cae292355 Chuck Lever         2020-06-15  277  			return rpcrdma_ep_put(ep);
c58f2c9a4198e3 Dan Aloni           2024-05-05  278  		return 0;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  279  	default:
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  280  		break;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  281  	}
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  282  
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  283  	return 0;
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  284  }
c56c65fb67d646 \"Talpey, Thomas\   2007-09-10  285

rpcrdma: don't decref EP if a ESTABLISHED did not happen

Commit Message

Comments

Patch