RDMA: null pointer in __ib_umem_release causes kernel panic

Message ID	20220105141841.411197-1-trondmy@kernel.org (mailing list archive)
State	Rejected
Delegated to:	Jason Gunthorpe
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: trondmy@kernel.org To: Jason Gunthorpe <jgg@nvidia.com> Cc: linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org Subject: [PATCH] RDMA: null pointer in __ib_umem_release causes kernel panic Date: Wed, 5 Jan 2022 09:18:41 -0500 Message-Id: <20220105141841.411197-1-trondmy@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	RDMA: null pointer in __ib_umem_release causes kernel panic \| expand RDMA: null pointer in __ib_umem_release causes kernel panic

Trond Myklebust Jan. 5, 2022, 2:18 p.m. UTC

From: Trond Myklebust <trond.myklebust@hammerspace.com>

When doing RPC/RDMA, we're seeing a kernel panic when __ib_umem_release()
iterates over the scatter gather list and hits NULL pages.

It turns out that commit 79fbd3e1241c ended up changing the iteration
from being over only the mapped entries to being over the original list
size.

Fixes: 79fbd3e1241c ("RDMA: Use the sg_table directly and remove the opencoded version from umem")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 drivers/infiniband/core/umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Jason Gunthorpe Jan. 5, 2022, 2:37 p.m. UTC | #1

On Wed, Jan 05, 2022 at 09:18:41AM -0500, trondmy@kernel.org wrote:
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> When doing RPC/RDMA, we're seeing a kernel panic when __ib_umem_release()
> iterates over the scatter gather list and hits NULL pages.
> 
> It turns out that commit 79fbd3e1241c ended up changing the iteration
> from being over only the mapped entries to being over the original list
> size.

You mean this?

-       for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+       for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)

I don't see what changed there? The invarient should be that

  umem->sg_nents == sgt->orig_nents

> @@ -55,7 +55,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>  		ib_dma_unmap_sgtable_attrs(dev, &umem->sgt_append.sgt,
>  					   DMA_BIDIRECTIONAL, 0);
>  
> -	for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
> +	for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i)
>  		unpin_user_page_range_dirty_lock(sg_page(sg),

Calling sg_page() from under a dma_sg iterator is unconditionally
wrong..

More likely your case is something has gone wrong when the sgtable was
created and it has the wrong value in orig_nents..

Jason

Trond Myklebust Jan. 5, 2022, 3:02 p.m. UTC | #2

On Wed, 2022-01-05 at 10:37 -0400, Jason Gunthorpe wrote:
> On Wed, Jan 05, 2022 at 09:18:41AM -0500, trondmy@kernel.org wrote:
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > When doing RPC/RDMA, we're seeing a kernel panic when
> > __ib_umem_release()
> > iterates over the scatter gather list and hits NULL pages.
> > 
> > It turns out that commit 79fbd3e1241c ended up changing the
> > iteration
> > from being over only the mapped entries to being over the original
> > list
> > size.
> 
> You mean this?
> 
> -       for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
> +       for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
> 
> I don't see what changed there? The invarient should be that
> 
>   umem->sg_nents == sgt->orig_nents
> 
> > @@ -55,7 +55,7 @@ static void __ib_umem_release(struct ib_device
> > *dev, struct ib_umem *umem, int d
> >                 ib_dma_unmap_sgtable_attrs(dev, &umem-
> > >sgt_append.sgt,
> >                                            DMA_BIDIRECTIONAL, 0);
> >  
> > -       for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
> > +       for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i)
> >                 unpin_user_page_range_dirty_lock(sg_page(sg),
> 
> Calling sg_page() from under a dma_sg iterator is unconditionally
> wrong..
> 
> More likely your case is something has gone wrong when the sgtable
> was
> created and it has the wrong value in orig_nents..

Can you define "wrong value" in this case? Chuck's RPC/RDMA code
appears to call ib_alloc_mr() with an 'expected maximum number of
entries' (depth) in net/sunrpc/xprtrdma/frwr_ops.c:frwr_mr_init().

It then fills that table with a set of n <= depth pages in
net/sunrpc/xprtrdma/frwr_ops.c:frwr_map() and calls ib_dma_map_sg() to
map them, and then adjusts the sgtable with a call to ib_map_mr_sg().


What part of that sequence of operations is incorrect?
> 
> Jason

Jason Gunthorpe Jan. 5, 2022, 4:09 p.m. UTC | #3

On Wed, Jan 05, 2022 at 03:02:34PM +0000, Trond Myklebust wrote:
> On Wed, 2022-01-05 at 10:37 -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 05, 2022 at 09:18:41AM -0500, trondmy@kernel.org wrote:
> > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > 
> > > When doing RPC/RDMA, we're seeing a kernel panic when
> > > __ib_umem_release()
> > > iterates over the scatter gather list and hits NULL pages.
> > > 
> > > It turns out that commit 79fbd3e1241c ended up changing the
> > > iteration
> > > from being over only the mapped entries to being over the original
> > > list
> > > size.
> > 
> > You mean this?
> > 
> > -       for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
> > +       for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
> > 
> > I don't see what changed there? The invarient should be that
> > 
> >   umem->sg_nents == sgt->orig_nents
> > 
> > > @@ -55,7 +55,7 @@ static void __ib_umem_release(struct ib_device
> > > *dev, struct ib_umem *umem, int d
> > >                 ib_dma_unmap_sgtable_attrs(dev, &umem-
> > > >sgt_append.sgt,
> > >                                            DMA_BIDIRECTIONAL, 0);
> > >  
> > > -       for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
> > > +       for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i)
> > >                 unpin_user_page_range_dirty_lock(sg_page(sg),
> > 
> > Calling sg_page() from under a dma_sg iterator is unconditionally
> > wrong..
> > 
> > More likely your case is something has gone wrong when the sgtable
> > was
> > created and it has the wrong value in orig_nents..
> 
> Can you define "wrong value" in this case? Chuck's RPC/RDMA code
> appears to call ib_alloc_mr() with an 'expected maximum number of
> entries' (depth) in net/sunrpc/xprtrdma/frwr_ops.c:frwr_mr_init().
> 
> It then fills that table with a set of n <= depth pages in
> net/sunrpc/xprtrdma/frwr_ops.c:frwr_map() and calls ib_dma_map_sg() to
> map them, and then adjusts the sgtable with a call to ib_map_mr_sg().

I'm confused, RPC/RDMA should never touch a umem at all.

Is this really the other bug where user and kernel MR are getting
confused?

Jason

Trond Myklebust Jan. 5, 2022, 5:16 p.m. UTC | #4

On Wed, 2022-01-05 at 12:09 -0400, Jason Gunthorpe wrote:
> On Wed, Jan 05, 2022 at 03:02:34PM +0000, Trond Myklebust wrote:
> > On Wed, 2022-01-05 at 10:37 -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 05, 2022 at 09:18:41AM -0500,
> > > trondmy@kernel.org wrote:
> > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > 
> > > > When doing RPC/RDMA, we're seeing a kernel panic when
> > > > __ib_umem_release()
> > > > iterates over the scatter gather list and hits NULL pages.
> > > > 
> > > > It turns out that commit 79fbd3e1241c ended up changing the
> > > > iteration
> > > > from being over only the mapped entries to being over the
> > > > original
> > > > list
> > > > size.
> > > 
> > > You mean this?
> > > 
> > > -       for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
> > > +       for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
> > > 
> > > I don't see what changed there? The invarient should be that
> > > 
> > >   umem->sg_nents == sgt->orig_nents
> > > 
> > > > @@ -55,7 +55,7 @@ static void __ib_umem_release(struct
> > > > ib_device
> > > > *dev, struct ib_umem *umem, int d
> > > >                 ib_dma_unmap_sgtable_attrs(dev, &umem-
> > > > > sgt_append.sgt,
> > > >                                            DMA_BIDIRECTIONAL,
> > > > 0);
> > > >  
> > > > -       for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
> > > > +       for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i)
> > > >                 unpin_user_page_range_dirty_lock(sg_page(sg),
> > > 
> > > Calling sg_page() from under a dma_sg iterator is unconditionally
> > > wrong..
> > > 
> > > More likely your case is something has gone wrong when the
> > > sgtable
> > > was
> > > created and it has the wrong value in orig_nents..
> > 
> > Can you define "wrong value" in this case? Chuck's RPC/RDMA code
> > appears to call ib_alloc_mr() with an 'expected maximum number of
> > entries' (depth) in net/sunrpc/xprtrdma/frwr_ops.c:frwr_mr_init().
> > 
> > It then fills that table with a set of n <= depth pages in
> > net/sunrpc/xprtrdma/frwr_ops.c:frwr_map() and calls ib_dma_map_sg()
> > to
> > map them, and then adjusts the sgtable with a call to
> > ib_map_mr_sg().
> 
> I'm confused, RPC/RDMA should never touch a umem at all.
> 
> Is this really the other bug where user and kernel MR are getting
> confused?
> 

As far as I know, RPC/RDMA is just using the RDMA api to register and
unregister chunks of memory, so it is definitely not directly touching
the umem. 

Let me just copy/paste the stack dump that we got. This was a 5.15.12
kernel running on Azure with an infiniband setup. It's a little garbled
(sorry) but the stack trace itself is fairly clear that this is
happening on transport teardown and that we're iterating through a NULL
page.


2022-01-04 17:53:24 kernel: [ 2922.654038] nf[s:  2922.656486] BUG:
kernel NULL pointer dereference, address: 0000000000000000
[ 2922.660304] #PF: supervisor read access in kernel mode
serv[er  1722.1962.20..662320] #PF: error_code(0x0000) - not-present
page
[ 2922.664606] PGD 1d6c381067 P4D 1d6c381067 PUD 1d0c93e067 PMD 0
2[1  no2t9 r2espo2ndi.n667089] Oops: 0000 [#1] SMP NOPTI
[ 2922.668823] CPU: 27 PID: 11384 Comm: kworker/u241:4 Kdump: loaded
Tainted: G        W         5.15.12-200.pd.17715.el7.x86_64 #1
73695] Hardware name: Microsoft Corporation Virtual Machine/Virtual
Machine, BIOS 090008  12/07/2018
[ 2922.677741] Workqueue: xprtiod xprt_autoclose [sunrpc]

[202 2-2019-042 217.:5679875] RIP: 0010:__ib_umem_release+0x7a/0xa0
[ib_uverbs]
[ 2922.682325] Code: f6 cb 48 89 ef e8 16 90 2e cc 48 89 c5 41 39 5c 24
5c 77 cd 5b 49 8d 7c 24 50 5d 41 5c 41 5d e9 ec 96 2e cc 41 bd 01 00 00
00 <48> 8b 3f 48 85 ff 74 9f 41 8b 54 24 5c 49 8b 74 24 50 45 31 c0 31
3:24[  k2e92rn2.el6: [ 920363] RSP: 0018:ffffb53dee517d28 EFLAGS:
00010246
[ 2922.692787] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff9f222edfa180
[92 2.2692540428.]6 nf9s:5675] RDX: 0000000000000001 RSI:
ffff9f31b0dc4000 RDI: 0000000000000000
[ 2922.698735] RBP: ffff9f31b0dc4000 R08: 0000000000000000 R09:
0000000000000000
 s[er ver2 19722.126.0.701658] R10: ffff9f3ddfb69680 R11:
ffffb53dee517d50 R12: ffff9f31b0dc4000
[ 2922.704764] R13: 0000000000000000 R14: 0000000000000000 R15:
ffff9f23f2cc09b0
[. 212 n9ot2 re2s.po7nd0i8268] FS:  0000000000000000(0000)
GS:ffff9f3d0fcc0000(0000) knlGS:0000000000000000
[ 2922.713056] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
ng[,  st2il9l22 .tr7y1in5g925] CR2: 0000000000000000 CR3:
0000001dafdb6005 CR4: 0000000000370ee0
[ 2922.720003] Call Trace:

 202229-2021.-047 127:1202]  <TASK>
[ 2922.722467]  ib_umem_release+0x2a/0x90 [ib_uverbs]
53:24[ ke rn2el9: [2 2.724829]  mlx5_ib_dereg_mr+0x1fe/0x3d0 [mlx5_ib]
[ 2922.727447]  ib_dereg_mr_user+0x40/0xa0 [ib_core]
2922.6[540 51]2 nfs922.729658]  frwr_mr_release+0x20/0x70 [rpcrdma]
[ 2922.732133]  rpcrdma_xprt_disconnect+0x2b4/0x3a0 [rpcrdma]
[:  se2rve9r 1272.126..734753]  xprt_rdma_close+0xe/0x30 [rpcrdma]
[ 2922.737096]  xprt_autoclose+0x52/0xf0 [sunrpc]
0.[21  no2t r9es2pon2d.739248]  process_one_work+0x1f1/0x390
[ 2922.741247]  worker_thread+0x53/0x3e0
ing[,  s2til9l try2in2.743081]  ? process_one_work+0x390/0x390
[ 2922.745341]  kthread+0x127/0x150
[
  2922.747185]  ? set_kthread_struct+0x40/0x40
[ 2922.749685]  ret_from_fork+0x22/0x30
[ 2922.751889]  </TASK>
[ 2922.753068] Modules linked in: nfsv3 bpf_preload veth
nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver
nfsidmap nfs fscache netfs xt_nat xt_MASQUERADE nf_conntrack_netlink
xt_addrtypebr_netfilter bridge stp llc nfsd auth_rpcgss overlay nfs_acl
lockd grace xt_owner nf_nat_ftp nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack
ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat
nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ip6table_filter
ip6_tables iptable_filter bonding ipmi_msghandler rpcrdma sunrpc
rdma_ucm ib_iser libiscsi ib_umad scsi_transport_iscsi rdma_cm ib_ipoib
iw_cm ib_cm mlx5_ib raid0 ib_uverbs ib_core vfat fat mlx5_core nvme
mlxfw psample nvme_core tls intel_rapl_msr intel_rapl_common
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pci_hyperv hyperv_drm
hv_balloon
[ 2922.753108]  hv_utils hyperv_keyboard hid_hyperv pci_hyperv_intf
drm_kms_helper cec drm i2c_piix4 hyperv_fb dm_multipath ip_tables xfs
hv_storvsc hv_netvsc scsi_transport_fc crc32c_intel ata_generic
hv_vmbus pata_acpi
[ 2922.803987] CR2: 0000000000000000
[ 2922.805713] ---[ end trace 450d08bd4a337f29 ]---
[ 2922.808718] RIP: 0010:__ib_umem_release+0x7a/0xa0 [ib_uverbs]
[ 2922.812103] Code: f6 cb 48 89 ef e8 16 90 2e cc 48 89 c5 41 39 5c 24
5c 77 cd 5b 49 8d 7c 24 50 5d 41 5c 41 5d e9 ec 96 2e cc 41 bd 01 00 00
00 <48> 8b 3f 48 85 ff 74 9f 41 8b 54 24 5c 49 8b 74 24 50 45 31 c0 31
[ 2922.821369] RSP: 0018:ffffb53dee517d28 EFLAGS: 00010246
[ 2922.824135] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff9f222edfa180
[ 2922.827639] RDX: 0000000000000001 RSI: ffff9f31b0dc4000 RDI:
0000000000000000
[ 2922.831121] RBP: ffff9f31b0dc4000 R08: 0000000000000000 R09:
0000000000000000
[ 2922.834528] R10: ffff9f3ddfb69680 R11: ffffb53dee517d50 R12:
ffff9f31b0dc4000
[ 2922.838042] R13: 0000000000000000 R14: 0000000000000000 R15:
ffff9f23f2cc09b0
[ 2922.841442] FS:  0000000000000000(0000) GS:ffff9f3d0fcc0000(0000)
knlGS:0000000000000000
[ 2922.845340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2922.848327] CR2: 0000000000000000 CR3: 0000001dafdb6005 CR4:
0000000000370ee0
[ 2922.852791] Kernel panic - not syncing: Fatal exception
[ 2922.858167] Kernel Offset: 0xc000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Jason Gunthorpe Jan. 5, 2022, 5:43 p.m. UTC | #5

On Wed, Jan 05, 2022 at 05:16:06PM +0000, Trond Myklebust wrote:
> > I'm confused, RPC/RDMA should never touch a umem at all.
> > 
> > Is this really the other bug where user and kernel MR are getting
> > confused?
> > 
> 
> As far as I know, RPC/RDMA is just using the RDMA api to register and
> unregister chunks of memory, so it is definitely not directly touching
> the umem. 

I mean, RPC/RDMA doesn't have a umem at all, so seeing it any stack
trace says something is corrupted

I suppose it is this:

https://lore.kernel.org/r/20211222101312.1358616-1-maorg@nvidia.com

Jason

Trond Myklebust Jan. 5, 2022, 5:49 p.m. UTC | #6

On Wed, 2022-01-05 at 13:43 -0400, Jason Gunthorpe wrote:
> On Wed, Jan 05, 2022 at 05:16:06PM +0000, Trond Myklebust wrote:
> > > I'm confused, RPC/RDMA should never touch a umem at all.
> > > 
> > > Is this really the other bug where user and kernel MR are getting
> > > confused?
> > > 
> > 
> > As far as I know, RPC/RDMA is just using the RDMA api to register
> > and
> > unregister chunks of memory, so it is definitely not directly
> > touching
> > the umem. 
> 
> I mean, RPC/RDMA doesn't have a umem at all, so seeing it any stack
> trace says something is corrupted
> 
> I suppose it is this:
> 
> https://lore.kernel.org/r/20211222101312.1358616-1-maorg@nvidia.com
> 
> Jason
> 

Ah... Thanks! I'll try that out.

RDMA: null pointer in __ib_umem_release causes kernel panic

Commit Message

Comments

Patch