Message ID | 20220105141841.411197-1-trondmy@kernel.org (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | RDMA: null pointer in __ib_umem_release causes kernel panic | expand |
On Wed, Jan 05, 2022 at 09:18:41AM -0500, trondmy@kernel.org wrote: > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > When doing RPC/RDMA, we're seeing a kernel panic when __ib_umem_release() > iterates over the scatter gather list and hits NULL pages. > > It turns out that commit 79fbd3e1241c ended up changing the iteration > from being over only the mapped entries to being over the original list > size. You mean this? - for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i) + for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) I don't see what changed there? The invarient should be that umem->sg_nents == sgt->orig_nents > @@ -55,7 +55,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d > ib_dma_unmap_sgtable_attrs(dev, &umem->sgt_append.sgt, > DMA_BIDIRECTIONAL, 0); > > - for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) > + for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i) > unpin_user_page_range_dirty_lock(sg_page(sg), Calling sg_page() from under a dma_sg iterator is unconditionally wrong.. More likely your case is something has gone wrong when the sgtable was created and it has the wrong value in orig_nents.. Jason
On Wed, 2022-01-05 at 10:37 -0400, Jason Gunthorpe wrote: > On Wed, Jan 05, 2022 at 09:18:41AM -0500, trondmy@kernel.org wrote: > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > When doing RPC/RDMA, we're seeing a kernel panic when > > __ib_umem_release() > > iterates over the scatter gather list and hits NULL pages. > > > > It turns out that commit 79fbd3e1241c ended up changing the > > iteration > > from being over only the mapped entries to being over the original > > list > > size. > > You mean this? > > - for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i) > + for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) > > I don't see what changed there? The invarient should be that > > umem->sg_nents == sgt->orig_nents > > > @@ -55,7 +55,7 @@ static void __ib_umem_release(struct ib_device > > *dev, struct ib_umem *umem, int d > > ib_dma_unmap_sgtable_attrs(dev, &umem- > > >sgt_append.sgt, > > DMA_BIDIRECTIONAL, 0); > > > > - for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) > > + for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i) > > unpin_user_page_range_dirty_lock(sg_page(sg), > > Calling sg_page() from under a dma_sg iterator is unconditionally > wrong.. > > More likely your case is something has gone wrong when the sgtable > was > created and it has the wrong value in orig_nents.. Can you define "wrong value" in this case? Chuck's RPC/RDMA code appears to call ib_alloc_mr() with an 'expected maximum number of entries' (depth) in net/sunrpc/xprtrdma/frwr_ops.c:frwr_mr_init(). It then fills that table with a set of n <= depth pages in net/sunrpc/xprtrdma/frwr_ops.c:frwr_map() and calls ib_dma_map_sg() to map them, and then adjusts the sgtable with a call to ib_map_mr_sg(). What part of that sequence of operations is incorrect? > > Jason
On Wed, Jan 05, 2022 at 03:02:34PM +0000, Trond Myklebust wrote: > On Wed, 2022-01-05 at 10:37 -0400, Jason Gunthorpe wrote: > > On Wed, Jan 05, 2022 at 09:18:41AM -0500, trondmy@kernel.org wrote: > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > When doing RPC/RDMA, we're seeing a kernel panic when > > > __ib_umem_release() > > > iterates over the scatter gather list and hits NULL pages. > > > > > > It turns out that commit 79fbd3e1241c ended up changing the > > > iteration > > > from being over only the mapped entries to being over the original > > > list > > > size. > > > > You mean this? > > > > - for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i) > > + for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) > > > > I don't see what changed there? The invarient should be that > > > > umem->sg_nents == sgt->orig_nents > > > > > @@ -55,7 +55,7 @@ static void __ib_umem_release(struct ib_device > > > *dev, struct ib_umem *umem, int d > > > ib_dma_unmap_sgtable_attrs(dev, &umem- > > > >sgt_append.sgt, > > > DMA_BIDIRECTIONAL, 0); > > > > > > - for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) > > > + for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i) > > > unpin_user_page_range_dirty_lock(sg_page(sg), > > > > Calling sg_page() from under a dma_sg iterator is unconditionally > > wrong.. > > > > More likely your case is something has gone wrong when the sgtable > > was > > created and it has the wrong value in orig_nents.. > > Can you define "wrong value" in this case? Chuck's RPC/RDMA code > appears to call ib_alloc_mr() with an 'expected maximum number of > entries' (depth) in net/sunrpc/xprtrdma/frwr_ops.c:frwr_mr_init(). > > It then fills that table with a set of n <= depth pages in > net/sunrpc/xprtrdma/frwr_ops.c:frwr_map() and calls ib_dma_map_sg() to > map them, and then adjusts the sgtable with a call to ib_map_mr_sg(). I'm confused, RPC/RDMA should never touch a umem at all. Is this really the other bug where user and kernel MR are getting confused? Jason
On Wed, 2022-01-05 at 12:09 -0400, Jason Gunthorpe wrote: > On Wed, Jan 05, 2022 at 03:02:34PM +0000, Trond Myklebust wrote: > > On Wed, 2022-01-05 at 10:37 -0400, Jason Gunthorpe wrote: > > > On Wed, Jan 05, 2022 at 09:18:41AM -0500, > > > trondmy@kernel.org wrote: > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > > > When doing RPC/RDMA, we're seeing a kernel panic when > > > > __ib_umem_release() > > > > iterates over the scatter gather list and hits NULL pages. > > > > > > > > It turns out that commit 79fbd3e1241c ended up changing the > > > > iteration > > > > from being over only the mapped entries to being over the > > > > original > > > > list > > > > size. > > > > > > You mean this? > > > > > > - for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i) > > > + for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) > > > > > > I don't see what changed there? The invarient should be that > > > > > > umem->sg_nents == sgt->orig_nents > > > > > > > @@ -55,7 +55,7 @@ static void __ib_umem_release(struct > > > > ib_device > > > > *dev, struct ib_umem *umem, int d > > > > ib_dma_unmap_sgtable_attrs(dev, &umem- > > > > > sgt_append.sgt, > > > > DMA_BIDIRECTIONAL, > > > > 0); > > > > > > > > - for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) > > > > + for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i) > > > > unpin_user_page_range_dirty_lock(sg_page(sg), > > > > > > Calling sg_page() from under a dma_sg iterator is unconditionally > > > wrong.. > > > > > > More likely your case is something has gone wrong when the > > > sgtable > > > was > > > created and it has the wrong value in orig_nents.. > > > > Can you define "wrong value" in this case? Chuck's RPC/RDMA code > > appears to call ib_alloc_mr() with an 'expected maximum number of > > entries' (depth) in net/sunrpc/xprtrdma/frwr_ops.c:frwr_mr_init(). > > > > It then fills that table with a set of n <= depth pages in > > net/sunrpc/xprtrdma/frwr_ops.c:frwr_map() and calls ib_dma_map_sg() > > to > > map them, and then adjusts the sgtable with a call to > > ib_map_mr_sg(). > > I'm confused, RPC/RDMA should never touch a umem at all. > > Is this really the other bug where user and kernel MR are getting > confused? > As far as I know, RPC/RDMA is just using the RDMA api to register and unregister chunks of memory, so it is definitely not directly touching the umem. Let me just copy/paste the stack dump that we got. This was a 5.15.12 kernel running on Azure with an infiniband setup. It's a little garbled (sorry) but the stack trace itself is fairly clear that this is happening on transport teardown and that we're iterating through a NULL page. 2022-01-04 17:53:24 kernel: [ 2922.654038] nf[s: 2922.656486] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 2922.660304] #PF: supervisor read access in kernel mode serv[er 1722.1962.20..662320] #PF: error_code(0x0000) - not-present page [ 2922.664606] PGD 1d6c381067 P4D 1d6c381067 PUD 1d0c93e067 PMD 0 2[1 no2t9 r2espo2ndi.n667089] Oops: 0000 [#1] SMP NOPTI [ 2922.668823] CPU: 27 PID: 11384 Comm: kworker/u241:4 Kdump: loaded Tainted: G W 5.15.12-200.pd.17715.el7.x86_64 #1 73695] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 2922.677741] Workqueue: xprtiod xprt_autoclose [sunrpc] [202 2-2019-042 217.:5679875] RIP: 0010:__ib_umem_release+0x7a/0xa0 [ib_uverbs] [ 2922.682325] Code: f6 cb 48 89 ef e8 16 90 2e cc 48 89 c5 41 39 5c 24 5c 77 cd 5b 49 8d 7c 24 50 5d 41 5c 41 5d e9 ec 96 2e cc 41 bd 01 00 00 00 <48> 8b 3f 48 85 ff 74 9f 41 8b 54 24 5c 49 8b 74 24 50 45 31 c0 31 3:24[ k2e92rn2.el6: [ 920363] RSP: 0018:ffffb53dee517d28 EFLAGS: 00010246 [ 2922.692787] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9f222edfa180 [92 2.2692540428.]6 nf9s:5675] RDX: 0000000000000001 RSI: ffff9f31b0dc4000 RDI: 0000000000000000 [ 2922.698735] RBP: ffff9f31b0dc4000 R08: 0000000000000000 R09: 0000000000000000 s[er ver2 19722.126.0.701658] R10: ffff9f3ddfb69680 R11: ffffb53dee517d50 R12: ffff9f31b0dc4000 [ 2922.704764] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9f23f2cc09b0 [. 212 n9ot2 re2s.po7nd0i8268] FS: 0000000000000000(0000) GS:ffff9f3d0fcc0000(0000) knlGS:0000000000000000 [ 2922.713056] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 ng[, st2il9l22 .tr7y1in5g925] CR2: 0000000000000000 CR3: 0000001dafdb6005 CR4: 0000000000370ee0 [ 2922.720003] Call Trace: 202229-2021.-047 127:1202] <TASK> [ 2922.722467] ib_umem_release+0x2a/0x90 [ib_uverbs] 53:24[ ke rn2el9: [2 2.724829] mlx5_ib_dereg_mr+0x1fe/0x3d0 [mlx5_ib] [ 2922.727447] ib_dereg_mr_user+0x40/0xa0 [ib_core] 2922.6[540 51]2 nfs922.729658] frwr_mr_release+0x20/0x70 [rpcrdma] [ 2922.732133] rpcrdma_xprt_disconnect+0x2b4/0x3a0 [rpcrdma] [: se2rve9r 1272.126..734753] xprt_rdma_close+0xe/0x30 [rpcrdma] [ 2922.737096] xprt_autoclose+0x52/0xf0 [sunrpc] 0.[21 no2t r9es2pon2d.739248] process_one_work+0x1f1/0x390 [ 2922.741247] worker_thread+0x53/0x3e0 ing[, s2til9l try2in2.743081] ? process_one_work+0x390/0x390 [ 2922.745341] kthread+0x127/0x150 [ 2922.747185] ? set_kthread_struct+0x40/0x40 [ 2922.749685] ret_from_fork+0x22/0x30 [ 2922.751889] </TASK> [ 2922.753068] Modules linked in: nfsv3 bpf_preload veth nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver nfsidmap nfs fscache netfs xt_nat xt_MASQUERADE nf_conntrack_netlink xt_addrtypebr_netfilter bridge stp llc nfsd auth_rpcgss overlay nfs_acl lockd grace xt_owner nf_nat_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bonding ipmi_msghandler rpcrdma sunrpc rdma_ucm ib_iser libiscsi ib_umad scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib raid0 ib_uverbs ib_core vfat fat mlx5_core nvme mlxfw psample nvme_core tls intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pci_hyperv hyperv_drm hv_balloon [ 2922.753108] hv_utils hyperv_keyboard hid_hyperv pci_hyperv_intf drm_kms_helper cec drm i2c_piix4 hyperv_fb dm_multipath ip_tables xfs hv_storvsc hv_netvsc scsi_transport_fc crc32c_intel ata_generic hv_vmbus pata_acpi [ 2922.803987] CR2: 0000000000000000 [ 2922.805713] ---[ end trace 450d08bd4a337f29 ]--- [ 2922.808718] RIP: 0010:__ib_umem_release+0x7a/0xa0 [ib_uverbs] [ 2922.812103] Code: f6 cb 48 89 ef e8 16 90 2e cc 48 89 c5 41 39 5c 24 5c 77 cd 5b 49 8d 7c 24 50 5d 41 5c 41 5d e9 ec 96 2e cc 41 bd 01 00 00 00 <48> 8b 3f 48 85 ff 74 9f 41 8b 54 24 5c 49 8b 74 24 50 45 31 c0 31 [ 2922.821369] RSP: 0018:ffffb53dee517d28 EFLAGS: 00010246 [ 2922.824135] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9f222edfa180 [ 2922.827639] RDX: 0000000000000001 RSI: ffff9f31b0dc4000 RDI: 0000000000000000 [ 2922.831121] RBP: ffff9f31b0dc4000 R08: 0000000000000000 R09: 0000000000000000 [ 2922.834528] R10: ffff9f3ddfb69680 R11: ffffb53dee517d50 R12: ffff9f31b0dc4000 [ 2922.838042] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9f23f2cc09b0 [ 2922.841442] FS: 0000000000000000(0000) GS:ffff9f3d0fcc0000(0000) knlGS:0000000000000000 [ 2922.845340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2922.848327] CR2: 0000000000000000 CR3: 0000001dafdb6005 CR4: 0000000000370ee0 [ 2922.852791] Kernel panic - not syncing: Fatal exception [ 2922.858167] Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
On Wed, Jan 05, 2022 at 05:16:06PM +0000, Trond Myklebust wrote: > > I'm confused, RPC/RDMA should never touch a umem at all. > > > > Is this really the other bug where user and kernel MR are getting > > confused? > > > > As far as I know, RPC/RDMA is just using the RDMA api to register and > unregister chunks of memory, so it is definitely not directly touching > the umem. I mean, RPC/RDMA doesn't have a umem at all, so seeing it any stack trace says something is corrupted I suppose it is this: https://lore.kernel.org/r/20211222101312.1358616-1-maorg@nvidia.com Jason
On Wed, 2022-01-05 at 13:43 -0400, Jason Gunthorpe wrote: > On Wed, Jan 05, 2022 at 05:16:06PM +0000, Trond Myklebust wrote: > > > I'm confused, RPC/RDMA should never touch a umem at all. > > > > > > Is this really the other bug where user and kernel MR are getting > > > confused? > > > > > > > As far as I know, RPC/RDMA is just using the RDMA api to register > > and > > unregister chunks of memory, so it is definitely not directly > > touching > > the umem. > > I mean, RPC/RDMA doesn't have a umem at all, so seeing it any stack > trace says something is corrupted > > I suppose it is this: > > https://lore.kernel.org/r/20211222101312.1358616-1-maorg@nvidia.com > > Jason > Ah... Thanks! I'll try that out.
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 86d479772fbc..59304bae13ca 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -55,7 +55,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d ib_dma_unmap_sgtable_attrs(dev, &umem->sgt_append.sgt, DMA_BIDIRECTIONAL, 0); - for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) + for_each_sgtable_dma_sg(&umem->sgt_append.sgt, sg, i) unpin_user_page_range_dirty_lock(sg_page(sg), DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);