Message ID | 20201113190851.7817-1-olga.kornievskaia@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] NFSv4.2: fix LISTXATTR buffer receive size | expand |
Hi Olga- > On Nov 13, 2020, at 2:08 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > From: Olga Kornievskaia <kolga@netapp.com> > > xfstest generic/013 over on a NFSoRDMA over SoftRoCE or iWarp panics > and running with KASAN reports: There is still only a highly circumstantial connection between soft RoCE and iWarp and these crashes, so this description seems misleading and/or pre-mature. > [ 216.018711] BUG: KASAN: wild-memory-access in rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > [ 216.024195] Write of size 12 at addr 0005088000000000 by task kworker/1:1H/480 > [ 216.028820] > [ 216.029776] CPU: 1 PID: 480 Comm: kworker/1:1H Not tainted 5.8.0-rc5+ #37 > [ 216.034247] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 > [ 216.040604] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] > [ 216.043739] Call Trace: > [ 216.045014] dump_stack+0x7c/0xb0 > [ 216.046757] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > [ 216.050008] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > [ 216.053091] kasan_report.cold.10+0x6a/0x85 > [ 216.055703] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > [ 216.058979] check_memory_region+0x183/0x1e0 > [ 216.061933] memcpy+0x38/0x60 > [ 216.064077] rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > [ 216.067502] ? rpcrdma_reset_cwnd+0x70/0x70 [rpcrdma] > [ 216.070268] ? recalibrate_cpu_khz+0x10/0x10 > [ 216.072585] ? rpcrdma_reply_handler+0x604/0x6e0 [rpcrdma] > [ 216.075469] __ib_process_cq+0xa7/0x220 [ib_core] > [ 216.078077] ib_cq_poll_work+0x31/0xb0 [ib_core] > [ 216.080451] process_one_work+0x387/0x6c0 > [ 216.082498] worker_thread+0x57/0x5a0 > [ 216.084425] ? process_one_work+0x6c0/0x6c0 > [ 216.086583] kthread+0x1ca/0x200 > [ 216.088775] ? kthread_create_on_node+0xc0/0xc0 > [ 216.091847] ret_from_fork+0x22/0x30 > > Fixes: 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size") > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > --- > fs/nfs/nfs42xdr.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c > index 6e060a8..e88bc7a 100644 > --- a/fs/nfs/nfs42xdr.c > +++ b/fs/nfs/nfs42xdr.c > @@ -196,7 +196,8 @@ > 1 + nfs4_xattr_name_maxsz + 1) > #define decode_setxattr_maxsz (op_decode_hdr_maxsz + decode_change_info_maxsz) > #define encode_listxattrs_maxsz (op_encode_hdr_maxsz + 2 + 1) > -#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + 1 + 1) > +#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + 1 + \ > + XDR_QUADLEN(NFS4_OPAQUE_LIMIT)) From RFC 8276, Section 8.4.3.2: /// struct LISTXATTRS4resok { /// nfs_cookie4 lxr_cookie; /// xattrkey4 lxr_names<>; /// bool lxr_eof; /// }; The decode_listxattrs_maxsz macro defines the maximum size of the /fixed portion/ of the LISTXATTRS reply. That is the operation status code, the cookie, and the EOF flag. maxsz has an extra "+ 1" for rpc_prepare_reply_pages() to deal with possible XDR padding. The post-6c2190b3fcbc value of this macro is already correct, and removing the "+ 1" is wrong. In addition, the variable portion of the result is an unbounded array of component4 fields, nothing to do with NFS4_OPAQUE_LIMIT, so that's just an arbitrary constant here with no justification. Adding more space to the receive buffer happens to help the case where there are no xattrs to list. That doesn't mean its the correct solution in general. It certainly won't be sufficient to handle an xattr name array that is larger than 1024 bytes. The client manages the variable portion of that result in the reply's pages array, which is set up by nfs4_xdr_enc_listxattrs(). Further: > rpcrdma_complete_rqst+0x447 is in the paragraph of rpcrdma_inline_fixup() that copies received bytes into the receive xdr_buf's pages array. The address "0005088000000000" is bogus. Since nfs4_xdr_enc_listxattrs() sets XDRBUF_SPARSE_PAGES, it's likely it is relying on the transport to allocate pages for this buffer, and possibly that page allocation has failed or has a bug. Please determine why the encode side has not set up the correct number of pages to handle the LISTXATTRS result. Until then I have to NACK this one. > #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \ > nfs4_xattr_name_maxsz) > #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \ > -- > 1.8.3.1 -- Chuck Lever
Hi Chuck, The first problem I found was from 5.10-rc3 testing was from the fact that tail's iov_len was set to -4 and reply_chunk was trying to call rpcrdma_convert_kvec() for a tail that didn't exist. Do you see issues with this fix? diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index 71e03b930b70..2e6a228abb95 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset, tail->iov_base = buf + offset; tail->iov_len = buflen - offset; - if ((xdr->page_len & 3) == 0) + if ((xdr->page_len & 3) == 0 && tail->iov_len) tail->iov_len -= sizeof(__be32); xdr->buflen += len; This one fixes KASAN's reported problem of added at the end of this message. The next problem that I can't figure out yet is another KASAN's report during the completion of the request. [ 99.610666] BUG: KASAN: wild-memory-access in rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma] [ 99.617947] Write of size 4 at addr 0005088000000000 by task kworker/1:1H/490 This is the KASAN for the negative tail problem: [ 665.767611] ================================================================== [ 665.772202] BUG: KASAN: slab-out-of-bounds in rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] [ 665.777860] Write of size 8 at addr ffff88803ded9b70 by task fsstress/3123 [ 665.783754] [ 665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0-rc3+ #38 [ 665.790534] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 [ 665.798538] Call Trace: [ 665.800398] dump_stack+0x7c/0xa2 [ 665.802647] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] [ 665.808145] print_address_description.constprop.7+0x1e/0x230 [ 665.812543] ? record_print_text.cold.38+0x11/0x11 [ 665.816093] ? _raw_write_lock_irqsave+0xe0/0xe0 [ 665.819257] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] [ 665.823368] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] [ 665.827837] kasan_report.cold.9+0x37/0x7c [ 665.830076] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] [ 665.834066] rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] [ 665.837342] rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0 [rpcrdma] [ 665.841766] ? decode_read_list+0x40/0x40 [rpcrdma] [ 665.845063] ? _raw_spin_lock_irqsave+0x80/0xe0 [ 665.848444] ? xdr_reserve_space+0x12e/0x360 [sunrpc] [ 665.852186] ? xdr_init_encode+0x104/0x130 [sunrpc] [ 665.855305] rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma] [ 665.859771] ? _raw_spin_lock+0x7a/0xd0 [ 665.862516] ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma] [ 665.866533] ? call_bind+0x60/0xf0 [sunrpc] [ 665.869382] xprt_rdma_send_request+0x79/0x190 [rpcrdma] [ 665.872965] xprt_transmit+0x2ae/0x6c0 [sunrpc] [ 665.875955] ? call_bind+0xf0/0xf0 [sunrpc] [ 665.878372] call_transmit+0xdd/0x110 [sunrpc] [ 665.881539] ? call_bind+0xf0/0xf0 [sunrpc] [ 665.884629] __rpc_execute+0x11c/0x6e0 [sunrpc] [ 665.888300] ? trace_event_raw_event_xprt_cong_event+0x270/0x270 [sunrpc] [ 665.893935] ? rpc_make_runnable+0x54/0xe0 [sunrpc] [ 665.898021] rpc_run_task+0x29c/0x2c0 [sunrpc] [ 665.901142] nfs4_call_sync_custom+0xc/0x40 [nfsv4] [ 665.904903] nfs4_do_call_sync+0x114/0x160 [nfsv4] [ 665.908633] ? nfs4_call_sync_custom+0x40/0x40 [nfsv4] [ 665.913094] ? __alloc_pages_nodemask+0x200/0x410 [ 665.916831] ? kasan_unpoison_shadow+0x30/0x40 [ 665.920393] ? __kasan_kmalloc.constprop.8+0xc1/0xd0 [ 665.924403] _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4] [ 665.928552] ? kasan_set_free_info+0x1b/0x30 [ 665.932283] ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4] [ 665.936240] ? _raw_spin_lock+0x7a/0xd0 [ 665.938677] nfs42_proc_listxattrs+0xf4/0x150 [nfsv4] [ 665.942532] ? nfs42_proc_setxattr+0x150/0x150 [nfsv4] [ 665.946410] ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4] [ 665.950095] nfs4_listxattr+0x34d/0x3d0 [nfsv4] [ 665.952951] ? _nfs4_proc_access+0x260/0x260 [nfsv4] [ 665.956383] ? __ia32_sys_rename+0x40/0x40 [ 665.959559] ? __ia32_sys_lstat+0x30/0x30 [ 665.962519] ? __check_object_size+0x178/0x220 [ 665.965830] ? strncpy_from_user+0xe9/0x230 [ 665.968401] ? security_inode_listxattr+0x20/0x60 [ 665.971653] listxattr+0xd1/0xf0 [ 665.974065] path_listxattr+0xa1/0x100 [ 665.977023] ? listxattr+0xf0/0xf0 [ 665.979305] do_syscall_64+0x33/0x40 [ 665.981561] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 665.985559] RIP: 0033:0x7fe3f5a1fc8b [ 665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64 89 01 48 [ 666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000c2 [ 666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fe3f5a1fc8b [ 666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000016eaf70 [ 666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09: 00007ffc82612537 [ 666.025553] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000000021 [ 666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15: 0000000000000000 [ 666.035780] [ 666.036906] Allocated by task 2837: [ 666.039447] kasan_save_stack+0x19/0x40 [ 666.042815] __kasan_kmalloc.constprop.8+0xc1/0xd0 [ 666.046626] rpcrdma_req_create+0x58/0x1f0 [rpcrdma] [ 666.050283] rpcrdma_buffer_create+0x217/0x270 [rpcrdma] [ 666.053727] xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma] [ 666.057287] xprt_create_transport+0xc7/0x300 [sunrpc] [ 666.061656] rpc_create+0x185/0x360 [sunrpc] [ 666.064803] nfs_create_rpc_client+0x2d9/0x350 [nfs] [ 666.068233] nfs4_init_client+0x111/0x3d0 [nfsv4] [ 666.071563] nfs4_set_client+0x18c/0x2b0 [nfsv4] [ 666.075287] nfs4_create_server+0x303/0x590 [nfsv4] [ 666.079563] nfs4_try_get_tree+0x60/0xe0 [nfsv4] [ 666.082835] vfs_get_tree+0x45/0x120 [ 666.085165] path_mount+0x8da/0xcc0 [ 666.087352] do_mount+0xcb/0xf0 [ 666.089377] __x64_sys_mount+0xf4/0x110 [ 666.091894] do_syscall_64+0x33/0x40 [ 666.094215] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 666.097181] [ 666.098548] The buggy address belongs to the object at ffff88803ded8000 [ 666.098548] which belongs to the cache kmalloc-8k of size 8192 [ 666.108266] The buggy address is located 7024 bytes inside of [ 666.108266] 8192-byte region [ffff88803ded8000, ffff88803deda000) [ 666.115709] The buggy address belongs to the page: [ 666.118516] page:00000000e08a579f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3ded8 [ 666.124078] head:00000000e08a579f order:3 compound_mapcount:0 compound_pincount:0 [ 666.130104] flags: 0xfffffc0010200(slab|head) [ 666.133692] raw: 000fffffc0010200 dead000000000100 dead000000000122 ffff88800104ee40 [ 666.139286] raw: 0000000000000000 0000000000020002 00000001ffffffff 0000000000000000 [ 666.145052] page dumped because: kasan: bad access detected [ 666.149341] [ 666.150408] Memory state around the buggy address: On Fri, Nov 13, 2020 at 3:34 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > Hi Olga- > > > On Nov 13, 2020, at 2:08 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > xfstest generic/013 over on a NFSoRDMA over SoftRoCE or iWarp panics > > and running with KASAN reports: > > There is still only a highly circumstantial connection between > soft RoCE and iWarp and these crashes, so this description seems > misleading and/or pre-mature. > > > > [ 216.018711] BUG: KASAN: wild-memory-access in rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > [ 216.024195] Write of size 12 at addr 0005088000000000 by task kworker/1:1H/480 > > [ 216.028820] > > [ 216.029776] CPU: 1 PID: 480 Comm: kworker/1:1H Not tainted 5.8.0-rc5+ #37 > > [ 216.034247] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 > > [ 216.040604] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] > > [ 216.043739] Call Trace: > > [ 216.045014] dump_stack+0x7c/0xb0 > > [ 216.046757] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > [ 216.050008] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > [ 216.053091] kasan_report.cold.10+0x6a/0x85 > > [ 216.055703] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > [ 216.058979] check_memory_region+0x183/0x1e0 > > [ 216.061933] memcpy+0x38/0x60 > > [ 216.064077] rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > [ 216.067502] ? rpcrdma_reset_cwnd+0x70/0x70 [rpcrdma] > > [ 216.070268] ? recalibrate_cpu_khz+0x10/0x10 > > [ 216.072585] ? rpcrdma_reply_handler+0x604/0x6e0 [rpcrdma] > > [ 216.075469] __ib_process_cq+0xa7/0x220 [ib_core] > > [ 216.078077] ib_cq_poll_work+0x31/0xb0 [ib_core] > > [ 216.080451] process_one_work+0x387/0x6c0 > > [ 216.082498] worker_thread+0x57/0x5a0 > > [ 216.084425] ? process_one_work+0x6c0/0x6c0 > > [ 216.086583] kthread+0x1ca/0x200 > > [ 216.088775] ? kthread_create_on_node+0xc0/0xc0 > > [ 216.091847] ret_from_fork+0x22/0x30 > > > > Fixes: 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size") > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > > --- > > fs/nfs/nfs42xdr.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c > > index 6e060a8..e88bc7a 100644 > > --- a/fs/nfs/nfs42xdr.c > > +++ b/fs/nfs/nfs42xdr.c > > @@ -196,7 +196,8 @@ > > 1 + nfs4_xattr_name_maxsz + 1) > > #define decode_setxattr_maxsz (op_decode_hdr_maxsz + decode_change_info_maxsz) > > #define encode_listxattrs_maxsz (op_encode_hdr_maxsz + 2 + 1) > > -#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + 1 + 1) > > +#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + 1 + \ > > + XDR_QUADLEN(NFS4_OPAQUE_LIMIT)) > > From RFC 8276, Section 8.4.3.2: > > /// struct LISTXATTRS4resok { > /// nfs_cookie4 lxr_cookie; > /// xattrkey4 lxr_names<>; > /// bool lxr_eof; > /// }; > > The decode_listxattrs_maxsz macro defines the maximum size of > the /fixed portion/ of the LISTXATTRS reply. That is the operation > status code, the cookie, and the EOF flag. maxsz has an extra > "+ 1" for rpc_prepare_reply_pages() to deal with possible XDR > padding. The post-6c2190b3fcbc value of this macro is already > correct, and removing the "+ 1" is wrong. > > In addition, the variable portion of the result is an unbounded > array of component4 fields, nothing to do with NFS4_OPAQUE_LIMIT, > so that's just an arbitrary constant here with no justification. > > Adding more space to the receive buffer happens to help the case > where there are no xattrs to list. That doesn't mean its the > correct solution in general. It certainly won't be sufficient to > handle an xattr name array that is larger than 1024 bytes. > > > The client manages the variable portion of that result in the > reply's pages array, which is set up by nfs4_xdr_enc_listxattrs(). > > Further: > > > rpcrdma_complete_rqst+0x447 > > is in the paragraph of rpcrdma_inline_fixup() that copies received > bytes into the receive xdr_buf's pages array. > > The address "0005088000000000" is bogus. Since > nfs4_xdr_enc_listxattrs() sets XDRBUF_SPARSE_PAGES, it's likely > it is relying on the transport to allocate pages for this buffer, > and possibly that page allocation has failed or has a bug. > > Please determine why the encode side has not set up the correct > number of pages to handle the LISTXATTRS result. Until then I > have to NACK this one. > > > > #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \ > > nfs4_xattr_name_maxsz) > > #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \ > > -- > > 1.8.3.1 > > -- > Chuck Lever > > >
On Wed, 2020-11-18 at 16:44 -0500, Olga Kornievskaia wrote: > Hi Chuck, > > The first problem I found was from 5.10-rc3 testing was from the fact > that tail's iov_len was set to -4 and reply_chunk was trying to call > rpcrdma_convert_kvec() for a tail that didn't exist. > > Do you see issues with this fix? > > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > index 71e03b930b70..2e6a228abb95 100644 > --- a/net/sunrpc/xdr.c > +++ b/net/sunrpc/xdr.c > @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned > int offset, > > tail->iov_base = buf + offset; > tail->iov_len = buflen - offset; > - if ((xdr->page_len & 3) == 0) > + if ((xdr->page_len & 3) == 0 && tail->iov_len) > tail->iov_len -= sizeof(__be32); > > xdr->buflen += len; > > This one fixes KASAN's reported problem of added at the end of this > message. By the way, I want to get rid of this kind on unnecessary copy. The padding is basically just a bunch of '\0's so can stay in the page without any problems (provided the page has space). The fix is simple: all the callers of xdr_inline_pages need to do is use a size that is wrapped by xdr_align_size(). > > The next problem that I can't figure out yet is another KASAN's > report > during the completion of the request. > > [ 99.610666] BUG: KASAN: wild-memory-access in > rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma] > [ 99.617947] Write of size 4 at addr 0005088000000000 by task > kworker/1:1H/490 > > > This is the KASAN for the negative tail problem: > [ 665.767611] > ================================================================== > [ 665.772202] BUG: KASAN: slab-out-of-bounds in > rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.777860] Write of size 8 at addr ffff88803ded9b70 by task > fsstress/3123 > [ 665.783754] > [ 665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0- > rc3+ #38 > [ 665.790534] Hardware name: VMware, Inc. VMware Virtual > Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 > [ 665.798538] Call Trace: > [ 665.800398] dump_stack+0x7c/0xa2 > [ 665.802647] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.808145] print_address_description.constprop.7+0x1e/0x230 > [ 665.812543] ? record_print_text.cold.38+0x11/0x11 > [ 665.816093] ? _raw_write_lock_irqsave+0xe0/0xe0 > [ 665.819257] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.823368] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.827837] kasan_report.cold.9+0x37/0x7c > [ 665.830076] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.834066] rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.837342] rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0 > [rpcrdma] > [ 665.841766] ? decode_read_list+0x40/0x40 [rpcrdma] > [ 665.845063] ? _raw_spin_lock_irqsave+0x80/0xe0 > [ 665.848444] ? xdr_reserve_space+0x12e/0x360 [sunrpc] > [ 665.852186] ? xdr_init_encode+0x104/0x130 [sunrpc] > [ 665.855305] rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma] > [ 665.859771] ? _raw_spin_lock+0x7a/0xd0 > [ 665.862516] ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma] > [ 665.866533] ? call_bind+0x60/0xf0 [sunrpc] > [ 665.869382] xprt_rdma_send_request+0x79/0x190 [rpcrdma] > [ 665.872965] xprt_transmit+0x2ae/0x6c0 [sunrpc] > [ 665.875955] ? call_bind+0xf0/0xf0 [sunrpc] > [ 665.878372] call_transmit+0xdd/0x110 [sunrpc] > [ 665.881539] ? call_bind+0xf0/0xf0 [sunrpc] > [ 665.884629] __rpc_execute+0x11c/0x6e0 [sunrpc] > [ 665.888300] ? trace_event_raw_event_xprt_cong_event+0x270/0x270 > [sunrpc] > [ 665.893935] ? rpc_make_runnable+0x54/0xe0 [sunrpc] > [ 665.898021] rpc_run_task+0x29c/0x2c0 [sunrpc] > [ 665.901142] nfs4_call_sync_custom+0xc/0x40 [nfsv4] > [ 665.904903] nfs4_do_call_sync+0x114/0x160 [nfsv4] > [ 665.908633] ? nfs4_call_sync_custom+0x40/0x40 [nfsv4] > [ 665.913094] ? __alloc_pages_nodemask+0x200/0x410 > [ 665.916831] ? kasan_unpoison_shadow+0x30/0x40 > [ 665.920393] ? __kasan_kmalloc.constprop.8+0xc1/0xd0 > [ 665.924403] _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4] > [ 665.928552] ? kasan_set_free_info+0x1b/0x30 > [ 665.932283] ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4] > [ 665.936240] ? _raw_spin_lock+0x7a/0xd0 > [ 665.938677] nfs42_proc_listxattrs+0xf4/0x150 [nfsv4] > [ 665.942532] ? nfs42_proc_setxattr+0x150/0x150 [nfsv4] > [ 665.946410] ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4] > [ 665.950095] nfs4_listxattr+0x34d/0x3d0 [nfsv4] > [ 665.952951] ? _nfs4_proc_access+0x260/0x260 [nfsv4] > [ 665.956383] ? __ia32_sys_rename+0x40/0x40 > [ 665.959559] ? __ia32_sys_lstat+0x30/0x30 > [ 665.962519] ? __check_object_size+0x178/0x220 > [ 665.965830] ? strncpy_from_user+0xe9/0x230 > [ 665.968401] ? security_inode_listxattr+0x20/0x60 > [ 665.971653] listxattr+0xd1/0xf0 > [ 665.974065] path_listxattr+0xa1/0x100 > [ 665.977023] ? listxattr+0xf0/0xf0 > [ 665.979305] do_syscall_64+0x33/0x40 > [ 665.981561] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 665.985559] RIP: 0033:0x7fe3f5a1fc8b > [ 665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64 > 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00 > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64 > 89 > 01 48 > [ 666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX: > 00000000000000c2 > [ 666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: > 00007fe3f5a1fc8b > [ 666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > 00000000016eaf70 > [ 666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09: > 00007ffc82612537 > [ 666.025553] R10: 0000000000000004 R11: 0000000000000206 R12: > 0000000000000021 > [ 666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15: > 0000000000000000 > [ 666.035780] > [ 666.036906] Allocated by task 2837: > [ 666.039447] kasan_save_stack+0x19/0x40 > [ 666.042815] __kasan_kmalloc.constprop.8+0xc1/0xd0 > [ 666.046626] rpcrdma_req_create+0x58/0x1f0 [rpcrdma] > [ 666.050283] rpcrdma_buffer_create+0x217/0x270 [rpcrdma] > [ 666.053727] xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma] > [ 666.057287] xprt_create_transport+0xc7/0x300 [sunrpc] > [ 666.061656] rpc_create+0x185/0x360 [sunrpc] > [ 666.064803] nfs_create_rpc_client+0x2d9/0x350 [nfs] > [ 666.068233] nfs4_init_client+0x111/0x3d0 [nfsv4] > [ 666.071563] nfs4_set_client+0x18c/0x2b0 [nfsv4] > [ 666.075287] nfs4_create_server+0x303/0x590 [nfsv4] > [ 666.079563] nfs4_try_get_tree+0x60/0xe0 [nfsv4] > [ 666.082835] vfs_get_tree+0x45/0x120 > [ 666.085165] path_mount+0x8da/0xcc0 > [ 666.087352] do_mount+0xcb/0xf0 > [ 666.089377] __x64_sys_mount+0xf4/0x110 > [ 666.091894] do_syscall_64+0x33/0x40 > [ 666.094215] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 666.097181] > [ 666.098548] The buggy address belongs to the object at > ffff88803ded8000 > [ 666.098548] which belongs to the cache kmalloc-8k of size 8192 > [ 666.108266] The buggy address is located 7024 bytes inside of > [ 666.108266] 8192-byte region [ffff88803ded8000, ffff88803deda000) > [ 666.115709] The buggy address belongs to the page: > [ 666.118516] page:00000000e08a579f refcount:1 mapcount:0 > mapping:0000000000000000 index:0x0 pfn:0x3ded8 > [ 666.124078] head:00000000e08a579f order:3 compound_mapcount:0 > compound_pincount:0 > [ 666.130104] flags: 0xfffffc0010200(slab|head) > [ 666.133692] raw: 000fffffc0010200 dead000000000100 > dead000000000122 > ffff88800104ee40 > [ 666.139286] raw: 0000000000000000 0000000000020002 > 00000001ffffffff > 0000000000000000 > [ 666.145052] page dumped because: kasan: bad access detected > [ 666.149341] > [ 666.150408] Memory state around the buggy address: > > On Fri, Nov 13, 2020 at 3:34 PM Chuck Lever <chuck.lever@oracle.com> > wrote: > > > > Hi Olga- > > > > > On Nov 13, 2020, at 2:08 PM, Olga Kornievskaia < > > > olga.kornievskaia@gmail.com> wrote: > > > > > > From: Olga Kornievskaia <kolga@netapp.com> > > > > > > xfstest generic/013 over on a NFSoRDMA over SoftRoCE or iWarp > > > panics > > > and running with KASAN reports: > > > > There is still only a highly circumstantial connection between > > soft RoCE and iWarp and these crashes, so this description seems > > misleading and/or pre-mature. > > > > > > > [ 216.018711] BUG: KASAN: wild-memory-access in > > > rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > > [ 216.024195] Write of size 12 at addr 0005088000000000 by task > > > kworker/1:1H/480 > > > [ 216.028820] > > > [ 216.029776] CPU: 1 PID: 480 Comm: kworker/1:1H Not tainted > > > 5.8.0-rc5+ #37 > > > [ 216.034247] Hardware name: VMware, Inc. VMware Virtual > > > Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 > > > [ 216.040604] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] > > > [ 216.043739] Call Trace: > > > [ 216.045014] dump_stack+0x7c/0xb0 > > > [ 216.046757] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > > [ 216.050008] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > > [ 216.053091] kasan_report.cold.10+0x6a/0x85 > > > [ 216.055703] ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > > [ 216.058979] check_memory_region+0x183/0x1e0 > > > [ 216.061933] memcpy+0x38/0x60 > > > [ 216.064077] rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma] > > > [ 216.067502] ? rpcrdma_reset_cwnd+0x70/0x70 [rpcrdma] > > > [ 216.070268] ? recalibrate_cpu_khz+0x10/0x10 > > > [ 216.072585] ? rpcrdma_reply_handler+0x604/0x6e0 [rpcrdma] > > > [ 216.075469] __ib_process_cq+0xa7/0x220 [ib_core] > > > [ 216.078077] ib_cq_poll_work+0x31/0xb0 [ib_core] > > > [ 216.080451] process_one_work+0x387/0x6c0 > > > [ 216.082498] worker_thread+0x57/0x5a0 > > > [ 216.084425] ? process_one_work+0x6c0/0x6c0 > > > [ 216.086583] kthread+0x1ca/0x200 > > > [ 216.088775] ? kthread_create_on_node+0xc0/0xc0 > > > [ 216.091847] ret_from_fork+0x22/0x30 > > > > > > Fixes: 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size") > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > > > --- > > > fs/nfs/nfs42xdr.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c > > > index 6e060a8..e88bc7a 100644 > > > --- a/fs/nfs/nfs42xdr.c > > > +++ b/fs/nfs/nfs42xdr.c > > > @@ -196,7 +196,8 @@ > > > 1 + nfs4_xattr_name_maxsz + 1) > > > #define decode_setxattr_maxsz (op_decode_hdr_maxsz + > > > decode_change_info_maxsz) > > > #define encode_listxattrs_maxsz (op_encode_hdr_maxsz + 2 + 1) > > > -#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + > > > 1 + 1) > > > +#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + > > > 1 + \ > > > + XDR_QUADLEN(NFS4_OPAQUE_LIMIT)) > > > > From RFC 8276, Section 8.4.3.2: > > > > /// struct LISTXATTRS4resok { > > /// nfs_cookie4 lxr_cookie; > > /// xattrkey4 lxr_names<>; > > /// bool lxr_eof; > > /// }; > > > > The decode_listxattrs_maxsz macro defines the maximum size of > > the /fixed portion/ of the LISTXATTRS reply. That is the operation > > status code, the cookie, and the EOF flag. maxsz has an extra > > "+ 1" for rpc_prepare_reply_pages() to deal with possible XDR > > padding. The post-6c2190b3fcbc value of this macro is already > > correct, and removing the "+ 1" is wrong. > > > > In addition, the variable portion of the result is an unbounded > > array of component4 fields, nothing to do with NFS4_OPAQUE_LIMIT, > > so that's just an arbitrary constant here with no justification. > > > > Adding more space to the receive buffer happens to help the case > > where there are no xattrs to list. That doesn't mean its the > > correct solution in general. It certainly won't be sufficient to > > handle an xattr name array that is larger than 1024 bytes. > > > > > > The client manages the variable portion of that result in the > > reply's pages array, which is set up by nfs4_xdr_enc_listxattrs(). > > > > Further: > > > > > rpcrdma_complete_rqst+0x447 > > > > is in the paragraph of rpcrdma_inline_fixup() that copies received > > bytes into the receive xdr_buf's pages array. > > > > The address "0005088000000000" is bogus. Since > > nfs4_xdr_enc_listxattrs() sets XDRBUF_SPARSE_PAGES, it's likely > > it is relying on the transport to allocate pages for this buffer, > > and possibly that page allocation has failed or has a bug. > > > > Please determine why the encode side has not set up the correct > > number of pages to handle the LISTXATTRS result. Until then I > > have to NACK this one. > > > > > > > #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \ > > > nfs4_xattr_name_maxsz) > > > #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \ > > > -- > > > 1.8.3.1 > > > > -- > > Chuck Lever > > > > > >
Hi Olga- > On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > Hi Chuck, > > The first problem I found was from 5.10-rc3 testing was from the fact > that tail's iov_len was set to -4 and reply_chunk was trying to call > rpcrdma_convert_kvec() for a tail that didn't exist. > > Do you see issues with this fix? > > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > index 71e03b930b70..2e6a228abb95 100644 > --- a/net/sunrpc/xdr.c > +++ b/net/sunrpc/xdr.c > @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset, > > tail->iov_base = buf + offset; > tail->iov_len = buflen - offset; > - if ((xdr->page_len & 3) == 0) > + if ((xdr->page_len & 3) == 0 && tail->iov_len) > tail->iov_len -= sizeof(__be32); > > xdr->buflen += len; It's not clear to me whether only the listxattrs encoder is not providing a receive tail kvec, or whether all the encoders fail to provide a tail in this case. > This one fixes KASAN's reported problem of added at the end of this message. > > The next problem that I can't figure out yet is another KASAN's report > during the completion of the request. > > [ 99.610666] BUG: KASAN: wild-memory-access in > rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma] > [ 99.617947] Write of size 4 at addr 0005088000000000 by task kworker/1:1H/490 > > > This is the KASAN for the negative tail problem: > [ 665.767611] ================================================================== > [ 665.772202] BUG: KASAN: slab-out-of-bounds in > rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.777860] Write of size 8 at addr ffff88803ded9b70 by task fsstress/3123 > [ 665.783754] > [ 665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0-rc3+ #38 > [ 665.790534] Hardware name: VMware, Inc. VMware Virtual > Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 > [ 665.798538] Call Trace: > [ 665.800398] dump_stack+0x7c/0xa2 > [ 665.802647] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.808145] print_address_description.constprop.7+0x1e/0x230 > [ 665.812543] ? record_print_text.cold.38+0x11/0x11 > [ 665.816093] ? _raw_write_lock_irqsave+0xe0/0xe0 > [ 665.819257] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.823368] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.827837] kasan_report.cold.9+0x37/0x7c > [ 665.830076] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.834066] rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > [ 665.837342] rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0 [rpcrdma] > [ 665.841766] ? decode_read_list+0x40/0x40 [rpcrdma] > [ 665.845063] ? _raw_spin_lock_irqsave+0x80/0xe0 > [ 665.848444] ? xdr_reserve_space+0x12e/0x360 [sunrpc] > [ 665.852186] ? xdr_init_encode+0x104/0x130 [sunrpc] > [ 665.855305] rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma] > [ 665.859771] ? _raw_spin_lock+0x7a/0xd0 > [ 665.862516] ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma] > [ 665.866533] ? call_bind+0x60/0xf0 [sunrpc] > [ 665.869382] xprt_rdma_send_request+0x79/0x190 [rpcrdma] > [ 665.872965] xprt_transmit+0x2ae/0x6c0 [sunrpc] > [ 665.875955] ? call_bind+0xf0/0xf0 [sunrpc] > [ 665.878372] call_transmit+0xdd/0x110 [sunrpc] > [ 665.881539] ? call_bind+0xf0/0xf0 [sunrpc] > [ 665.884629] __rpc_execute+0x11c/0x6e0 [sunrpc] > [ 665.888300] ? trace_event_raw_event_xprt_cong_event+0x270/0x270 [sunrpc] > [ 665.893935] ? rpc_make_runnable+0x54/0xe0 [sunrpc] > [ 665.898021] rpc_run_task+0x29c/0x2c0 [sunrpc] > [ 665.901142] nfs4_call_sync_custom+0xc/0x40 [nfsv4] > [ 665.904903] nfs4_do_call_sync+0x114/0x160 [nfsv4] > [ 665.908633] ? nfs4_call_sync_custom+0x40/0x40 [nfsv4] > [ 665.913094] ? __alloc_pages_nodemask+0x200/0x410 > [ 665.916831] ? kasan_unpoison_shadow+0x30/0x40 > [ 665.920393] ? __kasan_kmalloc.constprop.8+0xc1/0xd0 > [ 665.924403] _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4] > [ 665.928552] ? kasan_set_free_info+0x1b/0x30 > [ 665.932283] ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4] > [ 665.936240] ? _raw_spin_lock+0x7a/0xd0 > [ 665.938677] nfs42_proc_listxattrs+0xf4/0x150 [nfsv4] > [ 665.942532] ? nfs42_proc_setxattr+0x150/0x150 [nfsv4] > [ 665.946410] ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4] > [ 665.950095] nfs4_listxattr+0x34d/0x3d0 [nfsv4] > [ 665.952951] ? _nfs4_proc_access+0x260/0x260 [nfsv4] > [ 665.956383] ? __ia32_sys_rename+0x40/0x40 > [ 665.959559] ? __ia32_sys_lstat+0x30/0x30 > [ 665.962519] ? __check_object_size+0x178/0x220 > [ 665.965830] ? strncpy_from_user+0xe9/0x230 > [ 665.968401] ? security_inode_listxattr+0x20/0x60 > [ 665.971653] listxattr+0xd1/0xf0 > [ 665.974065] path_listxattr+0xa1/0x100 > [ 665.977023] ? listxattr+0xf0/0xf0 > [ 665.979305] do_syscall_64+0x33/0x40 > [ 665.981561] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 665.985559] RIP: 0033:0x7fe3f5a1fc8b > [ 665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64 > 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00 > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64 89 > 01 48 > [ 666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX: > 00000000000000c2 > [ 666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fe3f5a1fc8b > [ 666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000016eaf70 > [ 666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09: 00007ffc82612537 > [ 666.025553] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000000021 > [ 666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15: 0000000000000000 > [ 666.035780] > [ 666.036906] Allocated by task 2837: > [ 666.039447] kasan_save_stack+0x19/0x40 > [ 666.042815] __kasan_kmalloc.constprop.8+0xc1/0xd0 > [ 666.046626] rpcrdma_req_create+0x58/0x1f0 [rpcrdma] > [ 666.050283] rpcrdma_buffer_create+0x217/0x270 [rpcrdma] > [ 666.053727] xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma] > [ 666.057287] xprt_create_transport+0xc7/0x300 [sunrpc] > [ 666.061656] rpc_create+0x185/0x360 [sunrpc] > [ 666.064803] nfs_create_rpc_client+0x2d9/0x350 [nfs] > [ 666.068233] nfs4_init_client+0x111/0x3d0 [nfsv4] > [ 666.071563] nfs4_set_client+0x18c/0x2b0 [nfsv4] > [ 666.075287] nfs4_create_server+0x303/0x590 [nfsv4] > [ 666.079563] nfs4_try_get_tree+0x60/0xe0 [nfsv4] > [ 666.082835] vfs_get_tree+0x45/0x120 > [ 666.085165] path_mount+0x8da/0xcc0 > [ 666.087352] do_mount+0xcb/0xf0 > [ 666.089377] __x64_sys_mount+0xf4/0x110 > [ 666.091894] do_syscall_64+0x33/0x40 > [ 666.094215] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 666.097181] > [ 666.098548] The buggy address belongs to the object at ffff88803ded8000 > [ 666.098548] which belongs to the cache kmalloc-8k of size 8192 > [ 666.108266] The buggy address is located 7024 bytes inside of > [ 666.108266] 8192-byte region [ffff88803ded8000, ffff88803deda000) > [ 666.115709] The buggy address belongs to the page: > [ 666.118516] page:00000000e08a579f refcount:1 mapcount:0 > mapping:0000000000000000 index:0x0 pfn:0x3ded8 > [ 666.124078] head:00000000e08a579f order:3 compound_mapcount:0 > compound_pincount:0 > [ 666.130104] flags: 0xfffffc0010200(slab|head) > [ 666.133692] raw: 000fffffc0010200 dead000000000100 dead000000000122 > ffff88800104ee40 > [ 666.139286] raw: 0000000000000000 0000000000020002 00000001ffffffff > 0000000000000000 > [ 666.145052] page dumped because: kasan: bad access detected > [ 666.149341] > [ 666.150408] Memory state around the buggy address: -- Chuck Lever
On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote: > > Hi Olga- > > > On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > Hi Chuck, > > > > The first problem I found was from 5.10-rc3 testing was from the fact > > that tail's iov_len was set to -4 and reply_chunk was trying to call > > rpcrdma_convert_kvec() for a tail that didn't exist. > > > > Do you see issues with this fix? > > > > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > > index 71e03b930b70..2e6a228abb95 100644 > > --- a/net/sunrpc/xdr.c > > +++ b/net/sunrpc/xdr.c > > @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset, > > > > tail->iov_base = buf + offset; > > tail->iov_len = buflen - offset; > > - if ((xdr->page_len & 3) == 0) > > + if ((xdr->page_len & 3) == 0 && tail->iov_len) > > tail->iov_len -= sizeof(__be32); > > > > xdr->buflen += len; > > It's not clear to me whether only the listxattrs encoder is > not providing a receive tail kvec, or whether all the encoders > fail to provide a tail in this case. There is nothing specific that listxattr does, it just calls rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has tail[1] defined)? But not all requests have data in the page. So as far as I understand, tail->iov_len can be 0 so not checking for it is wrong. > > This one fixes KASAN's reported problem of added at the end of this message. > > > > The next problem that I can't figure out yet is another KASAN's report > > during the completion of the request. > > > > [ 99.610666] BUG: KASAN: wild-memory-access in > > rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma] > > [ 99.617947] Write of size 4 at addr 0005088000000000 by task kworker/1:1H/490 > > > > > > This is the KASAN for the negative tail problem: > > [ 665.767611] ================================================================== > > [ 665.772202] BUG: KASAN: slab-out-of-bounds in > > rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > > [ 665.777860] Write of size 8 at addr ffff88803ded9b70 by task fsstress/3123 > > [ 665.783754] > > [ 665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0-rc3+ #38 > > [ 665.790534] Hardware name: VMware, Inc. VMware Virtual > > Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 > > [ 665.798538] Call Trace: > > [ 665.800398] dump_stack+0x7c/0xa2 > > [ 665.802647] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > > [ 665.808145] print_address_description.constprop.7+0x1e/0x230 > > [ 665.812543] ? record_print_text.cold.38+0x11/0x11 > > [ 665.816093] ? _raw_write_lock_irqsave+0xe0/0xe0 > > [ 665.819257] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > > [ 665.823368] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > > [ 665.827837] kasan_report.cold.9+0x37/0x7c > > [ 665.830076] ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > > [ 665.834066] rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma] > > [ 665.837342] rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0 [rpcrdma] > > [ 665.841766] ? decode_read_list+0x40/0x40 [rpcrdma] > > [ 665.845063] ? _raw_spin_lock_irqsave+0x80/0xe0 > > [ 665.848444] ? xdr_reserve_space+0x12e/0x360 [sunrpc] > > [ 665.852186] ? xdr_init_encode+0x104/0x130 [sunrpc] > > [ 665.855305] rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma] > > [ 665.859771] ? _raw_spin_lock+0x7a/0xd0 > > [ 665.862516] ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma] > > [ 665.866533] ? call_bind+0x60/0xf0 [sunrpc] > > [ 665.869382] xprt_rdma_send_request+0x79/0x190 [rpcrdma] > > [ 665.872965] xprt_transmit+0x2ae/0x6c0 [sunrpc] > > [ 665.875955] ? call_bind+0xf0/0xf0 [sunrpc] > > [ 665.878372] call_transmit+0xdd/0x110 [sunrpc] > > [ 665.881539] ? call_bind+0xf0/0xf0 [sunrpc] > > [ 665.884629] __rpc_execute+0x11c/0x6e0 [sunrpc] > > [ 665.888300] ? trace_event_raw_event_xprt_cong_event+0x270/0x270 [sunrpc] > > [ 665.893935] ? rpc_make_runnable+0x54/0xe0 [sunrpc] > > [ 665.898021] rpc_run_task+0x29c/0x2c0 [sunrpc] > > [ 665.901142] nfs4_call_sync_custom+0xc/0x40 [nfsv4] > > [ 665.904903] nfs4_do_call_sync+0x114/0x160 [nfsv4] > > [ 665.908633] ? nfs4_call_sync_custom+0x40/0x40 [nfsv4] > > [ 665.913094] ? __alloc_pages_nodemask+0x200/0x410 > > [ 665.916831] ? kasan_unpoison_shadow+0x30/0x40 > > [ 665.920393] ? __kasan_kmalloc.constprop.8+0xc1/0xd0 > > [ 665.924403] _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4] > > [ 665.928552] ? kasan_set_free_info+0x1b/0x30 > > [ 665.932283] ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4] > > [ 665.936240] ? _raw_spin_lock+0x7a/0xd0 > > [ 665.938677] nfs42_proc_listxattrs+0xf4/0x150 [nfsv4] > > [ 665.942532] ? nfs42_proc_setxattr+0x150/0x150 [nfsv4] > > [ 665.946410] ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4] > > [ 665.950095] nfs4_listxattr+0x34d/0x3d0 [nfsv4] > > [ 665.952951] ? _nfs4_proc_access+0x260/0x260 [nfsv4] > > [ 665.956383] ? __ia32_sys_rename+0x40/0x40 > > [ 665.959559] ? __ia32_sys_lstat+0x30/0x30 > > [ 665.962519] ? __check_object_size+0x178/0x220 > > [ 665.965830] ? strncpy_from_user+0xe9/0x230 > > [ 665.968401] ? security_inode_listxattr+0x20/0x60 > > [ 665.971653] listxattr+0xd1/0xf0 > > [ 665.974065] path_listxattr+0xa1/0x100 > > [ 665.977023] ? listxattr+0xf0/0xf0 > > [ 665.979305] do_syscall_64+0x33/0x40 > > [ 665.981561] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 665.985559] RIP: 0033:0x7fe3f5a1fc8b > > [ 665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64 > > 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00 > > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64 89 > > 01 48 > > [ 666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX: > > 00000000000000c2 > > [ 666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fe3f5a1fc8b > > [ 666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000016eaf70 > > [ 666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09: 00007ffc82612537 > > [ 666.025553] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000000021 > > [ 666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15: 0000000000000000 > > [ 666.035780] > > [ 666.036906] Allocated by task 2837: > > [ 666.039447] kasan_save_stack+0x19/0x40 > > [ 666.042815] __kasan_kmalloc.constprop.8+0xc1/0xd0 > > [ 666.046626] rpcrdma_req_create+0x58/0x1f0 [rpcrdma] > > [ 666.050283] rpcrdma_buffer_create+0x217/0x270 [rpcrdma] > > [ 666.053727] xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma] > > [ 666.057287] xprt_create_transport+0xc7/0x300 [sunrpc] > > [ 666.061656] rpc_create+0x185/0x360 [sunrpc] > > [ 666.064803] nfs_create_rpc_client+0x2d9/0x350 [nfs] > > [ 666.068233] nfs4_init_client+0x111/0x3d0 [nfsv4] > > [ 666.071563] nfs4_set_client+0x18c/0x2b0 [nfsv4] > > [ 666.075287] nfs4_create_server+0x303/0x590 [nfsv4] > > [ 666.079563] nfs4_try_get_tree+0x60/0xe0 [nfsv4] > > [ 666.082835] vfs_get_tree+0x45/0x120 > > [ 666.085165] path_mount+0x8da/0xcc0 > > [ 666.087352] do_mount+0xcb/0xf0 > > [ 666.089377] __x64_sys_mount+0xf4/0x110 > > [ 666.091894] do_syscall_64+0x33/0x40 > > [ 666.094215] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 666.097181] > > [ 666.098548] The buggy address belongs to the object at ffff88803ded8000 > > [ 666.098548] which belongs to the cache kmalloc-8k of size 8192 > > [ 666.108266] The buggy address is located 7024 bytes inside of > > [ 666.108266] 8192-byte region [ffff88803ded8000, ffff88803deda000) > > [ 666.115709] The buggy address belongs to the page: > > [ 666.118516] page:00000000e08a579f refcount:1 mapcount:0 > > mapping:0000000000000000 index:0x0 pfn:0x3ded8 > > [ 666.124078] head:00000000e08a579f order:3 compound_mapcount:0 > > compound_pincount:0 > > [ 666.130104] flags: 0xfffffc0010200(slab|head) > > [ 666.133692] raw: 000fffffc0010200 dead000000000100 dead000000000122 > > ffff88800104ee40 > > [ 666.139286] raw: 0000000000000000 0000000000020002 00000001ffffffff > > 0000000000000000 > > [ 666.145052] page dumped because: kasan: bad access detected > > [ 666.149341] > > [ 666.150408] Memory state around the buggy address: > > -- > Chuck Lever > > >
> On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote: >> >> Hi Olga- >> >>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: >>> >>> Hi Chuck, >>> >>> The first problem I found was from 5.10-rc3 testing was from the fact >>> that tail's iov_len was set to -4 and reply_chunk was trying to call >>> rpcrdma_convert_kvec() for a tail that didn't exist. >>> >>> Do you see issues with this fix? >>> >>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c >>> index 71e03b930b70..2e6a228abb95 100644 >>> --- a/net/sunrpc/xdr.c >>> +++ b/net/sunrpc/xdr.c >>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset, >>> >>> tail->iov_base = buf + offset; >>> tail->iov_len = buflen - offset; >>> - if ((xdr->page_len & 3) == 0) >>> + if ((xdr->page_len & 3) == 0 && tail->iov_len) >>> tail->iov_len -= sizeof(__be32); >>> >>> xdr->buflen += len; >> >> It's not clear to me whether only the listxattrs encoder is >> not providing a receive tail kvec, or whether all the encoders >> fail to provide a tail in this case. > > There is nothing specific that listxattr does, it just calls > rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has > tail[1] defined)? Flip the question on its head: Why does xdr_inline_pages() work fine for all the other operations? That suggests the problem is with listxattrs, not with generic code. > But not all requests have data in the page. So as > far as I understand, tail->iov_len can be 0 so not checking for it is > wrong. The full context of the tail logic is: 194 tail->iov_base = buf + offset; 195 tail->iov_len = buflen - offset; 196 if ((xdr->page_len & 3) == 0) 197 tail->iov_len -= sizeof(__be32); tail->iov_len is set to buflen - offset right here. It should /always/ be 4 or more. We can ensure that because the input to this function is always provided by another kernel function (in other words, we control all the callers). Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size") is trying to ensure tail->iov_len is not zero -- that the math here gives us a positive value every time. In nfs4_xdr_enc_listxattrs() we have: 1529 rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count, 1530 hdr.replen); hdr.replen is set to NFS4_dec_listxattrs_sz. _nfs42_proc_listxattrs() sets args->count. I suspect the problem is the way _nfs42_proc_listxattrs() is computing the length of xattr_pages. It includes the trailing EOF boolean, but so does the decode_listxattrs_maxsz macro, for instance. We need head->iov_len to always be one XDR_UNIT larger than the position where the xattr_pages array starts. Then the math in xdr_inline_pages() will work correctly. (sidebar: perhaps the documenting comment for xdr_inline_pages() should explain that assumption). So, now I agree with the assessment that 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size") is probably not adequate. There is another subtlety to address in the way that _nfs42_proc_listxattrs() computes args->count. -- Chuck Lever
On Thu, Nov 19, 2020 at 11:19:05AM -0500, Chuck Lever wrote: > > On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote: > >> > >> Hi Olga- > >> > >>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > >>> > >>> Hi Chuck, > >>> > >>> The first problem I found was from 5.10-rc3 testing was from the fact > >>> that tail's iov_len was set to -4 and reply_chunk was trying to call > >>> rpcrdma_convert_kvec() for a tail that didn't exist. > >>> > >>> Do you see issues with this fix? > >>> > >>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > >>> index 71e03b930b70..2e6a228abb95 100644 > >>> --- a/net/sunrpc/xdr.c > >>> +++ b/net/sunrpc/xdr.c > >>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset, > >>> > >>> tail->iov_base = buf + offset; > >>> tail->iov_len = buflen - offset; > >>> - if ((xdr->page_len & 3) == 0) > >>> + if ((xdr->page_len & 3) == 0 && tail->iov_len) > >>> tail->iov_len -= sizeof(__be32); > >>> > >>> xdr->buflen += len; > >> > >> It's not clear to me whether only the listxattrs encoder is > >> not providing a receive tail kvec, or whether all the encoders > >> fail to provide a tail in this case. > > > > There is nothing specific that listxattr does, it just calls > > rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has > > tail[1] defined)? > > Flip the question on its head: Why does xdr_inline_pages() work > fine for all the other operations? That suggests the problem is > with listxattrs, not with generic code. > > > > But not all requests have data in the page. So as > > far as I understand, tail->iov_len can be 0 so not checking for it is > > wrong. > > The full context of the tail logic is: > > 194 tail->iov_base = buf + offset; > 195 tail->iov_len = buflen - offset; > 196 if ((xdr->page_len & 3) == 0) > 197 tail->iov_len -= sizeof(__be32); > > tail->iov_len is set to buflen - offset right here. It should > /always/ be 4 or more. We can ensure that because the input > to this function is always provided by another kernel function > (in other words, we control all the callers). > > Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS: > Fix listxattr receive buffer size") is trying to ensure > tail->iov_len is not zero -- that the math here gives us a > positive value every time. > > In nfs4_xdr_enc_listxattrs() we have: > > 1529 rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count, > 1530 hdr.replen); > > hdr.replen is set to NFS4_dec_listxattrs_sz. > > _nfs42_proc_listxattrs() sets args->count. > > I suspect the problem is the way _nfs42_proc_listxattrs() is > computing the length of xattr_pages. It includes the trailing > EOF boolean, but so does the decode_listxattrs_maxsz macro, > for instance. > > We need head->iov_len to always be one XDR_UNIT larger than > the position where the xattr_pages array starts. Then the > math in xdr_inline_pages() will work correctly. (sidebar: > perhaps the documenting comment for xdr_inline_pages() should > explain that assumption). > > So, now I agree with the assessment that 6c2190b3fcbc ("NFS: > Fix listxattr receive buffer size") is probably not adequate. > There is another subtlety to address in the way that > _nfs42_proc_listxattrs() computes args->count. The only thing I see wrong so far is that I believe that decode_listxattrs_maxsz is wrong - it shouldn't include the EOF word, which is accounted for in the page data part. But, it seems that this wouldn't cause the above problem. I'm also uncertain why this happens with RDMA, but not otherwise. Unfortunately, I can't test RDMA, but when I ran listxattr tests, I did so with KASAN enabled, and didn't see issues. Obviously there could be a bug here, it could be that the code just gets lucky, but that the bug is exposed on RDMA. Is there a specific size passed to listxattr that this happens with? - Frank
On Thu, Nov 19, 2020 at 6:26 PM Frank van der Linden <fllinden@amazon.com> wrote: > > On Thu, Nov 19, 2020 at 11:19:05AM -0500, Chuck Lever wrote: > > > On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > > > On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote: > > >> > > >> Hi Olga- > > >> > > >>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > >>> > > >>> Hi Chuck, > > >>> > > >>> The first problem I found was from 5.10-rc3 testing was from the fact > > >>> that tail's iov_len was set to -4 and reply_chunk was trying to call > > >>> rpcrdma_convert_kvec() for a tail that didn't exist. > > >>> > > >>> Do you see issues with this fix? > > >>> > > >>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > > >>> index 71e03b930b70..2e6a228abb95 100644 > > >>> --- a/net/sunrpc/xdr.c > > >>> +++ b/net/sunrpc/xdr.c > > >>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset, > > >>> > > >>> tail->iov_base = buf + offset; > > >>> tail->iov_len = buflen - offset; > > >>> - if ((xdr->page_len & 3) == 0) > > >>> + if ((xdr->page_len & 3) == 0 && tail->iov_len) > > >>> tail->iov_len -= sizeof(__be32); > > >>> > > >>> xdr->buflen += len; > > >> > > >> It's not clear to me whether only the listxattrs encoder is > > >> not providing a receive tail kvec, or whether all the encoders > > >> fail to provide a tail in this case. > > > > > > There is nothing specific that listxattr does, it just calls > > > rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has > > > tail[1] defined)? > > > > Flip the question on its head: Why does xdr_inline_pages() work > > fine for all the other operations? That suggests the problem is > > with listxattrs, not with generic code. > > > > > > > But not all requests have data in the page. So as > > > far as I understand, tail->iov_len can be 0 so not checking for it is > > > wrong. > > > > The full context of the tail logic is: > > > > 194 tail->iov_base = buf + offset; > > 195 tail->iov_len = buflen - offset; > > 196 if ((xdr->page_len & 3) == 0) > > 197 tail->iov_len -= sizeof(__be32); > > > > tail->iov_len is set to buflen - offset right here. It should > > /always/ be 4 or more. We can ensure that because the input > > to this function is always provided by another kernel function > > (in other words, we control all the callers). > > > > Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS: > > Fix listxattr receive buffer size") is trying to ensure > > tail->iov_len is not zero -- that the math here gives us a > > positive value every time. > > > > In nfs4_xdr_enc_listxattrs() we have: > > > > 1529 rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count, > > 1530 hdr.replen); > > > > hdr.replen is set to NFS4_dec_listxattrs_sz. > > > > _nfs42_proc_listxattrs() sets args->count. > > > > I suspect the problem is the way _nfs42_proc_listxattrs() is > > computing the length of xattr_pages. It includes the trailing > > EOF boolean, but so does the decode_listxattrs_maxsz macro, > > for instance. > > > > We need head->iov_len to always be one XDR_UNIT larger than > > the position where the xattr_pages array starts. Then the > > math in xdr_inline_pages() will work correctly. (sidebar: > > perhaps the documenting comment for xdr_inline_pages() should > > explain that assumption). > > > > So, now I agree with the assessment that 6c2190b3fcbc ("NFS: > > Fix listxattr receive buffer size") is probably not adequate. > > There is another subtlety to address in the way that > > _nfs42_proc_listxattrs() computes args->count. > > The only thing I see wrong so far is that I believe that > decode_listxattrs_maxsz is wrong - it shouldn't include the EOF > word, which is accounted for in the page data part. > > But, it seems that this wouldn't cause the above problem. I'm > also uncertain why this happens with RDMA, but not otherwise. > Unfortunately, I can't test RDMA, but when I ran listxattr tests, > I did so with KASAN enabled, and didn't see issues. There isn't nothing special to run tests on RDMA, you just need to compile the RXE driver and the rdma-core package have everything you need to run soft Roce (or soft iwarp). This is how I'm testing. > Obviously there could be a bug here, it could be that the code > just gets lucky, but that the bug is exposed on RDMA. > > Is there a specific size passed to listxattr that this happens with? First let me answer the last question, I'm running xfstest generic/013. The latest update: updated to Trond's origing/testing which is now based on 5.10-rc4. 1. I don't see the oops during the encoding of the listxattr 2. I'm still seeing the oops during the rdma completion. This happens in the following scenario. Normally, there is a request for listxattr which passes in buflen of 65536. This translates into rdma request with a reply chunk with 2 segments. But during failure there is a request for listxattr which passes buflen of only 8bytes. This translates into rdma inline request which later oops in rpcrdma_inline_fixup. What I would like to know: (1) should _nf42_proc_listxattrs() be doing some kind of sanity checking for passed in buflen? (2) but of course I'm assuming even if passed in buffer is small the code shouldn't be oops-ing. > > - Frank
Hi Frank, Chuck, I would like your option on how LISTXATTR is supposed to work over RDMA. Here's my current understanding of why the listxattr is not working over the RDMA. This happens when the listxattr is called with a very small buffer size which RDMA wants to send an inline request. I really dont understand why, Chuck, you are not seeing any problems with hardware as far as I can tell it would have the same problem because the inline threshold size would still make this size inline. rcprdma_inline_fixup() is trying to write to pages that don't exist. When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that will allocate pages in xs_alloc_sparse_pages() but this is ONLY for TCP. RDMA doesn't have anything like that. Question: Should there be code added to RDMA that will do something similar when it sees that flag set? Or, should LISTXATTR be re-written to be like READDIR which allocates pages before calling the code. On Fri, Nov 20, 2020 at 11:37 AM Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > On Thu, Nov 19, 2020 at 6:26 PM Frank van der Linden > <fllinden@amazon.com> wrote: > > > > On Thu, Nov 19, 2020 at 11:19:05AM -0500, Chuck Lever wrote: > > > > On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > > > > > On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote: > > > >> > > > >> Hi Olga- > > > >> > > > >>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > >>> > > > >>> Hi Chuck, > > > >>> > > > >>> The first problem I found was from 5.10-rc3 testing was from the fact > > > >>> that tail's iov_len was set to -4 and reply_chunk was trying to call > > > >>> rpcrdma_convert_kvec() for a tail that didn't exist. > > > >>> > > > >>> Do you see issues with this fix? > > > >>> > > > >>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c > > > >>> index 71e03b930b70..2e6a228abb95 100644 > > > >>> --- a/net/sunrpc/xdr.c > > > >>> +++ b/net/sunrpc/xdr.c > > > >>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset, > > > >>> > > > >>> tail->iov_base = buf + offset; > > > >>> tail->iov_len = buflen - offset; > > > >>> - if ((xdr->page_len & 3) == 0) > > > >>> + if ((xdr->page_len & 3) == 0 && tail->iov_len) > > > >>> tail->iov_len -= sizeof(__be32); > > > >>> > > > >>> xdr->buflen += len; > > > >> > > > >> It's not clear to me whether only the listxattrs encoder is > > > >> not providing a receive tail kvec, or whether all the encoders > > > >> fail to provide a tail in this case. > > > > > > > > There is nothing specific that listxattr does, it just calls > > > > rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has > > > > tail[1] defined)? > > > > > > Flip the question on its head: Why does xdr_inline_pages() work > > > fine for all the other operations? That suggests the problem is > > > with listxattrs, not with generic code. > > > > > > > > > > But not all requests have data in the page. So as > > > > far as I understand, tail->iov_len can be 0 so not checking for it is > > > > wrong. > > > > > > The full context of the tail logic is: > > > > > > 194 tail->iov_base = buf + offset; > > > 195 tail->iov_len = buflen - offset; > > > 196 if ((xdr->page_len & 3) == 0) > > > 197 tail->iov_len -= sizeof(__be32); > > > > > > tail->iov_len is set to buflen - offset right here. It should > > > /always/ be 4 or more. We can ensure that because the input > > > to this function is always provided by another kernel function > > > (in other words, we control all the callers). > > > > > > Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS: > > > Fix listxattr receive buffer size") is trying to ensure > > > tail->iov_len is not zero -- that the math here gives us a > > > positive value every time. > > > > > > In nfs4_xdr_enc_listxattrs() we have: > > > > > > 1529 rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count, > > > 1530 hdr.replen); > > > > > > hdr.replen is set to NFS4_dec_listxattrs_sz. > > > > > > _nfs42_proc_listxattrs() sets args->count. > > > > > > I suspect the problem is the way _nfs42_proc_listxattrs() is > > > computing the length of xattr_pages. It includes the trailing > > > EOF boolean, but so does the decode_listxattrs_maxsz macro, > > > for instance. > > > > > > We need head->iov_len to always be one XDR_UNIT larger than > > > the position where the xattr_pages array starts. Then the > > > math in xdr_inline_pages() will work correctly. (sidebar: > > > perhaps the documenting comment for xdr_inline_pages() should > > > explain that assumption). > > > > > > So, now I agree with the assessment that 6c2190b3fcbc ("NFS: > > > Fix listxattr receive buffer size") is probably not adequate. > > > There is another subtlety to address in the way that > > > _nfs42_proc_listxattrs() computes args->count. > > > > The only thing I see wrong so far is that I believe that > > decode_listxattrs_maxsz is wrong - it shouldn't include the EOF > > word, which is accounted for in the page data part. > > > > But, it seems that this wouldn't cause the above problem. I'm > > also uncertain why this happens with RDMA, but not otherwise. > > Unfortunately, I can't test RDMA, but when I ran listxattr tests, > > I did so with KASAN enabled, and didn't see issues. > > There isn't nothing special to run tests on RDMA, you just need to > compile the RXE driver and the rdma-core package have everything you > need to run soft Roce (or soft iwarp). This is how I'm testing. > > > Obviously there could be a bug here, it could be that the code > > just gets lucky, but that the bug is exposed on RDMA. > > > > Is there a specific size passed to listxattr that this happens with? > > First let me answer the last question, I'm running xfstest generic/013. > > The latest update: updated to Trond's origing/testing which is now > based on 5.10-rc4. > > 1. I don't see the oops during the encoding of the listxattr > 2. I'm still seeing the oops during the rdma completion. This happens > in the following scenario. Normally, there is a request for listxattr > which passes in buflen of 65536. This translates into rdma request > with a reply chunk with 2 segments. But during failure there is a > request for listxattr which passes buflen of only 8bytes. This > translates into rdma inline request which later oops in > rpcrdma_inline_fixup. > > What I would like to know: (1) should _nf42_proc_listxattrs() be doing > some kind of sanity checking for passed in buflen? (2) but of course > I'm assuming even if passed in buffer is small the code shouldn't be > oops-ing. > > > > > - Frank
> On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > Hi Frank, Chuck, > > I would like your option on how LISTXATTR is supposed to work over > RDMA. Here's my current understanding of why the listxattr is not > working over the RDMA. > > This happens when the listxattr is called with a very small buffer > size which RDMA wants to send an inline request. I really dont > understand why, Chuck, you are not seeing any problems with hardware > as far as I can tell it would have the same problem because the inline > threshold size would still make this size inline. > rcprdma_inline_fixup() is trying to write to pages that don't exist. > > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for > TCP. RDMA doesn't have anything like that. > > Question: Should there be code added to RDMA that will do something > similar when it sees that flag set? Isn't the logic in rpcrdma_convert_iovs() allocating those pages? > Or, should LISTXATTR be re-written > to be like READDIR which allocates pages before calling the code. AIUI READDIR reads into the directory inode's page cache. I recall that Frank couldn't do that for LISTXATTR because there's no similar page cache associated with the xattr listing. That said, I would prefer that the *XATTR procedures directly allocate pages instead of relying on SPARSE_PAGES, which is a hack IMO. I think it would have to use alloc_page() for that, and then ensure those pages are released when the call has completed. I'm not convinced this is the cause of the problem you're seeing, though. -- Chuck Lever
On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote: > Hi Frank, Chuck, > > I would like your option on how LISTXATTR is supposed to work over > RDMA. Here's my current understanding of why the listxattr is not > working over the RDMA. > > This happens when the listxattr is called with a very small buffer > size which RDMA wants to send an inline request. I really dont > understand why, Chuck, you are not seeing any problems with hardware > as far as I can tell it would have the same problem because the inline > threshold size would still make this size inline. > rcprdma_inline_fixup() is trying to write to pages that don't exist. > > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for > TCP. RDMA doesn't have anything like that. > > Question: Should there be code added to RDMA that will do something > similar when it sees that flag set? Or, should LISTXATTR be re-written > to be like READDIR which allocates pages before calling the code. Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested over RDMA? That's the only other user. If the flag simply doesn't work, I agree that it should either be fixed or just removed. It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR) to use preallocated pages. But, I didn't do that because I didn't want to waste the max size (64k) every time, even though you usually just get a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES is cleaner. - Frank
> On Nov 23, 2020, at 12:38 PM, Frank van der Linden <fllinden@amazon.com> wrote: > > On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote: >> Hi Frank, Chuck, >> >> I would like your option on how LISTXATTR is supposed to work over >> RDMA. Here's my current understanding of why the listxattr is not >> working over the RDMA. >> >> This happens when the listxattr is called with a very small buffer >> size which RDMA wants to send an inline request. I really dont >> understand why, Chuck, you are not seeing any problems with hardware >> as far as I can tell it would have the same problem because the inline >> threshold size would still make this size inline. >> rcprdma_inline_fixup() is trying to write to pages that don't exist. >> >> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that >> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for >> TCP. RDMA doesn't have anything like that. >> >> Question: Should there be code added to RDMA that will do something >> similar when it sees that flag set? Or, should LISTXATTR be re-written >> to be like READDIR which allocates pages before calling the code. > > Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested > over RDMA? That's the only other user. > > If the flag simply doesn't work, I agree that it should either be fixed > or just removed. Shirley fixed a crasher here years ago by making SPARSE_PAGES work for RPC/RDMA. She confirmed the fix was effective. > It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR) > to use preallocated pages. But, I didn't do that because I didn't want to > waste the max size (64k) every time, even though you usually just get > a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES > is cleaner. These are infrequent operations. We've added extra conditional branches in the hot path of the transports to handle this rare, non-performance sensitive case. I also wonder how well SPARSE_PAGES will work with xdr->bvecs, since the bio_vec array is constructed before the transport can allocate those pages. To me, the whole SPARSE_PAGES concept is prototype-quality code that needs to be replaced with a robust permanent solution. -- Chuck Lever
> On Nov 23, 2020, at 12:38 PM, Frank van der Linden <fllinden@amazon.com> wrote: > > On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote: >> Hi Frank, Chuck, >> >> I would like your option on how LISTXATTR is supposed to work over >> RDMA. Here's my current understanding of why the listxattr is not >> working over the RDMA. >> >> This happens when the listxattr is called with a very small buffer >> size which RDMA wants to send an inline request. I really dont >> understand why, Chuck, you are not seeing any problems with hardware >> as far as I can tell it would have the same problem because the inline >> threshold size would still make this size inline. >> rcprdma_inline_fixup() is trying to write to pages that don't exist. >> >> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that >> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for >> TCP. RDMA doesn't have anything like that. >> >> Question: Should there be code added to RDMA that will do something >> similar when it sees that flag set? Or, should LISTXATTR be re-written >> to be like READDIR which allocates pages before calling the code. > > Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested > over RDMA? That's the only other user. > > If the flag simply doesn't work, I agree that it should either be fixed > or just removed. > > It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR) > to use preallocated pages. But, I didn't do that because I didn't want to > waste the max size (64k) every time, even though you usually just get > a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES > is cleaner. Also, because this is for a receive buffer, the transport always has to allocate the maximum number of pages. Especially for RPC/RDMA, this is not an "allocate on demand" kind of thing. The maximum buffer size has to be allocated every time. -- Chuck Lever
On Mon, Nov 23, 2020 at 12:37 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > > > > On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > Hi Frank, Chuck, > > > > I would like your option on how LISTXATTR is supposed to work over > > RDMA. Here's my current understanding of why the listxattr is not > > working over the RDMA. > > > > This happens when the listxattr is called with a very small buffer > > size which RDMA wants to send an inline request. I really dont > > understand why, Chuck, you are not seeing any problems with hardware > > as far as I can tell it would have the same problem because the inline > > threshold size would still make this size inline. > > rcprdma_inline_fixup() is trying to write to pages that don't exist. > > > > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that > > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for > > TCP. RDMA doesn't have anything like that. > > > > Question: Should there be code added to RDMA that will do something > > similar when it sees that flag set? > > Isn't the logic in rpcrdma_convert_iovs() allocating those pages? No, rpcrdm_convert_iovs is only called for when you have reply chunks, lists etc but not for the inline messages. What am I missing? > > Or, should LISTXATTR be re-written > > to be like READDIR which allocates pages before calling the code. > > AIUI READDIR reads into the directory inode's page cache. I recall > that Frank couldn't do that for LISTXATTR because there's no > similar page cache associated with the xattr listing. > > That said, I would prefer that the *XATTR procedures directly > allocate pages instead of relying on SPARSE_PAGES, which is a hack > IMO. I think it would have to use alloc_page() for that, and then > ensure those pages are released when the call has completed. > > I'm not convinced this is the cause of the problem you're seeing, > though. > > -- > Chuck Lever > > >
On Mon, Nov 23, 2020 at 12:42 PM Frank van der Linden <fllinden@amazon.com> wrote: > > On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote: > > Hi Frank, Chuck, > > > > I would like your option on how LISTXATTR is supposed to work over > > RDMA. Here's my current understanding of why the listxattr is not > > working over the RDMA. > > > > This happens when the listxattr is called with a very small buffer > > size which RDMA wants to send an inline request. I really dont > > understand why, Chuck, you are not seeing any problems with hardware > > as far as I can tell it would have the same problem because the inline > > threshold size would still make this size inline. > > rcprdma_inline_fixup() is trying to write to pages that don't exist. > > > > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that > > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for > > TCP. RDMA doesn't have anything like that. > > > > Question: Should there be code added to RDMA that will do something > > similar when it sees that flag set? Or, should LISTXATTR be re-written > > to be like READDIR which allocates pages before calling the code. > > Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested > over RDMA? That's the only other user. It could have worked depending on whether or not ACL ever were sent inline. As I said LISTXATTR works when userspace provides a large buffer because that triggers the RDMA code to setup a reply chunk which allocated memory. So it's very specific case. It might have worked for ACL because the way ACL works (at least now) is that it's called first with a large buffer size, then client code actually caches the reply so when teh user space calls it again with appropriate buffer size, the code doesn't do another request. LISTXATTR doesn't have that optimization and it does another call. This way it uses a inline RDMA message which doesnt not work for when pages are not allocated as far as I can tell. > If the flag simply doesn't work, I agree that it should either be fixed > or just removed. > > It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR) > to use preallocated pages. But, I didn't do that because I didn't want to > waste the max size (64k) every time, even though you usually just get > a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES > is cleaner. > > - Frank
> On Nov 23, 2020, at 12:59 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > On Mon, Nov 23, 2020 at 12:37 PM Chuck Lever <chuck.lever@oracle.com> wrote: >> >> >> >>> On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: >>> >>> Hi Frank, Chuck, >>> >>> I would like your option on how LISTXATTR is supposed to work over >>> RDMA. Here's my current understanding of why the listxattr is not >>> working over the RDMA. >>> >>> This happens when the listxattr is called with a very small buffer >>> size which RDMA wants to send an inline request. I really dont >>> understand why, Chuck, you are not seeing any problems with hardware >>> as far as I can tell it would have the same problem because the inline >>> threshold size would still make this size inline. >>> rcprdma_inline_fixup() is trying to write to pages that don't exist. >>> >>> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that >>> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for >>> TCP. RDMA doesn't have anything like that. >>> >>> Question: Should there be code added to RDMA that will do something >>> similar when it sees that flag set? >> >> Isn't the logic in rpcrdma_convert_iovs() allocating those pages? > > No, rpcrdm_convert_iovs is only called for when you have reply chunks, > lists etc but not for the inline messages. What am I missing? So, then, rpcrdma_marshal_req() is deciding that the LISTXATTRS reply is supposed to fit inline. That means rqst->rq_rcv_buf.buflen is small. But if rpcrdma_inline_fixup() is trying to fill pages, rqst->rq_rcv_buf.page_len must not be zero? That sounds like the LISTXATTRS encoder is not setting up the receive buffer correctly. The receive buffer's buflen field is supposed to be set to a value that is at least as large as page_len, I would think. >>> Or, should LISTXATTR be re-written >>> to be like READDIR which allocates pages before calling the code. >> >> AIUI READDIR reads into the directory inode's page cache. I recall >> that Frank couldn't do that for LISTXATTR because there's no >> similar page cache associated with the xattr listing. >> >> That said, I would prefer that the *XATTR procedures directly >> allocate pages instead of relying on SPARSE_PAGES, which is a hack >> IMO. I think it would have to use alloc_page() for that, and then >> ensure those pages are released when the call has completed. >> >> I'm not convinced this is the cause of the problem you're seeing, >> though. >> >> -- >> Chuck Lever -- Chuck Lever
On Mon, Nov 23, 2020 at 12:37:15PM -0500, Chuck Lever wrote: > > > > On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > Hi Frank, Chuck, > > > > I would like your option on how LISTXATTR is supposed to work over > > RDMA. Here's my current understanding of why the listxattr is not > > working over the RDMA. > > > > This happens when the listxattr is called with a very small buffer > > size which RDMA wants to send an inline request. I really dont > > understand why, Chuck, you are not seeing any problems with hardware > > as far as I can tell it would have the same problem because the inline > > threshold size would still make this size inline. > > rcprdma_inline_fixup() is trying to write to pages that don't exist. > > > > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that > > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for > > TCP. RDMA doesn't have anything like that. > > > > Question: Should there be code added to RDMA that will do something > > similar when it sees that flag set? > > Isn't the logic in rpcrdma_convert_iovs() allocating those pages? > > > > Or, should LISTXATTR be re-written > > to be like READDIR which allocates pages before calling the code. > > AIUI READDIR reads into the directory inode's page cache. I recall > that Frank couldn't do that for LISTXATTR because there's no > similar page cache associated with the xattr listing. Yep, correct. > > That said, I would prefer that the *XATTR procedures directly > allocate pages instead of relying on SPARSE_PAGES, which is a hack > IMO. I think it would have to use alloc_page() for that, and then > ensure those pages are released when the call has completed. > > I'm not convinced this is the cause of the problem you're seeing, > though. Since the flag works with GETXATTR I would agree that it's probably not the issue. More likely, it seems like args->count is somehow off. - Frank
On Mon, Nov 23, 2020 at 05:38:02PM +0000, Frank van der Linden wrote: > It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR) > to use preallocated pages. But, I didn't do that because I didn't want to > waste the max size (64k) every time, even though you usually just get > a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES > is cleaner. Correcting myself here.. this is true of GETXATTR, since there is no maximum length argument for that call - so your only choice is to allocate the max of what you can handle, and SPARSE seems to be the best option fr that. For LISTXATTRS, there is a max length argument. It's a bit unusual, in that it specifies the maximum number of bytes of *XDR* data you want to receive. So you don't need to allocate the maximum number of pages. But, I decided to still use the SPARSE option, since the common 0 length argument to the listxattr syscall, used to probe the length of the buffer needed, would still require the maximum length (64k) to be allocated, because there is no 0 lengh 'probe' option for the RPC. So all you can do is ask for the maximum size you can handle, and see what you get back to determine the length. Olga's comment about caching: the code actually does cache the data for all valid replies, even if the system call only was a len == 0 probe call. That way, the followup system call that actually retrieves the data can normally just get it from the cache. In any case, I need to set up an RDMA environment to see if I can reproduce and fix the issue. I hope to do that today or tomorrow. - Frank
On Mon, Nov 23, 2020 at 1:09 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > > > > On Nov 23, 2020, at 12:59 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > > > > On Mon, Nov 23, 2020 at 12:37 PM Chuck Lever <chuck.lever@oracle.com> wrote: > >> > >> > >> > >>> On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote: > >>> > >>> Hi Frank, Chuck, > >>> > >>> I would like your option on how LISTXATTR is supposed to work over > >>> RDMA. Here's my current understanding of why the listxattr is not > >>> working over the RDMA. > >>> > >>> This happens when the listxattr is called with a very small buffer > >>> size which RDMA wants to send an inline request. I really dont > >>> understand why, Chuck, you are not seeing any problems with hardware > >>> as far as I can tell it would have the same problem because the inline > >>> threshold size would still make this size inline. > >>> rcprdma_inline_fixup() is trying to write to pages that don't exist. > >>> > >>> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that > >>> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for > >>> TCP. RDMA doesn't have anything like that. > >>> > >>> Question: Should there be code added to RDMA that will do something > >>> similar when it sees that flag set? > >> > >> Isn't the logic in rpcrdma_convert_iovs() allocating those pages? > > > > No, rpcrdm_convert_iovs is only called for when you have reply chunks, > > lists etc but not for the inline messages. What am I missing? > > So, then, rpcrdma_marshal_req() is deciding that the LISTXATTRS > reply is supposed to fit inline. That means rqst->rq_rcv_buf.buflen > is small. > > But if rpcrdma_inline_fixup() is trying to fill pages, > rqst->rq_rcv_buf.page_len must not be zero? That sounds like the > LISTXATTRS encoder is not setting up the receive buffer correctly. > > The receive buffer's buflen field is supposed to be set to a value > that is at least as large as page_len, I would think. Here's what the LISTXATTR code does that I can see: It allocates pointers to the pages (but no pages). It sets the page_len to the hdr.replen so yes it's not zero (setting of the value as far as i know is correct). So for RDMA nothing allocates those pages because it's an inline request. TCP code will allocate those pages because the code was added. You keep on saying that you don't think this is it but I don't know how to prove to you that Kasan's message of "wild-memory access" means that page wasn't allocated. It's a bogus address. The page isn't there. Or at least that's how I read the Kasan's message. I don't know how else to interpret it (but also know that code never allocates the memory I believe is a strong argument). NFS code can't know that request is inline. It does assume something will allocate that memory but RDMA doesn't allocate memory for the inline messages. While I'm not suggesting this is a correct fix but this is a fix that removes the oops. diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c index 2b2211d1234e..faab6aedeb42 100644 --- a/fs/nfs/nfs42proc.c +++ b/fs/nfs/nfs42proc.c @@ -1258,6 +1258,15 @@ static ssize_t _nfs42_proc_listxattrs(struct inode *inode, void *buf, __free_page(res.scratch); return -ENOMEM; } + if (buflen < 1024) { + int i; + for (i = 0; i < np; i++) { + pages[i] = alloc_page(GFP_NOWAIT | __GFP_NOWARN); + if (!pages[i]) + return -ENOMEM; + } + } + arg.xattr_pages = pages; arg.count = xdrlen; Basically since I know that all RDMA less than 1024 (for soft Roce) will be inline, I need to allocate pages for them. This doesn't interfere with the TCP mounts as the code checks if pages are allocated and only allocates them if they are not. But of course this is not a solution as it's unknown what's the rdma's inline threshold is at the NFS layer. > >>> Or, should LISTXATTR be re-written > >>> to be like READDIR which allocates pages before calling the code. > >> > >> AIUI READDIR reads into the directory inode's page cache. I recall > >> that Frank couldn't do that for LISTXATTR because there's no > >> similar page cache associated with the xattr listing. > >> > >> That said, I would prefer that the *XATTR procedures directly > >> allocate pages instead of relying on SPARSE_PAGES, which is a hack > >> IMO. I think it would have to use alloc_page() for that, and then > >> ensure those pages are released when the call has completed. > >> > >> I'm not convinced this is the cause of the problem you're seeing, > >> though. > >> > >> -- > >> Chuck Lever > > -- > Chuck Lever > > >
diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c index 6e060a8..e88bc7a 100644 --- a/fs/nfs/nfs42xdr.c +++ b/fs/nfs/nfs42xdr.c @@ -196,7 +196,8 @@ 1 + nfs4_xattr_name_maxsz + 1) #define decode_setxattr_maxsz (op_decode_hdr_maxsz + decode_change_info_maxsz) #define encode_listxattrs_maxsz (op_encode_hdr_maxsz + 2 + 1) -#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + 1 + 1) +#define decode_listxattrs_maxsz (op_decode_hdr_maxsz + 2 + 1 + 1 + \ + XDR_QUADLEN(NFS4_OPAQUE_LIMIT)) #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \ nfs4_xattr_name_maxsz) #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \