diff mbox series

[1/1] NFSv4.2: fix LISTXATTR buffer receive size

Message ID 20201113190851.7817-1-olga.kornievskaia@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/1] NFSv4.2: fix LISTXATTR buffer receive size | expand

Commit Message

Olga Kornievskaia Nov. 13, 2020, 7:08 p.m. UTC
From: Olga Kornievskaia <kolga@netapp.com>

xfstest generic/013 over on a NFSoRDMA over SoftRoCE or iWarp panics
and running with KASAN reports:

[  216.018711] BUG: KASAN: wild-memory-access in rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
[  216.024195] Write of size 12 at addr 0005088000000000 by task kworker/1:1H/480
[  216.028820]
[  216.029776] CPU: 1 PID: 480 Comm: kworker/1:1H Not tainted 5.8.0-rc5+ #37
[  216.034247] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[  216.040604] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[  216.043739] Call Trace:
[  216.045014]  dump_stack+0x7c/0xb0
[  216.046757]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
[  216.050008]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
[  216.053091]  kasan_report.cold.10+0x6a/0x85
[  216.055703]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
[  216.058979]  check_memory_region+0x183/0x1e0
[  216.061933]  memcpy+0x38/0x60
[  216.064077]  rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
[  216.067502]  ? rpcrdma_reset_cwnd+0x70/0x70 [rpcrdma]
[  216.070268]  ? recalibrate_cpu_khz+0x10/0x10
[  216.072585]  ? rpcrdma_reply_handler+0x604/0x6e0 [rpcrdma]
[  216.075469]  __ib_process_cq+0xa7/0x220 [ib_core]
[  216.078077]  ib_cq_poll_work+0x31/0xb0 [ib_core]
[  216.080451]  process_one_work+0x387/0x6c0
[  216.082498]  worker_thread+0x57/0x5a0
[  216.084425]  ? process_one_work+0x6c0/0x6c0
[  216.086583]  kthread+0x1ca/0x200
[  216.088775]  ? kthread_create_on_node+0xc0/0xc0
[  216.091847]  ret_from_fork+0x22/0x30

Fixes: 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
---
 fs/nfs/nfs42xdr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Chuck Lever III Nov. 13, 2020, 8:34 p.m. UTC | #1
Hi Olga-

> On Nov 13, 2020, at 2:08 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> From: Olga Kornievskaia <kolga@netapp.com>
> 
> xfstest generic/013 over on a NFSoRDMA over SoftRoCE or iWarp panics
> and running with KASAN reports:

There is still only a highly circumstantial connection between
soft RoCE and iWarp and these crashes, so this description seems
misleading and/or pre-mature.


> [  216.018711] BUG: KASAN: wild-memory-access in rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> [  216.024195] Write of size 12 at addr 0005088000000000 by task kworker/1:1H/480
> [  216.028820]
> [  216.029776] CPU: 1 PID: 480 Comm: kworker/1:1H Not tainted 5.8.0-rc5+ #37
> [  216.034247] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
> [  216.040604] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> [  216.043739] Call Trace:
> [  216.045014]  dump_stack+0x7c/0xb0
> [  216.046757]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> [  216.050008]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> [  216.053091]  kasan_report.cold.10+0x6a/0x85
> [  216.055703]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> [  216.058979]  check_memory_region+0x183/0x1e0
> [  216.061933]  memcpy+0x38/0x60
> [  216.064077]  rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> [  216.067502]  ? rpcrdma_reset_cwnd+0x70/0x70 [rpcrdma]
> [  216.070268]  ? recalibrate_cpu_khz+0x10/0x10
> [  216.072585]  ? rpcrdma_reply_handler+0x604/0x6e0 [rpcrdma]
> [  216.075469]  __ib_process_cq+0xa7/0x220 [ib_core]
> [  216.078077]  ib_cq_poll_work+0x31/0xb0 [ib_core]
> [  216.080451]  process_one_work+0x387/0x6c0
> [  216.082498]  worker_thread+0x57/0x5a0
> [  216.084425]  ? process_one_work+0x6c0/0x6c0
> [  216.086583]  kthread+0x1ca/0x200
> [  216.088775]  ? kthread_create_on_node+0xc0/0xc0
> [  216.091847]  ret_from_fork+0x22/0x30
> 
> Fixes: 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size")
> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> ---
> fs/nfs/nfs42xdr.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
> index 6e060a8..e88bc7a 100644
> --- a/fs/nfs/nfs42xdr.c
> +++ b/fs/nfs/nfs42xdr.c
> @@ -196,7 +196,8 @@
> 				 1 + nfs4_xattr_name_maxsz + 1)
> #define decode_setxattr_maxsz   (op_decode_hdr_maxsz + decode_change_info_maxsz)
> #define encode_listxattrs_maxsz  (op_encode_hdr_maxsz + 2 + 1)
> -#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 + 1 + 1)
> +#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 + 1 + \
> +				  XDR_QUADLEN(NFS4_OPAQUE_LIMIT))

From RFC 8276, Section 8.4.3.2:

   /// struct LISTXATTRS4resok {
   ///         nfs_cookie4    lxr_cookie;
   ///         xattrkey4      lxr_names<>;
   ///         bool           lxr_eof;
   /// };

The decode_listxattrs_maxsz macro defines the maximum size of
the /fixed portion/ of the LISTXATTRS reply. That is the operation
status code, the cookie, and the EOF flag. maxsz has an extra
"+ 1" for rpc_prepare_reply_pages() to deal with possible XDR
padding. The post-6c2190b3fcbc value of this macro is already
correct, and removing the "+ 1" is wrong.

In addition, the variable portion of the result is an unbounded
array of component4 fields, nothing to do with NFS4_OPAQUE_LIMIT,
so that's just an arbitrary constant here with no justification.

Adding more space to the receive buffer happens to help the case
where there are no xattrs to list. That doesn't mean its the
correct solution in general. It certainly won't be sufficient to
handle an xattr name array that is larger than 1024 bytes.


The client manages the variable portion of that result in the
reply's pages array, which is set up by nfs4_xdr_enc_listxattrs().

Further:

> rpcrdma_complete_rqst+0x447

is in the paragraph of rpcrdma_inline_fixup() that copies received
bytes into the receive xdr_buf's pages array.

The address "0005088000000000" is bogus. Since
nfs4_xdr_enc_listxattrs() sets XDRBUF_SPARSE_PAGES, it's likely
it is relying on the transport to allocate pages for this buffer,
and possibly that page allocation has failed or has a bug.

Please determine why the encode side has not set up the correct
number of pages to handle the LISTXATTRS result. Until then I
have to NACK this one.


> #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \
> 				  nfs4_xattr_name_maxsz)
> #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \
> -- 
> 1.8.3.1

--
Chuck Lever
Olga Kornievskaia Nov. 18, 2020, 9:44 p.m. UTC | #2
Hi Chuck,

The first problem I found was from 5.10-rc3 testing was from the fact
that tail's iov_len was set to -4 and reply_chunk was trying to call
rpcrdma_convert_kvec() for a tail that didn't exist.

Do you see issues with this fix?

diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 71e03b930b70..2e6a228abb95 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,

        tail->iov_base = buf + offset;
        tail->iov_len = buflen - offset;
-       if ((xdr->page_len & 3) == 0)
+       if ((xdr->page_len & 3) == 0 && tail->iov_len)
                tail->iov_len -= sizeof(__be32);

        xdr->buflen += len;

This one fixes KASAN's reported problem of added at the end of this message.

The next problem that I can't figure out yet is another KASAN's report
during the completion of the request.

[   99.610666] BUG: KASAN: wild-memory-access in
rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma]
[   99.617947] Write of size 4 at addr 0005088000000000 by task kworker/1:1H/490


This is the KASAN for the negative tail problem:
[  665.767611] ==================================================================
[  665.772202] BUG: KASAN: slab-out-of-bounds in
rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
[  665.777860] Write of size 8 at addr ffff88803ded9b70 by task fsstress/3123
[  665.783754]
[  665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0-rc3+ #38
[  665.790534] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[  665.798538] Call Trace:
[  665.800398]  dump_stack+0x7c/0xa2
[  665.802647]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
[  665.808145]  print_address_description.constprop.7+0x1e/0x230
[  665.812543]  ? record_print_text.cold.38+0x11/0x11
[  665.816093]  ? _raw_write_lock_irqsave+0xe0/0xe0
[  665.819257]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
[  665.823368]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
[  665.827837]  kasan_report.cold.9+0x37/0x7c
[  665.830076]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
[  665.834066]  rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
[  665.837342]  rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0 [rpcrdma]
[  665.841766]  ? decode_read_list+0x40/0x40 [rpcrdma]
[  665.845063]  ? _raw_spin_lock_irqsave+0x80/0xe0
[  665.848444]  ? xdr_reserve_space+0x12e/0x360 [sunrpc]
[  665.852186]  ? xdr_init_encode+0x104/0x130 [sunrpc]
[  665.855305]  rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma]
[  665.859771]  ? _raw_spin_lock+0x7a/0xd0
[  665.862516]  ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma]
[  665.866533]  ? call_bind+0x60/0xf0 [sunrpc]
[  665.869382]  xprt_rdma_send_request+0x79/0x190 [rpcrdma]
[  665.872965]  xprt_transmit+0x2ae/0x6c0 [sunrpc]
[  665.875955]  ? call_bind+0xf0/0xf0 [sunrpc]
[  665.878372]  call_transmit+0xdd/0x110 [sunrpc]
[  665.881539]  ? call_bind+0xf0/0xf0 [sunrpc]
[  665.884629]  __rpc_execute+0x11c/0x6e0 [sunrpc]
[  665.888300]  ? trace_event_raw_event_xprt_cong_event+0x270/0x270 [sunrpc]
[  665.893935]  ? rpc_make_runnable+0x54/0xe0 [sunrpc]
[  665.898021]  rpc_run_task+0x29c/0x2c0 [sunrpc]
[  665.901142]  nfs4_call_sync_custom+0xc/0x40 [nfsv4]
[  665.904903]  nfs4_do_call_sync+0x114/0x160 [nfsv4]
[  665.908633]  ? nfs4_call_sync_custom+0x40/0x40 [nfsv4]
[  665.913094]  ? __alloc_pages_nodemask+0x200/0x410
[  665.916831]  ? kasan_unpoison_shadow+0x30/0x40
[  665.920393]  ? __kasan_kmalloc.constprop.8+0xc1/0xd0
[  665.924403]  _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4]
[  665.928552]  ? kasan_set_free_info+0x1b/0x30
[  665.932283]  ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4]
[  665.936240]  ? _raw_spin_lock+0x7a/0xd0
[  665.938677]  nfs42_proc_listxattrs+0xf4/0x150 [nfsv4]
[  665.942532]  ? nfs42_proc_setxattr+0x150/0x150 [nfsv4]
[  665.946410]  ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4]
[  665.950095]  nfs4_listxattr+0x34d/0x3d0 [nfsv4]
[  665.952951]  ? _nfs4_proc_access+0x260/0x260 [nfsv4]
[  665.956383]  ? __ia32_sys_rename+0x40/0x40
[  665.959559]  ? __ia32_sys_lstat+0x30/0x30
[  665.962519]  ? __check_object_size+0x178/0x220
[  665.965830]  ? strncpy_from_user+0xe9/0x230
[  665.968401]  ? security_inode_listxattr+0x20/0x60
[  665.971653]  listxattr+0xd1/0xf0
[  665.974065]  path_listxattr+0xa1/0x100
[  665.977023]  ? listxattr+0xf0/0xf0
[  665.979305]  do_syscall_64+0x33/0x40
[  665.981561]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  665.985559] RIP: 0033:0x7fe3f5a1fc8b
[  665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64
89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64 89
01 48
[  666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX:
00000000000000c2
[  666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fe3f5a1fc8b
[  666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000016eaf70
[  666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09: 00007ffc82612537
[  666.025553] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000000021
[  666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15: 0000000000000000
[  666.035780]
[  666.036906] Allocated by task 2837:
[  666.039447]  kasan_save_stack+0x19/0x40
[  666.042815]  __kasan_kmalloc.constprop.8+0xc1/0xd0
[  666.046626]  rpcrdma_req_create+0x58/0x1f0 [rpcrdma]
[  666.050283]  rpcrdma_buffer_create+0x217/0x270 [rpcrdma]
[  666.053727]  xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma]
[  666.057287]  xprt_create_transport+0xc7/0x300 [sunrpc]
[  666.061656]  rpc_create+0x185/0x360 [sunrpc]
[  666.064803]  nfs_create_rpc_client+0x2d9/0x350 [nfs]
[  666.068233]  nfs4_init_client+0x111/0x3d0 [nfsv4]
[  666.071563]  nfs4_set_client+0x18c/0x2b0 [nfsv4]
[  666.075287]  nfs4_create_server+0x303/0x590 [nfsv4]
[  666.079563]  nfs4_try_get_tree+0x60/0xe0 [nfsv4]
[  666.082835]  vfs_get_tree+0x45/0x120
[  666.085165]  path_mount+0x8da/0xcc0
[  666.087352]  do_mount+0xcb/0xf0
[  666.089377]  __x64_sys_mount+0xf4/0x110
[  666.091894]  do_syscall_64+0x33/0x40
[  666.094215]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  666.097181]
[  666.098548] The buggy address belongs to the object at ffff88803ded8000
[  666.098548]  which belongs to the cache kmalloc-8k of size 8192
[  666.108266] The buggy address is located 7024 bytes inside of
[  666.108266]  8192-byte region [ffff88803ded8000, ffff88803deda000)
[  666.115709] The buggy address belongs to the page:
[  666.118516] page:00000000e08a579f refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x3ded8
[  666.124078] head:00000000e08a579f order:3 compound_mapcount:0
compound_pincount:0
[  666.130104] flags: 0xfffffc0010200(slab|head)
[  666.133692] raw: 000fffffc0010200 dead000000000100 dead000000000122
ffff88800104ee40
[  666.139286] raw: 0000000000000000 0000000000020002 00000001ffffffff
0000000000000000
[  666.145052] page dumped because: kasan: bad access detected
[  666.149341]
[  666.150408] Memory state around the buggy address:

On Fri, Nov 13, 2020 at 3:34 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> Hi Olga-
>
> > On Nov 13, 2020, at 2:08 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > From: Olga Kornievskaia <kolga@netapp.com>
> >
> > xfstest generic/013 over on a NFSoRDMA over SoftRoCE or iWarp panics
> > and running with KASAN reports:
>
> There is still only a highly circumstantial connection between
> soft RoCE and iWarp and these crashes, so this description seems
> misleading and/or pre-mature.
>
>
> > [  216.018711] BUG: KASAN: wild-memory-access in rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > [  216.024195] Write of size 12 at addr 0005088000000000 by task kworker/1:1H/480
> > [  216.028820]
> > [  216.029776] CPU: 1 PID: 480 Comm: kworker/1:1H Not tainted 5.8.0-rc5+ #37
> > [  216.034247] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
> > [  216.040604] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > [  216.043739] Call Trace:
> > [  216.045014]  dump_stack+0x7c/0xb0
> > [  216.046757]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > [  216.050008]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > [  216.053091]  kasan_report.cold.10+0x6a/0x85
> > [  216.055703]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > [  216.058979]  check_memory_region+0x183/0x1e0
> > [  216.061933]  memcpy+0x38/0x60
> > [  216.064077]  rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > [  216.067502]  ? rpcrdma_reset_cwnd+0x70/0x70 [rpcrdma]
> > [  216.070268]  ? recalibrate_cpu_khz+0x10/0x10
> > [  216.072585]  ? rpcrdma_reply_handler+0x604/0x6e0 [rpcrdma]
> > [  216.075469]  __ib_process_cq+0xa7/0x220 [ib_core]
> > [  216.078077]  ib_cq_poll_work+0x31/0xb0 [ib_core]
> > [  216.080451]  process_one_work+0x387/0x6c0
> > [  216.082498]  worker_thread+0x57/0x5a0
> > [  216.084425]  ? process_one_work+0x6c0/0x6c0
> > [  216.086583]  kthread+0x1ca/0x200
> > [  216.088775]  ? kthread_create_on_node+0xc0/0xc0
> > [  216.091847]  ret_from_fork+0x22/0x30
> >
> > Fixes: 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size")
> > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > ---
> > fs/nfs/nfs42xdr.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
> > index 6e060a8..e88bc7a 100644
> > --- a/fs/nfs/nfs42xdr.c
> > +++ b/fs/nfs/nfs42xdr.c
> > @@ -196,7 +196,8 @@
> >                                1 + nfs4_xattr_name_maxsz + 1)
> > #define decode_setxattr_maxsz   (op_decode_hdr_maxsz + decode_change_info_maxsz)
> > #define encode_listxattrs_maxsz  (op_encode_hdr_maxsz + 2 + 1)
> > -#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 + 1 + 1)
> > +#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 + 1 + \
> > +                               XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
>
> From RFC 8276, Section 8.4.3.2:
>
>    /// struct LISTXATTRS4resok {
>    ///         nfs_cookie4    lxr_cookie;
>    ///         xattrkey4      lxr_names<>;
>    ///         bool           lxr_eof;
>    /// };
>
> The decode_listxattrs_maxsz macro defines the maximum size of
> the /fixed portion/ of the LISTXATTRS reply. That is the operation
> status code, the cookie, and the EOF flag. maxsz has an extra
> "+ 1" for rpc_prepare_reply_pages() to deal with possible XDR
> padding. The post-6c2190b3fcbc value of this macro is already
> correct, and removing the "+ 1" is wrong.
>
> In addition, the variable portion of the result is an unbounded
> array of component4 fields, nothing to do with NFS4_OPAQUE_LIMIT,
> so that's just an arbitrary constant here with no justification.
>
> Adding more space to the receive buffer happens to help the case
> where there are no xattrs to list. That doesn't mean its the
> correct solution in general. It certainly won't be sufficient to
> handle an xattr name array that is larger than 1024 bytes.
>
>
> The client manages the variable portion of that result in the
> reply's pages array, which is set up by nfs4_xdr_enc_listxattrs().
>
> Further:
>
> > rpcrdma_complete_rqst+0x447
>
> is in the paragraph of rpcrdma_inline_fixup() that copies received
> bytes into the receive xdr_buf's pages array.
>
> The address "0005088000000000" is bogus. Since
> nfs4_xdr_enc_listxattrs() sets XDRBUF_SPARSE_PAGES, it's likely
> it is relying on the transport to allocate pages for this buffer,
> and possibly that page allocation has failed or has a bug.
>
> Please determine why the encode side has not set up the correct
> number of pages to handle the LISTXATTRS result. Until then I
> have to NACK this one.
>
>
> > #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \
> >                                 nfs4_xattr_name_maxsz)
> > #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \
> > --
> > 1.8.3.1
>
> --
> Chuck Lever
>
>
>
Trond Myklebust Nov. 18, 2020, 10:16 p.m. UTC | #3
On Wed, 2020-11-18 at 16:44 -0500, Olga Kornievskaia wrote:
> Hi Chuck,
> 
> The first problem I found was from 5.10-rc3 testing was from the fact
> that tail's iov_len was set to -4 and reply_chunk was trying to call
> rpcrdma_convert_kvec() for a tail that didn't exist.
> 
> Do you see issues with this fix?
> 
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 71e03b930b70..2e6a228abb95 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned
> int offset,
> 
>         tail->iov_base = buf + offset;
>         tail->iov_len = buflen - offset;
> -       if ((xdr->page_len & 3) == 0)
> +       if ((xdr->page_len & 3) == 0 && tail->iov_len)
>                 tail->iov_len -= sizeof(__be32);
> 
>         xdr->buflen += len;
> 
> This one fixes KASAN's reported problem of added at the end of this
> message.

By the way, I want to get rid of this kind on unnecessary copy. The
padding is basically just a bunch of '\0's so can stay in the page
without any problems (provided the page has space).

The fix is simple: all the callers of xdr_inline_pages need to do is
use a size that is wrapped by xdr_align_size().

> 
> The next problem that I can't figure out yet is another KASAN's
> report
> during the completion of the request.
> 
> [   99.610666] BUG: KASAN: wild-memory-access in
> rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma]
> [   99.617947] Write of size 4 at addr 0005088000000000 by task
> kworker/1:1H/490
> 
> 
> This is the KASAN for the negative tail problem:
> [  665.767611]
> ==================================================================
> [  665.772202] BUG: KASAN: slab-out-of-bounds in
> rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.777860] Write of size 8 at addr ffff88803ded9b70 by task
> fsstress/3123
> [  665.783754]
> [  665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0-
> rc3+ #38
> [  665.790534] Hardware name: VMware, Inc. VMware Virtual
> Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
> [  665.798538] Call Trace:
> [  665.800398]  dump_stack+0x7c/0xa2
> [  665.802647]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.808145]  print_address_description.constprop.7+0x1e/0x230
> [  665.812543]  ? record_print_text.cold.38+0x11/0x11
> [  665.816093]  ? _raw_write_lock_irqsave+0xe0/0xe0
> [  665.819257]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.823368]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.827837]  kasan_report.cold.9+0x37/0x7c
> [  665.830076]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.834066]  rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.837342]  rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0
> [rpcrdma]
> [  665.841766]  ? decode_read_list+0x40/0x40 [rpcrdma]
> [  665.845063]  ? _raw_spin_lock_irqsave+0x80/0xe0
> [  665.848444]  ? xdr_reserve_space+0x12e/0x360 [sunrpc]
> [  665.852186]  ? xdr_init_encode+0x104/0x130 [sunrpc]
> [  665.855305]  rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma]
> [  665.859771]  ? _raw_spin_lock+0x7a/0xd0
> [  665.862516]  ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma]
> [  665.866533]  ? call_bind+0x60/0xf0 [sunrpc]
> [  665.869382]  xprt_rdma_send_request+0x79/0x190 [rpcrdma]
> [  665.872965]  xprt_transmit+0x2ae/0x6c0 [sunrpc]
> [  665.875955]  ? call_bind+0xf0/0xf0 [sunrpc]
> [  665.878372]  call_transmit+0xdd/0x110 [sunrpc]
> [  665.881539]  ? call_bind+0xf0/0xf0 [sunrpc]
> [  665.884629]  __rpc_execute+0x11c/0x6e0 [sunrpc]
> [  665.888300]  ? trace_event_raw_event_xprt_cong_event+0x270/0x270
> [sunrpc]
> [  665.893935]  ? rpc_make_runnable+0x54/0xe0 [sunrpc]
> [  665.898021]  rpc_run_task+0x29c/0x2c0 [sunrpc]
> [  665.901142]  nfs4_call_sync_custom+0xc/0x40 [nfsv4]
> [  665.904903]  nfs4_do_call_sync+0x114/0x160 [nfsv4]
> [  665.908633]  ? nfs4_call_sync_custom+0x40/0x40 [nfsv4]
> [  665.913094]  ? __alloc_pages_nodemask+0x200/0x410
> [  665.916831]  ? kasan_unpoison_shadow+0x30/0x40
> [  665.920393]  ? __kasan_kmalloc.constprop.8+0xc1/0xd0
> [  665.924403]  _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4]
> [  665.928552]  ? kasan_set_free_info+0x1b/0x30
> [  665.932283]  ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4]
> [  665.936240]  ? _raw_spin_lock+0x7a/0xd0
> [  665.938677]  nfs42_proc_listxattrs+0xf4/0x150 [nfsv4]
> [  665.942532]  ? nfs42_proc_setxattr+0x150/0x150 [nfsv4]
> [  665.946410]  ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4]
> [  665.950095]  nfs4_listxattr+0x34d/0x3d0 [nfsv4]
> [  665.952951]  ? _nfs4_proc_access+0x260/0x260 [nfsv4]
> [  665.956383]  ? __ia32_sys_rename+0x40/0x40
> [  665.959559]  ? __ia32_sys_lstat+0x30/0x30
> [  665.962519]  ? __check_object_size+0x178/0x220
> [  665.965830]  ? strncpy_from_user+0xe9/0x230
> [  665.968401]  ? security_inode_listxattr+0x20/0x60
> [  665.971653]  listxattr+0xd1/0xf0
> [  665.974065]  path_listxattr+0xa1/0x100
> [  665.977023]  ? listxattr+0xf0/0xf0
> [  665.979305]  do_syscall_64+0x33/0x40
> [  665.981561]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  665.985559] RIP: 0033:0x7fe3f5a1fc8b
> [  665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64
> 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64
> 89
> 01 48
> [  666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX:
> 00000000000000c2
> [  666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX:
> 00007fe3f5a1fc8b
> [  666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 00000000016eaf70
> [  666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09:
> 00007ffc82612537
> [  666.025553] R10: 0000000000000004 R11: 0000000000000206 R12:
> 0000000000000021
> [  666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15:
> 0000000000000000
> [  666.035780]
> [  666.036906] Allocated by task 2837:
> [  666.039447]  kasan_save_stack+0x19/0x40
> [  666.042815]  __kasan_kmalloc.constprop.8+0xc1/0xd0
> [  666.046626]  rpcrdma_req_create+0x58/0x1f0 [rpcrdma]
> [  666.050283]  rpcrdma_buffer_create+0x217/0x270 [rpcrdma]
> [  666.053727]  xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma]
> [  666.057287]  xprt_create_transport+0xc7/0x300 [sunrpc]
> [  666.061656]  rpc_create+0x185/0x360 [sunrpc]
> [  666.064803]  nfs_create_rpc_client+0x2d9/0x350 [nfs]
> [  666.068233]  nfs4_init_client+0x111/0x3d0 [nfsv4]
> [  666.071563]  nfs4_set_client+0x18c/0x2b0 [nfsv4]
> [  666.075287]  nfs4_create_server+0x303/0x590 [nfsv4]
> [  666.079563]  nfs4_try_get_tree+0x60/0xe0 [nfsv4]
> [  666.082835]  vfs_get_tree+0x45/0x120
> [  666.085165]  path_mount+0x8da/0xcc0
> [  666.087352]  do_mount+0xcb/0xf0
> [  666.089377]  __x64_sys_mount+0xf4/0x110
> [  666.091894]  do_syscall_64+0x33/0x40
> [  666.094215]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  666.097181]
> [  666.098548] The buggy address belongs to the object at
> ffff88803ded8000
> [  666.098548]  which belongs to the cache kmalloc-8k of size 8192
> [  666.108266] The buggy address is located 7024 bytes inside of
> [  666.108266]  8192-byte region [ffff88803ded8000, ffff88803deda000)
> [  666.115709] The buggy address belongs to the page:
> [  666.118516] page:00000000e08a579f refcount:1 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0x3ded8
> [  666.124078] head:00000000e08a579f order:3 compound_mapcount:0
> compound_pincount:0
> [  666.130104] flags: 0xfffffc0010200(slab|head)
> [  666.133692] raw: 000fffffc0010200 dead000000000100
> dead000000000122
> ffff88800104ee40
> [  666.139286] raw: 0000000000000000 0000000000020002
> 00000001ffffffff
> 0000000000000000
> [  666.145052] page dumped because: kasan: bad access detected
> [  666.149341]
> [  666.150408] Memory state around the buggy address:
> 
> On Fri, Nov 13, 2020 at 3:34 PM Chuck Lever <chuck.lever@oracle.com>
> wrote:
> > 
> > Hi Olga-
> > 
> > > On Nov 13, 2020, at 2:08 PM, Olga Kornievskaia <
> > > olga.kornievskaia@gmail.com> wrote:
> > > 
> > > From: Olga Kornievskaia <kolga@netapp.com>
> > > 
> > > xfstest generic/013 over on a NFSoRDMA over SoftRoCE or iWarp
> > > panics
> > > and running with KASAN reports:
> > 
> > There is still only a highly circumstantial connection between
> > soft RoCE and iWarp and these crashes, so this description seems
> > misleading and/or pre-mature.
> > 
> > 
> > > [  216.018711] BUG: KASAN: wild-memory-access in
> > > rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > > [  216.024195] Write of size 12 at addr 0005088000000000 by task
> > > kworker/1:1H/480
> > > [  216.028820]
> > > [  216.029776] CPU: 1 PID: 480 Comm: kworker/1:1H Not tainted
> > > 5.8.0-rc5+ #37
> > > [  216.034247] Hardware name: VMware, Inc. VMware Virtual
> > > Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
> > > [  216.040604] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > > [  216.043739] Call Trace:
> > > [  216.045014]  dump_stack+0x7c/0xb0
> > > [  216.046757]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > > [  216.050008]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > > [  216.053091]  kasan_report.cold.10+0x6a/0x85
> > > [  216.055703]  ? rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > > [  216.058979]  check_memory_region+0x183/0x1e0
> > > [  216.061933]  memcpy+0x38/0x60
> > > [  216.064077]  rpcrdma_complete_rqst+0x447/0x6e0 [rpcrdma]
> > > [  216.067502]  ? rpcrdma_reset_cwnd+0x70/0x70 [rpcrdma]
> > > [  216.070268]  ? recalibrate_cpu_khz+0x10/0x10
> > > [  216.072585]  ? rpcrdma_reply_handler+0x604/0x6e0 [rpcrdma]
> > > [  216.075469]  __ib_process_cq+0xa7/0x220 [ib_core]
> > > [  216.078077]  ib_cq_poll_work+0x31/0xb0 [ib_core]
> > > [  216.080451]  process_one_work+0x387/0x6c0
> > > [  216.082498]  worker_thread+0x57/0x5a0
> > > [  216.084425]  ? process_one_work+0x6c0/0x6c0
> > > [  216.086583]  kthread+0x1ca/0x200
> > > [  216.088775]  ? kthread_create_on_node+0xc0/0xc0
> > > [  216.091847]  ret_from_fork+0x22/0x30
> > > 
> > > Fixes: 6c2190b3fcbc ("NFS: Fix listxattr receive buffer size")
> > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > > ---
> > > fs/nfs/nfs42xdr.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
> > > index 6e060a8..e88bc7a 100644
> > > --- a/fs/nfs/nfs42xdr.c
> > > +++ b/fs/nfs/nfs42xdr.c
> > > @@ -196,7 +196,8 @@
> > >                                1 + nfs4_xattr_name_maxsz + 1)
> > > #define decode_setxattr_maxsz   (op_decode_hdr_maxsz +
> > > decode_change_info_maxsz)
> > > #define encode_listxattrs_maxsz  (op_encode_hdr_maxsz + 2 + 1)
> > > -#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 +
> > > 1 + 1)
> > > +#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 +
> > > 1 + \
> > > +                               XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
> > 
> > From RFC 8276, Section 8.4.3.2:
> > 
> >    /// struct LISTXATTRS4resok {
> >    ///         nfs_cookie4    lxr_cookie;
> >    ///         xattrkey4      lxr_names<>;
> >    ///         bool           lxr_eof;
> >    /// };
> > 
> > The decode_listxattrs_maxsz macro defines the maximum size of
> > the /fixed portion/ of the LISTXATTRS reply. That is the operation
> > status code, the cookie, and the EOF flag. maxsz has an extra
> > "+ 1" for rpc_prepare_reply_pages() to deal with possible XDR
> > padding. The post-6c2190b3fcbc value of this macro is already
> > correct, and removing the "+ 1" is wrong.
> > 
> > In addition, the variable portion of the result is an unbounded
> > array of component4 fields, nothing to do with NFS4_OPAQUE_LIMIT,
> > so that's just an arbitrary constant here with no justification.
> > 
> > Adding more space to the receive buffer happens to help the case
> > where there are no xattrs to list. That doesn't mean its the
> > correct solution in general. It certainly won't be sufficient to
> > handle an xattr name array that is larger than 1024 bytes.
> > 
> > 
> > The client manages the variable portion of that result in the
> > reply's pages array, which is set up by nfs4_xdr_enc_listxattrs().
> > 
> > Further:
> > 
> > > rpcrdma_complete_rqst+0x447
> > 
> > is in the paragraph of rpcrdma_inline_fixup() that copies received
> > bytes into the receive xdr_buf's pages array.
> > 
> > The address "0005088000000000" is bogus. Since
> > nfs4_xdr_enc_listxattrs() sets XDRBUF_SPARSE_PAGES, it's likely
> > it is relying on the transport to allocate pages for this buffer,
> > and possibly that page allocation has failed or has a bug.
> > 
> > Please determine why the encode side has not set up the correct
> > number of pages to handle the LISTXATTRS result. Until then I
> > have to NACK this one.
> > 
> > 
> > > #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \
> > >                                 nfs4_xattr_name_maxsz)
> > > #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \
> > > --
> > > 1.8.3.1
> > 
> > --
> > Chuck Lever
> > 
> > 
> >
Chuck Lever III Nov. 19, 2020, 2:37 p.m. UTC | #4
Hi Olga-

> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> Hi Chuck,
> 
> The first problem I found was from 5.10-rc3 testing was from the fact
> that tail's iov_len was set to -4 and reply_chunk was trying to call
> rpcrdma_convert_kvec() for a tail that didn't exist.
> 
> Do you see issues with this fix?
> 
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 71e03b930b70..2e6a228abb95 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,
> 
>        tail->iov_base = buf + offset;
>        tail->iov_len = buflen - offset;
> -       if ((xdr->page_len & 3) == 0)
> +       if ((xdr->page_len & 3) == 0 && tail->iov_len)
>                tail->iov_len -= sizeof(__be32);
> 
>        xdr->buflen += len;

It's not clear to me whether only the listxattrs encoder is
not providing a receive tail kvec, or whether all the encoders
fail to provide a tail in this case.


> This one fixes KASAN's reported problem of added at the end of this message.
> 
> The next problem that I can't figure out yet is another KASAN's report
> during the completion of the request.
> 
> [   99.610666] BUG: KASAN: wild-memory-access in
> rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma]
> [   99.617947] Write of size 4 at addr 0005088000000000 by task kworker/1:1H/490
> 
> 
> This is the KASAN for the negative tail problem:
> [  665.767611] ==================================================================
> [  665.772202] BUG: KASAN: slab-out-of-bounds in
> rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.777860] Write of size 8 at addr ffff88803ded9b70 by task fsstress/3123
> [  665.783754]
> [  665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0-rc3+ #38
> [  665.790534] Hardware name: VMware, Inc. VMware Virtual
> Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
> [  665.798538] Call Trace:
> [  665.800398]  dump_stack+0x7c/0xa2
> [  665.802647]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.808145]  print_address_description.constprop.7+0x1e/0x230
> [  665.812543]  ? record_print_text.cold.38+0x11/0x11
> [  665.816093]  ? _raw_write_lock_irqsave+0xe0/0xe0
> [  665.819257]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.823368]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.827837]  kasan_report.cold.9+0x37/0x7c
> [  665.830076]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.834066]  rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> [  665.837342]  rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0 [rpcrdma]
> [  665.841766]  ? decode_read_list+0x40/0x40 [rpcrdma]
> [  665.845063]  ? _raw_spin_lock_irqsave+0x80/0xe0
> [  665.848444]  ? xdr_reserve_space+0x12e/0x360 [sunrpc]
> [  665.852186]  ? xdr_init_encode+0x104/0x130 [sunrpc]
> [  665.855305]  rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma]
> [  665.859771]  ? _raw_spin_lock+0x7a/0xd0
> [  665.862516]  ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma]
> [  665.866533]  ? call_bind+0x60/0xf0 [sunrpc]
> [  665.869382]  xprt_rdma_send_request+0x79/0x190 [rpcrdma]
> [  665.872965]  xprt_transmit+0x2ae/0x6c0 [sunrpc]
> [  665.875955]  ? call_bind+0xf0/0xf0 [sunrpc]
> [  665.878372]  call_transmit+0xdd/0x110 [sunrpc]
> [  665.881539]  ? call_bind+0xf0/0xf0 [sunrpc]
> [  665.884629]  __rpc_execute+0x11c/0x6e0 [sunrpc]
> [  665.888300]  ? trace_event_raw_event_xprt_cong_event+0x270/0x270 [sunrpc]
> [  665.893935]  ? rpc_make_runnable+0x54/0xe0 [sunrpc]
> [  665.898021]  rpc_run_task+0x29c/0x2c0 [sunrpc]
> [  665.901142]  nfs4_call_sync_custom+0xc/0x40 [nfsv4]
> [  665.904903]  nfs4_do_call_sync+0x114/0x160 [nfsv4]
> [  665.908633]  ? nfs4_call_sync_custom+0x40/0x40 [nfsv4]
> [  665.913094]  ? __alloc_pages_nodemask+0x200/0x410
> [  665.916831]  ? kasan_unpoison_shadow+0x30/0x40
> [  665.920393]  ? __kasan_kmalloc.constprop.8+0xc1/0xd0
> [  665.924403]  _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4]
> [  665.928552]  ? kasan_set_free_info+0x1b/0x30
> [  665.932283]  ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4]
> [  665.936240]  ? _raw_spin_lock+0x7a/0xd0
> [  665.938677]  nfs42_proc_listxattrs+0xf4/0x150 [nfsv4]
> [  665.942532]  ? nfs42_proc_setxattr+0x150/0x150 [nfsv4]
> [  665.946410]  ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4]
> [  665.950095]  nfs4_listxattr+0x34d/0x3d0 [nfsv4]
> [  665.952951]  ? _nfs4_proc_access+0x260/0x260 [nfsv4]
> [  665.956383]  ? __ia32_sys_rename+0x40/0x40
> [  665.959559]  ? __ia32_sys_lstat+0x30/0x30
> [  665.962519]  ? __check_object_size+0x178/0x220
> [  665.965830]  ? strncpy_from_user+0xe9/0x230
> [  665.968401]  ? security_inode_listxattr+0x20/0x60
> [  665.971653]  listxattr+0xd1/0xf0
> [  665.974065]  path_listxattr+0xa1/0x100
> [  665.977023]  ? listxattr+0xf0/0xf0
> [  665.979305]  do_syscall_64+0x33/0x40
> [  665.981561]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  665.985559] RIP: 0033:0x7fe3f5a1fc8b
> [  665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64
> 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64 89
> 01 48
> [  666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX:
> 00000000000000c2
> [  666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fe3f5a1fc8b
> [  666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000016eaf70
> [  666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09: 00007ffc82612537
> [  666.025553] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000000021
> [  666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15: 0000000000000000
> [  666.035780]
> [  666.036906] Allocated by task 2837:
> [  666.039447]  kasan_save_stack+0x19/0x40
> [  666.042815]  __kasan_kmalloc.constprop.8+0xc1/0xd0
> [  666.046626]  rpcrdma_req_create+0x58/0x1f0 [rpcrdma]
> [  666.050283]  rpcrdma_buffer_create+0x217/0x270 [rpcrdma]
> [  666.053727]  xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma]
> [  666.057287]  xprt_create_transport+0xc7/0x300 [sunrpc]
> [  666.061656]  rpc_create+0x185/0x360 [sunrpc]
> [  666.064803]  nfs_create_rpc_client+0x2d9/0x350 [nfs]
> [  666.068233]  nfs4_init_client+0x111/0x3d0 [nfsv4]
> [  666.071563]  nfs4_set_client+0x18c/0x2b0 [nfsv4]
> [  666.075287]  nfs4_create_server+0x303/0x590 [nfsv4]
> [  666.079563]  nfs4_try_get_tree+0x60/0xe0 [nfsv4]
> [  666.082835]  vfs_get_tree+0x45/0x120
> [  666.085165]  path_mount+0x8da/0xcc0
> [  666.087352]  do_mount+0xcb/0xf0
> [  666.089377]  __x64_sys_mount+0xf4/0x110
> [  666.091894]  do_syscall_64+0x33/0x40
> [  666.094215]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  666.097181]
> [  666.098548] The buggy address belongs to the object at ffff88803ded8000
> [  666.098548]  which belongs to the cache kmalloc-8k of size 8192
> [  666.108266] The buggy address is located 7024 bytes inside of
> [  666.108266]  8192-byte region [ffff88803ded8000, ffff88803deda000)
> [  666.115709] The buggy address belongs to the page:
> [  666.118516] page:00000000e08a579f refcount:1 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0x3ded8
> [  666.124078] head:00000000e08a579f order:3 compound_mapcount:0
> compound_pincount:0
> [  666.130104] flags: 0xfffffc0010200(slab|head)
> [  666.133692] raw: 000fffffc0010200 dead000000000100 dead000000000122
> ffff88800104ee40
> [  666.139286] raw: 0000000000000000 0000000000020002 00000001ffffffff
> 0000000000000000
> [  666.145052] page dumped because: kasan: bad access detected
> [  666.149341]
> [  666.150408] Memory state around the buggy address:

--
Chuck Lever
Olga Kornievskaia Nov. 19, 2020, 3:09 p.m. UTC | #5
On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> Hi Olga-
>
> > On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > Hi Chuck,
> >
> > The first problem I found was from 5.10-rc3 testing was from the fact
> > that tail's iov_len was set to -4 and reply_chunk was trying to call
> > rpcrdma_convert_kvec() for a tail that didn't exist.
> >
> > Do you see issues with this fix?
> >
> > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> > index 71e03b930b70..2e6a228abb95 100644
> > --- a/net/sunrpc/xdr.c
> > +++ b/net/sunrpc/xdr.c
> > @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,
> >
> >        tail->iov_base = buf + offset;
> >        tail->iov_len = buflen - offset;
> > -       if ((xdr->page_len & 3) == 0)
> > +       if ((xdr->page_len & 3) == 0 && tail->iov_len)
> >                tail->iov_len -= sizeof(__be32);
> >
> >        xdr->buflen += len;
>
> It's not clear to me whether only the listxattrs encoder is
> not providing a receive tail kvec, or whether all the encoders
> fail to provide a tail in this case.

There is nothing specific that listxattr does, it just calls
rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has
tail[1] defined)? But not all requests have data in the page. So as
far as I understand, tail->iov_len can be 0 so not checking for it is
wrong.

> > This one fixes KASAN's reported problem of added at the end of this message.
> >
> > The next problem that I can't figure out yet is another KASAN's report
> > during the completion of the request.
> >
> > [   99.610666] BUG: KASAN: wild-memory-access in
> > rpcrdma_complete_rqst+0x41b/0x680 [rpcrdma]
> > [   99.617947] Write of size 4 at addr 0005088000000000 by task kworker/1:1H/490
> >
> >
> > This is the KASAN for the negative tail problem:
> > [  665.767611] ==================================================================
> > [  665.772202] BUG: KASAN: slab-out-of-bounds in
> > rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> > [  665.777860] Write of size 8 at addr ffff88803ded9b70 by task fsstress/3123
> > [  665.783754]
> > [  665.784981] CPU: 0 PID: 3123 Comm: fsstress Not tainted 5.10.0-rc3+ #38
> > [  665.790534] Hardware name: VMware, Inc. VMware Virtual
> > Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
> > [  665.798538] Call Trace:
> > [  665.800398]  dump_stack+0x7c/0xa2
> > [  665.802647]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> > [  665.808145]  print_address_description.constprop.7+0x1e/0x230
> > [  665.812543]  ? record_print_text.cold.38+0x11/0x11
> > [  665.816093]  ? _raw_write_lock_irqsave+0xe0/0xe0
> > [  665.819257]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> > [  665.823368]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> > [  665.827837]  kasan_report.cold.9+0x37/0x7c
> > [  665.830076]  ? rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> > [  665.834066]  rpcrdma_convert_kvec.isra.29+0x4a/0xc4 [rpcrdma]
> > [  665.837342]  rpcrdma_convert_iovs.isra.30.cold.41+0x9b/0xa0 [rpcrdma]
> > [  665.841766]  ? decode_read_list+0x40/0x40 [rpcrdma]
> > [  665.845063]  ? _raw_spin_lock_irqsave+0x80/0xe0
> > [  665.848444]  ? xdr_reserve_space+0x12e/0x360 [sunrpc]
> > [  665.852186]  ? xdr_init_encode+0x104/0x130 [sunrpc]
> > [  665.855305]  rpcrdma_marshal_req.cold.43+0x39/0x1fa [rpcrdma]
> > [  665.859771]  ? _raw_spin_lock+0x7a/0xd0
> > [  665.862516]  ? rpcrdma_prepare_send_sges+0x7e0/0x7e0 [rpcrdma]
> > [  665.866533]  ? call_bind+0x60/0xf0 [sunrpc]
> > [  665.869382]  xprt_rdma_send_request+0x79/0x190 [rpcrdma]
> > [  665.872965]  xprt_transmit+0x2ae/0x6c0 [sunrpc]
> > [  665.875955]  ? call_bind+0xf0/0xf0 [sunrpc]
> > [  665.878372]  call_transmit+0xdd/0x110 [sunrpc]
> > [  665.881539]  ? call_bind+0xf0/0xf0 [sunrpc]
> > [  665.884629]  __rpc_execute+0x11c/0x6e0 [sunrpc]
> > [  665.888300]  ? trace_event_raw_event_xprt_cong_event+0x270/0x270 [sunrpc]
> > [  665.893935]  ? rpc_make_runnable+0x54/0xe0 [sunrpc]
> > [  665.898021]  rpc_run_task+0x29c/0x2c0 [sunrpc]
> > [  665.901142]  nfs4_call_sync_custom+0xc/0x40 [nfsv4]
> > [  665.904903]  nfs4_do_call_sync+0x114/0x160 [nfsv4]
> > [  665.908633]  ? nfs4_call_sync_custom+0x40/0x40 [nfsv4]
> > [  665.913094]  ? __alloc_pages_nodemask+0x200/0x410
> > [  665.916831]  ? kasan_unpoison_shadow+0x30/0x40
> > [  665.920393]  ? __kasan_kmalloc.constprop.8+0xc1/0xd0
> > [  665.924403]  _nfs42_proc_listxattrs+0x1f6/0x2f0 [nfsv4]
> > [  665.928552]  ? kasan_set_free_info+0x1b/0x30
> > [  665.932283]  ? nfs42_offload_cancel_done+0x50/0x50 [nfsv4]
> > [  665.936240]  ? _raw_spin_lock+0x7a/0xd0
> > [  665.938677]  nfs42_proc_listxattrs+0xf4/0x150 [nfsv4]
> > [  665.942532]  ? nfs42_proc_setxattr+0x150/0x150 [nfsv4]
> > [  665.946410]  ? nfs4_xattr_cache_list+0x21/0x120 [nfsv4]
> > [  665.950095]  nfs4_listxattr+0x34d/0x3d0 [nfsv4]
> > [  665.952951]  ? _nfs4_proc_access+0x260/0x260 [nfsv4]
> > [  665.956383]  ? __ia32_sys_rename+0x40/0x40
> > [  665.959559]  ? __ia32_sys_lstat+0x30/0x30
> > [  665.962519]  ? __check_object_size+0x178/0x220
> > [  665.965830]  ? strncpy_from_user+0xe9/0x230
> > [  665.968401]  ? security_inode_listxattr+0x20/0x60
> > [  665.971653]  listxattr+0xd1/0xf0
> > [  665.974065]  path_listxattr+0xa1/0x100
> > [  665.977023]  ? listxattr+0xf0/0xf0
> > [  665.979305]  do_syscall_64+0x33/0x40
> > [  665.981561]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  665.985559] RIP: 0033:0x7fe3f5a1fc8b
> > [  665.988136] Code: f0 ff ff 73 01 c3 48 8b 0d fa 21 2c 00 f7 d8 64
> > 89 01 48 83 c8 ff c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 c2 00 00
> > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 21 2c 00 f7 d8 64 89
> > 01 48
> > [  666.002248] RSP: 002b:00007ffc826128e8 EFLAGS: 00000206 ORIG_RAX:
> > 00000000000000c2
> > [  666.007760] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fe3f5a1fc8b
> > [  666.012999] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000016eaf70
> > [  666.018963] RBP: 00000000000001f4 R08: 0000000000000000 R09: 00007ffc82612537
> > [  666.025553] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000000021
> > [  666.030600] R13: 0000000000403e60 R14: 0000000000000000 R15: 0000000000000000
> > [  666.035780]
> > [  666.036906] Allocated by task 2837:
> > [  666.039447]  kasan_save_stack+0x19/0x40
> > [  666.042815]  __kasan_kmalloc.constprop.8+0xc1/0xd0
> > [  666.046626]  rpcrdma_req_create+0x58/0x1f0 [rpcrdma]
> > [  666.050283]  rpcrdma_buffer_create+0x217/0x270 [rpcrdma]
> > [  666.053727]  xprt_setup_rdma+0x1a3/0x2c0 [rpcrdma]
> > [  666.057287]  xprt_create_transport+0xc7/0x300 [sunrpc]
> > [  666.061656]  rpc_create+0x185/0x360 [sunrpc]
> > [  666.064803]  nfs_create_rpc_client+0x2d9/0x350 [nfs]
> > [  666.068233]  nfs4_init_client+0x111/0x3d0 [nfsv4]
> > [  666.071563]  nfs4_set_client+0x18c/0x2b0 [nfsv4]
> > [  666.075287]  nfs4_create_server+0x303/0x590 [nfsv4]
> > [  666.079563]  nfs4_try_get_tree+0x60/0xe0 [nfsv4]
> > [  666.082835]  vfs_get_tree+0x45/0x120
> > [  666.085165]  path_mount+0x8da/0xcc0
> > [  666.087352]  do_mount+0xcb/0xf0
> > [  666.089377]  __x64_sys_mount+0xf4/0x110
> > [  666.091894]  do_syscall_64+0x33/0x40
> > [  666.094215]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  666.097181]
> > [  666.098548] The buggy address belongs to the object at ffff88803ded8000
> > [  666.098548]  which belongs to the cache kmalloc-8k of size 8192
> > [  666.108266] The buggy address is located 7024 bytes inside of
> > [  666.108266]  8192-byte region [ffff88803ded8000, ffff88803deda000)
> > [  666.115709] The buggy address belongs to the page:
> > [  666.118516] page:00000000e08a579f refcount:1 mapcount:0
> > mapping:0000000000000000 index:0x0 pfn:0x3ded8
> > [  666.124078] head:00000000e08a579f order:3 compound_mapcount:0
> > compound_pincount:0
> > [  666.130104] flags: 0xfffffc0010200(slab|head)
> > [  666.133692] raw: 000fffffc0010200 dead000000000100 dead000000000122
> > ffff88800104ee40
> > [  666.139286] raw: 0000000000000000 0000000000020002 00000001ffffffff
> > 0000000000000000
> > [  666.145052] page dumped because: kasan: bad access detected
> > [  666.149341]
> > [  666.150408] Memory state around the buggy address:
>
> --
> Chuck Lever
>
>
>
Chuck Lever III Nov. 19, 2020, 4:19 p.m. UTC | #6
> On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> Hi Olga-
>> 
>>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
>>> 
>>> Hi Chuck,
>>> 
>>> The first problem I found was from 5.10-rc3 testing was from the fact
>>> that tail's iov_len was set to -4 and reply_chunk was trying to call
>>> rpcrdma_convert_kvec() for a tail that didn't exist.
>>> 
>>> Do you see issues with this fix?
>>> 
>>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
>>> index 71e03b930b70..2e6a228abb95 100644
>>> --- a/net/sunrpc/xdr.c
>>> +++ b/net/sunrpc/xdr.c
>>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,
>>> 
>>>       tail->iov_base = buf + offset;
>>>       tail->iov_len = buflen - offset;
>>> -       if ((xdr->page_len & 3) == 0)
>>> +       if ((xdr->page_len & 3) == 0 && tail->iov_len)
>>>               tail->iov_len -= sizeof(__be32);
>>> 
>>>       xdr->buflen += len;
>> 
>> It's not clear to me whether only the listxattrs encoder is
>> not providing a receive tail kvec, or whether all the encoders
>> fail to provide a tail in this case.
> 
> There is nothing specific that listxattr does, it just calls
> rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has
> tail[1] defined)?

Flip the question on its head: Why does xdr_inline_pages() work
fine for all the other operations? That suggests the problem is
with listxattrs, not with generic code.


> But not all requests have data in the page. So as
> far as I understand, tail->iov_len can be 0 so not checking for it is
> wrong.

The full context of the tail logic is:

 194         tail->iov_base = buf + offset;
 195         tail->iov_len = buflen - offset;
 196         if ((xdr->page_len & 3) == 0)
 197                 tail->iov_len -= sizeof(__be32);

tail->iov_len is set to buflen - offset right here. It should
/always/ be 4 or more. We can ensure that because the input
to this function is always provided by another kernel function
(in other words, we control all the callers).

Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS:
Fix listxattr receive buffer size") is trying to ensure
tail->iov_len is not zero -- that the math here gives us a
positive value every time.

In nfs4_xdr_enc_listxattrs() we have:

1529         rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count,
1530             hdr.replen);

hdr.replen is set to NFS4_dec_listxattrs_sz.

_nfs42_proc_listxattrs() sets args->count.

I suspect the problem is the way _nfs42_proc_listxattrs() is
computing the length of xattr_pages. It includes the trailing
EOF boolean, but so does the decode_listxattrs_maxsz macro,
for instance.

We need head->iov_len to always be one XDR_UNIT larger than
the position where the xattr_pages array starts. Then the
math in xdr_inline_pages() will work correctly. (sidebar:
perhaps the documenting comment for xdr_inline_pages() should
explain that assumption).

So, now I agree with the assessment that 6c2190b3fcbc ("NFS:
Fix listxattr receive buffer size") is probably not adequate.
There is another subtlety to address in the way that
_nfs42_proc_listxattrs() computes args->count.


--
Chuck Lever
Frank van der Linden Nov. 19, 2020, 11:26 p.m. UTC | #7
On Thu, Nov 19, 2020 at 11:19:05AM -0500, Chuck Lever wrote:
> > On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> Hi Olga-
> >>
> >>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >>>
> >>> Hi Chuck,
> >>>
> >>> The first problem I found was from 5.10-rc3 testing was from the fact
> >>> that tail's iov_len was set to -4 and reply_chunk was trying to call
> >>> rpcrdma_convert_kvec() for a tail that didn't exist.
> >>>
> >>> Do you see issues with this fix?
> >>>
> >>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> >>> index 71e03b930b70..2e6a228abb95 100644
> >>> --- a/net/sunrpc/xdr.c
> >>> +++ b/net/sunrpc/xdr.c
> >>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,
> >>>
> >>>       tail->iov_base = buf + offset;
> >>>       tail->iov_len = buflen - offset;
> >>> -       if ((xdr->page_len & 3) == 0)
> >>> +       if ((xdr->page_len & 3) == 0 && tail->iov_len)
> >>>               tail->iov_len -= sizeof(__be32);
> >>>
> >>>       xdr->buflen += len;
> >>
> >> It's not clear to me whether only the listxattrs encoder is
> >> not providing a receive tail kvec, or whether all the encoders
> >> fail to provide a tail in this case.
> >
> > There is nothing specific that listxattr does, it just calls
> > rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has
> > tail[1] defined)?
> 
> Flip the question on its head: Why does xdr_inline_pages() work
> fine for all the other operations? That suggests the problem is
> with listxattrs, not with generic code.
> 
> 
> > But not all requests have data in the page. So as
> > far as I understand, tail->iov_len can be 0 so not checking for it is
> > wrong.
> 
> The full context of the tail logic is:
> 
>  194         tail->iov_base = buf + offset;
>  195         tail->iov_len = buflen - offset;
>  196         if ((xdr->page_len & 3) == 0)
>  197                 tail->iov_len -= sizeof(__be32);
> 
> tail->iov_len is set to buflen - offset right here. It should
> /always/ be 4 or more. We can ensure that because the input
> to this function is always provided by another kernel function
> (in other words, we control all the callers).
> 
> Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS:
> Fix listxattr receive buffer size") is trying to ensure
> tail->iov_len is not zero -- that the math here gives us a
> positive value every time.
> 
> In nfs4_xdr_enc_listxattrs() we have:
> 
> 1529         rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count,
> 1530             hdr.replen);
> 
> hdr.replen is set to NFS4_dec_listxattrs_sz.
> 
> _nfs42_proc_listxattrs() sets args->count.
> 
> I suspect the problem is the way _nfs42_proc_listxattrs() is
> computing the length of xattr_pages. It includes the trailing
> EOF boolean, but so does the decode_listxattrs_maxsz macro,
> for instance.
> 
> We need head->iov_len to always be one XDR_UNIT larger than
> the position where the xattr_pages array starts. Then the
> math in xdr_inline_pages() will work correctly. (sidebar:
> perhaps the documenting comment for xdr_inline_pages() should
> explain that assumption).
> 
> So, now I agree with the assessment that 6c2190b3fcbc ("NFS:
> Fix listxattr receive buffer size") is probably not adequate.
> There is another subtlety to address in the way that
> _nfs42_proc_listxattrs() computes args->count.

The only thing I see wrong so far is that I believe that
decode_listxattrs_maxsz is wrong - it shouldn't include the EOF
word, which is accounted for in the page data part.

But, it seems that this wouldn't cause the above problem. I'm
also uncertain why this happens with RDMA, but not otherwise.
Unfortunately, I can't test RDMA, but when I ran listxattr tests,
I did so with KASAN enabled, and didn't see issues.

Obviously there could be a bug here, it could be that the code
just gets lucky, but that the bug is exposed on RDMA.

Is there a specific size passed to listxattr that this happens with?

- Frank
Olga Kornievskaia Nov. 20, 2020, 4:37 p.m. UTC | #8
On Thu, Nov 19, 2020 at 6:26 PM Frank van der Linden
<fllinden@amazon.com> wrote:
>
> On Thu, Nov 19, 2020 at 11:19:05AM -0500, Chuck Lever wrote:
> > > On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> > >
> > > On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> > >>
> > >> Hi Olga-
> > >>
> > >>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> > >>>
> > >>> Hi Chuck,
> > >>>
> > >>> The first problem I found was from 5.10-rc3 testing was from the fact
> > >>> that tail's iov_len was set to -4 and reply_chunk was trying to call
> > >>> rpcrdma_convert_kvec() for a tail that didn't exist.
> > >>>
> > >>> Do you see issues with this fix?
> > >>>
> > >>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> > >>> index 71e03b930b70..2e6a228abb95 100644
> > >>> --- a/net/sunrpc/xdr.c
> > >>> +++ b/net/sunrpc/xdr.c
> > >>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,
> > >>>
> > >>>       tail->iov_base = buf + offset;
> > >>>       tail->iov_len = buflen - offset;
> > >>> -       if ((xdr->page_len & 3) == 0)
> > >>> +       if ((xdr->page_len & 3) == 0 && tail->iov_len)
> > >>>               tail->iov_len -= sizeof(__be32);
> > >>>
> > >>>       xdr->buflen += len;
> > >>
> > >> It's not clear to me whether only the listxattrs encoder is
> > >> not providing a receive tail kvec, or whether all the encoders
> > >> fail to provide a tail in this case.
> > >
> > > There is nothing specific that listxattr does, it just calls
> > > rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has
> > > tail[1] defined)?
> >
> > Flip the question on its head: Why does xdr_inline_pages() work
> > fine for all the other operations? That suggests the problem is
> > with listxattrs, not with generic code.
> >
> >
> > > But not all requests have data in the page. So as
> > > far as I understand, tail->iov_len can be 0 so not checking for it is
> > > wrong.
> >
> > The full context of the tail logic is:
> >
> >  194         tail->iov_base = buf + offset;
> >  195         tail->iov_len = buflen - offset;
> >  196         if ((xdr->page_len & 3) == 0)
> >  197                 tail->iov_len -= sizeof(__be32);
> >
> > tail->iov_len is set to buflen - offset right here. It should
> > /always/ be 4 or more. We can ensure that because the input
> > to this function is always provided by another kernel function
> > (in other words, we control all the callers).
> >
> > Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS:
> > Fix listxattr receive buffer size") is trying to ensure
> > tail->iov_len is not zero -- that the math here gives us a
> > positive value every time.
> >
> > In nfs4_xdr_enc_listxattrs() we have:
> >
> > 1529         rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count,
> > 1530             hdr.replen);
> >
> > hdr.replen is set to NFS4_dec_listxattrs_sz.
> >
> > _nfs42_proc_listxattrs() sets args->count.
> >
> > I suspect the problem is the way _nfs42_proc_listxattrs() is
> > computing the length of xattr_pages. It includes the trailing
> > EOF boolean, but so does the decode_listxattrs_maxsz macro,
> > for instance.
> >
> > We need head->iov_len to always be one XDR_UNIT larger than
> > the position where the xattr_pages array starts. Then the
> > math in xdr_inline_pages() will work correctly. (sidebar:
> > perhaps the documenting comment for xdr_inline_pages() should
> > explain that assumption).
> >
> > So, now I agree with the assessment that 6c2190b3fcbc ("NFS:
> > Fix listxattr receive buffer size") is probably not adequate.
> > There is another subtlety to address in the way that
> > _nfs42_proc_listxattrs() computes args->count.
>
> The only thing I see wrong so far is that I believe that
> decode_listxattrs_maxsz is wrong - it shouldn't include the EOF
> word, which is accounted for in the page data part.
>
> But, it seems that this wouldn't cause the above problem. I'm
> also uncertain why this happens with RDMA, but not otherwise.
> Unfortunately, I can't test RDMA, but when I ran listxattr tests,
> I did so with KASAN enabled, and didn't see issues.

There isn't nothing special to run tests on RDMA, you just need to
compile the RXE driver and the rdma-core package have everything you
need to run soft Roce (or soft iwarp). This is how I'm testing.

> Obviously there could be a bug here, it could be that the code
> just gets lucky, but that the bug is exposed on RDMA.
>
> Is there a specific size passed to listxattr that this happens with?

First let me answer the last question, I'm running xfstest generic/013.

The latest update: updated to Trond's origing/testing which is now
based on 5.10-rc4.

1. I don't see the oops during the encoding of the listxattr
2. I'm still seeing the oops during the rdma completion. This happens
in the following scenario. Normally, there is a request for listxattr
which passes in buflen of 65536. This translates into rdma request
with a reply chunk with 2 segments. But during failure there is a
request for listxattr which passes buflen of only 8bytes. This
translates into rdma inline request which later oops in
rpcrdma_inline_fixup.

What I would like to know: (1) should _nf42_proc_listxattrs() be doing
some kind of sanity checking for passed in buflen? (2) but of course
I'm assuming even if passed in buffer is small the code shouldn't be
oops-ing.

>
> - Frank
Olga Kornievskaia Nov. 23, 2020, 4:42 p.m. UTC | #9
Hi Frank, Chuck,

I would like your option on how LISTXATTR is supposed to work over
RDMA. Here's my current understanding of why the listxattr is not
working over the RDMA.

This happens when the listxattr is called with a very small buffer
size which RDMA wants to send an inline request. I really dont
understand why, Chuck, you are not seeing any problems with hardware
as far as I can tell it would have the same problem because the inline
threshold size would still make this size inline.
rcprdma_inline_fixup() is trying to write to pages that don't exist.

When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
TCP. RDMA doesn't have anything like that.

Question: Should there be code added to RDMA that will do something
similar when it sees that flag set? Or, should LISTXATTR be re-written
to be like READDIR which allocates pages before calling the code.

On Fri, Nov 20, 2020 at 11:37 AM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Thu, Nov 19, 2020 at 6:26 PM Frank van der Linden
> <fllinden@amazon.com> wrote:
> >
> > On Thu, Nov 19, 2020 at 11:19:05AM -0500, Chuck Lever wrote:
> > > > On Nov 19, 2020, at 10:09 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> > > >
> > > > On Thu, Nov 19, 2020 at 9:37 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> > > >>
> > > >> Hi Olga-
> > > >>
> > > >>> On Nov 18, 2020, at 4:44 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> > > >>>
> > > >>> Hi Chuck,
> > > >>>
> > > >>> The first problem I found was from 5.10-rc3 testing was from the fact
> > > >>> that tail's iov_len was set to -4 and reply_chunk was trying to call
> > > >>> rpcrdma_convert_kvec() for a tail that didn't exist.
> > > >>>
> > > >>> Do you see issues with this fix?
> > > >>>
> > > >>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> > > >>> index 71e03b930b70..2e6a228abb95 100644
> > > >>> --- a/net/sunrpc/xdr.c
> > > >>> +++ b/net/sunrpc/xdr.c
> > > >>> @@ -193,7 +193,7 @@ xdr_inline_pages(struct xdr_buf *xdr, unsigned int offset,
> > > >>>
> > > >>>       tail->iov_base = buf + offset;
> > > >>>       tail->iov_len = buflen - offset;
> > > >>> -       if ((xdr->page_len & 3) == 0)
> > > >>> +       if ((xdr->page_len & 3) == 0 && tail->iov_len)
> > > >>>               tail->iov_len -= sizeof(__be32);
> > > >>>
> > > >>>       xdr->buflen += len;
> > > >>
> > > >> It's not clear to me whether only the listxattrs encoder is
> > > >> not providing a receive tail kvec, or whether all the encoders
> > > >> fail to provide a tail in this case.
> > > >
> > > > There is nothing specific that listxattr does, it just calls
> > > > rpc_prepare_pages(). Isn't tail[0] always there (xdr_buf structure has
> > > > tail[1] defined)?
> > >
> > > Flip the question on its head: Why does xdr_inline_pages() work
> > > fine for all the other operations? That suggests the problem is
> > > with listxattrs, not with generic code.
> > >
> > >
> > > > But not all requests have data in the page. So as
> > > > far as I understand, tail->iov_len can be 0 so not checking for it is
> > > > wrong.
> > >
> > > The full context of the tail logic is:
> > >
> > >  194         tail->iov_base = buf + offset;
> > >  195         tail->iov_len = buflen - offset;
> > >  196         if ((xdr->page_len & 3) == 0)
> > >  197                 tail->iov_len -= sizeof(__be32);
> > >
> > > tail->iov_len is set to buflen - offset right here. It should
> > > /always/ be 4 or more. We can ensure that because the input
> > > to this function is always provided by another kernel function
> > > (in other words, we control all the callers).
> > >
> > > Why is buflen == offset for listxattrs? 6c2190b3fcbc ("NFS:
> > > Fix listxattr receive buffer size") is trying to ensure
> > > tail->iov_len is not zero -- that the math here gives us a
> > > positive value every time.
> > >
> > > In nfs4_xdr_enc_listxattrs() we have:
> > >
> > > 1529         rpc_prepare_reply_pages(req, args->xattr_pages, 0, args->count,
> > > 1530             hdr.replen);
> > >
> > > hdr.replen is set to NFS4_dec_listxattrs_sz.
> > >
> > > _nfs42_proc_listxattrs() sets args->count.
> > >
> > > I suspect the problem is the way _nfs42_proc_listxattrs() is
> > > computing the length of xattr_pages. It includes the trailing
> > > EOF boolean, but so does the decode_listxattrs_maxsz macro,
> > > for instance.
> > >
> > > We need head->iov_len to always be one XDR_UNIT larger than
> > > the position where the xattr_pages array starts. Then the
> > > math in xdr_inline_pages() will work correctly. (sidebar:
> > > perhaps the documenting comment for xdr_inline_pages() should
> > > explain that assumption).
> > >
> > > So, now I agree with the assessment that 6c2190b3fcbc ("NFS:
> > > Fix listxattr receive buffer size") is probably not adequate.
> > > There is another subtlety to address in the way that
> > > _nfs42_proc_listxattrs() computes args->count.
> >
> > The only thing I see wrong so far is that I believe that
> > decode_listxattrs_maxsz is wrong - it shouldn't include the EOF
> > word, which is accounted for in the page data part.
> >
> > But, it seems that this wouldn't cause the above problem. I'm
> > also uncertain why this happens with RDMA, but not otherwise.
> > Unfortunately, I can't test RDMA, but when I ran listxattr tests,
> > I did so with KASAN enabled, and didn't see issues.
>
> There isn't nothing special to run tests on RDMA, you just need to
> compile the RXE driver and the rdma-core package have everything you
> need to run soft Roce (or soft iwarp). This is how I'm testing.
>
> > Obviously there could be a bug here, it could be that the code
> > just gets lucky, but that the bug is exposed on RDMA.
> >
> > Is there a specific size passed to listxattr that this happens with?
>
> First let me answer the last question, I'm running xfstest generic/013.
>
> The latest update: updated to Trond's origing/testing which is now
> based on 5.10-rc4.
>
> 1. I don't see the oops during the encoding of the listxattr
> 2. I'm still seeing the oops during the rdma completion. This happens
> in the following scenario. Normally, there is a request for listxattr
> which passes in buflen of 65536. This translates into rdma request
> with a reply chunk with 2 segments. But during failure there is a
> request for listxattr which passes buflen of only 8bytes. This
> translates into rdma inline request which later oops in
> rpcrdma_inline_fixup.
>
> What I would like to know: (1) should _nf42_proc_listxattrs() be doing
> some kind of sanity checking for passed in buflen? (2) but of course
> I'm assuming even if passed in buffer is small the code shouldn't be
> oops-ing.
>
> >
> > - Frank
Chuck Lever III Nov. 23, 2020, 5:37 p.m. UTC | #10
> On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> Hi Frank, Chuck,
> 
> I would like your option on how LISTXATTR is supposed to work over
> RDMA. Here's my current understanding of why the listxattr is not
> working over the RDMA.
> 
> This happens when the listxattr is called with a very small buffer
> size which RDMA wants to send an inline request. I really dont
> understand why, Chuck, you are not seeing any problems with hardware
> as far as I can tell it would have the same problem because the inline
> threshold size would still make this size inline.
> rcprdma_inline_fixup() is trying to write to pages that don't exist.
> 
> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
> TCP. RDMA doesn't have anything like that.
> 
> Question: Should there be code added to RDMA that will do something
> similar when it sees that flag set?

Isn't the logic in rpcrdma_convert_iovs() allocating those pages?


> Or, should LISTXATTR be re-written
> to be like READDIR which allocates pages before calling the code.

AIUI READDIR reads into the directory inode's page cache. I recall
that Frank couldn't do that for LISTXATTR because there's no
similar page cache associated with the xattr listing.

That said, I would prefer that the *XATTR procedures directly
allocate pages instead of relying on SPARSE_PAGES, which is a hack
IMO. I think it would have to use alloc_page() for that, and then
ensure those pages are released when the call has completed.

I'm not convinced this is the cause of the problem you're seeing,
though.

--
Chuck Lever
Frank van der Linden Nov. 23, 2020, 5:38 p.m. UTC | #11
On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote:
> Hi Frank, Chuck,
> 
> I would like your option on how LISTXATTR is supposed to work over
> RDMA. Here's my current understanding of why the listxattr is not
> working over the RDMA.
> 
> This happens when the listxattr is called with a very small buffer
> size which RDMA wants to send an inline request. I really dont
> understand why, Chuck, you are not seeing any problems with hardware
> as far as I can tell it would have the same problem because the inline
> threshold size would still make this size inline.
> rcprdma_inline_fixup() is trying to write to pages that don't exist.
> 
> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
> TCP. RDMA doesn't have anything like that.
> 
> Question: Should there be code added to RDMA that will do something
> similar when it sees that flag set? Or, should LISTXATTR be re-written
> to be like READDIR which allocates pages before calling the code.

Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested
over RDMA? That's the only other user.

If the flag simply doesn't work, I agree that it should either be fixed
or just removed.

It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR)
to use preallocated pages. But, I didn't do that because I didn't want to
waste the max size (64k) every time, even though you usually just get
a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES
is cleaner.

- Frank
Chuck Lever III Nov. 23, 2020, 5:49 p.m. UTC | #12
> On Nov 23, 2020, at 12:38 PM, Frank van der Linden <fllinden@amazon.com> wrote:
> 
> On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote:
>> Hi Frank, Chuck,
>> 
>> I would like your option on how LISTXATTR is supposed to work over
>> RDMA. Here's my current understanding of why the listxattr is not
>> working over the RDMA.
>> 
>> This happens when the listxattr is called with a very small buffer
>> size which RDMA wants to send an inline request. I really dont
>> understand why, Chuck, you are not seeing any problems with hardware
>> as far as I can tell it would have the same problem because the inline
>> threshold size would still make this size inline.
>> rcprdma_inline_fixup() is trying to write to pages that don't exist.
>> 
>> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
>> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
>> TCP. RDMA doesn't have anything like that.
>> 
>> Question: Should there be code added to RDMA that will do something
>> similar when it sees that flag set? Or, should LISTXATTR be re-written
>> to be like READDIR which allocates pages before calling the code.
> 
> Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested
> over RDMA? That's the only other user.
> 
> If the flag simply doesn't work, I agree that it should either be fixed
> or just removed.

Shirley fixed a crasher here years ago by making SPARSE_PAGES work
for RPC/RDMA. She confirmed the fix was effective.


> It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR)
> to use preallocated pages. But, I didn't do that because I didn't want to
> waste the max size (64k) every time, even though you usually just get
> a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES
> is cleaner.

These are infrequent operations. We've added extra conditional branches
in the hot path of the transports to handle this rare, non-performance
sensitive case.

I also wonder how well SPARSE_PAGES will work with xdr->bvecs, since
the bio_vec array is constructed before the transport can allocate
those pages.

To me, the whole SPARSE_PAGES concept is prototype-quality code that
needs to be replaced with a robust permanent solution.


--
Chuck Lever
Chuck Lever III Nov. 23, 2020, 5:56 p.m. UTC | #13
> On Nov 23, 2020, at 12:38 PM, Frank van der Linden <fllinden@amazon.com> wrote:
> 
> On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote:
>> Hi Frank, Chuck,
>> 
>> I would like your option on how LISTXATTR is supposed to work over
>> RDMA. Here's my current understanding of why the listxattr is not
>> working over the RDMA.
>> 
>> This happens when the listxattr is called with a very small buffer
>> size which RDMA wants to send an inline request. I really dont
>> understand why, Chuck, you are not seeing any problems with hardware
>> as far as I can tell it would have the same problem because the inline
>> threshold size would still make this size inline.
>> rcprdma_inline_fixup() is trying to write to pages that don't exist.
>> 
>> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
>> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
>> TCP. RDMA doesn't have anything like that.
>> 
>> Question: Should there be code added to RDMA that will do something
>> similar when it sees that flag set? Or, should LISTXATTR be re-written
>> to be like READDIR which allocates pages before calling the code.
> 
> Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested
> over RDMA? That's the only other user.
> 
> If the flag simply doesn't work, I agree that it should either be fixed
> or just removed.
> 
> It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR)
> to use preallocated pages. But, I didn't do that because I didn't want to
> waste the max size (64k) every time, even though you usually just get
> a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES
> is cleaner.

Also, because this is for a receive buffer, the transport always has
to allocate the maximum number of pages. Especially for RPC/RDMA, this
is not an "allocate on demand" kind of thing. The maximum buffer size
has to be allocated every time.

--
Chuck Lever
Olga Kornievskaia Nov. 23, 2020, 5:59 p.m. UTC | #14
On Mon, Nov 23, 2020 at 12:37 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > Hi Frank, Chuck,
> >
> > I would like your option on how LISTXATTR is supposed to work over
> > RDMA. Here's my current understanding of why the listxattr is not
> > working over the RDMA.
> >
> > This happens when the listxattr is called with a very small buffer
> > size which RDMA wants to send an inline request. I really dont
> > understand why, Chuck, you are not seeing any problems with hardware
> > as far as I can tell it would have the same problem because the inline
> > threshold size would still make this size inline.
> > rcprdma_inline_fixup() is trying to write to pages that don't exist.
> >
> > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
> > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
> > TCP. RDMA doesn't have anything like that.
> >
> > Question: Should there be code added to RDMA that will do something
> > similar when it sees that flag set?
>
> Isn't the logic in rpcrdma_convert_iovs() allocating those pages?

No, rpcrdm_convert_iovs is only called for when you have reply chunks,
lists etc but not for the inline messages. What am I missing?


> > Or, should LISTXATTR be re-written
> > to be like READDIR which allocates pages before calling the code.
>
> AIUI READDIR reads into the directory inode's page cache. I recall
> that Frank couldn't do that for LISTXATTR because there's no
> similar page cache associated with the xattr listing.
>
> That said, I would prefer that the *XATTR procedures directly
> allocate pages instead of relying on SPARSE_PAGES, which is a hack
> IMO. I think it would have to use alloc_page() for that, and then
> ensure those pages are released when the call has completed.
>
> I'm not convinced this is the cause of the problem you're seeing,
> though.
>
> --
> Chuck Lever
>
>
>
Olga Kornievskaia Nov. 23, 2020, 6:05 p.m. UTC | #15
On Mon, Nov 23, 2020 at 12:42 PM Frank van der Linden
<fllinden@amazon.com> wrote:
>
> On Mon, Nov 23, 2020 at 11:42:46AM -0500, Olga Kornievskaia wrote:
> > Hi Frank, Chuck,
> >
> > I would like your option on how LISTXATTR is supposed to work over
> > RDMA. Here's my current understanding of why the listxattr is not
> > working over the RDMA.
> >
> > This happens when the listxattr is called with a very small buffer
> > size which RDMA wants to send an inline request. I really dont
> > understand why, Chuck, you are not seeing any problems with hardware
> > as far as I can tell it would have the same problem because the inline
> > threshold size would still make this size inline.
> > rcprdma_inline_fixup() is trying to write to pages that don't exist.
> >
> > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
> > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
> > TCP. RDMA doesn't have anything like that.
> >
> > Question: Should there be code added to RDMA that will do something
> > similar when it sees that flag set? Or, should LISTXATTR be re-written
> > to be like READDIR which allocates pages before calling the code.
>
> Hm.. so if the flag never worked for RDMA, was NFS_V3_ACL ever tested
> over RDMA? That's the only other user.

It could have worked depending on whether or not ACL ever were sent
inline.  As I said LISTXATTR works when userspace provides a large
buffer because that triggers the RDMA code to setup a reply chunk
which allocated memory. So it's very specific case.

It might have worked for ACL because the way ACL works (at least now)
is that it's called first with a large buffer size, then client code
actually caches the reply so when teh user space calls it again with
appropriate buffer size, the code doesn't do another request.
LISTXATTR doesn't have that optimization and it does another call.
This way it uses a inline RDMA message which doesnt not work for when
pages are not allocated as far as I can tell.

> If the flag simply doesn't work, I agree that it should either be fixed
> or just removed.
>
> It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR)
> to use preallocated pages. But, I didn't do that because I didn't want to
> waste the max size (64k) every time, even though you usually just get
> a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES
> is cleaner.
>
> - Frank
Chuck Lever III Nov. 23, 2020, 6:09 p.m. UTC | #16
> On Nov 23, 2020, at 12:59 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> On Mon, Nov 23, 2020 at 12:37 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
>>> 
>>> Hi Frank, Chuck,
>>> 
>>> I would like your option on how LISTXATTR is supposed to work over
>>> RDMA. Here's my current understanding of why the listxattr is not
>>> working over the RDMA.
>>> 
>>> This happens when the listxattr is called with a very small buffer
>>> size which RDMA wants to send an inline request. I really dont
>>> understand why, Chuck, you are not seeing any problems with hardware
>>> as far as I can tell it would have the same problem because the inline
>>> threshold size would still make this size inline.
>>> rcprdma_inline_fixup() is trying to write to pages that don't exist.
>>> 
>>> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
>>> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
>>> TCP. RDMA doesn't have anything like that.
>>> 
>>> Question: Should there be code added to RDMA that will do something
>>> similar when it sees that flag set?
>> 
>> Isn't the logic in rpcrdma_convert_iovs() allocating those pages?
> 
> No, rpcrdm_convert_iovs is only called for when you have reply chunks,
> lists etc but not for the inline messages. What am I missing?

So, then, rpcrdma_marshal_req() is deciding that the LISTXATTRS
reply is supposed to fit inline. That means rqst->rq_rcv_buf.buflen
is small.

But if rpcrdma_inline_fixup() is trying to fill pages,
rqst->rq_rcv_buf.page_len must not be zero? That sounds like the
LISTXATTRS encoder is not setting up the receive buffer correctly.

The receive buffer's buflen field is supposed to be set to a value
that is at least as large as page_len, I would think.


>>> Or, should LISTXATTR be re-written
>>> to be like READDIR which allocates pages before calling the code.
>> 
>> AIUI READDIR reads into the directory inode's page cache. I recall
>> that Frank couldn't do that for LISTXATTR because there's no
>> similar page cache associated with the xattr listing.
>> 
>> That said, I would prefer that the *XATTR procedures directly
>> allocate pages instead of relying on SPARSE_PAGES, which is a hack
>> IMO. I think it would have to use alloc_page() for that, and then
>> ensure those pages are released when the call has completed.
>> 
>> I'm not convinced this is the cause of the problem you're seeing,
>> though.
>> 
>> --
>> Chuck Lever

--
Chuck Lever
Frank van der Linden Nov. 23, 2020, 6:20 p.m. UTC | #17
On Mon, Nov 23, 2020 at 12:37:15PM -0500, Chuck Lever wrote:
> 
> 
> > On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > Hi Frank, Chuck,
> >
> > I would like your option on how LISTXATTR is supposed to work over
> > RDMA. Here's my current understanding of why the listxattr is not
> > working over the RDMA.
> >
> > This happens when the listxattr is called with a very small buffer
> > size which RDMA wants to send an inline request. I really dont
> > understand why, Chuck, you are not seeing any problems with hardware
> > as far as I can tell it would have the same problem because the inline
> > threshold size would still make this size inline.
> > rcprdma_inline_fixup() is trying to write to pages that don't exist.
> >
> > When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
> > will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
> > TCP. RDMA doesn't have anything like that.
> >
> > Question: Should there be code added to RDMA that will do something
> > similar when it sees that flag set?
> 
> Isn't the logic in rpcrdma_convert_iovs() allocating those pages?
> 
> 
> > Or, should LISTXATTR be re-written
> > to be like READDIR which allocates pages before calling the code.
> 
> AIUI READDIR reads into the directory inode's page cache. I recall
> that Frank couldn't do that for LISTXATTR because there's no
> similar page cache associated with the xattr listing.

Yep, correct.

> 
> That said, I would prefer that the *XATTR procedures directly
> allocate pages instead of relying on SPARSE_PAGES, which is a hack
> IMO. I think it would have to use alloc_page() for that, and then
> ensure those pages are released when the call has completed.
> 
> I'm not convinced this is the cause of the problem you're seeing,
> though.

Since the flag works with GETXATTR I would agree that it's probably
not the issue.

More likely, it seems like args->count is somehow off.

- Frank
Frank van der Linden Nov. 23, 2020, 7:24 p.m. UTC | #18
On Mon, Nov 23, 2020 at 05:38:02PM +0000, Frank van der Linden wrote:
> It wouldn't be the end of the world to change LISTXATTRS (and GETXATTR)
> to use preallocated pages. But, I didn't do that because I didn't want to
> waste the max size (64k) every time, even though you usually just get
> a few hundred bytes at most. So it seems like fixing XDRBUF_SPARSE_PAGES
> is cleaner.

Correcting myself here.. this is true of GETXATTR, since there is no
maximum length argument for that call - so your only choice is to
allocate the max of what you can handle, and SPARSE seems to
be the best option fr that.

For LISTXATTRS, there is a max length argument. It's a bit unusual,
in that it specifies the maximum number of bytes of *XDR* data you
want to receive. So you don't need to allocate the maximum number
of pages.  But, I decided to still use the SPARSE option, since the
common 0 length argument to the listxattr syscall, used to probe
the length of the buffer needed, would still require the maximum
length (64k) to be allocated, because there is no 0 lengh 'probe'
option for the RPC. So all you can do is ask for the maximum
size you can handle, and see what you get back to determine
the length.

Olga's comment about caching: the code actually does cache
the data for all valid replies, even if the system call
only was a len == 0 probe call. That way, the followup
system call that actually retrieves the data can normally
just get it from the cache.

In any case, I need to set up an RDMA environment to see
if I can reproduce and fix the issue. I hope to do that
today or tomorrow.

- Frank
Olga Kornievskaia Nov. 23, 2020, 11:14 p.m. UTC | #19
On Mon, Nov 23, 2020 at 1:09 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Nov 23, 2020, at 12:59 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > On Mon, Nov 23, 2020 at 12:37 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >>>
> >>> Hi Frank, Chuck,
> >>>
> >>> I would like your option on how LISTXATTR is supposed to work over
> >>> RDMA. Here's my current understanding of why the listxattr is not
> >>> working over the RDMA.
> >>>
> >>> This happens when the listxattr is called with a very small buffer
> >>> size which RDMA wants to send an inline request. I really dont
> >>> understand why, Chuck, you are not seeing any problems with hardware
> >>> as far as I can tell it would have the same problem because the inline
> >>> threshold size would still make this size inline.
> >>> rcprdma_inline_fixup() is trying to write to pages that don't exist.
> >>>
> >>> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
> >>> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
> >>> TCP. RDMA doesn't have anything like that.
> >>>
> >>> Question: Should there be code added to RDMA that will do something
> >>> similar when it sees that flag set?
> >>
> >> Isn't the logic in rpcrdma_convert_iovs() allocating those pages?
> >
> > No, rpcrdm_convert_iovs is only called for when you have reply chunks,
> > lists etc but not for the inline messages. What am I missing?
>
> So, then, rpcrdma_marshal_req() is deciding that the LISTXATTRS
> reply is supposed to fit inline. That means rqst->rq_rcv_buf.buflen
> is small.
>
> But if rpcrdma_inline_fixup() is trying to fill pages,
> rqst->rq_rcv_buf.page_len must not be zero? That sounds like the
> LISTXATTRS encoder is not setting up the receive buffer correctly.
>
> The receive buffer's buflen field is supposed to be set to a value
> that is at least as large as page_len, I would think.

Here's what the LISTXATTR code does that I can see:
It allocates pointers to the pages (but no pages). It sets the
page_len to the hdr.replen so yes it's not zero (setting of the value
as far as i know is correct). So for RDMA nothing allocates those
pages because it's an inline request. TCP code will allocate those
pages because the code was added. You keep on saying that you don't
think this is it but I don't know how to prove to you that Kasan's
message of "wild-memory access" means that page wasn't allocated. It's
a bogus address. The page isn't there. Or at least that's how I read
the Kasan's message. I don't know how else to interpret it (but also
know that code never allocates the memory I believe is a strong
argument).

NFS code can't know that request is inline. It does assume something
will allocate that memory but RDMA doesn't allocate memory for the
inline messages.

While I'm not suggesting this is a correct fix but this is a fix that
removes the oops.

diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 2b2211d1234e..faab6aedeb42 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -1258,6 +1258,15 @@ static ssize_t _nfs42_proc_listxattrs(struct
inode *inode, void *buf,
                __free_page(res.scratch);
                return -ENOMEM;
        }
+       if (buflen < 1024) {
+               int i;
+               for (i = 0; i < np; i++) {
+                       pages[i] = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
+                       if (!pages[i])
+                               return -ENOMEM;
+               }
+       }
+

        arg.xattr_pages = pages;
        arg.count = xdrlen;


Basically since I know that all RDMA less than 1024 (for soft Roce)
will be inline, I need to allocate pages for them. This doesn't
interfere with the TCP mounts as the code checks if pages are
allocated and only allocates them if they are not.

But of course this is not a solution as it's unknown what's the rdma's
inline threshold is at the NFS layer.



> >>> Or, should LISTXATTR be re-written
> >>> to be like READDIR which allocates pages before calling the code.
> >>
> >> AIUI READDIR reads into the directory inode's page cache. I recall
> >> that Frank couldn't do that for LISTXATTR because there's no
> >> similar page cache associated with the xattr listing.
> >>
> >> That said, I would prefer that the *XATTR procedures directly
> >> allocate pages instead of relying on SPARSE_PAGES, which is a hack
> >> IMO. I think it would have to use alloc_page() for that, and then
> >> ensure those pages are released when the call has completed.
> >>
> >> I'm not convinced this is the cause of the problem you're seeing,
> >> though.
> >>
> >> --
> >> Chuck Lever
>
> --
> Chuck Lever
>
>
>
diff mbox series

Patch

diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
index 6e060a8..e88bc7a 100644
--- a/fs/nfs/nfs42xdr.c
+++ b/fs/nfs/nfs42xdr.c
@@ -196,7 +196,8 @@ 
 				 1 + nfs4_xattr_name_maxsz + 1)
 #define decode_setxattr_maxsz   (op_decode_hdr_maxsz + decode_change_info_maxsz)
 #define encode_listxattrs_maxsz  (op_encode_hdr_maxsz + 2 + 1)
-#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 + 1 + 1)
+#define decode_listxattrs_maxsz  (op_decode_hdr_maxsz + 2 + 1 + 1 + \
+				  XDR_QUADLEN(NFS4_OPAQUE_LIMIT))
 #define encode_removexattr_maxsz (op_encode_hdr_maxsz + 1 + \
 				  nfs4_xattr_name_maxsz)
 #define decode_removexattr_maxsz (op_decode_hdr_maxsz + \