diff mbox series

[net,v2] tcp: fix page frag corruption on page fault

Message ID 0fa74bfdfee95d6bba9fa49a5437f63c014f29ce.1637951641.git.pabeni@redhat.com (mailing list archive)
State Accepted
Commit dacb5d8875cc6cd3a553363b4d6f06760fcbe70c
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] tcp: fix page frag corruption on page fault | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3056 this patch: 3056
netdev/cc_maintainers fail 1 blamed authors not CCed: davem@davemloft.net; 2 maintainers not CCed: kuba@kernel.org davem@davemloft.net
netdev/build_clang success Errors and warnings before: 422 this patch: 422
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 3192 this patch: 3192
netdev/checkpatch warning WARNING: line length of 87 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Paolo Abeni Nov. 26, 2021, 6:34 p.m. UTC
Steffen reported a TCP stream corruption for HTTP requests
served by the apache web-server using a cifs mount-point
and memory mapping the relevant file.

The root cause is quite similar to the one addressed by
commit 20eb4f29b602 ("net: fix sk_page_frag() recursion from
memory reclaim"). Here the nested access to the task page frag
is caused by a page fault on the (mmapped) user-space memory
buffer coming from the cifs file.

The page fault handler performs an smb transaction on a different
socket, inside the same process context. Since sk->sk_allaction
for such socket does not prevent the usage for the task_frag,
the nested allocation modify "under the hood" the page frag
in use by the outer sendmsg call, corrupting the stream.

The overall relevant stack trace looks like the following:

httpd 78268 [001] 3461630.850950:      probe:tcp_sendmsg_locked:
        ffffffff91461d91 tcp_sendmsg_locked+0x1
        ffffffff91462b57 tcp_sendmsg+0x27
        ffffffff9139814e sock_sendmsg+0x3e
        ffffffffc06dfe1d smb_send_kvec+0x28
        [...]
        ffffffffc06cfaf8 cifs_readpages+0x213
        ffffffff90e83c4b read_pages+0x6b
        ffffffff90e83f31 __do_page_cache_readahead+0x1c1
        ffffffff90e79e98 filemap_fault+0x788
        ffffffff90eb0458 __do_fault+0x38
        ffffffff90eb5280 do_fault+0x1a0
        ffffffff90eb7c84 __handle_mm_fault+0x4d4
        ffffffff90eb8093 handle_mm_fault+0xc3
        ffffffff90c74f6d __do_page_fault+0x1ed
        ffffffff90c75277 do_page_fault+0x37
        ffffffff9160111e page_fault+0x1e
        ffffffff9109e7b5 copyin+0x25
        ffffffff9109eb40 _copy_from_iter_full+0xe0
        ffffffff91462370 tcp_sendmsg_locked+0x5e0
        ffffffff91462370 tcp_sendmsg_locked+0x5e0
        ffffffff91462b57 tcp_sendmsg+0x27
        ffffffff9139815c sock_sendmsg+0x4c
        ffffffff913981f7 sock_write_iter+0x97
        ffffffff90f2cc56 do_iter_readv_writev+0x156
        ffffffff90f2dff0 do_iter_write+0x80
        ffffffff90f2e1c3 vfs_writev+0xa3
        ffffffff90f2e27c do_writev+0x5c
        ffffffff90c042bb do_syscall_64+0x5b
        ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65

The cifs filesystem rightfully sets sk_allocations to GFP_NOFS,
we can avoid the nesting using the sk page frag for allocation
lacking the __GFP_FS flag. Do not define an additional mm-helper
for that, as this is strictly tied to the sk page frag usage.

v1 -> v2:
 - use a stricted sk_page_frag() check instead of reordering the
   code (Eric)

Reported-by: Steffen Froemer <sfroemer@redhat.com>
Fixes: 5640f7685831 ("net: use a per task frag allocator")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sock.h | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Nov. 27, 2021, 12:26 a.m. UTC | #1
On Fri, Nov 26, 2021 at 10:35 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Steffen reported a TCP stream corruption for HTTP requests
> served by the apache web-server using a cifs mount-point
> and memory mapping the relevant file.
>
> The root cause is quite similar to the one addressed by
> commit 20eb4f29b602 ("net: fix sk_page_frag() recursion from
> memory reclaim"). Here the nested access to the task page frag
> is caused by a page fault on the (mmapped) user-space memory
> buffer coming from the cifs file.
>
> The page fault handler performs an smb transaction on a different
> socket, inside the same process context. Since sk->sk_allaction
> for such socket does not prevent the usage for the task_frag,
> the nested allocation modify "under the hood" the page frag
> in use by the outer sendmsg call, corrupting the stream.
>

> v1 -> v2:
>  - use a stricted sk_page_frag() check instead of reordering the
>    code (Eric)
>
> Reported-by: Steffen Froemer <sfroemer@redhat.com>
> Fixes: 5640f7685831 ("net: use a per task frag allocator")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---

SGTM, thanks !

Reviewed-by: Eric Dumazet <edumazet@google.com>
patchwork-bot+netdevbpf@kernel.org Nov. 29, 2021, 12:20 p.m. UTC | #2
Hello:

This patch was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Fri, 26 Nov 2021 19:34:21 +0100 you wrote:
> Steffen reported a TCP stream corruption for HTTP requests
> served by the apache web-server using a cifs mount-point
> and memory mapping the relevant file.
> 
> The root cause is quite similar to the one addressed by
> commit 20eb4f29b602 ("net: fix sk_page_frag() recursion from
> memory reclaim"). Here the nested access to the task page frag
> is caused by a page fault on the (mmapped) user-space memory
> buffer coming from the cifs file.
> 
> [...]

Here is the summary with links:
  - [net,v2] tcp: fix page frag corruption on page fault
    https://git.kernel.org/netdev/net/c/dacb5d8875cc

You are awesome, thank you!
diff mbox series

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index b32906e1ab55..715cdb4b2b79 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2430,19 +2430,22 @@  static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * @sk: socket
  *
  * Use the per task page_frag instead of the per socket one for
- * optimization when we know that we're in the normal context and owns
+ * optimization when we know that we're in process context and own
  * everything that's associated with %current.
  *
- * gfpflags_allow_blocking() isn't enough here as direct reclaim may nest
- * inside other socket operations and end up recursing into sk_page_frag()
- * while it's already in use.
+ * Both direct reclaim and page faults can nest inside other
+ * socket operations and end up recursing into sk_page_frag()
+ * while it's already in use: explicitly avoid task page_frag
+ * usage if the caller is potentially doing any of them.
+ * This assumes that page fault handlers use the GFP_NOFS flags.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (gfpflags_normal_context(sk->sk_allocation))
+	if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
+	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
 		return &current->task_frag;
 
 	return &sk->sk_frag;