Message ID | 1a369325ac2d4a604a074428f58fa72a6065e197.1670929442.git.bcodding@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Stop corrupting socket's task_frag | expand |
On Tue, 2022-12-13 at 06:10 -0500, Benjamin Coddington wrote: > From: Guillaume Nault <gnault@redhat.com> > > Sockets that can be used while recursing into memory reclaim, like > those used by network block devices and file systems, mustn't use > current->task_frag: if the current process is already using it, then > the inner memory reclaim call would corrupt the task_frag structure. > > To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets > that mustn't use current->task_frag, assuming that those used during > memory reclaim had their allocation constraints reflected in > ->sk_allocation. > > This unfortunately doesn't cover all cases: in an attempt to remove all > usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in > ->sk_allocation, and used memalloc_nofs critical sections instead. > This breaks the sk_page_frag() heuristic since the allocation > constraints are now stored in current->flags, which sk_page_frag() > can't read without risking triggering a cache miss and slowing down > TCP's fast path. > > This patch creates a new field in struct sock, named sk_use_task_frag, > which sockets with memory reclaim constraints can set to false if they > can't safely use current->task_frag. In such cases, sk_page_frag() now > always returns the socket's page_frag (->sk_frag). The first user is > sunrpc, which needs to avoid using current->task_frag but can keep > ->sk_allocation set to GFP_KERNEL otherwise. > > Eventually, it might be possible to simplify sk_page_frag() by only > testing ->sk_use_task_frag and avoid relying on the ->sk_allocation > heuristic entirely (assuming other sockets will set ->sk_use_task_frag > according to their constraints in the future). > > The new ->sk_use_task_frag field is placed in a hole in struct sock and > belongs to a cache line shared with ->sk_shutdown. Therefore it should > be hot and shouldn't have negative performance impacts on TCP's fast > path (sk_shutdown is tested just before the while() loop in > tcp_sendmsg_locked()). > > Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/ > Signed-off-by: Guillaume Nault <gnault@redhat.com> > Reviewed-by: Benjamin Coddington <bcodding@redhat.com> > --- > include/net/sock.h | 11 +++++++++-- > net/core/sock.c | 1 + > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index e0517ecc6531..44380c6dc6c4 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -318,6 +318,9 @@ struct sk_filter; > * @sk_stamp: time stamp of last packet received > * @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only > * @sk_tsflags: SO_TIMESTAMPING flags > + * @sk_use_task_frag: allow sk_page_frag() to use current->task_frag. > + * Sockets that can be used under memory reclaim should > + * set this to false. > * @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock > * for timestamping > * @sk_tskey: counter to disambiguate concurrent tstamp requests > @@ -505,6 +508,7 @@ struct sock { > #endif > u16 sk_tsflags; > u8 sk_shutdown; > + bool sk_use_task_frag; > atomic_t sk_tskey; > atomic_t sk_zckey; I'm sorry, but after the post PR -net -> net-next merge, this does not apply cleanly any-more, you need to rebase it once more. Note that commit b534dc46c8ae ("net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP") moved the surronding fields a bit but there is still one byte hole after sk_txtime_unused, before sk_socket. Thanks! Paolo
diff --git a/include/net/sock.h b/include/net/sock.h index e0517ecc6531..44380c6dc6c4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -318,6 +318,9 @@ struct sk_filter; * @sk_stamp: time stamp of last packet received * @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only * @sk_tsflags: SO_TIMESTAMPING flags + * @sk_use_task_frag: allow sk_page_frag() to use current->task_frag. + * Sockets that can be used under memory reclaim should + * set this to false. * @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock * for timestamping * @sk_tskey: counter to disambiguate concurrent tstamp requests @@ -505,6 +508,7 @@ struct sock { #endif u16 sk_tsflags; u8 sk_shutdown; + bool sk_use_task_frag; atomic_t sk_tskey; atomic_t sk_zckey; @@ -2561,14 +2565,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk) * socket operations and end up recursing into sk_page_frag() * while it's already in use: explicitly avoid task page_frag * usage if the caller is potentially doing any of them. - * This assumes that page fault handlers use the GFP_NOFS flags. + * This assumes that page fault handlers use the GFP_NOFS flags or + * explicitly disable sk_use_task_frag. * * Return: a per task page_frag if context allows that, * otherwise a per socket one. */ static inline struct page_frag *sk_page_frag(struct sock *sk) { - if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) == + if (sk->sk_use_task_frag && + (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | + __GFP_FS)) == (__GFP_DIRECT_RECLAIM | __GFP_FS)) return ¤t->task_frag; diff --git a/net/core/sock.c b/net/core/sock.c index a3ba0358c77c..cc113500d442 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3368,6 +3368,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_rcvbuf = READ_ONCE(sysctl_rmem_default); sk->sk_sndbuf = READ_ONCE(sysctl_wmem_default); sk->sk_state = TCP_CLOSE; + sk->sk_use_task_frag = true; sk_set_socket(sk, sock); sock_set_flag(sk, SOCK_ZAPPED);