Message ID | d9041e542ade6af472c7be14b5a28856692815cf.1669036433.git.bcodding@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Stop corrupting socket's task_frag | expand |
On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote: > From: Guillaume Nault <gnault@redhat.com> > > Sockets that can be used while recursing into memory reclaim, like > those used by network block devices and file systems, mustn't use > current->task_frag: if the current process is already using it, then > the inner memory reclaim call would corrupt the task_frag structure. > > To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets > that mustn't use current->task_frag, assuming that those used during > memory reclaim had their allocation constraints reflected in > ->sk_allocation. > > This unfortunately doesn't cover all cases: in an attempt to remove all > usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in > ->sk_allocation, and used memalloc_nofs critical sections instead. > This breaks the sk_page_frag() heuristic since the allocation > constraints are now stored in current->flags, which sk_page_frag() > can't read without risking triggering a cache miss and slowing down > TCP's fast path. > > This patch creates a new field in struct sock, named sk_use_task_frag, > which sockets with memory reclaim constraints can set to false if they > can't safely use current->task_frag. In such cases, sk_page_frag() now > always returns the socket's page_frag (->sk_frag). The first user is > sunrpc, which needs to avoid using current->task_frag but can keep > ->sk_allocation set to GFP_KERNEL otherwise. > > Eventually, it might be possible to simplify sk_page_frag() by only > testing ->sk_use_task_frag and avoid relying on the ->sk_allocation > heuristic entirely (assuming other sockets will set ->sk_use_task_frag > according to their constraints in the future). > > The new ->sk_use_task_frag field is placed in a hole in struct sock and > belongs to a cache line shared with ->sk_shutdown. Therefore it should > be hot and shouldn't have negative performance impacts on TCP's fast > path (sk_shutdown is tested just before the while() loop in > tcp_sendmsg_locked()). > > Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/ > Signed-off-by: Guillaume Nault <gnault@redhat.com> > --- > include/net/sock.h | 11 +++++++++-- > net/core/sock.c | 1 + > 2 files changed, 10 insertions(+), 2 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index d08cfe190a78..ffba9e95470d 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -318,6 +318,9 @@ struct sk_filter; > * @sk_stamp: time stamp of last packet received > * @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only > * @sk_tsflags: SO_TIMESTAMPING flags > + * @sk_use_task_frag: allow sk_page_frag() to use current->task_frag. > + Sockets that can be used under memory reclaim should > + set this to false. > * @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock > * for timestamping > * @sk_tskey: counter to disambiguate concurrent tstamp requests > @@ -504,6 +507,7 @@ struct sock { > #endif > u16 sk_tsflags; > u8 sk_shutdown; > + bool sk_use_task_frag; > atomic_t sk_tskey; > atomic_t sk_zckey; I think the above should be fine from a data locality PoV, as the used cacheline should be hot at sk_page_frag_refill() usage time, as sk_tsflags has been accessed just before. @Eric, does the above fit with the planned sock fields reordering? Jakub noted we could use a bitfield here to be future proof for additional flags addition. I think in this specific case a bool is preferable, because we actually wont to discourage people to add more of such flags, and the search for holes (or the bool -> bitflag conversion) should give to such eventual future changes some additional thoughts. Thanks! Paolo
On Fri, Dec 9, 2022 at 1:09 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote: > > From: Guillaume Nault <gnault@redhat.com> > > > > Sockets that can be used while recursing into memory reclaim, like > > those used by network block devices and file systems, mustn't use > > current->task_frag: if the current process is already using it, then > > the inner memory reclaim call would corrupt the task_frag structure. > > > > To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets > > that mustn't use current->task_frag, assuming that those used during > > memory reclaim had their allocation constraints reflected in > > ->sk_allocation. > > > > This unfortunately doesn't cover all cases: in an attempt to remove all > > usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in > > ->sk_allocation, and used memalloc_nofs critical sections instead. > > This breaks the sk_page_frag() heuristic since the allocation > > constraints are now stored in current->flags, which sk_page_frag() > > can't read without risking triggering a cache miss and slowing down > > TCP's fast path. > > > > This patch creates a new field in struct sock, named sk_use_task_frag, > > which sockets with memory reclaim constraints can set to false if they > > can't safely use current->task_frag. In such cases, sk_page_frag() now > > always returns the socket's page_frag (->sk_frag). The first user is > > sunrpc, which needs to avoid using current->task_frag but can keep > > ->sk_allocation set to GFP_KERNEL otherwise. > > > > Eventually, it might be possible to simplify sk_page_frag() by only > > testing ->sk_use_task_frag and avoid relying on the ->sk_allocation > > heuristic entirely (assuming other sockets will set ->sk_use_task_frag > > according to their constraints in the future). > > > > The new ->sk_use_task_frag field is placed in a hole in struct sock and > > belongs to a cache line shared with ->sk_shutdown. Therefore it should > > be hot and shouldn't have negative performance impacts on TCP's fast > > path (sk_shutdown is tested just before the while() loop in > > tcp_sendmsg_locked()). > > > > Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/ > > Signed-off-by: Guillaume Nault <gnault@redhat.com> > > --- > > include/net/sock.h | 11 +++++++++-- > > net/core/sock.c | 1 + > > 2 files changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/include/net/sock.h b/include/net/sock.h > > index d08cfe190a78..ffba9e95470d 100644 > > --- a/include/net/sock.h > > +++ b/include/net/sock.h > > @@ -318,6 +318,9 @@ struct sk_filter; > > * @sk_stamp: time stamp of last packet received > > * @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only > > * @sk_tsflags: SO_TIMESTAMPING flags > > + * @sk_use_task_frag: allow sk_page_frag() to use current->task_frag. > > + Sockets that can be used under memory reclaim should > > + set this to false. > > * @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock > > * for timestamping > > * @sk_tskey: counter to disambiguate concurrent tstamp requests > > @@ -504,6 +507,7 @@ struct sock { > > #endif > > u16 sk_tsflags; > > u8 sk_shutdown; > > + bool sk_use_task_frag; > > atomic_t sk_tskey; > > atomic_t sk_zckey; > > I think the above should be fine from a data locality PoV, as the used > cacheline should be hot at sk_page_frag_refill() usage time, as > sk_tsflags has been accessed just before. > > @Eric, does the above fit with the planned sock fields reordering? Do not worry about that, this can be handled later if needed. > > Jakub noted we could use a bitfield here to be future proof for > additional flags addition. I think in this specific case a bool is > preferable, because we actually wont to discourage people to add more > of such flags, and the search for holes (or the bool -> bitflag > conversion) should give to such eventual future changes some additional > thoughts. > > Thanks! > > Paolo >
diff --git a/include/net/sock.h b/include/net/sock.h index d08cfe190a78..ffba9e95470d 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -318,6 +318,9 @@ struct sk_filter; * @sk_stamp: time stamp of last packet received * @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only * @sk_tsflags: SO_TIMESTAMPING flags + * @sk_use_task_frag: allow sk_page_frag() to use current->task_frag. + Sockets that can be used under memory reclaim should + set this to false. * @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock * for timestamping * @sk_tskey: counter to disambiguate concurrent tstamp requests @@ -504,6 +507,7 @@ struct sock { #endif u16 sk_tsflags; u8 sk_shutdown; + bool sk_use_task_frag; atomic_t sk_tskey; atomic_t sk_zckey; @@ -2536,14 +2540,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk) * socket operations and end up recursing into sk_page_frag() * while it's already in use: explicitly avoid task page_frag * usage if the caller is potentially doing any of them. - * This assumes that page fault handlers use the GFP_NOFS flags. + * This assumes that page fault handlers use the GFP_NOFS flags or + * explicitly disable sk_use_task_frag. * * Return: a per task page_frag if context allows that, * otherwise a per socket one. */ static inline struct page_frag *sk_page_frag(struct sock *sk) { - if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) == + if (sk->sk_use_task_frag && + (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | + __GFP_FS)) == (__GFP_DIRECT_RECLAIM | __GFP_FS)) return ¤t->task_frag; diff --git a/net/core/sock.c b/net/core/sock.c index 788c1372663c..1ab781be9fbe 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3314,6 +3314,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_rcvbuf = READ_ONCE(sysctl_rmem_default); sk->sk_sndbuf = READ_ONCE(sysctl_wmem_default); sk->sk_state = TCP_CLOSE; + sk->sk_use_task_frag = true; sk_set_socket(sk, sock); sock_set_flag(sk, SOCK_ZAPPED);