Message ID | 20230920132545.56834-2-wuyun.abel@bytedance.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,1/2] sock: Code cleanup on __sk_mem_raise_allocated() | expand |
On Wed, Sep 20, 2023 at 09:25:41PM +0800, Abel Wu wrote: > Before sockets became aware of net-memcg's memory pressure since > commit e1aab161e013 ("socket: initial cgroup code."), the memory > usage would be granted to raise if below average even when under > protocol's pressure. This provides fairness among the sockets of > same protocol. > > That commit changes this because the heuristic will also be > effective when only memcg is under pressure which makes no sense. > Fix this by skipping this heuristic when under memcg pressure. > > Fixes: e1aab161e013 ("socket: initial cgroup code.") > Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> > --- > net/core/sock.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/net/core/sock.c b/net/core/sock.c > index 379eb8b65562..ef5cf6250f17 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) > if (sk_has_memory_pressure(sk)) { > u64 alloc; > > - if (!sk_under_memory_pressure(sk)) > + if (memcg && mem_cgroup_under_socket_pressure(memcg)) > + goto suppress_allocation; > + > + if (!sk_under_global_memory_pressure(sk)) > return 1; I am onboard with replacing sk_under_memory_pressure() with sk_under_global_memory_pressure(). However suppressing on memcg pressure is a behavior change from status quo and need more thought and testing. I think there are three options for this hunk: 1. proposed patch 2. Consider memcg pressure only for !in_softirq(). 3. Don't consider memcg pressure at all. All three options are behavior change from the status quo but with different risk levels. (1) may reintroduce the regression fixed by 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure"). (2) is more inlined with 720ca52bcef22. (3) has the risk to making memcg limits ineffective. IMHO we should go with (2) as there is already a precedence in 720ca52bcef22. thanks, Shakeel
On 9/22/23 3:01 AM, Shakeel Butt wrote: > On Wed, Sep 20, 2023 at 09:25:41PM +0800, Abel Wu wrote: >> Before sockets became aware of net-memcg's memory pressure since >> commit e1aab161e013 ("socket: initial cgroup code."), the memory >> usage would be granted to raise if below average even when under >> protocol's pressure. This provides fairness among the sockets of >> same protocol. >> >> That commit changes this because the heuristic will also be >> effective when only memcg is under pressure which makes no sense. >> Fix this by skipping this heuristic when under memcg pressure. >> >> Fixes: e1aab161e013 ("socket: initial cgroup code.") >> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> >> --- >> net/core/sock.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/sock.c b/net/core/sock.c >> index 379eb8b65562..ef5cf6250f17 100644 >> --- a/net/core/sock.c >> +++ b/net/core/sock.c >> @@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) >> if (sk_has_memory_pressure(sk)) { >> u64 alloc; >> >> - if (!sk_under_memory_pressure(sk)) >> + if (memcg && mem_cgroup_under_socket_pressure(memcg)) >> + goto suppress_allocation; >> + >> + if (!sk_under_global_memory_pressure(sk)) >> return 1; > > I am onboard with replacing sk_under_memory_pressure() with > sk_under_global_memory_pressure(). However suppressing on memcg pressure > is a behavior change from status quo and need more thought and testing. > > I think there are three options for this hunk: > > 1. proposed patch > 2. Consider memcg pressure only for !in_softirq(). > 3. Don't consider memcg pressure at all. > > All three options are behavior change from the status quo but with > different risk levels. (1) may reintroduce the regression fixed by > 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure"). Just for the record, it is same for the current upstream implementation if the socket reaches average usage. Taking option 2 will fix this too. > (2) is more inlined with 720ca52bcef22. (3) has the risk to making memcg > limits ineffective. > > IMHO we should go with (2) as there is already a precedence in > 720ca52bcef22. Yes, I agree. Actually applying option(2) would make this patch quite similar to the previous version[a], except the below part: /* Under limit. */ if (allocated <= sk_prot_mem_limits(sk, 0)) { sk_leave_memory_pressure(sk); - return 1; + if (!under_memcg_pressure) + return 1; } My original thought is to inherit the behavior of tcpmem pressure. There are also 3 levels of memcg pressure named low/medium/critical, but considering that the 'low' level is too much conservative for socket allocation, I made the following match: PROTOCOL MEMCG ACTION ----------------------------------------------------- low <medium allow allocation pressure medium be more conservative high critical throttle which also seems align with the design[b] of memcg pressure. Anyway I will take option (2) and post v2. Thanks & Best, Abel [a] https://lore.kernel.org/lkml/20230901062141.51972-4-wuyun.abel@bytedance.com/ [b] https://docs.kernel.org/admin-guide/cgroup-v1/memory.html#memory-pressure
On 9/22/23 4:36 PM, Abel Wu wrote: > On 9/22/23 3:01 AM, Shakeel Butt wrote: >> On Wed, Sep 20, 2023 at 09:25:41PM +0800, Abel Wu wrote: >>> Before sockets became aware of net-memcg's memory pressure since >>> commit e1aab161e013 ("socket: initial cgroup code."), the memory >>> usage would be granted to raise if below average even when under >>> protocol's pressure. This provides fairness among the sockets of >>> same protocol. >>> >>> That commit changes this because the heuristic will also be >>> effective when only memcg is under pressure which makes no sense. >>> Fix this by skipping this heuristic when under memcg pressure. >>> >>> Fixes: e1aab161e013 ("socket: initial cgroup code.") >>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> >>> --- >>> net/core/sock.c | 10 +++++++++- >>> 1 file changed, 9 insertions(+), 1 deletion(-) >>> >>> diff --git a/net/core/sock.c b/net/core/sock.c >>> index 379eb8b65562..ef5cf6250f17 100644 >>> --- a/net/core/sock.c >>> +++ b/net/core/sock.c >>> @@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk, >>> int size, int amt, int kind) >>> if (sk_has_memory_pressure(sk)) { >>> u64 alloc; >>> - if (!sk_under_memory_pressure(sk)) >>> + if (memcg && mem_cgroup_under_socket_pressure(memcg)) >>> + goto suppress_allocation; >>> + >>> + if (!sk_under_global_memory_pressure(sk)) >>> return 1; >> >> I am onboard with replacing sk_under_memory_pressure() with >> sk_under_global_memory_pressure(). However suppressing on memcg pressure >> is a behavior change from status quo and need more thought and testing. >> >> I think there are three options for this hunk: >> >> 1. proposed patch >> 2. Consider memcg pressure only for !in_softirq(). >> 3. Don't consider memcg pressure at all. >> >> All three options are behavior change from the status quo but with >> different risk levels. (1) may reintroduce the regression fixed by >> 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure"). > > Just for the record, it is same for the current upstream implementation > if the socket reaches average usage. Taking option 2 will fix this too. > >> (2) is more inlined with 720ca52bcef22. (3) has the risk to making memcg >> limits ineffective. >> >> IMHO we should go with (2) as there is already a precedence in >> 720ca52bcef22. > > Yes, I agree. Actually applying option(2) would make this patch quite > similar to the previous version[a], except the below part: > > /* Under limit. */ > if (allocated <= sk_prot_mem_limits(sk, 0)) { > sk_leave_memory_pressure(sk); > - return 1; > + if (!under_memcg_pressure) > + return 1; > } After a second thought, it is still vague to me about the position the memcg pressure should be in socket memory allocation. It lacks convincing design. I think the above hunk helps, but not much. I wonder if we should take option (3) first. Thoughts? Thanks, Abel > > My original thought is to inherit the behavior of tcpmem pressure. > There are also 3 levels of memcg pressure named low/medium/critical, > but considering that the 'low' level is too much conservative for > socket allocation, I made the following match: > > PROTOCOL MEMCG ACTION > ----------------------------------------------------- > low <medium allow allocation > pressure medium be more conservative > high critical throttle > > which also seems align with the design[b] of memcg pressure. Anyway > I will take option (2) and post v2. > > Thanks & Best, > Abel > > [a] > https://lore.kernel.org/lkml/20230901062141.51972-4-wuyun.abel@bytedance.com/ > [b] > https://docs.kernel.org/admin-guide/cgroup-v1/memory.html#memory-pressure
On Fri, Sep 22, 2023 at 06:10:06PM +0800, Abel Wu wrote: [...] > > After a second thought, it is still vague to me about the position > the memcg pressure should be in socket memory allocation. It lacks > convincing design. I think the above hunk helps, but not much. > > I wonder if we should take option (3) first. Thoughts? > Let's take a step further. Let's decouple the memcg accounting and global skmem accounting. __sk_mem_raise_allocated is already very hard to reason. There are couple of heuristics in it which may or may not apply to both accounting infrastructures. Let's explicitly document what heurisitics allows to forcefully succeed the allocations i.e. irrespective of pressure or over limit for both accounting infras. I think decoupling them would make the flow of the code very clear. There are three heuristics: 1. minimum buffer size even under pressure. 2. allow allocation for a socket whose usage is below average of the system. 3. socket is over its sndbuf. Let's discuss which heuristic applies to which accounting infra and under which state (under pressure or over limit). thanks, Shakeel
On 9/24/23 3:28 PM, Shakeel Butt wrote: > On Fri, Sep 22, 2023 at 06:10:06PM +0800, Abel Wu wrote: > [...] >> >> After a second thought, it is still vague to me about the position >> the memcg pressure should be in socket memory allocation. It lacks >> convincing design. I think the above hunk helps, but not much. >> >> I wonder if we should take option (3) first. Thoughts? >> > > Let's take a step further. Let's decouple the memcg accounting and > global skmem accounting. __sk_mem_raise_allocated is already very hard > to reason. There are couple of heuristics in it which may or may not > apply to both accounting infrastructures. > > Let's explicitly document what heurisitics allows to forcefully succeed > the allocations i.e. irrespective of pressure or over limit for both > accounting infras. I think decoupling them would make the flow of the > code very clear. I can't agree more. > > There are three heuristics: I found all of them were first introduced in linux-2.4.0-test7pre1 for TCP only, and then migrated to socket core in linux-2.6.8-rc1 without functional change. > > 1. minimum buffer size even under pressure. This is required by RFC 7323 (TCP Extensions for High Performance) to make features like Window Scale option work as expected, and should be succeeded under global pressure by tcp_{r,w}mem's definition. And IMHO for same reason, it should also be succeeded under memcg pressure, or else workloads might suffer performance drop due to bottleneck on network. The allocation must not be succeeded either exceed global or memcg's hard limit, or else a DoS attack can be taken place by spawning lots of sockets that are under minimum buffer size. > > 2. allow allocation for a socket whose usage is below average of the > system. Since 'average' is within the scope of global accounting, this one only makes sense under global memory pressure. Actually this exists before cgroup was born, hence doesn't take memcg into consideration. While OTOH the intention of throttling under memcg pressure is to relief the memcg from heavy reclaim pressure, this heuristic does no help. And there also seems to be no reason to succeed the allocation when global or memcg's hard limit is exceeded. > > 3. socket is over its sndbuf. TBH I don't get its point.. > > Let's discuss which heuristic applies to which accounting infra and > under which state (under pressure or over limit). I will follow your suggestion to post a patch to explicitly document the behaviors once things are cleared. Thanks, Abel
Gentle ping :) 在 10/3/23 8:49 PM, Abel Wu Wrote: > On 9/24/23 3:28 PM, Shakeel Butt wrote: >> On Fri, Sep 22, 2023 at 06:10:06PM +0800, Abel Wu wrote: >> [...] >>> >>> After a second thought, it is still vague to me about the position >>> the memcg pressure should be in socket memory allocation. It lacks >>> convincing design. I think the above hunk helps, but not much. >>> >>> I wonder if we should take option (3) first. Thoughts? >>> >> >> Let's take a step further. Let's decouple the memcg accounting and >> global skmem accounting. __sk_mem_raise_allocated is already very hard >> to reason. There are couple of heuristics in it which may or may not >> apply to both accounting infrastructures. >> >> Let's explicitly document what heurisitics allows to forcefully succeed >> the allocations i.e. irrespective of pressure or over limit for both >> accounting infras. I think decoupling them would make the flow of the >> code very clear. > > I can't agree more. > >> >> There are three heuristics: > > I found all of them were first introduced in linux-2.4.0-test7pre1 for > TCP only, and then migrated to socket core in linux-2.6.8-rc1 without > functional change. > >> >> 1. minimum buffer size even under pressure. > > This is required by RFC 7323 (TCP Extensions for High Performance) to > make features like Window Scale option work as expected, and should be > succeeded under global pressure by tcp_{r,w}mem's definition. And IMHO > for same reason, it should also be succeeded under memcg pressure, or > else workloads might suffer performance drop due to bottleneck on > network. > > The allocation must not be succeeded either exceed global or memcg's > hard limit, or else a DoS attack can be taken place by spawning lots > of sockets that are under minimum buffer size. > >> >> 2. allow allocation for a socket whose usage is below average of the >> system. > > Since 'average' is within the scope of global accounting, this one > only makes sense under global memory pressure. Actually this exists > before cgroup was born, hence doesn't take memcg into consideration. > > While OTOH the intention of throttling under memcg pressure is to > relief the memcg from heavy reclaim pressure, this heuristic does no > help. And there also seems to be no reason to succeed the allocation > when global or memcg's hard limit is exceeded. > >> >> 3. socket is over its sndbuf. > > TBH I don't get its point.. > >> >> Let's discuss which heuristic applies to which accounting infra and >> under which state (under pressure or over limit). > > I will follow your suggestion to post a patch to explicitly document > the behaviors once things are cleared. > > Thanks, > Abel
On Tue, Oct 03, 2023 at 08:49:08PM +0800, Abel Wu wrote: [...] > > 1. minimum buffer size even under pressure. > > This is required by RFC 7323 (TCP Extensions for High Performance) to > make features like Window Scale option work as expected, and should be > succeeded under global pressure by tcp_{r,w}mem's definition. And IMHO > for same reason, it should also be succeeded under memcg pressure, or > else workloads might suffer performance drop due to bottleneck on > network. > > The allocation must not be succeeded either exceed global or memcg's > hard limit, or else a DoS attack can be taken place by spawning lots > of sockets that are under minimum buffer size. > Sounds good. > > > > 2. allow allocation for a socket whose usage is below average of the > > system. > > Since 'average' is within the scope of global accounting, this one > only makes sense under global memory pressure. Actually this exists > before cgroup was born, hence doesn't take memcg into consideration. > > While OTOH the intention of throttling under memcg pressure is to > relief the memcg from heavy reclaim pressure, this heuristic does no > help. And there also seems to be no reason to succeed the allocation > when global or memcg's hard limit is exceeded. > Sounds good too. > > > > 3. socket is over its sndbuf. > > TBH I don't get its point.. > So, this corresponds to following code in __sk_mem_raise_allocated() if (kind == SK_MEM_SEND && sk->sk_type == SOCK_STREAM) { sk_stream_moderate_sndbuf(sk); /* Fail only if socket is _under_ its sndbuf. * In this case we cannot block, so that we have to fail. */ if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) { /* Force charge with __GFP_NOFAIL */ if (memcg_charge && !charged) { mem_cgroup_charge_skmem(sk->sk_memcg, amt, gfp_memcg_charge() | __GFP_NOFAIL); } return 1; } } Here we moderate the sk_sndbuf possibly half of sk_wmem_queued and thus we always succeed unless user has done SO_SNDBUF on the socket in which case it interacts with sk_stream_wait_memory() called in sendmsg. I am not really able to make sense of the interaction between this code and sk_stream_wait_memory() and will punt to networking experts to shed some light. Other than that I think we need to answer if we want to moderate the sndbuf on memcg charge failure. > > > > Let's discuss which heuristic applies to which accounting infra and > > under which state (under pressure or over limit). > > I will follow your suggestion to post a patch to explicitly document > the behaviors once things are cleared. > Let's just post the patch and see what other folks comment as well.
diff --git a/net/core/sock.c b/net/core/sock.c index 379eb8b65562..ef5cf6250f17 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) if (sk_has_memory_pressure(sk)) { u64 alloc; - if (!sk_under_memory_pressure(sk)) + if (memcg && mem_cgroup_under_socket_pressure(memcg)) + goto suppress_allocation; + + if (!sk_under_global_memory_pressure(sk)) return 1; + + /* Trying to be fair among all the sockets under the + * protocol's memory pressure, by allowing the ones + * that below average usage to raise. + */ alloc = sk_sockets_allocated_read_positive(sk); if (sk_prot_mem_limits(sk, 2) > alloc * sk_mem_pages(sk->sk_wmem_queued +
Before sockets became aware of net-memcg's memory pressure since commit e1aab161e013 ("socket: initial cgroup code."), the memory usage would be granted to raise if below average even when under protocol's pressure. This provides fairness among the sockets of same protocol. That commit changes this because the heuristic will also be effective when only memcg is under pressure which makes no sense. Fix this by skipping this heuristic when under memcg pressure. Fixes: e1aab161e013 ("socket: initial cgroup code.") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> --- net/core/sock.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)