[v2,3/4] sock: Consider memcg pressure when raising sockmem

Message ID	20230522070122.6727-4-wuyun.abel@bytedance.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E54A163BD for <netdev@vger.kernel.org>; Mon, 22 May 2023 07:02:27 +0000 (UTC) From: Abel Wu <wuyun.abel@bytedance.com> To: "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: Glauber Costa <glommer@parallels.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Abel Wu <wuyun.abel@bytedance.com> Subject: [PATCH v2 3/4] sock: Consider memcg pressure when raising sockmem Date: Mon, 22 May 2023 15:01:21 +0800 Message-Id: <20230522070122.6727-4-wuyun.abel@bytedance.com> In-Reply-To: <20230522070122.6727-1-wuyun.abel@bytedance.com> References: <20230522070122.6727-1-wuyun.abel@bytedance.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	sock: Improve condition on sockmem pressure \| expand [v2,0/4] sock: Improve condition on sockmem pressure [v2,1/4] sock: Always take memcg pressure into consideration [v2,2/4] sock: Fix misuse of sk_under_memory_pressure() [v2,3/4] sock: Consider memcg pressure when raising sockmem [v2,4/4] sock: Remove redundant cond of memcg pressure

Context	Check	Description
netdev/series_format	warning	Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 14 this patch: 14
netdev/cc_maintainers	warning	2 maintainers not CCed: kuniyu@amazon.com martin.lau@kernel.org
netdev/build_clang	success	Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 14 this patch: 14
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 37 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Abel Wu May 22, 2023, 7:01 a.m. UTC

For now __sk_mem_raise_allocated() mainly considers global socket
memory pressure and allows to raise if no global pressure observed,
including the sockets whose memcgs are in pressure, which might
result in longer memcg memstall.

So take net-memcg's pressure into consideration when allocating
socket memory to alleviate long tail latencies.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 net/core/sock.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

Simon Horman May 22, 2023, 12:53 p.m. UTC | #1

On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote:
> For now __sk_mem_raise_allocated() mainly considers global socket
> memory pressure and allows to raise if no global pressure observed,
> including the sockets whose memcgs are in pressure, which might
> result in longer memcg memstall.
> 
> So take net-memcg's pressure into consideration when allocating
> socket memory to alleviate long tail latencies.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  net/core/sock.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 801df091e37a..7641d64293af 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>  {
>  	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
>  	struct proto *prot = sk->sk_prot;
> -	bool charged = true;
> +	bool charged = true, pressured = false;
>  	long allocated;

Please preserve reverse xmas tree - longest line to shortest -
for local variables in neworking code.

...

Shakeel Butt May 25, 2023, 1:22 a.m. UTC | #2

On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote:
> For now __sk_mem_raise_allocated() mainly considers global socket
> memory pressure and allows to raise if no global pressure observed,
> including the sockets whose memcgs are in pressure, which might
> result in longer memcg memstall.
> 
> So take net-memcg's pressure into consideration when allocating
> socket memory to alleviate long tail latencies.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>

Hi Abel,

Have you seen any real world production issue which is fixed by this
patch or is it more of a fix after reading code?

This code is quite subtle and small changes can cause unintended
behavior changes. At the moment the tcp memory accounting and memcg
accounting is intermingled and I think we should decouple them.

> ---
>  net/core/sock.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 801df091e37a..7641d64293af 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>  {
>  	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
>  	struct proto *prot = sk->sk_prot;
> -	bool charged = true;
> +	bool charged = true, pressured = false;
>  	long allocated;
>  
>  	sk_memory_allocated_add(sk, amt);
>  	allocated = sk_memory_allocated(sk);
> -	if (memcg_charge &&
> -	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
> -						gfp_memcg_charge())))
> -		goto suppress_allocation;
> +
> +	if (memcg_charge) {
> +		charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
> +						  gfp_memcg_charge());
> +		if (!charged)
> +			goto suppress_allocation;
> +		if (mem_cgroup_under_socket_pressure(sk->sk_memcg))

The memcg under pressure callback does a upward memcg tree walk, do
please make sure you have tested the performance impact of this.

> +			pressured = true;
> +	}
>  
>  	/* Under limit. */
> -	if (allocated <= sk_prot_mem_limits(sk, 0)) {
> +	if (allocated <= sk_prot_mem_limits(sk, 0))
>  		sk_leave_memory_pressure(sk);
> +	else
> +		pressured = true;
> +
> +	/* No pressure observed in global/memcg. */
> +	if (!pressured)
>  		return 1;
> -	}
>  
>  	/* Under pressure. */
>  	if (allocated > sk_prot_mem_limits(sk, 1))
> -- 
> 2.37.3
>

Abel Wu May 29, 2023, 11:58 a.m. UTC | #3

Hi Shakeel, thanks for reviewing! And sorry for replying so late,
I was on a vocation :)

On 5/25/23 9:22 AM, Shakeel Butt wrote:
> On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote:
>> For now __sk_mem_raise_allocated() mainly considers global socket
>> memory pressure and allows to raise if no global pressure observed,
>> including the sockets whose memcgs are in pressure, which might
>> result in longer memcg memstall.
>>
>> So take net-memcg's pressure into consideration when allocating
>> socket memory to alleviate long tail latencies.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> 
> Hi Abel,
> 
> Have you seen any real world production issue which is fixed by this
> patch or is it more of a fix after reading code?

The latter. But we do observe one common case in the production env
that p2p service, which mainly downloads container images, running
inside a container with tight memory limit can easily be throttled and
keep memstalled for a long period of time and sometimes even be OOM-
killed. This service shows burst usage of TCP memory and I think it
indeed needs suppressing sockmem allocation if memcg is already under
pressure. The memcg pressure is usually caused by too many page caches
and the dirty ones starting to be wrote back to slow backends. So it
is insane to continuously receive net data to consume more memory.

> 
> This code is quite subtle and small changes can cause unintended
> behavior changes. At the moment the tcp memory accounting and memcg
> accounting is intermingled and I think we should decouple them.

My original intention to post this patchset is to clarify that:

   - proto pressure only considers sysctl_mem[] (patch 2)
   - memcg pressure only indicates the pressure inside itself
   - consider both whenever needs allocation or reclaim (patch 1,3)

In this way, the two kinds of pressure maintain purer semantics, and
socket core can react on both of them properly and consistently.

> 
>> ---
>>   net/core/sock.c | 23 ++++++++++++++++-------
>>   1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 801df091e37a..7641d64293af 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>>   {
>>   	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
>>   	struct proto *prot = sk->sk_prot;
>> -	bool charged = true;
>> +	bool charged = true, pressured = false;
>>   	long allocated;
>>   
>>   	sk_memory_allocated_add(sk, amt);
>>   	allocated = sk_memory_allocated(sk);
>> -	if (memcg_charge &&
>> -	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
>> -						gfp_memcg_charge())))
>> -		goto suppress_allocation;
>> +
>> +	if (memcg_charge) {
>> +		charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
>> +						  gfp_memcg_charge());
>> +		if (!charged)
>> +			goto suppress_allocation;
>> +		if (mem_cgroup_under_socket_pressure(sk->sk_memcg))
> 
> The memcg under pressure callback does a upward memcg tree walk, do
> please make sure you have tested the performance impact of this.

Yes, I have tested several benchmarks on a dual socket machine modeled
Intel Xeon(R) Platinum 8260 with SNC disabled, that is 2 NUMA nodes each
of which has 24C/48T. All the benchmarks are done inside a separate
cgroup in a clean host. Below shows the result of tbench4 and netperf:

tbench4 Throughput (misleading but traditional)
                             baseline               patchset
Hmean     1        377.62 (   0.00%)      375.06 *  -0.68%*
Hmean     2        753.99 (   0.00%)      753.21 *  -0.10%*
Hmean     4       1503.50 (   0.00%)     1493.07 *  -0.69%*
Hmean     8       2941.43 (   0.00%)     2925.18 *  -0.55%*
Hmean     16      5637.59 (   0.00%)     5603.64 *  -0.60%*
Hmean     32      9042.90 (   0.00%)     9022.53 *  -0.23%*
Hmean     64     10530.55 (   0.00%)    10554.89 *   0.23%*
Hmean     128    24230.20 (   0.00%)    24424.74 *   0.80%*
Hmean     256    23798.21 (   0.00%)    23941.24 *   0.60%*
Hmean     384    23620.63 (   0.00%)    23569.54 *  -0.22%*

netperf-udp
                                    baseline               patchset
Hmean     send-64         281.99 (   0.00%)      274.50 *  -2.65%*
Hmean     send-128        556.70 (   0.00%)      545.82 *  -1.96%*
Hmean     send-256       1102.60 (   0.00%)     1091.21 *  -1.03%*
Hmean     send-1024      4180.48 (   0.00%)     4073.87 *  -2.55%*
Hmean     send-2048      7837.61 (   0.00%)     7707.12 *  -1.66%*
Hmean     send-3312     12157.49 (   0.00%)    11845.03 *  -2.57%*
Hmean     send-4096     14512.64 (   0.00%)    14156.45 *  -2.45%*
Hmean     send-8192     24015.40 (   0.00%)    23920.94 (  -0.39%)
Hmean     send-16384    39875.21 (   0.00%)    39696.67 (  -0.45%)
Hmean     recv-64         281.99 (   0.00%)      274.50 *  -2.65%*
Hmean     recv-128        556.70 (   0.00%)      545.82 *  -1.96%*
Hmean     recv-256       1102.60 (   0.00%)     1091.21 *  -1.03%*
Hmean     recv-1024      4180.48 (   0.00%)     4073.76 *  -2.55%*
Hmean     recv-2048      7837.61 (   0.00%)     7707.11 *  -1.67%*
Hmean     recv-3312     12157.49 (   0.00%)    11845.03 *  -2.57%*
Hmean     recv-4096     14512.62 (   0.00%)    14156.45 *  -2.45%*
Hmean     recv-8192     24015.29 (   0.00%)    23920.88 (  -0.39%)
Hmean     recv-16384    39873.93 (   0.00%)    39696.02 (  -0.45%)

netperf-tcp
                               baseline               patchset
Hmean     64        1777.05 (   0.00%)     1793.04 (   0.90%)
Hmean     128       3364.25 (   0.00%)     3451.05 *   2.58%*
Hmean     256       6309.21 (   0.00%)     6506.84 *   3.13%*
Hmean     1024     19571.52 (   0.00%)    19606.65 (   0.18%)
Hmean     2048     26467.00 (   0.00%)    26658.12 (   0.72%)
Hmean     3312     31312.36 (   0.00%)    31403.54 (   0.29%)
Hmean     4096     33263.37 (   0.00%)    33278.77 (   0.05%)
Hmean     8192     39961.82 (   0.00%)    40149.77 (   0.47%)
Hmean     16384    46065.33 (   0.00%)    46683.67 (   1.34%)

Except slight regression in netperf-udp, no obvious performance win
or loss. But as you reminded me of the cost of hierarchical behavior,
I re-tested the cases in a 5-level depth cgroup (originally 2-level),
and the results are:

tbench4 Throughput (misleading but traditional)
                             baseline               patchset
Hmean     1        361.93 (   0.00%)      367.58 *   1.56%*
Hmean     2        734.39 (   0.00%)      730.33 *  -0.55%*
Hmean     4       1426.82 (   0.00%)     1440.81 *   0.98%*
Hmean     8       2848.86 (   0.00%)     2860.87 *   0.42%*
Hmean     16      5436.72 (   0.00%)     5491.72 *   1.01%*
Hmean     32      8743.34 (   0.00%)     8913.27 *   1.94%*
Hmean     64     10345.41 (   0.00%)    10436.92 *   0.88%*
Hmean     128    23390.36 (   0.00%)    23353.09 *  -0.16%*
Hmean     256    23823.20 (   0.00%)    23509.79 *  -1.32%*
Hmean     384    23268.09 (   0.00%)    23178.10 *  -0.39%*

netperf-udp
                                    baseline               patchset
Hmean     send-64         278.31 (   0.00%)      275.68 *  -0.94%*
Hmean     send-128        554.52 (   0.00%)      547.46 (  -1.27%)
Hmean     send-256       1106.64 (   0.00%)     1103.01 (  -0.33%)
Hmean     send-1024      4135.84 (   0.00%)     4057.47 *  -1.89%*
Hmean     send-2048      7816.13 (   0.00%)     7732.71 *  -1.07%*
Hmean     send-3312     12068.32 (   0.00%)    11895.94 *  -1.43%*
Hmean     send-4096     14358.02 (   0.00%)    14304.06 (  -0.38%)
Hmean     send-8192     24041.57 (   0.00%)    24061.70 (   0.08%)
Hmean     send-16384    39996.09 (   0.00%)    39936.08 (  -0.15%)
Hmean     recv-64         278.31 (   0.00%)      275.68 *  -0.94%*
Hmean     recv-128        554.52 (   0.00%)      547.46 (  -1.27%)
Hmean     recv-256       1106.64 (   0.00%)     1103.01 (  -0.33%)
Hmean     recv-1024      4135.84 (   0.00%)     4057.47 *  -1.89%*
Hmean     recv-2048      7816.13 (   0.00%)     7732.71 *  -1.07%*
Hmean     recv-3312     12068.32 (   0.00%)    11895.94 *  -1.43%*
Hmean     recv-4096     14357.99 (   0.00%)    14304.04 (  -0.38%)
Hmean     recv-8192     24041.43 (   0.00%)    24061.58 (   0.08%)
Hmean     recv-16384    39995.72 (   0.00%)    39935.68 (  -0.15%)

netperf-tcp
                               baseline               patchset
Hmean     64        1779.93 (   0.00%)     1784.75 (   0.27%)
Hmean     128       3380.32 (   0.00%)     3424.14 (   1.30%)
Hmean     256       6383.37 (   0.00%)     6504.97 *   1.90%*
Hmean     1024     19345.07 (   0.00%)    19604.06 *   1.34%*
Hmean     2048     26547.60 (   0.00%)    26743.94 *   0.74%*
Hmean     3312     30948.40 (   0.00%)    31419.11 *   1.52%*
Hmean     4096     32888.83 (   0.00%)    33125.01 *   0.72%*
Hmean     8192     40020.38 (   0.00%)    39949.53 (  -0.18%)
Hmean     16384    46084.48 (   0.00%)    46300.43 (   0.47%)

Still no obvious difference, and even the udp regression is reduced.

Thanks & Best,
	Abel

Shakeel Butt May 29, 2023, 9:12 p.m. UTC | #4

+linux-mm and cgroups

On Mon, May 29, 2023 at 07:58:45PM +0800, Abel Wu wrote:
> Hi Shakeel, thanks for reviewing! And sorry for replying so late,
> I was on a vocation :)
> 
> On 5/25/23 9:22 AM, Shakeel Butt wrote:
> > On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote:
> > > For now __sk_mem_raise_allocated() mainly considers global socket
> > > memory pressure and allows to raise if no global pressure observed,
> > > including the sockets whose memcgs are in pressure, which might
> > > result in longer memcg memstall.
> > > 
> > > So take net-memcg's pressure into consideration when allocating
> > > socket memory to alleviate long tail latencies.
> > > 
> > > Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> > 
> > Hi Abel,
> > 
> > Have you seen any real world production issue which is fixed by this
> > patch or is it more of a fix after reading code?
> 
> The latter. But we do observe one common case in the production env
> that p2p service, which mainly downloads container images, running
> inside a container with tight memory limit can easily be throttled and
> keep memstalled for a long period of time and sometimes even be OOM-
> killed. This service shows burst usage of TCP memory and I think it
> indeed needs suppressing sockmem allocation if memcg is already under
> pressure. The memcg pressure is usually caused by too many page caches
> and the dirty ones starting to be wrote back to slow backends. So it
> is insane to continuously receive net data to consume more memory.
> 

We actually made an intentional decision to not throttle the incoming
traffic under memory pressure. See 720ca52bcef22 ("net-memcg: avoid
stalls when under memory pressure"). If you think the throttling
behavior is preferred for your application, please propose the patch
separately and we can work on how to enable flexible policy here.

> > 
> > This code is quite subtle and small changes can cause unintended
> > behavior changes. At the moment the tcp memory accounting and memcg
> > accounting is intermingled and I think we should decouple them.
> 
> My original intention to post this patchset is to clarify that:
> 
>   - proto pressure only considers sysctl_mem[] (patch 2)
>   - memcg pressure only indicates the pressure inside itself
>   - consider both whenever needs allocation or reclaim (patch 1,3)
> 
> In this way, the two kinds of pressure maintain purer semantics, and
> socket core can react on both of them properly and consistently.

Can you please resend you patch series (without patch 3) and Cc to
linux-mm, cgroups list and memcg maintainers as well?

Abel Wu May 30, 2023, 9:58 a.m. UTC | #5

On 5/30/23 5:12 AM, Shakeel Butt wrote:
> 
> +linux-mm and cgroups
> 
> On Mon, May 29, 2023 at 07:58:45PM +0800, Abel Wu wrote:
>> Hi Shakeel, thanks for reviewing! And sorry for replying so late,
>> I was on a vocation :)
>>
>> On 5/25/23 9:22 AM, Shakeel Butt wrote:
>>> On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote:
>>>> For now __sk_mem_raise_allocated() mainly considers global socket
>>>> memory pressure and allows to raise if no global pressure observed,
>>>> including the sockets whose memcgs are in pressure, which might
>>>> result in longer memcg memstall.
>>>>
>>>> So take net-memcg's pressure into consideration when allocating
>>>> socket memory to alleviate long tail latencies.
>>>>
>>>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>>>
>>> Hi Abel,
>>>
>>> Have you seen any real world production issue which is fixed by this
>>> patch or is it more of a fix after reading code?
>>
>> The latter. But we do observe one common case in the production env
>> that p2p service, which mainly downloads container images, running
>> inside a container with tight memory limit can easily be throttled and
>> keep memstalled for a long period of time and sometimes even be OOM-
>> killed. This service shows burst usage of TCP memory and I think it
>> indeed needs suppressing sockmem allocation if memcg is already under
>> pressure. The memcg pressure is usually caused by too many page caches
>> and the dirty ones starting to be wrote back to slow backends. So it
>> is insane to continuously receive net data to consume more memory.
>>
> 
> We actually made an intentional decision to not throttle the incoming
> traffic under memory pressure. See 720ca52bcef22 ("net-memcg: avoid
> stalls when under memory pressure"). If you think the throttling
> behavior is preferred for your application, please propose the patch
> separately and we can work on how to enable flexible policy here.

Ah I see. Thanks for providing the context. So suppressing the alloc
under memcg pressure could further keep senders waiting if SACKed segs
get dropped from the OFO queue.

> 
>>>
>>> This code is quite subtle and small changes can cause unintended
>>> behavior changes. At the moment the tcp memory accounting and memcg
>>> accounting is intermingled and I think we should decouple them.
>>
>> My original intention to post this patchset is to clarify that:
>>
>>    - proto pressure only considers sysctl_mem[] (patch 2)
>>    - memcg pressure only indicates the pressure inside itself
>>    - consider both whenever needs allocation or reclaim (patch 1,3)
>>
>> In this way, the two kinds of pressure maintain purer semantics, and
>> socket core can react on both of them properly and consistently.
> 
> Can you please resend you patch series (without patch 3) and Cc to
> linux-mm, cgroups list and memcg maintainers as well?

Yeah, absolutely.

Thanks,
	Abel

[v2,3/4] sock: Consider memcg pressure when raising sockmem

Checks

Commit Message

Comments

Patch