diff mbox series

[RFC,net-next,3/3] sock: Throttle pressure-aware sockets under pressure

Message ID 20230901062141.51972-4-wuyun.abel@bytedance.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series sock: Be aware of memcg pressure on alloc | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1336 this patch: 1336
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1359 this patch: 1359
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Abel Wu Sept. 1, 2023, 6:21 a.m. UTC
A socket is pressure-aware when its protocol has pressure defined, that
is sk_has_memory_pressure(sk) != NULL, e.g. TCP. These protocols might
want to limit the usage of socket memory depending on both the state of
global & memcg pressure through sk_under_memory_pressure(sk).

While for allocation, memcg pressure will be simply ignored when usage
is under global limit (sysctl_mem[0]). This behavior has different impacts
on different cgroup modes. In cgroupv2 socket and other purposes share a
same memory limit, thus allowing sockmem to burst under memcg reclaiming
pressure could lead to longer stall, sometimes even OOM. While cgroupv1
has no such worries.

As a cloud service provider, we encountered a problem in our production
environment during the transition from cgroup v1 to v2 (partly due to the
heavy taxes of accounting socket memory in v1). Say one workload behaves
fine in cgroupv1 with memcg limit configured to 10GB memory and another
1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
due to burst memory usage on socket, since there is no specific limit for
socket memory in cgroupv2 and relies largely on workloads doing traffic
control themselves.

It's rational for the workloads to build some traffic control to better
utilize the resources they bought, but from kernel's point of view it's
also reasonable to suppress the allocation of socket memory once there is
a shortage of free memory, given that performance degradation is better
than failure.

As per the above, this patch aims to be more conservative on allocation
for the pressure-aware sockets under global and/or memcg pressure. While
OTOH throttling on incoming traffic could hurt latency badly possibly
due to SACKed segs get dropped from the OFO queue. See a related commit
720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure").
This patch preserves this decision by throttling RX allocation only at
critical pressure level when it hardly makes sense to continue receive
data.

No functional change intended for pressure-unaware protocols.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 net/core/sock.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

Comments

Simon Horman Sept. 1, 2023, 1:59 p.m. UTC | #1
On Fri, Sep 01, 2023 at 02:21:28PM +0800, Abel Wu wrote:
> A socket is pressure-aware when its protocol has pressure defined, that
> is sk_has_memory_pressure(sk) != NULL, e.g. TCP. These protocols might
> want to limit the usage of socket memory depending on both the state of
> global & memcg pressure through sk_under_memory_pressure(sk).
> 
> While for allocation, memcg pressure will be simply ignored when usage
> is under global limit (sysctl_mem[0]). This behavior has different impacts
> on different cgroup modes. In cgroupv2 socket and other purposes share a
> same memory limit, thus allowing sockmem to burst under memcg reclaiming
> pressure could lead to longer stall, sometimes even OOM. While cgroupv1
> has no such worries.
> 
> As a cloud service provider, we encountered a problem in our production
> environment during the transition from cgroup v1 to v2 (partly due to the
> heavy taxes of accounting socket memory in v1). Say one workload behaves
> fine in cgroupv1 with memcg limit configured to 10GB memory and another
> 1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
> due to burst memory usage on socket, since there is no specific limit for
> socket memory in cgroupv2 and relies largely on workloads doing traffic
> control themselves.
> 
> It's rational for the workloads to build some traffic control to better
> utilize the resources they bought, but from kernel's point of view it's
> also reasonable to suppress the allocation of socket memory once there is
> a shortage of free memory, given that performance degradation is better
> than failure.
> 
> As per the above, this patch aims to be more conservative on allocation
> for the pressure-aware sockets under global and/or memcg pressure. While
> OTOH throttling on incoming traffic could hurt latency badly possibly
> due to SACKed segs get dropped from the OFO queue. See a related commit
> 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure").
> This patch preserves this decision by throttling RX allocation only at
> critical pressure level when it hardly makes sense to continue receive
> data.
> 
> No functional change intended for pressure-unaware protocols.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>

...

> @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>  	if (sk_has_memory_pressure(sk)) {
>  		u64 alloc;
>  
> -		if (!sk_under_memory_pressure(sk))
> +		/* Be more conservative if the socket's memcg (or its
> +		 * parents) is under reclaim pressure, try to possibly
> +		 * avoid further memstall.
> +		 */
> +		if (under_memcg_pressure)
> +			goto suppress_allocation;
> +
> +		if (!sk_under_global_memory_pressure(sk))
>  			return 1;
> +
> +		/* Trying to be fair among all the sockets of same
> +		 * protocal under global memory pressure, by allowing

nit: checkpatch.pl --codespell says, protocal -> protocol

> +		 * the ones that under average usage to raise.
> +		 */
>  		alloc = sk_sockets_allocated_read_positive(sk);
>  		if (sk_prot_mem_limits(sk, 2) > alloc *
>  		    sk_mem_pages(sk->sk_wmem_queued +
> -- 
> 2.37.3
> 
>
Abel Wu Sept. 3, 2023, 4:54 a.m. UTC | #2
Hi Simon, thanks for reviewing!

On 9/1/23 9:59 PM, Simon Horman wrote:
> On Fri, Sep 01, 2023 at 02:21:28PM +0800, Abel Wu wrote:
>> @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>>   	if (sk_has_memory_pressure(sk)) {
>>   		u64 alloc;
>>   
>> -		if (!sk_under_memory_pressure(sk))
>> +		/* Be more conservative if the socket's memcg (or its
>> +		 * parents) is under reclaim pressure, try to possibly
>> +		 * avoid further memstall.
>> +		 */
>> +		if (under_memcg_pressure)
>> +			goto suppress_allocation;
>> +
>> +		if (!sk_under_global_memory_pressure(sk))
>>   			return 1;
>> +
>> +		/* Trying to be fair among all the sockets of same
>> +		 * protocal under global memory pressure, by allowing
> 
> nit: checkpatch.pl --codespell says, protocal -> protocol

Will fix in next version.

Thanks,
	Abel

> 
>> +		 * the ones that under average usage to raise.
>> +		 */
>>   		alloc = sk_sockets_allocated_read_positive(sk);
>>   		if (sk_prot_mem_limits(sk, 2) > alloc *
>>   		    sk_mem_pages(sk->sk_wmem_queued +
>> -- 
>> 2.37.3
>>
>>
Abel Wu Sept. 18, 2023, 7:48 a.m. UTC | #3
On 9/1/23 2:21 PM, Abel Wu wrote:
> @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>   	if (sk_has_memory_pressure(sk)) {
>   		u64 alloc;
>   
> -		if (!sk_under_memory_pressure(sk))
> +		/* Be more conservative if the socket's memcg (or its
> +		 * parents) is under reclaim pressure, try to possibly
> +		 * avoid further memstall.
> +		 */
> +		if (under_memcg_pressure)
> +			goto suppress_allocation;
> +
> +		if (!sk_under_global_memory_pressure(sk))
>   			return 1;
> +
> +		/* Trying to be fair among all the sockets of same
> +		 * protocal under global memory pressure, by allowing
> +		 * the ones that under average usage to raise.
> +		 */
>   		alloc = sk_sockets_allocated_read_positive(sk);
>   		if (sk_prot_mem_limits(sk, 2) > alloc *
>   		    sk_mem_pages(sk->sk_wmem_queued +

I totally agree with what Shakeel said in last reply and will try ebpf-
based solution to let userspace inject proper strategies. But IMHO the
above hunk is irrelevant to the idea of this patchset, and is the right
thing to do, so maybe worth a separate patch?

This hunk originally passes the allocation when this socket is below
average usage even under global and/or memcg pressure. It makes sense
to do so under global pressure, as the 'average' is in the scope of
global, but it's really weird from a memcg's point of view. Actually
this pass condition was present before memcg pressure was introduced.

Please correct me if I missed something, thanks!

Best,
	Abel
Shakeel Butt Sept. 18, 2023, 3:49 p.m. UTC | #4
On Mon, Sep 18, 2023 at 12:48 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> On 9/1/23 2:21 PM, Abel Wu wrote:
> > @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
> >       if (sk_has_memory_pressure(sk)) {
> >               u64 alloc;
> >
> > -             if (!sk_under_memory_pressure(sk))
> > +             /* Be more conservative if the socket's memcg (or its
> > +              * parents) is under reclaim pressure, try to possibly
> > +              * avoid further memstall.
> > +              */
> > +             if (under_memcg_pressure)
> > +                     goto suppress_allocation;
> > +
> > +             if (!sk_under_global_memory_pressure(sk))
> >                       return 1;
> > +
> > +             /* Trying to be fair among all the sockets of same
> > +              * protocal under global memory pressure, by allowing
> > +              * the ones that under average usage to raise.
> > +              */
> >               alloc = sk_sockets_allocated_read_positive(sk);
> >               if (sk_prot_mem_limits(sk, 2) > alloc *
> >                   sk_mem_pages(sk->sk_wmem_queued +
>
> I totally agree with what Shakeel said in last reply and will try ebpf-
> based solution to let userspace inject proper strategies. But IMHO the
> above hunk is irrelevant to the idea of this patchset, and is the right
> thing to do, so maybe worth a separate patch?
>
> This hunk originally passes the allocation when this socket is below
> average usage even under global and/or memcg pressure. It makes sense
> to do so under global pressure, as the 'average' is in the scope of
> global, but it's really weird from a memcg's point of view. Actually
> this pass condition was present before memcg pressure was introduced.
>
> Please correct me if I missed something, thanks!
>

Please send the patch 1 and this hunk as separate patches with
relevant motivation and reasoning.

thanks,
Shakeel
diff mbox series

Patch

diff --git a/net/core/sock.c b/net/core/sock.c
index af778fc60a4d..6c1d13547f1b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3041,6 +3041,7 @@  EXPORT_SYMBOL(sk_wait_data);
 int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	struct mem_cgroup *memcg = mem_cgroup_sockets_enabled ? sk->sk_memcg : NULL;
+	bool under_memcg_pressure = false;
 	struct proto *prot = sk->sk_prot;
 	bool charged = false;
 	long allocated;
@@ -3051,13 +3052,25 @@  int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 	if (memcg) {
 		if (!mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge()))
 			goto suppress_allocation;
+
+		/* Get pressure info from net-memcg. But consider the memcg
+		 * to be under pressure for incoming traffic iff at 'critical'
+		 * level, see commit 720ca52bcef22 ("net-memcg: avoid stalls
+		 * when under memory pressure").
+		 */
+		if (sk_has_memory_pressure(sk) &&
+		    mem_cgroup_under_socket_pressure(memcg, &under_memcg_pressure) &&
+		    !in_softirq())
+			under_memcg_pressure = true;
+
 		charged = true;
 	}
 
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
-		return 1;
+		if (!under_memcg_pressure)
+			return 1;
 	}
 
 	/* Under pressure. */
@@ -3087,8 +3100,20 @@  int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 	if (sk_has_memory_pressure(sk)) {
 		u64 alloc;
 
-		if (!sk_under_memory_pressure(sk))
+		/* Be more conservative if the socket's memcg (or its
+		 * parents) is under reclaim pressure, try to possibly
+		 * avoid further memstall.
+		 */
+		if (under_memcg_pressure)
+			goto suppress_allocation;
+
+		if (!sk_under_global_memory_pressure(sk))
 			return 1;
+
+		/* Trying to be fair among all the sockets of same
+		 * protocal under global memory pressure, by allowing
+		 * the ones that under average usage to raise.
+		 */
 		alloc = sk_sockets_allocated_read_positive(sk);
 		if (sk_prot_mem_limits(sk, 2) > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +