Message ID | 20230522070122.6727-4-wuyun.abel@bytedance.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | sock: Improve condition on sockmem pressure | expand |
On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote: > For now __sk_mem_raise_allocated() mainly considers global socket > memory pressure and allows to raise if no global pressure observed, > including the sockets whose memcgs are in pressure, which might > result in longer memcg memstall. > > So take net-memcg's pressure into consideration when allocating > socket memory to alleviate long tail latencies. > > Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> > --- > net/core/sock.c | 23 ++++++++++++++++------- > 1 file changed, 16 insertions(+), 7 deletions(-) > > diff --git a/net/core/sock.c b/net/core/sock.c > index 801df091e37a..7641d64293af 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) > { > bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg; > struct proto *prot = sk->sk_prot; > - bool charged = true; > + bool charged = true, pressured = false; > long allocated; Please preserve reverse xmas tree - longest line to shortest - for local variables in neworking code. ...
On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote: > For now __sk_mem_raise_allocated() mainly considers global socket > memory pressure and allows to raise if no global pressure observed, > including the sockets whose memcgs are in pressure, which might > result in longer memcg memstall. > > So take net-memcg's pressure into consideration when allocating > socket memory to alleviate long tail latencies. > > Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Hi Abel, Have you seen any real world production issue which is fixed by this patch or is it more of a fix after reading code? This code is quite subtle and small changes can cause unintended behavior changes. At the moment the tcp memory accounting and memcg accounting is intermingled and I think we should decouple them. > --- > net/core/sock.c | 23 ++++++++++++++++------- > 1 file changed, 16 insertions(+), 7 deletions(-) > > diff --git a/net/core/sock.c b/net/core/sock.c > index 801df091e37a..7641d64293af 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) > { > bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg; > struct proto *prot = sk->sk_prot; > - bool charged = true; > + bool charged = true, pressured = false; > long allocated; > > sk_memory_allocated_add(sk, amt); > allocated = sk_memory_allocated(sk); > - if (memcg_charge && > - !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, > - gfp_memcg_charge()))) > - goto suppress_allocation; > + > + if (memcg_charge) { > + charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, > + gfp_memcg_charge()); > + if (!charged) > + goto suppress_allocation; > + if (mem_cgroup_under_socket_pressure(sk->sk_memcg)) The memcg under pressure callback does a upward memcg tree walk, do please make sure you have tested the performance impact of this. > + pressured = true; > + } > > /* Under limit. */ > - if (allocated <= sk_prot_mem_limits(sk, 0)) { > + if (allocated <= sk_prot_mem_limits(sk, 0)) > sk_leave_memory_pressure(sk); > + else > + pressured = true; > + > + /* No pressure observed in global/memcg. */ > + if (!pressured) > return 1; > - } > > /* Under pressure. */ > if (allocated > sk_prot_mem_limits(sk, 1)) > -- > 2.37.3 >
Hi Shakeel, thanks for reviewing! And sorry for replying so late, I was on a vocation :) On 5/25/23 9:22 AM, Shakeel Butt wrote: > On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote: >> For now __sk_mem_raise_allocated() mainly considers global socket >> memory pressure and allows to raise if no global pressure observed, >> including the sockets whose memcgs are in pressure, which might >> result in longer memcg memstall. >> >> So take net-memcg's pressure into consideration when allocating >> socket memory to alleviate long tail latencies. >> >> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> > > Hi Abel, > > Have you seen any real world production issue which is fixed by this > patch or is it more of a fix after reading code? The latter. But we do observe one common case in the production env that p2p service, which mainly downloads container images, running inside a container with tight memory limit can easily be throttled and keep memstalled for a long period of time and sometimes even be OOM- killed. This service shows burst usage of TCP memory and I think it indeed needs suppressing sockmem allocation if memcg is already under pressure. The memcg pressure is usually caused by too many page caches and the dirty ones starting to be wrote back to slow backends. So it is insane to continuously receive net data to consume more memory. > > This code is quite subtle and small changes can cause unintended > behavior changes. At the moment the tcp memory accounting and memcg > accounting is intermingled and I think we should decouple them. My original intention to post this patchset is to clarify that: - proto pressure only considers sysctl_mem[] (patch 2) - memcg pressure only indicates the pressure inside itself - consider both whenever needs allocation or reclaim (patch 1,3) In this way, the two kinds of pressure maintain purer semantics, and socket core can react on both of them properly and consistently. > >> --- >> net/core/sock.c | 23 ++++++++++++++++------- >> 1 file changed, 16 insertions(+), 7 deletions(-) >> >> diff --git a/net/core/sock.c b/net/core/sock.c >> index 801df091e37a..7641d64293af 100644 >> --- a/net/core/sock.c >> +++ b/net/core/sock.c >> @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) >> { >> bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg; >> struct proto *prot = sk->sk_prot; >> - bool charged = true; >> + bool charged = true, pressured = false; >> long allocated; >> >> sk_memory_allocated_add(sk, amt); >> allocated = sk_memory_allocated(sk); >> - if (memcg_charge && >> - !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, >> - gfp_memcg_charge()))) >> - goto suppress_allocation; >> + >> + if (memcg_charge) { >> + charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, >> + gfp_memcg_charge()); >> + if (!charged) >> + goto suppress_allocation; >> + if (mem_cgroup_under_socket_pressure(sk->sk_memcg)) > > The memcg under pressure callback does a upward memcg tree walk, do > please make sure you have tested the performance impact of this. Yes, I have tested several benchmarks on a dual socket machine modeled Intel Xeon(R) Platinum 8260 with SNC disabled, that is 2 NUMA nodes each of which has 24C/48T. All the benchmarks are done inside a separate cgroup in a clean host. Below shows the result of tbench4 and netperf: tbench4 Throughput (misleading but traditional) baseline patchset Hmean 1 377.62 ( 0.00%) 375.06 * -0.68%* Hmean 2 753.99 ( 0.00%) 753.21 * -0.10%* Hmean 4 1503.50 ( 0.00%) 1493.07 * -0.69%* Hmean 8 2941.43 ( 0.00%) 2925.18 * -0.55%* Hmean 16 5637.59 ( 0.00%) 5603.64 * -0.60%* Hmean 32 9042.90 ( 0.00%) 9022.53 * -0.23%* Hmean 64 10530.55 ( 0.00%) 10554.89 * 0.23%* Hmean 128 24230.20 ( 0.00%) 24424.74 * 0.80%* Hmean 256 23798.21 ( 0.00%) 23941.24 * 0.60%* Hmean 384 23620.63 ( 0.00%) 23569.54 * -0.22%* netperf-udp baseline patchset Hmean send-64 281.99 ( 0.00%) 274.50 * -2.65%* Hmean send-128 556.70 ( 0.00%) 545.82 * -1.96%* Hmean send-256 1102.60 ( 0.00%) 1091.21 * -1.03%* Hmean send-1024 4180.48 ( 0.00%) 4073.87 * -2.55%* Hmean send-2048 7837.61 ( 0.00%) 7707.12 * -1.66%* Hmean send-3312 12157.49 ( 0.00%) 11845.03 * -2.57%* Hmean send-4096 14512.64 ( 0.00%) 14156.45 * -2.45%* Hmean send-8192 24015.40 ( 0.00%) 23920.94 ( -0.39%) Hmean send-16384 39875.21 ( 0.00%) 39696.67 ( -0.45%) Hmean recv-64 281.99 ( 0.00%) 274.50 * -2.65%* Hmean recv-128 556.70 ( 0.00%) 545.82 * -1.96%* Hmean recv-256 1102.60 ( 0.00%) 1091.21 * -1.03%* Hmean recv-1024 4180.48 ( 0.00%) 4073.76 * -2.55%* Hmean recv-2048 7837.61 ( 0.00%) 7707.11 * -1.67%* Hmean recv-3312 12157.49 ( 0.00%) 11845.03 * -2.57%* Hmean recv-4096 14512.62 ( 0.00%) 14156.45 * -2.45%* Hmean recv-8192 24015.29 ( 0.00%) 23920.88 ( -0.39%) Hmean recv-16384 39873.93 ( 0.00%) 39696.02 ( -0.45%) netperf-tcp baseline patchset Hmean 64 1777.05 ( 0.00%) 1793.04 ( 0.90%) Hmean 128 3364.25 ( 0.00%) 3451.05 * 2.58%* Hmean 256 6309.21 ( 0.00%) 6506.84 * 3.13%* Hmean 1024 19571.52 ( 0.00%) 19606.65 ( 0.18%) Hmean 2048 26467.00 ( 0.00%) 26658.12 ( 0.72%) Hmean 3312 31312.36 ( 0.00%) 31403.54 ( 0.29%) Hmean 4096 33263.37 ( 0.00%) 33278.77 ( 0.05%) Hmean 8192 39961.82 ( 0.00%) 40149.77 ( 0.47%) Hmean 16384 46065.33 ( 0.00%) 46683.67 ( 1.34%) Except slight regression in netperf-udp, no obvious performance win or loss. But as you reminded me of the cost of hierarchical behavior, I re-tested the cases in a 5-level depth cgroup (originally 2-level), and the results are: tbench4 Throughput (misleading but traditional) baseline patchset Hmean 1 361.93 ( 0.00%) 367.58 * 1.56%* Hmean 2 734.39 ( 0.00%) 730.33 * -0.55%* Hmean 4 1426.82 ( 0.00%) 1440.81 * 0.98%* Hmean 8 2848.86 ( 0.00%) 2860.87 * 0.42%* Hmean 16 5436.72 ( 0.00%) 5491.72 * 1.01%* Hmean 32 8743.34 ( 0.00%) 8913.27 * 1.94%* Hmean 64 10345.41 ( 0.00%) 10436.92 * 0.88%* Hmean 128 23390.36 ( 0.00%) 23353.09 * -0.16%* Hmean 256 23823.20 ( 0.00%) 23509.79 * -1.32%* Hmean 384 23268.09 ( 0.00%) 23178.10 * -0.39%* netperf-udp baseline patchset Hmean send-64 278.31 ( 0.00%) 275.68 * -0.94%* Hmean send-128 554.52 ( 0.00%) 547.46 ( -1.27%) Hmean send-256 1106.64 ( 0.00%) 1103.01 ( -0.33%) Hmean send-1024 4135.84 ( 0.00%) 4057.47 * -1.89%* Hmean send-2048 7816.13 ( 0.00%) 7732.71 * -1.07%* Hmean send-3312 12068.32 ( 0.00%) 11895.94 * -1.43%* Hmean send-4096 14358.02 ( 0.00%) 14304.06 ( -0.38%) Hmean send-8192 24041.57 ( 0.00%) 24061.70 ( 0.08%) Hmean send-16384 39996.09 ( 0.00%) 39936.08 ( -0.15%) Hmean recv-64 278.31 ( 0.00%) 275.68 * -0.94%* Hmean recv-128 554.52 ( 0.00%) 547.46 ( -1.27%) Hmean recv-256 1106.64 ( 0.00%) 1103.01 ( -0.33%) Hmean recv-1024 4135.84 ( 0.00%) 4057.47 * -1.89%* Hmean recv-2048 7816.13 ( 0.00%) 7732.71 * -1.07%* Hmean recv-3312 12068.32 ( 0.00%) 11895.94 * -1.43%* Hmean recv-4096 14357.99 ( 0.00%) 14304.04 ( -0.38%) Hmean recv-8192 24041.43 ( 0.00%) 24061.58 ( 0.08%) Hmean recv-16384 39995.72 ( 0.00%) 39935.68 ( -0.15%) netperf-tcp baseline patchset Hmean 64 1779.93 ( 0.00%) 1784.75 ( 0.27%) Hmean 128 3380.32 ( 0.00%) 3424.14 ( 1.30%) Hmean 256 6383.37 ( 0.00%) 6504.97 * 1.90%* Hmean 1024 19345.07 ( 0.00%) 19604.06 * 1.34%* Hmean 2048 26547.60 ( 0.00%) 26743.94 * 0.74%* Hmean 3312 30948.40 ( 0.00%) 31419.11 * 1.52%* Hmean 4096 32888.83 ( 0.00%) 33125.01 * 0.72%* Hmean 8192 40020.38 ( 0.00%) 39949.53 ( -0.18%) Hmean 16384 46084.48 ( 0.00%) 46300.43 ( 0.47%) Still no obvious difference, and even the udp regression is reduced. Thanks & Best, Abel
+linux-mm and cgroups On Mon, May 29, 2023 at 07:58:45PM +0800, Abel Wu wrote: > Hi Shakeel, thanks for reviewing! And sorry for replying so late, > I was on a vocation :) > > On 5/25/23 9:22 AM, Shakeel Butt wrote: > > On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote: > > > For now __sk_mem_raise_allocated() mainly considers global socket > > > memory pressure and allows to raise if no global pressure observed, > > > including the sockets whose memcgs are in pressure, which might > > > result in longer memcg memstall. > > > > > > So take net-memcg's pressure into consideration when allocating > > > socket memory to alleviate long tail latencies. > > > > > > Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> > > > > Hi Abel, > > > > Have you seen any real world production issue which is fixed by this > > patch or is it more of a fix after reading code? > > The latter. But we do observe one common case in the production env > that p2p service, which mainly downloads container images, running > inside a container with tight memory limit can easily be throttled and > keep memstalled for a long period of time and sometimes even be OOM- > killed. This service shows burst usage of TCP memory and I think it > indeed needs suppressing sockmem allocation if memcg is already under > pressure. The memcg pressure is usually caused by too many page caches > and the dirty ones starting to be wrote back to slow backends. So it > is insane to continuously receive net data to consume more memory. > We actually made an intentional decision to not throttle the incoming traffic under memory pressure. See 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure"). If you think the throttling behavior is preferred for your application, please propose the patch separately and we can work on how to enable flexible policy here. > > > > This code is quite subtle and small changes can cause unintended > > behavior changes. At the moment the tcp memory accounting and memcg > > accounting is intermingled and I think we should decouple them. > > My original intention to post this patchset is to clarify that: > > - proto pressure only considers sysctl_mem[] (patch 2) > - memcg pressure only indicates the pressure inside itself > - consider both whenever needs allocation or reclaim (patch 1,3) > > In this way, the two kinds of pressure maintain purer semantics, and > socket core can react on both of them properly and consistently. Can you please resend you patch series (without patch 3) and Cc to linux-mm, cgroups list and memcg maintainers as well?
On 5/30/23 5:12 AM, Shakeel Butt wrote: > > +linux-mm and cgroups > > On Mon, May 29, 2023 at 07:58:45PM +0800, Abel Wu wrote: >> Hi Shakeel, thanks for reviewing! And sorry for replying so late, >> I was on a vocation :) >> >> On 5/25/23 9:22 AM, Shakeel Butt wrote: >>> On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote: >>>> For now __sk_mem_raise_allocated() mainly considers global socket >>>> memory pressure and allows to raise if no global pressure observed, >>>> including the sockets whose memcgs are in pressure, which might >>>> result in longer memcg memstall. >>>> >>>> So take net-memcg's pressure into consideration when allocating >>>> socket memory to alleviate long tail latencies. >>>> >>>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> >>> >>> Hi Abel, >>> >>> Have you seen any real world production issue which is fixed by this >>> patch or is it more of a fix after reading code? >> >> The latter. But we do observe one common case in the production env >> that p2p service, which mainly downloads container images, running >> inside a container with tight memory limit can easily be throttled and >> keep memstalled for a long period of time and sometimes even be OOM- >> killed. This service shows burst usage of TCP memory and I think it >> indeed needs suppressing sockmem allocation if memcg is already under >> pressure. The memcg pressure is usually caused by too many page caches >> and the dirty ones starting to be wrote back to slow backends. So it >> is insane to continuously receive net data to consume more memory. >> > > We actually made an intentional decision to not throttle the incoming > traffic under memory pressure. See 720ca52bcef22 ("net-memcg: avoid > stalls when under memory pressure"). If you think the throttling > behavior is preferred for your application, please propose the patch > separately and we can work on how to enable flexible policy here. Ah I see. Thanks for providing the context. So suppressing the alloc under memcg pressure could further keep senders waiting if SACKed segs get dropped from the OFO queue. > >>> >>> This code is quite subtle and small changes can cause unintended >>> behavior changes. At the moment the tcp memory accounting and memcg >>> accounting is intermingled and I think we should decouple them. >> >> My original intention to post this patchset is to clarify that: >> >> - proto pressure only considers sysctl_mem[] (patch 2) >> - memcg pressure only indicates the pressure inside itself >> - consider both whenever needs allocation or reclaim (patch 1,3) >> >> In this way, the two kinds of pressure maintain purer semantics, and >> socket core can react on both of them properly and consistently. > > Can you please resend you patch series (without patch 3) and Cc to > linux-mm, cgroups list and memcg maintainers as well? Yeah, absolutely. Thanks, Abel
diff --git a/net/core/sock.c b/net/core/sock.c index 801df091e37a..7641d64293af 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) { bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg; struct proto *prot = sk->sk_prot; - bool charged = true; + bool charged = true, pressured = false; long allocated; sk_memory_allocated_add(sk, amt); allocated = sk_memory_allocated(sk); - if (memcg_charge && - !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, - gfp_memcg_charge()))) - goto suppress_allocation; + + if (memcg_charge) { + charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt, + gfp_memcg_charge()); + if (!charged) + goto suppress_allocation; + if (mem_cgroup_under_socket_pressure(sk->sk_memcg)) + pressured = true; + } /* Under limit. */ - if (allocated <= sk_prot_mem_limits(sk, 0)) { + if (allocated <= sk_prot_mem_limits(sk, 0)) sk_leave_memory_pressure(sk); + else + pressured = true; + + /* No pressure observed in global/memcg. */ + if (!pressured) return 1; - } /* Under pressure. */ if (allocated > sk_prot_mem_limits(sk, 1))
For now __sk_mem_raise_allocated() mainly considers global socket memory pressure and allows to raise if no global pressure observed, including the sockets whose memcgs are in pressure, which might result in longer memcg memstall. So take net-memcg's pressure into consideration when allocating socket memory to alleviate long tail latencies. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> --- net/core/sock.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-)