mbox series

[mm-unstable,v4,0/5] mm: memcg: subtree stats flushing and thresholds

Message ID 20231129032154.3710765-1-yosryahmed@google.com (mailing list archive)
Headers show
Series mm: memcg: subtree stats flushing and thresholds | expand

Message

Yosry Ahmed Nov. 29, 2023, 3:21 a.m. UTC
This series attempts to address shortages in today's approach for memcg
stats flushing, namely occasionally stale or expensive stat reads. The
series does so by changing the threshold that we use to decide whether
to trigger a flush to be per memcg instead of global (patch 3), and then
changing flushing to be per memcg (i.e. subtree flushes) instead of
global (patch 5).

Patch 3 & 5 are the core of the series, and they include more details
and testing results. The rest are either cleanups or prep work.

This series replaces the "memcg: more sophisticated stats flushing"
series [1], which also replaces another series, in a long list of
attempts to improve memcg stats flushing. It is not a new version of
the same patchset as it is a completely different approach. This is
based on collected feedback from discussions on lkml in all previous
attempts. Hopefully, this is the final attempt.

There was a reported regression in v2 [2] for will-it-scale::fallocate
benchmark. I believe this regression should not affect production
workloads. This specific benchmark is allocating and freeing memory
(using fallocate/ftruncate) at a rate that is much faster to make actual
use of the memory. Testing this series on 100+ machines running
production workloads did not show any practical regressions in page
fault latency or allocation latency, but it showed great improvements in
stats read time. I do not have numbers about the exact improvements for
this series, but combined with another optimization for cgroup v1 [3] we
see 5-10x improvements. A significant chunk of that is coming from the
cgroup v1 optimization, but this series also made an improvement as
reported by Domenico [4].

v3 -> v4:
- Rebased on top of mm-unstable + "workload-specific and memory
  pressure-driven zswap writeback" series to fix conflicts [5].

v3: https://lore.kernel.org/all/20231116022411.2250072-1-yosryahmed@google.com/

[1]https://lore.kernel.org/lkml/20230913073846.1528938-1-yosryahmed@google.com/
[2]https://lore.kernel.org/lkml/202310202303.c68e7639-oliver.sang@intel.com/
[3]https://lore.kernel.org/lkml/20230803185046.1385770-1-yosryahmed@google.com/
[4]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
[5]https://lore.kernel.org/all/20231127234600.2971029-1-nphamcs@gmail.com/

Yosry Ahmed (5):
  mm: memcg: change flush_next_time to flush_last_time
  mm: memcg: move vmstats structs definition above flushing code
  mm: memcg: make stats flushing threshold per-memcg
  mm: workingset: move the stats flush into workingset_test_recent()
  mm: memcg: restore subtree stats flushing

 include/linux/memcontrol.h |   8 +-
 mm/memcontrol.c            | 272 +++++++++++++++++++++----------------
 mm/vmscan.c                |   2 +-
 mm/workingset.c            |  42 ++++--
 4 files changed, 188 insertions(+), 136 deletions(-)

Comments

Bagas Sanjaya Dec. 2, 2023, 4:51 a.m. UTC | #1
On Wed, Nov 29, 2023 at 03:21:48AM +0000, Yosry Ahmed wrote:
> This series attempts to address shortages in today's approach for memcg
> stats flushing, namely occasionally stale or expensive stat reads. The
> series does so by changing the threshold that we use to decide whether
> to trigger a flush to be per memcg instead of global (patch 3), and then
> changing flushing to be per memcg (i.e. subtree flushes) instead of
> global (patch 5).
> 
> Patch 3 & 5 are the core of the series, and they include more details
> and testing results. The rest are either cleanups or prep work.
> 
> This series replaces the "memcg: more sophisticated stats flushing"
> series [1], which also replaces another series, in a long list of
> attempts to improve memcg stats flushing. It is not a new version of
> the same patchset as it is a completely different approach. This is
> based on collected feedback from discussions on lkml in all previous
> attempts. Hopefully, this is the final attempt.
> 
> There was a reported regression in v2 [2] for will-it-scale::fallocate
> benchmark. I believe this regression should not affect production
> workloads. This specific benchmark is allocating and freeing memory
> (using fallocate/ftruncate) at a rate that is much faster to make actual
> use of the memory. Testing this series on 100+ machines running
> production workloads did not show any practical regressions in page
> fault latency or allocation latency, but it showed great improvements in
> stats read time. I do not have numbers about the exact improvements for
> this series, but combined with another optimization for cgroup v1 [3] we
> see 5-10x improvements. A significant chunk of that is coming from the
> cgroup v1 optimization, but this series also made an improvement as
> reported by Domenico [4].
> 
> v3 -> v4:
> - Rebased on top of mm-unstable + "workload-specific and memory
>   pressure-driven zswap writeback" series to fix conflicts [5].
> 
> v3: https://lore.kernel.org/all/20231116022411.2250072-1-yosryahmed@google.com/
> 
> [1]https://lore.kernel.org/lkml/20230913073846.1528938-1-yosryahmed@google.com/
> [2]https://lore.kernel.org/lkml/202310202303.c68e7639-oliver.sang@intel.com/
> [3]https://lore.kernel.org/lkml/20230803185046.1385770-1-yosryahmed@google.com/
> [4]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
> [5]https://lore.kernel.org/all/20231127234600.2971029-1-nphamcs@gmail.com/
> 
> Yosry Ahmed (5):
>   mm: memcg: change flush_next_time to flush_last_time
>   mm: memcg: move vmstats structs definition above flushing code
>   mm: memcg: make stats flushing threshold per-memcg
>   mm: workingset: move the stats flush into workingset_test_recent()
>   mm: memcg: restore subtree stats flushing
> 
>  include/linux/memcontrol.h |   8 +-
>  mm/memcontrol.c            | 272 +++++++++++++++++++++----------------
>  mm/vmscan.c                |   2 +-
>  mm/workingset.c            |  42 ++++--
>  4 files changed, 188 insertions(+), 136 deletions(-)
> 

No regressions when booting the kernel with this series applied.

Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>