Message ID | 20231220152653.3273778-3-schatzberg.dan@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add swappiness argument to memory.reclaim | expand |
On Wed 20-12-23 07:26:51, Dan Schatzberg wrote: > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. Thank you for extending the changelog with usecases! > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 56 ++++++++++++++++++++----- > mm/vmscan.c | 13 +++++- > 4 files changed, 69 insertions(+), 21 deletions(-) LGTM Acked-by: Michal Hocko <mhocko@suse.com. Just one minor thing. It would be really great to prevent from potential incorrect use of mem_cgroup_swappiness. This should be internal function to memcg. Now, having scan_control internal to vmscan.c makes that harder and moving it out to swap.h or internal.h sounds overreaching. We could do this at least to reduce those mistakes. I can make it a proper patch if this seems reasonable or you can fold it into your patch directly. --- diff --git a/mm/vmscan.c b/mm/vmscan.c index f98dff23b758..5f3a182e9515 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -92,8 +92,10 @@ struct scan_control { unsigned long anon_cost; unsigned long file_cost; - /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ +#ifdef CONFIG_MEMCG + /* Swappiness value for reclaim. Always use sc_swappiness()! */ int *swappiness; +#endif /* Can active folios be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 @@ -230,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc) #endif return false; } + +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) +{ + if (sc->swappiness) + return *sc->swappiness; + return mem_cgroup_swappiness(memcg); +} #else static bool cgroup_reclaim(struct scan_control *sc) { @@ -245,6 +254,10 @@ static bool writeback_throttling_sane(struct scan_control *sc) { return true; } +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) +{ + return READ_ONCE(vm_swappiness); +} #endif static void set_task_reclaim_state(struct task_struct *task, @@ -2330,8 +2343,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; - int swappiness = sc->swappiness ? - *sc->swappiness : mem_cgroup_swappiness(memcg); + int swappiness = sc_swappiness(sc, memcg); u64 fraction[ANON_AND_FILE]; u64 denominator = 0; /* gcc */ enum scan_balance scan_balance; @@ -2612,10 +2624,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; - if (sc->swappiness) - return *sc->swappiness; - - return mem_cgroup_swappiness(memcg); + return sc_swappiness(sc, memcg); } static int get_nr_gens(struct lruvec *lruvec, int type)
On Wed, 20 Dec 2023, Dan Schatzberg wrote: > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. > > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. > > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> The cover letter says: "Previously, this exact interface addition was proposed by Yosry[3]." So I think it should be acknowledged with a Suggested-by, based on: "A Suggested-by: tag indicates that the patch idea is suggested by the person named and ensures credit to the person for the idea." from https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes > Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 56 ++++++++++++++++++++----- > mm/vmscan.c | 13 +++++- > 4 files changed, 69 insertions(+), 21 deletions(-) ... > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d91963e2d47f..aa5666842c49 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,6 +92,9 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > + int *swappiness; Using a pointer to indicate whether the type it points to is overridden isn't really a good practice. A better alternative was suggested during the v2: "Perhaps the negative to avoid unnecessary dereferences." https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/ Since only proactive reclaim can override swappiness, meaning it only happens if sc->proactive is true, I think the best way to make it work without spending much effort is create a helper as Michal suggest but it should look like: sc_swappiness() { return sc->proactive ? sc->swappiness : mem_cgroup_swappiness(memcg); } In this patchset, sc->swappiness really means sc->proactive_swappiness. So it should be renamed accordingly.
Hi Dan, Thanks for removing the -1 special cases. Acked-by: Chris Li <chrisl@kernel.org> Chris On Wed, Dec 20, 2023 at 7:27 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. > > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 56 ++++++++++++++++++++----- > mm/vmscan.c | 13 +++++- > 4 files changed, 69 insertions(+), 21 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 3f85254f3cef..ee42f74e0765 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back. > This is a simple interface to trigger memory reclaim in the > target cgroup. > > - This file accepts a single key, the number of bytes to reclaim. > - No nested keys are currently supported. > - > Example:: > > echo "1G" > memory.reclaim > > - The interface can be later extended with nested keys to > - configure the reclaim behavior. For example, specify the > - type of memory to reclaim from (anon, file, ..). > - > Please note that the kernel can over or under reclaim from > the target cgroup. If less bytes are reclaimed than the > specified amount, -EAGAIN is returned. > @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back. > This means that the networking layer will not adapt based on > reclaim induced by memory.reclaim. > > +The following nested keys are defined. > + > + ========== ================================ > + swappiness Swappiness value to reclaim with > + ========== ================================ > + > + Specifying a swappiness value instructs the kernel to perform > + the reclaim with that swappiness value. Note that this has the > + same semantics as vm.swappiness applied to memcg reclaim with > + all the existing limitations and potential future extensions. > + > memory.peak > A read-only single value file which exists on non-root > cgroups. > diff --git a/include/linux/swap.h b/include/linux/swap.h > index e2ab76c25b4a..8afdec40efe3 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options); > + unsigned int reclaim_options, > + int *swappiness); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index fbe9f02dd206..6d627a754851 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -52,6 +52,7 @@ > #include <linux/sort.h> > #include <linux/fs.h> > #include <linux/seq_file.h> > +#include <linux/parser.h> > #include <linux/vmpressure.h> > #include <linux/memremap.h> > #include <linux/mm_inline.h> > @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > gfp_mask, > - MEMCG_RECLAIM_MAY_SWAP); > + MEMCG_RECLAIM_MAY_SWAP, > + NULL); > psi_memstall_leave(&pflags); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > - gfp_mask, reclaim_options); > + gfp_mask, reclaim_options, NULL); > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { > ret = -EBUSY; > break; > } > @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP)) > + MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_retries--; > } > > @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > } > > reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); > > if (!reclaimed && !nr_retries--) > break; > @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_reclaims--; > continue; > } > @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int swappiness = -1; > unsigned int reclaim_options; > - int err; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > > buf = strstrip(buf); > - err = page_counter_memparse(buf, "", &nr_to_reclaim); > - if (err) > - return err; > + > + old_buf = buf; > + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip(buf); > + > + while ((start = strsep(&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > > reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; > while (nr_reclaimed < nr_to_reclaim) { > @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness == -1 ? NULL : &swappiness); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d91963e2d47f..aa5666842c49 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,6 +92,9 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > + int *swappiness; > + > /* Can active folios be deactivated as part of reclaim? */ > #define DEACTIVATE_ANON 1 > #define DEACTIVATE_FILE 2 > @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct mem_cgroup *memcg = lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness = mem_cgroup_swappiness(memcg); > + int swappiness = sc->swappiness ? > + *sc->swappiness : mem_cgroup_swappiness(memcg); > u64 fraction[ANON_AND_FILE]; > u64 denominator = 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > + if (sc->swappiness) > + return *sc->swappiness; > + > return mem_cgroup_swappiness(memcg); > } > > @@ -6463,12 +6470,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options) > + unsigned int reclaim_options, > + int *swappiness) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > struct scan_control sc = { > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > + .swappiness = swappiness, > .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | > (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), > .reclaim_idx = MAX_NR_ZONES - 1, > -- > 2.39.3 > >
Hi Yu Zhao, Thanks for the feedback, sorry for the delayed response. On Thu, Dec 21, 2023 at 10:31:59PM -0700, Yu Zhao wrote: > On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > > > ... > > The cover letter says: > "Previously, this exact interface addition was proposed by Yosry[3]." > > So I think it should be acknowledged with a Suggested-by, based on: > "A Suggested-by: tag indicates that the patch idea is suggested by the > person named and ensures credit to the person for the idea." > from > https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes Sure, will do. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index d91963e2d47f..aa5666842c49 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -92,6 +92,9 @@ struct scan_control { > > unsigned long anon_cost; > > unsigned long file_cost; > > > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > > + int *swappiness; > > Using a pointer to indicate whether the type it points to is > overridden isn't really a good practice. > > A better alternative was suggested during the v2: > "Perhaps the negative to avoid unnecessary dereferences." > https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/ I did have a couple versions with a negative but it creates initialization issues where every instantiation of scan_control needs to make sure to initialize swappiness or else it will behave as if swappiness is 0. That's pretty error prone so using the pointer seemed the better approach. > Since only proactive reclaim can override swappiness, meaning it only > happens if sc->proactive is true, I think the best way to make it work > without spending much effort is create a helper as Michal suggest but > it should look like: > > sc_swappiness() > { > return sc->proactive ? sc->swappiness : mem_cgroup_swappiness(memcg); > } > > In this patchset, sc->swappiness really means > sc->proactive_swappiness. So it should be renamed accordingly. Helper aside, I disagree with this point about coupling with the proactive flag. The fact that the only user currently is proactive reclaim doesn't imply to me that the interface (in scan_control) should be coupled to the use-case. It's easier to reason about a swappiness field that overrides swappiness for all scans that set it regardless of the users.
On Thu, Dec 21, 2023 at 10:29:59AM +0100, Michal Hocko wrote: > On Wed 20-12-23 07:26:51, Dan Schatzberg wrote: > > ... > > LGTM > Acked-by: Michal Hocko <mhocko@suse.com. > > Just one minor thing. It would be really great to prevent from potential > incorrect use of mem_cgroup_swappiness. This should be internal function > to memcg. Now, having scan_control internal to vmscan.c makes that > harder and moving it out to swap.h or internal.h sounds overreaching. > > We could do this at least to reduce those mistakes. I can make it a > proper patch if this seems reasonable or you can fold it into your patch > directly. > --- > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f98dff23b758..5f3a182e9515 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,8 +92,10 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > - /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > +#ifdef CONFIG_MEMCG > + /* Swappiness value for reclaim. Always use sc_swappiness()! */ > int *swappiness; > +#endif > > /* Can active folios be deactivated as part of reclaim? */ > #define DEACTIVATE_ANON 1 > @@ -230,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc) > #endif > return false; > } > + > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > +{ > + if (sc->swappiness) > + return *sc->swappiness; > + return mem_cgroup_swappiness(memcg); > +} > #else > static bool cgroup_reclaim(struct scan_control *sc) > { > @@ -245,6 +254,10 @@ static bool writeback_throttling_sane(struct scan_control *sc) > { > return true; > } > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > +{ > + return READ_ONCE(vm_swappiness); > +} > #endif > > static void set_task_reclaim_state(struct task_struct *task, > @@ -2330,8 +2343,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct mem_cgroup *memcg = lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness = sc->swappiness ? > - *sc->swappiness : mem_cgroup_swappiness(memcg); > + int swappiness = sc_swappiness(sc, memcg); > u64 fraction[ANON_AND_FILE]; > u64 denominator = 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2612,10 +2624,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > - if (sc->swappiness) > - return *sc->swappiness; > - > - return mem_cgroup_swappiness(memcg); > + return sc_swappiness(sc, memcg); > } > > static int get_nr_gens(struct lruvec *lruvec, int type) > -- > Michal Hocko > SUSE Labs Thanks for the review Michal and sorry for the delayed response. Your patch looks reasonable to me but I'm a bit unclear about the need for #ifdef - mem_cgroup_swappiness already works correctly regardless of CONFIG_MEMCG or not - why not make sc->swappiness and sc_swappiness() unconditional? Happy to roll that into the next version of my patch.
On Wed, Dec 20, 2023 at 7:27 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. > > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 56 ++++++++++++++++++++----- > mm/vmscan.c | 13 +++++- > 4 files changed, 69 insertions(+), 21 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 3f85254f3cef..ee42f74e0765 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back. > This is a simple interface to trigger memory reclaim in the > target cgroup. > > - This file accepts a single key, the number of bytes to reclaim. > - No nested keys are currently supported. > - > Example:: > > echo "1G" > memory.reclaim > > - The interface can be later extended with nested keys to > - configure the reclaim behavior. For example, specify the > - type of memory to reclaim from (anon, file, ..). > - > Please note that the kernel can over or under reclaim from > the target cgroup. If less bytes are reclaimed than the > specified amount, -EAGAIN is returned. > @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back. > This means that the networking layer will not adapt based on > reclaim induced by memory.reclaim. > > +The following nested keys are defined. > + > + ========== ================================ > + swappiness Swappiness value to reclaim with > + ========== ================================ > + > + Specifying a swappiness value instructs the kernel to perform > + the reclaim with that swappiness value. Note that this has the > + same semantics as vm.swappiness applied to memcg reclaim with > + all the existing limitations and potential future extensions. > + > memory.peak > A read-only single value file which exists on non-root > cgroups. > diff --git a/include/linux/swap.h b/include/linux/swap.h > index e2ab76c25b4a..8afdec40efe3 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options); > + unsigned int reclaim_options, > + int *swappiness); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index fbe9f02dd206..6d627a754851 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -52,6 +52,7 @@ > #include <linux/sort.h> > #include <linux/fs.h> > #include <linux/seq_file.h> > +#include <linux/parser.h> > #include <linux/vmpressure.h> > #include <linux/memremap.h> > #include <linux/mm_inline.h> > @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > gfp_mask, > - MEMCG_RECLAIM_MAY_SWAP); > + MEMCG_RECLAIM_MAY_SWAP, > + NULL); > psi_memstall_leave(&pflags); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > - gfp_mask, reclaim_options); > + gfp_mask, reclaim_options, NULL); > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { > ret = -EBUSY; > break; > } > @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP)) > + MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_retries--; > } > > @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > } > > reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); > > if (!reclaimed && !nr_retries--) > break; > @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_reclaims--; > continue; > } > @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int swappiness = -1; > unsigned int reclaim_options; > - int err; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > > buf = strstrip(buf); > - err = page_counter_memparse(buf, "", &nr_to_reclaim); > - if (err) > - return err; > + > + old_buf = buf; > + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip(buf); > + > + while ((start = strsep(&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > > reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; > while (nr_reclaimed < nr_to_reclaim) { > @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness == -1 ? NULL : &swappiness); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d91963e2d47f..aa5666842c49 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,6 +92,9 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > + int *swappiness; > + > /* Can active folios be deactivated as part of reclaim? */ > #define DEACTIVATE_ANON 1 > #define DEACTIVATE_FILE 2 > @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct mem_cgroup *memcg = lruvec_memcg(lruvec); > unsigned long anon_cost, file_cost, total_cost; > - int swappiness = mem_cgroup_swappiness(memcg); > + int swappiness = sc->swappiness ? > + *sc->swappiness : mem_cgroup_swappiness(memcg); > u64 fraction[ANON_AND_FILE]; > u64 denominator = 0; /* gcc */ > enum scan_balance scan_balance; > @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > return 0; > > + if (sc->swappiness) > + return *sc->swappiness; > + > return mem_cgroup_swappiness(memcg); > } > > @@ -6463,12 +6470,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, > unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options) > + unsigned int reclaim_options, > + int *swappiness) > { > unsigned long nr_reclaimed; > unsigned int noreclaim_flag; > struct scan_control sc = { > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > + .swappiness = swappiness, > .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | > (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), > .reclaim_idx = MAX_NR_ZONES - 1, > -- > 2.39.3 > The swappiness = -1 part seems a bit awkward, but other LGTM. Reviewed-by: Nhat Pham <nphamcs@gmail.com>
On Tue, Jan 2, 2024 at 8:21 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > Hi Yu Zhao, > > Thanks for the feedback, sorry for the delayed response. > > On Thu, Dec 21, 2023 at 10:31:59PM -0700, Yu Zhao wrote: > > On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > > > > > ... > > > > The cover letter says: > > "Previously, this exact interface addition was proposed by Yosry[3]." > > > > So I think it should be acknowledged with a Suggested-by, based on: > > "A Suggested-by: tag indicates that the patch idea is suggested by the > > person named and ensures credit to the person for the idea." > > from > > https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes > > Sure, will do. > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index d91963e2d47f..aa5666842c49 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -92,6 +92,9 @@ struct scan_control { > > > unsigned long anon_cost; > > > unsigned long file_cost; > > > > > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > > > + int *swappiness; > > > > Using a pointer to indicate whether the type it points to is > > overridden isn't really a good practice. > > > > A better alternative was suggested during the v2: > > "Perhaps the negative to avoid unnecessary dereferences." > > https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/ > > I did have a couple versions with a negative but it creates > initialization issues where every instantiation of scan_control needs > to make sure to initialize swappiness or else it will behave as if > swappiness is 0. That's pretty error prone so using the pointer seemed > the better approach. > > > Since only proactive reclaim can override swappiness, meaning it only > > happens if sc->proactive is true, I think the best way to make it work > > without spending much effort is create a helper as Michal suggest but > > it should look like: > > > > sc_swappiness() > > { > > return sc->proactive ? sc->swappiness : mem_cgroup_swappiness(memcg); > > } > > > > In this patchset, sc->swappiness really means > > sc->proactive_swappiness. So it should be renamed accordingly. > > Helper aside, I disagree with this point about coupling with the > proactive flag. Sure. But I would like to hear a *concrete* counterexample. > The fact that the only user currently is proactive > reclaim Yes, that's a fact, and we should make the decision based on the current known facts. > doesn't imply to me that the interface (in scan_control) > should be coupled to the use-case. Future always has its uncertainty which I would not worry so much about. > It's easier to reason about a > swappiness field that overrides swappiness for all scans that set it > regardless of the users. For example? And how likely would that happen in the next few years?
On Tue 02-01-24 12:43:49, Dan Schatzberg wrote: > On Thu, Dec 21, 2023 at 10:29:59AM +0100, Michal Hocko wrote: [...] > Thanks for the review Michal and sorry for the delayed response. Your > patch looks reasonable to me but I'm a bit unclear about the need for > #ifdef - mem_cgroup_swappiness already works correctly regardless of > CONFIG_MEMCG or not - why not make sc->swappiness and sc_swappiness() > unconditional? We do not have a different user than memcg pro-active reclaim. Making that conditional makes that more explicit. Nothing that I would insist on of course.
On Tue 02-01-24 10:21:44, Dan Schatzberg wrote: > Hi Yu Zhao, > > Thanks for the feedback, sorry for the delayed response. > > On Thu, Dec 21, 2023 at 10:31:59PM -0700, Yu Zhao wrote: > > On Wed, Dec 20, 2023 at 8:27 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > > > > > ... > > > > The cover letter says: > > "Previously, this exact interface addition was proposed by Yosry[3]." > > > > So I think it should be acknowledged with a Suggested-by, based on: > > "A Suggested-by: tag indicates that the patch idea is suggested by the > > person named and ensures credit to the person for the idea." > > from > > https://docs.kernel.org/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes > > Sure, will do. > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index d91963e2d47f..aa5666842c49 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -92,6 +92,9 @@ struct scan_control { > > > unsigned long anon_cost; > > > unsigned long file_cost; > > > > > > + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ > > > + int *swappiness; > > > > Using a pointer to indicate whether the type it points to is > > overridden isn't really a good practice. > > > > A better alternative was suggested during the v2: > > "Perhaps the negative to avoid unnecessary dereferences." > > https://lore.kernel.org/linux-mm/dhhjw4h22q4ngwtxmhuyifv32zjd6z2relrcjgnxsw6zys3mod@o6dh5dy53ae3/ > > I did have a couple versions with a negative but it creates > initialization issues where every instantiation of scan_control needs > to make sure to initialize swappiness or else it will behave as if > swappiness is 0. That's pretty error prone so using the pointer seemed > the better approach. I do agree with this. Especially for an opt-in features it is better if the default initialization has a clear meanining. In this case even if somebody doesn't use the helper then the NULL should be caught as NULL ptr rather than a silent misbehavior.
On Tue, Jan 02, 2024 at 05:27:18PM -0700, Yu Zhao wrote: [...] > > Helper aside, I disagree with this point about coupling with the > > proactive flag. > > Sure. But I would like to hear a *concrete* counterexample. > > > The fact that the only user currently is proactive > > reclaim > > Yes, that's a fact, and we should make the decision based on the > current known facts. > > > doesn't imply to me that the interface (in scan_control) > > should be coupled to the use-case. > > Future always has its uncertainty which I would not worry so much about. > > > It's easier to reason about a > > swappiness field that overrides swappiness for all scans that set it > > regardless of the users. > > For example? And how likely would that happen in the next few years? My argument isn't that making the interface more generic will be worthwhile due to some future use-case. Rather my argument is that making the interface more generic makes the code simpler. All else being equal, having sc->swappiness behave the same regardless of sc->proactive makes vmscan.c and struct scan_control easier to follow. That being said - I'm fine with conceding this point - particularly since both you and Michal appear to feel similarly. I'll make the corresponding change and send out a new version.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f85254f3cef..ee42f74e0765 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. - Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. +The following nested keys are defined. + + ========== ================================ + swappiness Swappiness value to reclaim with + ========== ================================ + + Specifying a swappiness value instructs the kernel to perform + the reclaim with that swappiness value. Note that this has the + same semantics as vm.swappiness applied to memcg reclaim with + all the existing limitations and potential future extensions. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index e2ab76c25b4a..8afdec40efe3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + int *swappiness); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fbe9f02dd206..6d627a754851 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -52,6 +52,7 @@ #include <linux/sort.h> #include <linux/fs.h> #include <linux/seq_file.h> +#include <linux/parser.h> #include <linux/vmpressure.h> #include <linux/memremap.h> #include <linux/mm_inline.h> @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, + NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { ret = -EBUSY; break; } @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_retries--; } @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); if (!reclaimed && !nr_retries--) break; @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_reclaims--; continue; } @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t tokens = { + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; + int swappiness = -1; unsigned int reclaim_options; - int err; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; + + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + switch (match_token(start, tokens, args)) { + case MEMORY_RECLAIM_SWAPPINESS: + if (match_int(&args[0], &swappiness)) + return -EINVAL; + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) + return -EINVAL; + break; + default: + return -EINVAL; + } + } reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + swappiness == -1 ? NULL : &swappiness); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index d91963e2d47f..aa5666842c49 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -92,6 +92,9 @@ struct scan_control { unsigned long anon_cost; unsigned long file_cost; + /* Swappiness value for reclaim. NULL will fall back to per-memcg/global value */ + int *swappiness; + /* Can active folios be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 #define DEACTIVATE_FILE 2 @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; - int swappiness = mem_cgroup_swappiness(memcg); + int swappiness = sc->swappiness ? + *sc->swappiness : mem_cgroup_swappiness(memcg); u64 fraction[ANON_AND_FILE]; u64 denominator = 0; /* gcc */ enum scan_balance scan_balance; @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; + if (sc->swappiness) + return *sc->swappiness; + return mem_cgroup_swappiness(memcg); } @@ -6463,12 +6470,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + int *swappiness) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), + .swappiness = swappiness, .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), .reclaim_idx = MAX_NR_ZONES - 1,
Allow proactive reclaimers to submit an additional swappiness=<val> argument to memory.reclaim. This overrides the global or per-memcg swappiness setting for that reclaim attempt. For example: echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 0 (no swap) regardless of the vm.swappiness sysctl setting. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> --- Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- include/linux/swap.h | 3 +- mm/memcontrol.c | 56 ++++++++++++++++++++----- mm/vmscan.c | 13 +++++- 4 files changed, 69 insertions(+), 21 deletions(-)