Message ID | 20240103164841.2800183-3-schatzberg.dan@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add swappiness argument to memory.reclaim | expand |
On Wed, Jan 3, 2024 at 9:49 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > Allow proactive reclaimers to submit an additional swappiness=<val> > argument to memory.reclaim. This overrides the global or per-memcg > swappiness setting for that reclaim attempt. > > For example: > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 0 (no > swap) regardless of the vm.swappiness sysctl setting. > > Userspace proactive reclaimers use the memory.reclaim interface to > trigger reclaim. The memory.reclaim interface does not allow for any way > to effect the balance of file vs anon during proactive reclaim. The only > approach is to adjust the vm.swappiness setting. However, there are a > few reasons we look to control the balance of file vs anon during > proactive reclaim, separately from reactive reclaim: > > * Swapout should be limited to manage SSD write endurance. In near-OOM > situations we are fine with lots of swap-out to avoid OOMs. As these are > typically rare events, they have relatively little impact on write > endurance. However, proactive reclaim runs continuously and so its > impact on SSD write endurance is more significant. Therefore it is > desireable to control swap-out for proactive reclaim separately from > reactive reclaim > > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on > swap exhaustion. This makes sense if the swap exhaustion is triggered > due to reactive reclaim but less so if it is triggered due to proactive > reclaim (e.g. one could see OOMs when free memory is ample but anon is > just particularly cold). Therefore, it's desireable to have proactive > reclaim reduce or stop swap-out before the threshold at which OOM > killing occurs. > > In the case of Meta's Senpai proactive reclaimer, we adjust > vm.swappiness before writes to memory.reclaim[2]. This has been in > production for nearly two years and has addressed our needs to control > proactive vs reactive reclaim behavior but is still not ideal for a > number of reasons: > > * vm.swappiness is a global setting, adjusting it can race/interfere > with other system administration that wishes to control vm.swappiness. > In our case, we need to disable Senpai before adjusting vm.swappiness. > > * vm.swappiness is stateful - so a crash or restart of Senpai can leave > a misconfigured setting. This requires some additional management to > record the "desired" setting and ensure Senpai always adjusts to it. > > With this patch, we avoid these downsides of adjusting vm.swappiness > globally. > > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> > Suggested-by: Yosry Ahmed <yosryahmed@google.com> > Acked-by: Michal Hocko <mhocko@suse.com> > Acked-by: David Rientjes <rientjes@google.com> > Acked-by: Chris Li <chrisl@kernel.org> > --- > Documentation/admin-guide/cgroup-v2.rst | 18 ++++---- > include/linux/swap.h | 3 +- > mm/memcontrol.c | 56 ++++++++++++++++++++----- > mm/vmscan.c | 25 +++++++++-- > 4 files changed, 80 insertions(+), 22 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 3f85254f3cef..ee42f74e0765 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back. > This is a simple interface to trigger memory reclaim in the > target cgroup. > > - This file accepts a single key, the number of bytes to reclaim. > - No nested keys are currently supported. > - > Example:: > > echo "1G" > memory.reclaim > > - The interface can be later extended with nested keys to > - configure the reclaim behavior. For example, specify the > - type of memory to reclaim from (anon, file, ..). > - > Please note that the kernel can over or under reclaim from > the target cgroup. If less bytes are reclaimed than the > specified amount, -EAGAIN is returned. > @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back. > This means that the networking layer will not adapt based on > reclaim induced by memory.reclaim. > > +The following nested keys are defined. > + > + ========== ================================ > + swappiness Swappiness value to reclaim with > + ========== ================================ > + > + Specifying a swappiness value instructs the kernel to perform > + the reclaim with that swappiness value. Note that this has the > + same semantics as vm.swappiness applied to memcg reclaim with > + all the existing limitations and potential future extensions. > + > memory.peak > A read-only single value file which exists on non-root > cgroups. > diff --git a/include/linux/swap.h b/include/linux/swap.h > index e2ab76c25b4a..8afdec40efe3 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > unsigned long nr_pages, > gfp_t gfp_mask, > - unsigned int reclaim_options); > + unsigned int reclaim_options, > + int *swappiness); > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > pg_data_t *pgdat, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index fbe9f02dd206..6d627a754851 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -52,6 +52,7 @@ > #include <linux/sort.h> > #include <linux/fs.h> > #include <linux/seq_file.h> > +#include <linux/parser.h> > #include <linux/vmpressure.h> > #include <linux/memremap.h> > #include <linux/mm_inline.h> > @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > psi_memstall_enter(&pflags); > nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, > gfp_mask, > - MEMCG_RECLAIM_MAY_SWAP); > + MEMCG_RECLAIM_MAY_SWAP, > + NULL); > psi_memstall_leave(&pflags); > } while ((memcg = parent_mem_cgroup(memcg)) && > !mem_cgroup_is_root(memcg)); > @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > psi_memstall_enter(&pflags); > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > - gfp_mask, reclaim_options); > + gfp_mask, reclaim_options, NULL); > psi_memstall_leave(&pflags); > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, > } > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { > ret = -EBUSY; > break; > } > @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > return -EINTR; > > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, > - MEMCG_RECLAIM_MAY_SWAP)) > + MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_retries--; > } > > @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > } > > reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); > > if (!reclaimed && !nr_retries--) > break; > @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > if (nr_reclaims) { > if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) > nr_reclaims--; > continue; > } > @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int swappiness = -1; > unsigned int reclaim_options; > - int err; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > > buf = strstrip(buf); > - err = page_counter_memparse(buf, "", &nr_to_reclaim); > - if (err) > - return err; > + > + old_buf = buf; > + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip(buf); > + > + while ((start = strsep(&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) > + return -EINVAL; > + break; > + default: > + return -EINVAL; > + } > + } > > reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; > while (nr_reclaimed < nr_to_reclaim) { > @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness == -1 ? NULL : &swappiness); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d91963e2d47f..394e0dd46b2e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,6 +92,11 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > +#ifdef CONFIG_MEMCG > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > + int *proactive_swappiness; > +#endif Why is proactive_swappiness still a pointer? The whole point of the previous conversation is that sc->proactive can tell whether sc->swappiness is valid or not, and that's less awkward than using a pointer. Also why the #ifdef here? I don't see the point for a small stack variable. Otherwise wouldn't we want to do this for sc->proactive as well? If you really want it to be explicit, you could do struct scan_control { ... struct { bool is_set; int swappiness; } proactive; }; But I think even this is too much.
On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote: [...] > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index d91963e2d47f..394e0dd46b2e 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -92,6 +92,11 @@ struct scan_control { > > unsigned long anon_cost; > > unsigned long file_cost; > > > > +#ifdef CONFIG_MEMCG > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > > + int *proactive_swappiness; > > +#endif > > Why is proactive_swappiness still a pointer? The whole point of the > previous conversation is that sc->proactive can tell whether > sc->swappiness is valid or not, and that's less awkward than using a > pointer. It's the same reason as before - zero initialization ensures that the pointer is NULL which tells us if it's valid or not. Proactive reclaim might not set swappiness and you need to distinguish swappiness of 0 and not-set. See this discussion with Michal: https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/ > Also why the #ifdef here? I don't see the point for a small stack > variable. Otherwise wouldn't we want to do this for sc->proactive as > well? This was Michal's request and it feels similar to your rationale for naming it proactive_swappiness - it's just restricting the interface down to the only use-cases. I'd be fine with doing the same in sc->proactive as a subsequent patch. See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/
On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote: > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote: > [...] > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index d91963e2d47f..394e0dd46b2e 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -92,6 +92,11 @@ struct scan_control { > > > unsigned long anon_cost; > > > unsigned long file_cost; > > > > > > +#ifdef CONFIG_MEMCG > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > > > + int *proactive_swappiness; > > > +#endif > > > > Why is proactive_swappiness still a pointer? The whole point of the > > previous conversation is that sc->proactive can tell whether > > sc->swappiness is valid or not, and that's less awkward than using a > > pointer. > > It's the same reason as before - zero initialization ensures that the > pointer is NULL which tells us if it's valid or not. Proactive reclaim > might not set swappiness and you need to distinguish swappiness of 0 > and not-set. See this discussion with Michal: > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; + int swappiness = -1; ... reclaimed = try_to_free_mem_cgroup_pages(memcg, min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + swappiness); ... +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) +{ + return sc->proactive && sc->proactive_swappiness > -1 ? + sc->proactive_swappiness : mem_cgroup_swappiness(memcg); +}
On Wed, Jan 3, 2024 at 11:20 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote: > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote: > [...] > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index d91963e2d47f..394e0dd46b2e 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -92,6 +92,11 @@ struct scan_control { > > > unsigned long anon_cost; > > > unsigned long file_cost; > > > > > > +#ifdef CONFIG_MEMCG > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > > > + int *proactive_swappiness; > > > +#endif > > > > Why is proactive_swappiness still a pointer? The whole point of the > > previous conversation is that sc->proactive can tell whether > > sc->swappiness is valid or not, and that's less awkward than using a > > pointer. > > It's the same reason as before - zero initialization ensures that the > pointer is NULL which tells us if it's valid or not. Proactive reclaim > might not set swappiness and you need to distinguish swappiness of 0 > and not-set. See this discussion with Michal: > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/ > > > Also why the #ifdef here? I don't see the point for a small stack > > variable. Otherwise wouldn't we want to do this for sc->proactive as > > well? > > This was Michal's request and it feels similar to your rationale for > naming it proactive_swappiness - it's just restricting the interface > down to the only use-cases. I'd be fine with doing the same in > sc->proactive as a subsequent patch. > > See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/ Also regarding #ifdef, quoting Documentation/process/4.Coding.rst: "As a general rule, #ifdef use should be confined to header files whenever possible."
On Wed 03-01-24 18:07:43, Yu Zhao wrote: > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote: > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote: > > [...] > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index d91963e2d47f..394e0dd46b2e 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -92,6 +92,11 @@ struct scan_control { > > > > unsigned long anon_cost; > > > > unsigned long file_cost; > > > > > > > > +#ifdef CONFIG_MEMCG > > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > > > > + int *proactive_swappiness; > > > > +#endif > > > > > > Why is proactive_swappiness still a pointer? The whole point of the > > > previous conversation is that sc->proactive can tell whether > > > sc->swappiness is valid or not, and that's less awkward than using a > > > pointer. > > > > It's the same reason as before - zero initialization ensures that the > > pointer is NULL which tells us if it's valid or not. Proactive reclaim > > might not set swappiness and you need to distinguish swappiness of 0 > > and not-set. See this discussion with Michal: > > > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/ > > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > size_t nbytes, loff_t off) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int swappiness = -1; > ... > reclaimed = try_to_free_mem_cgroup_pages(memcg, > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + GFP_KERNEL, reclaim_options, > + swappiness); > > ... > > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > +{ > + return sc->proactive && sc->proactive_swappiness > -1 ? > + sc->proactive_swappiness : mem_cgroup_swappiness(memcg); > +} Tpo be completely honest I really fail to see why this is such a hot discussion point. To be completely clear both approaches are feasible. The main argument for NULL check based approach is that it is less error prone from an incorrect ussage because any bug becomes obvious. If we use any other special constant a missing initialization would be much harder to spot because they would be subtle behavior change. Are there really any strong arguments to go against this "default initialization is safe" policy?
On Wed 03-01-24 08:48:37, Dan Schatzberg wrote: [...] > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d91963e2d47f..394e0dd46b2e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -92,6 +92,11 @@ struct scan_control { > unsigned long anon_cost; > unsigned long file_cost; > > +#ifdef CONFIG_MEMCG > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > + int *proactive_swappiness; > +#endif > + > /* Can active folios be deactivated as part of reclaim? */ > #define DEACTIVATE_ANON 1 > #define DEACTIVATE_FILE 2 > @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc) > #endif > return false; > } > + > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > +{ > + if (sc->proactive && sc->proactive_swappiness) > + return *sc->proactive_swappiness; > + return mem_cgroup_swappiness(memcg); > +} If you really want to make this sc->proactive bound then do not use CONFIG_MEMCG as sc->proactive is not guarded either. I do not think that sc->proactive check is really necessary. A pure NULL check is sufficient to have a valid and self evident code that is future proof. But TBH this is not the most important aspect of the patch to spend much more time discussing. Either go with sc->proactive but make it config space consistent or simply rely on NULL check (with or without MEMCG guard as both are valid options).
On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 03-01-24 18:07:43, Yu Zhao wrote: > > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote: > > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote: > > > [...] > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index d91963e2d47f..394e0dd46b2e 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -92,6 +92,11 @@ struct scan_control { > > > > > unsigned long anon_cost; > > > > > unsigned long file_cost; > > > > > > > > > > +#ifdef CONFIG_MEMCG > > > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > > > > > + int *proactive_swappiness; > > > > > +#endif > > > > > > > > Why is proactive_swappiness still a pointer? The whole point of the > > > > previous conversation is that sc->proactive can tell whether > > > > sc->swappiness is valid or not, and that's less awkward than using a > > > > pointer. > > > > > > It's the same reason as before - zero initialization ensures that the > > > pointer is NULL which tells us if it's valid or not. Proactive reclaim > > > might not set swappiness and you need to distinguish swappiness of 0 > > > and not-set. See this discussion with Michal: > > > > > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/ > > > > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > size_t nbytes, loff_t off) > > { > > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int swappiness = -1; > > ... > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > - GFP_KERNEL, reclaim_options); > > + GFP_KERNEL, reclaim_options, > > + swappiness); > > > > ... > > > > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > > +{ > > + return sc->proactive && sc->proactive_swappiness > -1 ? > > + sc->proactive_swappiness : mem_cgroup_swappiness(memcg); > > +} > > Tpo be completely honest I really fail to see why this is such a hot > discussion point. To be completely clear both approaches are feasible. Feasible but not equal. > The main argument for NULL check based approach is that it is less error > prone from an incorrect ussage because any bug becomes obvious. Any bug becomes *fatal*, and fatal isn't only obvious but also hurts in production systems. This was the reason for going through the trouble switching from VM_BUG_ON() to VM_WARN_ON() and documenting it in Documentation/process/coding-style.rst: 22) Do not crash the kernel --------------------------- In general, the decision to crash the kernel belongs to the user, rather than to the kernel developer. Isn't? > If we > use any other special constant a missing initialization would be much > harder to spot because they would be subtle behavior change. > > Are there really any strong arguments to go against this "default > initialization is safe" policy? Just wanted to point out an alternative. Fine details (best practices) matter to me.
On Thu, Jan 4, 2024 at 3:09 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 03-01-24 08:48:37, Dan Schatzberg wrote: > [...] > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index d91963e2d47f..394e0dd46b2e 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -92,6 +92,11 @@ struct scan_control { > > unsigned long anon_cost; > > unsigned long file_cost; > > > > +#ifdef CONFIG_MEMCG > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > > + int *proactive_swappiness; > > +#endif > > + > > /* Can active folios be deactivated as part of reclaim? */ > > #define DEACTIVATE_ANON 1 > > #define DEACTIVATE_FILE 2 > > @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc) > > #endif > > return false; > > } > > + > > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > > +{ > > + if (sc->proactive && sc->proactive_swappiness) > > + return *sc->proactive_swappiness; > > + return mem_cgroup_swappiness(memcg); > > +} > > If you really want to make this sc->proactive bound then do not use > CONFIG_MEMCG as sc->proactive is not guarded either. > > I do not think that sc->proactive check is really necessary. A pure NULL > check is sufficient to have a valid and self evident code that is future > proof. But TBH this is not the most important aspect of the patch to > spend much more time discussing. Either go with sc->proactive but make > it config space consistent or simply rely on NULL check (with or without > MEMCG guard as both are valid options). Now you see why I replied. That "hybrid" if statement is just neither of what was suggested.
On Tue 09-01-24 16:54:15, Yu Zhao wrote: > On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 03-01-24 18:07:43, Yu Zhao wrote: > > > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote: > > > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote: > > > > [...] > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > > index d91963e2d47f..394e0dd46b2e 100644 > > > > > > --- a/mm/vmscan.c > > > > > > +++ b/mm/vmscan.c > > > > > > @@ -92,6 +92,11 @@ struct scan_control { > > > > > > unsigned long anon_cost; > > > > > > unsigned long file_cost; > > > > > > > > > > > > +#ifdef CONFIG_MEMCG > > > > > > + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ > > > > > > + int *proactive_swappiness; > > > > > > +#endif > > > > > > > > > > Why is proactive_swappiness still a pointer? The whole point of the > > > > > previous conversation is that sc->proactive can tell whether > > > > > sc->swappiness is valid or not, and that's less awkward than using a > > > > > pointer. > > > > > > > > It's the same reason as before - zero initialization ensures that the > > > > pointer is NULL which tells us if it's valid or not. Proactive reclaim > > > > might not set swappiness and you need to distinguish swappiness of 0 > > > > and not-set. See this discussion with Michal: > > > > > > > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/ > > > > > > static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > size_t nbytes, loff_t off) > > > { > > > struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > > unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > > unsigned long nr_to_reclaim, nr_reclaimed = 0; > > > + int swappiness = -1; > > > ... > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > - GFP_KERNEL, reclaim_options); > > > + GFP_KERNEL, reclaim_options, > > > + swappiness); > > > > > > ... > > > > > > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) > > > +{ > > > + return sc->proactive && sc->proactive_swappiness > -1 ? > > > + sc->proactive_swappiness : mem_cgroup_swappiness(memcg); > > > +} > > > > Tpo be completely honest I really fail to see why this is such a hot > > discussion point. To be completely clear both approaches are feasible. > > Feasible but not equal. > > > The main argument for NULL check based approach is that it is less error > > prone from an incorrect ussage because any bug becomes obvious. > > Any bug becomes *fatal*, and fatal isn't only obvious but also hurts > in production systems. > > This was the reason for going through the trouble switching from > VM_BUG_ON() to VM_WARN_ON() and documenting it in > Documentation/process/coding-style.rst: > > 22) Do not crash the kernel > --------------------------- > > In general, the decision to crash the kernel belongs to the user, rather > than to the kernel developer. > > Isn't? I do agree with this general statement but I do not think it is applicable in this context. This is not an explicit BUG() when kernel explicitly sets to panic the system. We are talking about subtle misbehavior which might be non-trivial to debug (there are other reasons to not swap at all) vs. a potential NULL ptr which will kill the userspace in a very obvious way. Sure there are risks with that but checks for potential NULL ptr dereferncing is easier than forgot explicit initialization. There are clear pros and cons for both approaches. NULL default initialized structures members which allow for behavior override are a general kernel pattern so I do not really see this going way off the rails.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f85254f3cef..ee42f74e0765 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. - Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. +The following nested keys are defined. + + ========== ================================ + swappiness Swappiness value to reclaim with + ========== ================================ + + Specifying a swappiness value instructs the kernel to perform + the reclaim with that swappiness value. Note that this has the + same semantics as vm.swappiness applied to memcg reclaim with + all the existing limitations and potential future extensions. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index e2ab76c25b4a..8afdec40efe3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + int *swappiness); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fbe9f02dd206..6d627a754851 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -52,6 +52,7 @@ #include <linux/sort.h> #include <linux/fs.h> #include <linux/seq_file.h> +#include <linux/parser.h> #include <linux/vmpressure.h> #include <linux/memremap.h> #include <linux/mm_inline.h> @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, + NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) { ret = -EBUSY; break; } @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_retries--; } @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL); if (!reclaimed && !nr_retries--) break; @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL)) nr_reclaims--; continue; } @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_SWAPPINESS = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t tokens = { + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; + int swappiness = -1; unsigned int reclaim_options; - int err; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; + + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + switch (match_token(start, tokens, args)) { + case MEMORY_RECLAIM_SWAPPINESS: + if (match_int(&args[0], &swappiness)) + return -EINVAL; + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) + return -EINVAL; + break; + default: + return -EINVAL; + } + } reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; while (nr_reclaimed < nr_to_reclaim) { @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + swappiness == -1 ? NULL : &swappiness); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index d91963e2d47f..394e0dd46b2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -92,6 +92,11 @@ struct scan_control { unsigned long anon_cost; unsigned long file_cost; +#ifdef CONFIG_MEMCG + /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */ + int *proactive_swappiness; +#endif + /* Can active folios be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 #define DEACTIVATE_FILE 2 @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc) #endif return false; } + +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) +{ + if (sc->proactive && sc->proactive_swappiness) + return *sc->proactive_swappiness; + return mem_cgroup_swappiness(memcg); +} #else static bool cgroup_reclaim(struct scan_control *sc) { @@ -242,6 +254,11 @@ static bool writeback_throttling_sane(struct scan_control *sc) { return true; } + +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) +{ + return READ_ONCE(vm_swappiness); +} #endif static void set_task_reclaim_state(struct task_struct *task, @@ -2327,7 +2344,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; - int swappiness = mem_cgroup_swappiness(memcg); + int swappiness = sc_swappiness(sc, memcg); u64 fraction[ANON_AND_FILE]; u64 denominator = 0; /* gcc */ enum scan_balance scan_balance; @@ -2608,7 +2625,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) return 0; - return mem_cgroup_swappiness(memcg); + return sc_swappiness(sc, memcg); } static int get_nr_gens(struct lruvec *lruvec, int type) @@ -6463,12 +6480,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + int *swappiness) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), + .proactive_swappiness = swappiness, .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), .reclaim_idx = MAX_NR_ZONES - 1,