diff mbox series

[v6,2/2] mm: add swapiness= arg to memory.reclaim

Message ID 20240103164841.2800183-3-schatzberg.dan@gmail.com (mailing list archive)
State New
Headers show
Series Add swappiness argument to memory.reclaim | expand

Commit Message

Dan Schatzberg Jan. 3, 2024, 4:48 p.m. UTC
Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Userspace proactive reclaimers use the memory.reclaim interface to
trigger reclaim. The memory.reclaim interface does not allow for any way
to effect the balance of file vs anon during proactive reclaim. The only
approach is to adjust the vm.swappiness setting. However, there are a
few reasons we look to control the balance of file vs anon during
proactive reclaim, separately from reactive reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write
endurance. However, proactive reclaim runs continuously and so its
impact on SSD write endurance is more significant. Therefore it is
desireable to control swap-out for proactive reclaim separately from
reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on
swap exhaustion. This makes sense if the swap exhaustion is triggered
due to reactive reclaim but less so if it is triggered due to proactive
reclaim (e.g. one could see OOMs when free memory is ample but anon is
just particularly cold). Therefore, it's desireable to have proactive
reclaim reduce or stop swap-out before the threshold at which OOM
killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust
vm.swappiness before writes to memory.reclaim[2]. This has been in
production for nearly two years and has addressed our needs to control
proactive vs reactive reclaim behavior but is still not ideal for a
number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere
with other system administration that wishes to control vm.swappiness.
In our case, we need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave
a misconfigured setting. This requires some additional management to
record the "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness
globally.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 56 ++++++++++++++++++++-----
 mm/vmscan.c                             | 25 +++++++++--
 4 files changed, 80 insertions(+), 22 deletions(-)

Comments

Yu Zhao Jan. 3, 2024, 5:19 p.m. UTC | #1
On Wed, Jan 3, 2024 at 9:49 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 56 ++++++++++++++++++++-----
>  mm/vmscan.c                             | 25 +++++++++--
>  4 files changed, 80 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..ee42f74e0765 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
>         This is a simple interface to trigger memory reclaim in the
>         target cgroup.
>
> -       This file accepts a single key, the number of bytes to reclaim.
> -       No nested keys are currently supported.
> -
>         Example::
>
>           echo "1G" > memory.reclaim
>
> -       The interface can be later extended with nested keys to
> -       configure the reclaim behavior. For example, specify the
> -       type of memory to reclaim from (anon, file, ..).
> -
>         Please note that the kernel can over or under reclaim from
>         the target cgroup. If less bytes are reclaimed than the
>         specified amount, -EAGAIN is returned.
> @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
>         This means that the networking layer will not adapt based on
>         reclaim induced by memory.reclaim.
>
> +The following nested keys are defined.
> +
> +         ==========            ================================
> +         swappiness            Swappiness value to reclaim with
> +         ==========            ================================
> +
> +       Specifying a swappiness value instructs the kernel to perform
> +       the reclaim with that swappiness value. Note that this has the
> +       same semantics as vm.swappiness applied to memcg reclaim with
> +       all the existing limitations and potential future extensions.
> +
>    memory.peak
>         A read-only single value file which exists on non-root
>         cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e2ab76c25b4a..8afdec40efe3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> -                                                 unsigned int reclaim_options);
> +                                                 unsigned int reclaim_options,
> +                                                 int *swappiness);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>                                                 gfp_t gfp_mask, bool noswap,
>                                                 pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fbe9f02dd206..6d627a754851 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -52,6 +52,7 @@
>  #include <linux/sort.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
> +#include <linux/parser.h>
>  #include <linux/vmpressure.h>
>  #include <linux/memremap.h>
>  #include <linux/mm_inline.h>
> @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>                 psi_memstall_enter(&pflags);
>                 nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>                                                         gfp_mask,
> -                                                       MEMCG_RECLAIM_MAY_SWAP);
> +                                                       MEMCG_RECLAIM_MAY_SWAP,
> +                                                       NULL);
>                 psi_memstall_leave(&pflags);
>         } while ((memcg = parent_mem_cgroup(memcg)) &&
>                  !mem_cgroup_is_root(memcg));
> @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
>         psi_memstall_enter(&pflags);
>         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -                                                   gfp_mask, reclaim_options);
> +                                                   gfp_mask, reclaim_options, NULL);
>         psi_memstall_leave(&pflags);
>
>         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>                 }
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
>                         ret = -EBUSY;
>                         break;
>                 }
> @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>                         return -EINTR;
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                                 MEMCG_RECLAIM_MAY_SWAP))
> +                                                 MEMCG_RECLAIM_MAY_SWAP, NULL))
>                         nr_retries--;
>         }
>
> @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>                 }
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
>
>                 if (!reclaimed && !nr_retries--)
>                         break;
> @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
>                 if (nr_reclaims) {
>                         if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
>                                 nr_reclaims--;
>                         continue;
>                 }
> @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +enum {
> +       MEMORY_RECLAIM_SWAPPINESS = 0,
> +       MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> +       { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +       { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>                               size_t nbytes, loff_t off)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +       int swappiness = -1;
>         unsigned int reclaim_options;
> -       int err;
> +       char *old_buf, *start;
> +       substring_t args[MAX_OPT_ARGS];
>
>         buf = strstrip(buf);
> -       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -       if (err)
> -               return err;
> +
> +       old_buf = buf;
> +       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +       if (buf == old_buf)
> +               return -EINVAL;
> +
> +       buf = strstrip(buf);
> +
> +       while ((start = strsep(&buf, " ")) != NULL) {
> +               if (!strlen(start))
> +                       continue;
> +               switch (match_token(start, tokens, args)) {
> +               case MEMORY_RECLAIM_SWAPPINESS:
> +                       if (match_int(&args[0], &swappiness))
> +                               return -EINVAL;
> +                       if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> +                               return -EINVAL;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
>
>         reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
>         while (nr_reclaimed < nr_to_reclaim) {
> @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
>                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> -                                       GFP_KERNEL, reclaim_options);
> +                                       GFP_KERNEL, reclaim_options,
> +                                       swappiness == -1 ? NULL : &swappiness);
>
>                 if (!reclaimed && !nr_retries--)
>                         return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..394e0dd46b2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,11 @@ struct scan_control {
>         unsigned long   anon_cost;
>         unsigned long   file_cost;
>
> +#ifdef CONFIG_MEMCG
> +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> +       int *proactive_swappiness;
> +#endif

Why is proactive_swappiness still a pointer? The whole point of the
previous conversation is that sc->proactive can tell whether
sc->swappiness is valid or not, and that's less awkward than using a
pointer.

Also why the #ifdef here? I don't see the point for a small stack
variable. Otherwise wouldn't we want to do this for sc->proactive as
well?

If you really want it to be explicit, you could do
  struct scan_control {
    ...
    struct {
      bool is_set;
      int swappiness;
    } proactive;
  };

But I think even this is too much.
Dan Schatzberg Jan. 3, 2024, 6:19 p.m. UTC | #2
On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
[...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d91963e2d47f..394e0dd46b2e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -92,6 +92,11 @@ struct scan_control {
> >         unsigned long   anon_cost;
> >         unsigned long   file_cost;
> >
> > +#ifdef CONFIG_MEMCG
> > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > +       int *proactive_swappiness;
> > +#endif
> 
> Why is proactive_swappiness still a pointer? The whole point of the
> previous conversation is that sc->proactive can tell whether
> sc->swappiness is valid or not, and that's less awkward than using a
> pointer.

It's the same reason as before - zero initialization ensures that the
pointer is NULL which tells us if it's valid or not. Proactive reclaim
might not set swappiness and you need to distinguish swappiness of 0
and not-set. See this discussion with Michal:

https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/

> Also why the #ifdef here? I don't see the point for a small stack
> variable. Otherwise wouldn't we want to do this for sc->proactive as
> well?

This was Michal's request and it feels similar to your rationale for
naming it proactive_swappiness - it's just restricting the interface
down to the only use-cases. I'd be fine with doing the same in
sc->proactive as a subsequent patch.

See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/
Yu Zhao Jan. 4, 2024, 1:07 a.m. UTC | #3
On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..394e0dd46b2e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,11 @@ struct scan_control {
> > >         unsigned long   anon_cost;
> > >         unsigned long   file_cost;
> > >
> > > +#ifdef CONFIG_MEMCG
> > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > +       int *proactive_swappiness;
> > > +#endif
> > 
> > Why is proactive_swappiness still a pointer? The whole point of the
> > previous conversation is that sc->proactive can tell whether
> > sc->swappiness is valid or not, and that's less awkward than using a
> > pointer.
> 
> It's the same reason as before - zero initialization ensures that the
> pointer is NULL which tells us if it's valid or not. Proactive reclaim
> might not set swappiness and you need to distinguish swappiness of 0
> and not-set. See this discussion with Michal:
> 
> https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/

 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
                              size_t nbytes, loff_t off)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
        unsigned int nr_retries = MAX_RECLAIM_RETRIES;
        unsigned long nr_to_reclaim, nr_reclaimed = 0;
+       int swappiness = -1;
...
                reclaimed = try_to_free_mem_cgroup_pages(memcg,
                                        min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
-                                       GFP_KERNEL, reclaim_options);
+                                       GFP_KERNEL, reclaim_options,
+                                       swappiness);

...

+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+       return sc->proactive && sc->proactive_swappiness > -1 ?
+              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
+}
Yu Zhao Jan. 4, 2024, 1:17 a.m. UTC | #4
On Wed, Jan 3, 2024 at 11:20 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..394e0dd46b2e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,11 @@ struct scan_control {
> > >         unsigned long   anon_cost;
> > >         unsigned long   file_cost;
> > >
> > > +#ifdef CONFIG_MEMCG
> > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > +       int *proactive_swappiness;
> > > +#endif
> >
> > Why is proactive_swappiness still a pointer? The whole point of the
> > previous conversation is that sc->proactive can tell whether
> > sc->swappiness is valid or not, and that's less awkward than using a
> > pointer.
>
> It's the same reason as before - zero initialization ensures that the
> pointer is NULL which tells us if it's valid or not. Proactive reclaim
> might not set swappiness and you need to distinguish swappiness of 0
> and not-set. See this discussion with Michal:
>
> https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
>
> > Also why the #ifdef here? I don't see the point for a small stack
> > variable. Otherwise wouldn't we want to do this for sc->proactive as
> > well?
>
> This was Michal's request and it feels similar to your rationale for
> naming it proactive_swappiness - it's just restricting the interface
> down to the only use-cases. I'd be fine with doing the same in
> sc->proactive as a subsequent patch.
>
> See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/

Also regarding #ifdef, quoting Documentation/process/4.Coding.rst:
"As a general rule, #ifdef use should be confined to header files
whenever possible."
Michal Hocko Jan. 4, 2024, 8:48 a.m. UTC | #5
On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > >         unsigned long   anon_cost;
> > > >         unsigned long   file_cost;
> > > >
> > > > +#ifdef CONFIG_MEMCG
> > > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > +       int *proactive_swappiness;
> > > > +#endif
> > > 
> > > Why is proactive_swappiness still a pointer? The whole point of the
> > > previous conversation is that sc->proactive can tell whether
> > > sc->swappiness is valid or not, and that's less awkward than using a
> > > pointer.
> > 
> > It's the same reason as before - zero initialization ensures that the
> > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > might not set swappiness and you need to distinguish swappiness of 0
> > and not-set. See this discussion with Michal:
> > 
> > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> 
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>                               size_t nbytes, loff_t off)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +       int swappiness = -1;
> ...
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
>                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> -                                       GFP_KERNEL, reclaim_options);
> +                                       GFP_KERNEL, reclaim_options,
> +                                       swappiness);
> 
> ...
> 
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> +       return sc->proactive && sc->proactive_swappiness > -1 ?
> +              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> +}

Tpo be completely honest I really fail to see why this is such a hot
discussion point. To be completely clear both approaches are feasible.

The main argument for NULL check based approach is that it is less error
prone from an incorrect ussage because any bug becomes obvious. If we
use any other special constant a missing initialization would be much
harder to spot because they would be subtle behavior change.

Are there really any strong arguments to go against this "default
initialization is safe" policy?
Michal Hocko Jan. 4, 2024, 10:09 a.m. UTC | #6
On Wed 03-01-24 08:48:37, Dan Schatzberg wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..394e0dd46b2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,11 @@ struct scan_control {
>  	unsigned long	anon_cost;
>  	unsigned long	file_cost;
>  
> +#ifdef CONFIG_MEMCG
> +	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> +	int *proactive_swappiness;
> +#endif
> +
>  	/* Can active folios be deactivated as part of reclaim? */
>  #define DEACTIVATE_ANON 1
>  #define DEACTIVATE_FILE 2
> @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  #endif
>  	return false;
>  }
> +
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> +	if (sc->proactive && sc->proactive_swappiness)
> +		return *sc->proactive_swappiness;
> +	return mem_cgroup_swappiness(memcg);
> +}

If you really want to make this sc->proactive bound then do not use
CONFIG_MEMCG as sc->proactive is not guarded either.

I do not think that sc->proactive check is really necessary. A pure NULL
check is sufficient to have a valid and self evident code that is future
proof. But TBH this is not the most important aspect of the patch to
spend much more time discussing. Either go with sc->proactive but make
it config space consistent or simply rely on NULL check (with or without
MEMCG guard as both are valid options).
Yu Zhao Jan. 9, 2024, 11:54 p.m. UTC | #7
On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > > [...]
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > > >         unsigned long   anon_cost;
> > > > >         unsigned long   file_cost;
> > > > >
> > > > > +#ifdef CONFIG_MEMCG
> > > > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > > +       int *proactive_swappiness;
> > > > > +#endif
> > > >
> > > > Why is proactive_swappiness still a pointer? The whole point of the
> > > > previous conversation is that sc->proactive can tell whether
> > > > sc->swappiness is valid or not, and that's less awkward than using a
> > > > pointer.
> > >
> > > It's the same reason as before - zero initialization ensures that the
> > > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > > might not set swappiness and you need to distinguish swappiness of 0
> > > and not-set. See this discussion with Michal:
> > >
> > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> >
> >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >                               size_t nbytes, loff_t off)
> >  {
> >         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> >         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> >         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +       int swappiness = -1;
> > ...
> >                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
> >                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > -                                       GFP_KERNEL, reclaim_options);
> > +                                       GFP_KERNEL, reclaim_options,
> > +                                       swappiness);
> >
> > ...
> >
> > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > +{
> > +       return sc->proactive && sc->proactive_swappiness > -1 ?
> > +              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> > +}
>
> Tpo be completely honest I really fail to see why this is such a hot
> discussion point. To be completely clear both approaches are feasible.

Feasible but not equal.

> The main argument for NULL check based approach is that it is less error
> prone from an incorrect ussage because any bug becomes obvious.

Any bug becomes *fatal*, and fatal isn't only obvious but also hurts
in production systems.

This was the reason for going through the trouble switching from
VM_BUG_ON() to VM_WARN_ON() and documenting it in
Documentation/process/coding-style.rst:

22) Do not crash the kernel
---------------------------

In general, the decision to crash the kernel belongs to the user, rather
than to the kernel developer.

Isn't?

> If we
> use any other special constant a missing initialization would be much
> harder to spot because they would be subtle behavior change.
>
> Are there really any strong arguments to go against this "default
> initialization is safe" policy?

Just wanted to point out an alternative. Fine details (best practices)
matter to me.
Yu Zhao Jan. 9, 2024, 11:57 p.m. UTC | #8
On Thu, Jan 4, 2024 at 3:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 03-01-24 08:48:37, Dan Schatzberg wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d91963e2d47f..394e0dd46b2e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -92,6 +92,11 @@ struct scan_control {
> >       unsigned long   anon_cost;
> >       unsigned long   file_cost;
> >
> > +#ifdef CONFIG_MEMCG
> > +     /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > +     int *proactive_swappiness;
> > +#endif
> > +
> >       /* Can active folios be deactivated as part of reclaim? */
> >  #define DEACTIVATE_ANON 1
> >  #define DEACTIVATE_FILE 2
> > @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  #endif
> >       return false;
> >  }
> > +
> > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > +{
> > +     if (sc->proactive && sc->proactive_swappiness)
> > +             return *sc->proactive_swappiness;
> > +     return mem_cgroup_swappiness(memcg);
> > +}
>
> If you really want to make this sc->proactive bound then do not use
> CONFIG_MEMCG as sc->proactive is not guarded either.
>
> I do not think that sc->proactive check is really necessary. A pure NULL
> check is sufficient to have a valid and self evident code that is future
> proof. But TBH this is not the most important aspect of the patch to
> spend much more time discussing. Either go with sc->proactive but make
> it config space consistent or simply rely on NULL check (with or without
> MEMCG guard as both are valid options).

Now you see why I replied. That "hybrid" if statement is just neither
of what was suggested.
Michal Hocko Jan. 10, 2024, 10:32 a.m. UTC | #9
On Tue 09-01-24 16:54:15, Yu Zhao wrote:
> On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> > > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > > > [...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > > > >         unsigned long   anon_cost;
> > > > > >         unsigned long   file_cost;
> > > > > >
> > > > > > +#ifdef CONFIG_MEMCG
> > > > > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > > > +       int *proactive_swappiness;
> > > > > > +#endif
> > > > >
> > > > > Why is proactive_swappiness still a pointer? The whole point of the
> > > > > previous conversation is that sc->proactive can tell whether
> > > > > sc->swappiness is valid or not, and that's less awkward than using a
> > > > > pointer.
> > > >
> > > > It's the same reason as before - zero initialization ensures that the
> > > > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > > > might not set swappiness and you need to distinguish swappiness of 0
> > > > and not-set. See this discussion with Michal:
> > > >
> > > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> > >
> > >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > >                               size_t nbytes, loff_t off)
> > >  {
> > >         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > >         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > >         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > +       int swappiness = -1;
> > > ...
> > >                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > >                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > > -                                       GFP_KERNEL, reclaim_options);
> > > +                                       GFP_KERNEL, reclaim_options,
> > > +                                       swappiness);
> > >
> > > ...
> > >
> > > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > > +{
> > > +       return sc->proactive && sc->proactive_swappiness > -1 ?
> > > +              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> > > +}
> >
> > Tpo be completely honest I really fail to see why this is such a hot
> > discussion point. To be completely clear both approaches are feasible.
> 
> Feasible but not equal.
> 
> > The main argument for NULL check based approach is that it is less error
> > prone from an incorrect ussage because any bug becomes obvious.
> 
> Any bug becomes *fatal*, and fatal isn't only obvious but also hurts
> in production systems.
> 
> This was the reason for going through the trouble switching from
> VM_BUG_ON() to VM_WARN_ON() and documenting it in
> Documentation/process/coding-style.rst:
> 
> 22) Do not crash the kernel
> ---------------------------
> 
> In general, the decision to crash the kernel belongs to the user, rather
> than to the kernel developer.
> 
> Isn't?

I do agree with this general statement but I do not think it is
applicable in this context.

This is not an explicit BUG() when kernel explicitly sets to panic the
system. We are talking about subtle misbehavior which might be
non-trivial to debug (there are other reasons to not swap at all) vs. a
potential NULL ptr which will kill the userspace in a very obvious way.
Sure there are risks with that but checks for potential NULL ptr
dereferncing is easier than forgot explicit initialization. There are
clear pros and cons for both approaches. NULL default initialized
structures members which allow for behavior override are a general
kernel pattern so I do not really see this going way off the rails.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..ee42f74e0765 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,17 +1282,10 @@  PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.
 
-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
-
 	Example::
 
 	  echo "1G" > memory.reclaim
 
-	The interface can be later extended with nested keys to
-	configure the reclaim behavior. For example, specify the
-	type of memory to reclaim from (anon, file, ..).
-
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1304,6 +1297,17 @@  PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.
 
+The following nested keys are defined.
+
+	  ==========            ================================
+	  swappiness            Swappiness value to reclaim with
+	  ==========            ================================
+
+	Specifying a swappiness value instructs the kernel to perform
+	the reclaim with that swappiness value. Note that this has the
+	same semantics as vm.swappiness applied to memcg reclaim with
+	all the existing limitations and potential future extensions.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e2ab76c25b4a..8afdec40efe3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,7 +412,8 @@  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  int *swappiness);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fbe9f02dd206..6d627a754851 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -52,6 +52,7 @@ 
 #include <linux/sort.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/parser.h>
 #include <linux/vmpressure.h>
 #include <linux/memremap.h>
 #include <linux/mm_inline.h>
@@ -2449,7 +2450,8 @@  static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP,
+							NULL);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2740,7 +2742,7 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options, NULL);
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3662,7 @@  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3774,7 +3776,7 @@  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP, NULL))
 			nr_retries--;
 	}
 
@@ -6720,7 +6722,7 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6769,7 +6771,7 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
 				nr_reclaims--;
 			continue;
 		}
@@ -6895,19 +6897,50 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+enum {
+	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	int swappiness = -1;
 	unsigned int reclaim_options;
-	int err;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
 
 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;
+
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		switch (match_token(start, tokens, args)) {
+		case MEMORY_RECLAIM_SWAPPINESS:
+			if (match_int(&args[0], &swappiness))
+				return -EINVAL;
+			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
 
 	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
 	while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6959,8 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
-					GFP_KERNEL, reclaim_options);
+					GFP_KERNEL, reclaim_options,
+					swappiness == -1 ? NULL : &swappiness);
 
 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d91963e2d47f..394e0dd46b2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -92,6 +92,11 @@  struct scan_control {
 	unsigned long	anon_cost;
 	unsigned long	file_cost;
 
+#ifdef CONFIG_MEMCG
+	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
+	int *proactive_swappiness;
+#endif
+
 	/* Can active folios be deactivated as part of reclaim? */
 #define DEACTIVATE_ANON 1
 #define DEACTIVATE_FILE 2
@@ -227,6 +232,13 @@  static bool writeback_throttling_sane(struct scan_control *sc)
 #endif
 	return false;
 }
+
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+	if (sc->proactive && sc->proactive_swappiness)
+		return *sc->proactive_swappiness;
+	return mem_cgroup_swappiness(memcg);
+}
 #else
 static bool cgroup_reclaim(struct scan_control *sc)
 {
@@ -242,6 +254,11 @@  static bool writeback_throttling_sane(struct scan_control *sc)
 {
 	return true;
 }
+
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+	return READ_ONCE(vm_swappiness);
+}
 #endif
 
 static void set_task_reclaim_state(struct task_struct *task,
@@ -2327,7 +2344,7 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	unsigned long anon_cost, file_cost, total_cost;
-	int swappiness = mem_cgroup_swappiness(memcg);
+	int swappiness = sc_swappiness(sc, memcg);
 	u64 fraction[ANON_AND_FILE];
 	u64 denominator = 0;	/* gcc */
 	enum scan_balance scan_balance;
@@ -2608,7 +2625,7 @@  static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
 	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
 		return 0;
 
-	return mem_cgroup_swappiness(memcg);
+	return sc_swappiness(sc, memcg);
 }
 
 static int get_nr_gens(struct lruvec *lruvec, int type)
@@ -6463,12 +6480,14 @@  unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   int *swappiness)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
+		.proactive_swappiness = swappiness,
 		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,