diff mbox series

[V3,1/1] mm: add swapiness= arg to memory.reclaim

Message ID 20231211140419.1298178-2-schatzberg.dan@gmail.com (mailing list archive)
State New
Headers show
Series Add swappiness argument to memory.reclaim | expand

Commit Message

Dan Schatzberg Dec. 11, 2023, 2:04 p.m. UTC
Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 15 ++++++-
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 55 ++++++++++++++++++++-----
 mm/vmscan.c                             | 13 +++++-
 4 files changed, 70 insertions(+), 16 deletions(-)

Comments

Yosry Ahmed Dec. 11, 2023, 7:41 p.m. UTC | #1
On Mon, Dec 11, 2023 at 6:04 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 ++++++-
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 55 ++++++++++++++++++++-----
>  mm/vmscan.c                             | 13 +++++-
>  4 files changed, 70 insertions(+), 16 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..fc2b379dbd0f 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1282,8 +1282,8 @@ PAGE_SIZE multiple when read back.
>         This is a simple interface to trigger memory reclaim in the
>         target cgroup.
>
> -       This file accepts a single key, the number of bytes to reclaim.
> -       No nested keys are currently supported.
> +       This file accepts a string which containers thhe number of bytes

contains* the*

I think this statement was only important because no keys were
supported, so I think we can remove it completely and rely on
documenting the supported keys below like other interfaces, see my
next comment.

> +       to reclaim.
>
>         Example::
>
> @@ -1304,6 +1304,17 @@ PAGE_SIZE multiple when read back.
>         This means that the networking layer will not adapt based on
>         reclaim induced by memory.reclaim.
>
> +       This file also allows the user to specify the swappiness value
> +       to be used for the reclaim. For example:
> +
> +         echo "1G swappiness=60" > memory.reclaim
> +
> +       The above instructs the kernel to perform the reclaim with
> +       a swappiness value of 60. Note that this has the same semantics
> +       as the vm.swappiness sysctl - it sets the relative IO cost of
> +       reclaiming anon vs file memory but does not allow for reclaiming
> +       specific amounts of anon or file memory.
> +

Can we instead follow the same format used by other nested-keyed files
(e.g. io.max)? This usually involves a table of supported keys and
such.

>    memory.peak
>         A read-only single value file which exists on non-root
>         cgroups.
[..]
> @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
>         unsigned int reclaim_options;
> -       int err;
> +       char *old_buf, *start;
> +       substring_t args[MAX_OPT_ARGS];
> +       int swappiness = -1;
>
>         buf = strstrip(buf);
> -       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -       if (err)
> -               return err;
> +
> +       old_buf = buf;
> +       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +       if (buf == old_buf)
> +               return -EINVAL;
> +
> +       buf = strstrip(buf);
> +
> +       while ((start = strsep(&buf, " ")) != NULL) {
> +               if (!strlen(start))
> +                       continue;
> +               switch (match_token(start, if_tokens, args)) {
> +               case MEMORY_RECLAIM_SWAPPINESS:
> +                       if (match_int(&args[0], &swappiness))
> +                               return -EINVAL;
> +                       if (swappiness < 0 || swappiness > 200)

I am not a fan of extending the hardcoded 0 and 200 values, and now
the new -1 value. Maybe it's time to create constants for the min and
max swappiness values instead of hardcoding them everywhere? This can
be a separate preparatory patch. Then, -1 (or any invalid value) can
also be added as a constant with a useful name, instead of passing -1
to all other callers.

This should make the code a little bit more readable and easier to extend.
Chris Li Dec. 12, 2023, 1:06 a.m. UTC | #2
Hi Dan,

Thank you for the patch.

On Mon, Dec 11, 2023 at 6:04 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.

I am curious what prompted you to develop this patch. I understand
what this patch does, just want to know more of your background story
why this is needed.

>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 ++++++-
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 55 ++++++++++++++++++++-----
>  mm/vmscan.c                             | 13 +++++-
>  4 files changed, 70 insertions(+), 16 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..fc2b379dbd0f 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1282,8 +1282,8 @@ PAGE_SIZE multiple when read back.
>         This is a simple interface to trigger memory reclaim in the
>         target cgroup.
>
> -       This file accepts a single key, the number of bytes to reclaim.
> -       No nested keys are currently supported.
> +       This file accepts a string which containers thhe number of bytes
Just as Yosry points out, there is some typo here.

"contains the"

> +       to reclaim.
>
>         Example::
>
> @@ -1304,6 +1304,17 @@ PAGE_SIZE multiple when read back.
>         This means that the networking layer will not adapt based on
>         reclaim induced by memory.reclaim.
>
> +       This file also allows the user to specify the swappiness value
> +       to be used for the reclaim. For example:
> +
> +         echo "1G swappiness=60" > memory.reclaim
> +
> +       The above instructs the kernel to perform the reclaim with
> +       a swappiness value of 60. Note that this has the same semantics
> +       as the vm.swappiness sysctl - it sets the relative IO cost of
> +       reclaiming anon vs file memory but does not allow for reclaiming
> +       specific amounts of anon or file memory.
> +
>    memory.peak
>         A read-only single value file which exists on non-root
>         cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f6dd6575b905..ebc20d094609 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -410,7 +410,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> -                                                 unsigned int reclaim_options);
> +                                                 unsigned int reclaim_options,
> +                                                 int swappiness);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>                                                 gfp_t gfp_mask, bool noswap,
>                                                 pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1c1061df9cd1..74598c17d3cc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> +#include <linux/parser.h>
>  #include <linux/sched/isolation.h>
>  #include "internal.h"
>  #include <net/sock.h>
> @@ -2449,7 +2450,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>                 psi_memstall_enter(&pflags);
>                 nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>                                                         gfp_mask,
> -                                                       MEMCG_RECLAIM_MAY_SWAP);
> +                                                       MEMCG_RECLAIM_MAY_SWAP, -1);

Instead of passing -1, maybe we can use mem_cgroup_swappiness(memcg);

And maybe remove the -1 test from the get_scan_count().

""
static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
   unsigned long *nr)
{
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long anon_cost, file_cost, total_cost;
int swappiness = sc->swappiness != -1 ?
sc->swappiness : mem_cgroup_swappiness(memcg);
""

In other words, I feel it is cleaner if try_to_free_mem_cgroup_pages
accept the swappiness as it is. There is no second guessing the value
if it is -1, then the internal function gets the swappiness from
mem_cgroup_swappiness().
Maybe we can completely remove the -1 special default value?

>                 psi_memstall_leave(&pflags);
>         } while ((memcg = parent_mem_cgroup(memcg)) &&
>                  !mem_cgroup_is_root(memcg));
> @@ -2740,7 +2741,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
>         psi_memstall_enter(&pflags);
>         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -                                                   gfp_mask, reclaim_options);
> +                                                   gfp_mask, reclaim_options, -1);

Same here.

>         psi_memstall_leave(&pflags);
>
>         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3660,7 +3661,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>                 }
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, -1)) {

Same here.

>                         ret = -EBUSY;
>                         break;
>                 }
> @@ -3774,7 +3775,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>                         return -EINTR;
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                                 MEMCG_RECLAIM_MAY_SWAP))
> +                                                 MEMCG_RECLAIM_MAY_SWAP, -1))

Here too.

>                         nr_retries--;
>         }
>
> @@ -6720,7 +6721,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>                 }
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, -1);

Here too.

>
>                 if (!reclaimed && !nr_retries--)
>                         break;
> @@ -6769,7 +6770,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
>                 if (nr_reclaims) {
>                         if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, -1))

Here too.

>                                 nr_reclaims--;
>                         continue;
>                 }
> @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +enum {
> +       MEMORY_RECLAIM_SWAPPINESS = 0,
> +       MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t if_tokens = {

What this is called "if_tokens"? I am trying to figure out what "if" refers to.

> +       { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +       { MEMORY_RECLAIM_NULL, NULL },
> +};
> +

Do we foresee a lot of tunable for the try to free page? I see. You
want to use match_token() to do the keyword parsing.

>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>                               size_t nbytes, loff_t off)
>  {
> @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
>         unsigned int reclaim_options;
> -       int err;
> +       char *old_buf, *start;
> +       substring_t args[MAX_OPT_ARGS];
> +       int swappiness = -1;
>
>         buf = strstrip(buf);
> -       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -       if (err)
> -               return err;
> +
> +       old_buf = buf;
> +       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +       if (buf == old_buf)
> +               return -EINVAL;
> +
> +       buf = strstrip(buf);
> +
> +       while ((start = strsep(&buf, " ")) != NULL) {
> +               if (!strlen(start))
> +                       continue;
> +               switch (match_token(start, if_tokens, args)) {
> +               case MEMORY_RECLAIM_SWAPPINESS:
> +                       if (match_int(&args[0], &swappiness))
> +                               return -EINVAL;
> +                       if (swappiness < 0 || swappiness > 200)

Agree with Yosry on the 200 magic value.

I am also wondering if there is an easier way to just parse one
keyword. Will using strcmp("swappiness=") be a bad idea? I haven't
tried it myself though.

> +                               return -EINVAL;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
>
>         reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
>         while (nr_reclaimed < nr_to_reclaim) {
> @@ -6926,7 +6958,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
>                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> -                                       GFP_KERNEL, reclaim_options);
> +                                       GFP_KERNEL, reclaim_options,
> +                                       swappiness);
>
>                 if (!reclaimed && !nr_retries--)
>                         return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 506f8220c5fe..a20965b20177 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -136,6 +136,9 @@ struct scan_control {
>         /* Always discard instead of demoting to lower tier memory */
>         unsigned int no_demotion:1;
>
> +       /* Swappiness value for reclaim, if -1 use memcg/global value */
> +       int swappiness;
> +

It would be great if we can get rid of the -1 case.

>         /* Allocation order */
>         s8 order;
>
> @@ -2327,7 +2330,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         unsigned long anon_cost, file_cost, total_cost;
> -       int swappiness = mem_cgroup_swappiness(memcg);
> +       int swappiness = sc->swappiness != -1 ?
> +               sc->swappiness : mem_cgroup_swappiness(memcg);

Let's see if we can get rid of the -1. Now I see the -1 is actually
introduced by this patch, should be doable.

>         u64 fraction[ANON_AND_FILE];
>         u64 denominator = 0;    /* gcc */
>         enum scan_balance scan_balance;
> @@ -2608,6 +2612,9 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
>             mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
>                 return 0;
>
> +       if (sc->swappiness != -1)
> +               return sc->swappiness;
> +

Get rid of that too.

Thank you.

Chris


>         return mem_cgroup_swappiness(memcg);
>  }
>
> @@ -6433,7 +6440,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                            unsigned long nr_pages,
>                                            gfp_t gfp_mask,
> -                                          unsigned int reclaim_options)
> +                                          unsigned int reclaim_options,
> +                                          int swappiness)
>  {
>         unsigned long nr_reclaimed;
>         unsigned int noreclaim_flag;
> @@ -6448,6 +6456,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                 .may_unmap = 1,
>                 .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>                 .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +               .swappiness = swappiness,
>         };
>         /*
>          * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.34.1
>
>
Dan Schatzberg Dec. 12, 2023, 9:27 p.m. UTC | #3
On Mon, Dec 11, 2023 at 11:41:24AM -0800, Yosry Ahmed wrote:
> On Mon, Dec 11, 2023 at 6:04 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> 
> contains* the*
> 
> I think this statement was only important because no keys were
> supported, so I think we can remove it completely and rely on
> documenting the supported keys below like other interfaces, see my
> next comment.
> 
> > +       to reclaim.
> >
> >         Example::
> >
> > @@ -1304,6 +1304,17 @@ PAGE_SIZE multiple when read back.
> >         This means that the networking layer will not adapt based on
> >         reclaim induced by memory.reclaim.
> >
> > +       This file also allows the user to specify the swappiness value
> > +       to be used for the reclaim. For example:
> > +
> > +         echo "1G swappiness=60" > memory.reclaim
> > +
> > +       The above instructs the kernel to perform the reclaim with
> > +       a swappiness value of 60. Note that this has the same semantics
> > +       as the vm.swappiness sysctl - it sets the relative IO cost of
> > +       reclaiming anon vs file memory but does not allow for reclaiming
> > +       specific amounts of anon or file memory.
> > +
> 
> Can we instead follow the same format used by other nested-keyed files
> (e.g. io.max)? This usually involves a table of supported keys and
> such.

Thanks, both are good suggestions. Will address these.

> > +       while ((start = strsep(&buf, " ")) != NULL) {
> > +               if (!strlen(start))
> > +                       continue;
> > +               switch (match_token(start, if_tokens, args)) {
> > +               case MEMORY_RECLAIM_SWAPPINESS:
> > +                       if (match_int(&args[0], &swappiness))
> > +                               return -EINVAL;
> > +                       if (swappiness < 0 || swappiness > 200)
> 
> I am not a fan of extending the hardcoded 0 and 200 values, and now
> the new -1 value. Maybe it's time to create constants for the min and
> max swappiness values instead of hardcoding them everywhere? This can
> be a separate preparatory patch. Then, -1 (or any invalid value) can
> also be added as a constant with a useful name, instead of passing -1
> to all other callers.
> 
> This should make the code a little bit more readable and easier to extend.

I'm not sure I understand the concern. This check just validates that
the swappiness value inputted is between 0 and 200 (inclusive)
otherwise the interface returns -EINVAL. Are you just concerned that
these constants are not named explicitly so they can be reused
elsewhere in the code?
Yosry Ahmed Dec. 12, 2023, 9:31 p.m. UTC | #4
On Tue, Dec 12, 2023 at 1:27 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> On Mon, Dec 11, 2023 at 11:41:24AM -0800, Yosry Ahmed wrote:
> > On Mon, Dec 11, 2023 at 6:04 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> >
> > contains* the*
> >
> > I think this statement was only important because no keys were
> > supported, so I think we can remove it completely and rely on
> > documenting the supported keys below like other interfaces, see my
> > next comment.
> >
> > > +       to reclaim.
> > >
> > >         Example::
> > >
> > > @@ -1304,6 +1304,17 @@ PAGE_SIZE multiple when read back.
> > >         This means that the networking layer will not adapt based on
> > >         reclaim induced by memory.reclaim.
> > >
> > > +       This file also allows the user to specify the swappiness value
> > > +       to be used for the reclaim. For example:
> > > +
> > > +         echo "1G swappiness=60" > memory.reclaim
> > > +
> > > +       The above instructs the kernel to perform the reclaim with
> > > +       a swappiness value of 60. Note that this has the same semantics
> > > +       as the vm.swappiness sysctl - it sets the relative IO cost of
> > > +       reclaiming anon vs file memory but does not allow for reclaiming
> > > +       specific amounts of anon or file memory.
> > > +
> >
> > Can we instead follow the same format used by other nested-keyed files
> > (e.g. io.max)? This usually involves a table of supported keys and
> > such.
>
> Thanks, both are good suggestions. Will address these.
>
> > > +       while ((start = strsep(&buf, " ")) != NULL) {
> > > +               if (!strlen(start))
> > > +                       continue;
> > > +               switch (match_token(start, if_tokens, args)) {
> > > +               case MEMORY_RECLAIM_SWAPPINESS:
> > > +                       if (match_int(&args[0], &swappiness))
> > > +                               return -EINVAL;
> > > +                       if (swappiness < 0 || swappiness > 200)
> >
> > I am not a fan of extending the hardcoded 0 and 200 values, and now
> > the new -1 value. Maybe it's time to create constants for the min and
> > max swappiness values instead of hardcoding them everywhere? This can
> > be a separate preparatory patch. Then, -1 (or any invalid value) can
> > also be added as a constant with a useful name, instead of passing -1
> > to all other callers.
> >
> > This should make the code a little bit more readable and easier to extend.
>
> I'm not sure I understand the concern. This check just validates that
> the swappiness value inputted is between 0 and 200 (inclusive)
> otherwise the interface returns -EINVAL. Are you just concerned that
> these constants are not named explicitly so they can be reused
> elsewhere in the code?

Yes. The 0 and 200 values are already hardcoded in multiple places,
and we are adding more places now and more hardcoded values (i.e. -1).
Chris Li Dec. 12, 2023, 9:32 p.m. UTC | #5
Hi Dan,

On Tue, Dec 12, 2023 at 1:27 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> > > +       while ((start = strsep(&buf, " ")) != NULL) {
> > > +               if (!strlen(start))
> > > +                       continue;
> > > +               switch (match_token(start, if_tokens, args)) {
> > > +               case MEMORY_RECLAIM_SWAPPINESS:
> > > +                       if (match_int(&args[0], &swappiness))
> > > +                               return -EINVAL;
> > > +                       if (swappiness < 0 || swappiness > 200)
> >
> > I am not a fan of extending the hardcoded 0 and 200 values, and now
> > the new -1 value. Maybe it's time to create constants for the min and
> > max swappiness values instead of hardcoding them everywhere? This can
> > be a separate preparatory patch. Then, -1 (or any invalid value) can
> > also be added as a constant with a useful name, instead of passing -1
> > to all other callers.
> >
> > This should make the code a little bit more readable and easier to extend.
>
> I'm not sure I understand the concern. This check just validates that
> the swappiness value inputted is between 0 and 200 (inclusive)
> otherwise the interface returns -EINVAL. Are you just concerned that
> these constants are not named explicitly so they can be reused
> elsewhere in the code?
>

I think the concern is why 200? Why not 400 or 600?

The user might write bigger values and expect the reclaim to work with
those values.
If there is some hard coded limit enforced somewhere else so writing
more than 200 does not make sense. It would be nice to have those
other places reference this limit as well. Thus give 200 a name and
use it in other places of the code as well.

Chris
Dan Schatzberg Dec. 12, 2023, 9:43 p.m. UTC | #6
On Mon, Dec 11, 2023 at 05:06:54PM -0800, Chris Li wrote:
> Hi Dan,
> 
> Thank you for the patch.
> 
> On Mon, Dec 11, 2023 at 6:04 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> >
> > Allow proactive reclaimers to submit an additional swappiness=<val>
> > argument to memory.reclaim. This overrides the global or per-memcg
> > swappiness setting for that reclaim attempt.
> 
> I am curious what prompted you to develop this patch. I understand
> what this patch does, just want to know more of your background story
> why this is needed.

I wrote about this in some detail in the cover letter (0/1). Take a
look and let me know if the rationale is still unclear.

> Instead of passing -1, maybe we can use mem_cgroup_swappiness(memcg);
>

Yeah this makes sense, I'll go ahead and make that change and
eliminate the -1.

> >                                 nr_reclaims--;
> >                         continue;
> >                 }
> > @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >         return nbytes;
> >  }
> >
> > +enum {
> > +       MEMORY_RECLAIM_SWAPPINESS = 0,
> > +       MEMORY_RECLAIM_NULL,
> > +};
> > +
> > +static const match_table_t if_tokens = {
> 
> What this is called "if_tokens"? I am trying to figure out what "if" refers to.

I used the same logic as in "mm: Add nodes= arg to memory.reclaim". I
can just call it tokens.

> 
> > +       { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> > +       { MEMORY_RECLAIM_NULL, NULL },
> > +};
> > +
> 
> Do we foresee a lot of tunable for the try to free page? I see. You
> want to use match_token() to do the keyword parsing.

See below

> 
> >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >                               size_t nbytes, loff_t off)
> >  {
> > @@ -6902,12 +6913,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> >         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> >         unsigned int reclaim_options;
> > -       int err;
> > +       char *old_buf, *start;
> > +       substring_t args[MAX_OPT_ARGS];
> > +       int swappiness = -1;
> >
> >         buf = strstrip(buf);
> > -       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > -       if (err)
> > -               return err;
> > +
> > +       old_buf = buf;
> > +       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> > +       if (buf == old_buf)
> > +               return -EINVAL;
> > +
> > +       buf = strstrip(buf);
> > +
> > +       while ((start = strsep(&buf, " ")) != NULL) {
> > +               if (!strlen(start))
> > +                       continue;
> > +               switch (match_token(start, if_tokens, args)) {
> > +               case MEMORY_RECLAIM_SWAPPINESS:
> > +                       if (match_int(&args[0], &swappiness))
> > +                               return -EINVAL;
> > +                       if (swappiness < 0 || swappiness > 200)
> 
> Agree with Yosry on the 200 magic value.
> 
> I am also wondering if there is an easier way to just parse one
> keyword. Will using strcmp("swappiness=") be a bad idea? I haven't
> tried it myself though.

As above, "mm: Add nodes= arg to memory.reclaim" was previously in the
mm tree doing it this way, so I duplicated it. I think given that
there have been lots of discussions about extending this interface,
this match table has some potential future value and I don't see a
major downside to using it in favor of strcmp.
Dan Schatzberg Dec. 12, 2023, 9:46 p.m. UTC | #7
On Tue, Dec 12, 2023 at 01:31:46PM -0800, Yosry Ahmed wrote:
> On Tue, Dec 12, 2023 at 1:27 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 11:41:24AM -0800, Yosry Ahmed wrote:
> > > On Mon, Dec 11, 2023 at 6:04 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> > >
> > > contains* the*
> > >
> > > I think this statement was only important because no keys were
> > > supported, so I think we can remove it completely and rely on
> > > documenting the supported keys below like other interfaces, see my
> > > next comment.
> > >
> > > > +       to reclaim.
> > > >
> > > >         Example::
> > > >
> > > > @@ -1304,6 +1304,17 @@ PAGE_SIZE multiple when read back.
> > > >         This means that the networking layer will not adapt based on
> > > >         reclaim induced by memory.reclaim.
> > > >
> > > > +       This file also allows the user to specify the swappiness value
> > > > +       to be used for the reclaim. For example:
> > > > +
> > > > +         echo "1G swappiness=60" > memory.reclaim
> > > > +
> > > > +       The above instructs the kernel to perform the reclaim with
> > > > +       a swappiness value of 60. Note that this has the same semantics
> > > > +       as the vm.swappiness sysctl - it sets the relative IO cost of
> > > > +       reclaiming anon vs file memory but does not allow for reclaiming
> > > > +       specific amounts of anon or file memory.
> > > > +
> > >
> > > Can we instead follow the same format used by other nested-keyed files
> > > (e.g. io.max)? This usually involves a table of supported keys and
> > > such.
> >
> > Thanks, both are good suggestions. Will address these.
> >
> > > > +       while ((start = strsep(&buf, " ")) != NULL) {
> > > > +               if (!strlen(start))
> > > > +                       continue;
> > > > +               switch (match_token(start, if_tokens, args)) {
> > > > +               case MEMORY_RECLAIM_SWAPPINESS:
> > > > +                       if (match_int(&args[0], &swappiness))
> > > > +                               return -EINVAL;
> > > > +                       if (swappiness < 0 || swappiness > 200)
> > >
> > > I am not a fan of extending the hardcoded 0 and 200 values, and now
> > > the new -1 value. Maybe it's time to create constants for the min and
> > > max swappiness values instead of hardcoding them everywhere? This can
> > > be a separate preparatory patch. Then, -1 (or any invalid value) can
> > > also be added as a constant with a useful name, instead of passing -1
> > > to all other callers.
> > >
> > > This should make the code a little bit more readable and easier to extend.
> >
> > I'm not sure I understand the concern. This check just validates that
> > the swappiness value inputted is between 0 and 200 (inclusive)
> > otherwise the interface returns -EINVAL. Are you just concerned that
> > these constants are not named explicitly so they can be reused
> > elsewhere in the code?
> 
> Yes. The 0 and 200 values are already hardcoded in multiple places,
> and we are adding more places now and more hardcoded values (i.e. -1).

Understood, I'll add a preparatory patch which adds DEFINEs for
MIN_SWAPPINESS and MAX_SWAPPINESS and change the usages of 0 and 200
to those. I'll also eliminate the use of -1 as Chris suggested.
Chris Li Dec. 13, 2023, 12:12 a.m. UTC | #8
Hi Dan,

On Tue, Dec 12, 2023 at 1:43 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> > I am curious what prompted you to develop this patch. I understand
> > what this patch does, just want to know more of your background story
> > why this is needed.
>
> I wrote about this in some detail in the cover letter (0/1). Take a
> look and let me know if the rationale is still unclear.

Ah, found it. I was not CC on the cover letter but CC on the 1/1
patch.  That is why I did not pick up the cover letter.

Yes, the cover letter explanation was great. Exactly what I am looking for.


>
> > Instead of passing -1, maybe we can use mem_cgroup_swappiness(memcg);
> >
>
> Yeah this makes sense, I'll go ahead and make that change and
> eliminate the -1.

Thanks

>
> > >                                 nr_reclaims--;
> > >                         continue;
> > >                 }
> > > @@ -6895,6 +6896,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> > >         return nbytes;
> > >  }
> > >
> > > +enum {
> > > +       MEMORY_RECLAIM_SWAPPINESS = 0,
> > > +       MEMORY_RECLAIM_NULL,
> > > +};
> > > +
> > > +static const match_table_t if_tokens = {
> >
> > What this is called "if_tokens"? I am trying to figure out what "if" refers to.
>
> I used the same logic as in "mm: Add nodes= arg to memory.reclaim". I
> can just call it tokens.

Thanks. I will take a look at that change.


> > > +
> > > +       old_buf = buf;
> > > +       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> > > +       if (buf == old_buf)
> > > +               return -EINVAL;
> > > +
> > > +       buf = strstrip(buf);
> > > +
> > > +       while ((start = strsep(&buf, " ")) != NULL) {
> > > +               if (!strlen(start))
> > > +                       continue;
> > > +               switch (match_token(start, if_tokens, args)) {
> > > +               case MEMORY_RECLAIM_SWAPPINESS:
> > > +                       if (match_int(&args[0], &swappiness))
> > > +                               return -EINVAL;
> > > +                       if (swappiness < 0 || swappiness > 200)
> >
> > Agree with Yosry on the 200 magic value.
> >
> > I am also wondering if there is an easier way to just parse one
> > keyword. Will using strcmp("swappiness=") be a bad idea? I haven't
> > tried it myself though.
>
> As above, "mm: Add nodes= arg to memory.reclaim" was previously in the
> mm tree doing it this way, so I duplicated it. I think given that
> there have been lots of discussions about extending this interface,
> this match table has some potential future value and I don't see a
> major downside to using it in favor of strcmp.

Yes, that is totally your call. I am fine as it is. Just the micro
optimization of me trying to see if there is a slimmer way to do it.

Chris
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..fc2b379dbd0f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,8 +1282,8 @@  PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.
 
-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
+	This file accepts a string which containers thhe number of bytes
+	to reclaim.
 
 	Example::
 
@@ -1304,6 +1304,17 @@  PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.
 
+	This file also allows the user to specify the swappiness value
+	to be used for the reclaim. For example:
+
+	  echo "1G swappiness=60" > memory.reclaim
+
+	The above instructs the kernel to perform the reclaim with
+	a swappiness value of 60. Note that this has the same semantics
+	as the vm.swappiness sysctl - it sets the relative IO cost of
+	reclaiming anon vs file memory but does not allow for reclaiming
+	specific amounts of anon or file memory.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f6dd6575b905..ebc20d094609 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -410,7 +410,8 @@  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  int swappiness);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c1061df9cd1..74598c17d3cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@ 
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/parser.h>
 #include <linux/sched/isolation.h>
 #include "internal.h"
 #include <net/sock.h>
@@ -2449,7 +2450,7 @@  static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP, -1);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2740,7 +2741,7 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options, -1);
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3661,7 @@  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, -1)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3774,7 +3775,7 @@  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP, -1))
 			nr_retries--;
 	}
 
@@ -6720,7 +6721,7 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, -1);
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6769,7 +6770,7 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, -1))
 				nr_reclaims--;
 			continue;
 		}
@@ -6895,6 +6896,16 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+enum {
+	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t if_tokens = {
+	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
@@ -6902,12 +6913,33 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
 	unsigned int reclaim_options;
-	int err;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+	int swappiness = -1;
 
 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;
+
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		switch (match_token(start, if_tokens, args)) {
+		case MEMORY_RECLAIM_SWAPPINESS:
+			if (match_int(&args[0], &swappiness))
+				return -EINVAL;
+			if (swappiness < 0 || swappiness > 200)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
 
 	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
 	while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6958,8 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
-					GFP_KERNEL, reclaim_options);
+					GFP_KERNEL, reclaim_options,
+					swappiness);
 
 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 506f8220c5fe..a20965b20177 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -136,6 +136,9 @@  struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+	/* Swappiness value for reclaim, if -1 use memcg/global value */
+	int swappiness;
+
 	/* Allocation order */
 	s8 order;
 
@@ -2327,7 +2330,8 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	unsigned long anon_cost, file_cost, total_cost;
-	int swappiness = mem_cgroup_swappiness(memcg);
+	int swappiness = sc->swappiness != -1 ?
+		sc->swappiness : mem_cgroup_swappiness(memcg);
 	u64 fraction[ANON_AND_FILE];
 	u64 denominator = 0;	/* gcc */
 	enum scan_balance scan_balance;
@@ -2608,6 +2612,9 @@  static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
 	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
 		return 0;
 
+	if (sc->swappiness != -1)
+		return sc->swappiness;
+
 	return mem_cgroup_swappiness(memcg);
 }
 
@@ -6433,7 +6440,8 @@  unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   int swappiness)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6448,6 +6456,7 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.swappiness = swappiness,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put