diff mbox series

[V4,2/2] mm: add swapiness= arg to memory.reclaim

Message ID 20231213013807.897742-3-schatzberg.dan@gmail.com (mailing list archive)
State New
Headers show
Series Add swappiness argument to memory.reclaim | expand

Commit Message

Dan Schatzberg Dec. 13, 2023, 1:38 a.m. UTC
Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 19 +++++---
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 61 ++++++++++++++++++++-----
 mm/vmscan.c                             | 11 +++--
 4 files changed, 72 insertions(+), 22 deletions(-)

Comments

Yu Zhao Dec. 13, 2023, 2:05 a.m. UTC | #1
On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>

NAK.

Please initialize new variables properly and test code changes
thoroughly before submission.
Dan Schatzberg Dec. 13, 2023, 4:38 p.m. UTC | #2
On Tue, Dec 12, 2023 at 07:05:36PM -0700, Yu Zhao wrote:
> On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> >
> > Allow proactive reclaimers to submit an additional swappiness=<val>
> > argument to memory.reclaim. This overrides the global or per-memcg
> > swappiness setting for that reclaim attempt.
> >
> > For example:
> >
> > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
> >
> > will perform reclaim on the rootcg with a swappiness setting of 0 (no
> > swap) regardless of the vm.swappiness sysctl setting.
> >
> > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> 
> NAK.
> 
> Please initialize new variables properly and test code changes
> thoroughly before submission.

Could you be a bit more specific? The patch is compiling and working
locally but perhaps there's some configuration or behavior that I
haven't been testing.
Yosry Ahmed Dec. 14, 2023, 4:28 a.m. UTC | #3
On Wed, Dec 13, 2023 at 8:38 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> On Tue, Dec 12, 2023 at 07:05:36PM -0700, Yu Zhao wrote:
> > On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> > >
> > > Allow proactive reclaimers to submit an additional swappiness=<val>
> > > argument to memory.reclaim. This overrides the global or per-memcg
> > > swappiness setting for that reclaim attempt.
> > >
> > > For example:
> > >
> > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
> > >
> > > will perform reclaim on the rootcg with a swappiness setting of 0 (no
> > > swap) regardless of the vm.swappiness sysctl setting.
> > >
> > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> >
> > NAK.
> >
> > Please initialize new variables properly and test code changes
> > thoroughly before submission.
>
> Could you be a bit more specific? The patch is compiling and working
> locally but perhaps there's some configuration or behavior that I
> haven't been testing.

scan_control.swappiness is only initialized from
try_to_free_mem_cgroup_pages(), which means that swappiness is now 0
for all other types of reclaim, which can be a huge problem.

It might be easier to restore a special value (-1, 201, whatever) that
means "use mem_cgroup_swappiness()".
Michal Hocko Dec. 14, 2023, 8:38 a.m. UTC | #4
On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.

You are providing the usecase in the cover letter and Andrew usually
appends that to the first patch in the series. I think it would be
better to have the usecase described here.

[...]
> @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
>  
> +The following nested keys are defined.
> +
> +	  ==========		================================
> +	  swappiness		Swappiness value to reclaim with
> +	  ==========		================================
> +
> +	Specifying a swappiness value instructs the kernel to perform
> +	the reclaim with that swappiness value. Note that this has the
> +	same semantics as the vm.swappiness sysctl - it sets the

same semantics as vm.swappiness applied to memcg reclaim with all the
existing limitations and potential future extensions.

> +	relative IO cost of reclaiming anon vs file memory but does
> +	not allow for reclaiming specific amounts of anon or file memory.
> +
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.

The biggest problem with the implementation I can see, and others have
pointed out the same, is how fragile this is. You really have to check
the code and _every_ place which defines scan_control to learn that
mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
shrink_all_memory and __node_reclaim. I have only checked couple of
them, like direct reclaim and kswapd and none of them really sets up
swappiness to the default memcg nor global value. So you effectively end
up with swappiness == 0.

While the review can point those out it is quite easy to break and you
will only learn about that very indirectly. I think it would be easier
to review and maintain if you go with a pointer that would fallback to
mem_cgroup_swappiness() if NULL which will be the case for every
existing reclaimer except memory.reclaim with swappiness value.
Dan Schatzberg Dec. 14, 2023, 6:22 p.m. UTC | #5
On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
> On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> > Allow proactive reclaimers to submit an additional swappiness=<val>
> > argument to memory.reclaim. This overrides the global or per-memcg
> > swappiness setting for that reclaim attempt.
> 
> You are providing the usecase in the cover letter and Andrew usually
> appends that to the first patch in the series. I think it would be
> better to have the usecase described here.
> 
> [...]
> > @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
> >  	This means that the networking layer will not adapt based on
> >  	reclaim induced by memory.reclaim.
> >  
> > +The following nested keys are defined.
> > +
> > +	  ==========		================================
> > +	  swappiness		Swappiness value to reclaim with
> > +	  ==========		================================
> > +
> > +	Specifying a swappiness value instructs the kernel to perform
> > +	the reclaim with that swappiness value. Note that this has the
> > +	same semantics as the vm.swappiness sysctl - it sets the
> 
> same semantics as vm.swappiness applied to memcg reclaim with all the
> existing limitations and potential future extensions.

Thanks, will make this change.

> 
> > +	relative IO cost of reclaiming anon vs file memory but does
> > +	not allow for reclaiming specific amounts of anon or file memory.
> > +
> >    memory.peak
> >  	A read-only single value file which exists on non-root
> >  	cgroups.
> 
> The biggest problem with the implementation I can see, and others have
> pointed out the same, is how fragile this is. You really have to check
> the code and _every_ place which defines scan_control to learn that
> mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
> reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
> shrink_all_memory and __node_reclaim. I have only checked couple of
> them, like direct reclaim and kswapd and none of them really sets up
> swappiness to the default memcg nor global value. So you effectively end
> up with swappiness == 0.
> 
> While the review can point those out it is quite easy to break and you
> will only learn about that very indirectly. I think it would be easier
> to review and maintain if you go with a pointer that would fallback to
> mem_cgroup_swappiness() if NULL which will be the case for every
> existing reclaimer except memory.reclaim with swappiness value.

I agree. My initial implementation used a pointer for this
reason. I'll switch this back. Just to be clear - I still need to
initialize scan_control.swappiness in all these other places right? It
appears to mostly be stack-initialized which means it has
indeterminate value, not necessarily zero.
Yosry Ahmed Dec. 14, 2023, 6:28 p.m. UTC | #6
On Thu, Dec 14, 2023 at 10:22 AM Dan Schatzberg
<schatzberg.dan@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
> > On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> > > Allow proactive reclaimers to submit an additional swappiness=<val>
> > > argument to memory.reclaim. This overrides the global or per-memcg
> > > swappiness setting for that reclaim attempt.
> >
> > You are providing the usecase in the cover letter and Andrew usually
> > appends that to the first patch in the series. I think it would be
> > better to have the usecase described here.
> >
> > [...]
> > > @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
> > >     This means that the networking layer will not adapt based on
> > >     reclaim induced by memory.reclaim.
> > >
> > > +The following nested keys are defined.
> > > +
> > > +     ==========            ================================
> > > +     swappiness            Swappiness value to reclaim with
> > > +     ==========            ================================
> > > +
> > > +   Specifying a swappiness value instructs the kernel to perform
> > > +   the reclaim with that swappiness value. Note that this has the
> > > +   same semantics as the vm.swappiness sysctl - it sets the
> >
> > same semantics as vm.swappiness applied to memcg reclaim with all the
> > existing limitations and potential future extensions.
>
> Thanks, will make this change.
>
> >
> > > +   relative IO cost of reclaiming anon vs file memory but does
> > > +   not allow for reclaiming specific amounts of anon or file memory.
> > > +
> > >    memory.peak
> > >     A read-only single value file which exists on non-root
> > >     cgroups.
> >
> > The biggest problem with the implementation I can see, and others have
> > pointed out the same, is how fragile this is. You really have to check
> > the code and _every_ place which defines scan_control to learn that
> > mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
> > reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
> > shrink_all_memory and __node_reclaim. I have only checked couple of
> > them, like direct reclaim and kswapd and none of them really sets up
> > swappiness to the default memcg nor global value. So you effectively end
> > up with swappiness == 0.
> >
> > While the review can point those out it is quite easy to break and you
> > will only learn about that very indirectly. I think it would be easier
> > to review and maintain if you go with a pointer that would fallback to
> > mem_cgroup_swappiness() if NULL which will be the case for every
> > existing reclaimer except memory.reclaim with swappiness value.
>
> I agree. My initial implementation used a pointer for this
> reason. I'll switch this back. Just to be clear - I still need to
> initialize scan_control.swappiness in all these other places right? It
> appears to mostly be stack-initialized which means it has
> indeterminate value, not necessarily zero.

My understanding is that in a partially initialized struct,
uninitialized members default to 0, but I am not a C expert :)
Michal Hocko Dec. 15, 2023, 8:50 a.m. UTC | #7
On Thu 14-12-23 13:22:29, Dan Schatzberg wrote:
> On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
[...]
> > While the review can point those out it is quite easy to break and you
> > will only learn about that very indirectly. I think it would be easier
> > to review and maintain if you go with a pointer that would fallback to
> > mem_cgroup_swappiness() if NULL which will be the case for every
> > existing reclaimer except memory.reclaim with swappiness value.
> 
> I agree. My initial implementation used a pointer for this
> reason. I'll switch this back. Just to be clear - I still need to
> initialize scan_control.swappiness in all these other places right?

No. They will just get initialized to 0. All you need to make sure is
that the swappiness is used consistently. I would propose something like
scan_control_swappiness() (or sc_swappiness) which would actually do

	if (sc->swappiness)
		return sc->swappiness;
	return mem_cgroup_swappiness(sc->target_mem_cgroup);

and then make sure that mem_cgroup_swappiness is never used directly.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..06319349c072 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,17 +1282,10 @@  PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.
 
-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
-
 	Example::
 
 	  echo "1G" > memory.reclaim
 
-	The interface can be later extended with nested keys to
-	configure the reclaim behavior. For example, specify the
-	type of memory to reclaim from (anon, file, ..).
-
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1304,6 +1297,18 @@  PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.
 
+The following nested keys are defined.
+
+	  ==========		================================
+	  swappiness		Swappiness value to reclaim with
+	  ==========		================================
+
+	Specifying a swappiness value instructs the kernel to perform
+	the reclaim with that swappiness value. Note that this has the
+	same semantics as the vm.swappiness sysctl - it sets the
+	relative IO cost of reclaiming anon vs file memory but does
+	not allow for reclaiming specific amounts of anon or file memory.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e2ab76c25b4a..690898c347de 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,7 +412,8 @@  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  int swappiness);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9748a6b88bb8..be3d5797d9df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@ 
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/parser.h>
 #include <linux/sched/isolation.h>
 #include "internal.h"
 #include <net/sock.h>
@@ -2449,7 +2450,8 @@  static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP,
+							mem_cgroup_swappiness(memcg));
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2740,7 +2742,8 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options,
+						    mem_cgroup_swappiness(memcg));
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3663,8 @@  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+					mem_cgroup_swappiness(memcg))) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3774,7 +3778,8 @@  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP,
+						  mem_cgroup_swappiness(memcg)))
 			nr_retries--;
 	}
 
@@ -6720,7 +6725,8 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					mem_cgroup_swappiness(memcg));
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6769,7 +6775,8 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					mem_cgroup_swappiness(memcg)))
 				nr_reclaims--;
 			continue;
 		}
@@ -6895,6 +6902,16 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+enum {
+	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
@@ -6902,12 +6919,33 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
 	unsigned int reclaim_options;
-	int err;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+	int swappiness = mem_cgroup_swappiness(memcg);
 
 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;
+
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		switch (match_token(start, tokens, args)) {
+		case MEMORY_RECLAIM_SWAPPINESS:
+			if (match_int(&args[0], &swappiness))
+				return -EINVAL;
+			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
 
 	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
 	while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6964,8 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
-					GFP_KERNEL, reclaim_options);
+					GFP_KERNEL, reclaim_options,
+					swappiness);
 
 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2aa671fe938b..cfef06c7c23f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -136,6 +136,9 @@  struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+	/* Swappiness value for reclaim */
+	int swappiness;
+
 	/* Allocation order */
 	s8 order;
 
@@ -2327,7 +2330,7 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	unsigned long anon_cost, file_cost, total_cost;
-	int swappiness = mem_cgroup_swappiness(memcg);
+	int swappiness = sc->swappiness;
 	u64 fraction[ANON_AND_FILE];
 	u64 denominator = 0;	/* gcc */
 	enum scan_balance scan_balance;
@@ -2608,7 +2611,7 @@  static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
 	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
 		return 0;
 
-	return mem_cgroup_swappiness(memcg);
+	return sc->swappiness;
 }
 
 static int get_nr_gens(struct lruvec *lruvec, int type)
@@ -6433,7 +6436,8 @@  unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   int swappiness)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6448,6 +6452,7 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.swappiness = swappiness,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put