diff mbox series

[V2] mm: vmscan: skip the file folios in proactive reclaim if swappiness is MAX

Message ID 20250314033350.1156370-1-hezhongkun.hzk@bytedance.com (mailing list archive)
State New
Headers show
Series [V2] mm: vmscan: skip the file folios in proactive reclaim if swappiness is MAX | expand

Commit Message

Zhongkun He March 14, 2025, 3:33 a.m. UTC
With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
memory.reclaim")', we can submit an additional swappiness=<val> argument
to memory.reclaim. It is very useful because we can dynamically adjust
the reclamation ratio based on the anonymous folios and file folios of
each cgroup. For example,when swappiness is set to 0, we only reclaim
from file folios.

However,we have also encountered a new issue: when swappiness is set to
the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
to the knob of cache_trim_mode, which depends solely on the ratio of
inactive folios, regardless of whether there are a large number of cold
folios in anonymous folio list.

So, we hope to add a new control logic where proactive memory reclaim only
reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
For example, something like this:

echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 200 (max
swappiness) regardless of the file folios. Users have a more comprehensive
view of the application's memory distribution because there are many
metrics available. For example, if we find that a certain cgroup has a
large number of inactive anon folios, we can reclaim only those and skip
file folios, because with the zram/zswap, the IO tradeoff that
cache_trim_mode is making doesn't hold - file refaults will cause IO,
whereas anon decompression will not.

With this patch, the swappiness argument of memory.reclaim has a more
precise semantics: 0 means reclaiming only from file pages, while 200
means reclaiming just from anonymous pages.

V1:
  Update Documentation/admin-guide/cgroup-v2.rst --from Andrew Morton
  Add more descriptions in the comment.   --from Johannes Weiner

V2:
  Add reviewed from Yosry Ahmed.

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  4 ++++
 mm/vmscan.c                             | 10 ++++++++++
 2 files changed, 14 insertions(+)

Comments

Muchun Song March 14, 2025, 6:11 a.m. UTC | #1
> On Mar 14, 2025, at 11:33, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> 
> With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
> memory.reclaim")', we can submit an additional swappiness=<val> argument
> to memory.reclaim. It is very useful because we can dynamically adjust
> the reclamation ratio based on the anonymous folios and file folios of
> each cgroup. For example,when swappiness is set to 0, we only reclaim
> from file folios.
> 
> However,we have also encountered a new issue: when swappiness is set to
> the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
> to the knob of cache_trim_mode, which depends solely on the ratio of
> inactive folios, regardless of whether there are a large number of cold
> folios in anonymous folio list.
> 
> So, we hope to add a new control logic where proactive memory reclaim only
> reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
> For example, something like this:
> 
> echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim
> 
> will perform reclaim on the rootcg with a swappiness setting of 200 (max
> swappiness) regardless of the file folios. Users have a more comprehensive
> view of the application's memory distribution because there are many
> metrics available. For example, if we find that a certain cgroup has a
> large number of inactive anon folios, we can reclaim only those and skip
> file folios, because with the zram/zswap, the IO tradeoff that
> cache_trim_mode is making doesn't hold - file refaults will cause IO,
> whereas anon decompression will not.
> 
> With this patch, the swappiness argument of memory.reclaim has a more
> precise semantics: 0 means reclaiming only from file pages, while 200
> means reclaiming just from anonymous pages.
> 
> V1:
>  Update Documentation/admin-guide/cgroup-v2.rst --from Andrew Morton
>  Add more descriptions in the comment.   --from Johannes Weiner
> 
> V2:
>  Add reviewed from Yosry Ahmed.

Actually, those changelog should be added below "---" below.

> 
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---

Here, start to add your changelog. The code looks good to me.

Acked-by: Muchun Song <muchun.song@linux.dev>

Thanks.
Michal Hocko March 14, 2025, 8:52 a.m. UTC | #2
On Fri 14-03-25 11:33:50, Zhongkun He wrote:
> With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
> memory.reclaim")', we can submit an additional swappiness=<val> argument
> to memory.reclaim. It is very useful because we can dynamically adjust
> the reclamation ratio based on the anonymous folios and file folios of
> each cgroup. For example,when swappiness is set to 0, we only reclaim
> from file folios.
> 
> However,we have also encountered a new issue: when swappiness is set to
> the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
> to the knob of cache_trim_mode, which depends solely on the ratio of
> inactive folios, regardless of whether there are a large number of cold
> folios in anonymous folio list.
> 
> So, we hope to add a new control logic where proactive memory reclaim only
> reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
> For example, something like this:
> 
> echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim
> 
> will perform reclaim on the rootcg with a swappiness setting of 200 (max
> swappiness) regardless of the file folios. Users have a more comprehensive
> view of the application's memory distribution because there are many
> metrics available. For example, if we find that a certain cgroup has a
> large number of inactive anon folios, we can reclaim only those and skip
> file folios, because with the zram/zswap, the IO tradeoff that
> cache_trim_mode is making doesn't hold - file refaults will cause IO,
> whereas anon decompression will not.
> 
> With this patch, the swappiness argument of memory.reclaim has a more
> precise semantics: 0 means reclaiming only from file pages, while 200
> means reclaiming just from anonymous pages.

Haven't you said you will try a slightly different approach and always
bypass LRU balancing heuristics for pro-active reclaim and swappiness
provided? What has happened with that?
Zhongkun He March 14, 2025, 9:24 a.m. UTC | #3
On Fri, Mar 14, 2025 at 4:53 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 14-03-25 11:33:50, Zhongkun He wrote:
> > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
> > memory.reclaim")', we can submit an additional swappiness=<val> argument
> > to memory.reclaim. It is very useful because we can dynamically adjust
> > the reclamation ratio based on the anonymous folios and file folios of
> > each cgroup. For example,when swappiness is set to 0, we only reclaim
> > from file folios.
> >
> > However,we have also encountered a new issue: when swappiness is set to
> > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
> > to the knob of cache_trim_mode, which depends solely on the ratio of
> > inactive folios, regardless of whether there are a large number of cold
> > folios in anonymous folio list.
> >
> > So, we hope to add a new control logic where proactive memory reclaim only
> > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
> > For example, something like this:
> >
> > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim
> >
> > will perform reclaim on the rootcg with a swappiness setting of 200 (max
> > swappiness) regardless of the file folios. Users have a more comprehensive
> > view of the application's memory distribution because there are many
> > metrics available. For example, if we find that a certain cgroup has a
> > large number of inactive anon folios, we can reclaim only those and skip
> > file folios, because with the zram/zswap, the IO tradeoff that
> > cache_trim_mode is making doesn't hold - file refaults will cause IO,
> > whereas anon decompression will not.
> >
> > With this patch, the swappiness argument of memory.reclaim has a more
> > precise semantics: 0 means reclaiming only from file pages, while 200
> > means reclaiming just from anonymous pages.
>
> Haven't you said you will try a slightly different approach and always
> bypass LRU balancing heuristics for pro-active reclaim and swappiness
> provided? What has happened with that?
>

Hi Michal
I'm not sure if we should do that. Because i found a problem that If we
drop all the heuristics for scanning LRUs, the swappiness value each
time will accurately represent the ratio of memory to be reclaimed. This
means that before each pro reclamation operation, we would need to
have relatively clear information of the current memory ratio and dynamically
changing the swappiness more often because with the pro memory reclaiming,
the ratio of anon and file is alway changing . Therefore, we should adjust the
swappiness value more frequently.  The frequency of setting Swappiness to
200 is relatively much lower.

Do you have any commits about this concern?

> --
> Michal Hocko
> SUSE Labs
Michal Hocko March 14, 2025, 9:27 a.m. UTC | #4
On Fri 14-03-25 09:52:45, Michal Hocko wrote:
> On Fri 14-03-25 11:33:50, Zhongkun He wrote:
> > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
> > memory.reclaim")', we can submit an additional swappiness=<val> argument
> > to memory.reclaim. It is very useful because we can dynamically adjust
> > the reclamation ratio based on the anonymous folios and file folios of
> > each cgroup. For example,when swappiness is set to 0, we only reclaim
> > from file folios.
> > 
> > However,we have also encountered a new issue: when swappiness is set to
> > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
> > to the knob of cache_trim_mode, which depends solely on the ratio of
> > inactive folios, regardless of whether there are a large number of cold
> > folios in anonymous folio list.
> > 
> > So, we hope to add a new control logic where proactive memory reclaim only
> > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
> > For example, something like this:
> > 
> > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim
> > 
> > will perform reclaim on the rootcg with a swappiness setting of 200 (max
> > swappiness) regardless of the file folios. Users have a more comprehensive
> > view of the application's memory distribution because there are many
> > metrics available. For example, if we find that a certain cgroup has a
> > large number of inactive anon folios, we can reclaim only those and skip
> > file folios, because with the zram/zswap, the IO tradeoff that
> > cache_trim_mode is making doesn't hold - file refaults will cause IO,
> > whereas anon decompression will not.
> > 
> > With this patch, the swappiness argument of memory.reclaim has a more
> > precise semantics: 0 means reclaiming only from file pages, while 200
> > means reclaiming just from anonymous pages.
> 
> Haven't you said you will try a slightly different approach and always
> bypass LRU balancing heuristics for pro-active reclaim and swappiness
> provided? What has happened with that?

I have just noticed that you have followed up [1] with a concern that
using swappiness in the whole min-max range without any heuristics turns
out to be harder than just relying on the min and max as extremes.
What seems to be still missing (or maybe it is just me not seeing that)
is why should we only enforce those extreme ends of the range and still
preserve under-defined semantic for all other swappiness values in the
pro-active reclaim.

[1] https://lore.kernel.org/all/CACSyD1OHD8oXQcQmi1D9t2f5oeMVDvCQnYZUMQTGbqBz4YYKLQ@mail.gmail.com/T/#u
Zhongkun He March 14, 2025, 10:35 a.m. UTC | #5
On Fri, Mar 14, 2025 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 14-03-25 09:52:45, Michal Hocko wrote:
> > On Fri 14-03-25 11:33:50, Zhongkun He wrote:
> > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
> > > memory.reclaim")', we can submit an additional swappiness=<val> argument
> > > to memory.reclaim. It is very useful because we can dynamically adjust
> > > the reclamation ratio based on the anonymous folios and file folios of
> > > each cgroup. For example,when swappiness is set to 0, we only reclaim
> > > from file folios.
> > >
> > > However,we have also encountered a new issue: when swappiness is set to
> > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
> > > to the knob of cache_trim_mode, which depends solely on the ratio of
> > > inactive folios, regardless of whether there are a large number of cold
> > > folios in anonymous folio list.
> > >
> > > So, we hope to add a new control logic where proactive memory reclaim only
> > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
> > > For example, something like this:
> > >
> > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim
> > >
> > > will perform reclaim on the rootcg with a swappiness setting of 200 (max
> > > swappiness) regardless of the file folios. Users have a more comprehensive
> > > view of the application's memory distribution because there are many
> > > metrics available. For example, if we find that a certain cgroup has a
> > > large number of inactive anon folios, we can reclaim only those and skip
> > > file folios, because with the zram/zswap, the IO tradeoff that
> > > cache_trim_mode is making doesn't hold - file refaults will cause IO,
> > > whereas anon decompression will not.
> > >
> > > With this patch, the swappiness argument of memory.reclaim has a more
> > > precise semantics: 0 means reclaiming only from file pages, while 200
> > > means reclaiming just from anonymous pages.
> >
> > Haven't you said you will try a slightly different approach and always
> > bypass LRU balancing heuristics for pro-active reclaim and swappiness
> > provided? What has happened with that?
>
> I have just noticed that you have followed up [1] with a concern that
> using swappiness in the whole min-max range without any heuristics turns
> out to be harder than just relying on the min and max as extremes.
> What seems to be still missing (or maybe it is just me not seeing that)
> is why should we only enforce those extreme ends of the range and still
> preserve under-defined semantic for all other swappiness values in the
> pro-active reclaim.
>

Yes, you are right.
There is a demo if we bypass LRU balancing heuristics in pro reclaim.
I have a question, but I'm not sure if it should be considered. For example,
if anon scan=5 and swappiness=5, then 5*5/200=0. The scan becomes zero.
Do you have any suggestions?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f4312b41e0e0..75935fe42245 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2448,6 +2448,19 @@ static void get_scan_count(struct lruvec
*lruvec, struct scan_control *sc,
                goto out;
        }

+       /*
+        * Bypassing LRU balancing heuristics for proactive memory
+        * reclaim to make the semantic of swappiness clearer in
+        * memory.reclaim.
+        */
+       if (sc->proactive && sc->proactive_swappiness) {
+               scan_balance = SCAN_FRACT;
+               fraction[0] = swappiness;
+               fraction[1] = MAX_SWAPPINESS - swappiness;
+               denominator = MAX_SWAPPINESS;
+               goto out;
+       }
+
        /*
         * Do not apply any pressure balancing cleverness when the
         * system is close to OOM, scan both anon and file equally


Additionally, any feedback from others is welcome.

Thanks.

> [1] https://lore.kernel.org/all/CACSyD1OHD8oXQcQmi1D9t2f5oeMVDvCQnYZUMQTGbqBz4YYKLQ@mail.gmail.com/T/#u
> --
> Michal Hocko
> SUSE Labs
Hailong Liu March 14, 2025, 11:32 a.m. UTC | #6
On Fri, 14. Mar 11:33, Zhongkun He wrote:
> With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
> memory.reclaim")', we can submit an additional swappiness=<val> argument
> to memory.reclaim. It is very useful because we can dynamically adjust
> the reclamation ratio based on the anonymous folios and file folios of
> each cgroup. For example,when swappiness is set to 0, we only reclaim
> from file folios.
>
> However,we have also encountered a new issue: when swappiness is set to
> the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
> to the knob of cache_trim_mode, which depends solely on the ratio of
> inactive folios, regardless of whether there are a large number of cold
> folios in anonymous folio list.
>
> So, we hope to add a new control logic where proactive memory reclaim only
> reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
> For example, something like this:
>
> echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 200 (max
> swappiness) regardless of the file folios. Users have a more comprehensive
> view of the application's memory distribution because there are many
> metrics available. For example, if we find that a certain cgroup has a
> large number of inactive anon folios, we can reclaim only those and skip
> file folios, because with the zram/zswap, the IO tradeoff that
> cache_trim_mode is making doesn't hold - file refaults will cause IO,
> whereas anon decompression will not.
>
> With this patch, the swappiness argument of memory.reclaim has a more
> precise semantics: 0 means reclaiming only from file pages, while 200
> means reclaiming just from anonymous pages.
>
> V1:
>   Update Documentation/admin-guide/cgroup-v2.rst --from Andrew Morton
>   Add more descriptions in the comment.   --from Johannes Weiner
>
> V2:
>   Add reviewed from Yosry Ahmed.
>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  4 ++++
>  mm/vmscan.c                             | 10 ++++++++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index cb1b4e759b7e..6a4487ead7e0 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1343,6 +1343,10 @@ The following nested keys are defined.
>  	same semantics as vm.swappiness applied to memcg reclaim with
>  	all the existing limitations and potential future extensions.
>
> +	The swappiness have the range [0, 200], 0 means reclaiming only
> +	from file folios, 200 (MAX_SWAPPINESS) means reclaiming just from
> +	anonymous folios.
> +
mglru ?
https://elixir.bootlin.com/linux/v6.13-rc1/source/mm/vmscan.c#L4533
>    memory.peak
>  	A read-write single value file which exists on non-root cgroups.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c767d71c43d7..f4312b41e0e0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2438,6 +2438,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		goto out;
>  	}
>
> +	/*
> +	 * Do not bother scanning file folios if the memory reclaim
> +	 * invoked by userspace through memory.reclaim and the
> +	 * swappiness is MAX_SWAPPINESS.
> +	 */
> +	if (sc->proactive && (swappiness == MAX_SWAPPINESS)) {
> +		scan_balance = SCAN_ANON;
> +		goto out;
> +	}
> +
>  	/*
>  	 * Do not apply any pressure balancing cleverness when the
>  	 * system is close to OOM, scan both anon and file equally
> --
> 2.39.5
>
>

--

Help you, Help me,
Hailong.
Johannes Weiner March 14, 2025, 2:18 p.m. UTC | #7
On Fri, Mar 14, 2025 at 10:27:57AM +0100, Michal Hocko wrote:
> On Fri 14-03-25 09:52:45, Michal Hocko wrote:
> > On Fri 14-03-25 11:33:50, Zhongkun He wrote:
> > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to
> > > memory.reclaim")', we can submit an additional swappiness=<val> argument
> > > to memory.reclaim. It is very useful because we can dynamically adjust
> > > the reclamation ratio based on the anonymous folios and file folios of
> > > each cgroup. For example,when swappiness is set to 0, we only reclaim
> > > from file folios.
> > > 
> > > However,we have also encountered a new issue: when swappiness is set to
> > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due
> > > to the knob of cache_trim_mode, which depends solely on the ratio of
> > > inactive folios, regardless of whether there are a large number of cold
> > > folios in anonymous folio list.
> > > 
> > > So, we hope to add a new control logic where proactive memory reclaim only
> > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS.
> > > For example, something like this:
> > > 
> > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim
> > > 
> > > will perform reclaim on the rootcg with a swappiness setting of 200 (max
> > > swappiness) regardless of the file folios. Users have a more comprehensive
> > > view of the application's memory distribution because there are many
> > > metrics available. For example, if we find that a certain cgroup has a
> > > large number of inactive anon folios, we can reclaim only those and skip
> > > file folios, because with the zram/zswap, the IO tradeoff that
> > > cache_trim_mode is making doesn't hold - file refaults will cause IO,
> > > whereas anon decompression will not.
> > > 
> > > With this patch, the swappiness argument of memory.reclaim has a more
> > > precise semantics: 0 means reclaiming only from file pages, while 200
> > > means reclaiming just from anonymous pages.
> > 
> > Haven't you said you will try a slightly different approach and always
> > bypass LRU balancing heuristics for pro-active reclaim and swappiness
> > provided? What has happened with that?
> 
> I have just noticed that you have followed up [1] with a concern that
> using swappiness in the whole min-max range without any heuristics turns
> out to be harder than just relying on the min and max as extremes.
> What seems to be still missing (or maybe it is just me not seeing that)
> is why should we only enforce those extreme ends of the range and still
> preserve under-defined semantic for all other swappiness values in the
> pro-active reclaim.

I'm guess I'm not seeing the "under-defined" part. cache_trim_mode is
there to make sure a streaming file access pattern doesn't cause
swapping. He has a special usecase to override cache_trim_mode when he
knows a large amount of anon is going cold. There is no way we can
generally remove it from proactive reclaim.
Michal Hocko March 14, 2025, 2:49 p.m. UTC | #8
On Fri 14-03-25 10:18:33, Johannes Weiner wrote:
> On Fri, Mar 14, 2025 at 10:27:57AM +0100, Michal Hocko wrote:
[...]
> > I have just noticed that you have followed up [1] with a concern that
> > using swappiness in the whole min-max range without any heuristics turns
> > out to be harder than just relying on the min and max as extremes.
> > What seems to be still missing (or maybe it is just me not seeing that)
> > is why should we only enforce those extreme ends of the range and still
> > preserve under-defined semantic for all other swappiness values in the
> > pro-active reclaim.
> 
> I'm guess I'm not seeing the "under-defined" part.

What I meant here is that any other value than both ends of swappiness
doesn't have generally predictable behavior unless you know specific
details of the current memory reclaim heuristics in get_scan_count.

> cache_trim_mode is
> there to make sure a streaming file access pattern doesn't cause
> swapping.

Yes, I am aware of the purpose.

> He has a special usecase to override cache_trim_mode when he
> knows a large amount of anon is going cold. There is no way we can
> generally remove it from proactive reclaim.

I believe I do understand the requirement here. The patch offers
counterpart to noswap pro-active reclaim and I do not have objections to
that.

The reason I brought this up is that everything in between 0..200 is
kinda gray area. We've had several queries why swappiness=N doesn't work
as expected and the usual answer was because of heuristics. Most people
just learned to live with that and stopped fine tuning vm_swappiness.
Which is good I guess.

Pro-active reclaim is slightly different in a sense that it gives a much
better control on how much to reclaim and since we have addes swappiness
extension then even the balancing. So why not make that balancing work
for real and always follow the given proportion? To prevent any
unintended regressions this would be the case only with swappiness was
explicitly given to the reclaim request. Does that make any sense?
Johannes Weiner March 14, 2025, 4:57 p.m. UTC | #9
On Fri, Mar 14, 2025 at 03:49:30PM +0100, Michal Hocko wrote:
> On Fri 14-03-25 10:18:33, Johannes Weiner wrote:
> > On Fri, Mar 14, 2025 at 10:27:57AM +0100, Michal Hocko wrote:
> [...]
> > > I have just noticed that you have followed up [1] with a concern that
> > > using swappiness in the whole min-max range without any heuristics turns
> > > out to be harder than just relying on the min and max as extremes.
> > > What seems to be still missing (or maybe it is just me not seeing that)
> > > is why should we only enforce those extreme ends of the range and still
> > > preserve under-defined semantic for all other swappiness values in the
> > > pro-active reclaim.
> > 
> > I'm guess I'm not seeing the "under-defined" part.
> 
> What I meant here is that any other value than both ends of swappiness
> doesn't have generally predictable behavior unless you know specific
> details of the current memory reclaim heuristics in get_scan_count.
> 
> > cache_trim_mode is
> > there to make sure a streaming file access pattern doesn't cause
> > swapping.
> 
> Yes, I am aware of the purpose.
> 
> > He has a special usecase to override cache_trim_mode when he
> > knows a large amount of anon is going cold. There is no way we can
> > generally remove it from proactive reclaim.
> 
> I believe I do understand the requirement here. The patch offers
> counterpart to noswap pro-active reclaim and I do not have objections to
> that.
> 
> The reason I brought this up is that everything in between 0..200 is
> kinda gray area. We've had several queries why swappiness=N doesn't work
> as expected and the usual answer was because of heuristics. Most people
> just learned to live with that and stopped fine tuning vm_swappiness.
> Which is good I guess.

You're still oversimplifying and then dismissing. The heuristics don't
make swappiness meaningless, they make it useful in the first place.

  This control is used to define the rough relative IO cost of swapping
  and filesystem paging, as a value between 0 and 200.

This is clearly defined, and implemented as such. cache_trim_mode is
predicated on the *absence* of paging and caching benefits: A linear,
use-once file access pattern that *does not* benefit from additional
cache space. Kicking out anon for that purpose would be wrong under
pretty much any circumstance. That's why it "overrides" swappiness:
swappiness cannot apply when swapping at all would be nonsense.

Proactive reclaimers like ours rely on this. We use swappiness to
express exactly what it says on the tin: the relative cost between
thrashing file vs anon. We use it quite effectively to manage anon
write rates for flash wear management e.g. Obviously that doesn't mean
we want to swap when somebody streams through a large file set.

Zhongkun's case is a significant exception. He just wants to get rid
of known-cold anon set. This level of insight into userspace access
patterns is rare in practice. You could argue that MADV_PAGEOUT might
be more suitable for that. But I also don't necessarily see a problem
with making swappiness=200 do it; although we might have to teach our
proactive reclaimer to auto-tune between 1 and 199 then.

> Pro-active reclaim is slightly different in a sense that it gives a much
> better control on how much to reclaim and since we have addes swappiness
> extension then even the balancing. So why not make that balancing work
> for real and always follow the given proportion? To prevent any
> unintended regressions this would be the case only with swappiness was
> explicitly given to the reclaim request. Does that make any sense?

That would require the proactive reclaimer always knowing enough about
the access patterns to implement cache_trim_mode manually. This isn't
practical. And removing the heuristics would be a massive regression.
Yosry Ahmed March 14, 2025, 5:52 p.m. UTC | #10
On Fri, Mar 14, 2025 at 12:57:39PM -0400, Johannes Weiner wrote:
> On Fri, Mar 14, 2025 at 03:49:30PM +0100, Michal Hocko wrote:
> > On Fri 14-03-25 10:18:33, Johannes Weiner wrote:
> > > On Fri, Mar 14, 2025 at 10:27:57AM +0100, Michal Hocko wrote:
> > [...]
> > > > I have just noticed that you have followed up [1] with a concern that
> > > > using swappiness in the whole min-max range without any heuristics turns
> > > > out to be harder than just relying on the min and max as extremes.
> > > > What seems to be still missing (or maybe it is just me not seeing that)
> > > > is why should we only enforce those extreme ends of the range and still
> > > > preserve under-defined semantic for all other swappiness values in the
> > > > pro-active reclaim.
> > > 
> > > I'm guess I'm not seeing the "under-defined" part.
> > 
> > What I meant here is that any other value than both ends of swappiness
> > doesn't have generally predictable behavior unless you know specific
> > details of the current memory reclaim heuristics in get_scan_count.
> > 
> > > cache_trim_mode is
> > > there to make sure a streaming file access pattern doesn't cause
> > > swapping.
> > 
> > Yes, I am aware of the purpose.
> > 
> > > He has a special usecase to override cache_trim_mode when he
> > > knows a large amount of anon is going cold. There is no way we can
> > > generally remove it from proactive reclaim.
> > 
> > I believe I do understand the requirement here. The patch offers
> > counterpart to noswap pro-active reclaim and I do not have objections to
> > that.
> > 
> > The reason I brought this up is that everything in between 0..200 is
> > kinda gray area. We've had several queries why swappiness=N doesn't work
> > as expected and the usual answer was because of heuristics. Most people
> > just learned to live with that and stopped fine tuning vm_swappiness.
> > Which is good I guess.
> 
> You're still oversimplifying and then dismissing. The heuristics don't
> make swappiness meaningless, they make it useful in the first place.
> 
>   This control is used to define the rough relative IO cost of swapping
>   and filesystem paging, as a value between 0 and 200.
> 
> This is clearly defined, and implemented as such. cache_trim_mode is
> predicated on the *absence* of paging and caching benefits: A linear,
> use-once file access pattern that *does not* benefit from additional
> cache space. Kicking out anon for that purpose would be wrong under
> pretty much any circumstance. That's why it "overrides" swappiness:
> swappiness cannot apply when swapping at all would be nonsense.
> 
> Proactive reclaimers like ours rely on this. We use swappiness to
> express exactly what it says on the tin: the relative cost between
> thrashing file vs anon. We use it quite effectively to manage anon
> write rates for flash wear management e.g. Obviously that doesn't mean
> we want to swap when somebody streams through a large file set.
> 
> Zhongkun's case is a significant exception. He just wants to get rid
> of known-cold anon set. This level of insight into userspace access
> patterns is rare in practice. You could argue that MADV_PAGEOUT might
> be more suitable for that.

We have a similar use case at Google where we have a known-cold anon set
and we proactively reclaim it. This is why we previously proposed
type=anon/file/.., but swappiness is more flexible for use cases like
the one Johannes describes above.

> But I also don't necessarily see a problem
> with making swappiness=200 do it; although we might have to teach our
> proactive reclaimer to auto-tune between 1 and 199 then.

Would it be better if we don't use the existing swappiness=200 for this?

We can support something like memory.reclaim X swappiness=max instead to
achieve the "anon only" mode without affecting the existing semantics of
swappiness at all. I have a feeling I may have already proposed that at
some point.

In the kernel, we can define a new value (say 201 or 1000) that means
anon only and set it in memory_reclaim() when "max" is specified. We can
then explicitly check for this value in get_scan_count() (we probably
also need to handle MGLRU?).

From a user perspective the swappiness semantics remain unchanged, and
you do not need to teach your proactive reclaim to auto tune up to 199
of 200. We just support a new swappiness mode specific to proactive
reclaim.

WDYT?
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index cb1b4e759b7e..6a4487ead7e0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1343,6 +1343,10 @@  The following nested keys are defined.
 	same semantics as vm.swappiness applied to memcg reclaim with
 	all the existing limitations and potential future extensions.
 
+	The swappiness have the range [0, 200], 0 means reclaiming only
+	from file folios, 200 (MAX_SWAPPINESS) means reclaiming just from
+	anonymous folios.
+
   memory.peak
 	A read-write single value file which exists on non-root cgroups.
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c767d71c43d7..f4312b41e0e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2438,6 +2438,16 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		goto out;
 	}
 
+	/*
+	 * Do not bother scanning file folios if the memory reclaim
+	 * invoked by userspace through memory.reclaim and the
+	 * swappiness is MAX_SWAPPINESS.
+	 */
+	if (sc->proactive && (swappiness == MAX_SWAPPINESS)) {
+		scan_balance = SCAN_ANON;
+		goto out;
+	}
+
 	/*
 	 * Do not apply any pressure balancing cleverness when the
 	 * system is close to OOM, scan both anon and file equally