Message ID | 20250314033350.1156370-1-hezhongkun.hzk@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [V2] mm: vmscan: skip the file folios in proactive reclaim if swappiness is MAX | expand |
> On Mar 14, 2025, at 11:33, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > memory.reclaim")', we can submit an additional swappiness=<val> argument > to memory.reclaim. It is very useful because we can dynamically adjust > the reclamation ratio based on the anonymous folios and file folios of > each cgroup. For example,when swappiness is set to 0, we only reclaim > from file folios. > > However,we have also encountered a new issue: when swappiness is set to > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > to the knob of cache_trim_mode, which depends solely on the ratio of > inactive folios, regardless of whether there are a large number of cold > folios in anonymous folio list. > > So, we hope to add a new control logic where proactive memory reclaim only > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > For example, something like this: > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > swappiness) regardless of the file folios. Users have a more comprehensive > view of the application's memory distribution because there are many > metrics available. For example, if we find that a certain cgroup has a > large number of inactive anon folios, we can reclaim only those and skip > file folios, because with the zram/zswap, the IO tradeoff that > cache_trim_mode is making doesn't hold - file refaults will cause IO, > whereas anon decompression will not. > > With this patch, the swappiness argument of memory.reclaim has a more > precise semantics: 0 means reclaiming only from file pages, while 200 > means reclaiming just from anonymous pages. > > V1: > Update Documentation/admin-guide/cgroup-v2.rst --from Andrew Morton > Add more descriptions in the comment. --from Johannes Weiner > > V2: > Add reviewed from Yosry Ahmed. Actually, those changelog should be added below "---" below. > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> > Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > --- Here, start to add your changelog. The code looks good to me. Acked-by: Muchun Song <muchun.song@linux.dev> Thanks.
On Fri 14-03-25 11:33:50, Zhongkun He wrote: > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > memory.reclaim")', we can submit an additional swappiness=<val> argument > to memory.reclaim. It is very useful because we can dynamically adjust > the reclamation ratio based on the anonymous folios and file folios of > each cgroup. For example,when swappiness is set to 0, we only reclaim > from file folios. > > However,we have also encountered a new issue: when swappiness is set to > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > to the knob of cache_trim_mode, which depends solely on the ratio of > inactive folios, regardless of whether there are a large number of cold > folios in anonymous folio list. > > So, we hope to add a new control logic where proactive memory reclaim only > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > For example, something like this: > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > swappiness) regardless of the file folios. Users have a more comprehensive > view of the application's memory distribution because there are many > metrics available. For example, if we find that a certain cgroup has a > large number of inactive anon folios, we can reclaim only those and skip > file folios, because with the zram/zswap, the IO tradeoff that > cache_trim_mode is making doesn't hold - file refaults will cause IO, > whereas anon decompression will not. > > With this patch, the swappiness argument of memory.reclaim has a more > precise semantics: 0 means reclaiming only from file pages, while 200 > means reclaiming just from anonymous pages. Haven't you said you will try a slightly different approach and always bypass LRU balancing heuristics for pro-active reclaim and swappiness provided? What has happened with that?
On Fri, Mar 14, 2025 at 4:53 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 14-03-25 11:33:50, Zhongkun He wrote: > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > > memory.reclaim")', we can submit an additional swappiness=<val> argument > > to memory.reclaim. It is very useful because we can dynamically adjust > > the reclamation ratio based on the anonymous folios and file folios of > > each cgroup. For example,when swappiness is set to 0, we only reclaim > > from file folios. > > > > However,we have also encountered a new issue: when swappiness is set to > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > > to the knob of cache_trim_mode, which depends solely on the ratio of > > inactive folios, regardless of whether there are a large number of cold > > folios in anonymous folio list. > > > > So, we hope to add a new control logic where proactive memory reclaim only > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > > For example, something like this: > > > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > > swappiness) regardless of the file folios. Users have a more comprehensive > > view of the application's memory distribution because there are many > > metrics available. For example, if we find that a certain cgroup has a > > large number of inactive anon folios, we can reclaim only those and skip > > file folios, because with the zram/zswap, the IO tradeoff that > > cache_trim_mode is making doesn't hold - file refaults will cause IO, > > whereas anon decompression will not. > > > > With this patch, the swappiness argument of memory.reclaim has a more > > precise semantics: 0 means reclaiming only from file pages, while 200 > > means reclaiming just from anonymous pages. > > Haven't you said you will try a slightly different approach and always > bypass LRU balancing heuristics for pro-active reclaim and swappiness > provided? What has happened with that? > Hi Michal I'm not sure if we should do that. Because i found a problem that If we drop all the heuristics for scanning LRUs, the swappiness value each time will accurately represent the ratio of memory to be reclaimed. This means that before each pro reclamation operation, we would need to have relatively clear information of the current memory ratio and dynamically changing the swappiness more often because with the pro memory reclaiming, the ratio of anon and file is alway changing . Therefore, we should adjust the swappiness value more frequently. The frequency of setting Swappiness to 200 is relatively much lower. Do you have any commits about this concern? > -- > Michal Hocko > SUSE Labs
On Fri 14-03-25 09:52:45, Michal Hocko wrote: > On Fri 14-03-25 11:33:50, Zhongkun He wrote: > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > > memory.reclaim")', we can submit an additional swappiness=<val> argument > > to memory.reclaim. It is very useful because we can dynamically adjust > > the reclamation ratio based on the anonymous folios and file folios of > > each cgroup. For example,when swappiness is set to 0, we only reclaim > > from file folios. > > > > However,we have also encountered a new issue: when swappiness is set to > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > > to the knob of cache_trim_mode, which depends solely on the ratio of > > inactive folios, regardless of whether there are a large number of cold > > folios in anonymous folio list. > > > > So, we hope to add a new control logic where proactive memory reclaim only > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > > For example, something like this: > > > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > > swappiness) regardless of the file folios. Users have a more comprehensive > > view of the application's memory distribution because there are many > > metrics available. For example, if we find that a certain cgroup has a > > large number of inactive anon folios, we can reclaim only those and skip > > file folios, because with the zram/zswap, the IO tradeoff that > > cache_trim_mode is making doesn't hold - file refaults will cause IO, > > whereas anon decompression will not. > > > > With this patch, the swappiness argument of memory.reclaim has a more > > precise semantics: 0 means reclaiming only from file pages, while 200 > > means reclaiming just from anonymous pages. > > Haven't you said you will try a slightly different approach and always > bypass LRU balancing heuristics for pro-active reclaim and swappiness > provided? What has happened with that? I have just noticed that you have followed up [1] with a concern that using swappiness in the whole min-max range without any heuristics turns out to be harder than just relying on the min and max as extremes. What seems to be still missing (or maybe it is just me not seeing that) is why should we only enforce those extreme ends of the range and still preserve under-defined semantic for all other swappiness values in the pro-active reclaim. [1] https://lore.kernel.org/all/CACSyD1OHD8oXQcQmi1D9t2f5oeMVDvCQnYZUMQTGbqBz4YYKLQ@mail.gmail.com/T/#u
On Fri, Mar 14, 2025 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 14-03-25 09:52:45, Michal Hocko wrote: > > On Fri 14-03-25 11:33:50, Zhongkun He wrote: > > > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > > > memory.reclaim")', we can submit an additional swappiness=<val> argument > > > to memory.reclaim. It is very useful because we can dynamically adjust > > > the reclamation ratio based on the anonymous folios and file folios of > > > each cgroup. For example,when swappiness is set to 0, we only reclaim > > > from file folios. > > > > > > However,we have also encountered a new issue: when swappiness is set to > > > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > > > to the knob of cache_trim_mode, which depends solely on the ratio of > > > inactive folios, regardless of whether there are a large number of cold > > > folios in anonymous folio list. > > > > > > So, we hope to add a new control logic where proactive memory reclaim only > > > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > > > For example, something like this: > > > > > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > > > > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > > > swappiness) regardless of the file folios. Users have a more comprehensive > > > view of the application's memory distribution because there are many > > > metrics available. For example, if we find that a certain cgroup has a > > > large number of inactive anon folios, we can reclaim only those and skip > > > file folios, because with the zram/zswap, the IO tradeoff that > > > cache_trim_mode is making doesn't hold - file refaults will cause IO, > > > whereas anon decompression will not. > > > > > > With this patch, the swappiness argument of memory.reclaim has a more > > > precise semantics: 0 means reclaiming only from file pages, while 200 > > > means reclaiming just from anonymous pages. > > > > Haven't you said you will try a slightly different approach and always > > bypass LRU balancing heuristics for pro-active reclaim and swappiness > > provided? What has happened with that? > > I have just noticed that you have followed up [1] with a concern that > using swappiness in the whole min-max range without any heuristics turns > out to be harder than just relying on the min and max as extremes. > What seems to be still missing (or maybe it is just me not seeing that) > is why should we only enforce those extreme ends of the range and still > preserve under-defined semantic for all other swappiness values in the > pro-active reclaim. > Yes, you are right. There is a demo if we bypass LRU balancing heuristics in pro reclaim. I have a question, but I'm not sure if it should be considered. For example, if anon scan=5 and swappiness=5, then 5*5/200=0. The scan becomes zero. Do you have any suggestions? diff --git a/mm/vmscan.c b/mm/vmscan.c index f4312b41e0e0..75935fe42245 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2448,6 +2448,19 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, goto out; } + /* + * Bypassing LRU balancing heuristics for proactive memory + * reclaim to make the semantic of swappiness clearer in + * memory.reclaim. + */ + if (sc->proactive && sc->proactive_swappiness) { + scan_balance = SCAN_FRACT; + fraction[0] = swappiness; + fraction[1] = MAX_SWAPPINESS - swappiness; + denominator = MAX_SWAPPINESS; + goto out; + } + /* * Do not apply any pressure balancing cleverness when the * system is close to OOM, scan both anon and file equally Additionally, any feedback from others is welcome. Thanks. > [1] https://lore.kernel.org/all/CACSyD1OHD8oXQcQmi1D9t2f5oeMVDvCQnYZUMQTGbqBz4YYKLQ@mail.gmail.com/T/#u > -- > Michal Hocko > SUSE Labs
On Fri, 14. Mar 11:33, Zhongkun He wrote: > With this patch 'commit <68cd9050d871> ("mm: add swappiness= arg to > memory.reclaim")', we can submit an additional swappiness=<val> argument > to memory.reclaim. It is very useful because we can dynamically adjust > the reclamation ratio based on the anonymous folios and file folios of > each cgroup. For example,when swappiness is set to 0, we only reclaim > from file folios. > > However,we have also encountered a new issue: when swappiness is set to > the MAX_SWAPPINESS, it may still only reclaim file folios. This is due > to the knob of cache_trim_mode, which depends solely on the ratio of > inactive folios, regardless of whether there are a large number of cold > folios in anonymous folio list. > > So, we hope to add a new control logic where proactive memory reclaim only > reclaims from anonymous folios when swappiness is set to MAX_SWAPPINESS. > For example, something like this: > > echo "2M swappiness=200" > /sys/fs/cgroup/memory.reclaim > > will perform reclaim on the rootcg with a swappiness setting of 200 (max > swappiness) regardless of the file folios. Users have a more comprehensive > view of the application's memory distribution because there are many > metrics available. For example, if we find that a certain cgroup has a > large number of inactive anon folios, we can reclaim only those and skip > file folios, because with the zram/zswap, the IO tradeoff that > cache_trim_mode is making doesn't hold - file refaults will cause IO, > whereas anon decompression will not. > > With this patch, the swappiness argument of memory.reclaim has a more > precise semantics: 0 means reclaiming only from file pages, while 200 > means reclaiming just from anonymous pages. > > V1: > Update Documentation/admin-guide/cgroup-v2.rst --from Andrew Morton > Add more descriptions in the comment. --from Johannes Weiner > > V2: > Add reviewed from Yosry Ahmed. > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> > Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > --- > Documentation/admin-guide/cgroup-v2.rst | 4 ++++ > mm/vmscan.c | 10 ++++++++++ > 2 files changed, 14 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index cb1b4e759b7e..6a4487ead7e0 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1343,6 +1343,10 @@ The following nested keys are defined. > same semantics as vm.swappiness applied to memcg reclaim with > all the existing limitations and potential future extensions. > > + The swappiness have the range [0, 200], 0 means reclaiming only > + from file folios, 200 (MAX_SWAPPINESS) means reclaiming just from > + anonymous folios. > + mglru ? https://elixir.bootlin.com/linux/v6.13-rc1/source/mm/vmscan.c#L4533 > memory.peak > A read-write single value file which exists on non-root cgroups. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c767d71c43d7..f4312b41e0e0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2438,6 +2438,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > goto out; > } > > + /* > + * Do not bother scanning file folios if the memory reclaim > + * invoked by userspace through memory.reclaim and the > + * swappiness is MAX_SWAPPINESS. > + */ > + if (sc->proactive && (swappiness == MAX_SWAPPINESS)) { > + scan_balance = SCAN_ANON; > + goto out; > + } > + > /* > * Do not apply any pressure balancing cleverness when the > * system is close to OOM, scan both anon and file equally > -- > 2.39.5 > > -- Help you, Help me, Hailong.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index cb1b4e759b7e..6a4487ead7e0 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1343,6 +1343,10 @@ The following nested keys are defined. same semantics as vm.swappiness applied to memcg reclaim with all the existing limitations and potential future extensions. + The swappiness have the range [0, 200], 0 means reclaiming only + from file folios, 200 (MAX_SWAPPINESS) means reclaiming just from + anonymous folios. + memory.peak A read-write single value file which exists on non-root cgroups. diff --git a/mm/vmscan.c b/mm/vmscan.c index c767d71c43d7..f4312b41e0e0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2438,6 +2438,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, goto out; } + /* + * Do not bother scanning file folios if the memory reclaim + * invoked by userspace through memory.reclaim and the + * swappiness is MAX_SWAPPINESS. + */ + if (sc->proactive && (swappiness == MAX_SWAPPINESS)) { + scan_balance = SCAN_ANON; + goto out; + } + /* * Do not apply any pressure balancing cleverness when the * system is close to OOM, scan both anon and file equally