Message ID | 20250318075833.90615-2-jiahao.kernel@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Adding Proactive Memory Reclaim Statistics | expand |
Hello. On Tue, Mar 18, 2025 at 03:58:32PM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote: > From: Hao Jia <jiahao1@lixiang.com> > > In proactive memory reclaim scenarios, it is necessary to > accurately track proactive reclaim statistics to dynamically > adjust the frequency and amount of memory being reclaimed > proactively. Currently, proactive reclaim is included in > direct reclaim statistics, which can make these > direct reclaim statistics misleading. How silly is it to have multiple memory.reclaim writers? Would it make sense to bind those statistics to each such a write(r) instead of the aggregated totals? Michal
On 2025/3/18 18:17, Michal Koutný wrote: > Hello. > > On Tue, Mar 18, 2025 at 03:58:32PM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote: >> From: Hao Jia <jiahao1@lixiang.com> >> >> In proactive memory reclaim scenarios, it is necessary to >> accurately track proactive reclaim statistics to dynamically >> adjust the frequency and amount of memory being reclaimed >> proactively. Currently, proactive reclaim is included in >> direct reclaim statistics, which can make these >> direct reclaim statistics misleading. > > How silly is it to have multiple memory.reclaim writers? > Would it make sense to bind those statistics to each such a write(r) > instead of the aggregated totals? I'm sorry, I didn't understand what your suggestion was conveying. Are you suggesting that the statistics for {pgscan, pgsteal}_{kswapd, direct, khugepaged} be merged into one? In our current scenario, userspace proactive reclaimers trigger proactive memory reclaim on different memory cgroups. Tracking statistics related to proactive reclaim for each memory cgroup is very helpful for dynamically adjusting the frequency and amount of memory reclaimed for each cgroup. Please correct me if I've misunderstood anything. Thanks, Hao
On Tue, Mar 18, 2025 at 08:03:44PM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote: > > How silly is it to have multiple memory.reclaim writers? > > Would it make sense to bind those statistics to each such a write(r) > > instead of the aggregated totals? > > > I'm sorry, I didn't understand what your suggestion was conveying. For instance one reclaimer for page cache and another for anon (in one memcg): echo "1G swappiness=0" >memory.reclaim & echo "1G swappiness=200" >memory.reclaim > Are you suggesting that the statistics for {pgscan, pgsteal}_{kswapd, > direct, khugepaged} be merged into one? Not more merging -- opposite, having separate stats (somewhere) for each of the above reclaimers. Michal
On Tue, Mar 18, 2025 at 03:58:32PM +0800, Hao Jia wrote: > From: Hao Jia <jiahao1@lixiang.com> > > In proactive memory reclaim scenarios, it is necessary to > accurately track proactive reclaim statistics to dynamically > adjust the frequency and amount of memory being reclaimed > proactively. Currently, proactive reclaim is included in > direct reclaim statistics, which can make these > direct reclaim statistics misleading. > > Therefore, separate proactive reclaim memory from the > direct reclaim counters by introducing new counters: > pgsteal_proactive, pgdemote_proactive, and pgscan_proactive, > to avoid confusion with direct reclaim. > > Signed-off-by: Hao Jia <jiahao1@lixiang.com> This is indeed quite useful. Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On 2025/3/18 20:59, Michal Koutný wrote: > On Tue, Mar 18, 2025 at 08:03:44PM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote: >>> How silly is it to have multiple memory.reclaim writers? >>> Would it make sense to bind those statistics to each such a write(r) >>> instead of the aggregated totals? >> >> >> I'm sorry, I didn't understand what your suggestion was conveying. > > For instance one reclaimer for page cache and another for anon (in one > memcg): > echo "1G swappiness=0" >memory.reclaim & > echo "1G swappiness=200" >memory.reclaim > Thank you for your suggestion. However, binding the statistics to the memory.reclaim writers may not be suitable for our scenario. The userspace proactive memory reclaimer triggers proactive memory reclaim on different memory cgroups, and all memory reclaim statistics would be tied to this userspace proactive memory reclaim process. This does not distinguish the proactive memory reclaim status of different cgroups. Thanks, Hao
On Wed, Mar 19, 2025 at 10:38:01AM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote: > However, binding the statistics to the memory.reclaim writers may not be > suitable for our scenario. The userspace proactive memory reclaimer triggers > proactive memory reclaim on different memory cgroups, and all memory reclaim > statistics would be tied to this userspace proactive memory reclaim process. It thought that was what you wanted -- have stats related precisely to the process so that you can feedback-control the reclaim. > This does not distinguish the proactive memory reclaim status of different > cgroups. a `- b `- c Or do you mean that you write to a/memory.reclaim and want to observe respective results in {b,c}/memory.stat? (I think your addition to memory.stat is also natural. If the case above is the explanation why to prefer it over per-writer feedback, please mention that in next-rev commit message.) Thanks, Michal
On 2025/3/19 17:15, Michal Koutný wrote: > On Wed, Mar 19, 2025 at 10:38:01AM +0800, Hao Jia <jiahao.kernel@gmail.com> wrote: >> However, binding the statistics to the memory.reclaim writers may not be >> suitable for our scenario. The userspace proactive memory reclaimer triggers >> proactive memory reclaim on different memory cgroups, and all memory reclaim >> statistics would be tied to this userspace proactive memory reclaim process. > > It thought that was what you wanted -- have stats related precisely to > the process so that you can feedback-control the reclaim. What I want is the proactive memory reclamation statistics for each memory cgroup. > >> This does not distinguish the proactive memory reclaim status of different >> cgroups. > > a > `- b > `- c > > Or do you mean that you write to a/memory.reclaim and want to observe > respective results in {b,c}/memory.stat? root `- a `- b`- c We have a userspace proactive memory reclaim process that writes to a/memory.reclaim, observes a/memory.stat, then writes to b/memory.reclaim and observes b/memory.stat. This pattern is the same for other cgroups as well, so all memory cgroups(a, b, c) have the **same writer**. So, I need per-cgroup proactive memory reclaim statistics. Thanks, Hao 声明:这封邮件只允许文件接收者阅读,有很高的机密性要求。禁止其他人使用、打开、复制或转发里面的任何内容。如果本邮件错误地发给了你,请联系邮件发出者并删除这个文件。机密及法律的特权并不因为误发邮件而放弃或丧失。任何提出的观点或意见只属于作者的个人见解,并不一定代表本公司。 Disclaimer: This email is intended to be read only by the designated recipient of the document and has high confidentiality requirements. Anyone else is prohibited from using, opening, copying or forwarding any of the contents inside. If this email was sent to you by mistake, please contact the sender of the email and delete this file immediately. Confidentiality and legal privileges are not waived or lost by misdirected emails. Any views or opinions expressed in the email are those of the author and do not necessarily represent those of the Company.
On Wed, Mar 19, 2025 at 05:49:15PM +0800, Hao Jia <jiahao1@lixiang.com> wrote: > root > `- a `- b`- c > > We have a userspace proactive memory reclaim process that writes to > a/memory.reclaim, observes a/memory.stat, then writes to > b/memory.reclaim and observes b/memory.stat. This pattern is the same > for other cgroups as well, so all memory cgroups(a, b, c) have the > **same writer**. So, I need per-cgroup proactive memory reclaim statistics. Sorry for unclarity, it got lost among the mails. Originally, I thought about each write(2) but in reality it'd be per each FD. Similar to how memory.peak allows seeing different values. WDYT? Michal
Hey Michal, On Wed, Mar 19, 2025 at 11:33:10AM +0100, Michal Koutný wrote: > On Wed, Mar 19, 2025 at 05:49:15PM +0800, Hao Jia <jiahao1@lixiang.com> wrote: > > root > > `- a `- b`- c > > > > We have a userspace proactive memory reclaim process that writes to > > a/memory.reclaim, observes a/memory.stat, then writes to > > b/memory.reclaim and observes b/memory.stat. This pattern is the same > > for other cgroups as well, so all memory cgroups(a, b, c) have the > > **same writer**. So, I need per-cgroup proactive memory reclaim statistics. > > Sorry for unclarity, it got lost among the mails. Originally, I thought > about each write(2) but in reality it'd be per each FD. Similar to how > memory.peak allows seeing different values. WDYT? Can you clarify if you're proposing this as an addition or instead of the memory.stat items? The memory.stat items are quite useful to understand what happened to a cgroup in the past. In Meta prod, memory.stat is recorded over time, and it's go-to information when the kernel team gets looped into an investigation around unexpected workload behavior at some date/time X. The proactive reclaimer data points provide a nice bit of nuance to this. They can easily be aggregated over many machines etc. A usecase for per-fd stats would be interesting to hear about, but I don't think they would be a suitable replacement for memory.stat data.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index cb1b4e759b7e..d6692607f80a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1570,6 +1570,9 @@ The following nested keys are defined. pgscan_khugepaged (npn) Amount of scanned pages by khugepaged (in an inactive LRU list) + pgscan_proactive (npn) + Amount of scanned pages proactively (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd @@ -1579,6 +1582,9 @@ The following nested keys are defined. pgsteal_khugepaged (npn) Amount of reclaimed pages by khugepaged + pgsteal_proactive (npn) + Amount of reclaimed pages proactively + pgfault (npn) Total number of page faults incurred @@ -1656,6 +1662,9 @@ The following nested keys are defined. pgdemote_khugepaged Number of pages demoted by khugepaged. + pgdemote_proactive + Number of pages demoted by proactively. + hugetlb Amount of memory used by hugetlb pages. This metric only shows up if hugetlb usage is accounted for in memory.current (i.e. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9540b41894da..69b4996dadc8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -220,6 +220,7 @@ enum node_stat_item { PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, + PGDEMOTE_PROACTIVE, #ifdef CONFIG_HUGETLB_PAGE NR_HUGETLB, #endif diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index f70d0958095c..f11b6fa9c5b3 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -41,9 +41,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, PGSTEAL_KHUGEPAGED, + PGSTEAL_PROACTIVE, PGSCAN_KSWAPD, PGSCAN_DIRECT, PGSCAN_KHUGEPAGED, + PGSCAN_PROACTIVE, PGSCAN_DIRECT_THROTTLE, PGSCAN_ANON, PGSCAN_FILE, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4de6acb9b8ec..32e28ab90914 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -315,6 +315,7 @@ static const unsigned int memcg_node_stat_items[] = { PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, + PGDEMOTE_PROACTIVE, #ifdef CONFIG_HUGETLB_PAGE NR_HUGETLB, #endif @@ -431,9 +432,11 @@ static const unsigned int memcg_vm_event_stat[] = { PGSCAN_KSWAPD, PGSCAN_DIRECT, PGSCAN_KHUGEPAGED, + PGSCAN_PROACTIVE, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, PGSTEAL_KHUGEPAGED, + PGSTEAL_PROACTIVE, PGFAULT, PGMAJFAULT, PGREFILL, @@ -1390,6 +1393,7 @@ static const struct memory_stat memory_stats[] = { { "pgdemote_kswapd", PGDEMOTE_KSWAPD }, { "pgdemote_direct", PGDEMOTE_DIRECT }, { "pgdemote_khugepaged", PGDEMOTE_KHUGEPAGED }, + { "pgdemote_proactive", PGDEMOTE_PROACTIVE }, #ifdef CONFIG_NUMA_BALANCING { "pgpromote_success", PGPROMOTE_SUCCESS }, #endif @@ -1432,6 +1436,7 @@ static int memcg_page_state_output_unit(int item) case PGDEMOTE_KSWAPD: case PGDEMOTE_DIRECT: case PGDEMOTE_KHUGEPAGED: + case PGDEMOTE_PROACTIVE: #ifdef CONFIG_NUMA_BALANCING case PGPROMOTE_SUCCESS: #endif @@ -1503,10 +1508,12 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) seq_buf_printf(s, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) + memcg_events(memcg, PGSCAN_DIRECT) + + memcg_events(memcg, PGSCAN_PROACTIVE) + memcg_events(memcg, PGSCAN_KHUGEPAGED)); seq_buf_printf(s, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) + memcg_events(memcg, PGSTEAL_DIRECT) + + memcg_events(memcg, PGSTEAL_PROACTIVE) + memcg_events(memcg, PGSTEAL_KHUGEPAGED)); for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { diff --git a/mm/vmscan.c b/mm/vmscan.c index c767d71c43d7..fa816cd08ac3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -441,21 +441,26 @@ void drop_slab(void) } while ((freed >> shift++) > 1); } -static int reclaimer_offset(void) +#define CHECK_RECLAIMER_OFFSET(type) \ + do { \ + BUILD_BUG_ON(PGSTEAL_##type - PGSTEAL_KSWAPD != \ + PGDEMOTE_##type - PGDEMOTE_KSWAPD); \ + BUILD_BUG_ON(PGSTEAL_##type - PGSTEAL_KSWAPD != \ + PGSCAN_##type - PGSCAN_KSWAPD); \ + } while (0) + +static int reclaimer_offset(struct scan_control *sc) { - BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != - PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD); - BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != - PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD); - BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != - PGSCAN_DIRECT - PGSCAN_KSWAPD); - BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != - PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD); + CHECK_RECLAIMER_OFFSET(DIRECT); + CHECK_RECLAIMER_OFFSET(KHUGEPAGED); + CHECK_RECLAIMER_OFFSET(PROACTIVE); if (current_is_kswapd()) return 0; if (current_is_khugepaged()) return PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD; + if (sc->proactive) + return PGSTEAL_PROACTIVE - PGSTEAL_KSWAPD; return PGSTEAL_DIRECT - PGSTEAL_KSWAPD; } @@ -1986,7 +1991,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, &nr_scanned, sc, lru); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); - item = PGSCAN_KSWAPD + reclaimer_offset(); + item = PGSCAN_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_scanned); __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); @@ -2002,10 +2007,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &folio_list); - __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), + __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), stat.nr_demoted); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - item = PGSTEAL_KSWAPD + reclaimer_offset(); + item = PGSTEAL_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); @@ -4545,7 +4550,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, break; } - item = PGSCAN_KSWAPD + reclaimer_offset(); + item = PGSCAN_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) { __count_vm_events(item, isolated); __count_vm_events(PGREFILL, sorted); @@ -4695,10 +4700,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap reset_batch_size(walk); } - __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), + __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), stat.nr_demoted); - item = PGSTEAL_KSWAPD + reclaimer_offset(); + item = PGSTEAL_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, reclaimed); __count_memcg_events(memcg, item, reclaimed); diff --git a/mm/vmstat.c b/mm/vmstat.c index 16bfe1c694dd..eff4d833ff8a 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1273,6 +1273,7 @@ const char * const vmstat_text[] = { "pgdemote_kswapd", "pgdemote_direct", "pgdemote_khugepaged", + "pgdemote_proactive", #ifdef CONFIG_HUGETLB_PAGE "nr_hugetlb", #endif @@ -1307,9 +1308,11 @@ const char * const vmstat_text[] = { "pgsteal_kswapd", "pgsteal_direct", "pgsteal_khugepaged", + "pgsteal_proactive", "pgscan_kswapd", "pgscan_direct", "pgscan_khugepaged", + "pgscan_proactive", "pgscan_direct_throttle", "pgscan_anon", "pgscan_file",