Message ID | 20230125005457.4139289-1-minchan@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/madvise: add vmstat statistics for madvise_[cold|pageout] | expand |
On Tue 24-01-23 16:54:57, Minchan Kim wrote: > madvise LRU manipulation APIs need to scan address ranges to find > present pages at page table and provides advice hints for them. > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > shows the proactive reclaim efficiency so this patch adds those > two statistics in vmstat. > > madvise_pgscanned, madvise_pghinted > > Since proactive reclaim using process_madvise(2) as userland > memory policy is popular(e.g,. Android ActivityManagerService), > those stats are helpful to know how efficiently the policy works > well. The usecase description is still too vague. What are those values useful for? Is there anything actionable based on those numbers? How do you deal with multiple parties using madvise resp. process_madvise so that their stats are combined? In the previous version I have also pointed out that this might be easily achieved by tracepoints. Your counterargument was a convenience in a large scale monitoring without going much into details. Presumably this is because your fleet based monitoring already collects /proc/vmstat while tracepoints based monitoring would require additional changes. This alone is rather weak argument to be honest because deploying tracepoints monitoring is quite trivial and can be done outside of the said memory reclaim agent. > Signed-off-by: Minchan Kim <minchan@kernel.org> > --- > > * From v1 - https://lore.kernel.org/linux-mm/20230117231632.2734737-1-minchan@kernel.org/ > * not relying on the pageout for accounting - mhocko > * drop unnecessary changes - mhocko > > include/linux/vm_event_item.h | 2 ++ > mm/madvise.c | 8 ++++++++ > mm/vmstat.c | 2 ++ > 3 files changed, 12 insertions(+) > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 7f5d1caf5890..3c117858946d 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -52,6 +52,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > PGSCAN_FILE, > PGSTEAL_ANON, > PGSTEAL_FILE, > + MADVISE_PGSCANNED, > + MADVISE_PGHINTED, > #ifdef CONFIG_NUMA > PGSCAN_ZONE_RECLAIM_FAILED, > #endif > diff --git a/mm/madvise.c b/mm/madvise.c > index 7db6622f8293..d2624e77f729 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -344,6 +344,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > spinlock_t *ptl; > struct folio *folio = NULL; > LIST_HEAD(folio_list); > + unsigned int nr_scanned = 0; > + unsigned int nr_hinted = 0; > bool pageout_anon_only_filter; > > if (fatal_signal_pending(current)) > @@ -357,6 +359,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > pmd_t orig_pmd; > unsigned long next = pmd_addr_end(addr, end); > > + nr_scanned += HPAGE_PMD_NR; > tlb_change_page_size(tlb, HPAGE_PMD_SIZE); > ptl = pmd_trans_huge_lock(pmd, vma); > if (!ptl) > @@ -414,6 +417,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > } > } else > folio_deactivate(folio); > + nr_hinted += HPAGE_PMD_NR; > huge_unlock: > spin_unlock(ptl); > if (pageout) > @@ -431,6 +435,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > arch_enter_lazy_mmu_mode(); > for (; addr < end; pte++, addr += PAGE_SIZE) { > ptent = *pte; > + nr_scanned++; > > if (pte_none(ptent)) > continue; > @@ -508,6 +513,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > } > } else > folio_deactivate(folio); > + nr_hinted++; > } > > arch_leave_lazy_mmu_mode(); > @@ -515,6 +521,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > if (pageout) > reclaim_pages(&folio_list); > cond_resched(); > + count_vm_events(MADVISE_PGSCANNED, nr_scanned); > + count_vm_events(MADVISE_PGHINTED, nr_hinted); > > return 0; > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 1ea6a5ce1c41..84acc90820e1 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1283,6 +1283,8 @@ const char * const vmstat_text[] = { > "pgscan_file", > "pgsteal_anon", > "pgsteal_file", > + "madvise_pgscanned", > + "madvise_pghinted", > > #ifdef CONFIG_NUMA > "zone_reclaim_failed", > -- > 2.39.1.405.gd4c25cc71f-goog
On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > madvise LRU manipulation APIs need to scan address ranges to find > > present pages at page table and provides advice hints for them. > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > shows the proactive reclaim efficiency so this patch adds those > > two statistics in vmstat. > > > > madvise_pgscanned, madvise_pghinted > > > > Since proactive reclaim using process_madvise(2) as userland > > memory policy is popular(e.g,. Android ActivityManagerService), > > those stats are helpful to know how efficiently the policy works > > well. > > The usecase description is still too vague. What are those values useful > for? Is there anything actionable based on those numbers? How do you > deal with multiple parties using madvise resp. process_madvise so that > their stats are combined? The metric helps monitoing system MM health under fleet and experimental tuning with diffrent policies from the centralized userland memory daemon. That's really good fit under vmstat along with other MM metrics. > > In the previous version I have also pointed out that this might be > easily achieved by tracepoints. Your counterargument was a convenience > in a large scale monitoring without going much into details. Presumably > this is because your fleet based monitoring already collects > /proc/vmstat while tracepoints based monitoring would require additional > changes. This alone is rather weak argument to be honest because > deploying tracepoints monitoring is quite trivial and can be done > outside of the said memory reclaim agent. The convenience matters but that's not my argument. Ithink using tracepoint for system metric makes no sense even though the tracepoint could be extended by using bpf or histogram trigger to get the accumulated counters for system metric. The tracepoint is the next step if we want to know further breakdown once something strange happens. That's why we have separate level metric system to narrow problem down rather than implementing all the metric with tracepoint. Please, look at vmstat fields. Almost every fields would have same question you asked "how do you break down if multiple processes were invovled to contribute the metric?" I am fine if you suggest adding tracepoint as well as the vmstat fields for further breakdown but only relying on tracepoint and frineds for system global metric doesn't make sense.
On Wed 25-01-23 08:36:02, Minchan Kim wrote: > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > > madvise LRU manipulation APIs need to scan address ranges to find > > > present pages at page table and provides advice hints for them. > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > > shows the proactive reclaim efficiency so this patch adds those > > > two statistics in vmstat. > > > > > > madvise_pgscanned, madvise_pghinted > > > > > > Since proactive reclaim using process_madvise(2) as userland > > > memory policy is popular(e.g,. Android ActivityManagerService), > > > those stats are helpful to know how efficiently the policy works > > > well. > > > > The usecase description is still too vague. What are those values useful > > for? Is there anything actionable based on those numbers? How do you > > deal with multiple parties using madvise resp. process_madvise so that > > their stats are combined? > > The metric helps monitoing system MM health under fleet and experimental > tuning with diffrent policies from the centralized userland memory daemon. That is just too vague for me to imagine anything more specific then, we have numbers and we can show them in a report. What does it actually mean that madvise_pgscanned is high. Or that pghinted / pgscanned is low (that you tend to manually reclaim sparse mappings)? > That's really good fit under vmstat along with other MM metrics. > > > > > In the previous version I have also pointed out that this might be > > easily achieved by tracepoints. Your counterargument was a convenience > > in a large scale monitoring without going much into details. Presumably > > this is because your fleet based monitoring already collects > > /proc/vmstat while tracepoints based monitoring would require additional > > changes. This alone is rather weak argument to be honest because > > deploying tracepoints monitoring is quite trivial and can be done > > outside of the said memory reclaim agent. > > The convenience matters but that's not my argument. > > Ithink using tracepoint for system metric makes no sense even though > the tracepoint could be extended by using bpf or histogram trigger to > get the accumulated counters for system metric. System wide metrics data collection by ftrace is a common use case. I really do not follow your argument here. There are certainly cases where ftrace is suboptimal solution - e.g. when the cumulative data couldn't have been collected early on for one reason or another (e.g. system uptime is already high when you decide to start collecting). But you have stated there is data collection happening so what does prevent collecting this just along with anything else. > The tracepoint is the next step if we want to know further breakdown > once something strange happens. That's why we have separate level metric > system to narrow problem down rather than implementing all the metric > with tracepoint. Please, look at vmstat fields. Almost every fields > would have same question you asked "how do you break down if multiple > processes were invovled to contribute the metric?" Yes, we tended to be much more willing to add counters. Partly because runtime debugging capabilities were not that great in the past as we have these days. > I am fine if you suggest adding tracepoint as well as the vmstat fields > for further breakdown but only relying on tracepoint and frineds for > system global metric doesn't make sense. We have to agree to disagree here. I am not going to nack this but I disagree with this patch because the justification is just too vague and also those numbers cannot really be attributed to anybody performing madvise to actually evaluate that activity.
On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote: > On Wed 25-01-23 08:36:02, Minchan Kim wrote: > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > > > madvise LRU manipulation APIs need to scan address ranges to find > > > > present pages at page table and provides advice hints for them. > > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > > > shows the proactive reclaim efficiency so this patch adds those > > > > two statistics in vmstat. > > > > > > > > madvise_pgscanned, madvise_pghinted > > > > > > > > Since proactive reclaim using process_madvise(2) as userland > > > > memory policy is popular(e.g,. Android ActivityManagerService), > > > > those stats are helpful to know how efficiently the policy works > > > > well. > > > > > > The usecase description is still too vague. What are those values useful > > > for? Is there anything actionable based on those numbers? How do you > > > deal with multiple parties using madvise resp. process_madvise so that > > > their stats are combined? > > > > The metric helps monitoing system MM health under fleet and experimental > > tuning with diffrent policies from the centralized userland memory daemon. > > That is just too vague for me to imagine anything more specific then, we > have numbers and we can show them in a report. What does it actually > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is > low (that you tend to manually reclaim sparse mappings)? If that's low, it means the userspace daemon's current tune/policy are inefficient or too aggressive since it is working on address spacess of processes which don't have enough memory the hint can work(e.g., shared addresses, cold address ranges or some special address ranges like VM_PFNMAP) so sometime, we can detect regression to find culprit or have a chance to look into better ideas to improve. > > > That's really good fit under vmstat along with other MM metrics. > > > > > > > > In the previous version I have also pointed out that this might be > > > easily achieved by tracepoints. Your counterargument was a convenience > > > in a large scale monitoring without going much into details. Presumably > > > this is because your fleet based monitoring already collects > > > /proc/vmstat while tracepoints based monitoring would require additional > > > changes. This alone is rather weak argument to be honest because > > > deploying tracepoints monitoring is quite trivial and can be done > > > outside of the said memory reclaim agent. > > > > The convenience matters but that's not my argument. > > > > Ithink using tracepoint for system metric makes no sense even though > > the tracepoint could be extended by using bpf or histogram trigger to > > get the accumulated counters for system metric. > > System wide metrics data collection by ftrace is a common use case. I > really do not follow your argument here. There are certainly cases where > ftrace is suboptimal solution - e.g. when the cumulative data couldn't > have been collected early on for one reason or another (e.g. system > uptime is already high when you decide to start collecting). But you > have stated there is data collection happening so what does prevent > collecting this just along with anything else. > > > The tracepoint is the next step if we want to know further breakdown > > once something strange happens. That's why we have separate level metric > > system to narrow problem down rather than implementing all the metric > > with tracepoint. Please, look at vmstat fields. Almost every fields > > would have same question you asked "how do you break down if multiple > > processes were invovled to contribute the metric?" > > Yes, we tended to be much more willing to add counters. Partly because > runtime debugging capabilities were not that great in the past as we > have these days. > > > I am fine if you suggest adding tracepoint as well as the vmstat fields > > for further breakdown but only relying on tracepoint and frineds for > > system global metric doesn't make sense. > > We have to agree to disagree here. I am not going to nack this but I > disagree with this patch because the justification is just too vague and > also those numbers cannot really be attributed to anybody performing > madvise to actually evaluate that activity. > -- > Michal Hocko > SUSE Labs
On Wed 25-01-23 10:07:49, Minchan Kim wrote: > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote: > > On Wed 25-01-23 08:36:02, Minchan Kim wrote: > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > > > > madvise LRU manipulation APIs need to scan address ranges to find > > > > > present pages at page table and provides advice hints for them. > > > > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > > > > shows the proactive reclaim efficiency so this patch adds those > > > > > two statistics in vmstat. > > > > > > > > > > madvise_pgscanned, madvise_pghinted > > > > > > > > > > Since proactive reclaim using process_madvise(2) as userland > > > > > memory policy is popular(e.g,. Android ActivityManagerService), > > > > > those stats are helpful to know how efficiently the policy works > > > > > well. > > > > > > > > The usecase description is still too vague. What are those values useful > > > > for? Is there anything actionable based on those numbers? How do you > > > > deal with multiple parties using madvise resp. process_madvise so that > > > > their stats are combined? > > > > > > The metric helps monitoing system MM health under fleet and experimental > > > tuning with diffrent policies from the centralized userland memory daemon. > > > > That is just too vague for me to imagine anything more specific then, we > > have numbers and we can show them in a report. What does it actually > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is > > low (that you tend to manually reclaim sparse mappings)? > > If that's low, it means the userspace daemon's current tune/policy are > inefficient or too aggressive since it is working on address spacess > of processes which don't have enough memory the hint can work(e.g., > shared addresses, cold address ranges or some special address ranges like > VM_PFNMAP) so sometime, we can detect regression to find culprit or > have a chance to look into better ideas to improve. Are you sure this is really meaningful metric? Just consider a large and sparsely populated mapping. This can be a perfect candidate for user space reclaim target (e.g. consider a mapping covering a large matrix or other similar data structure). pghinted/pgscanned would be really small while the reclaim efficiency could be quite high in that case, wouldn't it?
On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote: > On Wed 25-01-23 10:07:49, Minchan Kim wrote: > > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote: > > > On Wed 25-01-23 08:36:02, Minchan Kim wrote: > > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > > > > > madvise LRU manipulation APIs need to scan address ranges to find > > > > > > present pages at page table and provides advice hints for them. > > > > > > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > > > > > shows the proactive reclaim efficiency so this patch adds those > > > > > > two statistics in vmstat. > > > > > > > > > > > > madvise_pgscanned, madvise_pghinted > > > > > > > > > > > > Since proactive reclaim using process_madvise(2) as userland > > > > > > memory policy is popular(e.g,. Android ActivityManagerService), > > > > > > those stats are helpful to know how efficiently the policy works > > > > > > well. > > > > > > > > > > The usecase description is still too vague. What are those values useful > > > > > for? Is there anything actionable based on those numbers? How do you > > > > > deal with multiple parties using madvise resp. process_madvise so that > > > > > their stats are combined? > > > > > > > > The metric helps monitoing system MM health under fleet and experimental > > > > tuning with diffrent policies from the centralized userland memory daemon. > > > > > > That is just too vague for me to imagine anything more specific then, we > > > have numbers and we can show them in a report. What does it actually > > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is > > > low (that you tend to manually reclaim sparse mappings)? > > > > If that's low, it means the userspace daemon's current tune/policy are > > inefficient or too aggressive since it is working on address spacess > > of processes which don't have enough memory the hint can work(e.g., > > shared addresses, cold address ranges or some special address ranges like > > VM_PFNMAP) so sometime, we can detect regression to find culprit or > > have a chance to look into better ideas to improve. > > Are you sure this is really meaningful metric? Just consider a large and > sparsely populated mapping. This can be a perfect candidate for user > space reclaim target (e.g. consider a mapping covering a large matrix > or other similar data structure). pghinted/pgscanned would be really > small while the reclaim efficiency could be quite high in that case, > wouldn't it? Why do you think it's efficient? It need to spend quite CPU cycle to scan a few of pages to evict. I don't see it's efficient if it happens quite a lot.
On Wed 25-01-23 14:21:35, Minchan Kim wrote: > On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote: > > On Wed 25-01-23 10:07:49, Minchan Kim wrote: > > > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote: > > > > On Wed 25-01-23 08:36:02, Minchan Kim wrote: > > > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > > > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > > > > > > madvise LRU manipulation APIs need to scan address ranges to find > > > > > > > present pages at page table and provides advice hints for them. > > > > > > > > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > > > > > > shows the proactive reclaim efficiency so this patch adds those > > > > > > > two statistics in vmstat. > > > > > > > > > > > > > > madvise_pgscanned, madvise_pghinted > > > > > > > > > > > > > > Since proactive reclaim using process_madvise(2) as userland > > > > > > > memory policy is popular(e.g,. Android ActivityManagerService), > > > > > > > those stats are helpful to know how efficiently the policy works > > > > > > > well. > > > > > > > > > > > > The usecase description is still too vague. What are those values useful > > > > > > for? Is there anything actionable based on those numbers? How do you > > > > > > deal with multiple parties using madvise resp. process_madvise so that > > > > > > their stats are combined? > > > > > > > > > > The metric helps monitoing system MM health under fleet and experimental > > > > > tuning with diffrent policies from the centralized userland memory daemon. > > > > > > > > That is just too vague for me to imagine anything more specific then, we > > > > have numbers and we can show them in a report. What does it actually > > > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is > > > > low (that you tend to manually reclaim sparse mappings)? > > > > > > If that's low, it means the userspace daemon's current tune/policy are > > > inefficient or too aggressive since it is working on address spacess > > > of processes which don't have enough memory the hint can work(e.g., > > > shared addresses, cold address ranges or some special address ranges like > > > VM_PFNMAP) so sometime, we can detect regression to find culprit or > > > have a chance to look into better ideas to improve. > > > > Are you sure this is really meaningful metric? Just consider a large and > > sparsely populated mapping. This can be a perfect candidate for user > > space reclaim target (e.g. consider a mapping covering a large matrix > > or other similar data structure). pghinted/pgscanned would be really > > small while the reclaim efficiency could be quite high in that case, > > wouldn't it? > > Why do you think it's efficient? It need to spend quite CPU cycle to > scan a few of pages to evict. I don't see it's efficient if it happens > quite a lot. Because it doesn't really matter how many page tables you have to scan but how easily you can reclaim the memory behind that. Because it is the memory that matters. Just consider THP vs. 4k backed address ranges. You are going to scan much more for latter by design. That doesn't really mean that this is a worse candidate for reclaim and you should be only focusing on THP backed mappings. See? I suspect you try to mimic pgscan/pgsteal effectivness metric on the address space but that is a fundamentally different thing.
On Thu 26-01-23 09:50:38, Michal Hocko wrote: > On Wed 25-01-23 14:21:35, Minchan Kim wrote: > > On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote: > > > On Wed 25-01-23 10:07:49, Minchan Kim wrote: > > > > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote: > > > > > On Wed 25-01-23 08:36:02, Minchan Kim wrote: > > > > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > > > > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > > > > > > > madvise LRU manipulation APIs need to scan address ranges to find > > > > > > > > present pages at page table and provides advice hints for them. > > > > > > > > > > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > > > > > > > shows the proactive reclaim efficiency so this patch adds those > > > > > > > > two statistics in vmstat. > > > > > > > > > > > > > > > > madvise_pgscanned, madvise_pghinted > > > > > > > > > > > > > > > > Since proactive reclaim using process_madvise(2) as userland > > > > > > > > memory policy is popular(e.g,. Android ActivityManagerService), > > > > > > > > those stats are helpful to know how efficiently the policy works > > > > > > > > well. > > > > > > > > > > > > > > The usecase description is still too vague. What are those values useful > > > > > > > for? Is there anything actionable based on those numbers? How do you > > > > > > > deal with multiple parties using madvise resp. process_madvise so that > > > > > > > their stats are combined? > > > > > > > > > > > > The metric helps monitoing system MM health under fleet and experimental > > > > > > tuning with diffrent policies from the centralized userland memory daemon. > > > > > > > > > > That is just too vague for me to imagine anything more specific then, we > > > > > have numbers and we can show them in a report. What does it actually > > > > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is > > > > > low (that you tend to manually reclaim sparse mappings)? > > > > > > > > If that's low, it means the userspace daemon's current tune/policy are > > > > inefficient or too aggressive since it is working on address spacess > > > > of processes which don't have enough memory the hint can work(e.g., > > > > shared addresses, cold address ranges or some special address ranges like > > > > VM_PFNMAP) so sometime, we can detect regression to find culprit or > > > > have a chance to look into better ideas to improve. > > > > > > Are you sure this is really meaningful metric? Just consider a large and > > > sparsely populated mapping. This can be a perfect candidate for user > > > space reclaim target (e.g. consider a mapping covering a large matrix > > > or other similar data structure). pghinted/pgscanned would be really > > > small while the reclaim efficiency could be quite high in that case, > > > wouldn't it? > > > > Why do you think it's efficient? It need to spend quite CPU cycle to > > scan a few of pages to evict. I don't see it's efficient if it happens > > quite a lot. > > Because it doesn't really matter how many page tables you have to scan > but how easily you can reclaim the memory behind that. Because it is the > memory that matters. Just consider THP vs. 4k backed address ranges. You > are going to scan much more for latter by design. That doesn't really > mean that this is a worse candidate for reclaim and you should be only > focusing on THP backed mappings. See? > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the dang. I meant pgsteal/pgscan > address space but that is a fundamentally different thing.
On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote: > On Wed 25-01-23 14:21:35, Minchan Kim wrote: > > On Wed, Jan 25, 2023 at 10:37:59PM +0100, Michal Hocko wrote: > > > On Wed 25-01-23 10:07:49, Minchan Kim wrote: > > > > On Wed, Jan 25, 2023 at 06:07:00PM +0100, Michal Hocko wrote: > > > > > On Wed 25-01-23 08:36:02, Minchan Kim wrote: > > > > > > On Wed, Jan 25, 2023 at 09:04:16AM +0100, Michal Hocko wrote: > > > > > > > On Tue 24-01-23 16:54:57, Minchan Kim wrote: > > > > > > > > madvise LRU manipulation APIs need to scan address ranges to find > > > > > > > > present pages at page table and provides advice hints for them. > > > > > > > > > > > > > > > > Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] > > > > > > > > shows the proactive reclaim efficiency so this patch adds those > > > > > > > > two statistics in vmstat. > > > > > > > > > > > > > > > > madvise_pgscanned, madvise_pghinted > > > > > > > > > > > > > > > > Since proactive reclaim using process_madvise(2) as userland > > > > > > > > memory policy is popular(e.g,. Android ActivityManagerService), > > > > > > > > those stats are helpful to know how efficiently the policy works > > > > > > > > well. > > > > > > > > > > > > > > The usecase description is still too vague. What are those values useful > > > > > > > for? Is there anything actionable based on those numbers? How do you > > > > > > > deal with multiple parties using madvise resp. process_madvise so that > > > > > > > their stats are combined? > > > > > > > > > > > > The metric helps monitoing system MM health under fleet and experimental > > > > > > tuning with diffrent policies from the centralized userland memory daemon. > > > > > > > > > > That is just too vague for me to imagine anything more specific then, we > > > > > have numbers and we can show them in a report. What does it actually > > > > > mean that madvise_pgscanned is high. Or that pghinted / pgscanned is > > > > > low (that you tend to manually reclaim sparse mappings)? > > > > > > > > If that's low, it means the userspace daemon's current tune/policy are > > > > inefficient or too aggressive since it is working on address spacess > > > > of processes which don't have enough memory the hint can work(e.g., > > > > shared addresses, cold address ranges or some special address ranges like > > > > VM_PFNMAP) so sometime, we can detect regression to find culprit or > > > > have a chance to look into better ideas to improve. > > > > > > Are you sure this is really meaningful metric? Just consider a large and > > > sparsely populated mapping. This can be a perfect candidate for user > > > space reclaim target (e.g. consider a mapping covering a large matrix > > > or other similar data structure). pghinted/pgscanned would be really > > > small while the reclaim efficiency could be quite high in that case, > > > wouldn't it? > > > > Why do you think it's efficient? It need to spend quite CPU cycle to > > scan a few of pages to evict. I don't see it's efficient if it happens > > quite a lot. > > Because it doesn't really matter how many page tables you have to scan > but how easily you can reclaim the memory behind that. Because it is the I really don't follow your claim here. Efficiency is input vs output. For the those advices, input is number of scanned ptes vs. number of hinted pages. If you keep need to scan the sparsed huge address range to reclaim just a few of pages, that's really inefficient. If you keep hinting to the non-populated-page, already swapped-out and hint-cannot-work address ranges, that's really inefficient. What do you see the problem here? What exactly do you mean "how easily" from your context? > memory that matters. Just consider THP vs. 4k backed address ranges. You > are going to scan much more for latter by design. That doesn't really > mean that this is a worse candidate for reclaim and you should be only > focusing on THP backed mappings. See? I don't understand your point here. The stat doesn't aim to make such decision. If THP page shows the good yield from the efficienty above, that's good. If 4K page shows the bad yield, should we focus on THP pages? How could you conclude such decision from the stat? The stat just sees the current health of the system and find something regressed/improved compared to old to intitiate further investigation. > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the > address space but that is a fundamentally different thing. I don't see anything different, fundamentally.
On Thu 26-01-23 09:10:46, Minchan Kim wrote: > On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote: [...] > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the > > address space but that is a fundamentally different thing. > > I don't see anything different, fundamentally. OK, this really explains our disconnect here. Your metric reports nr_page_tables (nr_scanned) and number of aged and potentially reclaimed pages. You do not know whether that reclaim was successful. So you effectively learn how many pages have already been unmapped before your call. Can this be sometimes useful? Probably yes. Does it say anything about the reclaim efficiency? I do not think so. You could have hit pinned pages or countless other conditions why those pages couldn't have been reclaimed and they have stayed mapped after madvise call. pgsteal tells you how many pages from those scanned have been reclaimed. See the difference? Also I do not find information about how many non-present ptes have been scann super interesting. Sure that is a burnt time as well but to me it would be much more valuable information to see how many of those resident could have been actually reclaimed. Because that tells whether your reclaim target was a good choice and IMHO that is a valuable information for user space memory reclaim agent. Again consider a large sparsely mapped memory but mostly inactive memory and a condensed active one with the same rss. The reclaim could have been successful for the former while not on the latter. Your matric would give a rather misleading numbers, don't you think?
On Thu, Jan 26, 2023 at 08:58:57PM +0100, Michal Hocko wrote: > On Thu 26-01-23 09:10:46, Minchan Kim wrote: > > On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote: > [...] > > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the > > > address space but that is a fundamentally different thing. > > > > I don't see anything different, fundamentally. > > OK, this really explains our disconnect here. Your metric reports > nr_page_tables (nr_scanned) and number of aged and potentially reclaimed > pages. You do not know whether that reclaim was successful. So you > effectively learn how many pages have already been unmapped before your > call. Can this be sometimes useful? Probably yes. Does it say anything > about the reclaim efficiency? I do not think so. You could have hit > pinned pages or countless other conditions why those pages couldn't have > been reclaimed and they have stayed mapped after madvise call. > > pgsteal tells you how many pages from those scanned have been reclaimed. > See the difference? That's why my previous version kept counting exact number of reclaimed/ deactivated pages but I changed mind since I observed majority of failure happened from already-paged-out ranges and shared pages rather than minor countless other conditions in real practice. Without finding present pages, the mavise hints couldn't do anything from the beginning and that's the major cost we are facing. Saing again, I don't think the global stat could cover all the minor you are insisting and I agree tracepoint could do better jobs to pinpoint root causes but the global stat still have a role to provides basic ground to sense abnormal and guides us moving next steps with easier interface/ efficient way. > > Also I do not find information about how many non-present ptes have > been scann super interesting. Sure that is a burnt time as well but to > me it would be much more valuable information to see how many of those > resident could have been actually reclaimed. Because that tells whether > your reclaim target was a good choice and IMHO that is a valuable > information for user space memory reclaim agent. That's exactly what I had in previous version. If you believe it's right direction, I am okay. > > Again consider a large sparsely mapped memory but mostly inactive memory > and a condensed active one with the same rss. The reclaim could have > been successful for the former while not on the latter. Your matric > would give a rather misleading numbers, don't you think? > -- > Michal Hocko > SUSE Labs
On Thu 26-01-23 16:08:43, Minchan Kim wrote: > On Thu, Jan 26, 2023 at 08:58:57PM +0100, Michal Hocko wrote: > > On Thu 26-01-23 09:10:46, Minchan Kim wrote: > > > On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote: > > [...] > > > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the > > > > address space but that is a fundamentally different thing. > > > > > > I don't see anything different, fundamentally. > > > > OK, this really explains our disconnect here. Your metric reports > > nr_page_tables (nr_scanned) and number of aged and potentially reclaimed > > pages. You do not know whether that reclaim was successful. So you > > effectively learn how many pages have already been unmapped before your > > call. Can this be sometimes useful? Probably yes. Does it say anything > > about the reclaim efficiency? I do not think so. You could have hit > > pinned pages or countless other conditions why those pages couldn't have > > been reclaimed and they have stayed mapped after madvise call. > > > > pgsteal tells you how many pages from those scanned have been reclaimed. > > See the difference? > > That's why my previous version kept counting exact number of reclaimed/ > deactivated pages but I changed mind since I observed majority of failure > happened from already-paged-out ranges and shared pages rather than minor > countless other conditions in real practice. Without finding present pages, > the mavise hints couldn't do anything from the beginning and that's the > major cost we are facing. I cannot really comment on your user space reclaim policy but I would have expected that you at least check for rss before trying to use madvise on the range. Learning that from the operation sounds like a suboptimal policy to me. > Saing again, I don't think the global stat could cover all the minor > you are insisting and I agree tracepoint could do better jobs to pinpoint > root causes but the global stat still have a role to provides basic ground > to sense abnormal and guides us moving next steps with easier interface/ > efficient way. I hate to repeat myself but the more we discuss this the more I am convinced that vmstat is a bad fit. Sooner or later you end up realizing that nr_reclaimed/nr_scanned is insufficient metric because you would need to learn more anout those reclaim failures. Really what you want is to have a tracepoint with a full reclaim metric and grow monitoring tooling around that. This will deal with the major design flaw of global stat mentioned ealier (that you cannot attribute specific stats to the corresponding madvise caller).
On Fri, Jan 27, 2023 at 10:48:25AM +0100, Michal Hocko wrote: > On Thu 26-01-23 16:08:43, Minchan Kim wrote: > > On Thu, Jan 26, 2023 at 08:58:57PM +0100, Michal Hocko wrote: > > > On Thu 26-01-23 09:10:46, Minchan Kim wrote: > > > > On Thu, Jan 26, 2023 at 09:50:37AM +0100, Michal Hocko wrote: > > > [...] > > > > > I suspect you try to mimic pgscan/pgsteal effectivness metric on the > > > > > address space but that is a fundamentally different thing. > > > > > > > > I don't see anything different, fundamentally. > > > > > > OK, this really explains our disconnect here. Your metric reports > > > nr_page_tables (nr_scanned) and number of aged and potentially reclaimed > > > pages. You do not know whether that reclaim was successful. So you > > > effectively learn how many pages have already been unmapped before your > > > call. Can this be sometimes useful? Probably yes. Does it say anything > > > about the reclaim efficiency? I do not think so. You could have hit > > > pinned pages or countless other conditions why those pages couldn't have > > > been reclaimed and they have stayed mapped after madvise call. > > > > > > pgsteal tells you how many pages from those scanned have been reclaimed. > > > See the difference? > > > > That's why my previous version kept counting exact number of reclaimed/ > > deactivated pages but I changed mind since I observed majority of failure > > happened from already-paged-out ranges and shared pages rather than minor > > countless other conditions in real practice. Without finding present pages, > > the mavise hints couldn't do anything from the beginning and that's the > > major cost we are facing. > > I cannot really comment on your user space reclaim policy but I would > have expected that you at least check for rss before trying to use > madvise on the range. Learning that from the operation sounds like a > suboptimal policy to me. Current rss couldn't say where is the present pages among huge address spaces. And that's not what I want to from the operation but keep monitoring trending under fleet. > > > Saing again, I don't think the global stat could cover all the minor > > you are insisting and I agree tracepoint could do better jobs to pinpoint > > root causes but the global stat still have a role to provides basic ground > > to sense abnormal and guides us moving next steps with easier interface/ > > efficient way. > > I hate to repeat myself but the more we discuss this the more I am > convinced that vmstat is a bad fit. Sooner or later you end up realizing > that nr_reclaimed/nr_scanned is insufficient metric because you would > need to learn more anout those reclaim failures. Really what you want is > to have a tracepoint with a full reclaim metric and grow monitoring tooling > around that. This will deal with the major design flaw of global stat > mentioned ealier (that you cannot attribute specific stats to the > corresponding madvise caller). Then, let me ask back to you. What statistcis in the current vmstat fields or pending fields (to be merged) among accumulated counter stats sound reasonable to be part of vmstat fields not tracepoint from your perspective? Almost every stat would have corner cases by various reasons and people would want to know the reason from process, context, function or block scope depending on how they want to use the stat. Even, tracepoint you're loving couldn't tell all the detail what they want without adding more and more as on growing code chages. However, unlike your worry, people has used such an high level vague vmstat fields very well to understand/monitor system health even though it has various miscounting cases since they know the corner cases are really minor. I am really curious what metric we could add in the vmstat instead of tracepoint in future if we follow your logic.
On Fri 27-01-23 19:00:15, Minchan Kim wrote: [...] > Then, let me ask back to you. > > What statistcis in the current vmstat fields or pending fields > (to be merged) among accumulated counter stats sound reasonable > to be part of vmstat fields not tracepoint from your perspective? Most of those could be replaced but for historical reasons a counter was an only solution back then. Some metrics make a lot of sense these days as well. Regular snapshots of vmstat can give a nice overview of the _system_ reclaim activity. > Almost every stat would have corner cases by various reasons and > people would want to know the reason from process, context, function > or block scope depending on how they want to use the stat. > Even, tracepoint you're loving couldn't tell all the detail what they want > without adding more and more as on growing code chages. Quite possible but tracepoints are much easier to modify and shape to a particular need. > However, unlike your worry, people has used such an high level vague > vmstat fields very well to understand/monitor system health even though > it has various miscounting cases since they know the corner cases > are really minor. > > I am really curious what metric we could add in the vmstat instead of > tracepoint in future if we follow your logic. I would say that we should be more and more conservative when extending vmstat counters and use tracing instead as much as possible. I can imagine there could be cases where tracing is not a preferable option. Then we can judge case by case. So far you have presented no real argument, except you already collect vmstat on a larger scale and that would be easier (essentially free from the tool modification POV). That is a weak argument. Especially with a major design flaw already mentioned. I do not have much more to add here.
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 7f5d1caf5890..3c117858946d 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -52,6 +52,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGSCAN_FILE, PGSTEAL_ANON, PGSTEAL_FILE, + MADVISE_PGSCANNED, + MADVISE_PGHINTED, #ifdef CONFIG_NUMA PGSCAN_ZONE_RECLAIM_FAILED, #endif diff --git a/mm/madvise.c b/mm/madvise.c index 7db6622f8293..d2624e77f729 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -344,6 +344,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, spinlock_t *ptl; struct folio *folio = NULL; LIST_HEAD(folio_list); + unsigned int nr_scanned = 0; + unsigned int nr_hinted = 0; bool pageout_anon_only_filter; if (fatal_signal_pending(current)) @@ -357,6 +359,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, pmd_t orig_pmd; unsigned long next = pmd_addr_end(addr, end); + nr_scanned += HPAGE_PMD_NR; tlb_change_page_size(tlb, HPAGE_PMD_SIZE); ptl = pmd_trans_huge_lock(pmd, vma); if (!ptl) @@ -414,6 +417,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, } } else folio_deactivate(folio); + nr_hinted += HPAGE_PMD_NR; huge_unlock: spin_unlock(ptl); if (pageout) @@ -431,6 +435,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, arch_enter_lazy_mmu_mode(); for (; addr < end; pte++, addr += PAGE_SIZE) { ptent = *pte; + nr_scanned++; if (pte_none(ptent)) continue; @@ -508,6 +513,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, } } else folio_deactivate(folio); + nr_hinted++; } arch_leave_lazy_mmu_mode(); @@ -515,6 +521,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (pageout) reclaim_pages(&folio_list); cond_resched(); + count_vm_events(MADVISE_PGSCANNED, nr_scanned); + count_vm_events(MADVISE_PGHINTED, nr_hinted); return 0; } diff --git a/mm/vmstat.c b/mm/vmstat.c index 1ea6a5ce1c41..84acc90820e1 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1283,6 +1283,8 @@ const char * const vmstat_text[] = { "pgscan_file", "pgsteal_anon", "pgsteal_file", + "madvise_pgscanned", + "madvise_pghinted", #ifdef CONFIG_NUMA "zone_reclaim_failed",
madvise LRU manipulation APIs need to scan address ranges to find present pages at page table and provides advice hints for them. Likewise pg[scan/steal] count on vmstat, madvise_pg[scanned/hinted] shows the proactive reclaim efficiency so this patch adds those two statistics in vmstat. madvise_pgscanned, madvise_pghinted Since proactive reclaim using process_madvise(2) as userland memory policy is popular(e.g,. Android ActivityManagerService), those stats are helpful to know how efficiently the policy works well. Signed-off-by: Minchan Kim <minchan@kernel.org> --- * From v1 - https://lore.kernel.org/linux-mm/20230117231632.2734737-1-minchan@kernel.org/ * not relying on the pageout for accounting - mhocko * drop unnecessary changes - mhocko include/linux/vm_event_item.h | 2 ++ mm/madvise.c | 8 ++++++++ mm/vmstat.c | 2 ++ 3 files changed, 12 insertions(+)