Message ID | 20200714173920.3319063-1-guro@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings | expand |
On Tue 14-07-20 10:39:20, Roman Gushchin wrote: > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production > hosts. The numbers of these warnings were relatively low and stable, > so it didn't look like we are systematically leaking the counters. > The corresponding vmstat counters also looked sane. > > These warnings are generated by the vmstat_refresh() function, which > assumes that atomic zone and numa counters can't go below zero. > However, on a SMP machine it's not quite right: due to per-cpu > caching it can in theory be as low as -(zone threshold) * NR_CPUs. > > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES > reached 0. Then we've reclaimed a small number of cma pages on each > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are > slightly positive (the atomic counter is still 0). Then somebody on > CPU0 consumes all these pages. The number of pages can easily exceed > the threshold and a negative value will be committed to the atomic > counter. > > To fix the problem and avoid generating false warnings, let's just > relax the condition and warn only if the value is less than minus > the maximum theoretically possible drift value, which is 125 * > number of online CPUs. It will still allow to catch systematic leaks, > but will not generate bogus warnings. > > Signed-off-by: Roman Gushchin <guro@fb.com> > Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> One minor nit which can be handled by a separate patch but now that you are touching this code... > --- > Documentation/admin-guide/sysctl/vm.rst | 4 ++-- > mm/vmstat.c | 30 ++++++++++++++++--------- > 2 files changed, 21 insertions(+), 13 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index 4b9d2e8e9142..95fb80d0c606 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo > > As a side-effect, it also checks for negative totals (elsewhere reported > as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. > -(At time of writing, a few stats are known sometimes to be found negative, > -with no ill effects: errors and warnings on these stats are suppressed.) > +(On a SMP machine some stats can temporarily become negative, with no ill > +effects: errors and warnings on these stats are suppressed.) > > > numa_stat > diff --git a/mm/vmstat.c b/mm/vmstat.c > index a21140373edb..8f0ef8aaf8ee 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat); > > #ifdef CONFIG_SMP > > +#define MAX_THRESHOLD 125 This would deserve a comment. 88f5acf88ae6a didn't really explain why this specific value has been selected but the specific value shouldn't really matter much. I would go with the following at least. " Maximum sync threshold for per-cpu vmstat counters. " > + > int calculate_pressure_threshold(struct zone *zone) > { > int threshold; > @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone) > threshold = max(1, (int)(watermark_distance / num_online_cpus())); > > /* > - * Maximum threshold is 125 > + * Threshold is capped by MAX_THRESHOLD > */ > - threshold = min(125, threshold); > - > - return threshold; > + return min(MAX_THRESHOLD, threshold); > } > > int calculate_normal_threshold(struct zone *zone) > @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item) > } > EXPORT_SYMBOL(dec_node_page_state); > #else > + > +#define MAX_THRESHOLD 0 > + > /* > * Use interrupt disable to serialize counter updates > */ > @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work) > int vmstat_refresh(struct ctl_table *table, int write, > void *buffer, size_t *lenp, loff_t *ppos) > { > - long val; > + long val, max_drift; > int err; > int i; > > @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write, > * pages, immediately after running a test. /proc/sys/vm/stat_refresh, > * which can equally be echo'ed to or cat'ted from (by root), > * can be used to update the stats just before reading them. > - * > - * Oh, and since global_zone_page_state() etc. are so careful to hide > - * transiently negative values, report an error here if any of > - * the stats is negative, so we know to go looking for imbalance. > */ > err = schedule_on_each_cpu(refresh_vm_stats); > if (err) > return err; > + > + /* > + * Since global_zone_page_state() etc. are so careful to hide > + * transiently negative values, report an error here if any of > + * the stats is negative and are less than the maximum drift value, > + * so we know to go looking for imbalance. > + */ > + max_drift = num_online_cpus() * MAX_THRESHOLD; > + > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { > val = atomic_long_read(&vm_zone_stat[i]); > - if (val < 0) { > + if (val < -max_drift) { > pr_warn("%s: %s %ld\n", > __func__, zone_stat_name(i), val); > err = -EINVAL; > @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write, > #ifdef CONFIG_NUMA > for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { > val = atomic_long_read(&vm_numa_stat[i]); > - if (val < 0) { > + if (val < -max_drift) { > pr_warn("%s: %s %ld\n", > __func__, numa_stat_name(i), val); > err = -EINVAL; > -- > 2.26.2
On Mon, Jul 20, 2020 at 10:03:49AM +0200, Michal Hocko wrote: > On Tue 14-07-20 10:39:20, Roman Gushchin wrote: > > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma > > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production > > hosts. The numbers of these warnings were relatively low and stable, > > so it didn't look like we are systematically leaking the counters. > > The corresponding vmstat counters also looked sane. > > > > These warnings are generated by the vmstat_refresh() function, which > > assumes that atomic zone and numa counters can't go below zero. > > However, on a SMP machine it's not quite right: due to per-cpu > > caching it can in theory be as low as -(zone threshold) * NR_CPUs. > > > > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES > > reached 0. Then we've reclaimed a small number of cma pages on each > > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are > > slightly positive (the atomic counter is still 0). Then somebody on > > CPU0 consumes all these pages. The number of pages can easily exceed > > the threshold and a negative value will be committed to the atomic > > counter. > > > > To fix the problem and avoid generating false warnings, let's just > > relax the condition and warn only if the value is less than minus > > the maximum theoretically possible drift value, which is 125 * > > number of online CPUs. It will still allow to catch systematic leaks, > > but will not generate bogus warnings. > > > > Signed-off-by: Roman Gushchin <guro@fb.com> > > Cc: Hugh Dickins <hughd@google.com> > > Acked-by: Michal Hocko <mhocko@suse.com> > > One minor nit which can be handled by a separate patch but now that you > are touching this code... Thank you! > > > --- > > Documentation/admin-guide/sysctl/vm.rst | 4 ++-- > > mm/vmstat.c | 30 ++++++++++++++++--------- > > 2 files changed, 21 insertions(+), 13 deletions(-) > > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > > index 4b9d2e8e9142..95fb80d0c606 100644 > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo > > > > As a side-effect, it also checks for negative totals (elsewhere reported > > as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. > > -(At time of writing, a few stats are known sometimes to be found negative, > > -with no ill effects: errors and warnings on these stats are suppressed.) > > +(On a SMP machine some stats can temporarily become negative, with no ill > > +effects: errors and warnings on these stats are suppressed.) > > > > > > numa_stat > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index a21140373edb..8f0ef8aaf8ee 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat); > > > > #ifdef CONFIG_SMP > > > > +#define MAX_THRESHOLD 125 > > This would deserve a comment. 88f5acf88ae6a didn't really explain why > this specific value has been selected but the specific value shouldn't > really matter much. I would go with the following at least. > " > Maximum sync threshold for per-cpu vmstat counters. > " Agree. Below is the diff to be squashed into the original patch. Thanks! -- diff --git a/mm/vmstat.c b/mm/vmstat.c index 08e415e0a15d..ddc59b533599 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -167,6 +167,9 @@ EXPORT_SYMBOL(vm_zone_stat); EXPORT_SYMBOL(vm_numa_stat); EXPORT_SYMBOL(vm_node_stat); +/* + * Maximum sync threshold for per-cpu vmstat counters. + */ #ifdef CONFIG_SMP #define MAX_THRESHOLD 125 #else
On Tue, 14 Jul 2020, Roman Gushchin wrote: > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production > hosts. The numbers of these warnings were relatively low and stable, > so it didn't look like we are systematically leaking the counters. > The corresponding vmstat counters also looked sane. nr_zone_write_pending: yes, I've looked at our machines, and see that showing up for us too (-49 was the worst I saw). Not at all common, but seen. And not followed by increasingly worse numbers, so a state that corrects itself. nr_dirty too (fewer instances, bigger numbers); but never nr_writeback, which you'd expect to go along with those. I wish I could explain that (I've wondered if somewhere delays incrementing the stat, and can be interrupted by the decrementer of the stat before it has reached incrementing), but have not succeeded. Perhaps it is all down to the vmstat_refresh() skid that you hide in this patch; but I'm not convinced. > > These warnings are generated by the vmstat_refresh() function, which > assumes that atomic zone and numa counters can't go below zero. > However, on a SMP machine it's not quite right: due to per-cpu > caching it can in theory be as low as -(zone threshold) * NR_CPUs. > > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES > reached 0. Then we've reclaimed a small number of cma pages on each > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are > slightly positive (the atomic counter is still 0). Then somebody on > CPU0 consumes all these pages. The number of pages can easily exceed > the threshold and a negative value will be committed to the atomic > counter. > > To fix the problem and avoid generating false warnings, let's just > relax the condition and warn only if the value is less than minus > the maximum theoretically possible drift value, which is 125 * > number of online CPUs. It will still allow to catch systematic leaks, > but will not generate bogus warnings. Sorry, but despite the acks of others, I want to NAK this in its present form. You're right that there's a possibility of a spurious warning, but is that so terrible? I'm imagining a threshold of 125, and 128 cpus, and the following worst case. Yes, it is possible that when vmstat_refresh() schedules its refresh on all the cpus, that the first 127 cpus to complete all sync a count of 0, but immediately after each allocates 125 of whatever (raising their per-cpu counters without touching the global counter); and then, before the 128th cpu starts its sync, somehow that 128th gets to be the lucky cpu to free all of those 127*125 items, so arriving at a mistaken count of -15875 for that stat. And I guess you could even devise a test which conspires to do that. But is it so likely, that it's worth throwing away the warning when we leak (or perhaps it's unleak) 16000 huge pages? I don't think so: I think it makes those warnings pretty much useless, and it would be better just to delete the code that issues them. But there's other things you could do, that we can probably agree on. When stat_refresh first went in, there were two stats (NR_ALLOC_BATCH and NR_PAGES_SCANNED) which were peculiarly accounted, and gave rise to negative reports: the original commit just filtered those cases out in a switch. Maybe you should filter out NR_ZONE_WRITE_PENDING and NR_FREE_CMA_PAGES, if there's nothing to fix in their accounting. I'm not sure exactly what your objection is to the warning: would you prefer pr_info or pr_debug to pr_warn? I'd prefer to leave it as pr_warn, but can compromise. (IIRC xfstests can fail a test if an unexpected message appears in the log; but xfstests does not use /proc/sys/vm/stat_refresh.) But a better idea is perhaps to redefine the behavior of "echo >/proc/sys/vm/stat_refresh". What if "echo someparticularstring >/proc/sys/vm/stat_refresh" were to disable or enable the warning (permanently? or just that time?): disable would be more "back-compatible", but I think it's okay if you prefer enable. Or "someparticularstring" could actually specify the warning threshold you want to use - you might echo 125 or 16000, I might echo 0. We can haggle over the default. And there's a simpler change we already made internally: we didn't mind the warning at all, but the accompanying -EINVALs became very annoying. A lot of testing scripts have "set -e" in them, and for test B of feature B to fail because earlier test A of feature A had tickled a bug in A that wrapped some stat negative, that was very irritating. We deleted those "err = -EINVAL;"s - which might be what's actually most annoying you too? Nit in this patch called out below. Hugh > > Signed-off-by: Roman Gushchin <guro@fb.com> > Cc: Hugh Dickins <hughd@google.com> > --- > Documentation/admin-guide/sysctl/vm.rst | 4 ++-- > mm/vmstat.c | 30 ++++++++++++++++--------- > 2 files changed, 21 insertions(+), 13 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index 4b9d2e8e9142..95fb80d0c606 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo > > As a side-effect, it also checks for negative totals (elsewhere reported > as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. > -(At time of writing, a few stats are known sometimes to be found negative, > -with no ill effects: errors and warnings on these stats are suppressed.) > +(On a SMP machine some stats can temporarily become negative, with no ill > +effects: errors and warnings on these stats are suppressed.) > > > numa_stat > diff --git a/mm/vmstat.c b/mm/vmstat.c > index a21140373edb..8f0ef8aaf8ee 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat); > > #ifdef CONFIG_SMP > > +#define MAX_THRESHOLD 125 > + > int calculate_pressure_threshold(struct zone *zone) > { > int threshold; > @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone) > threshold = max(1, (int)(watermark_distance / num_online_cpus())); > > /* > - * Maximum threshold is 125 > + * Threshold is capped by MAX_THRESHOLD > */ > - threshold = min(125, threshold); > - > - return threshold; > + return min(MAX_THRESHOLD, threshold); > } > > int calculate_normal_threshold(struct zone *zone) calculate_normal_threshold() also contains a 125: better change that to MAX_THRESHOLD too, if you do pursue this. > @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item) > } > EXPORT_SYMBOL(dec_node_page_state); > #else > + > +#define MAX_THRESHOLD 0 > + > /* > * Use interrupt disable to serialize counter updates > */ > @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work) > int vmstat_refresh(struct ctl_table *table, int write, > void *buffer, size_t *lenp, loff_t *ppos) > { > - long val; > + long val, max_drift; > int err; > int i; > > @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write, > * pages, immediately after running a test. /proc/sys/vm/stat_refresh, > * which can equally be echo'ed to or cat'ted from (by root), > * can be used to update the stats just before reading them. > - * > - * Oh, and since global_zone_page_state() etc. are so careful to hide > - * transiently negative values, report an error here if any of > - * the stats is negative, so we know to go looking for imbalance. > */ > err = schedule_on_each_cpu(refresh_vm_stats); > if (err) > return err; > + > + /* > + * Since global_zone_page_state() etc. are so careful to hide > + * transiently negative values, report an error here if any of > + * the stats is negative and are less than the maximum drift value, > + * so we know to go looking for imbalance. > + */ > + max_drift = num_online_cpus() * MAX_THRESHOLD; > + > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { > val = atomic_long_read(&vm_zone_stat[i]); > - if (val < 0) { > + if (val < -max_drift) { > pr_warn("%s: %s %ld\n", > __func__, zone_stat_name(i), val); > err = -EINVAL; > @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write, > #ifdef CONFIG_NUMA > for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { > val = atomic_long_read(&vm_numa_stat[i]); > - if (val < 0) { > + if (val < -max_drift) { > pr_warn("%s: %s %ld\n", > __func__, numa_stat_name(i), val); > err = -EINVAL; > -- > 2.26.2
On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote: > On Tue, 14 Jul 2020, Roman Gushchin wrote: > > > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma > > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production > > hosts. The numbers of these warnings were relatively low and stable, > > so it didn't look like we are systematically leaking the counters. > > The corresponding vmstat counters also looked sane. > > nr_zone_write_pending: yes, I've looked at our machines, and see that > showing up for us too (-49 was the worst I saw). Not at all common, > but seen. And not followed by increasingly worse numbers, so a state > that corrects itself. nr_dirty too (fewer instances, bigger numbers); > but never nr_writeback, which you'd expect to go along with those. NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them? > > I wish I could explain that (I've wondered if somewhere delays > incrementing the stat, and can be interrupted by the decrementer of > the stat before it has reached incrementing), but have not succeeded. > Perhaps it is all down to the vmstat_refresh() skid that you hide in > this patch; but I'm not convinced. > > > > > These warnings are generated by the vmstat_refresh() function, which > > assumes that atomic zone and numa counters can't go below zero. > > However, on a SMP machine it's not quite right: due to per-cpu > > caching it can in theory be as low as -(zone threshold) * NR_CPUs. > > > > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES > > reached 0. Then we've reclaimed a small number of cma pages on each > > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are > > slightly positive (the atomic counter is still 0). Then somebody on > > CPU0 consumes all these pages. The number of pages can easily exceed > > the threshold and a negative value will be committed to the atomic > > counter. > > > > To fix the problem and avoid generating false warnings, let's just > > relax the condition and warn only if the value is less than minus > > the maximum theoretically possible drift value, which is 125 * > > number of online CPUs. It will still allow to catch systematic leaks, > > but will not generate bogus warnings. > > Sorry, but despite the acks of others, I want to NAK this in its > present form. Sorry to hear this. > > You're right that there's a possibility of a spurious warning, > but is that so terrible? We do collect all warnings fleet-wide and false warnings are creating a noise, which makes it easier to miss real problems. Of course, we can filter out these particular warnings, but then what's the point of generating them? In my opinion such warnings with a bad signal/ratio aren't good because initially they cause somebody to look into the "problem" and waste their time, and later they are usually just ignored, even when the real problem appears. In this particular case I was testing some cma-related changes, and when I saw a bunch a new warnings (cma was not used on these hosts before), I was concerned that something's wrong with my changes. > > I'm imagining a threshold of 125, and 128 cpus, and the following > worst case. Yes, it is possible that when vmstat_refresh() schedules > its refresh on all the cpus, that the first 127 cpus to complete all > sync a count of 0, but immediately after each allocates 125 of whatever > (raising their per-cpu counters without touching the global counter); > and then, before the 128th cpu starts its sync, somehow that 128th > gets to be the lucky cpu to free all of those 127*125 items, > so arriving at a mistaken count of -15875 for that stat. First, I have to agree, 125 * number of cpus is definitely a very high number, so it's extremely unlikely that any vmstat value will reach it in the real life. I'm totally happy to go with a (much) lower limit, I just have no good idea how to justify any particular number below. I you can suggest something, I'd appreciate it a lot. I like the number 400 :) > > And I guess you could even devise a test which conspires to do that. > But is it so likely, that it's worth throwing away the warning when > we leak (or perhaps it's unleak) 16000 huge pages? I don't think so: > I think it makes those warnings pretty much useless, and it would be > better just to delete the code that issues them. Of course, it's an option too (delete the warnings at all). > > But there's other things you could do, that we can probably agree on. > > When stat_refresh first went in, there were two stats (NR_ALLOC_BATCH > and NR_PAGES_SCANNED) which were peculiarly accounted, and gave rise > to negative reports: the original commit just filtered those cases out > in a switch. Maybe you should filter out NR_ZONE_WRITE_PENDING and > NR_FREE_CMA_PAGES, if there's nothing to fix in their accounting. In fact, there is nothing specific to NR_ZONE_WRITE_PENDING and NR_FREE_CMA_PAGES, any value which is often bouncing around 0 can go negative, and it's not an indication of any issues. > > I'm not sure exactly what your objection is to the warning: would > you prefer pr_info or pr_debug to pr_warn? I'd prefer to leave it > as pr_warn, but can compromise. Yeah, we can go with pr_debug as well. > > (IIRC xfstests can fail a test if an unexpected message appears > in the log; but xfstests does not use /proc/sys/vm/stat_refresh.) > > But a better idea is perhaps to redefine the behavior of > "echo >/proc/sys/vm/stat_refresh". What if > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to > disable or enable the warning (permanently? or just that time?): > disable would be more "back-compatible", but I think it's okay > if you prefer enable. Or "someparticularstring" could actually > specify the warning threshold you want to use - you might echo > 125 or 16000, I might echo 0. We can haggle over the default. May I ask you, what kind of problems you have in your in mind, which can be revealed by these warnings? Or maybe there is some history attached? If it's all about some particular counters, which are known to be strictly positive, maybe we should do the opposite, and check only those counters? Because in general it's not an indication of a problem. > > And there's a simpler change we already made internally: we didn't > mind the warning at all, but the accompanying -EINVALs became very > annoying. A lot of testing scripts have "set -e" in them, and for > test B of feature B to fail because earlier test A of feature A > had tickled a bug in A that wrapped some stat negative, that > was very irritating. We deleted those "err = -EINVAL;"s - > which might be what's actually most annoying you too? > > Nit in this patch called out below. > > Hugh > > > > > Signed-off-by: Roman Gushchin <guro@fb.com> > > Cc: Hugh Dickins <hughd@google.com> > > --- > > Documentation/admin-guide/sysctl/vm.rst | 4 ++-- > > mm/vmstat.c | 30 ++++++++++++++++--------- > > 2 files changed, 21 insertions(+), 13 deletions(-) > > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > > index 4b9d2e8e9142..95fb80d0c606 100644 > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo > > > > As a side-effect, it also checks for negative totals (elsewhere reported > > as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. > > -(At time of writing, a few stats are known sometimes to be found negative, > > -with no ill effects: errors and warnings on these stats are suppressed.) > > +(On a SMP machine some stats can temporarily become negative, with no ill > > +effects: errors and warnings on these stats are suppressed.) > > > > > > numa_stat > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index a21140373edb..8f0ef8aaf8ee 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat); > > > > #ifdef CONFIG_SMP > > > > +#define MAX_THRESHOLD 125 > > + > > int calculate_pressure_threshold(struct zone *zone) > > { > > int threshold; > > @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone) > > threshold = max(1, (int)(watermark_distance / num_online_cpus())); > > > > /* > > - * Maximum threshold is 125 > > + * Threshold is capped by MAX_THRESHOLD > > */ > > - threshold = min(125, threshold); > > - > > - return threshold; > > + return min(MAX_THRESHOLD, threshold); > > } > > > > int calculate_normal_threshold(struct zone *zone) > > calculate_normal_threshold() also contains a 125: > better change that to MAX_THRESHOLD too, if you do pursue this. Totally agree, good catch. Thanks! Roman
On Thu, 30 Jul 2020, Roman Gushchin wrote: > On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote: > > > > But a better idea is perhaps to redefine the behavior of > > "echo >/proc/sys/vm/stat_refresh". What if > > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to > > disable or enable the warning (permanently? or just that time?): > > disable would be more "back-compatible", but I think it's okay > > if you prefer enable. Or "someparticularstring" could actually > > specify the warning threshold you want to use - you might echo > > 125 or 16000, I might echo 0. We can haggle over the default. > > May I ask you, what kind of problems you have in your in mind, > which can be revealed by these warnings? Or maybe there is some > history attached? Yes: 52b6f46bc163 mentions finding a bug of mine in NR_ISOLATED_FILE accounting, but IIRC (though I might be making this up) there was also a bug in the NR_ACTIVE or NR_INACTIVE FILE or ANON accounting. When one of the stats used for balancing or limiting in vmscan.c trends increasingly negative, it becomes increasingly difficult for those heuristics (adding on to others, comparing with others) to do what they're intended to do: they behave increasingly weirdly. Now the same (or the opposite) is true if one of those stats trends increasingly positive: but if it leaks positive, it's visible in /proc/vmstat; whereas if it leaks negative, it's presented there as 0. And most of the time (when unsynchronized) showing 0 is much better than showing a transient negative. But to help fix bugs, we do need some way of seeing the negatives, and vm/stat_refresh provides an opportunity to do so, when it synchronizes. I'd be glad not to show the transients if I knew them: set a flag on any that go negative, and only show if negative twice or more in a row? Perhaps, but I don't relish adding that, and think it would be over-engineering. It does sound to me like echoing the warning threshold into /proc/sys/vm/stat_refresh is the best way to satisfy us both. Though another alternative did occur to me overnight: we could scrap the logged warning, and show "nr_whatever -53" as output from /proc/sys/vm/stat_refresh: that too would be acceptable to me, and you redirect to /dev/null. (Why did I choose -53 in my example? An in-joke: when I looked through our machines for these warnings, on old kernels with my old shmem hugepage implementation, there were a striking number with "nr_shmem_freeholes -53"; but I'm a few years too late to investigate what was going on there.) > > If it's all about some particular counters, which are known to be > strictly positive, maybe we should do the opposite, and check only > those counters? Because in general it's not an indication of a problem. Yet it's very curious how few stats ever generate such warnings: you're convinced they're just transient noise, and you're probably right; but I am a little suspicious of whether they are accounted correctly. Hugh
On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote: > On Thu, 30 Jul 2020, Roman Gushchin wrote: > > On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote: > > > > > > But a better idea is perhaps to redefine the behavior of > > > "echo >/proc/sys/vm/stat_refresh". What if > > > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to > > > disable or enable the warning (permanently? or just that time?): > > > disable would be more "back-compatible", but I think it's okay > > > if you prefer enable. Or "someparticularstring" could actually > > > specify the warning threshold you want to use - you might echo > > > 125 or 16000, I might echo 0. We can haggle over the default. > > > > May I ask you, what kind of problems you have in your in mind, > > which can be revealed by these warnings? Or maybe there is some > > history attached? > > Yes: 52b6f46bc163 mentions finding a bug of mine in NR_ISOLATED_FILE > accounting, but IIRC (though I might be making this up) there was > also a bug in the NR_ACTIVE or NR_INACTIVE FILE or ANON accounting. > > When one of the stats used for balancing or limiting in vmscan.c > trends increasingly negative, it becomes increasingly difficult > for those heuristics (adding on to others, comparing with others) > to do what they're intended to do: they behave increasingly weirdly. > > Now the same (or the opposite) is true if one of those stats trends > increasingly positive: but if it leaks positive, it's visible in > /proc/vmstat; whereas if it leaks negative, it's presented there as 0. > > And most of the time (when unsynchronized) showing 0 is much better > than showing a transient negative. But to help fix bugs, we do need > some way of seeing the negatives, and vm/stat_refresh provides an > opportunity to do so, when it synchronizes. > > I'd be glad not to show the transients if I knew them: set a flag > on any that go negative, and only show if negative twice or more > in a row? Perhaps, but I don't relish adding that, and think it > would be over-engineering. > > It does sound to me like echoing the warning threshold into > /proc/sys/vm/stat_refresh is the best way to satisfy us both. > > Though another alternative did occur to me overnight: we could > scrap the logged warning, and show "nr_whatever -53" as output > from /proc/sys/vm/stat_refresh: that too would be acceptable > to me, and you redirect to /dev/null. It sounds like a good idea to me. Do you want me to prepare a patch? > > (Why did I choose -53 in my example? An in-joke: when I looked > through our machines for these warnings, on old kernels with my > old shmem hugepage implementation, there were a striking number > with "nr_shmem_freeholes -53"; but I'm a few years too late to > investigate what was going on there.) :) > > > > > If it's all about some particular counters, which are known to be > > strictly positive, maybe we should do the opposite, and check only > > those counters? Because in general it's not an indication of a problem. > > Yet it's very curious how few stats ever generate such warnings: > you're convinced they're just transient noise, and you're probably right; > but I am a little suspicious of whether they are accounted correctly. Yeah, I was initially very suspicious too, but I didn't find any issues and still think it's not an indication of a problem. Thank you!
On Fri, 31 Jul 2020, Roman Gushchin wrote: > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote: > > > > Though another alternative did occur to me overnight: we could > > scrap the logged warning, and show "nr_whatever -53" as output > > from /proc/sys/vm/stat_refresh: that too would be acceptable > > to me, and you redirect to /dev/null. > > It sounds like a good idea to me. Do you want me to prepare a patch? Yes, if you like that one best, please do prepare a patch - thanks! Hugh
On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote: > On Fri, 31 Jul 2020, Roman Gushchin wrote: > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote: > > > > > > Though another alternative did occur to me overnight: we could > > > scrap the logged warning, and show "nr_whatever -53" as output > > > from /proc/sys/vm/stat_refresh: that too would be acceptable > > > to me, and you redirect to /dev/null. > > > > It sounds like a good idea to me. Do you want me to prepare a patch? > > Yes, if you like that one best, please do prepare a patch - thanks! Hi Hugh, I mastered a patch (attached below), but honestly I can't say I like it. The resulting interface is confusing: we don't generally use sysctls to print debug data and/or warnings. I thought about treating a write to this sysctls as setting the threshold, so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold as in my patch. But this breaks to some extent the current ABI, as passing an incorrect value will result in -EINVAL instead of passing (as now). Overall I still think we shouldn't warn on any values inside the possible range, as it's not an indication of any kind of error. The only reason why we see some values going negative and some not, is that some of them are updated more frequently than others, and some are bouncing around zero, while other can't reach zero too easily (like the number of free pages). Actually, if someone wants to ensure that numbers are accurate, we have to temporarily set the threshold to 0, then flush the percpu data and only then check atomics. In the current design flushing percpu data matters for only slowly updated counters, as all others will run away while we're waiting for the flush. So if we're targeting some slowly updating counters, maybe we should warn only on them being negative, Idk. Thanks! -- diff --git a/mm/vmstat.c b/mm/vmstat.c index 6eecfcbbfe79..1d2f2471bb78 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1815,13 +1815,28 @@ static void refresh_vm_stats(struct work_struct *work) refresh_cpu_vm_stats(true); } +static void warn_vmstat(void *buffer, size_t *lenp, loff_t *ppos, + const char *name, long val) +{ + int len; + + len = snprintf(buffer + *ppos, *lenp, "%s %lu\n", name, val); + *lenp -= len; + *ppos += len; +} + int vmstat_refresh(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { - long val, max_drift; + long val; int err; int i; + if (!*lenp || (*ppos && !write)) { + *lenp = 0; + return 0; + } + /* * The regular update, every sysctl_stat_interval, may come later * than expected: leaving a significant amount in per_cpu buckets. @@ -1837,35 +1852,24 @@ int vmstat_refresh(struct ctl_table *table, int write, /* * Since global_zone_page_state() etc. are so careful to hide * transiently negative values, report an error here if any of - * the stats is negative and are less than the maximum drift value, - * so we know to go looking for imbalance. + * the stats is negative, so we know to go looking for imbalance. */ - max_drift = num_online_cpus() * MAX_THRESHOLD; - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { val = atomic_long_read(&vm_zone_stat[i]); - if (val < -max_drift) { - pr_warn("%s: %s %ld\n", - __func__, zone_stat_name(i), val); - err = -EINVAL; - } + if (!write && val < 0) + warn_vmstat(buffer, lenp, ppos, zone_stat_name(i), val); } #ifdef CONFIG_NUMA for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { val = atomic_long_read(&vm_numa_stat[i]); - if (val < -max_drift) { - pr_warn("%s: %s %ld\n", - __func__, numa_stat_name(i), val); - err = -EINVAL; - } + if (!write && val < 0) + warn_vmstat(buffer, lenp, ppos, numa_stat_name(i), val); } #endif if (err) return err; if (write) *ppos += *lenp; - else - *lenp = 0; return 0; } #endif /* CONFIG_PROC_FS */
On Mon, 3 Aug 2020, Roman Gushchin wrote: > On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote: > > On Fri, 31 Jul 2020, Roman Gushchin wrote: > > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote: > > > > > > > > Though another alternative did occur to me overnight: we could > > > > scrap the logged warning, and show "nr_whatever -53" as output > > > > from /proc/sys/vm/stat_refresh: that too would be acceptable > > > > to me, and you redirect to /dev/null. > > > > > > It sounds like a good idea to me. Do you want me to prepare a patch? > > > > Yes, if you like that one best, please do prepare a patch - thanks! > > Hi Hugh, > > I mastered a patch (attached below), but honestly I can't say I like it. > The resulting interface is confusing: we don't generally use sysctls to > print debug data and/or warnings. Since you confessed to not liking it yourself, I paid it very little attention. Yes, when I made that suggestion, I wasn't really thinking of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all sure how issuing output from a /proc file intended for input works out (perhaps there are plenty of good examples, and you followed one, but it smells fishy to me now). > > I thought about treating a write to this sysctls as setting the threshold, > so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative > entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold > as in my patch. But this breaks to some extent the current ABI, as passing > an incorrect value will result in -EINVAL instead of passing (as now). I expect we could handle that well enough, by more lenient validation of the input; though my comment above on output versus input sheds doubt. > > Overall I still think we shouldn't warn on any values inside the possible > range, as it's not an indication of any kind of error. The only reason > why we see some values going negative and some not, is that some of them > are updated more frequently than others, and some are bouncing around > zero, while other can't reach zero too easily (like the number of free pages). We continue to disagree on that (and it amuses me that you who are so sure they can be ignored, cannot ignore them; whereas I who am so curious to investigate them, have not actually found the time to do so in years). It was looking as if nothing could satisfy us both, but... > > Actually, if someone wants to ensure that numbers are accurate, > we have to temporarily set the threshold to 0, then flush the percpu data > and only then check atomics. In the current design flushing percpu data > matters for only slowly updated counters, as all others will run away while > we're waiting for the flush. So if we're targeting some slowly updating > counters, maybe we should warn only on them being negative, Idk. I was going to look into that angle, though it would probably add a little unjustifiable overhead to fast paths, and be rejected on that basis. But in going to do so, came up against an earlier comment of yours, of which I had misunderstood the significance. I had said and you replied: > > nr_zone_write_pending: yes, I've looked at our machines, and see that > > showing up for us too (-49 was the worst I saw). Not at all common, > > but seen. And not followed by increasingly worse numbers, so a state > > that corrects itself. nr_dirty too (fewer instances, bigger numbers); > > but never nr_writeback, which you'd expect to go along with those. > > NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them? Wow. Now I see what you were pointing out: when v4.8's 75ef71840539 ("mm, vmstat: add infrastructure for per-node vmstats") went in, it missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items. And I've never noticed, and have interpreted its silence on those items as meaning they're all good (and the nr_dirty ones I mentioned above, must have been from residual old kernels, hence the fewer instances). I see the particularly tricky NR_ISOLATED ones are in that category. Maybe they are all good, but I have been mistaken. I shall certainly want to reintroduce those stats to checking for negatives, even if it's in a patch that never earns your approval, and just ends up kept internal for debugging. But equally certainly, I must not suddenly reintroduce that checking without gaining some experience of it (and perhaps getting as irritated as you by more transient negatives). I said earlier that I'd prefer you to rip out all that checking for negatives, rather than retaining it with the uselessly over-generous 125 * nr_cpus leeway. Please, Roman, would you send Andrew a patch doing that, to replace the patch in this thread? Or if you prefer, I can do so. Thanks, Hugh
On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote: > On Mon, 3 Aug 2020, Roman Gushchin wrote: > > On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote: > > > On Fri, 31 Jul 2020, Roman Gushchin wrote: > > > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote: > > > > > > > > > > Though another alternative did occur to me overnight: we could > > > > > scrap the logged warning, and show "nr_whatever -53" as output > > > > > from /proc/sys/vm/stat_refresh: that too would be acceptable > > > > > to me, and you redirect to /dev/null. > > > > > > > > It sounds like a good idea to me. Do you want me to prepare a patch? > > > > > > Yes, if you like that one best, please do prepare a patch - thanks! > > > > Hi Hugh, > > > > I mastered a patch (attached below), but honestly I can't say I like it. > > The resulting interface is confusing: we don't generally use sysctls to > > print debug data and/or warnings. > > Since you confessed to not liking it yourself, I paid it very little > attention. Yes, when I made that suggestion, I wasn't really thinking > of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all > sure how issuing output from a /proc file intended for input works out > (perhaps there are plenty of good examples, and you followed one, but > it smells fishy to me now). > > > > > I thought about treating a write to this sysctls as setting the threshold, > > so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative > > entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold > > as in my patch. But this breaks to some extent the current ABI, as passing > > an incorrect value will result in -EINVAL instead of passing (as now). > > I expect we could handle that well enough, by more lenient validation > of the input; though my comment above on output versus input sheds doubt. > > > > > Overall I still think we shouldn't warn on any values inside the possible > > range, as it's not an indication of any kind of error. The only reason > > why we see some values going negative and some not, is that some of them > > are updated more frequently than others, and some are bouncing around > > zero, while other can't reach zero too easily (like the number of free pages). > > We continue to disagree on that (and it amuses me that you who are so > sure they can be ignored, cannot ignore them; whereas I who am so curious > to investigate them, have not actually found the time to do so in years). > It was looking as if nothing could satisfy us both, but... I can only repeat my understanding here: with the current implementation the measured number can vary in range of (true_value - zone_threshold * NR_CPUS, true_value + zone_threshold * NR_CPUS). zone_threshold depends on the size of a zone and the number of CPUs, but cannot exceed 125. Of course, most likely measured numbers are mostly distributed somewhere close to the real number, and reaching distant ends of this range is unlikely. But it's a question of probability. So if the true value is close to 0, there are high chances of getting negative measured numbers. The bigger is the value, the lower are these chances. And if it's bigger than the maximal drift, these chances are 0. So we can be sure that a measured value can't go negative only if we know for sure that the true number is bigger than zone_threshold * NR_CPUS. You can, probably, say that if the chances of getting a negative value are really really low, it's better to spawn a warning, rather than miss a potential error. I'd happily agree, if we'd have a nice formula to calculate the tolerance by the given probability. But if we'll treat all negative numbers as warnings, we'll just end with a lot of false warnings. > > > > > Actually, if someone wants to ensure that numbers are accurate, > > we have to temporarily set the threshold to 0, then flush the percpu data > > and only then check atomics. In the current design flushing percpu data > > matters for only slowly updated counters, as all others will run away while > > we're waiting for the flush. So if we're targeting some slowly updating > > counters, maybe we should warn only on them being negative, Idk. > > I was going to look into that angle, though it would probably add a little > unjustifiable overhead to fast paths, and be rejected on that basis. I'd expect it. What I think can be acceptable is to have different tolerance for different counters, if there is a good reason to have more precise values for some counters. I did a similar thing in the "new slab controller" patchset for memcg slab statistics, which required a different threshold because they are measured in bytes (all other metrics were historically in pages). > > But in going to do so, came up against an earlier comment of yours, of > which I had misunderstood the significance. I had said and you replied: > > > > nr_zone_write_pending: yes, I've looked at our machines, and see that > > > showing up for us too (-49 was the worst I saw). Not at all common, > > > but seen. And not followed by increasingly worse numbers, so a state > > > that corrects itself. nr_dirty too (fewer instances, bigger numbers); > > > but never nr_writeback, which you'd expect to go along with those. > > > > NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them? > > Wow. Now I see what you were pointing out: when v4.8's 75ef71840539 > ("mm, vmstat: add infrastructure for per-node vmstats") went in, it > missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items. > > And I've never noticed, and have interpreted its silence on those items > as meaning they're all good (and the nr_dirty ones I mentioned above, > must have been from residual old kernels, hence the fewer instances). > I see the particularly tricky NR_ISOLATED ones are in that category. > Maybe they are all good, but I have been mistaken. > > I shall certainly want to reintroduce those stats to checking for > negatives, even if it's in a patch that never earns your approval, > and just ends up kept internal for debugging. But equally certainly, > I must not suddenly reintroduce that checking without gaining some > experience of it (and perhaps getting as irritated as you by more > transient negatives). > > I said earlier that I'd prefer you to rip out all that checking for > negatives, rather than retaining it with the uselessly over-generous > 125 * nr_cpus leeway. Please, Roman, would you send Andrew a patch > doing that, to replace the patch in this thread? Or if you prefer, > I can do so. Sure, I can do it. But can you, please, explain why you want them to be eliminated? Is this because they will never fire, you think? In my humble opinion they might be quite useful: any systematic under- or overflow will eventually trigger this warning, and we'll know for sure that something is wrong. So I'd add a similar check for node counters without any hesitation. I wouldn't oppose such a patch anyway, just want to make sure we are on the same page here. Thanks!
On Wed, 5 Aug 2020, Roman Gushchin wrote: > On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote: > > > > I shall certainly want to reintroduce those stats to checking for > > negatives, even if it's in a patch that never earns your approval, > > and just ends up kept internal for debugging. But equally certainly, > > I must not suddenly reintroduce that checking without gaining some > > experience of it (and perhaps getting as irritated as you by more > > transient negatives). > > > > I said earlier that I'd prefer you to rip out all that checking for > > negatives, rather than retaining it with the uselessly over-generous > > 125 * nr_cpus leeway. Please, Roman, would you send Andrew a patch > > doing that, to replace the patch in this thread? Or if you prefer, > > I can do so. > > Sure, I can do it. But can you, please, explain why you want them to be > eliminated? Is this because they will never fire, you think? Yes, I've never seen a machine on which vm/stat_refresh reported -16000 or less, or anything approaching that (true, this laptop does not have 128 cpus, but...). Maybe they get reinstalled or otherwise rebooted before reaching numbers of that magnitude. Maybe having the warnings shown at much lower magnitudes has helped to root-cause and eliminate some of them. Waiting until the heat-death-of-the-universe theoretical worst case is so unhelpful as to defeat the purpose of the warning. I think you do understand, but perhaps not all readers of this thread understand, that vm/stat_refresh merges all per-cpu counts into the global atomic immediately before deciding negative. The only problem is that "immediately" is not instantaneous across cpus, so the possibility of work started on one cpu but completed on another during the course of the refresh, causing false negatives, is real though not great. > > In my humble opinion they might be quite useful: any systematic under- or > overflow will eventually trigger this warning, and we'll know for sure that > something is wrong. So I'd add a similar check for node counters without > any hesitation. It's true that while developing, we can all make mistakes so big that "eventually" will show up quickly, and even that warning could help. But since you'll only show them when they reach -16000 (sorry, I keep going back to the 128 cpu case, just to make it more vivid), it won't be of any use to catch races later in development, or in production. Whereas my own patch would just fix the missing items, and continue to annoy you with occasional ignorable warnings - so I cannot submit that, and wouldn't want to submit it right now, without having tried it out for a while, to check what kind of noise it will generate. So I thought it best, either to leave mm/vmstat.c alone for the moment, or else just delete the disputed and incomplete code; coming back to it later when we have something we can agree upon. But I think you're preferring to resubmit your 125*nr_cpus patch to akpm, with the missing NR_VM_NODE_STAT_ITEMS added in (as either one or two patches), with foreshortened testing but the reassurance that since it's so hard to reach the point of showing the warnings, new negatives will either be quiet, or easily fixed before v5.10 released: okay, I can more easily Hack that to my preference than Ack it or Nak it. Hugh
On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote: > On Mon, 3 Aug 2020, Roman Gushchin wrote: > > On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote: > > > On Fri, 31 Jul 2020, Roman Gushchin wrote: > > > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote: > > > > > > > > > > Though another alternative did occur to me overnight: we could > > > > > scrap the logged warning, and show "nr_whatever -53" as output > > > > > from /proc/sys/vm/stat_refresh: that too would be acceptable > > > > > to me, and you redirect to /dev/null. > > > > > > > > It sounds like a good idea to me. Do you want me to prepare a patch? > > > > > > Yes, if you like that one best, please do prepare a patch - thanks! > > > > Hi Hugh, > > > > I mastered a patch (attached below), but honestly I can't say I like it. > > The resulting interface is confusing: we don't generally use sysctls to > > print debug data and/or warnings. > > Since you confessed to not liking it yourself, I paid it very little > attention. Yes, when I made that suggestion, I wasn't really thinking > of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all > sure how issuing output from a /proc file intended for input works out > (perhaps there are plenty of good examples, and you followed one, but > it smells fishy to me now). > > > > > I thought about treating a write to this sysctls as setting the threshold, > > so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative > > entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold > > as in my patch. But this breaks to some extent the current ABI, as passing > > an incorrect value will result in -EINVAL instead of passing (as now). > > I expect we could handle that well enough, by more lenient validation > of the input; though my comment above on output versus input sheds doubt. > > > > > Overall I still think we shouldn't warn on any values inside the possible > > range, as it's not an indication of any kind of error. The only reason > > why we see some values going negative and some not, is that some of them > > are updated more frequently than others, and some are bouncing around > > zero, while other can't reach zero too easily (like the number of free pages). > > We continue to disagree on that (and it amuses me that you who are so > sure they can be ignored, cannot ignore them; whereas I who am so curious > to investigate them, have not actually found the time to do so in years). > It was looking as if nothing could satisfy us both, but... > > > > > Actually, if someone wants to ensure that numbers are accurate, > > we have to temporarily set the threshold to 0, then flush the percpu data > > and only then check atomics. In the current design flushing percpu data > > matters for only slowly updated counters, as all others will run away while > > we're waiting for the flush. So if we're targeting some slowly updating > > counters, maybe we should warn only on them being negative, Idk. > > I was going to look into that angle, though it would probably add a little > unjustifiable overhead to fast paths, and be rejected on that basis. > > But in going to do so, came up against an earlier comment of yours, of > which I had misunderstood the significance. I had said and you replied: > > > > nr_zone_write_pending: yes, I've looked at our machines, and see that > > > showing up for us too (-49 was the worst I saw). Not at all common, > > > but seen. And not followed by increasingly worse numbers, so a state > > > that corrects itself. nr_dirty too (fewer instances, bigger numbers); > > > but never nr_writeback, which you'd expect to go along with those. > > > > NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them? > > Wow. Now I see what you were pointing out: when v4.8's 75ef71840539 > ("mm, vmstat: add infrastructure for per-node vmstats") went in, it > missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items. > > And I've never noticed, and have interpreted its silence on those items > as meaning they're all good (and the nr_dirty ones I mentioned above, > must have been from residual old kernels, hence the fewer instances). > I see the particularly tricky NR_ISOLATED ones are in that category. > Maybe they are all good, but I have been mistaken. > > I shall certainly want to reintroduce those stats to checking for > negatives, even if it's in a patch that never earns your approval, > and just ends up kept internal for debugging. But equally certainly, > I must not suddenly reintroduce that checking without gaining some > experience of it (and perhaps getting as irritated as you by more > transient negatives). > > I said earlier that I'd prefer you to rip out all that checking for > negatives, rather than retaining it with the uselessly over-generous > 125 * nr_cpus leeway. Please, Roman, would you send Andrew a patch > doing that, to replace the patch in this thread? Or if you prefer, > I can do so. Hi Andrew, it seems that Hugh and me haven't reached a consensus here. Can, you, please, not merge this patch into 5.9, so we would have more time to find a solution, acceptable for all? Thank you! Roman
On Thu, 6 Aug 2020, Roman Gushchin wrote: > > Hi Andrew, > > it seems that Hugh and me haven't reached a consensus here. > Can, you, please, not merge this patch into 5.9, so we would have > more time to find a solution, acceptable for all? > > Thank you! > > Roman Thanks, Roman: yes, I agree that's the best we can do right now - Hugh
On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote: > it seems that Hugh and me haven't reached a consensus here. > Can, you, please, not merge this patch into 5.9, so we would have > more time to find a solution, acceptable for all? No probs. I already had a big red asterisk on it ;)
On Thu, 6 Aug 2020, Andrew Morton wrote: > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote: August, yikes, I thought it was much more recent. > > > it seems that Hugh and me haven't reached a consensus here. > > Can, you, please, not merge this patch into 5.9, so we would have > > more time to find a solution, acceptable for all? > > No probs. I already had a big red asterisk on it ;) I've a suspicion that Andrew might be tiring of his big red asterisk, and wanting to unload mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch into 5.12. I would prefer not, and reiterate my Nack: but no great harm will befall the cosmos if he overrules that, and it does go through to 5.12 - I'll just want to revert it again later. And I do think a more straightforward way of suppressing those warnings would be just to delete the code that issues them, rather than brushing them under a carpet of overtuning. I've been running mmotm with the patch below (shown as sign of good faith, and for you to try, but not ready to go yet) for a few months now - overriding your max_drift, restoring nr_writeback and friends to the same checking, fixing the obvious reason why nr_zone_write_pending and nr_writeback are seen negative occasionally (interrupt interrupting to decrement those stats before they have even been incremented). Two big BUTs (if not asterisks): since adding that patch, I have usually forgotten all about it, so forgotten to run the script that echoes /proc/sys/vm/stat_refresh at odd intervals while under load: so have less data than I'd intended by now. And secondly (and I've just checked again this evening) I do still see nr_zone_write_pending and nr_writeback occasionally caught negative while under load. So, there's something more at play, perhaps the predicted Gushchin Effect (but wouldn't they go together if so? I've only seen them separately), or maybe something else, I don't know. Those are the only stats I've seen caught negative, but I don't have CMA configured at all. You mention nr_free_cma as the only(?) other stat you've seen negative, that of course I won't see, but looking at the source I now notice that NR_FREE_CMA_PAGES is incremented and decremented according to page migratetype... ... internally we have another stat that's incremented and decremented according to page migratetype, and that one has been seen negative too: isn't page migratetype something that usually stays the same, but sometimes the migratetype of the page's block can change, even while some pages of it are allocated? Not a stable basis for maintaining stats, though won't matter much if they are only for display. vmstat_refresh could just exempt nr_zone_write_pending, nr_writeback and nr_free_cma from warnings, if we cannot find a fix to them: but I see no reason to suppress warnings on all the other vmstats. The patch I've been testing with: --- mmotm/mm/page-writeback.c 2021-02-14 14:32:24.000000000 -0800 +++ hughd/mm/page-writeback.c 2021-02-20 18:01:11.264162616 -0800 @@ -2769,6 +2769,13 @@ int __test_set_page_writeback(struct pag int ret, access_ret; lock_page_memcg(page); + /* + * Increment counts in advance, so that they will not go negative + * if test_clear_page_writeback() comes in to decrement them. + */ + inc_lruvec_page_state(page, NR_WRITEBACK); + inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); + if (mapping && mapping_use_writeback_tags(mapping)) { XA_STATE(xas, &mapping->i_pages, page_index(page)); struct inode *inode = mapping->host; @@ -2804,9 +2811,14 @@ int __test_set_page_writeback(struct pag } else { ret = TestSetPageWriteback(page); } - if (!ret) { - inc_lruvec_page_state(page, NR_WRITEBACK); - inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); + + if (WARN_ON_ONCE(ret)) { + /* + * Correct counts in retrospect, if PageWriteback was already + * set; but does any filesystem ever allow this to happen? + */ + dec_lruvec_page_state(page, NR_WRITEBACK); + dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); access_ret = arch_make_page_accessible(page); --- mmotm/mm/vmstat.c 2021-02-20 17:59:44.838171232 -0800 +++ hughd/mm/vmstat.c 2021-02-20 18:01:11.272162661 -0800 @@ -1865,7 +1865,7 @@ int vmstat_refresh(struct ctl_table *tab for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { val = atomic_long_read(&vm_zone_stat[i]); - if (val < -max_drift) { + if (val < 0) { pr_warn("%s: %s %ld\n", __func__, zone_stat_name(i), val); err = -EINVAL; @@ -1874,13 +1874,21 @@ int vmstat_refresh(struct ctl_table *tab #ifdef CONFIG_NUMA for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { val = atomic_long_read(&vm_numa_stat[i]); - if (val < -max_drift) { + if (val < 0) { pr_warn("%s: %s %ld\n", __func__, numa_stat_name(i), val); err = -EINVAL; } } #endif + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { + val = atomic_long_read(&vm_node_stat[i]); + if (val < 0) { + pr_warn("%s: %s %ld\n", + __func__, node_stat_name(i), val); + err = -EINVAL; + } + } if (err) return err; if (write)
On Tue, Feb 23, 2021 at 11:24:23PM -0800, Hugh Dickins wrote: > On Thu, 6 Aug 2020, Andrew Morton wrote: > > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote: > > August, yikes, I thought it was much more recent. > > > > > > it seems that Hugh and me haven't reached a consensus here. > > > Can, you, please, not merge this patch into 5.9, so we would have > > > more time to find a solution, acceptable for all? > > > > No probs. I already had a big red asterisk on it ;) > > I've a suspicion that Andrew might be tiring of his big red asterisk, > and wanting to unload > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch > into 5.12. > > I would prefer not, and reiterate my Nack: but no great harm will > befall the cosmos if he overrules that, and it does go through to > 5.12 - I'll just want to revert it again later. And I do think a > more straightforward way of suppressing those warnings would be just > to delete the code that issues them, rather than brushing them under > a carpet of overtuning. I'm actually fine with either option. My only concern is that if somebody will try to use the hugetlb_cma boot option AND /proc/sys/vm/stat_refresh together, they will get a false warning and report them to mm@ or will waste their time trying to debug a non-existing problem. It's not the end of the world. We can also make the warning conditional on CONFIG_DEBUG_VM, for example. Please, let me know what's your preferred way to go forward. Thanks!
On Wed, 24 Feb 2021, Roman Gushchin wrote: > On Tue, Feb 23, 2021 at 11:24:23PM -0800, Hugh Dickins wrote: > > On Thu, 6 Aug 2020, Andrew Morton wrote: > > > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote: > > > > August, yikes, I thought it was much more recent. > > > > > > > > > it seems that Hugh and me haven't reached a consensus here. > > > > Can, you, please, not merge this patch into 5.9, so we would have > > > > more time to find a solution, acceptable for all? > > > > > > No probs. I already had a big red asterisk on it ;) > > > > I've a suspicion that Andrew might be tiring of his big red asterisk, > > and wanting to unload > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch > > into 5.12. > > > > I would prefer not, and reiterate my Nack: but no great harm will > > befall the cosmos if he overrules that, and it does go through to > > 5.12 - I'll just want to revert it again later. And I do think a > > more straightforward way of suppressing those warnings would be just > > to delete the code that issues them, rather than brushing them under > > a carpet of overtuning. > > I'm actually fine with either option. My only concern is that if somebody > will try to use the hugetlb_cma boot option AND /proc/sys/vm/stat_refresh > together, they will get a false warning and report them to mm@ or will > waste their time trying to debug a non-existing problem. It's not the end > of the world. > We can also make the warning conditional on CONFIG_DEBUG_VM, for example. > > Please, let me know what's your preferred way to go forward. My preferred way forward (for now: since we're all too busy to fix the misbehaving stats) is for Andrew to drop your patch, and I'll post three patches against current 5.12 in a few hours: one to restore the check on the missing NR_VM_NODE_STAT_ITEMS, one to remove the -EINVAL (which upsets test scripts at our end), one to suppress the warning on nr_zone_write_pending, nr_writeback and nr_free_cma. Hugh
On Thu, Feb 25, 2021 at 09:21:04AM -0800, Hugh Dickins wrote: > On Wed, 24 Feb 2021, Roman Gushchin wrote: > > On Tue, Feb 23, 2021 at 11:24:23PM -0800, Hugh Dickins wrote: > > > On Thu, 6 Aug 2020, Andrew Morton wrote: > > > > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote: > > > > > > August, yikes, I thought it was much more recent. > > > > > > > > > > > > it seems that Hugh and me haven't reached a consensus here. > > > > > Can, you, please, not merge this patch into 5.9, so we would have > > > > > more time to find a solution, acceptable for all? > > > > > > > > No probs. I already had a big red asterisk on it ;) > > > > > > I've a suspicion that Andrew might be tiring of his big red asterisk, > > > and wanting to unload > > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch > > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch > > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch > > > into 5.12. > > > > > > I would prefer not, and reiterate my Nack: but no great harm will > > > befall the cosmos if he overrules that, and it does go through to > > > 5.12 - I'll just want to revert it again later. And I do think a > > > more straightforward way of suppressing those warnings would be just > > > to delete the code that issues them, rather than brushing them under > > > a carpet of overtuning. > > > > I'm actually fine with either option. My only concern is that if somebody > > will try to use the hugetlb_cma boot option AND /proc/sys/vm/stat_refresh > > together, they will get a false warning and report them to mm@ or will > > waste their time trying to debug a non-existing problem. It's not the end > > of the world. > > We can also make the warning conditional on CONFIG_DEBUG_VM, for example. > > > > Please, let me know what's your preferred way to go forward. > > My preferred way forward (for now: since we're all too busy to fix > the misbehaving stats) is for Andrew to drop your patch, and I'll post > three patches against current 5.12 in a few hours: one to restore the > check on the missing NR_VM_NODE_STAT_ITEMS, one to remove the -EINVAL > (which upsets test scripts at our end), one to suppress the warning on > nr_zone_write_pending, nr_writeback and nr_free_cma. I'd totally support it! Please, cc me and I'll be happy to review/ack your patches. Thanks!
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 4b9d2e8e9142..95fb80d0c606 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo As a side-effect, it also checks for negative totals (elsewhere reported as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. -(At time of writing, a few stats are known sometimes to be found negative, -with no ill effects: errors and warnings on these stats are suppressed.) +(On a SMP machine some stats can temporarily become negative, with no ill +effects: errors and warnings on these stats are suppressed.) numa_stat diff --git a/mm/vmstat.c b/mm/vmstat.c index a21140373edb..8f0ef8aaf8ee 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat); #ifdef CONFIG_SMP +#define MAX_THRESHOLD 125 + int calculate_pressure_threshold(struct zone *zone) { int threshold; @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone) threshold = max(1, (int)(watermark_distance / num_online_cpus())); /* - * Maximum threshold is 125 + * Threshold is capped by MAX_THRESHOLD */ - threshold = min(125, threshold); - - return threshold; + return min(MAX_THRESHOLD, threshold); } int calculate_normal_threshold(struct zone *zone) @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item) } EXPORT_SYMBOL(dec_node_page_state); #else + +#define MAX_THRESHOLD 0 + /* * Use interrupt disable to serialize counter updates */ @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work) int vmstat_refresh(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { - long val; + long val, max_drift; int err; int i; @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write, * pages, immediately after running a test. /proc/sys/vm/stat_refresh, * which can equally be echo'ed to or cat'ted from (by root), * can be used to update the stats just before reading them. - * - * Oh, and since global_zone_page_state() etc. are so careful to hide - * transiently negative values, report an error here if any of - * the stats is negative, so we know to go looking for imbalance. */ err = schedule_on_each_cpu(refresh_vm_stats); if (err) return err; + + /* + * Since global_zone_page_state() etc. are so careful to hide + * transiently negative values, report an error here if any of + * the stats is negative and are less than the maximum drift value, + * so we know to go looking for imbalance. + */ + max_drift = num_online_cpus() * MAX_THRESHOLD; + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { val = atomic_long_read(&vm_zone_stat[i]); - if (val < 0) { + if (val < -max_drift) { pr_warn("%s: %s %ld\n", __func__, zone_stat_name(i), val); err = -EINVAL; @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write, #ifdef CONFIG_NUMA for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { val = atomic_long_read(&vm_numa_stat[i]); - if (val < 0) { + if (val < -max_drift) { pr_warn("%s: %s %ld\n", __func__, numa_stat_name(i), val); err = -EINVAL;
I've noticed a number of warnings like "vmstat_refresh: nr_free_cma -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production hosts. The numbers of these warnings were relatively low and stable, so it didn't look like we are systematically leaking the counters. The corresponding vmstat counters also looked sane. These warnings are generated by the vmstat_refresh() function, which assumes that atomic zone and numa counters can't go below zero. However, on a SMP machine it's not quite right: due to per-cpu caching it can in theory be as low as -(zone threshold) * NR_CPUs. For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES reached 0. Then we've reclaimed a small number of cma pages on each CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly positive (the atomic counter is still 0). Then somebody on CPU0 consumes all these pages. The number of pages can easily exceed the threshold and a negative value will be committed to the atomic counter. To fix the problem and avoid generating false warnings, let's just relax the condition and warn only if the value is less than minus the maximum theoretically possible drift value, which is 125 * number of online CPUs. It will still allow to catch systematic leaks, but will not generate bogus warnings. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Hugh Dickins <hughd@google.com> --- Documentation/admin-guide/sysctl/vm.rst | 4 ++-- mm/vmstat.c | 30 ++++++++++++++++--------- 2 files changed, 21 insertions(+), 13 deletions(-)