diff mbox series

[v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings

Message ID 20200714173920.3319063-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings | expand

Commit Message

Roman Gushchin July 14, 2020, 5:39 p.m. UTC
I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
-5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
hosts. The numbers of these warnings were relatively low and stable,
so it didn't look like we are systematically leaking the counters.
The corresponding vmstat counters also looked sane.

These warnings are generated by the vmstat_refresh() function, which
assumes that atomic zone and numa counters can't go below zero.
However, on a SMP machine it's not quite right: due to per-cpu
caching it can in theory be as low as -(zone threshold) * NR_CPUs.

For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
reached 0. Then we've reclaimed a small number of cma pages on each
CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
slightly positive (the atomic counter is still 0). Then somebody on
CPU0 consumes all these pages. The number of pages can easily exceed
the threshold and a negative value will be committed to the atomic
counter.

To fix the problem and avoid generating false warnings, let's just
relax the condition and warn only if the value is less than minus
the maximum theoretically possible drift value, which is 125 *
number of online CPUs. It will still allow to catch systematic leaks,
but will not generate bogus warnings.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Hugh Dickins <hughd@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  4 ++--
 mm/vmstat.c                             | 30 ++++++++++++++++---------
 2 files changed, 21 insertions(+), 13 deletions(-)

Comments

Michal Hocko July 20, 2020, 8:03 a.m. UTC | #1
On Tue 14-07-20 10:39:20, Roman Gushchin wrote:
> I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> hosts. The numbers of these warnings were relatively low and stable,
> so it didn't look like we are systematically leaking the counters.
> The corresponding vmstat counters also looked sane.
> 
> These warnings are generated by the vmstat_refresh() function, which
> assumes that atomic zone and numa counters can't go below zero.
> However, on a SMP machine it's not quite right: due to per-cpu
> caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> 
> For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> reached 0. Then we've reclaimed a small number of cma pages on each
> CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> slightly positive (the atomic counter is still 0). Then somebody on
> CPU0 consumes all these pages. The number of pages can easily exceed
> the threshold and a negative value will be committed to the atomic
> counter.
> 
> To fix the problem and avoid generating false warnings, let's just
> relax the condition and warn only if the value is less than minus
> the maximum theoretically possible drift value, which is 125 *
> number of online CPUs. It will still allow to catch systematic leaks,
> but will not generate bogus warnings.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

One minor nit which can be handled by a separate patch but now that you
are touching this code...

> ---
>  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
>  mm/vmstat.c                             | 30 ++++++++++++++++---------
>  2 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 4b9d2e8e9142..95fb80d0c606 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
>  
>  As a side-effect, it also checks for negative totals (elsewhere reported
>  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> -(At time of writing, a few stats are known sometimes to be found negative,
> -with no ill effects: errors and warnings on these stats are suppressed.)
> +(On a SMP machine some stats can temporarily become negative, with no ill
> +effects: errors and warnings on these stats are suppressed.)
>  
>  
>  numa_stat
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index a21140373edb..8f0ef8aaf8ee 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
>  
>  #ifdef CONFIG_SMP
>  
> +#define MAX_THRESHOLD 125

This would deserve a comment. 88f5acf88ae6a didn't really explain why
this specific value has been selected but the specific value shouldn't
really matter much. I would go with the following at least.
"
Maximum sync threshold for per-cpu vmstat counters. 
"
> +
>  int calculate_pressure_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone)
>  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
>  
>  	/*
> -	 * Maximum threshold is 125
> +	 * Threshold is capped by MAX_THRESHOLD
>  	 */
> -	threshold = min(125, threshold);
> -
> -	return threshold;
> +	return min(MAX_THRESHOLD, threshold);
>  }
>  
>  int calculate_normal_threshold(struct zone *zone)
> @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item)
>  }
>  EXPORT_SYMBOL(dec_node_page_state);
>  #else
> +
> +#define MAX_THRESHOLD 0
> +
>  /*
>   * Use interrupt disable to serialize counter updates
>   */
> @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work)
>  int vmstat_refresh(struct ctl_table *table, int write,
>  		   void *buffer, size_t *lenp, loff_t *ppos)
>  {
> -	long val;
> +	long val, max_drift;
>  	int err;
>  	int i;
>  
> @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	 * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
>  	 * which can equally be echo'ed to or cat'ted from (by root),
>  	 * can be used to update the stats just before reading them.
> -	 *
> -	 * Oh, and since global_zone_page_state() etc. are so careful to hide
> -	 * transiently negative values, report an error here if any of
> -	 * the stats is negative, so we know to go looking for imbalance.
>  	 */
>  	err = schedule_on_each_cpu(refresh_vm_stats);
>  	if (err)
>  		return err;
> +
> +	/*
> +	 * Since global_zone_page_state() etc. are so careful to hide
> +	 * transiently negative values, report an error here if any of
> +	 * the stats is negative and are less than the maximum drift value,
> +	 * so we know to go looking for imbalance.
> +	 */
> +	max_drift = num_online_cpus() * MAX_THRESHOLD;
> +
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, zone_stat_name(i), val);
>  			err = -EINVAL;
> @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  #ifdef CONFIG_NUMA
>  	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_numa_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, numa_stat_name(i), val);
>  			err = -EINVAL;
> -- 
> 2.26.2
Roman Gushchin July 20, 2020, 8:20 p.m. UTC | #2
On Mon, Jul 20, 2020 at 10:03:49AM +0200, Michal Hocko wrote:
> On Tue 14-07-20 10:39:20, Roman Gushchin wrote:
> > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> > hosts. The numbers of these warnings were relatively low and stable,
> > so it didn't look like we are systematically leaking the counters.
> > The corresponding vmstat counters also looked sane.
> > 
> > These warnings are generated by the vmstat_refresh() function, which
> > assumes that atomic zone and numa counters can't go below zero.
> > However, on a SMP machine it's not quite right: due to per-cpu
> > caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> > 
> > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> > reached 0. Then we've reclaimed a small number of cma pages on each
> > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> > slightly positive (the atomic counter is still 0). Then somebody on
> > CPU0 consumes all these pages. The number of pages can easily exceed
> > the threshold and a negative value will be committed to the atomic
> > counter.
> > 
> > To fix the problem and avoid generating false warnings, let's just
> > relax the condition and warn only if the value is less than minus
> > the maximum theoretically possible drift value, which is 125 *
> > number of online CPUs. It will still allow to catch systematic leaks,
> > but will not generate bogus warnings.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Hugh Dickins <hughd@google.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> One minor nit which can be handled by a separate patch but now that you
> are touching this code...

Thank you!

> 
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
> >  mm/vmstat.c                             | 30 ++++++++++++++++---------
> >  2 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index 4b9d2e8e9142..95fb80d0c606 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
> >  
> >  As a side-effect, it also checks for negative totals (elsewhere reported
> >  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> > -(At time of writing, a few stats are known sometimes to be found negative,
> > -with no ill effects: errors and warnings on these stats are suppressed.)
> > +(On a SMP machine some stats can temporarily become negative, with no ill
> > +effects: errors and warnings on these stats are suppressed.)
> >  
> >  
> >  numa_stat
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index a21140373edb..8f0ef8aaf8ee 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +#define MAX_THRESHOLD 125
> 
> This would deserve a comment. 88f5acf88ae6a didn't really explain why
> this specific value has been selected but the specific value shouldn't
> really matter much. I would go with the following at least.
> "
> Maximum sync threshold for per-cpu vmstat counters. 
> "

Agree. Below is the diff to be squashed into the original patch.

Thanks!

--

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 08e415e0a15d..ddc59b533599 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -167,6 +167,9 @@ EXPORT_SYMBOL(vm_zone_stat);
 EXPORT_SYMBOL(vm_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
+/*
+ * Maximum sync threshold for per-cpu vmstat counters.
+ */
 #ifdef CONFIG_SMP
 #define MAX_THRESHOLD 125
 #else
Hugh Dickins July 30, 2020, 3:45 a.m. UTC | #3
On Tue, 14 Jul 2020, Roman Gushchin wrote:

> I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> hosts. The numbers of these warnings were relatively low and stable,
> so it didn't look like we are systematically leaking the counters.
> The corresponding vmstat counters also looked sane.

nr_zone_write_pending: yes, I've looked at our machines, and see that
showing up for us too (-49 was the worst I saw).  Not at all common,
but seen.  And not followed by increasingly worse numbers, so a state
that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
but never nr_writeback, which you'd expect to go along with those.

I wish I could explain that (I've wondered if somewhere delays
incrementing the stat, and can be interrupted by the decrementer of
the stat before it has reached incrementing), but have not succeeded.
Perhaps it is all down to the vmstat_refresh() skid that you hide in
this patch; but I'm not convinced.

> 
> These warnings are generated by the vmstat_refresh() function, which
> assumes that atomic zone and numa counters can't go below zero.
> However, on a SMP machine it's not quite right: due to per-cpu
> caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> 
> For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> reached 0. Then we've reclaimed a small number of cma pages on each
> CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> slightly positive (the atomic counter is still 0). Then somebody on
> CPU0 consumes all these pages. The number of pages can easily exceed
> the threshold and a negative value will be committed to the atomic
> counter.
> 
> To fix the problem and avoid generating false warnings, let's just
> relax the condition and warn only if the value is less than minus
> the maximum theoretically possible drift value, which is 125 *
> number of online CPUs. It will still allow to catch systematic leaks,
> but will not generate bogus warnings.

Sorry, but despite the acks of others, I want to NAK this in its
present form.

You're right that there's a possibility of a spurious warning,
but is that so terrible?

I'm imagining a threshold of 125, and 128 cpus, and the following
worst case.  Yes, it is possible that when vmstat_refresh() schedules
its refresh on all the cpus, that the first 127 cpus to complete all
sync a count of 0, but immediately after each allocates 125 of whatever
(raising their per-cpu counters without touching the global counter);
and then, before the 128th cpu starts its sync, somehow that 128th
gets to be the lucky cpu to free all of those 127*125 items,
so arriving at a mistaken count of -15875 for that stat.

And I guess you could even devise a test which conspires to do that.
But is it so likely, that it's worth throwing away the warning when
we leak (or perhaps it's unleak) 16000 huge pages?  I don't think so:
I think it makes those warnings pretty much useless, and it would be
better just to delete the code that issues them.

But there's other things you could do, that we can probably agree on.

When stat_refresh first went in, there were two stats (NR_ALLOC_BATCH
and NR_PAGES_SCANNED) which were peculiarly accounted, and gave rise
to negative reports: the original commit just filtered those cases out
in a switch.  Maybe you should filter out NR_ZONE_WRITE_PENDING and
NR_FREE_CMA_PAGES, if there's nothing to fix in their accounting.

I'm not sure exactly what your objection is to the warning: would
you prefer pr_info or pr_debug to pr_warn?  I'd prefer to leave it
as pr_warn, but can compromise.

(IIRC xfstests can fail a test if an unexpected message appears
in the log; but xfstests does not use /proc/sys/vm/stat_refresh.)

But a better idea is perhaps to redefine the behavior of
"echo >/proc/sys/vm/stat_refresh".  What if
"echo someparticularstring >/proc/sys/vm/stat_refresh" were to
disable or enable the warning (permanently? or just that time?):
disable would be more "back-compatible", but I think it's okay
if you prefer enable.  Or "someparticularstring" could actually
specify the warning threshold you want to use - you might echo
125 or 16000, I might echo 0.  We can haggle over the default.

And there's a simpler change we already made internally: we didn't
mind the warning at all, but the accompanying -EINVALs became very
annoying.  A lot of testing scripts have "set -e" in them, and for
test B of feature B to fail because earlier test A of feature A
had tickled a bug in A that wrapped some stat negative, that
was very irritating.  We deleted those "err = -EINVAL;"s -
which might be what's actually most annoying you too?

Nit in this patch called out below.

Hugh

> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Hugh Dickins <hughd@google.com>
> ---
>  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
>  mm/vmstat.c                             | 30 ++++++++++++++++---------
>  2 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 4b9d2e8e9142..95fb80d0c606 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
>  
>  As a side-effect, it also checks for negative totals (elsewhere reported
>  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> -(At time of writing, a few stats are known sometimes to be found negative,
> -with no ill effects: errors and warnings on these stats are suppressed.)
> +(On a SMP machine some stats can temporarily become negative, with no ill
> +effects: errors and warnings on these stats are suppressed.)
>  
>  
>  numa_stat
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index a21140373edb..8f0ef8aaf8ee 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
>  
>  #ifdef CONFIG_SMP
>  
> +#define MAX_THRESHOLD 125
> +
>  int calculate_pressure_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone)
>  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
>  
>  	/*
> -	 * Maximum threshold is 125
> +	 * Threshold is capped by MAX_THRESHOLD
>  	 */
> -	threshold = min(125, threshold);
> -
> -	return threshold;
> +	return min(MAX_THRESHOLD, threshold);
>  }
>  
>  int calculate_normal_threshold(struct zone *zone)

calculate_normal_threshold() also contains a 125:
better change that to MAX_THRESHOLD too, if you do pursue this.

> @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item)
>  }
>  EXPORT_SYMBOL(dec_node_page_state);
>  #else
> +
> +#define MAX_THRESHOLD 0
> +
>  /*
>   * Use interrupt disable to serialize counter updates
>   */
> @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work)
>  int vmstat_refresh(struct ctl_table *table, int write,
>  		   void *buffer, size_t *lenp, loff_t *ppos)
>  {
> -	long val;
> +	long val, max_drift;
>  	int err;
>  	int i;
>  
> @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	 * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
>  	 * which can equally be echo'ed to or cat'ted from (by root),
>  	 * can be used to update the stats just before reading them.
> -	 *
> -	 * Oh, and since global_zone_page_state() etc. are so careful to hide
> -	 * transiently negative values, report an error here if any of
> -	 * the stats is negative, so we know to go looking for imbalance.
>  	 */
>  	err = schedule_on_each_cpu(refresh_vm_stats);
>  	if (err)
>  		return err;
> +
> +	/*
> +	 * Since global_zone_page_state() etc. are so careful to hide
> +	 * transiently negative values, report an error here if any of
> +	 * the stats is negative and are less than the maximum drift value,
> +	 * so we know to go looking for imbalance.
> +	 */
> +	max_drift = num_online_cpus() * MAX_THRESHOLD;
> +
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, zone_stat_name(i), val);
>  			err = -EINVAL;
> @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  #ifdef CONFIG_NUMA
>  	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_numa_stat[i]);
> -		if (val < 0) {
> +		if (val < -max_drift) {
>  			pr_warn("%s: %s %ld\n",
>  				__func__, numa_stat_name(i), val);
>  			err = -EINVAL;
> -- 
> 2.26.2
Roman Gushchin July 30, 2020, 4:23 p.m. UTC | #4
On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote:
> On Tue, 14 Jul 2020, Roman Gushchin wrote:
> 
> > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma
> > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production
> > hosts. The numbers of these warnings were relatively low and stable,
> > so it didn't look like we are systematically leaking the counters.
> > The corresponding vmstat counters also looked sane.
> 
> nr_zone_write_pending: yes, I've looked at our machines, and see that
> showing up for us too (-49 was the worst I saw).  Not at all common,
> but seen.  And not followed by increasingly worse numbers, so a state
> that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
> but never nr_writeback, which you'd expect to go along with those.

NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?

> 
> I wish I could explain that (I've wondered if somewhere delays
> incrementing the stat, and can be interrupted by the decrementer of
> the stat before it has reached incrementing), but have not succeeded.
> Perhaps it is all down to the vmstat_refresh() skid that you hide in
> this patch; but I'm not convinced.
> 
> > 
> > These warnings are generated by the vmstat_refresh() function, which
> > assumes that atomic zone and numa counters can't go below zero.
> > However, on a SMP machine it's not quite right: due to per-cpu
> > caching it can in theory be as low as -(zone threshold) * NR_CPUs.
> > 
> > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> > reached 0. Then we've reclaimed a small number of cma pages on each
> > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are
> > slightly positive (the atomic counter is still 0). Then somebody on
> > CPU0 consumes all these pages. The number of pages can easily exceed
> > the threshold and a negative value will be committed to the atomic
> > counter.
> > 
> > To fix the problem and avoid generating false warnings, let's just
> > relax the condition and warn only if the value is less than minus
> > the maximum theoretically possible drift value, which is 125 *
> > number of online CPUs. It will still allow to catch systematic leaks,
> > but will not generate bogus warnings.
> 
> Sorry, but despite the acks of others, I want to NAK this in its
> present form.

Sorry to hear this.
> 
> You're right that there's a possibility of a spurious warning,
> but is that so terrible?

We do collect all warnings fleet-wide and false warnings are
creating a noise, which makes it easier to miss real problems.
Of course, we can filter out these particular warnings, but then
what's the point of generating them?

In my opinion such warnings with a bad signal/ratio aren't good because
initially they cause somebody to look into the "problem" and waste
their time, and later they are usually just ignored, even when the real
problem appears.

In this particular case I was testing some cma-related changes,
and when I saw a bunch a new warnings (cma was not used on these hosts
before), I was concerned that something's wrong with my changes.

> 
> I'm imagining a threshold of 125, and 128 cpus, and the following
> worst case.  Yes, it is possible that when vmstat_refresh() schedules
> its refresh on all the cpus, that the first 127 cpus to complete all
> sync a count of 0, but immediately after each allocates 125 of whatever
> (raising their per-cpu counters without touching the global counter);
> and then, before the 128th cpu starts its sync, somehow that 128th
> gets to be the lucky cpu to free all of those 127*125 items,
> so arriving at a mistaken count of -15875 for that stat.

First, I have to agree, 125 * number of cpus is definitely a very high
number, so it's extremely unlikely that any vmstat value will reach it
in the real life. I'm totally happy to go with a (much) lower limit,
I just have no good idea how to justify any particular number below.
I you can suggest something, I'd appreciate it a lot.
I like the number 400 :)

> 
> And I guess you could even devise a test which conspires to do that.
> But is it so likely, that it's worth throwing away the warning when
> we leak (or perhaps it's unleak) 16000 huge pages?  I don't think so:
> I think it makes those warnings pretty much useless, and it would be
> better just to delete the code that issues them.

Of course, it's an option too (delete the warnings at all).

> 
> But there's other things you could do, that we can probably agree on.
> 
> When stat_refresh first went in, there were two stats (NR_ALLOC_BATCH
> and NR_PAGES_SCANNED) which were peculiarly accounted, and gave rise
> to negative reports: the original commit just filtered those cases out
> in a switch.  Maybe you should filter out NR_ZONE_WRITE_PENDING and
> NR_FREE_CMA_PAGES, if there's nothing to fix in their accounting.

In fact, there is nothing specific to NR_ZONE_WRITE_PENDING and
NR_FREE_CMA_PAGES, any value which is often bouncing around 0 can go
negative, and it's not an indication of any issues.

> 
> I'm not sure exactly what your objection is to the warning: would
> you prefer pr_info or pr_debug to pr_warn?  I'd prefer to leave it
> as pr_warn, but can compromise.

Yeah, we can go with pr_debug as well.

> 
> (IIRC xfstests can fail a test if an unexpected message appears
> in the log; but xfstests does not use /proc/sys/vm/stat_refresh.)
> 
> But a better idea is perhaps to redefine the behavior of
> "echo >/proc/sys/vm/stat_refresh".  What if
> "echo someparticularstring >/proc/sys/vm/stat_refresh" were to
> disable or enable the warning (permanently? or just that time?):
> disable would be more "back-compatible", but I think it's okay
> if you prefer enable.  Or "someparticularstring" could actually
> specify the warning threshold you want to use - you might echo
> 125 or 16000, I might echo 0.  We can haggle over the default.

May I ask you, what kind of problems you have in your in mind,
which can be revealed by these warnings? Or maybe there is some
history attached?

If it's all about some particular counters, which are known to be
strictly positive, maybe we should do the opposite, and check only
those counters? Because in general it's not an indication of a problem.

> 
> And there's a simpler change we already made internally: we didn't
> mind the warning at all, but the accompanying -EINVALs became very
> annoying.  A lot of testing scripts have "set -e" in them, and for
> test B of feature B to fail because earlier test A of feature A
> had tickled a bug in A that wrapped some stat negative, that
> was very irritating.  We deleted those "err = -EINVAL;"s -
> which might be what's actually most annoying you too?

> 
> Nit in this patch called out below.
> 
> Hugh
> 
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst |  4 ++--
> >  mm/vmstat.c                             | 30 ++++++++++++++++---------
> >  2 files changed, 21 insertions(+), 13 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index 4b9d2e8e9142..95fb80d0c606 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
> >  
> >  As a side-effect, it also checks for negative totals (elsewhere reported
> >  as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
> > -(At time of writing, a few stats are known sometimes to be found negative,
> > -with no ill effects: errors and warnings on these stats are suppressed.)
> > +(On a SMP machine some stats can temporarily become negative, with no ill
> > +effects: errors and warnings on these stats are suppressed.)
> >  
> >  
> >  numa_stat
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index a21140373edb..8f0ef8aaf8ee 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +#define MAX_THRESHOLD 125
> > +
> >  int calculate_pressure_threshold(struct zone *zone)
> >  {
> >  	int threshold;
> > @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone)
> >  	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> >  
> >  	/*
> > -	 * Maximum threshold is 125
> > +	 * Threshold is capped by MAX_THRESHOLD
> >  	 */
> > -	threshold = min(125, threshold);
> > -
> > -	return threshold;
> > +	return min(MAX_THRESHOLD, threshold);
> >  }
> >  
> >  int calculate_normal_threshold(struct zone *zone)
> 
> calculate_normal_threshold() also contains a 125:
> better change that to MAX_THRESHOLD too, if you do pursue this.

Totally agree, good catch.

Thanks!

Roman
Hugh Dickins July 31, 2020, 4:06 a.m. UTC | #5
On Thu, 30 Jul 2020, Roman Gushchin wrote:
> On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote:
> > 
> > But a better idea is perhaps to redefine the behavior of
> > "echo >/proc/sys/vm/stat_refresh".  What if
> > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to
> > disable or enable the warning (permanently? or just that time?):
> > disable would be more "back-compatible", but I think it's okay
> > if you prefer enable.  Or "someparticularstring" could actually
> > specify the warning threshold you want to use - you might echo
> > 125 or 16000, I might echo 0.  We can haggle over the default.
> 
> May I ask you, what kind of problems you have in your in mind,
> which can be revealed by these warnings? Or maybe there is some
> history attached?

Yes: 52b6f46bc163 mentions finding a bug of mine in NR_ISOLATED_FILE
accounting, but IIRC (though I might be making this up) there was
also a bug in the NR_ACTIVE or NR_INACTIVE FILE or ANON accounting.

When one of the stats used for balancing or limiting in vmscan.c
trends increasingly negative, it becomes increasingly difficult
for those heuristics (adding on to others, comparing with others)
to do what they're intended to do: they behave increasingly weirdly.

Now the same (or the opposite) is true if one of those stats trends
increasingly positive: but if it leaks positive, it's visible in
/proc/vmstat; whereas if it leaks negative, it's presented there as 0.

And most of the time (when unsynchronized) showing 0 is much better
than showing a transient negative.  But to help fix bugs, we do need
some way of seeing the negatives, and vm/stat_refresh provides an
opportunity to do so, when it synchronizes.

I'd be glad not to show the transients if I knew them: set a flag
on any that go negative, and only show if negative twice or more
in a row?  Perhaps, but I don't relish adding that, and think it
would be over-engineering.

It does sound to me like echoing the warning threshold into
/proc/sys/vm/stat_refresh is the best way to satisfy us both.

Though another alternative did occur to me overnight: we could
scrap the logged warning, and show "nr_whatever -53" as output
from /proc/sys/vm/stat_refresh: that too would be acceptable
to me, and you redirect to /dev/null.

(Why did I choose -53 in my example?  An in-joke: when I looked
through our machines for these warnings, on old kernels with my
old shmem hugepage implementation, there were a striking number
with "nr_shmem_freeholes -53"; but I'm a few years too late to
investigate what was going on there.)

> 
> If it's all about some particular counters, which are known to be
> strictly positive, maybe we should do the opposite, and check only
> those counters? Because in general it's not an indication of a problem.

Yet it's very curious how few stats ever generate such warnings:
you're convinced they're just transient noise, and you're probably right;
but I am a little suspicious of whether they are accounted correctly.

Hugh
Roman Gushchin Aug. 1, 2020, 1:18 a.m. UTC | #6
On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> On Thu, 30 Jul 2020, Roman Gushchin wrote:
> > On Wed, Jul 29, 2020 at 08:45:47PM -0700, Hugh Dickins wrote:
> > > 
> > > But a better idea is perhaps to redefine the behavior of
> > > "echo >/proc/sys/vm/stat_refresh".  What if
> > > "echo someparticularstring >/proc/sys/vm/stat_refresh" were to
> > > disable or enable the warning (permanently? or just that time?):
> > > disable would be more "back-compatible", but I think it's okay
> > > if you prefer enable.  Or "someparticularstring" could actually
> > > specify the warning threshold you want to use - you might echo
> > > 125 or 16000, I might echo 0.  We can haggle over the default.
> > 
> > May I ask you, what kind of problems you have in your in mind,
> > which can be revealed by these warnings? Or maybe there is some
> > history attached?
> 
> Yes: 52b6f46bc163 mentions finding a bug of mine in NR_ISOLATED_FILE
> accounting, but IIRC (though I might be making this up) there was
> also a bug in the NR_ACTIVE or NR_INACTIVE FILE or ANON accounting.
> 
> When one of the stats used for balancing or limiting in vmscan.c
> trends increasingly negative, it becomes increasingly difficult
> for those heuristics (adding on to others, comparing with others)
> to do what they're intended to do: they behave increasingly weirdly.
> 
> Now the same (or the opposite) is true if one of those stats trends
> increasingly positive: but if it leaks positive, it's visible in
> /proc/vmstat; whereas if it leaks negative, it's presented there as 0.
> 
> And most of the time (when unsynchronized) showing 0 is much better
> than showing a transient negative.  But to help fix bugs, we do need
> some way of seeing the negatives, and vm/stat_refresh provides an
> opportunity to do so, when it synchronizes.
> 
> I'd be glad not to show the transients if I knew them: set a flag
> on any that go negative, and only show if negative twice or more
> in a row?  Perhaps, but I don't relish adding that, and think it
> would be over-engineering.
> 
> It does sound to me like echoing the warning threshold into
> /proc/sys/vm/stat_refresh is the best way to satisfy us both.
> 
> Though another alternative did occur to me overnight: we could
> scrap the logged warning, and show "nr_whatever -53" as output
> from /proc/sys/vm/stat_refresh: that too would be acceptable
> to me, and you redirect to /dev/null.

It sounds like a good idea to me. Do you want me to prepare a patch?

> 
> (Why did I choose -53 in my example?  An in-joke: when I looked
> through our machines for these warnings, on old kernels with my
> old shmem hugepage implementation, there were a striking number
> with "nr_shmem_freeholes -53"; but I'm a few years too late to
> investigate what was going on there.)

:)

> 
> > 
> > If it's all about some particular counters, which are known to be
> > strictly positive, maybe we should do the opposite, and check only
> > those counters? Because in general it's not an indication of a problem.
> 
> Yet it's very curious how few stats ever generate such warnings:
> you're convinced they're just transient noise, and you're probably right;
> but I am a little suspicious of whether they are accounted correctly.

Yeah, I was initially very suspicious too, but I didn't find any issues
and still think it's not an indication of a problem.

Thank you!
Hugh Dickins Aug. 1, 2020, 2:17 a.m. UTC | #7
On Fri, 31 Jul 2020, Roman Gushchin wrote:
> On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > 
> > Though another alternative did occur to me overnight: we could
> > scrap the logged warning, and show "nr_whatever -53" as output
> > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > to me, and you redirect to /dev/null.
> 
> It sounds like a good idea to me. Do you want me to prepare a patch?

Yes, if you like that one best, please do prepare a patch - thanks!

Hugh
Roman Gushchin Aug. 4, 2020, 12:40 a.m. UTC | #8
On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > 
> > > Though another alternative did occur to me overnight: we could
> > > scrap the logged warning, and show "nr_whatever -53" as output
> > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > to me, and you redirect to /dev/null.
> > 
> > It sounds like a good idea to me. Do you want me to prepare a patch?
> 
> Yes, if you like that one best, please do prepare a patch - thanks!

Hi Hugh,

I mastered a patch (attached below), but honestly I can't say I like it.
The resulting interface is confusing: we don't generally use sysctls to
print debug data and/or warnings.

I thought about treating a write to this sysctls as setting the threshold,
so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
as in my patch. But this breaks  to some extent the current ABI, as passing
an incorrect value will result in -EINVAL instead of passing (as now).

Overall I still think we shouldn't warn on any values inside the possible
range, as it's not an indication of any kind of error. The only reason
why we see some values going negative and some not, is that some of them
are updated more frequently than others, and some are bouncing around
zero, while other can't reach zero too easily (like the number of free pages).

Actually, if someone wants to ensure that numbers are accurate,
we have to temporarily set the threshold to 0, then flush the percpu data
and only then check atomics. In the current design flushing percpu data
matters for only slowly updated counters, as all others will run away while
we're waiting for the flush. So if we're targeting some slowly updating
counters, maybe we should warn only on them being negative, Idk.

Thanks!

--

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6eecfcbbfe79..1d2f2471bb78 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1815,13 +1815,28 @@ static void refresh_vm_stats(struct work_struct *work)
        refresh_cpu_vm_stats(true);
 }
 
+static void warn_vmstat(void *buffer, size_t *lenp, loff_t *ppos,
+                       const char *name, long val)
+{
+       int len;
+
+       len = snprintf(buffer + *ppos, *lenp, "%s %lu\n", name, val);
+       *lenp -= len;
+       *ppos += len;
+}
+
 int vmstat_refresh(struct ctl_table *table, int write,
                   void *buffer, size_t *lenp, loff_t *ppos)
 {
-       long val, max_drift;
+       long val;
        int err;
        int i;
 
+       if (!*lenp || (*ppos && !write)) {
+               *lenp = 0;
+               return 0;
+       }
+
        /*
         * The regular update, every sysctl_stat_interval, may come later
         * than expected: leaving a significant amount in per_cpu buckets.
@@ -1837,35 +1852,24 @@ int vmstat_refresh(struct ctl_table *table, int write,
        /*
         * Since global_zone_page_state() etc. are so careful to hide
         * transiently negative values, report an error here if any of
-        * the stats is negative and are less than the maximum drift value,
-        * so we know to go looking for imbalance.
+        * the stats is negative, so we know to go looking for imbalance.
         */
-       max_drift = num_online_cpus() * MAX_THRESHOLD;
-
        for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
                val = atomic_long_read(&vm_zone_stat[i]);
-               if (val < -max_drift) {
-                       pr_warn("%s: %s %ld\n",
-                               __func__, zone_stat_name(i), val);
-                       err = -EINVAL;
-               }
+               if (!write && val < 0)
+                       warn_vmstat(buffer, lenp, ppos, zone_stat_name(i), val);
        }
 #ifdef CONFIG_NUMA
        for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
                val = atomic_long_read(&vm_numa_stat[i]);
-               if (val < -max_drift) {
-                       pr_warn("%s: %s %ld\n",
-                               __func__, numa_stat_name(i), val);
-                       err = -EINVAL;
-               }
+               if (!write && val < 0)
+                       warn_vmstat(buffer, lenp, ppos, numa_stat_name(i), val);
        }
 #endif
        if (err)
                return err;
        if (write)
                *ppos += *lenp;
-       else
-               *lenp = 0;
        return 0;
 }
 #endif /* CONFIG_PROC_FS */
Hugh Dickins Aug. 6, 2020, 3:01 a.m. UTC | #9
On Mon, 3 Aug 2020, Roman Gushchin wrote:
> On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> > On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > > 
> > > > Though another alternative did occur to me overnight: we could
> > > > scrap the logged warning, and show "nr_whatever -53" as output
> > > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > > to me, and you redirect to /dev/null.
> > > 
> > > It sounds like a good idea to me. Do you want me to prepare a patch?
> > 
> > Yes, if you like that one best, please do prepare a patch - thanks!
> 
> Hi Hugh,
> 
> I mastered a patch (attached below), but honestly I can't say I like it.
> The resulting interface is confusing: we don't generally use sysctls to
> print debug data and/or warnings.

Since you confessed to not liking it yourself, I paid it very little
attention.  Yes, when I made that suggestion, I wasn't really thinking
of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all
sure how issuing output from a /proc file intended for input works out
(perhaps there are plenty of good examples, and you followed one, but
it smells fishy to me now).

> 
> I thought about treating a write to this sysctls as setting the threshold,
> so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
> entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
> as in my patch. But this breaks  to some extent the current ABI, as passing
> an incorrect value will result in -EINVAL instead of passing (as now).

I expect we could handle that well enough, by more lenient validation
of the input; though my comment above on output versus input sheds doubt.

> 
> Overall I still think we shouldn't warn on any values inside the possible
> range, as it's not an indication of any kind of error. The only reason
> why we see some values going negative and some not, is that some of them
> are updated more frequently than others, and some are bouncing around
> zero, while other can't reach zero too easily (like the number of free pages).

We continue to disagree on that (and it amuses me that you who are so
sure they can be ignored, cannot ignore them; whereas I who am so curious
to investigate them, have not actually found the time to do so in years).
It was looking as if nothing could satisfy us both, but...

> 
> Actually, if someone wants to ensure that numbers are accurate,
> we have to temporarily set the threshold to 0, then flush the percpu data
> and only then check atomics. In the current design flushing percpu data
> matters for only slowly updated counters, as all others will run away while
> we're waiting for the flush. So if we're targeting some slowly updating
> counters, maybe we should warn only on them being negative, Idk.

I was going to look into that angle, though it would probably add a little
unjustifiable overhead to fast paths, and be rejected on that basis.

But in going to do so, came up against an earlier comment of yours, of
which I had misunderstood the significance. I had said and you replied:

> > nr_zone_write_pending: yes, I've looked at our machines, and see that
> > showing up for us too (-49 was the worst I saw).  Not at all common,
> > but seen.  And not followed by increasingly worse numbers, so a state
> > that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
> > but never nr_writeback, which you'd expect to go along with those.
> 
> NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?

Wow. Now I see what you were pointing out: when v4.8's 75ef71840539
("mm, vmstat: add infrastructure for per-node vmstats") went in, it
missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items.

And I've never noticed, and have interpreted its silence on those items
as meaning they're all good (and the nr_dirty ones I mentioned above,
must have been from residual old kernels, hence the fewer instances).
I see the particularly tricky NR_ISOLATED ones are in that category.
Maybe they are all good, but I have been mistaken.

I shall certainly want to reintroduce those stats to checking for
negatives, even if it's in a patch that never earns your approval,
and just ends up kept internal for debugging.  But equally certainly,
I must not suddenly reintroduce that checking without gaining some
experience of it (and perhaps getting as irritated as you by more
transient negatives).

I said earlier that I'd prefer you to rip out all that checking for
negatives, rather than retaining it with the uselessly over-generous
125 * nr_cpus leeway.  Please, Roman, would you send Andrew a patch
doing that, to replace the patch in this thread?  Or if you prefer,
I can do so.

Thanks,
Hugh
Roman Gushchin Aug. 6, 2020, 3:51 a.m. UTC | #10
On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote:
> On Mon, 3 Aug 2020, Roman Gushchin wrote:
> > On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> > > On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > Though another alternative did occur to me overnight: we could
> > > > > scrap the logged warning, and show "nr_whatever -53" as output
> > > > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > > > to me, and you redirect to /dev/null.
> > > > 
> > > > It sounds like a good idea to me. Do you want me to prepare a patch?
> > > 
> > > Yes, if you like that one best, please do prepare a patch - thanks!
> > 
> > Hi Hugh,
> > 
> > I mastered a patch (attached below), but honestly I can't say I like it.
> > The resulting interface is confusing: we don't generally use sysctls to
> > print debug data and/or warnings.
> 
> Since you confessed to not liking it yourself, I paid it very little
> attention.  Yes, when I made that suggestion, I wasn't really thinking
> of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all
> sure how issuing output from a /proc file intended for input works out
> (perhaps there are plenty of good examples, and you followed one, but
> it smells fishy to me now).
> 
> > 
> > I thought about treating a write to this sysctls as setting the threshold,
> > so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
> > entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
> > as in my patch. But this breaks  to some extent the current ABI, as passing
> > an incorrect value will result in -EINVAL instead of passing (as now).
> 
> I expect we could handle that well enough, by more lenient validation
> of the input; though my comment above on output versus input sheds doubt.
> 
> > 
> > Overall I still think we shouldn't warn on any values inside the possible
> > range, as it's not an indication of any kind of error. The only reason
> > why we see some values going negative and some not, is that some of them
> > are updated more frequently than others, and some are bouncing around
> > zero, while other can't reach zero too easily (like the number of free pages).
> 
> We continue to disagree on that (and it amuses me that you who are so
> sure they can be ignored, cannot ignore them; whereas I who am so curious
> to investigate them, have not actually found the time to do so in years).
> It was looking as if nothing could satisfy us both, but...

I can only repeat my understanding here: with the current implementation
the measured number can vary in range of
  (true_value - zone_threshold * NR_CPUS,
   true_value + zone_threshold * NR_CPUS).
zone_threshold depends on the size of a zone and the number of CPUs,
but cannot exceed 125.

Of course, most likely measured numbers are mostly distributed somewhere
close to the real number, and reaching distant ends of this range is
unlikely. But it's a question of probability.

So if the true value is close to 0, there are high chances of getting
negative measured numbers. The bigger is the value, the lower are these
chances. And if it's bigger than the maximal drift, these chances are 0.

So we can be sure that a measured value can't go negative only if we know
for sure that the true number is bigger than zone_threshold * NR_CPUS.

You can, probably, say that if the chances of getting a negative value
are really really low, it's better to spawn a warning, rather than miss
a potential error. I'd happily agree, if we'd have a nice formula
to calculate the tolerance by the given probability. But if we'll treat
all negative numbers as warnings, we'll just end with a lot of false
warnings.

> 
> > 
> > Actually, if someone wants to ensure that numbers are accurate,
> > we have to temporarily set the threshold to 0, then flush the percpu data
> > and only then check atomics. In the current design flushing percpu data
> > matters for only slowly updated counters, as all others will run away while
> > we're waiting for the flush. So if we're targeting some slowly updating
> > counters, maybe we should warn only on them being negative, Idk.
> 
> I was going to look into that angle, though it would probably add a little
> unjustifiable overhead to fast paths, and be rejected on that basis.

I'd expect it. What I think can be acceptable is to have different tolerance
for different counters, if there is a good reason to have more precise values
for some counters.
I did a similar thing in the "new slab controller" patchset for memcg
slab statistics, which required a different threshold because they are measured
in bytes (all other metrics were historically in pages).

> 
> But in going to do so, came up against an earlier comment of yours, of
> which I had misunderstood the significance. I had said and you replied:
> 
> > > nr_zone_write_pending: yes, I've looked at our machines, and see that
> > > showing up for us too (-49 was the worst I saw).  Not at all common,
> > > but seen.  And not followed by increasingly worse numbers, so a state
> > > that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
> > > but never nr_writeback, which you'd expect to go along with those.
> > 
> > NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?
> 
> Wow. Now I see what you were pointing out: when v4.8's 75ef71840539
> ("mm, vmstat: add infrastructure for per-node vmstats") went in, it
> missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items.
> 
> And I've never noticed, and have interpreted its silence on those items
> as meaning they're all good (and the nr_dirty ones I mentioned above,
> must have been from residual old kernels, hence the fewer instances).
> I see the particularly tricky NR_ISOLATED ones are in that category.
> Maybe they are all good, but I have been mistaken.
> 
> I shall certainly want to reintroduce those stats to checking for
> negatives, even if it's in a patch that never earns your approval,
> and just ends up kept internal for debugging.  But equally certainly,
> I must not suddenly reintroduce that checking without gaining some
> experience of it (and perhaps getting as irritated as you by more
> transient negatives).
> 
> I said earlier that I'd prefer you to rip out all that checking for
> negatives, rather than retaining it with the uselessly over-generous
> 125 * nr_cpus leeway.  Please, Roman, would you send Andrew a patch
> doing that, to replace the patch in this thread?  Or if you prefer,
> I can do so.

Sure, I can do it. But can you, please, explain why you want them to be
eliminated? Is this because they will never fire, you think?

In my humble opinion they might be quite useful: any systematic under- or
overflow will eventually trigger this warning, and we'll know for sure that
something is wrong. So I'd add a similar check for node counters without
any hesitation.

I wouldn't oppose such a patch anyway, just want to make sure we are
on the same page here.

Thanks!
Hugh Dickins Aug. 6, 2020, 4:41 p.m. UTC | #11
On Wed, 5 Aug 2020, Roman Gushchin wrote:
> On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote:
> > 
> > I shall certainly want to reintroduce those stats to checking for
> > negatives, even if it's in a patch that never earns your approval,
> > and just ends up kept internal for debugging.  But equally certainly,
> > I must not suddenly reintroduce that checking without gaining some
> > experience of it (and perhaps getting as irritated as you by more
> > transient negatives).
> > 
> > I said earlier that I'd prefer you to rip out all that checking for
> > negatives, rather than retaining it with the uselessly over-generous
> > 125 * nr_cpus leeway.  Please, Roman, would you send Andrew a patch
> > doing that, to replace the patch in this thread?  Or if you prefer,
> > I can do so.
> 
> Sure, I can do it. But can you, please, explain why you want them to be
> eliminated? Is this because they will never fire, you think?

Yes, I've never seen a machine on which vm/stat_refresh reported -16000
or less, or anything approaching that (true, this laptop does not have
128 cpus, but...).  Maybe they get reinstalled or otherwise rebooted
before reaching numbers of that magnitude.  Maybe having the warnings
shown at much lower magnitudes has helped to root-cause and eliminate
some of them.

Waiting until the heat-death-of-the-universe theoretical worst case is
so unhelpful as to defeat the purpose of the warning.  I think you do
understand, but perhaps not all readers of this thread understand,
that vm/stat_refresh merges all per-cpu counts into the global atomic
immediately before deciding negative.  The only problem is that
"immediately" is not instantaneous across cpus, so the possibility
of work started on one cpu but completed on another during the course
of the refresh, causing false negatives, is real though not great.

> 
> In my humble opinion they might be quite useful: any systematic under- or
> overflow will eventually trigger this warning, and we'll know for sure that
> something is wrong. So I'd add a similar check for node counters without
> any hesitation.

It's true that while developing, we can all make mistakes so big that
"eventually" will show up quickly, and even that warning could help.
But since you'll only show them when they reach -16000 (sorry, I keep
going back to the 128 cpu case, just to make it more vivid), it won't
be of any use to catch races later in development, or in production.

Whereas my own patch would just fix the missing items, and continue
to annoy you with occasional ignorable warnings - so I cannot submit
that, and wouldn't want to submit it right now, without having tried
it out for a while, to check what kind of noise it will generate.

So I thought it best, either to leave mm/vmstat.c alone for the moment,
or else just delete the disputed and incomplete code; coming back to
it later when we have something we can agree upon.

But I think you're preferring to resubmit your 125*nr_cpus patch to akpm,
with the missing NR_VM_NODE_STAT_ITEMS added in (as either one or two
patches), with foreshortened testing but the reassurance that since
it's so hard to reach the point of showing the warnings, new negatives
will either be quiet, or easily fixed before v5.10 released: okay,
I can more easily Hack that to my preference than Ack it or Nak it.

Hugh
Roman Gushchin Aug. 6, 2020, 11:38 p.m. UTC | #12
On Wed, Aug 05, 2020 at 08:01:33PM -0700, Hugh Dickins wrote:
> On Mon, 3 Aug 2020, Roman Gushchin wrote:
> > On Fri, Jul 31, 2020 at 07:17:05PM -0700, Hugh Dickins wrote:
> > > On Fri, 31 Jul 2020, Roman Gushchin wrote:
> > > > On Thu, Jul 30, 2020 at 09:06:55PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > Though another alternative did occur to me overnight: we could
> > > > > scrap the logged warning, and show "nr_whatever -53" as output
> > > > > from /proc/sys/vm/stat_refresh: that too would be acceptable
> > > > > to me, and you redirect to /dev/null.
> > > > 
> > > > It sounds like a good idea to me. Do you want me to prepare a patch?
> > > 
> > > Yes, if you like that one best, please do prepare a patch - thanks!
> > 
> > Hi Hugh,
> > 
> > I mastered a patch (attached below), but honestly I can't say I like it.
> > The resulting interface is confusing: we don't generally use sysctls to
> > print debug data and/or warnings.
> 
> Since you confessed to not liking it yourself, I paid it very little
> attention.  Yes, when I made that suggestion, I wasn't really thinking
> of how stat_refresh is a /proc/sys/vm sysctl thing; and I'm not at all
> sure how issuing output from a /proc file intended for input works out
> (perhaps there are plenty of good examples, and you followed one, but
> it smells fishy to me now).
> 
> > 
> > I thought about treating a write to this sysctls as setting the threshold,
> > so that "echo 0 > /proc/sys/vm/stat_refresh" would warn on all negative
> > entries, and "cat /proc/sys/vm/stat_refresh" would use the default threshold
> > as in my patch. But this breaks  to some extent the current ABI, as passing
> > an incorrect value will result in -EINVAL instead of passing (as now).
> 
> I expect we could handle that well enough, by more lenient validation
> of the input; though my comment above on output versus input sheds doubt.
> 
> > 
> > Overall I still think we shouldn't warn on any values inside the possible
> > range, as it's not an indication of any kind of error. The only reason
> > why we see some values going negative and some not, is that some of them
> > are updated more frequently than others, and some are bouncing around
> > zero, while other can't reach zero too easily (like the number of free pages).
> 
> We continue to disagree on that (and it amuses me that you who are so
> sure they can be ignored, cannot ignore them; whereas I who am so curious
> to investigate them, have not actually found the time to do so in years).
> It was looking as if nothing could satisfy us both, but...
> 
> > 
> > Actually, if someone wants to ensure that numbers are accurate,
> > we have to temporarily set the threshold to 0, then flush the percpu data
> > and only then check atomics. In the current design flushing percpu data
> > matters for only slowly updated counters, as all others will run away while
> > we're waiting for the flush. So if we're targeting some slowly updating
> > counters, maybe we should warn only on them being negative, Idk.
> 
> I was going to look into that angle, though it would probably add a little
> unjustifiable overhead to fast paths, and be rejected on that basis.
> 
> But in going to do so, came up against an earlier comment of yours, of
> which I had misunderstood the significance. I had said and you replied:
> 
> > > nr_zone_write_pending: yes, I've looked at our machines, and see that
> > > showing up for us too (-49 was the worst I saw).  Not at all common,
> > > but seen.  And not followed by increasingly worse numbers, so a state
> > > that corrects itself.  nr_dirty too (fewer instances, bigger numbers);
> > > but never nr_writeback, which you'd expect to go along with those.
> > 
> > NR_DIRTY and NR_WRITEBACK are node counters, so we don't check them?
> 
> Wow. Now I see what you were pointing out: when v4.8's 75ef71840539
> ("mm, vmstat: add infrastructure for per-node vmstats") went in, it
> missed updating vmstat_refresh() to check all the NR_VM_NODE_STAT items.
> 
> And I've never noticed, and have interpreted its silence on those items
> as meaning they're all good (and the nr_dirty ones I mentioned above,
> must have been from residual old kernels, hence the fewer instances).
> I see the particularly tricky NR_ISOLATED ones are in that category.
> Maybe they are all good, but I have been mistaken.
> 
> I shall certainly want to reintroduce those stats to checking for
> negatives, even if it's in a patch that never earns your approval,
> and just ends up kept internal for debugging.  But equally certainly,
> I must not suddenly reintroduce that checking without gaining some
> experience of it (and perhaps getting as irritated as you by more
> transient negatives).
> 
> I said earlier that I'd prefer you to rip out all that checking for
> negatives, rather than retaining it with the uselessly over-generous
> 125 * nr_cpus leeway.  Please, Roman, would you send Andrew a patch
> doing that, to replace the patch in this thread?  Or if you prefer,
> I can do so.

Hi Andrew,

it seems that Hugh and me haven't reached a consensus here.
Can, you, please, not merge this patch into 5.9, so we would have
more time to find a solution, acceptable for all?

Thank you!

Roman
Hugh Dickins Aug. 7, 2020, 12:16 a.m. UTC | #13
On Thu, 6 Aug 2020, Roman Gushchin wrote:
> 
> Hi Andrew,
> 
> it seems that Hugh and me haven't reached a consensus here.
> Can, you, please, not merge this patch into 5.9, so we would have
> more time to find a solution, acceptable for all?
> 
> Thank you!
> 
> Roman

Thanks, Roman: yes, I agree that's the best we can do right now - Hugh
Andrew Morton Aug. 7, 2020, 1:25 a.m. UTC | #14
On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote:

> it seems that Hugh and me haven't reached a consensus here.
> Can, you, please, not merge this patch into 5.9, so we would have
> more time to find a solution, acceptable for all?

No probs.  I already had a big red asterisk on it ;)
Hugh Dickins Feb. 24, 2021, 7:24 a.m. UTC | #15
On Thu, 6 Aug 2020, Andrew Morton wrote:
> On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote:

August, yikes, I thought it was much more recent.

> 
> > it seems that Hugh and me haven't reached a consensus here.
> > Can, you, please, not merge this patch into 5.9, so we would have
> > more time to find a solution, acceptable for all?
> 
> No probs.  I already had a big red asterisk on it ;)

I've a suspicion that Andrew might be tiring of his big red asterisk,
and wanting to unload
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
into 5.12.

I would prefer not, and reiterate my Nack: but no great harm will
befall the cosmos if he overrules that, and it does go through to
5.12 - I'll just want to revert it again later.  And I do think a
more straightforward way of suppressing those warnings would be just
to delete the code that issues them, rather than brushing them under
a carpet of overtuning.

I've been running mmotm with the patch below (shown as sign of good
faith, and for you to try, but not ready to go yet) for a few months
now - overriding your max_drift, restoring nr_writeback and friends to
the same checking, fixing the obvious reason why nr_zone_write_pending
and nr_writeback are seen negative occasionally (interrupt interrupting
to decrement those stats before they have even been incremented).

Two big BUTs (if not asterisks): since adding that patch, I have
usually forgotten all about it, so forgotten to run the script that
echoes /proc/sys/vm/stat_refresh at odd intervals while under load:
so have less data than I'd intended by now.  And secondly (and I've
just checked again this evening) I do still see nr_zone_write_pending
and nr_writeback occasionally caught negative while under load.  So,
there's something more at play, perhaps the predicted Gushchin Effect
(but wouldn't they go together if so? I've only seen them separately),
or maybe something else, I don't know.

Those are the only stats I've seen caught negative, but I don't have
CMA configured at all.  You mention nr_free_cma as the only(?) other
stat you've seen negative, that of course I won't see, but looking
at the source I now notice that NR_FREE_CMA_PAGES is incremented
and decremented according to page migratetype...

... internally we have another stat that's incremented and decremented
according to page migratetype, and that one has been seen negative too:
isn't page migratetype something that usually stays the same, but
sometimes the migratetype of the page's block can change, even while
some pages of it are allocated?  Not a stable basis for maintaining
stats, though won't matter much if they are only for display.

vmstat_refresh could just exempt nr_zone_write_pending, nr_writeback
and nr_free_cma from warnings, if we cannot find a fix to them: but
I see no reason to suppress warnings on all the other vmstats.

The patch I've been testing with:

--- mmotm/mm/page-writeback.c	2021-02-14 14:32:24.000000000 -0800
+++ hughd/mm/page-writeback.c	2021-02-20 18:01:11.264162616 -0800
@@ -2769,6 +2769,13 @@ int __test_set_page_writeback(struct pag
 	int ret, access_ret;
 
 	lock_page_memcg(page);
+	/*
+	 * Increment counts in advance, so that they will not go negative
+	 * if test_clear_page_writeback() comes in to decrement them.
+	 */
+	inc_lruvec_page_state(page, NR_WRITEBACK);
+	inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+
 	if (mapping && mapping_use_writeback_tags(mapping)) {
 		XA_STATE(xas, &mapping->i_pages, page_index(page));
 		struct inode *inode = mapping->host;
@@ -2804,9 +2811,14 @@ int __test_set_page_writeback(struct pag
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret) {
-		inc_lruvec_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+
+	if (WARN_ON_ONCE(ret)) {
+		/*
+		 * Correct counts in retrospect, if PageWriteback was already
+		 * set; but does any filesystem ever allow this to happen?
+		 */
+		dec_lruvec_page_state(page, NR_WRITEBACK);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	access_ret = arch_make_page_accessible(page);
--- mmotm/mm/vmstat.c	2021-02-20 17:59:44.838171232 -0800
+++ hughd/mm/vmstat.c	2021-02-20 18:01:11.272162661 -0800
@@ -1865,7 +1865,7 @@ int vmstat_refresh(struct ctl_table *tab
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
-		if (val < -max_drift) {
+		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
 				__func__, zone_stat_name(i), val);
 			err = -EINVAL;
@@ -1874,13 +1874,21 @@ int vmstat_refresh(struct ctl_table *tab
 #ifdef CONFIG_NUMA
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < -max_drift) {
+		if (val < 0) {
 			pr_warn("%s: %s %ld\n",
 				__func__, numa_stat_name(i), val);
 			err = -EINVAL;
 		}
 	}
 #endif
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+		val = atomic_long_read(&vm_node_stat[i]);
+		if (val < 0) {
+			pr_warn("%s: %s %ld\n",
+				__func__, node_stat_name(i), val);
+			err = -EINVAL;
+		}
+	}
 	if (err)
 		return err;
 	if (write)
Roman Gushchin Feb. 25, 2021, 1:53 a.m. UTC | #16
On Tue, Feb 23, 2021 at 11:24:23PM -0800, Hugh Dickins wrote:
> On Thu, 6 Aug 2020, Andrew Morton wrote:
> > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote:
> 
> August, yikes, I thought it was much more recent.
> 
> > 
> > > it seems that Hugh and me haven't reached a consensus here.
> > > Can, you, please, not merge this patch into 5.9, so we would have
> > > more time to find a solution, acceptable for all?
> > 
> > No probs.  I already had a big red asterisk on it ;)
> 
> I've a suspicion that Andrew might be tiring of his big red asterisk,
> and wanting to unload
> mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
> mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
> mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
> into 5.12.
> 
> I would prefer not, and reiterate my Nack: but no great harm will
> befall the cosmos if he overrules that, and it does go through to
> 5.12 - I'll just want to revert it again later.  And I do think a
> more straightforward way of suppressing those warnings would be just
> to delete the code that issues them, rather than brushing them under
> a carpet of overtuning.

I'm actually fine with either option. My only concern is that if somebody
will try to use the hugetlb_cma boot option AND /proc/sys/vm/stat_refresh
together, they will get a false warning and report them to mm@ or will
waste their time trying to debug a non-existing problem. It's not the end
of the world.
We can also make the warning conditional on CONFIG_DEBUG_VM, for example.

Please, let me know what's your preferred way to go forward.

Thanks!
Hugh Dickins Feb. 25, 2021, 5:21 p.m. UTC | #17
On Wed, 24 Feb 2021, Roman Gushchin wrote:
> On Tue, Feb 23, 2021 at 11:24:23PM -0800, Hugh Dickins wrote:
> > On Thu, 6 Aug 2020, Andrew Morton wrote:
> > > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote:
> > 
> > August, yikes, I thought it was much more recent.
> > 
> > > 
> > > > it seems that Hugh and me haven't reached a consensus here.
> > > > Can, you, please, not merge this patch into 5.9, so we would have
> > > > more time to find a solution, acceptable for all?
> > > 
> > > No probs.  I already had a big red asterisk on it ;)
> > 
> > I've a suspicion that Andrew might be tiring of his big red asterisk,
> > and wanting to unload
> > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
> > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
> > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
> > into 5.12.
> > 
> > I would prefer not, and reiterate my Nack: but no great harm will
> > befall the cosmos if he overrules that, and it does go through to
> > 5.12 - I'll just want to revert it again later.  And I do think a
> > more straightforward way of suppressing those warnings would be just
> > to delete the code that issues them, rather than brushing them under
> > a carpet of overtuning.
> 
> I'm actually fine with either option. My only concern is that if somebody
> will try to use the hugetlb_cma boot option AND /proc/sys/vm/stat_refresh
> together, they will get a false warning and report them to mm@ or will
> waste their time trying to debug a non-existing problem. It's not the end
> of the world.
> We can also make the warning conditional on CONFIG_DEBUG_VM, for example.
> 
> Please, let me know what's your preferred way to go forward.

My preferred way forward (for now: since we're all too busy to fix
the misbehaving stats) is for Andrew to drop your patch, and I'll post
three patches against current 5.12 in a few hours: one to restore the
check on the missing NR_VM_NODE_STAT_ITEMS, one to remove the -EINVAL
(which upsets test scripts at our end), one to suppress the warning on
nr_zone_write_pending, nr_writeback and nr_free_cma.

Hugh
Roman Gushchin Feb. 25, 2021, 6:06 p.m. UTC | #18
On Thu, Feb 25, 2021 at 09:21:04AM -0800, Hugh Dickins wrote:
> On Wed, 24 Feb 2021, Roman Gushchin wrote:
> > On Tue, Feb 23, 2021 at 11:24:23PM -0800, Hugh Dickins wrote:
> > > On Thu, 6 Aug 2020, Andrew Morton wrote:
> > > > On Thu, 6 Aug 2020 16:38:04 -0700 Roman Gushchin <guro@fb.com> wrote:
> > > 
> > > August, yikes, I thought it was much more recent.
> > > 
> > > > 
> > > > > it seems that Hugh and me haven't reached a consensus here.
> > > > > Can, you, please, not merge this patch into 5.9, so we would have
> > > > > more time to find a solution, acceptable for all?
> > > > 
> > > > No probs.  I already had a big red asterisk on it ;)
> > > 
> > > I've a suspicion that Andrew might be tiring of his big red asterisk,
> > > and wanting to unload
> > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
> > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
> > > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
> > > into 5.12.
> > > 
> > > I would prefer not, and reiterate my Nack: but no great harm will
> > > befall the cosmos if he overrules that, and it does go through to
> > > 5.12 - I'll just want to revert it again later.  And I do think a
> > > more straightforward way of suppressing those warnings would be just
> > > to delete the code that issues them, rather than brushing them under
> > > a carpet of overtuning.
> > 
> > I'm actually fine with either option. My only concern is that if somebody
> > will try to use the hugetlb_cma boot option AND /proc/sys/vm/stat_refresh
> > together, they will get a false warning and report them to mm@ or will
> > waste their time trying to debug a non-existing problem. It's not the end
> > of the world.
> > We can also make the warning conditional on CONFIG_DEBUG_VM, for example.
> > 
> > Please, let me know what's your preferred way to go forward.
> 
> My preferred way forward (for now: since we're all too busy to fix
> the misbehaving stats) is for Andrew to drop your patch, and I'll post
> three patches against current 5.12 in a few hours: one to restore the
> check on the missing NR_VM_NODE_STAT_ITEMS, one to remove the -EINVAL
> (which upsets test scripts at our end), one to suppress the warning on
> nr_zone_write_pending, nr_writeback and nr_free_cma.

I'd totally support it!

Please, cc me and I'll be happy to review/ack your patches.

Thanks!
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4b9d2e8e9142..95fb80d0c606 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -822,8 +822,8 @@  e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
 
 As a side-effect, it also checks for negative totals (elsewhere reported
 as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
-(At time of writing, a few stats are known sometimes to be found negative,
-with no ill effects: errors and warnings on these stats are suppressed.)
+(On a SMP machine some stats can temporarily become negative, with no ill
+effects: errors and warnings on these stats are suppressed.)
 
 
 numa_stat
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a21140373edb..8f0ef8aaf8ee 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -169,6 +169,8 @@  EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
 
+#define MAX_THRESHOLD 125
+
 int calculate_pressure_threshold(struct zone *zone)
 {
 	int threshold;
@@ -186,11 +188,9 @@  int calculate_pressure_threshold(struct zone *zone)
 	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
 
 	/*
-	 * Maximum threshold is 125
+	 * Threshold is capped by MAX_THRESHOLD
 	 */
-	threshold = min(125, threshold);
-
-	return threshold;
+	return min(MAX_THRESHOLD, threshold);
 }
 
 int calculate_normal_threshold(struct zone *zone)
@@ -610,6 +610,9 @@  void dec_node_page_state(struct page *page, enum node_stat_item item)
 }
 EXPORT_SYMBOL(dec_node_page_state);
 #else
+
+#define MAX_THRESHOLD 0
+
 /*
  * Use interrupt disable to serialize counter updates
  */
@@ -1810,7 +1813,7 @@  static void refresh_vm_stats(struct work_struct *work)
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
-	long val;
+	long val, max_drift;
 	int err;
 	int i;
 
@@ -1821,17 +1824,22 @@  int vmstat_refresh(struct ctl_table *table, int write,
 	 * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
 	 * which can equally be echo'ed to or cat'ted from (by root),
 	 * can be used to update the stats just before reading them.
-	 *
-	 * Oh, and since global_zone_page_state() etc. are so careful to hide
-	 * transiently negative values, report an error here if any of
-	 * the stats is negative, so we know to go looking for imbalance.
 	 */
 	err = schedule_on_each_cpu(refresh_vm_stats);
 	if (err)
 		return err;
+
+	/*
+	 * Since global_zone_page_state() etc. are so careful to hide
+	 * transiently negative values, report an error here if any of
+	 * the stats is negative and are less than the maximum drift value,
+	 * so we know to go looking for imbalance.
+	 */
+	max_drift = num_online_cpus() * MAX_THRESHOLD;
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
-		if (val < 0) {
+		if (val < -max_drift) {
 			pr_warn("%s: %s %ld\n",
 				__func__, zone_stat_name(i), val);
 			err = -EINVAL;
@@ -1840,7 +1848,7 @@  int vmstat_refresh(struct ctl_table *table, int write,
 #ifdef CONFIG_NUMA
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < 0) {
+		if (val < -max_drift) {
 			pr_warn("%s: %s %ld\n",
 				__func__, numa_stat_name(i), val);
 			err = -EINVAL;