Message ID | 20200714173747.3315771-1-guro@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: vmstat: fix /proc/sys/vm/stat_refresh generating false warnings | expand |
On Tue, Jul 14, 2020 at 10:37:47AM -0700, Roman Gushchin wrote: > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production > hosts. The numbers of these warnings were relatively low and stable, > so it didn't look like we are systematically leaking the counters. > The corresponding vmstat counters also looked sane. > > These warnings are generated by the vmstat_refresh() function, which > assumes that atomic zone and numa counters can't go below zero. > However, on a SMP machine it's not quite right: due to per-cpu > caching it can in theory be as low as -(zone threshold) * NR_CPUs. > > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES > reached 0. Then we've reclaimed a small number of cma pages on each > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are > slightly positive (the atomic counter is still 0). Then somebody on > CPU0 consumes all these pages. The number of pages can easily exceed > the threshold and a negative value will be committed to the atomic > counter. > > To fix the problem and avoid generating false warnings, let's just > relax the condition and warn only if the value is less than minus > the maximum theoretically possible drift value, which is 125 * > number of online CPUs. It will still allow to catch systematic leaks, > but will not generate bogus warnings. > > Signed-off-by: Roman Gushchin <guro@fb.com> > Cc: Hugh Dickins <hughd@google.com> > Signed-off-by: Roman Gushchin <guro@fb.com> Spotted a double sign-off a second after sending. Fixed in v2. Please, ignore this version. Thanks!
On 7/14/20 7:37 PM, Roman Gushchin wrote: > I've noticed a number of warnings like "vmstat_refresh: nr_free_cma > -5" or "vmstat_refresh: nr_zone_write_pending -11" on our production > hosts. The numbers of these warnings were relatively low and stable, > so it didn't look like we are systematically leaking the counters. > The corresponding vmstat counters also looked sane. > > These warnings are generated by the vmstat_refresh() function, which > assumes that atomic zone and numa counters can't go below zero. > However, on a SMP machine it's not quite right: due to per-cpu > caching it can in theory be as low as -(zone threshold) * NR_CPUs. > > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES > reached 0. Then we've reclaimed a small number of cma pages on each > CPU except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are > slightly positive (the atomic counter is still 0). Then somebody on > CPU0 consumes all these pages. The number of pages can easily exceed > the threshold and a negative value will be committed to the atomic > counter. > > To fix the problem and avoid generating false warnings, let's just > relax the condition and warn only if the value is less than minus > the maximum theoretically possible drift value, which is 125 * > number of online CPUs. It will still allow to catch systematic leaks, > but will not generate bogus warnings. > > Signed-off-by: Roman Gushchin <guro@fb.com> > Cc: Hugh Dickins <hughd@google.com> > Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 4b9d2e8e9142..95fb80d0c606 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -822,8 +822,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo As a side-effect, it also checks for negative totals (elsewhere reported as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. -(At time of writing, a few stats are known sometimes to be found negative, -with no ill effects: errors and warnings on these stats are suppressed.) +(On a SMP machine some stats can temporarily become negative, with no ill +effects: errors and warnings on these stats are suppressed.) numa_stat diff --git a/mm/vmstat.c b/mm/vmstat.c index a21140373edb..8f0ef8aaf8ee 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -169,6 +169,8 @@ EXPORT_SYMBOL(vm_node_stat); #ifdef CONFIG_SMP +#define MAX_THRESHOLD 125 + int calculate_pressure_threshold(struct zone *zone) { int threshold; @@ -186,11 +188,9 @@ int calculate_pressure_threshold(struct zone *zone) threshold = max(1, (int)(watermark_distance / num_online_cpus())); /* - * Maximum threshold is 125 + * Threshold is capped by MAX_THRESHOLD */ - threshold = min(125, threshold); - - return threshold; + return min(MAX_THRESHOLD, threshold); } int calculate_normal_threshold(struct zone *zone) @@ -610,6 +610,9 @@ void dec_node_page_state(struct page *page, enum node_stat_item item) } EXPORT_SYMBOL(dec_node_page_state); #else + +#define MAX_THRESHOLD 0 + /* * Use interrupt disable to serialize counter updates */ @@ -1810,7 +1813,7 @@ static void refresh_vm_stats(struct work_struct *work) int vmstat_refresh(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { - long val; + long val, max_drift; int err; int i; @@ -1821,17 +1824,22 @@ int vmstat_refresh(struct ctl_table *table, int write, * pages, immediately after running a test. /proc/sys/vm/stat_refresh, * which can equally be echo'ed to or cat'ted from (by root), * can be used to update the stats just before reading them. - * - * Oh, and since global_zone_page_state() etc. are so careful to hide - * transiently negative values, report an error here if any of - * the stats is negative, so we know to go looking for imbalance. */ err = schedule_on_each_cpu(refresh_vm_stats); if (err) return err; + + /* + * Since global_zone_page_state() etc. are so careful to hide + * transiently negative values, report an error here if any of + * the stats is negative and are less than the maximum drift value, + * so we know to go looking for imbalance. + */ + max_drift = num_online_cpus() * MAX_THRESHOLD; + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { val = atomic_long_read(&vm_zone_stat[i]); - if (val < 0) { + if (val < -max_drift) { pr_warn("%s: %s %ld\n", __func__, zone_stat_name(i), val); err = -EINVAL; @@ -1840,7 +1848,7 @@ int vmstat_refresh(struct ctl_table *table, int write, #ifdef CONFIG_NUMA for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { val = atomic_long_read(&vm_numa_stat[i]); - if (val < 0) { + if (val < -max_drift) { pr_warn("%s: %s %ld\n", __func__, numa_stat_name(i), val); err = -EINVAL;