Message ID | 20230105125248.813825852@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped | expand |
On Thu, 5 Jan 2023, Marcelo Tosatti wrote: > --- linux-2.6.orig/mm/vmstat.c > +++ linux-2.6/mm/vmstat.c > @@ -381,6 +381,7 @@ void __mod_zone_page_state(struct zone * > x = 0; > } > __this_cpu_write(*p, x); > + vmstat_mark_dirty(); __vmstat_mark_dirty()? See earlier patch > > preempt_enable_nested(); > } > @@ -417,6 +418,7 @@ void __mod_node_page_state(struct pglist > x = 0; > } > __this_cpu_write(*p, x); > + vmstat_mark_dirty(); Ditto. > > preempt_enable_nested(); > } > @@ -577,6 +579,9 @@ static inline void mod_zone_state(struct > s8 __percpu *p = pcp->vm_stat_diff + item; > long o, n, t, z; > > + /* cmpxchg and vmstat_mark_dirty should happen on the same CPU */ > + preempt_disable(); If you are disabling preemption then why do we still need cmpxchg? Same again below.
On Tue, Jan 10, 2023 at 01:06:37PM +0100, Christoph Lameter wrote: > On Thu, 5 Jan 2023, Marcelo Tosatti wrote: > > @@ -577,6 +579,9 @@ static inline void mod_zone_state(struct > > s8 __percpu *p = pcp->vm_stat_diff + item; > > long o, n, t, z; > > > > + /* cmpxchg and vmstat_mark_dirty should happen on the same CPU */ > > + preempt_disable(); > > If you are disabling preemption then why do we still need cmpxchg? > Same again below. Note I'm absolutely clueless with vmstat. But I was wondering about it as well while reviewing Marcelo's series, so git blame pointed me to: 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid interrupt disable / enable") And this seem to mention that this can race with IRQs as well, hence the local cmpxchg operation.
On Tue, 10 Jan 2023, Frederic Weisbecker wrote: > Note I'm absolutely clueless with vmstat. But I was wondering about it as well > while reviewing Marcelo's series, so git blame pointed me to: > > 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid > interrupt disable / enable") > > And this seem to mention that this can race with IRQs as well, hence the local > cmpxchg operation. The race with irq could be an issue but I thought we avoided that and were content with disabling preemption. But this issue illustrates the central problem of the patchset: It makes the lightweight counters not so lightweight anymore. The basic primitives add a lot of weight. And the pre cpu atomic updates operations require the modification of multiple values. The operation cannot be "atomic" in that sense anymore and we need some other form of synchronization that can span multiple instructions.
On Tue, Jan 10, 2023 at 02:39:08PM +0100, Christoph Lameter wrote: > On Tue, 10 Jan 2023, Frederic Weisbecker wrote: > > > Note I'm absolutely clueless with vmstat. But I was wondering about it as well > > while reviewing Marcelo's series, so git blame pointed me to: > > > > 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid > > interrupt disable / enable") > > > > And this seem to mention that this can race with IRQs as well, hence the local > > cmpxchg operation. > > The race with irq could be an issue but I thought we avoided that and were > content with disabling preemption. > > But this issue illustrates the central problem of the patchset: It makes > the lightweight counters not so lightweight anymore. https://lkml.iu.edu/hypermail/linux/kernel/0903.2/00569.html With added static void do_test_preempt(void) { unsigned long flags; unsigned int i; cycles_t time1, time2, time; u32 rem; local_irq_save(flags); preempt_disable(); time1 = get_cycles(); for (i = 0; i < NR_LOOPS; i++) { preempt_disable(); preempt_enable(); } time2 = get_cycles(); local_irq_restore(flags); preempt_enable(); time = time2 - time1; printk(KERN_ALERT "test results: time for disabling/enabling preemption\n"); printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS); printk(KERN_ALERT "total time: %llu\n", time); time = div_u64_rem(time, NR_LOOPS, &rem); printk(KERN_ALERT "-> enabling/disabling preemption takes %llu cycles\n", time); printk(KERN_ALERT "test end\n"); } model name : 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz [ 423.676079] test init [ 423.676249] test results: time for baseline [ 423.676405] number of loops: 200000 [ 423.676676] total time: 104274 [ 423.676910] -> baseline takes 0 cycles [ 423.677051] test end [ 423.678150] test results: time for locked cmpxchg [ 423.678353] number of loops: 200000 [ 423.678498] total time: 2473839 [ 423.678630] -> locked cmpxchg takes 12 cycles [ 423.678810] test end [ 423.679204] test results: time for non locked cmpxchg [ 423.679394] number of loops: 200000 [ 423.679527] total time: 740298 [ 423.679644] -> non locked cmpxchg takes 3 cycles [ 423.679817] test end [ 423.680755] test results: time for locked add return [ 423.680951] number of loops: 200000 [ 423.681089] total time: 2118185 [ 423.681229] -> locked add return takes 10 cycles [ 423.681411] test end [ 423.681846] test results: time for enabling interrupts (STI) [ 423.682063] number of loops: 200000 [ 423.682209] total time: 861591 [ 423.682335] -> enabling interrupts (STI) takes 4 cycles [ 423.682532] test end [ 423.683606] test results: time for disabling interrupts (CLI) [ 423.683852] number of loops: 200000 [ 423.684006] total time: 2440756 [ 423.684141] -> disabling interrupts (CLI) takes 12 cycles [ 423.684588] test end [ 423.686626] test results: time for disabling/enabling interrupts (STI/CLI) [ 423.686879] number of loops: 200000 [ 423.687015] total time: 4802297 [ 423.687139] -> enabling/disabling interrupts (STI/CLI) takes 24 cycles [ 423.687389] test end [ 423.688025] test results: time for disabling/enabling preemption [ 423.688258] number of loops: 200000 [ 423.688396] total time: 1341001 [ 423.688526] -> enabling/disabling preemption takes 6 cycles [ 423.689276] test end > The basic primitives add a lot of weight. Can't see any alternative given the necessity to avoid interruption by the work to sync per-CPU vmstats to global vmstats. > And the pre cpu atomic updates operations require the modification > of multiple values. The operation > cannot be "atomic" in that sense anymore and we need some other form of > synchronization that can > span multiple instructions. So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer count on preremption being disabled we still have some minor issues. The fetching of the counter thresholds is racy. A threshold from another cpu may be applied if we happen to be rescheduled on another cpu. However, the following vmstat operation will then bring the counter again under the threshold limit. Those small issues are gone, OTOH.
On Tue, 10 Jan 2023, Marcelo Tosatti wrote: > > The basic primitives add a lot of weight. > > Can't see any alternative given the necessity to avoid interruption > by the work to sync per-CPU vmstats to global vmstats. this_cpu operations are designed to operate on a *single* value (a counter) and can be run on an arbitrary cpu, There is no preemption or interrupt disable required since the counters of all cpus will be added up at the end. You want *two* values (the counter and the dirty flag) to be modified together and want to use the counters/flag to identify the cpu where these events occurred. this_cpu_xxx operations are not suitable for that purpose. You would need a way to ensure that both operations occur on the same cpu. > > > And the pre cpu atomic updates operations require the modification > > of multiple values. The operation > > cannot be "atomic" in that sense anymore and we need some other form of > > synchronization that can > > span multiple instructions. > > So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer > count on preremption being disabled we still have some minor issues. > The fetching of the counter thresholds is racy. > A threshold from another cpu may be applied if we happen to be > rescheduled on another cpu. However, the following vmstat operation > will then bring the counter again under the threshold limit. > > Those small issues are gone, OTOH. Well you could use this_cpu_cmpxchg128 to update a 64 bit counter and a flag at the same time. Otherwise you will have to switch off preemption or interrupts when incrementing the counters and updating the dirty flag. Thus you do not really need the this_cpu operations anymore. It would best to use a preempt_disable section and uuse C operators -- ++ for the counter and do regular assignment for the flag.
On Wed, Jan 11, 2023 at 09:42:18AM +0100, Christoph Lameter wrote: > On Tue, 10 Jan 2023, Marcelo Tosatti wrote: > > > > The basic primitives add a lot of weight. > > > > Can't see any alternative given the necessity to avoid interruption > > by the work to sync per-CPU vmstats to global vmstats. > > this_cpu operations are designed to operate on a *single* value (a counter) and can > be run on an arbitrary cpu, There is no preemption or interrupt > disable required since the counters of all cpus will be added up at the > end. > > You want *two* values (the counter and the dirty flag) to be modified > together and want to use the counters/flag to identify the cpu where > these events occurred. this_cpu_xxx operations are not suitable for that > purpose. You would need a way to ensure that both operations occur on the > same cpu. Which is either preempt_disable (CONFIG_HAVE_CMPXCHG_LOCAL case), or local_irq_disable (!CONFIG_HAVE_CMPXCHG_LOCAL case). > > > > And the pre cpu atomic updates operations require the modification > > > of multiple values. The operation > > > cannot be "atomic" in that sense anymore and we need some other form of > > > synchronization that can > > > span multiple instructions. > > > > So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer > > count on preremption being disabled we still have some minor issues. > > The fetching of the counter thresholds is racy. > > A threshold from another cpu may be applied if we happen to be > > rescheduled on another cpu. However, the following vmstat operation > > will then bring the counter again under the threshold limit. > > > > Those small issues are gone, OTOH. > > Well you could use this_cpu_cmpxchg128 to update a 64 bit counter and a > flag at the same time. But then you transform the "per-CPU vmstat is dirty" bit (bool) into a number of flags that must be scanned (when returning to userspace). Which increases the overhead of a fast path (return to userspace). > Otherwise you will have to switch off preemption or > interrupts when incrementing the counters and updating the dirty flag. > > Thus you do not really need the this_cpu operations anymore. It would > best to use a preempt_disable section and uuse C operators -- ++ for the > counter and do regular assignment for the flag. OK, can replace this_cpu operations with this_cpu_ptr + standard C operators (and in fact can do that for interrupt disabled functions as well, that is CONFIG_HAVE_CMPXCHG_LOCAL not defined). Is that it?
On Wed, 11 Jan 2023, Marcelo Tosatti wrote: > OK, can replace this_cpu operations with this_cpu_ptr + standard C operators > (and in fact can do that for interrupt disabled functions as well, that > is CONFIG_HAVE_CMPXCHG_LOCAL not defined). > > Is that it? No that was hyperthetical. I do not know how to get out of this dilemma. We surely want to keep fast vmstat operations working. The fundamental issue that causes the vmstat discrepancies is likely that the fast this_cpu ops can increment the counter on any random cpu and that this is the reason you get vmstat discrepancies. Give up the assumption that an increment of a this_cpu counter on a specific cpu means that something occurred on that specific cpu. Maybe that will get you on a path to resolve the issues you are seeing.
On Mon, Jan 16, 2023 at 10:51:40AM +0100, Christoph Lameter wrote: > On Wed, 11 Jan 2023, Marcelo Tosatti wrote: > > > OK, can replace this_cpu operations with this_cpu_ptr + standard C operators > > (and in fact can do that for interrupt disabled functions as well, that > > is CONFIG_HAVE_CMPXCHG_LOCAL not defined). > > > > Is that it? > > No that was hyperthetical. > > I do not know how to get out of this dilemma. We surely want to keep fast > vmstat operations working. Honestly, to me, there is no dilemma: * There is a requirement from applications to be uninterrupted by operating system activities. Examples include radio access network software, software defined PLCs for industrial automation (1). * There exists vm-statistics counters (which count the number of pages on different states, for example, number of free pages, locked pages, pages under writeback, pagetable pages, file pages, etc). To reduce number of accesses to the global counters, each CPU maintains its own delta relative to the global VM counters (which can be cached in the local processor cache, therefore fast). The per-CPU deltas are synchronized to global counters: 1) If the per-CPU counters exceed a given threshold. 2) Periodically, with a low frequency compared to CPU events (every number of seconds). 3) Upon an event that requires accurate counters. * The periodic synchronization interrupts a given CPU, in case it contains a counter delta relative to the global counters. To avoid this interruption, due to [1], the proposed patchset synchronizes any pending per-CPU deltas to global counters, for nohz_full= CPUs, when returning to userspace (which is a very fast path). Since return to userspace is a very fast path, synchronizing per-CPU counter deltas by reading their contents is undesired. Therefore a single bit is introduced to compact the following information: does this CPU contain any delta relative to the global counters that should be written back? This bit is set when a per-CPU delta is increased. This bit is cleared when the per-CPU deltas are written back to the global counters. Since for the following two operations: modify per-CPU delta (for current CPU) of counter X by Y set bit (for current CPU) indicating the per-CPU delta exists "current CPU" can change, it is necessary to disable CPU preemption when executing the pair of operations. vmstat operations still perform their main goal which is to maintain accesses local to the CPU when incrementing the counters (for most of counter modifications). The preempt_disable/enable pair is also a per-CPU variable. Now you are objecting to this patchset because: It increases the number of cycles to execute the function to modify the counters by 6. Can you mention any benchmark where this increase is significant? By searching for mod_zone_page_state/mode_node_page_state one can see the following: the codepaths that call them are touching multiple pages and other data structures, so the preempt_enable/preempt_disable pair should be a very small contribution (in terms of percentage) to any meaningful benchmark. > The fundamental issue that causes the vmstat discrepancies is likely that > the fast this_cpu ops can increment the counter on any random cpu and that > this is the reason you get vmstat discrepancies. Yes. > Give up the assumption that an increment of a this_cpu counter on a > specific cpu means that something occurred on that specific cpu. Maybe > that will get you on a path to resolve the issues you are seeing. But it can't. To be able to condense the information "does a delta exist on this CPU" from a number of cacheline reads to a single cacheline read, one bit can be used. And the write to that bit and to the counters is not atomic. Alternatives: 1) Disable periodic synchronization for nohz_full CPUs. 2) Processor instructions which can modify more than one address in memory. 3) Synchronize the per-CPU stats remotely (which increases per-CPU and per-node accesses).
On Mon, 16 Jan 2023, Marcelo Tosatti wrote: > Honestly, to me, there is no dilemma: > > * There is a requirement from applications to be uninterrupted > by operating system activities. Examples include radio access > network software, software defined PLCs for industrial automation (1). > > * There exists vm-statistics counters (which count > the number of pages on different states, for example, number of > free pages, locked pages, pages under writeback, pagetable pages, > file pages, etc). > To reduce number of accesses to the global counters, each CPU maintains > its own delta relative to the global VM counters > (which can be cached in the local processor cache, therefore fast). The counters only count accurately as a global sum. A counter may be specific to a zone and at which time it counts uses of that zone of from all processors. > Now you are objecting to this patchset because: > > It increases the number of cycles to execute the function to modify > the counters by 6. Can you mention any benchmark where this > increase is significant? I am objecting because of a fundamental misunderstanding of how these counters work and because the patchset is incorrect in the way it handles these counters. Come up with a correct approach and then we can consider regressions and/or improvements in performance. > Alternatives: > 1) Disable periodic synchronization for nohz_full CPUs. > 2) Processor instructions which can modify more than > one address in memory. > 3) Synchronize the per-CPU stats remotely (which > increases per-CPU and per-node accesses). Or remove the assumptions that may exist in current code that a delta on a specific cpu counter means that something occurred on that cpu? If there is a delta then that *does not* mean that there is something to do on that processor. The delta could be folded by another processor into the global counter if that processor is idle or not entering the Kernel and stays that way throughout the operation. So I guess that would be #3. The function cpu_vm_stats_fold() already does this for offline cpus. Can something similar be made to work for idle cpus or those continually running in user space?
Index: linux-2.6/mm/vmstat.c =================================================================== --- linux-2.6.orig/mm/vmstat.c +++ linux-2.6/mm/vmstat.c @@ -381,6 +381,7 @@ void __mod_zone_page_state(struct zone * x = 0; } __this_cpu_write(*p, x); + vmstat_mark_dirty(); preempt_enable_nested(); } @@ -417,6 +418,7 @@ void __mod_node_page_state(struct pglist x = 0; } __this_cpu_write(*p, x); + vmstat_mark_dirty(); preempt_enable_nested(); } @@ -577,6 +579,9 @@ static inline void mod_zone_state(struct s8 __percpu *p = pcp->vm_stat_diff + item; long o, n, t, z; + /* cmpxchg and vmstat_mark_dirty should happen on the same CPU */ + preempt_disable(); + do { z = 0; /* overflow to zone counters */ @@ -606,6 +611,8 @@ static inline void mod_zone_state(struct if (z) zone_page_state_add(z, zone, item); + vmstat_mark_dirty(); + preempt_enable(); } void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, @@ -645,6 +652,8 @@ static inline void mod_node_state(struct delta >>= PAGE_SHIFT; } + /* cmpxchg and vmstat_mark_dirty should happen on the same CPU */ + preempt_disable(); do { z = 0; /* overflow to node counters */ @@ -674,6 +683,8 @@ static inline void mod_node_state(struct if (z) node_page_state_add(z, pgdat, item); + vmstat_mark_dirty(); + preempt_enable(); } void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, @@ -828,6 +839,14 @@ static int refresh_cpu_vm_stats(bool do_ int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; int changes = 0; + /* + * Clear vmstat_dirty before clearing the percpu vmstats. + * If interrupts are enabled, it is possible that an interrupt + * or another task modifies a percpu vmstat, which will + * set vmstat_dirty to true. + */ + vmstat_clear_dirty(); + for_each_populated_zone(zone) { struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; #ifdef CONFIG_NUMA @@ -1957,35 +1976,6 @@ static void vmstat_update(struct work_st } /* - * Check if the diffs for a certain cpu indicate that - * an update is needed. - */ -static bool need_update(int cpu) -{ - pg_data_t *last_pgdat = NULL; - struct zone *zone; - - for_each_populated_zone(zone) { - struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); - struct per_cpu_nodestat *n; - - /* - * The fast way of checking if there are any vmstat diffs. - */ - if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff))) - return true; - - if (last_pgdat == zone->zone_pgdat) - continue; - last_pgdat = zone->zone_pgdat; - n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu); - if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff))) - return true; - } - return false; -} - -/* * Switch off vmstat processing and then fold all the remaining differentials * until the diffs stay at zero. The function is used by NOHZ and can only be * invoked when tick processing is not active. @@ -1995,10 +1985,7 @@ void quiet_vmstat(void) if (system_state != SYSTEM_RUNNING) return; - if (!delayed_work_pending(this_cpu_ptr(&vmstat_work))) - return; - - if (!need_update(smp_processor_id())) + if (!is_vmstat_dirty()) return; /* @@ -2029,7 +2016,7 @@ static void vmstat_shepherd(struct work_ for_each_online_cpu(cpu) { struct delayed_work *dw = &per_cpu(vmstat_work, cpu); - if (!delayed_work_pending(dw) && need_update(cpu)) + if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu)) queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); cond_resched();