diff mbox series

[3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold

Message ID 20230201195104.460373427@redhat.com (mailing list archive)
State New
Headers show
Series fold per-CPU vmstats remotely | expand

Commit Message

Marcelo Tosatti Feb. 1, 2023, 7:50 p.m. UTC
In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use a cmpxchg loop 
instead of a pair of read/write instructions.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Comments

Christoph Lameter Feb. 2, 2023, 2:38 p.m. UTC | #1
On Wed, 1 Feb 2023, Marcelo Tosatti wrote:

> In preparation to switch vmstat shepherd to flush
> per-CPU counters remotely, use a cmpxchg loop
> instead of a pair of read/write instructions.

You are mixing full atomic cmpxchg and  per cpu atomic cmpxchg? That does
not work.

I thought you would only run this while the kernel is not active on the
remote cpu? Then you dont need any cmpxchg and you can leave the function
as is.
Marcelo Tosatti Feb. 2, 2023, 3:54 p.m. UTC | #2
On Thu, Feb 02, 2023 at 03:38:58PM +0100, Christoph Lameter wrote:
> On Wed, 1 Feb 2023, Marcelo Tosatti wrote:
> 
> > In preparation to switch vmstat shepherd to flush
> > per-CPU counters remotely, use a cmpxchg loop
> > instead of a pair of read/write instructions.
> 
> You are mixing full atomic cmpxchg and  per cpu atomic cmpxchg? That does
> not work.

OK, missing locked on the local functions. Can fix that.

> I thought you would only run this while the kernel is not active on the
> remote cpu? Then you dont need any cmpxchg and you can leave the function
> as is.

The remote cpu can enter kernel mode while this function executes.

There is no mode which indicates userspace cannot enter the kernel.
Christoph Lameter Feb. 3, 2023, 9:34 a.m. UTC | #3
On Thu, 2 Feb 2023, Marcelo Tosatti wrote:

> > I thought you would only run this while the kernel is not active on the
> > remote cpu? Then you dont need any cmpxchg and you can leave the function
> > as is.
>
> The remote cpu can enter kernel mode while this function executes.

Isnt there some lock/serializtion to stall the kernel until you are done?

> There is no mode which indicates userspace cannot enter the kernel.

There are lot of thinngs that happen upon entry to the kernel. I would
hope that you can do something there. Scheduler?
Marcelo Tosatti Feb. 3, 2023, 6:52 p.m. UTC | #4
On Fri, Feb 03, 2023 at 10:34:22AM +0100, Christoph Lameter wrote:
> On Thu, 2 Feb 2023, Marcelo Tosatti wrote:
> 
> > > I thought you would only run this while the kernel is not active on the
> > > remote cpu? Then you dont need any cmpxchg and you can leave the function
> > > as is.
> >
> > The remote cpu can enter kernel mode while this function executes.
> 
> Isnt there some lock/serializtion to stall the kernel until you are done?

Not that i know of. Anyway, an additional datapoint is:

"Software defined PLC"
(https://www.redhat.com/en/blog/software-defined-programmable-logic-controller-introduction),
applications
can perform system calls in their time sensitive loop.

One example of an opensource software is OpenPLC.

One would like to avoid interruptions for those cases as well.

> > There is no mode which indicates userspace cannot enter the kernel.
> 
> There are lot of thinngs that happen upon entry to the kernel. I would
> hope that you can do something there. Scheduler?

The use-case in question is with isolation, where a CPU is dedicated
to a single task. So the scheduler should not be an issue.
Christoph Lameter Feb. 6, 2023, 9:42 a.m. UTC | #5
On Fri, 3 Feb 2023, Marcelo Tosatti wrote:

> > Isnt there some lock/serializtion to stall the kernel until you are done?
>
> Not that i know of. Anyway, an additional datapoint is:
>
> "Software defined PLC"
> (https://www.redhat.com/en/blog/software-defined-programmable-logic-controller-introduction),
> applications
> can perform system calls in their time sensitive loop.
>
> One example of an opensource software is OpenPLC.
>
> One would like to avoid interruptions for those cases as well.

Well allowing sytem calls during "time sensitiveness" implies
it is not that sensitive to vmstat updates which have a smaller impact
than system calls.

Unless we are talking about virtual system calls like gettimeofday or
clock_gettime. These do not enter the kernel if configured correctly.
Marcelo Tosatti Feb. 6, 2023, 7:10 p.m. UTC | #6
On Mon, Feb 06, 2023 at 10:42:46AM +0100, Christoph Lameter wrote:
> On Fri, 3 Feb 2023, Marcelo Tosatti wrote:
> 
> > > Isnt there some lock/serializtion to stall the kernel until you are done?
> >
> > Not that i know of. Anyway, an additional datapoint is:
> >
> > "Software defined PLC"
> > (https://www.redhat.com/en/blog/software-defined-programmable-logic-controller-introduction),
> > applications
> > can perform system calls in their time sensitive loop.
> >
> > One example of an opensource software is OpenPLC.
> >
> > One would like to avoid interruptions for those cases as well.
> 
> Well allowing sytem calls during "time sensitiveness" implies
> it is not that sensitive to vmstat updates
> which have a smaller impact than system calls.

Not necessarily. Certain system calls won't touch per-CPU vmstats: nanosleep,
for example. Perhaps i misunderstood your suggestion:

So the patchset in discussion uses (or should use, in v2), in both
vmstat_shepherd and vmstat counter modification, LOCK CMPXCHG.

There is the potential that LOCK CMPXCHG, from vmstat counter modification, 
incurs a performance degradation.

Note however, that cachelocking should hopefully "hide" the costs. 

Do you have any concerns about this patchset other than the performance
degradation due to addition of LOCK in CMPXCHG? 

The other possible concern is that the preempt-disabled functions,
namely:
__inc_node_page_state, __dec_node_page_state, __mod_node_page_state,
__inc_zone_page_state, __dec_zone_page_state, __mod_zone_page_state
have been switched to cmpxchg loop. Is that a problem?

Would expect that measuring LOCK CMPXCHG does not incur significant
performance degradation as compared to CMPXCHG (from the 
page allocation benchmark) would address your concerns?

Thanks
Matthew Wilcox (Oracle) Feb. 6, 2023, 7:19 p.m. UTC | #7
On Wed, Feb 01, 2023 at 04:50:16PM -0300, Marcelo Tosatti wrote:
> In preparation to switch vmstat shepherd to flush
> per-CPU counters remotely, use a cmpxchg loop 
> instead of a pair of read/write instructions.

FYI, try_cmpxchg() is preferred to plain cmpxchg() these days.
Apparently it generates better code on x86.

> -				v = pzstats->vm_stat_diff[i];
> -				pzstats->vm_stat_diff[i] = 0;
> +				do {
> +					v = pzstats->vm_stat_diff[i];
> +				} while (cmpxchg(&pzstats->vm_stat_diff[i], v, 0) != v);

I think this would be:

				do {
					v = pzstats->vm_stat_diff[i];
				} while (!try_cmpxchg(&pzstats->vm_stat_diff[i], v, 0));
diff mbox series

Patch

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -885,7 +885,7 @@  static int refresh_cpu_vm_stats(void)
 }
 
 /*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
@@ -906,8 +906,9 @@  void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = pzstats->vm_stat_diff[i];
-				pzstats->vm_stat_diff[i] = 0;
+				do {
+					v = pzstats->vm_stat_diff[i];
+				} while (cmpxchg(&pzstats->vm_stat_diff[i], v, 0) != v);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
@@ -917,8 +918,9 @@  void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_numa_event[i]) {
 				unsigned long v;
 
-				v = pzstats->vm_numa_event[i];
-				pzstats->vm_numa_event[i] = 0;
+				do {
+					v = pzstats->vm_numa_event[i];
+				} while (cmpxchg(&pzstats->vm_numa_event[i], v, 0) != v);
 				zone_numa_event_add(v, zone, i);
 			}
 		}
@@ -934,8 +936,9 @@  void cpu_vm_stats_fold(int cpu)
 			if (p->vm_node_stat_diff[i]) {
 				int v;
 
-				v = p->vm_node_stat_diff[i];
-				p->vm_node_stat_diff[i] = 0;
+				do {
+					v = p->vm_node_stat_diff[i];
+				} while (cmpxchg(&p->vm_node_stat_diff[i], v, 0) != v);
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
 			}