Message ID | 20230209153204.846239718@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fold per-CPU vmstats remotely | expand |
On 09.02.23 16:01, Marcelo Tosatti wrote: > In preparation to switch vmstat shepherd to flush > per-CPU counters remotely, switch all functions that > modify the counters to use cmpxchg. > > To test the performance difference, a page allocator microbenchmark: > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c > with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz. > > For the single_page_alloc_free test, which does > > /** Loop to measure **/ > for (i = 0; i < rec->loops; i++) { > my_page = alloc_page(gfp_mask); > if (unlikely(my_page == NULL)) > return 0; > __free_page(my_page); > } > > Unit is cycles. > > Vanilla Patched Diff > 159 165 3.7% > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > Index: linux-vmstat-remote/mm/vmstat.c > =================================================================== > --- linux-vmstat-remote.orig/mm/vmstat.c > +++ linux-vmstat-remote/mm/vmstat.c > @@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_ > } > } I wonder why we get a diff that is rather hard to review because it removes all existing codes and replaces it by almost-identical code. Are you maybe moving a bunch of code while modifying some tiny bits at the same time?
On Thu, Mar 02, 2023 at 11:47:35AM +0100, David Hildenbrand wrote: > On 09.02.23 16:01, Marcelo Tosatti wrote: > > In preparation to switch vmstat shepherd to flush > > per-CPU counters remotely, switch all functions that > > modify the counters to use cmpxchg. > > > > To test the performance difference, a page allocator microbenchmark: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c > > with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz. > > > > For the single_page_alloc_free test, which does > > > > /** Loop to measure **/ > > for (i = 0; i < rec->loops; i++) { > > my_page = alloc_page(gfp_mask); > > if (unlikely(my_page == NULL)) > > return 0; > > __free_page(my_page); > > } > > > > Unit is cycles. > > > > Vanilla Patched Diff > > 159 165 3.7% > > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > > > Index: linux-vmstat-remote/mm/vmstat.c > > =================================================================== > > --- linux-vmstat-remote.orig/mm/vmstat.c > > +++ linux-vmstat-remote/mm/vmstat.c > > @@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_ > > } > > } > > I wonder why we get a diff that is rather hard to review because it removes > all existing codes and replaces it by almost-identical code. Are you maybe > moving a bunch of code while modifying some tiny bits at the same time? Current code has functions defined like so: __mod_zone_page_state __mod_node_page_state __inc_zone_page_state __inc_node_page_state __dec_zone_page_state __dec_node_page_state #ifdef CONFIG_HAVE_CMPXCHG_LOCAL mod_zone_page_state inc_zone_page_state dec_zone_page_state mod_node_page_state inc_node_page_state dec_node_page_state #else mod_zone_page_state inc_zone_page_state dec_zone_page_state mod_node_page_state inc_node_page_state dec_node_page_state #endif What this patch is doing is to define the __ versions for the CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts. So it will be: #ifdef CONFIG_HAVE_CMPXCHG_LOCAL mod_zone_page_state inc_zone_page_state dec_zone_page_state mod_node_page_state inc_node_page_state dec_node_page_state __mod_zone_page_state (new function, calls mod_zone_page_state). __mod_node_page_state (new function, calls mod_node_page_state). __inc_zone_page_state __inc_node_page_state __dec_zone_page_state __dec_node_page_state #else __mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not) __mod_node_page_state __inc_zone_page_state __inc_node_page_state __dec_zone_page_state __dec_node_page_state mod_zone_page_state inc_zone_page_state dec_zone_page_state mod_node_page_state inc_node_page_state dec_node_page_state #endif Any suggestion on how to split this into multiple patchsets for easier reviewing? (can't think of anything obvious).
On Thu, Mar 02, 2023 at 11:47:35AM -0300, Marcelo Tosatti wrote: > So it will be: > > #ifdef CONFIG_HAVE_CMPXCHG_LOCAL > mod_zone_page_state > inc_zone_page_state > dec_zone_page_state > mod_node_page_state > inc_node_page_state > dec_node_page_state > __mod_zone_page_state (new function, calls mod_zone_page_state). > __mod_node_page_state (new function, calls mod_node_page_state). > __inc_zone_page_state > __inc_node_page_state > __dec_zone_page_state > __dec_node_page_state > #else > __mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not) > __mod_node_page_state > __inc_zone_page_state > __inc_node_page_state > __dec_zone_page_state > __dec_node_page_state > mod_zone_page_state > inc_zone_page_state > dec_zone_page_state > mod_node_page_state > inc_node_page_state > dec_node_page_state > #endif > > Any suggestion on how to split this into multiple patchsets for easier > reviewing? (can't think of anything obvious). I figured this out before saw this, but it did take me some time to read carefully into the code base.. maybe it'll be a good idea to mention something like above in the commit message to ease future reviewers (and more likelyhood to attract the experts to start chim in)? One fundamental (but maybe another naive.. :) question on this code piece (so not directly related to the changeset but maybe it is still..): AFAICT CONFIG_HAVE_CMPXCHG_LOCAL only means we can do cmpxchg() without locking memory bus, however when !CONFIG_HAVE_CMPXCHG_LOCAL here we're not using non-local version but using preempt_disable_nested() to make sure the read is atomic. Does it really worth it? What happens if we use cmpxchg() unconditionally, but just use local (e.g. no "LOCK" prefix) version when CONFIG_HAVE_CMPXCHG_LOCAL?
On Thu, Mar 02, 2023 at 11:20:49AM -0500, Peter Xu wrote: > On Thu, Mar 02, 2023 at 11:47:35AM -0300, Marcelo Tosatti wrote: > > So it will be: > > > > #ifdef CONFIG_HAVE_CMPXCHG_LOCAL > > mod_zone_page_state > > inc_zone_page_state > > dec_zone_page_state > > mod_node_page_state > > inc_node_page_state > > dec_node_page_state > > __mod_zone_page_state (new function, calls mod_zone_page_state). > > __mod_node_page_state (new function, calls mod_node_page_state). > > __inc_zone_page_state > > __inc_node_page_state > > __dec_zone_page_state > > __dec_node_page_state > > #else > > __mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not) > > __mod_node_page_state > > __inc_zone_page_state > > __inc_node_page_state > > __dec_zone_page_state > > __dec_node_page_state > > mod_zone_page_state > > inc_zone_page_state > > dec_zone_page_state > > mod_node_page_state > > inc_node_page_state > > dec_node_page_state > > #endif > > > > Any suggestion on how to split this into multiple patchsets for easier > > reviewing? (can't think of anything obvious). > > I figured this out before saw this, but it did take me some time to read > carefully into the code base.. maybe it'll be a good idea to mention > something like above in the commit message to ease future reviewers (and > more likelyhood to attract the experts to start chim in)? > > One fundamental (but maybe another naive.. :) question on this code piece > (so not directly related to the changeset but maybe it is still..): > > AFAICT CONFIG_HAVE_CMPXCHG_LOCAL only means we can do cmpxchg() without > locking memory bus, CONFIG_HAVE_CMPXCHG_LOCAL means cmpxchg_local is implemented (that is cmpxchg which is atomic with respect to local CPU). LOCK cmpxchg is necessary for cmpxchg to be atomic on SMP. > however when !CONFIG_HAVE_CMPXCHG_LOCAL here we're not > using non-local version but using preempt_disable_nested() to make sure the > read is atomic. Does it really worth it? What happens if we use cmpxchg() > unconditionally, but just use local (e.g. no "LOCK" prefix) version when > CONFIG_HAVE_CMPXCHG_LOCAL? Can't use local version of cmpxchg because the vmstat counters are supposed to be accessed from different CPUs simultaneously (this is the objective of the patchset): CPU-0 CPU-1 vmstat_shepherd mod_zone_page_state xchg location LOCK cmpxchg location xchg locks memory bus implicitly. Is this what you are thinking about or did i misunderstood what you mean?
On Thu, Mar 02, 2023 at 04:11:32PM -0300, Marcelo Tosatti wrote: > On Thu, Mar 02, 2023 at 11:20:49AM -0500, Peter Xu wrote: > > On Thu, Mar 02, 2023 at 11:47:35AM -0300, Marcelo Tosatti wrote: > > > So it will be: > > > > > > #ifdef CONFIG_HAVE_CMPXCHG_LOCAL > > > mod_zone_page_state > > > inc_zone_page_state > > > dec_zone_page_state > > > mod_node_page_state > > > inc_node_page_state > > > dec_node_page_state > > > __mod_zone_page_state (new function, calls mod_zone_page_state). > > > __mod_node_page_state (new function, calls mod_node_page_state). > > > __inc_zone_page_state > > > __inc_node_page_state > > > __dec_zone_page_state > > > __dec_node_page_state > > > #else > > > __mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not) > > > __mod_node_page_state > > > __inc_zone_page_state > > > __inc_node_page_state > > > __dec_zone_page_state > > > __dec_node_page_state > > > mod_zone_page_state > > > inc_zone_page_state > > > dec_zone_page_state > > > mod_node_page_state > > > inc_node_page_state > > > dec_node_page_state > > > #endif > > > > > > Any suggestion on how to split this into multiple patchsets for easier > > > reviewing? (can't think of anything obvious). > > > > I figured this out before saw this, but it did take me some time to read > > carefully into the code base.. maybe it'll be a good idea to mention > > something like above in the commit message to ease future reviewers (and > > more likelyhood to attract the experts to start chim in)? > > > > One fundamental (but maybe another naive.. :) question on this code piece > > (so not directly related to the changeset but maybe it is still..): > > > > AFAICT CONFIG_HAVE_CMPXCHG_LOCAL only means we can do cmpxchg() without > > locking memory bus, > > CONFIG_HAVE_CMPXCHG_LOCAL means cmpxchg_local is implemented (that is > cmpxchg which is atomic with respect to local CPU). > > LOCK cmpxchg is necessary for cmpxchg to be atomic on SMP. > > > however when !CONFIG_HAVE_CMPXCHG_LOCAL here we're not > > using non-local version but using preempt_disable_nested() to make sure the > > read is atomic. Does it really worth it? What happens if we use cmpxchg() > > unconditionally, but just use local (e.g. no "LOCK" prefix) version when > > CONFIG_HAVE_CMPXCHG_LOCAL? > > Can't use local version of cmpxchg because the vmstat counters are supposed > to be accessed from different CPUs simultaneously (this is the objective > of the patchset): > > CPU-0 CPU-1 > > vmstat_shepherd mod_zone_page_state > xchg location LOCK cmpxchg location > > xchg locks memory bus implicitly. > > Is this what you are thinking about or did i misunderstood what you > mean? Yes, I think I wrongly interpreted cmpxchg_local() before on assuming it's still atomic but an accelerated version of cmpxchg() that was only supported on a few archs. I double checked the code and spec on x86 - I believe you're right. Thanks,
Index: linux-vmstat-remote/mm/vmstat.c =================================================================== --- linux-vmstat-remote.orig/mm/vmstat.c +++ linux-vmstat-remote/mm/vmstat.c @@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_ } } +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL +/* + * If we have cmpxchg_local support then we do not need to incur the overhead + * that comes with local_irq_save/restore if we use this_cpu_cmpxchg. + * + * mod_state() modifies the zone counter state through atomic per cpu + * operations. + * + * Overstep mode specifies how overstep should handled: + * 0 No overstepping + * 1 Overstepping half of threshold + * -1 Overstepping minus half of threshold + */ +static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item, + long delta, int overstep_mode) +{ + struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; + s8 __percpu *p = pcp->vm_stat_diff + item; + long o, n, t, z; + + do { + z = 0; /* overflow to zone counters */ + + /* + * The fetching of the stat_threshold is racy. We may apply + * a counter threshold to the wrong the cpu if we get + * rescheduled while executing here. However, the next + * counter update will apply the threshold again and + * therefore bring the counter under the threshold again. + * + * Most of the time the thresholds are the same anyways + * for all cpus in a zone. + */ + t = this_cpu_read(pcp->stat_threshold); + + o = this_cpu_read(*p); + n = delta + o; + + if (abs(n) > t) { + int os = overstep_mode * (t >> 1); + + /* Overflow must be added to zone counters */ + z = n + os; + n = -os; + } + } while (this_cpu_cmpxchg(*p, o, n) != o); + + if (z) + zone_page_state_add(z, zone, item); +} + +void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, + long delta) +{ + mod_zone_state(zone, item, delta, 0); +} +EXPORT_SYMBOL(mod_zone_page_state); + +void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, + long delta) +{ + mod_zone_state(zone, item, delta, 0); +} +EXPORT_SYMBOL(__mod_zone_page_state); + +void inc_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, 1, 1); +} +EXPORT_SYMBOL(inc_zone_page_state); + +void __inc_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, 1, 1); +} +EXPORT_SYMBOL(__inc_zone_page_state); + +void dec_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, -1, -1); +} +EXPORT_SYMBOL(dec_zone_page_state); + +void __dec_zone_page_state(struct page *page, enum zone_stat_item item) +{ + mod_zone_state(page_zone(page), item, -1, -1); +} +EXPORT_SYMBOL(__dec_zone_page_state); + +static inline void mod_node_state(struct pglist_data *pgdat, + enum node_stat_item item, + int delta, int overstep_mode) +{ + struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; + s8 __percpu *p = pcp->vm_node_stat_diff + item; + long o, n, t, z; + + if (vmstat_item_in_bytes(item)) { + /* + * Only cgroups use subpage accounting right now; at + * the global level, these items still change in + * multiples of whole pages. Store them as pages + * internally to keep the per-cpu counters compact. + */ + VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); + delta >>= PAGE_SHIFT; + } + + do { + z = 0; /* overflow to node counters */ + + /* + * The fetching of the stat_threshold is racy. We may apply + * a counter threshold to the wrong the cpu if we get + * rescheduled while executing here. However, the next + * counter update will apply the threshold again and + * therefore bring the counter under the threshold again. + * + * Most of the time the thresholds are the same anyways + * for all cpus in a node. + */ + t = this_cpu_read(pcp->stat_threshold); + + o = this_cpu_read(*p); + n = delta + o; + + if (abs(n) > t) { + int os = overstep_mode * (t >> 1); + + /* Overflow must be added to node counters */ + z = n + os; + n = -os; + } + } while (this_cpu_cmpxchg(*p, o, n) != o); + + if (z) + node_page_state_add(z, pgdat, item); +} + +void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, + long delta) +{ + mod_node_state(pgdat, item, delta, 0); +} +EXPORT_SYMBOL(mod_node_page_state); + +void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, + long delta) +{ + mod_node_state(pgdat, item, delta, 0); +} +EXPORT_SYMBOL(__mod_node_page_state); + +void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) +{ + mod_node_state(pgdat, item, 1, 1); +} + +void inc_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, 1, 1); +} +EXPORT_SYMBOL(inc_node_page_state); + +void __inc_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, 1, 1); +} +EXPORT_SYMBOL(__inc_node_page_state); + +void dec_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, -1, -1); +} +EXPORT_SYMBOL(dec_node_page_state); + +void __dec_node_page_state(struct page *page, enum node_stat_item item) +{ + mod_node_state(page_pgdat(page), item, -1, -1); +} +EXPORT_SYMBOL(__dec_node_page_state); +#else /* * For use when we know that interrupts are disabled, * or when we know that preemption is disabled and that @@ -541,149 +723,6 @@ void __dec_node_page_state(struct page * } EXPORT_SYMBOL(__dec_node_page_state); -#ifdef CONFIG_HAVE_CMPXCHG_LOCAL -/* - * If we have cmpxchg_local support then we do not need to incur the overhead - * that comes with local_irq_save/restore if we use this_cpu_cmpxchg. - * - * mod_state() modifies the zone counter state through atomic per cpu - * operations. - * - * Overstep mode specifies how overstep should handled: - * 0 No overstepping - * 1 Overstepping half of threshold - * -1 Overstepping minus half of threshold -*/ -static inline void mod_zone_state(struct zone *zone, - enum zone_stat_item item, long delta, int overstep_mode) -{ - struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; - s8 __percpu *p = pcp->vm_stat_diff + item; - long o, n, t, z; - - do { - z = 0; /* overflow to zone counters */ - - /* - * The fetching of the stat_threshold is racy. We may apply - * a counter threshold to the wrong the cpu if we get - * rescheduled while executing here. However, the next - * counter update will apply the threshold again and - * therefore bring the counter under the threshold again. - * - * Most of the time the thresholds are the same anyways - * for all cpus in a zone. - */ - t = this_cpu_read(pcp->stat_threshold); - - o = this_cpu_read(*p); - n = delta + o; - - if (abs(n) > t) { - int os = overstep_mode * (t >> 1) ; - - /* Overflow must be added to zone counters */ - z = n + os; - n = -os; - } - } while (this_cpu_cmpxchg(*p, o, n) != o); - - if (z) - zone_page_state_add(z, zone, item); -} - -void mod_zone_page_state(struct zone *zone, enum zone_stat_item item, - long delta) -{ - mod_zone_state(zone, item, delta, 0); -} -EXPORT_SYMBOL(mod_zone_page_state); - -void inc_zone_page_state(struct page *page, enum zone_stat_item item) -{ - mod_zone_state(page_zone(page), item, 1, 1); -} -EXPORT_SYMBOL(inc_zone_page_state); - -void dec_zone_page_state(struct page *page, enum zone_stat_item item) -{ - mod_zone_state(page_zone(page), item, -1, -1); -} -EXPORT_SYMBOL(dec_zone_page_state); - -static inline void mod_node_state(struct pglist_data *pgdat, - enum node_stat_item item, int delta, int overstep_mode) -{ - struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - long o, n, t, z; - - if (vmstat_item_in_bytes(item)) { - /* - * Only cgroups use subpage accounting right now; at - * the global level, these items still change in - * multiples of whole pages. Store them as pages - * internally to keep the per-cpu counters compact. - */ - VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); - delta >>= PAGE_SHIFT; - } - - do { - z = 0; /* overflow to node counters */ - - /* - * The fetching of the stat_threshold is racy. We may apply - * a counter threshold to the wrong the cpu if we get - * rescheduled while executing here. However, the next - * counter update will apply the threshold again and - * therefore bring the counter under the threshold again. - * - * Most of the time the thresholds are the same anyways - * for all cpus in a node. - */ - t = this_cpu_read(pcp->stat_threshold); - - o = this_cpu_read(*p); - n = delta + o; - - if (abs(n) > t) { - int os = overstep_mode * (t >> 1) ; - - /* Overflow must be added to node counters */ - z = n + os; - n = -os; - } - } while (this_cpu_cmpxchg(*p, o, n) != o); - - if (z) - node_page_state_add(z, pgdat, item); -} - -void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, - long delta) -{ - mod_node_state(pgdat, item, delta, 0); -} -EXPORT_SYMBOL(mod_node_page_state); - -void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) -{ - mod_node_state(pgdat, item, 1, 1); -} - -void inc_node_page_state(struct page *page, enum node_stat_item item) -{ - mod_node_state(page_pgdat(page), item, 1, 1); -} -EXPORT_SYMBOL(inc_node_page_state); - -void dec_node_page_state(struct page *page, enum node_stat_item item) -{ - mod_node_state(page_pgdat(page), item, -1, -1); -} -EXPORT_SYMBOL(dec_node_page_state); -#else /* * Use interrupt disable to serialize counter updates */
In preparation to switch vmstat shepherd to flush per-CPU counters remotely, switch all functions that modify the counters to use cmpxchg. To test the performance difference, a page allocator microbenchmark: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz. For the single_page_alloc_free test, which does /** Loop to measure **/ for (i = 0; i < rec->loops; i++) { my_page = alloc_page(gfp_mask); if (unlikely(my_page == NULL)) return 0; __free_page(my_page); } Unit is cycles. Vanilla Patched Diff 159 165 3.7% Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>