Message ID | 20191018002820.307763-3-guro@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | The new slab memory controller | expand |
On Thu, 17 Oct 2019, Roman Gushchin wrote: > But if some counters are in bytes (and the next commit in the series > will convert slab counters to bytes), it's not gonna work: > value in bytes can easily exceed s8 without exceeding the threshold > converted to bytes. So to avoid overfilling per-cpu caches and breaking > vmstats correctness, let's use s32 instead. This quardruples the cache footprint of the counters and likely has some influence on performance.
On Thu, 17 Oct 2019, Roman Gushchin wrote: > Currently s8 type is used for per-cpu caching of per-node statistics. > It works fine because the overfill threshold can't exceed 125. > > But if some counters are in bytes (and the next commit in the series > will convert slab counters to bytes), it's not gonna work: > value in bytes can easily exceed s8 without exceeding the threshold > converted to bytes. So to avoid overfilling per-cpu caches and breaking > vmstats correctness, let's use s32 instead. Actually this is inconsistenct since the other counters are all used to account for pages. Some of the functions in use for the page counters will no longer make sense. inc/dec_node_state() etc.
On Sun, Oct 20, 2019 at 10:44:10PM +0000, Christopher Lameter wrote: > On Thu, 17 Oct 2019, Roman Gushchin wrote: > > > But if some counters are in bytes (and the next commit in the series > > will convert slab counters to bytes), it's not gonna work: > > value in bytes can easily exceed s8 without exceeding the threshold > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > vmstats correctness, let's use s32 instead. > > This quardruples the cache footprint of the counters and likely has some > influence on performance. But otherwise slab allocation counters will be flushed too often, which will be likely noticeable. I can do custom s32 buffers only for these counters, but to me it seems like an unnecessary complication, unless we'll find a clear regression. Sp far I haven't noticed any regression on the set of workloads where I did test the patchset, but if you know any benchmark or realistic test which can affected by this check, I'll be happy to try. Also, less-than-word-sized operations can be slower on some platforms.
On Sun, Oct 20, 2019 at 10:51:10PM +0000, Christopher Lameter wrote: > On Thu, 17 Oct 2019, Roman Gushchin wrote: > > > Currently s8 type is used for per-cpu caching of per-node statistics. > > It works fine because the overfill threshold can't exceed 125. > > > > But if some counters are in bytes (and the next commit in the series > > will convert slab counters to bytes), it's not gonna work: > > value in bytes can easily exceed s8 without exceeding the threshold > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > vmstats correctness, let's use s32 instead. > > Actually this is inconsistenct since the other counters are all used to > account for pages. Some of the functions in use for the page counters will > no longer make sense. inc/dec_node_state() etc. Actually I tried to implement what Johannes suggested earlier and convert all counters to bytes, but it looked like it can lead to a significant performance hit on 32bit platforms, as I'd need to replace all atomic_long_t with atomic64_t. Also, to make the code look good, I'd need to convert all counters to bytes (and atomic64_t): zone stats, vmevents, etc. So I gave up relatively quickly. Maybe it's a good long-term plan, but as now it doesn't really look as an obviously good think to do. I'm fully open to any suggestions here.
On Mon, 21 Oct 2019, Roman Gushchin wrote: > Sp far I haven't noticed any regression on the set of workloads where I did test > the patchset, but if you know any benchmark or realistic test which can affected > by this check, I'll be happy to try. > > Also, less-than-word-sized operations can be slower on some platforms. The main issue in the past was the cache footprint. Memory is slow. Processors are fast.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d4ca03b93373..4d8adaa5d59c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -353,7 +353,7 @@ struct per_cpu_pageset { struct per_cpu_nodestat { s8 stat_threshold; - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; + s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; }; #endif /* !__GENERATING_BOUNDS.H */ diff --git a/mm/vmstat.c b/mm/vmstat.c index b2fd344d2fcf..3375e7f45891 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -337,7 +337,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, long delta) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s32 __percpu *p = pcp->vm_node_stat_diff + item; long x; long t; @@ -395,13 +395,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s32 __percpu *p = pcp->vm_node_stat_diff + item; + s32 v, t; v = __this_cpu_inc_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v > t)) { - s8 overstep = t >> 1; + s32 overstep = t >> 1; node_page_state_add(v + overstep, pgdat, item); __this_cpu_write(*p, -overstep); @@ -439,13 +439,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s32 __percpu *p = pcp->vm_node_stat_diff + item; + s32 v, t; v = __this_cpu_dec_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v < - t)) { - s8 overstep = t >> 1; + s32 overstep = t >> 1; node_page_state_add(v - overstep, pgdat, item); __this_cpu_write(*p, overstep); @@ -538,7 +538,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, enum node_stat_item item, int delta, int overstep_mode) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s32 __percpu *p = pcp->vm_node_stat_diff + item; long o, n, t, z; do {
Currently s8 type is used for per-cpu caching of per-node statistics. It works fine because the overfill threshold can't exceed 125. But if some counters are in bytes (and the next commit in the series will convert slab counters to bytes), it's not gonna work: value in bytes can easily exceed s8 without exceeding the threshold converted to bytes. So to avoid overfilling per-cpu caches and breaking vmstats correctness, let's use s32 instead. This doesn't affect per-zone statistics. There are no plans to use zone-level byte-sized counters, so no reasons to change anything. Signed-off-by: Roman Gushchin <guro@fb.com> --- include/linux/mmzone.h | 2 +- mm/vmstat.c | 16 ++++++++-------- 2 files changed, 9 insertions(+), 9 deletions(-)