diff mbox series

[02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat

Message ID 20191018002820.307763-3-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series The new slab memory controller | expand

Commit Message

Roman Gushchin Oct. 18, 2019, 12:28 a.m. UTC
Currently s8 type is used for per-cpu caching of per-node statistics.
It works fine because the overfill threshold can't exceed 125.

But if some counters are in bytes (and the next commit in the series
will convert slab counters to bytes), it's not gonna work:
value in bytes can easily exceed s8 without exceeding the threshold
converted to bytes. So to avoid overfilling per-cpu caches and breaking
vmstats correctness, let's use s32 instead.

This doesn't affect per-zone statistics. There are no plans to use
zone-level byte-sized counters, so no reasons to change anything.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/mmzone.h |  2 +-
 mm/vmstat.c            | 16 ++++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

Comments

Christoph Lameter (Ampere) Oct. 20, 2019, 10:44 p.m. UTC | #1
On Thu, 17 Oct 2019, Roman Gushchin wrote:

> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.

This quardruples the cache footprint of the counters and likely has some
influence on performance.
Christoph Lameter (Ampere) Oct. 20, 2019, 10:51 p.m. UTC | #2
On Thu, 17 Oct 2019, Roman Gushchin wrote:

> Currently s8 type is used for per-cpu caching of per-node statistics.
> It works fine because the overfill threshold can't exceed 125.
>
> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.

Actually this is inconsistenct since the other counters are all used to
account for pages. Some of the functions in use for the page counters will
no longer make sense. inc/dec_node_state() etc.
Roman Gushchin Oct. 21, 2019, 1:15 a.m. UTC | #3
On Sun, Oct 20, 2019 at 10:44:10PM +0000, Christopher Lameter wrote:
> On Thu, 17 Oct 2019, Roman Gushchin wrote:
> 
> > But if some counters are in bytes (and the next commit in the series
> > will convert slab counters to bytes), it's not gonna work:
> > value in bytes can easily exceed s8 without exceeding the threshold
> > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > vmstats correctness, let's use s32 instead.
> 
> This quardruples the cache footprint of the counters and likely has some
> influence on performance.

But otherwise slab allocation counters will be flushed too often, which
will be likely noticeable. I can do custom s32 buffers only for these counters,
but to me it seems like an unnecessary complication, unless we'll find
a clear regression.

Sp far I haven't noticed any regression on the set of workloads where I did test
the patchset, but if you know any benchmark or realistic test which can affected
by this check, I'll be happy to try.

Also, less-than-word-sized operations can be slower on some platforms.
Roman Gushchin Oct. 21, 2019, 1:21 a.m. UTC | #4
On Sun, Oct 20, 2019 at 10:51:10PM +0000, Christopher Lameter wrote:
> On Thu, 17 Oct 2019, Roman Gushchin wrote:
> 
> > Currently s8 type is used for per-cpu caching of per-node statistics.
> > It works fine because the overfill threshold can't exceed 125.
> >
> > But if some counters are in bytes (and the next commit in the series
> > will convert slab counters to bytes), it's not gonna work:
> > value in bytes can easily exceed s8 without exceeding the threshold
> > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > vmstats correctness, let's use s32 instead.
> 
> Actually this is inconsistenct since the other counters are all used to
> account for pages. Some of the functions in use for the page counters will
> no longer make sense. inc/dec_node_state() etc.

Actually I tried to implement what Johannes suggested earlier and convert
all counters to bytes, but it looked like it can lead to a significant
performance hit on 32bit platforms, as I'd need to replace all atomic_long_t
with atomic64_t. Also, to make the code look good, I'd need to convert
all counters to bytes (and atomic64_t): zone stats, vmevents, etc.
So I gave up relatively quickly.

Maybe it's a good long-term plan, but as now it doesn't really look
as an obviously good think to do.

I'm fully open to any suggestions here.
Christoph Lameter (Ampere) Oct. 21, 2019, 6:09 p.m. UTC | #5
On Mon, 21 Oct 2019, Roman Gushchin wrote:

> Sp far I haven't noticed any regression on the set of workloads where I did test
> the patchset, but if you know any benchmark or realistic test which can affected
> by this check, I'll be happy to try.
>
> Also, less-than-word-sized operations can be slower on some platforms.

The main issue in the past was the cache footprint. Memory is slow.
Processors are fast.
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4ca03b93373..4d8adaa5d59c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -353,7 +353,7 @@  struct per_cpu_pageset {
 
 struct per_cpu_nodestat {
 	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b2fd344d2fcf..3375e7f45891 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -337,7 +337,7 @@  void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -395,13 +395,13 @@  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -439,13 +439,13 @@  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
@@ -538,7 +538,7 @@  static inline void mod_node_state(struct pglist_data *pgdat,
        enum node_stat_item item, int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	do {