Message ID | 20200127173453.2089565-13-guro@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | The new cgroup slab memory controller | expand |
On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote: > Currently s8 type is used for per-cpu caching of per-node statistics. > It works fine because the overfill threshold can't exceed 125. > > But if some counters are in bytes (and the next commit in the series > will convert slab counters to bytes), it's not gonna work: > value in bytes can easily exceed s8 without exceeding the threshold > converted to bytes. So to avoid overfilling per-cpu caches and breaking > vmstats correctness, let's use s32 instead. > > This doesn't affect per-zone statistics. There are no plans to use > zone-level byte-sized counters, so no reasons to change anything. Wait, is this still necessary? AFAIU, the node counters will account full slab pages, including free space, and only the memcg counters that track actual objects will be in bytes. Can you please elaborate?
On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote: > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote: > > Currently s8 type is used for per-cpu caching of per-node statistics. > > It works fine because the overfill threshold can't exceed 125. > > > > But if some counters are in bytes (and the next commit in the series > > will convert slab counters to bytes), it's not gonna work: > > value in bytes can easily exceed s8 without exceeding the threshold > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > vmstats correctness, let's use s32 instead. > > > > This doesn't affect per-zone statistics. There are no plans to use > > zone-level byte-sized counters, so no reasons to change anything. > > Wait, is this still necessary? AFAIU, the node counters will account > full slab pages, including free space, and only the memcg counters > that track actual objects will be in bytes. > > Can you please elaborate? It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B) being in different units depending on the accounting scope. So I do convert all slab counters: global, per-lruvec, and per-memcg to bytes. Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec NR_SLAB_RECLAIMABLE_OBJ NR_SLAB_UNRECLAIMABLE_OBJ and keep global counters untouched. If going this way, I'd prefer to make them per-memcg, because it will simplify things on charging paths: now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(), and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to bump per-lruvec counters. Btw, I wonder if we really need per-lruvec counters at all (at least being enabled by default). For the significant amount of users who have a single-node machine it doesn't bring anything except performance overhead. For those who have multiple nodes (and most likely many many memory cgroups) it provides way too many data except for debugging some weird mm issues. I guess in the absolute majority of cases having global per-node + per-memcg counters will be enough. Thanks!
On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote: > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote: > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote: > > > Currently s8 type is used for per-cpu caching of per-node statistics. > > > It works fine because the overfill threshold can't exceed 125. > > > > > > But if some counters are in bytes (and the next commit in the series > > > will convert slab counters to bytes), it's not gonna work: > > > value in bytes can easily exceed s8 without exceeding the threshold > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > > vmstats correctness, let's use s32 instead. > > > > > > This doesn't affect per-zone statistics. There are no plans to use > > > zone-level byte-sized counters, so no reasons to change anything. > > > > Wait, is this still necessary? AFAIU, the node counters will account > > full slab pages, including free space, and only the memcg counters > > that track actual objects will be in bytes. > > > > Can you please elaborate? > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B) > being in different units depending on the accounting scope. > So I do convert all slab counters: global, per-lruvec, > and per-memcg to bytes. Since the node counters tracks allocated slab pages and the memcg counter tracks allocated objects, arguably they shouldn't use the same name anyway. > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec > NR_SLAB_RECLAIMABLE_OBJ > NR_SLAB_UNRECLAIMABLE_OBJ Can we alias them and reuse their slots? /* Reuse the node slab page counters item for charged objects */ MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE, MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE, > and keep global counters untouched. If going this way, I'd prefer to make > them per-memcg, because it will simplify things on charging paths: > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(), > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to > bump per-lruvec counters. I don't quite follow. Don't you still have to update the global counters? > Btw, I wonder if we really need per-lruvec counters at all (at least > being enabled by default). For the significant amount of users who > have a single-node machine it doesn't bring anything except performance > overhead. Yeah, for single-node systems we should be able to redirect everything to the memcg counters, without allocating and tracking lruvec copies. > For those who have multiple nodes (and most likely many many > memory cgroups) it provides way too many data except for debugging > some weird mm issues. > I guess in the absolute majority of cases having global per-node + per-memcg > counters will be enough. Hm? Reclaim uses the lruvec counters.
On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote: > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote: > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote: > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote: > > > > Currently s8 type is used for per-cpu caching of per-node statistics. > > > > It works fine because the overfill threshold can't exceed 125. > > > > > > > > But if some counters are in bytes (and the next commit in the series > > > > will convert slab counters to bytes), it's not gonna work: > > > > value in bytes can easily exceed s8 without exceeding the threshold > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > > > vmstats correctness, let's use s32 instead. > > > > > > > > This doesn't affect per-zone statistics. There are no plans to use > > > > zone-level byte-sized counters, so no reasons to change anything. > > > > > > Wait, is this still necessary? AFAIU, the node counters will account > > > full slab pages, including free space, and only the memcg counters > > > that track actual objects will be in bytes. > > > > > > Can you please elaborate? > > > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B) > > being in different units depending on the accounting scope. > > So I do convert all slab counters: global, per-lruvec, > > and per-memcg to bytes. > > Since the node counters tracks allocated slab pages and the memcg > counter tracks allocated objects, arguably they shouldn't use the same > name anyway. > > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec > > NR_SLAB_RECLAIMABLE_OBJ > > NR_SLAB_UNRECLAIMABLE_OBJ > > Can we alias them and reuse their slots? > > /* Reuse the node slab page counters item for charged objects */ > MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE, > MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE, Yeah, lgtm. Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to the enum memcg_stat_item? > > > and keep global counters untouched. If going this way, I'd prefer to make > > them per-memcg, because it will simplify things on charging paths: > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(), > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to > > bump per-lruvec counters. > > I don't quite follow. Don't you still have to update the global > counters? Global counters are updated only if an allocation requires a new slab page, which isn't the most common path. In generic case post_hook is required because it's the only place where we have both page (to get the node) and memcg pointer. If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK), then post_hook can handle only the rare "allocation failed" case. I'm not sure here what's better. > > > Btw, I wonder if we really need per-lruvec counters at all (at least > > being enabled by default). For the significant amount of users who > > have a single-node machine it doesn't bring anything except performance > > overhead. > > Yeah, for single-node systems we should be able to redirect everything > to the memcg counters, without allocating and tracking lruvec copies. Sounds good. It can lead to significant savings on single-node machines. > > > For those who have multiple nodes (and most likely many many > > memory cgroups) it provides way too many data except for debugging > > some weird mm issues. > > I guess in the absolute majority of cases having global per-node + per-memcg > > counters will be enough. > > Hm? Reclaim uses the lruvec counters. Can you, please, provide some examples? It looks like it's mostly based on per-zone lruvec size counters. Anyway, it seems to be a little bit off from this patchset, so let's discuss it separately.
On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote: > On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote: > > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote: > > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote: > > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote: > > > > > Currently s8 type is used for per-cpu caching of per-node statistics. > > > > > It works fine because the overfill threshold can't exceed 125. > > > > > > > > > > But if some counters are in bytes (and the next commit in the series > > > > > will convert slab counters to bytes), it's not gonna work: > > > > > value in bytes can easily exceed s8 without exceeding the threshold > > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > > > > vmstats correctness, let's use s32 instead. > > > > > > > > > > This doesn't affect per-zone statistics. There are no plans to use > > > > > zone-level byte-sized counters, so no reasons to change anything. > > > > > > > > Wait, is this still necessary? AFAIU, the node counters will account > > > > full slab pages, including free space, and only the memcg counters > > > > that track actual objects will be in bytes. > > > > > > > > Can you please elaborate? > > > > > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B) > > > being in different units depending on the accounting scope. > > > So I do convert all slab counters: global, per-lruvec, > > > and per-memcg to bytes. > > > > Since the node counters tracks allocated slab pages and the memcg > > counter tracks allocated objects, arguably they shouldn't use the same > > name anyway. > > > > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec > > > NR_SLAB_RECLAIMABLE_OBJ > > > NR_SLAB_UNRECLAIMABLE_OBJ > > > > Can we alias them and reuse their slots? > > > > /* Reuse the node slab page counters item for charged objects */ > > MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE, > > MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE, > > Yeah, lgtm. > > Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to > the enum memcg_stat_item? Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps? > > > and keep global counters untouched. If going this way, I'd prefer to make > > > them per-memcg, because it will simplify things on charging paths: > > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(), > > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to > > > bump per-lruvec counters. > > > > I don't quite follow. Don't you still have to update the global > > counters? > > Global counters are updated only if an allocation requires a new slab > page, which isn't the most common path. Right. > In generic case post_hook is required because it's the only place where > we have both page (to get the node) and memcg pointer. > > If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK), > then post_hook can handle only the rare "allocation failed" case. > > I'm not sure here what's better. If it's tracked only per-memcg, you still have to account it every time you charge an object to a memcg, no? How is it less frequent than acconting at the lruvec level? > > > Btw, I wonder if we really need per-lruvec counters at all (at least > > > being enabled by default). For the significant amount of users who > > > have a single-node machine it doesn't bring anything except performance > > > overhead. > > > > Yeah, for single-node systems we should be able to redirect everything > > to the memcg counters, without allocating and tracking lruvec copies. > > Sounds good. It can lead to significant savings on single-node machines. > > > > > > For those who have multiple nodes (and most likely many many > > > memory cgroups) it provides way too many data except for debugging > > > some weird mm issues. > > > I guess in the absolute majority of cases having global per-node + per-memcg > > > counters will be enough. > > > > Hm? Reclaim uses the lruvec counters. > > Can you, please, provide some examples? It looks like it's mostly based > on per-zone lruvec size counters. It uses the recursive lruvec state to decide inactive_is_low(), whether refaults are occuring, whether to trim cache only or go for anon etc. We use it to determine refault distances and how many shadow nodes to shrink. Grep for lruvec_page_state(). > Anyway, it seems to be a little bit off from this patchset, so let's > discuss it separately. True
On Mon, Feb 03, 2020 at 05:39:54PM -0500, Johannes Weiner wrote: > On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote: > > On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote: > > > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote: > > > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote: > > > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote: > > > > > > Currently s8 type is used for per-cpu caching of per-node statistics. > > > > > > It works fine because the overfill threshold can't exceed 125. > > > > > > > > > > > > But if some counters are in bytes (and the next commit in the series > > > > > > will convert slab counters to bytes), it's not gonna work: > > > > > > value in bytes can easily exceed s8 without exceeding the threshold > > > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > > > > > vmstats correctness, let's use s32 instead. > > > > > > > > > > > > This doesn't affect per-zone statistics. There are no plans to use > > > > > > zone-level byte-sized counters, so no reasons to change anything. > > > > > > > > > > Wait, is this still necessary? AFAIU, the node counters will account > > > > > full slab pages, including free space, and only the memcg counters > > > > > that track actual objects will be in bytes. > > > > > > > > > > Can you please elaborate? > > > > > > > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B) > > > > being in different units depending on the accounting scope. > > > > So I do convert all slab counters: global, per-lruvec, > > > > and per-memcg to bytes. > > > > > > Since the node counters tracks allocated slab pages and the memcg > > > counter tracks allocated objects, arguably they shouldn't use the same > > > name anyway. > > > > > > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec > > > > NR_SLAB_RECLAIMABLE_OBJ > > > > NR_SLAB_UNRECLAIMABLE_OBJ > > > > > > Can we alias them and reuse their slots? > > > > > > /* Reuse the node slab page counters item for charged objects */ > > > MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE, > > > MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE, > > > > Yeah, lgtm. > > > > Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to > > the enum memcg_stat_item? > > Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps? Maybe not. I'll probably go with MEMCG_SLAB_RECLAIMABLE_B = NR_SLAB_RECLAIMABLE, MEMCG_SLAB_UNRECLAIMABLE_B = NR_SLAB_UNRECLAIMABLE, Please, let me know if you're not ok with it. > > > > > and keep global counters untouched. If going this way, I'd prefer to make > > > > them per-memcg, because it will simplify things on charging paths: > > > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(), > > > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to > > > > bump per-lruvec counters. > > > > > > I don't quite follow. Don't you still have to update the global > > > counters? > > > > Global counters are updated only if an allocation requires a new slab > > page, which isn't the most common path. > > Right. > > > In generic case post_hook is required because it's the only place where > > we have both page (to get the node) and memcg pointer. > > > > If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK), > > then post_hook can handle only the rare "allocation failed" case. > > > > I'm not sure here what's better. > > If it's tracked only per-memcg, you still have to account it every > time you charge an object to a memcg, no? How is it less frequent than > acconting at the lruvec level? It's not less frequent, it just can be done in the pre-alloc hook when there is a memcg pointer available. The problem with the obj_cgroup thing is that we get it indirectly from current memcg in the pre_alloc_hook, then pass it to obj_cgroup API, internally we might need to get the memcg from it to charge a page, and then again in the post_hook we need to get memcg to bump per-lruvec stats. In other words we make several memcg <-> objcg conversions, which isn't very nice on the hot path. I see that in the future we might optimize the initial lookup of objcg, but getting memcg just to bump vmstats looks unnecessarily expensive. One option I think about is to handle byte-sized stats on obj_cgroup level and flush whole pages to memcg level. > > > > > Btw, I wonder if we really need per-lruvec counters at all (at least > > > > being enabled by default). For the significant amount of users who > > > > have a single-node machine it doesn't bring anything except performance > > > > overhead. > > > > > > Yeah, for single-node systems we should be able to redirect everything > > > to the memcg counters, without allocating and tracking lruvec copies. > > > > Sounds good. It can lead to significant savings on single-node machines. > > > > > > > > > For those who have multiple nodes (and most likely many many > > > > memory cgroups) it provides way too many data except for debugging > > > > some weird mm issues. > > > > I guess in the absolute majority of cases having global per-node + per-memcg > > > > counters will be enough. > > > > > > Hm? Reclaim uses the lruvec counters. > > > > Can you, please, provide some examples? It looks like it's mostly based > > on per-zone lruvec size counters. > > It uses the recursive lruvec state to decide inactive_is_low(), > whether refaults are occuring, whether to trim cache only or go for > anon etc. We use it to determine refault distances and how many shadow > nodes to shrink. > > Grep for lruvec_page_state(). I see... Thanks! > > > Anyway, it seems to be a little bit off from this patchset, so let's > > discuss it separately. > > True
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 462f6873905a..1c5eafd925e7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -360,7 +360,7 @@ struct per_cpu_pageset { struct per_cpu_nodestat { s8 stat_threshold; - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; + s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; }; #endif /* !__GENERATING_BOUNDS.H */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 78d53378db99..6242129939dd 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -337,7 +337,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, long delta) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s32 __percpu *p = pcp->vm_node_stat_diff + item; long x; long t; @@ -395,13 +395,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s32 __percpu *p = pcp->vm_node_stat_diff + item; + s32 v, t; v = __this_cpu_inc_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v > t)) { - s8 overstep = t >> 1; + s32 overstep = t >> 1; node_page_state_add(v + overstep, pgdat, item); __this_cpu_write(*p, -overstep); @@ -439,13 +439,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s32 __percpu *p = pcp->vm_node_stat_diff + item; + s32 v, t; v = __this_cpu_dec_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v < - t)) { - s8 overstep = t >> 1; + s32 overstep = t >> 1; node_page_state_add(v - overstep, pgdat, item); __this_cpu_write(*p, overstep); @@ -538,7 +538,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, enum node_stat_item item, int delta, int overstep_mode) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s32 __percpu *p = pcp->vm_node_stat_diff + item; long o, n, t, z; do {
Currently s8 type is used for per-cpu caching of per-node statistics. It works fine because the overfill threshold can't exceed 125. But if some counters are in bytes (and the next commit in the series will convert slab counters to bytes), it's not gonna work: value in bytes can easily exceed s8 without exceeding the threshold converted to bytes. So to avoid overfilling per-cpu caches and breaking vmstats correctness, let's use s32 instead. This doesn't affect per-zone statistics. There are no plans to use zone-level byte-sized counters, so no reasons to change anything. Signed-off-by: Roman Gushchin <guro@fb.com> --- include/linux/mmzone.h | 2 +- mm/vmstat.c | 16 ++++++++-------- 2 files changed, 9 insertions(+), 9 deletions(-)