Message ID | 20240420232505.2768428-1-shakeel.butt@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] memcg: simple cleanup of stats update functions | expand |
On 2024-04-20 16:25:05 [-0700], Shakeel Butt wrote: > mod_memcg_lruvec_state() is never called from outside of memcontrol.c > and with always irq disabled. So, replace it with the irq disabled > version and add an assert that irq is disabled in the caller. unless PREEMPT_RT is enabled. In that case IRQs are not disabled as part of local_lock_irqsave(&memcg_stock.stock_lock, …) leading to: | ------------[ cut here ]------------ | WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:3150 __mod_objcg_mlstate+0xc2/0x110 | CPU: 0 PID: 1 Comm: systemd Not tainted 6.10.0-rc1-rt0+ #17 | Call Trace: | <TASK> | mod_objcg_state+0x2b3/0x320 | __memcg_slab_post_alloc_hook+0x13c/0x340 | kmem_cache_alloc_lru_noprof+0x2bd/0x2e0 | alloc_inode+0x59/0xc0 | iget_locked+0xf0/0x290 suggestions? Sebastian
On 5/27/24 5:22 PM, Sebastian Andrzej Siewior wrote: > On 2024-04-20 16:25:05 [-0700], Shakeel Butt wrote: >> mod_memcg_lruvec_state() is never called from outside of memcontrol.c >> and with always irq disabled. So, replace it with the irq disabled >> version and add an assert that irq is disabled in the caller. > > unless PREEMPT_RT is enabled. In that case IRQs are not disabled as part > of local_lock_irqsave(&memcg_stock.stock_lock, …) leading to: But then the "interrupts are handled by a kernel thread that can sleep" part of RT also means it's ok to just have the stock_lock taken with no interrupts disabled as no actual raw interrupt handler will interrupt the holder and deadlock, right? > | ------------[ cut here ]------------ > | WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:3150 __mod_objcg_mlstate+0xc2/0x110 > | CPU: 0 PID: 1 Comm: systemd Not tainted 6.10.0-rc1-rt0+ #17 > | Call Trace: > | <TASK> > | mod_objcg_state+0x2b3/0x320 > | __memcg_slab_post_alloc_hook+0x13c/0x340 > | kmem_cache_alloc_lru_noprof+0x2bd/0x2e0 > | alloc_inode+0x59/0xc0 > | iget_locked+0xf0/0x290 > > suggestions? So in that case the appropriate thing would be to replace the assert with lockdep_assert_held(&memcg_stock.stock_lock); ? It seems all the code paths leading here have that one. > Sebastian >
On Mon, May 27, 2024 at 06:34:24PM GMT, Vlastimil Babka (SUSE) wrote: > On 5/27/24 5:22 PM, Sebastian Andrzej Siewior wrote: > > On 2024-04-20 16:25:05 [-0700], Shakeel Butt wrote: > >> mod_memcg_lruvec_state() is never called from outside of memcontrol.c > >> and with always irq disabled. So, replace it with the irq disabled > >> version and add an assert that irq is disabled in the caller. > > > > unless PREEMPT_RT is enabled. In that case IRQs are not disabled as part > > of local_lock_irqsave(&memcg_stock.stock_lock, …) leading to: Sorry about that and thanks for the report. > > But then the "interrupts are handled by a kernel thread that can sleep" part > of RT also means it's ok to just have the stock_lock taken with no > interrupts disabled as no actual raw interrupt handler will interrupt the > holder and deadlock, right? > Thanks Vlastimil for jolting my memory on RT reasoning. > > | ------------[ cut here ]------------ > > | WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:3150 __mod_objcg_mlstate+0xc2/0x110 > > | CPU: 0 PID: 1 Comm: systemd Not tainted 6.10.0-rc1-rt0+ #17 > > | Call Trace: > > | <TASK> > > | mod_objcg_state+0x2b3/0x320 > > | __memcg_slab_post_alloc_hook+0x13c/0x340 > > | kmem_cache_alloc_lru_noprof+0x2bd/0x2e0 > > | alloc_inode+0x59/0xc0 > > | iget_locked+0xf0/0x290 > > > > suggestions? > > So in that case the appropriate thing would be to replace the assert with > lockdep_assert_held(&memcg_stock.stock_lock); > ? > > It seems all the code paths leading here have that one. > Yeah this seems right and reasonable. Should I send a fix or you want to send it?
On 2024-05-27 22:16:41 [-0700], Shakeel Butt wrote: > On Mon, May 27, 2024 at 06:34:24PM GMT, Vlastimil Babka (SUSE) wrote: > > On 5/27/24 5:22 PM, Sebastian Andrzej Siewior wrote: > > > On 2024-04-20 16:25:05 [-0700], Shakeel Butt wrote: > > >> mod_memcg_lruvec_state() is never called from outside of memcontrol.c > > >> and with always irq disabled. So, replace it with the irq disabled > > >> version and add an assert that irq is disabled in the caller. > > > > > > unless PREEMPT_RT is enabled. In that case IRQs are not disabled as part > > > of local_lock_irqsave(&memcg_stock.stock_lock, …) leading to: > > Sorry about that and thanks for the report. no worries. > > > > But then the "interrupts are handled by a kernel thread that can sleep" part > > of RT also means it's ok to just have the stock_lock taken with no > > interrupts disabled as no actual raw interrupt handler will interrupt the > > holder and deadlock, right? I *don't* know why the interrupts-disabled check is here. The memcg_stock.stock_lock is acquired on RT with interrupts enabled and never disables interrupts. The lock is never acquired in an hard interrupt (not threaded interrupt) context so there is never a deadlock. Originally the interrupts were disabled in mod_memcg_lruvec_state() because the counter, it operates on, is per-CPU and relies on disabled interrupts because the operation is not atomic and the code can be run in interrupts context (on !RT). The __mod_memcg_lruvec_state() variant of it relied on interrupts being disabled by the caller. This "rely on" was part of a spinlock_t lock (or invoked from an interrupt handler, the memory is fading slowly away) which does not disable interrupts on PREEMPT_RT. So for that reason we ended up with __memcg_stats_lock() which disables preemption only on PREEMPT_RT to achieve the same level of "atomic" update. > Thanks Vlastimil for jolting my memory on RT reasoning. > > > > suggestions? > > > > So in that case the appropriate thing would be to replace the assert with > > lockdep_assert_held(&memcg_stock.stock_lock); > > ? > > > > It seems all the code paths leading here have that one. > > > > Yeah this seems right and reasonable. Should I send a fix or you want to > send it? I don't mind sending a patch. I'm just not sure if the lock is the right thing to do. However it should ensure that interrupts are disabled on !RT for the sake of the counter update (if observed in IRQ context). Yeah, let me prepare something. Sebastian
On 5/28/24 9:56 AM, Sebastian Andrzej Siewior wrote: > On 2024-05-27 22:16:41 [-0700], Shakeel Butt wrote: >> On Mon, May 27, 2024 at 06:34:24PM GMT, Vlastimil Babka (SUSE) wrote: >> > On 5/27/24 5:22 PM, Sebastian Andrzej Siewior wrote: >> > > On 2024-04-20 16:25:05 [-0700], Shakeel Butt wrote: >> > >> mod_memcg_lruvec_state() is never called from outside of memcontrol.c >> > >> and with always irq disabled. So, replace it with the irq disabled >> > >> version and add an assert that irq is disabled in the caller. >> > > >> > > unless PREEMPT_RT is enabled. In that case IRQs are not disabled as part >> > > of local_lock_irqsave(&memcg_stock.stock_lock, …) leading to: >> >> Sorry about that and thanks for the report. > > no worries. > >> > >> > But then the "interrupts are handled by a kernel thread that can sleep" part >> > of RT also means it's ok to just have the stock_lock taken with no >> > interrupts disabled as no actual raw interrupt handler will interrupt the >> > holder and deadlock, right? > > I *don't* know why the interrupts-disabled check is here. The > memcg_stock.stock_lock is acquired on RT with interrupts enabled and > never disables interrupts. The lock is never acquired in an hard > interrupt (not threaded interrupt) context so there is never a deadlock. > > Originally the interrupts were disabled in mod_memcg_lruvec_state() > because the counter, it operates on, is per-CPU and relies on disabled > interrupts because the operation is not atomic and the code can be run > in interrupts context (on !RT). The __mod_memcg_lruvec_state() variant > of it relied on interrupts being disabled by the caller. This "rely on" > was part of a spinlock_t lock (or invoked from an interrupt handler, the > memory is fading slowly away) which does not disable interrupts on > PREEMPT_RT. > So for that reason we ended up with __memcg_stats_lock() which disables > preemption only on PREEMPT_RT to achieve the same level of "atomic" > update. > >> Thanks Vlastimil for jolting my memory on RT reasoning. >> >> > > suggestions? >> > >> > So in that case the appropriate thing would be to replace the assert with >> > lockdep_assert_held(&memcg_stock.stock_lock); >> > ? >> > >> > It seems all the code paths leading here have that one. >> > >> >> Yeah this seems right and reasonable. Should I send a fix or you want to >> send it? > > I don't mind sending a patch. I'm just not sure if the lock is the right > thing to do. However it should ensure that interrupts are disabled on > !RT for the sake of the counter update (if observed in IRQ context). Looks like some places there use VM_WARN_ON_IRQS_ENABLED() that's turned off for PREEMPT_RT, so maybe that's what should replace the current lockdep_assert, perhaps together with lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock)); But also __mod_memcg_lruvec_state() already has that VM_WARN_ON. > Yeah, let me prepare something. > > Sebastian
On 2024-05-28 10:10:50 [+0200], Vlastimil Babka (SUSE) wrote: > > I don't mind sending a patch. I'm just not sure if the lock is the right > > thing to do. However it should ensure that interrupts are disabled on > > !RT for the sake of the counter update (if observed in IRQ context). > > Looks like some places there use VM_WARN_ON_IRQS_ENABLED() that's turned off > for PREEMPT_RT, so maybe that's what should replace the current > lockdep_assert, perhaps together with > lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock)); > > But also __mod_memcg_lruvec_state() already has that VM_WARN_ON. This "VM_WARN_ON_IRQS_ENABLED" is the initial assert for "interrupts must be disabled while change the counter". You want to replace it with lockdep? Part of its requirement was that it yells with lockdep disabled. Currently I am leaning towards removing the lockdep_assert_irqs_disabled() from __mod_objcg_mlstate(). Nothing but the counter need it and they have their own check. So? Sebastian
On 5/28/24 12:19 PM, Sebastian Andrzej Siewior wrote: > On 2024-05-28 10:10:50 [+0200], Vlastimil Babka (SUSE) wrote: >> > I don't mind sending a patch. I'm just not sure if the lock is the right >> > thing to do. However it should ensure that interrupts are disabled on >> > !RT for the sake of the counter update (if observed in IRQ context). >> >> Looks like some places there use VM_WARN_ON_IRQS_ENABLED() that's turned off >> for PREEMPT_RT, so maybe that's what should replace the current >> lockdep_assert, perhaps together with >> lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock)); >> >> But also __mod_memcg_lruvec_state() already has that VM_WARN_ON. > > This "VM_WARN_ON_IRQS_ENABLED" is the initial assert for "interrupts > must be disabled while change the counter". > You want to replace it with lockdep? Part of its requirement was that it > yells with lockdep disabled. No I meant the other way around, that we'de use VM_WARN_ON_IRQS_ENABLED instead of the current lockdep_assert() that doesn't work on RT. > Currently I am leaning towards removing the > lockdep_assert_irqs_disabled() from __mod_objcg_mlstate(). Nothing but > the counter need it and they have their own check. So? Yeah that might be sufficient. > Sebastian
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8f332b4ae84c..9aba0d0462ca 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1077,8 +1077,6 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, void mem_cgroup_flush_stats(struct mem_cgroup *memcg); void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg); -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val); void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val); static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, @@ -1091,16 +1089,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, local_irq_restore(flags); } -static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, int val) -{ - unsigned long flags; - - local_irq_save(flags); - __mod_memcg_lruvec_state(lruvec, idx, val); - local_irq_restore(flags); -} - void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count); @@ -1594,11 +1582,6 @@ static inline void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg) { } -static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx, int val) -{ -} - static inline void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7703ced535a3..833d09c1d523 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -837,8 +837,9 @@ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) return x; } -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val) +static void __mod_memcg_lruvec_state(struct lruvec *lruvec, + enum node_stat_item idx, + int val) { struct mem_cgroup_per_node *pn; struct mem_cgroup *memcg; @@ -2983,21 +2984,19 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg) #ifdef CONFIG_MEMCG_KMEM -/* - * mod_objcg_mlstate() may be called with irq enabled, so - * mod_memcg_lruvec_state() should be used. - */ -static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, - struct pglist_data *pgdat, - enum node_stat_item idx, int nr) +static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg, + struct pglist_data *pgdat, + enum node_stat_item idx, int nr) { struct mem_cgroup *memcg; struct lruvec *lruvec; + lockdep_assert_irqs_disabled(); + rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + __mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } @@ -3317,7 +3316,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) obj_cgroup_put(objcg); } -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { struct memcg_stock_pcp *stock; @@ -3345,12 +3344,12 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, struct pglist_data *oldpg = stock->cached_pgdat; if (stock->nr_slab_reclaimable_b) { - mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B, + __mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B, stock->nr_slab_reclaimable_b); stock->nr_slab_reclaimable_b = 0; } if (stock->nr_slab_unreclaimable_b) { - mod_objcg_mlstate(objcg, oldpg, NR_SLAB_UNRECLAIMABLE_B, + __mod_objcg_mlstate(objcg, oldpg, NR_SLAB_UNRECLAIMABLE_B, stock->nr_slab_unreclaimable_b); stock->nr_slab_unreclaimable_b = 0; } @@ -3376,7 +3375,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, } } if (nr) - mod_objcg_mlstate(objcg, pgdat, idx, nr); + __mod_objcg_mlstate(objcg, pgdat, idx, nr); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); obj_cgroup_put(old); @@ -3442,13 +3441,13 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock) */ if (stock->nr_slab_reclaimable_b || stock->nr_slab_unreclaimable_b) { if (stock->nr_slab_reclaimable_b) { - mod_objcg_mlstate(old, stock->cached_pgdat, + __mod_objcg_mlstate(old, stock->cached_pgdat, NR_SLAB_RECLAIMABLE_B, stock->nr_slab_reclaimable_b); stock->nr_slab_reclaimable_b = 0; } if (stock->nr_slab_unreclaimable_b) { - mod_objcg_mlstate(old, stock->cached_pgdat, + __mod_objcg_mlstate(old, stock->cached_pgdat, NR_SLAB_UNRECLAIMABLE_B, stock->nr_slab_unreclaimable_b); stock->nr_slab_unreclaimable_b = 0; diff --git a/mm/slab.h b/mm/slab.h index e32d9cf1077a..5f8f47c5bee0 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -578,8 +578,6 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, gfp_t flags, size_t size, void **p); void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, int objects, struct slabobj_ext *obj_exts); -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, - enum node_stat_item idx, int nr); #endif size_t __ksize(const void *objp);