Message ID | 20211207155208.eyre5svucpg7krxe@linutronix.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memcontrol: Disable on PREEMPT_RT | expand |
On 12/7/21 10:52, Sebastian Andrzej Siewior wrote: > From: Thomas Gleixner <tglx@linutronix.de> > > MEMCG has a few constructs which are not compatible with PREEMPT_RT's > requirements. This includes: > - relying on disabled interrupts from spin_lock_irqsave() locking for > something not related to lock itself (like the per-CPU counter). > > - explicitly disabling interrupts and acquiring a spinlock_t based lock > like in memcg_check_events() -> eventfd_signal(). > > - explicitly disabling interrupts and freeing memory like in > drain_obj_stock() -> obj_cgroup_put() -> obj_cgroup_release() -> > percpu_ref_exit(). > > Commit 559271146efc ("mm/memcg: optimize user context object stock > access") continued to optimize for the CPU local access which > complicates the PREEMPT_RT locking requirements further. > > Disable MEMCG on PREEMPT_RT until the whole situation can be evaluated > again. Disabling MEMCG for PREEMPT_RT may be too drastic a step to take. For commit 559271146efc ("mm/memcg: optimize user context object stock access"), I can modify it to disable the optimization for PREEMPT_RT. Cheers, Longman > [ bigeasy: commit description. ] > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > --- > init/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -943,6 +943,7 @@ config PAGE_COUNTER > > config MEMCG > bool "Memory controller" > + depends on !PREEMPT_RT > select PAGE_COUNTER > select EVENTFD > help >
On Tue, Dec 07, 2021 at 04:52:08PM +0100, Sebastian Andrzej Siewior wrote: > From: Thomas Gleixner <tglx@linutronix.de> > > MEMCG has a few constructs which are not compatible with PREEMPT_RT's > requirements. This includes: > - relying on disabled interrupts from spin_lock_irqsave() locking for > something not related to lock itself (like the per-CPU counter). If memory serves me right, this is the VM_BUG_ON() in workingset.c: VM_WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */ This isn't memcg specific. This is the serialization model of the generic MM page counters. They can be updated from process and irq context, and need to avoid preemption (and corruption) during RMW. !CONFIG_MEMCG: static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { struct page *page = virt_to_head_page(p); mod_node_page_state(page_pgdat(page), idx, val); } which does: void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, long delta) { unsigned long flags; local_irq_save(flags); __mod_node_page_state(pgdat, item, delta); local_irq_restore(flags); } If this breaks PREEMPT_RT, it's broken without memcg too. > - explicitly disabling interrupts and acquiring a spinlock_t based lock > like in memcg_check_events() -> eventfd_signal(). Similar problem to the above: we disable interrupts to protect RMW sequences that can (on non-preemptrt) be initiated through process context as well as irq context. IIUC, the PREEMPT_RT construct for handling exactly that scenario is the "local lock". Is that correct? It appears Ingo has already fixed the LRU cache, which for non-rt also relies on irq disabling: commit b01b2141999936ac3e4746b7f76c0f204ae4b445 Author: Ingo Molnar <mingo@kernel.org> Date: Wed May 27 22:11:15 2020 +0200 mm/swap: Use local_lock for protection The memcg charge cache should be fixable the same way. Likewise, if you fix the generic vmstat counters like this, the memcg implementation can follow suit.
On 2021-12-07 11:55:38 [-0500], Johannes Weiner wrote: > On Tue, Dec 07, 2021 at 04:52:08PM +0100, Sebastian Andrzej Siewior wrote: > > From: Thomas Gleixner <tglx@linutronix.de> > > > > MEMCG has a few constructs which are not compatible with PREEMPT_RT's > > requirements. This includes: > > - relying on disabled interrupts from spin_lock_irqsave() locking for > > something not related to lock itself (like the per-CPU counter). > > If memory serves me right, this is the VM_BUG_ON() in workingset.c: > > VM_WARN_ON_ONCE(!irqs_disabled()); /* For __inc_lruvec_page_state */ > > This isn't memcg specific. This is the serialization model of the > generic MM page counters. They can be updated from process and irq > context, and need to avoid preemption (and corruption) during RMW. > > !CONFIG_MEMCG: > > static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, > int val) > { > struct page *page = virt_to_head_page(p); > > mod_node_page_state(page_pgdat(page), idx, val); > } > > which does: > > void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, > long delta) > { > unsigned long flags; > > local_irq_save(flags); > __mod_node_page_state(pgdat, item, delta); > local_irq_restore(flags); > } > > If this breaks PREEMPT_RT, it's broken without memcg too. The mod_node_page_state() looks fine. But if we use disabling interrupts as protecting the RMW operation then this has be used everywhere and can not be assumed to be inherited from spin_lock_irq(). Also, none of the code here should be invoked from IRQ context on PREEMPT_RT. If the locking scope is known then local_irq_disable() could be replaced with a local_lock_t to avoid other things that appear unrelated like the memcg_check_events() invocation in uncharge_batch(). The problematic part here is mem_cgroup_tree_per_node::lock which can not be acquired with disabled interrupts on PREEMPT_RT. The "locking scope" is not always clear to me. Also, if it is _just_ the counter, then we might solve this differently. > > - explicitly disabling interrupts and acquiring a spinlock_t based lock > > like in memcg_check_events() -> eventfd_signal(). > > Similar problem to the above: we disable interrupts to protect RMW > sequences that can (on non-preemptrt) be initiated through process > context as well as irq context. > > IIUC, the PREEMPT_RT construct for handling exactly that scenario is > the "local lock". Is that correct? On !PREEMPT_RT this_cpu_inc() can be used in hard-IRQ and task context equally while __this_cpu_inc() is "optimized" to be used in IRQ-context/ a context where it can not be interrupted during its operation. local_irq_save() and spin_lock_irq() both disable interrupts here. On PREEMPT_RT chances are high that the code never runs with disabled interrupts. local_irq_save() disables interrupts, yes, but spin_lock_irq() does not. Therefore a per-object lock, say address_space::i_pages, can not be used to protect an otherwise unrelated per-CPU data, a global DEFINE_PER_CPU(). The reason is that you can acquire address_space::i_pages and get preempted in the middle of __this_cpu_inc(). Then another task on the same CPU can acquire address_space::i_pages of another struct address_space and perform __this_cpu_inc() on the very same per-CPU date. There is your interruption of a RMW operation. local_lock_t is a per-CPU lock which can be used to synchronize access to per-CPU variables which are otherwise unprotected / rely on disabled preemption / interrupts. So yes, it could be used as a substitute in situations where the !PREEMPT_RT needs to manually disable interrupts. So this: |func1(struct address_space *m) |{ | spin_lock_irq(&m->i_pages); | /* other m changes */ | __this_cpu_add(counter); | spin_unlock_irq(&m->i_pages); |} | |func2(void) |{ | local_irq_disable(); | __this_cpu_add(counter); | local_irq_enable(); |} construct breaks on PREEMPT_RT. With local_lock_t that would be: |func1(struct address_space *m) |{ | spin_lock_irq(&m->i_pages); | /* other m changes */ | local_lock(&counter_lock); | __this_cpu_add(counter); | local_unlock(&counter_lock); | spin_unlock_irq(&m->i_pages); |} | |func2(void) |{ | local_lock_irq(&counter_lock); | __this_cpu_add(counter); | local_unlock_irq(&counter_lock); |} Ideally you would attach counter_lock to the same struct where the counter is defined so the protection scope is obvious. As you see, the local_irq_disable() was substituted with a local_lock_irq() but also a local_lock() was added to func1(). > It appears Ingo has already fixed the LRU cache, which for non-rt also > relies on irq disabling: > > commit b01b2141999936ac3e4746b7f76c0f204ae4b445 > Author: Ingo Molnar <mingo@kernel.org> > Date: Wed May 27 22:11:15 2020 +0200 > > mm/swap: Use local_lock for protection > > The memcg charge cache should be fixable the same way. > > Likewise, if you fix the generic vmstat counters like this, the memcg > implementation can follow suit. The vmstat counters should be fixed since commit c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT") again by Ingo. We need to agree how to proceed with these counters. And then we can tackle what is left things :) It should be enough to disable preemption during the update since on PREEMPT_RT that update does not happen in IRQ context. Sebastian
On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote: [...] I am sorry but I didn't get to read and digest the rest of the message yet. Let me just point out this > The problematic part here is mem_cgroup_tree_per_node::lock which can > not be acquired with disabled interrupts on PREEMPT_RT. The "locking > scope" is not always clear to me. Also, if it is _just_ the counter, > then we might solve this differently. I do not think you should be losing sleep over soft limit reclaim. This is certainly not something to be used for RT workloads and rather than touching that code I think it makes some sense to simply disallow soft limit with RT enabled (i.e. do not allow to set any soft limit).
On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote: > On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote: > [...] > I am sorry but I didn't get to read and digest the rest of the message > yet. Let me just point out this > > > The problematic part here is mem_cgroup_tree_per_node::lock which can > > not be acquired with disabled interrupts on PREEMPT_RT. The "locking > > scope" is not always clear to me. Also, if it is _just_ the counter, > > then we might solve this differently. > > I do not think you should be losing sleep over soft limit reclaim. This > is certainly not something to be used for RT workloads and rather than > touching that code I think it makes some sense to simply disallow soft > limit with RT enabled (i.e. do not allow to set any soft limit). Okay. So instead of disabling it entirely you suggest I should take another stab at it? Okay. Disabling softlimit, where should I start with it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error or something else? In the meantime I try to swap in my memcg memory… Sebastian
On Wed 15-12-21 17:47:54, Sebastian Andrzej Siewior wrote: > On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote: > > On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote: > > [...] > > I am sorry but I didn't get to read and digest the rest of the message > > yet. Let me just point out this > > > > > The problematic part here is mem_cgroup_tree_per_node::lock which can > > > not be acquired with disabled interrupts on PREEMPT_RT. The "locking > > > scope" is not always clear to me. Also, if it is _just_ the counter, > > > then we might solve this differently. > > > > I do not think you should be losing sleep over soft limit reclaim. This > > is certainly not something to be used for RT workloads and rather than > > touching that code I think it makes some sense to simply disallow soft > > limit with RT enabled (i.e. do not allow to set any soft limit). > > Okay. So instead of disabling it entirely you suggest I should take > another stab at it? Okay. Disabling softlimit, where should I start with > it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error > or something else? Yeah, I would just return an error for RT configuration. If we ever need to implement that behavior for RT then we can look at specific fixes. Thanks!
On 2021-12-15 17:56:03 [+0100], Michal Hocko wrote: > On Wed 15-12-21 17:47:54, Sebastian Andrzej Siewior wrote: > > On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote: > > > On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote: > > > [...] > > > I am sorry but I didn't get to read and digest the rest of the message > > > yet. Let me just point out this > > > > > > > The problematic part here is mem_cgroup_tree_per_node::lock which can > > > > not be acquired with disabled interrupts on PREEMPT_RT. The "locking > > > > scope" is not always clear to me. Also, if it is _just_ the counter, > > > > then we might solve this differently. > > > > > > I do not think you should be losing sleep over soft limit reclaim. This > > > is certainly not something to be used for RT workloads and rather than > > > touching that code I think it makes some sense to simply disallow soft > > > limit with RT enabled (i.e. do not allow to set any soft limit). > > > > Okay. So instead of disabling it entirely you suggest I should take > > another stab at it? Okay. Disabling softlimit, where should I start with > > it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error > > or something else? > > Yeah, I would just return an error for RT configuration. If we ever need > to implement that behavior for RT then we can look at specific fixes. Okay. What do I gain by doing this / how do I test this? Is running tools/testing/selftests/cgroup/test_*mem* sufficient to test all corner cases here? > Thanks! Thank you ;) Sebastian
On Wed 15-12-21 18:13:40, Sebastian Andrzej Siewior wrote: > On 2021-12-15 17:56:03 [+0100], Michal Hocko wrote: > > On Wed 15-12-21 17:47:54, Sebastian Andrzej Siewior wrote: > > > On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote: > > > > On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote: > > > > [...] > > > > I am sorry but I didn't get to read and digest the rest of the message > > > > yet. Let me just point out this > > > > > > > > > The problematic part here is mem_cgroup_tree_per_node::lock which can > > > > > not be acquired with disabled interrupts on PREEMPT_RT. The "locking > > > > > scope" is not always clear to me. Also, if it is _just_ the counter, > > > > > then we might solve this differently. > > > > > > > > I do not think you should be losing sleep over soft limit reclaim. This > > > > is certainly not something to be used for RT workloads and rather than > > > > touching that code I think it makes some sense to simply disallow soft > > > > limit with RT enabled (i.e. do not allow to set any soft limit). > > > > > > Okay. So instead of disabling it entirely you suggest I should take > > > another stab at it? Okay. Disabling softlimit, where should I start with > > > it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error > > > or something else? > > > > Yeah, I would just return an error for RT configuration. If we ever need > > to implement that behavior for RT then we can look at specific fixes. > > Okay. What do I gain by doing this / how do I test this? Is running > tools/testing/selftests/cgroup/test_*mem* sufficient to test all corner > cases here? I am not fully aware of all the tests but my point is that if the soft limit is not configured then there are no soft limit tree manipulations ever happening and therefore the code is effectivelly dead. Is this sufficient for the RT patchset to ignore the RT incompatible parts?
On 2021-12-15 19:44:00 [+0100], Michal Hocko wrote: > On Wed 15-12-21 18:13:40, Sebastian Andrzej Siewior wrote: > > Okay. What do I gain by doing this / how do I test this? Is running > > tools/testing/selftests/cgroup/test_*mem* sufficient to test all corner > > cases here? > > I am not fully aware of all the tests but my point is that if the soft > limit is not configured then there are no soft limit tree manipulations > ever happening and therefore the code is effectivelly dead. Is this > sufficient for the RT patchset to ignore the RT incompatible parts? So if that softlimit is not essential and makes things easier by simply disabling it, yes I could try that. I will keep that in mind. Sebastian
--- a/init/Kconfig +++ b/init/Kconfig @@ -943,6 +943,7 @@ config PAGE_COUNTER config MEMCG bool "Memory controller" + depends on !PREEMPT_RT select PAGE_COUNTER select EVENTFD help