mm/memcontrol: Disable on PREEMPT_RT

Message ID	20211207155208.eyre5svucpg7krxe@linutronix.de (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 7 Dec 2021 16:52:08 +0100 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Vladimir Davydov <vdavydov.dev@gmail.com>, Waiman Long <longman@redhat.com> Subject: [PATCH] mm/memcontrol: Disable on PREEMPT_RT Message-ID: <20211207155208.eyre5svucpg7krxe@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/memcontrol: Disable on PREEMPT_RT \| expand mm/memcontrol: Disable on PREEMPT_RT

Sebastian Andrzej Siewior Dec. 7, 2021, 3:52 p.m. UTC

From: Thomas Gleixner <tglx@linutronix.de>

MEMCG has a few constructs which are not compatible with PREEMPT_RT's
requirements. This includes:
- relying on disabled interrupts from spin_lock_irqsave() locking for
  something not related to lock itself (like the per-CPU counter).

- explicitly disabling interrupts and acquiring a spinlock_t based lock
  like in memcg_check_events() -> eventfd_signal().

- explicitly disabling interrupts and freeing memory like in
  drain_obj_stock() -> obj_cgroup_put() -> obj_cgroup_release() ->
  percpu_ref_exit().

Commit 559271146efc ("mm/memcg: optimize user context object stock
access") continued to optimize for the CPU local access which
complicates the PREEMPT_RT locking requirements further.

Disable MEMCG on PREEMPT_RT until the whole situation can be evaluated
again.

[ bigeasy: commit description. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 init/Kconfig |    1 +
 1 file changed, 1 insertion(+)

Waiman Long Dec. 7, 2021, 4 p.m. UTC | #1

On 12/7/21 10:52, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> MEMCG has a few constructs which are not compatible with PREEMPT_RT's
> requirements. This includes:
> - relying on disabled interrupts from spin_lock_irqsave() locking for
>    something not related to lock itself (like the per-CPU counter).
>
> - explicitly disabling interrupts and acquiring a spinlock_t based lock
>    like in memcg_check_events() -> eventfd_signal().
>
> - explicitly disabling interrupts and freeing memory like in
>    drain_obj_stock() -> obj_cgroup_put() -> obj_cgroup_release() ->
>    percpu_ref_exit().
>
> Commit 559271146efc ("mm/memcg: optimize user context object stock
> access") continued to optimize for the CPU local access which
> complicates the PREEMPT_RT locking requirements further.
>
> Disable MEMCG on PREEMPT_RT until the whole situation can be evaluated
> again.

Disabling MEMCG for PREEMPT_RT may be too drastic a step to take. For 
commit 559271146efc ("mm/memcg: optimize user context object stock 
access"), I can modify it to disable the optimization for PREEMPT_RT.

Cheers,
Longman


> [ bigeasy: commit description. ]
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>   init/Kconfig |    1 +
>   1 file changed, 1 insertion(+)
>
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -943,6 +943,7 @@ config PAGE_COUNTER
>   
>   config MEMCG
>   	bool "Memory controller"
> +	depends on !PREEMPT_RT
>   	select PAGE_COUNTER
>   	select EVENTFD
>   	help
>

Johannes Weiner Dec. 7, 2021, 4:55 p.m. UTC | #2

On Tue, Dec 07, 2021 at 04:52:08PM +0100, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> MEMCG has a few constructs which are not compatible with PREEMPT_RT's
> requirements. This includes:
> - relying on disabled interrupts from spin_lock_irqsave() locking for
>   something not related to lock itself (like the per-CPU counter).

If memory serves me right, this is the VM_BUG_ON() in workingset.c:

	VM_WARN_ON_ONCE(!irqs_disabled());  /* For __inc_lruvec_page_state */

This isn't memcg specific. This is the serialization model of the
generic MM page counters. They can be updated from process and irq
context, and need to avoid preemption (and corruption) during RMW.

!CONFIG_MEMCG:

static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
					 int val)
{
	struct page *page = virt_to_head_page(p);

	mod_node_page_state(page_pgdat(page), idx, val);
}

which does:

void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
					long delta)
{
	unsigned long flags;

	local_irq_save(flags);
	__mod_node_page_state(pgdat, item, delta);
	local_irq_restore(flags);
}

If this breaks PREEMPT_RT, it's broken without memcg too.

> - explicitly disabling interrupts and acquiring a spinlock_t based lock
>   like in memcg_check_events() -> eventfd_signal().

Similar problem to the above: we disable interrupts to protect RMW
sequences that can (on non-preemptrt) be initiated through process
context as well as irq context.

IIUC, the PREEMPT_RT construct for handling exactly that scenario is
the "local lock". Is that correct?

It appears Ingo has already fixed the LRU cache, which for non-rt also
relies on irq disabling:

commit b01b2141999936ac3e4746b7f76c0f204ae4b445
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed May 27 22:11:15 2020 +0200

    mm/swap: Use local_lock for protection

The memcg charge cache should be fixable the same way.

Likewise, if you fix the generic vmstat counters like this, the memcg
implementation can follow suit.

Sebastian Andrzej Siewior Dec. 10, 2021, 3:22 p.m. UTC | #3

On 2021-12-07 11:55:38 [-0500], Johannes Weiner wrote:
> On Tue, Dec 07, 2021 at 04:52:08PM +0100, Sebastian Andrzej Siewior wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > MEMCG has a few constructs which are not compatible with PREEMPT_RT's
> > requirements. This includes:
> > - relying on disabled interrupts from spin_lock_irqsave() locking for
> >   something not related to lock itself (like the per-CPU counter).
> 
> If memory serves me right, this is the VM_BUG_ON() in workingset.c:
> 
> 	VM_WARN_ON_ONCE(!irqs_disabled());  /* For __inc_lruvec_page_state */
> 
> This isn't memcg specific. This is the serialization model of the
> generic MM page counters. They can be updated from process and irq
> context, and need to avoid preemption (and corruption) during RMW.
> 
> !CONFIG_MEMCG:
> 
> static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> 					 int val)
> {
> 	struct page *page = virt_to_head_page(p);
> 
> 	mod_node_page_state(page_pgdat(page), idx, val);
> }
> 
> which does:
> 
> void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
> 					long delta)
> {
> 	unsigned long flags;
> 
> 	local_irq_save(flags);
> 	__mod_node_page_state(pgdat, item, delta);
> 	local_irq_restore(flags);
> }
> 
> If this breaks PREEMPT_RT, it's broken without memcg too.

The mod_node_page_state() looks fine. But if we use disabling interrupts
as protecting the RMW operation then this has be used everywhere and can
not be assumed to be inherited from spin_lock_irq(). Also, none of the
code here should be invoked from IRQ context on PREEMPT_RT.

If the locking scope is known then local_irq_disable() could be replaced
with a local_lock_t to avoid other things that appear unrelated like the
memcg_check_events() invocation in uncharge_batch(). The problematic
part here is mem_cgroup_tree_per_node::lock which can not be acquired
with disabled interrupts on PREEMPT_RT.
The "locking scope" is not always clear to me.
Also, if it is _just_ the counter, then we might solve this differently.

> > - explicitly disabling interrupts and acquiring a spinlock_t based lock
> >   like in memcg_check_events() -> eventfd_signal().
> 
> Similar problem to the above: we disable interrupts to protect RMW
> sequences that can (on non-preemptrt) be initiated through process
> context as well as irq context.
> 
> IIUC, the PREEMPT_RT construct for handling exactly that scenario is
> the "local lock". Is that correct?

On !PREEMPT_RT this_cpu_inc() can be used in hard-IRQ and task context
equally while __this_cpu_inc() is "optimized" to be used in IRQ-context/
a context where it can not be interrupted during its operation.
local_irq_save() and spin_lock_irq() both disable interrupts here.

On PREEMPT_RT chances are high that the code never runs with disabled
interrupts. local_irq_save() disables interrupts, yes, but
spin_lock_irq() does not.
Therefore a per-object lock, say address_space::i_pages, can
not be used to protect an otherwise unrelated per-CPU data, a global
DEFINE_PER_CPU(). The reason is that you can acquire
address_space::i_pages and get preempted in the middle of
__this_cpu_inc(). Then another task on the same CPU can acquire
address_space::i_pages of another struct address_space and perform
__this_cpu_inc() on the very same per-CPU date. There is your
interruption of a RMW operation.

local_lock_t is a per-CPU lock which can be used to synchronize access
to per-CPU variables which are otherwise unprotected / rely on disabled
preemption / interrupts. So yes, it could be used as a substitute in
situations where the !PREEMPT_RT needs to manually disable interrupts.

So this:

|func1(struct address_space *m)
|{
|  spin_lock_irq(&m->i_pages);
|  /* other m changes */
|  __this_cpu_add(counter);
|  spin_unlock_irq(&m->i_pages);
|}
|
|func2(void)
|{
|  local_irq_disable();
|  __this_cpu_add(counter);
|  local_irq_enable();
|}

construct breaks on PREEMPT_RT. With local_lock_t that would be:

|func1(struct address_space *m)
|{
|  spin_lock_irq(&m->i_pages);
|  /* other m changes */
|  local_lock(&counter_lock);
|  __this_cpu_add(counter);
|  local_unlock(&counter_lock);
|  spin_unlock_irq(&m->i_pages);
|}
|
|func2(void)
|{
|  local_lock_irq(&counter_lock);
|  __this_cpu_add(counter);
|  local_unlock_irq(&counter_lock);
|}

Ideally you would attach counter_lock to the same struct where the
counter is defined so the protection scope is obvious.
As you see, the local_irq_disable() was substituted with a
local_lock_irq() but also a local_lock() was added to func1().

> It appears Ingo has already fixed the LRU cache, which for non-rt also
> relies on irq disabling:
> 
> commit b01b2141999936ac3e4746b7f76c0f204ae4b445
> Author: Ingo Molnar <mingo@kernel.org>
> Date:   Wed May 27 22:11:15 2020 +0200
> 
>     mm/swap: Use local_lock for protection
> 
> The memcg charge cache should be fixable the same way.
> 
> Likewise, if you fix the generic vmstat counters like this, the memcg
> implementation can follow suit.

The vmstat counters should be fixed since commit
   c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

again by Ingo.

We need to agree how to proceed with these counters. And then we can tackle
what is left things :)
It should be enough to disable preemption during the update since on
PREEMPT_RT that update does not happen in IRQ context.

Sebastian

Michal Hocko Dec. 13, 2021, 10:08 a.m. UTC | #4

On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote:
[...]
I am sorry but I didn't get to read and digest the rest of the message
yet. Let me just point out this

> The problematic part here is mem_cgroup_tree_per_node::lock which can
> not be acquired with disabled interrupts on PREEMPT_RT.  The "locking
> scope" is not always clear to me.  Also, if it is _just_ the counter,
> then we might solve this differently.

I do not think you should be losing sleep over soft limit reclaim. This
is certainly not something to be used for RT workloads and rather than
touching that code I think it makes some sense to simply disallow soft
limit with RT enabled (i.e. do not allow to set any soft limit).

Sebastian Andrzej Siewior Dec. 15, 2021, 4:47 p.m. UTC | #5

On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote:
> On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote:
> [...]
> I am sorry but I didn't get to read and digest the rest of the message
> yet. Let me just point out this
> 
> > The problematic part here is mem_cgroup_tree_per_node::lock which can
> > not be acquired with disabled interrupts on PREEMPT_RT.  The "locking
> > scope" is not always clear to me.  Also, if it is _just_ the counter,
> > then we might solve this differently.
> 
> I do not think you should be losing sleep over soft limit reclaim. This
> is certainly not something to be used for RT workloads and rather than
> touching that code I think it makes some sense to simply disallow soft
> limit with RT enabled (i.e. do not allow to set any soft limit).

Okay. So instead of disabling it entirely you suggest I should take
another stab at it? Okay. Disabling softlimit, where should I start with
it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error
or something else?
In the meantime I try to swap in my memcg memory…

Sebastian

Michal Hocko Dec. 15, 2021, 4:56 p.m. UTC | #6

On Wed 15-12-21 17:47:54, Sebastian Andrzej Siewior wrote:
> On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote:
> > On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote:
> > [...]
> > I am sorry but I didn't get to read and digest the rest of the message
> > yet. Let me just point out this
> > 
> > > The problematic part here is mem_cgroup_tree_per_node::lock which can
> > > not be acquired with disabled interrupts on PREEMPT_RT.  The "locking
> > > scope" is not always clear to me.  Also, if it is _just_ the counter,
> > > then we might solve this differently.
> > 
> > I do not think you should be losing sleep over soft limit reclaim. This
> > is certainly not something to be used for RT workloads and rather than
> > touching that code I think it makes some sense to simply disallow soft
> > limit with RT enabled (i.e. do not allow to set any soft limit).
> 
> Okay. So instead of disabling it entirely you suggest I should take
> another stab at it? Okay. Disabling softlimit, where should I start with
> it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error
> or something else?

Yeah, I would just return an error for RT configuration. If we ever need
to implement that behavior for RT then we can look at specific fixes.

Thanks!

Sebastian Andrzej Siewior Dec. 15, 2021, 5:13 p.m. UTC | #7

On 2021-12-15 17:56:03 [+0100], Michal Hocko wrote:
> On Wed 15-12-21 17:47:54, Sebastian Andrzej Siewior wrote:
> > On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote:
> > > On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote:
> > > [...]
> > > I am sorry but I didn't get to read and digest the rest of the message
> > > yet. Let me just point out this
> > > 
> > > > The problematic part here is mem_cgroup_tree_per_node::lock which can
> > > > not be acquired with disabled interrupts on PREEMPT_RT.  The "locking
> > > > scope" is not always clear to me.  Also, if it is _just_ the counter,
> > > > then we might solve this differently.
> > > 
> > > I do not think you should be losing sleep over soft limit reclaim. This
> > > is certainly not something to be used for RT workloads and rather than
> > > touching that code I think it makes some sense to simply disallow soft
> > > limit with RT enabled (i.e. do not allow to set any soft limit).
> > 
> > Okay. So instead of disabling it entirely you suggest I should take
> > another stab at it? Okay. Disabling softlimit, where should I start with
> > it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error
> > or something else?
> 
> Yeah, I would just return an error for RT configuration. If we ever need
> to implement that behavior for RT then we can look at specific fixes.

Okay. What do I gain by doing this / how do I test this? Is running
tools/testing/selftests/cgroup/test_*mem* sufficient to test all corner
cases here?

> Thanks!
Thank you ;)

Sebastian

Michal Hocko Dec. 15, 2021, 6:44 p.m. UTC | #8

On Wed 15-12-21 18:13:40, Sebastian Andrzej Siewior wrote:
> On 2021-12-15 17:56:03 [+0100], Michal Hocko wrote:
> > On Wed 15-12-21 17:47:54, Sebastian Andrzej Siewior wrote:
> > > On 2021-12-13 11:08:26 [+0100], Michal Hocko wrote:
> > > > On Fri 10-12-21 16:22:01, Sebastian Andrzej Siewior wrote:
> > > > [...]
> > > > I am sorry but I didn't get to read and digest the rest of the message
> > > > yet. Let me just point out this
> > > > 
> > > > > The problematic part here is mem_cgroup_tree_per_node::lock which can
> > > > > not be acquired with disabled interrupts on PREEMPT_RT.  The "locking
> > > > > scope" is not always clear to me.  Also, if it is _just_ the counter,
> > > > > then we might solve this differently.
> > > > 
> > > > I do not think you should be losing sleep over soft limit reclaim. This
> > > > is certainly not something to be used for RT workloads and rather than
> > > > touching that code I think it makes some sense to simply disallow soft
> > > > limit with RT enabled (i.e. do not allow to set any soft limit).
> > > 
> > > Okay. So instead of disabling it entirely you suggest I should take
> > > another stab at it? Okay. Disabling softlimit, where should I start with
> > > it? Should mem_cgroup_write() for RES_SOFT_LIMIT always return an error
> > > or something else?
> > 
> > Yeah, I would just return an error for RT configuration. If we ever need
> > to implement that behavior for RT then we can look at specific fixes.
> 
> Okay. What do I gain by doing this / how do I test this? Is running
> tools/testing/selftests/cgroup/test_*mem* sufficient to test all corner
> cases here?

I am not fully aware of all the tests but my point is that if the soft
limit is not configured then there are no soft limit tree manipulations
ever happening and therefore the code is effectivelly dead. Is this
sufficient for the RT patchset to ignore the RT incompatible parts?

Sebastian Andrzej Siewior Dec. 16, 2021, 7:51 a.m. UTC | #9

On 2021-12-15 19:44:00 [+0100], Michal Hocko wrote:
> On Wed 15-12-21 18:13:40, Sebastian Andrzej Siewior wrote:
> > Okay. What do I gain by doing this / how do I test this? Is running
> > tools/testing/selftests/cgroup/test_*mem* sufficient to test all corner
> > cases here?
> 
> I am not fully aware of all the tests but my point is that if the soft
> limit is not configured then there are no soft limit tree manipulations
> ever happening and therefore the code is effectivelly dead. Is this
> sufficient for the RT patchset to ignore the RT incompatible parts?

So if that softlimit is not essential and makes things easier by simply
disabling it, yes I could try that. I will keep that in mind.

Sebastian

mm/memcontrol: Disable on PREEMPT_RT

Commit Message

Comments

Patch