diff mbox series

[v2,3/4] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.

Message ID 20220211223537.2175879-4-bigeasy@linutronix.de (mailing list archive)
State New
Headers show
Series mm/memcg: Address PREEMPT_RT problems instead of disabling it. | expand

Commit Message

Sebastian Andrzej Siewior Feb. 11, 2022, 10:35 p.m. UTC
The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensured by disabling interrupts for the update.
On non PREEMPT_RT configuration this works because acquiring a
spinlock_t typed lock with the _irq() suffix disables interrupts. On
PREEMPT_RT configurations the RMW operation can be interrupted.

Another problem is that mem_cgroup_swapout() expects to be invoked with
disabled interrupts because the caller has to acquire a spinlock_t which
is acquired with disabled interrupts. Since spinlock_t never disables
interrupts on PREEMPT_RT the interrupts are never disabled at this
point.

The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT.
The sections which explicitly disable interrupts can remain on
PREEMPT_RT because the sections remain short and they don't involve
sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT).

Disable preemption during update of the per-CPU variables which do not
explicitly disable interrupts.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Comments

Johannes Weiner Feb. 14, 2022, 4:46 p.m. UTC | #1
On Fri, Feb 11, 2022 at 11:35:36PM +0100, Sebastian Andrzej Siewior wrote:
> The per-CPU counter are modified with the non-atomic modifier. The
> consistency is ensured by disabling interrupts for the update.
> On non PREEMPT_RT configuration this works because acquiring a
> spinlock_t typed lock with the _irq() suffix disables interrupts. On
> PREEMPT_RT configurations the RMW operation can be interrupted.
> 
> Another problem is that mem_cgroup_swapout() expects to be invoked with
> disabled interrupts because the caller has to acquire a spinlock_t which
> is acquired with disabled interrupts. Since spinlock_t never disables
> interrupts on PREEMPT_RT the interrupts are never disabled at this
> point.
> 
> The code is never called from in_irq() context on PREEMPT_RT therefore
> disabling preemption during the update is sufficient on PREEMPT_RT.
> The sections which explicitly disable interrupts can remain on
> PREEMPT_RT because the sections remain short and they don't involve
> sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT).
> 
> Disable preemption during update of the per-CPU variables which do not
> explicitly disable interrupts.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  mm/memcontrol.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c1caa662946dc..466466f285cea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -705,6 +705,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>  	memcg = pn->memcg;
>  
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		preempt_disable();
>  	/* Update memcg */
>  	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
>  
> @@ -712,6 +714,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
>  
>  	memcg_rstat_updated(memcg, val);
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		preempt_enable();
>  }

I notice you didn't annoate __mod_memcg_state(). I suppose that is
because it's called with explicit local_irq_disable(), and that
disables preemption on rt? And you only need another preempt_disable()
for stacks that rely on coming from spin_lock_irq(save)?

That makes sense, but it's difficult to maintain. It'll easily break
if somebody adds more memory accounting sites that may also rely on an
irq-disabled spinlock somewhere.

So better to make this an unconditional locking protocol:

static void memcg_stats_lock(void)
{
#ifdef CONFIG_PREEMPT_RT
	preempt_disable();
#else
	VM_BUG_ON(!irqs_disabled());
#endif
}

static void memcg_stats_unlock(void)
{
#ifdef CONFIG_PREEMPT_RT
	preempt_enable();
#endif
}

and always use these around the counter updates.
Roman Gushchin Feb. 14, 2022, 7:53 p.m. UTC | #2
On Mon, Feb 14, 2022 at 11:46:00AM -0500, Johannes Weiner wrote:
> On Fri, Feb 11, 2022 at 11:35:36PM +0100, Sebastian Andrzej Siewior wrote:
> > The per-CPU counter are modified with the non-atomic modifier. The
> > consistency is ensured by disabling interrupts for the update.
> > On non PREEMPT_RT configuration this works because acquiring a
> > spinlock_t typed lock with the _irq() suffix disables interrupts. On
> > PREEMPT_RT configurations the RMW operation can be interrupted.
> > 
> > Another problem is that mem_cgroup_swapout() expects to be invoked with
> > disabled interrupts because the caller has to acquire a spinlock_t which
> > is acquired with disabled interrupts. Since spinlock_t never disables
> > interrupts on PREEMPT_RT the interrupts are never disabled at this
> > point.
> > 
> > The code is never called from in_irq() context on PREEMPT_RT therefore
> > disabling preemption during the update is sufficient on PREEMPT_RT.
> > The sections which explicitly disable interrupts can remain on
> > PREEMPT_RT because the sections remain short and they don't involve
> > sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT).
> > 
> > Disable preemption during update of the per-CPU variables which do not
> > explicitly disable interrupts.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > ---
> >  mm/memcontrol.c | 21 +++++++++++++++++++--
> >  1 file changed, 19 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c1caa662946dc..466466f285cea 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -705,6 +705,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >  	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> >  	memcg = pn->memcg;
> >  
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > +		preempt_disable();
> >  	/* Update memcg */
> >  	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
> >  
> > @@ -712,6 +714,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >  	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> >  
> >  	memcg_rstat_updated(memcg, val);
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > +		preempt_enable();
> >  }
> 
> I notice you didn't annoate __mod_memcg_state(). I suppose that is
> because it's called with explicit local_irq_disable(), and that
> disables preemption on rt? And you only need another preempt_disable()
> for stacks that rely on coming from spin_lock_irq(save)?
> 
> That makes sense, but it's difficult to maintain. It'll easily break
> if somebody adds more memory accounting sites that may also rely on an
> irq-disabled spinlock somewhere.
> 
> So better to make this an unconditional locking protocol:
> 
> static void memcg_stats_lock(void)
> {
> #ifdef CONFIG_PREEMPT_RT
> 	preempt_disable();
> #else
> 	VM_BUG_ON(!irqs_disabled());
> #endif
> }
> 
> static void memcg_stats_unlock(void)
> {
> #ifdef CONFIG_PREEMPT_RT
> 	preempt_enable();
> #endif
> }
> 
> and always use these around the counter updates.

Thanks, Johannes, this looks really good to me. The code is already quite
complicated, the suggested locking protocol makes it easier to read and
support.

Otherwise the patch looks good to me. Sebastian, please, feel to add
Acked-by: Roman Gushchin <guro@fb.com> after incorporating Johannes's
suggestion.

Thanks!
Sebastian Andrzej Siewior Feb. 15, 2022, 6:01 p.m. UTC | #3
On 2022-02-14 11:46:00 [-0500], Johannes Weiner wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c1caa662946dc..466466f285cea 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -705,6 +705,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >  	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> >  	memcg = pn->memcg;
> >  
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > +		preempt_disable();
> >  	/* Update memcg */
> >  	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
> >  
> > @@ -712,6 +714,8 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >  	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> >  
> >  	memcg_rstat_updated(memcg, val);
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > +		preempt_enable();
> >  }
> 
> I notice you didn't annoate __mod_memcg_state(). I suppose that is
> because it's called with explicit local_irq_disable(), and that
> disables preemption on rt? And you only need another preempt_disable()
> for stacks that rely on coming from spin_lock_irq(save)?

Correct. The code is not used in_hardirq() on PREEMPT_RT so
preempt_disable() is sufficient. I didn't bother to replace all
local_irq_save() with preempt_disable() since it is probably not worth
it.
And yes: spin_lock_irq() does not disable interrupts so I need something
here to ensure that the RMW operation is not interrupted.

> That makes sense, but it's difficult to maintain. It'll easily break
> if somebody adds more memory accounting sites that may also rely on an
> irq-disabled spinlock somewhere.
> 
> So better to make this an unconditional locking protocol:
> 
> static void memcg_stats_lock(void)
> {
> #ifdef CONFIG_PREEMPT_RT
> 	preempt_disable();
> #else
> 	VM_BUG_ON(!irqs_disabled());
> #endif
> }
> 
> static void memcg_stats_unlock(void)
> {
> #ifdef CONFIG_PREEMPT_RT
> 	preempt_enable();
> #endif
> }
> 
> and always use these around the counter updates.

Something like the following perhaps? I didn't add anything to
__mod_memcg_state() since it has no users besides the one which does
local_irq_save().

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c1caa662946dc..69130a5fe3d51 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -629,6 +629,28 @@ static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 
+/*
+ * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
+ * not rely on this as part of an acquired spinlock_t lock. These functions are
+ * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
+ * is sufficient.
+ */
+static void memcg_stats_lock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_disable();
+#else
+      VM_BUG_ON(!irqs_disabled());
+#endif
+}
+
+static void memcg_stats_unlock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_enable();
+#endif
+}
+
 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
 	unsigned int x;
@@ -705,6 +727,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
+	memcg_stats_lock();
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
@@ -712,6 +735,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
 	memcg_rstat_updated(memcg, val);
+	memcg_stats_unlock();
 }
 
 /**
@@ -794,8 +818,10 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (mem_cgroup_disabled())
 		return;
 
+	memcg_stats_lock();
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
 	memcg_rstat_updated(memcg, count);
+	memcg_stats_unlock();
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -7149,8 +7175,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * important here to have the interrupts disabled because it is the
 	 * only synchronisation we have for updating the per-CPU variables.
 	 */
-	VM_BUG_ON(!irqs_disabled());
+	memcg_stats_lock();
 	mem_cgroup_charge_statistics(memcg, -nr_entries);
+	memcg_stats_unlock();
 	memcg_check_events(memcg, page_to_nid(page));
 
 	css_put(&memcg->css);
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c1caa662946dc..466466f285cea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -705,6 +705,8 @@  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
@@ -712,6 +714,8 @@  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
 	memcg_rstat_updated(memcg, val);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 /**
@@ -794,8 +798,12 @@  void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (mem_cgroup_disabled())
 		return;
 
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_disable();
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
 	memcg_rstat_updated(memcg, count);
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_enable();
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -7148,9 +7156,18 @@  void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * i_pages lock which is taken with interrupts-off. It is
 	 * important here to have the interrupts disabled because it is the
 	 * only synchronisation we have for updating the per-CPU variables.
+	 * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU
+	 * variables are synchronised by keeping preemption disabled.
 	 */
-	VM_BUG_ON(!irqs_disabled());
-	mem_cgroup_charge_statistics(memcg, -nr_entries);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		VM_BUG_ON(!irqs_disabled());
+		mem_cgroup_charge_statistics(memcg, -nr_entries);
+	} else {
+		preempt_disable();
+		mem_cgroup_charge_statistics(memcg, -nr_entries);
+		preempt_enable();
+	}
+
 	memcg_check_events(memcg, page_to_nid(page));
 
 	css_put(&memcg->css);