diff mbox series

[RFC,1/7] cgroup: rstat: only disable interrupts for the percpu lock

Message ID 20230323040037.2389095-2-yosryahmed@google.com (mailing list archive)
State RFC
Headers show
Series Make rstat flushing IRQ and sleep friendly | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Yosry Ahmed March 23, 2023, 4 a.m. UTC
Currently, when sleeping is not allowed during rstat flushing, we hold
the global rstat lock with interrupts disabled throughout the entire
flush operation. Flushing in an O(# cgroups * # cpus) operation, and
having interrupts disabled throughout is dangerous.

For some contexts, we may not want to sleep, but can be interrupted
(e.g. while holding a spinlock or RCU read lock). As such, do not
disable interrupts throughout rstat flushing, only when holding the
percpu lock. This breaks down the O(# cgroups * # cpus) duration with
interrupts disabled to a series of O(# cgroups) durations.

Furthermore, if a cpu spinning waiting for the global rstat lock, it
doesn't need to spin with interrupts disabled anymore.

If the caller of rstat flushing needs interrupts to be disabled, it's up
to them to decide that, and it should be fine to hold the global rstat
lock with interrupts disabled. There is currently a single context that
may invoke rstat flushing with interrupts disabled, the
mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
mem_cgroup_threshold().

To make it safe to hold the global rstat lock with interrupts enabled,
make sure we only flush from in_task() contexts. The side effect of that
we read stale stats in interrupt context, but this should be okay, as
flushing in interrupt context is dangerous anyway as it is an expensive
operation, so reading stale stats is safer.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 kernel/cgroup/rstat.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

Comments

Shakeel Butt March 23, 2023, 4:29 a.m. UTC | #1
On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Currently, when sleeping is not allowed during rstat flushing, we hold
> the global rstat lock with interrupts disabled throughout the entire
> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> having interrupts disabled throughout is dangerous.
>
> For some contexts, we may not want to sleep, but can be interrupted
> (e.g. while holding a spinlock or RCU read lock). As such, do not
> disable interrupts throughout rstat flushing, only when holding the
> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> interrupts disabled to a series of O(# cgroups) durations.
>
> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> doesn't need to spin with interrupts disabled anymore.
>
> If the caller of rstat flushing needs interrupts to be disabled, it's up
> to them to decide that, and it should be fine to hold the global rstat
> lock with interrupts disabled. There is currently a single context that
> may invoke rstat flushing with interrupts disabled, the
> mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
> mem_cgroup_threshold().
>
> To make it safe to hold the global rstat lock with interrupts enabled,
> make sure we only flush from in_task() contexts. The side effect of that
> we read stale stats in interrupt context, but this should be okay, as
> flushing in interrupt context is dangerous anyway as it is an expensive
> operation, so reading stale stats is safer.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Couple of questions:

1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
altogether?
2. Are we really calling rstat flush in irq context?
3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
done for root memcg. Why is mem_cgroup_threshold() interested in root
memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
Yosry Ahmed March 23, 2023, 5:15 a.m. UTC | #2
On Wed, Mar 22, 2023 at 9:29 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Currently, when sleeping is not allowed during rstat flushing, we hold
> > the global rstat lock with interrupts disabled throughout the entire
> > flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> > having interrupts disabled throughout is dangerous.
> >
> > For some contexts, we may not want to sleep, but can be interrupted
> > (e.g. while holding a spinlock or RCU read lock). As such, do not
> > disable interrupts throughout rstat flushing, only when holding the
> > percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> > interrupts disabled to a series of O(# cgroups) durations.
> >
> > Furthermore, if a cpu spinning waiting for the global rstat lock, it
> > doesn't need to spin with interrupts disabled anymore.
> >
> > If the caller of rstat flushing needs interrupts to be disabled, it's up
> > to them to decide that, and it should be fine to hold the global rstat
> > lock with interrupts disabled. There is currently a single context that
> > may invoke rstat flushing with interrupts disabled, the
> > mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
> > mem_cgroup_threshold().
> >
> > To make it safe to hold the global rstat lock with interrupts enabled,
> > make sure we only flush from in_task() contexts. The side effect of that
> > we read stale stats in interrupt context, but this should be okay, as
> > flushing in interrupt context is dangerous anyway as it is an expensive
> > operation, so reading stale stats is safer.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Couple of questions:
>
> 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> altogether?

I believe it protects the global state variables that we flush into.
For example, for memcg, it protects mem_cgroup->vmstats.

I tried removing the lock and allowing concurrent flushing on
different cpus, by changing mem_cgroup->vmstats to use atomics
instead, but that turned out to be a little expensive. Also,
cgroup_rstat_lock is already contended by different flushers
(mitigated by stats_flush_lock on the memcg side). If we remove it,
concurrent flushers contend on every single percpu lock instead, which
also seems to be expensive.

> 2. Are we really calling rstat flush in irq context?

I think it is possible through the charge/uncharge path:
memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
added the protection against flushing in an interrupt context for
future callers as well, as it may cause a deadlock if we don't disable
interrupts when acquiring cgroup_rstat_lock.

> 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> done for root memcg. Why is mem_cgroup_threshold() interested in root
> memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?

I am not sure, but the code looks like event notifications may be set
up on root memcg, which is why we need to check thresholds.

Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
of this patch is to make sure the rstat flushing code itself is not
disabling interrupts; which it currently does for any unsleepable
context, even if it is interruptible.
Shakeel Butt March 23, 2023, 6:33 a.m. UTC | #3
On Wed, Mar 22, 2023 at 10:15 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> > Couple of questions:
> >
> > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> > altogether?
>
> I believe it protects the global state variables that we flush into.
> For example, for memcg, it protects mem_cgroup->vmstats.
>
> I tried removing the lock and allowing concurrent flushing on
> different cpus, by changing mem_cgroup->vmstats to use atomics
> instead, but that turned out to be a little expensive. Also,
> cgroup_rstat_lock is already contended by different flushers
> (mitigated by stats_flush_lock on the memcg side). If we remove it,
> concurrent flushers contend on every single percpu lock instead, which
> also seems to be expensive.

We should add a comment on what it is protecting. I think block rstat
are fine but memcg and bpf would need this.

>
> > 2. Are we really calling rstat flush in irq context?
>
> I think it is possible through the charge/uncharge path:
> memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> added the protection against flushing in an interrupt context for
> future callers as well, as it may cause a deadlock if we don't disable
> interrupts when acquiring cgroup_rstat_lock.
>
> > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
>
> I am not sure, but the code looks like event notifications may be set
> up on root memcg, which is why we need to check thresholds.

This is something we should deprecate as root memcg's usage is ill defined.

>
> Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
> of this patch is to make sure the rstat flushing code itself is not
> disabling interrupts; which it currently does for any unsleepable
> context, even if it is interruptible.

Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the
flush function rather than adding should_skip_flush() which does not
stop potential new irq flushers.
Yosry Ahmed March 23, 2023, 1:35 p.m. UTC | #4
On Wed, Mar 22, 2023 at 11:33 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Mar 22, 2023 at 10:15 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > > Couple of questions:
> > >
> > > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> > > altogether?
> >
> > I believe it protects the global state variables that we flush into.
> > For example, for memcg, it protects mem_cgroup->vmstats.
> >
> > I tried removing the lock and allowing concurrent flushing on
> > different cpus, by changing mem_cgroup->vmstats to use atomics
> > instead, but that turned out to be a little expensive. Also,
> > cgroup_rstat_lock is already contended by different flushers
> > (mitigated by stats_flush_lock on the memcg side). If we remove it,
> > concurrent flushers contend on every single percpu lock instead, which
> > also seems to be expensive.
>
> We should add a comment on what it is protecting. I think block rstat
> are fine but memcg and bpf would need this.

I think it also protects the cpu base stats flushed by cgroup_base_stat_flush().

I will add a comment in the next version.

>
> >
> > > 2. Are we really calling rstat flush in irq context?
> >
> > I think it is possible through the charge/uncharge path:
> > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > added the protection against flushing in an interrupt context for
> > future callers as well, as it may cause a deadlock if we don't disable
> > interrupts when acquiring cgroup_rstat_lock.
> >
> > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> >
> > I am not sure, but the code looks like event notifications may be set
> > up on root memcg, which is why we need to check thresholds.
>
> This is something we should deprecate as root memcg's usage is ill defined.

Right, but I think this would be orthogonal to this patch series.

>
> >
> > Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
> > of this patch is to make sure the rstat flushing code itself is not
> > disabling interrupts; which it currently does for any unsleepable
> > context, even if it is interruptible.
>
> Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the
> flush function rather than adding should_skip_flush() which does not
> stop potential new irq flushers.

I wanted to start with VM_BUG_ON(!in_task()) but I wasn't sure that
all contexts that call rstat flushing are not in irq contexts. I added
should_skip_flush() so that if there are existing flushers in irq
context, or new flushers are added, we are protected against a
deadlock.

We can change should_skip_flush() to have a WARN_ON_ONCE(!in_task())
to alert in this case. If you prefer removing should_skip_flush() and
just adding VM_BUG_ON(!in_task()) we can do that, but personally I was
not confident enough that we have no code paths today that may attempt
flushing from irq context.
Shakeel Butt March 23, 2023, 3:40 p.m. UTC | #5
On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> > >
> > > > 2. Are we really calling rstat flush in irq context?
> > >
> > > I think it is possible through the charge/uncharge path:
> > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > added the protection against flushing in an interrupt context for
> > > future callers as well, as it may cause a deadlock if we don't disable
> > > interrupts when acquiring cgroup_rstat_lock.
> > >
> > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > >
> > > I am not sure, but the code looks like event notifications may be set
> > > up on root memcg, which is why we need to check thresholds.
> >
> > This is something we should deprecate as root memcg's usage is ill defined.
>
> Right, but I think this would be orthogonal to this patch series.
>

I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
without either breaking a link between mem_cgroup_threshold and
cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
irqs.

So, this patch can not be applied before either of those two tasks are
done (and we may find more such scenarios).
Yosry Ahmed March 23, 2023, 3:42 p.m. UTC | #6
On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > > >
> > > > > 2. Are we really calling rstat flush in irq context?
> > > >
> > > > I think it is possible through the charge/uncharge path:
> > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > added the protection against flushing in an interrupt context for
> > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > interrupts when acquiring cgroup_rstat_lock.
> > > >
> > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > >
> > > > I am not sure, but the code looks like event notifications may be set
> > > > up on root memcg, which is why we need to check thresholds.
> > >
> > > This is something we should deprecate as root memcg's usage is ill defined.
> >
> > Right, but I think this would be orthogonal to this patch series.
> >
>
> I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> without either breaking a link between mem_cgroup_threshold and
> cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> irqs.
>
> So, this patch can not be applied before either of those two tasks are
> done (and we may find more such scenarios).


Could you elaborate why?

My understanding is that with an in_task() check to make sure we only
acquire cgroup_rstat_lock from non-irq context it should be fine to
acquire cgroup_rstat_lock without disabling interrupts.
Shakeel Butt March 23, 2023, 3:46 p.m. UTC | #7
On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > [...]
> > > > >
> > > > > > 2. Are we really calling rstat flush in irq context?
> > > > >
> > > > > I think it is possible through the charge/uncharge path:
> > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > added the protection against flushing in an interrupt context for
> > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > >
> > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > >
> > > > > I am not sure, but the code looks like event notifications may be set
> > > > > up on root memcg, which is why we need to check thresholds.
> > > >
> > > > This is something we should deprecate as root memcg's usage is ill defined.
> > >
> > > Right, but I think this would be orthogonal to this patch series.
> > >
> >
> > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > without either breaking a link between mem_cgroup_threshold and
> > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > irqs.
> >
> > So, this patch can not be applied before either of those two tasks are
> > done (and we may find more such scenarios).
>
>
> Could you elaborate why?
>
> My understanding is that with an in_task() check to make sure we only
> acquire cgroup_rstat_lock from non-irq context it should be fine to
> acquire cgroup_rstat_lock without disabling interrupts.

From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
with irq disabled while other code paths will take cgroup_rstat_lock
with irq enabled. This is a potential deadlock hazard unless
cgroup_rstat_lock is always taken with irq disabled.
Shakeel Butt March 23, 2023, 4:09 p.m. UTC | #8
On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > [...]
> > > > > >
> > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > >
> > > > > > I think it is possible through the charge/uncharge path:
> > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > added the protection against flushing in an interrupt context for
> > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > >
> > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > >
> > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > up on root memcg, which is why we need to check thresholds.
> > > > >
> > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > >
> > > > Right, but I think this would be orthogonal to this patch series.
> > > >
> > >
> > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > without either breaking a link between mem_cgroup_threshold and
> > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > irqs.
> > >
> > > So, this patch can not be applied before either of those two tasks are
> > > done (and we may find more such scenarios).
> >
> >
> > Could you elaborate why?
> >
> > My understanding is that with an in_task() check to make sure we only
> > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > acquire cgroup_rstat_lock without disabling interrupts.
>
> From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> with irq disabled while other code paths will take cgroup_rstat_lock
> with irq enabled. This is a potential deadlock hazard unless
> cgroup_rstat_lock is always taken with irq disabled.

Oh you are making sure it is not taken in the irq context through
should_skip_flush(). Hmm seems like a hack. Normally it is recommended
to actually remove all such users instead of silently
ignoring/bypassing the functionality.

So, how about removing mem_cgroup_flush_stats() from
mem_cgroup_usage(). It will break the known chain which is taking
cgroup_rstat_lock with irq disabled and you can add
WARN_ON_ONCE(!in_task()).
Yosry Ahmed March 23, 2023, 4:17 p.m. UTC | #9
On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > [...]
> > > > > > >
> > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > >
> > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > >
> > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > >
> > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > >
> > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > >
> > > > > Right, but I think this would be orthogonal to this patch series.
> > > > >
> > > >
> > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > without either breaking a link between mem_cgroup_threshold and
> > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > irqs.
> > > >
> > > > So, this patch can not be applied before either of those two tasks are
> > > > done (and we may find more such scenarios).
> > >
> > >
> > > Could you elaborate why?
> > >
> > > My understanding is that with an in_task() check to make sure we only
> > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > acquire cgroup_rstat_lock without disabling interrupts.
> >
> > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > with irq disabled while other code paths will take cgroup_rstat_lock
> > with irq enabled. This is a potential deadlock hazard unless
> > cgroup_rstat_lock is always taken with irq disabled.
>
> Oh you are making sure it is not taken in the irq context through
> should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> to actually remove all such users instead of silently
> ignoring/bypassing the functionality.

It is a workaround, we simply accept to read stale stats in irq
context instead of the expensive flush operation.

>
> So, how about removing mem_cgroup_flush_stats() from
> mem_cgroup_usage(). It will break the known chain which is taking
> cgroup_rstat_lock with irq disabled and you can add
> WARN_ON_ONCE(!in_task()).

This changes the behavior in a more obvious way because:
1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
path is also exercised in a lot of paths outside irq context, this
will change the behavior for any event thresholds on the root memcg.
With proposed skipped flushing in irq context we only change the
behavior in a small subset of cases.

I think we can skip flushing in irq context for now, and separately
deprecate threshold events for the root memcg. When that is done we
can come back and remove should_skip_flush() and add a VM_BUG_ON or
WARN_ON_ONCE instead. WDYT?

2. mem_cgroup_usage() is also used when reading usage from userspace.
This should be an easy workaround though.
Shakeel Butt March 23, 2023, 4:29 p.m. UTC | #10
On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > [...]
> > > > > > > >
> > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > >
> > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > >
> > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > >
> > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > >
> > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > >
> > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > >
> > > > >
> > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > irqs.
> > > > >
> > > > > So, this patch can not be applied before either of those two tasks are
> > > > > done (and we may find more such scenarios).
> > > >
> > > >
> > > > Could you elaborate why?
> > > >
> > > > My understanding is that with an in_task() check to make sure we only
> > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > acquire cgroup_rstat_lock without disabling interrupts.
> > >
> > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > with irq enabled. This is a potential deadlock hazard unless
> > > cgroup_rstat_lock is always taken with irq disabled.
> >
> > Oh you are making sure it is not taken in the irq context through
> > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > to actually remove all such users instead of silently
> > ignoring/bypassing the functionality.
>
> It is a workaround, we simply accept to read stale stats in irq
> context instead of the expensive flush operation.
>
> >
> > So, how about removing mem_cgroup_flush_stats() from
> > mem_cgroup_usage(). It will break the known chain which is taking
> > cgroup_rstat_lock with irq disabled and you can add
> > WARN_ON_ONCE(!in_task()).
>
> This changes the behavior in a more obvious way because:
> 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> path is also exercised in a lot of paths outside irq context, this
> will change the behavior for any event thresholds on the root memcg.
> With proposed skipped flushing in irq context we only change the
> behavior in a small subset of cases.
>
> I think we can skip flushing in irq context for now, and separately
> deprecate threshold events for the root memcg. When that is done we
> can come back and remove should_skip_flush() and add a VM_BUG_ON or
> WARN_ON_ONCE instead. WDYT?
>
> 2. mem_cgroup_usage() is also used when reading usage from userspace.
> This should be an easy workaround though.

This is a cgroup v1 behavior and to me it is totally reasonable to get
the 2 second stale root's usage. Even if you want to skip flushing in
irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
rstat core code. This way we will know if other subsystems are doing
the same or not.
Yosry Ahmed March 23, 2023, 4:36 p.m. UTC | #11
On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > > >
> > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > >
> > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > >
> > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > >
> > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > >
> > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > >
> > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > >
> > > > > >
> > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > irqs.
> > > > > >
> > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > done (and we may find more such scenarios).
> > > > >
> > > > >
> > > > > Could you elaborate why?
> > > > >
> > > > > My understanding is that with an in_task() check to make sure we only
> > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > >
> > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > with irq enabled. This is a potential deadlock hazard unless
> > > > cgroup_rstat_lock is always taken with irq disabled.
> > >
> > > Oh you are making sure it is not taken in the irq context through
> > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > to actually remove all such users instead of silently
> > > ignoring/bypassing the functionality.
> >
> > It is a workaround, we simply accept to read stale stats in irq
> > context instead of the expensive flush operation.
> >
> > >
> > > So, how about removing mem_cgroup_flush_stats() from
> > > mem_cgroup_usage(). It will break the known chain which is taking
> > > cgroup_rstat_lock with irq disabled and you can add
> > > WARN_ON_ONCE(!in_task()).
> >
> > This changes the behavior in a more obvious way because:
> > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > path is also exercised in a lot of paths outside irq context, this
> > will change the behavior for any event thresholds on the root memcg.
> > With proposed skipped flushing in irq context we only change the
> > behavior in a small subset of cases.
> >
> > I think we can skip flushing in irq context for now, and separately
> > deprecate threshold events for the root memcg. When that is done we
> > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > WARN_ON_ONCE instead. WDYT?
> >
> > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > This should be an easy workaround though.
>
> This is a cgroup v1 behavior and to me it is totally reasonable to get
> the 2 second stale root's usage. Even if you want to skip flushing in
> irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> rstat core code. This way we will know if other subsystems are doing
> the same or not.

We can do that. Basically in mem_cgroup_usage() have:

/* Some useful comment */
if (in_task())
    mem_cgroup_flush_stats();

and in cgroup_rstat_flush() have:
WARN_ON_ONCE(!in_task());

I am assuming VM_BUG_ON is not used outside mm code.

The only thing that worries me is that if there is another unlikely
path somewhere that flushes stats in irq context we may run into a
deadlock. I am a little bit nervous about not skipping flushing if
!in_task() in cgroup_rstat_flush().
Shakeel Butt March 23, 2023, 4:45 p.m. UTC | #12
On Thu, Mar 23, 2023 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > >
> > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > >
> > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > >
> > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > >
> > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > >
> > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > >
> > > > > > >
> > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > irqs.
> > > > > > >
> > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > done (and we may find more such scenarios).
> > > > > >
> > > > > >
> > > > > > Could you elaborate why?
> > > > > >
> > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > >
> > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > >
> > > > Oh you are making sure it is not taken in the irq context through
> > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > to actually remove all such users instead of silently
> > > > ignoring/bypassing the functionality.
> > >
> > > It is a workaround, we simply accept to read stale stats in irq
> > > context instead of the expensive flush operation.
> > >
> > > >
> > > > So, how about removing mem_cgroup_flush_stats() from
> > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > cgroup_rstat_lock with irq disabled and you can add
> > > > WARN_ON_ONCE(!in_task()).
> > >
> > > This changes the behavior in a more obvious way because:
> > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > path is also exercised in a lot of paths outside irq context, this
> > > will change the behavior for any event thresholds on the root memcg.
> > > With proposed skipped flushing in irq context we only change the
> > > behavior in a small subset of cases.
> > >
> > > I think we can skip flushing in irq context for now, and separately
> > > deprecate threshold events for the root memcg. When that is done we
> > > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > > WARN_ON_ONCE instead. WDYT?
> > >
> > > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > > This should be an easy workaround though.
> >
> > This is a cgroup v1 behavior and to me it is totally reasonable to get
> > the 2 second stale root's usage. Even if you want to skip flushing in
> > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> > rstat core code. This way we will know if other subsystems are doing
> > the same or not.
>
> We can do that. Basically in mem_cgroup_usage() have:
>
> /* Some useful comment */
> if (in_task())
>     mem_cgroup_flush_stats();
>
> and in cgroup_rstat_flush() have:
> WARN_ON_ONCE(!in_task());
>
> I am assuming VM_BUG_ON is not used outside mm code.
>
> The only thing that worries me is that if there is another unlikely
> path somewhere that flushes stats in irq context we may run into a
> deadlock. I am a little bit nervous about not skipping flushing if
> !in_task() in cgroup_rstat_flush().

I think it is a good thing. We will find such scenarios and fix those
instead of hiding them forever or keeping the door open for new such
scenarios.
Yosry Ahmed March 23, 2023, 4:51 p.m. UTC | #13
On Thu, Mar 23, 2023 at 9:45 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > [...]
> > > > > > > > > > >
> > > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > > >
> > > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > > >
> > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > > >
> > > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > > >
> > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > > >
> > > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > > irqs.
> > > > > > > >
> > > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > > done (and we may find more such scenarios).
> > > > > > >
> > > > > > >
> > > > > > > Could you elaborate why?
> > > > > > >
> > > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > > >
> > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > > >
> > > > > Oh you are making sure it is not taken in the irq context through
> > > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > > to actually remove all such users instead of silently
> > > > > ignoring/bypassing the functionality.
> > > >
> > > > It is a workaround, we simply accept to read stale stats in irq
> > > > context instead of the expensive flush operation.
> > > >
> > > > >
> > > > > So, how about removing mem_cgroup_flush_stats() from
> > > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > > cgroup_rstat_lock with irq disabled and you can add
> > > > > WARN_ON_ONCE(!in_task()).
> > > >
> > > > This changes the behavior in a more obvious way because:
> > > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > > path is also exercised in a lot of paths outside irq context, this
> > > > will change the behavior for any event thresholds on the root memcg.
> > > > With proposed skipped flushing in irq context we only change the
> > > > behavior in a small subset of cases.
> > > >
> > > > I think we can skip flushing in irq context for now, and separately
> > > > deprecate threshold events for the root memcg. When that is done we
> > > > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > > > WARN_ON_ONCE instead. WDYT?
> > > >
> > > > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > > > This should be an easy workaround though.
> > >
> > > This is a cgroup v1 behavior and to me it is totally reasonable to get
> > > the 2 second stale root's usage. Even if you want to skip flushing in
> > > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> > > rstat core code. This way we will know if other subsystems are doing
> > > the same or not.
> >
> > We can do that. Basically in mem_cgroup_usage() have:
> >
> > /* Some useful comment */
> > if (in_task())
> >     mem_cgroup_flush_stats();
> >
> > and in cgroup_rstat_flush() have:
> > WARN_ON_ONCE(!in_task());
> >
> > I am assuming VM_BUG_ON is not used outside mm code.
> >
> > The only thing that worries me is that if there is another unlikely
> > path somewhere that flushes stats in irq context we may run into a
> > deadlock. I am a little bit nervous about not skipping flushing if
> > !in_task() in cgroup_rstat_flush().
>
> I think it is a good thing. We will find such scenarios and fix those
> instead of hiding them forever or keeping the door open for new such
> scenarios.

Sure, I can do that in the next version. I will include a patch that
adds an in_task() check to mem_cgroup_usage() before this one. Since
BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are
left with WARN_ON_ONCE() for the rstat core code, right?

Thanks Shakeel. Any other thoughts I should address for the next version?
Johannes Weiner March 23, 2023, 5:33 p.m. UTC | #14
On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > [...]
> > > > > > > >
> > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > >
> > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > >
> > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > >
> > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > >
> > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > >
> > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > >
> > > > >
> > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > irqs.
> > > > >
> > > > > So, this patch can not be applied before either of those two tasks are
> > > > > done (and we may find more such scenarios).
> > > >
> > > >
> > > > Could you elaborate why?
> > > >
> > > > My understanding is that with an in_task() check to make sure we only
> > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > acquire cgroup_rstat_lock without disabling interrupts.
> > >
> > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > with irq enabled. This is a potential deadlock hazard unless
> > > cgroup_rstat_lock is always taken with irq disabled.
> >
> > Oh you are making sure it is not taken in the irq context through
> > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > to actually remove all such users instead of silently
> > ignoring/bypassing the functionality.

+1

It shouldn't silently skip the requested operation, rather it
shouldn't be requested from an incompatible context.

> > So, how about removing mem_cgroup_flush_stats() from
> > mem_cgroup_usage(). It will break the known chain which is taking
> > cgroup_rstat_lock with irq disabled and you can add
> > WARN_ON_ONCE(!in_task()).
> 
> This changes the behavior in a more obvious way because:
> 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> path is also exercised in a lot of paths outside irq context, this
> will change the behavior for any event thresholds on the root memcg.
> With proposed skipped flushing in irq context we only change the
> behavior in a small subset of cases.

Can you do

	/* Note: stale usage data when called from irq context!! */
	if (in_task())
		mem_cgroup_flush_stats()

directly in the callsite? Maybe even include the whole callchain in
the comment that's currently broken and needs fixing/removing.
Yosry Ahmed March 23, 2023, 6:09 p.m. UTC | #15
On Thu, Mar 23, 2023 at 10:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > > >
> > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > >
> > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > >
> > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > >
> > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > >
> > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > >
> > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > >
> > > > > >
> > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > irqs.
> > > > > >
> > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > done (and we may find more such scenarios).
> > > > >
> > > > >
> > > > > Could you elaborate why?
> > > > >
> > > > > My understanding is that with an in_task() check to make sure we only
> > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > >
> > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > with irq enabled. This is a potential deadlock hazard unless
> > > > cgroup_rstat_lock is always taken with irq disabled.
> > >
> > > Oh you are making sure it is not taken in the irq context through
> > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > to actually remove all such users instead of silently
> > > ignoring/bypassing the functionality.
>
> +1
>
> It shouldn't silently skip the requested operation, rather it
> shouldn't be requested from an incompatible context.
>
> > > So, how about removing mem_cgroup_flush_stats() from
> > > mem_cgroup_usage(). It will break the known chain which is taking
> > > cgroup_rstat_lock with irq disabled and you can add
> > > WARN_ON_ONCE(!in_task()).
> >
> > This changes the behavior in a more obvious way because:
> > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > path is also exercised in a lot of paths outside irq context, this
> > will change the behavior for any event thresholds on the root memcg.
> > With proposed skipped flushing in irq context we only change the
> > behavior in a small subset of cases.
>
> Can you do
>
>         /* Note: stale usage data when called from irq context!! */
>         if (in_task())
>                 mem_cgroup_flush_stats()
>
> directly in the callsite? Maybe even include the whole callchain in
> the comment that's currently broken and needs fixing/removing.

Yeah, we can do that in mem_cgroup_usage(), which is the only context
that I am aware of that may flush from irq context. We can also add
WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any
other code paths that we are not aware of -- which may result in a
deadlock, but hopefully if there is a violation it will be caught soon
enough.
Johannes Weiner March 23, 2023, 6:19 p.m. UTC | #16
On Thu, Mar 23, 2023 at 11:09:30AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 10:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > >
> > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > >
> > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > >
> > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > >
> > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > >
> > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > >
> > > > > > >
> > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > irqs.
> > > > > > >
> > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > done (and we may find more such scenarios).
> > > > > >
> > > > > >
> > > > > > Could you elaborate why?
> > > > > >
> > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > >
> > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > >
> > > > Oh you are making sure it is not taken in the irq context through
> > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > to actually remove all such users instead of silently
> > > > ignoring/bypassing the functionality.
> >
> > +1
> >
> > It shouldn't silently skip the requested operation, rather it
> > shouldn't be requested from an incompatible context.
> >
> > > > So, how about removing mem_cgroup_flush_stats() from
> > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > cgroup_rstat_lock with irq disabled and you can add
> > > > WARN_ON_ONCE(!in_task()).
> > >
> > > This changes the behavior in a more obvious way because:
> > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > path is also exercised in a lot of paths outside irq context, this
> > > will change the behavior for any event thresholds on the root memcg.
> > > With proposed skipped flushing in irq context we only change the
> > > behavior in a small subset of cases.
> >
> > Can you do
> >
> >         /* Note: stale usage data when called from irq context!! */
> >         if (in_task())
> >                 mem_cgroup_flush_stats()
> >
> > directly in the callsite? Maybe even include the whole callchain in
> > the comment that's currently broken and needs fixing/removing.
> 
> Yeah, we can do that in mem_cgroup_usage(), which is the only context
> that I am aware of that may flush from irq context. We can also add
> WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any
> other code paths that we are not aware of -- which may result in a
> deadlock, but hopefully if there is a violation it will be caught soon
> enough.

That sounds good to me, thanks!
Shakeel Butt March 23, 2023, 7:09 p.m. UTC | #17
On Thu, Mar 23, 2023 at 9:52 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
>
> Sure, I can do that in the next version. I will include a patch that
> adds an in_task() check to mem_cgroup_usage() before this one. Since
> BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are
> left with WARN_ON_ONCE() for the rstat core code, right?
>
> Thanks Shakeel. Any other thoughts I should address for the next version?

This is all for now (at least for this patch).
Tejun Heo March 24, 2023, 1:39 a.m. UTC | #18
Hello,

On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> Currently, when sleeping is not allowed during rstat flushing, we hold
> the global rstat lock with interrupts disabled throughout the entire
> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> having interrupts disabled throughout is dangerous.
> 
> For some contexts, we may not want to sleep, but can be interrupted
> (e.g. while holding a spinlock or RCU read lock). As such, do not
> disable interrupts throughout rstat flushing, only when holding the
> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> interrupts disabled to a series of O(# cgroups) durations.
> 
> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> doesn't need to spin with interrupts disabled anymore.

I'm generally not a fan of big spin locks w/o irq protection. They too often
become a source of unpredictable latency spikes. As you said, the global
rstat lock can be held for quite a while. Removing _irq makes irq latency
better on the CPU but on the other hand it makes a lot more likely that the
lock is gonna be held even longer, possibly significantly so depending on
the configuration and workload which will in turn stall other CPUs waiting
for the lock. Sure, irqs are being serviced quicker but if the cost is more
and longer !irq context multi-cpu stalls, what's the point?

I don't think there's anything which requires the global lock to be held
throughout the entire flushing sequence and irq needs to be disabled when
grabbing the percpu lock anyway, so why not just release the global lock on
CPU boundaries instead? We don't really lose anything significant that way.
The durations of irq disabled sections are still about the same as in the
currently proposed solution at O(# cgroups) and we avoid the risk of holding
the global lock for too long unexpectedly from getting hit repeatedly by
irqs while holding the global lock.

Thanks.
Yosry Ahmed March 24, 2023, 7:22 a.m. UTC | #19
On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> > Currently, when sleeping is not allowed during rstat flushing, we hold
> > the global rstat lock with interrupts disabled throughout the entire
> > flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> > having interrupts disabled throughout is dangerous.
> >
> > For some contexts, we may not want to sleep, but can be interrupted
> > (e.g. while holding a spinlock or RCU read lock). As such, do not
> > disable interrupts throughout rstat flushing, only when holding the
> > percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> > interrupts disabled to a series of O(# cgroups) durations.
> >
> > Furthermore, if a cpu spinning waiting for the global rstat lock, it
> > doesn't need to spin with interrupts disabled anymore.
>
> I'm generally not a fan of big spin locks w/o irq protection. They too often
> become a source of unpredictable latency spikes. As you said, the global
> rstat lock can be held for quite a while. Removing _irq makes irq latency
> better on the CPU but on the other hand it makes a lot more likely that the
> lock is gonna be held even longer, possibly significantly so depending on
> the configuration and workload which will in turn stall other CPUs waiting
> for the lock. Sure, irqs are being serviced quicker but if the cost is more
> and longer !irq context multi-cpu stalls, what's the point?
>
> I don't think there's anything which requires the global lock to be held
> throughout the entire flushing sequence and irq needs to be disabled when
> grabbing the percpu lock anyway, so why not just release the global lock on
> CPU boundaries instead? We don't really lose anything significant that way.
> The durations of irq disabled sections are still about the same as in the
> currently proposed solution at O(# cgroups) and we avoid the risk of holding
> the global lock for too long unexpectedly from getting hit repeatedly by
> irqs while holding the global lock.

Thanks for taking a look!

I think a problem with this approach is that we risk having to contend
for the global lock at every CPU boundary in atomic contexts. Right
now we contend for the global lock once, and once we have it we go
through all CPUs to flush, only having to contend with updates taking
the percpu locks at this point. If we unconditionally release &
reacquire the global lock at every CPU boundary then we may contend
for it much more frequently with concurrent flushers.

On the memory controller side, concurrent flushers are already held
back to avoid a thundering herd problem on the global rstat lock, but
flushers from outside the memory controller can still compete together
or with a flusher from the memory controller. In this case, we risk
contending the global lock more and concurrent flushers taking a
longer period of time, which may end up causing multi-CPU stalls
anyway, right? Also, if we keep _irq when spinning for the lock, then
concurrent flushers still need to spin with irq disabled -- another
problem that this series tries to fix.

This is particularly a problem for flushers in atomic contexts. There
is a flusher in mem_cgroup_wb_stats() that flushes while holding
another spinlock, and a flusher in mem_cgroup_usage() that flushes
with irqs disabled. If flushing takes a longer period of time due to
repeated lock contention, it affects such atomic context negatively.

I am not sure how all of this matters in practice, it depends heavily
on the workloads and the configuration like you mentioned. I am just
pointing out the potential disadvantages of reacquiring the lock at
every CPU boundary in atomic contexts.

>
> Thanks.
>
> --
> tejun
Waiman Long March 24, 2023, 2:12 p.m. UTC | #20
On 3/24/23 03:22, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <tj@kernel.org> wrote:
>> Hello,
>>
>> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
>>> Currently, when sleeping is not allowed during rstat flushing, we hold
>>> the global rstat lock with interrupts disabled throughout the entire
>>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
>>> having interrupts disabled throughout is dangerous.
>>>
>>> For some contexts, we may not want to sleep, but can be interrupted
>>> (e.g. while holding a spinlock or RCU read lock). As such, do not
>>> disable interrupts throughout rstat flushing, only when holding the
>>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
>>> interrupts disabled to a series of O(# cgroups) durations.
>>>
>>> Furthermore, if a cpu spinning waiting for the global rstat lock, it
>>> doesn't need to spin with interrupts disabled anymore.
>> I'm generally not a fan of big spin locks w/o irq protection. They too often
>> become a source of unpredictable latency spikes. As you said, the global
>> rstat lock can be held for quite a while. Removing _irq makes irq latency
>> better on the CPU but on the other hand it makes a lot more likely that the
>> lock is gonna be held even longer, possibly significantly so depending on
>> the configuration and workload which will in turn stall other CPUs waiting
>> for the lock. Sure, irqs are being serviced quicker but if the cost is more
>> and longer !irq context multi-cpu stalls, what's the point?
>>
>> I don't think there's anything which requires the global lock to be held
>> throughout the entire flushing sequence and irq needs to be disabled when
>> grabbing the percpu lock anyway, so why not just release the global lock on
>> CPU boundaries instead? We don't really lose anything significant that way.
>> The durations of irq disabled sections are still about the same as in the
>> currently proposed solution at O(# cgroups) and we avoid the risk of holding
>> the global lock for too long unexpectedly from getting hit repeatedly by
>> irqs while holding the global lock.
> Thanks for taking a look!
>
> I think a problem with this approach is that we risk having to contend
> for the global lock at every CPU boundary in atomic contexts. Right
Isn't it the plan to just do a trylock in atomic contexts so that it 
won't get stuck spinning for the lock for an indeterminate amount of time?
> now we contend for the global lock once, and once we have it we go
> through all CPUs to flush, only having to contend with updates taking
> the percpu locks at this point. If we unconditionally release &
> reacquire the global lock at every CPU boundary then we may contend
> for it much more frequently with concurrent flushers.

Note that with the use of qspinlock in all the major arches, the impact 
of thundering herds of lockers are much less serious than before. There 
are certainly some overhead in doing multiple lock acquires and 
releases, but that shouldn't been too excessive.

I am all in for reducing lock hold time as much as possible as it will 
improve the response time.

Cheers,
Longman
Yosry Ahmed March 24, 2023, 10:50 p.m. UTC | #21
On Fri, Mar 24, 2023 at 7:12 AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/24/23 03:22, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <tj@kernel.org> wrote:
> >> Hello,
> >>
> >> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> >>> Currently, when sleeping is not allowed during rstat flushing, we hold
> >>> the global rstat lock with interrupts disabled throughout the entire
> >>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> >>> having interrupts disabled throughout is dangerous.
> >>>
> >>> For some contexts, we may not want to sleep, but can be interrupted
> >>> (e.g. while holding a spinlock or RCU read lock). As such, do not
> >>> disable interrupts throughout rstat flushing, only when holding the
> >>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> >>> interrupts disabled to a series of O(# cgroups) durations.
> >>>
> >>> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> >>> doesn't need to spin with interrupts disabled anymore.
> >> I'm generally not a fan of big spin locks w/o irq protection. They too often
> >> become a source of unpredictable latency spikes. As you said, the global
> >> rstat lock can be held for quite a while. Removing _irq makes irq latency
> >> better on the CPU but on the other hand it makes a lot more likely that the
> >> lock is gonna be held even longer, possibly significantly so depending on
> >> the configuration and workload which will in turn stall other CPUs waiting
> >> for the lock. Sure, irqs are being serviced quicker but if the cost is more
> >> and longer !irq context multi-cpu stalls, what's the point?
> >>
> >> I don't think there's anything which requires the global lock to be held
> >> throughout the entire flushing sequence and irq needs to be disabled when
> >> grabbing the percpu lock anyway, so why not just release the global lock on
> >> CPU boundaries instead? We don't really lose anything significant that way.
> >> The durations of irq disabled sections are still about the same as in the
> >> currently proposed solution at O(# cgroups) and we avoid the risk of holding
> >> the global lock for too long unexpectedly from getting hit repeatedly by
> >> irqs while holding the global lock.
> > Thanks for taking a look!
> >
> > I think a problem with this approach is that we risk having to contend
> > for the global lock at every CPU boundary in atomic contexts. Right
> Isn't it the plan to just do a trylock in atomic contexts so that it
> won't get stuck spinning for the lock for an indeterminate amount of time?

Not exactly. On the memory controller side, we currently only allow
one flusher at a time and force all flushers to flush the full
hierarchy, such that concurrent flushers can skip. This is done for
both atomic and non-atomic contexts.

For flushers outside the memory controller, they can still contend the
lock among themselves or with flushers in the memory controller. In
this case, instead of contending the lock once, they contend it at
each CPU boundary.

> > now we contend for the global lock once, and once we have it we go
> > through all CPUs to flush, only having to contend with updates taking
> > the percpu locks at this point. If we unconditionally release &
> > reacquire the global lock at every CPU boundary then we may contend
> > for it much more frequently with concurrent flushers.
>
> Note that with the use of qspinlock in all the major arches, the impact
> of thundering herds of lockers are much less serious than before. There
> are certainly some overhead in doing multiple lock acquires and
> releases, but that shouldn't been too excessive.

I ran some tests to measure this. Since I am using a cgroup v1
hierarchy, I cannot reproduce contention between memory controller
flushers and non-memory controller flushers, so I removed the "one
memory flusher only" restriction to have concurrent memory flushers
compete for the global rstat lock to measure the impact:

Before (only one flusher allowed to compete for the global rstat lock):
            ---cgroup_rstat_flush
               |
                --1.27%--cgroup_rstat_flush_locked
                          |
                           --0.94%--mem_cgroup_css_rstat_flush

After (concurrent flushers allowed to compete for the global rstat lock):
            ---cgroup_rstat_flush
               |
               |--4.94%--_raw_spin_lock
               |          |
               |           --4.94%--queued_spin_lock_slowpath
               |
                --0.92%--cgroup_rstat_flush_locked
                          |
                           --0.56%--mem_cgroup_css_rstat_flush

This was run with 20 processes trying to flush concurrently, so it may
be excessive, but it seems like in this case lock contention makes a
significant difference.

Again, this is not a regression for non-atomic flushers, as they
already compete for the lock at every CPU boundary, but for atomic
flushers that don't give up the lock at all today, it would be a
regression to start competing for the lock at every CPU boundary. This
patch series aims to minimize the number of atomic flushers (brings
them down to two, one of which is not common), so this may be fine.

My main concern is that for some flushers that this series converts
from atomic to non-atomic, we may notice a regression later and revert
it (e.g. refault path), which is why I have them in separate patches.
If we regress the atomic flushing path, it would be a larger surgery
to restore the performance for these paths -- which is why I would
rather keep the atomic path without excessive lock contention.

Thoughts?

>
> I am all in for reducing lock hold time as much as possible as it will
> improve the response time.
>
> Cheers,
> Longman
>
Tejun Heo March 25, 2023, 1:54 a.m. UTC | #22
Hello,

On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote:
> I think a problem with this approach is that we risk having to contend
> for the global lock at every CPU boundary in atomic contexts. Right
> now we contend for the global lock once, and once we have it we go
> through all CPUs to flush, only having to contend with updates taking
> the percpu locks at this point. If we unconditionally release &
> reacquire the global lock at every CPU boundary then we may contend
> for it much more frequently with concurrent flushers.
> 
> On the memory controller side, concurrent flushers are already held
> back to avoid a thundering herd problem on the global rstat lock, but
> flushers from outside the memory controller can still compete together
> or with a flusher from the memory controller. In this case, we risk
> contending the global lock more and concurrent flushers taking a
> longer period of time, which may end up causing multi-CPU stalls
> anyway, right? Also, if we keep _irq when spinning for the lock, then
> concurrent flushers still need to spin with irq disabled -- another
> problem that this series tries to fix.
> 
> This is particularly a problem for flushers in atomic contexts. There
> is a flusher in mem_cgroup_wb_stats() that flushes while holding
> another spinlock, and a flusher in mem_cgroup_usage() that flushes
> with irqs disabled. If flushing takes a longer period of time due to
> repeated lock contention, it affects such atomic context negatively.
> 
> I am not sure how all of this matters in practice, it depends heavily
> on the workloads and the configuration like you mentioned. I am just
> pointing out the potential disadvantages of reacquiring the lock at
> every CPU boundary in atomic contexts.

So, I'm not too convinced by the arguments for a couple reasons:

* It's not very difficult to create conditions where a contented non-irq
  protected spinlock is held unnecessarily long due to heavy IRQ irq load on
  the holding CPU. This can easily extend the amount of time the lock is
  held by multiple times if not orders of magnitude. That is likely a
  significantly worse problem than the contention on the lock cacheline
  which will lead to a lot more gradual degradation.

* If concurrent flushing is an actual problem, we need and can implement a
  better solution. There's quite a bit of maneuvering room here given that
  the flushing operations are mostly idempotent in close time proximity and
  there's no real atomicity requirement across different segments of
  flushing operations.

Thanks.
Yosry Ahmed March 25, 2023, 2:17 a.m. UTC | #23
On Fri, Mar 24, 2023 at 6:54 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote:
> > I think a problem with this approach is that we risk having to contend
> > for the global lock at every CPU boundary in atomic contexts. Right
> > now we contend for the global lock once, and once we have it we go
> > through all CPUs to flush, only having to contend with updates taking
> > the percpu locks at this point. If we unconditionally release &
> > reacquire the global lock at every CPU boundary then we may contend
> > for it much more frequently with concurrent flushers.
> >
> > On the memory controller side, concurrent flushers are already held
> > back to avoid a thundering herd problem on the global rstat lock, but
> > flushers from outside the memory controller can still compete together
> > or with a flusher from the memory controller. In this case, we risk
> > contending the global lock more and concurrent flushers taking a
> > longer period of time, which may end up causing multi-CPU stalls
> > anyway, right? Also, if we keep _irq when spinning for the lock, then
> > concurrent flushers still need to spin with irq disabled -- another
> > problem that this series tries to fix.
> >
> > This is particularly a problem for flushers in atomic contexts. There
> > is a flusher in mem_cgroup_wb_stats() that flushes while holding
> > another spinlock, and a flusher in mem_cgroup_usage() that flushes
> > with irqs disabled. If flushing takes a longer period of time due to
> > repeated lock contention, it affects such atomic context negatively.
> >
> > I am not sure how all of this matters in practice, it depends heavily
> > on the workloads and the configuration like you mentioned. I am just
> > pointing out the potential disadvantages of reacquiring the lock at
> > every CPU boundary in atomic contexts.
>
> So, I'm not too convinced by the arguments for a couple reasons:
>
> * It's not very difficult to create conditions where a contented non-irq
>   protected spinlock is held unnecessarily long due to heavy IRQ irq load on
>   the holding CPU. This can easily extend the amount of time the lock is
>   held by multiple times if not orders of magnitude. That is likely a
>   significantly worse problem than the contention on the lock cacheline
>   which will lead to a lot more gradual degradation.

I agree that can be a problem, it depends on the specific workload and
configuration. The continuous lock contention at each CPU boundary
causes a regression (see my reply to Waiman), but I am not sure if
it's worse than the scenario you are describing.

>
> * If concurrent flushing is an actual problem, we need and can implement a
>   better solution. There's quite a bit of maneuvering room here given that
>   the flushing operations are mostly idempotent in close time proximity and
>   there's no real atomicity requirement across different segments of
>   flushing operations.

Concurrent flushing can be a problem for some workloads, especially in
the MM code we flush in the reclaim and refault paths. This is
currently mitigated by only allowing one flusher at a time from the
memcg side, but contention can still happen with flushing when a
cgroup is being freed or other flushers in other subsystems.

I tried allowing concurrent flushing by completely removing the global
rstat lock, and only depending on the percpu locks for
synchronization. For this to be correct the global stat counters need
to be atomic, this introduced a slow down for flushing in general. I
also noticed heavier lock contention on the percpu locks, since all
flushers try to acquire all locks in the same order. I even tried
implementing a simple retry scheme where we try to acquire the percpu
lock, and if we fail we queue the current cpu and move to the next
one. This ended up causing a little bit of slowness as well. Together
with the slowness introduced by atomic operations it seemed like a
significant regression in the simple flushing path.

Don't get me wrong, I am all for modifying the current approach, I
just want to make sure we are making the correct decision for *most*
workloads. Keep in mind that this series aims to minimize the number
of flushers from atomic contexts as well, and for non-atomic flushers
we allow giving up the lock at CPU boundaries anyway. The current
approach only keeps the lock held throughout for atomic flushers.

Any ideas here are welcome!

>
> Thanks.
>
> --
> tejun
Shakeel Butt March 25, 2023, 4:30 a.m. UTC | #24
On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> Any ideas here are welcome!
>

Let's move forward. It seems like we are not going to reach an
agreement on making cgroup_rstat_lock a non-irq lock. However there is
agreement on the memcg code of not flushing in irq context and the
cleanup Johannes has requested. Let's proceed with those for now. We
can come back to cgroup_rstat_lock later if we still see issues in
production.

Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in
the rstat flushing code?
Yosry Ahmed March 25, 2023, 4:37 a.m. UTC | #25
On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > Any ideas here are welcome!
> >
>
> Let's move forward. It seems like we are not going to reach an
> agreement on making cgroup_rstat_lock a non-irq lock. However there is
> agreement on the memcg code of not flushing in irq context and the
> cleanup Johannes has requested. Let's proceed with those for now. We
> can come back to cgroup_rstat_lock later if we still see issues in
> production.

Even if we do not flush from irq context, we still flush from atomic
contexts that will currently hold the lock with irqs disabled
throughout the entire flush sequence. A primary purpose of this reason
is to avoid that.

We can either:
(a) Proceed with the following approach of making cgroup_rstat_lock a
non-irq lock.
(b) Proceed with Tejun's suggestion of always releasing and
reacquiring the lock at CPU boundaries, even for atomic flushes (if
the spinlock needs a break ofc).
(c) Something else.

I am happy to proceed with any solution, but we need to address the
fact that interrupts are always disabled throughout the flush. My main
concern about Tejun's suggestion is atomic contexts having to contend
cgroup_rstat_lock much more than they do now, but it's still better
than what we have today.

>
> Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in
> the rstat flushing code?
Shakeel Butt March 25, 2023, 4:46 a.m. UTC | #26
On Fri, Mar 24, 2023 at 9:37 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > [...]
> > > Any ideas here are welcome!
> > >
> >
> > Let's move forward. It seems like we are not going to reach an
> > agreement on making cgroup_rstat_lock a non-irq lock. However there is
> > agreement on the memcg code of not flushing in irq context and the
> > cleanup Johannes has requested. Let's proceed with those for now. We
> > can come back to cgroup_rstat_lock later if we still see issues in
> > production.
>
> Even if we do not flush from irq context, we still flush from atomic
> contexts that will currently hold the lock with irqs disabled
> throughout the entire flush sequence. A primary purpose of this reason
> is to avoid that.
>
> We can either:
> (a) Proceed with the following approach of making cgroup_rstat_lock a
> non-irq lock.
> (b) Proceed with Tejun's suggestion of always releasing and
> reacquiring the lock at CPU boundaries, even for atomic flushes (if
> the spinlock needs a break ofc).
> (c) Something else.

(d) keep the status quo regarding cgroup_rstat_lock
(e) decouple the discussion of cgroup_rstat_lock from the agreed
improvements. Send the patches for the agreed ones and continue
discussing cgroup_rstat_lock.
Yosry Ahmed March 27, 2023, 11:23 p.m. UTC | #27
On Fri, Mar 24, 2023 at 9:46 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Mar 24, 2023 at 9:37 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > [...]
> > > > Any ideas here are welcome!
> > > >
> > >
> > > Let's move forward. It seems like we are not going to reach an
> > > agreement on making cgroup_rstat_lock a non-irq lock. However there is
> > > agreement on the memcg code of not flushing in irq context and the
> > > cleanup Johannes has requested. Let's proceed with those for now. We
> > > can come back to cgroup_rstat_lock later if we still see issues in
> > > production.
> >
> > Even if we do not flush from irq context, we still flush from atomic
> > contexts that will currently hold the lock with irqs disabled
> > throughout the entire flush sequence. A primary purpose of this reason
> > is to avoid that.
> >
> > We can either:
> > (a) Proceed with the following approach of making cgroup_rstat_lock a
> > non-irq lock.
> > (b) Proceed with Tejun's suggestion of always releasing and
> > reacquiring the lock at CPU boundaries, even for atomic flushes (if
> > the spinlock needs a break ofc).
> > (c) Something else.
>
> (d) keep the status quo regarding cgroup_rstat_lock
> (e) decouple the discussion of cgroup_rstat_lock from the agreed
> improvements. Send the patches for the agreed ones and continue
> discussing cgroup_rstat_lock.


Ah, I lost sight of the fact that the rest of the patch series does
not strictly depend on this patch. I will respin the rest of the patch
series separately. Thanks, Shakeel.

Meanwhile, it would be useful to reach an agreement here to stop
acquiring the cgroup_rstat_lock for a long time with irq disabled in
atomic contexts.

Tejun, if having the lock be non-irq is a non-starter for you, I can
send a patch that instead gives up the lock and reacquires it at every
CPU boundary unconditionally -- or perhaps every N CPU boundaries to
avoid excessively releasing and reacquiring the lock.

Something like:

static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
{
    ...
    for_each_possible_cpu(cpu) {
        ...
        /* Always yield the at CPU boundaries to enable irqs */
        spin_unlock_irq(&cgroup_rstat_lock);

        /* if @may_sleep, play nice and yield if necessary */
        if (may_sleep)
            cond_resched();

        spin_lock_irq(&cgroup_rstat_lock);
    }
}

If you have other ideas to avoid disabling irq's for the entire flush
sequence I am also open to that.

Thanks!
Tejun Heo March 29, 2023, 6:53 p.m. UTC | #28
Hello, Yosry.

On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote:
> Tejun, if having the lock be non-irq is a non-starter for you, I can

This is an actual hazard. We see in prod these unprotected locks causing
very big spikes in tail latencies and they can be tricky to root cause too
and given the way rstat lock is used it's highly likely to be involved in
those scenarios with the proposed change, so it's gonna be a nack from my
end.

> send a patch that instead gives up the lock and reacquires it at every
> CPU boundary unconditionally -- or perhaps every N CPU boundaries to
> avoid excessively releasing and reacquiring the lock.

I'd just do the simple thing and see whether there's any perf penalty before
making it complicated. I'd be pretty surprised if unlocking and relocking
the same spinlock adds any noticeable overhead here.

Thanks.
Hugh Dickins March 29, 2023, 7:22 p.m. UTC | #29
Hi Tejun,

On Wed, 29 Mar 2023, Tejun Heo wrote:
> On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote:
> > Tejun, if having the lock be non-irq is a non-starter for you, I can
> 
> This is an actual hazard. We see in prod these unprotected locks causing
> very big spikes in tail latencies and they can be tricky to root cause too
> and given the way rstat lock is used it's highly likely to be involved in
> those scenarios with the proposed change, so it's gonna be a nack from my
> end.

Butting in here, I'm fascinated.  This is certainly not my area, I know
nothing about rstat, but this is the first time I ever heard someone
arguing for more disabling of interrupts rather than less.

An interrupt coming in while holding a contended resource can certainly
add to latencies, that I accept of course.  But until now, I thought it
was agreed best practice to disable irqs only regretfully, when strictly
necessary.

If that has changed, I for one want to know about it.  How should we
now judge which spinlocks should disable interrupts and which should not?
Page table locks are currently my main interest - should those be changed?

Thanks,
Hugh
Tejun Heo March 29, 2023, 8 p.m. UTC | #30
Hello, Hugh. How have you been?

On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> Hi Tejun,
> Butting in here, I'm fascinated.  This is certainly not my area, I know
> nothing about rstat, but this is the first time I ever heard someone
> arguing for more disabling of interrupts rather than less.
> 
> An interrupt coming in while holding a contended resource can certainly
> add to latencies, that I accept of course.  But until now, I thought it
> was agreed best practice to disable irqs only regretfully, when strictly
> necessary.
> 
> If that has changed, I for one want to know about it.  How should we
> now judge which spinlocks should disable interrupts and which should not?
> Page table locks are currently my main interest - should those be changed?

For rstat, it's a simple case because the global lock here wraps around
per-cpu locks which have to be irq-safe, so the only difference we get
between making the global irq-unsafe and keeping it so but releasing
inbetween is:

 Global lock held: G
 IRQ disabled: I
 Percpu lock held: P
 
1. IRQ unsafe

 GGGGGGGGGGGGGGG~~GGGGG
 IIII IIII IIII ~~ IIII
 PPPP PPPP PPPP ~~ PPPP

2. IRQ safe released inbetween cpus

 GGGG GGGG GGGG ~~ GGGG
 IIII IIII IIII ~~ IIII
 PPPP PPPP PPPP ~~ PPPP

#2 seems like the obvious thing to do here given how the lock is used and
each P section may take a bit of time.

So, in the rstat case, the choice is, at least to me, obvious, but even for
more generic cases where the bulk of actual work isn't done w/ irq disabled,
I don't think the picture is as simple as "use the least protected variant
possible" anymore because the underlying hardware changed.

For an SMP kernel running on an UP system, "the least protected variant" is
the obvious choice to make because you don't lose anything by holding a
spinlock longer than necessary. However, as you increase the number of CPUs,
there rises a tradeoff between local irq servicing latency and global lock
contention.

Imagine a, say, 128 cpu system with a few cores servicing relatively high
frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
up in the system profile but only just. Let's say something happens and the
irq rate on those cores went up for some reason to the point where it
becomes a rather common occurrence when the lock is held on one of those
cpus, irqs are likely to intervene lengthening how long the lock is held,
sometimes, signficantly. Now because the lock is on average held for much
longer, it become a lot hotter as more CPUs would stall on it and depending
on luck or lack thereof these stalls can span many CPUs on the system for
quite a while. This is actually something we saw in production.

So, in general, there's a trade off between local irq service latency and
inducing global lock contention when using unprotected locks. With more and
more CPUs, the balance keeps shifting. The balance still very much depends
on the specifics of a given lock but yeah I think it's something we need to
be a lot more careful about now.

Thanks.
Hugh Dickins March 29, 2023, 8:38 p.m. UTC | #31
On Wed, 29 Mar 2023, Tejun Heo wrote:

> Hello, Hugh. How have you been?
> 
> On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> > Hi Tejun,
> > Butting in here, I'm fascinated.  This is certainly not my area, I know
> > nothing about rstat, but this is the first time I ever heard someone
> > arguing for more disabling of interrupts rather than less.
> > 
> > An interrupt coming in while holding a contended resource can certainly
> > add to latencies, that I accept of course.  But until now, I thought it
> > was agreed best practice to disable irqs only regretfully, when strictly
> > necessary.
> > 
> > If that has changed, I for one want to know about it.  How should we
> > now judge which spinlocks should disable interrupts and which should not?
> > Page table locks are currently my main interest - should those be changed?
> 
> For rstat, it's a simple case because the global lock here wraps around
> per-cpu locks which have to be irq-safe, so the only difference we get
> between making the global irq-unsafe and keeping it so but releasing
> inbetween is:
> 
>  Global lock held: G
>  IRQ disabled: I
>  Percpu lock held: P
>  
> 1. IRQ unsafe
> 
>  GGGGGGGGGGGGGGG~~GGGGG
>  IIII IIII IIII ~~ IIII
>  PPPP PPPP PPPP ~~ PPPP
> 
> 2. IRQ safe released inbetween cpus
> 
>  GGGG GGGG GGGG ~~ GGGG
>  IIII IIII IIII ~~ IIII
>  PPPP PPPP PPPP ~~ PPPP
> 
> #2 seems like the obvious thing to do here given how the lock is used and
> each P section may take a bit of time.

Many thanks for the detailed response.  I'll leave it to the rstat folks,
to agree or disagree with your analysis there.

> 
> So, in the rstat case, the choice is, at least to me, obvious, but even for
> more generic cases where the bulk of actual work isn't done w/ irq disabled,
> I don't think the picture is as simple as "use the least protected variant
> possible" anymore because the underlying hardware changed.
> 
> For an SMP kernel running on an UP system, "the least protected variant" is
> the obvious choice to make because you don't lose anything by holding a
> spinlock longer than necessary. However, as you increase the number of CPUs,
> there rises a tradeoff between local irq servicing latency and global lock
> contention.
> 
> Imagine a, say, 128 cpu system with a few cores servicing relatively high
> frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
> up in the system profile but only just. Let's say something happens and the
> irq rate on those cores went up for some reason to the point where it
> becomes a rather common occurrence when the lock is held on one of those
> cpus, irqs are likely to intervene lengthening how long the lock is held,
> sometimes, signficantly. Now because the lock is on average held for much
> longer, it become a lot hotter as more CPUs would stall on it and depending
> on luck or lack thereof these stalls can span many CPUs on the system for
> quite a while. This is actually something we saw in production.
> 
> So, in general, there's a trade off between local irq service latency and
> inducing global lock contention when using unprotected locks. With more and
> more CPUs, the balance keeps shifting. The balance still very much depends
> on the specifics of a given lock but yeah I think it's something we need to
> be a lot more careful about now.

And this looks a very plausible argument to me: I'll let it sink in.

But I hadn't heard that the RT folks were clamouring for more irq disabling:
perhaps they partition their machines with more care, and are not devotees
of high CPU counts.

What I hope is that others will chime in one way or the other -
it does sound as if a reappraisal of the balances is overdue.

Thanks,
Hugh (disabling interrupts for as long as he can)
Yosry Ahmed March 30, 2023, 4:26 a.m. UTC | #32
Thanks for a great discussion, Tejun and Hugh.

On Wed, Mar 29, 2023 at 1:38 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 29 Mar 2023, Tejun Heo wrote:
>
> > Hello, Hugh. How have you been?
> >
> > On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> > > Hi Tejun,
> > > Butting in here, I'm fascinated.  This is certainly not my area, I know
> > > nothing about rstat, but this is the first time I ever heard someone
> > > arguing for more disabling of interrupts rather than less.
> > >
> > > An interrupt coming in while holding a contended resource can certainly
> > > add to latencies, that I accept of course.  But until now, I thought it
> > > was agreed best practice to disable irqs only regretfully, when strictly
> > > necessary.
> > >
> > > If that has changed, I for one want to know about it.  How should we
> > > now judge which spinlocks should disable interrupts and which should not?
> > > Page table locks are currently my main interest - should those be changed?
> >
> > For rstat, it's a simple case because the global lock here wraps around
> > per-cpu locks which have to be irq-safe, so the only difference we get
> > between making the global irq-unsafe and keeping it so but releasing
> > inbetween is:
> >
> >  Global lock held: G
> >  IRQ disabled: I
> >  Percpu lock held: P
> >
> > 1. IRQ unsafe
> >
> >  GGGGGGGGGGGGGGG~~GGGGG
> >  IIII IIII IIII ~~ IIII
> >  PPPP PPPP PPPP ~~ PPPP
> >
> > 2. IRQ safe released inbetween cpus
> >
> >  GGGG GGGG GGGG ~~ GGGG
> >  IIII IIII IIII ~~ IIII
> >  PPPP PPPP PPPP ~~ PPPP
> >
> > #2 seems like the obvious thing to do here given how the lock is used and
> > each P section may take a bit of time.
>
> Many thanks for the detailed response.  I'll leave it to the rstat folks,
> to agree or disagree with your analysis there.

Thanks for the analysis, Tejun, it does indeed make sense. I perf'd
releasing and reacquiring the lock at each CPU boundary and the
overhead seems to be minimal. It would be higher with contention, but
all memcg flushers should be held back by the memcg code, and flushers
outside memcg are not frequent (reading blkcg and cpu base stats from
user space, and when a cgroup is being removed).

I realized that after v2 of this patch series [1], we would only end
up with two atomic flushing contexts, mem_cgroup_wb_stats() and
mem_cgroup_usage(). The latter is already disabling irqs for other
reasons, so anything we do within the rstat core code doesn't really
help, it needs to be addressed separately. So only the call site in
mem_cgroup_wb_stats() would benefit from not having irqs disabled
throughout the flush.

I will hold off on sending a patch until I observe that this call site
is causing us pain and/or other atomic call sites emerge (or we have
to revert one of the ones we made non-atomic), so that we don't hurt
other flushers unnecessarily. Does this make sense to you?

[1] https://lore.kernel.org/linux-mm/20230328221644.803272-1-yosryahmed@google.com/

>
> >
> > So, in the rstat case, the choice is, at least to me, obvious, but even for
> > more generic cases where the bulk of actual work isn't done w/ irq disabled,
> > I don't think the picture is as simple as "use the least protected variant
> > possible" anymore because the underlying hardware changed.
> >
> > For an SMP kernel running on an UP system, "the least protected variant" is
> > the obvious choice to make because you don't lose anything by holding a
> > spinlock longer than necessary. However, as you increase the number of CPUs,
> > there rises a tradeoff between local irq servicing latency and global lock
> > contention.
> >
> > Imagine a, say, 128 cpu system with a few cores servicing relatively high
> > frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
> > up in the system profile but only just. Let's say something happens and the
> > irq rate on those cores went up for some reason to the point where it
> > becomes a rather common occurrence when the lock is held on one of those
> > cpus, irqs are likely to intervene lengthening how long the lock is held,
> > sometimes, signficantly. Now because the lock is on average held for much
> > longer, it become a lot hotter as more CPUs would stall on it and depending
> > on luck or lack thereof these stalls can span many CPUs on the system for
> > quite a while. This is actually something we saw in production.
> >
> > So, in general, there's a trade off between local irq service latency and
> > inducing global lock contention when using unprotected locks. With more and
> > more CPUs, the balance keeps shifting. The balance still very much depends
> > on the specifics of a given lock but yeah I think it's something we need to
> > be a lot more careful about now.
>
> And this looks a very plausible argument to me: I'll let it sink in.
>
> But I hadn't heard that the RT folks were clamouring for more irq disabling:
> perhaps they partition their machines with more care, and are not devotees
> of high CPU counts.
>
> What I hope is that others will chime in one way or the other -
> it does sound as if a reappraisal of the balances is overdue.
>
> Thanks,
> Hugh (disabling interrupts for as long as he can)
Tejun Heo March 31, 2023, 1:51 a.m. UTC | #33
Hello, Hugh.

On Wed, Mar 29, 2023 at 01:38:48PM -0700, Hugh Dickins wrote:
> > So, in general, there's a trade off between local irq service latency and
> > inducing global lock contention when using unprotected locks. With more and
> > more CPUs, the balance keeps shifting. The balance still very much depends
> > on the specifics of a given lock but yeah I think it's something we need to
> > be a lot more careful about now.
> 
> And this looks a very plausible argument to me: I'll let it sink in.

Another somewhat relevant change is that flipping irq on/off used to be
relatively expensive on older x86 cpus. I forget all details about when and
how much but they should be a lot cheaper now. No idea about !x86 cpus tho.

> But I hadn't heard that the RT folks were clamouring for more irq disabling:
> perhaps they partition their machines with more care, and are not devotees
> of high CPU counts.

I think RT folks care a lot more about raw IRQ disables. These shouldn't
actually disable IRQs on RT kernels.

Thanks.
diff mbox series

Patch

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 831f1f472bb8..af11e28bd055 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -7,6 +7,7 @@ 
 #include <linux/btf.h>
 #include <linux/btf_ids.h>
 
+/* This lock should only be held from task context */
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
@@ -210,14 +211,24 @@  static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 		/* if @may_sleep, play nice and yield if necessary */
 		if (may_sleep && (need_resched() ||
 				  spin_needbreak(&cgroup_rstat_lock))) {
-			spin_unlock_irq(&cgroup_rstat_lock);
+			spin_unlock(&cgroup_rstat_lock);
 			if (!cond_resched())
 				cpu_relax();
-			spin_lock_irq(&cgroup_rstat_lock);
+			spin_lock(&cgroup_rstat_lock);
 		}
 	}
 }
 
+static bool should_skip_flush(void)
+{
+	/*
+	 * We acquire cgroup_rstat_lock without disabling interrupts, so we
+	 * should not try to acquire from interrupt contexts to avoid deadlocks.
+	 * It can be expensive to flush stats in interrupt context anyway.
+	 */
+	return !in_task();
+}
+
 /**
  * cgroup_rstat_flush - flush stats in @cgrp's subtree
  * @cgrp: target cgroup
@@ -229,15 +240,18 @@  static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
  * This also gets all cgroups in the subtree including @cgrp off the
  * ->updated_children lists.
  *
- * This function may block.
+ * This function is safe to call from contexts that disable interrupts, but
+ * @may_sleep must be set to false, otherwise the function may block.
  */
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 {
-	might_sleep();
+	if (should_skip_flush())
+		return;
 
-	spin_lock_irq(&cgroup_rstat_lock);
+	might_sleep();
+	spin_lock(&cgroup_rstat_lock);
 	cgroup_rstat_flush_locked(cgrp, true);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock(&cgroup_rstat_lock);
 }
 
 /**
@@ -248,11 +262,12 @@  __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
  */
 void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
 {
-	unsigned long flags;
+	if (should_skip_flush())
+		return;
 
-	spin_lock_irqsave(&cgroup_rstat_lock, flags);
+	spin_lock(&cgroup_rstat_lock);
 	cgroup_rstat_flush_locked(cgrp, false);
-	spin_unlock_irqrestore(&cgroup_rstat_lock, flags);
+	spin_unlock(&cgroup_rstat_lock);
 }
 
 /**
@@ -267,8 +282,11 @@  void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
 void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
+	if (should_skip_flush())
+		return;
+
 	might_sleep();
-	spin_lock_irq(&cgroup_rstat_lock);
+	spin_lock(&cgroup_rstat_lock);
 	cgroup_rstat_flush_locked(cgrp, true);
 }
 
@@ -278,7 +296,7 @@  void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 void cgroup_rstat_flush_release(void)
 	__releases(&cgroup_rstat_lock)
 {
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock(&cgroup_rstat_lock);
 }
 
 int cgroup_rstat_init(struct cgroup *cgrp)