diff mbox series

[3/3] mm: memcg: use non-unified stats flushing for userspace reads

Message ID 20230821205458.1764662-4-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series memcg: non-unified flushing for userspace stats | expand

Commit Message

Yosry Ahmed Aug. 21, 2023, 8:54 p.m. UTC
Unified flushing allows for great concurrency for paths that attempt to
flush the stats, at the expense of potential staleness and a single
flusher paying the extra cost of flushing the full tree.

This tradeoff makes sense for in-kernel flushers that may observe high
concurrency (e.g. reclaim, refault). For userspace readers, stale stats
may be unexpected and problematic, especially when such stats are used
for critical paths such as userspace OOM handling. Additionally, a
userspace reader will occasionally pay the cost of flushing the entire
hierarchy, which also causes problems in some cases [1].

Opt userspace reads out of unified flushing. This makes the cost of
reading the stats more predictable (proportional to the size of the
subtree), as well as the freshness of the stats. Since userspace readers
are not expected to have similar concurrency to in-kernel flushers,
serializing them among themselves and among in-kernel flushers should be
okay.

This was tested on a machine with 256 cpus by running a synthetic test
The script that creates 50 top-level cgroups, each with 5 children (250
leaf cgroups). Each leaf cgroup has 10 processes running that allocate
memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
unified flusher). Concurrently, one thread is spawned per-cgroup to read
the stats every second (including root, top-level, and leaf cgroups --
so total 251 threads). No regressions were observed in the total running
time; which means that non-unified userspace readers are not slowing
down in-kernel unified flushers:

Base (mm-unstable):

real	0m18.228s
user	0m9.463s
sys	60m15.879s

real	0m20.828s
user	0m8.535s
sys	70m12.364s

real	0m19.789s
user	0m9.177s
sys	66m10.798s

With this patch:

real	0m19.632s
user	0m8.608s
sys	64m23.483s

real	0m18.463s
user	0m7.465s
sys	60m34.089s

real	0m20.309s
user	0m7.754s
sys	68m2.392s

Additionally, the average latency for reading stats went from roughly
40ms to 5 ms, because we mostly read the stats of leaf cgroups in this
script, so we only have to flush one cgroup, instead of *sometimes*
flushing the entire tree with unified flushing.

[1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Michal Hocko Aug. 22, 2023, 9:06 a.m. UTC | #1
On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> Unified flushing allows for great concurrency for paths that attempt to
> flush the stats, at the expense of potential staleness and a single
> flusher paying the extra cost of flushing the full tree.
> 
> This tradeoff makes sense for in-kernel flushers that may observe high
> concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> may be unexpected and problematic, especially when such stats are used
> for critical paths such as userspace OOM handling. Additionally, a
> userspace reader will occasionally pay the cost of flushing the entire
> hierarchy, which also causes problems in some cases [1].
> 
> Opt userspace reads out of unified flushing. This makes the cost of
> reading the stats more predictable (proportional to the size of the
> subtree), as well as the freshness of the stats. Since userspace readers
> are not expected to have similar concurrency to in-kernel flushers,
> serializing them among themselves and among in-kernel flushers should be
> okay.
> 
> This was tested on a machine with 256 cpus by running a synthetic test
> The script that creates 50 top-level cgroups, each with 5 children (250
> leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> unified flusher). Concurrently, one thread is spawned per-cgroup to read
> the stats every second (including root, top-level, and leaf cgroups --
> so total 251 threads). No regressions were observed in the total running
> time; which means that non-unified userspace readers are not slowing
> down in-kernel unified flushers:

I have to admit I am rather confused by cgroup_rstat_flush (and
cgroup_rstat_flush_locked). The former says it can block but the later
doesn't ever block and even if it drops the cgroup_rstat_lock it merely
cond_rescheds or busy loops. How much of a contention and yielding can
you see with this patch? What is the worst case? How bad a random user
can make the situation by going crazy and trying to flush from many
different contexts?
Yosry Ahmed Aug. 22, 2023, 3:30 p.m. UTC | #2
On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > Unified flushing allows for great concurrency for paths that attempt to
> > flush the stats, at the expense of potential staleness and a single
> > flusher paying the extra cost of flushing the full tree.
> >
> > This tradeoff makes sense for in-kernel flushers that may observe high
> > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > may be unexpected and problematic, especially when such stats are used
> > for critical paths such as userspace OOM handling. Additionally, a
> > userspace reader will occasionally pay the cost of flushing the entire
> > hierarchy, which also causes problems in some cases [1].
> >
> > Opt userspace reads out of unified flushing. This makes the cost of
> > reading the stats more predictable (proportional to the size of the
> > subtree), as well as the freshness of the stats. Since userspace readers
> > are not expected to have similar concurrency to in-kernel flushers,
> > serializing them among themselves and among in-kernel flushers should be
> > okay.
> >
> > This was tested on a machine with 256 cpus by running a synthetic test
> > The script that creates 50 top-level cgroups, each with 5 children (250
> > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > the stats every second (including root, top-level, and leaf cgroups --
> > so total 251 threads). No regressions were observed in the total running
> > time; which means that non-unified userspace readers are not slowing
> > down in-kernel unified flushers:
>
> I have to admit I am rather confused by cgroup_rstat_flush (and
> cgroup_rstat_flush_locked). The former says it can block but the later
> doesn't ever block and even if it drops the cgroup_rstat_lock it merely
> cond_rescheds or busy loops. How much of a contention and yielding can
> you see with this patch? What is the worst case? How bad a random user
> can make the situation by going crazy and trying to flush from many
> different contexts?

Userspace readers (or more generically non-unified flushers) can all
collectively only block a single unified flusher at most.
Specifically, one userspace reader goes to flush and holds
cgroup_rstat_lock, meanwhile an in-kernel flusher (e.g. reclaim) goes
and tries to flush, and spins on cgroup_rstat_lock. Other in-kernel
(unified) flushers will just see another unified flusher in progress
and skip. So userspace can only block a single in-kernel reclaimer.
Not that it's not really that bad because:
(a) As you note, cgroup_rstat_flush() does not really "block", it's
cpu-bound. Even when it cond_resched()'s, it yields the lock first. So
it can't really hold anyone hostage for long.
(b) I assume a random user can only read their own stats, which should
be a relatively small subtree, quick to flush. I am assuming a random
user cannot read root's memory.stat (which is most expensive).
(c) Excessive flushing doesn't really build up because there will be
nothing to flush and the lock will be released very shortly after it's
held.

So to answer your question, I don't think a random user can really
affect the system in a significant way by constantly flushing. In
fact, in the test script (which I am now attaching, in case you're
interested), there are hundreds of threads that are reading stats of
different cgroups every 1s, and I don't see any negative effects on
in-kernel flushers in this case (reclaimers).

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 23, 2023, 7:33 a.m. UTC | #3
On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
[...]
> So to answer your question, I don't think a random user can really
> affect the system in a significant way by constantly flushing. In
> fact, in the test script (which I am now attaching, in case you're
> interested), there are hundreds of threads that are reading stats of
> different cgroups every 1s, and I don't see any negative effects on
> in-kernel flushers in this case (reclaimers).

I suspect you have missed my point. Maybe I am just misunderstanding
the code but it seems to me that the lock dropping inside
cgroup_rstat_flush_locked effectivelly allows unbounded number of
contenders which is really dangerous when it is triggerable from the
userspace. The number of spinners at a moment is always bound by the
number CPUs but depending on timing many potential spinners might be
cond_rescheded and the worst time latency to complete can be really
high. Makes more sense?
Yosry Ahmed Aug. 23, 2023, 2:55 p.m. UTC | #4
On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> [...]
> > So to answer your question, I don't think a random user can really
> > affect the system in a significant way by constantly flushing. In
> > fact, in the test script (which I am now attaching, in case you're
> > interested), there are hundreds of threads that are reading stats of
> > different cgroups every 1s, and I don't see any negative effects on
> > in-kernel flushers in this case (reclaimers).
>
> I suspect you have missed my point.

I suspect you are right :)


> Maybe I am just misunderstanding
> the code but it seems to me that the lock dropping inside
> cgroup_rstat_flush_locked effectivelly allows unbounded number of
> contenders which is really dangerous when it is triggerable from the
> userspace. The number of spinners at a moment is always bound by the
> number CPUs but depending on timing many potential spinners might be
> cond_rescheded and the worst time latency to complete can be really
> high. Makes more sense?

I think I understand better now. So basically because we might drop
the lock and resched, there can be nr_cpus spinners + other spinners
that are currently scheduled away, so these will need to wait to be
scheduled and then start spinning on the lock. This may happen for one
reader multiple times during its read, which is what can cause a high
worst case latency.

I hope I understood you correctly this time. Did I?

So the logic to give up the lock and sleep was introduced by commit
0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") in
4.18. It has been possible for userspace to trigger this scenario by
reading cpu.stat, which has been using rstat since then. On the memcg
side, it was also possible to trigger this behavior between commit
2d146aa3aa84 ("mm: memcontrol: switch to rstat") and commit
fd25a9e0e23b ("memcg: unify memcg stat flushing") (i.e between 5.13
and 5.16).

I am not sure there has been any problems from this, but perhaps Tejun
can answer this better than I can.

The way I think about it is that random userspace reads will mostly be
reading their subtrees, which is generally not very large (and can be
limited), so every individual read should be cheap enough. Also,
consequent flushes on overlapping subtrees will have very little to do
as there won't be many pending updates, they should also be very
cheap. So unless multiple jobs on the same machine are collectively
trying to act maliciously (purposefully or not) and concurrently spawn
multiple readers of different parts of the hierarchy (and maintain
enough activity to generate stat updates to flush), I don't think it's
a concern.

I also imagine (but haven't checked) that there is some locking at
some level that will throttle a malicious job that spawns multiple
readers to the same memory.stat file.

I hope this answers your question.


> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 24, 2023, 7:13 a.m. UTC | #5
On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > [...]
> > > So to answer your question, I don't think a random user can really
> > > affect the system in a significant way by constantly flushing. In
> > > fact, in the test script (which I am now attaching, in case you're
> > > interested), there are hundreds of threads that are reading stats of
> > > different cgroups every 1s, and I don't see any negative effects on
> > > in-kernel flushers in this case (reclaimers).
> >
> > I suspect you have missed my point.
> 
> I suspect you are right :)
> 
> 
> > Maybe I am just misunderstanding
> > the code but it seems to me that the lock dropping inside
> > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > contenders which is really dangerous when it is triggerable from the
> > userspace. The number of spinners at a moment is always bound by the
> > number CPUs but depending on timing many potential spinners might be
> > cond_rescheded and the worst time latency to complete can be really
> > high. Makes more sense?
> 
> I think I understand better now. So basically because we might drop
> the lock and resched, there can be nr_cpus spinners + other spinners
> that are currently scheduled away, so these will need to wait to be
> scheduled and then start spinning on the lock. This may happen for one
> reader multiple times during its read, which is what can cause a high
> worst case latency.
> 
> I hope I understood you correctly this time. Did I?

Yes. I would just add that this could also influence the worst case
latency for a different reader - so an adversary user can stall others.
Exposing a shared global lock in uncontrolable way over generally
available user interface is not really a great idea IMHO.
Yosry Ahmed Aug. 24, 2023, 6:15 p.m. UTC | #6
On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > [...]
> > > > So to answer your question, I don't think a random user can really
> > > > affect the system in a significant way by constantly flushing. In
> > > > fact, in the test script (which I am now attaching, in case you're
> > > > interested), there are hundreds of threads that are reading stats of
> > > > different cgroups every 1s, and I don't see any negative effects on
> > > > in-kernel flushers in this case (reclaimers).
> > >
> > > I suspect you have missed my point.
> >
> > I suspect you are right :)
> >
> >
> > > Maybe I am just misunderstanding
> > > the code but it seems to me that the lock dropping inside
> > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > contenders which is really dangerous when it is triggerable from the
> > > userspace. The number of spinners at a moment is always bound by the
> > > number CPUs but depending on timing many potential spinners might be
> > > cond_rescheded and the worst time latency to complete can be really
> > > high. Makes more sense?
> >
> > I think I understand better now. So basically because we might drop
> > the lock and resched, there can be nr_cpus spinners + other spinners
> > that are currently scheduled away, so these will need to wait to be
> > scheduled and then start spinning on the lock. This may happen for one
> > reader multiple times during its read, which is what can cause a high
> > worst case latency.
> >
> > I hope I understood you correctly this time. Did I?
>
> Yes. I would just add that this could also influence the worst case
> latency for a different reader - so an adversary user can stall others.

I can add that for v2 to the commit log, thanks.

> Exposing a shared global lock in uncontrolable way over generally
> available user interface is not really a great idea IMHO.

I think that's how it was always meant to be when it was designed. The
global rstat lock has always existed and was always available to
userspace readers. The memory controller took a different path at some
point with unified flushing, but that was mainly because of high
concurrency from in-kernel flushers, not because userspace readers
caused a problem. Outside of memcg, the core cgroup code has always
exercised this global lock when reading cpu.stat since rstat's
introduction. I assume there hasn't been any problems since it's still
there. I was hoping Tejun would confirm/deny this.
Yosry Ahmed Aug. 24, 2023, 6:50 p.m. UTC | #7
On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > > [...]
> > > > > So to answer your question, I don't think a random user can really
> > > > > affect the system in a significant way by constantly flushing. In
> > > > > fact, in the test script (which I am now attaching, in case you're
> > > > > interested), there are hundreds of threads that are reading stats of
> > > > > different cgroups every 1s, and I don't see any negative effects on
> > > > > in-kernel flushers in this case (reclaimers).
> > > >
> > > > I suspect you have missed my point.
> > >
> > > I suspect you are right :)
> > >
> > >
> > > > Maybe I am just misunderstanding
> > > > the code but it seems to me that the lock dropping inside
> > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > > contenders which is really dangerous when it is triggerable from the
> > > > userspace. The number of spinners at a moment is always bound by the
> > > > number CPUs but depending on timing many potential spinners might be
> > > > cond_rescheded and the worst time latency to complete can be really
> > > > high. Makes more sense?
> > >
> > > I think I understand better now. So basically because we might drop
> > > the lock and resched, there can be nr_cpus spinners + other spinners
> > > that are currently scheduled away, so these will need to wait to be
> > > scheduled and then start spinning on the lock. This may happen for one
> > > reader multiple times during its read, which is what can cause a high
> > > worst case latency.
> > >
> > > I hope I understood you correctly this time. Did I?
> >
> > Yes. I would just add that this could also influence the worst case
> > latency for a different reader - so an adversary user can stall others.
>
> I can add that for v2 to the commit log, thanks.
>
> > Exposing a shared global lock in uncontrolable way over generally
> > available user interface is not really a great idea IMHO.
>
> I think that's how it was always meant to be when it was designed. The
> global rstat lock has always existed and was always available to
> userspace readers. The memory controller took a different path at some
> point with unified flushing, but that was mainly because of high
> concurrency from in-kernel flushers, not because userspace readers
> caused a problem. Outside of memcg, the core cgroup code has always
> exercised this global lock when reading cpu.stat since rstat's
> introduction. I assume there hasn't been any problems since it's still
> there. I was hoping Tejun would confirm/deny this.

One thing we can do to remedy this situation is to replace the global
rstat lock with a mutex, and drop the resched/lock dropping condition.
Tejun suggested this in the previous thread. This effectively reverts
0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
since now all the flushing contexts are sleepable.

My synthetic stress test does not show any regressions with mutexes,
and there is a small boost to reading latency (probably because we
stop dropping the lock / rescheduling). Not sure if we may start
seeing need_resched warnings on big flushes though.

One other concern that Shakeel pointed out to me is preemption. If
someone holding the mutex gets preempted this may starve other
waiters. We can disable preemption while we hold the mutex, not sure
if that's a common pattern though.
Michal Hocko Aug. 25, 2023, 7:05 a.m. UTC | #8
On Thu 24-08-23 11:50:51, Yosry Ahmed wrote:
> On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > > > [...]
> > > > > > So to answer your question, I don't think a random user can really
> > > > > > affect the system in a significant way by constantly flushing. In
> > > > > > fact, in the test script (which I am now attaching, in case you're
> > > > > > interested), there are hundreds of threads that are reading stats of
> > > > > > different cgroups every 1s, and I don't see any negative effects on
> > > > > > in-kernel flushers in this case (reclaimers).
> > > > >
> > > > > I suspect you have missed my point.
> > > >
> > > > I suspect you are right :)
> > > >
> > > >
> > > > > Maybe I am just misunderstanding
> > > > > the code but it seems to me that the lock dropping inside
> > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > > > contenders which is really dangerous when it is triggerable from the
> > > > > userspace. The number of spinners at a moment is always bound by the
> > > > > number CPUs but depending on timing many potential spinners might be
> > > > > cond_rescheded and the worst time latency to complete can be really
> > > > > high. Makes more sense?
> > > >
> > > > I think I understand better now. So basically because we might drop
> > > > the lock and resched, there can be nr_cpus spinners + other spinners
> > > > that are currently scheduled away, so these will need to wait to be
> > > > scheduled and then start spinning on the lock. This may happen for one
> > > > reader multiple times during its read, which is what can cause a high
> > > > worst case latency.
> > > >
> > > > I hope I understood you correctly this time. Did I?
> > >
> > > Yes. I would just add that this could also influence the worst case
> > > latency for a different reader - so an adversary user can stall others.
> >
> > I can add that for v2 to the commit log, thanks.
> >
> > > Exposing a shared global lock in uncontrolable way over generally
> > > available user interface is not really a great idea IMHO.
> >
> > I think that's how it was always meant to be when it was designed. The
> > global rstat lock has always existed and was always available to
> > userspace readers. The memory controller took a different path at some
> > point with unified flushing, but that was mainly because of high
> > concurrency from in-kernel flushers, not because userspace readers
> > caused a problem. Outside of memcg, the core cgroup code has always
> > exercised this global lock when reading cpu.stat since rstat's
> > introduction. I assume there hasn't been any problems since it's still
> > there.

I suspect nobody has just considered a malfunctioning or adversary
workloads so far.

> > I was hoping Tejun would confirm/deny this.

Yes, that would be interesting to hear.

> One thing we can do to remedy this situation is to replace the global
> rstat lock with a mutex, and drop the resched/lock dropping condition.
> Tejun suggested this in the previous thread. This effectively reverts
> 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
> since now all the flushing contexts are sleepable.

I would have a very daring question. Do we really need a global lock in
the first place? AFAIU this locks serializes (kinda as the lock can be
dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a
single one ATM). I can see cgroup_base_stat_cputime_show would like to
have a consistent view on multiple stats but can we live without a
strong guarantee or to replace the lock with seqlock instead?

> My synthetic stress test does not show any regressions with mutexes,
> and there is a small boost to reading latency (probably because we
> stop dropping the lock / rescheduling). Not sure if we may start
> seeing need_resched warnings on big flushes though.

Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
it seems the point of moving away from mutex was to have a more usable
API.

> One other concern that Shakeel pointed out to me is preemption. If
> someone holding the mutex gets preempted this may starve other
> waiters. We can disable preemption while we hold the mutex, not sure
> if that's a common pattern though.

No, not really. It is expected that holder of mutex can sleep and can be
preempted as well.

I might be wrong but the whole discussion so far suggests that the
global rstat lock should be reconsidered. From my personal experience
global locks easily triggerable from the userspace are just a receip for
problems. Stats reading shouldn't be interfering with the system runtime
as much as possible and they should be deterministic wrt runtime as
well.
Yosry Ahmed Aug. 25, 2023, 3:14 p.m. UTC | #9
On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 24-08-23 11:50:51, Yosry Ahmed wrote:
> > On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote:
> > > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote:
> > > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > > So to answer your question, I don't think a random user can really
> > > > > > > affect the system in a significant way by constantly flushing. In
> > > > > > > fact, in the test script (which I am now attaching, in case you're
> > > > > > > interested), there are hundreds of threads that are reading stats of
> > > > > > > different cgroups every 1s, and I don't see any negative effects on
> > > > > > > in-kernel flushers in this case (reclaimers).
> > > > > >
> > > > > > I suspect you have missed my point.
> > > > >
> > > > > I suspect you are right :)
> > > > >
> > > > >
> > > > > > Maybe I am just misunderstanding
> > > > > > the code but it seems to me that the lock dropping inside
> > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of
> > > > > > contenders which is really dangerous when it is triggerable from the
> > > > > > userspace. The number of spinners at a moment is always bound by the
> > > > > > number CPUs but depending on timing many potential spinners might be
> > > > > > cond_rescheded and the worst time latency to complete can be really
> > > > > > high. Makes more sense?
> > > > >
> > > > > I think I understand better now. So basically because we might drop
> > > > > the lock and resched, there can be nr_cpus spinners + other spinners
> > > > > that are currently scheduled away, so these will need to wait to be
> > > > > scheduled and then start spinning on the lock. This may happen for one
> > > > > reader multiple times during its read, which is what can cause a high
> > > > > worst case latency.
> > > > >
> > > > > I hope I understood you correctly this time. Did I?
> > > >
> > > > Yes. I would just add that this could also influence the worst case
> > > > latency for a different reader - so an adversary user can stall others.
> > >
> > > I can add that for v2 to the commit log, thanks.
> > >
> > > > Exposing a shared global lock in uncontrolable way over generally
> > > > available user interface is not really a great idea IMHO.
> > >
> > > I think that's how it was always meant to be when it was designed. The
> > > global rstat lock has always existed and was always available to
> > > userspace readers. The memory controller took a different path at some
> > > point with unified flushing, but that was mainly because of high
> > > concurrency from in-kernel flushers, not because userspace readers
> > > caused a problem. Outside of memcg, the core cgroup code has always
> > > exercised this global lock when reading cpu.stat since rstat's
> > > introduction. I assume there hasn't been any problems since it's still
> > > there.
>
> I suspect nobody has just considered a malfunctioning or adversary
> workloads so far.

Perhaps that also means it's not a problem in practice, or at least I hope so :)

>
> > > I was hoping Tejun would confirm/deny this.
>
> Yes, that would be interesting to hear.
>
> > One thing we can do to remedy this situation is to replace the global
> > rstat lock with a mutex, and drop the resched/lock dropping condition.
> > Tejun suggested this in the previous thread. This effectively reverts
> > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
> > since now all the flushing contexts are sleepable.
>
> I would have a very daring question. Do we really need a global lock in
> the first place? AFAIU this locks serializes (kinda as the lock can be
> dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a
> single one ATM). I can see cgroup_base_stat_cputime_show would like to
> have a consistent view on multiple stats but can we live without a
> strong guarantee or to replace the lock with seqlock instead?

Unfortunately, it's more difficult than this. I thought about breaking
down that lock and falled back to this solution. See below.

>
> > My synthetic stress test does not show any regressions with mutexes,
> > and there is a small boost to reading latency (probably because we
> > stop dropping the lock / rescheduling). Not sure if we may start
> > seeing need_resched warnings on big flushes though.
>
> Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock")
> it seems the point of moving away from mutex was to have a more usable
> API.

Right, we needed an atomic interface for flushing, but that later
turned out to cause some problems, so we reworked the code such that
we never have to flush atomically. Now we can go back to the mutex if
it makes things better, I am not really sure how much it helps though.

>
> > One other concern that Shakeel pointed out to me is preemption. If
> > someone holding the mutex gets preempted this may starve other
> > waiters. We can disable preemption while we hold the mutex, not sure
> > if that's a common pattern though.
>
> No, not really. It is expected that holder of mutex can sleep and can be
> preempted as well.

Maybe not for this specific case because it's a global mutex and
holding it for too long might cause problems? Is it bad to disable
preemption while holding a mutex?

>
> I might be wrong but the whole discussion so far suggests that the
> global rstat lock should be reconsidered. From my personal experience
> global locks easily triggerable from the userspace are just a receip for
> problems. Stats reading shouldn't be interfering with the system runtime
> as much as possible and they should be deterministic wrt runtime as
> well.

The problem is that the global lock also serializes the global
counters that we flush to. I will talk from the memcg flushing
perspective as that's what I am familiar with. I am not sure how much
this is transferable to other flushers.

On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
synchronizes access to multiple counters, for this discussion what's
most important are:
- The global stat counters of the memcg being flushed (e.g.
memcg->vmstats->state).
- The pending stat counters of the parent being flushed (e.g.
parent->vmstats->state_pending).

To get rid of this lock, we either need to use atomics for those
counters, or have fine-grained locks. I experimented a while back with
atomic and flushing was significantly more expensive. The number of
atomic operations would scale with O(# cgroups * # cpus) and can grow
unbounded.

The other option is fine-grained locks. In this case we would need to
lock both the memcg being flushed and its parent together. This can go
wrong with ordering, and for top-level memcgs the root memcg lock will
become the new global lock anyway. One way to overcome this is to
change the parent's pending stat counters to also be percpu. This will
increase the memory usage of the stat counters per-memcg, by hundreds
of bytes per cpu.

Let's assume that's okay, so we only need to lock one cgroup at a
time. There are more problems.

We also have a percpu lock (cgroup_rstat_cpu_lock), which we use to
lock the percpu tree which has the cgroups that have updates on this
cpu. It is held by both flushing contexts and updating contexts (hot
paths). Ideally we don't want to spin on a per-cgroup (non percpu)
lock while holding the percpu lock, as flushers of different cpus will
start blocking one another, as well as blocking updaters. On the other
hand, we need to hold percpu lock to pop a cgroup from that tree and
lock it. It's a chicken and egg problem. Also, if we release the
percpu lock while flushing, we open another can of worms:
(a) Concurrent updates can keep updating the tree putting us in an
endless flushing loop. We need some sort of generation tracking for
this.
(b) Concurrent flushing can flush a parent prematurely on the same cpu
as we are flushing a child, and not get the updates from the child.

One possible scheme to handle the above is as follows:
1. Hold the percpu lock, find the cgroup that needs to be flushed next.
2. Trylock that cgroup. If we succeed, we flush it with both the
percpu and the per-cgroup locks held.
3. If we fail, release the percpu lock and spin on the per-cgroup lock.
4. When we get the per-cgroup lock, take the percpu lock again, and
make sure that the locked cgroup is still the correct cgroup to flush.
If not, repeat.
5. Flush the cgroup, and go back to step 1 to get the next cgroup.

Of course this is complex and error-prone, and might introduce
significant overheads due to the number of locks we need to take
compared with what we currently have.

I guess what I am trying to say is, breaking down that lock is a major
surgery that might require re-designing or re-implementing some parts
of rstat. I would be extremely happy to be proven wrong. If we can
break down that lock then there is no need for unified flushing even
for in-kernel contexts, and we can all live happily ever after with
cheap(ish) and accurate stats flushing.

I really hope we can move forward with the problems at hand (sometimes
reads are expensive, sometimes reads are stale), and not block fixing
them until we can come up with an alternative to that global lock
(unless, of course, there is a simpler way of doing that).

Sorry for the very long reply :)

Thanks!
Michal Hocko Aug. 25, 2023, 6:17 p.m. UTC | #10
On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > I might be wrong but the whole discussion so far suggests that the
> > global rstat lock should be reconsidered. From my personal experience
> > global locks easily triggerable from the userspace are just a receip for
> > problems. Stats reading shouldn't be interfering with the system runtime
> > as much as possible and they should be deterministic wrt runtime as
> > well.
> 
> The problem is that the global lock also serializes the global
> counters that we flush to. I will talk from the memcg flushing
> perspective as that's what I am familiar with. I am not sure how much
> this is transferable to other flushers.
> 
> On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> synchronizes access to multiple counters, for this discussion what's
> most important are:
> - The global stat counters of the memcg being flushed (e.g.
> memcg->vmstats->state).
> - The pending stat counters of the parent being flushed (e.g.
> parent->vmstats->state_pending).

I haven't digested the rest of the email yet (Friday brain, sorry) but I
do not think you are adressing this particular part so let me ask before
I dive more into the following. I really do not follow the serialization
requirement here because the lock doesn't really serialize the flushing,
does it? At least not in a sense of a single caller to do the flushing
atomicaly from other flushers. It is possible that the current flusher
simply drops the lock midway and another one retakes the lock and
performs the operation again. So what additional flushing
synchronization does it provide and why cannot parallel flushers simply
compete over pcp spinlocks?

So what am I missing?
Yosry Ahmed Aug. 25, 2023, 6:21 p.m. UTC | #11
On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > I might be wrong but the whole discussion so far suggests that the
> > > global rstat lock should be reconsidered. From my personal experience
> > > global locks easily triggerable from the userspace are just a receip for
> > > problems. Stats reading shouldn't be interfering with the system runtime
> > > as much as possible and they should be deterministic wrt runtime as
> > > well.
> >
> > The problem is that the global lock also serializes the global
> > counters that we flush to. I will talk from the memcg flushing
> > perspective as that's what I am familiar with. I am not sure how much
> > this is transferable to other flushers.
> >
> > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> > synchronizes access to multiple counters, for this discussion what's
> > most important are:
> > - The global stat counters of the memcg being flushed (e.g.
> > memcg->vmstats->state).
> > - The pending stat counters of the parent being flushed (e.g.
> > parent->vmstats->state_pending).
>
> I haven't digested the rest of the email yet (Friday brain, sorry) but I
> do not think you are adressing this particular part so let me ask before
> I dive more into the following. I really do not follow the serialization
> requirement here because the lock doesn't really serialize the flushing,
> does it? At least not in a sense of a single caller to do the flushing
> atomicaly from other flushers. It is possible that the current flusher
> simply drops the lock midway and another one retakes the lock and
> performs the operation again. So what additional flushing
> synchronization does it provide and why cannot parallel flushers simply
> compete over pcp spinlocks?
>
> So what am I missing?

Those counters are non-atomic. The lock makes sure we don't have two
concurrent flushers updating the same counter locklessly and
non-atomically, which would be possible if we flush the same cgroup on
two different cpus in parallel.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 25, 2023, 6:43 p.m. UTC | #12
On Fri 25-08-23 11:21:16, Yosry Ahmed wrote:
> On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > I might be wrong but the whole discussion so far suggests that the
> > > > global rstat lock should be reconsidered. From my personal experience
> > > > global locks easily triggerable from the userspace are just a receip for
> > > > problems. Stats reading shouldn't be interfering with the system runtime
> > > > as much as possible and they should be deterministic wrt runtime as
> > > > well.
> > >
> > > The problem is that the global lock also serializes the global
> > > counters that we flush to. I will talk from the memcg flushing
> > > perspective as that's what I am familiar with. I am not sure how much
> > > this is transferable to other flushers.
> > >
> > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> > > synchronizes access to multiple counters, for this discussion what's
> > > most important are:
> > > - The global stat counters of the memcg being flushed (e.g.
> > > memcg->vmstats->state).
> > > - The pending stat counters of the parent being flushed (e.g.
> > > parent->vmstats->state_pending).
> >
> > I haven't digested the rest of the email yet (Friday brain, sorry) but I
> > do not think you are adressing this particular part so let me ask before
> > I dive more into the following. I really do not follow the serialization
> > requirement here because the lock doesn't really serialize the flushing,
> > does it? At least not in a sense of a single caller to do the flushing
> > atomicaly from other flushers. It is possible that the current flusher
> > simply drops the lock midway and another one retakes the lock and
> > performs the operation again. So what additional flushing
> > synchronization does it provide and why cannot parallel flushers simply
> > compete over pcp spinlocks?
> >
> > So what am I missing?
> 
> Those counters are non-atomic. The lock makes sure we don't have two
> concurrent flushers updating the same counter locklessly and
> non-atomically, which would be possible if we flush the same cgroup on
> two different cpus in parallel.

pcp lock (cpu_lock) guarantees the very same, doesn't it?
Michal Hocko Aug. 25, 2023, 6:44 p.m. UTC | #13
On Fri 25-08-23 20:43:02, Michal Hocko wrote:
> On Fri 25-08-23 11:21:16, Yosry Ahmed wrote:
> > On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> > > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > I might be wrong but the whole discussion so far suggests that the
> > > > > global rstat lock should be reconsidered. From my personal experience
> > > > > global locks easily triggerable from the userspace are just a receip for
> > > > > problems. Stats reading shouldn't be interfering with the system runtime
> > > > > as much as possible and they should be deterministic wrt runtime as
> > > > > well.
> > > >
> > > > The problem is that the global lock also serializes the global
> > > > counters that we flush to. I will talk from the memcg flushing
> > > > perspective as that's what I am familiar with. I am not sure how much
> > > > this is transferable to other flushers.
> > > >
> > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock
> > > > synchronizes access to multiple counters, for this discussion what's
> > > > most important are:
> > > > - The global stat counters of the memcg being flushed (e.g.
> > > > memcg->vmstats->state).
> > > > - The pending stat counters of the parent being flushed (e.g.
> > > > parent->vmstats->state_pending).
> > >
> > > I haven't digested the rest of the email yet (Friday brain, sorry) but I
> > > do not think you are adressing this particular part so let me ask before
> > > I dive more into the following. I really do not follow the serialization
> > > requirement here because the lock doesn't really serialize the flushing,
> > > does it? At least not in a sense of a single caller to do the flushing
> > > atomicaly from other flushers. It is possible that the current flusher
> > > simply drops the lock midway and another one retakes the lock and
> > > performs the operation again. So what additional flushing
> > > synchronization does it provide and why cannot parallel flushers simply
> > > compete over pcp spinlocks?
> > >
> > > So what am I missing?
> > 
> > Those counters are non-atomic. The lock makes sure we don't have two
> > concurrent flushers updating the same counter locklessly and
> > non-atomically, which would be possible if we flush the same cgroup on
> > two different cpus in parallel.
> 
> pcp lock (cpu_lock) guarantees the very same, doesn't it?

Nope, it doesn't. I really need to have a deeper look.
Michal Hocko Aug. 28, 2023, 3:47 p.m. UTC | #14
Done my homework and studied the rstat code more (sorry should have done
that earlier).

On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
[...]
> I guess what I am trying to say is, breaking down that lock is a major
> surgery that might require re-designing or re-implementing some parts
> of rstat. I would be extremely happy to be proven wrong. If we can
> break down that lock then there is no need for unified flushing even
> for in-kernel contexts, and we can all live happily ever after with
> cheap(ish) and accurate stats flushing.

Yes, this seems like a big change and also over complicating the whole
thing. I am not sure this is worth it.

> I really hope we can move forward with the problems at hand (sometimes
> reads are expensive, sometimes reads are stale), and not block fixing
> them until we can come up with an alternative to that global lock
> (unless, of course, there is a simpler way of doing that).

Well, I really have to say that I do not like the notion that reading
stats is unpredictable. This just makes it really hard to use. If
the precision is to be sarificed then this should be preferable over
potentially high global lock contention. We already have that model in
place of /proc/vmstat (configurable timeout for flusher and a way to
flush explicitly). I appreciate you would like to have a better
precision but as you have explored the locking is really hard to get rid
of here.

So from my POV I would prefer to avoid flushing from the stats reading
path and implement force flushing by writing to stat file. If the 2s
flushing interval is considered to coarse I would be OK to allow setting
it from userspace. This way this would be more in line with /proc/vmstat
which seems to be working quite well.

If this is not accaptable or deemed a wrong approach long term then it
would be good to reonsider the current cgroup_rstat_lock at least.
Either by turning it into mutex or by dropping the yielding code which
can severly affect the worst case latency AFAIU.

> Sorry for the very long reply :)

Thanks for bearing with me and taking time to formulate all this!
Yosry Ahmed Aug. 28, 2023, 4:15 p.m. UTC | #15
On Mon, Aug 28, 2023 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Done my homework and studied the rstat code more (sorry should have done
> that earlier).
>
> On Fri 25-08-23 08:14:54, Yosry Ahmed wrote:
> [...]
> > I guess what I am trying to say is, breaking down that lock is a major
> > surgery that might require re-designing or re-implementing some parts
> > of rstat. I would be extremely happy to be proven wrong. If we can
> > break down that lock then there is no need for unified flushing even
> > for in-kernel contexts, and we can all live happily ever after with
> > cheap(ish) and accurate stats flushing.
>
> Yes, this seems like a big change and also over complicating the whole
> thing. I am not sure this is worth it.
>
> > I really hope we can move forward with the problems at hand (sometimes
> > reads are expensive, sometimes reads are stale), and not block fixing
> > them until we can come up with an alternative to that global lock
> > (unless, of course, there is a simpler way of doing that).
>
> Well, I really have to say that I do not like the notion that reading
> stats is unpredictable. This just makes it really hard to use. If
> the precision is to be sarificed then this should be preferable over
> potentially high global lock contention. We already have that model in
> place of /proc/vmstat (configurable timeout for flusher and a way to
> flush explicitly). I appreciate you would like to have a better
> precision but as you have explored the locking is really hard to get rid
> of here.

Reading the stats *is* unpredictable today. In terms of
accuracy/staleness and cost. Avoiding the flush entirely on the read
path will surely make the cost very stable and cheap, but will make
accuracy even less predictable.

>
> So from my POV I would prefer to avoid flushing from the stats reading
> path and implement force flushing by writing to stat file. If the 2s
> flushing interval is considered to coarse I would be OK to allow setting
> it from userspace. This way this would be more in line with /proc/vmstat
> which seems to be working quite well.
>
> If this is not accaptable or deemed a wrong approach long term then it
> would be good to reonsider the current cgroup_rstat_lock at least.
> Either by turning it into mutex or by dropping the yielding code which
> can severly affect the worst case latency AFAIU.

Honestly I think it's better if we do it the other way around. We make
flushing on the stats reading path non-unified and deterministic. That
model also exists and is used for cpu.stat. If we find a problem with
the locking being held from userspace, we can then remove flushing
from the read path and add interface(s) to configure the periodic
flusher and do a force flush.

I would like to avoid introducing additional interfaces and
configuration knobs unless it's necessary. Also, if we remove the
flush entirely the cost will become really cheap. We will have a hard
time reversing that in the future if we want to change the
implementation.

IOW, moving forward with this change seems much more reversible than
adopting the /proc/vmstat model.

If using a mutex will make things better, we can do that now. It
doesn't introduce performance issues in my testing. My only concern is
someone sleeping or getting preempted while holding the mutex, so I
would prefer disabling preemption while we flush if that doesn't cause
problems.

Thanks!
Shakeel Butt Aug. 28, 2023, 5 p.m. UTC | #16
On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> >
> > Well, I really have to say that I do not like the notion that reading
> > stats is unpredictable. This just makes it really hard to use. If
> > the precision is to be sarificed then this should be preferable over
> > potentially high global lock contention. We already have that model in
> > place of /proc/vmstat (configurable timeout for flusher and a way to
> > flush explicitly). I appreciate you would like to have a better
> > precision but as you have explored the locking is really hard to get rid
> > of here.
>
> Reading the stats *is* unpredictable today. In terms of
> accuracy/staleness and cost. Avoiding the flush entirely on the read
> path will surely make the cost very stable and cheap, but will make
> accuracy even less predictable.
>

On average you would get the stats at most 2 second old, so I would
not say it is less predictable.

> >
> > So from my POV I would prefer to avoid flushing from the stats reading
> > path and implement force flushing by writing to stat file. If the 2s
> > flushing interval is considered to coarse I would be OK to allow setting
> > it from userspace. This way this would be more in line with /proc/vmstat
> > which seems to be working quite well.
> >
> > If this is not accaptable or deemed a wrong approach long term then it
> > would be good to reonsider the current cgroup_rstat_lock at least.
> > Either by turning it into mutex or by dropping the yielding code which
> > can severly affect the worst case latency AFAIU.
>
> Honestly I think it's better if we do it the other way around. We make
> flushing on the stats reading path non-unified and deterministic. That
> model also exists and is used for cpu.stat. If we find a problem with
> the locking being held from userspace, we can then remove flushing
> from the read path and add interface(s) to configure the periodic
> flusher and do a force flush.
>

Here I agree with you. Let's go with the approach which is easy to
undo for now. Though I prefer the new explicit interface for flushing,
that step would be very hard to undo. Let's reevaluate if the proposed
approach shows negative impact on production traffic and I think
Cloudflare folks can give us the results soon.
Yosry Ahmed Aug. 28, 2023, 5:07 p.m. UTC | #17
On Mon, Aug 28, 2023 at 10:00 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > >
> > > Well, I really have to say that I do not like the notion that reading
> > > stats is unpredictable. This just makes it really hard to use. If
> > > the precision is to be sarificed then this should be preferable over
> > > potentially high global lock contention. We already have that model in
> > > place of /proc/vmstat (configurable timeout for flusher and a way to
> > > flush explicitly). I appreciate you would like to have a better
> > > precision but as you have explored the locking is really hard to get rid
> > > of here.
> >
> > Reading the stats *is* unpredictable today. In terms of
> > accuracy/staleness and cost. Avoiding the flush entirely on the read
> > path will surely make the cost very stable and cheap, but will make
> > accuracy even less predictable.
> >
>
> On average you would get the stats at most 2 second old, so I would
> not say it is less predictable.
>
> > >
> > > So from my POV I would prefer to avoid flushing from the stats reading
> > > path and implement force flushing by writing to stat file. If the 2s
> > > flushing interval is considered to coarse I would be OK to allow setting
> > > it from userspace. This way this would be more in line with /proc/vmstat
> > > which seems to be working quite well.
> > >
> > > If this is not accaptable or deemed a wrong approach long term then it
> > > would be good to reonsider the current cgroup_rstat_lock at least.
> > > Either by turning it into mutex or by dropping the yielding code which
> > > can severly affect the worst case latency AFAIU.
> >
> > Honestly I think it's better if we do it the other way around. We make
> > flushing on the stats reading path non-unified and deterministic. That
> > model also exists and is used for cpu.stat. If we find a problem with
> > the locking being held from userspace, we can then remove flushing
> > from the read path and add interface(s) to configure the periodic
> > flusher and do a force flush.
> >
>
> Here I agree with you. Let's go with the approach which is easy to
> undo for now. Though I prefer the new explicit interface for flushing,
> that step would be very hard to undo. Let's reevaluate if the proposed
> approach shows negative impact on production traffic and I think
> Cloudflare folks can give us the results soon.

Do you prefer we also switch to using a mutex (with preemption
disabled) to avoid the scenario Michal described where flushers give
up the lock and sleep resulting in an unbounded wait time in the worst
case?
Waiman Long Aug. 28, 2023, 5:27 p.m. UTC | #18
On 8/28/23 13:07, Yosry Ahmed wrote:
>
>> Here I agree with you. Let's go with the approach which is easy to
>> undo for now. Though I prefer the new explicit interface for flushing,
>> that step would be very hard to undo. Let's reevaluate if the proposed
>> approach shows negative impact on production traffic and I think
>> Cloudflare folks can give us the results soon.
> Do you prefer we also switch to using a mutex (with preemption
> disabled) to avoid the scenario Michal described where flushers give
> up the lock and sleep resulting in an unbounded wait time in the worst
> case?

Locking with mutex with preemption disabled is an oxymoron. Use spinlock 
if you want to have preemption disabled. The purpose of usiing mutex is 
to allow the lock owner to sleep, but you can't sleep with preemption 
disabled. You need to enable preemption first. You can disable 
preemption for a short time in a non-sleeping section of the lock 
critical section, but I would not recommend disabling preemption for the 
whole critical section.

Cheers,
Longman
Yosry Ahmed Aug. 28, 2023, 5:28 p.m. UTC | #19
On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
>
>
> On 8/28/23 13:07, Yosry Ahmed wrote:
> >
> >> Here I agree with you. Let's go with the approach which is easy to
> >> undo for now. Though I prefer the new explicit interface for flushing,
> >> that step would be very hard to undo. Let's reevaluate if the proposed
> >> approach shows negative impact on production traffic and I think
> >> Cloudflare folks can give us the results soon.
> > Do you prefer we also switch to using a mutex (with preemption
> > disabled) to avoid the scenario Michal described where flushers give
> > up the lock and sleep resulting in an unbounded wait time in the worst
> > case?
>
> Locking with mutex with preemption disabled is an oxymoron. Use spinlock
> if you want to have preemption disabled. The purpose of usiing mutex is
> to allow the lock owner to sleep, but you can't sleep with preemption
> disabled. You need to enable preemption first. You can disable
> preemption for a short time in a non-sleeping section of the lock
> critical section, but I would not recommend disabling preemption for the
> whole critical section.

I thought using a mutex with preemption disabled would at least allow
waiters to sleep rather than spin, is this not correct (or doesn't
matter) ?

>
> Cheers,
> Longman
>
Waiman Long Aug. 28, 2023, 5:35 p.m. UTC | #20
On 8/28/23 13:28, Yosry Ahmed wrote:
> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
>>
>> On 8/28/23 13:07, Yosry Ahmed wrote:
>>>> Here I agree with you. Let's go with the approach which is easy to
>>>> undo for now. Though I prefer the new explicit interface for flushing,
>>>> that step would be very hard to undo. Let's reevaluate if the proposed
>>>> approach shows negative impact on production traffic and I think
>>>> Cloudflare folks can give us the results soon.
>>> Do you prefer we also switch to using a mutex (with preemption
>>> disabled) to avoid the scenario Michal described where flushers give
>>> up the lock and sleep resulting in an unbounded wait time in the worst
>>> case?
>> Locking with mutex with preemption disabled is an oxymoron. Use spinlock
>> if you want to have preemption disabled. The purpose of usiing mutex is
>> to allow the lock owner to sleep, but you can't sleep with preemption
>> disabled. You need to enable preemption first. You can disable
>> preemption for a short time in a non-sleeping section of the lock
>> critical section, but I would not recommend disabling preemption for the
>> whole critical section.
> I thought using a mutex with preemption disabled would at least allow
> waiters to sleep rather than spin, is this not correct (or doesn't
> matter) ?

Because of optimistic spinning, a mutex lock waiter will only sleep if 
the lock holder sleep or when its time slice run out. So the waiters are 
likely to spin for quite a while before they go to sleep.

Cheers,
Longman
Waiman Long Aug. 28, 2023, 5:43 p.m. UTC | #21
On 8/28/23 13:35, Waiman Long wrote:
> On 8/28/23 13:28, Yosry Ahmed wrote:
>> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
>>>
>>> On 8/28/23 13:07, Yosry Ahmed wrote:
>>>>> Here I agree with you. Let's go with the approach which is easy to
>>>>> undo for now. Though I prefer the new explicit interface for 
>>>>> flushing,
>>>>> that step would be very hard to undo. Let's reevaluate if the 
>>>>> proposed
>>>>> approach shows negative impact on production traffic and I think
>>>>> Cloudflare folks can give us the results soon.
>>>> Do you prefer we also switch to using a mutex (with preemption
>>>> disabled) to avoid the scenario Michal described where flushers give
>>>> up the lock and sleep resulting in an unbounded wait time in the worst
>>>> case?
>>> Locking with mutex with preemption disabled is an oxymoron. Use 
>>> spinlock
>>> if you want to have preemption disabled. The purpose of usiing mutex is
>>> to allow the lock owner to sleep, but you can't sleep with preemption
>>> disabled. You need to enable preemption first. You can disable
>>> preemption for a short time in a non-sleeping section of the lock
>>> critical section, but I would not recommend disabling preemption for 
>>> the
>>> whole critical section.
>> I thought using a mutex with preemption disabled would at least allow
>> waiters to sleep rather than spin, is this not correct (or doesn't
>> matter) ?
>
> Because of optimistic spinning, a mutex lock waiter will only sleep if 
> the lock holder sleep or when its time slice run out. So the waiters 
> are likely to spin for quite a while before they go to sleep.

Perhaps you can add a mutex at the read side so that only 1 reader can 
contend with the global rstat spinlock at any time if this is a concern.

Regards,
Longman
Yosry Ahmed Aug. 28, 2023, 6:35 p.m. UTC | #22
On Mon, Aug 28, 2023 at 10:43 AM Waiman Long <longman@redhat.com> wrote:
>
>
> On 8/28/23 13:35, Waiman Long wrote:
> > On 8/28/23 13:28, Yosry Ahmed wrote:
> >> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote:
> >>>
> >>> On 8/28/23 13:07, Yosry Ahmed wrote:
> >>>>> Here I agree with you. Let's go with the approach which is easy to
> >>>>> undo for now. Though I prefer the new explicit interface for
> >>>>> flushing,
> >>>>> that step would be very hard to undo. Let's reevaluate if the
> >>>>> proposed
> >>>>> approach shows negative impact on production traffic and I think
> >>>>> Cloudflare folks can give us the results soon.
> >>>> Do you prefer we also switch to using a mutex (with preemption
> >>>> disabled) to avoid the scenario Michal described where flushers give
> >>>> up the lock and sleep resulting in an unbounded wait time in the worst
> >>>> case?
> >>> Locking with mutex with preemption disabled is an oxymoron. Use
> >>> spinlock
> >>> if you want to have preemption disabled. The purpose of usiing mutex is
> >>> to allow the lock owner to sleep, but you can't sleep with preemption
> >>> disabled. You need to enable preemption first. You can disable
> >>> preemption for a short time in a non-sleeping section of the lock
> >>> critical section, but I would not recommend disabling preemption for
> >>> the
> >>> whole critical section.
> >> I thought using a mutex with preemption disabled would at least allow
> >> waiters to sleep rather than spin, is this not correct (or doesn't
> >> matter) ?
> >
> > Because of optimistic spinning, a mutex lock waiter will only sleep if
> > the lock holder sleep or when its time slice run out. So the waiters
> > are likely to spin for quite a while before they go to sleep.

I see. Thanks for the explanation.

>
> Perhaps you can add a mutex at the read side so that only 1 reader can
> contend with the global rstat spinlock at any time if this is a concern.

I guess we can keep it simple for now and add that later if needed.
For unified flushers we already can only have one. If we see a problem
from the stat reading side we can add a mutex there.

Thanks!
Michal Hocko Aug. 29, 2023, 7:27 a.m. UTC | #23
On Mon 28-08-23 13:27:23, Waiman Long wrote:
> 
> On 8/28/23 13:07, Yosry Ahmed wrote:
> > 
> > > Here I agree with you. Let's go with the approach which is easy to
> > > undo for now. Though I prefer the new explicit interface for flushing,
> > > that step would be very hard to undo. Let's reevaluate if the proposed
> > > approach shows negative impact on production traffic and I think
> > > Cloudflare folks can give us the results soon.
> > Do you prefer we also switch to using a mutex (with preemption
> > disabled) to avoid the scenario Michal described where flushers give
> > up the lock and sleep resulting in an unbounded wait time in the worst
> > case?
> 
> Locking with mutex with preemption disabled is an oxymoron.

I believe Yosry wanted to disable preemption _after_ the lock is taken
to reduce the time spent while it is held. The idea to use the mutex is
to reduce spinning and more importantly to get rid of lock dropping
part. It is not really clear (but unlikely) we can drop it while
preserving the spinlock as the thing scales with O(#cgroups x #cpus)
in the worst case.
Waiman Long Aug. 29, 2023, 3:05 p.m. UTC | #24
On 8/29/23 03:27, Michal Hocko wrote:
> On Mon 28-08-23 13:27:23, Waiman Long wrote:
>> On 8/28/23 13:07, Yosry Ahmed wrote:
>>>> Here I agree with you. Let's go with the approach which is easy to
>>>> undo for now. Though I prefer the new explicit interface for flushing,
>>>> that step would be very hard to undo. Let's reevaluate if the proposed
>>>> approach shows negative impact on production traffic and I think
>>>> Cloudflare folks can give us the results soon.
>>> Do you prefer we also switch to using a mutex (with preemption
>>> disabled) to avoid the scenario Michal described where flushers give
>>> up the lock and sleep resulting in an unbounded wait time in the worst
>>> case?
>> Locking with mutex with preemption disabled is an oxymoron.
> I believe Yosry wanted to disable preemption _after_ the lock is taken
> to reduce the time spent while it is held. The idea to use the mutex is
> to reduce spinning and more importantly to get rid of lock dropping
> part. It is not really clear (but unlikely) we can drop it while
> preserving the spinlock as the thing scales with O(#cgroups x #cpus)
> in the worst case.

As I have said later in my email, I am not against disabling preemption 
selectively on some parts of the lock critical section where preemption 
is undesirable. However, I am against disabling preemption for the whole 
duration of the code where the mutex lock is held as it defeats the 
purpose of using mutex in the first place.

Cheers,
Longman
Michal Hocko Aug. 29, 2023, 3:17 p.m. UTC | #25
On Tue 29-08-23 11:05:28, Waiman Long wrote:
> On 8/29/23 03:27, Michal Hocko wrote:
> > On Mon 28-08-23 13:27:23, Waiman Long wrote:
> > > On 8/28/23 13:07, Yosry Ahmed wrote:
> > > > > Here I agree with you. Let's go with the approach which is easy to
> > > > > undo for now. Though I prefer the new explicit interface for flushing,
> > > > > that step would be very hard to undo. Let's reevaluate if the proposed
> > > > > approach shows negative impact on production traffic and I think
> > > > > Cloudflare folks can give us the results soon.
> > > > Do you prefer we also switch to using a mutex (with preemption
> > > > disabled) to avoid the scenario Michal described where flushers give
> > > > up the lock and sleep resulting in an unbounded wait time in the worst
> > > > case?
> > > Locking with mutex with preemption disabled is an oxymoron.
> > I believe Yosry wanted to disable preemption _after_ the lock is taken
> > to reduce the time spent while it is held. The idea to use the mutex is
> > to reduce spinning and more importantly to get rid of lock dropping
> > part. It is not really clear (but unlikely) we can drop it while
> > preserving the spinlock as the thing scales with O(#cgroups x #cpus)
> > in the worst case.
> 
> As I have said later in my email, I am not against disabling preemption
> selectively on some parts of the lock critical section where preemption is
> undesirable. However, I am against disabling preemption for the whole
> duration of the code where the mutex lock is held as it defeats the purpose
> of using mutex in the first place.

I certainly agree this is an antipattern.
Yosry Ahmed Aug. 29, 2023, 4:04 p.m. UTC | #26
On Tue, Aug 29, 2023 at 8:17 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 29-08-23 11:05:28, Waiman Long wrote:
> > On 8/29/23 03:27, Michal Hocko wrote:
> > > On Mon 28-08-23 13:27:23, Waiman Long wrote:
> > > > On 8/28/23 13:07, Yosry Ahmed wrote:
> > > > > > Here I agree with you. Let's go with the approach which is easy to
> > > > > > undo for now. Though I prefer the new explicit interface for flushing,
> > > > > > that step would be very hard to undo. Let's reevaluate if the proposed
> > > > > > approach shows negative impact on production traffic and I think
> > > > > > Cloudflare folks can give us the results soon.
> > > > > Do you prefer we also switch to using a mutex (with preemption
> > > > > disabled) to avoid the scenario Michal described where flushers give
> > > > > up the lock and sleep resulting in an unbounded wait time in the worst
> > > > > case?
> > > > Locking with mutex with preemption disabled is an oxymoron.
> > > I believe Yosry wanted to disable preemption _after_ the lock is taken
> > > to reduce the time spent while it is held. The idea to use the mutex is
> > > to reduce spinning and more importantly to get rid of lock dropping
> > > part. It is not really clear (but unlikely) we can drop it while
> > > preserving the spinlock as the thing scales with O(#cgroups x #cpus)
> > > in the worst case.
> >
> > As I have said later in my email, I am not against disabling preemption
> > selectively on some parts of the lock critical section where preemption is
> > undesirable. However, I am against disabling preemption for the whole
> > duration of the code where the mutex lock is held as it defeats the purpose
> > of using mutex in the first place.
>
> I certainly agree this is an antipattern.

So I guess the verdict is to avoid using a mutex here for now. I sent
a v2 which includes an additional small patch suggested by Michal
Koutny and an updated changelog for this patch to document this
discussion and possible alternatives we can do if things go wrong with
this approach:

https://lore.kernel.org/lkml/20230828233319.340712-1-yosryahmed@google.com/
Tejun Heo Aug. 29, 2023, 6:44 p.m. UTC | #27
Hello,

On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote:
> > > I think that's how it was always meant to be when it was designed. The
> > > global rstat lock has always existed and was always available to
> > > userspace readers. The memory controller took a different path at some
> > > point with unified flushing, but that was mainly because of high
> > > concurrency from in-kernel flushers, not because userspace readers
> > > caused a problem. Outside of memcg, the core cgroup code has always
> > > exercised this global lock when reading cpu.stat since rstat's
> > > introduction. I assume there hasn't been any problems since it's still
> > > there.
> 
> I suspect nobody has just considered a malfunctioning or adversary
> workloads so far.
> 
> > > I was hoping Tejun would confirm/deny this.
> 
> Yes, that would be interesting to hear.

So, the assumptions in the original design were:

* Writers are high freq but readers are lower freq and can block.

* The global lock is mutex.

* Back-to-back reads won't have too much to do because it only has to flush
  what's been accumulated since the last flush which took place just before.

It's likely that the userspace side is gonna be just fine if we restore the
global lock to be a mutex and let them be. Most of the problems are caused
by trying to allow flushing from non-sleepable and kernel contexts. Would it
make sense to distinguish what can and can't wait and make the latter group
always use cached value? e.g. even in kernel, during oom kill, waiting
doesn't really matter and it can just wait to obtain the up-to-date numbers.

Thanks.
Yosry Ahmed Aug. 29, 2023, 7:13 p.m. UTC | #28
On Tue, Aug 29, 2023 at 11:44 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote:
> > > > I think that's how it was always meant to be when it was designed. The
> > > > global rstat lock has always existed and was always available to
> > > > userspace readers. The memory controller took a different path at some
> > > > point with unified flushing, but that was mainly because of high
> > > > concurrency from in-kernel flushers, not because userspace readers
> > > > caused a problem. Outside of memcg, the core cgroup code has always
> > > > exercised this global lock when reading cpu.stat since rstat's
> > > > introduction. I assume there hasn't been any problems since it's still
> > > > there.
> >
> > I suspect nobody has just considered a malfunctioning or adversary
> > workloads so far.
> >
> > > > I was hoping Tejun would confirm/deny this.
> >
> > Yes, that would be interesting to hear.
>
> So, the assumptions in the original design were:
>
> * Writers are high freq but readers are lower freq and can block.
>
> * The global lock is mutex.
>
> * Back-to-back reads won't have too much to do because it only has to flush
>   what's been accumulated since the last flush which took place just before.
>
> It's likely that the userspace side is gonna be just fine if we restore the
> global lock to be a mutex and let them be. Most of the problems are caused
> by trying to allow flushing from non-sleepable and kernel contexts.

So basically restore the flush without disabling preemption, and if a
userspace reader gets preempted while holding the mutex it's probably
okay because we won't have high concurrency among userspace readers?

I think Shakeel was worried that this may cause a priority inversion
where a low priority task is preempted while holding the mutex, and
prevents high priority tasks from acquiring it to read the stats and
take actions (e.g. userspace OOMs).

> Would it
> make sense to distinguish what can and can't wait and make the latter group
> always use cached value? e.g. even in kernel, during oom kill, waiting
> doesn't really matter and it can just wait to obtain the up-to-date numbers.

The problem with waiting for in-kernel flushers is that high
concurrency leads to terrible serialization. Running a stress test
with 100s of reclaimers where everyone waits shows ~ 3x slowdowns.

This patch series is indeed trying to formalize a distinction between
waiters who can wait and those who can't on the memcg side:

- Unified flushers always flush the entire tree and only flush if no
one else is flushing (no waiting), otherwise they use cached data and
hope the concurrent flushing helps. This is what we currently do for
most memcg contexts. This patch series opts userspace reads out of it.

- Non-unified flushers only flush the subtree they care about and they
wait if there are other flushers. This is what we currently do for
some zswap accounting code. This patch series opts userspace readers
into this.

The problem Michal is raising is that dropping the lock can lead to an
unbounded number of waiters and longer worst case latency. Especially
that this is directly influenced by userspace. Reintroducing the mutex
and removing the lock dropping code fixes that problem, but then if
the mutex holder gets preempted, we face another problem.

Personally I think there is a good chance there won't be userspace
latency problems due to the lock as usually there isn't high
concurrency among userspace readers, and we can deal with that problem
if/when it happens. So far no problem is happening for cpu.stat which
has the same potential problem.
Tejun Heo Aug. 29, 2023, 7:36 p.m. UTC | #29
Hello,

On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote:
...
> > So, the assumptions in the original design were:
> >
> > * Writers are high freq but readers are lower freq and can block.
> >
> > * The global lock is mutex.
> >
> > * Back-to-back reads won't have too much to do because it only has to flush
> >   what's been accumulated since the last flush which took place just before.
> >
> > It's likely that the userspace side is gonna be just fine if we restore the
> > global lock to be a mutex and let them be. Most of the problems are caused
> > by trying to allow flushing from non-sleepable and kernel contexts.
> 
> So basically restore the flush without disabling preemption, and if a
> userspace reader gets preempted while holding the mutex it's probably
> okay because we won't have high concurrency among userspace readers?
> 
> I think Shakeel was worried that this may cause a priority inversion
> where a low priority task is preempted while holding the mutex, and
> prevents high priority tasks from acquiring it to read the stats and
> take actions (e.g. userspace OOMs).

We'll have to see but I'm not sure this is going to be a huge problem. The
most common way that priority inversions are resolved is through work
conservation - ie. the system just runs out of other things to do, so
whatever is blocking the system gets to run and unblocks it. It only really
becomes a problem when work conservation breaks down on CPU side which
happens if the one holding the resource is 1. blocked on IOs (usually
through memory allocation but can be anything) 2. throttled by cpu.max.

#1 is not a factor here. #2 is but that is a factor for everything in the
kernel and should really be solved from cpu.max side. So, I think in
practice, this should be fine, or at least not worse than anything else.

> > Would it
> > make sense to distinguish what can and can't wait and make the latter group
> > always use cached value? e.g. even in kernel, during oom kill, waiting
> > doesn't really matter and it can just wait to obtain the up-to-date numbers.
> 
> The problem with waiting for in-kernel flushers is that high
> concurrency leads to terrible serialization. Running a stress test
> with 100s of reclaimers where everyone waits shows ~ 3x slowdowns.
> 
> This patch series is indeed trying to formalize a distinction between
> waiters who can wait and those who can't on the memcg side:
> 
> - Unified flushers always flush the entire tree and only flush if no
> one else is flushing (no waiting), otherwise they use cached data and
> hope the concurrent flushing helps. This is what we currently do for
> most memcg contexts. This patch series opts userspace reads out of it.
> 
> - Non-unified flushers only flush the subtree they care about and they
> wait if there are other flushers. This is what we currently do for
> some zswap accounting code. This patch series opts userspace readers
> into this.
> 
> The problem Michal is raising is that dropping the lock can lead to an
> unbounded number of waiters and longer worst case latency. Especially
> that this is directly influenced by userspace. Reintroducing the mutex
> and removing the lock dropping code fixes that problem, but then if
> the mutex holder gets preempted, we face another problem.
> 
> Personally I think there is a good chance there won't be userspace
> latency problems due to the lock as usually there isn't high
> concurrency among userspace readers, and we can deal with that problem
> if/when it happens. So far no problem is happening for cpu.stat which
> has the same potential problem.

Maybe leave the global lock as-is and gate the userland flushers with a
mutex so that there's only ever one contenting on the rstat lock from
userland side?

Thanks.
Yosry Ahmed Aug. 29, 2023, 7:54 p.m. UTC | #30
On Tue, Aug 29, 2023 at 12:36 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote:
> ...
> > > So, the assumptions in the original design were:
> > >
> > > * Writers are high freq but readers are lower freq and can block.
> > >
> > > * The global lock is mutex.
> > >
> > > * Back-to-back reads won't have too much to do because it only has to flush
> > >   what's been accumulated since the last flush which took place just before.
> > >
> > > It's likely that the userspace side is gonna be just fine if we restore the
> > > global lock to be a mutex and let them be. Most of the problems are caused
> > > by trying to allow flushing from non-sleepable and kernel contexts.
> >
> > So basically restore the flush without disabling preemption, and if a
> > userspace reader gets preempted while holding the mutex it's probably
> > okay because we won't have high concurrency among userspace readers?
> >
> > I think Shakeel was worried that this may cause a priority inversion
> > where a low priority task is preempted while holding the mutex, and
> > prevents high priority tasks from acquiring it to read the stats and
> > take actions (e.g. userspace OOMs).
>
> We'll have to see but I'm not sure this is going to be a huge problem. The
> most common way that priority inversions are resolved is through work
> conservation - ie. the system just runs out of other things to do, so
> whatever is blocking the system gets to run and unblocks it. It only really
> becomes a problem when work conservation breaks down on CPU side which
> happens if the one holding the resource is 1. blocked on IOs (usually
> through memory allocation but can be anything) 2. throttled by cpu.max.
>
> #1 is not a factor here. #2 is but that is a factor for everything in the
> kernel and should really be solved from cpu.max side. So, I think in
> practice, this should be fine, or at least not worse than anything else.

If that's not a concern I can just append a patch that changes the
spinlock to a mutex to this series. Shakeel, wdyt?

>
> > > Would it
> > > make sense to distinguish what can and can't wait and make the latter group
> > > always use cached value? e.g. even in kernel, during oom kill, waiting
> > > doesn't really matter and it can just wait to obtain the up-to-date numbers.
> >
> > The problem with waiting for in-kernel flushers is that high
> > concurrency leads to terrible serialization. Running a stress test
> > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns.
> >
> > This patch series is indeed trying to formalize a distinction between
> > waiters who can wait and those who can't on the memcg side:
> >
> > - Unified flushers always flush the entire tree and only flush if no
> > one else is flushing (no waiting), otherwise they use cached data and
> > hope the concurrent flushing helps. This is what we currently do for
> > most memcg contexts. This patch series opts userspace reads out of it.
> >
> > - Non-unified flushers only flush the subtree they care about and they
> > wait if there are other flushers. This is what we currently do for
> > some zswap accounting code. This patch series opts userspace readers
> > into this.
> >
> > The problem Michal is raising is that dropping the lock can lead to an
> > unbounded number of waiters and longer worst case latency. Especially
> > that this is directly influenced by userspace. Reintroducing the mutex
> > and removing the lock dropping code fixes that problem, but then if
> > the mutex holder gets preempted, we face another problem.
> >
> > Personally I think there is a good chance there won't be userspace
> > latency problems due to the lock as usually there isn't high
> > concurrency among userspace readers, and we can deal with that problem
> > if/when it happens. So far no problem is happening for cpu.stat which
> > has the same potential problem.
>
> Maybe leave the global lock as-is and gate the userland flushers with a
> mutex so that there's only ever one contenting on the rstat lock from
> userland side?

Waiman suggested this as well. We can do that for sure, although I
think we should wait until we are sure it's needed.

One question. If whoever is holding that mutex is either flushing with
the spinlock held or spinning (i.e not sleepable or preemptable),
wouldn't this be equivalent to just changing the spinlock with a mutex
and disable preemption while holding it?
Tejun Heo Aug. 29, 2023, 8:12 p.m. UTC | #31
Hello,

On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote:
...
> > Maybe leave the global lock as-is and gate the userland flushers with a
> > mutex so that there's only ever one contenting on the rstat lock from
> > userland side?
> 
> Waiman suggested this as well. We can do that for sure, although I
> think we should wait until we are sure it's needed.
> 
> One question. If whoever is holding that mutex is either flushing with
> the spinlock held or spinning (i.e not sleepable or preemptable),
> wouldn't this be equivalent to just changing the spinlock with a mutex
> and disable preemption while holding it?

Well, it creates layering so that userspace can't flood the inner lock which
can cause contention issues for kernel side users. Not sleeping while
actively flushing is an side-effect too but the code at least doesn't look
as anti-patterny as disabling preemption right after grabbing a mutex.

I don't have a strong preference. As long as we stay away from introducing a
new user interface construct and can address the noticed scalability issues,
it should be fine. Note that there are other ways to address priority
inversions and contentions too - e.g. we can always bounce flushing to a
[kthread_]kworker and rate limit (or rather latency limit) how often
different classes of users can trigger flushing. I don't think we have to go
there yet but if the simpler meaures don't work out, there are still many
ways to solve the problem within the kernel.

Thanks.
Yosry Ahmed Aug. 29, 2023, 8:20 p.m. UTC | #32
On Tue, Aug 29, 2023 at 1:12 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote:
> ...
> > > Maybe leave the global lock as-is and gate the userland flushers with a
> > > mutex so that there's only ever one contenting on the rstat lock from
> > > userland side?
> >
> > Waiman suggested this as well. We can do that for sure, although I
> > think we should wait until we are sure it's needed.
> >
> > One question. If whoever is holding that mutex is either flushing with
> > the spinlock held or spinning (i.e not sleepable or preemptable),
> > wouldn't this be equivalent to just changing the spinlock with a mutex
> > and disable preemption while holding it?
>
> Well, it creates layering so that userspace can't flood the inner lock which
> can cause contention issues for kernel side users. Not sleeping while
> actively flushing is an side-effect too but the code at least doesn't look
> as anti-patterny as disabling preemption right after grabbing a mutex.

I see. At most one kernel side flusher will be spinning for the lock
at any given point anyway, but I guess having that one kernel side
flusher competing against one user side flusher is better competing
with N flushers.

I will add a mutex on the userspace read side then and spin a v3.
Hopefully this addresses Michal's concern as well. The lock dropping
logic will still exist for the inner lock, but when one userspace
reader drops the inner lock other readers won't be able to pick it up.

>
> I don't have a strong preference. As long as we stay away from introducing a
> new user interface construct and can address the noticed scalability issues,
> it should be fine. Note that there are other ways to address priority
> inversions and contentions too - e.g. we can always bounce flushing to a
> [kthread_]kworker and rate limit (or rather latency limit) how often
> different classes of users can trigger flushing. I don't think we have to go
> there yet but if the simpler meaures don't work out, there are still many
> ways to solve the problem within the kernel.

I whole-heartedly agree with the preference to fix the problem within
the kernel with minimal/none user space involvement.

Thanks!
Michal Hocko Aug. 31, 2023, 9:05 a.m. UTC | #33
On Tue 29-08-23 13:20:34, Yosry Ahmed wrote:
[...]
> I will add a mutex on the userspace read side then and spin a v3.
> Hopefully this addresses Michal's concern as well. The lock dropping
> logic will still exist for the inner lock, but when one userspace
> reader drops the inner lock other readers won't be able to pick it up.

Yes, that would minimize the risk of the worst case pathological
behavior.

> > I don't have a strong preference. As long as we stay away from introducing a
> > new user interface construct and can address the noticed scalability issues,
> > it should be fine. Note that there are other ways to address priority
> > inversions and contentions too - e.g. we can always bounce flushing to a
> > [kthread_]kworker and rate limit (or rather latency limit) how often
> > different classes of users can trigger flushing. I don't think we have to go
> > there yet but if the simpler meaures don't work out, there are still many
> > ways to solve the problem within the kernel.
> 
> I whole-heartedly agree with the preference to fix the problem within
> the kernel with minimal/none user space involvement.

Let's see. While I would love to see a solution that works for everybody
without explicit interface we have hit problems with locks involved in
stat files in the past.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90f08b35fa77..d3b13a06224c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1606,7 +1606,7 @@  static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4048,7 +4048,7 @@  static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4123,7 +4123,7 @@  static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -4625,7 +4625,7 @@  void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6640,7 +6640,7 @@  static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_try_flush_stats();
+	do_stats_flush(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;