diff mbox series

mm: memcg: provide accurate stats for userspace reads

Message ID 20230809045810.1659356-1-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series mm: memcg: provide accurate stats for userspace reads | expand

Commit Message

Yosry Ahmed Aug. 9, 2023, 4:58 a.m. UTC
Over time, the memcg code added multiple optimizations to the stats
flushing path that introduce a tradeoff between accuracy and
performance. In some contexts (e.g. dirty throttling, refaults, etc), a
full rstat flush of the stats in the tree can be too expensive. Such
optimizations include [1]:
(a) Introducing a periodic background flusher to keep the size of the
update tree from growing unbounded.
(b) Allowing only one thread to flush at a time, and other concurrent
flushers just skip the flush. This avoids a thundering herd problem
when multiple reclaim/refault threads attempt to flush the stats at
once.
(c) Only executing a flush if the magnitude of the stats updates exceeds
a certain threshold.

These optimizations were necessary to make flushing feasible in
performance-critical paths, and they come at the cost of some accuracy
that we choose to live without. On the other hand, for flushes invoked
when userspace is reading the stats, the tradeoff is less appealing
This code path is not performance-critical, and the inaccuracies can
affect userspace behavior. For example, skipping flushing when there is
another ongoing flush is essentially a coin flip. We don't know if the
ongoing flush is done with the subtree of interest or not.

If userspace asks for stats, let's give it accurate stats. Without this
patch, we see regressions in userspace workloads due to stats inaccuracy
in some cases.

Rework the do_flush_stats() helper to accept a "full" boolean argument.
For a "full" flush, if there is an ongoing flush, do not skip. Instead
wait for the flush to complete. Introduce a new
mem_cgroup_flush_stats_full() interface that use this full flush, and
also does not check if the magnitude of the updates exceeds the
threshold. Use mem_cgroup_flush_stats_full() in code paths where stats
are flushed due to a userspace read. This essentially undos optimzations
(b) and (c) above for flushes triggered by userspace reads.

[1] https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---

I want to argue that this is what we should be doing for all flushing
contexts, not just userspace reads (i.e all flushes should be "full").
Skipping if a flush is ongoing is too brittle. There is a significant
chance that the stats of the cgroup we care about is not fully flushed.
Waiting for an ongoing flush to finish ensures correctness while still
avoiding the thundering herd problem on the rstat flush lock.

Having said that, there is a higher chance of regression if we add the
wait in more critical paths (e.g. reclaim, refaults), so I opt-ed to do
this for userspace reads for now. We have complaints about inaccuracy in
userspace reads, but no complaints about inaccuracy in other paths so
far (although it would be really difficult to tie a reclaim/refault
problem to a partial stats flush anyway).

---
 mm/memcontrol.c | 42 +++++++++++++++++++++++++++---------------
 1 file changed, 27 insertions(+), 15 deletions(-)

Comments

Michal Hocko Aug. 9, 2023, 8:51 a.m. UTC | #1
On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> Over time, the memcg code added multiple optimizations to the stats
> flushing path that introduce a tradeoff between accuracy and
> performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> full rstat flush of the stats in the tree can be too expensive. Such
> optimizations include [1]:
> (a) Introducing a periodic background flusher to keep the size of the
> update tree from growing unbounded.
> (b) Allowing only one thread to flush at a time, and other concurrent
> flushers just skip the flush. This avoids a thundering herd problem
> when multiple reclaim/refault threads attempt to flush the stats at
> once.
> (c) Only executing a flush if the magnitude of the stats updates exceeds
> a certain threshold.
> 
> These optimizations were necessary to make flushing feasible in
> performance-critical paths, and they come at the cost of some accuracy
> that we choose to live without. On the other hand, for flushes invoked
> when userspace is reading the stats, the tradeoff is less appealing
> This code path is not performance-critical, and the inaccuracies can
> affect userspace behavior. For example, skipping flushing when there is
> another ongoing flush is essentially a coin flip. We don't know if the
> ongoing flush is done with the subtree of interest or not.

I am not convinced by this much TBH. What kind of precision do you
really need and how much off is what we provide?

More expensive read of stats from userspace is quite easy to notice
and usually reported as a regression. So you should have a convincing
argument that an extra time spent is really worth it. AFAIK there are
many monitoring (top like) tools which simply read those files regularly
just to show numbers and they certainly do not need a high level of
precision.

[...]
> @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  	}
>  }
>  
> -static void do_flush_stats(void)
> +static void do_flush_stats(bool full)
>  {
> +	if (!atomic_read(&stats_flush_ongoing) &&
> +	    !atomic_xchg(&stats_flush_ongoing, 1))
> +		goto flush;
> +
>  	/*
> -	 * We always flush the entire tree, so concurrent flushers can just
> -	 * skip. This avoids a thundering herd problem on the rstat global lock
> -	 * from memcg flushers (e.g. reclaim, refault, etc).
> +	 * We always flush the entire tree, so concurrent flushers can choose to
> +	 * skip if accuracy is not critical. Otherwise, wait for the ongoing
> +	 * flush to complete. This avoids a thundering herd problem on the rstat
> +	 * global lock from memcg flushers (e.g. reclaim, refault, etc).
>  	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> -		return;
> -
> +	while (full && atomic_read(&stats_flush_ongoing) == 1) {
> +		if (!cond_resched())
> +			cpu_relax();

You are reinveting a mutex with spinning waiter. Why don't you simply
make stats_flush_ongoing a real mutex and make use try_lock for !full
flush and normal lock otherwise?

> +	}
> +	return;
> +flush:
>  	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
>  
>  	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
[...]
Yosry Ahmed Aug. 9, 2023, 12:31 p.m. UTC | #2
On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > Over time, the memcg code added multiple optimizations to the stats
> > flushing path that introduce a tradeoff between accuracy and
> > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > full rstat flush of the stats in the tree can be too expensive. Such
> > optimizations include [1]:
> > (a) Introducing a periodic background flusher to keep the size of the
> > update tree from growing unbounded.
> > (b) Allowing only one thread to flush at a time, and other concurrent
> > flushers just skip the flush. This avoids a thundering herd problem
> > when multiple reclaim/refault threads attempt to flush the stats at
> > once.
> > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > a certain threshold.
> >
> > These optimizations were necessary to make flushing feasible in
> > performance-critical paths, and they come at the cost of some accuracy
> > that we choose to live without. On the other hand, for flushes invoked
> > when userspace is reading the stats, the tradeoff is less appealing
> > This code path is not performance-critical, and the inaccuracies can
> > affect userspace behavior. For example, skipping flushing when there is
> > another ongoing flush is essentially a coin flip. We don't know if the
> > ongoing flush is done with the subtree of interest or not.
>
> I am not convinced by this much TBH. What kind of precision do you
> really need and how much off is what we provide?
>
> More expensive read of stats from userspace is quite easy to notice
> and usually reported as a regression. So you should have a convincing
> argument that an extra time spent is really worth it. AFAIK there are
> many monitoring (top like) tools which simply read those files regularly
> just to show numbers and they certainly do not need a high level of
> precision.

We used to spend this time before commit fd25a9e0e23b ("memcg: unify
memcg stat flushing") which generalized the "skip if ongoing flush"
for all stat flushing. As far I know, the problem was contention on
the flushing lock which also affected critical paths like refault.

The problem is that the current behavior is indeterministic, if cpu A
tries to flush stats and cpu B is already doing that, cpu A will just
skip. At that point, the cgroup(s) that cpu A cares about may have
been fully flushed, partially flushed (in terms of cpus), or not
flushed at all. We have no idea. We just know that someone else is
flushing something. IOW, in some cases the flush request will be
completely ignored and userspace will read stale stats (up to 2s + the
periodic flusher runtime).

Some workloads need to read up-to-date stats as feedback to actions
(e.g. after proactive reclaim, or for userspace OOM killing purposes),
and reading such stale stats causes regressions or misbehavior by
userspace.

>
> [...]
> > @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >       }
> >  }
> >
> > -static void do_flush_stats(void)
> > +static void do_flush_stats(bool full)
> >  {
> > +     if (!atomic_read(&stats_flush_ongoing) &&
> > +         !atomic_xchg(&stats_flush_ongoing, 1))
> > +             goto flush;
> > +
> >       /*
> > -      * We always flush the entire tree, so concurrent flushers can just
> > -      * skip. This avoids a thundering herd problem on the rstat global lock
> > -      * from memcg flushers (e.g. reclaim, refault, etc).
> > +      * We always flush the entire tree, so concurrent flushers can choose to
> > +      * skip if accuracy is not critical. Otherwise, wait for the ongoing
> > +      * flush to complete. This avoids a thundering herd problem on the rstat
> > +      * global lock from memcg flushers (e.g. reclaim, refault, etc).
> >        */
> > -     if (atomic_read(&stats_flush_ongoing) ||
> > -         atomic_xchg(&stats_flush_ongoing, 1))
> > -             return;
> > -
> > +     while (full && atomic_read(&stats_flush_ongoing) == 1) {
> > +             if (!cond_resched())
> > +                     cpu_relax();
>
> You are reinveting a mutex with spinning waiter. Why don't you simply
> make stats_flush_ongoing a real mutex and make use try_lock for !full
> flush and normal lock otherwise?

So that was actually a spinlock at one point, when we used to skip if
try_lock failed. We opted for an atomic because the lock was only used
in a try_lock fashion. The problem here is that the atomic is used to
ensure that only one thread actually attempts to flush at a time (and
others skip/wait), to avoid a thundering herd problem on
cgroup_rstat_lock.

Here, what I am trying to do is essentially equivalent to "wait until
the lock is available but don't grab it". If we make
stats_flush_ongoing a mutex, I am afraid the thundering herd problem
will be reintroduced for stats_flush_ongoing this time.

I am not sure if there's a cleaner way of doing this, but I am
certainly open for suggestions. I also don't like how the spinning
loop looks as of now.
Michal Hocko Aug. 9, 2023, 12:58 p.m. UTC | #3
On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > Over time, the memcg code added multiple optimizations to the stats
> > > flushing path that introduce a tradeoff between accuracy and
> > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > full rstat flush of the stats in the tree can be too expensive. Such
> > > optimizations include [1]:
> > > (a) Introducing a periodic background flusher to keep the size of the
> > > update tree from growing unbounded.
> > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > flushers just skip the flush. This avoids a thundering herd problem
> > > when multiple reclaim/refault threads attempt to flush the stats at
> > > once.
> > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > a certain threshold.
> > >
> > > These optimizations were necessary to make flushing feasible in
> > > performance-critical paths, and they come at the cost of some accuracy
> > > that we choose to live without. On the other hand, for flushes invoked
> > > when userspace is reading the stats, the tradeoff is less appealing
> > > This code path is not performance-critical, and the inaccuracies can
> > > affect userspace behavior. For example, skipping flushing when there is
> > > another ongoing flush is essentially a coin flip. We don't know if the
> > > ongoing flush is done with the subtree of interest or not.
> >
> > I am not convinced by this much TBH. What kind of precision do you
> > really need and how much off is what we provide?
> >
> > More expensive read of stats from userspace is quite easy to notice
> > and usually reported as a regression. So you should have a convincing
> > argument that an extra time spent is really worth it. AFAIK there are
> > many monitoring (top like) tools which simply read those files regularly
> > just to show numbers and they certainly do not need a high level of
> > precision.
> 
> We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> memcg stat flushing") which generalized the "skip if ongoing flush"
> for all stat flushing. As far I know, the problem was contention on
> the flushing lock which also affected critical paths like refault.
> 
> The problem is that the current behavior is indeterministic, if cpu A
> tries to flush stats and cpu B is already doing that, cpu A will just
> skip. At that point, the cgroup(s) that cpu A cares about may have
> been fully flushed, partially flushed (in terms of cpus), or not
> flushed at all. We have no idea. We just know that someone else is
> flushing something. IOW, in some cases the flush request will be
> completely ignored and userspace will read stale stats (up to 2s + the
> periodic flusher runtime).

Yes, that is certainly true but why does that matter? Stats are always a
snapshot of the past. Do we get an inconsistent image that would be
actively harmful.

> Some workloads need to read up-to-date stats as feedback to actions
> (e.g. after proactive reclaim, or for userspace OOM killing purposes),
> and reading such stale stats causes regressions or misbehavior by
> userspace.

Please tell us more about those and why should all others that do not
require such a precision should page that price as well.

> > [...]
> > > @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> > >       }
> > >  }
> > >
> > > -static void do_flush_stats(void)
> > > +static void do_flush_stats(bool full)
> > >  {
> > > +     if (!atomic_read(&stats_flush_ongoing) &&
> > > +         !atomic_xchg(&stats_flush_ongoing, 1))
> > > +             goto flush;
> > > +
> > >       /*
> > > -      * We always flush the entire tree, so concurrent flushers can just
> > > -      * skip. This avoids a thundering herd problem on the rstat global lock
> > > -      * from memcg flushers (e.g. reclaim, refault, etc).
> > > +      * We always flush the entire tree, so concurrent flushers can choose to
> > > +      * skip if accuracy is not critical. Otherwise, wait for the ongoing
> > > +      * flush to complete. This avoids a thundering herd problem on the rstat
> > > +      * global lock from memcg flushers (e.g. reclaim, refault, etc).
> > >        */
> > > -     if (atomic_read(&stats_flush_ongoing) ||
> > > -         atomic_xchg(&stats_flush_ongoing, 1))
> > > -             return;
> > > -
> > > +     while (full && atomic_read(&stats_flush_ongoing) == 1) {
> > > +             if (!cond_resched())
> > > +                     cpu_relax();
> >
> > You are reinveting a mutex with spinning waiter. Why don't you simply
> > make stats_flush_ongoing a real mutex and make use try_lock for !full
> > flush and normal lock otherwise?
> 
> So that was actually a spinlock at one point, when we used to skip if
> try_lock failed.

AFAICS cgroup_rstat_flush is allowed to sleep so spinlocks are not
really possible.

> We opted for an atomic because the lock was only used
> in a try_lock fashion. The problem here is that the atomic is used to
> ensure that only one thread actually attempts to flush at a time (and
> others skip/wait), to avoid a thundering herd problem on
> cgroup_rstat_lock.
> 
> Here, what I am trying to do is essentially equivalent to "wait until
> the lock is available but don't grab it". If we make
> stats_flush_ongoing a mutex, I am afraid the thundering herd problem
> will be reintroduced for stats_flush_ongoing this time.

You will have potentially many spinners for something that might take
quite a lot of time (sleep) if there is nothing else to schedule. I do
not think this is a proper behavior. Really, you shouldn't be busy
waiting for a sleeper.

> I am not sure if there's a cleaner way of doing this, but I am
> certainly open for suggestions. I also don't like how the spinning
> loop looks as of now.

mutex_try_lock for non-critical flushers and mutex_lock of syncing ones.
We can talk a custom locking scheme if that proves insufficient or
problematic.
Yosry Ahmed Aug. 9, 2023, 1:13 p.m. UTC | #4
On Wed, Aug 9, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> > On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > > Over time, the memcg code added multiple optimizations to the stats
> > > > flushing path that introduce a tradeoff between accuracy and
> > > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > > full rstat flush of the stats in the tree can be too expensive. Such
> > > > optimizations include [1]:
> > > > (a) Introducing a periodic background flusher to keep the size of the
> > > > update tree from growing unbounded.
> > > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > > flushers just skip the flush. This avoids a thundering herd problem
> > > > when multiple reclaim/refault threads attempt to flush the stats at
> > > > once.
> > > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > > a certain threshold.
> > > >
> > > > These optimizations were necessary to make flushing feasible in
> > > > performance-critical paths, and they come at the cost of some accuracy
> > > > that we choose to live without. On the other hand, for flushes invoked
> > > > when userspace is reading the stats, the tradeoff is less appealing
> > > > This code path is not performance-critical, and the inaccuracies can
> > > > affect userspace behavior. For example, skipping flushing when there is
> > > > another ongoing flush is essentially a coin flip. We don't know if the
> > > > ongoing flush is done with the subtree of interest or not.
> > >
> > > I am not convinced by this much TBH. What kind of precision do you
> > > really need and how much off is what we provide?
> > >
> > > More expensive read of stats from userspace is quite easy to notice
> > > and usually reported as a regression. So you should have a convincing
> > > argument that an extra time spent is really worth it. AFAIK there are
> > > many monitoring (top like) tools which simply read those files regularly
> > > just to show numbers and they certainly do not need a high level of
> > > precision.
> >
> > We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> > memcg stat flushing") which generalized the "skip if ongoing flush"
> > for all stat flushing. As far I know, the problem was contention on
> > the flushing lock which also affected critical paths like refault.
> >
> > The problem is that the current behavior is indeterministic, if cpu A
> > tries to flush stats and cpu B is already doing that, cpu A will just
> > skip. At that point, the cgroup(s) that cpu A cares about may have
> > been fully flushed, partially flushed (in terms of cpus), or not
> > flushed at all. We have no idea. We just know that someone else is
> > flushing something. IOW, in some cases the flush request will be
> > completely ignored and userspace will read stale stats (up to 2s + the
> > periodic flusher runtime).
>
> Yes, that is certainly true but why does that matter? Stats are always a
> snapshot of the past. Do we get an inconsistent image that would be
> actively harmful.

That can very well be the case because we may be in a state where some
cpus are flushed and some aren't. Also sometimes a few seconds is too
old. We have some workloads that read the stats every 1-2 seconds to
keep a fresh state, and they certainly do not expect stats to be 2+
seconds old when they read them.

>
> > Some workloads need to read up-to-date stats as feedback to actions
> > (e.g. after proactive reclaim, or for userspace OOM killing purposes),
> > and reading such stale stats causes regressions or misbehavior by
> > userspace.
>
> Please tell us more about those and why should all others that do not
> require such a precision should page that price as well.

Everyone used to pay this price though and no one used to complain.
Even before rstat, we used to iterate the entire hierarchy when
userspace reads the stats. rstat came in and made this much more
efficient by only iterating the subtrees that actually have updates.

The "skip if someone else is flushing" behavior was introduced for
flushers in critical paths (e.g. refault), and hurting the accuracy
for userspace readers was a side effect of it. This patch is trying to
remedy this side effect by restoring the old behavior for userspace
reads.

One other side effect is testing. Some tests started becoming flaky
because a test performs an action and expects the state of the system
to change in a certain way deterministically. In some cases the
flushing race leads to false negatives.

>
> > > [...]
> > > > @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> > > >       }
> > > >  }
> > > >
> > > > -static void do_flush_stats(void)
> > > > +static void do_flush_stats(bool full)
> > > >  {
> > > > +     if (!atomic_read(&stats_flush_ongoing) &&
> > > > +         !atomic_xchg(&stats_flush_ongoing, 1))
> > > > +             goto flush;
> > > > +
> > > >       /*
> > > > -      * We always flush the entire tree, so concurrent flushers can just
> > > > -      * skip. This avoids a thundering herd problem on the rstat global lock
> > > > -      * from memcg flushers (e.g. reclaim, refault, etc).
> > > > +      * We always flush the entire tree, so concurrent flushers can choose to
> > > > +      * skip if accuracy is not critical. Otherwise, wait for the ongoing
> > > > +      * flush to complete. This avoids a thundering herd problem on the rstat
> > > > +      * global lock from memcg flushers (e.g. reclaim, refault, etc).
> > > >        */
> > > > -     if (atomic_read(&stats_flush_ongoing) ||
> > > > -         atomic_xchg(&stats_flush_ongoing, 1))
> > > > -             return;
> > > > -
> > > > +     while (full && atomic_read(&stats_flush_ongoing) == 1) {
> > > > +             if (!cond_resched())
> > > > +                     cpu_relax();
> > >
> > > You are reinveting a mutex with spinning waiter. Why don't you simply
> > > make stats_flush_ongoing a real mutex and make use try_lock for !full
> > > flush and normal lock otherwise?
> >
> > So that was actually a spinlock at one point, when we used to skip if
> > try_lock failed.
>
> AFAICS cgroup_rstat_flush is allowed to sleep so spinlocks are not
> really possible.
>
> > We opted for an atomic because the lock was only used
> > in a try_lock fashion. The problem here is that the atomic is used to
> > ensure that only one thread actually attempts to flush at a time (and
> > others skip/wait), to avoid a thundering herd problem on
> > cgroup_rstat_lock.
> >
> > Here, what I am trying to do is essentially equivalent to "wait until
> > the lock is available but don't grab it". If we make
> > stats_flush_ongoing a mutex, I am afraid the thundering herd problem
> > will be reintroduced for stats_flush_ongoing this time.
>
> You will have potentially many spinners for something that might take
> quite a lot of time (sleep) if there is nothing else to schedule. I do
> not think this is a proper behavior. Really, you shouldn't be busy
> waiting for a sleeper.
>
> > I am not sure if there's a cleaner way of doing this, but I am
> > certainly open for suggestions. I also don't like how the spinning
> > loop looks as of now.
>
> mutex_try_lock for non-critical flushers and mutex_lock of syncing ones.
> We can talk a custom locking scheme if that proves insufficient or
> problematic.
> --
> Michal Hocko
> SUSE Labs
Yosry Ahmed Aug. 9, 2023, 1:17 p.m. UTC | #5
<snip>
> > > [...]
> > > > @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> > > >       }
> > > >  }
> > > >
> > > > -static void do_flush_stats(void)
> > > > +static void do_flush_stats(bool full)
> > > >  {
> > > > +     if (!atomic_read(&stats_flush_ongoing) &&
> > > > +         !atomic_xchg(&stats_flush_ongoing, 1))
> > > > +             goto flush;
> > > > +
> > > >       /*
> > > > -      * We always flush the entire tree, so concurrent flushers can just
> > > > -      * skip. This avoids a thundering herd problem on the rstat global lock
> > > > -      * from memcg flushers (e.g. reclaim, refault, etc).
> > > > +      * We always flush the entire tree, so concurrent flushers can choose to
> > > > +      * skip if accuracy is not critical. Otherwise, wait for the ongoing
> > > > +      * flush to complete. This avoids a thundering herd problem on the rstat
> > > > +      * global lock from memcg flushers (e.g. reclaim, refault, etc).
> > > >        */
> > > > -     if (atomic_read(&stats_flush_ongoing) ||
> > > > -         atomic_xchg(&stats_flush_ongoing, 1))
> > > > -             return;
> > > > -
> > > > +     while (full && atomic_read(&stats_flush_ongoing) == 1) {
> > > > +             if (!cond_resched())
> > > > +                     cpu_relax();
> > >
> > > You are reinveting a mutex with spinning waiter. Why don't you simply
> > > make stats_flush_ongoing a real mutex and make use try_lock for !full
> > > flush and normal lock otherwise?
> >
> > So that was actually a spinlock at one point, when we used to skip if
> > try_lock failed.
>
> AFAICS cgroup_rstat_flush is allowed to sleep so spinlocks are not
> really possible.

Sorry I hit the send button too early, didn't get to this part.

We were able to use a spinlock because we used to disable sleeping
when flushing the stats then, which opened another can of worms :)

>
> > We opted for an atomic because the lock was only used
> > in a try_lock fashion. The problem here is that the atomic is used to
> > ensure that only one thread actually attempts to flush at a time (and
> > others skip/wait), to avoid a thundering herd problem on
> > cgroup_rstat_lock.
> >
> > Here, what I am trying to do is essentially equivalent to "wait until
> > the lock is available but don't grab it". If we make
> > stats_flush_ongoing a mutex, I am afraid the thundering herd problem
> > will be reintroduced for stats_flush_ongoing this time.
>
> You will have potentially many spinners for something that might take
> quite a lot of time (sleep) if there is nothing else to schedule. I do
> not think this is a proper behavior. Really, you shouldn't be busy
> waiting for a sleeper.
>
> > I am not sure if there's a cleaner way of doing this, but I am
> > certainly open for suggestions. I also don't like how the spinning
> > loop looks as of now.
>
> mutex_try_lock for non-critical flushers and mutex_lock of syncing ones.
> We can talk a custom locking scheme if that proves insufficient or
> problematic.

I have no problem with this. I can send a v2 following this scheme,
once we agree on the importance of this patch :)

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 9, 2023, 1:31 p.m. UTC | #6
On Wed 09-08-23 06:13:05, Yosry Ahmed wrote:
> On Wed, Aug 9, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> > > On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > > > Over time, the memcg code added multiple optimizations to the stats
> > > > > flushing path that introduce a tradeoff between accuracy and
> > > > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > > > full rstat flush of the stats in the tree can be too expensive. Such
> > > > > optimizations include [1]:
> > > > > (a) Introducing a periodic background flusher to keep the size of the
> > > > > update tree from growing unbounded.
> > > > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > > > flushers just skip the flush. This avoids a thundering herd problem
> > > > > when multiple reclaim/refault threads attempt to flush the stats at
> > > > > once.
> > > > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > > > a certain threshold.
> > > > >
> > > > > These optimizations were necessary to make flushing feasible in
> > > > > performance-critical paths, and they come at the cost of some accuracy
> > > > > that we choose to live without. On the other hand, for flushes invoked
> > > > > when userspace is reading the stats, the tradeoff is less appealing
> > > > > This code path is not performance-critical, and the inaccuracies can
> > > > > affect userspace behavior. For example, skipping flushing when there is
> > > > > another ongoing flush is essentially a coin flip. We don't know if the
> > > > > ongoing flush is done with the subtree of interest or not.
> > > >
> > > > I am not convinced by this much TBH. What kind of precision do you
> > > > really need and how much off is what we provide?
> > > >
> > > > More expensive read of stats from userspace is quite easy to notice
> > > > and usually reported as a regression. So you should have a convincing
> > > > argument that an extra time spent is really worth it. AFAIK there are
> > > > many monitoring (top like) tools which simply read those files regularly
> > > > just to show numbers and they certainly do not need a high level of
> > > > precision.
> > >
> > > We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> > > memcg stat flushing") which generalized the "skip if ongoing flush"
> > > for all stat flushing. As far I know, the problem was contention on
> > > the flushing lock which also affected critical paths like refault.
> > >
> > > The problem is that the current behavior is indeterministic, if cpu A
> > > tries to flush stats and cpu B is already doing that, cpu A will just
> > > skip. At that point, the cgroup(s) that cpu A cares about may have
> > > been fully flushed, partially flushed (in terms of cpus), or not
> > > flushed at all. We have no idea. We just know that someone else is
> > > flushing something. IOW, in some cases the flush request will be
> > > completely ignored and userspace will read stale stats (up to 2s + the
> > > periodic flusher runtime).
> >
> > Yes, that is certainly true but why does that matter? Stats are always a
> > snapshot of the past. Do we get an inconsistent image that would be
> > actively harmful.
> 
> That can very well be the case because we may be in a state where some
> cpus are flushed and some aren't. Also sometimes a few seconds is too
> old. We have some workloads that read the stats every 1-2 seconds to
> keep a fresh state, and they certainly do not expect stats to be 2+
> seconds old when they read them.

I hate to repeat myself but please be more specific. This all sounds
just too wavy to me.

> > > Some workloads need to read up-to-date stats as feedback to actions
> > > (e.g. after proactive reclaim, or for userspace OOM killing purposes),
> > > and reading such stale stats causes regressions or misbehavior by
> > > userspace.
> >
> > Please tell us more about those and why should all others that do not
> > require such a precision should page that price as well.
> 
> Everyone used to pay this price though and no one used to complain.

Right, and then the overhead has been reduced and now you want to bring
it back and that will be seen as a regression. It doesn't really matter
what used to be the overhead. People always care when something gets
slower.
Yosry Ahmed Aug. 9, 2023, 6:33 p.m. UTC | #7
On Wed, Aug 9, 2023 at 6:32 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 09-08-23 06:13:05, Yosry Ahmed wrote:
> > On Wed, Aug 9, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> > > > On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > > > > Over time, the memcg code added multiple optimizations to the stats
> > > > > > flushing path that introduce a tradeoff between accuracy and
> > > > > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > > > > full rstat flush of the stats in the tree can be too expensive. Such
> > > > > > optimizations include [1]:
> > > > > > (a) Introducing a periodic background flusher to keep the size of the
> > > > > > update tree from growing unbounded.
> > > > > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > > > > flushers just skip the flush. This avoids a thundering herd problem
> > > > > > when multiple reclaim/refault threads attempt to flush the stats at
> > > > > > once.
> > > > > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > > > > a certain threshold.
> > > > > >
> > > > > > These optimizations were necessary to make flushing feasible in
> > > > > > performance-critical paths, and they come at the cost of some accuracy
> > > > > > that we choose to live without. On the other hand, for flushes invoked
> > > > > > when userspace is reading the stats, the tradeoff is less appealing
> > > > > > This code path is not performance-critical, and the inaccuracies can
> > > > > > affect userspace behavior. For example, skipping flushing when there is
> > > > > > another ongoing flush is essentially a coin flip. We don't know if the
> > > > > > ongoing flush is done with the subtree of interest or not.
> > > > >
> > > > > I am not convinced by this much TBH. What kind of precision do you
> > > > > really need and how much off is what we provide?
> > > > >
> > > > > More expensive read of stats from userspace is quite easy to notice
> > > > > and usually reported as a regression. So you should have a convincing
> > > > > argument that an extra time spent is really worth it. AFAIK there are
> > > > > many monitoring (top like) tools which simply read those files regularly
> > > > > just to show numbers and they certainly do not need a high level of
> > > > > precision.
> > > >
> > > > We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> > > > memcg stat flushing") which generalized the "skip if ongoing flush"
> > > > for all stat flushing. As far I know, the problem was contention on
> > > > the flushing lock which also affected critical paths like refault.
> > > >
> > > > The problem is that the current behavior is indeterministic, if cpu A
> > > > tries to flush stats and cpu B is already doing that, cpu A will just
> > > > skip. At that point, the cgroup(s) that cpu A cares about may have
> > > > been fully flushed, partially flushed (in terms of cpus), or not
> > > > flushed at all. We have no idea. We just know that someone else is
> > > > flushing something. IOW, in some cases the flush request will be
> > > > completely ignored and userspace will read stale stats (up to 2s + the
> > > > periodic flusher runtime).
> > >
> > > Yes, that is certainly true but why does that matter? Stats are always a
> > > snapshot of the past. Do we get an inconsistent image that would be
> > > actively harmful.
> >
> > That can very well be the case because we may be in a state where some
> > cpus are flushed and some aren't. Also sometimes a few seconds is too
> > old. We have some workloads that read the stats every 1-2 seconds to
> > keep a fresh state, and they certainly do not expect stats to be 2+
> > seconds old when they read them.
>
> I hate to repeat myself but please be more specific. This all sounds
> just too wavy to me.

Sorry I didn't have the full story in mind, I had to do my homework.
One example is userspace OOM killing. Our userspace OOM killer makes
decisions based on some stats from memory.stat, and stale stats (a few
seconds in this case) can result in an unrightful OOM kill, which can
easily cascade.

A simplified example of that is when a hierarchy has a parent cgroup
with multiple related children. In this case, there are usually
file-backed resources that are shared between those children, and OOM
killing one of them will not free those resources. Hence, the OOM
killer only considers their anonymous usage to be reap-able when a
memcg is nuked. For that we use the "anon" stat (or "rss" in cgroup
v1) in memory.stat.

>
> > > > Some workloads need to read up-to-date stats as feedback to actions
> > > > (e.g. after proactive reclaim, or for userspace OOM killing purposes),
> > > > and reading such stale stats causes regressions or misbehavior by
> > > > userspace.
> > >
> > > Please tell us more about those and why should all others that do not
> > > require such a precision should page that price as well.
> >
> > Everyone used to pay this price though and no one used to complain.
>
> Right, and then the overhead has been reduced and now you want to bring
> it back and that will be seen as a regression. It doesn't really matter
> what used to be the overhead. People always care when something gets
> slower.

People also care when something gets less accurate :)
Michal Hocko Aug. 11, 2023, 12:21 p.m. UTC | #8
On Wed 09-08-23 11:33:20, Yosry Ahmed wrote:
> On Wed, Aug 9, 2023 at 6:32 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 09-08-23 06:13:05, Yosry Ahmed wrote:
> > > On Wed, Aug 9, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> > > > > On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > > > > > Over time, the memcg code added multiple optimizations to the stats
> > > > > > > flushing path that introduce a tradeoff between accuracy and
> > > > > > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > > > > > full rstat flush of the stats in the tree can be too expensive. Such
> > > > > > > optimizations include [1]:
> > > > > > > (a) Introducing a periodic background flusher to keep the size of the
> > > > > > > update tree from growing unbounded.
> > > > > > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > > > > > flushers just skip the flush. This avoids a thundering herd problem
> > > > > > > when multiple reclaim/refault threads attempt to flush the stats at
> > > > > > > once.
> > > > > > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > > > > > a certain threshold.
> > > > > > >
> > > > > > > These optimizations were necessary to make flushing feasible in
> > > > > > > performance-critical paths, and they come at the cost of some accuracy
> > > > > > > that we choose to live without. On the other hand, for flushes invoked
> > > > > > > when userspace is reading the stats, the tradeoff is less appealing
> > > > > > > This code path is not performance-critical, and the inaccuracies can
> > > > > > > affect userspace behavior. For example, skipping flushing when there is
> > > > > > > another ongoing flush is essentially a coin flip. We don't know if the
> > > > > > > ongoing flush is done with the subtree of interest or not.
> > > > > >
> > > > > > I am not convinced by this much TBH. What kind of precision do you
> > > > > > really need and how much off is what we provide?
> > > > > >
> > > > > > More expensive read of stats from userspace is quite easy to notice
> > > > > > and usually reported as a regression. So you should have a convincing
> > > > > > argument that an extra time spent is really worth it. AFAIK there are
> > > > > > many monitoring (top like) tools which simply read those files regularly
> > > > > > just to show numbers and they certainly do not need a high level of
> > > > > > precision.
> > > > >
> > > > > We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> > > > > memcg stat flushing") which generalized the "skip if ongoing flush"
> > > > > for all stat flushing. As far I know, the problem was contention on
> > > > > the flushing lock which also affected critical paths like refault.
> > > > >
> > > > > The problem is that the current behavior is indeterministic, if cpu A
> > > > > tries to flush stats and cpu B is already doing that, cpu A will just
> > > > > skip. At that point, the cgroup(s) that cpu A cares about may have
> > > > > been fully flushed, partially flushed (in terms of cpus), or not
> > > > > flushed at all. We have no idea. We just know that someone else is
> > > > > flushing something. IOW, in some cases the flush request will be
> > > > > completely ignored and userspace will read stale stats (up to 2s + the
> > > > > periodic flusher runtime).
> > > >
> > > > Yes, that is certainly true but why does that matter? Stats are always a
> > > > snapshot of the past. Do we get an inconsistent image that would be
> > > > actively harmful.
> > >
> > > That can very well be the case because we may be in a state where some
> > > cpus are flushed and some aren't. Also sometimes a few seconds is too
> > > old. We have some workloads that read the stats every 1-2 seconds to
> > > keep a fresh state, and they certainly do not expect stats to be 2+
> > > seconds old when they read them.
> >
> > I hate to repeat myself but please be more specific. This all sounds
> > just too wavy to me.
> 
> Sorry I didn't have the full story in mind, I had to do my homework.
> One example is userspace OOM killing. Our userspace OOM killer makes
> decisions based on some stats from memory.stat, and stale stats (a few
> seconds in this case) can result in an unrightful OOM kill, which can
> easily cascade.

OK, but how is this any different from having outdated data because you
have to wait for memory.stat to read (being blocked inside the rstat
code)? Either your oom killer is reading the stats directly and then you
depend on that flushing which is something that could be really harmful
itself or you rely on another thread doing the blocking and you do not
have up-to-date numbers anyway. So how does blocking actually help?

> A simplified example of that is when a hierarchy has a parent cgroup
> with multiple related children. In this case, there are usually
> file-backed resources that are shared between those children, and OOM
> killing one of them will not free those resources. Hence, the OOM
> killer only considers their anonymous usage to be reap-able when a
> memcg is nuked. For that we use the "anon" stat (or "rss" in cgroup
> v1) in memory.stat.
> 
> >
> > > > > Some workloads need to read up-to-date stats as feedback to actions
> > > > > (e.g. after proactive reclaim, or for userspace OOM killing purposes),
> > > > > and reading such stale stats causes regressions or misbehavior by
> > > > > userspace.
> > > >
> > > > Please tell us more about those and why should all others that do not
> > > > require such a precision should page that price as well.
> > >
> > > Everyone used to pay this price though and no one used to complain.
> >
> > Right, and then the overhead has been reduced and now you want to bring
> > it back and that will be seen as a regression. It doesn't really matter
> > what used to be the overhead. People always care when something gets
> > slower.
> 
> People also care when something gets less accurate :)

Accuracy will never be 100%. We have to carefully balance between
accuracy and overhead. So far we haven't heard about how much inaccuracy
you are getting. Numbers help!

In any case I do get the argument about consistency within a subtree
(children data largely not matching parents'). Examples like that would
be really helpful as well. If that is indeed the case then I would
consider it much more serious than accuracy which is always problematic
(100ms of an actively allocating context can ruin your just read numbers
and there is no way around that wihtout stopping the world).

Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger
an explicit refresh. For those users who really need more accurate
numbers we might consider interface like that. Or allow to write to stat
file and do that in the write handler.
Yosry Ahmed Aug. 11, 2023, 7:02 p.m. UTC | #9
On Fri, Aug 11, 2023 at 5:21 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 09-08-23 11:33:20, Yosry Ahmed wrote:
> > On Wed, Aug 9, 2023 at 6:32 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 09-08-23 06:13:05, Yosry Ahmed wrote:
> > > > On Wed, Aug 9, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> > > > > > On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > > > > > > Over time, the memcg code added multiple optimizations to the stats
> > > > > > > > flushing path that introduce a tradeoff between accuracy and
> > > > > > > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > > > > > > full rstat flush of the stats in the tree can be too expensive. Such
> > > > > > > > optimizations include [1]:
> > > > > > > > (a) Introducing a periodic background flusher to keep the size of the
> > > > > > > > update tree from growing unbounded.
> > > > > > > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > > > > > > flushers just skip the flush. This avoids a thundering herd problem
> > > > > > > > when multiple reclaim/refault threads attempt to flush the stats at
> > > > > > > > once.
> > > > > > > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > > > > > > a certain threshold.
> > > > > > > >
> > > > > > > > These optimizations were necessary to make flushing feasible in
> > > > > > > > performance-critical paths, and they come at the cost of some accuracy
> > > > > > > > that we choose to live without. On the other hand, for flushes invoked
> > > > > > > > when userspace is reading the stats, the tradeoff is less appealing
> > > > > > > > This code path is not performance-critical, and the inaccuracies can
> > > > > > > > affect userspace behavior. For example, skipping flushing when there is
> > > > > > > > another ongoing flush is essentially a coin flip. We don't know if the
> > > > > > > > ongoing flush is done with the subtree of interest or not.
> > > > > > >
> > > > > > > I am not convinced by this much TBH. What kind of precision do you
> > > > > > > really need and how much off is what we provide?
> > > > > > >
> > > > > > > More expensive read of stats from userspace is quite easy to notice
> > > > > > > and usually reported as a regression. So you should have a convincing
> > > > > > > argument that an extra time spent is really worth it. AFAIK there are
> > > > > > > many monitoring (top like) tools which simply read those files regularly
> > > > > > > just to show numbers and they certainly do not need a high level of
> > > > > > > precision.
> > > > > >
> > > > > > We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> > > > > > memcg stat flushing") which generalized the "skip if ongoing flush"
> > > > > > for all stat flushing. As far I know, the problem was contention on
> > > > > > the flushing lock which also affected critical paths like refault.
> > > > > >
> > > > > > The problem is that the current behavior is indeterministic, if cpu A
> > > > > > tries to flush stats and cpu B is already doing that, cpu A will just
> > > > > > skip. At that point, the cgroup(s) that cpu A cares about may have
> > > > > > been fully flushed, partially flushed (in terms of cpus), or not
> > > > > > flushed at all. We have no idea. We just know that someone else is
> > > > > > flushing something. IOW, in some cases the flush request will be
> > > > > > completely ignored and userspace will read stale stats (up to 2s + the
> > > > > > periodic flusher runtime).
> > > > >
> > > > > Yes, that is certainly true but why does that matter? Stats are always a
> > > > > snapshot of the past. Do we get an inconsistent image that would be
> > > > > actively harmful.
> > > >
> > > > That can very well be the case because we may be in a state where some
> > > > cpus are flushed and some aren't. Also sometimes a few seconds is too
> > > > old. We have some workloads that read the stats every 1-2 seconds to
> > > > keep a fresh state, and they certainly do not expect stats to be 2+
> > > > seconds old when they read them.
> > >
> > > I hate to repeat myself but please be more specific. This all sounds
> > > just too wavy to me.
> >
> > Sorry I didn't have the full story in mind, I had to do my homework.
> > One example is userspace OOM killing. Our userspace OOM killer makes
> > decisions based on some stats from memory.stat, and stale stats (a few
> > seconds in this case) can result in an unrightful OOM kill, which can
> > easily cascade.
>
> OK, but how is this any different from having outdated data because you
> have to wait for memory.stat to read (being blocked inside the rstat
> code)? Either your oom killer is reading the stats directly and then you
> depend on that flushing which is something that could be really harmful
> itself or you rely on another thread doing the blocking and you do not
> have up-to-date numbers anyway. So how does blocking actually help?

I am not sure I understand.

The problem is that when you skip when someone else is flushing, there
is a chance that the stats we care about haven't been flushed since
the last time the periodic flusher ran. Which is supposed to be ~2
seconds ago, but maybe more depending on how busy the workqueue is.

When you block until the flusher finishes, the stats are being
refreshed as you wait. So the stats are not getting more outdated as
you wait in the general case (unless your cgroup was flushed first and
you're waiting for others to be flushed).
[Let's call this approach A]

Furthermore, with the implementation you suggested using a mutex, we
will wait until the ongoing flush is completed, then we will grab the
mutex and do a flush ourselves. That second flush should mostly be
very fast, but it will guarantee even fresher stats.
[Let's call this approach B]

See below for test results with either A or B.

We can add a new API that checks if the specific cgroup we care about
is flushed and wait on that instead of waiting for the entire flush to
finish, which will add stronger guarantees. However, as you said when
you suggested the mutex approach, let's start simple and add more
complexity when needed.

>
> > A simplified example of that is when a hierarchy has a parent cgroup
> > with multiple related children. In this case, there are usually
> > file-backed resources that are shared between those children, and OOM
> > killing one of them will not free those resources. Hence, the OOM
> > killer only considers their anonymous usage to be reap-able when a
> > memcg is nuked. For that we use the "anon" stat (or "rss" in cgroup
> > v1) in memory.stat.
> >
> > >
> > > > > > Some workloads need to read up-to-date stats as feedback to actions
> > > > > > (e.g. after proactive reclaim, or for userspace OOM killing purposes),
> > > > > > and reading such stale stats causes regressions or misbehavior by
> > > > > > userspace.
> > > > >
> > > > > Please tell us more about those and why should all others that do not
> > > > > require such a precision should page that price as well.
> > > >
> > > > Everyone used to pay this price though and no one used to complain.
> > >
> > > Right, and then the overhead has been reduced and now you want to bring
> > > it back and that will be seen as a regression. It doesn't really matter
> > > what used to be the overhead. People always care when something gets
> > > slower.
> >
> > People also care when something gets less accurate :)
>
> Accuracy will never be 100%. We have to carefully balance between
> accuracy and overhead. So far we haven't heard about how much inaccuracy
> you are getting. Numbers help!

Very good question, I should have added numbers since the beginning to
clarify the significance of the problem. To easily produce numbers I
will use another use case that we have that relies on having fresh
stats, which is proactive reclaim. Proactive reclaim usually operates
in a feedback loop where it requests some reclaim, queries the stats,
and decides how to operate based on that (e.g. fallback for a while).

When running a test that is proactively reclaiming some memory and
expecting to see the memory swapped, without this patch, we see
significant inaccuracy. In some failure instances we expect ~2000
pages to be swapped but we only find ~1200. This is observed on
machines with hundreds of cpus, where the problem is most noticeable.
This is a huge difference. Keep in mind that the inaccuracy would
probably be even worse in a production environment if the system is
under enough pressure (e.g. the periodic flusher is late).

For both approach A (wait until flusher finishes and exit, i.e this
patch) and approach B (wait until flusher finishes then flush, i.e the
mutex approach), I stop seeing this failure in the proactive reclaim
test and the stats are accurate.

I have v2 ready that implements approach B with the mutex ready to
fire, just say the word :)

>
> In any case I do get the argument about consistency within a subtree
> (children data largely not matching parents'). Examples like that would
> be really helpful as well. If that is indeed the case then I would
> consider it much more serious than accuracy which is always problematic
> (100ms of an actively allocating context can ruin your just read numbers
> and there is no way around that wihtout stopping the world).

100% agreed. It's more difficult to get testing results for this, but
that can easily be the case when we have no idea how much is flushed
when we return from mem_cgroup_flush_stats().

>
> Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger
> an explicit refresh. For those users who really need more accurate
> numbers we might consider interface like that. Or allow to write to stat
> file and do that in the write handler.

This wouldn't be my first option, but if that's the only way to get
accurate stats I'll take it.

Keep in mind that the normal stats read path will always try to
refresh, it's just that it will often skip refreshing due to an
implementation-specific race. So having an interface for an explicit
flush might be too implementation specific, especially if the race
disappears later and the interface is not needed later.

Having said that, I am not opposed to this if that's the only way
forward for accurate stats, but I would rather have the stat reads be
always accurate unless a regression is noticed.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 11, 2023, 7:25 p.m. UTC | #10
On Fri 11-08-23 12:02:48, Yosry Ahmed wrote:
> On Fri, Aug 11, 2023 at 5:21 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 09-08-23 11:33:20, Yosry Ahmed wrote:
> > > On Wed, Aug 9, 2023 at 6:32 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 09-08-23 06:13:05, Yosry Ahmed wrote:
> > > > > On Wed, Aug 9, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote:
> > > > > > > On Wed, Aug 9, 2023 at 1:51 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote:
> > > > > > > > > Over time, the memcg code added multiple optimizations to the stats
> > > > > > > > > flushing path that introduce a tradeoff between accuracy and
> > > > > > > > > performance. In some contexts (e.g. dirty throttling, refaults, etc), a
> > > > > > > > > full rstat flush of the stats in the tree can be too expensive. Such
> > > > > > > > > optimizations include [1]:
> > > > > > > > > (a) Introducing a periodic background flusher to keep the size of the
> > > > > > > > > update tree from growing unbounded.
> > > > > > > > > (b) Allowing only one thread to flush at a time, and other concurrent
> > > > > > > > > flushers just skip the flush. This avoids a thundering herd problem
> > > > > > > > > when multiple reclaim/refault threads attempt to flush the stats at
> > > > > > > > > once.
> > > > > > > > > (c) Only executing a flush if the magnitude of the stats updates exceeds
> > > > > > > > > a certain threshold.
> > > > > > > > >
> > > > > > > > > These optimizations were necessary to make flushing feasible in
> > > > > > > > > performance-critical paths, and they come at the cost of some accuracy
> > > > > > > > > that we choose to live without. On the other hand, for flushes invoked
> > > > > > > > > when userspace is reading the stats, the tradeoff is less appealing
> > > > > > > > > This code path is not performance-critical, and the inaccuracies can
> > > > > > > > > affect userspace behavior. For example, skipping flushing when there is
> > > > > > > > > another ongoing flush is essentially a coin flip. We don't know if the
> > > > > > > > > ongoing flush is done with the subtree of interest or not.
> > > > > > > >
> > > > > > > > I am not convinced by this much TBH. What kind of precision do you
> > > > > > > > really need and how much off is what we provide?
> > > > > > > >
> > > > > > > > More expensive read of stats from userspace is quite easy to notice
> > > > > > > > and usually reported as a regression. So you should have a convincing
> > > > > > > > argument that an extra time spent is really worth it. AFAIK there are
> > > > > > > > many monitoring (top like) tools which simply read those files regularly
> > > > > > > > just to show numbers and they certainly do not need a high level of
> > > > > > > > precision.
> > > > > > >
> > > > > > > We used to spend this time before commit fd25a9e0e23b ("memcg: unify
> > > > > > > memcg stat flushing") which generalized the "skip if ongoing flush"
> > > > > > > for all stat flushing. As far I know, the problem was contention on
> > > > > > > the flushing lock which also affected critical paths like refault.
> > > > > > >
> > > > > > > The problem is that the current behavior is indeterministic, if cpu A
> > > > > > > tries to flush stats and cpu B is already doing that, cpu A will just
> > > > > > > skip. At that point, the cgroup(s) that cpu A cares about may have
> > > > > > > been fully flushed, partially flushed (in terms of cpus), or not
> > > > > > > flushed at all. We have no idea. We just know that someone else is
> > > > > > > flushing something. IOW, in some cases the flush request will be
> > > > > > > completely ignored and userspace will read stale stats (up to 2s + the
> > > > > > > periodic flusher runtime).
> > > > > >
> > > > > > Yes, that is certainly true but why does that matter? Stats are always a
> > > > > > snapshot of the past. Do we get an inconsistent image that would be
> > > > > > actively harmful.
> > > > >
> > > > > That can very well be the case because we may be in a state where some
> > > > > cpus are flushed and some aren't. Also sometimes a few seconds is too
> > > > > old. We have some workloads that read the stats every 1-2 seconds to
> > > > > keep a fresh state, and they certainly do not expect stats to be 2+
> > > > > seconds old when they read them.
> > > >
> > > > I hate to repeat myself but please be more specific. This all sounds
> > > > just too wavy to me.
> > >
> > > Sorry I didn't have the full story in mind, I had to do my homework.
> > > One example is userspace OOM killing. Our userspace OOM killer makes
> > > decisions based on some stats from memory.stat, and stale stats (a few
> > > seconds in this case) can result in an unrightful OOM kill, which can
> > > easily cascade.
> >
> > OK, but how is this any different from having outdated data because you
> > have to wait for memory.stat to read (being blocked inside the rstat
> > code)? Either your oom killer is reading the stats directly and then you
> > depend on that flushing which is something that could be really harmful
> > itself or you rely on another thread doing the blocking and you do not
> > have up-to-date numbers anyway. So how does blocking actually help?
> 
> I am not sure I understand.
> 
> The problem is that when you skip when someone else is flushing, there
> is a chance that the stats we care about haven't been flushed since
> the last time the periodic flusher ran. Which is supposed to be ~2
> seconds ago, but maybe more depending on how busy the workqueue is.

Yes, this is clear. You simply get _some_ snapshot of the past.

> When you block until the flusher finishes, the stats are being
> refreshed as you wait. So the stats are not getting more outdated as
> you wait in the general case (unless your cgroup was flushed first and
> you're waiting for others to be flushed).
> [Let's call this approach A]

Yes, but the amount of waiting is also undeterministic and even after
you waited your stats might be outdated already depending on how quickly
somebody allocates. That was my point.

> Furthermore, with the implementation you suggested using a mutex, we
> will wait until the ongoing flush is completed, then we will grab the
> mutex and do a flush ourselves.

Flushing would be mostly unnecessary as somebody has just flushed
everything. The only point of mutex is to remove the super ugly busy
wait for sleepable context construct.

[...]
> When running a test that is proactively reclaiming some memory and
> expecting to see the memory swapped, without this patch, we see
> significant inaccuracy. In some failure instances we expect ~2000
> pages to be swapped but we only find ~1200.

That difference is 3MB of memory. What is the precision you are
operating on? 

> This is observed on
> machines with hundreds of cpus, where the problem is most noticeable.
> This is a huge difference. Keep in mind that the inaccuracy would
> probably be even worse in a production environment if the system is
> under enough pressure (e.g. the periodic flusher is late).
> 
> For both approach A (wait until flusher finishes and exit, i.e this
> patch) and approach B (wait until flusher finishes then flush, i.e the
> mutex approach), I stop seeing this failure in the proactive reclaim
> test and the stats are accurate.
> 
> I have v2 ready that implements approach B with the mutex ready to
> fire, just say the word :)
> 
> >
> > In any case I do get the argument about consistency within a subtree
> > (children data largely not matching parents'). Examples like that would
> > be really helpful as well. If that is indeed the case then I would
> > consider it much more serious than accuracy which is always problematic
> > (100ms of an actively allocating context can ruin your just read numbers
> > and there is no way around that wihtout stopping the world).
> 
> 100% agreed. It's more difficult to get testing results for this, but
> that can easily be the case when we have no idea how much is flushed
> when we return from mem_cgroup_flush_stats().
> 
> >
> > Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger
> > an explicit refresh. For those users who really need more accurate
> > numbers we might consider interface like that. Or allow to write to stat
> > file and do that in the write handler.
> 
> This wouldn't be my first option, but if that's the only way to get
> accurate stats I'll take it.

To be honest, this would be my preferable option because of 2 reasons.
a) we do not want to guarantee to much on the precision front because
that would just makes maintainability much more harder with different
people having a different opinion of how much precision is enough and b)
it makes the more rare (need precise) case the special case rather than
the default.

> Keep in mind that the normal stats read path will always try to
> refresh, it's just that it will often skip refreshing due to an
> implementation-specific race. So having an interface for an explicit
> flush might be too implementation specific, especially if the race
> disappears later and the interface is not needed later.

That doesn't really matter because from the userspace POV it is really
not important how the whole thing is implemented and whether the
interface blocks in reality. It simply has count with blocking. It just
needs a coherent and up-to-date-at-the-flush semantic.
Yosry Ahmed Aug. 11, 2023, 8:39 p.m. UTC | #11
<snip>
> > > > > I hate to repeat myself but please be more specific. This all sounds
> > > > > just too wavy to me.
> > > >
> > > > Sorry I didn't have the full story in mind, I had to do my homework.
> > > > One example is userspace OOM killing. Our userspace OOM killer makes
> > > > decisions based on some stats from memory.stat, and stale stats (a few
> > > > seconds in this case) can result in an unrightful OOM kill, which can
> > > > easily cascade.
> > >
> > > OK, but how is this any different from having outdated data because you
> > > have to wait for memory.stat to read (being blocked inside the rstat
> > > code)? Either your oom killer is reading the stats directly and then you
> > > depend on that flushing which is something that could be really harmful
> > > itself or you rely on another thread doing the blocking and you do not
> > > have up-to-date numbers anyway. So how does blocking actually help?
> >
> > I am not sure I understand.
> >
> > The problem is that when you skip when someone else is flushing, there
> > is a chance that the stats we care about haven't been flushed since
> > the last time the periodic flusher ran. Which is supposed to be ~2
> > seconds ago, but maybe more depending on how busy the workqueue is.
>
> Yes, this is clear. You simply get _some_ snapshot of the past.
>
> > When you block until the flusher finishes, the stats are being
> > refreshed as you wait. So the stats are not getting more outdated as
> > you wait in the general case (unless your cgroup was flushed first and
> > you're waiting for others to be flushed).
> > [Let's call this approach A]
>
> Yes, but the amount of waiting is also undeterministic and even after
> you waited your stats might be outdated already depending on how quickly
> somebody allocates. That was my point.

Right, we are just trying to minimize the staleness window.

>
> > Furthermore, with the implementation you suggested using a mutex, we
> > will wait until the ongoing flush is completed, then we will grab the
> > mutex and do a flush ourselves.
>
> Flushing would be mostly unnecessary as somebody has just flushed
> everything. The only point of mutex is to remove the super ugly busy
> wait for sleepable context construct.

Right, but it also has the (arguably) nice double flush effect, also
minimizes the staleness window.

>
> [...]
> > When running a test that is proactively reclaiming some memory and
> > expecting to see the memory swapped, without this patch, we see
> > significant inaccuracy. In some failure instances we expect ~2000
> > pages to be swapped but we only find ~1200.
>
> That difference is 3MB of memory. What is the precision you are
> operating on?

I am not concerned with MBs, I am concerned with ratio. On a large
system with hundreds of cpus there are larger chances of missing
updates on a bunch of cpus, which might be a lot.

>
> > This is observed on
> > machines with hundreds of cpus, where the problem is most noticeable.
> > This is a huge difference. Keep in mind that the inaccuracy would
> > probably be even worse in a production environment if the system is
> > under enough pressure (e.g. the periodic flusher is late).
> >
> > For both approach A (wait until flusher finishes and exit, i.e this
> > patch) and approach B (wait until flusher finishes then flush, i.e the
> > mutex approach), I stop seeing this failure in the proactive reclaim
> > test and the stats are accurate.
> >
> > I have v2 ready that implements approach B with the mutex ready to
> > fire, just say the word :)
> >
> > >
> > > In any case I do get the argument about consistency within a subtree
> > > (children data largely not matching parents'). Examples like that would
> > > be really helpful as well. If that is indeed the case then I would
> > > consider it much more serious than accuracy which is always problematic
> > > (100ms of an actively allocating context can ruin your just read numbers
> > > and there is no way around that wihtout stopping the world).
> >
> > 100% agreed. It's more difficult to get testing results for this, but
> > that can easily be the case when we have no idea how much is flushed
> > when we return from mem_cgroup_flush_stats().
> >
> > >
> > > Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger
> > > an explicit refresh. For those users who really need more accurate
> > > numbers we might consider interface like that. Or allow to write to stat
> > > file and do that in the write handler.
> >
> > This wouldn't be my first option, but if that's the only way to get
> > accurate stats I'll take it.
>
> To be honest, this would be my preferable option because of 2 reasons.
> a) we do not want to guarantee to much on the precision front because
> that would just makes maintainability much more harder with different
> people having a different opinion of how much precision is enough and b)
> it makes the more rare (need precise) case the special case rather than
> the default.

How about we go with the proposed approach in this patch (or the mutex
approach as it's much cleaner), and if someone complains about slow
reads we revert the change and introduce the refresh API? We might
just get away with making all reads accurate and avoid the hassle of
updating some userspace readers to do write-then-read. We don't know
for sure that something will regress.

What do you think?
Shakeel Butt Aug. 12, 2023, 2:08 a.m. UTC | #12
Hi all,

(sorry for late response as I was away)

On Fri, Aug 11, 2023 at 1:40 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> > > >
> > > > Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger
> > > > an explicit refresh. For those users who really need more accurate
> > > > numbers we might consider interface like that. Or allow to write to stat
> > > > file and do that in the write handler.
> > >
> > > This wouldn't be my first option, but if that's the only way to get
> > > accurate stats I'll take it.
> >
> > To be honest, this would be my preferable option because of 2 reasons.
> > a) we do not want to guarantee to much on the precision front because
> > that would just makes maintainability much more harder with different
> > people having a different opinion of how much precision is enough and b)
> > it makes the more rare (need precise) case the special case rather than
> > the default.
>
> How about we go with the proposed approach in this patch (or the mutex
> approach as it's much cleaner), and if someone complains about slow
> reads we revert the change and introduce the refresh API? We might
> just get away with making all reads accurate and avoid the hassle of
> updating some userspace readers to do write-then-read. We don't know
> for sure that something will regress.
>
> What do you think?

Actually I am with Michal on this one. As I see multiple regression
reports for reading the stats, I am inclined towards rate limiting the
sync stats flushing from user readable interfaces (through
mem_cgroup_flush_stats_ratelimited()) and providing a separate
interface as suggested by Michal to explicitly flush the stats for
users ok with the cost. Since we flush the stats every 2 seconds, most
of the users should be fine and the users who care about accuracy can
pay for it.
Yosry Ahmed Aug. 12, 2023, 2:11 a.m. UTC | #13
On Fri, Aug 11, 2023 at 7:08 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi all,
>
> (sorry for late response as I was away)
>
> On Fri, Aug 11, 2023 at 1:40 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > > > >
> > > > > Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger
> > > > > an explicit refresh. For those users who really need more accurate
> > > > > numbers we might consider interface like that. Or allow to write to stat
> > > > > file and do that in the write handler.
> > > >
> > > > This wouldn't be my first option, but if that's the only way to get
> > > > accurate stats I'll take it.
> > >
> > > To be honest, this would be my preferable option because of 2 reasons.
> > > a) we do not want to guarantee to much on the precision front because
> > > that would just makes maintainability much more harder with different
> > > people having a different opinion of how much precision is enough and b)
> > > it makes the more rare (need precise) case the special case rather than
> > > the default.
> >
> > How about we go with the proposed approach in this patch (or the mutex
> > approach as it's much cleaner), and if someone complains about slow
> > reads we revert the change and introduce the refresh API? We might
> > just get away with making all reads accurate and avoid the hassle of
> > updating some userspace readers to do write-then-read. We don't know
> > for sure that something will regress.
> >
> > What do you think?
>
> Actually I am with Michal on this one. As I see multiple regression
> reports for reading the stats, I am inclined towards rate limiting the
> sync stats flushing from user readable interfaces (through
> mem_cgroup_flush_stats_ratelimited()) and providing a separate
> interface as suggested by Michal to explicitly flush the stats for
> users ok with the cost. Since we flush the stats every 2 seconds, most
> of the users should be fine and the users who care about accuracy can
> pay for it.

I am worried that writing to a stat for flushing then reading will
increase the staleness window which we are trying to reduce here.
Would it be acceptable to add a separate interface to explicitly read
flushed stats without having to write first? If the distinction
disappears in the future we can just short-circuit both interfaces.
Shakeel Butt Aug. 12, 2023, 2:29 a.m. UTC | #14
On Fri, Aug 11, 2023 at 7:12 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
>
> I am worried that writing to a stat for flushing then reading will
> increase the staleness window which we are trying to reduce here.
> Would it be acceptable to add a separate interface to explicitly read
> flushed stats without having to write first? If the distinction
> disappears in the future we can just short-circuit both interfaces.

What is the acceptable staleness time window for your case? It is hard
to imagine that a write+read will always be worse than just a read.
Even the proposed patch can have an unintended and larger than
expected staleness window due to some processing on
return-to-userspace or some scheduling delay.
Yosry Ahmed Aug. 12, 2023, 2:35 a.m. UTC | #15
On Fri, Aug 11, 2023 at 7:29 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Aug 11, 2023 at 7:12 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> >
> > I am worried that writing to a stat for flushing then reading will
> > increase the staleness window which we are trying to reduce here.
> > Would it be acceptable to add a separate interface to explicitly read
> > flushed stats without having to write first? If the distinction
> > disappears in the future we can just short-circuit both interfaces.
>
> What is the acceptable staleness time window for your case? It is hard
> to imagine that a write+read will always be worse than just a read.
> Even the proposed patch can have an unintended and larger than
> expected staleness window due to some processing on
> return-to-userspace or some scheduling delay.

Maybe I am worrying too much, we can just go for writing to
memory.stat for explicit stats refresh.

Do we still want to go with the mutex approach Michal suggested for
do_flush_stats() to support either waiting for ongoing flushes
(mutex_lock) or skipping (mutex_trylock)?
Shakeel Butt Aug. 12, 2023, 2:48 a.m. UTC | #16
On Fri, Aug 11, 2023 at 7:36 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Aug 11, 2023 at 7:29 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Fri, Aug 11, 2023 at 7:12 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > [...]
> > >
> > > I am worried that writing to a stat for flushing then reading will
> > > increase the staleness window which we are trying to reduce here.
> > > Would it be acceptable to add a separate interface to explicitly read
> > > flushed stats without having to write first? If the distinction
> > > disappears in the future we can just short-circuit both interfaces.
> >
> > What is the acceptable staleness time window for your case? It is hard
> > to imagine that a write+read will always be worse than just a read.
> > Even the proposed patch can have an unintended and larger than
> > expected staleness window due to some processing on
> > return-to-userspace or some scheduling delay.
>
> Maybe I am worrying too much, we can just go for writing to
> memory.stat for explicit stats refresh.
>
> Do we still want to go with the mutex approach Michal suggested for
> do_flush_stats() to support either waiting for ongoing flushes
> (mutex_lock) or skipping (mutex_trylock)?

I would say keep that as a separate patch.
Michal Hocko Aug. 12, 2023, 8:35 a.m. UTC | #17
On Fri 11-08-23 19:48:14, Shakeel Butt wrote:
> On Fri, Aug 11, 2023 at 7:36 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Aug 11, 2023 at 7:29 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Fri, Aug 11, 2023 at 7:12 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > [...]
> > > >
> > > > I am worried that writing to a stat for flushing then reading will
> > > > increase the staleness window which we are trying to reduce here.
> > > > Would it be acceptable to add a separate interface to explicitly read
> > > > flushed stats without having to write first? If the distinction
> > > > disappears in the future we can just short-circuit both interfaces.
> > >
> > > What is the acceptable staleness time window for your case? It is hard
> > > to imagine that a write+read will always be worse than just a read.
> > > Even the proposed patch can have an unintended and larger than
> > > expected staleness window due to some processing on
> > > return-to-userspace or some scheduling delay.
> >
> > Maybe I am worrying too much, we can just go for writing to
> > memory.stat for explicit stats refresh.
> >
> > Do we still want to go with the mutex approach Michal suggested for
> > do_flush_stats() to support either waiting for ongoing flushes
> > (mutex_lock) or skipping (mutex_trylock)?
> 
> I would say keep that as a separate patch.

Separate patches would be better but please make the mutex conversion
first. We really do not want to have any busy waiting depending on a
sleep exported to the userspace. That is just no-go.

Thanks!
Yosry Ahmed Aug. 12, 2023, 11:04 a.m. UTC | #18
On Sat, Aug 12, 2023 at 1:35 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 11-08-23 19:48:14, Shakeel Butt wrote:
> > On Fri, Aug 11, 2023 at 7:36 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Aug 11, 2023 at 7:29 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Fri, Aug 11, 2023 at 7:12 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > [...]
> > > > >
> > > > > I am worried that writing to a stat for flushing then reading will
> > > > > increase the staleness window which we are trying to reduce here.
> > > > > Would it be acceptable to add a separate interface to explicitly read
> > > > > flushed stats without having to write first? If the distinction
> > > > > disappears in the future we can just short-circuit both interfaces.
> > > >
> > > > What is the acceptable staleness time window for your case? It is hard
> > > > to imagine that a write+read will always be worse than just a read.
> > > > Even the proposed patch can have an unintended and larger than
> > > > expected staleness window due to some processing on
> > > > return-to-userspace or some scheduling delay.
> > >
> > > Maybe I am worrying too much, we can just go for writing to
> > > memory.stat for explicit stats refresh.
> > >
> > > Do we still want to go with the mutex approach Michal suggested for
> > > do_flush_stats() to support either waiting for ongoing flushes
> > > (mutex_lock) or skipping (mutex_trylock)?
> >
> > I would say keep that as a separate patch.
>
> Separate patches would be better but please make the mutex conversion
> first. We really do not want to have any busy waiting depending on a
> sleep exported to the userspace. That is just no-go.

+tj@kernel.org

That makes sense.

Taking a step back though, and considering there have been other
complaints about unified flushing causing expensive reads from
memory.stat [1], I am wondering if we should tackle the fundamental
problem.

We have a single global rstat lock for flushing, which protects the
global per-cgroup counters as far as I understand. A single lock means
a lot of contention, which is why we implemented unified flushing on
the memcg side in the first place, where we only let one flusher
operate and everyone else skip, but that flusher needs to flush the
entire tree.

This can be unnecessarily expensive (see [1]), and to avoid how
expensive it is we sacrifice accuracy (what this patch is about). I am
exploring breaking down that lock into per-cgroup locks, where a
flusher acquires locks in a top down fashion. This allows for some
concurrency in flushing, and makes unified flushing unnecessary. If we
retire unified flushing we fix both accuracy and expensive reads at
the same time, while not sacrificing performance for concurrent
in-kernel flushers.

What do you think? I am prototyping something now and running some
tests, it seems promising and simple-ish (unless I am missing a big
correctness issue).

[1] https://lore.kernel.org/lkml/CABWYdi3YNwtPDwwJWmCO-ER50iP7CfbXkCep5TKb-9QzY-a40A@mail.gmail.com/

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
Tejun Heo Aug. 15, 2023, 12:14 a.m. UTC | #19
Hello,

On Sat, Aug 12, 2023 at 04:04:32AM -0700, Yosry Ahmed wrote:
> Taking a step back though, and considering there have been other
> complaints about unified flushing causing expensive reads from
> memory.stat [1], I am wondering if we should tackle the fundamental
> problem.
> 
> We have a single global rstat lock for flushing, which protects the
> global per-cgroup counters as far as I understand. A single lock means
> a lot of contention, which is why we implemented unified flushing on
> the memcg side in the first place, where we only let one flusher
> operate and everyone else skip, but that flusher needs to flush the
> entire tree.
> 
> This can be unnecessarily expensive (see [1]), and to avoid how
> expensive it is we sacrifice accuracy (what this patch is about). I am
> exploring breaking down that lock into per-cgroup locks, where a
> flusher acquires locks in a top down fashion. This allows for some
> concurrency in flushing, and makes unified flushing unnecessary. If we
> retire unified flushing we fix both accuracy and expensive reads at
> the same time, while not sacrificing performance for concurrent
> in-kernel flushers.
> 
> What do you think? I am prototyping something now and running some
> tests, it seems promising and simple-ish (unless I am missing a big
> correctness issue).

So, the original design used mutex for synchronize flushing with the idea
being that updates are high freq but reads are low freq and can be
relatively slow. Using rstats for mm internal operations changed this
assumption quite a bit and we ended up switching that mutex with a lock.

Here are some suggestions:

* Update-side, per-cpu lock should be fine. I don't think splitting them
  would buy us anything meaningful.

* Flush-side, maybe we can break flushing into per-cpu or whatnot but
  there's no avoiding the fact that flushing can take quite a while if there
  are a lot to flush whether locks are split or not. I wonder whether it'd
  be possible to go back to mutex for flushing and update the users to
  either consume the cached values or operate in a sleepable context if
  synchronous read is necessary, which is the right thing to do anyway given
  how long flushes can take.

Thanks.
Yosry Ahmed Aug. 15, 2023, 12:28 a.m. UTC | #20
On Mon, Aug 14, 2023 at 5:14 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Sat, Aug 12, 2023 at 04:04:32AM -0700, Yosry Ahmed wrote:
> > Taking a step back though, and considering there have been other
> > complaints about unified flushing causing expensive reads from
> > memory.stat [1], I am wondering if we should tackle the fundamental
> > problem.
> >
> > We have a single global rstat lock for flushing, which protects the
> > global per-cgroup counters as far as I understand. A single lock means
> > a lot of contention, which is why we implemented unified flushing on
> > the memcg side in the first place, where we only let one flusher
> > operate and everyone else skip, but that flusher needs to flush the
> > entire tree.
> >
> > This can be unnecessarily expensive (see [1]), and to avoid how
> > expensive it is we sacrifice accuracy (what this patch is about). I am
> > exploring breaking down that lock into per-cgroup locks, where a
> > flusher acquires locks in a top down fashion. This allows for some
> > concurrency in flushing, and makes unified flushing unnecessary. If we
> > retire unified flushing we fix both accuracy and expensive reads at
> > the same time, while not sacrificing performance for concurrent
> > in-kernel flushers.
> >
> > What do you think? I am prototyping something now and running some
> > tests, it seems promising and simple-ish (unless I am missing a big
> > correctness issue).
>
> So, the original design used mutex for synchronize flushing with the idea
> being that updates are high freq but reads are low freq and can be
> relatively slow. Using rstats for mm internal operations changed this
> assumption quite a bit and we ended up switching that mutex with a lock.

Naive question, do mutexes handle thundering herd problems better than
spinlocks? I would assume so but I am not sure.

>
> Here are some suggestions:
>
> * Update-side, per-cpu lock should be fine. I don't think splitting them
>   would buy us anything meaningful.

I agree, I was mainly concerned with the global flushing lock.

>
> * Flush-side, maybe we can break flushing into per-cpu or whatnot but
>   there's no avoiding the fact that flushing can take quite a while if there
>   are a lot to flush whether locks are split or not. I wonder whether it'd
>   be possible to go back to mutex for flushing and update the users to
>   either consume the cached values or operate in a sleepable context if
>   synchronous read is necessary, which is the right thing to do anyway given
>   how long flushes can take.

Unfortunately it cannot be broken down into per-cpu as all flushers
update the same per-cgroup counters, so we need a bigger locking
scope. Switching to atomics really hurts performance. Breaking down
the lock to be per-cgroup is doable, but since we need to lock both
the parent and the cgroup, flushing top-level cgroups (which I assume
is most common) will lock the root anyway.

All flushers right now operate in sleepable context, so we can go
again to the mutex if you think this will make things better. The
slowness problem reported recently is in a sleepable context, it's
just too slow for userspace if I understand correctly.

+Ivan Babrou

What I am thinking about now is that since all flushers are sleepable,
perhaps the thundering herd problem would be less severe, since we may
release the lock (or mutex) at the cpu boundaries. I wonder if would
be better if we retire the unified flushing on the memcg side, so that
not all flushers need to flush the entire tree, and we allow
concurrent flushing.

This should address the slowness in reads (as proven by a patch by
Ivan [1]), and it should also address the inaccuracy addressed by this
thread, since no flushers will skip if someone else is flushing.

I am trying to test if there are any regressions by running some
synthetic stress testing (reclaim, refault, read stats, repeat), so
far I can't see any.

Two things that we will need to figure out if we retire unified flushing:
(a) We now have a global "stats_flush_threshold" variable to know when
to flush and when to skip. If flushing is not global, we need to make
this per-cgroup or retire it as well. If we make it per-cgroup it may
affect the update-side, and we will need to move it to the rstat code
I think.

(b) We now have a global "flush_next_time" to know whether the
ratelimited flusher should run or not. If we keep it, only the global
async flusher will kill it forward, sync flushers will not. Otherwise
we can also make it per-cgroup and update it during flushes.

[1]https://github.com/bobrik/linux/commit/50b627811d54
>
> Thanks.
>
> --
> tejun
Tejun Heo Aug. 15, 2023, 12:35 a.m. UTC | #21
Hello,

On Mon, Aug 14, 2023 at 05:28:22PM -0700, Yosry Ahmed wrote:
> > So, the original design used mutex for synchronize flushing with the idea
> > being that updates are high freq but reads are low freq and can be
> > relatively slow. Using rstats for mm internal operations changed this
> > assumption quite a bit and we ended up switching that mutex with a lock.
> 
> Naive question, do mutexes handle thundering herd problems better than
> spinlocks? I would assume so but I am not sure.

I don't know. We can ask Waiman if that becomes a problem.

> > * Flush-side, maybe we can break flushing into per-cpu or whatnot but
> >   there's no avoiding the fact that flushing can take quite a while if there
> >   are a lot to flush whether locks are split or not. I wonder whether it'd
> >   be possible to go back to mutex for flushing and update the users to
> >   either consume the cached values or operate in a sleepable context if
> >   synchronous read is necessary, which is the right thing to do anyway given
> >   how long flushes can take.
> 
> Unfortunately it cannot be broken down into per-cpu as all flushers
> update the same per-cgroup counters, so we need a bigger locking
> scope. Switching to atomics really hurts performance. Breaking down
> the lock to be per-cgroup is doable, but since we need to lock both
> the parent and the cgroup, flushing top-level cgroups (which I assume
> is most common) will lock the root anyway.

Plus, there's not much point in flushing in parallel, so I don't feel too
enthusiastic about splitting flush locking.

> All flushers right now operate in sleepable context, so we can go
> again to the mutex if you think this will make things better. The

Yes, I think that'd be more sane.

> slowness problem reported recently is in a sleepable context, it's
> just too slow for userspace if I understand correctly.

I mean, there's a certain amount of work to do. There's no way around it if
you wanna read the counters synchronously. The only solution there would be
using a cached value or having some sort of auto-flushing mechanism so that
the amount to flush don't build up too much - e.g. keep a count of the
number of entries to flush and trigger flush if it goes over some threshold.

Thanks.
Yosry Ahmed Aug. 15, 2023, 12:39 a.m. UTC | #22
On Mon, Aug 14, 2023 at 5:35 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Aug 14, 2023 at 05:28:22PM -0700, Yosry Ahmed wrote:
> > > So, the original design used mutex for synchronize flushing with the idea
> > > being that updates are high freq but reads are low freq and can be
> > > relatively slow. Using rstats for mm internal operations changed this
> > > assumption quite a bit and we ended up switching that mutex with a lock.
> >
> > Naive question, do mutexes handle thundering herd problems better than
> > spinlocks? I would assume so but I am not sure.
>
> I don't know. We can ask Waiman if that becomes a problem.
>
> > > * Flush-side, maybe we can break flushing into per-cpu or whatnot but
> > >   there's no avoiding the fact that flushing can take quite a while if there
> > >   are a lot to flush whether locks are split or not. I wonder whether it'd
> > >   be possible to go back to mutex for flushing and update the users to
> > >   either consume the cached values or operate in a sleepable context if
> > >   synchronous read is necessary, which is the right thing to do anyway given
> > >   how long flushes can take.
> >
> > Unfortunately it cannot be broken down into per-cpu as all flushers
> > update the same per-cgroup counters, so we need a bigger locking
> > scope. Switching to atomics really hurts performance. Breaking down
> > the lock to be per-cgroup is doable, but since we need to lock both
> > the parent and the cgroup, flushing top-level cgroups (which I assume
> > is most common) will lock the root anyway.
>
> Plus, there's not much point in flushing in parallel, so I don't feel too
> enthusiastic about splitting flush locking.
>
> > All flushers right now operate in sleepable context, so we can go
> > again to the mutex if you think this will make things better. The
>
> Yes, I think that'd be more sane.
>
> > slowness problem reported recently is in a sleepable context, it's
> > just too slow for userspace if I understand correctly.
>
> I mean, there's a certain amount of work to do. There's no way around it if
> you wanna read the counters synchronously. The only solution there would be
> using a cached value or having some sort of auto-flushing mechanism so that
> the amount to flush don't build up too much - e.g. keep a count of the
> number of entries to flush and trigger flush if it goes over some threshold.

I really hoped you'd continue reading past this point :)

My proposed solution was to only flush the needed subtree rather than
flushing the entire tree all the time, which is what we do now on the
memcg side. We already have an asynchronous flusher on the memcg side
that runs every 2s to try to keep the tree size bounded, and we
already keep track of the magnitude of updates and only flush if it's
significant.

The problems in this thread and the other one are:
(a) Sometimes reading from userspace is slow because we needlessly
flush the entire tree.
(b) Sometimes reading from userspace is inaccurate because we skip
flushing if someone else is flushing, even though we don't know if
they flushed the subtree we care about yet or not.

I believe dropping unified flushing, if possible of course, may fix
both problems.

>
> Thanks.
>
> --
> tejun
Tejun Heo Aug. 15, 2023, 12:47 a.m. UTC | #23
Hello,

On Mon, Aug 14, 2023 at 05:39:15PM -0700, Yosry Ahmed wrote:
> I believe dropping unified flushing, if possible of course, may fix
> both problems.

Yeah, flushing the whole tree for every stat read will push up the big O
complexity of the operation. It shouldn't be too bad because only what's
updated since the last read will need flushing but if you have a really big
machine with a lot of constantly active cgroups, you're still gonna feel it.
So, yeah, drop that and switch the global lock to mutex and we should all be
good?

Thanks.
Yosry Ahmed Aug. 15, 2023, 12:50 a.m. UTC | #24
On Mon, Aug 14, 2023 at 5:48 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Aug 14, 2023 at 05:39:15PM -0700, Yosry Ahmed wrote:
> > I believe dropping unified flushing, if possible of course, may fix
> > both problems.
>
> Yeah, flushing the whole tree for every stat read will push up the big O
> complexity of the operation. It shouldn't be too bad because only what's
> updated since the last read will need flushing but if you have a really big
> machine with a lot of constantly active cgroups, you're still gonna feel it.
> So, yeah, drop that and switch the global lock to mutex and we should all be
> good?

I hope so, but I am not sure.

The unified flushing was added initially to mitigate a thundering herd
problem from concurrent in-kernel flushers (e.g. concurrent reclaims),
but back then flushing was atomic so we had to keep the spinlock held
for a long time. I think it should be better now, but I am hoping
Shakeel will chime in since he added the unified flushing originally.

We also need to agree on what to do about stats_flushing_threshold and
flush_next_time since they're both global now (since all flushing is
global).

>
> Thanks.
>
> --
> tejun
Waiman Long Aug. 15, 2023, 3:44 p.m. UTC | #25
On 8/14/23 20:35, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 14, 2023 at 05:28:22PM -0700, Yosry Ahmed wrote:
>>> So, the original design used mutex for synchronize flushing with the idea
>>> being that updates are high freq but reads are low freq and can be
>>> relatively slow. Using rstats for mm internal operations changed this
>>> assumption quite a bit and we ended up switching that mutex with a lock.
>> Naive question, do mutexes handle thundering herd problems better than
>> spinlocks? I would assume so but I am not sure.
> I don't know. We can ask Waiman if that becomes a problem.

We had essentially solved the thundering herd problems for both 
spinlocks and mutexes. Both types of lock waiters will spin in their own 
cachelines (in the OSP wait queue in the case of mutex) except one that 
is at the head of the queue. So there should be minimal cacheline 
bouncing. One should certainly uses mutexes in sleep-able context or 
when the critical section is long.

Cheers,
Longman
Shakeel Butt Aug. 16, 2023, 12:23 a.m. UTC | #26
+Ivan

On Mon, Aug 14, 2023 at 5:51 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Aug 14, 2023 at 5:48 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Mon, Aug 14, 2023 at 05:39:15PM -0700, Yosry Ahmed wrote:
> > > I believe dropping unified flushing, if possible of course, may fix
> > > both problems.
> >
> > Yeah, flushing the whole tree for every stat read will push up the big O
> > complexity of the operation. It shouldn't be too bad because only what's
> > updated since the last read will need flushing but if you have a really big
> > machine with a lot of constantly active cgroups, you're still gonna feel it.
> > So, yeah, drop that and switch the global lock to mutex and we should all be
> > good?
>
> I hope so, but I am not sure.
>
> The unified flushing was added initially to mitigate a thundering herd
> problem from concurrent in-kernel flushers (e.g. concurrent reclaims),
> but back then flushing was atomic so we had to keep the spinlock held
> for a long time. I think it should be better now, but I am hoping
> Shakeel will chime in since he added the unified flushing originally.
>
> We also need to agree on what to do about stats_flushing_threshold and
> flush_next_time since they're both global now (since all flushing is
> global).
>

I thought we already reached the decision on how to proceed here. Let
me summarize what I think we should do:

1. Completely remove the sync flush from stat files read from userspace.
2. Provide a separate way/interface to explicitly flush stats for
users who want more accurate stats and can pay the cost. This is
similar to the stat_refresh interface.
3. Keep the 2 sec periodic stats flusher.
Yosry Ahmed Aug. 16, 2023, 12:29 a.m. UTC | #27
On Tue, Aug 15, 2023 at 5:23 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> +Ivan
>
> On Mon, Aug 14, 2023 at 5:51 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Aug 14, 2023 at 5:48 PM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Mon, Aug 14, 2023 at 05:39:15PM -0700, Yosry Ahmed wrote:
> > > > I believe dropping unified flushing, if possible of course, may fix
> > > > both problems.
> > >
> > > Yeah, flushing the whole tree for every stat read will push up the big O
> > > complexity of the operation. It shouldn't be too bad because only what's
> > > updated since the last read will need flushing but if you have a really big
> > > machine with a lot of constantly active cgroups, you're still gonna feel it.
> > > So, yeah, drop that and switch the global lock to mutex and we should all be
> > > good?
> >
> > I hope so, but I am not sure.
> >
> > The unified flushing was added initially to mitigate a thundering herd
> > problem from concurrent in-kernel flushers (e.g. concurrent reclaims),
> > but back then flushing was atomic so we had to keep the spinlock held
> > for a long time. I think it should be better now, but I am hoping
> > Shakeel will chime in since he added the unified flushing originally.
> >
> > We also need to agree on what to do about stats_flushing_threshold and
> > flush_next_time since they're both global now (since all flushing is
> > global).
> >
>
> I thought we already reached the decision on how to proceed here. Let
> me summarize what I think we should do:
>
> 1. Completely remove the sync flush from stat files read from userspace.
> 2. Provide a separate way/interface to explicitly flush stats for
> users who want more accurate stats and can pay the cost. This is
> similar to the stat_refresh interface.
> 3. Keep the 2 sec periodic stats flusher.

I think this solution is suboptimal to be honest, I think we can do better.

With recent improvements to spinlocks/mutexes, and flushers becoming
sleepable, I think a better solution would be to remove unified
flushing and let everyone only flush the subtree they care about. Sync
flushing becomes much better (unless you're flushing root ofc), and
concurrent flushing wouldn't cause too many problems (ideally no
thundering herd, and rstat lock can be dropped at cpu boundaries in
cgroup_rstat_flush_locked()).

If we do this, stat reads can be much faster as Ivan demonstrated with
his patch that only flushes the cgroup being read, and we do not
sacrifice accuracy as we never skip flushing. We also do not need a
separate interface for explicit refresh.

In all cases, we need to keep the 2 sec periodic flusher. What we need
to figure out if we remove unified flushing is:

1. Handling stats_flush_threshold.
2. Handling flush_next_time.

Both of these are global now, and will need to be adapted to
non-unified non-global flushing.
Shakeel Butt Aug. 16, 2023, 1:14 a.m. UTC | #28
On Tue, Aug 15, 2023 at 5:29 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> >
> > I thought we already reached the decision on how to proceed here. Let
> > me summarize what I think we should do:
> >
> > 1. Completely remove the sync flush from stat files read from userspace.
> > 2. Provide a separate way/interface to explicitly flush stats for
> > users who want more accurate stats and can pay the cost. This is
> > similar to the stat_refresh interface.
> > 3. Keep the 2 sec periodic stats flusher.
>
> I think this solution is suboptimal to be honest, I think we can do better.
>
> With recent improvements to spinlocks/mutexes, and flushers becoming
> sleepable, I think a better solution would be to remove unified
> flushing and let everyone only flush the subtree they care about. Sync
> flushing becomes much better (unless you're flushing root ofc), and
> concurrent flushing wouldn't cause too many problems (ideally no
> thundering herd, and rstat lock can be dropped at cpu boundaries in
> cgroup_rstat_flush_locked()).
>
> If we do this, stat reads can be much faster as Ivan demonstrated with
> his patch that only flushes the cgroup being read, and we do not
> sacrifice accuracy as we never skip flushing. We also do not need a
> separate interface for explicit refresh.
>
> In all cases, we need to keep the 2 sec periodic flusher. What we need
> to figure out if we remove unified flushing is:
>
> 1. Handling stats_flush_threshold.
> 2. Handling flush_next_time.
>
> Both of these are global now, and will need to be adapted to
> non-unified non-global flushing.

The only thing we are disagreeing on is (1) the complete removal of
sync flush and an explicit flush interface versus (2) keep doing the
sync flush of the subtree.

To me (1) seems more optimal particularly for the server use-case
where a node controller reads stats of root and as well as cgroups of
a couple of top levels (we actually do this internally). Doing flush
once explicitly and then reading the stats for all such cgroups seems
better to me.
Yosry Ahmed Aug. 16, 2023, 2:19 a.m. UTC | #29
On Tue, Aug 15, 2023 at 6:14 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Aug 15, 2023 at 5:29 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > >
> > > I thought we already reached the decision on how to proceed here. Let
> > > me summarize what I think we should do:
> > >
> > > 1. Completely remove the sync flush from stat files read from userspace.
> > > 2. Provide a separate way/interface to explicitly flush stats for
> > > users who want more accurate stats and can pay the cost. This is
> > > similar to the stat_refresh interface.
> > > 3. Keep the 2 sec periodic stats flusher.
> >
> > I think this solution is suboptimal to be honest, I think we can do better.
> >
> > With recent improvements to spinlocks/mutexes, and flushers becoming
> > sleepable, I think a better solution would be to remove unified
> > flushing and let everyone only flush the subtree they care about. Sync
> > flushing becomes much better (unless you're flushing root ofc), and
> > concurrent flushing wouldn't cause too many problems (ideally no
> > thundering herd, and rstat lock can be dropped at cpu boundaries in
> > cgroup_rstat_flush_locked()).
> >
> > If we do this, stat reads can be much faster as Ivan demonstrated with
> > his patch that only flushes the cgroup being read, and we do not
> > sacrifice accuracy as we never skip flushing. We also do not need a
> > separate interface for explicit refresh.
> >
> > In all cases, we need to keep the 2 sec periodic flusher. What we need
> > to figure out if we remove unified flushing is:
> >
> > 1. Handling stats_flush_threshold.
> > 2. Handling flush_next_time.
> >
> > Both of these are global now, and will need to be adapted to
> > non-unified non-global flushing.
>
> The only thing we are disagreeing on is (1) the complete removal of
> sync flush and an explicit flush interface versus (2) keep doing the
> sync flush of the subtree.
>
> To me (1) seems more optimal particularly for the server use-case
> where a node controller reads stats of root and as well as cgroups of
> a couple of top levels (we actually do this internally). Doing flush
> once explicitly and then reading the stats for all such cgroups seems
> better to me.

The problem in (1) is that first of all it's a behavioral change, we
start having explicit staleness in the stats, and userspace needs to
adapt by explicitly requesting a flush. A node controller can be
enlightened to do so, but on a system with a lot of cgroups, if you
flush once explicitly and iterate through all cgroups, the flush will
be stale by the time you reach the last cgroup. Keep in mind there are
also users that read their own stats, figuring out which users need to
flush explicitly vs. read cached stats is a problem.

Taking a step back, the total work that needs to be done does not
change with (2). A node controller iterating cgroups and reading their
stats will do the same amount of flushing, it will just be distributed
across multiple read syscalls, so shorter intervals in kernel space.

There are also in-kernel flushers (e.g. reclaim and dirty throttling)
that will benefit from (2) by reading more accurate stats without
having to flush the entire tree. The behavior is currently
indeterministic, you may get fresh or stale stats, you may flush one
cgroup or 100 cgroups.

I think with (2) we make less compromises in terms of accuracy and
determinism, and it's a less disruptive change to userspace.
Shakeel Butt Aug. 16, 2023, 5:11 p.m. UTC | #30
On Tue, Aug 15, 2023 at 7:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
>
> The problem in (1) is that first of all it's a behavioral change, we
> start having explicit staleness in the stats, and userspace needs to
> adapt by explicitly requesting a flush. A node controller can be
> enlightened to do so, but on a system with a lot of cgroups, if you
> flush once explicitly and iterate through all cgroups, the flush will
> be stale by the time you reach the last cgroup. Keep in mind there are
> also users that read their own stats, figuring out which users need to
> flush explicitly vs. read cached stats is a problem.

I thought we covered the invalidity of the staleness argument. Similar
staleness can happen today, so not strictly a behavioral change. We
can change the time window and condition of the periodic flush to
reduce the chance of staleness. Option 2 can also face staleness as
well.

>
> Taking a step back, the total work that needs to be done does not
> change with (2). A node controller iterating cgroups and reading their
> stats will do the same amount of flushing, it will just be distributed
> across multiple read syscalls, so shorter intervals in kernel space.

You seem to be worried about the very fine grained staleness of the
stats. So, for scenarios where stats of multi-level cgroups need to be
read and the workload is continuously updating the stats, the total
work can be much more. For example if we are reading stats of root and
top level memcgs then potentially option 2 can flush the stats for the
whole tree twice.

>
> There are also in-kernel flushers (e.g. reclaim and dirty throttling)
> that will benefit from (2) by reading more accurate stats without
> having to flush the entire tree. The behavior is currently
> indeterministic, you may get fresh or stale stats, you may flush one
> cgroup or 100 cgroups.
>
> I think with (2) we make less compromises in terms of accuracy and
> determinism, and it's a less disruptive change to userspace.

These options are not white and black and there can be something in
between but let me be very clear on what I don't want and would NACK.
I don't want a global sleepable lock which can be taken by potentially
any application running on the system. We have seen similar global
locks causing isolation and priority inversion issues in production.
So, not another lock which needs to be taken under extreme condition
(reading stats under OOM) by a high priority task (node controller)
and might be held by a low priority task.
Tejun Heo Aug. 16, 2023, 7:08 p.m. UTC | #31
Hello,

On Wed, Aug 16, 2023 at 10:11:20AM -0700, Shakeel Butt wrote:
> These options are not white and black and there can be something in
> between but let me be very clear on what I don't want and would NACK.

I'm not a big fan of interfaces with hidden states. What you're proposing
isn't strictly that but it's still a bit nasty. So, if we can get by without
doing that, that'd be great.

> I don't want a global sleepable lock which can be taken by potentially
> any application running on the system. We have seen similar global
> locks causing isolation and priority inversion issues in production.
> So, not another lock which needs to be taken under extreme condition
> (reading stats under OOM) by a high priority task (node controller)
> and might be held by a low priority task.

Yeah, this is a real concern. Those priority inversions do occur and can be
serious but causing serious problems under memory pressure usually requires
involving memory allocations and IOs. Here, it's just all CPU. So, at least
in OOM conditions, this shouldn't be in the way (the system wouldn't have
anything else to do anyway).

It is true that this still can lead to priority through CPU competition tho.
However, that problem isn't necessarily solved by what you're suggesting
either unless you want to restrict explicit flushing based on permissions
which is another can of worms.

My preference is not exposing this in user interface. This is mostly arising
from internal implementation details and isn't what users necessarily care
about. There are many things we can do on the kernel side to make trade-offs
among overhead, staleness and priority inversions. If we make this an
explicit userland interface behavior, we get locked into that semantics
which we'll likely regret in some future.

Thanks.
Yosry Ahmed Aug. 16, 2023, 10:35 p.m. UTC | #32
On Wed, Aug 16, 2023 at 12:08 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Aug 16, 2023 at 10:11:20AM -0700, Shakeel Butt wrote:
> > These options are not white and black and there can be something in
> > between but let me be very clear on what I don't want and would NACK.
>
> I'm not a big fan of interfaces with hidden states. What you're proposing
> isn't strictly that but it's still a bit nasty. So, if we can get by without
> doing that, that'd be great.

Agreed. I will try to send patches soon implementing option (2) above,
basically removing unified flushing. I will try to document any
potential regressions that may happen and how we may fix them. Ideally
we see no regressions.

>
> > I don't want a global sleepable lock which can be taken by potentially
> > any application running on the system. We have seen similar global
> > locks causing isolation and priority inversion issues in production.
> > So, not another lock which needs to be taken under extreme condition
> > (reading stats under OOM) by a high priority task (node controller)
> > and might be held by a low priority task.
>
> Yeah, this is a real concern. Those priority inversions do occur and can be
> serious but causing serious problems under memory pressure usually requires
> involving memory allocations and IOs. Here, it's just all CPU. So, at least
> in OOM conditions, this shouldn't be in the way (the system wouldn't have
> anything else to do anyway).
>
> It is true that this still can lead to priority through CPU competition tho.
> However, that problem isn't necessarily solved by what you're suggesting
> either unless you want to restrict explicit flushing based on permissions
> which is another can of worms.

Right. Also in the case of a mutex, if we disable preemption while
holding the mutex, this makes sure that whoever holding the mutex does
not starve waiters. Essentially the difference would be that waiters
will sleep with the mutex instead of spinning, but the mutex holder
itself wouldn't sleep.

I will make this a separate patch, just in case it's too
controversial. Switching the spinlock to a mutex should not block
removing unified flushing.

>
> My preference is not exposing this in user interface. This is mostly arising
> from internal implementation details and isn't what users necessarily care
> about. There are many things we can do on the kernel side to make trade-offs
> among overhead, staleness and priority inversions. If we make this an
> explicit userland interface behavior, we get locked into that semantics
> which we'll likely regret in some future.
>

Yeah that's what I am trying to do here as well.  I will try to follow
up on this discussion with patches soon.

Thanks everyone!
Yosry Ahmed Aug. 18, 2023, 9:40 p.m. UTC | #33
On Wed, Aug 16, 2023 at 3:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Aug 16, 2023 at 12:08 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Aug 16, 2023 at 10:11:20AM -0700, Shakeel Butt wrote:
> > > These options are not white and black and there can be something in
> > > between but let me be very clear on what I don't want and would NACK.
> >
> > I'm not a big fan of interfaces with hidden states. What you're proposing
> > isn't strictly that but it's still a bit nasty. So, if we can get by without
> > doing that, that'd be great.
>
> Agreed. I will try to send patches soon implementing option (2) above,
> basically removing unified flushing. I will try to document any
> potential regressions that may happen and how we may fix them. Ideally
> we see no regressions.
>
> >
> > > I don't want a global sleepable lock which can be taken by potentially
> > > any application running on the system. We have seen similar global
> > > locks causing isolation and priority inversion issues in production.
> > > So, not another lock which needs to be taken under extreme condition
> > > (reading stats under OOM) by a high priority task (node controller)
> > > and might be held by a low priority task.
> >
> > Yeah, this is a real concern. Those priority inversions do occur and can be
> > serious but causing serious problems under memory pressure usually requires
> > involving memory allocations and IOs. Here, it's just all CPU. So, at least
> > in OOM conditions, this shouldn't be in the way (the system wouldn't have
> > anything else to do anyway).
> >
> > It is true that this still can lead to priority through CPU competition tho.
> > However, that problem isn't necessarily solved by what you're suggesting
> > either unless you want to restrict explicit flushing based on permissions
> > which is another can of worms.
>
> Right. Also in the case of a mutex, if we disable preemption while
> holding the mutex, this makes sure that whoever holding the mutex does
> not starve waiters. Essentially the difference would be that waiters
> will sleep with the mutex instead of spinning, but the mutex holder
> itself wouldn't sleep.
>
> I will make this a separate patch, just in case it's too
> controversial. Switching the spinlock to a mutex should not block
> removing unified flushing.
>
> >
> > My preference is not exposing this in user interface. This is mostly arising
> > from internal implementation details and isn't what users necessarily care
> > about. There are many things we can do on the kernel side to make trade-offs
> > among overhead, staleness and priority inversions. If we make this an
> > explicit userland interface behavior, we get locked into that semantics
> > which we'll likely regret in some future.
> >
>
> Yeah that's what I am trying to do here as well.  I will try to follow
> up on this discussion with patches soon.
>
> Thanks everyone!

So status update. I tried implementing removing unified flushing. I
can send the patches if it helps visualize things, but essentially
mem_cgroup_flush_stats{_ratelimited} takes a memcg argument that it
passes to cgroup_flush_stats(). No skipping if someone else is
flushing (stats_flush_ongoing) and no threshold at which we start
flushing (stats_flush_threshold).

I tested this by some synthetic reclaim stress test that I wrote,
because reclaim is the easiest way to stress in-kernel flushing. I
basically create 10s or 100s cgroups and run workers in them that keep
allocating memory beyond the limit. I also run workers that
periodically read the stats.

Removing unified flushing makes such a synthetic benchmark 2-3 times
slower. This is not surprising, as all these concurrent reclaimers
used to just skip flushing, regardless of whether or not they get
accurate stats. Now I don't know how realistic such a benchmark is,
but if there's a workload somewhere that runs into such conditions it
will surely regress.

Sorry if the above is difficult to visualize. I can send the patches
and the test script if it makes things easier, I just didn't want to
overwhelm the thread.

I think there are 2 options going forward:
(a) If we believe removing unified flushing is the best way forward
for most use cases, then we need to find a better way of testing this
approach practically. Any suggestions here are useful.

(b) Decide that it's too expensive to remove unified flushing
completely, at least for in-kernel flushers that can see a lot of
concurrency. We can only remove unified flushing for userspace reads.
Only flush the memcg being read and never skip (essentially what Ivan
tried). There are already some flushing contexts that do this (e.g.
zswap memcg charging code). I believe as long as there isn't a lot of
concurrency in these paths it should be fine to skip unified flushing
for them, and keep it only for in-kernel flushers.

Any thoughts on this?
Yosry Ahmed Aug. 21, 2023, 8:58 p.m. UTC | #34
On Fri, Aug 18, 2023 at 2:40 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Aug 16, 2023 at 3:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Aug 16, 2023 at 12:08 PM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Aug 16, 2023 at 10:11:20AM -0700, Shakeel Butt wrote:
> > > > These options are not white and black and there can be something in
> > > > between but let me be very clear on what I don't want and would NACK.
> > >
> > > I'm not a big fan of interfaces with hidden states. What you're proposing
> > > isn't strictly that but it's still a bit nasty. So, if we can get by without
> > > doing that, that'd be great.
> >
> > Agreed. I will try to send patches soon implementing option (2) above,
> > basically removing unified flushing. I will try to document any
> > potential regressions that may happen and how we may fix them. Ideally
> > we see no regressions.
> >
> > >
> > > > I don't want a global sleepable lock which can be taken by potentially
> > > > any application running on the system. We have seen similar global
> > > > locks causing isolation and priority inversion issues in production.
> > > > So, not another lock which needs to be taken under extreme condition
> > > > (reading stats under OOM) by a high priority task (node controller)
> > > > and might be held by a low priority task.
> > >
> > > Yeah, this is a real concern. Those priority inversions do occur and can be
> > > serious but causing serious problems under memory pressure usually requires
> > > involving memory allocations and IOs. Here, it's just all CPU. So, at least
> > > in OOM conditions, this shouldn't be in the way (the system wouldn't have
> > > anything else to do anyway).
> > >
> > > It is true that this still can lead to priority through CPU competition tho.
> > > However, that problem isn't necessarily solved by what you're suggesting
> > > either unless you want to restrict explicit flushing based on permissions
> > > which is another can of worms.
> >
> > Right. Also in the case of a mutex, if we disable preemption while
> > holding the mutex, this makes sure that whoever holding the mutex does
> > not starve waiters. Essentially the difference would be that waiters
> > will sleep with the mutex instead of spinning, but the mutex holder
> > itself wouldn't sleep.
> >
> > I will make this a separate patch, just in case it's too
> > controversial. Switching the spinlock to a mutex should not block
> > removing unified flushing.
> >
> > >
> > > My preference is not exposing this in user interface. This is mostly arising
> > > from internal implementation details and isn't what users necessarily care
> > > about. There are many things we can do on the kernel side to make trade-offs
> > > among overhead, staleness and priority inversions. If we make this an
> > > explicit userland interface behavior, we get locked into that semantics
> > > which we'll likely regret in some future.
> > >
> >
> > Yeah that's what I am trying to do here as well.  I will try to follow
> > up on this discussion with patches soon.
> >
> > Thanks everyone!
>
> So status update. I tried implementing removing unified flushing. I
> can send the patches if it helps visualize things, but essentially
> mem_cgroup_flush_stats{_ratelimited} takes a memcg argument that it
> passes to cgroup_flush_stats(). No skipping if someone else is
> flushing (stats_flush_ongoing) and no threshold at which we start
> flushing (stats_flush_threshold).
>
> I tested this by some synthetic reclaim stress test that I wrote,
> because reclaim is the easiest way to stress in-kernel flushing. I
> basically create 10s or 100s cgroups and run workers in them that keep
> allocating memory beyond the limit. I also run workers that
> periodically read the stats.
>
> Removing unified flushing makes such a synthetic benchmark 2-3 times
> slower. This is not surprising, as all these concurrent reclaimers
> used to just skip flushing, regardless of whether or not they get
> accurate stats. Now I don't know how realistic such a benchmark is,
> but if there's a workload somewhere that runs into such conditions it
> will surely regress.
>
> Sorry if the above is difficult to visualize. I can send the patches
> and the test script if it makes things easier, I just didn't want to
> overwhelm the thread.
>
> I think there are 2 options going forward:
> (a) If we believe removing unified flushing is the best way forward
> for most use cases, then we need to find a better way of testing this
> approach practically. Any suggestions here are useful.
>
> (b) Decide that it's too expensive to remove unified flushing
> completely, at least for in-kernel flushers that can see a lot of
> concurrency. We can only remove unified flushing for userspace reads.
> Only flush the memcg being read and never skip (essentially what Ivan
> tried). There are already some flushing contexts that do this (e.g.
> zswap memcg charging code). I believe as long as there isn't a lot of
> concurrency in these paths it should be fine to skip unified flushing
> for them, and keep it only for in-kernel flushers.

I sent a short series implementing (b) above:
https://lore.kernel.org/lkml/20230821205458.1764662-1-yosryahmed@google.com/

Basic testing shows that it works well.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e041ba827e59..38e227f7127d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -630,7 +630,7 @@  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 		/*
 		 * If stats_flush_threshold exceeds the threshold
 		 * (>num_online_cpus()), cgroup stats update will be triggered
-		 * in __mem_cgroup_flush_stats(). Increasing this var further
+		 * in mem_cgroup_flush_stats(). Increasing this var further
 		 * is redundant and simply adds overhead in atomic update.
 		 */
 		if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
@@ -639,17 +639,24 @@  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void do_flush_stats(void)
+static void do_flush_stats(bool full)
 {
+	if (!atomic_read(&stats_flush_ongoing) &&
+	    !atomic_xchg(&stats_flush_ongoing, 1))
+		goto flush;
+
 	/*
-	 * We always flush the entire tree, so concurrent flushers can just
-	 * skip. This avoids a thundering herd problem on the rstat global lock
-	 * from memcg flushers (e.g. reclaim, refault, etc).
+	 * We always flush the entire tree, so concurrent flushers can choose to
+	 * skip if accuracy is not critical. Otherwise, wait for the ongoing
+	 * flush to complete. This avoids a thundering herd problem on the rstat
+	 * global lock from memcg flushers (e.g. reclaim, refault, etc).
 	 */
-	if (atomic_read(&stats_flush_ongoing) ||
-	    atomic_xchg(&stats_flush_ongoing, 1))
-		return;
-
+	while (full && atomic_read(&stats_flush_ongoing) == 1) {
+		if (!cond_resched())
+			cpu_relax();
+	}
+	return;
+flush:
 	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 
 	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
@@ -661,7 +668,12 @@  static void do_flush_stats(void)
 void mem_cgroup_flush_stats(void)
 {
 	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
-		do_flush_stats();
+		do_flush_stats(false);
+}
+
+static void mem_cgroup_flush_stats_full(void)
+{
+	do_flush_stats(true);
 }
 
 void mem_cgroup_flush_stats_ratelimited(void)
@@ -676,7 +688,7 @@  static void flush_memcg_stats_dwork(struct work_struct *w)
 	 * Always flush here so that flushing in latency-sensitive paths is
 	 * as cheap as possible.
 	 */
-	do_flush_stats();
+	do_flush_stats(false);
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -1576,7 +1588,7 @@  static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_full();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4018,7 +4030,7 @@  static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_full();
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4093,7 +4105,7 @@  static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_full();
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -6610,7 +6622,7 @@  static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_full();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;