diff mbox series

memcg: use ratelimited stats flush in the reclaim

Message ID 20240615081257.3945587-1-shakeel.butt@linux.dev (mailing list archive)
State New
Headers show
Series memcg: use ratelimited stats flush in the reclaim | expand

Commit Message

Shakeel Butt June 15, 2024, 8:12 a.m. UTC
The Meta prod is seeing large amount of stalls in memcg stats flush
from the memcg reclaim code path. At the moment, this specific callsite
is doing a synchronous memcg stats flush. The rstat flush is an
expensive and time consuming operation, so concurrent relaimers will
busywait on the lock potentially for a long time. Actually this issue is
not unique to Meta and has been observed by Cloudflare [1] as well. For
the Cloudflare case, the stalls were due to contention between kswapd
threads running on their 8 numa node machines which does not make sense
as rstat flush is global and flush from one kswapd thread should be
sufficient for all. Simply replace the synchronous flush with the
ratelimited one.

One may raise a concern on potentially using 2 sec stale (at worst)
stats for heuristics like desirable inactive:active ratio and preferring
inactive file pages over anon pages but these specific heuristics do not
require very precise stats and also are ignored under severe memory
pressure. This patch has been running on Meta fleet for more than a
month and we have not observed any issues. Please note that MGLRU is not
impacted by this issue at all as it avoids rstat flushing completely.

Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Yosry Ahmed June 16, 2024, 12:28 a.m. UTC | #1
On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> The Meta prod is seeing large amount of stalls in memcg stats flush
> from the memcg reclaim code path. At the moment, this specific callsite
> is doing a synchronous memcg stats flush. The rstat flush is an
> expensive and time consuming operation, so concurrent relaimers will
> busywait on the lock potentially for a long time. Actually this issue is
> not unique to Meta and has been observed by Cloudflare [1] as well. For
> the Cloudflare case, the stalls were due to contention between kswapd
> threads running on their 8 numa node machines which does not make sense
> as rstat flush is global and flush from one kswapd thread should be
> sufficient for all. Simply replace the synchronous flush with the
> ratelimited one.
>
> One may raise a concern on potentially using 2 sec stale (at worst)
> stats for heuristics like desirable inactive:active ratio and preferring
> inactive file pages over anon pages but these specific heuristics do not
> require very precise stats and also are ignored under severe memory
> pressure. This patch has been running on Meta fleet for more than a
> month and we have not observed any issues. Please note that MGLRU is not
> impacted by this issue at all as it avoids rstat flushing completely.
>
> Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c0429fd6c573..bda4f92eba71 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2263,7 +2263,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
>          * Flush the memory cgroup stats, so that we read accurate per-memcg
>          * lruvec stats for heuristics.
>          */
> -       mem_cgroup_flush_stats(sc->target_mem_cgroup);
> +       mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup);

I think you already know my opinion about this one :) I don't like it
at all, and I will explain why below. I know it may be a necessary
evil, but I would like us to make sure there is no other option before
going forward with this.

Unfortunately, I am travelling this week, so I probably won't be able
to follow up on this for a week or so, but I will try to lay down my
thoughts as much as I can.

Why don't I like this?

- From a high level, I don't like the approach of replacing
problematic flushing calls with the ratelimited version. It strikes me
as a whac-a-mole approach that is mitigating symptoms of a larger
problem.

- With the added thresholding code, a flush is only done if there is a
significant number of pending updates in the relevant subtree.
Choosing the ratelimited approach is intentionally ignoring a
significant change in stats (although arguably it could be irrelevant
stats).

- Reclaim code is an iterative process, so not updating the stats on
every retry is very counterintuitive. We are retrying reclaim using
the same stats and heuristics used by a previous iteration,
essentially dismissing the effects of those previous iterations.

- Indeterministic behavior like this one is very difficult to debug if
it causes problems. The missing updates in the last 2s (or whatever
period) could be of any magnitude. We may be ignoring GBs of
free/allocated memory. What's worse is, if it causes any problems,
tracing it back to this flush will be extremely difficult.

What can we do?

- Try to make more fundamental improvements to the flushing code (for
memcgs or cgroups in general). The per-memcg flushing thresholding is
an example of this. For example, if flushing is taking too long
because we are flushing all subsystems, it may make sense to have
separate rstat trees for separate subsystems.

One other thing we can try is add a mutex in the memcg flushing path.
I had initially had this in my subtree flushing series [1], but I
dropped it as we thought it's not very useful. Currently in
mem_cgroup_flush_stats(), we check if there are enough pending updates
to flush, then we call cgroup_flush_stats() and spin on the lock. It
is possible that while we spin, those pending updates we observed have
been flushed. If we add back the mutex like in [1], then once we
acquire the mutex we check again to make sure there are still enough
stats to flush.

[1]https://lore.kernel.org/all/20231010032117.1577496-6-yosryahmed@google.com/

- Try to avoid the need for flushing in this path. I am not sure what
approach MGLRU uses to avoid the flush, but if we can do something
similar for classic LRUs that would be preferable. I am guessing MGLRU
may be maintaining its own stats outside of the rstat framework.

- Try to figure out if one (or a few) update paths are regressing all
flushers. If one specific stat or stats update path is causing most of
the updates, we can try to fix that instead. Especially if it's a
counter that is continuously being increased and decreases (so the net
change is not as high as we think).

At the end of the day, all of the above may not work, and we may have
to live with just using the ratelimited approach. But I *really* hope
we could actually go the other way. Fix things on a more fundamental
level and eventually drop the ratelimited variants completely.

Just my 2c. Sorry for the long email :)
Jesper Dangaard Brouer June 17, 2024, 3:31 p.m. UTC | #2
On 16/06/2024 02.28, Yosry Ahmed wrote:
> On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> The Meta prod is seeing large amount of stalls in memcg stats flush
>> from the memcg reclaim code path. At the moment, this specific callsite
>> is doing a synchronous memcg stats flush. The rstat flush is an
>> expensive and time consuming operation, so concurrent relaimers will
>> busywait on the lock potentially for a long time. Actually this issue is
>> not unique to Meta and has been observed by Cloudflare [1] as well. For
>> the Cloudflare case, the stalls were due to contention between kswapd
>> threads running on their 8 numa node machines which does not make sense
>> as rstat flush is global and flush from one kswapd thread should be
>> sufficient for all. Simply replace the synchronous flush with the
>> ratelimited one.
>>

Like Yosry, I don't agree that simply using ratelimited flush here is
the right solution, at-least other options need to be investigated first.

I can confirm that at Cloudflare, we are seeing massive issues due
kswapd and to this specific mem_cgroup_flush_stats() call inlined in
shrink_node, taking the rstat lock. I looked at our 12 numa node
machines, given kswapd have a kthread per NUMA node, and realized the
issue is much worse on these servers, as lock contention on rstat lock
happens more frequently, where 12 CPUs waste cycles on spinning every
time kswapd runs.  We added fleet wide stats (/proc/N/schedstat) for
kthreads, and realized we are on fleet wide average burning 20.000 CPU
cores on kswapd, that primarily spins on rstat lock.

Try to run this command to see kswapd CPU usage spikes:
   $ pidstat -u -C kswapd 1 60

Code wise: when the Per-CPU-Pages (PCP) freelist is empty,
__alloc_pages_slowpath calls wake_all_kswapds(), this cause wakeup calls
to all the kswapdN threads simultaneously. The kswapd thread will invoke
shrink_node, which as explained trigger the cgroup rstat flush operation
as part of its work. Thus, effectively the kernel's own memory
management processes triggers self-inflicted rstat lock contention, by
waking up all kswapd threads simultaneously.

So, we definitely need a solution to this issue!
(... more ideas below inlined as comments to Yosry)

>> One may raise a concern on potentially using 2 sec stale (at worst)
>> stats for heuristics like desirable inactive:active ratio and preferring
>> inactive file pages over anon pages but these specific heuristics do not
>> require very precise stats and also are ignored under severe memory
>> pressure. This patch has been running on Meta fleet for more than a
>> month and we have not observed any issues. Please note that MGLRU is not
>> impacted by this issue at all as it avoids rstat flushing completely.
>>
>> Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1]
>> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
>> ---
>>   mm/vmscan.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c0429fd6c573..bda4f92eba71 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2263,7 +2263,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
>>           * Flush the memory cgroup stats, so that we read accurate per-memcg
>>           * lruvec stats for heuristics.
>>           */
>> -       mem_cgroup_flush_stats(sc->target_mem_cgroup);
>> +       mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup);
> 
> I think you already know my opinion about this one :) I don't like it
> at all, and I will explain why below. I know it may be a necessary
> evil, but I would like us to make sure there is no other option before
> going forward with this.
> 
I'm signing up to solving this somehow, as this is a real prod issue.

An easy way to solve the kswapd issue, would be to reintroduce
"stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm:
memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and
introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an
atomic") (Author: Yosry Ahmed).

The concept is: If there is an ongoing rstat flush, this time limited to
the root cgroup, then don't perform the flush.  We can only do this for
the root cgroup tree, as flushing can be done for subtrees, but kswapd
is always for root tree, so it is good enough to solve the kswapd
thundering herd problem.  We might want to generalize this beyond memcg.


> Unfortunately, I am travelling this week, so I probably won't be able
> to follow up on this for a week or so, but I will try to lay down my
> thoughts as much as I can.
> 
> Why don't I like this?
> 
> - From a high level, I don't like the approach of replacing
> problematic flushing calls with the ratelimited version. It strikes me
> as a whac-a-mole approach that is mitigating symptoms of a larger
> problem.
> 

This also looks like a whac-a-mole approach to me.

I'm not directly against rate-limited flushing, but I do think the 2 sec
interval is too long. IMHO we should be able to have a smaller bound in
the ms area.

Like I proposed 50 ms flush rate limiting in [2]
  [2] 
https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/


> - With the added thresholding code, a flush is only done if there is a
> significant number of pending updates in the relevant subtree.
> Choosing the ratelimited approach is intentionally ignoring a
> significant change in stats (although arguably it could be irrelevant
> stats).
> 

My production observations are that the thresholding code isn't limiting
the flushing in practice.


> - Reclaim code is an iterative process, so not updating the stats on
> every retry is very counterintuitive. We are retrying reclaim using
> the same stats and heuristics used by a previous iteration,
> essentially dismissing the effects of those previous iterations.
> 
> - Indeterministic behavior like this one is very difficult to debug if
> it causes problems. The missing updates in the last 2s (or whatever
> period) could be of any magnitude. We may be ignoring GBs of
> free/allocated memory. What's worse is, if it causes any problems,
> tracing it back to this flush will be extremely difficult.
> 

The 2 sec seems like a long period for me.

> What can we do?
> 
> - Try to make more fundamental improvements to the flushing code (for
> memcgs or cgroups in general). The per-memcg flushing thresholding is
> an example of this. For example, if flushing is taking too long
> because we are flushing all subsystems, it may make sense to have
> separate rstat trees for separate subsystems.
> 
> One other thing we can try is add a mutex in the memcg flushing path.
> I had initially had this in my subtree flushing series [1], but I
> dropped it as we thought it's not very useful. 

I'm running an experimental kernel with rstat lock converted to mutex on
a number of production servers, and we have not observed any regressions.
The kswapd thundering herd problem also happen on these machines, but as
these are sleep-able background threads, it is fine to sleep on the mutex.


> Currently in
> mem_cgroup_flush_stats(), we check if there are enough pending updates
> to flush, then we call cgroup_flush_stats() and spin on the lock. It
> is possible that while we spin, those pending updates we observed have
> been flushed. If we add back the mutex like in [1], then once we
> acquire the mutex we check again to make sure there are still enough
> stats to flush.
> 

It is a good point to re-check after obtaining the lock, as things could
have changed in the intermediate period when waiting for the lock. This
is also done (for time) in my 50 ms flush rate limit patch [1] ;-)


> [1]https://lore.kernel.org/all/20231010032117.1577496-6-yosryahmed@google.com/
> 
> - Try to avoid the need for flushing in this path. I am not sure what
> approach MGLRU uses to avoid the flush, but if we can do something
> similar for classic LRUs that would be preferable. I am guessing MGLRU
> may be maintaining its own stats outside of the rstat framework.
> 
> - Try to figure out if one (or a few) update paths are regressing all
> flushers. If one specific stat or stats update path is causing most of
> the updates, we can try to fix that instead. Especially if it's a
> counter that is continuously being increased and decreases (so the net
> change is not as high as we think).
> 
> At the end of the day, all of the above may not work, and we may have
> to live with just using the ratelimited approach. But I *really* hope
> we could actually go the other way. Fix things on a more fundamental
> level and eventually drop the ratelimited variants completely.
>

My pipe dream is that kernel can avoiding the cost of maintain the
cgroup threshold stats for flushing, and instead rely on a dynamic time
based threshold (in ms area) that have no fast-path overhead :-P


> Just my 2c. Sorry for the long email :)

Appreciate your input a lot! :-)

--Jesper
Shakeel Butt June 17, 2024, 5:20 p.m. UTC | #3
On Sat, Jun 15, 2024 at 05:28:55PM GMT, Yosry Ahmed wrote:
> On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > The Meta prod is seeing large amount of stalls in memcg stats flush
> > from the memcg reclaim code path. At the moment, this specific callsite
> > is doing a synchronous memcg stats flush. The rstat flush is an
> > expensive and time consuming operation, so concurrent relaimers will
> > busywait on the lock potentially for a long time. Actually this issue is
> > not unique to Meta and has been observed by Cloudflare [1] as well. For
> > the Cloudflare case, the stalls were due to contention between kswapd
> > threads running on their 8 numa node machines which does not make sense
> > as rstat flush is global and flush from one kswapd thread should be
> > sufficient for all. Simply replace the synchronous flush with the
> > ratelimited one.
> >
> > One may raise a concern on potentially using 2 sec stale (at worst)
> > stats for heuristics like desirable inactive:active ratio and preferring
> > inactive file pages over anon pages but these specific heuristics do not
> > require very precise stats and also are ignored under severe memory
> > pressure. This patch has been running on Meta fleet for more than a
> > month and we have not observed any issues. Please note that MGLRU is not
> > impacted by this issue at all as it avoids rstat flushing completely.
> >
> > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1]
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > ---
> >  mm/vmscan.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c0429fd6c573..bda4f92eba71 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2263,7 +2263,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
> >          * Flush the memory cgroup stats, so that we read accurate per-memcg
> >          * lruvec stats for heuristics.
> >          */
> > -       mem_cgroup_flush_stats(sc->target_mem_cgroup);
> > +       mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup);
> 
> I think you already know my opinion about this one :) I don't like it
> at all, 

Yup I know.

> and I will explain why below. I know it may be a necessary
> evil, but I would like us to make sure there is no other option before
> going forward with this.

Instead of necessary evil, I would call it a pragmatic approach i.e.
resolve the ongoing pain with good enough solution and work on long term
solution later.

> 
> Unfortunately, I am travelling this week, so I probably won't be able
> to follow up on this for a week or so, but I will try to lay down my
> thoughts as much as I can.
> 
> Why don't I like this?
> 
> - From a high level, I don't like the approach of replacing
> problematic flushing calls with the ratelimited version. It strikes me
> as a whac-a-mole approach that is mitigating symptoms of a larger
> problem.
> 
> - With the added thresholding code, a flush is only done if there is a
> significant number of pending updates in the relevant subtree.
> Choosing the ratelimited approach is intentionally ignoring a
> significant change in stats (although arguably it could be irrelevant
> stats).

Intentionally ignoring the significant change is fine for, as you said,
for irrelevant stats but also for the cases where we don't need the
exact and precise stats. See my next point.

> 
> - Reclaim code is an iterative process, so not updating the stats on
> every retry is very counterintuitive. We are retrying reclaim using
> the same stats and heuristics used by a previous iteration,
> essentially dismissing the effects of those previous iterations.
> 

I think I explained in the commit message why we don't need the precise
metrics for this specific case but let me reiterate.

The stats are needed for two specific heuristics in this case:

1. Deactivate LRUs
2. Cache trim mode

The deactivate LRUs heuristic is to maintain a desirable inactive:active
ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE*
and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to
check if there is a refault since last snapshot and the LRU size are
needed for the desirable ratio between inactive and active LRUs. See the
table below on how the desirable ratio is calculated.

/* total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB
 */

The desirable ratio only changes at the boundary of 1 GiB, 10 GiB,
100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate
LRU size information to calculate this ratio. In addition, if
deactivation is skipped for some LRU, the kernel will force deactive on
the severe memory pressure situation.

For the cache trim mode, inactive file LRU size is read and the kernel
scales it down based on the reclaim iteration (file >> sc->priority) and
only checks if it is zero or not. Again precise information is not
needed.

> - Indeterministic behavior like this one is very difficult to debug if
> it causes problems. The missing updates in the last 2s (or whatever
> period) could be of any magnitude. We may be ignoring GBs of
> free/allocated memory. What's worse is, if it causes any problems,
> tracing it back to this flush will be extremely difficult.

This is indeed an issue but that is common with the heuristics in
general. They work most of the time and fail for small set of cases.

Anyways, I am not arguing to remove sync flush for all cases. Rather I
am arguing for this specific case, we don't need to be precise as I have
explained above.

> 
> What can we do?
> 
> - Try to make more fundamental improvements to the flushing code (for
> memcgs or cgroups in general). The per-memcg flushing thresholding is
> an example of this. For example, if flushing is taking too long
> because we are flushing all subsystems, it may make sense to have
> separate rstat trees for separate subsystems.

Yes separate flushing for each subsystems make sense and can be done
orthogonally.

> 
> One other thing we can try is add a mutex in the memcg flushing path.
> I had initially had this in my subtree flushing series [1], but I
> dropped it as we thought it's not very useful. Currently in
> mem_cgroup_flush_stats(), we check if there are enough pending updates
> to flush, then we call cgroup_flush_stats() and spin on the lock. It
> is possible that while we spin, those pending updates we observed have
> been flushed. If we add back the mutex like in [1], then once we
> acquire the mutex we check again to make sure there are still enough
> stats to flush.
> 
> [1]https://lore.kernel.org/all/20231010032117.1577496-6-yosryahmed@google.com/

My main beef with the global mutex is the possible priority inversion.
Unless you agree to add try_lock() and skip flushing i.e. no one sleeps
on the mutex, this is a no go.

> 
> - Try to avoid the need for flushing in this path. I am not sure what
> approach MGLRU uses to avoid the flush, but if we can do something
> similar for classic LRUs that would be preferable. I am guessing MGLRU
> may be maintaining its own stats outside of the rstat framework.

MGLRU simply don't use these heuristics (Yu Zhao please correct me if I
am wrong). 

> 
> - Try to figure out if one (or a few) update paths are regressing all
> flushers. If one specific stat or stats update path is causing most of
> the updates, we can try to fix that instead. Especially if it's a
> counter that is continuously being increased and decreases (so the net
> change is not as high as we think).

This is actually a good point. I remember Jasper telling that MEMCG_KMEM
might be the one with most updates. I can try to collect from Meta fleet
what is the cause of most updates.

> 
> At the end of the day, all of the above may not work, and we may have
> to live with just using the ratelimited approach. But I *really* hope
> we could actually go the other way. Fix things on a more fundamental
> level and eventually drop the ratelimited variants completely.
> 
> Just my 2c. Sorry for the long email :)

Please note that this is not some user API which can not be changed
later. We can change and disect however we want. My only point is not to
wait for the perfect solution and have some intermediate and good enough
solution.

Thanks for the review.
Shakeel Butt June 17, 2024, 6:01 p.m. UTC | #4
On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 16/06/2024 02.28, Yosry Ahmed wrote:
> > On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > 
> > > The Meta prod is seeing large amount of stalls in memcg stats flush
> > > from the memcg reclaim code path. At the moment, this specific callsite
> > > is doing a synchronous memcg stats flush. The rstat flush is an
> > > expensive and time consuming operation, so concurrent relaimers will
> > > busywait on the lock potentially for a long time. Actually this issue is
> > > not unique to Meta and has been observed by Cloudflare [1] as well. For
> > > the Cloudflare case, the stalls were due to contention between kswapd
> > > threads running on their 8 numa node machines which does not make sense
> > > as rstat flush is global and flush from one kswapd thread should be
> > > sufficient for all. Simply replace the synchronous flush with the
> > > ratelimited one.
> > > 
> 
> Like Yosry, I don't agree that simply using ratelimited flush here is
> the right solution, at-least other options need to be investigated first.

I added more detail in my reply to Yosry on why using ratelimited flush
for this specific case is fine.

[...]
> > 
> > I think you already know my opinion about this one :) I don't like it
> > at all, and I will explain why below. I know it may be a necessary
> > evil, but I would like us to make sure there is no other option before
> > going forward with this.
> > 
> I'm signing up to solving this somehow, as this is a real prod issue.
> 
> An easy way to solve the kswapd issue, would be to reintroduce
> "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm:
> memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and
> introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an
> atomic") (Author: Yosry Ahmed).
> 

The skipping flush for "stats_flush_ongoing" was there from the start.

> The concept is: If there is an ongoing rstat flush, this time limited to
> the root cgroup, then don't perform the flush.  We can only do this for
> the root cgroup tree, as flushing can be done for subtrees, but kswapd
> is always for root tree, so it is good enough to solve the kswapd
> thundering herd problem.  We might want to generalize this beyond memcg.
> 

No objection from me for this skipping root memcg flush idea.

> 
[...]
> 
> > - With the added thresholding code, a flush is only done if there is a
> > significant number of pending updates in the relevant subtree.
> > Choosing the ratelimited approach is intentionally ignoring a
> > significant change in stats (although arguably it could be irrelevant
> > stats).
> > 
> 
> My production observations are that the thresholding code isn't limiting
> the flushing in practice.
> 

Here we need more production data. I remember you mentioned MEMCG_KMEM
being used for most of the updates. Is it possible to get top 5 (or 10)
most updated stats for your production environment?

> 
> > - Reclaim code is an iterative process, so not updating the stats on
> > every retry is very counterintuitive. We are retrying reclaim using
> > the same stats and heuristics used by a previous iteration,
> > essentially dismissing the effects of those previous iterations.
> > 
> > - Indeterministic behavior like this one is very difficult to debug if
> > it causes problems. The missing updates in the last 2s (or whatever
> > period) could be of any magnitude. We may be ignoring GBs of
> > free/allocated memory. What's worse is, if it causes any problems,
> > tracing it back to this flush will be extremely difficult.
> > 
> 
> The 2 sec seems like a long period for me.
> 
> > What can we do?
> > 
> > - Try to make more fundamental improvements to the flushing code (for
> > memcgs or cgroups in general). The per-memcg flushing thresholding is
> > an example of this. For example, if flushing is taking too long
> > because we are flushing all subsystems, it may make sense to have
> > separate rstat trees for separate subsystems.
> > 
> > One other thing we can try is add a mutex in the memcg flushing path.
> > I had initially had this in my subtree flushing series [1], but I
> > dropped it as we thought it's not very useful.
> 
> I'm running an experimental kernel with rstat lock converted to mutex on
> a number of production servers, and we have not observed any regressions.
> The kswapd thundering herd problem also happen on these machines, but as
> these are sleep-able background threads, it is fine to sleep on the mutex.
> 

Sorry but a global mutex which can be taken by userspace applications
and is needed by node controller (to read stats) is a no from me. On a
multi-tenant systems, global locks causing priority inversion is a real
issue.

> 
[...]
> 
> My pipe dream is that kernel can avoiding the cost of maintain the
> cgroup threshold stats for flushing, and instead rely on a dynamic time
> based threshold (in ms area) that have no fast-path overhead :-P
> 

Please do expand on what you mean by dynamic time based threshold.
Jesper Dangaard Brouer June 18, 2024, 3:53 p.m. UTC | #5
On 17/06/2024 20.01, Shakeel Butt wrote:
> On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote:
>>
>>
>> On 16/06/2024 02.28, Yosry Ahmed wrote:
>>> On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>>
>>>> The Meta prod is seeing large amount of stalls in memcg stats flush
>>>> from the memcg reclaim code path. At the moment, this specific callsite
>>>> is doing a synchronous memcg stats flush. The rstat flush is an
>>>> expensive and time consuming operation, so concurrent relaimers will
>>>> busywait on the lock potentially for a long time. Actually this issue is
>>>> not unique to Meta and has been observed by Cloudflare [1] as well. For
>>>> the Cloudflare case, the stalls were due to contention between kswapd
>>>> threads running on their 8 numa node machines which does not make sense
>>>> as rstat flush is global and flush from one kswapd thread should be
>>>> sufficient for all. Simply replace the synchronous flush with the
>>>> ratelimited one.
>>>>
>>
>> Like Yosry, I don't agree that simply using ratelimited flush here is
>> the right solution, at-least other options need to be investigated first.
> 
> I added more detail in my reply to Yosry on why using ratelimited flush
> for this specific case is fine.
> 
> [...]
>>>
>>> I think you already know my opinion about this one :) I don't like it
>>> at all, and I will explain why below. I know it may be a necessary
>>> evil, but I would like us to make sure there is no other option before
>>> going forward with this.
>>>
>> I'm signing up to solving this somehow, as this is a real prod issue.
>>
>> An easy way to solve the kswapd issue, would be to reintroduce
>> "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm:
>> memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and
>> introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an
>> atomic") (Author: Yosry Ahmed).
>>
> 
> The skipping flush for "stats_flush_ongoing" was there from the start.
> 
>> The concept is: If there is an ongoing rstat flush, this time limited to
>> the root cgroup, then don't perform the flush.  We can only do this for
>> the root cgroup tree, as flushing can be done for subtrees, but kswapd
>> is always for root tree, so it is good enough to solve the kswapd
>> thundering herd problem.  We might want to generalize this beyond memcg.
>>
> 
> No objection from me for this skipping root memcg flush idea.
> 

Okay, looking into coding this up then.

>>
> [...]
>>
>>> - With the added thresholding code, a flush is only done if there is a
>>> significant number of pending updates in the relevant subtree.
>>> Choosing the ratelimited approach is intentionally ignoring a
>>> significant change in stats (although arguably it could be irrelevant
>>> stats).
>>>
>>
>> My production observations are that the thresholding code isn't limiting
>> the flushing in practice.
>>
> 
> Here we need more production data. I remember you mentioned MEMCG_KMEM
> being used for most of the updates. Is it possible to get top 5 (or 10)
> most updated stats for your production environment?
> 

Do you have a method for obtaining these stats?

Last time I used eBPF + bpftrace to extract these stats.  The high rate
these updates occur, it caused production overload situations that SRE
noticed.  So, I have to be careful when producing these stats for you.

Also could you be more specific what code lines you want stats for?


>>
>>> - Reclaim code is an iterative process, so not updating the stats on
>>> every retry is very counterintuitive. We are retrying reclaim using
>>> the same stats and heuristics used by a previous iteration,
>>> essentially dismissing the effects of those previous iterations.
>>>
>>> - Indeterministic behavior like this one is very difficult to debug if
>>> it causes problems. The missing updates in the last 2s (or whatever
>>> period) could be of any magnitude. We may be ignoring GBs of
>>> free/allocated memory. What's worse is, if it causes any problems,
>>> tracing it back to this flush will be extremely difficult.
>>>
>>
>> The 2 sec seems like a long period for me.
>>
>>> What can we do?
>>>
>>> - Try to make more fundamental improvements to the flushing code (for
>>> memcgs or cgroups in general). The per-memcg flushing thresholding is
>>> an example of this. For example, if flushing is taking too long
>>> because we are flushing all subsystems, it may make sense to have
>>> separate rstat trees for separate subsystems.
>>>
>>> One other thing we can try is add a mutex in the memcg flushing path.
>>> I had initially had this in my subtree flushing series [1], but I
>>> dropped it as we thought it's not very useful.
>>
>> I'm running an experimental kernel with rstat lock converted to mutex on
>> a number of production servers, and we have not observed any regressions.
>> The kswapd thundering herd problem also happen on these machines, but as
>> these are sleep-able background threads, it is fine to sleep on the mutex.
>>
> 
> Sorry but a global mutex which can be taken by userspace applications
> and is needed by node controller (to read stats) is a no from me. On a
> multi-tenant systems, global locks causing priority inversion is a real
> issue.
> 

The situation we have *today* with a global IRQ-disabling spin_lock is
precisely causing a priority-inversion situation always by design.
Userspace applications (reading stat file) and kswapd background
processes are currently getting higher priority than both hardware and
software interrupts. This is causing actual production issues, which is
why I'm working on this.

I do understand that changing this to a global mutex creates the
theoretical *possibility* for priority-inversion between processes with
different configured priorities.  IMHO this is better than always taking
the "big" bottom-half-lock [LWN].  I still want to lower the potential
priority-inversion issue with the mutex lock, by (1) working on lowering
the pressure on the lock (e.g. exit if flush is ongoing on root, and
e.g. add time limit on how often flush can run on sub-trees), and (2) we
will collect production metrics for lock contention and hold time (with
appropriate alerts).


[LWN] https://lwn.net/SubscriberLink/978189/5f50cab8478fac45/


>>
> [...]
>>
>> My pipe dream is that kernel can avoiding the cost of maintain the
>> cgroup threshold stats for flushing, and instead rely on a dynamic time
>> based threshold (in ms area) that have no fast-path overhead :-P
>>
> 
> Please do expand on what you mean by dynamic time based threshold.

I proposed a fixed 50 ms flush rate limiting in [2].  But I don't want
this to be a static value as it will vary on different configs and
hardware. I propose making this dynamic via measuring the time it takes
to flush the cgroup root.  This makes sense in a queuing theory context,
because this corresponds to the (longest) service time, and theory say
"a queue is formed when customers arrive faster than they can get
served". Thus, this should lower the pressure/contention on the lock,
while still allowing frequent flushing.  Hope it makes sense.

--Jesper


[2] 
https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/
Shakeel Butt June 18, 2024, 6:07 p.m. UTC | #6
On Tue, Jun 18, 2024 at 05:53:02PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
[...]
> > > > - With the added thresholding code, a flush is only done if there is a
> > > > significant number of pending updates in the relevant subtree.
> > > > Choosing the ratelimited approach is intentionally ignoring a
> > > > significant change in stats (although arguably it could be irrelevant
> > > > stats).
> > > > 
> > > 
> > > My production observations are that the thresholding code isn't limiting
> > > the flushing in practice.
> > > 
> > 
> > Here we need more production data. I remember you mentioned MEMCG_KMEM
> > being used for most of the updates. Is it possible to get top 5 (or 10)
> > most updated stats for your production environment?
> > 
> 
> Do you have a method for obtaining these stats?
> 
> Last time I used eBPF + bpftrace to extract these stats.  The high rate
> these updates occur, it caused production overload situations that SRE
> noticed.  So, I have to be careful when producing these stats for you.
> 
> Also could you be more specific what code lines you want stats for?
> 

I am looking for idx and val in __mod_memcg_state, __mod_memcg_lruvec_state
and __count_memcg_events.

Can you share your bpftrace trace (I remember you shared it, sorry for
asking again) and I will reach out to bpf experts to see if we can
further reduce the cost.

> 
> > > 
> > > > - Reclaim code is an iterative process, so not updating the stats on
> > > > every retry is very counterintuitive. We are retrying reclaim using
> > > > the same stats and heuristics used by a previous iteration,
> > > > essentially dismissing the effects of those previous iterations.
> > > > 
> > > > - Indeterministic behavior like this one is very difficult to debug if
> > > > it causes problems. The missing updates in the last 2s (or whatever
> > > > period) could be of any magnitude. We may be ignoring GBs of
> > > > free/allocated memory. What's worse is, if it causes any problems,
> > > > tracing it back to this flush will be extremely difficult.
> > > > 
> > > 
> > > The 2 sec seems like a long period for me.
> > > 
> > > > What can we do?
> > > > 
> > > > - Try to make more fundamental improvements to the flushing code (for
> > > > memcgs or cgroups in general). The per-memcg flushing thresholding is
> > > > an example of this. For example, if flushing is taking too long
> > > > because we are flushing all subsystems, it may make sense to have
> > > > separate rstat trees for separate subsystems.
> > > > 
> > > > One other thing we can try is add a mutex in the memcg flushing path.
> > > > I had initially had this in my subtree flushing series [1], but I
> > > > dropped it as we thought it's not very useful.
> > > 
> > > I'm running an experimental kernel with rstat lock converted to mutex on
> > > a number of production servers, and we have not observed any regressions.
> > > The kswapd thundering herd problem also happen on these machines, but as
> > > these are sleep-able background threads, it is fine to sleep on the mutex.
> > > 
> > 
> > Sorry but a global mutex which can be taken by userspace applications
> > and is needed by node controller (to read stats) is a no from me. On a
> > multi-tenant systems, global locks causing priority inversion is a real
> > issue.
> > 
> 
> The situation we have *today* with a global IRQ-disabling spin_lock is
> precisely causing a priority-inversion situation always by design.
> Userspace applications (reading stat file) and kswapd background
> processes are currently getting higher priority than both hardware and
> software interrupts. This is causing actual production issues, which is
> why I'm working on this.
> 
> I do understand that changing this to a global mutex creates the
> theoretical *possibility* for priority-inversion between processes with

Ah sorry for giving the impression that I am talking about theoretical
issue, no, we (old me at Google) have seen global lock causing priority
inversions and cauing the high priority node controller stuck for 10s of
seconds to several minutes in Google production.

> different configured priorities.  IMHO this is better than always taking
> the "big" bottom-half-lock [LWN].  I still want to lower the potential
> priority-inversion issue with the mutex lock, by (1) working on lowering
> the pressure on the lock (e.g. exit if flush is ongoing on root, and
> e.g. add time limit on how often flush can run on sub-trees), and (2) we
> will collect production metrics for lock contention and hold time (with
> appropriate alerts).

Oh I am open to mutex change along with the efforts to reduce the
chances of priority inversion, just not the mutex change only. I am open
to trying out new things as some (or all) might not work and we may have
to go back to the older situation.

> 
> 
> [LWN] https://lwn.net/SubscriberLink/978189/5f50cab8478fac45/
> 
> 
> > > 
> > [...]
> > > 
> > > My pipe dream is that kernel can avoiding the cost of maintain the
> > > cgroup threshold stats for flushing, and instead rely on a dynamic time
> > > based threshold (in ms area) that have no fast-path overhead :-P
> > > 
> > 
> > Please do expand on what you mean by dynamic time based threshold.
> 
> I proposed a fixed 50 ms flush rate limiting in [2].  But I don't want
> this to be a static value as it will vary on different configs and
> hardware. I propose making this dynamic via measuring the time it takes
> to flush the cgroup root.  This makes sense in a queuing theory context,
> because this corresponds to the (longest) service time, and theory say
> "a queue is formed when customers arrive faster than they can get
> served". Thus, this should lower the pressure/contention on the lock,
> while still allowing frequent flushing.  Hope it makes sense.
> 

Thanks for the explanation.

> --Jesper
> 
> 
> [2] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/
Yosry Ahmed June 24, 2024, 12:57 p.m. UTC | #7
> > and I will explain why below. I know it may be a necessary
> > evil, but I would like us to make sure there is no other option before
> > going forward with this.
>
> Instead of necessary evil, I would call it a pragmatic approach i.e.
> resolve the ongoing pain with good enough solution and work on long term
> solution later.

It seems like there are a few ideas for solutions that may address
longer-term concerns, let's make sure we try those out first before we
fall back to the short-term mitigation.

[..]
> >
> > - Reclaim code is an iterative process, so not updating the stats on
> > every retry is very counterintuitive. We are retrying reclaim using
> > the same stats and heuristics used by a previous iteration,
> > essentially dismissing the effects of those previous iterations.
> >
>
> I think I explained in the commit message why we don't need the precise
> metrics for this specific case but let me reiterate.
>
> The stats are needed for two specific heuristics in this case:
>
> 1. Deactivate LRUs
> 2. Cache trim mode
>
> The deactivate LRUs heuristic is to maintain a desirable inactive:active
> ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE*
> and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to
> check if there is a refault since last snapshot and the LRU size are
> needed for the desirable ratio between inactive and active LRUs. See the
> table below on how the desirable ratio is calculated.
>
> /* total     target    max
>  * memory    ratio     inactive
>  * -------------------------------------
>  *   10MB       1         5MB
>  *  100MB       1        50MB
>  *    1GB       3       250MB
>  *   10GB      10       0.9GB
>  *  100GB      31         3GB
>  *    1TB     101        10GB
>  *   10TB     320        32GB
>  */
>
> The desirable ratio only changes at the boundary of 1 GiB, 10 GiB,
> 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate
> LRU size information to calculate this ratio. In addition, if
> deactivation is skipped for some LRU, the kernel will force deactive on
> the severe memory pressure situation.

Thanks for explaining this in such detail. It does make me feel
better, but keep in mind that the above heuristics may change in the
future and become more sensitive to stale stats, and very likely no
one will remember that we decided that stale stats are fine
previously.

>
> For the cache trim mode, inactive file LRU size is read and the kernel
> scales it down based on the reclaim iteration (file >> sc->priority) and
> only checks if it is zero or not. Again precise information is not
> needed.

It sounds like it is possible that we enter the cache trim mode when
we shouldn't if the stats are stale. Couldn't this lead to
over-reclaiming file memory?

>
> > - Indeterministic behavior like this one is very difficult to debug if
> > it causes problems. The missing updates in the last 2s (or whatever
> > period) could be of any magnitude. We may be ignoring GBs of
> > free/allocated memory. What's worse is, if it causes any problems,
> > tracing it back to this flush will be extremely difficult.
>
> This is indeed an issue but that is common with the heuristics in
> general. They work most of the time and fail for small set of cases.
>
> Anyways, I am not arguing to remove sync flush for all cases. Rather I
> am arguing for this specific case, we don't need to be precise as I have
> explained above.
>
> >
> > What can we do?
> >
> > - Try to make more fundamental improvements to the flushing code (for
> > memcgs or cgroups in general). The per-memcg flushing thresholding is
> > an example of this. For example, if flushing is taking too long
> > because we are flushing all subsystems, it may make sense to have
> > separate rstat trees for separate subsystems.
>
> Yes separate flushing for each subsystems make sense and can be done
> orthogonally.
>
> >
> > One other thing we can try is add a mutex in the memcg flushing path.
> > I had initially had this in my subtree flushing series [1], but I
> > dropped it as we thought it's not very useful. Currently in
> > mem_cgroup_flush_stats(), we check if there are enough pending updates
> > to flush, then we call cgroup_flush_stats() and spin on the lock. It
> > is possible that while we spin, those pending updates we observed have
> > been flushed. If we add back the mutex like in [1], then once we
> > acquire the mutex we check again to make sure there are still enough
> > stats to flush.
> >
> > [1]https://lore.kernel.org/all/20231010032117.1577496-6-yosryahmed@google.com/
>
> My main beef with the global mutex is the possible priority inversion.
> Unless you agree to add try_lock() and skip flushing i.e. no one sleeps
> on the mutex, this is a no go.

Jesper is working on ways to mitigate the possible priority inversion
AFAICT. Let's see what comes out of this (I commented on Jesper's
patch).

>
> >
> > - Try to avoid the need for flushing in this path. I am not sure what
> > approach MGLRU uses to avoid the flush, but if we can do something
> > similar for classic LRUs that would be preferable. I am guessing MGLRU
> > may be maintaining its own stats outside of the rstat framework.
>
> MGLRU simply don't use these heuristics (Yu Zhao please correct me if I
> am wrong).
>
> >
> > - Try to figure out if one (or a few) update paths are regressing all
> > flushers. If one specific stat or stats update path is causing most of
> > the updates, we can try to fix that instead. Especially if it's a
> > counter that is continuously being increased and decreases (so the net
> > change is not as high as we think).
>
> This is actually a good point. I remember Jasper telling that MEMCG_KMEM
> might be the one with most updates. I can try to collect from Meta fleet
> what is the cause of most updates.

Let's also wait and see what comes out of this. It would be
interesting if we can fix this on the update side instead.

>
> >
> > At the end of the day, all of the above may not work, and we may have
> > to live with just using the ratelimited approach. But I *really* hope
> > we could actually go the other way. Fix things on a more fundamental
> > level and eventually drop the ratelimited variants completely.
> >
> > Just my 2c. Sorry for the long email :)
>
> Please note that this is not some user API which can not be changed
> later. We can change and disect however we want. My only point is not to
> wait for the perfect solution and have some intermediate and good enough
> solution.

I agree that we shouldn't wait for a perfect solution, but it also
seems like there are a few easy-ish solutions that we can discover
first (Jesper's patch, investigating update paths, etc). If none of
those pan out, we can fall back to the ratelimited flush, ideally with
a plan on next steps for a longer-term solution.
Shakeel Butt June 24, 2024, 5:02 p.m. UTC | #8
On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > and I will explain why below. I know it may be a necessary
> > > evil, but I would like us to make sure there is no other option before
> > > going forward with this.
> >
> > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > resolve the ongoing pain with good enough solution and work on long term
> > solution later.
> 
> It seems like there are a few ideas for solutions that may address
> longer-term concerns, let's make sure we try those out first before we
> fall back to the short-term mitigation.
> 

Why? More specifically why try out other things before this patch? Both
can be done in parallel. This patch has been running in production at
Meta for several weeks without issues. Also I don't see how merging this
would impact us on working on long term solutions.

[...]
> 
> Thanks for explaining this in such detail. It does make me feel
> better, but keep in mind that the above heuristics may change in the
> future and become more sensitive to stale stats, and very likely no
> one will remember that we decided that stale stats are fine
> previously.
> 

When was the last time this heuristic change? This heuristic was
introduced in 2008 for anon pages and extended to file pages in 2016. In
2019 the ratio enforcement at 'reclaim root' was introduce. I am pretty
sure we will improve the whole rstat flushing thing within a year or so
:P

> >
> > For the cache trim mode, inactive file LRU size is read and the kernel
> > scales it down based on the reclaim iteration (file >> sc->priority) and
> > only checks if it is zero or not. Again precise information is not
> > needed.
> 
> It sounds like it is possible that we enter the cache trim mode when
> we shouldn't if the stats are stale. Couldn't this lead to
> over-reclaiming file memory?
> 

Can you explain how this over-reclaiming file will happen?

[...]
> > >
> > > - Try to figure out if one (or a few) update paths are regressing all
> > > flushers. If one specific stat or stats update path is causing most of
> > > the updates, we can try to fix that instead. Especially if it's a
> > > counter that is continuously being increased and decreases (so the net
> > > change is not as high as we think).
> >
> > This is actually a good point. I remember Jasper telling that MEMCG_KMEM
> > might be the one with most updates. I can try to collect from Meta fleet
> > what is the cause of most updates.
> 
> Let's also wait and see what comes out of this. It would be
> interesting if we can fix this on the update side instead.
> 

Yes it would be interesting but I don't see any reason to wait for it.

> >
> > >
> > > At the end of the day, all of the above may not work, and we may have
> > > to live with just using the ratelimited approach. But I *really* hope
> > > we could actually go the other way. Fix things on a more fundamental
> > > level and eventually drop the ratelimited variants completely.
> > >
> > > Just my 2c. Sorry for the long email :)
> >
> > Please note that this is not some user API which can not be changed
> > later. We can change and disect however we want. My only point is not to
> > wait for the perfect solution and have some intermediate and good enough
> > solution.
> 
> I agree that we shouldn't wait for a perfect solution, but it also
> seems like there are a few easy-ish solutions that we can discover
> first (Jesper's patch, investigating update paths, etc). If none of
> those pan out, we can fall back to the ratelimited flush, ideally with
> a plan on next steps for a longer-term solution.

I think I already explain why there is no need to wait. One thing we
should agree on is that this is hard problem and will need multiple
iterations to comeup with a solution which is acceptable for most. Until
then I don't see any reason to block mitigations to reduce pain.

thanks,
Shakeel
Yosry Ahmed June 24, 2024, 5:15 p.m. UTC | #9
On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > and I will explain why below. I know it may be a necessary
> > > > evil, but I would like us to make sure there is no other option before
> > > > going forward with this.
> > >
> > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > resolve the ongoing pain with good enough solution and work on long term
> > > solution later.
> >
> > It seems like there are a few ideas for solutions that may address
> > longer-term concerns, let's make sure we try those out first before we
> > fall back to the short-term mitigation.
> >
>
> Why? More specifically why try out other things before this patch? Both
> can be done in parallel. This patch has been running in production at
> Meta for several weeks without issues. Also I don't see how merging this
> would impact us on working on long term solutions.

The problem is that once this is merged, it will be difficult to
change this back to a normal flush once other improvements land. We
don't have a test that reproduces the problem that we can use to make
sure it's safe to revert this change later, it's only using data from
prod.

Once this mitigation goes in, I think everyone will be less motivated
to get more data from prod about whether it's safe to revert the
ratelimiting later :)

>
> [...]
> >
> > Thanks for explaining this in such detail. It does make me feel
> > better, but keep in mind that the above heuristics may change in the
> > future and become more sensitive to stale stats, and very likely no
> > one will remember that we decided that stale stats are fine
> > previously.
> >
>
> When was the last time this heuristic change? This heuristic was
> introduced in 2008 for anon pages and extended to file pages in 2016. In
> 2019 the ratio enforcement at 'reclaim root' was introduce. I am pretty
> sure we will improve the whole rstat flushing thing within a year or so
> :P

Fair point, although I meant it's easy to miss that the flush is
ratelimited and the stats are potentially stale in general :)

>
> > >
> > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > only checks if it is zero or not. Again precise information is not
> > > needed.
> >
> > It sounds like it is possible that we enter the cache trim mode when
> > we shouldn't if the stats are stale. Couldn't this lead to
> > over-reclaiming file memory?
> >
>
> Can you explain how this over-reclaiming file will happen?

In one reclaim iteration, we could flush the stats, read the inactive
file LRU size, confirm that (file >> sc->priority) > 0 and enter the
cache trim mode, reclaiming file memory only. Let's assume that we
reclaimed enough file memory such that the condition (file >>
sc->priority) > 0 does not hold anymore.

In a subsequent reclaim iteration, the flush could be skipped due to
ratelimiting. Now we will enter the cache trim mode again and reclaim
file memory only, even though the actual amount of file memory is low.
This will cause over-reclaiming from file memory and dismissing anon
memory that we should have reclaimed, which means that we will need
additional reclaim iterations to actually free memory.

I believe this scenario would be possible with ratelimiting, right?

[..]
> > >
> > > Please note that this is not some user API which can not be changed
> > > later. We can change and disect however we want. My only point is not to
> > > wait for the perfect solution and have some intermediate and good enough
> > > solution.
> >
> > I agree that we shouldn't wait for a perfect solution, but it also
> > seems like there are a few easy-ish solutions that we can discover
> > first (Jesper's patch, investigating update paths, etc). If none of
> > those pan out, we can fall back to the ratelimited flush, ideally with
> > a plan on next steps for a longer-term solution.
>
> I think I already explain why there is no need to wait. One thing we
> should agree on is that this is hard problem and will need multiple
> iterations to comeup with a solution which is acceptable for most. Until
> then I don't see any reason to block mitigations to reduce pain.

Agreed, but I expressed above why I think we should explore other
solutions first. Please correct me if I am wrong.
Shakeel Butt June 24, 2024, 6:59 p.m. UTC | #10
On Mon, Jun 24, 2024 at 10:15:38AM GMT, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > > and I will explain why below. I know it may be a necessary
> > > > > evil, but I would like us to make sure there is no other option before
> > > > > going forward with this.
> > > >
> > > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > > resolve the ongoing pain with good enough solution and work on long term
> > > > solution later.
> > >
> > > It seems like there are a few ideas for solutions that may address
> > > longer-term concerns, let's make sure we try those out first before we
> > > fall back to the short-term mitigation.
> > >
> >
> > Why? More specifically why try out other things before this patch? Both
> > can be done in parallel. This patch has been running in production at
> > Meta for several weeks without issues. Also I don't see how merging this
> > would impact us on working on long term solutions.
> 
> The problem is that once this is merged, it will be difficult to
> change this back to a normal flush once other improvements land. We
> don't have a test that reproduces the problem that we can use to make
> sure it's safe to revert this change later, it's only using data from
> prod.
> 

I am pretty sure the work on long term solution would be iterative which
will involve many reverts and redoing things differently. So, I think it
is understandable that we may need to revert or revert the reverts.

> Once this mitigation goes in, I think everyone will be less motivated
> to get more data from prod about whether it's safe to revert the
> ratelimiting later :)

As I said I don't expect "safe in prod" as a strict requirement for a
change.

> 
> >
> > [...]
> > >
> > > Thanks for explaining this in such detail. It does make me feel
> > > better, but keep in mind that the above heuristics may change in the
> > > future and become more sensitive to stale stats, and very likely no
> > > one will remember that we decided that stale stats are fine
> > > previously.
> > >
> >
> > When was the last time this heuristic change? This heuristic was
> > introduced in 2008 for anon pages and extended to file pages in 2016. In
> > 2019 the ratio enforcement at 'reclaim root' was introduce. I am pretty
> > sure we will improve the whole rstat flushing thing within a year or so
> > :P
> 
> Fair point, although I meant it's easy to miss that the flush is
> ratelimited and the stats are potentially stale in general :)
> 
> >
> > > >
> > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > only checks if it is zero or not. Again precise information is not
> > > > needed.
> > >
> > > It sounds like it is possible that we enter the cache trim mode when
> > > we shouldn't if the stats are stale. Couldn't this lead to
> > > over-reclaiming file memory?
> > >
> >
> > Can you explain how this over-reclaiming file will happen?
> 
> In one reclaim iteration, we could flush the stats, read the inactive
> file LRU size, confirm that (file >> sc->priority) > 0 and enter the
> cache trim mode, reclaiming file memory only. Let's assume that we
> reclaimed enough file memory such that the condition (file >>
> sc->priority) > 0 does not hold anymore.
> 
> In a subsequent reclaim iteration, the flush could be skipped due to
> ratelimiting. Now we will enter the cache trim mode again and reclaim
> file memory only, even though the actual amount of file memory is low.
> This will cause over-reclaiming from file memory and dismissing anon
> memory that we should have reclaimed, which means that we will need
> additional reclaim iterations to actually free memory.
> 
> I believe this scenario would be possible with ratelimiting, right?
> 

So, the (old_file >> sc->priority) > 0 is true but the (new_file >>
sc->priority) > is false. In the next iteration, (old_file >>
(sc->priority-1)) > 0 will still be true but somehow (new_file >>
(sc->priority-1)) > 0 is false. It can happen if in the previous
iteration, somehow kernel has reclaimed more than double what it was
supposed to reclaim or there are concurrent reclaimers. In addition the
nr_reclaim is still less than nr_to_reclaim and there is no file
deactivation request.

Yeah it can happen but a lot of wierd conditions need to happen
concurrently for this to happen.
Yosry Ahmed June 24, 2024, 7:06 p.m. UTC | #11
On Mon, Jun 24, 2024 at 11:59 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jun 24, 2024 at 10:15:38AM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > > > and I will explain why below. I know it may be a necessary
> > > > > > evil, but I would like us to make sure there is no other option before
> > > > > > going forward with this.
> > > > >
> > > > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > > > resolve the ongoing pain with good enough solution and work on long term
> > > > > solution later.
> > > >
> > > > It seems like there are a few ideas for solutions that may address
> > > > longer-term concerns, let's make sure we try those out first before we
> > > > fall back to the short-term mitigation.
> > > >
> > >
> > > Why? More specifically why try out other things before this patch? Both
> > > can be done in parallel. This patch has been running in production at
> > > Meta for several weeks without issues. Also I don't see how merging this
> > > would impact us on working on long term solutions.
> >
> > The problem is that once this is merged, it will be difficult to
> > change this back to a normal flush once other improvements land. We
> > don't have a test that reproduces the problem that we can use to make
> > sure it's safe to revert this change later, it's only using data from
> > prod.
> >
>
> I am pretty sure the work on long term solution would be iterative which
> will involve many reverts and redoing things differently. So, I think it
> is understandable that we may need to revert or revert the reverts.
>
> > Once this mitigation goes in, I think everyone will be less motivated
> > to get more data from prod about whether it's safe to revert the
> > ratelimiting later :)
>
> As I said I don't expect "safe in prod" as a strict requirement for a
> change.

If everyone agrees that we can experiment with reverting this change
later without having to prove that it is safe, then I think it's fine.
Let's document this in the commit log though, so that whoever tries to
revert this in the future (if any) does not have to re-explain all of
this :)

[..]
> > > > >
> > > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > > only checks if it is zero or not. Again precise information is not
> > > > > needed.
> > > >
> > > > It sounds like it is possible that we enter the cache trim mode when
> > > > we shouldn't if the stats are stale. Couldn't this lead to
> > > > over-reclaiming file memory?
> > > >
> > >
> > > Can you explain how this over-reclaiming file will happen?
> >
> > In one reclaim iteration, we could flush the stats, read the inactive
> > file LRU size, confirm that (file >> sc->priority) > 0 and enter the
> > cache trim mode, reclaiming file memory only. Let's assume that we
> > reclaimed enough file memory such that the condition (file >>
> > sc->priority) > 0 does not hold anymore.
> >
> > In a subsequent reclaim iteration, the flush could be skipped due to
> > ratelimiting. Now we will enter the cache trim mode again and reclaim
> > file memory only, even though the actual amount of file memory is low.
> > This will cause over-reclaiming from file memory and dismissing anon
> > memory that we should have reclaimed, which means that we will need
> > additional reclaim iterations to actually free memory.
> >
> > I believe this scenario would be possible with ratelimiting, right?
> >
>
> So, the (old_file >> sc->priority) > 0 is true but the (new_file >>
> sc->priority) > is false. In the next iteration, (old_file >>
> (sc->priority-1)) > 0 will still be true but somehow (new_file >>
> (sc->priority-1)) > 0 is false. It can happen if in the previous
> iteration, somehow kernel has reclaimed more than double what it was
> supposed to reclaim or there are concurrent reclaimers. In addition the
> nr_reclaim is still less than nr_to_reclaim and there is no file
> deactivation request.
>
> Yeah it can happen but a lot of wierd conditions need to happen
> concurrently for this to happen.

Not necessarily sc->priority-1. Consider two separate sequential
reclaim attempts. At the same priority, the first reclaim attempt
could rightfully enter cache trim mode, while the second one
wrongfully enters cache trim mode due to stale stats, over-reclaim
file memory, and stall longer to actually reclaim the anon memory.

I am sure such a scenario is not going to be common, but I am also
sure if it happens it will be a huge pain to debug.

If others agree that this is fine, let's document this with a comment
and in the commit log. I am not sure how common the cache trim mode is
in practice to understand the potential severity of such problems.
There may also be other consequences that I am not aware of.
Shakeel Butt June 24, 2024, 8:01 p.m. UTC | #12
On Mon, Jun 24, 2024 at 12:06:28PM GMT, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 11:59 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Jun 24, 2024 at 10:15:38AM GMT, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > > > > and I will explain why below. I know it may be a necessary
> > > > > > > evil, but I would like us to make sure there is no other option before
> > > > > > > going forward with this.
> > > > > >
> > > > > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > > > > resolve the ongoing pain with good enough solution and work on long term
> > > > > > solution later.
> > > > >
> > > > > It seems like there are a few ideas for solutions that may address
> > > > > longer-term concerns, let's make sure we try those out first before we
> > > > > fall back to the short-term mitigation.
> > > > >
> > > >
> > > > Why? More specifically why try out other things before this patch? Both
> > > > can be done in parallel. This patch has been running in production at
> > > > Meta for several weeks without issues. Also I don't see how merging this
> > > > would impact us on working on long term solutions.
> > >
> > > The problem is that once this is merged, it will be difficult to
> > > change this back to a normal flush once other improvements land. We
> > > don't have a test that reproduces the problem that we can use to make
> > > sure it's safe to revert this change later, it's only using data from
> > > prod.
> > >
> >
> > I am pretty sure the work on long term solution would be iterative which
> > will involve many reverts and redoing things differently. So, I think it
> > is understandable that we may need to revert or revert the reverts.
> >
> > > Once this mitigation goes in, I think everyone will be less motivated
> > > to get more data from prod about whether it's safe to revert the
> > > ratelimiting later :)
> >
> > As I said I don't expect "safe in prod" as a strict requirement for a
> > change.
> 
> If everyone agrees that we can experiment with reverting this change
> later without having to prove that it is safe, then I think it's fine.
> Let's document this in the commit log though, so that whoever tries to
> revert this in the future (if any) does not have to re-explain all of
> this :)

Sure.

> 
> [..]
> > > > > >
> > > > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > > > only checks if it is zero or not. Again precise information is not
> > > > > > needed.
> > > > >
> > > > > It sounds like it is possible that we enter the cache trim mode when
> > > > > we shouldn't if the stats are stale. Couldn't this lead to
> > > > > over-reclaiming file memory?
> > > > >
> > > >
> > > > Can you explain how this over-reclaiming file will happen?
> > >
> > > In one reclaim iteration, we could flush the stats, read the inactive
> > > file LRU size, confirm that (file >> sc->priority) > 0 and enter the
> > > cache trim mode, reclaiming file memory only. Let's assume that we
> > > reclaimed enough file memory such that the condition (file >>
> > > sc->priority) > 0 does not hold anymore.
> > >
> > > In a subsequent reclaim iteration, the flush could be skipped due to
> > > ratelimiting. Now we will enter the cache trim mode again and reclaim
> > > file memory only, even though the actual amount of file memory is low.
> > > This will cause over-reclaiming from file memory and dismissing anon
> > > memory that we should have reclaimed, which means that we will need
> > > additional reclaim iterations to actually free memory.
> > >
> > > I believe this scenario would be possible with ratelimiting, right?
> > >
> >
> > So, the (old_file >> sc->priority) > 0 is true but the (new_file >>
> > sc->priority) > is false. In the next iteration, (old_file >>
> > (sc->priority-1)) > 0 will still be true but somehow (new_file >>
> > (sc->priority-1)) > 0 is false. It can happen if in the previous
> > iteration, somehow kernel has reclaimed more than double what it was
> > supposed to reclaim or there are concurrent reclaimers. In addition the
> > nr_reclaim is still less than nr_to_reclaim and there is no file
> > deactivation request.
> >
> > Yeah it can happen but a lot of wierd conditions need to happen
> > concurrently for this to happen.
> 
> Not necessarily sc->priority-1. Consider two separate sequential
> reclaim attempts. At the same priority, the first reclaim attempt
> could rightfully enter cache trim mode, while the second one
> wrongfully enters cache trim mode due to stale stats, over-reclaim
> file memory, and stall longer to actually reclaim the anon memory.
> 

For two different reclaim attempts even more things need to go wrong.
Anyways we are talking too much in abstract here and focusing on the
corner cases which almost all heuristics have. Unless there is a clear
explanation that the corner case probability will be increased, I don't
think spending time discussing it is useful.

> I am sure such a scenario is not going to be common, but I am also
> sure if it happens it will be a huge pain to debug.
> 
> If others agree that this is fine, let's document this with a comment
> and in the commit log. I am not sure how common the cache trim mode is
> in practice to understand the potential severity of such problems.
> There may also be other consequences that I am not aware of.

What is your definition of "others" though?
Yosry Ahmed June 24, 2024, 9:41 p.m. UTC | #13
On Mon, Jun 24, 2024 at 1:01 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Jun 24, 2024 at 12:06:28PM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 11:59 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 10:15:38AM GMT, Yosry Ahmed wrote:
> > > > On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > > > > > and I will explain why below. I know it may be a necessary
> > > > > > > > evil, but I would like us to make sure there is no other option before
> > > > > > > > going forward with this.
> > > > > > >
> > > > > > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > > > > > resolve the ongoing pain with good enough solution and work on long term
> > > > > > > solution later.
> > > > > >
> > > > > > It seems like there are a few ideas for solutions that may address
> > > > > > longer-term concerns, let's make sure we try those out first before we
> > > > > > fall back to the short-term mitigation.
> > > > > >
> > > > >
> > > > > Why? More specifically why try out other things before this patch? Both
> > > > > can be done in parallel. This patch has been running in production at
> > > > > Meta for several weeks without issues. Also I don't see how merging this
> > > > > would impact us on working on long term solutions.
> > > >
> > > > The problem is that once this is merged, it will be difficult to
> > > > change this back to a normal flush once other improvements land. We
> > > > don't have a test that reproduces the problem that we can use to make
> > > > sure it's safe to revert this change later, it's only using data from
> > > > prod.
> > > >
> > >
> > > I am pretty sure the work on long term solution would be iterative which
> > > will involve many reverts and redoing things differently. So, I think it
> > > is understandable that we may need to revert or revert the reverts.
> > >
> > > > Once this mitigation goes in, I think everyone will be less motivated
> > > > to get more data from prod about whether it's safe to revert the
> > > > ratelimiting later :)
> > >
> > > As I said I don't expect "safe in prod" as a strict requirement for a
> > > change.
> >
> > If everyone agrees that we can experiment with reverting this change
> > later without having to prove that it is safe, then I think it's fine.
> > Let's document this in the commit log though, so that whoever tries to
> > revert this in the future (if any) does not have to re-explain all of
> > this :)
>
> Sure.
>
> >
> > [..]
> > > > > > >
> > > > > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > > > > only checks if it is zero or not. Again precise information is not
> > > > > > > needed.
> > > > > >
> > > > > > It sounds like it is possible that we enter the cache trim mode when
> > > > > > we shouldn't if the stats are stale. Couldn't this lead to
> > > > > > over-reclaiming file memory?
> > > > > >
> > > > >
> > > > > Can you explain how this over-reclaiming file will happen?
> > > >
> > > > In one reclaim iteration, we could flush the stats, read the inactive
> > > > file LRU size, confirm that (file >> sc->priority) > 0 and enter the
> > > > cache trim mode, reclaiming file memory only. Let's assume that we
> > > > reclaimed enough file memory such that the condition (file >>
> > > > sc->priority) > 0 does not hold anymore.
> > > >
> > > > In a subsequent reclaim iteration, the flush could be skipped due to
> > > > ratelimiting. Now we will enter the cache trim mode again and reclaim
> > > > file memory only, even though the actual amount of file memory is low.
> > > > This will cause over-reclaiming from file memory and dismissing anon
> > > > memory that we should have reclaimed, which means that we will need
> > > > additional reclaim iterations to actually free memory.
> > > >
> > > > I believe this scenario would be possible with ratelimiting, right?
> > > >
> > >
> > > So, the (old_file >> sc->priority) > 0 is true but the (new_file >>
> > > sc->priority) > is false. In the next iteration, (old_file >>
> > > (sc->priority-1)) > 0 will still be true but somehow (new_file >>
> > > (sc->priority-1)) > 0 is false. It can happen if in the previous
> > > iteration, somehow kernel has reclaimed more than double what it was
> > > supposed to reclaim or there are concurrent reclaimers. In addition the
> > > nr_reclaim is still less than nr_to_reclaim and there is no file
> > > deactivation request.
> > >
> > > Yeah it can happen but a lot of wierd conditions need to happen
> > > concurrently for this to happen.
> >
> > Not necessarily sc->priority-1. Consider two separate sequential
> > reclaim attempts. At the same priority, the first reclaim attempt
> > could rightfully enter cache trim mode, while the second one
> > wrongfully enters cache trim mode due to stale stats, over-reclaim
> > file memory, and stall longer to actually reclaim the anon memory.
> >
>
> For two different reclaim attempts even more things need to go wrong.
> Anyways we are talking too much in abstract here and focusing on the
> corner cases which almost all heuristics have. Unless there is a clear
> explanation that the corner case probability will be increased, I don't
> think spending time discussing it is useful.
>
> > I am sure such a scenario is not going to be common, but I am also
> > sure if it happens it will be a huge pain to debug.
> >
> > If others agree that this is fine, let's document this with a comment
> > and in the commit log. I am not sure how common the cache trim mode is
> > in practice to understand the potential severity of such problems.
> > There may also be other consequences that I am not aware of.
>
> What is your definition of "others" though?

I am just interested to hear more opinions. If others (e.g. people in
the CC) agree with you that this is the approach we should be taking
then I won't stand in the way. If others share my concerns, then maybe
we should not proceed. It seemed like at least Jesper had some
concerns as well.

If no one cares enough to voice their opinions then I suppose it's up to you :)
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c0429fd6c573..bda4f92eba71 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2263,7 +2263,7 @@  static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats(sc->target_mem_cgroup);
+	mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup);
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.