diff mbox series

memcg: sync flush only if periodic flush is delayed

Message ID 20220304184040.1304781-1-shakeelb@google.com (mailing list archive)
State New
Headers show
Series memcg: sync flush only if periodic flush is delayed | expand

Commit Message

Shakeel Butt March 4, 2022, 6:40 p.m. UTC
Daniel Dao has reported [1] a regression on workloads that may trigger
a lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.

This patch fixes this regression by making the rstat flushing
conditional in the performance critical codepaths. More specifically,
the kernel relies on the async periodic rstat flusher to flush the stats
and only if the periodic flusher is delayed by more than twice the
amount of its normal time window then the kernel allows rstat flushing
from the performance critical codepaths.

Now the question: what are the side-effects of this change? The worst
that can happen is the refault codepath will see 4sec old lruvec stats
and may cause false (or missed) activations of the refaulted page which
may under-or-overestimate the workingset size. Though that is not very
concerning as the kernel can already miss or do false activations.

There are two more codepaths whose flushing behavior is not changed by
this patch and we may need to come to them in future. One is the
writeback stats used by dirty throttling and second is the deactivation
heuristic in the reclaim. For now keeping an eye on them and if there is
report of regression due to these codepaths, we will reevaluate then.

Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Daniel Dao <dqminh@cloudflare.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/memcontrol.h |  5 +++++
 mm/memcontrol.c            | 12 +++++++++++-
 mm/workingset.c            |  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

Comments

Ivan Babrou March 4, 2022, 6:53 p.m. UTC | #1
On Fri, Mar 4, 2022 at 10:40 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Daniel Dao has reported [1] a regression on workloads that may trigger
> a lot of refaults (anon and file). The underlying issue is that flushing
> rstat is expensive. Although rstat flush are batched with (nr_cpus *
> MEMCG_BATCH) stat updates, it seems like there are workloads which
> genuinely do stat updates larger than batch value within short amount of
> time. Since the rstat flush can happen in the performance critical
> codepaths like page faults, such workload can suffer greatly.
>
> This patch fixes this regression by making the rstat flushing
> conditional in the performance critical codepaths. More specifically,
> the kernel relies on the async periodic rstat flusher to flush the stats
> and only if the periodic flusher is delayed by more than twice the
> amount of its normal time window then the kernel allows rstat flushing
> from the performance critical codepaths.
>
> Now the question: what are the side-effects of this change? The worst
> that can happen is the refault codepath will see 4sec old lruvec stats
> and may cause false (or missed) activations of the refaulted page which
> may under-or-overestimate the workingset size. Though that is not very
> concerning as the kernel can already miss or do false activations.
>
> There are two more codepaths whose flushing behavior is not changed by
> this patch and we may need to come to them in future. One is the
> writeback stats used by dirty throttling and second is the deactivation
> heuristic in the reclaim. For now keeping an eye on them and if there is
> report of regression due to these codepaths, we will reevaluate then.
>
> Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
> Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: Daniel Dao <dqminh@cloudflare.com>
> Cc: <stable@vger.kernel.org>
> ---

See my testing results here:
https://lore.kernel.org/linux-mm/CABWYdi2usrWOnOnmKYYvuFpE=yJmgtq4a7u6FiGJGJkskv+eVQ@mail.gmail.com/

Tested-by: Ivan Babrou <ivan@cloudflare.com>
Andrew Morton March 7, 2022, 2:44 a.m. UTC | #2
On Fri,  4 Mar 2022 18:40:40 +0000 Shakeel Butt <shakeelb@google.com> wrote:

> Daniel Dao has reported [1] a regression on workloads that may trigger
> a lot of refaults (anon and file). The underlying issue is that flushing
> rstat is expensive. Although rstat flush are batched with (nr_cpus *
> MEMCG_BATCH) stat updates, it seems like there are workloads which
> genuinely do stat updates larger than batch value within short amount of
> time. Since the rstat flush can happen in the performance critical
> codepaths like page faults, such workload can suffer greatly.
> 
> This patch fixes this regression by making the rstat flushing
> conditional in the performance critical codepaths. More specifically,
> the kernel relies on the async periodic rstat flusher to flush the stats
> and only if the periodic flusher is delayed by more than twice the
> amount of its normal time window then the kernel allows rstat flushing
> from the performance critical codepaths.
> 
> Now the question: what are the side-effects of this change? The worst
> that can happen is the refault codepath will see 4sec old lruvec stats
> and may cause false (or missed) activations of the refaulted page which
> may under-or-overestimate the workingset size. Though that is not very
> concerning as the kernel can already miss or do false activations.
> 
> There are two more codepaths whose flushing behavior is not changed by
> this patch and we may need to come to them in future. One is the
> writeback stats used by dirty throttling and second is the deactivation
> heuristic in the reclaim. For now keeping an eye on them and if there is
> report of regression due to these codepaths, we will reevaluate then.
> 
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
>
> ...
>
> @@ -648,10 +652,16 @@ void mem_cgroup_flush_stats(void)
>  		__mem_cgroup_flush_stats();
>  }
>  
> +void mem_cgroup_flush_stats_delayed(void)
> +{
> +	if (rstat_flush_time && time_after64(jiffies_64, flush_next_time))

rstat_flush_time isn't defined for me and my googling indicates this is
the first time the symbol has been used in the history of the world. 
I'm stumped.

> +		mem_cgroup_flush_stats();
> +}
> +
>
> ...
>
Shakeel Butt March 7, 2022, 3:06 a.m. UTC | #3
On Sun, Mar 6, 2022 at 6:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri,  4 Mar 2022 18:40:40 +0000 Shakeel Butt <shakeelb@google.com> wrote:
>
> > Daniel Dao has reported [1] a regression on workloads that may trigger
> > a lot of refaults (anon and file). The underlying issue is that flushing
> > rstat is expensive. Although rstat flush are batched with (nr_cpus *
> > MEMCG_BATCH) stat updates, it seems like there are workloads which
> > genuinely do stat updates larger than batch value within short amount of
> > time. Since the rstat flush can happen in the performance critical
> > codepaths like page faults, such workload can suffer greatly.
> >
> > This patch fixes this regression by making the rstat flushing
> > conditional in the performance critical codepaths. More specifically,
> > the kernel relies on the async periodic rstat flusher to flush the stats
> > and only if the periodic flusher is delayed by more than twice the
> > amount of its normal time window then the kernel allows rstat flushing
> > from the performance critical codepaths.
> >
> > Now the question: what are the side-effects of this change? The worst
> > that can happen is the refault codepath will see 4sec old lruvec stats
> > and may cause false (or missed) activations of the refaulted page which
> > may under-or-overestimate the workingset size. Though that is not very
> > concerning as the kernel can already miss or do false activations.
> >
> > There are two more codepaths whose flushing behavior is not changed by
> > this patch and we may need to come to them in future. One is the
> > writeback stats used by dirty throttling and second is the deactivation
> > heuristic in the reclaim. For now keeping an eye on them and if there is
> > report of regression due to these codepaths, we will reevaluate then.
> >
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> >
> > ...
> >
> > @@ -648,10 +652,16 @@ void mem_cgroup_flush_stats(void)
> >               __mem_cgroup_flush_stats();
> >  }
> >
> > +void mem_cgroup_flush_stats_delayed(void)
> > +{
> > +     if (rstat_flush_time && time_after64(jiffies_64, flush_next_time))
>
> rstat_flush_time isn't defined for me and my googling indicates this is
> the first time the symbol has been used in the history of the world.
> I'm stumped.
>

Oh sorry about that. I thought I renamed all instances of
"rstat_flush_time" to "flush_next_time" before sending out the email.
Please just remove "rstat_flush_time &&" from the if-condition.

thanks,
Shakeel
Michal Koutný March 11, 2022, 4 p.m. UTC | #4
Hello.

TL;DR rstats are slow but accurate on reader side. To tackle the
performance regression no flush seems simpler than this patch.


So, I've made an overview for myself what were the relevant changes with
rstat introduction.
The amount of work is:
- before
  R: O(1)
  W: O(tree_depth)

- after
  R: O(nr_cpus * nr_cgroups(subtree) * nr_counters) 
  W: O(tree_depth)

That doesn't look like a positive change especially on the reader side.

(In theory, the reader's work would be amortized but as the original
report here shows, not all workloads are diluting the flushes
sufficiently. [1])

The benefit this was traded for was the greater accuracy, the possible
error is:
- before
  - O(nr_cpus * nr_cgroups(subtree) * MEMCG_CHARGE_BATCH)	(1)
- after
    O(nr_cpus * MEMCG_CHARGE_BATCH) // sync. flush
    or
    O(flush_period * max_cr) // periodic flush only		(2)
                             // max_cr is per-counter max change rate

So we could argue that if the pre-rstat kernels did just fine with the
error (1), they would not be worse with periodic flush if we can compare
(1) and (2). 

On Fri, Mar 04, 2022 at 06:40:40PM +0000, Shakeel Butt <shakeelb@google.com> wrote:
> This patch fixes this regression by making the rstat flushing
> conditional in the performance critical codepaths. More specifically,
> the kernel relies on the async periodic rstat flusher to flush the stats
> and only if the periodic flusher is delayed by more than twice the
> amount of its normal time window then the kernel allows rstat flushing
> from the performance critical codepaths.

I'm not sure whether your patch attempts to solve the problem of
(a) periodic flush getting stuck or (b) limiting error on refault path.
If it's (a), it should be tackled more systematically (dedicated wq?).
If it's (b), why not just rely on periodic flush (self answer: (1) and
(2) comparison is workload dependent).

> Now the question: what are the side-effects of this change? The worst
> that can happen is the refault codepath will see 4sec old lruvec stats
> and may cause false (or missed) activations of the refaulted page which
> may under-or-overestimate the workingset size. Though that is not very
> concerning as the kernel can already miss or do false activations.

We can't argue what's the effect of periodic only flushing so this
newly introduced factor would inherit that too. I find it superfluous.


Michal

[1] This is worth looking at in more detail.

From the flush condition we have
  cr * Δt = nr_cpus * MEMCG_CHARGE_BATCH 
where Δt is time between flushes and cr is global change rate.

cr composes of all updates together (corresponds to stats_updates in
memcg_rstat_updated(), max_cr is change rate per counter)
  cr = Σ cr_i <= nr_counters * max_cr 

By combining these two we get shortest time between flushes:
  cr * Δt <= nr_counters * max_cr * Δt
  nr_cpus * MEMCG_CHARGE_BATCH <= nr_counters * max_cr * Δt
  Δt >= (nr_cpus * MEMCG_CHARGE_BATCH) / (nr_counters * max_cr)

We are interested in 
  R_amort = flush_work / Δt
which is
  R_amort <= flush_work * nr_counters * max_cr / (nr_cpus * MEMCG_CHARGE_BATCH)

R_amort: O( nr_cpus * nr_cgroups(subtree) * nr_counters * (nr_counters * max_cr) / (nr_cpus * MEMCG_CHARGE_BATCH) )
R_amort: O( nr_cgroups(subtree) * nr_counters^2 * max_cr) / (MEMCG_CHARGE_BATCH) )

The square looks interesting given there are already tens of counters.
(As data from Ivan have shown, we can hardly restore the pre-rstat
performance on the read side even with mere mod_delayed_work().)
This is what you partially solved with introduction of NR_MEMCG_EVENTS
but the stats_updates was still sum of all events, so the flush might
have still triggered too frequently.

Maybe that would be better long-term approach, splitting into accurate
and approximate counters and reflect that in the error estimator stats_updates.

Or some other optimization of mem_cgroup_css_rstat_flush().
Shakeel Butt March 12, 2022, 7:07 p.m. UTC | #5
Hi Michal,

On Fri, Mar 11, 2022 at 05:00:51PM +0100, Michal Koutný wrote:
> Hello.

> TL;DR rstats are slow but accurate on reader side. To tackle the
> performance regression no flush seems simpler than this patch.


The term 'simpler' is very subjective here and I would argue this patch
is not that complicated than no flush but I think there is no benefit on
arguing on this as these are not some stable API which can not be
changed later. We can always come back and change based on new findings.

Before going further, I do want to mention the main reason to move to
rstat infrastructure was to decouple the error rate in the stats from
the number of memory cgroups and the number of stat items. So, I will
focus on the error rate in this email.

[...]

> The benefit this was traded for was the greater accuracy, the possible
> error is:
> - before
>    - O(nr_cpus * nr_cgroups(subtree) * MEMCG_CHARGE_BATCH)	(1)

Please note that (1) is the possible error for each stat item and
without any time bound.

> - after
>      O(nr_cpus * MEMCG_CHARGE_BATCH) // sync. flush

The above is across all the stat items.

>      or
>      O(flush_period * max_cr) // periodic flush only		(2)
>                               // max_cr is per-counter max change rate

And this above one, I am assuming, is for performance critical readers
(workingset_refault introduced by this patch) and flush_period here is 4
seconds. Please correct me if I misunderstood this.


> So we could argue that if the pre-rstat kernels did just fine with the
> error (1), they would not be worse with periodic flush if we can compare
> (1) and (2).

I agree with this assessment but please note that pre-rstat kernels were
not good for machines with large number of CPUs and running large number
of workloads.

[...]

> I'm not sure whether your patch attempts to solve the problem of
> (a) periodic flush getting stuck or (b) limiting error on refault path.
> If it's (a), it should be tackled more systematically (dedicated wq?).
> If it's (b), why not just rely on periodic flush (self answer: (1) and
> (2) comparison is workload dependent).


It is (b) that I am aiming for in this patch. At least (a) was not
happening in the cloudflare experiments. Are you suggesting having a
dedicated high priority wq would solve both (a) and (b)?

> > Now the question: what are the side-effects of this change? The worst
> > that can happen is the refault codepath will see 4sec old lruvec stats
> > and may cause false (or missed) activations of the refaulted page which
> > may under-or-overestimate the workingset size. Though that is not very
> > concerning as the kernel can already miss or do false activations.

> We can't argue what's the effect of periodic only flushing so this
> newly introduced factor would inherit that too. I find it superfluous.


Sorry I didn't get your point. What is superfluous?


> Michal

> [1] This is worth looking at in more detail.


Oh you did some awesome analysis here.

>  From the flush condition we have
>    cr * Δt = nr_cpus * MEMCG_CHARGE_BATCH
> where Δt is time between flushes and cr is global change rate.

> cr composes of all updates together (corresponds to stats_updates in
> memcg_rstat_updated(), max_cr is change rate per counter)
>    cr = Σ cr_i <= nr_counters * max_cr

I don't get the reason of breaking 'cr' into individual stat item or
counter. What is the benefit? We want to keep the error rate decoupled
from the number of counters (or stat items).


> By combining these two we get shortest time between flushes:
>    cr * Δt <= nr_counters * max_cr * Δt
>    nr_cpus * MEMCG_CHARGE_BATCH <= nr_counters * max_cr * Δt
>    Δt >= (nr_cpus * MEMCG_CHARGE_BATCH) / (nr_counters * max_cr)

> We are interested in
>    R_amort = flush_work / Δt
> which is
>    R_amort <= flush_work * nr_counters * max_cr / (nr_cpus *  
> MEMCG_CHARGE_BATCH)

> R_amort: O( nr_cpus * nr_cgroups(subtree) * nr_counters * (nr_counters *  
> max_cr) / (nr_cpus * MEMCG_CHARGE_BATCH) )
> R_amort: O( nr_cgroups(subtree) * nr_counters^2 * max_cr) /  
> (MEMCG_CHARGE_BATCH) )

> The square looks interesting given there are already tens of counters.
> (As data from Ivan have shown, we can hardly restore the pre-rstat
> performance on the read side even with mere mod_delayed_work().)
> This is what you partially solved with introduction of NR_MEMCG_EVENTS

My main reason behind trying NR_MEMCG_EVENTS was to reduce flush_work by
reducing nr_counters and I don't think nr_counters should have an impact
on Δt.

> but the stats_updates was still sum of all events, so the flush might
> have still triggered too frequently.

> Maybe that would be better long-term approach, splitting into accurate
> and approximate counters and reflect that in the error estimator  
> stats_updates.

> Or some other optimization of mem_cgroup_css_rstat_flush().


Thanks for your insights. This is really awesome and good to explore the
long-term approach. Do you have any strong concerns with the currect
patch? I think we should proceed with this and focus more on long-term
approach.

thanks,
Shakeel
Hillf Danton March 13, 2022, 2:50 a.m. UTC | #6
On Fri,  4 Mar 2022 18:40:40 +0000 Shakeel Butt wrote:
> Daniel Dao has reported [1] a regression on workloads that may trigger
> a lot of refaults (anon and file). The underlying issue is that flushing
> rstat is expensive. Although rstat flush are batched with (nr_cpus *
> MEMCG_BATCH) stat updates, it seems like there are workloads which
> genuinely do stat updates larger than batch value within short amount of
> time. Since the rstat flush can happen in the performance critical
> codepaths like page faults, such workload can suffer greatly.
> 
> This patch fixes this regression by making the rstat flushing
> conditional in the performance critical codepaths. More specifically,
> the kernel relies on the async periodic rstat flusher to flush the stats
> and only if the periodic flusher is delayed by more than twice the
> amount of its normal time window then the kernel allows rstat flushing
> from the performance critical codepaths.
> 
> Now the question: what are the side-effects of this change? The worst
> that can happen is the refault codepath will see 4sec old lruvec stats
> and may cause false (or missed) activations of the refaulted page which
> may under-or-overestimate the workingset size. Though that is not very
> concerning as the kernel can already miss or do false activations.
> 
> There are two more codepaths whose flushing behavior is not changed by
> this patch and we may need to come to them in future. One is the
> writeback stats used by dirty throttling and second is the deactivation
> heuristic in the reclaim. For now keeping an eye on them and if there is
> report of regression due to these codepaths, we will reevaluate then.
> 
> Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
> Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: Daniel Dao <dqminh@cloudflare.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/memcontrol.h |  5 +++++
>  mm/memcontrol.c            | 12 +++++++++++-
>  mm/workingset.c            |  2 +-
>  3 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a68dce3873fc..89b14729d59f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1012,6 +1012,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  }
>  
>  void mem_cgroup_flush_stats(void);
> +void mem_cgroup_flush_stats_delayed(void);
>  
>  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  			      int val);
> @@ -1455,6 +1456,10 @@ static inline void mem_cgroup_flush_stats(void)
>  {
>  }
>  
> +static inline void mem_cgroup_flush_stats_delayed(void)
> +{
> +}
> +
>  static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
>  					    enum node_stat_item idx, int val)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f79bb3f25ce4..edfb337e6948 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -587,6 +587,9 @@ static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static DEFINE_SPINLOCK(stats_flush_lock);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> +static u64 flush_next_time;
> +
> +#define FLUSH_TIME (2UL*HZ)
>  
>  /*
>   * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
> @@ -637,6 +640,7 @@ static void __mem_cgroup_flush_stats(void)
>  	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
>  		return;
>  
> +	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
>  	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
>  	atomic_set(&stats_flush_threshold, 0);
>  	spin_unlock_irqrestore(&stats_flush_lock, flag);
> @@ -648,10 +652,16 @@ void mem_cgroup_flush_stats(void)
>  		__mem_cgroup_flush_stats();
>  }
>  
> +void mem_cgroup_flush_stats_delayed(void)
> +{
> +	if (rstat_flush_time && time_after64(jiffies_64, flush_next_time))
> +		mem_cgroup_flush_stats();
> +}
> +
>  static void flush_memcg_stats_dwork(struct work_struct *w)
>  {
>  	__mem_cgroup_flush_stats();
> -	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
> +	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
>  }
>  

Given flush_next_time is updated only by the periodic flusher, flushing
stats is effectively disabled for page faults as sync flush can hardly
run before the periodic one.

If that is the case, turn the sync flush into a signal that requests
flushing stats sooner.

Only for thoughts now.

Hillf

--- x/mm/memcontrol.c
+++ y/mm/memcontrol.c
@@ -587,9 +587,9 @@ static DECLARE_DEFERRABLE_WORK(stats_flu
 static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
-static u64 flush_next_time;
 
-#define FLUSH_TIME (2UL*HZ)
+#define FLUSH_STATS_TICK (2UL*HZ)
+static unsigned long flush_stats_tick = FLUSH_STATS_TICK;
 
 /*
  * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
@@ -640,7 +640,6 @@ static void __mem_cgroup_flush_stats(voi
 	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
 		return;
 
-	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
 	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
 	spin_unlock_irqrestore(&stats_flush_lock, flag);
@@ -654,14 +653,15 @@ void mem_cgroup_flush_stats(void)
 
 void mem_cgroup_flush_stats_delayed(void)
 {
-	if (time_after64(jiffies_64, flush_next_time))
-		mem_cgroup_flush_stats();
+	/* s/mem_cgroup_flush_stats_delayed/mem_cgroup_signal_flush_stats/ */
+	flush_stats_tick /= 2;
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
 {
 	__mem_cgroup_flush_stats();
-	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
+	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, flush_stats_tick);
+	flush_stats_tick = FLUSH_STATS_TICK;
 }
 
 /**
Michal Koutný March 14, 2022, 12:57 p.m. UTC | #7
Hi.

On Sat, Mar 12, 2022 at 07:07:15PM +0000, Shakeel Butt <shakeelb@google.com> wrote:
> So, I will focus on the error rate in this email.

(OK, I'll stick to error estimate (for long-term) in this message and
will send another about the current patch.)

> [...]
> 
> > The benefit this was traded for was the greater accuracy, the possible
> > error is:
> > - before
> >    - O(nr_cpus * nr_cgroups(subtree) * MEMCG_CHARGE_BATCH)	(1)
> 
> Please note that (1) is the possible error for each stat item and
> without any time bound.

I agree (forgot to highlight this can stuck forever).

> 
> > - after
> >      O(nr_cpus * MEMCG_CHARGE_BATCH) // sync. flush
> 
> The above is across all the stat items.

Can it be used to argue about the error?
E.g.
    nr_cpus * MEMCG_CHARGE_BATCH / nr_counters
looks appealing but that's IMO too optimistic.

The individual item updates are correlated so in practice a single item
would see a lower error than my first relation but without delving too
much into correlations the upper bound is nr_counters independent.


> I don't get the reason of breaking 'cr' into individual stat item or
> counter. What is the benefit? We want to keep the error rate decoupled
> from the number of counters (or stat items).

It's just a model, it should capture that every stat item (change)
contributes to the common error estimate. (So it moves more towards the 
  nr_cpus * MEMCG_CHARGE_BATCH / nr_counters
per-item error (but here we're asking about processing time.))

[...]

> My main reason behind trying NR_MEMCG_EVENTS was to reduce flush_work by
> reducing nr_counters and I don't think nr_counters should have an impact
> on Δt.

The higher number of items is changing, the sooner they accumulate the
target error, no?

(Δt is not the periodic flush period, it's variable time between two
sync flushes.)

Michal
Michal Koutný March 14, 2022, 12:57 p.m. UTC | #8
Hi 2.

On Sat, Mar 12, 2022 at 07:07:15PM +0000, Shakeel Butt <shakeelb@google.com> wrote:
> It is (b) that I am aiming for in this patch. At least (a) was not
> happening in the cloudflare experiments. Are you suggesting having a
> dedicated high priority wq would solve both (a) and (b)?
> [...]
> > We can't argue what's the effect of periodic only flushing so this
> > newly introduced factor would inherit that too. I find it superfluous.
> 
> 
> Sorry I didn't get your point. What is superfluous?

Let me retell my understanding.
The current implementation flushes based on cumulated error and time.
Your patch proposes conditioning the former with another time-based
flushing, whose duration can be up to 2 times longer than the existing
periodic flush.

Assuming the periodic flush is working, the reader won't see data older
than 2 seconds, so the additional sync-flush after (possible) 4 seconds
seems superfluous.

(In the case of periodic flush being stuck, I thought the factor 2=4s/2s
was superfluous, another magic parameter.)

I'm comparing here your proposal vs no synchronous flushing in
workingset_refault().

> Do you have any strong concerns with the currect patch?

Does that clarify?

(I agree with your initial thesis this can be iterated before it evolves
to everyone's satisfaction.)


Michal
Shakeel Butt March 16, 2022, 4:26 p.m. UTC | #9
On Mon, Mar 14, 2022 at 5:57 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hi 2.
>
> On Sat, Mar 12, 2022 at 07:07:15PM +0000, Shakeel Butt <shakeelb@google.com> wrote:
> > It is (b) that I am aiming for in this patch. At least (a) was not
> > happening in the cloudflare experiments. Are you suggesting having a
> > dedicated high priority wq would solve both (a) and (b)?
> > [...]
> > > We can't argue what's the effect of periodic only flushing so this
> > > newly introduced factor would inherit that too. I find it superfluous.
> >
> >
> > Sorry I didn't get your point. What is superfluous?
>
> Let me retell my understanding.
> The current implementation flushes based on cumulated error and time.
> Your patch proposes conditioning the former with another time-based
> flushing, whose duration can be up to 2 times longer than the existing
> periodic flush.
>
> Assuming the periodic flush is working, the reader won't see data older
> than 2 seconds, so the additional sync-flush after (possible) 4 seconds
> seems superfluous.
>
> (In the case of periodic flush being stuck, I thought the factor 2=4s/2s
> was superfluous, another magic parameter.)
>
> I'm comparing here your proposal vs no synchronous flushing in
> workingset_refault().
>
> > Do you have any strong concerns with the currect patch?
>
> Does that clarify?
>
> (I agree with your initial thesis this can be iterated before it evolves
> to everyone's satisfaction.)
>

Thanks Michal for the explanation. For the long term, I think all
these batching can be made part of core rstat infrastructure and as
generic as you have described. Also there is interest in using rstat
from BPF, so generic batching would be needed there as well.
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a68dce3873fc..89b14729d59f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1012,6 +1012,7 @@  static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_delayed(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1455,6 +1456,10 @@  static inline void mem_cgroup_flush_stats(void)
 {
 }
 
+static inline void mem_cgroup_flush_stats_delayed(void)
+{
+}
+
 static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
 					    enum node_stat_item idx, int val)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f79bb3f25ce4..edfb337e6948 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -587,6 +587,9 @@  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
+static u64 flush_next_time;
+
+#define FLUSH_TIME (2UL*HZ)
 
 /*
  * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
@@ -637,6 +640,7 @@  static void __mem_cgroup_flush_stats(void)
 	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
 		return;
 
+	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
 	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
 	spin_unlock_irqrestore(&stats_flush_lock, flag);
@@ -648,10 +652,16 @@  void mem_cgroup_flush_stats(void)
 		__mem_cgroup_flush_stats();
 }
 
+void mem_cgroup_flush_stats_delayed(void)
+{
+	if (rstat_flush_time && time_after64(jiffies_64, flush_next_time))
+		mem_cgroup_flush_stats();
+}
+
 static void flush_memcg_stats_dwork(struct work_struct *w)
 {
 	__mem_cgroup_flush_stats();
-	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
+	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
 /**
diff --git a/mm/workingset.c b/mm/workingset.c
index 8a3828acc0bf..592569a8974c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -355,7 +355,7 @@  void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_delayed();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if