diff mbox series

sched, fair: Allow a per-cpu kthread waking a task to stack on the same CPU

Message ID 20200127143608.GX3466@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series sched, fair: Allow a per-cpu kthread waking a task to stack on the same CPU | expand

Commit Message

Mel Gorman Jan. 27, 2020, 2:36 p.m. UTC
Commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log
recovery") changed from using bound workqueues to using unbound
workqueues. Functionally this makes sense but it was observed at the time
that the dbench performance dropped quite a lot and CPU migrations were
excessively high even when there are plenty of idle CPUs.

The pattern of the task migration is straight-forward. With XFS, an IO
issuer may delegate work to a kworker which wakes on the same CPU. On
completion of the work, it wakes the task, finds that the previous CPU
is busy (because the kworker is still running on it) and migrates the
task to the next idle CPU. The task ends up migrating around all CPUs
sharing a LLC at high frequency. This has negative implications both in
commication costs and power management.  mpstat confirmed that at low
thread counts that all CPUs sharing an LLC has low level of activity.

The impact of this problem is related to the number of CPUs sharing an LLC.

This patch special cases the pattern and allows a kworker waker and a
task wakee to stack on the same CPU if there is a strong chance they are
directly related. The expectation is that the kworker is likely going
back to sleep shortly. This is not guaranteed as the IO could be queued
asynchronously but there is a very strong relationship between the task and
kworker in this case that would justify stacking on the same CPU instead
of migrating. There should be few concerns about kworker starvation given
that the special casing is only when the kworker is the waker.

DBench on XFS
MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem

UMA machine with 8 cores sharing LLC
                          5.5.0-rc7              5.5.0-rc7
                  tipsched-20200124           kworkerstack
Amean     1        22.63 (   0.00%)       20.54 *   9.23%*
Amean     2        25.56 (   0.00%)       23.40 *   8.44%*
Amean     4        28.63 (   0.00%)       27.85 *   2.70%*
Amean     8        37.66 (   0.00%)       37.68 (  -0.05%)
Amean     64      469.47 (   0.00%)      468.26 (   0.26%)
Stddev    1         1.00 (   0.00%)        0.72 (  28.12%)
Stddev    2         1.62 (   0.00%)        1.97 ( -21.54%)
Stddev    4         2.53 (   0.00%)        3.58 ( -41.19%)
Stddev    8         5.30 (   0.00%)        5.20 (   1.92%)
Stddev    64       86.36 (   0.00%)       94.53 (  -9.46%)

NUMA machine, 48 CPUs total, 24 CPUs share cache
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Amean     1         58.69 (   0.00%)       30.21 *  48.53%*
Amean     2         60.90 (   0.00%)       35.29 *  42.05%*
Amean     4         66.77 (   0.00%)       46.55 *  30.28%*
Amean     8         81.41 (   0.00%)       68.46 *  15.91%*
Amean     16       113.29 (   0.00%)      107.79 *   4.85%*
Amean     32       199.10 (   0.00%)      198.22 *   0.44%*
Amean     64       478.99 (   0.00%)      477.06 *   0.40%*
Amean     128     1345.26 (   0.00%)     1372.64 *  -2.04%*
Stddev    1          2.64 (   0.00%)        4.17 ( -58.08%)
Stddev    2          4.35 (   0.00%)        5.38 ( -23.73%)
Stddev    4          6.77 (   0.00%)        6.56 (   3.00%)
Stddev    8         11.61 (   0.00%)       10.91 (   6.04%)
Stddev    16        18.63 (   0.00%)       19.19 (  -3.01%)
Stddev    32        38.71 (   0.00%)       38.30 (   1.06%)
Stddev    64       100.28 (   0.00%)       91.24 (   9.02%)
Stddev    128      186.87 (   0.00%)      160.34 (  14.20%)

Dbench has been modified to report the time to complete a single "load
file". This is a more meaningful metric for dbench that a throughput
metric as the benchmark makes many different system calls that are not
throughput-related

Patch shows a 9.23% and 48.53% reduction in the time to process a load
file with the difference partially explained by the number of CPUs sharing
a LLC. In a separate run, task migrations were almost eliminated by the
patch for low client counts. In case people have issue with the metric
used for the benchmark, this is a comparison of the throughputs as
reported by dbench on the NUMA machine.

dbench4 Throughput (misleading but traditional)
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Hmean     1        321.41 (   0.00%)      617.82 *  92.22%*
Hmean     2        622.87 (   0.00%)     1066.80 *  71.27%*
Hmean     4       1134.56 (   0.00%)     1623.74 *  43.12%*
Hmean     8       1869.96 (   0.00%)     2212.67 *  18.33%*
Hmean     16      2673.11 (   0.00%)     2806.13 *   4.98%*
Hmean     32      3032.74 (   0.00%)     3039.54 (   0.22%)
Hmean     64      2514.25 (   0.00%)     2498.96 *  -0.61%*
Hmean     128     1778.49 (   0.00%)     1746.05 *  -1.82%*

Note that this is somewhat specific to XFS and ext4 shows no performance
difference as it does not rely on kworkers in the same way. No major
problem was observed running other workloads on different machines although
not all tests have completed yet.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/core.c  | 11 -----------
 kernel/sched/fair.c  | 13 +++++++++++++
 kernel/sched/sched.h | 13 +++++++++++++
 3 files changed, 26 insertions(+), 11 deletions(-)

Comments

Dave Chinner Jan. 27, 2020, 10:32 p.m. UTC | #1
On Mon, Jan 27, 2020 at 02:36:08PM +0000, Mel Gorman wrote:
> Commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log
> recovery") changed from using bound workqueues to using unbound
> workqueues. Functionally this makes sense but it was observed at the time
> that the dbench performance dropped quite a lot and CPU migrations were
> excessively high even when there are plenty of idle CPUs.

Hmmm - that just made the CIL workqueue WQ_UNBOUND. Not a complex
change...

> The pattern of the task migration is straight-forward. With XFS, an IO
> issuer may delegate work to a kworker which wakes on the same CPU. On
> completion of the work, it wakes the task, finds that the previous CPU
> is busy (because the kworker is still running on it) and migrates the
> task to the next idle CPU. The task ends up migrating around all CPUs
> sharing a LLC at high frequency. This has negative implications both in
> commication costs and power management.  mpstat confirmed that at low
> thread counts that all CPUs sharing an LLC has low level of activity.

Very interesting, Mel. :)

I suspect this appears is a very similar issue that is discussed in
this thread about workqueues and AIO completion latencies:

https://lore.kernel.org/lkml/20191114113153.GB4213@ming.t460p/

The problem is described here, along with comments about how
fundamental this behaviour is to the correct functioning of
filesystems:

https://lore.kernel.org/lkml/20191121221051.GG4614@dread.disaster.area/

There are several patches thrown about during the discussion,
initially focussed on wakeup pre-emption to run the work immediately
until I pointed out that was the wrong thing to do for work being
deferred to workqueues.  After a some more proposed patches the
discussion on the scheduler side of things largely ground to a halt
and so has not been fixed.

So I'm initially wondering if this solves that problem, too, or
whether you are seeing a slightly different manifestation of that
same scheduler issue....

> The impact of this problem is related to the number of CPUs sharing an LLC.
> 
> This patch special cases the pattern and allows a kworker waker and a
> task wakee to stack on the same CPU if there is a strong chance they are
> directly related. The expectation is that the kworker is likely going
> back to sleep shortly. This is not guaranteed as the IO could be queued
> asynchronously but there is a very strong relationship between the task and
> kworker in this case that would justify stacking on the same CPU instead
> of migrating. There should be few concerns about kworker starvation given
> that the special casing is only when the kworker is the waker.
> DBench on XFS

[snip positive dbench results]

Yeah, dbench does lots of synchronous operations that end up waiting
on journal flushes (data integrity operations) so it would trip over
kworker scheduling latency issues.

FWIW, I didn't see any perf degradation on my machines from the
commit you quoted, but I also had a real hard time replication the
aio completion latency problem on them as well. Hence I don't think
they are particularly susceptible to bad migration decisions, so I'm
not surprised I didn't see this.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fe4e0d775375..76df439aff76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5912,6 +5912,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)))
>  		return prev;
>  
> +	/*
> +	 * Allow a per-cpu kthread to stack with the wakee if the
> +	 * kworker thread and the tasks previous CPU are the same.
> +	 * The assumption is that the wakee queued work for the
> +	 * per-cpu kthread that is now complete and the wakeup is
> +	 * essentially a sync wakeup.
> +	 */
> +	if (is_per_cpu_kthread(current) &&
> +	    prev == smp_processor_id() &&
> +	    this_rq()->nr_running <= 1) {
> +		return prev;
> +	}

Ok, so if I've read this correctly, this special case only triggers
when scheduling from the per-cpu kworker thread context, and only if
there is one other runnable task on the queue? So it special cases
the ping-pong case so that the non-bound task that scheduled the
kworker also remains scheduled this CPU?

Hmmmm.

When we set up a workqueue as WQ_UNBOUND on a numa system, isn't the
worker pool set up as a node-bound task? i.e. it's not a per-cpu
kthread anymore, but a task limited by the cpumask of that node?
That isn't a per-CPU kthread anymore, is it? That is:

> @@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq,
>  {
>  }
>  #endif
> +
> +#ifdef CONFIG_SMP
> +static inline bool is_per_cpu_kthread(struct task_struct *p)
> +{
> +	if (!(p->flags & PF_KTHREAD))
> +		return false;
> +
> +	if (p->nr_cpus_allowed != 1)
> +		return false;

p->nr_cpus_allowed is no longer 1 but the number of CPUs in the
per-node cpu mask it is allowed to run on?

And so if that is the case, then the change in commit 8ab39f11d974
which set WQ_UNBOUND on the XFS CIL workqueue would mean the above
logic change should not be triggering for the CIL worker because it
is no longer a CPU bound kthread....

What am I missing here?

<light bulb illumination>

Is this actually ping-ponging the CIL flush and journal IO
completion because xlog_bio_end_io() always punts journal IO
completion to the log workqueue, which is:

	log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s",
			WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0,
			mp->m_super->s_id);

i.e. it uses per-cpu kthreads for processing journal IO completion
similar to DIO io completion and thereby provides a vector for
the same issue?

A typical fsync is processed like this:

user task		CIL kworker		IO completion kworker
   xfs_trans_commit()
    pushes on log
    <blocks waiting on flush completion>

			<wake>
			formats vectors
			loop {
			   wait for iclog space
			   <block waiting on write completion>

						<wake>
						journal IO completion
						frees up iclog space
						wakes write waiter
						<done>

			   <wake>
			   write vectors into iclog
			   submit iclog IO
			}
			<done>

						<wake>
						journal IO completion
						frees up iclog space
						wakes flush waiter
						<done>
   <wakes>
   <transaction commit done>

i.e. before commit 8ab39f11d974 we have:

user task = unbound
XFS CIL commit worker = CPU bound kthread
XFS journal IO completion = CPU bound kthread

And because the the CIL kworker and IO completion kworker are bound
to the same CPU they don't trigger migrations as they can't be moved
anywhere else.  And so it doesn't matter how many times we switch
between them to complete a CIL flush because they will not trigger
migrations.

However, after commit 8ab39f11d974 we have:

user task = unbound
XFS CIL commit worker = unbound kthread
XFS journal IO completion = CPU bound kthread

IOWs, I think we now have exactly the same situation as discussed in
the thread I pointed you to above, where an unbound task work (the
CIL kworker) is trying to run on the same CPU as the CPU bound IO
completion kworker, and that causes the CIL kworker to be migrated
to a different CPU on each bounced throught the "wait for iclog
space" loop. Hence your new logic is actually triggering on the
journal IO completion kworker threads, not the CIL kworker threads.

After all this, I have two questions that would help me understand
if this is what you are seeing:

1. to confirm: does removing just the WQ_UNBOUND from the CIL push
workqueue (as added in 8ab39f11d974) make the regression go away?

2. when the problem shows up, which tasks are actually being
migrated excessively - is it the user task, the CIL kworker task
or something else?

Cheers,

Dave.
Mel Gorman Jan. 28, 2020, 1:19 a.m. UTC | #2
On Tue, Jan 28, 2020 at 09:32:56AM +1100, Dave Chinner wrote:
> On Mon, Jan 27, 2020 at 02:36:08PM +0000, Mel Gorman wrote:
> > Commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log
> > recovery") changed from using bound workqueues to using unbound
> > workqueues. Functionally this makes sense but it was observed at the time
> > that the dbench performance dropped quite a lot and CPU migrations were
> > excessively high even when there are plenty of idle CPUs.
> 
> Hmmm - that just made the CIL workqueue WQ_UNBOUND. Not a complex
> change...
> 

Not in itself but it has an effect on the wakeup path hitting
select_idle_sibling and migrating a user task because the previously used
CPU is not idle. Performance problems with select_idle_sibling has been
known to cause poor anger management issues.

> > The pattern of the task migration is straight-forward. With XFS, an IO
> > issuer may delegate work to a kworker which wakes on the same CPU. On
> > completion of the work, it wakes the task, finds that the previous CPU
> > is busy (because the kworker is still running on it) and migrates the
> > task to the next idle CPU. The task ends up migrating around all CPUs
> > sharing a LLC at high frequency. This has negative implications both in
> > commication costs and power management.  mpstat confirmed that at low
> > thread counts that all CPUs sharing an LLC has low level of activity.
> 
> Very interesting, Mel. :)
> 
> I suspect this appears is a very similar issue that is discussed in
> this thread about workqueues and AIO completion latencies:
> 
> https://lore.kernel.org/lkml/20191114113153.GB4213@ming.t460p/
> 
> The problem is described here, along with comments about how
> fundamental this behaviour is to the correct functioning of
> filesystems:
> 
> https://lore.kernel.org/lkml/20191121221051.GG4614@dread.disaster.area/
> 

I suspect the problem is very similar, maybe even identical and just
observed from different directions. I agree with your assessment of the
issues and observations although I cannot swear every aspect is correct.
However, IO completions being sent to the issuer CPU makes sense,
particularly for synchronous IO where the issuer has gone to sleep. The
data may be still hot in L1 cache and the CPU is probably ramped up to a
higher frequency. There is no obvious reason why blk-mq should deliver
the completion to a different CPU just to sidestep undesired scheduler
behaviour.

It's less clear cut for jbd as stacking multiple IO issuers on top of jbd
is unlikely to perform very well. However, jbd appears to be unaffected
by this patch.

> There are several patches thrown about during the discussion,
> initially focussed on wakeup pre-emption to run the work immediately
> until I pointed out that was the wrong thing to do for work being
> deferred to workqueues.  After a some more proposed patches the
> discussion on the scheduler side of things largely ground to a halt
> and so has not been fixed.
> 

I do not think it's a good idea to preempt the issuing task when it is
asynchronous. Aside from any filesystem considerations, it artifically
increases context switches.

> So I'm initially wondering if this solves that problem, too, or
> whether you are seeing a slightly different manifestation of that
> same scheduler issue....
> 

I suspect my patch may accidentally fix or at least mitigate the issue.

> > The impact of this problem is related to the number of CPUs sharing an LLC.
> > 
> > This patch special cases the pattern and allows a kworker waker and a
> > task wakee to stack on the same CPU if there is a strong chance they are
> > directly related. The expectation is that the kworker is likely going
> > back to sleep shortly. This is not guaranteed as the IO could be queued
> > asynchronously but there is a very strong relationship between the task and
> > kworker in this case that would justify stacking on the same CPU instead
> > of migrating. There should be few concerns about kworker starvation given
> > that the special casing is only when the kworker is the waker.
> > DBench on XFS
> 
> [snip positive dbench results]
> 
> Yeah, dbench does lots of synchronous operations that end up waiting
> on journal flushes (data integrity operations) so it would trip over
> kworker scheduling latency issues.
> 

Yes, it turned out to be a good workload.

> FWIW, I didn't see any perf degradation on my machines from the
> commit you quoted, but I also had a real hard time replication the
> aio completion latency problem on them as well. Hence I don't think
> they are particularly susceptible to bad migration decisions, so I'm
> not surprised I didn't see this.
> 

FWIW, of 12 machines that checked commits around that time frame, only 2
machines bisected reliably to the XFS commit. About all that was special
about those machines it the number of CPUs sharing LLC. The disks were
SSDs but not particularly fast ones.

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fe4e0d775375..76df439aff76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5912,6 +5912,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)))
> >  		return prev;
> >  
> > +	/*
> > +	 * Allow a per-cpu kthread to stack with the wakee if the
> > +	 * kworker thread and the tasks previous CPU are the same.
> > +	 * The assumption is that the wakee queued work for the
> > +	 * per-cpu kthread that is now complete and the wakeup is
> > +	 * essentially a sync wakeup.
> > +	 */
> > +	if (is_per_cpu_kthread(current) &&
> > +	    prev == smp_processor_id() &&
> > +	    this_rq()->nr_running <= 1) {
> > +		return prev;
> > +	}
> 
> Ok, so if I've read this correctly, this special case only triggers
> when scheduling from the per-cpu kworker thread context, and only if
> there is one other runnable task on the queue?

Not quite, the 1 running task is the kworker itself.

> So it special cases
> the ping-pong case so that the non-bound task that scheduled the
> kworker also remains scheduled this CPU?
> 

Yes. Ordinarily, allowing premature task stacking in select_idle_sibling
when siblings are idle would be shot straight in the head. Stacking tasks
suits benchmarks that are strictly synchronised like pipes but almost
always a disaster in any other case. I think this patch is constrained
enough that it should fix the targetted problem without side-effects.

> Hmmmm.
> 
> When we set up a workqueue as WQ_UNBOUND on a numa system, isn't the
> worker pool set up as a node-bound task? i.e. it's not a per-cpu
> kthread anymore, but a task limited by the cpumask of that node?
> That isn't a per-CPU kthread anymore, is it? That is:
> 

I would have thought so but I didn't drill deep enough into the kworker
behaviour to see what exactly was causing the problem. 

> > @@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq,
> >  {
> >  }
> >  #endif
> > +
> > +#ifdef CONFIG_SMP
> > +static inline bool is_per_cpu_kthread(struct task_struct *p)
> > +{
> > +	if (!(p->flags & PF_KTHREAD))
> > +		return false;
> > +
> > +	if (p->nr_cpus_allowed != 1)
> > +		return false;
> 
> p->nr_cpus_allowed is no longer 1 but the number of CPUs in the
> per-node cpu mask it is allowed to run on?
> 

If that was the case, the patch would have had no effect so there is a
mix of unbound and bound workqueues that I failed to fully understand.
The patch was based on scheduler traces instead of a deep understanding
of workqueue internals or how filesystems use them :(

> And so if that is the case, then the change in commit 8ab39f11d974
> which set WQ_UNBOUND on the XFS CIL workqueue would mean the above
> logic change should not be triggering for the CIL worker because it
> is no longer a CPU bound kthread....
> 
> What am I missing here?
> 

I'm almost 100% certain the patch revealed some unintended behaviour and
the bisection did appear reliable. This is a report excerpt I got from
automation testing (similar functionality to LKP) commits around that time

Comparison
==========
                             initial                initial                   last                  penup                   last                  penup                  first
                           good-v5.3       bad-1c4e395cf7de           bad-b41dae06           bad-c6cfaf4f          good-cdea5459          good-7c107afb           bad-8ab39f11
....
Amean     1         47.34 (   0.00%)       54.80 * -15.75%*       56.02 * -18.35%*       56.05 * -18.40%*       48.05 *  -1.50%*       46.36 *   2.07%*       55.56 * -17.38%*
Amean     2         50.10 (   0.00%)       58.35 * -16.45%*       62.44 * -24.62%*       62.39 * -24.52%*       51.68 *  -3.15%*       50.28 *  -0.36%*       61.44 * -22.63%*
Amean     4         58.07 (   0.00%)       66.65 * -14.79%*       67.53 * -16.29%*       66.70 * -14.86%*       58.40 *  -0.57%*       58.49 *  -0.74%*       65.63 * -13.03%*
Amean     8         74.86 (   0.00%)       78.36 *  -4.68%*       79.84 *  -6.65%*       80.21 *  -7.15%*       75.82 *  -1.28%*       74.68 *   0.25%*       79.38 *  -6.03%*
Amean     16       108.57 (   0.00%)      108.86 *  -0.26%*      109.65 *  -1.00%*      109.52 *  -0.88%*      108.91 *  -0.32%*      107.99 *   0.53%*      109.81 *  -1.14%*
Amean     32       199.20 (   0.00%)      198.80 (   0.20%)      199.62 (  -0.21%)      200.11 *  -0.46%*      201.98 *  -1.40%*      202.33 *  -1.57%*      199.70 (  -0.25%)
Amean     64       424.43 (   0.00%)      421.39 *   0.72%*      419.08 *   1.26%*      422.68 *   0.41%*      427.88 *  -0.81%*      422.82 *   0.38%*      427.16 *  -0.64%*
Amean     128     1011.56 (   0.00%)      994.98 *   1.64%*      997.81 *   1.36%*      999.07 *   1.24%*     1003.33 *   0.81%*     1003.38 *   0.81%*      985.47 *   2.58%*

The difference between good-cdea5459 and bad-8ab39f11 is pretty clear.


> <light bulb illumination>
> 
> Is this actually ping-ponging the CIL flush and journal IO
> completion because xlog_bio_end_io() always punts journal IO
> completion to the log workqueue, which is:
> 
> 	log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s",
> 			WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0,
> 			mp->m_super->s_id);
> 
> i.e. it uses per-cpu kthreads for processing journal IO completion
> similar to DIO io completion and thereby provides a vector for
> the same issue?
> 
> <SNIP>

That is extremely informative, thanks!  Your reasoning sounds plausible
but unfortunately from the scheduler trace, I knew kworkers were involved
but not exactly what those kworkers were doing.  The pattern also did
not always cause major problems. Sometimes select_idle_sibling would
use p->recent_used_cpu instead of moving the task round-robin around the
LLC. It's the reason why I placed the check for per-cpu kthreads *before*
considering recent_used_cpu as an idle CPU. It was the difference between
reducing migration rates by 85% to almost 100%.

> IOWs, I think we now have exactly the same situation as discussed in
> the thread I pointed you to above, where an unbound task work (the
> CIL kworker) is trying to run on the same CPU as the CPU bound IO
> completion kworker, and that causes the CIL kworker to be migrated
> to a different CPU on each bounced throught the "wait for iclog
> space" loop. Hence your new logic is actually triggering on the
> journal IO completion kworker threads, not the CIL kworker threads.
> 

I added Paul Auld and Ming Lei to the cc. Maybe they'd be willing to
give the patch a spin on their test case. I could recreate their test
case easily enough but an independent test would be preferred.

> After all this, I have two questions that would help me understand
> if this is what you are seeing:
> 
> 1. to confirm: does removing just the WQ_UNBOUND from the CIL push
> workqueue (as added in 8ab39f11d974) make the regression go away?
> 

I'll have to check in the morning. Around the v5.4 development timeframe,
I'm definite that reverting the patch helped but that was not an option
given that it's fixing a correctness issue.

> 2. when the problem shows up, which tasks are actually being
> migrated excessively - is it the user task, the CIL kworker task
> or something else?
> 

The user task is doing the migration in my case. However, that may not
be universally true. The patch also does not prevent all migrations. For
low numbers of dbench clients, there are almost no migrations but as the
client count increases migrations happen because there are too many
running tasks.
Mel Gorman Jan. 28, 2020, 9:10 a.m. UTC | #3
On Tue, Jan 28, 2020 at 01:19:36AM +0000, Mel Gorman wrote:
> > <SNIP>
> > After all this, I have two questions that would help me understand
> > if this is what you are seeing:
> > 
> > 1. to confirm: does removing just the WQ_UNBOUND from the CIL push
> > workqueue (as added in 8ab39f11d974) make the regression go away?
> > 
> 
> I'll have to check in the morning. Around the v5.4 development timeframe,
> I'm definite that reverting the patch helped but that was not an option
> given that it's fixing a correctness issue.
> 

This is a comparison of the baseline kernel (tip at the time I started),
the proposed fix and a revert. The revert was not clean but I do not
believe it matters

dbench4 Loadfile Execution Time
                           5.5.0-rc7              5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2     revert-XFS-wq-v1r2
Amean     1         58.69 (   0.00%)       30.21 *  48.53%*       47.48 *  19.10%*
Amean     2         60.90 (   0.00%)       35.29 *  42.05%*       51.13 *  16.04%*
Amean     4         66.77 (   0.00%)       46.55 *  30.28%*       59.54 *  10.82%*
Amean     8         81.41 (   0.00%)       68.46 *  15.91%*       77.25 *   5.11%*
Amean     16       113.29 (   0.00%)      107.79 *   4.85%*      112.33 *   0.85%*
Amean     32       199.10 (   0.00%)      198.22 *   0.44%*      200.31 *  -0.61%*
Amean     64       478.99 (   0.00%)      477.06 *   0.40%*      482.17 *  -0.66%*
Amean     128     1345.26 (   0.00%)     1372.64 *  -2.04%*     1368.94 *  -1.76%*
Stddev    1          2.64 (   0.00%)        4.17 ( -58.08%)        5.01 ( -89.89%)
Stddev    2          4.35 (   0.00%)        5.38 ( -23.73%)        4.48 (  -2.90%)
Stddev    4          6.77 (   0.00%)        6.56 (   3.00%)        7.40 (  -9.40%)
Stddev    8         11.61 (   0.00%)       10.91 (   6.04%)       11.62 (  -0.05%)
Stddev    16        18.63 (   0.00%)       19.19 (  -3.01%)       19.12 (  -2.66%)
Stddev    32        38.71 (   0.00%)       38.30 (   1.06%)       38.82 (  -0.28%)
Stddev    64       100.28 (   0.00%)       91.24 (   9.02%)       95.68 (   4.59%)
Stddev    128      186.87 (   0.00%)      160.34 (  14.20%)      170.85 (   8.57%)

According to this, commit 8ab39f11d974 ("xfs: prevent CIL push holdoff
in log recovery") did introduce some unintended behaviour. The fix
actually performs better than a revert with the obvious benefit that it
does not reintroduce the functional breakage (log starvation) that the
commit originally fixed.

I still think that XFS is not the problem here, it's just the
messenger. The functional fix, delegating work to kworkers running on the
same CPU and blk-mq delivering IO completions to the same CPU as the IO
issuer are all sane decisions IMO. I do not think that adjusting any of
them to wakeup the task on a new CPU is sensible due to the loss of data
cache locality and potential snags with power management when waking a
CPU from idle state.

Peter, Ingo and Vincent -- I know the timing is bad due to the merge
window but do you have any thoughts on allowing select_idle_sibling to
stack a wakee task on the same CPU as a waker in this specific case?
Mel Gorman Jan. 28, 2020, 2:24 p.m. UTC | #4
I'm adding Jan Kara to the cc as he was looking into the workqueue
implemention in depth this morning and helped me better understand what
is going on.

Phil and Ming are still cc'd as an independent test would still be nice.

> <light bulb illumination>
> 
> Is this actually ping-ponging the CIL flush and journal IO
> completion because xlog_bio_end_io() always punts journal IO
> completion to the log workqueue, which is:
> 
> 	log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s",
> 			WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0,
> 			mp->m_super->s_id);
> 
> i.e. it uses per-cpu kthreads for processing journal IO completion
> similar to DIO io completion and thereby provides a vector for
> the same issue?
> 

Your light bulb is on point. The XFS unbound workqueue does run near the
task and does not directly cause the migration but the IO completions
matter. As it turned out, it was the IO completions I was looking at
in the old traces but there was insufficient detail to see the exact
sequence. I only observed that a bound wq at the end was causing the
migration and it was a consistent pattern.

I did a more detailed trace that included workqueue tracepoints. I limited
the run to 1 and 2 dbench clients and compared with and without the
patch. I still did not dig deep into the specifics of how XFS interacts
with workqueues because I'm focused on how the scheduler reacts.

The patch is still having an impact with bound workqueues as expected
because;

# zgrep sched_migrate_task 5.5.0-rc7-tipsched-20200124/iter-0/ftrace-dbench4.gz | wc -l
556259
# zgrep sched_migrate_task 5.5.0-rc7-kworkerstack-v1r2/iter-0/ftrace-dbench4.gz | wc -l
11736

There are still migrations happening but there also was a lot of logging
going on for this run so it's not directly comparable what I originally
reported.

This is an example sequence of what's happening from a scheduler
perspective on the vanilla kernel. It's editted a bit because there
were a lot of other IOs going on, mostly logging related which confuse
the picture.

          dbench-25633 [004] d...   194.998648: workqueue_queue_work: work struct=000000001cccdc2d function=xlog_cil_push_work [xfs] workqueue=00000000d90239c9 req_cpu=512 cpu=4294967295
          dbench-25633 [004] d...   194.998650: sched_waking: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006
          dbench-25633 [004] d...   194.998655: sched_wakeup: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006
  kworker/u161:6-718   [006] ....   194.998692: workqueue_execute_start: work struct 000000001cccdc2d: function xlog_cil_push_work [xfs]
  kworker/u161:6-718   [006] ....   194.998706: workqueue_execute_end: work struct 000000001cccdc2d

Dbench is on CPU 4, it queues xlog_cil_push_work on an UNBOUND
workqueue. An unbound kworker wakes on CPU 6 and finishes quickly.

  kworker/u161:6-718   [006] ....   194.998707: workqueue_execute_start: work struct 0000000046fbf8d5: function wq_barrier_func
  kworker/u161:6-718   [006] d...   194.998708: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004
  kworker/u161:6-718   [006] d...   194.998712: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=004
  kworker/u161:6-718   [006] ....   194.998713: workqueue_execute_end: work struct 0000000046fbf8d5

The kworker wakes dbench and finding that CPU 4 is still free, dbench
uses its previous CPU and no migration occurs.

          dbench-25633 [004] d...   194.998727: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4
          dbench-25633 [004] d...   194.998728: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
          dbench-25633 [004] dN..   194.998731: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004

Dbench queues blk_mq_requeue_work. This is a BOUND workqueue with a
mandatory CPU target of 4 so no migration..

    kworker/4:1H-991   [004] ....   194.998736: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work
    kworker/4:1H-991   [004] ....   194.998742: workqueue_execute_end: work struct 00000000442434a7

blk_mq_requeue_work is done

          <idle>-0     [004] d.s.   194.998859: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4
          <idle>-0     [004] d.s.   194.998861: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
          <idle>-0     [004] dNs.   194.998862: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
    kworker/4:1H-991   [004] ....   194.998866: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work
    kworker/4:1H-991   [004] ....   194.998870: workqueue_execute_end: work struct 00000000442434a7
          <idle>-0     [004] d.s.   194.998911: workqueue_queue_work: work struct=0000000072f39adb function=xlog_ioend_work [xfs] workqueue=00000000008f3d7f req_cpu=512 cpu=4
          <idle>-0     [004] d.s.   194.998912: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
          <idle>-0     [004] dNs.   194.998913: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004

Ok, this looks like an IRQ delivered for IO completion and the
xlog_ioend_work is reached. The BOUND kworker is woken again by the IRQ
handler on CPU 4 because it has no choice.

    kworker/4:1H-991   [004] ....   194.998918: workqueue_execute_start: work struct 0000000072f39adb: function xlog_ioend_work [xfs]
    kworker/4:1H-991   [004] d...   194.998943: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004
    kworker/4:1H-991   [004] d...   194.998945: sched_migrate_task: comm=dbench pid=25633 prio=120 orig_cpu=4 dest_cpu=5
    kworker/4:1H-991   [004] d...   194.998947: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=005
    kworker/4:1H-991   [004] ....   194.998948: workqueue_execute_end: work struct 0000000072f39adb

The IO completion handler finishes, the bound workqueue tries to wake
dbench on its old CPU. The BOUND kworker is on CPU 4, the task wants
CPU 4 but the CPU is busy with the kworker so the scheduler function
select_idle_sibling picks CPU 5 instead and now the task is migrated
and we have started our round-trip of all CPUs sharing a LLC. It's not a
perfect round-robin because p->recent_used_cpu often works.  Looking at
the traces, dbench bounces back and forth between CPUs 4 and 5 for 7 IO
completions before bouncing between CPUs 6/7 and so on.

The patch alters the very last stage. The IO completion is a bound kworker
and allows the wakee task to use the same CPU and avoid the migration.
Dave Chinner Jan. 28, 2020, 10:21 p.m. UTC | #5
On Tue, Jan 28, 2020 at 02:24:27PM +0000, Mel Gorman wrote:
> I'm adding Jan Kara to the cc as he was looking into the workqueue
> implemention in depth this morning and helped me better understand what
> is going on.
> 
> Phil and Ming are still cc'd as an independent test would still be nice.
> 
> > <light bulb illumination>
> > 
> > Is this actually ping-ponging the CIL flush and journal IO
> > completion because xlog_bio_end_io() always punts journal IO
> > completion to the log workqueue, which is:
> > 
> > 	log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s",
> > 			WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0,
> > 			mp->m_super->s_id);
> > 
> > i.e. it uses per-cpu kthreads for processing journal IO completion
> > similar to DIO io completion and thereby provides a vector for
> > the same issue?
> > 
> 
> Your light bulb is on point. The XFS unbound workqueue does run near the
> task and does not directly cause the migration but the IO completions
> matter. As it turned out, it was the IO completions I was looking at
> in the old traces but there was insufficient detail to see the exact
> sequence. I only observed that a bound wq at the end was causing the
> migration and it was a consistent pattern.
> 
> I did a more detailed trace that included workqueue tracepoints. I limited
> the run to 1 and 2 dbench clients and compared with and without the
> patch. I still did not dig deep into the specifics of how XFS interacts
> with workqueues because I'm focused on how the scheduler reacts.

[snip traces]

Ok, so it's not /exactly/ as I thought - what is happening is that
both dbench and the CIL push kworker are issuing IO during fsync
operations. dbench issues the data IO, the CIL push worker issues
the journal IO.

> This is an example sequence of what's happening from a scheduler
> perspective on the vanilla kernel. It's editted a bit because there
> were a lot of other IOs going on, mostly logging related which confuse
> the picture.
> 
>           dbench-25633 [004] d...   194.998648: workqueue_queue_work: work struct=000000001cccdc2d function=xlog_cil_push_work [xfs] workqueue=00000000d90239c9 req_cpu=512 cpu=4294967295
>           dbench-25633 [004] d...   194.998650: sched_waking: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006
>           dbench-25633 [004] d...   194.998655: sched_wakeup: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006
>   kworker/u161:6-718   [006] ....   194.998692: workqueue_execute_start: work struct 000000001cccdc2d: function xlog_cil_push_work [xfs]
>   kworker/u161:6-718   [006] ....   194.998706: workqueue_execute_end: work struct 000000001cccdc2d
> 
> Dbench is on CPU 4, it queues xlog_cil_push_work on an UNBOUND
> workqueue. An unbound kworker wakes on CPU 6 and finishes quickly.
> 
>   kworker/u161:6-718   [006] ....   194.998707: workqueue_execute_start: work struct 0000000046fbf8d5: function wq_barrier_func
>   kworker/u161:6-718   [006] d...   194.998708: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004
>   kworker/u161:6-718   [006] d...   194.998712: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=004
>   kworker/u161:6-718   [006] ....   194.998713: workqueue_execute_end: work struct 0000000046fbf8d5

Ok, that's what I'd expect if dbench issued a log force as part of
an fsync() or synchronous transaction. This is it flushing the CIL
and waiting for the flush work to complete (wq_barrier_func is what
wakes the wq flush waiter).

This doesn't complete the log force, however - the dbench process
will now do a bit more work and then go to sleep waiting for journal
IO to complete.

> The kworker wakes dbench and finding that CPU 4 is still free, dbench
> uses its previous CPU and no migration occurs.
> 
>           dbench-25633 [004] d...   194.998727: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4
>           dbench-25633 [004] d...   194.998728: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
>           dbench-25633 [004] dN..   194.998731: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
> 
> Dbench queues blk_mq_requeue_work. This is a BOUND workqueue with a
> mandatory CPU target of 4 so no migration..

So I spent some time trying to work out how the dbench process
triggers this directly. This work is queued when a new cache flush
command is inserted into the request queue, and generally those are
done by the journal writes via REQ_PREFLUSH | REQ_FUA. Those would
show up in the context of the xlog_cil_push_work and run on that
CPU, not the dbench task or CPU.

So this is probably xfs_file_fsync() calling
xfs_blkdev_issue_flush() directly because the inode metadata had
already been written to the journal by an earlier (or racing)
journal flush.  Hence we have to flush the device cache manually to
ensure that the data that may have been written is also on stable
storage. That will insert a flush directly into the request queue,
and that's likely how we are getting the flush machinery running on
this CPU.

>     kworker/4:1H-991   [004] ....   194.998736: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work
>     kworker/4:1H-991   [004] ....   194.998742: workqueue_execute_end: work struct 00000000442434a7
> 
> blk_mq_requeue_work is done
> 
>           <idle>-0     [004] d.s.   194.998859: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4
>           <idle>-0     [004] d.s.   194.998861: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
>           <idle>-0     [004] dNs.   194.998862: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
>     kworker/4:1H-991   [004] ....   194.998866: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work
>     kworker/4:1H-991   [004] ....   194.998870: workqueue_execute_end: work struct 00000000442434a7
>           <idle>-0     [004] d.s.   194.998911: workqueue_queue_work: work struct=0000000072f39adb function=xlog_ioend_work [xfs] workqueue=00000000008f3d7f req_cpu=512 cpu=4
>           <idle>-0     [004] d.s.   194.998912: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
>           <idle>-0     [004] dNs.   194.998913: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
> 
> Ok, this looks like an IRQ delivered for IO completion and the
> xlog_ioend_work is reached. The BOUND kworker is woken again by the IRQ
> handler on CPU 4 because it has no choice.

Yeah, this looks to be is the completion of the cache flush that the
dbench process is waiting on. It looks like the block layer is
aggregating sequential cache flushes into a single operation, and so
a post-journal-write cache flush and the fsync cache flush are one
and the same. Hence it runs journal IO completion in this flush
completion context, which queues xlog_ioend_work()...

>     kworker/4:1H-991   [004] ....   194.998918: workqueue_execute_start: work struct 0000000072f39adb: function xlog_ioend_work [xfs]
>     kworker/4:1H-991   [004] d...   194.998943: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004
>     kworker/4:1H-991   [004] d...   194.998945: sched_migrate_task: comm=dbench pid=25633 prio=120 orig_cpu=4 dest_cpu=5
>     kworker/4:1H-991   [004] d...   194.998947: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=005
>     kworker/4:1H-991   [004] ....   194.998948: workqueue_execute_end: work struct 0000000072f39adb
> 
> The IO completion handler finishes, the bound workqueue tries to wake
> dbench on its old CPU. The BOUND kworker is on CPU 4, the task wants
> CPU 4 but the CPU is busy with the kworker so the scheduler function
> select_idle_sibling picks CPU 5 instead and now the task is migrated
> and we have started our round-trip of all CPUs sharing a LLC. It's not a
> perfect round-robin because p->recent_used_cpu often works.  Looking at
> the traces, dbench bounces back and forth between CPUs 4 and 5 for 7 IO
> completions before bouncing between CPUs 6/7 and so on.

Then this happens.

Ok, so the commit that made the CIL push work unbound didn't
introduce this sub-optimal scheduling pattern, it just made it more
likely to trigger by increasing the likelihood of cache flush
aggregation. I think the problem was likely first visible when
blk-mq was introduced because of it's async cache flush machinery
but it went unnoticed because dbench on blk-mq was faster because of
all the other improvements blkmq brought to the table....

> The patch alters the very last stage. The IO completion is a bound kworker
> and allows the wakee task to use the same CPU and avoid the migration.

*nod*

AFAICT, everything is pointing to this being the same issue as the
AIO/DIO completion issue. We've got a bound worker thread waking an
unbound user task, and the scheduler is migrating the unbound worker
task to an idle CPU because it doesn't know we really want
synchronous wakeup semantics in this situation. And, really, I don't
think the code doing the wakeup knows whether synchronous wakeup
semantics are correct, either, as there can be many processes across
the entire machine waiting on this journal IO completion
notification. Hence I suspect a runtime evaluated heuristic is the
best we can do here...

Cheers,

Dave.
Mel Gorman Jan. 29, 2020, 10:53 a.m. UTC | #6
On Wed, Jan 29, 2020 at 09:21:12AM +1100, Dave Chinner wrote:
> > Your light bulb is on point. The XFS unbound workqueue does run near the
> > task and does not directly cause the migration but the IO completions
> > matter. As it turned out, it was the IO completions I was looking at
> > in the old traces but there was insufficient detail to see the exact
> > sequence. I only observed that a bound wq at the end was causing the
> > migration and it was a consistent pattern.
> > 
> > I did a more detailed trace that included workqueue tracepoints. I limited
> > the run to 1 and 2 dbench clients and compared with and without the
> > patch. I still did not dig deep into the specifics of how XFS interacts
> > with workqueues because I'm focused on how the scheduler reacts.
> 
> [snip traces]
> 
> Ok, so it's not /exactly/ as I thought - what is happening is that
> both dbench and the CIL push kworker are issuing IO during fsync
> operations. dbench issues the data IO, the CIL push worker issues
> the journal IO.
> 

This is potentially due to the additional parallelism allowed by the
XFS patch using an unbound workqueue for CIL. dbench can make a little
additional progress and the workqueue and task collide. This may not be
optimal in all cases but at least it can *try* make additional progress.

> > This is an example sequence of what's happening from a scheduler
> > perspective on the vanilla kernel. It's editted a bit because there
> > were a lot of other IOs going on, mostly logging related which confuse
> > the picture.
> > 
> >           dbench-25633 [004] d...   194.998648: workqueue_queue_work: work struct=000000001cccdc2d function=xlog_cil_push_work [xfs] workqueue=00000000d90239c9 req_cpu=512 cpu=4294967295
> >           dbench-25633 [004] d...   194.998650: sched_waking: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006
> >           dbench-25633 [004] d...   194.998655: sched_wakeup: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006
> >   kworker/u161:6-718   [006] ....   194.998692: workqueue_execute_start: work struct 000000001cccdc2d: function xlog_cil_push_work [xfs]
> >   kworker/u161:6-718   [006] ....   194.998706: workqueue_execute_end: work struct 000000001cccdc2d
> > 
> > Dbench is on CPU 4, it queues xlog_cil_push_work on an UNBOUND
> > workqueue. An unbound kworker wakes on CPU 6 and finishes quickly.
> > 
> >   kworker/u161:6-718   [006] ....   194.998707: workqueue_execute_start: work struct 0000000046fbf8d5: function wq_barrier_func
> >   kworker/u161:6-718   [006] d...   194.998708: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004
> >   kworker/u161:6-718   [006] d...   194.998712: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=004
> >   kworker/u161:6-718   [006] ....   194.998713: workqueue_execute_end: work struct 0000000046fbf8d5
> 
> Ok, that's what I'd expect if dbench issued a log force as part of
> an fsync() or synchronous transaction. This is it flushing the CIL
> and waiting for the flush work to complete (wq_barrier_func is what
> wakes the wq flush waiter).
> 
> This doesn't complete the log force, however - the dbench process
> will now do a bit more work and then go to sleep waiting for journal
> IO to complete.
> 

So, it could be argued that it should be serialised but I think it would
be very specific to how dbench behaves. I can't see anything actually
wrong with what XFS does and it's just unfortunate that dbench made enough
progress but still had to go to sleep.

> > The kworker wakes dbench and finding that CPU 4 is still free, dbench
> > uses its previous CPU and no migration occurs.
> > 
> >           dbench-25633 [004] d...   194.998727: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4
> >           dbench-25633 [004] d...   194.998728: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
> >           dbench-25633 [004] dN..   194.998731: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004
> > 
> > Dbench queues blk_mq_requeue_work. This is a BOUND workqueue with a
> > mandatory CPU target of 4 so no migration..
> 
> So I spent some time trying to work out how the dbench process
> triggers this directly. This work is queued when a new cache flush
> command is inserted into the request queue, and generally those are
> done by the journal writes via REQ_PREFLUSH | REQ_FUA. Those would
> show up in the context of the xlog_cil_push_work and run on that
> CPU, not the dbench task or CPU.
> 
> So this is probably xfs_file_fsync() calling
> xfs_blkdev_issue_flush() directly because the inode metadata had
> already been written to the journal by an earlier (or racing)
> journal flush.  Hence we have to flush the device cache manually to
> ensure that the data that may have been written is also on stable
> storage. That will insert a flush directly into the request queue,
> and that's likely how we are getting the flush machinery running on
> this CPU.
> 

dbench does call fsync periodically so it's a good fit. Again, it might
be better to let all the work done by the kworker but without a crystal
ball instruction, what XFS does seems reasonable.

> 
> >     kworker/4:1H-991   [004] ....   194.998918: workqueue_execute_start: work struct 0000000072f39adb: function xlog_ioend_work [xfs]
> >     kworker/4:1H-991   [004] d...   194.998943: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004
> >     kworker/4:1H-991   [004] d...   194.998945: sched_migrate_task: comm=dbench pid=25633 prio=120 orig_cpu=4 dest_cpu=5
> >     kworker/4:1H-991   [004] d...   194.998947: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=005
> >     kworker/4:1H-991   [004] ....   194.998948: workqueue_execute_end: work struct 0000000072f39adb
> > 
> > The IO completion handler finishes, the bound workqueue tries to wake
> > dbench on its old CPU. The BOUND kworker is on CPU 4, the task wants
> > CPU 4 but the CPU is busy with the kworker so the scheduler function
> > select_idle_sibling picks CPU 5 instead and now the task is migrated
> > and we have started our round-trip of all CPUs sharing a LLC. It's not a
> > perfect round-robin because p->recent_used_cpu often works.  Looking at
> > the traces, dbench bounces back and forth between CPUs 4 and 5 for 7 IO
> > completions before bouncing between CPUs 6/7 and so on.
> 
> Then this happens.
> 
> Ok, so the commit that made the CIL push work unbound didn't
> introduce this sub-optimal scheduling pattern, it just made it more
> likely to trigger by increasing the likelihood of cache flush
> aggregation. I think the problem was likely first visible when
> blk-mq was introduced because of it's async cache flush machinery
> but it went unnoticed because dbench on blk-mq was faster because of
> all the other improvements blkmq brought to the table....
> 

It probably was masked by the blk-mq switch. While mostly good because
it had been deferred a couple of times while issues were ironed out, it
wasn't perfect. We were just lucky that XFS made a change that made the
problem more obvious.

> > The patch alters the very last stage. The IO completion is a bound kworker
> > and allows the wakee task to use the same CPU and avoid the migration.
> 
> *nod*
> 
> AFAICT, everything is pointing to this being the same issue as the
> AIO/DIO completion issue. We've got a bound worker thread waking an
> unbound user task, and the scheduler is migrating the unbound worker
> task to an idle CPU because it doesn't know we really want
> synchronous wakeup semantics in this situation. And, really, I don't
> think the code doing the wakeup knows whether synchronous wakeup
> semantics are correct, either, as there can be many processes across
> the entire machine waiting on this journal IO completion
> notification. Hence I suspect a runtime evaluated heuristic is the
> best we can do here...
> 

I would prefer the wakeup code did not have to signal that it's a
synchronous wakeup. Sync wakeups so exist but callers got it wrong many
times where stacking was allowed and then the waker did not go to sleep.
While the chain of events are related, they are not related in a very
obvious way. I think it's much safer to keep this as a scheduler
heuristic instead of depending on callers to have sufficient knowledge
of the scheduler implementation.

I still need scheduler people to look at this because stacking tasks
prematurely is generally considered hazardous but with the merge window,
that might take a while. I'll resend the patch later in the window if
there is no response.

Thanks a lot Dave for taking a close look at this and explaining exactly
what XFS is doing and why.
Peter Zijlstra Jan. 29, 2020, 5:38 p.m. UTC | #7
On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote:
> Peter, Ingo and Vincent -- I know the timing is bad due to the merge
> window but do you have any thoughts on allowing select_idle_sibling to
> stack a wakee task on the same CPU as a waker in this specific case?

I sort of see, but *groan*...

so if the kworker unlocks a contended mutex/rwsem/completion...

I suppose the fact that it limits it to tasks that were running on the
same CPU limits the impact if we do get it wrong.

Elsewhere you write:

> I would prefer the wakeup code did not have to signal that it's a
> synchronous wakeup. Sync wakeups so exist but callers got it wrong many
> times where stacking was allowed and then the waker did not go to sleep.
> While the chain of events are related, they are not related in a very
> obvious way. I think it's much safer to keep this as a scheduler
> heuristic instead of depending on callers to have sufficient knowledge
> of the scheduler implementation.

That is true; the existing WF_SYNC has caused many issues for maybe
being too strong.

But what if we create a new hint that combines both these ideas? Say
WF_COMPLETE and subject that to these same criteria. This way we can
eliminate wakeups from locks and such (they won't have this set).

Or am I just making things complicated again?
Dave Chinner Jan. 29, 2020, 10 p.m. UTC | #8
On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote:
> > Peter, Ingo and Vincent -- I know the timing is bad due to the merge
> > window but do you have any thoughts on allowing select_idle_sibling to
> > stack a wakee task on the same CPU as a waker in this specific case?
> 
> I sort of see, but *groan*...
> 
> so if the kworker unlocks a contended mutex/rwsem/completion...
> 
> I suppose the fact that it limits it to tasks that were running on the
> same CPU limits the impact if we do get it wrong.
> 
> Elsewhere you write:
> 
> > I would prefer the wakeup code did not have to signal that it's a
> > synchronous wakeup. Sync wakeups so exist but callers got it wrong many
> > times where stacking was allowed and then the waker did not go to sleep.
> > While the chain of events are related, they are not related in a very
> > obvious way. I think it's much safer to keep this as a scheduler
> > heuristic instead of depending on callers to have sufficient knowledge
> > of the scheduler implementation.
> 
> That is true; the existing WF_SYNC has caused many issues for maybe
> being too strong.
> 
> But what if we create a new hint that combines both these ideas? Say
> WF_COMPLETE and subject that to these same criteria. This way we can
> eliminate wakeups from locks and such (they won't have this set).
> 
> Or am I just making things complicated again?

I suspect this is making it complicated again, because it requires
the people who maintain the code that is using workqueues to
understand when they might need to use a special wakeup interface in
the work function. And that includes code that currently calls
wake_up_all() because there can be hundreds of independent tasks
waiting on the IO completion (e.g all the wait queues in the XFS
journal code can (and do) have multiple threads waiting on them).

IOWs, requiring a special flag just to optimise this specific case
(i.e. single dependent waiter on same CPU as the kworker) when the
adverse behaviour is both hardware and workload dependent means it
just won't get used correctly or reliably.

Hence I'd much prefer the kernel detects and dynamically handles
this situation at runtime, because this pattern of workqueue usage
is already quite common and will only become more widespread as we
progress towards async processing of syscalls.

Cheers,

Dave.
Mel Gorman Jan. 30, 2020, 12:43 a.m. UTC | #9
On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote:
> > Peter, Ingo and Vincent -- I know the timing is bad due to the merge
> > window but do you have any thoughts on allowing select_idle_sibling to
> > stack a wakee task on the same CPU as a waker in this specific case?
> 
> I sort of see, but *groan*...
> 

This is the reaction I kinda expected. Sync wakeups and select_idle_sibling
probably caused someone PTSD at some point before my time in kernel/sched/.

> so if the kworker unlocks a contended mutex/rwsem/completion...
> 
> I suppose the fact that it limits it to tasks that were running on the
> same CPU limits the impact if we do get it wrong.
> 

And it's limited to no other task currently running on the
CPU. Now, potentially multiple sleepers are on that CPU waiting for
a mutex/rwsem/completion but it's very unlikely and mostly likely due
to the machine being saturated in which case searching for an idle CPU
will probably fail. It would also be bound by a small window after the
first wakeup before the task becomes runnable before the nr_running check
mitigages the problem. Besides, if the sleeping task is waiting on the
lock, it *is* related to the kworker which is probably finished.

In other words, even this patches worst-case behaviour does not seem
that bad.

> Elsewhere you write:
> 
> > I would prefer the wakeup code did not have to signal that it's a
> > synchronous wakeup. Sync wakeups so exist but callers got it wrong many
> > times where stacking was allowed and then the waker did not go to sleep.
> > While the chain of events are related, they are not related in a very
> > obvious way. I think it's much safer to keep this as a scheduler
> > heuristic instead of depending on callers to have sufficient knowledge
> > of the scheduler implementation.
> 
> That is true; the existing WF_SYNC has caused many issues for maybe
> being too strong.
> 

Exactly. It ended up being almost ignored. It basically just means that
the waker CPU may be used as the target for wake_affine because the
users were not obeying the contract.

> But what if we create a new hint that combines both these ideas? Say
> WF_COMPLETE and subject that to these same criteria. This way we can
> eliminate wakeups from locks and such (they won't have this set).
> 

I think that'll end up with three consequences. First, it falls foul
of Rusty's Rules of API Design[1] because even if people read the
implementation and the documentation, they might still get it wrong like
what happened with WF_SYNC.  Second, some other subsystem will think it's
special and use the flag because it happens to work for one benchmark or
worse, they happened to copy/paste the code for some reason. Finally,
the workqueue implementation may change in some way that renders the
use of the flag incorrect. With this patch, if workqueues change design,
it's more likely the patch becomes a no-op.

> Or am I just making things complicated again?

I think so but I also wrote the patch so I'm biased. I think the callers
would be forced into an API change if it's a common pattern where multiple
unbound tasks can sleep on the same CPU waiting on a single kworker and
I struggle to think of such an example.

The length of time it took this issue to be detected and patched is
indicative that not everyone is familiar with kernel/sched/fair.c and its
consequences.  If they were, chances are they would have implemented some
mental hack like binding a task to a single CPU until the IO completes.

[1] http://sweng.the-davies.net/Home/rustys-api-design-manifesto
Mel Gorman Jan. 30, 2020, 12:50 a.m. UTC | #10
On Thu, Jan 30, 2020 at 09:00:21AM +1100, Dave Chinner wrote:
> On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote:
> > > Peter, Ingo and Vincent -- I know the timing is bad due to the merge
> > > window but do you have any thoughts on allowing select_idle_sibling to
> > > stack a wakee task on the same CPU as a waker in this specific case?
> > 
> > I sort of see, but *groan*...
> > 
> > so if the kworker unlocks a contended mutex/rwsem/completion...
> > 
> > I suppose the fact that it limits it to tasks that were running on the
> > same CPU limits the impact if we do get it wrong.
> > 
> > Elsewhere you write:
> > 
> > > I would prefer the wakeup code did not have to signal that it's a
> > > synchronous wakeup. Sync wakeups so exist but callers got it wrong many
> > > times where stacking was allowed and then the waker did not go to sleep.
> > > While the chain of events are related, they are not related in a very
> > > obvious way. I think it's much safer to keep this as a scheduler
> > > heuristic instead of depending on callers to have sufficient knowledge
> > > of the scheduler implementation.
> > 
> > That is true; the existing WF_SYNC has caused many issues for maybe
> > being too strong.
> > 
> > But what if we create a new hint that combines both these ideas? Say
> > WF_COMPLETE and subject that to these same criteria. This way we can
> > eliminate wakeups from locks and such (they won't have this set).
> > 
> > Or am I just making things complicated again?
> 
> I suspect this is making it complicated again, because it requires
> the people who maintain the code that is using workqueues to
> understand when they might need to use a special wakeup interface in
> the work function. And that includes code that currently calls
> wake_up_all() because there can be hundreds of independent tasks
> waiting on the IO completion (e.g all the wait queues in the XFS
> journal code can (and do) have multiple threads waiting on them).
> 
> IOWs, requiring a special flag just to optimise this specific case
> (i.e. single dependent waiter on same CPU as the kworker) when the
> adverse behaviour is both hardware and workload dependent means it
> just won't get used correctly or reliably.
> 

I agree. Pick any of Rusty's rules from "-2 Read the implementation
and you'll get it wrong" all the way down to "-10 It's impossible to
get right.".

> Hence I'd much prefer the kernel detects and dynamically handles
> this situation at runtime, because this pattern of workqueue usage
> is already quite common and will only become more widespread as we
> progress towards async processing of syscalls.
> 

To be fair, as Peter says, the kernel patch may not detect this
properly. There are corner cases where it will get it wrong. My thinking is
that *at the moment* when the heuristic is wrong, it's almost certainly
because the machine was so over-saturated such that multiple related
tasks are stacking anyway.

Depending on how async syscalls proceeds, this might get turn out to the
the wrong heuristic and an API change will be required. At least if that
happens, we'll have a few use cases to help guide what the API change
should look like so we do not end up in WF_SYNC hell again.
Peter Zijlstra Jan. 30, 2020, 8:06 a.m. UTC | #11
On Thu, Jan 30, 2020 at 12:43:34AM +0000, Mel Gorman wrote:
> On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote:

> > I suppose the fact that it limits it to tasks that were running on the
> > same CPU limits the impact if we do get it wrong.
> > 
> 
> And it's limited to no other task currently running on the
> CPU. Now, potentially multiple sleepers are on that CPU waiting for
> a mutex/rwsem/completion but it's very unlikely and mostly likely due
> to the machine being saturated in which case searching for an idle CPU
> will probably fail. It would also be bound by a small window after the
> first wakeup before the task becomes runnable before the nr_running check
> mitigages the problem. Besides, if the sleeping task is waiting on the
> lock, it *is* related to the kworker which is probably finished.
> 
> In other words, even this patches worst-case behaviour does not seem
> that bad.

OK; let's just stick it in and see what, if anything, falls over :-)

I saw there is a v2 out (although I didn't see what changed in a hurry),
let me queue that one.
Mel Gorman Jan. 30, 2020, 8:55 a.m. UTC | #12
On Thu, Jan 30, 2020 at 09:06:53AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 30, 2020 at 12:43:34AM +0000, Mel Gorman wrote:
> > On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote:
> 
> > > I suppose the fact that it limits it to tasks that were running on the
> > > same CPU limits the impact if we do get it wrong.
> > > 
> > 
> > And it's limited to no other task currently running on the
> > CPU. Now, potentially multiple sleepers are on that CPU waiting for
> > a mutex/rwsem/completion but it's very unlikely and mostly likely due
> > to the machine being saturated in which case searching for an idle CPU
> > will probably fail. It would also be bound by a small window after the
> > first wakeup before the task becomes runnable before the nr_running check
> > mitigages the problem. Besides, if the sleeping task is waiting on the
> > lock, it *is* related to the kworker which is probably finished.
> > 
> > In other words, even this patches worst-case behaviour does not seem
> > that bad.
> 
> OK; let's just stick it in and see what, if anything, falls over :-)
> 
> I saw there is a v2 out (although I didn't see what changed in a hurry),
> let me queue that one.

Only the changelog and comments changed in light of the discussion
with Dave.
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc1dfc007604..1f615a223791 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1442,17 +1442,6 @@  void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 
 #ifdef CONFIG_SMP
 
-static inline bool is_per_cpu_kthread(struct task_struct *p)
-{
-	if (!(p->flags & PF_KTHREAD))
-		return false;
-
-	if (p->nr_cpus_allowed != 1)
-		return false;
-
-	return true;
-}
-
 /*
  * Per-CPU kthreads are allowed to run on !active && online CPUs, see
  * __set_cpus_allowed_ptr() and select_fallback_rq().
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4e0d775375..76df439aff76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5912,6 +5912,19 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)))
 		return prev;
 
+	/*
+	 * Allow a per-cpu kthread to stack with the wakee if the
+	 * kworker thread and the tasks previous CPU are the same.
+	 * The assumption is that the wakee queued work for the
+	 * per-cpu kthread that is now complete and the wakeup is
+	 * essentially a sync wakeup.
+	 */
+	if (is_per_cpu_kthread(current) &&
+	    prev == smp_processor_id() &&
+	    this_rq()->nr_running <= 1) {
+		return prev;
+	}
+
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
 	if (recent_used_cpu != prev &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a88dc8ad11b..5876e6ba5903 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2479,3 +2479,16 @@  static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+#ifdef CONFIG_SMP
+static inline bool is_per_cpu_kthread(struct task_struct *p)
+{
+	if (!(p->flags & PF_KTHREAD))
+		return false;
+
+	if (p->nr_cpus_allowed != 1)
+		return false;
+
+	return true;
+}
+#endif