Message ID | 20200127143608.GX3466@techsingularity.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched, fair: Allow a per-cpu kthread waking a task to stack on the same CPU | expand |
On Mon, Jan 27, 2020 at 02:36:08PM +0000, Mel Gorman wrote: > Commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log > recovery") changed from using bound workqueues to using unbound > workqueues. Functionally this makes sense but it was observed at the time > that the dbench performance dropped quite a lot and CPU migrations were > excessively high even when there are plenty of idle CPUs. Hmmm - that just made the CIL workqueue WQ_UNBOUND. Not a complex change... > The pattern of the task migration is straight-forward. With XFS, an IO > issuer may delegate work to a kworker which wakes on the same CPU. On > completion of the work, it wakes the task, finds that the previous CPU > is busy (because the kworker is still running on it) and migrates the > task to the next idle CPU. The task ends up migrating around all CPUs > sharing a LLC at high frequency. This has negative implications both in > commication costs and power management. mpstat confirmed that at low > thread counts that all CPUs sharing an LLC has low level of activity. Very interesting, Mel. :) I suspect this appears is a very similar issue that is discussed in this thread about workqueues and AIO completion latencies: https://lore.kernel.org/lkml/20191114113153.GB4213@ming.t460p/ The problem is described here, along with comments about how fundamental this behaviour is to the correct functioning of filesystems: https://lore.kernel.org/lkml/20191121221051.GG4614@dread.disaster.area/ There are several patches thrown about during the discussion, initially focussed on wakeup pre-emption to run the work immediately until I pointed out that was the wrong thing to do for work being deferred to workqueues. After a some more proposed patches the discussion on the scheduler side of things largely ground to a halt and so has not been fixed. So I'm initially wondering if this solves that problem, too, or whether you are seeing a slightly different manifestation of that same scheduler issue.... > The impact of this problem is related to the number of CPUs sharing an LLC. > > This patch special cases the pattern and allows a kworker waker and a > task wakee to stack on the same CPU if there is a strong chance they are > directly related. The expectation is that the kworker is likely going > back to sleep shortly. This is not guaranteed as the IO could be queued > asynchronously but there is a very strong relationship between the task and > kworker in this case that would justify stacking on the same CPU instead > of migrating. There should be few concerns about kworker starvation given > that the special casing is only when the kworker is the waker. > DBench on XFS [snip positive dbench results] Yeah, dbench does lots of synchronous operations that end up waiting on journal flushes (data integrity operations) so it would trip over kworker scheduling latency issues. FWIW, I didn't see any perf degradation on my machines from the commit you quoted, but I also had a real hard time replication the aio completion latency problem on them as well. Hence I don't think they are particularly susceptible to bad migration decisions, so I'm not surprised I didn't see this. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fe4e0d775375..76df439aff76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5912,6 +5912,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > (available_idle_cpu(prev) || sched_idle_cpu(prev))) > return prev; > > + /* > + * Allow a per-cpu kthread to stack with the wakee if the > + * kworker thread and the tasks previous CPU are the same. > + * The assumption is that the wakee queued work for the > + * per-cpu kthread that is now complete and the wakeup is > + * essentially a sync wakeup. > + */ > + if (is_per_cpu_kthread(current) && > + prev == smp_processor_id() && > + this_rq()->nr_running <= 1) { > + return prev; > + } Ok, so if I've read this correctly, this special case only triggers when scheduling from the per-cpu kworker thread context, and only if there is one other runnable task on the queue? So it special cases the ping-pong case so that the non-bound task that scheduled the kworker also remains scheduled this CPU? Hmmmm. When we set up a workqueue as WQ_UNBOUND on a numa system, isn't the worker pool set up as a node-bound task? i.e. it's not a per-cpu kthread anymore, but a task limited by the cpumask of that node? That isn't a per-CPU kthread anymore, is it? That is: > @@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq, > { > } > #endif > + > +#ifdef CONFIG_SMP > +static inline bool is_per_cpu_kthread(struct task_struct *p) > +{ > + if (!(p->flags & PF_KTHREAD)) > + return false; > + > + if (p->nr_cpus_allowed != 1) > + return false; p->nr_cpus_allowed is no longer 1 but the number of CPUs in the per-node cpu mask it is allowed to run on? And so if that is the case, then the change in commit 8ab39f11d974 which set WQ_UNBOUND on the XFS CIL workqueue would mean the above logic change should not be triggering for the CIL worker because it is no longer a CPU bound kthread.... What am I missing here? <light bulb illumination> Is this actually ping-ponging the CIL flush and journal IO completion because xlog_bio_end_io() always punts journal IO completion to the log workqueue, which is: log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s", WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0, mp->m_super->s_id); i.e. it uses per-cpu kthreads for processing journal IO completion similar to DIO io completion and thereby provides a vector for the same issue? A typical fsync is processed like this: user task CIL kworker IO completion kworker xfs_trans_commit() pushes on log <blocks waiting on flush completion> <wake> formats vectors loop { wait for iclog space <block waiting on write completion> <wake> journal IO completion frees up iclog space wakes write waiter <done> <wake> write vectors into iclog submit iclog IO } <done> <wake> journal IO completion frees up iclog space wakes flush waiter <done> <wakes> <transaction commit done> i.e. before commit 8ab39f11d974 we have: user task = unbound XFS CIL commit worker = CPU bound kthread XFS journal IO completion = CPU bound kthread And because the the CIL kworker and IO completion kworker are bound to the same CPU they don't trigger migrations as they can't be moved anywhere else. And so it doesn't matter how many times we switch between them to complete a CIL flush because they will not trigger migrations. However, after commit 8ab39f11d974 we have: user task = unbound XFS CIL commit worker = unbound kthread XFS journal IO completion = CPU bound kthread IOWs, I think we now have exactly the same situation as discussed in the thread I pointed you to above, where an unbound task work (the CIL kworker) is trying to run on the same CPU as the CPU bound IO completion kworker, and that causes the CIL kworker to be migrated to a different CPU on each bounced throught the "wait for iclog space" loop. Hence your new logic is actually triggering on the journal IO completion kworker threads, not the CIL kworker threads. After all this, I have two questions that would help me understand if this is what you are seeing: 1. to confirm: does removing just the WQ_UNBOUND from the CIL push workqueue (as added in 8ab39f11d974) make the regression go away? 2. when the problem shows up, which tasks are actually being migrated excessively - is it the user task, the CIL kworker task or something else? Cheers, Dave.
On Tue, Jan 28, 2020 at 09:32:56AM +1100, Dave Chinner wrote: > On Mon, Jan 27, 2020 at 02:36:08PM +0000, Mel Gorman wrote: > > Commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log > > recovery") changed from using bound workqueues to using unbound > > workqueues. Functionally this makes sense but it was observed at the time > > that the dbench performance dropped quite a lot and CPU migrations were > > excessively high even when there are plenty of idle CPUs. > > Hmmm - that just made the CIL workqueue WQ_UNBOUND. Not a complex > change... > Not in itself but it has an effect on the wakeup path hitting select_idle_sibling and migrating a user task because the previously used CPU is not idle. Performance problems with select_idle_sibling has been known to cause poor anger management issues. > > The pattern of the task migration is straight-forward. With XFS, an IO > > issuer may delegate work to a kworker which wakes on the same CPU. On > > completion of the work, it wakes the task, finds that the previous CPU > > is busy (because the kworker is still running on it) and migrates the > > task to the next idle CPU. The task ends up migrating around all CPUs > > sharing a LLC at high frequency. This has negative implications both in > > commication costs and power management. mpstat confirmed that at low > > thread counts that all CPUs sharing an LLC has low level of activity. > > Very interesting, Mel. :) > > I suspect this appears is a very similar issue that is discussed in > this thread about workqueues and AIO completion latencies: > > https://lore.kernel.org/lkml/20191114113153.GB4213@ming.t460p/ > > The problem is described here, along with comments about how > fundamental this behaviour is to the correct functioning of > filesystems: > > https://lore.kernel.org/lkml/20191121221051.GG4614@dread.disaster.area/ > I suspect the problem is very similar, maybe even identical and just observed from different directions. I agree with your assessment of the issues and observations although I cannot swear every aspect is correct. However, IO completions being sent to the issuer CPU makes sense, particularly for synchronous IO where the issuer has gone to sleep. The data may be still hot in L1 cache and the CPU is probably ramped up to a higher frequency. There is no obvious reason why blk-mq should deliver the completion to a different CPU just to sidestep undesired scheduler behaviour. It's less clear cut for jbd as stacking multiple IO issuers on top of jbd is unlikely to perform very well. However, jbd appears to be unaffected by this patch. > There are several patches thrown about during the discussion, > initially focussed on wakeup pre-emption to run the work immediately > until I pointed out that was the wrong thing to do for work being > deferred to workqueues. After a some more proposed patches the > discussion on the scheduler side of things largely ground to a halt > and so has not been fixed. > I do not think it's a good idea to preempt the issuing task when it is asynchronous. Aside from any filesystem considerations, it artifically increases context switches. > So I'm initially wondering if this solves that problem, too, or > whether you are seeing a slightly different manifestation of that > same scheduler issue.... > I suspect my patch may accidentally fix or at least mitigate the issue. > > The impact of this problem is related to the number of CPUs sharing an LLC. > > > > This patch special cases the pattern and allows a kworker waker and a > > task wakee to stack on the same CPU if there is a strong chance they are > > directly related. The expectation is that the kworker is likely going > > back to sleep shortly. This is not guaranteed as the IO could be queued > > asynchronously but there is a very strong relationship between the task and > > kworker in this case that would justify stacking on the same CPU instead > > of migrating. There should be few concerns about kworker starvation given > > that the special casing is only when the kworker is the waker. > > DBench on XFS > > [snip positive dbench results] > > Yeah, dbench does lots of synchronous operations that end up waiting > on journal flushes (data integrity operations) so it would trip over > kworker scheduling latency issues. > Yes, it turned out to be a good workload. > FWIW, I didn't see any perf degradation on my machines from the > commit you quoted, but I also had a real hard time replication the > aio completion latency problem on them as well. Hence I don't think > they are particularly susceptible to bad migration decisions, so I'm > not surprised I didn't see this. > FWIW, of 12 machines that checked commits around that time frame, only 2 machines bisected reliably to the XFS commit. About all that was special about those machines it the number of CPUs sharing LLC. The disks were SSDs but not particularly fast ones. > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index fe4e0d775375..76df439aff76 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5912,6 +5912,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > (available_idle_cpu(prev) || sched_idle_cpu(prev))) > > return prev; > > > > + /* > > + * Allow a per-cpu kthread to stack with the wakee if the > > + * kworker thread and the tasks previous CPU are the same. > > + * The assumption is that the wakee queued work for the > > + * per-cpu kthread that is now complete and the wakeup is > > + * essentially a sync wakeup. > > + */ > > + if (is_per_cpu_kthread(current) && > > + prev == smp_processor_id() && > > + this_rq()->nr_running <= 1) { > > + return prev; > > + } > > Ok, so if I've read this correctly, this special case only triggers > when scheduling from the per-cpu kworker thread context, and only if > there is one other runnable task on the queue? Not quite, the 1 running task is the kworker itself. > So it special cases > the ping-pong case so that the non-bound task that scheduled the > kworker also remains scheduled this CPU? > Yes. Ordinarily, allowing premature task stacking in select_idle_sibling when siblings are idle would be shot straight in the head. Stacking tasks suits benchmarks that are strictly synchronised like pipes but almost always a disaster in any other case. I think this patch is constrained enough that it should fix the targetted problem without side-effects. > Hmmmm. > > When we set up a workqueue as WQ_UNBOUND on a numa system, isn't the > worker pool set up as a node-bound task? i.e. it's not a per-cpu > kthread anymore, but a task limited by the cpumask of that node? > That isn't a per-CPU kthread anymore, is it? That is: > I would have thought so but I didn't drill deep enough into the kworker behaviour to see what exactly was causing the problem. > > @@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq, > > { > > } > > #endif > > + > > +#ifdef CONFIG_SMP > > +static inline bool is_per_cpu_kthread(struct task_struct *p) > > +{ > > + if (!(p->flags & PF_KTHREAD)) > > + return false; > > + > > + if (p->nr_cpus_allowed != 1) > > + return false; > > p->nr_cpus_allowed is no longer 1 but the number of CPUs in the > per-node cpu mask it is allowed to run on? > If that was the case, the patch would have had no effect so there is a mix of unbound and bound workqueues that I failed to fully understand. The patch was based on scheduler traces instead of a deep understanding of workqueue internals or how filesystems use them :( > And so if that is the case, then the change in commit 8ab39f11d974 > which set WQ_UNBOUND on the XFS CIL workqueue would mean the above > logic change should not be triggering for the CIL worker because it > is no longer a CPU bound kthread.... > > What am I missing here? > I'm almost 100% certain the patch revealed some unintended behaviour and the bisection did appear reliable. This is a report excerpt I got from automation testing (similar functionality to LKP) commits around that time Comparison ========== initial initial last penup last penup first good-v5.3 bad-1c4e395cf7de bad-b41dae06 bad-c6cfaf4f good-cdea5459 good-7c107afb bad-8ab39f11 .... Amean 1 47.34 ( 0.00%) 54.80 * -15.75%* 56.02 * -18.35%* 56.05 * -18.40%* 48.05 * -1.50%* 46.36 * 2.07%* 55.56 * -17.38%* Amean 2 50.10 ( 0.00%) 58.35 * -16.45%* 62.44 * -24.62%* 62.39 * -24.52%* 51.68 * -3.15%* 50.28 * -0.36%* 61.44 * -22.63%* Amean 4 58.07 ( 0.00%) 66.65 * -14.79%* 67.53 * -16.29%* 66.70 * -14.86%* 58.40 * -0.57%* 58.49 * -0.74%* 65.63 * -13.03%* Amean 8 74.86 ( 0.00%) 78.36 * -4.68%* 79.84 * -6.65%* 80.21 * -7.15%* 75.82 * -1.28%* 74.68 * 0.25%* 79.38 * -6.03%* Amean 16 108.57 ( 0.00%) 108.86 * -0.26%* 109.65 * -1.00%* 109.52 * -0.88%* 108.91 * -0.32%* 107.99 * 0.53%* 109.81 * -1.14%* Amean 32 199.20 ( 0.00%) 198.80 ( 0.20%) 199.62 ( -0.21%) 200.11 * -0.46%* 201.98 * -1.40%* 202.33 * -1.57%* 199.70 ( -0.25%) Amean 64 424.43 ( 0.00%) 421.39 * 0.72%* 419.08 * 1.26%* 422.68 * 0.41%* 427.88 * -0.81%* 422.82 * 0.38%* 427.16 * -0.64%* Amean 128 1011.56 ( 0.00%) 994.98 * 1.64%* 997.81 * 1.36%* 999.07 * 1.24%* 1003.33 * 0.81%* 1003.38 * 0.81%* 985.47 * 2.58%* The difference between good-cdea5459 and bad-8ab39f11 is pretty clear. > <light bulb illumination> > > Is this actually ping-ponging the CIL flush and journal IO > completion because xlog_bio_end_io() always punts journal IO > completion to the log workqueue, which is: > > log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s", > WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0, > mp->m_super->s_id); > > i.e. it uses per-cpu kthreads for processing journal IO completion > similar to DIO io completion and thereby provides a vector for > the same issue? > > <SNIP> That is extremely informative, thanks! Your reasoning sounds plausible but unfortunately from the scheduler trace, I knew kworkers were involved but not exactly what those kworkers were doing. The pattern also did not always cause major problems. Sometimes select_idle_sibling would use p->recent_used_cpu instead of moving the task round-robin around the LLC. It's the reason why I placed the check for per-cpu kthreads *before* considering recent_used_cpu as an idle CPU. It was the difference between reducing migration rates by 85% to almost 100%. > IOWs, I think we now have exactly the same situation as discussed in > the thread I pointed you to above, where an unbound task work (the > CIL kworker) is trying to run on the same CPU as the CPU bound IO > completion kworker, and that causes the CIL kworker to be migrated > to a different CPU on each bounced throught the "wait for iclog > space" loop. Hence your new logic is actually triggering on the > journal IO completion kworker threads, not the CIL kworker threads. > I added Paul Auld and Ming Lei to the cc. Maybe they'd be willing to give the patch a spin on their test case. I could recreate their test case easily enough but an independent test would be preferred. > After all this, I have two questions that would help me understand > if this is what you are seeing: > > 1. to confirm: does removing just the WQ_UNBOUND from the CIL push > workqueue (as added in 8ab39f11d974) make the regression go away? > I'll have to check in the morning. Around the v5.4 development timeframe, I'm definite that reverting the patch helped but that was not an option given that it's fixing a correctness issue. > 2. when the problem shows up, which tasks are actually being > migrated excessively - is it the user task, the CIL kworker task > or something else? > The user task is doing the migration in my case. However, that may not be universally true. The patch also does not prevent all migrations. For low numbers of dbench clients, there are almost no migrations but as the client count increases migrations happen because there are too many running tasks.
On Tue, Jan 28, 2020 at 01:19:36AM +0000, Mel Gorman wrote: > > <SNIP> > > After all this, I have two questions that would help me understand > > if this is what you are seeing: > > > > 1. to confirm: does removing just the WQ_UNBOUND from the CIL push > > workqueue (as added in 8ab39f11d974) make the regression go away? > > > > I'll have to check in the morning. Around the v5.4 development timeframe, > I'm definite that reverting the patch helped but that was not an option > given that it's fixing a correctness issue. > This is a comparison of the baseline kernel (tip at the time I started), the proposed fix and a revert. The revert was not clean but I do not believe it matters dbench4 Loadfile Execution Time 5.5.0-rc7 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 revert-XFS-wq-v1r2 Amean 1 58.69 ( 0.00%) 30.21 * 48.53%* 47.48 * 19.10%* Amean 2 60.90 ( 0.00%) 35.29 * 42.05%* 51.13 * 16.04%* Amean 4 66.77 ( 0.00%) 46.55 * 30.28%* 59.54 * 10.82%* Amean 8 81.41 ( 0.00%) 68.46 * 15.91%* 77.25 * 5.11%* Amean 16 113.29 ( 0.00%) 107.79 * 4.85%* 112.33 * 0.85%* Amean 32 199.10 ( 0.00%) 198.22 * 0.44%* 200.31 * -0.61%* Amean 64 478.99 ( 0.00%) 477.06 * 0.40%* 482.17 * -0.66%* Amean 128 1345.26 ( 0.00%) 1372.64 * -2.04%* 1368.94 * -1.76%* Stddev 1 2.64 ( 0.00%) 4.17 ( -58.08%) 5.01 ( -89.89%) Stddev 2 4.35 ( 0.00%) 5.38 ( -23.73%) 4.48 ( -2.90%) Stddev 4 6.77 ( 0.00%) 6.56 ( 3.00%) 7.40 ( -9.40%) Stddev 8 11.61 ( 0.00%) 10.91 ( 6.04%) 11.62 ( -0.05%) Stddev 16 18.63 ( 0.00%) 19.19 ( -3.01%) 19.12 ( -2.66%) Stddev 32 38.71 ( 0.00%) 38.30 ( 1.06%) 38.82 ( -0.28%) Stddev 64 100.28 ( 0.00%) 91.24 ( 9.02%) 95.68 ( 4.59%) Stddev 128 186.87 ( 0.00%) 160.34 ( 14.20%) 170.85 ( 8.57%) According to this, commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log recovery") did introduce some unintended behaviour. The fix actually performs better than a revert with the obvious benefit that it does not reintroduce the functional breakage (log starvation) that the commit originally fixed. I still think that XFS is not the problem here, it's just the messenger. The functional fix, delegating work to kworkers running on the same CPU and blk-mq delivering IO completions to the same CPU as the IO issuer are all sane decisions IMO. I do not think that adjusting any of them to wakeup the task on a new CPU is sensible due to the loss of data cache locality and potential snags with power management when waking a CPU from idle state. Peter, Ingo and Vincent -- I know the timing is bad due to the merge window but do you have any thoughts on allowing select_idle_sibling to stack a wakee task on the same CPU as a waker in this specific case?
I'm adding Jan Kara to the cc as he was looking into the workqueue implemention in depth this morning and helped me better understand what is going on. Phil and Ming are still cc'd as an independent test would still be nice. > <light bulb illumination> > > Is this actually ping-ponging the CIL flush and journal IO > completion because xlog_bio_end_io() always punts journal IO > completion to the log workqueue, which is: > > log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s", > WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0, > mp->m_super->s_id); > > i.e. it uses per-cpu kthreads for processing journal IO completion > similar to DIO io completion and thereby provides a vector for > the same issue? > Your light bulb is on point. The XFS unbound workqueue does run near the task and does not directly cause the migration but the IO completions matter. As it turned out, it was the IO completions I was looking at in the old traces but there was insufficient detail to see the exact sequence. I only observed that a bound wq at the end was causing the migration and it was a consistent pattern. I did a more detailed trace that included workqueue tracepoints. I limited the run to 1 and 2 dbench clients and compared with and without the patch. I still did not dig deep into the specifics of how XFS interacts with workqueues because I'm focused on how the scheduler reacts. The patch is still having an impact with bound workqueues as expected because; # zgrep sched_migrate_task 5.5.0-rc7-tipsched-20200124/iter-0/ftrace-dbench4.gz | wc -l 556259 # zgrep sched_migrate_task 5.5.0-rc7-kworkerstack-v1r2/iter-0/ftrace-dbench4.gz | wc -l 11736 There are still migrations happening but there also was a lot of logging going on for this run so it's not directly comparable what I originally reported. This is an example sequence of what's happening from a scheduler perspective on the vanilla kernel. It's editted a bit because there were a lot of other IOs going on, mostly logging related which confuse the picture. dbench-25633 [004] d... 194.998648: workqueue_queue_work: work struct=000000001cccdc2d function=xlog_cil_push_work [xfs] workqueue=00000000d90239c9 req_cpu=512 cpu=4294967295 dbench-25633 [004] d... 194.998650: sched_waking: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 dbench-25633 [004] d... 194.998655: sched_wakeup: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 kworker/u161:6-718 [006] .... 194.998692: workqueue_execute_start: work struct 000000001cccdc2d: function xlog_cil_push_work [xfs] kworker/u161:6-718 [006] .... 194.998706: workqueue_execute_end: work struct 000000001cccdc2d Dbench is on CPU 4, it queues xlog_cil_push_work on an UNBOUND workqueue. An unbound kworker wakes on CPU 6 and finishes quickly. kworker/u161:6-718 [006] .... 194.998707: workqueue_execute_start: work struct 0000000046fbf8d5: function wq_barrier_func kworker/u161:6-718 [006] d... 194.998708: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 kworker/u161:6-718 [006] d... 194.998712: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=004 kworker/u161:6-718 [006] .... 194.998713: workqueue_execute_end: work struct 0000000046fbf8d5 The kworker wakes dbench and finding that CPU 4 is still free, dbench uses its previous CPU and no migration occurs. dbench-25633 [004] d... 194.998727: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4 dbench-25633 [004] d... 194.998728: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 dbench-25633 [004] dN.. 194.998731: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 Dbench queues blk_mq_requeue_work. This is a BOUND workqueue with a mandatory CPU target of 4 so no migration.. kworker/4:1H-991 [004] .... 194.998736: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work kworker/4:1H-991 [004] .... 194.998742: workqueue_execute_end: work struct 00000000442434a7 blk_mq_requeue_work is done <idle>-0 [004] d.s. 194.998859: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4 <idle>-0 [004] d.s. 194.998861: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 <idle>-0 [004] dNs. 194.998862: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 kworker/4:1H-991 [004] .... 194.998866: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work kworker/4:1H-991 [004] .... 194.998870: workqueue_execute_end: work struct 00000000442434a7 <idle>-0 [004] d.s. 194.998911: workqueue_queue_work: work struct=0000000072f39adb function=xlog_ioend_work [xfs] workqueue=00000000008f3d7f req_cpu=512 cpu=4 <idle>-0 [004] d.s. 194.998912: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 <idle>-0 [004] dNs. 194.998913: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 Ok, this looks like an IRQ delivered for IO completion and the xlog_ioend_work is reached. The BOUND kworker is woken again by the IRQ handler on CPU 4 because it has no choice. kworker/4:1H-991 [004] .... 194.998918: workqueue_execute_start: work struct 0000000072f39adb: function xlog_ioend_work [xfs] kworker/4:1H-991 [004] d... 194.998943: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 kworker/4:1H-991 [004] d... 194.998945: sched_migrate_task: comm=dbench pid=25633 prio=120 orig_cpu=4 dest_cpu=5 kworker/4:1H-991 [004] d... 194.998947: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=005 kworker/4:1H-991 [004] .... 194.998948: workqueue_execute_end: work struct 0000000072f39adb The IO completion handler finishes, the bound workqueue tries to wake dbench on its old CPU. The BOUND kworker is on CPU 4, the task wants CPU 4 but the CPU is busy with the kworker so the scheduler function select_idle_sibling picks CPU 5 instead and now the task is migrated and we have started our round-trip of all CPUs sharing a LLC. It's not a perfect round-robin because p->recent_used_cpu often works. Looking at the traces, dbench bounces back and forth between CPUs 4 and 5 for 7 IO completions before bouncing between CPUs 6/7 and so on. The patch alters the very last stage. The IO completion is a bound kworker and allows the wakee task to use the same CPU and avoid the migration.
On Tue, Jan 28, 2020 at 02:24:27PM +0000, Mel Gorman wrote: > I'm adding Jan Kara to the cc as he was looking into the workqueue > implemention in depth this morning and helped me better understand what > is going on. > > Phil and Ming are still cc'd as an independent test would still be nice. > > > <light bulb illumination> > > > > Is this actually ping-ponging the CIL flush and journal IO > > completion because xlog_bio_end_io() always punts journal IO > > completion to the log workqueue, which is: > > > > log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s", > > WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0, > > mp->m_super->s_id); > > > > i.e. it uses per-cpu kthreads for processing journal IO completion > > similar to DIO io completion and thereby provides a vector for > > the same issue? > > > > Your light bulb is on point. The XFS unbound workqueue does run near the > task and does not directly cause the migration but the IO completions > matter. As it turned out, it was the IO completions I was looking at > in the old traces but there was insufficient detail to see the exact > sequence. I only observed that a bound wq at the end was causing the > migration and it was a consistent pattern. > > I did a more detailed trace that included workqueue tracepoints. I limited > the run to 1 and 2 dbench clients and compared with and without the > patch. I still did not dig deep into the specifics of how XFS interacts > with workqueues because I'm focused on how the scheduler reacts. [snip traces] Ok, so it's not /exactly/ as I thought - what is happening is that both dbench and the CIL push kworker are issuing IO during fsync operations. dbench issues the data IO, the CIL push worker issues the journal IO. > This is an example sequence of what's happening from a scheduler > perspective on the vanilla kernel. It's editted a bit because there > were a lot of other IOs going on, mostly logging related which confuse > the picture. > > dbench-25633 [004] d... 194.998648: workqueue_queue_work: work struct=000000001cccdc2d function=xlog_cil_push_work [xfs] workqueue=00000000d90239c9 req_cpu=512 cpu=4294967295 > dbench-25633 [004] d... 194.998650: sched_waking: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 > dbench-25633 [004] d... 194.998655: sched_wakeup: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 > kworker/u161:6-718 [006] .... 194.998692: workqueue_execute_start: work struct 000000001cccdc2d: function xlog_cil_push_work [xfs] > kworker/u161:6-718 [006] .... 194.998706: workqueue_execute_end: work struct 000000001cccdc2d > > Dbench is on CPU 4, it queues xlog_cil_push_work on an UNBOUND > workqueue. An unbound kworker wakes on CPU 6 and finishes quickly. > > kworker/u161:6-718 [006] .... 194.998707: workqueue_execute_start: work struct 0000000046fbf8d5: function wq_barrier_func > kworker/u161:6-718 [006] d... 194.998708: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 > kworker/u161:6-718 [006] d... 194.998712: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=004 > kworker/u161:6-718 [006] .... 194.998713: workqueue_execute_end: work struct 0000000046fbf8d5 Ok, that's what I'd expect if dbench issued a log force as part of an fsync() or synchronous transaction. This is it flushing the CIL and waiting for the flush work to complete (wq_barrier_func is what wakes the wq flush waiter). This doesn't complete the log force, however - the dbench process will now do a bit more work and then go to sleep waiting for journal IO to complete. > The kworker wakes dbench and finding that CPU 4 is still free, dbench > uses its previous CPU and no migration occurs. > > dbench-25633 [004] d... 194.998727: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4 > dbench-25633 [004] d... 194.998728: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > dbench-25633 [004] dN.. 194.998731: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > > Dbench queues blk_mq_requeue_work. This is a BOUND workqueue with a > mandatory CPU target of 4 so no migration.. So I spent some time trying to work out how the dbench process triggers this directly. This work is queued when a new cache flush command is inserted into the request queue, and generally those are done by the journal writes via REQ_PREFLUSH | REQ_FUA. Those would show up in the context of the xlog_cil_push_work and run on that CPU, not the dbench task or CPU. So this is probably xfs_file_fsync() calling xfs_blkdev_issue_flush() directly because the inode metadata had already been written to the journal by an earlier (or racing) journal flush. Hence we have to flush the device cache manually to ensure that the data that may have been written is also on stable storage. That will insert a flush directly into the request queue, and that's likely how we are getting the flush machinery running on this CPU. > kworker/4:1H-991 [004] .... 194.998736: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work > kworker/4:1H-991 [004] .... 194.998742: workqueue_execute_end: work struct 00000000442434a7 > > blk_mq_requeue_work is done > > <idle>-0 [004] d.s. 194.998859: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4 > <idle>-0 [004] d.s. 194.998861: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > <idle>-0 [004] dNs. 194.998862: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > kworker/4:1H-991 [004] .... 194.998866: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work > kworker/4:1H-991 [004] .... 194.998870: workqueue_execute_end: work struct 00000000442434a7 > <idle>-0 [004] d.s. 194.998911: workqueue_queue_work: work struct=0000000072f39adb function=xlog_ioend_work [xfs] workqueue=00000000008f3d7f req_cpu=512 cpu=4 > <idle>-0 [004] d.s. 194.998912: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > <idle>-0 [004] dNs. 194.998913: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > > Ok, this looks like an IRQ delivered for IO completion and the > xlog_ioend_work is reached. The BOUND kworker is woken again by the IRQ > handler on CPU 4 because it has no choice. Yeah, this looks to be is the completion of the cache flush that the dbench process is waiting on. It looks like the block layer is aggregating sequential cache flushes into a single operation, and so a post-journal-write cache flush and the fsync cache flush are one and the same. Hence it runs journal IO completion in this flush completion context, which queues xlog_ioend_work()... > kworker/4:1H-991 [004] .... 194.998918: workqueue_execute_start: work struct 0000000072f39adb: function xlog_ioend_work [xfs] > kworker/4:1H-991 [004] d... 194.998943: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 > kworker/4:1H-991 [004] d... 194.998945: sched_migrate_task: comm=dbench pid=25633 prio=120 orig_cpu=4 dest_cpu=5 > kworker/4:1H-991 [004] d... 194.998947: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=005 > kworker/4:1H-991 [004] .... 194.998948: workqueue_execute_end: work struct 0000000072f39adb > > The IO completion handler finishes, the bound workqueue tries to wake > dbench on its old CPU. The BOUND kworker is on CPU 4, the task wants > CPU 4 but the CPU is busy with the kworker so the scheduler function > select_idle_sibling picks CPU 5 instead and now the task is migrated > and we have started our round-trip of all CPUs sharing a LLC. It's not a > perfect round-robin because p->recent_used_cpu often works. Looking at > the traces, dbench bounces back and forth between CPUs 4 and 5 for 7 IO > completions before bouncing between CPUs 6/7 and so on. Then this happens. Ok, so the commit that made the CIL push work unbound didn't introduce this sub-optimal scheduling pattern, it just made it more likely to trigger by increasing the likelihood of cache flush aggregation. I think the problem was likely first visible when blk-mq was introduced because of it's async cache flush machinery but it went unnoticed because dbench on blk-mq was faster because of all the other improvements blkmq brought to the table.... > The patch alters the very last stage. The IO completion is a bound kworker > and allows the wakee task to use the same CPU and avoid the migration. *nod* AFAICT, everything is pointing to this being the same issue as the AIO/DIO completion issue. We've got a bound worker thread waking an unbound user task, and the scheduler is migrating the unbound worker task to an idle CPU because it doesn't know we really want synchronous wakeup semantics in this situation. And, really, I don't think the code doing the wakeup knows whether synchronous wakeup semantics are correct, either, as there can be many processes across the entire machine waiting on this journal IO completion notification. Hence I suspect a runtime evaluated heuristic is the best we can do here... Cheers, Dave.
On Wed, Jan 29, 2020 at 09:21:12AM +1100, Dave Chinner wrote: > > Your light bulb is on point. The XFS unbound workqueue does run near the > > task and does not directly cause the migration but the IO completions > > matter. As it turned out, it was the IO completions I was looking at > > in the old traces but there was insufficient detail to see the exact > > sequence. I only observed that a bound wq at the end was causing the > > migration and it was a consistent pattern. > > > > I did a more detailed trace that included workqueue tracepoints. I limited > > the run to 1 and 2 dbench clients and compared with and without the > > patch. I still did not dig deep into the specifics of how XFS interacts > > with workqueues because I'm focused on how the scheduler reacts. > > [snip traces] > > Ok, so it's not /exactly/ as I thought - what is happening is that > both dbench and the CIL push kworker are issuing IO during fsync > operations. dbench issues the data IO, the CIL push worker issues > the journal IO. > This is potentially due to the additional parallelism allowed by the XFS patch using an unbound workqueue for CIL. dbench can make a little additional progress and the workqueue and task collide. This may not be optimal in all cases but at least it can *try* make additional progress. > > This is an example sequence of what's happening from a scheduler > > perspective on the vanilla kernel. It's editted a bit because there > > were a lot of other IOs going on, mostly logging related which confuse > > the picture. > > > > dbench-25633 [004] d... 194.998648: workqueue_queue_work: work struct=000000001cccdc2d function=xlog_cil_push_work [xfs] workqueue=00000000d90239c9 req_cpu=512 cpu=4294967295 > > dbench-25633 [004] d... 194.998650: sched_waking: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 > > dbench-25633 [004] d... 194.998655: sched_wakeup: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 > > kworker/u161:6-718 [006] .... 194.998692: workqueue_execute_start: work struct 000000001cccdc2d: function xlog_cil_push_work [xfs] > > kworker/u161:6-718 [006] .... 194.998706: workqueue_execute_end: work struct 000000001cccdc2d > > > > Dbench is on CPU 4, it queues xlog_cil_push_work on an UNBOUND > > workqueue. An unbound kworker wakes on CPU 6 and finishes quickly. > > > > kworker/u161:6-718 [006] .... 194.998707: workqueue_execute_start: work struct 0000000046fbf8d5: function wq_barrier_func > > kworker/u161:6-718 [006] d... 194.998708: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 > > kworker/u161:6-718 [006] d... 194.998712: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=004 > > kworker/u161:6-718 [006] .... 194.998713: workqueue_execute_end: work struct 0000000046fbf8d5 > > Ok, that's what I'd expect if dbench issued a log force as part of > an fsync() or synchronous transaction. This is it flushing the CIL > and waiting for the flush work to complete (wq_barrier_func is what > wakes the wq flush waiter). > > This doesn't complete the log force, however - the dbench process > will now do a bit more work and then go to sleep waiting for journal > IO to complete. > So, it could be argued that it should be serialised but I think it would be very specific to how dbench behaves. I can't see anything actually wrong with what XFS does and it's just unfortunate that dbench made enough progress but still had to go to sleep. > > The kworker wakes dbench and finding that CPU 4 is still free, dbench > > uses its previous CPU and no migration occurs. > > > > dbench-25633 [004] d... 194.998727: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4 > > dbench-25633 [004] d... 194.998728: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > > dbench-25633 [004] dN.. 194.998731: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 > > > > Dbench queues blk_mq_requeue_work. This is a BOUND workqueue with a > > mandatory CPU target of 4 so no migration.. > > So I spent some time trying to work out how the dbench process > triggers this directly. This work is queued when a new cache flush > command is inserted into the request queue, and generally those are > done by the journal writes via REQ_PREFLUSH | REQ_FUA. Those would > show up in the context of the xlog_cil_push_work and run on that > CPU, not the dbench task or CPU. > > So this is probably xfs_file_fsync() calling > xfs_blkdev_issue_flush() directly because the inode metadata had > already been written to the journal by an earlier (or racing) > journal flush. Hence we have to flush the device cache manually to > ensure that the data that may have been written is also on stable > storage. That will insert a flush directly into the request queue, > and that's likely how we are getting the flush machinery running on > this CPU. > dbench does call fsync periodically so it's a good fit. Again, it might be better to let all the work done by the kworker but without a crystal ball instruction, what XFS does seems reasonable. > > > kworker/4:1H-991 [004] .... 194.998918: workqueue_execute_start: work struct 0000000072f39adb: function xlog_ioend_work [xfs] > > kworker/4:1H-991 [004] d... 194.998943: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 > > kworker/4:1H-991 [004] d... 194.998945: sched_migrate_task: comm=dbench pid=25633 prio=120 orig_cpu=4 dest_cpu=5 > > kworker/4:1H-991 [004] d... 194.998947: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=005 > > kworker/4:1H-991 [004] .... 194.998948: workqueue_execute_end: work struct 0000000072f39adb > > > > The IO completion handler finishes, the bound workqueue tries to wake > > dbench on its old CPU. The BOUND kworker is on CPU 4, the task wants > > CPU 4 but the CPU is busy with the kworker so the scheduler function > > select_idle_sibling picks CPU 5 instead and now the task is migrated > > and we have started our round-trip of all CPUs sharing a LLC. It's not a > > perfect round-robin because p->recent_used_cpu often works. Looking at > > the traces, dbench bounces back and forth between CPUs 4 and 5 for 7 IO > > completions before bouncing between CPUs 6/7 and so on. > > Then this happens. > > Ok, so the commit that made the CIL push work unbound didn't > introduce this sub-optimal scheduling pattern, it just made it more > likely to trigger by increasing the likelihood of cache flush > aggregation. I think the problem was likely first visible when > blk-mq was introduced because of it's async cache flush machinery > but it went unnoticed because dbench on blk-mq was faster because of > all the other improvements blkmq brought to the table.... > It probably was masked by the blk-mq switch. While mostly good because it had been deferred a couple of times while issues were ironed out, it wasn't perfect. We were just lucky that XFS made a change that made the problem more obvious. > > The patch alters the very last stage. The IO completion is a bound kworker > > and allows the wakee task to use the same CPU and avoid the migration. > > *nod* > > AFAICT, everything is pointing to this being the same issue as the > AIO/DIO completion issue. We've got a bound worker thread waking an > unbound user task, and the scheduler is migrating the unbound worker > task to an idle CPU because it doesn't know we really want > synchronous wakeup semantics in this situation. And, really, I don't > think the code doing the wakeup knows whether synchronous wakeup > semantics are correct, either, as there can be many processes across > the entire machine waiting on this journal IO completion > notification. Hence I suspect a runtime evaluated heuristic is the > best we can do here... > I would prefer the wakeup code did not have to signal that it's a synchronous wakeup. Sync wakeups so exist but callers got it wrong many times where stacking was allowed and then the waker did not go to sleep. While the chain of events are related, they are not related in a very obvious way. I think it's much safer to keep this as a scheduler heuristic instead of depending on callers to have sufficient knowledge of the scheduler implementation. I still need scheduler people to look at this because stacking tasks prematurely is generally considered hazardous but with the merge window, that might take a while. I'll resend the patch later in the window if there is no response. Thanks a lot Dave for taking a close look at this and explaining exactly what XFS is doing and why.
On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote: > Peter, Ingo and Vincent -- I know the timing is bad due to the merge > window but do you have any thoughts on allowing select_idle_sibling to > stack a wakee task on the same CPU as a waker in this specific case? I sort of see, but *groan*... so if the kworker unlocks a contended mutex/rwsem/completion... I suppose the fact that it limits it to tasks that were running on the same CPU limits the impact if we do get it wrong. Elsewhere you write: > I would prefer the wakeup code did not have to signal that it's a > synchronous wakeup. Sync wakeups so exist but callers got it wrong many > times where stacking was allowed and then the waker did not go to sleep. > While the chain of events are related, they are not related in a very > obvious way. I think it's much safer to keep this as a scheduler > heuristic instead of depending on callers to have sufficient knowledge > of the scheduler implementation. That is true; the existing WF_SYNC has caused many issues for maybe being too strong. But what if we create a new hint that combines both these ideas? Say WF_COMPLETE and subject that to these same criteria. This way we can eliminate wakeups from locks and such (they won't have this set). Or am I just making things complicated again?
On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote: > On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote: > > Peter, Ingo and Vincent -- I know the timing is bad due to the merge > > window but do you have any thoughts on allowing select_idle_sibling to > > stack a wakee task on the same CPU as a waker in this specific case? > > I sort of see, but *groan*... > > so if the kworker unlocks a contended mutex/rwsem/completion... > > I suppose the fact that it limits it to tasks that were running on the > same CPU limits the impact if we do get it wrong. > > Elsewhere you write: > > > I would prefer the wakeup code did not have to signal that it's a > > synchronous wakeup. Sync wakeups so exist but callers got it wrong many > > times where stacking was allowed and then the waker did not go to sleep. > > While the chain of events are related, they are not related in a very > > obvious way. I think it's much safer to keep this as a scheduler > > heuristic instead of depending on callers to have sufficient knowledge > > of the scheduler implementation. > > That is true; the existing WF_SYNC has caused many issues for maybe > being too strong. > > But what if we create a new hint that combines both these ideas? Say > WF_COMPLETE and subject that to these same criteria. This way we can > eliminate wakeups from locks and such (they won't have this set). > > Or am I just making things complicated again? I suspect this is making it complicated again, because it requires the people who maintain the code that is using workqueues to understand when they might need to use a special wakeup interface in the work function. And that includes code that currently calls wake_up_all() because there can be hundreds of independent tasks waiting on the IO completion (e.g all the wait queues in the XFS journal code can (and do) have multiple threads waiting on them). IOWs, requiring a special flag just to optimise this specific case (i.e. single dependent waiter on same CPU as the kworker) when the adverse behaviour is both hardware and workload dependent means it just won't get used correctly or reliably. Hence I'd much prefer the kernel detects and dynamically handles this situation at runtime, because this pattern of workqueue usage is already quite common and will only become more widespread as we progress towards async processing of syscalls. Cheers, Dave.
On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote: > On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote: > > Peter, Ingo and Vincent -- I know the timing is bad due to the merge > > window but do you have any thoughts on allowing select_idle_sibling to > > stack a wakee task on the same CPU as a waker in this specific case? > > I sort of see, but *groan*... > This is the reaction I kinda expected. Sync wakeups and select_idle_sibling probably caused someone PTSD at some point before my time in kernel/sched/. > so if the kworker unlocks a contended mutex/rwsem/completion... > > I suppose the fact that it limits it to tasks that were running on the > same CPU limits the impact if we do get it wrong. > And it's limited to no other task currently running on the CPU. Now, potentially multiple sleepers are on that CPU waiting for a mutex/rwsem/completion but it's very unlikely and mostly likely due to the machine being saturated in which case searching for an idle CPU will probably fail. It would also be bound by a small window after the first wakeup before the task becomes runnable before the nr_running check mitigages the problem. Besides, if the sleeping task is waiting on the lock, it *is* related to the kworker which is probably finished. In other words, even this patches worst-case behaviour does not seem that bad. > Elsewhere you write: > > > I would prefer the wakeup code did not have to signal that it's a > > synchronous wakeup. Sync wakeups so exist but callers got it wrong many > > times where stacking was allowed and then the waker did not go to sleep. > > While the chain of events are related, they are not related in a very > > obvious way. I think it's much safer to keep this as a scheduler > > heuristic instead of depending on callers to have sufficient knowledge > > of the scheduler implementation. > > That is true; the existing WF_SYNC has caused many issues for maybe > being too strong. > Exactly. It ended up being almost ignored. It basically just means that the waker CPU may be used as the target for wake_affine because the users were not obeying the contract. > But what if we create a new hint that combines both these ideas? Say > WF_COMPLETE and subject that to these same criteria. This way we can > eliminate wakeups from locks and such (they won't have this set). > I think that'll end up with three consequences. First, it falls foul of Rusty's Rules of API Design[1] because even if people read the implementation and the documentation, they might still get it wrong like what happened with WF_SYNC. Second, some other subsystem will think it's special and use the flag because it happens to work for one benchmark or worse, they happened to copy/paste the code for some reason. Finally, the workqueue implementation may change in some way that renders the use of the flag incorrect. With this patch, if workqueues change design, it's more likely the patch becomes a no-op. > Or am I just making things complicated again? I think so but I also wrote the patch so I'm biased. I think the callers would be forced into an API change if it's a common pattern where multiple unbound tasks can sleep on the same CPU waiting on a single kworker and I struggle to think of such an example. The length of time it took this issue to be detected and patched is indicative that not everyone is familiar with kernel/sched/fair.c and its consequences. If they were, chances are they would have implemented some mental hack like binding a task to a single CPU until the IO completes. [1] http://sweng.the-davies.net/Home/rustys-api-design-manifesto
On Thu, Jan 30, 2020 at 09:00:21AM +1100, Dave Chinner wrote: > On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote: > > On Tue, Jan 28, 2020 at 09:10:12AM +0000, Mel Gorman wrote: > > > Peter, Ingo and Vincent -- I know the timing is bad due to the merge > > > window but do you have any thoughts on allowing select_idle_sibling to > > > stack a wakee task on the same CPU as a waker in this specific case? > > > > I sort of see, but *groan*... > > > > so if the kworker unlocks a contended mutex/rwsem/completion... > > > > I suppose the fact that it limits it to tasks that were running on the > > same CPU limits the impact if we do get it wrong. > > > > Elsewhere you write: > > > > > I would prefer the wakeup code did not have to signal that it's a > > > synchronous wakeup. Sync wakeups so exist but callers got it wrong many > > > times where stacking was allowed and then the waker did not go to sleep. > > > While the chain of events are related, they are not related in a very > > > obvious way. I think it's much safer to keep this as a scheduler > > > heuristic instead of depending on callers to have sufficient knowledge > > > of the scheduler implementation. > > > > That is true; the existing WF_SYNC has caused many issues for maybe > > being too strong. > > > > But what if we create a new hint that combines both these ideas? Say > > WF_COMPLETE and subject that to these same criteria. This way we can > > eliminate wakeups from locks and such (they won't have this set). > > > > Or am I just making things complicated again? > > I suspect this is making it complicated again, because it requires > the people who maintain the code that is using workqueues to > understand when they might need to use a special wakeup interface in > the work function. And that includes code that currently calls > wake_up_all() because there can be hundreds of independent tasks > waiting on the IO completion (e.g all the wait queues in the XFS > journal code can (and do) have multiple threads waiting on them). > > IOWs, requiring a special flag just to optimise this specific case > (i.e. single dependent waiter on same CPU as the kworker) when the > adverse behaviour is both hardware and workload dependent means it > just won't get used correctly or reliably. > I agree. Pick any of Rusty's rules from "-2 Read the implementation and you'll get it wrong" all the way down to "-10 It's impossible to get right.". > Hence I'd much prefer the kernel detects and dynamically handles > this situation at runtime, because this pattern of workqueue usage > is already quite common and will only become more widespread as we > progress towards async processing of syscalls. > To be fair, as Peter says, the kernel patch may not detect this properly. There are corner cases where it will get it wrong. My thinking is that *at the moment* when the heuristic is wrong, it's almost certainly because the machine was so over-saturated such that multiple related tasks are stacking anyway. Depending on how async syscalls proceeds, this might get turn out to the the wrong heuristic and an API change will be required. At least if that happens, we'll have a few use cases to help guide what the API change should look like so we do not end up in WF_SYNC hell again.
On Thu, Jan 30, 2020 at 12:43:34AM +0000, Mel Gorman wrote: > On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote: > > I suppose the fact that it limits it to tasks that were running on the > > same CPU limits the impact if we do get it wrong. > > > > And it's limited to no other task currently running on the > CPU. Now, potentially multiple sleepers are on that CPU waiting for > a mutex/rwsem/completion but it's very unlikely and mostly likely due > to the machine being saturated in which case searching for an idle CPU > will probably fail. It would also be bound by a small window after the > first wakeup before the task becomes runnable before the nr_running check > mitigages the problem. Besides, if the sleeping task is waiting on the > lock, it *is* related to the kworker which is probably finished. > > In other words, even this patches worst-case behaviour does not seem > that bad. OK; let's just stick it in and see what, if anything, falls over :-) I saw there is a v2 out (although I didn't see what changed in a hurry), let me queue that one.
On Thu, Jan 30, 2020 at 09:06:53AM +0100, Peter Zijlstra wrote: > On Thu, Jan 30, 2020 at 12:43:34AM +0000, Mel Gorman wrote: > > On Wed, Jan 29, 2020 at 06:38:52PM +0100, Peter Zijlstra wrote: > > > > I suppose the fact that it limits it to tasks that were running on the > > > same CPU limits the impact if we do get it wrong. > > > > > > > And it's limited to no other task currently running on the > > CPU. Now, potentially multiple sleepers are on that CPU waiting for > > a mutex/rwsem/completion but it's very unlikely and mostly likely due > > to the machine being saturated in which case searching for an idle CPU > > will probably fail. It would also be bound by a small window after the > > first wakeup before the task becomes runnable before the nr_running check > > mitigages the problem. Besides, if the sleeping task is waiting on the > > lock, it *is* related to the kworker which is probably finished. > > > > In other words, even this patches worst-case behaviour does not seem > > that bad. > > OK; let's just stick it in and see what, if anything, falls over :-) > > I saw there is a v2 out (although I didn't see what changed in a hurry), > let me queue that one. Only the changelog and comments changed in light of the discussion with Dave.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fc1dfc007604..1f615a223791 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1442,17 +1442,6 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) #ifdef CONFIG_SMP -static inline bool is_per_cpu_kthread(struct task_struct *p) -{ - if (!(p->flags & PF_KTHREAD)) - return false; - - if (p->nr_cpus_allowed != 1) - return false; - - return true; -} - /* * Per-CPU kthreads are allowed to run on !active && online CPUs, see * __set_cpus_allowed_ptr() and select_fallback_rq(). diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fe4e0d775375..76df439aff76 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5912,6 +5912,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) (available_idle_cpu(prev) || sched_idle_cpu(prev))) return prev; + /* + * Allow a per-cpu kthread to stack with the wakee if the + * kworker thread and the tasks previous CPU are the same. + * The assumption is that the wakee queued work for the + * per-cpu kthread that is now complete and the wakeup is + * essentially a sync wakeup. + */ + if (is_per_cpu_kthread(current) && + prev == smp_processor_id() && + this_rq()->nr_running <= 1) { + return prev; + } + /* Check a recently used CPU as a potential idle candidate: */ recent_used_cpu = p->recent_used_cpu; if (recent_used_cpu != prev && diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1a88dc8ad11b..5876e6ba5903 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq, { } #endif + +#ifdef CONFIG_SMP +static inline bool is_per_cpu_kthread(struct task_struct *p) +{ + if (!(p->flags & PF_KTHREAD)) + return false; + + if (p->nr_cpus_allowed != 1) + return false; + + return true; +} +#endif
Commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log recovery") changed from using bound workqueues to using unbound workqueues. Functionally this makes sense but it was observed at the time that the dbench performance dropped quite a lot and CPU migrations were excessively high even when there are plenty of idle CPUs. The pattern of the task migration is straight-forward. With XFS, an IO issuer may delegate work to a kworker which wakes on the same CPU. On completion of the work, it wakes the task, finds that the previous CPU is busy (because the kworker is still running on it) and migrates the task to the next idle CPU. The task ends up migrating around all CPUs sharing a LLC at high frequency. This has negative implications both in commication costs and power management. mpstat confirmed that at low thread counts that all CPUs sharing an LLC has low level of activity. The impact of this problem is related to the number of CPUs sharing an LLC. This patch special cases the pattern and allows a kworker waker and a task wakee to stack on the same CPU if there is a strong chance they are directly related. The expectation is that the kworker is likely going back to sleep shortly. This is not guaranteed as the IO could be queued asynchronously but there is a very strong relationship between the task and kworker in this case that would justify stacking on the same CPU instead of migrating. There should be few concerns about kworker starvation given that the special casing is only when the kworker is the waker. DBench on XFS MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem UMA machine with 8 cores sharing LLC 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack Amean 1 22.63 ( 0.00%) 20.54 * 9.23%* Amean 2 25.56 ( 0.00%) 23.40 * 8.44%* Amean 4 28.63 ( 0.00%) 27.85 * 2.70%* Amean 8 37.66 ( 0.00%) 37.68 ( -0.05%) Amean 64 469.47 ( 0.00%) 468.26 ( 0.26%) Stddev 1 1.00 ( 0.00%) 0.72 ( 28.12%) Stddev 2 1.62 ( 0.00%) 1.97 ( -21.54%) Stddev 4 2.53 ( 0.00%) 3.58 ( -41.19%) Stddev 8 5.30 ( 0.00%) 5.20 ( 1.92%) Stddev 64 86.36 ( 0.00%) 94.53 ( -9.46%) NUMA machine, 48 CPUs total, 24 CPUs share cache 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 Amean 1 58.69 ( 0.00%) 30.21 * 48.53%* Amean 2 60.90 ( 0.00%) 35.29 * 42.05%* Amean 4 66.77 ( 0.00%) 46.55 * 30.28%* Amean 8 81.41 ( 0.00%) 68.46 * 15.91%* Amean 16 113.29 ( 0.00%) 107.79 * 4.85%* Amean 32 199.10 ( 0.00%) 198.22 * 0.44%* Amean 64 478.99 ( 0.00%) 477.06 * 0.40%* Amean 128 1345.26 ( 0.00%) 1372.64 * -2.04%* Stddev 1 2.64 ( 0.00%) 4.17 ( -58.08%) Stddev 2 4.35 ( 0.00%) 5.38 ( -23.73%) Stddev 4 6.77 ( 0.00%) 6.56 ( 3.00%) Stddev 8 11.61 ( 0.00%) 10.91 ( 6.04%) Stddev 16 18.63 ( 0.00%) 19.19 ( -3.01%) Stddev 32 38.71 ( 0.00%) 38.30 ( 1.06%) Stddev 64 100.28 ( 0.00%) 91.24 ( 9.02%) Stddev 128 186.87 ( 0.00%) 160.34 ( 14.20%) Dbench has been modified to report the time to complete a single "load file". This is a more meaningful metric for dbench that a throughput metric as the benchmark makes many different system calls that are not throughput-related Patch shows a 9.23% and 48.53% reduction in the time to process a load file with the difference partially explained by the number of CPUs sharing a LLC. In a separate run, task migrations were almost eliminated by the patch for low client counts. In case people have issue with the metric used for the benchmark, this is a comparison of the throughputs as reported by dbench on the NUMA machine. dbench4 Throughput (misleading but traditional) 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 Hmean 1 321.41 ( 0.00%) 617.82 * 92.22%* Hmean 2 622.87 ( 0.00%) 1066.80 * 71.27%* Hmean 4 1134.56 ( 0.00%) 1623.74 * 43.12%* Hmean 8 1869.96 ( 0.00%) 2212.67 * 18.33%* Hmean 16 2673.11 ( 0.00%) 2806.13 * 4.98%* Hmean 32 3032.74 ( 0.00%) 3039.54 ( 0.22%) Hmean 64 2514.25 ( 0.00%) 2498.96 * -0.61%* Hmean 128 1778.49 ( 0.00%) 1746.05 * -1.82%* Note that this is somewhat specific to XFS and ext4 shows no performance difference as it does not rely on kworkers in the same way. No major problem was observed running other workloads on different machines although not all tests have completed yet. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/core.c | 11 ----------- kernel/sched/fair.c | 13 +++++++++++++ kernel/sched/sched.h | 13 +++++++++++++ 3 files changed, 26 insertions(+), 11 deletions(-)