single aio thread is migrated crazily by scheduler

Message ID	20191114113153.GB4213@ming.t460p (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=h9WC=ZG=vger.kernel.org=linux-fsdevel-owner@kernel.org> Date: Thu, 14 Nov 2019 19:31:53 +0800 From: Ming Lei <ming.lei@redhat.com> To: linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Jeff Moyer <jmoyer@redhat.com>, Dave Chinner <dchinner@redhat.com>, Eric Sandeen <sandeen@redhat.com>, Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org> Subject: single aio thread is migrated crazily by scheduler Message-ID: <20191114113153.GB4213@ming.t460p> MIME-Version: 1.0 User-Agent: Mutt/1.12.1 (2019-06-15) Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	single aio thread is migrated crazily by scheduler \| expand single aio thread is migrated crazily by scheduler

Ming Lei Nov. 14, 2019, 11:31 a.m. UTC

Hi Guys,

It is found that single AIO thread is migrated crazely by scheduler, and
the migrate period can be < 10ms. Follows the test a):

	- run single job fio[1] for 30 seconds:
	./xfs_complete 512
	
	- observe fio io thread migration via bcc trace[2], and the migration
	times can reach 5k ~ 10K in above test. In this test, CPU utilization
	is 30~40% on the CPU running fio IO thread.
	
	- after applying the debug patch[3] to queue XFS completion work on
	other CPU(not current CPU), the above crazy fio IO thread migration
	can't be observed.

And the similar result can be observed in the following test b) too:

	- set sched parameters:
		sysctl kernel.sched_min_granularity_ns=10000000
		sysctl kernel.sched_wakeup_granularity_ns=15000000
	
		which is usually done by 'tuned-adm profile network-throughput'
	
	- run single job fio aio[1] for 30 seconds:
	  ./xfs_complete 4k 
	
	- observe fio io thread migration[2], and similar crazy migration
	can be observed too. In this test, CPU utilization is close to 100%
	on the CPU for running fio IO thread
	
	- the debug patch[3] still makes a big difference on this test wrt.
	fio IO thread migration.

For test b), I thought that load balance may be triggered when
single fio IO thread takes the CPU by ~100%, meantime XFS's queue_work()
schedules WQ worker thread on the current CPU, since all other CPUs
are idle. When the fio IO thread is migrated to new CPU, the same steps
can be repeated again.

But for test a), I have no idea why fio IO thread is still migrated so
frequently since the CPU isn't saturated at all.

IMO, it is normal for user to saturate aio thread, since this way may
save context switch.

Guys, any idea on the crazy aio thread migration?

BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the
fio test is created for simulating one real performance report which is
proved to be caused by frequent aio submission thread migration.


[1] xfs_complete: one fio script for running single job overwrite aio on XFS
#!/bin/bash

BS=$1
NJOBS=1
QD=128
DIR=/mnt/xfs
BATCH=1
VERIFY="sha3-512"

sysctl kernel.sched_wakeup_granularity_ns
sysctl kernel.sched_min_granularity_ns

rmmod scsi_debug;modprobe scsi_debug dev_size_mb=6144 ndelay=41000 dix=1 dif=2
DEV=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
DEV="/dev/"$DEV

mkfs.xfs -f $DEV
[ ! -d $DIR ] && mkdir -p $DIR
mount $DEV $DIR

fio --readwrite=randwrite --filesize=5g \
    --overwrite=1 \
    --filename=$DIR/fiofile \
    --runtime=30s --time_based \
    --ioengine=libaio --direct=1 --bs=4k --iodepth=$QD \
    --iodepth_batch_submit=$BATCH \
    --iodepth_batch_complete_min=$BATCH \
    --numjobs=$NJOBS \
    --verify=$VERIFY \
    --name=/hana/fsperf/foo

umount $DEV
rmmod scsi_debug


[2] observe fio migration via bcc trace:
/usr/share/bcc/tools/trace -C -t  't:sched:sched_migrate_task "%s/%d cpu %d->%d", args->comm,args->pid,args->orig_cpu,args->dest_cpu' | grep fio 

[3] test patch for queuing xfs completetion on other CPU


Thanks,
Ming

Peter Zijlstra Nov. 14, 2019, 1:14 p.m. UTC | #1

On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> Hi Guys,
> 
> It is found that single AIO thread is migrated crazely by scheduler, and
> the migrate period can be < 10ms. Follows the test a):

What does crazy mean? Does it cycle through the L3 mask?

Dave Chinner Nov. 14, 2019, 11:54 p.m. UTC | #2

On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> Hi Guys,
> 
> It is found that single AIO thread is migrated crazely by scheduler, and
> the migrate period can be < 10ms. Follows the test a):
> 
> 	- run single job fio[1] for 30 seconds:
> 	./xfs_complete 512
> 	
> 	- observe fio io thread migration via bcc trace[2], and the migration
> 	times can reach 5k ~ 10K in above test. In this test, CPU utilization
> 	is 30~40% on the CPU running fio IO thread.

Using the default scheduler tunings:

kernel.sched_wakeup_granularity_ns = 4000000
kernel.sched_min_granularity_ns = 3000000

I'm not seeing any migrations at all on a 16p x86-64 box. Even with
the tunings you suggest:

	sysctl kernel.sched_min_granularity_ns=10000000
	sysctl kernel.sched_wakeup_granularity_ns=15000000

There are no migrations at all.

During the overwrite phase of the test, I'm seeing 66% fio usage,
28% system time and of that, only 3% of the CPU time is in the IO
completion workqueue time. It's all running on the one CPU, and it's
running at about 15,000 context switches per second.. I'm not seeing
any CPU migrations at all, despite the workload being completely CPU
bound on a single CPU.

This is the typical output from top:

Tasks: 262 total,   2 running, 260 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 66.2 us, 27.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  6.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  16005.3 total,   8737.7 free,   6762.5 used,    505.2 buff/cache
MiB Swap:    486.3 total,    486.3 free,      0.0 used.   8640.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   5554 root      20   0  672064  15356   1392 R  97.0   0.1   0:17.85 fio
   1388 root      20   0       0      0      0 I   3.0   0.0   0:01.39 kworker/3:1-dio/sdb
   5552 root      20   0  661364 275380 271992 S   0.7   1.7   0:15.88 fio
   4891 dave      20   0   15572   5876   4600 S   0.3   0.0   0:00.56 sshd
   4933 dave      20   0   11732   4276   3296 R   0.3   0.0   0:01.30 top

i.e. it runs entirely on CPU 3 for the whole 30s measurement period.

And the number of migrate task events:

$ sudo trace-cmd start -e sched_migrate_task
$ sudo ./run_test.sh
.....
$ sudo trace-cmd show |grep fio
     kworker/1:1-1252  [001] d..2  2002.792659: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=5 dest_cpu=1
     rcu_preempt-11    [011] d..2  2004.202624: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=1 dest_cpu=11
     rcu_preempt-11    [006] d..2  2008.364481: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=11 dest_cpu=6
          <idle>-0     [007] dNs2  2009.209480: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=6 dest_cpu=7
$

And top tells me these fio processes are consuming CPU:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                  
   5695 root      20   0  663872   7120   1324 R  97.7   0.0   0:03.48 fio                                                                                      
   5689 root      20   0  661356 275512 272128 S   0.3   1.7   0:15.46 fio                                                                                      

So, the fio process that got migrated 4 times in 30s is the fio
process that isn't doing any of the work.

There's no migration going on here at all on a vanilla kernel on
a Debian userspace, with or without the suggested scheduler
tunings.

> 	- after applying the debug patch[3] to queue XFS completion work on
> 	other CPU(not current CPU), the above crazy fio IO thread migration
> 	can't be observed.
>
> IMO, it is normal for user to saturate aio thread, since this way may
> save context switch.
> 
> Guys, any idea on the crazy aio thread migration?

It doesn't happen on x86-64 with a current TOT vanilla kernel and
a debian userspace.

What userspace are you testing with?

> BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the
> fio test is created for simulating one real performance report which is
> proved to be caused by frequent aio submission thread migration.

What is the underlying hardware? I'm running in a 16p KVM guest on a
16p/32t x86-64 using 5.4-rc7, and I don't observe any significant
CPU migration occurring at all from your test workload.

> [3] test patch for queuing xfs completetion on other CPU
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 1fc28c2da279..bdc007a57706 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  			blk_wake_io_task(waiter);
>  		} else if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
> +			unsigned cpu = cpumask_next(smp_processor_id(),
> +					cpu_online_mask);
> +
> +			if (cpu >= nr_cpu_ids)
> +				cpu = 0;
>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work);
>  		} else {
>  			iomap_dio_complete_work(&dio->aio.work);
>  		}

So, can you test the patch I sent you a while back that avoided
using the completion workqueue for pure overwrites? Does that make
your whacky scheduler behaviour go away, or does it still reschedule
like crazy?

I haven't posted that patch to the list because I can't prove it
makes any difference to performance in any workload on any hardware
I've tested it on. I just tested it on your workload, and it makes
no different to behaviour or performance. It's definitely hitting
the non-workqueue completion path:

$ sudo trace-cmd show | grep overwrite |head -1
             fio-4526  [009] ..s.   180.361089: iomap_dio_bio_end_io: overwrite completion
$ sudo trace-cmd show | grep overwrite |wc -l
51295
$

so it's pretty clear that whatever is going on is not obviously a
problem with workqueues or the way iomap does completions. Patch is
attached below for you to test. You will need to pull commit
7684e2c4384d ("iomap: iomap that extends beyond EOF should be marked
dirty") from the iomap for-next branch to ensure this patch
functions correctly.

Cheers,

-Dave.

Ming Lei Nov. 15, 2019, 12:09 a.m. UTC | #3

On Thu, Nov 14, 2019 at 02:14:34PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> > Hi Guys,
> > 
> > It is found that single AIO thread is migrated crazely by scheduler, and
> > the migrate period can be < 10ms. Follows the test a):
> 
> What does crazy mean? Does it cycle through the L3 mask?
> 

The single thread AIO thread is migrated in several milliseconds once.


Thanks,
Ming

Ming Lei Nov. 15, 2019, 1:08 a.m. UTC | #4

Hi Dave,

On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote:
> On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> > Hi Guys,
> > 
> > It is found that single AIO thread is migrated crazely by scheduler, and
> > the migrate period can be < 10ms. Follows the test a):
> > 
> > 	- run single job fio[1] for 30 seconds:
> > 	./xfs_complete 512
> > 	
> > 	- observe fio io thread migration via bcc trace[2], and the migration
> > 	times can reach 5k ~ 10K in above test. In this test, CPU utilization
> > 	is 30~40% on the CPU running fio IO thread.
> 
> Using the default scheduler tunings:
> 
> kernel.sched_wakeup_granularity_ns = 4000000
> kernel.sched_min_granularity_ns = 3000000
> 
> I'm not seeing any migrations at all on a 16p x86-64 box. Even with
> the tunings you suggest:
> 
> 	sysctl kernel.sched_min_granularity_ns=10000000
> 	sysctl kernel.sched_wakeup_granularity_ns=15000000
> 
> There are no migrations at all.

Looks I forget to pass $BS to the fio command line in the script posted,
please try the following script again and run './xfs_complete 512' first.

[1] xfs_complete: one fio script for running single job overwrite aio on XFS
#!/bin/bash

BS=$1
NJOBS=1
QD=128
DIR=/mnt/xfs
BATCH=1
VERIFY="sha3-512"

sysctl kernel.sched_wakeup_granularity_ns
sysctl kernel.sched_min_granularity_ns

rmmod scsi_debug;modprobe scsi_debug dev_size_mb=6144 ndelay=41000 dix=1 dif=2
DEV=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
DEV="/dev/"$DEV

mkfs.xfs -f $DEV
[ ! -d $DIR ] && mkdir -p $DIR
mount $DEV $DIR

fio --readwrite=randwrite --filesize=5g \
    --overwrite=1 \
    --filename=$DIR/fiofile \
    --runtime=30s --time_based \
    --ioengine=libaio --direct=1 --bs=$BS --iodepth=$QD \
    --iodepth_batch_submit=$BATCH \
    --iodepth_batch_complete_min=$BATCH \
    --numjobs=$NJOBS \
    --verify=$VERIFY \
    --name=/hana/fsperf/foo

umount $DEV
rmmod scsi_debug


When running './xfs_complete 512', lots of fio migration can be observed
via bcc trace:

/usr/share/bcc/tools/trace -C -t  't:sched:sched_migrate_task "%s/%d cpu %d->%d", args->comm,args->pid,args->orig_cpu,args->dest_cpu' | grep fio

...
69.13495 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
69.13513 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
69.86733 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->3
70.20730 3   134     134     kworker/3:1     sched_migrate_task b'fio'/866 cpu 3->0
70.21131 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->1
70.21733 1   112     112     kworker/1:1     sched_migrate_task b'fio'/866 cpu 1->2
70.29528 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
70.29769 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
70.36332 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->3
70.41924 3   134     134     kworker/3:1     sched_migrate_task b'fio'/866 cpu 3->2
70.78572 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
70.79061 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
70.91526 2   23      23      kworker/2:0     sched_migrate_task b'fio'/866 cpu 2->0
70.91607 0   13      13      kworker/0:1     sched_migrate_task b'fio'/866 cpu 0->2
...

In my test just done, the migration count is 12K in 30s fio running.
Sometimes the number can be quite less, but most of times, the number
is very big(> 5k).

Also attaches my kernel config.

> During the overwrite phase of the test, I'm seeing 66% fio usage,
> 28% system time and of that, only 3% of the CPU time is in the IO
> completion workqueue time. It's all running on the one CPU, and it's
> running at about 15,000 context switches per second.. I'm not seeing
> any CPU migrations at all, despite the workload being completely CPU
> bound on a single CPU.

In my test VM, in case of 4K block size(./xfs_complete 4k), when the
following kernel parameters are passed in:

 	sysctl kernel.sched_min_granularity_ns=10000000
 	sysctl kernel.sched_wakeup_granularity_ns=15000000

'top -H' shows that fio consumes ~100% CPU, and the fio IO thread is
migrated very frequently.

> 
> This is the typical output from top:
> 
> Tasks: 262 total,   2 running, 260 sleeping,   0 stopped,   0 zombie
> %Cpu0  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu2  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu3  : 66.2 us, 27.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  6.0 si,  0.0 st
> %Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu8  :  0.0 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
> %Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu11 :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> %Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> MiB Mem :  16005.3 total,   8737.7 free,   6762.5 used,    505.2 buff/cache
> MiB Swap:    486.3 total,    486.3 free,      0.0 used.   8640.3 avail Mem 
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    5554 root      20   0  672064  15356   1392 R  97.0   0.1   0:17.85 fio
>    1388 root      20   0       0      0      0 I   3.0   0.0   0:01.39 kworker/3:1-dio/sdb
>    5552 root      20   0  661364 275380 271992 S   0.7   1.7   0:15.88 fio
>    4891 dave      20   0   15572   5876   4600 S   0.3   0.0   0:00.56 sshd
>    4933 dave      20   0   11732   4276   3296 R   0.3   0.0   0:01.30 top
> 
> i.e. it runs entirely on CPU 3 for the whole 30s measurement period.
> 
> And the number of migrate task events:
> 
> $ sudo trace-cmd start -e sched_migrate_task
> $ sudo ./run_test.sh
> .....
> $ sudo trace-cmd show |grep fio
>      kworker/1:1-1252  [001] d..2  2002.792659: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=5 dest_cpu=1
>      rcu_preempt-11    [011] d..2  2004.202624: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=1 dest_cpu=11
>      rcu_preempt-11    [006] d..2  2008.364481: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=11 dest_cpu=6
>           <idle>-0     [007] dNs2  2009.209480: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=6 dest_cpu=7
> $
> 
> And top tells me these fio processes are consuming CPU:
> 
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                  
>    5695 root      20   0  663872   7120   1324 R  97.7   0.0   0:03.48 fio                                                                                      
>    5689 root      20   0  661356 275512 272128 S   0.3   1.7   0:15.46 fio                                                                                      
> 
> So, the fio process that got migrated 4 times in 30s is the fio
> process that isn't doing any of the work.
> 
> There's no migration going on here at all on a vanilla kernel on
> a Debian userspace, with or without the suggested scheduler
> tunings.
> 
> > 	- after applying the debug patch[3] to queue XFS completion work on
> > 	other CPU(not current CPU), the above crazy fio IO thread migration
> > 	can't be observed.
> >
> > IMO, it is normal for user to saturate aio thread, since this way may
> > save context switch.
> > 
> > Guys, any idea on the crazy aio thread migration?
> 
> It doesn't happen on x86-64 with a current TOT vanilla kernel and
> a debian userspace.
> 
> What userspace are you testing with?

Fedora 29.

> 
> > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the
> > fio test is created for simulating one real performance report which is
> > proved to be caused by frequent aio submission thread migration.
> 
> What is the underlying hardware? I'm running in a 16p KVM guest on a
> 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant
> CPU migration occurring at all from your test workload.

It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop,
and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info:

[root@ktest-01 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
Stepping:            3
CPU MHz:             2712.000
BogoMIPS:            5424.00
Virtualization:      VT-x
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp
[root@ktest-01 ~]# uname -a
Linux ktest-01 5.4.0-rc5-00367-gac8d20c84c47 #1532 SMP Thu Nov 14 19:13:25 CST 2019 x86_64 x86_64 x86_64x

> 
> > [3] test patch for queuing xfs completetion on other CPU
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 1fc28c2da279..bdc007a57706 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio)
> >  			blk_wake_io_task(waiter);
> >  		} else if (dio->flags & IOMAP_DIO_WRITE) {
> >  			struct inode *inode = file_inode(dio->iocb->ki_filp);
> > +			unsigned cpu = cpumask_next(smp_processor_id(),
> > +					cpu_online_mask);
> > +
> > +			if (cpu >= nr_cpu_ids)
> > +				cpu = 0;
> >  
> >  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > +			queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work);
> >  		} else {
> >  			iomap_dio_complete_work(&dio->aio.work);
> >  		}
> 
> So, can you test the patch I sent you a while back that avoided
> using the completion workqueue for pure overwrites? Does that make
> your whacky scheduler behaviour go away, or does it still reschedule
> like crazy?

Last time, I found that your patch V3 doesn't work as expected since the trace
message of 'overwrite completion' isn't observed when running the external
test case. Seems not see your further response after I reported it to you.

Will try this patch again.

> 
> I haven't posted that patch to the list because I can't prove it
> makes any difference to performance in any workload on any hardware
> I've tested it on. I just tested it on your workload, and it makes
> no different to behaviour or performance. It's definitely hitting
> the non-workqueue completion path:
> 
> $ sudo trace-cmd show | grep overwrite |head -1
>              fio-4526  [009] ..s.   180.361089: iomap_dio_bio_end_io: overwrite completion
> $ sudo trace-cmd show | grep overwrite |wc -l
> 51295
> $
> 
> so it's pretty clear that whatever is going on is not obviously a
> problem with workqueues or the way iomap does completions. Patch is
> attached below for you to test. You will need to pull commit
> 7684e2c4384d ("iomap: iomap that extends beyond EOF should be marked
> dirty") from the iomap for-next branch to ensure this patch
> functions correctly.
> 
> Cheers,
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> iomap: don't defer completion of pure AIO+DIO overwrites
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> We have a workload issuing 4k sequential AIO+DIOs in batches (yeah,
> really silly thing to do when you could just emit a single large
> IO) that is showing some interesting performance anomalies. i.e.
> massive performance regressions that go away when the workload is
> straced because it changes AIO completion scheduling and submission
> batch sizes....
> 
> Older kernels that don't have the iomap DIO code don't show the same
> regression, and they complete the AIO-DIO in the the bio completion
> context rather than deferring it to a workqueue like iomap does.
> 
> Hence, for pure overwrites that don't require any metadata updates
> on IO completion, including file size, call the IO completion
> directly rather than deferring it to the completion workqueue. THe
> IOMAP_DIO_OVERWRITE flag is added to tell ->end_io implementations
> that they are running in bio completion context and that they should
> not have any update work to do. This leverages the iomap FUA
> optimisations that allow FUA writes to be issued when no metadata
> updates are required to be synced to disk during IO completion.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/direct-io.c  | 26 ++++++++++++++++++++------
>  fs/xfs/xfs_file.c     |  3 +++
>  include/linux/iomap.h |  8 ++++++++
>  3 files changed, 31 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 49bf9780e3ed..a89b29306794 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -159,8 +159,14 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  		} else if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
> -			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			if ((dio->flags & IOMAP_DIO_OVERWRITE) &&
> +			     !inode->i_mapping->nrpages) {
> +				trace_printk("overwrite completion\n");
> +				iomap_dio_complete_work(&dio->aio.work);
> +			} else {
> +				INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> +				queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			}
>  		} else {
>  			iomap_dio_complete_work(&dio->aio.work);
>  		}
> @@ -229,10 +235,12 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		 * the underlying device supports FUA. This allows us to avoid
>  		 * cache flushes on IO completion.
>  		 */
> -		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
> -		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
> -		    blk_queue_fua(bdev_get_queue(iomap->bdev)))
> -			use_fua = true;
> +		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY))) {
> +			dio->flags |= IOMAP_DIO_OVERWRITE;
> +			if ((dio->flags & IOMAP_DIO_WRITE_FUA) &&
> +			    blk_queue_fua(bdev_get_queue(iomap->bdev)))
> +				use_fua = true;
> +		}
>  	}
>  
>  	/*
> @@ -511,9 +519,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	/*
>  	 * If all the writes we issued were FUA, we don't need to flush the
>  	 * cache on IO completion. Clear the sync flag for this case.
> +	 *
> +	 * If we are doing an overwrite and need to sync and FUA cannot be used,
> +	 * clear the overwrite flag to ensure the completion is run via the
> +	 * workqueue rather than directly.
>  	 */
>  	if (dio->flags & IOMAP_DIO_WRITE_FUA)
>  		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> +	if (dio->flags & IOMAP_DIO_NEED_SYNC)
> +		dio->flags &= ~IOMAP_DIO_OVERWRITE;
>  
>  	WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
>  	WRITE_ONCE(iocb->private, dio->submit.last_queue);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 525b29b99116..794dea2f1dc3 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -395,6 +395,9 @@ xfs_dio_write_end_io(
>  	 */
>  	XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size);
>  
> +	if (flags & IOMAP_DIO_OVERWRITE)
> +		return 0;
> +
>  	/*
>  	 * We can allocate memory here while doing writeback on behalf of
>  	 * memory reclaim.  To avoid memory allocation deadlocks set the
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8b09463dae0d..009c5969a2ef 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -249,6 +249,14 @@ int iomap_writepages(struct address_space *mapping,
>  #define IOMAP_DIO_UNWRITTEN	(1 << 0)	/* covers unwritten extent(s) */
>  #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
>  
> +/*
> + * IOMAP_DIO_OVERWRITE indicates a pure overwrite that requires no metadata
> + * updates on completion. It also indicates that the completion is running in
> + * the hardware IO completion context (e.g. softirq) rather than on a schedules
> + * workqueue.
> + */
> +#define IOMAP_DIO_OVERWRITE	(1 << 2)
> +
>  struct iomap_dio_ops {
>  	int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
>  		      unsigned flags);
> 

Just run a quick test several times after applying the above patch, and looks it
does make a big difference in test './xfs_complete 512' wrt. fio io thread migration.


thanks,
Ming

Dave Chinner Nov. 15, 2019, 4:56 a.m. UTC | #5

On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> Hi Dave,
> 
> On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote:
> > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> > > Hi Guys,
> > > 
> > > It is found that single AIO thread is migrated crazely by scheduler, and
> > > the migrate period can be < 10ms. Follows the test a):
> > > 
> > > 	- run single job fio[1] for 30 seconds:
> > > 	./xfs_complete 512
> > > 	
> > > 	- observe fio io thread migration via bcc trace[2], and the migration
> > > 	times can reach 5k ~ 10K in above test. In this test, CPU utilization
> > > 	is 30~40% on the CPU running fio IO thread.
> > 
> > Using the default scheduler tunings:
> > 
> > kernel.sched_wakeup_granularity_ns = 4000000
> > kernel.sched_min_granularity_ns = 3000000
> > 
> > I'm not seeing any migrations at all on a 16p x86-64 box. Even with
> > the tunings you suggest:
> > 
> > 	sysctl kernel.sched_min_granularity_ns=10000000
> > 	sysctl kernel.sched_wakeup_granularity_ns=15000000
> > 
> > There are no migrations at all.
> 
> Looks I forget to pass $BS to the fio command line in the script posted,
> please try the following script again and run './xfs_complete 512' first.

So I ran 4kB IOs instead of 512 byte IOs. Shouldn't make any
difference, really - it'll still be CPU bound...

<snip script>

> In my test just done, the migration count is 12K in 30s fio running.
> Sometimes the number can be quite less, but most of times, the number
> is very big(> 5k).

With my iomap-dio-overwrite patch and 512 byte IOs:

$ sudo trace-cmd show |grep sched_migrate_task |wc -l
112
$ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l
22

Without the iomap-dio-overwrite patch:

$ sudo trace-cmd show |grep sched_migrate_task |wc -l
99
$ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l
9
$

There are -less- migrations when using the workqueue for everything.
But it's so low in either case that it's just noise.

Performance is identical for the two patches...

> > > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the
> > > fio test is created for simulating one real performance report which is
> > > proved to be caused by frequent aio submission thread migration.
> > 
> > What is the underlying hardware? I'm running in a 16p KVM guest on a
> > 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant
> > CPU migration occurring at all from your test workload.
> 
> It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop,
> and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info:

Ok, so what are all the custom distro kernel tunings that userspace
does for the kernel?

> [root@ktest-01 ~]# lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  1
> Core(s) per socket:  4
> Socket(s):           2
> NUMA node(s):        2

Curious. You've configured it as two CPU sockets. If you make it a
single socket, do your delay problems go away?  The snippet of trace
output you showed indicated it bouncing around CPUs on a single node
(cpus 0-3), so maybe it has something to do with way the scheduler
is interacting with non-zero NUMA distances...

> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               94
> Model name:          Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
> Stepping:            3
> CPU MHz:             2712.000
> BogoMIPS:            5424.00
> Virtualization:      VT-x
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            4096K
> L3 cache:            16384K
> NUMA node0 CPU(s):   0-3
> NUMA node1 CPU(s):   4-7
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp

That seems like a very minimal set of CPU flags - looks like you are
not actually passing the actual host CPU capabilities through to the
guest. That means it will be doing the slowest, most generic
spectre/meltdown mitigations, right?

Also, shouldn't lscpu be telling us all the CPU bug mitigations in
place?

From my test system:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   40 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       16
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           45
Model name:                      Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
Stepping:                        7
CPU MHz:                         2199.998
BogoMIPS:                        4399.99
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        64 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-15
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp l
                                 m constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic 
                                 popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpi
                                 d tsc_adjust xsaveopt arat umip md_clear arch_capabilities

So, to rule out that it has something to do with kernel config,
I just ran up a kernel built with your config.gz, and the problem
does not manifest. The only difference was a few drivers I needed to
boot my test VMs, and I was previously not using paravirt spinlocks.

So, I still can't reproduce the problem. Indeed, the workload gets
nowhere near single CPU bound with your config - it's using half the
CPU for the same performance:

%Cpu2  : 19.8 us, 28.2 sy,  0.0 ni,  0.0 id, 52.0 wa,  0.0 hi,  0.0 %si,  0.0 st

Basically, it's spending half it's time waiting on IO. If I wind the
delay down to 1000ns:

%Cpu1  : 42.2 us, 42.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 15.6 %si,  0.0 st

it spends an awful lot of time in soft-interrupt, but is back to
being CPU bound.

Despite this, I still don't see any significant amount of task
migration. In fact, I see a lot less with your kernel config that I
do with my original kernel config, because the CPU load was far
lower.

> Just run a quick test several times after applying the above patch, and looks it
> does make a big difference in test './xfs_complete 512' wrt. fio io thread migration.

There's something very different about your system, and it doesn't
appear to be a result of the kernel code itself. I think you're
going to have to do all the testing at the moment, Ming, because
it's clear that my test systems do not show up the problems even
when using the same kernel config as you do...

If you reconfig you kvm setup to pass all the native host side cpu
flags through to the guest, does the problem go away? I think adding
"-cpu host" to your qemu command line will do that...

Cheers,

Dave.

Ming Lei Nov. 15, 2019, 7:08 a.m. UTC | #6

On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > Hi Dave,
> > 
> > On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote:
> > > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> > > > Hi Guys,
> > > > 
> > > > It is found that single AIO thread is migrated crazely by scheduler, and
> > > > the migrate period can be < 10ms. Follows the test a):
> > > > 
> > > > 	- run single job fio[1] for 30 seconds:
> > > > 	./xfs_complete 512
> > > > 	
> > > > 	- observe fio io thread migration via bcc trace[2], and the migration
> > > > 	times can reach 5k ~ 10K in above test. In this test, CPU utilization
> > > > 	is 30~40% on the CPU running fio IO thread.
> > > 
> > > Using the default scheduler tunings:
> > > 
> > > kernel.sched_wakeup_granularity_ns = 4000000
> > > kernel.sched_min_granularity_ns = 3000000
> > > 
> > > I'm not seeing any migrations at all on a 16p x86-64 box. Even with
> > > the tunings you suggest:
> > > 
> > > 	sysctl kernel.sched_min_granularity_ns=10000000
> > > 	sysctl kernel.sched_wakeup_granularity_ns=15000000
> > > 
> > > There are no migrations at all.
> > 
> > Looks I forget to pass $BS to the fio command line in the script posted,
> > please try the following script again and run './xfs_complete 512' first.
> 
> So I ran 4kB IOs instead of 512 byte IOs. Shouldn't make any
> difference, really - it'll still be CPU bound...

In 512 block size test, the CPU utilization of fio IO thread is reduced to
40%, which is more like IO bound.

> 
> <snip script>
> 
> > In my test just done, the migration count is 12K in 30s fio running.
> > Sometimes the number can be quite less, but most of times, the number
> > is very big(> 5k).
> 
> With my iomap-dio-overwrite patch and 512 byte IOs:
> 
> $ sudo trace-cmd show |grep sched_migrate_task |wc -l
> 112
> $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l
> 22
> 
> Without the iomap-dio-overwrite patch:
> 
> $ sudo trace-cmd show |grep sched_migrate_task |wc -l
> 99
> $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l
> 9
> $
> 
> There are -less- migrations when using the workqueue for everything.
> But it's so low in either case that it's just noise.
> 
> Performance is identical for the two patches...

I can reproduce the issue with 4k block size on another RH system, and
the login info of that system has been shared to you in RH BZ.

1)
sysctl kernel.sched_min_granularity_ns=10000000
sysctl kernel.sched_wakeup_granularity_ns=15000000

2)
./xfs_complete 4k

Then you should see 1k~1.5k fio io thread migration in above test,
either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.

Not reproduced the issue with 512 block size on the RH system yet,
maybe it is related with my kernel config.

> 
> > > > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the
> > > > fio test is created for simulating one real performance report which is
> > > > proved to be caused by frequent aio submission thread migration.
> > > 
> > > What is the underlying hardware? I'm running in a 16p KVM guest on a
> > > 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant
> > > CPU migration occurring at all from your test workload.
> > 
> > It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop,
> > and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info:
> 
> Ok, so what are all the custom distro kernel tunings that userspace
> does for the kernel?

It is standard Fedora 29.

> 
> > [root@ktest-01 ~]# lscpu
> > Architecture:        x86_64
> > CPU op-mode(s):      32-bit, 64-bit
> > Byte Order:          Little Endian
> > CPU(s):              8
> > On-line CPU(s) list: 0-7
> > Thread(s) per core:  1
> > Core(s) per socket:  4
> > Socket(s):           2
> > NUMA node(s):        2
> 
> Curious. You've configured it as two CPU sockets. If you make it a
> single socket, do your delay problems go away?  The snippet of trace
> output you showed indicated it bouncing around CPUs on a single node
> (cpus 0-3), so maybe it has something to do with way the scheduler
> is interacting with non-zero NUMA distances...

I don't see that is a problem wrt. this issue, given the issue can
be reproduced on other system too.

> 
> > Vendor ID:           GenuineIntel
> > CPU family:          6
> > Model:               94
> > Model name:          Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
> > Stepping:            3
> > CPU MHz:             2712.000
> > BogoMIPS:            5424.00
> > Virtualization:      VT-x
> > Hypervisor vendor:   KVM
> > Virtualization type: full
> > L1d cache:           32K
> > L1i cache:           32K
> > L2 cache:            4096K
> > L3 cache:            16384K
> > NUMA node0 CPU(s):   0-3
> > NUMA node1 CPU(s):   4-7
> > Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp
> 
> That seems like a very minimal set of CPU flags - looks like you are
> not actually passing the actual host CPU capabilities through to the
> guest. That means it will be doing the slowest, most generic
> spectre/meltdown mitigations, right?

The above line is just trunated by the console terminal.

> 
> Also, shouldn't lscpu be telling us all the CPU bug mitigations in
> place?
> 
> From my test system:
> 
> Architecture:                    x86_64
> CPU op-mode(s):                  32-bit, 64-bit
> Byte Order:                      Little Endian
> Address sizes:                   40 bits physical, 48 bits virtual
> CPU(s):                          16
> On-line CPU(s) list:             0-15
> Thread(s) per core:              1
> Core(s) per socket:              1
> Socket(s):                       16
> NUMA node(s):                    1
> Vendor ID:                       GenuineIntel
> CPU family:                      6
> Model:                           45
> Model name:                      Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
> Stepping:                        7
> CPU MHz:                         2199.998
> BogoMIPS:                        4399.99
> Virtualization:                  VT-x
> Hypervisor vendor:               KVM
> Virtualization type:             full
> L1d cache:                       512 KiB
> L1i cache:                       512 KiB
> L2 cache:                        64 MiB
> L3 cache:                        256 MiB
> NUMA node0 CPU(s):               0-15
> Vulnerability L1tf:              Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
> Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
> Vulnerability Meltdown:          Vulnerable
> Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
> Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
> Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
> Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp l
>                                  m constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic 
>                                  popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpi
>                                  d tsc_adjust xsaveopt arat umip md_clear arch_capabilities
> 
> So, to rule out that it has something to do with kernel config,
> I just ran up a kernel built with your config.gz, and the problem
> does not manifest. The only difference was a few drivers I needed to
> boot my test VMs, and I was previously not using paravirt spinlocks.
> 
> So, I still can't reproduce the problem. Indeed, the workload gets
> nowhere near single CPU bound with your config - it's using half the
> CPU for the same performance:
> 
> %Cpu2  : 19.8 us, 28.2 sy,  0.0 ni,  0.0 id, 52.0 wa,  0.0 hi,  0.0 %si,  0.0 st
> 
> Basically, it's spending half it's time waiting on IO. If I wind the
> delay down to 1000ns:
> 
> %Cpu1  : 42.2 us, 42.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 15.6 %si,  0.0 st
> 
> it spends an awful lot of time in soft-interrupt, but is back to
> being CPU bound.
> 
> Despite this, I still don't see any significant amount of task
> migration. In fact, I see a lot less with your kernel config that I
> do with my original kernel config, because the CPU load was far
> lower.
> 
> > Just run a quick test several times after applying the above patch, and looks it
> > does make a big difference in test './xfs_complete 512' wrt. fio io thread migration.
> 
> There's something very different about your system, and it doesn't
> appear to be a result of the kernel code itself. I think you're
> going to have to do all the testing at the moment, Ming, because
> it's clear that my test systems do not show up the problems even
> when using the same kernel config as you do...
> 
> If you reconfig you kvm setup to pass all the native host side cpu
> flags through to the guest, does the problem go away? I think adding
> "-cpu host" to your qemu command line will do that...

Please login to the RH system I shared to you, and you will see the
issue.


Thanks,
Ming

Ming Lei Nov. 15, 2019, 2:16 p.m. UTC | #7

On Fri, Nov 15, 2019 at 08:09:25AM +0800, Ming Lei wrote:
> On Thu, Nov 14, 2019 at 02:14:34PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote:
> > > Hi Guys,
> > > 
> > > It is found that single AIO thread is migrated crazely by scheduler, and
> > > the migrate period can be < 10ms. Follows the test a):
> > 
> > What does crazy mean? Does it cycle through the L3 mask?
> > 
> 
> The single thread AIO thread is migrated in several milliseconds once.

Today I found the migrate rate of single fio IO thread can reach
11~12K/sec when I run './xfs_complete 512' on another real machine
(single numa node, 8 cores).

And the number is very close to IOPS of the test, that said the fio
IO thread can be migrated once just when completing one IO on the
scsi_debug device.

Thanks,
Ming

Dave Chinner Nov. 15, 2019, 11:40 p.m. UTC | #8

On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> I can reproduce the issue with 4k block size on another RH system, and
> the login info of that system has been shared to you in RH BZ.
> 
> 1)
> sysctl kernel.sched_min_granularity_ns=10000000
> sysctl kernel.sched_wakeup_granularity_ns=15000000

So, these settings definitely influence behaviour.

If these are set to kernel defaults (4ms and 3ms each):

sysctl kernel.sched_min_granularity_ns=4000000
sysctl kernel.sched_wakeup_granularity_ns=3000000

The migration problem largely goes away - the fio task migration
event count goes from ~2,000 a run down to 200/run.

That indicates that the migration trigger is likely load/timing
based. The analysis below is based on the 10/15ms numbers above,
because it makes it so much easier to reproduce.

> 2)
> ./xfs_complete 4k
> 
> Then you should see 1k~1.5k fio io thread migration in above test,
> either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.

Almost all the fio task migrations are coming from migration/X
kernel threads. i.e it's the scheduler active balancing that is
causing the fio thread to bounce around.

This is typical a typical trace, trimmed to remove extraneous noise.
The fio process is running on CPU 10:

             fio-3185  [010] 50419.285954: sched_stat_runtime:   comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns]
             fio-3185  [010] 50419.286953: sched_stat_runtime:   comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns]
             fio-3185  [010] 50419.287998: sched_stat_runtime:   comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns]
             fio-3185  [010] 50419.289973: sched_stat_runtime:   comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns]
             fio-3185  [010] 50419.290958: sched_stat_runtime:   comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns]
             fio-3185  [010] 50419.291952: sched_stat_runtime:   comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns]

fio consumes CPU for several milliseconds, then:

             fio-3185  [010] 50419.292935: sched_stat_runtime:   comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns]
             fio-3185  [010] 50419.292941: sched_switch:         fio:3185 [120] S ==> kworker/10:0:2763 [120]
    kworker/10:0-2763  [010] 50419.292954: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns]
    kworker/10:0-2763  [010] 50419.292956: sched_switch:         kworker/10:0:2763 [120] R ==> fio:3185 [120]
             fio-3185  [010] 50419.293115: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
             fio-3185  [010] 50419.293116: sched_stat_runtime:   comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns]
             fio-3185  [010] 50419.293118: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010

A context switch out to a kworker, then 13us later we immediately
switch back to the fio process, and go on running. No doubt
somewhere in what the fio process is doing, we queue up more work to
be run on the cpu, but the fio task keeps running
(due to CONFIG_PREEMPT=n).

             fio-3185  [010] 50419.293934: sched_stat_runtime:   comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns]
             fio-3185  [010] 50419.294936: sched_stat_runtime:   comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns]
             fio-3185  [010] 50419.295934: sched_stat_runtime:   comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns]
             fio-3185  [010] 50419.296935: sched_stat_runtime:   comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns]
             fio-3185  [010] 50419.297934: sched_stat_runtime:   comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns]
             fio-3185  [010] 50419.298937: sched_stat_runtime:   comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns]
             fio-3185  [010] 50419.299935: sched_stat_runtime:   comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns]

About 6ms later, CPU 0 kicks the active load balancer on CPU 10...

          <idle>-0     [000] 50419.300014: sched_waking:         comm=migration/10 pid=70 prio=0 target_cpu=010
             fio-3185  [010] 50419.300024: sched_wakeup:         migration/10:70 [0] success=1 CPU:010
             fio-3185  [010] 50419.300026: sched_stat_runtime:   comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns]
             fio-3185  [010] 50419.300027: sched_switch:         fio:3185 [120] S ==> migration/10:70 [0]
    migration/10-70    [010] 50419.300032: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12
    migration/10-70    [010] 50419.300040: sched_switch:         migration/10:70 [0] D ==> kworker/10:0:2763 [120]

And 10us later the fio process is switched away, the active load
balancer work is run and migrates the fio process to CPU 12. Then...

    kworker/10:0-2763  [010] 50419.300048: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns]
    kworker/10:0-2763  [010] 50419.300062: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
          <idle>-0     [010] 50419.300067: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
          <idle>-0     [010] 50419.300069: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
          <idle>-0     [010] 50419.300071: sched_switch:         swapper/10:0 [120] S ==> kworker/10:0:2763 [120]
    kworker/10:0-2763  [010] 50419.300073: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]

The kworker runs for another 10us and the CPU goes idle. Shortly
after this, CPU 12 is woken:

          <idle>-0     [012] 50419.300113: sched_switch:         swapper/12:0 [120] S ==> fio:3185 [120]
             fio-3185  [012] 50419.300596: sched_waking:         comm=kworker/12:1 pid=227 prio=120 target_cpu=012
             fio-3185  [012] 50419.300598: sched_stat_runtime:   comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns]
             fio-3185  [012] 50419.300936: sched_stat_runtime:   comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns]
             fio-3185  [012] 50419.301935: sched_stat_runtime:   comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns]
             fio-3185  [012] 50419.302935: sched_stat_runtime:   comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns]
             fio-3185  [012] 50419.303934: sched_stat_runtime:   comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns]
             fio-3185  [012] 50419.304934: sched_stat_runtime:   comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns]
             fio-3185  [012] 50419.305948: sched_stat_runtime:   comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns]


and fio goes on running there. The pattern repeats very soon afterwards:

          <idle>-0     [000] 50419.314982: sched_waking:         comm=migration/12 pid=82 prio=0 target_cpu=012
             fio-3185  [012] 50419.314988: sched_wakeup:         migration/12:82 [0] success=1 CPU:012
             fio-3185  [012] 50419.314990: sched_stat_runtime:   comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns]
             fio-3185  [012] 50419.314991: sched_switch:         fio:3185 [120] S ==> migration/12:82 [0]
    migration/12-82    [012] 50419.314995: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5
    migration/12-82    [012] 50419.315001: sched_switch:         migration/12:82 [0] D ==> kworker/12:1:227 [120]
    kworker/12:1-227   [012] 50419.315022: sched_stat_runtime:   comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns]
    kworker/12:1-227   [012] 50419.315028: sched_switch:         kworker/12:1:227 [120] R ==> swapper/12:0 [120]
          <idle>-0     [005] 50419.315053: sched_switch:         swapper/5:0 [120] S ==> fio:3185 [120]
             fio-3185  [005] 50419.315286: sched_waking:         comm=kworker/5:0 pid=2646 prio=120 target_cpu=005
             fio-3185  [005] 50419.315288: sched_stat_runtime:   comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns]

And fio is now running on CPU 5 - it only ran on CPU 12 for about
15ms. Hmmm:

$ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)

Yeah, the fio task averages 13.4ms on any given CPU before being
switched to another CPU. Mind you, the stddev is 12ms, so the range
of how long it spends on any one CPU is pretty wide (330us to
330ms).

IOWs, this doesn't look like a workqueue problem at all - this looks
like the scheduler is repeatedly making the wrong load balancing
decisions when mixing a very short runtime task (queued work) with a
long runtime task on the same CPU....

This is not my area of expertise, so I have no idea why this might
be happening. Scheduler experts: is this expected behaviour? What
tunables directly influence the active load balancer (and/or CONFIG
options) to change how aggressive it's behaviour is?

> Not reproduced the issue with 512 block size on the RH system yet,
> maybe it is related with my kernel config.

I doubt it - this looks like a load specific corner case in the
scheduling algorithm....

Cheers,

Dave.

Ming Lei Nov. 16, 2019, 6:31 a.m. UTC | #9

On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote:
> On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > I can reproduce the issue with 4k block size on another RH system, and
> > the login info of that system has been shared to you in RH BZ.
> > 
> > 1)
> > sysctl kernel.sched_min_granularity_ns=10000000
> > sysctl kernel.sched_wakeup_granularity_ns=15000000
> 
> So, these settings definitely influence behaviour.
> 
> If these are set to kernel defaults (4ms and 3ms each):
> 
> sysctl kernel.sched_min_granularity_ns=4000000
> sysctl kernel.sched_wakeup_granularity_ns=3000000
> 
> The migration problem largely goes away - the fio task migration
> event count goes from ~2,000 a run down to 200/run.
> 
> That indicates that the migration trigger is likely load/timing
> based. The analysis below is based on the 10/15ms numbers above,
> because it makes it so much easier to reproduce.

On another machine, './xfs_complete 512' may be migrated 11~12K/sec,
which don't need to change the above two kernel sched defaults, however
the fio io thread only takes 40% CPU. 

'./xfs_complete 4k' on this machine, the fio IO CPU utilization is >= 98%.


Thanks,
Ming

Peter Zijlstra Nov. 18, 2019, 9:21 a.m. UTC | #10

+Vincent

On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote:
> On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > I can reproduce the issue with 4k block size on another RH system, and
> > the login info of that system has been shared to you in RH BZ.
> > 
> > 1)
> > sysctl kernel.sched_min_granularity_ns=10000000
> > sysctl kernel.sched_wakeup_granularity_ns=15000000
> 
> So, these settings definitely influence behaviour.
> 
> If these are set to kernel defaults (4ms and 3ms each):
> 
> sysctl kernel.sched_min_granularity_ns=4000000
> sysctl kernel.sched_wakeup_granularity_ns=3000000
> 
> The migration problem largely goes away - the fio task migration
> event count goes from ~2,000 a run down to 200/run.
> 
> That indicates that the migration trigger is likely load/timing
> based. The analysis below is based on the 10/15ms numbers above,
> because it makes it so much easier to reproduce.
> 
> > 2)
> > ./xfs_complete 4k
> > 
> > Then you should see 1k~1.5k fio io thread migration in above test,
> > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.
> 
> Almost all the fio task migrations are coming from migration/X
> kernel threads. i.e it's the scheduler active balancing that is
> causing the fio thread to bounce around.
> 
> This is typical a typical trace, trimmed to remove extraneous noise.
> The fio process is running on CPU 10:
> 
>              fio-3185  [010] 50419.285954: sched_stat_runtime:   comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns]
>              fio-3185  [010] 50419.286953: sched_stat_runtime:   comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns]
>              fio-3185  [010] 50419.287998: sched_stat_runtime:   comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns]
>              fio-3185  [010] 50419.289973: sched_stat_runtime:   comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns]
>              fio-3185  [010] 50419.290958: sched_stat_runtime:   comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns]
>              fio-3185  [010] 50419.291952: sched_stat_runtime:   comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns]
> 
> fio consumes CPU for several milliseconds, then:
> 
>              fio-3185  [010] 50419.292935: sched_stat_runtime:   comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns]
>              fio-3185  [010] 50419.292941: sched_switch:         fio:3185 [120] S ==> kworker/10:0:2763 [120]
>     kworker/10:0-2763  [010] 50419.292954: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns]
>     kworker/10:0-2763  [010] 50419.292956: sched_switch:         kworker/10:0:2763 [120] R ==> fio:3185 [120]
>              fio-3185  [010] 50419.293115: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
>              fio-3185  [010] 50419.293116: sched_stat_runtime:   comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns]
>              fio-3185  [010] 50419.293118: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> 
> A context switch out to a kworker, then 13us later we immediately
> switch back to the fio process, and go on running. No doubt
> somewhere in what the fio process is doing, we queue up more work to
> be run on the cpu, but the fio task keeps running
> (due to CONFIG_PREEMPT=n).
> 
>              fio-3185  [010] 50419.293934: sched_stat_runtime:   comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns]
>              fio-3185  [010] 50419.294936: sched_stat_runtime:   comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns]
>              fio-3185  [010] 50419.295934: sched_stat_runtime:   comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns]
>              fio-3185  [010] 50419.296935: sched_stat_runtime:   comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns]
>              fio-3185  [010] 50419.297934: sched_stat_runtime:   comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns]
>              fio-3185  [010] 50419.298937: sched_stat_runtime:   comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns]
>              fio-3185  [010] 50419.299935: sched_stat_runtime:   comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns]
> 
> About 6ms later, CPU 0 kicks the active load balancer on CPU 10...
> 
>           <idle>-0     [000] 50419.300014: sched_waking:         comm=migration/10 pid=70 prio=0 target_cpu=010
>              fio-3185  [010] 50419.300024: sched_wakeup:         migration/10:70 [0] success=1 CPU:010
>              fio-3185  [010] 50419.300026: sched_stat_runtime:   comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns]
>              fio-3185  [010] 50419.300027: sched_switch:         fio:3185 [120] S ==> migration/10:70 [0]
>     migration/10-70    [010] 50419.300032: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12
>     migration/10-70    [010] 50419.300040: sched_switch:         migration/10:70 [0] D ==> kworker/10:0:2763 [120]
> 
> And 10us later the fio process is switched away, the active load
> balancer work is run and migrates the fio process to CPU 12. Then...
> 
>     kworker/10:0-2763  [010] 50419.300048: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns]
>     kworker/10:0-2763  [010] 50419.300062: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
>           <idle>-0     [010] 50419.300067: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
>           <idle>-0     [010] 50419.300069: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
>           <idle>-0     [010] 50419.300071: sched_switch:         swapper/10:0 [120] S ==> kworker/10:0:2763 [120]
>     kworker/10:0-2763  [010] 50419.300073: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> 
> The kworker runs for another 10us and the CPU goes idle. Shortly
> after this, CPU 12 is woken:
> 
>           <idle>-0     [012] 50419.300113: sched_switch:         swapper/12:0 [120] S ==> fio:3185 [120]
>              fio-3185  [012] 50419.300596: sched_waking:         comm=kworker/12:1 pid=227 prio=120 target_cpu=012
>              fio-3185  [012] 50419.300598: sched_stat_runtime:   comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns]
>              fio-3185  [012] 50419.300936: sched_stat_runtime:   comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns]
>              fio-3185  [012] 50419.301935: sched_stat_runtime:   comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns]
>              fio-3185  [012] 50419.302935: sched_stat_runtime:   comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns]
>              fio-3185  [012] 50419.303934: sched_stat_runtime:   comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns]
>              fio-3185  [012] 50419.304934: sched_stat_runtime:   comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns]
>              fio-3185  [012] 50419.305948: sched_stat_runtime:   comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns]
> 
> 
> and fio goes on running there. The pattern repeats very soon afterwards:
> 
>           <idle>-0     [000] 50419.314982: sched_waking:         comm=migration/12 pid=82 prio=0 target_cpu=012
>              fio-3185  [012] 50419.314988: sched_wakeup:         migration/12:82 [0] success=1 CPU:012
>              fio-3185  [012] 50419.314990: sched_stat_runtime:   comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns]
>              fio-3185  [012] 50419.314991: sched_switch:         fio:3185 [120] S ==> migration/12:82 [0]
>     migration/12-82    [012] 50419.314995: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5
>     migration/12-82    [012] 50419.315001: sched_switch:         migration/12:82 [0] D ==> kworker/12:1:227 [120]
>     kworker/12:1-227   [012] 50419.315022: sched_stat_runtime:   comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns]
>     kworker/12:1-227   [012] 50419.315028: sched_switch:         kworker/12:1:227 [120] R ==> swapper/12:0 [120]
>           <idle>-0     [005] 50419.315053: sched_switch:         swapper/5:0 [120] S ==> fio:3185 [120]
>              fio-3185  [005] 50419.315286: sched_waking:         comm=kworker/5:0 pid=2646 prio=120 target_cpu=005
>              fio-3185  [005] 50419.315288: sched_stat_runtime:   comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns]
> 
> And fio is now running on CPU 5 - it only ran on CPU 12 for about
> 15ms. Hmmm:
> 
> $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
> switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
> time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)
> 
> Yeah, the fio task averages 13.4ms on any given CPU before being
> switched to another CPU. Mind you, the stddev is 12ms, so the range
> of how long it spends on any one CPU is pretty wide (330us to
> 330ms).
> 
> IOWs, this doesn't look like a workqueue problem at all - this looks
> like the scheduler is repeatedly making the wrong load balancing
> decisions when mixing a very short runtime task (queued work) with a
> long runtime task on the same CPU....
> 
> This is not my area of expertise, so I have no idea why this might
> be happening. Scheduler experts: is this expected behaviour? What
> tunables directly influence the active load balancer (and/or CONFIG
> options) to change how aggressive it's behaviour is?

We typically only fall back to the active balancer when there is
(persistent) imbalance and we fail to migrate anything else (of
substance).

The tuning mentioned has the effect of less frequent scheduling, IOW,
leaving (short) tasks on the runqueue longer. This obviously means the
load-balancer will have a bigger chance of seeing them.

Now; it's been a while since I looked at the workqueue code but one
possible explanation would be if the kworker that picks up the work item
is pinned. That would make it runnable but not migratable, the exact
situation in which we'll end up shooting the current task with active
balance.

I'll go see if I can reproduce and stare at the workqueue code a bit.

> 
> > Not reproduced the issue with 512 block size on the RH system yet,
> > maybe it is related with my kernel config.
> 
> I doubt it - this looks like a load specific corner case in the
> scheduling algorithm....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Vincent Guittot Nov. 18, 2019, 2:54 p.m. UTC | #11

On Mon, 18 Nov 2019 at 10:21, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> +Vincent
>
> On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote:
> > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > > I can reproduce the issue with 4k block size on another RH system, and
> > > the login info of that system has been shared to you in RH BZ.
> > >
> > > 1)
> > > sysctl kernel.sched_min_granularity_ns=10000000
> > > sysctl kernel.sched_wakeup_granularity_ns=15000000
> >
> > So, these settings definitely influence behaviour.
> >
> > If these are set to kernel defaults (4ms and 3ms each):
> >
> > sysctl kernel.sched_min_granularity_ns=4000000
> > sysctl kernel.sched_wakeup_granularity_ns=3000000
> >
> > The migration problem largely goes away - the fio task migration
> > event count goes from ~2,000 a run down to 200/run.
> >
> > That indicates that the migration trigger is likely load/timing
> > based. The analysis below is based on the 10/15ms numbers above,
> > because it makes it so much easier to reproduce.
> >
> > > 2)
> > > ./xfs_complete 4k
> > >
> > > Then you should see 1k~1.5k fio io thread migration in above test,
> > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.
> >
> > Almost all the fio task migrations are coming from migration/X
> > kernel threads. i.e it's the scheduler active balancing that is
> > causing the fio thread to bounce around.
> >
> > This is typical a typical trace, trimmed to remove extraneous noise.
> > The fio process is running on CPU 10:
> >
> >              fio-3185  [010] 50419.285954: sched_stat_runtime:   comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns]
> >              fio-3185  [010] 50419.286953: sched_stat_runtime:   comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns]
> >              fio-3185  [010] 50419.287998: sched_stat_runtime:   comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns]
> >              fio-3185  [010] 50419.289973: sched_stat_runtime:   comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns]
> >              fio-3185  [010] 50419.290958: sched_stat_runtime:   comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns]
> >              fio-3185  [010] 50419.291952: sched_stat_runtime:   comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns]
> >
> > fio consumes CPU for several milliseconds, then:
> >
> >              fio-3185  [010] 50419.292935: sched_stat_runtime:   comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns]
> >              fio-3185  [010] 50419.292941: sched_switch:         fio:3185 [120] S ==> kworker/10:0:2763 [120]
> >     kworker/10:0-2763  [010] 50419.292954: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns]
> >     kworker/10:0-2763  [010] 50419.292956: sched_switch:         kworker/10:0:2763 [120] R ==> fio:3185 [120]
> >              fio-3185  [010] 50419.293115: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> >              fio-3185  [010] 50419.293116: sched_stat_runtime:   comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns]
> >              fio-3185  [010] 50419.293118: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> >
> > A context switch out to a kworker, then 13us later we immediately
> > switch back to the fio process, and go on running. No doubt
> > somewhere in what the fio process is doing, we queue up more work to
> > be run on the cpu, but the fio task keeps running
> > (due to CONFIG_PREEMPT=n).
> >
> >              fio-3185  [010] 50419.293934: sched_stat_runtime:   comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns]
> >              fio-3185  [010] 50419.294936: sched_stat_runtime:   comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns]
> >              fio-3185  [010] 50419.295934: sched_stat_runtime:   comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns]
> >              fio-3185  [010] 50419.296935: sched_stat_runtime:   comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns]
> >              fio-3185  [010] 50419.297934: sched_stat_runtime:   comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns]
> >              fio-3185  [010] 50419.298937: sched_stat_runtime:   comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns]
> >              fio-3185  [010] 50419.299935: sched_stat_runtime:   comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns]
> >
> > About 6ms later, CPU 0 kicks the active load balancer on CPU 10...
> >
> >           <idle>-0     [000] 50419.300014: sched_waking:         comm=migration/10 pid=70 prio=0 target_cpu=010
> >              fio-3185  [010] 50419.300024: sched_wakeup:         migration/10:70 [0] success=1 CPU:010
> >              fio-3185  [010] 50419.300026: sched_stat_runtime:   comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns]
> >              fio-3185  [010] 50419.300027: sched_switch:         fio:3185 [120] S ==> migration/10:70 [0]
> >     migration/10-70    [010] 50419.300032: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12
> >     migration/10-70    [010] 50419.300040: sched_switch:         migration/10:70 [0] D ==> kworker/10:0:2763 [120]
> >
> > And 10us later the fio process is switched away, the active load
> > balancer work is run and migrates the fio process to CPU 12. Then...
> >
> >     kworker/10:0-2763  [010] 50419.300048: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns]
> >     kworker/10:0-2763  [010] 50419.300062: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> >           <idle>-0     [010] 50419.300067: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> >           <idle>-0     [010] 50419.300069: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> >           <idle>-0     [010] 50419.300071: sched_switch:         swapper/10:0 [120] S ==> kworker/10:0:2763 [120]
> >     kworker/10:0-2763  [010] 50419.300073: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> >
> > The kworker runs for another 10us and the CPU goes idle. Shortly
> > after this, CPU 12 is woken:
> >
> >           <idle>-0     [012] 50419.300113: sched_switch:         swapper/12:0 [120] S ==> fio:3185 [120]
> >              fio-3185  [012] 50419.300596: sched_waking:         comm=kworker/12:1 pid=227 prio=120 target_cpu=012
> >              fio-3185  [012] 50419.300598: sched_stat_runtime:   comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns]
> >              fio-3185  [012] 50419.300936: sched_stat_runtime:   comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns]
> >              fio-3185  [012] 50419.301935: sched_stat_runtime:   comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns]
> >              fio-3185  [012] 50419.302935: sched_stat_runtime:   comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns]
> >              fio-3185  [012] 50419.303934: sched_stat_runtime:   comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns]
> >              fio-3185  [012] 50419.304934: sched_stat_runtime:   comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns]
> >              fio-3185  [012] 50419.305948: sched_stat_runtime:   comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns]
> >
> >
> > and fio goes on running there. The pattern repeats very soon afterwards:
> >
> >           <idle>-0     [000] 50419.314982: sched_waking:         comm=migration/12 pid=82 prio=0 target_cpu=012
> >              fio-3185  [012] 50419.314988: sched_wakeup:         migration/12:82 [0] success=1 CPU:012
> >              fio-3185  [012] 50419.314990: sched_stat_runtime:   comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns]
> >              fio-3185  [012] 50419.314991: sched_switch:         fio:3185 [120] S ==> migration/12:82 [0]
> >     migration/12-82    [012] 50419.314995: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5
> >     migration/12-82    [012] 50419.315001: sched_switch:         migration/12:82 [0] D ==> kworker/12:1:227 [120]
> >     kworker/12:1-227   [012] 50419.315022: sched_stat_runtime:   comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns]
> >     kworker/12:1-227   [012] 50419.315028: sched_switch:         kworker/12:1:227 [120] R ==> swapper/12:0 [120]
> >           <idle>-0     [005] 50419.315053: sched_switch:         swapper/5:0 [120] S ==> fio:3185 [120]
> >              fio-3185  [005] 50419.315286: sched_waking:         comm=kworker/5:0 pid=2646 prio=120 target_cpu=005
> >              fio-3185  [005] 50419.315288: sched_stat_runtime:   comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns]
> >
> > And fio is now running on CPU 5 - it only ran on CPU 12 for about
> > 15ms. Hmmm:
> >
> > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
> > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
> > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)
> >
> > Yeah, the fio task averages 13.4ms on any given CPU before being
> > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > of how long it spends on any one CPU is pretty wide (330us to
> > 330ms).
> >
> > IOWs, this doesn't look like a workqueue problem at all - this looks
> > like the scheduler is repeatedly making the wrong load balancing
> > decisions when mixing a very short runtime task (queued work) with a
> > long runtime task on the same CPU....
> >
> > This is not my area of expertise, so I have no idea why this might
> > be happening. Scheduler experts: is this expected behaviour? What
> > tunables directly influence the active load balancer (and/or CONFIG
> > options) to change how aggressive it's behaviour is?
>
> We typically only fall back to the active balancer when there is
> (persistent) imbalance and we fail to migrate anything else (of
> substance).
>
> The tuning mentioned has the effect of less frequent scheduling, IOW,
> leaving (short) tasks on the runqueue longer. This obviously means the
> load-balancer will have a bigger chance of seeing them.
>
> Now; it's been a while since I looked at the workqueue code but one
> possible explanation would be if the kworker that picks up the work item
> is pinned. That would make it runnable but not migratable, the exact
> situation in which we'll end up shooting the current task with active
> balance.

I agree with your explanation above

Although queue_work uses queue_work_on(WORK_CPU_UNBOUND,... At the
end, it uses a bounded workqueue and tries to queue the work on the
local cpu 1st even if the latter is already busy

>
> I'll go see if I can reproduce and stare at the workqueue code a bit.
>
> >
> > > Not reproduced the issue with 512 block size on the RH system yet,
> > > maybe it is related with my kernel config.
> >
> > I doubt it - this looks like a load specific corner case in the
> > scheduling algorithm....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com

Srikar Dronamraju Nov. 18, 2019, 4:26 p.m. UTC | #12

* Dave Chinner <david@fromorbit.com> [2019-11-16 10:40:05]:

> On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > I can reproduce the issue with 4k block size on another RH system, and
> > the login info of that system has been shared to you in RH BZ.
> > 
> > 1)
> 
> Almost all the fio task migrations are coming from migration/X
> kernel threads. i.e it's the scheduler active balancing that is
> causing the fio thread to bounce around.
> 

Can we try with the below patch.

Dave Chinner Nov. 18, 2019, 8:40 p.m. UTC | #13

On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote:
> On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote:
> > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > And fio is now running on CPU 5 - it only ran on CPU 12 for about
> > 15ms. Hmmm:
> > 
> > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
> > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
> > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)
> > 
> > Yeah, the fio task averages 13.4ms on any given CPU before being
> > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > of how long it spends on any one CPU is pretty wide (330us to
> > 330ms).
> > 
> > IOWs, this doesn't look like a workqueue problem at all - this looks
> > like the scheduler is repeatedly making the wrong load balancing
> > decisions when mixing a very short runtime task (queued work) with a
> > long runtime task on the same CPU....
> > 
> > This is not my area of expertise, so I have no idea why this might
> > be happening. Scheduler experts: is this expected behaviour? What
> > tunables directly influence the active load balancer (and/or CONFIG
> > options) to change how aggressive it's behaviour is?
> 
> We typically only fall back to the active balancer when there is
> (persistent) imbalance and we fail to migrate anything else (of
> substance).
> 
> The tuning mentioned has the effect of less frequent scheduling, IOW,
> leaving (short) tasks on the runqueue longer. This obviously means the
> load-balancer will have a bigger chance of seeing them.
> 
> Now; it's been a while since I looked at the workqueue code but one
> possible explanation would be if the kworker that picks up the work item
> is pinned. That would make it runnable but not migratable, the exact
> situation in which we'll end up shooting the current task with active
> balance.

Yes, that's precisely the problem - work is queued, by default, on a
specific CPU and it will wait for a kworker that is pinned to that
specific CPU to dispatch it. We've already tested that queuing on a
different CPU (via queue_work_on()) makes the problem largely go
away as the work is not longer queued behind the long running fio
task.

This, however, is not at viable solution to the problem. The pattern
of a long running process queuing small pieces of individual work
for processing in a separate context is pretty common...

Cheers,

Dave.

Dave Chinner Nov. 18, 2019, 9:18 p.m. UTC | #14

On Mon, Nov 18, 2019 at 09:56:33PM +0530, Srikar Dronamraju wrote:
> * Dave Chinner <david@fromorbit.com> [2019-11-16 10:40:05]:
> 
> > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > > I can reproduce the issue with 4k block size on another RH system, and
> > > the login info of that system has been shared to you in RH BZ.
> > > 
> > > 1)
> > 
> > Almost all the fio task migrations are coming from migration/X
> > kernel threads. i.e it's the scheduler active balancing that is
> > causing the fio thread to bounce around.
> > 
> 
> Can we try with the below patch.

That makes things much, much worse.

$ sudo trace-cmd show |grep fio |wc -l
4701
$

Even when not running the fio workload (i.e. the system is largely
idle and I'm just doing admin tasks like setting up for test runs),
the number of task migrations that occur goes up significantly.

-Dave.

Ming Lei Nov. 19, 2019, 8:54 a.m. UTC | #15

On Tue, Nov 19, 2019 at 08:18:04AM +1100, Dave Chinner wrote:
> On Mon, Nov 18, 2019 at 09:56:33PM +0530, Srikar Dronamraju wrote:
> > * Dave Chinner <david@fromorbit.com> [2019-11-16 10:40:05]:
> > 
> > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > > > I can reproduce the issue with 4k block size on another RH system, and
> > > > the login info of that system has been shared to you in RH BZ.
> > > > 
> > > > 1)
> > > 
> > > Almost all the fio task migrations are coming from migration/X
> > > kernel threads. i.e it's the scheduler active balancing that is
> > > causing the fio thread to bounce around.
> > > 
> > 
> > Can we try with the below patch.
> 
> That makes things much, much worse.

Yeah, it is same with my test result, basically the fio io thread
migration counts is increased from 5k to 8k.

Thanks,
Ming

Peter Zijlstra Nov. 20, 2019, 7:16 p.m. UTC | #16

On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:
> On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote:

> > We typically only fall back to the active balancer when there is
> > (persistent) imbalance and we fail to migrate anything else (of
> > substance).
> > 
> > The tuning mentioned has the effect of less frequent scheduling, IOW,
> > leaving (short) tasks on the runqueue longer. This obviously means the
> > load-balancer will have a bigger chance of seeing them.
> > 
> > Now; it's been a while since I looked at the workqueue code but one
> > possible explanation would be if the kworker that picks up the work item
> > is pinned. That would make it runnable but not migratable, the exact
> > situation in which we'll end up shooting the current task with active
> > balance.
> 
> Yes, that's precisely the problem - work is queued, by default, on a
> specific CPU and it will wait for a kworker that is pinned to that

I'm thinking the problem is that it doesn't wait. If it went and waited
for it, active balance wouldn't be needed, that only works on active
tasks.

> specific CPU to dispatch it. We've already tested that queuing on a
> different CPU (via queue_work_on()) makes the problem largely go
> away as the work is not longer queued behind the long running fio
> task.
> 
> This, however, is not at viable solution to the problem. The pattern
> of a long running process queuing small pieces of individual work
> for processing in a separate context is pretty common...

Right, but you're putting the scheduler in a bind. By overloading the
CPU and only allowing the one task to migrate, it pretty much has no
choice left.

Anyway, I'm still going to have try and reproduce -- I got side-tracked
into a crashing bug, I'll hopefully get back to this tomorrow. Lastly,
one other thing to try is -next. Vincent reworked the load-balancer
quite a bit.

Phil Auld Nov. 20, 2019, 10:03 p.m. UTC | #17

Hi Peter,

On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote:
> On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:
> > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote:
> 
> > > We typically only fall back to the active balancer when there is
> > > (persistent) imbalance and we fail to migrate anything else (of
> > > substance).
> > > 
> > > The tuning mentioned has the effect of less frequent scheduling, IOW,
> > > leaving (short) tasks on the runqueue longer. This obviously means the
> > > load-balancer will have a bigger chance of seeing them.
> > > 
> > > Now; it's been a while since I looked at the workqueue code but one
> > > possible explanation would be if the kworker that picks up the work item
> > > is pinned. That would make it runnable but not migratable, the exact
> > > situation in which we'll end up shooting the current task with active
> > > balance.
> > 
> > Yes, that's precisely the problem - work is queued, by default, on a
> > specific CPU and it will wait for a kworker that is pinned to that
> 
> I'm thinking the problem is that it doesn't wait. If it went and waited
> for it, active balance wouldn't be needed, that only works on active
> tasks.

Since this is AIO I wonder if it should queue_work on a nearby cpu by 
default instead of unbound.  

> 
> > specific CPU to dispatch it. We've already tested that queuing on a
> > different CPU (via queue_work_on()) makes the problem largely go
> > away as the work is not longer queued behind the long running fio
> > task.
> > 
> > This, however, is not at viable solution to the problem. The pattern
> > of a long running process queuing small pieces of individual work
> > for processing in a separate context is pretty common...
> 
> Right, but you're putting the scheduler in a bind. By overloading the
> CPU and only allowing the one task to migrate, it pretty much has no
> choice left.
> 
> Anyway, I'm still going to have try and reproduce -- I got side-tracked
> into a crashing bug, I'll hopefully get back to this tomorrow. Lastly,
> one other thing to try is -next. Vincent reworked the load-balancer
> quite a bit.
> 

I've tried it with the lb patch series. I get basically the same results.
With the high granularity settings I get 3700 migrations for the 30 
second run at 4k. Of those about 3200 are active balance on stock 5.4-rc7.
With the lb patches it's 3500 and 3000, a slight drop. 

Using the default granularity settings 50 and 22 for stock and 250 and 25.
So a few more total migrations with the lb patches but about the same active.


On this system I'm getting 100k migrations using 512 byte blocksize. Almost
all not active. I haven't looked into that closely yet but it's like 3000
per second looking like this:

...
64.19641 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.19694 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.19746 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.19665 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.19718 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.19772 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.19800 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.19828 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.19856 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.19882 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.19909 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.19937 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.19967 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.19995 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.20023 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.20053 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.20079 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.20107 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.20135 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.20163 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
64.20192 386     386     kworker/15:1    sched_migrate_task fio/2784 cpu 15->19 
64.20221 389     389     kworker/19:1    sched_migrate_task fio/2784 cpu 19->15 
...

Which is roughly equal to the number if iops it's doing. 

Cheers,
Phil

--

Ming Lei Nov. 21, 2019, 4:12 a.m. UTC | #18

On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote:
> Hi Peter,
> 
> On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote:
> > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote:
> > 
> > > > We typically only fall back to the active balancer when there is
> > > > (persistent) imbalance and we fail to migrate anything else (of
> > > > substance).
> > > > 
> > > > The tuning mentioned has the effect of less frequent scheduling, IOW,
> > > > leaving (short) tasks on the runqueue longer. This obviously means the
> > > > load-balancer will have a bigger chance of seeing them.
> > > > 
> > > > Now; it's been a while since I looked at the workqueue code but one
> > > > possible explanation would be if the kworker that picks up the work item
> > > > is pinned. That would make it runnable but not migratable, the exact
> > > > situation in which we'll end up shooting the current task with active
> > > > balance.
> > > 
> > > Yes, that's precisely the problem - work is queued, by default, on a
> > > specific CPU and it will wait for a kworker that is pinned to that
> > 
> > I'm thinking the problem is that it doesn't wait. If it went and waited
> > for it, active balance wouldn't be needed, that only works on active
> > tasks.
> 
> Since this is AIO I wonder if it should queue_work on a nearby cpu by 
> default instead of unbound.  

When the current CPU isn't busy enough, there is still cost for completing
request remotely.

Or could we change queue_work() in the following way?

 * We try to queue the work to the CPU on which it was submitted, but if the
 * CPU dies or is saturated enough it can be processed by another CPU.

Can we decide in a simple or efficient way if the current CPU is saturated
enough?

Thanks,
Ming

Peter Zijlstra Nov. 21, 2019, 1:29 p.m. UTC | #19

On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote:
> On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote:
> > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:

> > > Yes, that's precisely the problem - work is queued, by default, on a
> > > specific CPU and it will wait for a kworker that is pinned to that
> > 
> > I'm thinking the problem is that it doesn't wait. If it went and waited
> > for it, active balance wouldn't be needed, that only works on active
> > tasks.
> 
> Since this is AIO I wonder if it should queue_work on a nearby cpu by 
> default instead of unbound.  

The thing seems to be that 'unbound' is in fact 'bound'. Maybe we should
fix that. If the load-balancer were allowed to move the kworker around
when it didn't get time to run, that would probably be a better
solution.

Picking another 'bound' cpu by random might create the same sort of
problems in more complicated scenarios.

TJ, ISTR there used to be actually unbound kworkers, what happened to
those? or am I misremembering things.

> > Lastly,
> > one other thing to try is -next. Vincent reworked the load-balancer
> > quite a bit.
> > 
> 
> I've tried it with the lb patch series. I get basically the same results.
> With the high granularity settings I get 3700 migrations for the 30 
> second run at 4k. Of those about 3200 are active balance on stock 5.4-rc7.
> With the lb patches it's 3500 and 3000, a slight drop. 

Thanks for testing that. I didn't expect miracles, but it is good to
verify.

> Using the default granularity settings 50 and 22 for stock and 250 and 25.
> So a few more total migrations with the lb patches but about the same active.

Right, so the granularity thing interacts with the load-balance period.
By pushing it up, as some people appear to do, makes it so that what
might be a temporal imablance is perceived as a persitent imbalance.

Tying the load-balance period to the gramularity is something we could
consider, but then I'm sure, we'll get other people complaining the
doesn't balance quick enough anymore.

Phil Auld Nov. 21, 2019, 2:12 p.m. UTC | #20

On Thu, Nov 21, 2019 at 12:12:18PM +0800 Ming Lei wrote:
> On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote:
> > Hi Peter,
> > 
> > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote:
> > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:
> > > > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote:
> > > 
> > > > > We typically only fall back to the active balancer when there is
> > > > > (persistent) imbalance and we fail to migrate anything else (of
> > > > > substance).
> > > > > 
> > > > > The tuning mentioned has the effect of less frequent scheduling, IOW,
> > > > > leaving (short) tasks on the runqueue longer. This obviously means the
> > > > > load-balancer will have a bigger chance of seeing them.
> > > > > 
> > > > > Now; it's been a while since I looked at the workqueue code but one
> > > > > possible explanation would be if the kworker that picks up the work item
> > > > > is pinned. That would make it runnable but not migratable, the exact
> > > > > situation in which we'll end up shooting the current task with active
> > > > > balance.
> > > > 
> > > > Yes, that's precisely the problem - work is queued, by default, on a
> > > > specific CPU and it will wait for a kworker that is pinned to that
> > > 
> > > I'm thinking the problem is that it doesn't wait. If it went and waited
> > > for it, active balance wouldn't be needed, that only works on active
> > > tasks.
> > 
> > Since this is AIO I wonder if it should queue_work on a nearby cpu by 
> > default instead of unbound.  
> 
> When the current CPU isn't busy enough, there is still cost for completing
> request remotely.
> 
> Or could we change queue_work() in the following way?
> 
>  * We try to queue the work to the CPU on which it was submitted, but if the
>  * CPU dies or is saturated enough it can be processed by another CPU.
> 
> Can we decide in a simple or efficient way if the current CPU is saturated
> enough?
> 

The scheduler doesn't know if the queued_work submitter is going to go to sleep.
That's why I was singling out AIO. My understanding of it is that you submit the IO
and then keep going. So in that case it might be better to pick a node-local nearby
cpu instead. But this is a user of work queue issue not a scheduler issue. 

Interestingly in our fio case the 4k one does not sleep and we get the active balance
case where it moves the actually running thread.  The 512 byte case seems to be 
sleeping since the migrations are all at wakeup time I believe. 

Cheers,
Phil


> Thanks,
> Ming

--

Phil Auld Nov. 21, 2019, 2:21 p.m. UTC | #21

On Thu, Nov 21, 2019 at 02:29:37PM +0100 Peter Zijlstra wrote:
> On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote:
> > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote:
> > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:
> 
> > > > Yes, that's precisely the problem - work is queued, by default, on a
> > > > specific CPU and it will wait for a kworker that is pinned to that
> > > 
> > > I'm thinking the problem is that it doesn't wait. If it went and waited
> > > for it, active balance wouldn't be needed, that only works on active
> > > tasks.
> > 
> > Since this is AIO I wonder if it should queue_work on a nearby cpu by 
> > default instead of unbound.  
> 
> The thing seems to be that 'unbound' is in fact 'bound'. Maybe we should
> fix that. If the load-balancer were allowed to move the kworker around
> when it didn't get time to run, that would probably be a better
> solution.
> 

Yeah, I'm not convinced this is actually a scheduler issue.


> Picking another 'bound' cpu by random might create the same sort of
> problems in more complicated scenarios.
> 
> TJ, ISTR there used to be actually unbound kworkers, what happened to
> those? or am I misremembering things.
> 
> > > Lastly,
> > > one other thing to try is -next. Vincent reworked the load-balancer
> > > quite a bit.
> > > 
> > 
> > I've tried it with the lb patch series. I get basically the same results.
> > With the high granularity settings I get 3700 migrations for the 30 
> > second run at 4k. Of those about 3200 are active balance on stock 5.4-rc7.
> > With the lb patches it's 3500 and 3000, a slight drop. 
> 
> Thanks for testing that. I didn't expect miracles, but it is good to
> verify.
> 
> > Using the default granularity settings 50 and 22 for stock and 250 and 25.
> > So a few more total migrations with the lb patches but about the same active.
> 
> Right, so the granularity thing interacts with the load-balance period.
> By pushing it up, as some people appear to do, makes it so that what
> might be a temporal imablance is perceived as a persitent imbalance.
> 
> Tying the load-balance period to the gramularity is something we could
> consider, but then I'm sure, we'll get other people complaining the
> doesn't balance quick enough anymore.
> 

Thanks. These are old tuned settings that have been carried along. They may
not be right for newer kernels anyway. 


--

Boaz Harrosh Nov. 21, 2019, 3:02 p.m. UTC | #22

On 21/11/2019 16:12, Phil Auld wrote:
<>
> 
> The scheduler doesn't know if the queued_work submitter is going to go to sleep.
> That's why I was singling out AIO. My understanding of it is that you submit the IO
> and then keep going. So in that case it might be better to pick a node-local nearby
> cpu instead. But this is a user of work queue issue not a scheduler issue. 
> 

We have a very similar long standing problem in our system (zufs), that we had to do
hacks to fix.

We have seen these CPU bouncing exacly as above in fio and more benchmarks, Our final
analysis was: 
 One thread is in wait_event() if the wake_up() is on the same CPU as the
waiter, on some systems usually real HW and not VMs, would bounce to a different CPU.
Now our system has an array of worker-threads bound to each CPU. an incoming thread chooses
a corresponding cpu worker-thread, let it run, waiting for a reply, then when the
worker-thread is done it will do a wake_up(). Usually its fine and the wait_event() stays
on the same CPU. But on some systems it will wakeup in a different CPU.

Now this is a great pity because in our case and the work_queue case and high % of places 
the thread calling wake_up() will then immediately go to sleep on something.
(Work done lets wait for new work)

I wish there was a flag to wake_up() or to the event object that says to relinquish
the remaning of the time-slice to the waiter on same CPU, since I will be soon sleeping.

Then scheduler need not guess if the wake_up() caller is going to soon sleep or if its
going to continue. Let the coder give an hint about that?

(The hack was to set the waiter CPU mask to the incoming CPU and restore afer wakeup)

> Interestingly in our fio case the 4k one does not sleep and we get the active balance
> case where it moves the actually running thread.  The 512 byte case seems to be 
> sleeping since the migrations are all at wakeup time I believe. 
> 

Yes this is the same thing we saw in our system. (And it happens only sometimes)

> Cheers,
> Phil
> 
> 
>> Thanks,
>> Ming
> 

Very thanks
Boaz

Jens Axboe Nov. 21, 2019, 4:19 p.m. UTC | #23

On 11/21/19 8:02 AM, Boaz Harrosh wrote:
> On 21/11/2019 16:12, Phil Auld wrote:
> <>
>>
>> The scheduler doesn't know if the queued_work submitter is going to go to sleep.
>> That's why I was singling out AIO. My understanding of it is that you submit the IO
>> and then keep going. So in that case it might be better to pick a node-local nearby
>> cpu instead. But this is a user of work queue issue not a scheduler issue.
>>
> 
> We have a very similar long standing problem in our system (zufs), that we had to do
> hacks to fix.
> 
> We have seen these CPU bouncing exacly as above in fio and more
> benchmarks, Our final analysis was:
>
> One thread is in wait_event() if the wake_up() is on the same CPU as
> the waiter, on some systems usually real HW and not VMs, would bounce
> to a different CPU.  Now our system has an array of worker-threads
> bound to each CPU. an incoming thread chooses a corresponding cpu
> worker-thread, let it run, waiting for a reply, then when the
> worker-thread is done it will do a wake_up(). Usually its fine and the
> wait_event() stays on the same CPU. But on some systems it will wakeup
> in a different CPU.
> 
> Now this is a great pity because in our case and the work_queue case
> and high % of places the thread calling wake_up() will then
> immediately go to sleep on something.  (Work done lets wait for new
> work)
> 
> I wish there was a flag to wake_up() or to the event object that says
> to relinquish the remaning of the time-slice to the waiter on same
> CPU, since I will be soon sleeping.

Isn't that basically what wake_up_sync() is?

Dave Chinner Nov. 21, 2019, 10:10 p.m. UTC | #24

On Thu, Nov 21, 2019 at 09:12:07AM -0500, Phil Auld wrote:
> On Thu, Nov 21, 2019 at 12:12:18PM +0800 Ming Lei wrote:
> > On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote:
> > > Hi Peter,
> > > 
> > > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote:
> > > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:
> > > > > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote:
> > > > 
> > > > > > We typically only fall back to the active balancer when there is
> > > > > > (persistent) imbalance and we fail to migrate anything else (of
> > > > > > substance).
> > > > > > 
> > > > > > The tuning mentioned has the effect of less frequent scheduling, IOW,
> > > > > > leaving (short) tasks on the runqueue longer. This obviously means the
> > > > > > load-balancer will have a bigger chance of seeing them.
> > > > > > 
> > > > > > Now; it's been a while since I looked at the workqueue code but one
> > > > > > possible explanation would be if the kworker that picks up the work item
> > > > > > is pinned. That would make it runnable but not migratable, the exact
> > > > > > situation in which we'll end up shooting the current task with active
> > > > > > balance.
> > > > > 
> > > > > Yes, that's precisely the problem - work is queued, by default, on a
> > > > > specific CPU and it will wait for a kworker that is pinned to that
> > > > 
> > > > I'm thinking the problem is that it doesn't wait. If it went and waited
> > > > for it, active balance wouldn't be needed, that only works on active
> > > > tasks.
> > > 
> > > Since this is AIO I wonder if it should queue_work on a nearby cpu by 
> > > default instead of unbound.  
> > 
> > When the current CPU isn't busy enough, there is still cost for completing
> > request remotely.
> > 
> > Or could we change queue_work() in the following way?
> > 
> >  * We try to queue the work to the CPU on which it was submitted, but if the
> >  * CPU dies or is saturated enough it can be processed by another CPU.
> > 
> > Can we decide in a simple or efficient way if the current CPU is saturated
> > enough?
> > 
> 
> The scheduler doesn't know if the queued_work submitter is going to go to sleep.
> That's why I was singling out AIO. My understanding of it is that you submit the IO
> and then keep going. So in that case it might be better to pick a node-local nearby
> cpu instead. But this is a user of work queue issue not a scheduler issue. 

I think the part people are missing completely here is that the
workqueue in question here is the -completion- work queue, and is
not actually directly connected to the submitter process that is
getting migrated. This all happens on the one CPU:

Submitter		blk-mq completion (softirq)		wq

io_submit()
  aio_write
    submit_bio
  aio_write
    submit_bio
  .....
			blk-mq bh triggered
			bio_endio
			  iomap_dio_endio(dio)
			    queue_work(dio)
							kworker queued
			bio_endio
			  iomap_dio_endio(dio)
			    queue_work(dio)
			....
			<softirq ends, back to submitter>
  aio_write
    submit_bio
  aio_write
    submit_bio
  ...
  <scheduler migrates task>
							iomap_dio_complete()
							  aio_complete()
							....
							iomap_dio_complete()
							  aio_complete()
							....
							iomap_dio_complete()
							  aio_complete()

IOWs, the reason we see this behaviour is the IO completion steering
done by the blk-mq layer is, by default, directing completion back
to the submitter CPU. IIUC, it's the per-cpu submission/completion
queues that result in this behaviour, because local completion of
IOs has been measured to be substantially faster for highly
concurrent, high IOPS workloads.

What Ming's reproducer workload does is increase the amount of CPU
that the submitter process uses to the point where it uses the
entire CPU and starves the completion kworker thread from running.
i.e. there is no idle CPU left for the completion to be processed
without pre-empting the running submitter task in some way.

Sometimes we see the scheduler run the kworker thread and switch
straight back to the submitter task - this is what typically happens
when the scheduler is using the default tunings. But when the
non-default tunings are used, task migrations occur.

So the scheduler migrates the submitter task, because it cannot move
the IO completion tasks. And then the blk-mq sees that submission
are coming from a different CPU, and it queues the incoming
completions to that new CPU, theyby repeating the pattern.

> Interestingly in our fio case the 4k one does not sleep and we get
> the active balance case where it moves the actually running
> thread.  The 512 byte case seems to be sleeping since the
> migrations are all at wakeup time I believe. 

The 512 byte case demonstrates a behaviour where the submitter task
is ping ponging between CPUs on each IO submission. It is likely
another corner case when the blk-mq behaviour of delivering
completions to the submitter CPU triggers immediate migration rather
than just context switching to the kworker thread. Further
investigation needs to be done there to determine if the migration
is caused by pre-emption, or whether it is a result of the submitter
finishing work and yeilding the CPU, but then being woken and
finding the CPU is held by a running bound task and so is migrated
to an idle CPU. The next IO completion has the completion directed
to the new CPU, and so it migrates back when woken.

IOWs, Whatever the cause of the task migration is, it is likely that
it is repeating because IO completions are following the submitter
task around and repeatedly triggering the same migration heuristic
over and over again.

Also, keep in mind this is *not caused by AIO*. Workqueues are used
in IO completion to avoid a basic "can't do filesystem modification
work in irq context" problem. i.e. we can't take blocking locks or
run transactions in bio completion context, because that will stall
IO completion processing for that CPU completelyi, and then the
fileystem deadlocks.

And "non-AIO" example is page cache writeback: the bdi flusher
threads are designed around an asynchrnous IO submission pattern
where it never waits for completions. In this case, delayed
allocation is the reason the submitter consumes a big chunk of CPU.
And IO completion uses workqueues ifor the same reason as AIO -
because we often have to run transactions to complete the metadata
updates necessary to finish the IO correctly. We can't do those
updates in the bio completion context (softirq) because we need to
block and do stuff that is dependent on future IO completions being
processed.

So, yeah, this "queue small amounts of work to a workqueue from IO
completion context" is fundamental to how the filesystem IO stack
functions in a modern kernel. As such, this specific IO completion
workqueue usage needs to be handled correctly by the default kernel
config, and not fall apart when scheduler tunings are slightly
tweaked.

Cheers,

Dave.

Vincent Guittot Nov. 28, 2019, 9:53 a.m. UTC | #25

On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
>
>
> On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > > I can reproduce the issue with 4k block size on another RH system, and
> > > the login info of that system has been shared to you in RH BZ.
> > >
> > > 1)
> > > sysctl kernel.sched_min_granularity_ns=10000000
> > > sysctl kernel.sched_wakeup_granularity_ns=15000000
> >
> > So, these settings definitely influence behaviour.
> >
> > If these are set to kernel defaults (4ms and 3ms each):
> >
> > sysctl kernel.sched_min_granularity_ns=4000000
> > sysctl kernel.sched_wakeup_granularity_ns=3000000
> >
> > The migration problem largely goes away - the fio task migration
> > event count goes from ~2,000 a run down to 200/run.
> >
> > That indicates that the migration trigger is likely load/timing
> > based. The analysis below is based on the 10/15ms numbers above,
> > because it makes it so much easier to reproduce.
> >
> > > 2)
> > > ./xfs_complete 4k
> > >
> > > Then you should see 1k~1.5k fio io thread migration in above test,
> > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.
> >
> > Almost all the fio task migrations are coming from migration/X
> > kernel threads. i.e it's the scheduler active balancing that is
> > causing the fio thread to bounce around.
> >
> > This is typical a typical trace, trimmed to remove extraneous noise.
> > The fio process is running on CPU 10:
> >
> >              fio-3185  [010] 50419.285954: sched_stat_runtime:   comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns]
> >              fio-3185  [010] 50419.286953: sched_stat_runtime:   comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns]
> >              fio-3185  [010] 50419.287998: sched_stat_runtime:   comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns]
> >              fio-3185  [010] 50419.289973: sched_stat_runtime:   comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns]
> >              fio-3185  [010] 50419.290958: sched_stat_runtime:   comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns]
> >              fio-3185  [010] 50419.291952: sched_stat_runtime:   comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns]
> >
> > fio consumes CPU for several milliseconds, then:
> >
> >              fio-3185  [010] 50419.292935: sched_stat_runtime:   comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns]
> >              fio-3185  [010] 50419.292941: sched_switch:         fio:3185 [120] S ==> kworker/10:0:2763 [120]
> >     kworker/10:0-2763  [010] 50419.292954: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns]
> >     kworker/10:0-2763  [010] 50419.292956: sched_switch:         kworker/10:0:2763 [120] R ==> fio:3185 [120]
> >              fio-3185  [010] 50419.293115: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> >              fio-3185  [010] 50419.293116: sched_stat_runtime:   comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns]
> >              fio-3185  [010] 50419.293118: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> >
> > A context switch out to a kworker, then 13us later we immediately
> > switch back to the fio process, and go on running. No doubt
> > somewhere in what the fio process is doing, we queue up more work to
> > be run on the cpu, but the fio task keeps running
> > (due to CONFIG_PREEMPT=n).
> >
> >              fio-3185  [010] 50419.293934: sched_stat_runtime:   comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns]
> >              fio-3185  [010] 50419.294936: sched_stat_runtime:   comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns]
> >              fio-3185  [010] 50419.295934: sched_stat_runtime:   comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns]
> >              fio-3185  [010] 50419.296935: sched_stat_runtime:   comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns]
> >              fio-3185  [010] 50419.297934: sched_stat_runtime:   comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns]
> >              fio-3185  [010] 50419.298937: sched_stat_runtime:   comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns]
> >              fio-3185  [010] 50419.299935: sched_stat_runtime:   comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns]
> >
> > About 6ms later, CPU 0 kicks the active load balancer on CPU 10...
> >
> >           <idle>-0     [000] 50419.300014: sched_waking:         comm=migration/10 pid=70 prio=0 target_cpu=010
> >              fio-3185  [010] 50419.300024: sched_wakeup:         migration/10:70 [0] success=1 CPU:010
> >              fio-3185  [010] 50419.300026: sched_stat_runtime:   comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns]
> >              fio-3185  [010] 50419.300027: sched_switch:         fio:3185 [120] S ==> migration/10:70 [0]
> >     migration/10-70    [010] 50419.300032: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12
> >     migration/10-70    [010] 50419.300040: sched_switch:         migration/10:70 [0] D ==> kworker/10:0:2763 [120]
> >
> > And 10us later the fio process is switched away, the active load
> > balancer work is run and migrates the fio process to CPU 12. Then...
> >
> >     kworker/10:0-2763  [010] 50419.300048: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns]
> >     kworker/10:0-2763  [010] 50419.300062: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> >           <idle>-0     [010] 50419.300067: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> >           <idle>-0     [010] 50419.300069: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> >           <idle>-0     [010] 50419.300071: sched_switch:         swapper/10:0 [120] S ==> kworker/10:0:2763 [120]
> >     kworker/10:0-2763  [010] 50419.300073: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> >
> > The kworker runs for another 10us and the CPU goes idle. Shortly
> > after this, CPU 12 is woken:
> >
> >           <idle>-0     [012] 50419.300113: sched_switch:         swapper/12:0 [120] S ==> fio:3185 [120]
> >              fio-3185  [012] 50419.300596: sched_waking:         comm=kworker/12:1 pid=227 prio=120 target_cpu=012
> >              fio-3185  [012] 50419.300598: sched_stat_runtime:   comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns]
> >              fio-3185  [012] 50419.300936: sched_stat_runtime:   comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns]
> >              fio-3185  [012] 50419.301935: sched_stat_runtime:   comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns]
> >              fio-3185  [012] 50419.302935: sched_stat_runtime:   comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns]
> >              fio-3185  [012] 50419.303934: sched_stat_runtime:   comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns]
> >              fio-3185  [012] 50419.304934: sched_stat_runtime:   comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns]
> >              fio-3185  [012] 50419.305948: sched_stat_runtime:   comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns]
> >
> >
> > and fio goes on running there. The pattern repeats very soon afterwards:
> >
> >           <idle>-0     [000] 50419.314982: sched_waking:         comm=migration/12 pid=82 prio=0 target_cpu=012
> >              fio-3185  [012] 50419.314988: sched_wakeup:         migration/12:82 [0] success=1 CPU:012
> >              fio-3185  [012] 50419.314990: sched_stat_runtime:   comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns]
> >              fio-3185  [012] 50419.314991: sched_switch:         fio:3185 [120] S ==> migration/12:82 [0]
> >     migration/12-82    [012] 50419.314995: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5
> >     migration/12-82    [012] 50419.315001: sched_switch:         migration/12:82 [0] D ==> kworker/12:1:227 [120]
> >     kworker/12:1-227   [012] 50419.315022: sched_stat_runtime:   comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns]
> >     kworker/12:1-227   [012] 50419.315028: sched_switch:         kworker/12:1:227 [120] R ==> swapper/12:0 [120]
> >           <idle>-0     [005] 50419.315053: sched_switch:         swapper/5:0 [120] S ==> fio:3185 [120]
> >              fio-3185  [005] 50419.315286: sched_waking:         comm=kworker/5:0 pid=2646 prio=120 target_cpu=005
> >              fio-3185  [005] 50419.315288: sched_stat_runtime:   comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns]
> >
> > And fio is now running on CPU 5 - it only ran on CPU 12 for about
> > 15ms. Hmmm:
> >
> > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
> > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
> > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)
> >
> > Yeah, the fio task averages 13.4ms on any given CPU before being
> > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > of how long it spends on any one CPU is pretty wide (330us to
> > 330ms).
> >
> Hey Dave
>
> > IOWs, this doesn't look like a workqueue problem at all - this looks
>
> Surprised to see you're so sure it has little to do with wq,
>
> > like the scheduler is repeatedly making the wrong load balancing
> > decisions when mixing a very short runtime task (queued work) with a
> > long runtime task on the same CPU....
> >
> and it helps more to know what is driving lb to make decisions like
> this. Because for 70+ per cent of communters in cities like London it
> is supposed tube is better than cab on work days, the end_io cb is
> tweaked to be a lookalike of execute_in_process_context() in the diff
> with the devoted s_dio_done_wq taken out of account. It's interesting
> to see what difference lb will make in the tube environment.
>
> Hillf
>
> > This is not my area of expertise, so I have no idea why this might
> > be happening. Scheduler experts: is this expected behaviour? What
> > tunables directly influence the active load balancer (and/or CONFIG
> > options) to change how aggressive it's behaviour is?
> >
> > > Not reproduced the issue with 512 block size on the RH system yet,
> > > maybe it is related with my kernel config.
> >
> > I doubt it - this looks like a load specific corner case in the
> > scheduling algorithm....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
>
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
>                         WRITE_ONCE(dio->submit.waiter, NULL);
>                         blk_wake_io_task(waiter);
>                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> -
>                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +                       schedule_work(&dio->aio.work);

I'm not sure that this will make a real difference because it ends up
to call queue_work(system_wq, ...) and system_wq is bounded as well so
the work will still be pinned to a CPU
Using system_unbound_wq should make a difference because it doesn't
pin the work on a CPU
 +                       queue_work(system_unbound_wq, &dio->aio.work);


>                 } else {
>                         iomap_dio_complete_work(&dio->aio.work);
>                 }
> --
>

Ming Lei Dec. 2, 2019, 2:46 a.m. UTC | #26

On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> >
> >
> > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > > > I can reproduce the issue with 4k block size on another RH system, and
> > > > the login info of that system has been shared to you in RH BZ.
> > > >
> > > > 1)
> > > > sysctl kernel.sched_min_granularity_ns=10000000
> > > > sysctl kernel.sched_wakeup_granularity_ns=15000000
> > >
> > > So, these settings definitely influence behaviour.
> > >
> > > If these are set to kernel defaults (4ms and 3ms each):
> > >
> > > sysctl kernel.sched_min_granularity_ns=4000000
> > > sysctl kernel.sched_wakeup_granularity_ns=3000000
> > >
> > > The migration problem largely goes away - the fio task migration
> > > event count goes from ~2,000 a run down to 200/run.
> > >
> > > That indicates that the migration trigger is likely load/timing
> > > based. The analysis below is based on the 10/15ms numbers above,
> > > because it makes it so much easier to reproduce.
> > >
> > > > 2)
> > > > ./xfs_complete 4k
> > > >
> > > > Then you should see 1k~1.5k fio io thread migration in above test,
> > > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.
> > >
> > > Almost all the fio task migrations are coming from migration/X
> > > kernel threads. i.e it's the scheduler active balancing that is
> > > causing the fio thread to bounce around.
> > >
> > > This is typical a typical trace, trimmed to remove extraneous noise.
> > > The fio process is running on CPU 10:
> > >
> > >              fio-3185  [010] 50419.285954: sched_stat_runtime:   comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns]
> > >              fio-3185  [010] 50419.286953: sched_stat_runtime:   comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns]
> > >              fio-3185  [010] 50419.287998: sched_stat_runtime:   comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns]
> > >              fio-3185  [010] 50419.289973: sched_stat_runtime:   comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns]
> > >              fio-3185  [010] 50419.290958: sched_stat_runtime:   comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns]
> > >              fio-3185  [010] 50419.291952: sched_stat_runtime:   comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns]
> > >
> > > fio consumes CPU for several milliseconds, then:
> > >
> > >              fio-3185  [010] 50419.292935: sched_stat_runtime:   comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns]
> > >              fio-3185  [010] 50419.292941: sched_switch:         fio:3185 [120] S ==> kworker/10:0:2763 [120]
> > >     kworker/10:0-2763  [010] 50419.292954: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns]
> > >     kworker/10:0-2763  [010] 50419.292956: sched_switch:         kworker/10:0:2763 [120] R ==> fio:3185 [120]
> > >              fio-3185  [010] 50419.293115: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> > >              fio-3185  [010] 50419.293116: sched_stat_runtime:   comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns]
> > >              fio-3185  [010] 50419.293118: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> > >
> > > A context switch out to a kworker, then 13us later we immediately
> > > switch back to the fio process, and go on running. No doubt
> > > somewhere in what the fio process is doing, we queue up more work to
> > > be run on the cpu, but the fio task keeps running
> > > (due to CONFIG_PREEMPT=n).
> > >
> > >              fio-3185  [010] 50419.293934: sched_stat_runtime:   comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns]
> > >              fio-3185  [010] 50419.294936: sched_stat_runtime:   comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns]
> > >              fio-3185  [010] 50419.295934: sched_stat_runtime:   comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns]
> > >              fio-3185  [010] 50419.296935: sched_stat_runtime:   comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns]
> > >              fio-3185  [010] 50419.297934: sched_stat_runtime:   comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns]
> > >              fio-3185  [010] 50419.298937: sched_stat_runtime:   comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns]
> > >              fio-3185  [010] 50419.299935: sched_stat_runtime:   comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns]
> > >
> > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10...
> > >
> > >           <idle>-0     [000] 50419.300014: sched_waking:         comm=migration/10 pid=70 prio=0 target_cpu=010
> > >              fio-3185  [010] 50419.300024: sched_wakeup:         migration/10:70 [0] success=1 CPU:010
> > >              fio-3185  [010] 50419.300026: sched_stat_runtime:   comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns]
> > >              fio-3185  [010] 50419.300027: sched_switch:         fio:3185 [120] S ==> migration/10:70 [0]
> > >     migration/10-70    [010] 50419.300032: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12
> > >     migration/10-70    [010] 50419.300040: sched_switch:         migration/10:70 [0] D ==> kworker/10:0:2763 [120]
> > >
> > > And 10us later the fio process is switched away, the active load
> > > balancer work is run and migrates the fio process to CPU 12. Then...
> > >
> > >     kworker/10:0-2763  [010] 50419.300048: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns]
> > >     kworker/10:0-2763  [010] 50419.300062: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> > >           <idle>-0     [010] 50419.300067: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> > >           <idle>-0     [010] 50419.300069: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> > >           <idle>-0     [010] 50419.300071: sched_switch:         swapper/10:0 [120] S ==> kworker/10:0:2763 [120]
> > >     kworker/10:0-2763  [010] 50419.300073: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> > >
> > > The kworker runs for another 10us and the CPU goes idle. Shortly
> > > after this, CPU 12 is woken:
> > >
> > >           <idle>-0     [012] 50419.300113: sched_switch:         swapper/12:0 [120] S ==> fio:3185 [120]
> > >              fio-3185  [012] 50419.300596: sched_waking:         comm=kworker/12:1 pid=227 prio=120 target_cpu=012
> > >              fio-3185  [012] 50419.300598: sched_stat_runtime:   comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns]
> > >              fio-3185  [012] 50419.300936: sched_stat_runtime:   comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns]
> > >              fio-3185  [012] 50419.301935: sched_stat_runtime:   comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns]
> > >              fio-3185  [012] 50419.302935: sched_stat_runtime:   comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns]
> > >              fio-3185  [012] 50419.303934: sched_stat_runtime:   comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns]
> > >              fio-3185  [012] 50419.304934: sched_stat_runtime:   comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns]
> > >              fio-3185  [012] 50419.305948: sched_stat_runtime:   comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns]
> > >
> > >
> > > and fio goes on running there. The pattern repeats very soon afterwards:
> > >
> > >           <idle>-0     [000] 50419.314982: sched_waking:         comm=migration/12 pid=82 prio=0 target_cpu=012
> > >              fio-3185  [012] 50419.314988: sched_wakeup:         migration/12:82 [0] success=1 CPU:012
> > >              fio-3185  [012] 50419.314990: sched_stat_runtime:   comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns]
> > >              fio-3185  [012] 50419.314991: sched_switch:         fio:3185 [120] S ==> migration/12:82 [0]
> > >     migration/12-82    [012] 50419.314995: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5
> > >     migration/12-82    [012] 50419.315001: sched_switch:         migration/12:82 [0] D ==> kworker/12:1:227 [120]
> > >     kworker/12:1-227   [012] 50419.315022: sched_stat_runtime:   comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns]
> > >     kworker/12:1-227   [012] 50419.315028: sched_switch:         kworker/12:1:227 [120] R ==> swapper/12:0 [120]
> > >           <idle>-0     [005] 50419.315053: sched_switch:         swapper/5:0 [120] S ==> fio:3185 [120]
> > >              fio-3185  [005] 50419.315286: sched_waking:         comm=kworker/5:0 pid=2646 prio=120 target_cpu=005
> > >              fio-3185  [005] 50419.315288: sched_stat_runtime:   comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns]
> > >
> > > And fio is now running on CPU 5 - it only ran on CPU 12 for about
> > > 15ms. Hmmm:
> > >
> > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
> > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
> > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)
> > >
> > > Yeah, the fio task averages 13.4ms on any given CPU before being
> > > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > > of how long it spends on any one CPU is pretty wide (330us to
> > > 330ms).
> > >
> > Hey Dave
> >
> > > IOWs, this doesn't look like a workqueue problem at all - this looks
> >
> > Surprised to see you're so sure it has little to do with wq,
> >
> > > like the scheduler is repeatedly making the wrong load balancing
> > > decisions when mixing a very short runtime task (queued work) with a
> > > long runtime task on the same CPU....
> > >
> > and it helps more to know what is driving lb to make decisions like
> > this. Because for 70+ per cent of communters in cities like London it
> > is supposed tube is better than cab on work days, the end_io cb is
> > tweaked to be a lookalike of execute_in_process_context() in the diff
> > with the devoted s_dio_done_wq taken out of account. It's interesting
> > to see what difference lb will make in the tube environment.
> >
> > Hillf
> >
> > > This is not my area of expertise, so I have no idea why this might
> > > be happening. Scheduler experts: is this expected behaviour? What
> > > tunables directly influence the active load balancer (and/or CONFIG
> > > options) to change how aggressive it's behaviour is?
> > >
> > > > Not reproduced the issue with 512 block size on the RH system yet,
> > > > maybe it is related with my kernel config.
> > >
> > > I doubt it - this looks like a load specific corner case in the
> > > scheduling algorithm....
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> >
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> >                         WRITE_ONCE(dio->submit.waiter, NULL);
> >                         blk_wake_io_task(waiter);
> >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > -
> >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > +                       schedule_work(&dio->aio.work);
> 
> I'm not sure that this will make a real difference because it ends up
> to call queue_work(system_wq, ...) and system_wq is bounded as well so
> the work will still be pinned to a CPU
> Using system_unbound_wq should make a difference because it doesn't
> pin the work on a CPU
>  +                       queue_work(system_unbound_wq, &dio->aio.work);

Indeed, just run a quick test on my KVM guest, looks the following patch
makes a difference:

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9329ced91f1d..2f4488b0ecec 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
 {
        struct workqueue_struct *old;
        struct workqueue_struct *wq = alloc_workqueue("dio/%s",
-                                                     WQ_MEM_RECLAIM, 0,
+                                                     WQ_MEM_RECLAIM |
+                                                     WQ_UNBOUND, 0,
                                                      sb->s_id);


Thanks,
Ming

Dave Chinner Dec. 2, 2019, 3:08 a.m. UTC | #27

On Thu, Nov 28, 2019 at 05:40:03PM +0800, Hillf Danton wrote:
> On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > Yeah, the fio task averages 13.4ms on any given CPU before being
> > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > of how long it spends on any one CPU is pretty wide (330us to
> > 330ms).
> > 
> Hey Dave
> 
> > IOWs, this doesn't look like a workqueue problem at all - this looks
> 
> Surprised to see you're so sure it has little to do with wq,

Because I understand how the workqueue is used here.

Essentially, the workqueue is not necessary for a -pure- overwrite
where no metadata updates or end-of-io filesystem work is required.

However, change the workload just slightly, such as allocating the
space, writing into preallocated space (unwritten extents), using
AIO writes to extend the file, using O_DSYNC, etc, and we *must*
use a workqueue as we have to take blocking locks and/or run
transactions.

These may still be very short (e.g. updating inode size) and in most
cases will not block, but if they do, then if we don't move the work
out of the block layer completion context (i.e. softirq running the
block bh) then we risk deadlocking the code.

Not to mention none of the filesytem inode locks are irq safe.

IOWs, we can remove the workqueue for this -one specific instance-
but it does not remove the requirement for using a workqueue for all
the other types of write IO that pass through this code.

> > like the scheduler is repeatedly making the wrong load balancing
> > decisions when mixing a very short runtime task (queued work) with a
> > long runtime task on the same CPU....
> > 
> and it helps more to know what is driving lb to make decisions like
> this.

I know exactly what is driving it through both observation and
understanding of the code, and I've explained it elsewhere
in this thread.

> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
>  			WRITE_ONCE(dio->submit.waiter, NULL);
>  			blk_wake_io_task(waiter);
>  		} else if (dio->flags & IOMAP_DIO_WRITE) {
> -			struct inode *inode = file_inode(dio->iocb->ki_filp);
> -
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			schedule_work(&dio->aio.work);

This does nothing but change the workqueue from a per-sb wq to the
system wq. The work is still bound to the same CPU it is queued on,
so nothing will change.

Cheers,

Dave.

Dave Chinner Dec. 2, 2019, 4:02 a.m. UTC | #28

On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > >                         blk_wake_io_task(waiter);
> > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > -
> > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > +                       schedule_work(&dio->aio.work);
> > 
> > I'm not sure that this will make a real difference because it ends up
> > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > the work will still be pinned to a CPU
> > Using system_unbound_wq should make a difference because it doesn't
> > pin the work on a CPU
> >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> 
> Indeed, just run a quick test on my KVM guest, looks the following patch
> makes a difference:
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 9329ced91f1d..2f4488b0ecec 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
>  {
>         struct workqueue_struct *old;
>         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> -                                                     WQ_MEM_RECLAIM, 0,
> +                                                     WQ_MEM_RECLAIM |
> +                                                     WQ_UNBOUND, 0,
>                                                       sb->s_id);

That's not an answer to the user task migration issue.

That is, all this patch does is trade user task migration when the
CPU is busy for migrating all the queued work off the CPU so the
user task does not get migrated. IOWs, this forces all the queued
work to be migrated rather than the user task. IOWs, it does not
address the issue we've exposed in the scheduler between tasks with
competing CPU affinity scheduling requirements - it just hides the
symptom.

Maintaining CPU affinity across dispatch and completion work has
been proven to be a significant performance win. Right throughout
the IO stack we try to keep this submitter/completion affinity,
and that's the whole point of using a bound wq in the first place:
efficient delayed batch processing of work on the local CPU.

Spewing deferred completion work across every idle CPU in the
machine because the local cpu is temporarily busy is a bad choice,
both from a performance perspective (dirty cacheline bouncing) and
from a power efficiency point of view as it causes CPUs to be taken
out of idle state much more frequently[*].

The fact that the scheduler migrates the user task we use workqueues
for deferred work as they were intended doesn't make this a
workqueue problem. If the answer to this problem is "make all IO
workqueues WQ_UNBOUND" then we are effectively saying "the scheduler
has unfixable problems when mixing bound and unbound work on the
same run queue".

And, besides, what happens when every other CPU is also completely
busy and can't run the work in a timely fashion? We've just moved
the work to some random CPU where we wait to be scheduled instead of
just sitting on the local CPU and waiting....

So, yes, we can work around the -symptoms- we see (frequent user
task migration) by changing the work queue configuration or
bypassing the workqueue for this specific workload. But these only
address the visible symptom and don't take into account the wider
goals of retaining CPU affinity in the IO stack, and they will have
variable scheduling latency and perofrmance and as the overall
system load changes.

So, we can fiddle with workqueues, but it doesn't address the
underlying issue that the scheduler appears to be migrating
non-bound tasks off a busy CPU too easily....

-Dave.

[*] Pay attention to the WQ_POWER_EFFICIENT definition for a work
queue: it's designed for interrupt routines that defer work via work
queues to avoid doing work on otherwise idle CPUs. It does this by
turning the per-cpu wq into an unbound wq so that work gets
scheduled on a non-idle CPUs in preference to the local idle CPU
which can then remain in low power states.

That's the exact opposite of what using WQ_UNBOUND ends up doing in
this IO completion context: it pushes the work out over idle CPUs
rather than keeping them confined on the already busy CPUs where CPU
affinity allows the work to be done quickly. So while WQ_UNBOUND
avoids the user task being migrated frequently, it results in the
work being spread around many more CPUs and we burn more power to do
the same work.

Ming Lei Dec. 2, 2019, 4:22 a.m. UTC | #29

On Mon, Dec 02, 2019 at 03:02:56PM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote:
> > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > > >                         blk_wake_io_task(waiter);
> > > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > > -
> > > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > > +                       schedule_work(&dio->aio.work);
> > > 
> > > I'm not sure that this will make a real difference because it ends up
> > > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > > the work will still be pinned to a CPU
> > > Using system_unbound_wq should make a difference because it doesn't
> > > pin the work on a CPU
> > >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> > 
> > Indeed, just run a quick test on my KVM guest, looks the following patch
> > makes a difference:
> > 
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 9329ced91f1d..2f4488b0ecec 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
> >  {
> >         struct workqueue_struct *old;
> >         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> > -                                                     WQ_MEM_RECLAIM, 0,
> > +                                                     WQ_MEM_RECLAIM |
> > +                                                     WQ_UNBOUND, 0,
> >                                                       sb->s_id);
> 
> That's not an answer to the user task migration issue.
> 
> That is, all this patch does is trade user task migration when the
> CPU is busy for migrating all the queued work off the CPU so the
> user task does not get migrated. IOWs, this forces all the queued
> work to be migrated rather than the user task. IOWs, it does not
> address the issue we've exposed in the scheduler between tasks with
> competing CPU affinity scheduling requirements - it just hides the
> symptom.

Yeah, I believe we all reach the agreement that this is one issue in
scheduler's load balance.

> 
> Maintaining CPU affinity across dispatch and completion work has
> been proven to be a significant performance win. Right throughout
> the IO stack we try to keep this submitter/completion affinity,
> and that's the whole point of using a bound wq in the first place:
> efficient delayed batch processing of work on the local CPU.
> 
> Spewing deferred completion work across every idle CPU in the
> machine because the local cpu is temporarily busy is a bad choice,
> both from a performance perspective (dirty cacheline bouncing) and
> from a power efficiency point of view as it causes CPUs to be taken
> out of idle state much more frequently[*].
> 
> The fact that the scheduler migrates the user task we use workqueues
> for deferred work as they were intended doesn't make this a
> workqueue problem. If the answer to this problem is "make all IO
> workqueues WQ_UNBOUND" then we are effectively saying "the scheduler
> has unfixable problems when mixing bound and unbound work on the
> same run queue".
> 
> And, besides, what happens when every other CPU is also completely
> busy and can't run the work in a timely fashion? We've just moved
> the work to some random CPU where we wait to be scheduled instead of
> just sitting on the local CPU and waiting....
> 
> So, yes, we can work around the -symptoms- we see (frequent user
> task migration) by changing the work queue configuration or
> bypassing the workqueue for this specific workload. But these only
> address the visible symptom and don't take into account the wider
> goals of retaining CPU affinity in the IO stack, and they will have
> variable scheduling latency and perofrmance and as the overall
> system load changes.

So far, not see any progress in fixing the load balance issue,
I'd suggest to workaround the issue via the patch if no one objects.
We can comment it explicitly that it is just for workround scheduler's
problem.

BTW, there is lots of WQ_UNBOUND uses in kernel:

[linux]$ git grep -n -w WQ_UNBOUND ./ | grep -v "kernel\/workqueue" | wc
     86     524    7356

Some of them are used in fast IO path too.

> 
> So, we can fiddle with workqueues, but it doesn't address the
> underlying issue that the scheduler appears to be migrating
> non-bound tasks off a busy CPU too easily....
> 
> -Dave.
> 
> [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work
> queue: it's designed for interrupt routines that defer work via work
> queues to avoid doing work on otherwise idle CPUs. It does this by
> turning the per-cpu wq into an unbound wq so that work gets
> scheduled on a non-idle CPUs in preference to the local idle CPU
> which can then remain in low power states.
> 
> That's the exact opposite of what using WQ_UNBOUND ends up doing in
> this IO completion context: it pushes the work out over idle CPUs
> rather than keeping them confined on the already busy CPUs where CPU
> affinity allows the work to be done quickly. So while WQ_UNBOUND
> avoids the user task being migrated frequently, it results in the
> work being spread around many more CPUs and we burn more power to do
> the same work.

That can't be worse than this crazy migration, which schedules the
IO thread on other IDLE CPUs too.


Thanks,
Ming

Juri Lelli Dec. 2, 2019, 7:39 a.m. UTC | #30

Hi,

On 02/12/19 10:46, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > >
> > >
> > > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote:
> > > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote:
> > > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote:
> > > > > I can reproduce the issue with 4k block size on another RH system, and
> > > > > the login info of that system has been shared to you in RH BZ.
> > > > >
> > > > > 1)
> > > > > sysctl kernel.sched_min_granularity_ns=10000000
> > > > > sysctl kernel.sched_wakeup_granularity_ns=15000000
> > > >
> > > > So, these settings definitely influence behaviour.
> > > >
> > > > If these are set to kernel defaults (4ms and 3ms each):
> > > >
> > > > sysctl kernel.sched_min_granularity_ns=4000000
> > > > sysctl kernel.sched_wakeup_granularity_ns=3000000
> > > >
> > > > The migration problem largely goes away - the fio task migration
> > > > event count goes from ~2,000 a run down to 200/run.
> > > >
> > > > That indicates that the migration trigger is likely load/timing
> > > > based. The analysis below is based on the 10/15ms numbers above,
> > > > because it makes it so much easier to reproduce.
> > > >
> > > > > 2)
> > > > > ./xfs_complete 4k
> > > > >
> > > > > Then you should see 1k~1.5k fio io thread migration in above test,
> > > > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel.
> > > >
> > > > Almost all the fio task migrations are coming from migration/X
> > > > kernel threads. i.e it's the scheduler active balancing that is
> > > > causing the fio thread to bounce around.
> > > >
> > > > This is typical a typical trace, trimmed to remove extraneous noise.
> > > > The fio process is running on CPU 10:
> > > >
> > > >              fio-3185  [010] 50419.285954: sched_stat_runtime:   comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns]
> > > >              fio-3185  [010] 50419.286953: sched_stat_runtime:   comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns]
> > > >              fio-3185  [010] 50419.287998: sched_stat_runtime:   comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns]
> > > >              fio-3185  [010] 50419.289973: sched_stat_runtime:   comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns]
> > > >              fio-3185  [010] 50419.290958: sched_stat_runtime:   comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns]
> > > >              fio-3185  [010] 50419.291952: sched_stat_runtime:   comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns]
> > > >
> > > > fio consumes CPU for several milliseconds, then:
> > > >
> > > >              fio-3185  [010] 50419.292935: sched_stat_runtime:   comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns]
> > > >              fio-3185  [010] 50419.292941: sched_switch:         fio:3185 [120] S ==> kworker/10:0:2763 [120]
> > > >     kworker/10:0-2763  [010] 50419.292954: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns]
> > > >     kworker/10:0-2763  [010] 50419.292956: sched_switch:         kworker/10:0:2763 [120] R ==> fio:3185 [120]
> > > >              fio-3185  [010] 50419.293115: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> > > >              fio-3185  [010] 50419.293116: sched_stat_runtime:   comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns]
> > > >              fio-3185  [010] 50419.293118: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> > > >
> > > > A context switch out to a kworker, then 13us later we immediately
> > > > switch back to the fio process, and go on running. No doubt
> > > > somewhere in what the fio process is doing, we queue up more work to
> > > > be run on the cpu, but the fio task keeps running
> > > > (due to CONFIG_PREEMPT=n).
> > > >
> > > >              fio-3185  [010] 50419.293934: sched_stat_runtime:   comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns]
> > > >              fio-3185  [010] 50419.294936: sched_stat_runtime:   comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns]
> > > >              fio-3185  [010] 50419.295934: sched_stat_runtime:   comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns]
> > > >              fio-3185  [010] 50419.296935: sched_stat_runtime:   comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns]
> > > >              fio-3185  [010] 50419.297934: sched_stat_runtime:   comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns]
> > > >              fio-3185  [010] 50419.298937: sched_stat_runtime:   comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns]
> > > >              fio-3185  [010] 50419.299935: sched_stat_runtime:   comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns]
> > > >
> > > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10...
> > > >
> > > >           <idle>-0     [000] 50419.300014: sched_waking:         comm=migration/10 pid=70 prio=0 target_cpu=010
> > > >              fio-3185  [010] 50419.300024: sched_wakeup:         migration/10:70 [0] success=1 CPU:010
> > > >              fio-3185  [010] 50419.300026: sched_stat_runtime:   comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns]
> > > >              fio-3185  [010] 50419.300027: sched_switch:         fio:3185 [120] S ==> migration/10:70 [0]
> > > >     migration/10-70    [010] 50419.300032: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12
> > > >     migration/10-70    [010] 50419.300040: sched_switch:         migration/10:70 [0] D ==> kworker/10:0:2763 [120]
> > > >
> > > > And 10us later the fio process is switched away, the active load
> > > > balancer work is run and migrates the fio process to CPU 12. Then...
> > > >
> > > >     kworker/10:0-2763  [010] 50419.300048: sched_stat_runtime:   comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns]
> > > >     kworker/10:0-2763  [010] 50419.300062: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> > > >           <idle>-0     [010] 50419.300067: sched_waking:         comm=kworker/10:0 pid=2763 prio=120 target_cpu=010
> > > >           <idle>-0     [010] 50419.300069: sched_wakeup:         kworker/10:0:2763 [120] success=1 CPU:010
> > > >           <idle>-0     [010] 50419.300071: sched_switch:         swapper/10:0 [120] S ==> kworker/10:0:2763 [120]
> > > >     kworker/10:0-2763  [010] 50419.300073: sched_switch:         kworker/10:0:2763 [120] R ==> swapper/10:0 [120]
> > > >
> > > > The kworker runs for another 10us and the CPU goes idle. Shortly
> > > > after this, CPU 12 is woken:
> > > >
> > > >           <idle>-0     [012] 50419.300113: sched_switch:         swapper/12:0 [120] S ==> fio:3185 [120]
> > > >              fio-3185  [012] 50419.300596: sched_waking:         comm=kworker/12:1 pid=227 prio=120 target_cpu=012
> > > >              fio-3185  [012] 50419.300598: sched_stat_runtime:   comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns]
> > > >              fio-3185  [012] 50419.300936: sched_stat_runtime:   comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns]
> > > >              fio-3185  [012] 50419.301935: sched_stat_runtime:   comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns]
> > > >              fio-3185  [012] 50419.302935: sched_stat_runtime:   comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns]
> > > >              fio-3185  [012] 50419.303934: sched_stat_runtime:   comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns]
> > > >              fio-3185  [012] 50419.304934: sched_stat_runtime:   comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns]
> > > >              fio-3185  [012] 50419.305948: sched_stat_runtime:   comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns]
> > > >
> > > >
> > > > and fio goes on running there. The pattern repeats very soon afterwards:
> > > >
> > > >           <idle>-0     [000] 50419.314982: sched_waking:         comm=migration/12 pid=82 prio=0 target_cpu=012
> > > >              fio-3185  [012] 50419.314988: sched_wakeup:         migration/12:82 [0] success=1 CPU:012
> > > >              fio-3185  [012] 50419.314990: sched_stat_runtime:   comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns]
> > > >              fio-3185  [012] 50419.314991: sched_switch:         fio:3185 [120] S ==> migration/12:82 [0]
> > > >     migration/12-82    [012] 50419.314995: sched_migrate_task:   comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5
> > > >     migration/12-82    [012] 50419.315001: sched_switch:         migration/12:82 [0] D ==> kworker/12:1:227 [120]
> > > >     kworker/12:1-227   [012] 50419.315022: sched_stat_runtime:   comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns]
> > > >     kworker/12:1-227   [012] 50419.315028: sched_switch:         kworker/12:1:227 [120] R ==> swapper/12:0 [120]
> > > >           <idle>-0     [005] 50419.315053: sched_switch:         swapper/5:0 [120] S ==> fio:3185 [120]
> > > >              fio-3185  [005] 50419.315286: sched_waking:         comm=kworker/5:0 pid=2646 prio=120 target_cpu=005
> > > >              fio-3185  [005] 50419.315288: sched_stat_runtime:   comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns]
> > > >
> > > > And fio is now running on CPU 5 - it only ran on CPU 12 for about
> > > > 15ms. Hmmm:
> > > >
> > > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers
> > > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135
> > > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012)
> > > >
> > > > Yeah, the fio task averages 13.4ms on any given CPU before being
> > > > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > > > of how long it spends on any one CPU is pretty wide (330us to
> > > > 330ms).
> > > >
> > > Hey Dave
> > >
> > > > IOWs, this doesn't look like a workqueue problem at all - this looks
> > >
> > > Surprised to see you're so sure it has little to do with wq,
> > >
> > > > like the scheduler is repeatedly making the wrong load balancing
> > > > decisions when mixing a very short runtime task (queued work) with a
> > > > long runtime task on the same CPU....
> > > >
> > > and it helps more to know what is driving lb to make decisions like
> > > this. Because for 70+ per cent of communters in cities like London it
> > > is supposed tube is better than cab on work days, the end_io cb is
> > > tweaked to be a lookalike of execute_in_process_context() in the diff
> > > with the devoted s_dio_done_wq taken out of account. It's interesting
> > > to see what difference lb will make in the tube environment.
> > >
> > > Hillf
> > >
> > > > This is not my area of expertise, so I have no idea why this might
> > > > be happening. Scheduler experts: is this expected behaviour? What
> > > > tunables directly influence the active load balancer (and/or CONFIG
> > > > options) to change how aggressive it's behaviour is?
> > > >
> > > > > Not reproduced the issue with 512 block size on the RH system yet,
> > > > > maybe it is related with my kernel config.
> > > >
> > > > I doubt it - this looks like a load specific corner case in the
> > > > scheduling algorithm....
> > > >
> > > > Cheers,
> > > >
> > > > Dave.
> > > > --
> > > > Dave Chinner
> > > > david@fromorbit.com
> > >
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > >                         blk_wake_io_task(waiter);
> > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > -
> > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > +                       schedule_work(&dio->aio.work);
> > 
> > I'm not sure that this will make a real difference because it ends up
> > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > the work will still be pinned to a CPU
> > Using system_unbound_wq should make a difference because it doesn't
> > pin the work on a CPU
> >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> 
> Indeed, just run a quick test on my KVM guest, looks the following patch
> makes a difference:
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 9329ced91f1d..2f4488b0ecec 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
>  {
>         struct workqueue_struct *old;
>         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> -                                                     WQ_MEM_RECLAIM, 0,
> +                                                     WQ_MEM_RECLAIM |
> +                                                     WQ_UNBOUND, 0,
>                                                       sb->s_id);

Chiming in to report that, after Vincent suggested to try the very same
change out on a private IRC conversation, I saw the very same positive
effects. E.g.,

Baseline is using 'slow' sched parameters (10ms/15ms), baseline 3/4 is
same kernel just with default 'fast' sched params and dio unbound 10/15
is a kernel patched with the above patch and using 'slow' sched params.

fsperf-4k-5G                            baseline 3/4    dio unbound 10/15	
                                        improv.	        improv.
sequential initial write throughput	-0.12	        -0.87
                         latency	1.15	        2.87
           overwrite     throughput	1186.32	        1232.74
                         latency	11.35           11.26
           read          throughput     0.80            -2.38
                         latency        2.79            3.03
random     initial write throughput     84.73           105.34
                         latency        0.00            1.14
           overwrite     throughput     254.27          273.17
                         latency        1.23            7.37
           read	         throughput	40.05           3.24
                         latency        8.03            7.36

Best,

Juri

Vincent Guittot Dec. 2, 2019, 1:45 p.m. UTC | #31

On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote:
> > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > > >                         blk_wake_io_task(waiter);
> > > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > > -
> > > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > > +                       schedule_work(&dio->aio.work);
> > >
> > > I'm not sure that this will make a real difference because it ends up
> > > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > > the work will still be pinned to a CPU
> > > Using system_unbound_wq should make a difference because it doesn't
> > > pin the work on a CPU
> > >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> >
> > Indeed, just run a quick test on my KVM guest, looks the following patch
> > makes a difference:
> >
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 9329ced91f1d..2f4488b0ecec 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
> >  {
> >         struct workqueue_struct *old;
> >         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> > -                                                     WQ_MEM_RECLAIM, 0,
> > +                                                     WQ_MEM_RECLAIM |
> > +                                                     WQ_UNBOUND, 0,
> >                                                       sb->s_id);
>
> That's not an answer to the user task migration issue.
>
> That is, all this patch does is trade user task migration when the
> CPU is busy for migrating all the queued work off the CPU so the
> user task does not get migrated. IOWs, this forces all the queued
> work to be migrated rather than the user task. IOWs, it does not
> address the issue we've exposed in the scheduler between tasks with
> competing CPU affinity scheduling requirements - it just hides the
> symptom.
>
> Maintaining CPU affinity across dispatch and completion work has
> been proven to be a significant performance win. Right throughout
> the IO stack we try to keep this submitter/completion affinity,
> and that's the whole point of using a bound wq in the first place:
> efficient delayed batch processing of work on the local CPU.

Do you really want to target the same CPU ? looks like what you really
want to target the same cache instead

>
> Spewing deferred completion work across every idle CPU in the
> machine because the local cpu is temporarily busy is a bad choice,
> both from a performance perspective (dirty cacheline bouncing) and
> from a power efficiency point of view as it causes CPUs to be taken
> out of idle state much more frequently[*].
>
> The fact that the scheduler migrates the user task we use workqueues
> for deferred work as they were intended doesn't make this a
> workqueue problem. If the answer to this problem is "make all IO
> workqueues WQ_UNBOUND" then we are effectively saying "the scheduler
> has unfixable problems when mixing bound and unbound work on the
> same run queue".
>
> And, besides, what happens when every other CPU is also completely
> busy and can't run the work in a timely fashion? We've just moved
> the work to some random CPU where we wait to be scheduled instead of
> just sitting on the local CPU and waiting....
>
> So, yes, we can work around the -symptoms- we see (frequent user
> task migration) by changing the work queue configuration or
> bypassing the workqueue for this specific workload. But these only
> address the visible symptom and don't take into account the wider
> goals of retaining CPU affinity in the IO stack, and they will have
> variable scheduling latency and perofrmance and as the overall
> system load changes.
>
> So, we can fiddle with workqueues, but it doesn't address the
> underlying issue that the scheduler appears to be migrating
> non-bound tasks off a busy CPU too easily....

The root cause of the problem is that the sched_wakeup_granularity_ns
is in the same range or higher than load balance period. As Peter
explained, This make the kworker waiting for the CPU for several load
period and a transient unbalanced state becomes a stable one that the
scheduler to fix. With default value, the scheduler doesn't try to
migrate any task.

Then, I agree that having an ack close to the request makes sense but
forcing it on the exact same CPU is too restrictive IMO. Being able to
use another CPU on the same core should not harm the performance and
may even improve it. And that may still be the case while CPUs share
their cache.

>
> -Dave.
>
> [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work
> queue: it's designed for interrupt routines that defer work via work
> queues to avoid doing work on otherwise idle CPUs. It does this by
> turning the per-cpu wq into an unbound wq so that work gets
> scheduled on a non-idle CPUs in preference to the local idle CPU
> which can then remain in low power states.
>
> That's the exact opposite of what using WQ_UNBOUND ends up doing in
> this IO completion context: it pushes the work out over idle CPUs
> rather than keeping them confined on the already busy CPUs where CPU
> affinity allows the work to be done quickly. So while WQ_UNBOUND
> avoids the user task being migrated frequently, it results in the
> work being spread around many more CPUs and we burn more power to do
> the same work.
>
> --
> Dave Chinner
> david@fromorbit.com

Phil Auld Dec. 2, 2019, 9:22 p.m. UTC | #32

Hi Vincent,

On Mon, Dec 02, 2019 at 02:45:42PM +0100 Vincent Guittot wrote:
> On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote:

...

> > So, we can fiddle with workqueues, but it doesn't address the
> > underlying issue that the scheduler appears to be migrating
> > non-bound tasks off a busy CPU too easily....
> 
> The root cause of the problem is that the sched_wakeup_granularity_ns
> is in the same range or higher than load balance period. As Peter
> explained, This make the kworker waiting for the CPU for several load
> period and a transient unbalanced state becomes a stable one that the
> scheduler to fix. With default value, the scheduler doesn't try to
> migrate any task.

There are actually two issues here.   With the high wakeup granularity
we get the user task actively migrated. This causes the significant
performance hit Ming was showing. With the fast wakeup_granularity
(or smaller IOs - 512 instead of 4k) we get, instead, the user task
migrated at wakeup to a new CPU for every IO completion.

This is the 11k migrations per sec doing 11k iops.  In this test it
is not by itself causing the measured performance issue. It generally
flips back and forth between 2 cpus for large periods. I think it is
crossing cache boundaries at times (but I have not looked closely
at the traces compared to the topology, yet).

The active balances are what really hurts in thie case but I agree
that seems to be a tuning problem.


Cheers,
Phil


> 
> Then, I agree that having an ack close to the request makes sense but
> forcing it on the exact same CPU is too restrictive IMO. Being able to
> use another CPU on the same core should not harm the performance and
> may even improve it. And that may still be the case while CPUs share
> their cache.
> 
> >
> > -Dave.
> >
> > [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work
> > queue: it's designed for interrupt routines that defer work via work
> > queues to avoid doing work on otherwise idle CPUs. It does this by
> > turning the per-cpu wq into an unbound wq so that work gets
> > scheduled on a non-idle CPUs in preference to the local idle CPU
> > which can then remain in low power states.
> >
> > That's the exact opposite of what using WQ_UNBOUND ends up doing in
> > this IO completion context: it pushes the work out over idle CPUs
> > rather than keeping them confined on the already busy CPUs where CPU
> > affinity allows the work to be done quickly. So while WQ_UNBOUND
> > avoids the user task being migrated frequently, it results in the
> > work being spread around many more CPUs and we burn more power to do
> > the same work.
> >
> > --
> > Dave Chinner
> > david@fromorbit.com
> 

--

Dave Chinner Dec. 2, 2019, 11:06 p.m. UTC | #33

On Mon, Dec 02, 2019 at 05:01:58PM +0800, Hillf Danton wrote:
> 
> On Mon, 2 Dec 2019 14:08:44 +1100 Dave Chinner wrote:
> > On Thu, Nov 28, 2019 at 05:40:03PM +0800, Hillf Danton wrote:
> > > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote:
> > > > Yeah, the fio task averages 13.4ms on any given CPU before being
> > > > switched to another CPU. Mind you, the stddev is 12ms, so the range
> > > > of how long it spends on any one CPU is pretty wide (330us to
> > > > 330ms).
> > > > 
> > > Hey Dave
> > > 
> > > > IOWs, this doesn't look like a workqueue problem at all - this looks
> > > 
> > > Surprised to see you're so sure it has little to do with wq,
> > 
> > Because I understand how the workqueue is used here.
> > 
> > Essentially, the workqueue is not necessary for a -pure- overwrite
> > where no metadata updates or end-of-io filesystem work is required.
> > 
> > However, change the workload just slightly, such as allocating the
> > space, writing into preallocated space (unwritten extents), using
> > AIO writes to extend the file, using O_DSYNC, etc, and we *must*
> > use a workqueue as we have to take blocking locks and/or run
> > transactions.
> > 
> > These may still be very short (e.g. updating inode size) and in most
> > cases will not block, but if they do, then if we don't move the work
> > out of the block layer completion context (i.e. softirq running the
> > block bh) then we risk deadlocking the code.
> > 
> > Not to mention none of the filesytem inode locks are irq safe.
> > 
> > IOWs, we can remove the workqueue for this -one specific instance-
> > but it does not remove the requirement for using a workqueue for all
> > the other types of write IO that pass through this code.
> > 
> So it's not true that it doesn't has anything to do with workqueue.

You misunderstood what I was saying. I meant that this adverse
schdeuler behaviour is not *unique to this specific workqueue
instance* or workload. There are another 5+ workqueues in XFS alone
that are based around the same "do all the deferred work on the same
CPU" queuing behaviour. Several of them are IO completion
processing workqueues, and it is designed this way to avoid running
completion work that access common structures across all the CPUs in
the system.

And, FWIW, we've had this "per-cpu delayed work" processing
mechanism in XFS since ~2002 when per-cpu work queues were
introduced in ~2.5.40. What we are doing with workqueues here is not
new or novel, and it's worked just fine for most of this time...

> > >  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > +			schedule_work(&dio->aio.work);
> > 
> > This does nothing but change the workqueue from a per-sb wq to the
> > system wq. The work is still bound to the same CPU it is queued on,
> > so nothing will change.
> > 
> The system wq is enough here to make some visible difference as CFS will
> be looking to make new lb decision in particular when submitter and
> completion are running on different CPUs.

That's noise caused by slightly different loading of the system
workqueue vs a private work queue. It's likely just enough to move
the scheduler out of the window where it makes incorrect decisions.
i.e. Add a bit more user load or load onto other CPUs, and the
problem will reappear.

As I said, this is *not* a fix for the problem - it just moves it
around so that you can't see it for this specific workload instance.

> It's claimed that "Maintaining CPU affinity across dispatch and completion
> work has been proven to be a significant performance win." If completion
> is running in the softirq context then it would take some time to sort
> out why irq (not CPU) affinity is making difference across CPUs.

We use irq steering to provide CPU affinity for the structures being
used by completion because they are the same ones used by
submission. If completion happens quickly enough, those structures
are still hot in the cache of the submission CPU, and so we don't
drag bio and filesystem structures out of the CPU cache they sit in
by steering the completion to the submission CPU.

Most of the modern high perofrmance storage hardware has hardware
interrupt steering so the block layer doesn't have to do this. See
__blk_mq_complete_request() and __blk_complete_request(). If the
device has multiple hardware queues, they are already delivering CPU
affine completions. Otherwise __blk_complete_request() uses IPIs
to steer the completion to a CPU that shares a cache with the
submission CPU....

IOWs, we are trying to ensure that we run the data IO completion on
the CPU with that has that data hot in cache. When we are running
millions of IOs every second, this matters -a lot-. IRQ steering is
just a mechansim that is used to ensure completion processing hits
hot caches.

Cheers,

Dave.

Dave Chinner Dec. 2, 2019, 11:53 p.m. UTC | #34

On Mon, Dec 02, 2019 at 02:45:42PM +0100, Vincent Guittot wrote:
> On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote:
> > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > > > > --- a/fs/iomap/direct-io.c
> > > > > +++ b/fs/iomap/direct-io.c
> > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > > > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > > > >                         blk_wake_io_task(waiter);
> > > > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > > > -
> > > > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > > > +                       schedule_work(&dio->aio.work);
> > > >
> > > > I'm not sure that this will make a real difference because it ends up
> > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > > > the work will still be pinned to a CPU
> > > > Using system_unbound_wq should make a difference because it doesn't
> > > > pin the work on a CPU
> > > >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> > >
> > > Indeed, just run a quick test on my KVM guest, looks the following patch
> > > makes a difference:
> > >
> > > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > > index 9329ced91f1d..2f4488b0ecec 100644
> > > --- a/fs/direct-io.c
> > > +++ b/fs/direct-io.c
> > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
> > >  {
> > >         struct workqueue_struct *old;
> > >         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> > > -                                                     WQ_MEM_RECLAIM, 0,
> > > +                                                     WQ_MEM_RECLAIM |
> > > +                                                     WQ_UNBOUND, 0,
> > >                                                       sb->s_id);
> >
> > That's not an answer to the user task migration issue.
> >
> > That is, all this patch does is trade user task migration when the
> > CPU is busy for migrating all the queued work off the CPU so the
> > user task does not get migrated. IOWs, this forces all the queued
> > work to be migrated rather than the user task. IOWs, it does not
> > address the issue we've exposed in the scheduler between tasks with
> > competing CPU affinity scheduling requirements - it just hides the
> > symptom.
> >
> > Maintaining CPU affinity across dispatch and completion work has
> > been proven to be a significant performance win. Right throughout
> > the IO stack we try to keep this submitter/completion affinity,
> > and that's the whole point of using a bound wq in the first place:
> > efficient delayed batch processing of work on the local CPU.
> 
> Do you really want to target the same CPU ? looks like what you really
> want to target the same cache instead

Well, yes, ideally we want to target the same cache, but we can't do
that with workqueues.

However, the block layer already does that same-cache steering for
it's directed completions (see __blk_mq_complete_request()), so we
are *already running in a "hot cache" CPU context* when we queue
work. When we queue to the same CPU, we are simply maintaining the
"cache-hot" context that we are already running in.

Besides, selecting a specific "hot cache" CPU and bind the work to
that CPU (via queue_work_on()) doesn't fix the scheduler problem -
it just moves it to another CPU. If the destination CPU is loaded
like the local CPU, then it's jsut going to cause migrations on the
destination CPU instead of the local CPU.

IOWs, this is -not a fix- for the scheduler making an incorrect
migration decisions when we are mixing bound and unbound tasks on
the local run queue. Yes, it will hide the problem from this
specific workload instance but it doesn't fix it. We'll just hit it
under heavier load, such as when production workloads start running
AIO submission from tens of CPUs at a time while burning near 100%
CPU in userspace.......

Cheers,

Dave.

Ming Lei Dec. 3, 2019, 12:18 a.m. UTC | #35

On Tue, Dec 03, 2019 at 10:53:21AM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2019 at 02:45:42PM +0100, Vincent Guittot wrote:
> > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote:
> > > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > > > > > --- a/fs/iomap/direct-io.c
> > > > > > +++ b/fs/iomap/direct-io.c
> > > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > > > > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > > > > >                         blk_wake_io_task(waiter);
> > > > > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > > > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > > > > -
> > > > > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > > > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > > > > +                       schedule_work(&dio->aio.work);
> > > > >
> > > > > I'm not sure that this will make a real difference because it ends up
> > > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > > > > the work will still be pinned to a CPU
> > > > > Using system_unbound_wq should make a difference because it doesn't
> > > > > pin the work on a CPU
> > > > >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> > > >
> > > > Indeed, just run a quick test on my KVM guest, looks the following patch
> > > > makes a difference:
> > > >
> > > > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > > > index 9329ced91f1d..2f4488b0ecec 100644
> > > > --- a/fs/direct-io.c
> > > > +++ b/fs/direct-io.c
> > > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
> > > >  {
> > > >         struct workqueue_struct *old;
> > > >         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> > > > -                                                     WQ_MEM_RECLAIM, 0,
> > > > +                                                     WQ_MEM_RECLAIM |
> > > > +                                                     WQ_UNBOUND, 0,
> > > >                                                       sb->s_id);
> > >
> > > That's not an answer to the user task migration issue.
> > >
> > > That is, all this patch does is trade user task migration when the
> > > CPU is busy for migrating all the queued work off the CPU so the
> > > user task does not get migrated. IOWs, this forces all the queued
> > > work to be migrated rather than the user task. IOWs, it does not
> > > address the issue we've exposed in the scheduler between tasks with
> > > competing CPU affinity scheduling requirements - it just hides the
> > > symptom.
> > >
> > > Maintaining CPU affinity across dispatch and completion work has
> > > been proven to be a significant performance win. Right throughout
> > > the IO stack we try to keep this submitter/completion affinity,
> > > and that's the whole point of using a bound wq in the first place:
> > > efficient delayed batch processing of work on the local CPU.
> > 
> > Do you really want to target the same CPU ? looks like what you really
> > want to target the same cache instead
> 
> Well, yes, ideally we want to target the same cache, but we can't do
> that with workqueues.
> 
> However, the block layer already does that same-cache steering for
> it's directed completions (see __blk_mq_complete_request()), so we
> are *already running in a "hot cache" CPU context* when we queue
> work. When we queue to the same CPU, we are simply maintaining the
> "cache-hot" context that we are already running in.

__blk_mq_complete_request() doesn't always complete the request on
the submission CPU, which is only done in case of 1:1 queue mapping
and N:1 mapping when nr_hw_queues < nr_nodes. Also, the default
completion flag is SAME_GROUP, which just requires the completion
CPU to share cache with submission CPU:

#define QUEUE_FLAG_SAME_COMP    4       /* complete on same CPU-group */



Thanks, 
Ming

Vincent Guittot Dec. 3, 2019, 9:45 a.m. UTC | #36

On Mon, 2 Dec 2019 at 22:22, Phil Auld <pauld@redhat.com> wrote:
>
> Hi Vincent,
>
> On Mon, Dec 02, 2019 at 02:45:42PM +0100 Vincent Guittot wrote:
> > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote:
>
> ...
>
> > > So, we can fiddle with workqueues, but it doesn't address the
> > > underlying issue that the scheduler appears to be migrating
> > > non-bound tasks off a busy CPU too easily....
> >
> > The root cause of the problem is that the sched_wakeup_granularity_ns
> > is in the same range or higher than load balance period. As Peter
> > explained, This make the kworker waiting for the CPU for several load
> > period and a transient unbalanced state becomes a stable one that the
> > scheduler to fix. With default value, the scheduler doesn't try to
> > migrate any task.
>
> There are actually two issues here.   With the high wakeup granularity
> we get the user task actively migrated. This causes the significant
> performance hit Ming was showing. With the fast wakeup_granularity
> (or smaller IOs - 512 instead of 4k) we get, instead, the user task
> migrated at wakeup to a new CPU for every IO completion.

Ok, I haven't noticed that this one was a problem too. Do we have perf
regression ?

>
> This is the 11k migrations per sec doing 11k iops.  In this test it
> is not by itself causing the measured performance issue. It generally
> flips back and forth between 2 cpus for large periods. I think it is
> crossing cache boundaries at times (but I have not looked closely
> at the traces compared to the topology, yet).

At task wake up, scheduler compares local and previous CPU to decide
where to place the task and will then try to find an idle one which
shares cache so I don't expect that it will cross cache boundary as
local and previous are in your case.

>
> The active balances are what really hurts in thie case but I agree
> that seems to be a tuning problem.
>
>
> Cheers,
> Phil
>
>
> >
> > Then, I agree that having an ack close to the request makes sense but
> > forcing it on the exact same CPU is too restrictive IMO. Being able to
> > use another CPU on the same core should not harm the performance and
> > may even improve it. And that may still be the case while CPUs share
> > their cache.
> >
> > >
> > > -Dave.
> > >
> > > [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work
> > > queue: it's designed for interrupt routines that defer work via work
> > > queues to avoid doing work on otherwise idle CPUs. It does this by
> > > turning the per-cpu wq into an unbound wq so that work gets
> > > scheduled on a non-idle CPUs in preference to the local idle CPU
> > > which can then remain in low power states.
> > >
> > > That's the exact opposite of what using WQ_UNBOUND ends up doing in
> > > this IO completion context: it pushes the work out over idle CPUs
> > > rather than keeping them confined on the already busy CPUs where CPU
> > > affinity allows the work to be done quickly. So while WQ_UNBOUND
> > > avoids the user task being migrated frequently, it results in the
> > > work being spread around many more CPUs and we burn more power to do
> > > the same work.
> > >
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> >
>
> --
>

Vincent Guittot Dec. 3, 2019, 1:34 p.m. UTC | #37

On Tue, 3 Dec 2019 at 00:53, Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Dec 02, 2019 at 02:45:42PM +0100, Vincent Guittot wrote:
> > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote:
> > > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote:
> > > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote:
> > > > > > --- a/fs/iomap/direct-io.c
> > > > > > +++ b/fs/iomap/direct-io.c
> > > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct
> > > > > >                         WRITE_ONCE(dio->submit.waiter, NULL);
> > > > > >                         blk_wake_io_task(waiter);
> > > > > >                 } else if (dio->flags & IOMAP_DIO_WRITE) {
> > > > > > -                       struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > > > > -
> > > > > >                         INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> > > > > > -                       queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> > > > > > +                       schedule_work(&dio->aio.work);
> > > > >
> > > > > I'm not sure that this will make a real difference because it ends up
> > > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so
> > > > > the work will still be pinned to a CPU
> > > > > Using system_unbound_wq should make a difference because it doesn't
> > > > > pin the work on a CPU
> > > > >  +                       queue_work(system_unbound_wq, &dio->aio.work);
> > > >
> > > > Indeed, just run a quick test on my KVM guest, looks the following patch
> > > > makes a difference:
> > > >
> > > > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > > > index 9329ced91f1d..2f4488b0ecec 100644
> > > > --- a/fs/direct-io.c
> > > > +++ b/fs/direct-io.c
> > > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb)
> > > >  {
> > > >         struct workqueue_struct *old;
> > > >         struct workqueue_struct *wq = alloc_workqueue("dio/%s",
> > > > -                                                     WQ_MEM_RECLAIM, 0,
> > > > +                                                     WQ_MEM_RECLAIM |
> > > > +                                                     WQ_UNBOUND, 0,
> > > >                                                       sb->s_id);
> > >
> > > That's not an answer to the user task migration issue.
> > >
> > > That is, all this patch does is trade user task migration when the
> > > CPU is busy for migrating all the queued work off the CPU so the
> > > user task does not get migrated. IOWs, this forces all the queued
> > > work to be migrated rather than the user task. IOWs, it does not
> > > address the issue we've exposed in the scheduler between tasks with
> > > competing CPU affinity scheduling requirements - it just hides the
> > > symptom.
> > >
> > > Maintaining CPU affinity across dispatch and completion work has
> > > been proven to be a significant performance win. Right throughout
> > > the IO stack we try to keep this submitter/completion affinity,
> > > and that's the whole point of using a bound wq in the first place:
> > > efficient delayed batch processing of work on the local CPU.
> >
> > Do you really want to target the same CPU ? looks like what you really
> > want to target the same cache instead
>
> Well, yes, ideally we want to target the same cache, but we can't do
> that with workqueues.

Yes, this seems to be your main problem IMHO. You want to stay on the
same cache and the only way to do so it to pin the work on one single
CPU. But by doing so and increasing sched_wakeup_granularity_ns, the
scheduler detects an imbalanced state because of pinned task that it
wants to fix.
Being able to set the work on a cpumask that covers the cache would be
the solution so the scheduler would be able to select an idle CPU that
share the cache instead of being pinned to a CPU

>
> However, the block layer already does that same-cache steering for
> it's directed completions (see __blk_mq_complete_request()), so we
> are *already running in a "hot cache" CPU context* when we queue
> work. When we queue to the same CPU, we are simply maintaining the
> "cache-hot" context that we are already running in.
>
> Besides, selecting a specific "hot cache" CPU and bind the work to
> that CPU (via queue_work_on()) doesn't fix the scheduler problem -
> it just moves it to another CPU. If the destination CPU is loaded
> like the local CPU, then it's jsut going to cause migrations on the
> destination CPU instead of the local CPU.
>
> IOWs, this is -not a fix- for the scheduler making an incorrect
> migration decisions when we are mixing bound and unbound tasks on
> the local run queue. Yes, it will hide the problem from this
> specific workload instance but it doesn't fix it. We'll just hit it
> under heavier load, such as when production workloads start running
> AIO submission from tens of CPUs at a time while burning near 100%
> CPU in userspace.......
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Dec. 3, 2019, 10:29 p.m. UTC | #38

On Tue, Dec 03, 2019 at 09:15:14PM +0800, Hillf Danton wrote:
> > IOWs, we are trying to ensure that we run the data IO completion on
> > the CPU with that has that data hot in cache. When we are running
> > millions of IOs every second, this matters -a lot-. IRQ steering is
> > just a mechansim that is used to ensure completion processing hits
> > hot caches.
> 
> Along the "CPU affinity" direction, a trade-off is made between CPU
> affinity and cache affinity before lb can bear the ca scheme.
> Completion works are queued in round robin on the CPUs that share
> cache with the submission CPU.
> 
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -143,6 +143,42 @@ static inline void iomap_dio_set_error(s
>  	cmpxchg(&dio->error, 0, ret);
>  }
>  
> +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cnt);
> +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cpu);
> +#define IOMAP_DIO_BIO_END_IO_BATCH 7
> +
> +static int iomap_dio_cpu_rr(void)
> +{
> +	int *io_cnt, *io_cpu;
> +	int cpu, this_cpu;
> +
> +	io_cnt = get_cpu_ptr(&iomap_dio_bio_end_io_cnt);
> +	io_cpu = this_cpu_ptr(&iomap_dio_bio_end_io_cpu);
> +	this_cpu = smp_processor_id();
> +
> +	if (!(*io_cnt & IOMAP_DIO_BIO_END_IO_BATCH)) {
> +		for (cpu = *io_cpu + 1; cpu < nr_cpu_id; cpu++)
> +			if (cpu == this_cpu ||
> +			    cpus_share_cache(cpu, this_cpu))
> +				goto update_cpu;
> +
> +		for (cpu = 0; cpu < *io_cpu; cpu++)
> +			if (cpu == this_cpu ||
> +			    cpus_share_cache(cpu, this_cpu))
> +				goto update_cpu;

Linear scans like this just don't scale. We can have thousands of
CPUs in a system and maybe only 8 cores that share a local cache.
And we can be completing millions of direct IO writes a second these
days. A linear scan of (thousands - 8) cpu ids every so often is
going to show up as long tail latency for the unfortunate IO that
has to scan those thousands of non-matching CPU IDs to find a
sibling, and we'll be doing that every handful of IOs that are
completed on every CPU.

> +
> +		cpu = this_cpu;
> +update_cpu:
> +		*io_cpu = cpu;
> +	}
> +
> +	(*io_cnt)++;
> +	cpu = *io_cpu;
> +	put_cpu_ptr(&iomap_dio_bio_end_io_cnt);
> +
> +	return cpu;
> +}
> 
>  static void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
> @@ -158,9 +194,10 @@ static void iomap_dio_bio_end_io(struct
>  			blk_wake_io_task(waiter);
>  		} else if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
> +			int cpu = iomap_dio_cpu_rr();

IMO, this sort of "limit work to sibling CPU cores" does not belong in
general code. We have *lots* of workqueues that need this treatment,
and it's not viable to add this sort of linear search loop to every
workqueue and place we queue work. Besides....

>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +			queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work);

.... as I've stated before, this *does not solve the scheduler
problem*.  All this does is move the problem to the target CPU
instead of seeing it on the local CPU.

If we really want to hack around the load balancer problems in this
way, then we need to add a new workqueue concurrency management type
with behaviour that lies between the default of bound and WQ_UNBOUND.

WQ_UNBOUND limits scheduling to within a numa node - see
wq_update_unbound_numa() for how it sets up the cpumask attributes
it applies to it's workers - but we need the work to be bound to
within the local cache domain rather than a numa node. IOWs, set up
the kworker task pool management structure with the right attributes
(e.g. cpu masks) to define the cache domains, add all the hotplug
code to make it work with CPU hotplug, then simply apply those
attributes to the kworker task that is selected to execute the work.

This allows the scheduler to migrate the kworker away from the local
run queue without interrupting the currently scheduled task. The
cpumask limits the task is configured with limit the scheduler to
selecting the best CPU within the local cache domain, and we don't
have to bind work to CPUs to get CPU cache friendly work scheduling.
This also avoids overhead of per-queue_work_on() sibling CPU
calculation, and all the code that wants to use this functionality
needs to do is add a single flag at work queue init time (e.g.
WQ_CACHEBOUND).

IOWs, if the task migration behaviour cannot be easily fixed and so
we need work queue users to be more flexible about work placement,
then the solution needed here is "cpu cache local work queue
scheduling" implemented in the work queue infrastructure, not in
every workqueue user.

Cheers,

Dave.

Ming Lei Dec. 4, 2019, 1:50 p.m. UTC | #39

On Tue, Dec 03, 2019 at 10:45:38AM +0100, Vincent Guittot wrote:
> On Mon, 2 Dec 2019 at 22:22, Phil Auld <pauld@redhat.com> wrote:
> >
> > Hi Vincent,
> >
> > On Mon, Dec 02, 2019 at 02:45:42PM +0100 Vincent Guittot wrote:
> > > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote:
> >
> > ...
> >
> > > > So, we can fiddle with workqueues, but it doesn't address the
> > > > underlying issue that the scheduler appears to be migrating
> > > > non-bound tasks off a busy CPU too easily....
> > >
> > > The root cause of the problem is that the sched_wakeup_granularity_ns
> > > is in the same range or higher than load balance period. As Peter
> > > explained, This make the kworker waiting for the CPU for several load
> > > period and a transient unbalanced state becomes a stable one that the
> > > scheduler to fix. With default value, the scheduler doesn't try to
> > > migrate any task.
> >
> > There are actually two issues here.   With the high wakeup granularity
> > we get the user task actively migrated. This causes the significant
> > performance hit Ming was showing. With the fast wakeup_granularity
> > (or smaller IOs - 512 instead of 4k) we get, instead, the user task
> > migrated at wakeup to a new CPU for every IO completion.
> 
> Ok, I haven't noticed that this one was a problem too. Do we have perf
> regression ?

Follows the test result on one server(Dell, R630: Haswell-E):

kernel.sched_wakeup_granularity_ns = 4000000
kernel.sched_min_granularity_ns = 3000000

---------------------------------------
test              		        | IOPS
---------------------------------------
./xfs_complete 512      	    | 7.8K 
---------------------------------------
taskset -c 8 ./xfs_complete 512 | 9.8K 
---------------------------------------

Thanks,
Ming

Dave Chinner Dec. 4, 2019, 10:59 p.m. UTC | #40

On Wed, Dec 04, 2019 at 06:29:03PM +0800, Hillf Danton wrote:
> 
> On Wed, 4 Dec 2019 09:29:25 +1100 Dave Chinner wrote:
> > 
> > If we really want to hack around the load balancer problems in this
> > way, then we need to add a new workqueue concurrency management type
> > with behaviour that lies between the default of bound and WQ_UNBOUND.
> > 
> > WQ_UNBOUND limits scheduling to within a numa node - see
> > wq_update_unbound_numa() for how it sets up the cpumask attributes
> > it applies to it's workers - but we need the work to be bound to
> > within the local cache domain rather than a numa node. IOWs, set up
> > the kworker task pool management structure with the right attributes
> > (e.g. cpu masks) to define the cache domains, add all the hotplug
> > code to make it work with CPU hotplug, then simply apply those
> > attributes to the kworker task that is selected to execute the work.
> > 
> > This allows the scheduler to migrate the kworker away from the local
> > run queue without interrupting the currently scheduled task. The
> > cpumask limits the task is configured with limit the scheduler to
> > selecting the best CPU within the local cache domain, and we don't
> > have to bind work to CPUs to get CPU cache friendly work scheduling.
> > This also avoids overhead of per-queue_work_on() sibling CPU
> > calculation, and all the code that wants to use this functionality
> > needs to do is add a single flag at work queue init time (e.g.
> > WQ_CACHEBOUND).
> > 
> > IOWs, if the task migration behaviour cannot be easily fixed and so
> > we need work queue users to be more flexible about work placement,
> > then the solution needed here is "cpu cache local work queue
> > scheduling" implemented in the work queue infrastructure, not in
> > every workqueue user.
> 
> Add WQ_CACHE_BOUND and a user of it and a helper to find cpus that
> share cache.

<sigh>

If you are going to quote my suggestion in full, then please
implement it in full. This patch does almost none of what you quoted
above - it still has all the problems of the previous version that
lead me to write the above.

> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1358,16 +1358,42 @@ static bool is_chained_work(struct workq
>  	return worker && worker->current_pwq->wq == wq;
>  }
>  
> +static DEFINE_PER_CPU(int, wq_sel_cbc_cnt);
> +static DEFINE_PER_CPU(int, wq_sel_cbc_cpu);
> +#define WQ_SEL_CBC_BATCH 7
> +
> +static int wq_select_cache_bound_cpu(int this_cpu)
> +{
> +	int *cntp, *cpup;
> +	int cpu;
> +
> +	cntp = get_cpu_ptr(&wq_sel_cbc_cnt);
> +	cpup = this_cpu_ptr(&wq_sel_cbc_cpu);
> +	cpu = *cpup;
> +
> +	if (!(*cntp & WQ_SEL_CBC_BATCH)) {
> +		cpu = cpus_share_cache_next_cpu(this_cpu, cpu);
> +		*cpup = cpu;
> +	}
> +	(*cntp)++;
> +	put_cpu_ptr(&wq_sel_cbc_cnt);
> +
> +	return cpu;
> +}

This selects a specific CPU in the local cache domain at
queue_work() time, just like the previous patch. It does not do what
I suggested above in reponse to the scalability issues this approach
has...

>  /*
>   * When queueing an unbound work item to a wq, prefer local CPU if allowed
>   * by wq_unbound_cpumask.  Otherwise, round robin among the allowed ones to
>   * avoid perturbing sensitive tasks.
>   */
> -static int wq_select_unbound_cpu(int cpu)
> +static int wq_select_unbound_cpu(int cpu, bool cache_bound)
>  {
>  	static bool printed_dbg_warning;
>  	int new_cpu;
>  
> +	if (cache_bound)
> +		return wq_select_cache_bound_cpu(cpu);
> +
>  	if (likely(!wq_debug_force_rr_cpu)) {
>  		if (cpumask_test_cpu(cpu, wq_unbound_cpumask))
>  			return cpu;
> @@ -1417,7 +1443,8 @@ static void __queue_work(int cpu, struct
>  	rcu_read_lock();
>  retry:
>  	if (req_cpu == WORK_CPU_UNBOUND)
> -		cpu = wq_select_unbound_cpu(raw_smp_processor_id());
> +		cpu = wq_select_unbound_cpu(raw_smp_processor_id(),
> +					wq->flags & WQ_CACHE_BOUND);

And the per-cpu  kworker pool selection after we've selected a CPU
here binds it to that specific CPU or (if WQ_UNBOUND) the local
node. IOWs, this is exactly the same functionality as the last
patch, just moved inside the work queue infrastructure.

IOWs, apart from the WQ_CACHE_BOUND flag, this patch doesn't
implement any of what I suggested above. It does not solve the the
problem of bound kworkers kicking running tasks off a CPU so the
bound task can run, and it does not allow the scheduler to select
the best CPU in the local cache scheduling domain for the kworker to
run on. i.e. it still behaves like bound work rather than WQ_UNBOUND
work.

IMO, this adds the CPU selection to the -wrong function-.  We still
want the local CPU selected when req_cpu == WORK_CPU_UNBOUND.  The
following code selects where the kworker will be bound based on the
task pool that the workqueue is configured to use:

	/* pwq which will be used unless @work is executing elsewhere */
	if (!(wq->flags & WQ_UNBOUND))
		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
	else
		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

i.e. the local CPU is the key we need to look up the task pool for
running tasks in the local cache domain - we do not use it as the
CPU we want to run work on. IOWs, what we really want is this:

	if (wq->flags & WQ_UNBOUND)
		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
	else if (wq->flags & WQ_CACHE_BOUND)
		pwq = unbound_pwq_by_cache(wq, cpu);
	else
		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);

And then unbound_pwq_by_cache() is implemented in a similar manner
to unbound_pwq_by_node() where there is a separate worker pool per
cache domain. THe scheduler domain attributes (cpumask) is held by
the task pool, and they are applied to the kworker task when it's
given the task to run. This, like WQ_UNBOUND, allows the scheduler
to select the best CPU in the cpumask for the task to run on.

Binding kworkers to a single CPU is exactly the problem we need to
avoid here - we need to let the scheduler choose the best CPU in the
local cache domain based on the current load. That means the
implementation needs to behave like a WQ_UNBOUND workqueue, just
with a more restrictive cpumask.

-Dave.

Srikar Dronamraju Dec. 9, 2019, 4:51 p.m. UTC | #41

* Peter Zijlstra <peterz@infradead.org> [2019-11-21 14:29:37]:

> On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote:
> > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote:
> > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote:
> 
> > > > Yes, that's precisely the problem - work is queued, by default, on a
> > > > specific CPU and it will wait for a kworker that is pinned to that
> > > 
> > > I'm thinking the problem is that it doesn't wait. If it went and waited
> > > for it, active balance wouldn't be needed, that only works on active
> > > tasks.
> > 
> > Since this is AIO I wonder if it should queue_work on a nearby cpu by 
> > default instead of unbound.  
> 
> The thing seems to be that 'unbound' is in fact 'bound'. Maybe we should
> fix that. If the load-balancer were allowed to move the kworker around
> when it didn't get time to run, that would probably be a better
> solution.
> 
 
Can the scheduler detect this situation and probably preempt the current
task?

----->8-------------------------------------------8<------------------

From 578407a0d01fa458ccb412f3390b04d3e1030d33 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Date: Sat, 7 Dec 2019 08:50:54 +0530
Subject: [RFC PATCH] sched/core: Preempt current task in favour of bound kthread

A running task can wake-up a per CPU bound kthread on the same CPU.
If the current running task doesn't yield the CPU before the next load
balance operation, the scheduler would detect load imbalance and try to
balance the load. However this load balance would fail as the waiting
task is CPU bound, while the running task cannot be moved by the regular
load balancer. Finally the active load balancer would kick in and move
the task to a different CPU/Core. Moving the task to a different
CPU/core can lead to loss in cache affinity leading to poor performance.

This is more prone to happen if the current running task is CPU
intensive and the sched_wake_up_granularity is set to larger value.
When the sched_wake_up_granularity was relatively small, it was observed
that the bound thread would complete before the load balancer would have
chosen to move the cache hot task to a different CPU.

To deal with this situation, the current running task would yield to a
per CPU bound kthread, provided kthread is not CPU intensive.

/pboffline/hwcct_prg_old/lib/fsperf -t overwrite --noclean -f 5g -b 4k /pboffline

(With sched_wake_up_granularity set to 15ms)

Performance counter stats for 'system wide' (5 runs):
event					v5.4 				v5.4 + patch
probe:active_load_balance_cpu_stop     1,740  ( +-  4.06% )                4  ( +- 15.41% )
sched:sched_waking                   431,952  ( +-  0.63% )          905,001  ( +-  0.25% )
sched:sched_wakeup                   431,950  ( +-  0.63% )          905,000  ( +-  0.25% )
sched:sched_wakeup_new                   427  ( +- 14.50% )              544  ( +-  3.11% )
sched:sched_switch                   773,491  ( +-  0.72% )        1,467,539  ( +-  0.30% )
sched:sched_migrate_task              19,744  ( +-  0.69% )            2,488  ( +-  5.24% )
sched:sched_process_free                 417  ( +- 15.26% )              545  ( +-  3.47% )
sched:sched_process_exit                 433  ( +- 14.71% )              560  ( +-  3.37% )
sched:sched_wait_task                      3  ( +- 23.57% )                1  ( +- 61.24% )
sched:sched_process_wait                 132  ( +- 80.37% )              848  ( +-  3.63% )
sched:sched_process_fork                 427  ( +- 14.50% )              543  ( +-  3.08% )
sched:sched_process_exec                  36  ( +- 92.46% )              211  ( +-  7.50% )
sched:sched_wake_idle_without_ipi    178,349  ( +-  0.87% )          351,912  ( +-  0.31% )

elapsed time in seconds       288.09 +- 2.30  ( +-  0.80% )   72.631 +- 0.109 ( +-  0.15% )

Throughput results

v5.4
Trigger time:................... 0.842679 s   (Throughput:     6075.86 MB/s)
Asynchronous submit time:.......   1.0184 s   (Throughput:     5027.49 MB/s)
Synchronous submit time:........        0 s   (Throughput:           0 MB/s)
I/O time:.......................   263.17 s   (Throughput:      19.455 MB/s)
Ratio trigger time to I/O time:.0.00320202

v5.4 + patch
Trigger time:................... 0.858728 s   (Throughput:      5962.3 MB/s)
Asynchronous submit time:....... 0.758399 s   (Throughput:     6751.06 MB/s)
Synchronous submit time:........        0 s   (Throughput:           0 MB/s)
I/O time:.......................   43.411 s   (Throughput:     117.942 MB/s)
Ratio trigger time to I/O time:.0.0197813

(With sched_wake_up_granularity set to 4ms)

Performance counter stats for 'system wide' (5 runs):
event					      v5.4 				v5.4 + patch
probe:active_load_balance_cpu_stop               4  ( +- 29.92% )                  6  ( +- 21.88% )
sched:sched_waking                         896,177  ( +-  0.25% )            900,352  ( +-  0.36% )
sched:sched_wakeup                         896,174  ( +-  0.25% )            900,352  ( +-  0.36% )
sched:sched_wakeup_new                         255  ( +- 40.79% )                568  ( +-  4.22% )
sched:sched_switch                       1,453,937  ( +-  0.27% )          1,459,653  ( +-  0.46% )
sched:sched_migrate_task                     2,318  ( +-  6.55% )              2,898  ( +- 13.14% )
sched:sched_process_free                       239  ( +- 43.14% )                553  ( +-  4.46% )
sched:sched_process_exit                       255  ( +- 40.54% )                568  ( +-  4.33% )
sched:sched_wait_task                            3  ( +- 38.13% )                  2  ( +- 32.39% )
sched:sched_process_wait                       257  ( +- 68.90% )                887  ( +-  4.59% )
sched:sched_process_fork                       255  ( +- 40.87% )                567  ( +-  4.21% )
sched:sched_process_exec                       116  ( +- 71.52% )                214  ( +-  4.51% )
sched:sched_stat_runtime            82,757,021,750  ( +-  2.38% )     82,092,839,452  ( +-  0.31% )
sched:sched_wake_idle_without_ipi          347,790  ( +-  0.69% )            350,369  ( +-  0.27% )

elapsed time in seconds           72.6114 +- 0.0516 ( +-  0.07% )   72.6425 +- 0.0658 ( +-  0.09% )

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/core.c  |  7 ++++++-
 kernel/sched/fair.c  | 23 ++++++++++++++++++++++-
 kernel/sched/sched.h |  3 ++-
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44123b4d14e8..efd740aafa17 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2664,7 +2664,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
  */
 int wake_up_process(struct task_struct *p)
 {
-	return try_to_wake_up(p, TASK_NORMAL, 0);
+	int wake_flags = 0;
+
+	if (is_per_cpu_kthread(p))
+		wake_flags = WF_KTHREAD;
+
+	return try_to_wake_up(p, TASK_NORMAL, WF_KTHREAD);
 }
 EXPORT_SYMBOL(wake_up_process);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69a81a5709ff..36486f71e59f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6660,6 +6660,27 @@ static void set_skip_buddy(struct sched_entity *se)
 		cfs_rq_of(se)->skip = se;
 }
 
+static int kthread_wakeup_preempt(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+	struct task_struct *curr = rq->curr;
+	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+
+	if (!(wake_flags & WF_KTHREAD))
+		return 0;
+
+	if (p->nr_cpus_allowed != 1 || curr->nr_cpus_allowed == 1)
+		return 0;
+
+	if (cfs_rq->nr_running > 2)
+		return 0;
+
+	/*
+	 * Don't preempt, if the waking kthread is more CPU intensive than
+	 * the current thread.
+	 */
+	return p->nvcsw * curr->nivcsw >= p->nivcsw * curr->nvcsw;
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -6716,7 +6737,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	find_matching_se(&se, &pse);
 	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
-	if (wakeup_preempt_entity(se, pse) == 1) {
+	if (wakeup_preempt_entity(se, pse) == 1 || kthread_wakeup_preempt(rq, p, wake_flags)) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
 		 * triggering this preemption.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8870c5bd7df..23d4284ad1e3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1643,7 +1643,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
  */
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
 #define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
+#define WF_KTHREAD		0x08		/* Per CPU Kthread*/
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution

Srikar Dronamraju Dec. 9, 2019, 4:58 p.m. UTC | #42

* Jens Axboe <axboe@kernel.dk> [2019-11-21 09:19:29]:

> > 
> > I wish there was a flag to wake_up() or to the event object that says
> > to relinquish the remaning of the time-slice to the waiter on same
> > CPU, since I will be soon sleeping.
> 
> Isn't that basically what wake_up_sync() is?
> 

Workqueue don't seem to be using wait_queue_head which is needed when using
wake_up_sync and its related APIs.  Also wake_up_sync would work when the
waking task seems to hand off the CPU and goes to sleep/block. However here,
i.e in fsperf case atleast, the waking thread continues to run after it has
woken the per cpu workqueue.

Dave Chinner Dec. 9, 2019, 11:17 p.m. UTC | #43

On Mon, Dec 09, 2019 at 10:21:22PM +0530, Srikar Dronamraju wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 44123b4d14e8..efd740aafa17 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2664,7 +2664,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   */
>  int wake_up_process(struct task_struct *p)
>  {
> -	return try_to_wake_up(p, TASK_NORMAL, 0);
> +	int wake_flags = 0;
> +
> +	if (is_per_cpu_kthread(p))
> +		wake_flags = WF_KTHREAD;
> +
> +	return try_to_wake_up(p, TASK_NORMAL, WF_KTHREAD);

This is buggy. It always sets WF_KTHREAD, even for non-kernel
processes. I think you meant:

	return try_to_wake_up(p, TASK_NORMAL, wake_flags);

I suspect this bug invalidates the test results presented, too...

-Dave.

Srikar Dronamraju Dec. 10, 2019, 3:27 a.m. UTC | #44

* Dave Chinner <david@fromorbit.com> [2019-12-10 10:17:43]:

> On Mon, Dec 09, 2019 at 10:21:22PM +0530, Srikar Dronamraju wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 44123b4d14e8..efd740aafa17 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2664,7 +2664,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >   */
> >  int wake_up_process(struct task_struct *p)
> >  {
> > -	return try_to_wake_up(p, TASK_NORMAL, 0);
> > +	int wake_flags = 0;
> > +
> > +	if (is_per_cpu_kthread(p))
> > +		wake_flags = WF_KTHREAD;
> > +
> > +	return try_to_wake_up(p, TASK_NORMAL, WF_KTHREAD);
> 
> This is buggy. It always sets WF_KTHREAD, even for non-kernel
> processes. I think you meant:
> 
> 	return try_to_wake_up(p, TASK_NORMAL, wake_flags);

Yes, I meant the above. Thanks for catching.
Will test with this and repost.

> 
> I suspect this bug invalidates the test results presented, too...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

single aio thread is migrated crazily by scheduler

Commit Message

Comments

Patch