Message ID | 20191114113153.GB4213@ming.t460p (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | single aio thread is migrated crazily by scheduler | expand |
On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > Hi Guys, > > It is found that single AIO thread is migrated crazely by scheduler, and > the migrate period can be < 10ms. Follows the test a): What does crazy mean? Does it cycle through the L3 mask?
On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > Hi Guys, > > It is found that single AIO thread is migrated crazely by scheduler, and > the migrate period can be < 10ms. Follows the test a): > > - run single job fio[1] for 30 seconds: > ./xfs_complete 512 > > - observe fio io thread migration via bcc trace[2], and the migration > times can reach 5k ~ 10K in above test. In this test, CPU utilization > is 30~40% on the CPU running fio IO thread. Using the default scheduler tunings: kernel.sched_wakeup_granularity_ns = 4000000 kernel.sched_min_granularity_ns = 3000000 I'm not seeing any migrations at all on a 16p x86-64 box. Even with the tunings you suggest: sysctl kernel.sched_min_granularity_ns=10000000 sysctl kernel.sched_wakeup_granularity_ns=15000000 There are no migrations at all. During the overwrite phase of the test, I'm seeing 66% fio usage, 28% system time and of that, only 3% of the CPU time is in the IO completion workqueue time. It's all running on the one CPU, and it's running at about 15,000 context switches per second.. I'm not seeing any CPU migrations at all, despite the workload being completely CPU bound on a single CPU. This is the typical output from top: Tasks: 262 total, 2 running, 260 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 66.2 us, 27.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 6.0 si, 0.0 st %Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu9 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 16005.3 total, 8737.7 free, 6762.5 used, 505.2 buff/cache MiB Swap: 486.3 total, 486.3 free, 0.0 used. 8640.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5554 root 20 0 672064 15356 1392 R 97.0 0.1 0:17.85 fio 1388 root 20 0 0 0 0 I 3.0 0.0 0:01.39 kworker/3:1-dio/sdb 5552 root 20 0 661364 275380 271992 S 0.7 1.7 0:15.88 fio 4891 dave 20 0 15572 5876 4600 S 0.3 0.0 0:00.56 sshd 4933 dave 20 0 11732 4276 3296 R 0.3 0.0 0:01.30 top i.e. it runs entirely on CPU 3 for the whole 30s measurement period. And the number of migrate task events: $ sudo trace-cmd start -e sched_migrate_task $ sudo ./run_test.sh ..... $ sudo trace-cmd show |grep fio kworker/1:1-1252 [001] d..2 2002.792659: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=5 dest_cpu=1 rcu_preempt-11 [011] d..2 2004.202624: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=1 dest_cpu=11 rcu_preempt-11 [006] d..2 2008.364481: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=11 dest_cpu=6 <idle>-0 [007] dNs2 2009.209480: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=6 dest_cpu=7 $ And top tells me these fio processes are consuming CPU: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5695 root 20 0 663872 7120 1324 R 97.7 0.0 0:03.48 fio 5689 root 20 0 661356 275512 272128 S 0.3 1.7 0:15.46 fio So, the fio process that got migrated 4 times in 30s is the fio process that isn't doing any of the work. There's no migration going on here at all on a vanilla kernel on a Debian userspace, with or without the suggested scheduler tunings. > - after applying the debug patch[3] to queue XFS completion work on > other CPU(not current CPU), the above crazy fio IO thread migration > can't be observed. > > IMO, it is normal for user to saturate aio thread, since this way may > save context switch. > > Guys, any idea on the crazy aio thread migration? It doesn't happen on x86-64 with a current TOT vanilla kernel and a debian userspace. What userspace are you testing with? > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the > fio test is created for simulating one real performance report which is > proved to be caused by frequent aio submission thread migration. What is the underlying hardware? I'm running in a 16p KVM guest on a 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant CPU migration occurring at all from your test workload. > [3] test patch for queuing xfs completetion on other CPU > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 1fc28c2da279..bdc007a57706 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio) > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > struct inode *inode = file_inode(dio->iocb->ki_filp); > + unsigned cpu = cpumask_next(smp_processor_id(), > + cpu_online_mask); > + > + if (cpu >= nr_cpu_ids) > + cpu = 0; > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work); > } else { > iomap_dio_complete_work(&dio->aio.work); > } So, can you test the patch I sent you a while back that avoided using the completion workqueue for pure overwrites? Does that make your whacky scheduler behaviour go away, or does it still reschedule like crazy? I haven't posted that patch to the list because I can't prove it makes any difference to performance in any workload on any hardware I've tested it on. I just tested it on your workload, and it makes no different to behaviour or performance. It's definitely hitting the non-workqueue completion path: $ sudo trace-cmd show | grep overwrite |head -1 fio-4526 [009] ..s. 180.361089: iomap_dio_bio_end_io: overwrite completion $ sudo trace-cmd show | grep overwrite |wc -l 51295 $ so it's pretty clear that whatever is going on is not obviously a problem with workqueues or the way iomap does completions. Patch is attached below for you to test. You will need to pull commit 7684e2c4384d ("iomap: iomap that extends beyond EOF should be marked dirty") from the iomap for-next branch to ensure this patch functions correctly. Cheers, -Dave.
On Thu, Nov 14, 2019 at 02:14:34PM +0100, Peter Zijlstra wrote: > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > > Hi Guys, > > > > It is found that single AIO thread is migrated crazely by scheduler, and > > the migrate period can be < 10ms. Follows the test a): > > What does crazy mean? Does it cycle through the L3 mask? > The single thread AIO thread is migrated in several milliseconds once. Thanks, Ming
Hi Dave, On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote: > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > > Hi Guys, > > > > It is found that single AIO thread is migrated crazely by scheduler, and > > the migrate period can be < 10ms. Follows the test a): > > > > - run single job fio[1] for 30 seconds: > > ./xfs_complete 512 > > > > - observe fio io thread migration via bcc trace[2], and the migration > > times can reach 5k ~ 10K in above test. In this test, CPU utilization > > is 30~40% on the CPU running fio IO thread. > > Using the default scheduler tunings: > > kernel.sched_wakeup_granularity_ns = 4000000 > kernel.sched_min_granularity_ns = 3000000 > > I'm not seeing any migrations at all on a 16p x86-64 box. Even with > the tunings you suggest: > > sysctl kernel.sched_min_granularity_ns=10000000 > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > There are no migrations at all. Looks I forget to pass $BS to the fio command line in the script posted, please try the following script again and run './xfs_complete 512' first. [1] xfs_complete: one fio script for running single job overwrite aio on XFS #!/bin/bash BS=$1 NJOBS=1 QD=128 DIR=/mnt/xfs BATCH=1 VERIFY="sha3-512" sysctl kernel.sched_wakeup_granularity_ns sysctl kernel.sched_min_granularity_ns rmmod scsi_debug;modprobe scsi_debug dev_size_mb=6144 ndelay=41000 dix=1 dif=2 DEV=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename` DEV="/dev/"$DEV mkfs.xfs -f $DEV [ ! -d $DIR ] && mkdir -p $DIR mount $DEV $DIR fio --readwrite=randwrite --filesize=5g \ --overwrite=1 \ --filename=$DIR/fiofile \ --runtime=30s --time_based \ --ioengine=libaio --direct=1 --bs=$BS --iodepth=$QD \ --iodepth_batch_submit=$BATCH \ --iodepth_batch_complete_min=$BATCH \ --numjobs=$NJOBS \ --verify=$VERIFY \ --name=/hana/fsperf/foo umount $DEV rmmod scsi_debug When running './xfs_complete 512', lots of fio migration can be observed via bcc trace: /usr/share/bcc/tools/trace -C -t 't:sched:sched_migrate_task "%s/%d cpu %d->%d", args->comm,args->pid,args->orig_cpu,args->dest_cpu' | grep fio ... 69.13495 2 23 23 kworker/2:0 sched_migrate_task b'fio'/866 cpu 2->0 69.13513 0 13 13 kworker/0:1 sched_migrate_task b'fio'/866 cpu 0->2 69.86733 2 23 23 kworker/2:0 sched_migrate_task b'fio'/866 cpu 2->3 70.20730 3 134 134 kworker/3:1 sched_migrate_task b'fio'/866 cpu 3->0 70.21131 0 13 13 kworker/0:1 sched_migrate_task b'fio'/866 cpu 0->1 70.21733 1 112 112 kworker/1:1 sched_migrate_task b'fio'/866 cpu 1->2 70.29528 2 23 23 kworker/2:0 sched_migrate_task b'fio'/866 cpu 2->0 70.29769 0 13 13 kworker/0:1 sched_migrate_task b'fio'/866 cpu 0->2 70.36332 2 23 23 kworker/2:0 sched_migrate_task b'fio'/866 cpu 2->3 70.41924 3 134 134 kworker/3:1 sched_migrate_task b'fio'/866 cpu 3->2 70.78572 2 23 23 kworker/2:0 sched_migrate_task b'fio'/866 cpu 2->0 70.79061 0 13 13 kworker/0:1 sched_migrate_task b'fio'/866 cpu 0->2 70.91526 2 23 23 kworker/2:0 sched_migrate_task b'fio'/866 cpu 2->0 70.91607 0 13 13 kworker/0:1 sched_migrate_task b'fio'/866 cpu 0->2 ... In my test just done, the migration count is 12K in 30s fio running. Sometimes the number can be quite less, but most of times, the number is very big(> 5k). Also attaches my kernel config. > During the overwrite phase of the test, I'm seeing 66% fio usage, > 28% system time and of that, only 3% of the CPU time is in the IO > completion workqueue time. It's all running on the one CPU, and it's > running at about 15,000 context switches per second.. I'm not seeing > any CPU migrations at all, despite the workload being completely CPU > bound on a single CPU. In my test VM, in case of 4K block size(./xfs_complete 4k), when the following kernel parameters are passed in: sysctl kernel.sched_min_granularity_ns=10000000 sysctl kernel.sched_wakeup_granularity_ns=15000000 'top -H' shows that fio consumes ~100% CPU, and the fio IO thread is migrated very frequently. > > This is the typical output from top: > > Tasks: 262 total, 2 running, 260 sleeping, 0 stopped, 0 zombie > %Cpu0 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu2 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu3 : 66.2 us, 27.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 6.0 si, 0.0 st > %Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st > %Cpu9 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu11 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > %Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > MiB Mem : 16005.3 total, 8737.7 free, 6762.5 used, 505.2 buff/cache > MiB Swap: 486.3 total, 486.3 free, 0.0 used. 8640.3 avail Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5554 root 20 0 672064 15356 1392 R 97.0 0.1 0:17.85 fio > 1388 root 20 0 0 0 0 I 3.0 0.0 0:01.39 kworker/3:1-dio/sdb > 5552 root 20 0 661364 275380 271992 S 0.7 1.7 0:15.88 fio > 4891 dave 20 0 15572 5876 4600 S 0.3 0.0 0:00.56 sshd > 4933 dave 20 0 11732 4276 3296 R 0.3 0.0 0:01.30 top > > i.e. it runs entirely on CPU 3 for the whole 30s measurement period. > > And the number of migrate task events: > > $ sudo trace-cmd start -e sched_migrate_task > $ sudo ./run_test.sh > ..... > $ sudo trace-cmd show |grep fio > kworker/1:1-1252 [001] d..2 2002.792659: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=5 dest_cpu=1 > rcu_preempt-11 [011] d..2 2004.202624: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=1 dest_cpu=11 > rcu_preempt-11 [006] d..2 2008.364481: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=11 dest_cpu=6 > <idle>-0 [007] dNs2 2009.209480: sched_migrate_task: comm=fio pid=5689 prio=120 orig_cpu=6 dest_cpu=7 > $ > > And top tells me these fio processes are consuming CPU: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5695 root 20 0 663872 7120 1324 R 97.7 0.0 0:03.48 fio > 5689 root 20 0 661356 275512 272128 S 0.3 1.7 0:15.46 fio > > So, the fio process that got migrated 4 times in 30s is the fio > process that isn't doing any of the work. > > There's no migration going on here at all on a vanilla kernel on > a Debian userspace, with or without the suggested scheduler > tunings. > > > - after applying the debug patch[3] to queue XFS completion work on > > other CPU(not current CPU), the above crazy fio IO thread migration > > can't be observed. > > > > IMO, it is normal for user to saturate aio thread, since this way may > > save context switch. > > > > Guys, any idea on the crazy aio thread migration? > > It doesn't happen on x86-64 with a current TOT vanilla kernel and > a debian userspace. > > What userspace are you testing with? Fedora 29. > > > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the > > fio test is created for simulating one real performance report which is > > proved to be caused by frequent aio submission thread migration. > > What is the underlying hardware? I'm running in a 16p KVM guest on a > 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant > CPU migration occurring at all from your test workload. It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop, and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info: [root@ktest-01 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz Stepping: 3 CPU MHz: 2712.000 BogoMIPS: 5424.00 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp [root@ktest-01 ~]# uname -a Linux ktest-01 5.4.0-rc5-00367-gac8d20c84c47 #1532 SMP Thu Nov 14 19:13:25 CST 2019 x86_64 x86_64 x86_64x > > > [3] test patch for queuing xfs completetion on other CPU > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index 1fc28c2da279..bdc007a57706 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio) > > blk_wake_io_task(waiter); > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > struct inode *inode = file_inode(dio->iocb->ki_filp); > > + unsigned cpu = cpumask_next(smp_processor_id(), > > + cpu_online_mask); > > + > > + if (cpu >= nr_cpu_ids) > > + cpu = 0; > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > + queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work); > > } else { > > iomap_dio_complete_work(&dio->aio.work); > > } > > So, can you test the patch I sent you a while back that avoided > using the completion workqueue for pure overwrites? Does that make > your whacky scheduler behaviour go away, or does it still reschedule > like crazy? Last time, I found that your patch V3 doesn't work as expected since the trace message of 'overwrite completion' isn't observed when running the external test case. Seems not see your further response after I reported it to you. Will try this patch again. > > I haven't posted that patch to the list because I can't prove it > makes any difference to performance in any workload on any hardware > I've tested it on. I just tested it on your workload, and it makes > no different to behaviour or performance. It's definitely hitting > the non-workqueue completion path: > > $ sudo trace-cmd show | grep overwrite |head -1 > fio-4526 [009] ..s. 180.361089: iomap_dio_bio_end_io: overwrite completion > $ sudo trace-cmd show | grep overwrite |wc -l > 51295 > $ > > so it's pretty clear that whatever is going on is not obviously a > problem with workqueues or the way iomap does completions. Patch is > attached below for you to test. You will need to pull commit > 7684e2c4384d ("iomap: iomap that extends beyond EOF should be marked > dirty") from the iomap for-next branch to ensure this patch > functions correctly. > > Cheers, > > -Dave. > -- > Dave Chinner > david@fromorbit.com > > iomap: don't defer completion of pure AIO+DIO overwrites > > From: Dave Chinner <dchinner@redhat.com> > > We have a workload issuing 4k sequential AIO+DIOs in batches (yeah, > really silly thing to do when you could just emit a single large > IO) that is showing some interesting performance anomalies. i.e. > massive performance regressions that go away when the workload is > straced because it changes AIO completion scheduling and submission > batch sizes.... > > Older kernels that don't have the iomap DIO code don't show the same > regression, and they complete the AIO-DIO in the the bio completion > context rather than deferring it to a workqueue like iomap does. > > Hence, for pure overwrites that don't require any metadata updates > on IO completion, including file size, call the IO completion > directly rather than deferring it to the completion workqueue. THe > IOMAP_DIO_OVERWRITE flag is added to tell ->end_io implementations > that they are running in bio completion context and that they should > not have any update work to do. This leverages the iomap FUA > optimisations that allow FUA writes to be issued when no metadata > updates are required to be synced to disk during IO completion. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/iomap/direct-io.c | 26 ++++++++++++++++++++------ > fs/xfs/xfs_file.c | 3 +++ > include/linux/iomap.h | 8 ++++++++ > 3 files changed, 31 insertions(+), 6 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 49bf9780e3ed..a89b29306794 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -159,8 +159,14 @@ static void iomap_dio_bio_end_io(struct bio *bio) > } else if (dio->flags & IOMAP_DIO_WRITE) { > struct inode *inode = file_inode(dio->iocb->ki_filp); > > - INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + if ((dio->flags & IOMAP_DIO_OVERWRITE) && > + !inode->i_mapping->nrpages) { > + trace_printk("overwrite completion\n"); > + iomap_dio_complete_work(&dio->aio.work); > + } else { > + INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > + queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + } > } else { > iomap_dio_complete_work(&dio->aio.work); > } > @@ -229,10 +235,12 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > * the underlying device supports FUA. This allows us to avoid > * cache flushes on IO completion. > */ > - if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) && > - (dio->flags & IOMAP_DIO_WRITE_FUA) && > - blk_queue_fua(bdev_get_queue(iomap->bdev))) > - use_fua = true; > + if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY))) { > + dio->flags |= IOMAP_DIO_OVERWRITE; > + if ((dio->flags & IOMAP_DIO_WRITE_FUA) && > + blk_queue_fua(bdev_get_queue(iomap->bdev))) > + use_fua = true; > + } > } > > /* > @@ -511,9 +519,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > /* > * If all the writes we issued were FUA, we don't need to flush the > * cache on IO completion. Clear the sync flag for this case. > + * > + * If we are doing an overwrite and need to sync and FUA cannot be used, > + * clear the overwrite flag to ensure the completion is run via the > + * workqueue rather than directly. > */ > if (dio->flags & IOMAP_DIO_WRITE_FUA) > dio->flags &= ~IOMAP_DIO_NEED_SYNC; > + if (dio->flags & IOMAP_DIO_NEED_SYNC) > + dio->flags &= ~IOMAP_DIO_OVERWRITE; > > WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); > WRITE_ONCE(iocb->private, dio->submit.last_queue); > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 525b29b99116..794dea2f1dc3 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -395,6 +395,9 @@ xfs_dio_write_end_io( > */ > XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size); > > + if (flags & IOMAP_DIO_OVERWRITE) > + return 0; > + > /* > * We can allocate memory here while doing writeback on behalf of > * memory reclaim. To avoid memory allocation deadlocks set the > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 8b09463dae0d..009c5969a2ef 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -249,6 +249,14 @@ int iomap_writepages(struct address_space *mapping, > #define IOMAP_DIO_UNWRITTEN (1 << 0) /* covers unwritten extent(s) */ > #define IOMAP_DIO_COW (1 << 1) /* covers COW extent(s) */ > > +/* > + * IOMAP_DIO_OVERWRITE indicates a pure overwrite that requires no metadata > + * updates on completion. It also indicates that the completion is running in > + * the hardware IO completion context (e.g. softirq) rather than on a schedules > + * workqueue. > + */ > +#define IOMAP_DIO_OVERWRITE (1 << 2) > + > struct iomap_dio_ops { > int (*end_io)(struct kiocb *iocb, ssize_t size, int error, > unsigned flags); > Just run a quick test several times after applying the above patch, and looks it does make a big difference in test './xfs_complete 512' wrt. fio io thread migration. thanks, Ming
On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > Hi Dave, > > On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote: > > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > > > Hi Guys, > > > > > > It is found that single AIO thread is migrated crazely by scheduler, and > > > the migrate period can be < 10ms. Follows the test a): > > > > > > - run single job fio[1] for 30 seconds: > > > ./xfs_complete 512 > > > > > > - observe fio io thread migration via bcc trace[2], and the migration > > > times can reach 5k ~ 10K in above test. In this test, CPU utilization > > > is 30~40% on the CPU running fio IO thread. > > > > Using the default scheduler tunings: > > > > kernel.sched_wakeup_granularity_ns = 4000000 > > kernel.sched_min_granularity_ns = 3000000 > > > > I'm not seeing any migrations at all on a 16p x86-64 box. Even with > > the tunings you suggest: > > > > sysctl kernel.sched_min_granularity_ns=10000000 > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > > > There are no migrations at all. > > Looks I forget to pass $BS to the fio command line in the script posted, > please try the following script again and run './xfs_complete 512' first. So I ran 4kB IOs instead of 512 byte IOs. Shouldn't make any difference, really - it'll still be CPU bound... <snip script> > In my test just done, the migration count is 12K in 30s fio running. > Sometimes the number can be quite less, but most of times, the number > is very big(> 5k). With my iomap-dio-overwrite patch and 512 byte IOs: $ sudo trace-cmd show |grep sched_migrate_task |wc -l 112 $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l 22 Without the iomap-dio-overwrite patch: $ sudo trace-cmd show |grep sched_migrate_task |wc -l 99 $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l 9 $ There are -less- migrations when using the workqueue for everything. But it's so low in either case that it's just noise. Performance is identical for the two patches... > > > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the > > > fio test is created for simulating one real performance report which is > > > proved to be caused by frequent aio submission thread migration. > > > > What is the underlying hardware? I'm running in a 16p KVM guest on a > > 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant > > CPU migration occurring at all from your test workload. > > It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop, > and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info: Ok, so what are all the custom distro kernel tunings that userspace does for the kernel? > [root@ktest-01 ~]# lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 8 > On-line CPU(s) list: 0-7 > Thread(s) per core: 1 > Core(s) per socket: 4 > Socket(s): 2 > NUMA node(s): 2 Curious. You've configured it as two CPU sockets. If you make it a single socket, do your delay problems go away? The snippet of trace output you showed indicated it bouncing around CPUs on a single node (cpus 0-3), so maybe it has something to do with way the scheduler is interacting with non-zero NUMA distances... > Vendor ID: GenuineIntel > CPU family: 6 > Model: 94 > Model name: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz > Stepping: 3 > CPU MHz: 2712.000 > BogoMIPS: 5424.00 > Virtualization: VT-x > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 4096K > L3 cache: 16384K > NUMA node0 CPU(s): 0-3 > NUMA node1 CPU(s): 4-7 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp That seems like a very minimal set of CPU flags - looks like you are not actually passing the actual host CPU capabilities through to the guest. That means it will be doing the slowest, most generic spectre/meltdown mitigations, right? Also, shouldn't lscpu be telling us all the CPU bug mitigations in place? From my test system: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 40 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 16 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz Stepping: 7 CPU MHz: 2199.998 BogoMIPS: 4399.99 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB L1i cache: 512 KiB L2 cache: 64 MiB L3 cache: 256 MiB NUMA node0 CPU(s): 0-15 Vulnerability L1tf: Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Vulnerable Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp l m constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpi d tsc_adjust xsaveopt arat umip md_clear arch_capabilities So, to rule out that it has something to do with kernel config, I just ran up a kernel built with your config.gz, and the problem does not manifest. The only difference was a few drivers I needed to boot my test VMs, and I was previously not using paravirt spinlocks. So, I still can't reproduce the problem. Indeed, the workload gets nowhere near single CPU bound with your config - it's using half the CPU for the same performance: %Cpu2 : 19.8 us, 28.2 sy, 0.0 ni, 0.0 id, 52.0 wa, 0.0 hi, 0.0 %si, 0.0 st Basically, it's spending half it's time waiting on IO. If I wind the delay down to 1000ns: %Cpu1 : 42.2 us, 42.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 15.6 %si, 0.0 st it spends an awful lot of time in soft-interrupt, but is back to being CPU bound. Despite this, I still don't see any significant amount of task migration. In fact, I see a lot less with your kernel config that I do with my original kernel config, because the CPU load was far lower. > Just run a quick test several times after applying the above patch, and looks it > does make a big difference in test './xfs_complete 512' wrt. fio io thread migration. There's something very different about your system, and it doesn't appear to be a result of the kernel code itself. I think you're going to have to do all the testing at the moment, Ming, because it's clear that my test systems do not show up the problems even when using the same kernel config as you do... If you reconfig you kvm setup to pass all the native host side cpu flags through to the guest, does the problem go away? I think adding "-cpu host" to your qemu command line will do that... Cheers, Dave.
On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > Hi Dave, > > > > On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote: > > > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > > > > Hi Guys, > > > > > > > > It is found that single AIO thread is migrated crazely by scheduler, and > > > > the migrate period can be < 10ms. Follows the test a): > > > > > > > > - run single job fio[1] for 30 seconds: > > > > ./xfs_complete 512 > > > > > > > > - observe fio io thread migration via bcc trace[2], and the migration > > > > times can reach 5k ~ 10K in above test. In this test, CPU utilization > > > > is 30~40% on the CPU running fio IO thread. > > > > > > Using the default scheduler tunings: > > > > > > kernel.sched_wakeup_granularity_ns = 4000000 > > > kernel.sched_min_granularity_ns = 3000000 > > > > > > I'm not seeing any migrations at all on a 16p x86-64 box. Even with > > > the tunings you suggest: > > > > > > sysctl kernel.sched_min_granularity_ns=10000000 > > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > > > > > There are no migrations at all. > > > > Looks I forget to pass $BS to the fio command line in the script posted, > > please try the following script again and run './xfs_complete 512' first. > > So I ran 4kB IOs instead of 512 byte IOs. Shouldn't make any > difference, really - it'll still be CPU bound... In 512 block size test, the CPU utilization of fio IO thread is reduced to 40%, which is more like IO bound. > > <snip script> > > > In my test just done, the migration count is 12K in 30s fio running. > > Sometimes the number can be quite less, but most of times, the number > > is very big(> 5k). > > With my iomap-dio-overwrite patch and 512 byte IOs: > > $ sudo trace-cmd show |grep sched_migrate_task |wc -l > 112 > $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l > 22 > > Without the iomap-dio-overwrite patch: > > $ sudo trace-cmd show |grep sched_migrate_task |wc -l > 99 > $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l > 9 > $ > > There are -less- migrations when using the workqueue for everything. > But it's so low in either case that it's just noise. > > Performance is identical for the two patches... I can reproduce the issue with 4k block size on another RH system, and the login info of that system has been shared to you in RH BZ. 1) sysctl kernel.sched_min_granularity_ns=10000000 sysctl kernel.sched_wakeup_granularity_ns=15000000 2) ./xfs_complete 4k Then you should see 1k~1.5k fio io thread migration in above test, either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel. Not reproduced the issue with 512 block size on the RH system yet, maybe it is related with my kernel config. > > > > > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the > > > > fio test is created for simulating one real performance report which is > > > > proved to be caused by frequent aio submission thread migration. > > > > > > What is the underlying hardware? I'm running in a 16p KVM guest on a > > > 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant > > > CPU migration occurring at all from your test workload. > > > > It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop, > > and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info: > > Ok, so what are all the custom distro kernel tunings that userspace > does for the kernel? It is standard Fedora 29. > > > [root@ktest-01 ~]# lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Byte Order: Little Endian > > CPU(s): 8 > > On-line CPU(s) list: 0-7 > > Thread(s) per core: 1 > > Core(s) per socket: 4 > > Socket(s): 2 > > NUMA node(s): 2 > > Curious. You've configured it as two CPU sockets. If you make it a > single socket, do your delay problems go away? The snippet of trace > output you showed indicated it bouncing around CPUs on a single node > (cpus 0-3), so maybe it has something to do with way the scheduler > is interacting with non-zero NUMA distances... I don't see that is a problem wrt. this issue, given the issue can be reproduced on other system too. > > > Vendor ID: GenuineIntel > > CPU family: 6 > > Model: 94 > > Model name: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz > > Stepping: 3 > > CPU MHz: 2712.000 > > BogoMIPS: 5424.00 > > Virtualization: VT-x > > Hypervisor vendor: KVM > > Virtualization type: full > > L1d cache: 32K > > L1i cache: 32K > > L2 cache: 4096K > > L3 cache: 16384K > > NUMA node0 CPU(s): 0-3 > > NUMA node1 CPU(s): 4-7 > > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp > > That seems like a very minimal set of CPU flags - looks like you are > not actually passing the actual host CPU capabilities through to the > guest. That means it will be doing the slowest, most generic > spectre/meltdown mitigations, right? The above line is just trunated by the console terminal. > > Also, shouldn't lscpu be telling us all the CPU bug mitigations in > place? > > From my test system: > > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > Address sizes: 40 bits physical, 48 bits virtual > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 1 > Core(s) per socket: 1 > Socket(s): 16 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 45 > Model name: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz > Stepping: 7 > CPU MHz: 2199.998 > BogoMIPS: 4399.99 > Virtualization: VT-x > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 512 KiB > L1i cache: 512 KiB > L2 cache: 64 MiB > L3 cache: 256 MiB > NUMA node0 CPU(s): 0-15 > Vulnerability L1tf: Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled > Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown > Vulnerability Meltdown: Vulnerable > Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp > Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization > Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp l > m constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic > popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpi > d tsc_adjust xsaveopt arat umip md_clear arch_capabilities > > So, to rule out that it has something to do with kernel config, > I just ran up a kernel built with your config.gz, and the problem > does not manifest. The only difference was a few drivers I needed to > boot my test VMs, and I was previously not using paravirt spinlocks. > > So, I still can't reproduce the problem. Indeed, the workload gets > nowhere near single CPU bound with your config - it's using half the > CPU for the same performance: > > %Cpu2 : 19.8 us, 28.2 sy, 0.0 ni, 0.0 id, 52.0 wa, 0.0 hi, 0.0 %si, 0.0 st > > Basically, it's spending half it's time waiting on IO. If I wind the > delay down to 1000ns: > > %Cpu1 : 42.2 us, 42.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 15.6 %si, 0.0 st > > it spends an awful lot of time in soft-interrupt, but is back to > being CPU bound. > > Despite this, I still don't see any significant amount of task > migration. In fact, I see a lot less with your kernel config that I > do with my original kernel config, because the CPU load was far > lower. > > > Just run a quick test several times after applying the above patch, and looks it > > does make a big difference in test './xfs_complete 512' wrt. fio io thread migration. > > There's something very different about your system, and it doesn't > appear to be a result of the kernel code itself. I think you're > going to have to do all the testing at the moment, Ming, because > it's clear that my test systems do not show up the problems even > when using the same kernel config as you do... > > If you reconfig you kvm setup to pass all the native host side cpu > flags through to the guest, does the problem go away? I think adding > "-cpu host" to your qemu command line will do that... Please login to the RH system I shared to you, and you will see the issue. Thanks, Ming
On Fri, Nov 15, 2019 at 08:09:25AM +0800, Ming Lei wrote: > On Thu, Nov 14, 2019 at 02:14:34PM +0100, Peter Zijlstra wrote: > > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > > > Hi Guys, > > > > > > It is found that single AIO thread is migrated crazely by scheduler, and > > > the migrate period can be < 10ms. Follows the test a): > > > > What does crazy mean? Does it cycle through the L3 mask? > > > > The single thread AIO thread is migrated in several milliseconds once. Today I found the migrate rate of single fio IO thread can reach 11~12K/sec when I run './xfs_complete 512' on another real machine (single numa node, 8 cores). And the number is very close to IOPS of the test, that said the fio IO thread can be migrated once just when completing one IO on the scsi_debug device. Thanks, Ming
On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > I can reproduce the issue with 4k block size on another RH system, and > the login info of that system has been shared to you in RH BZ. > > 1) > sysctl kernel.sched_min_granularity_ns=10000000 > sysctl kernel.sched_wakeup_granularity_ns=15000000 So, these settings definitely influence behaviour. If these are set to kernel defaults (4ms and 3ms each): sysctl kernel.sched_min_granularity_ns=4000000 sysctl kernel.sched_wakeup_granularity_ns=3000000 The migration problem largely goes away - the fio task migration event count goes from ~2,000 a run down to 200/run. That indicates that the migration trigger is likely load/timing based. The analysis below is based on the 10/15ms numbers above, because it makes it so much easier to reproduce. > 2) > ./xfs_complete 4k > > Then you should see 1k~1.5k fio io thread migration in above test, > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel. Almost all the fio task migrations are coming from migration/X kernel threads. i.e it's the scheduler active balancing that is causing the fio thread to bounce around. This is typical a typical trace, trimmed to remove extraneous noise. The fio process is running on CPU 10: fio-3185 [010] 50419.285954: sched_stat_runtime: comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns] fio-3185 [010] 50419.286953: sched_stat_runtime: comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns] fio-3185 [010] 50419.287998: sched_stat_runtime: comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns] fio-3185 [010] 50419.289973: sched_stat_runtime: comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns] fio-3185 [010] 50419.290958: sched_stat_runtime: comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns] fio-3185 [010] 50419.291952: sched_stat_runtime: comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns] fio consumes CPU for several milliseconds, then: fio-3185 [010] 50419.292935: sched_stat_runtime: comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns] fio-3185 [010] 50419.292941: sched_switch: fio:3185 [120] S ==> kworker/10:0:2763 [120] kworker/10:0-2763 [010] 50419.292954: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns] kworker/10:0-2763 [010] 50419.292956: sched_switch: kworker/10:0:2763 [120] R ==> fio:3185 [120] fio-3185 [010] 50419.293115: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 fio-3185 [010] 50419.293116: sched_stat_runtime: comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns] fio-3185 [010] 50419.293118: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 A context switch out to a kworker, then 13us later we immediately switch back to the fio process, and go on running. No doubt somewhere in what the fio process is doing, we queue up more work to be run on the cpu, but the fio task keeps running (due to CONFIG_PREEMPT=n). fio-3185 [010] 50419.293934: sched_stat_runtime: comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns] fio-3185 [010] 50419.294936: sched_stat_runtime: comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns] fio-3185 [010] 50419.295934: sched_stat_runtime: comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns] fio-3185 [010] 50419.296935: sched_stat_runtime: comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns] fio-3185 [010] 50419.297934: sched_stat_runtime: comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns] fio-3185 [010] 50419.298937: sched_stat_runtime: comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns] fio-3185 [010] 50419.299935: sched_stat_runtime: comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns] About 6ms later, CPU 0 kicks the active load balancer on CPU 10... <idle>-0 [000] 50419.300014: sched_waking: comm=migration/10 pid=70 prio=0 target_cpu=010 fio-3185 [010] 50419.300024: sched_wakeup: migration/10:70 [0] success=1 CPU:010 fio-3185 [010] 50419.300026: sched_stat_runtime: comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns] fio-3185 [010] 50419.300027: sched_switch: fio:3185 [120] S ==> migration/10:70 [0] migration/10-70 [010] 50419.300032: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12 migration/10-70 [010] 50419.300040: sched_switch: migration/10:70 [0] D ==> kworker/10:0:2763 [120] And 10us later the fio process is switched away, the active load balancer work is run and migrates the fio process to CPU 12. Then... kworker/10:0-2763 [010] 50419.300048: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns] kworker/10:0-2763 [010] 50419.300062: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] <idle>-0 [010] 50419.300067: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 <idle>-0 [010] 50419.300069: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 <idle>-0 [010] 50419.300071: sched_switch: swapper/10:0 [120] S ==> kworker/10:0:2763 [120] kworker/10:0-2763 [010] 50419.300073: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] The kworker runs for another 10us and the CPU goes idle. Shortly after this, CPU 12 is woken: <idle>-0 [012] 50419.300113: sched_switch: swapper/12:0 [120] S ==> fio:3185 [120] fio-3185 [012] 50419.300596: sched_waking: comm=kworker/12:1 pid=227 prio=120 target_cpu=012 fio-3185 [012] 50419.300598: sched_stat_runtime: comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns] fio-3185 [012] 50419.300936: sched_stat_runtime: comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns] fio-3185 [012] 50419.301935: sched_stat_runtime: comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns] fio-3185 [012] 50419.302935: sched_stat_runtime: comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns] fio-3185 [012] 50419.303934: sched_stat_runtime: comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns] fio-3185 [012] 50419.304934: sched_stat_runtime: comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns] fio-3185 [012] 50419.305948: sched_stat_runtime: comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns] and fio goes on running there. The pattern repeats very soon afterwards: <idle>-0 [000] 50419.314982: sched_waking: comm=migration/12 pid=82 prio=0 target_cpu=012 fio-3185 [012] 50419.314988: sched_wakeup: migration/12:82 [0] success=1 CPU:012 fio-3185 [012] 50419.314990: sched_stat_runtime: comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns] fio-3185 [012] 50419.314991: sched_switch: fio:3185 [120] S ==> migration/12:82 [0] migration/12-82 [012] 50419.314995: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5 migration/12-82 [012] 50419.315001: sched_switch: migration/12:82 [0] D ==> kworker/12:1:227 [120] kworker/12:1-227 [012] 50419.315022: sched_stat_runtime: comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns] kworker/12:1-227 [012] 50419.315028: sched_switch: kworker/12:1:227 [120] R ==> swapper/12:0 [120] <idle>-0 [005] 50419.315053: sched_switch: swapper/5:0 [120] S ==> fio:3185 [120] fio-3185 [005] 50419.315286: sched_waking: comm=kworker/5:0 pid=2646 prio=120 target_cpu=005 fio-3185 [005] 50419.315288: sched_stat_runtime: comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns] And fio is now running on CPU 5 - it only ran on CPU 12 for about 15ms. Hmmm: $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135 time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012) Yeah, the fio task averages 13.4ms on any given CPU before being switched to another CPU. Mind you, the stddev is 12ms, so the range of how long it spends on any one CPU is pretty wide (330us to 330ms). IOWs, this doesn't look like a workqueue problem at all - this looks like the scheduler is repeatedly making the wrong load balancing decisions when mixing a very short runtime task (queued work) with a long runtime task on the same CPU.... This is not my area of expertise, so I have no idea why this might be happening. Scheduler experts: is this expected behaviour? What tunables directly influence the active load balancer (and/or CONFIG options) to change how aggressive it's behaviour is? > Not reproduced the issue with 512 block size on the RH system yet, > maybe it is related with my kernel config. I doubt it - this looks like a load specific corner case in the scheduling algorithm.... Cheers, Dave.
On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote: > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > I can reproduce the issue with 4k block size on another RH system, and > > the login info of that system has been shared to you in RH BZ. > > > > 1) > > sysctl kernel.sched_min_granularity_ns=10000000 > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > So, these settings definitely influence behaviour. > > If these are set to kernel defaults (4ms and 3ms each): > > sysctl kernel.sched_min_granularity_ns=4000000 > sysctl kernel.sched_wakeup_granularity_ns=3000000 > > The migration problem largely goes away - the fio task migration > event count goes from ~2,000 a run down to 200/run. > > That indicates that the migration trigger is likely load/timing > based. The analysis below is based on the 10/15ms numbers above, > because it makes it so much easier to reproduce. On another machine, './xfs_complete 512' may be migrated 11~12K/sec, which don't need to change the above two kernel sched defaults, however the fio io thread only takes 40% CPU. './xfs_complete 4k' on this machine, the fio IO CPU utilization is >= 98%. Thanks, Ming
+Vincent On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote: > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > I can reproduce the issue with 4k block size on another RH system, and > > the login info of that system has been shared to you in RH BZ. > > > > 1) > > sysctl kernel.sched_min_granularity_ns=10000000 > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > So, these settings definitely influence behaviour. > > If these are set to kernel defaults (4ms and 3ms each): > > sysctl kernel.sched_min_granularity_ns=4000000 > sysctl kernel.sched_wakeup_granularity_ns=3000000 > > The migration problem largely goes away - the fio task migration > event count goes from ~2,000 a run down to 200/run. > > That indicates that the migration trigger is likely load/timing > based. The analysis below is based on the 10/15ms numbers above, > because it makes it so much easier to reproduce. > > > 2) > > ./xfs_complete 4k > > > > Then you should see 1k~1.5k fio io thread migration in above test, > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel. > > Almost all the fio task migrations are coming from migration/X > kernel threads. i.e it's the scheduler active balancing that is > causing the fio thread to bounce around. > > This is typical a typical trace, trimmed to remove extraneous noise. > The fio process is running on CPU 10: > > fio-3185 [010] 50419.285954: sched_stat_runtime: comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns] > fio-3185 [010] 50419.286953: sched_stat_runtime: comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns] > fio-3185 [010] 50419.287998: sched_stat_runtime: comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns] > fio-3185 [010] 50419.289973: sched_stat_runtime: comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns] > fio-3185 [010] 50419.290958: sched_stat_runtime: comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns] > fio-3185 [010] 50419.291952: sched_stat_runtime: comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns] > > fio consumes CPU for several milliseconds, then: > > fio-3185 [010] 50419.292935: sched_stat_runtime: comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns] > fio-3185 [010] 50419.292941: sched_switch: fio:3185 [120] S ==> kworker/10:0:2763 [120] > kworker/10:0-2763 [010] 50419.292954: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns] > kworker/10:0-2763 [010] 50419.292956: sched_switch: kworker/10:0:2763 [120] R ==> fio:3185 [120] > fio-3185 [010] 50419.293115: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > fio-3185 [010] 50419.293116: sched_stat_runtime: comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns] > fio-3185 [010] 50419.293118: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > A context switch out to a kworker, then 13us later we immediately > switch back to the fio process, and go on running. No doubt > somewhere in what the fio process is doing, we queue up more work to > be run on the cpu, but the fio task keeps running > (due to CONFIG_PREEMPT=n). > > fio-3185 [010] 50419.293934: sched_stat_runtime: comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns] > fio-3185 [010] 50419.294936: sched_stat_runtime: comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns] > fio-3185 [010] 50419.295934: sched_stat_runtime: comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns] > fio-3185 [010] 50419.296935: sched_stat_runtime: comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns] > fio-3185 [010] 50419.297934: sched_stat_runtime: comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns] > fio-3185 [010] 50419.298937: sched_stat_runtime: comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns] > fio-3185 [010] 50419.299935: sched_stat_runtime: comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns] > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10... > > <idle>-0 [000] 50419.300014: sched_waking: comm=migration/10 pid=70 prio=0 target_cpu=010 > fio-3185 [010] 50419.300024: sched_wakeup: migration/10:70 [0] success=1 CPU:010 > fio-3185 [010] 50419.300026: sched_stat_runtime: comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns] > fio-3185 [010] 50419.300027: sched_switch: fio:3185 [120] S ==> migration/10:70 [0] > migration/10-70 [010] 50419.300032: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12 > migration/10-70 [010] 50419.300040: sched_switch: migration/10:70 [0] D ==> kworker/10:0:2763 [120] > > And 10us later the fio process is switched away, the active load > balancer work is run and migrates the fio process to CPU 12. Then... > > kworker/10:0-2763 [010] 50419.300048: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns] > kworker/10:0-2763 [010] 50419.300062: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > <idle>-0 [010] 50419.300067: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > <idle>-0 [010] 50419.300069: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > <idle>-0 [010] 50419.300071: sched_switch: swapper/10:0 [120] S ==> kworker/10:0:2763 [120] > kworker/10:0-2763 [010] 50419.300073: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > The kworker runs for another 10us and the CPU goes idle. Shortly > after this, CPU 12 is woken: > > <idle>-0 [012] 50419.300113: sched_switch: swapper/12:0 [120] S ==> fio:3185 [120] > fio-3185 [012] 50419.300596: sched_waking: comm=kworker/12:1 pid=227 prio=120 target_cpu=012 > fio-3185 [012] 50419.300598: sched_stat_runtime: comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns] > fio-3185 [012] 50419.300936: sched_stat_runtime: comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns] > fio-3185 [012] 50419.301935: sched_stat_runtime: comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns] > fio-3185 [012] 50419.302935: sched_stat_runtime: comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns] > fio-3185 [012] 50419.303934: sched_stat_runtime: comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns] > fio-3185 [012] 50419.304934: sched_stat_runtime: comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns] > fio-3185 [012] 50419.305948: sched_stat_runtime: comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns] > > > and fio goes on running there. The pattern repeats very soon afterwards: > > <idle>-0 [000] 50419.314982: sched_waking: comm=migration/12 pid=82 prio=0 target_cpu=012 > fio-3185 [012] 50419.314988: sched_wakeup: migration/12:82 [0] success=1 CPU:012 > fio-3185 [012] 50419.314990: sched_stat_runtime: comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns] > fio-3185 [012] 50419.314991: sched_switch: fio:3185 [120] S ==> migration/12:82 [0] > migration/12-82 [012] 50419.314995: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5 > migration/12-82 [012] 50419.315001: sched_switch: migration/12:82 [0] D ==> kworker/12:1:227 [120] > kworker/12:1-227 [012] 50419.315022: sched_stat_runtime: comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns] > kworker/12:1-227 [012] 50419.315028: sched_switch: kworker/12:1:227 [120] R ==> swapper/12:0 [120] > <idle>-0 [005] 50419.315053: sched_switch: swapper/5:0 [120] S ==> fio:3185 [120] > fio-3185 [005] 50419.315286: sched_waking: comm=kworker/5:0 pid=2646 prio=120 target_cpu=005 > fio-3185 [005] 50419.315288: sched_stat_runtime: comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns] > > And fio is now running on CPU 5 - it only ran on CPU 12 for about > 15ms. Hmmm: > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135 > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012) > > Yeah, the fio task averages 13.4ms on any given CPU before being > switched to another CPU. Mind you, the stddev is 12ms, so the range > of how long it spends on any one CPU is pretty wide (330us to > 330ms). > > IOWs, this doesn't look like a workqueue problem at all - this looks > like the scheduler is repeatedly making the wrong load balancing > decisions when mixing a very short runtime task (queued work) with a > long runtime task on the same CPU.... > > This is not my area of expertise, so I have no idea why this might > be happening. Scheduler experts: is this expected behaviour? What > tunables directly influence the active load balancer (and/or CONFIG > options) to change how aggressive it's behaviour is? We typically only fall back to the active balancer when there is (persistent) imbalance and we fail to migrate anything else (of substance). The tuning mentioned has the effect of less frequent scheduling, IOW, leaving (short) tasks on the runqueue longer. This obviously means the load-balancer will have a bigger chance of seeing them. Now; it's been a while since I looked at the workqueue code but one possible explanation would be if the kworker that picks up the work item is pinned. That would make it runnable but not migratable, the exact situation in which we'll end up shooting the current task with active balance. I'll go see if I can reproduce and stare at the workqueue code a bit. > > > Not reproduced the issue with 512 block size on the RH system yet, > > maybe it is related with my kernel config. > > I doubt it - this looks like a load specific corner case in the > scheduling algorithm.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, 18 Nov 2019 at 10:21, Peter Zijlstra <peterz@infradead.org> wrote: > > > +Vincent > > On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote: > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > > I can reproduce the issue with 4k block size on another RH system, and > > > the login info of that system has been shared to you in RH BZ. > > > > > > 1) > > > sysctl kernel.sched_min_granularity_ns=10000000 > > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > > > So, these settings definitely influence behaviour. > > > > If these are set to kernel defaults (4ms and 3ms each): > > > > sysctl kernel.sched_min_granularity_ns=4000000 > > sysctl kernel.sched_wakeup_granularity_ns=3000000 > > > > The migration problem largely goes away - the fio task migration > > event count goes from ~2,000 a run down to 200/run. > > > > That indicates that the migration trigger is likely load/timing > > based. The analysis below is based on the 10/15ms numbers above, > > because it makes it so much easier to reproduce. > > > > > 2) > > > ./xfs_complete 4k > > > > > > Then you should see 1k~1.5k fio io thread migration in above test, > > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel. > > > > Almost all the fio task migrations are coming from migration/X > > kernel threads. i.e it's the scheduler active balancing that is > > causing the fio thread to bounce around. > > > > This is typical a typical trace, trimmed to remove extraneous noise. > > The fio process is running on CPU 10: > > > > fio-3185 [010] 50419.285954: sched_stat_runtime: comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns] > > fio-3185 [010] 50419.286953: sched_stat_runtime: comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns] > > fio-3185 [010] 50419.287998: sched_stat_runtime: comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns] > > fio-3185 [010] 50419.289973: sched_stat_runtime: comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns] > > fio-3185 [010] 50419.290958: sched_stat_runtime: comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns] > > fio-3185 [010] 50419.291952: sched_stat_runtime: comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns] > > > > fio consumes CPU for several milliseconds, then: > > > > fio-3185 [010] 50419.292935: sched_stat_runtime: comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns] > > fio-3185 [010] 50419.292941: sched_switch: fio:3185 [120] S ==> kworker/10:0:2763 [120] > > kworker/10:0-2763 [010] 50419.292954: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns] > > kworker/10:0-2763 [010] 50419.292956: sched_switch: kworker/10:0:2763 [120] R ==> fio:3185 [120] > > fio-3185 [010] 50419.293115: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > fio-3185 [010] 50419.293116: sched_stat_runtime: comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns] > > fio-3185 [010] 50419.293118: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > > > A context switch out to a kworker, then 13us later we immediately > > switch back to the fio process, and go on running. No doubt > > somewhere in what the fio process is doing, we queue up more work to > > be run on the cpu, but the fio task keeps running > > (due to CONFIG_PREEMPT=n). > > > > fio-3185 [010] 50419.293934: sched_stat_runtime: comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns] > > fio-3185 [010] 50419.294936: sched_stat_runtime: comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns] > > fio-3185 [010] 50419.295934: sched_stat_runtime: comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns] > > fio-3185 [010] 50419.296935: sched_stat_runtime: comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns] > > fio-3185 [010] 50419.297934: sched_stat_runtime: comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns] > > fio-3185 [010] 50419.298937: sched_stat_runtime: comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns] > > fio-3185 [010] 50419.299935: sched_stat_runtime: comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns] > > > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10... > > > > <idle>-0 [000] 50419.300014: sched_waking: comm=migration/10 pid=70 prio=0 target_cpu=010 > > fio-3185 [010] 50419.300024: sched_wakeup: migration/10:70 [0] success=1 CPU:010 > > fio-3185 [010] 50419.300026: sched_stat_runtime: comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns] > > fio-3185 [010] 50419.300027: sched_switch: fio:3185 [120] S ==> migration/10:70 [0] > > migration/10-70 [010] 50419.300032: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12 > > migration/10-70 [010] 50419.300040: sched_switch: migration/10:70 [0] D ==> kworker/10:0:2763 [120] > > > > And 10us later the fio process is switched away, the active load > > balancer work is run and migrates the fio process to CPU 12. Then... > > > > kworker/10:0-2763 [010] 50419.300048: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns] > > kworker/10:0-2763 [010] 50419.300062: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > <idle>-0 [010] 50419.300067: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > <idle>-0 [010] 50419.300069: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > <idle>-0 [010] 50419.300071: sched_switch: swapper/10:0 [120] S ==> kworker/10:0:2763 [120] > > kworker/10:0-2763 [010] 50419.300073: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > > > The kworker runs for another 10us and the CPU goes idle. Shortly > > after this, CPU 12 is woken: > > > > <idle>-0 [012] 50419.300113: sched_switch: swapper/12:0 [120] S ==> fio:3185 [120] > > fio-3185 [012] 50419.300596: sched_waking: comm=kworker/12:1 pid=227 prio=120 target_cpu=012 > > fio-3185 [012] 50419.300598: sched_stat_runtime: comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns] > > fio-3185 [012] 50419.300936: sched_stat_runtime: comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns] > > fio-3185 [012] 50419.301935: sched_stat_runtime: comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns] > > fio-3185 [012] 50419.302935: sched_stat_runtime: comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns] > > fio-3185 [012] 50419.303934: sched_stat_runtime: comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns] > > fio-3185 [012] 50419.304934: sched_stat_runtime: comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns] > > fio-3185 [012] 50419.305948: sched_stat_runtime: comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns] > > > > > > and fio goes on running there. The pattern repeats very soon afterwards: > > > > <idle>-0 [000] 50419.314982: sched_waking: comm=migration/12 pid=82 prio=0 target_cpu=012 > > fio-3185 [012] 50419.314988: sched_wakeup: migration/12:82 [0] success=1 CPU:012 > > fio-3185 [012] 50419.314990: sched_stat_runtime: comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns] > > fio-3185 [012] 50419.314991: sched_switch: fio:3185 [120] S ==> migration/12:82 [0] > > migration/12-82 [012] 50419.314995: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5 > > migration/12-82 [012] 50419.315001: sched_switch: migration/12:82 [0] D ==> kworker/12:1:227 [120] > > kworker/12:1-227 [012] 50419.315022: sched_stat_runtime: comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns] > > kworker/12:1-227 [012] 50419.315028: sched_switch: kworker/12:1:227 [120] R ==> swapper/12:0 [120] > > <idle>-0 [005] 50419.315053: sched_switch: swapper/5:0 [120] S ==> fio:3185 [120] > > fio-3185 [005] 50419.315286: sched_waking: comm=kworker/5:0 pid=2646 prio=120 target_cpu=005 > > fio-3185 [005] 50419.315288: sched_stat_runtime: comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns] > > > > And fio is now running on CPU 5 - it only ran on CPU 12 for about > > 15ms. Hmmm: > > > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135 > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012) > > > > Yeah, the fio task averages 13.4ms on any given CPU before being > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > of how long it spends on any one CPU is pretty wide (330us to > > 330ms). > > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > like the scheduler is repeatedly making the wrong load balancing > > decisions when mixing a very short runtime task (queued work) with a > > long runtime task on the same CPU.... > > > > This is not my area of expertise, so I have no idea why this might > > be happening. Scheduler experts: is this expected behaviour? What > > tunables directly influence the active load balancer (and/or CONFIG > > options) to change how aggressive it's behaviour is? > > We typically only fall back to the active balancer when there is > (persistent) imbalance and we fail to migrate anything else (of > substance). > > The tuning mentioned has the effect of less frequent scheduling, IOW, > leaving (short) tasks on the runqueue longer. This obviously means the > load-balancer will have a bigger chance of seeing them. > > Now; it's been a while since I looked at the workqueue code but one > possible explanation would be if the kworker that picks up the work item > is pinned. That would make it runnable but not migratable, the exact > situation in which we'll end up shooting the current task with active > balance. I agree with your explanation above Although queue_work uses queue_work_on(WORK_CPU_UNBOUND,... At the end, it uses a bounded workqueue and tries to queue the work on the local cpu 1st even if the latter is already busy > > I'll go see if I can reproduce and stare at the workqueue code a bit. > > > > > > Not reproduced the issue with 512 block size on the RH system yet, > > > maybe it is related with my kernel config. > > > > I doubt it - this looks like a load specific corner case in the > > scheduling algorithm.... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
* Dave Chinner <david@fromorbit.com> [2019-11-16 10:40:05]: > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > I can reproduce the issue with 4k block size on another RH system, and > > the login info of that system has been shared to you in RH BZ. > > > > 1) > > Almost all the fio task migrations are coming from migration/X > kernel threads. i.e it's the scheduler active balancing that is > causing the fio thread to bounce around. > Can we try with the below patch.
On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote: > On Sat, Nov 16, 2019 at 10:40:05AM +1100, Dave Chinner wrote: > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > And fio is now running on CPU 5 - it only ran on CPU 12 for about > > 15ms. Hmmm: > > > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135 > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012) > > > > Yeah, the fio task averages 13.4ms on any given CPU before being > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > of how long it spends on any one CPU is pretty wide (330us to > > 330ms). > > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > like the scheduler is repeatedly making the wrong load balancing > > decisions when mixing a very short runtime task (queued work) with a > > long runtime task on the same CPU.... > > > > This is not my area of expertise, so I have no idea why this might > > be happening. Scheduler experts: is this expected behaviour? What > > tunables directly influence the active load balancer (and/or CONFIG > > options) to change how aggressive it's behaviour is? > > We typically only fall back to the active balancer when there is > (persistent) imbalance and we fail to migrate anything else (of > substance). > > The tuning mentioned has the effect of less frequent scheduling, IOW, > leaving (short) tasks on the runqueue longer. This obviously means the > load-balancer will have a bigger chance of seeing them. > > Now; it's been a while since I looked at the workqueue code but one > possible explanation would be if the kworker that picks up the work item > is pinned. That would make it runnable but not migratable, the exact > situation in which we'll end up shooting the current task with active > balance. Yes, that's precisely the problem - work is queued, by default, on a specific CPU and it will wait for a kworker that is pinned to that specific CPU to dispatch it. We've already tested that queuing on a different CPU (via queue_work_on()) makes the problem largely go away as the work is not longer queued behind the long running fio task. This, however, is not at viable solution to the problem. The pattern of a long running process queuing small pieces of individual work for processing in a separate context is pretty common... Cheers, Dave.
On Mon, Nov 18, 2019 at 09:56:33PM +0530, Srikar Dronamraju wrote: > * Dave Chinner <david@fromorbit.com> [2019-11-16 10:40:05]: > > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > > I can reproduce the issue with 4k block size on another RH system, and > > > the login info of that system has been shared to you in RH BZ. > > > > > > 1) > > > > Almost all the fio task migrations are coming from migration/X > > kernel threads. i.e it's the scheduler active balancing that is > > causing the fio thread to bounce around. > > > > Can we try with the below patch. That makes things much, much worse. $ sudo trace-cmd show |grep fio |wc -l 4701 $ Even when not running the fio workload (i.e. the system is largely idle and I'm just doing admin tasks like setting up for test runs), the number of task migrations that occur goes up significantly. -Dave.
On Tue, Nov 19, 2019 at 08:18:04AM +1100, Dave Chinner wrote: > On Mon, Nov 18, 2019 at 09:56:33PM +0530, Srikar Dronamraju wrote: > > * Dave Chinner <david@fromorbit.com> [2019-11-16 10:40:05]: > > > > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > > > I can reproduce the issue with 4k block size on another RH system, and > > > > the login info of that system has been shared to you in RH BZ. > > > > > > > > 1) > > > > > > Almost all the fio task migrations are coming from migration/X > > > kernel threads. i.e it's the scheduler active balancing that is > > > causing the fio thread to bounce around. > > > > > > > Can we try with the below patch. > > That makes things much, much worse. Yeah, it is same with my test result, basically the fio io thread migration counts is increased from 5k to 8k. Thanks, Ming
On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote: > > We typically only fall back to the active balancer when there is > > (persistent) imbalance and we fail to migrate anything else (of > > substance). > > > > The tuning mentioned has the effect of less frequent scheduling, IOW, > > leaving (short) tasks on the runqueue longer. This obviously means the > > load-balancer will have a bigger chance of seeing them. > > > > Now; it's been a while since I looked at the workqueue code but one > > possible explanation would be if the kworker that picks up the work item > > is pinned. That would make it runnable but not migratable, the exact > > situation in which we'll end up shooting the current task with active > > balance. > > Yes, that's precisely the problem - work is queued, by default, on a > specific CPU and it will wait for a kworker that is pinned to that I'm thinking the problem is that it doesn't wait. If it went and waited for it, active balance wouldn't be needed, that only works on active tasks. > specific CPU to dispatch it. We've already tested that queuing on a > different CPU (via queue_work_on()) makes the problem largely go > away as the work is not longer queued behind the long running fio > task. > > This, however, is not at viable solution to the problem. The pattern > of a long running process queuing small pieces of individual work > for processing in a separate context is pretty common... Right, but you're putting the scheduler in a bind. By overloading the CPU and only allowing the one task to migrate, it pretty much has no choice left. Anyway, I'm still going to have try and reproduce -- I got side-tracked into a crashing bug, I'll hopefully get back to this tomorrow. Lastly, one other thing to try is -next. Vincent reworked the load-balancer quite a bit.
Hi Peter, On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote: > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote: > > > > We typically only fall back to the active balancer when there is > > > (persistent) imbalance and we fail to migrate anything else (of > > > substance). > > > > > > The tuning mentioned has the effect of less frequent scheduling, IOW, > > > leaving (short) tasks on the runqueue longer. This obviously means the > > > load-balancer will have a bigger chance of seeing them. > > > > > > Now; it's been a while since I looked at the workqueue code but one > > > possible explanation would be if the kworker that picks up the work item > > > is pinned. That would make it runnable but not migratable, the exact > > > situation in which we'll end up shooting the current task with active > > > balance. > > > > Yes, that's precisely the problem - work is queued, by default, on a > > specific CPU and it will wait for a kworker that is pinned to that > > I'm thinking the problem is that it doesn't wait. If it went and waited > for it, active balance wouldn't be needed, that only works on active > tasks. Since this is AIO I wonder if it should queue_work on a nearby cpu by default instead of unbound. > > > specific CPU to dispatch it. We've already tested that queuing on a > > different CPU (via queue_work_on()) makes the problem largely go > > away as the work is not longer queued behind the long running fio > > task. > > > > This, however, is not at viable solution to the problem. The pattern > > of a long running process queuing small pieces of individual work > > for processing in a separate context is pretty common... > > Right, but you're putting the scheduler in a bind. By overloading the > CPU and only allowing the one task to migrate, it pretty much has no > choice left. > > Anyway, I'm still going to have try and reproduce -- I got side-tracked > into a crashing bug, I'll hopefully get back to this tomorrow. Lastly, > one other thing to try is -next. Vincent reworked the load-balancer > quite a bit. > I've tried it with the lb patch series. I get basically the same results. With the high granularity settings I get 3700 migrations for the 30 second run at 4k. Of those about 3200 are active balance on stock 5.4-rc7. With the lb patches it's 3500 and 3000, a slight drop. Using the default granularity settings 50 and 22 for stock and 250 and 25. So a few more total migrations with the lb patches but about the same active. On this system I'm getting 100k migrations using 512 byte blocksize. Almost all not active. I haven't looked into that closely yet but it's like 3000 per second looking like this: ... 64.19641 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.19694 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.19746 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.19665 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.19718 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.19772 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.19800 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.19828 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.19856 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.19882 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.19909 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.19937 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.19967 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.19995 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.20023 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.20053 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.20079 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.20107 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.20135 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.20163 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 64.20192 386 386 kworker/15:1 sched_migrate_task fio/2784 cpu 15->19 64.20221 389 389 kworker/19:1 sched_migrate_task fio/2784 cpu 19->15 ... Which is roughly equal to the number if iops it's doing. Cheers, Phil --
On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote: > Hi Peter, > > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote: > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > > > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote: > > > > > > We typically only fall back to the active balancer when there is > > > > (persistent) imbalance and we fail to migrate anything else (of > > > > substance). > > > > > > > > The tuning mentioned has the effect of less frequent scheduling, IOW, > > > > leaving (short) tasks on the runqueue longer. This obviously means the > > > > load-balancer will have a bigger chance of seeing them. > > > > > > > > Now; it's been a while since I looked at the workqueue code but one > > > > possible explanation would be if the kworker that picks up the work item > > > > is pinned. That would make it runnable but not migratable, the exact > > > > situation in which we'll end up shooting the current task with active > > > > balance. > > > > > > Yes, that's precisely the problem - work is queued, by default, on a > > > specific CPU and it will wait for a kworker that is pinned to that > > > > I'm thinking the problem is that it doesn't wait. If it went and waited > > for it, active balance wouldn't be needed, that only works on active > > tasks. > > Since this is AIO I wonder if it should queue_work on a nearby cpu by > default instead of unbound. When the current CPU isn't busy enough, there is still cost for completing request remotely. Or could we change queue_work() in the following way? * We try to queue the work to the CPU on which it was submitted, but if the * CPU dies or is saturated enough it can be processed by another CPU. Can we decide in a simple or efficient way if the current CPU is saturated enough? Thanks, Ming
On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote: > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote: > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > > > Yes, that's precisely the problem - work is queued, by default, on a > > > specific CPU and it will wait for a kworker that is pinned to that > > > > I'm thinking the problem is that it doesn't wait. If it went and waited > > for it, active balance wouldn't be needed, that only works on active > > tasks. > > Since this is AIO I wonder if it should queue_work on a nearby cpu by > default instead of unbound. The thing seems to be that 'unbound' is in fact 'bound'. Maybe we should fix that. If the load-balancer were allowed to move the kworker around when it didn't get time to run, that would probably be a better solution. Picking another 'bound' cpu by random might create the same sort of problems in more complicated scenarios. TJ, ISTR there used to be actually unbound kworkers, what happened to those? or am I misremembering things. > > Lastly, > > one other thing to try is -next. Vincent reworked the load-balancer > > quite a bit. > > > > I've tried it with the lb patch series. I get basically the same results. > With the high granularity settings I get 3700 migrations for the 30 > second run at 4k. Of those about 3200 are active balance on stock 5.4-rc7. > With the lb patches it's 3500 and 3000, a slight drop. Thanks for testing that. I didn't expect miracles, but it is good to verify. > Using the default granularity settings 50 and 22 for stock and 250 and 25. > So a few more total migrations with the lb patches but about the same active. Right, so the granularity thing interacts with the load-balance period. By pushing it up, as some people appear to do, makes it so that what might be a temporal imablance is perceived as a persitent imbalance. Tying the load-balance period to the gramularity is something we could consider, but then I'm sure, we'll get other people complaining the doesn't balance quick enough anymore.
On Thu, Nov 21, 2019 at 12:12:18PM +0800 Ming Lei wrote: > On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote: > > Hi Peter, > > > > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote: > > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > > > > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote: > > > > > > > > We typically only fall back to the active balancer when there is > > > > > (persistent) imbalance and we fail to migrate anything else (of > > > > > substance). > > > > > > > > > > The tuning mentioned has the effect of less frequent scheduling, IOW, > > > > > leaving (short) tasks on the runqueue longer. This obviously means the > > > > > load-balancer will have a bigger chance of seeing them. > > > > > > > > > > Now; it's been a while since I looked at the workqueue code but one > > > > > possible explanation would be if the kworker that picks up the work item > > > > > is pinned. That would make it runnable but not migratable, the exact > > > > > situation in which we'll end up shooting the current task with active > > > > > balance. > > > > > > > > Yes, that's precisely the problem - work is queued, by default, on a > > > > specific CPU and it will wait for a kworker that is pinned to that > > > > > > I'm thinking the problem is that it doesn't wait. If it went and waited > > > for it, active balance wouldn't be needed, that only works on active > > > tasks. > > > > Since this is AIO I wonder if it should queue_work on a nearby cpu by > > default instead of unbound. > > When the current CPU isn't busy enough, there is still cost for completing > request remotely. > > Or could we change queue_work() in the following way? > > * We try to queue the work to the CPU on which it was submitted, but if the > * CPU dies or is saturated enough it can be processed by another CPU. > > Can we decide in a simple or efficient way if the current CPU is saturated > enough? > The scheduler doesn't know if the queued_work submitter is going to go to sleep. That's why I was singling out AIO. My understanding of it is that you submit the IO and then keep going. So in that case it might be better to pick a node-local nearby cpu instead. But this is a user of work queue issue not a scheduler issue. Interestingly in our fio case the 4k one does not sleep and we get the active balance case where it moves the actually running thread. The 512 byte case seems to be sleeping since the migrations are all at wakeup time I believe. Cheers, Phil > Thanks, > Ming --
On Thu, Nov 21, 2019 at 02:29:37PM +0100 Peter Zijlstra wrote: > On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote: > > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote: > > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > > > > > Yes, that's precisely the problem - work is queued, by default, on a > > > > specific CPU and it will wait for a kworker that is pinned to that > > > > > > I'm thinking the problem is that it doesn't wait. If it went and waited > > > for it, active balance wouldn't be needed, that only works on active > > > tasks. > > > > Since this is AIO I wonder if it should queue_work on a nearby cpu by > > default instead of unbound. > > The thing seems to be that 'unbound' is in fact 'bound'. Maybe we should > fix that. If the load-balancer were allowed to move the kworker around > when it didn't get time to run, that would probably be a better > solution. > Yeah, I'm not convinced this is actually a scheduler issue. > Picking another 'bound' cpu by random might create the same sort of > problems in more complicated scenarios. > > TJ, ISTR there used to be actually unbound kworkers, what happened to > those? or am I misremembering things. > > > > Lastly, > > > one other thing to try is -next. Vincent reworked the load-balancer > > > quite a bit. > > > > > > > I've tried it with the lb patch series. I get basically the same results. > > With the high granularity settings I get 3700 migrations for the 30 > > second run at 4k. Of those about 3200 are active balance on stock 5.4-rc7. > > With the lb patches it's 3500 and 3000, a slight drop. > > Thanks for testing that. I didn't expect miracles, but it is good to > verify. > > > Using the default granularity settings 50 and 22 for stock and 250 and 25. > > So a few more total migrations with the lb patches but about the same active. > > Right, so the granularity thing interacts with the load-balance period. > By pushing it up, as some people appear to do, makes it so that what > might be a temporal imablance is perceived as a persitent imbalance. > > Tying the load-balance period to the gramularity is something we could > consider, but then I'm sure, we'll get other people complaining the > doesn't balance quick enough anymore. > Thanks. These are old tuned settings that have been carried along. They may not be right for newer kernels anyway. --
On 21/11/2019 16:12, Phil Auld wrote: <> > > The scheduler doesn't know if the queued_work submitter is going to go to sleep. > That's why I was singling out AIO. My understanding of it is that you submit the IO > and then keep going. So in that case it might be better to pick a node-local nearby > cpu instead. But this is a user of work queue issue not a scheduler issue. > We have a very similar long standing problem in our system (zufs), that we had to do hacks to fix. We have seen these CPU bouncing exacly as above in fio and more benchmarks, Our final analysis was: One thread is in wait_event() if the wake_up() is on the same CPU as the waiter, on some systems usually real HW and not VMs, would bounce to a different CPU. Now our system has an array of worker-threads bound to each CPU. an incoming thread chooses a corresponding cpu worker-thread, let it run, waiting for a reply, then when the worker-thread is done it will do a wake_up(). Usually its fine and the wait_event() stays on the same CPU. But on some systems it will wakeup in a different CPU. Now this is a great pity because in our case and the work_queue case and high % of places the thread calling wake_up() will then immediately go to sleep on something. (Work done lets wait for new work) I wish there was a flag to wake_up() or to the event object that says to relinquish the remaning of the time-slice to the waiter on same CPU, since I will be soon sleeping. Then scheduler need not guess if the wake_up() caller is going to soon sleep or if its going to continue. Let the coder give an hint about that? (The hack was to set the waiter CPU mask to the incoming CPU and restore afer wakeup) > Interestingly in our fio case the 4k one does not sleep and we get the active balance > case where it moves the actually running thread. The 512 byte case seems to be > sleeping since the migrations are all at wakeup time I believe. > Yes this is the same thing we saw in our system. (And it happens only sometimes) > Cheers, > Phil > > >> Thanks, >> Ming > Very thanks Boaz
On 11/21/19 8:02 AM, Boaz Harrosh wrote: > On 21/11/2019 16:12, Phil Auld wrote: > <> >> >> The scheduler doesn't know if the queued_work submitter is going to go to sleep. >> That's why I was singling out AIO. My understanding of it is that you submit the IO >> and then keep going. So in that case it might be better to pick a node-local nearby >> cpu instead. But this is a user of work queue issue not a scheduler issue. >> > > We have a very similar long standing problem in our system (zufs), that we had to do > hacks to fix. > > We have seen these CPU bouncing exacly as above in fio and more > benchmarks, Our final analysis was: > > One thread is in wait_event() if the wake_up() is on the same CPU as > the waiter, on some systems usually real HW and not VMs, would bounce > to a different CPU. Now our system has an array of worker-threads > bound to each CPU. an incoming thread chooses a corresponding cpu > worker-thread, let it run, waiting for a reply, then when the > worker-thread is done it will do a wake_up(). Usually its fine and the > wait_event() stays on the same CPU. But on some systems it will wakeup > in a different CPU. > > Now this is a great pity because in our case and the work_queue case > and high % of places the thread calling wake_up() will then > immediately go to sleep on something. (Work done lets wait for new > work) > > I wish there was a flag to wake_up() or to the event object that says > to relinquish the remaning of the time-slice to the waiter on same > CPU, since I will be soon sleeping. Isn't that basically what wake_up_sync() is?
On Thu, Nov 21, 2019 at 09:12:07AM -0500, Phil Auld wrote: > On Thu, Nov 21, 2019 at 12:12:18PM +0800 Ming Lei wrote: > > On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote: > > > Hi Peter, > > > > > > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote: > > > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > > > > > On Mon, Nov 18, 2019 at 10:21:21AM +0100, Peter Zijlstra wrote: > > > > > > > > > > We typically only fall back to the active balancer when there is > > > > > > (persistent) imbalance and we fail to migrate anything else (of > > > > > > substance). > > > > > > > > > > > > The tuning mentioned has the effect of less frequent scheduling, IOW, > > > > > > leaving (short) tasks on the runqueue longer. This obviously means the > > > > > > load-balancer will have a bigger chance of seeing them. > > > > > > > > > > > > Now; it's been a while since I looked at the workqueue code but one > > > > > > possible explanation would be if the kworker that picks up the work item > > > > > > is pinned. That would make it runnable but not migratable, the exact > > > > > > situation in which we'll end up shooting the current task with active > > > > > > balance. > > > > > > > > > > Yes, that's precisely the problem - work is queued, by default, on a > > > > > specific CPU and it will wait for a kworker that is pinned to that > > > > > > > > I'm thinking the problem is that it doesn't wait. If it went and waited > > > > for it, active balance wouldn't be needed, that only works on active > > > > tasks. > > > > > > Since this is AIO I wonder if it should queue_work on a nearby cpu by > > > default instead of unbound. > > > > When the current CPU isn't busy enough, there is still cost for completing > > request remotely. > > > > Or could we change queue_work() in the following way? > > > > * We try to queue the work to the CPU on which it was submitted, but if the > > * CPU dies or is saturated enough it can be processed by another CPU. > > > > Can we decide in a simple or efficient way if the current CPU is saturated > > enough? > > > > The scheduler doesn't know if the queued_work submitter is going to go to sleep. > That's why I was singling out AIO. My understanding of it is that you submit the IO > and then keep going. So in that case it might be better to pick a node-local nearby > cpu instead. But this is a user of work queue issue not a scheduler issue. I think the part people are missing completely here is that the workqueue in question here is the -completion- work queue, and is not actually directly connected to the submitter process that is getting migrated. This all happens on the one CPU: Submitter blk-mq completion (softirq) wq io_submit() aio_write submit_bio aio_write submit_bio ..... blk-mq bh triggered bio_endio iomap_dio_endio(dio) queue_work(dio) kworker queued bio_endio iomap_dio_endio(dio) queue_work(dio) .... <softirq ends, back to submitter> aio_write submit_bio aio_write submit_bio ... <scheduler migrates task> iomap_dio_complete() aio_complete() .... iomap_dio_complete() aio_complete() .... iomap_dio_complete() aio_complete() IOWs, the reason we see this behaviour is the IO completion steering done by the blk-mq layer is, by default, directing completion back to the submitter CPU. IIUC, it's the per-cpu submission/completion queues that result in this behaviour, because local completion of IOs has been measured to be substantially faster for highly concurrent, high IOPS workloads. What Ming's reproducer workload does is increase the amount of CPU that the submitter process uses to the point where it uses the entire CPU and starves the completion kworker thread from running. i.e. there is no idle CPU left for the completion to be processed without pre-empting the running submitter task in some way. Sometimes we see the scheduler run the kworker thread and switch straight back to the submitter task - this is what typically happens when the scheduler is using the default tunings. But when the non-default tunings are used, task migrations occur. So the scheduler migrates the submitter task, because it cannot move the IO completion tasks. And then the blk-mq sees that submission are coming from a different CPU, and it queues the incoming completions to that new CPU, theyby repeating the pattern. > Interestingly in our fio case the 4k one does not sleep and we get > the active balance case where it moves the actually running > thread. The 512 byte case seems to be sleeping since the > migrations are all at wakeup time I believe. The 512 byte case demonstrates a behaviour where the submitter task is ping ponging between CPUs on each IO submission. It is likely another corner case when the blk-mq behaviour of delivering completions to the submitter CPU triggers immediate migration rather than just context switching to the kworker thread. Further investigation needs to be done there to determine if the migration is caused by pre-emption, or whether it is a result of the submitter finishing work and yeilding the CPU, but then being woken and finding the CPU is held by a running bound task and so is migrated to an idle CPU. The next IO completion has the completion directed to the new CPU, and so it migrates back when woken. IOWs, Whatever the cause of the task migration is, it is likely that it is repeating because IO completions are following the submitter task around and repeatedly triggering the same migration heuristic over and over again. Also, keep in mind this is *not caused by AIO*. Workqueues are used in IO completion to avoid a basic "can't do filesystem modification work in irq context" problem. i.e. we can't take blocking locks or run transactions in bio completion context, because that will stall IO completion processing for that CPU completelyi, and then the fileystem deadlocks. And "non-AIO" example is page cache writeback: the bdi flusher threads are designed around an asynchrnous IO submission pattern where it never waits for completions. In this case, delayed allocation is the reason the submitter consumes a big chunk of CPU. And IO completion uses workqueues ifor the same reason as AIO - because we often have to run transactions to complete the metadata updates necessary to finish the IO correctly. We can't do those updates in the bio completion context (softirq) because we need to block and do stuff that is dependent on future IO completions being processed. So, yeah, this "queue small amounts of work to a workqueue from IO completion context" is fundamental to how the filesystem IO stack functions in a modern kernel. As such, this specific IO completion workqueue usage needs to be handled correctly by the default kernel config, and not fall apart when scheduler tunings are slightly tweaked. Cheers, Dave.
On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote: > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > > I can reproduce the issue with 4k block size on another RH system, and > > > the login info of that system has been shared to you in RH BZ. > > > > > > 1) > > > sysctl kernel.sched_min_granularity_ns=10000000 > > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > > > So, these settings definitely influence behaviour. > > > > If these are set to kernel defaults (4ms and 3ms each): > > > > sysctl kernel.sched_min_granularity_ns=4000000 > > sysctl kernel.sched_wakeup_granularity_ns=3000000 > > > > The migration problem largely goes away - the fio task migration > > event count goes from ~2,000 a run down to 200/run. > > > > That indicates that the migration trigger is likely load/timing > > based. The analysis below is based on the 10/15ms numbers above, > > because it makes it so much easier to reproduce. > > > > > 2) > > > ./xfs_complete 4k > > > > > > Then you should see 1k~1.5k fio io thread migration in above test, > > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel. > > > > Almost all the fio task migrations are coming from migration/X > > kernel threads. i.e it's the scheduler active balancing that is > > causing the fio thread to bounce around. > > > > This is typical a typical trace, trimmed to remove extraneous noise. > > The fio process is running on CPU 10: > > > > fio-3185 [010] 50419.285954: sched_stat_runtime: comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns] > > fio-3185 [010] 50419.286953: sched_stat_runtime: comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns] > > fio-3185 [010] 50419.287998: sched_stat_runtime: comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns] > > fio-3185 [010] 50419.289973: sched_stat_runtime: comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns] > > fio-3185 [010] 50419.290958: sched_stat_runtime: comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns] > > fio-3185 [010] 50419.291952: sched_stat_runtime: comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns] > > > > fio consumes CPU for several milliseconds, then: > > > > fio-3185 [010] 50419.292935: sched_stat_runtime: comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns] > > fio-3185 [010] 50419.292941: sched_switch: fio:3185 [120] S ==> kworker/10:0:2763 [120] > > kworker/10:0-2763 [010] 50419.292954: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns] > > kworker/10:0-2763 [010] 50419.292956: sched_switch: kworker/10:0:2763 [120] R ==> fio:3185 [120] > > fio-3185 [010] 50419.293115: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > fio-3185 [010] 50419.293116: sched_stat_runtime: comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns] > > fio-3185 [010] 50419.293118: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > > > A context switch out to a kworker, then 13us later we immediately > > switch back to the fio process, and go on running. No doubt > > somewhere in what the fio process is doing, we queue up more work to > > be run on the cpu, but the fio task keeps running > > (due to CONFIG_PREEMPT=n). > > > > fio-3185 [010] 50419.293934: sched_stat_runtime: comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns] > > fio-3185 [010] 50419.294936: sched_stat_runtime: comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns] > > fio-3185 [010] 50419.295934: sched_stat_runtime: comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns] > > fio-3185 [010] 50419.296935: sched_stat_runtime: comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns] > > fio-3185 [010] 50419.297934: sched_stat_runtime: comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns] > > fio-3185 [010] 50419.298937: sched_stat_runtime: comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns] > > fio-3185 [010] 50419.299935: sched_stat_runtime: comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns] > > > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10... > > > > <idle>-0 [000] 50419.300014: sched_waking: comm=migration/10 pid=70 prio=0 target_cpu=010 > > fio-3185 [010] 50419.300024: sched_wakeup: migration/10:70 [0] success=1 CPU:010 > > fio-3185 [010] 50419.300026: sched_stat_runtime: comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns] > > fio-3185 [010] 50419.300027: sched_switch: fio:3185 [120] S ==> migration/10:70 [0] > > migration/10-70 [010] 50419.300032: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12 > > migration/10-70 [010] 50419.300040: sched_switch: migration/10:70 [0] D ==> kworker/10:0:2763 [120] > > > > And 10us later the fio process is switched away, the active load > > balancer work is run and migrates the fio process to CPU 12. Then... > > > > kworker/10:0-2763 [010] 50419.300048: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns] > > kworker/10:0-2763 [010] 50419.300062: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > <idle>-0 [010] 50419.300067: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > <idle>-0 [010] 50419.300069: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > <idle>-0 [010] 50419.300071: sched_switch: swapper/10:0 [120] S ==> kworker/10:0:2763 [120] > > kworker/10:0-2763 [010] 50419.300073: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > > > The kworker runs for another 10us and the CPU goes idle. Shortly > > after this, CPU 12 is woken: > > > > <idle>-0 [012] 50419.300113: sched_switch: swapper/12:0 [120] S ==> fio:3185 [120] > > fio-3185 [012] 50419.300596: sched_waking: comm=kworker/12:1 pid=227 prio=120 target_cpu=012 > > fio-3185 [012] 50419.300598: sched_stat_runtime: comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns] > > fio-3185 [012] 50419.300936: sched_stat_runtime: comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns] > > fio-3185 [012] 50419.301935: sched_stat_runtime: comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns] > > fio-3185 [012] 50419.302935: sched_stat_runtime: comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns] > > fio-3185 [012] 50419.303934: sched_stat_runtime: comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns] > > fio-3185 [012] 50419.304934: sched_stat_runtime: comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns] > > fio-3185 [012] 50419.305948: sched_stat_runtime: comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns] > > > > > > and fio goes on running there. The pattern repeats very soon afterwards: > > > > <idle>-0 [000] 50419.314982: sched_waking: comm=migration/12 pid=82 prio=0 target_cpu=012 > > fio-3185 [012] 50419.314988: sched_wakeup: migration/12:82 [0] success=1 CPU:012 > > fio-3185 [012] 50419.314990: sched_stat_runtime: comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns] > > fio-3185 [012] 50419.314991: sched_switch: fio:3185 [120] S ==> migration/12:82 [0] > > migration/12-82 [012] 50419.314995: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5 > > migration/12-82 [012] 50419.315001: sched_switch: migration/12:82 [0] D ==> kworker/12:1:227 [120] > > kworker/12:1-227 [012] 50419.315022: sched_stat_runtime: comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns] > > kworker/12:1-227 [012] 50419.315028: sched_switch: kworker/12:1:227 [120] R ==> swapper/12:0 [120] > > <idle>-0 [005] 50419.315053: sched_switch: swapper/5:0 [120] S ==> fio:3185 [120] > > fio-3185 [005] 50419.315286: sched_waking: comm=kworker/5:0 pid=2646 prio=120 target_cpu=005 > > fio-3185 [005] 50419.315288: sched_stat_runtime: comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns] > > > > And fio is now running on CPU 5 - it only ran on CPU 12 for about > > 15ms. Hmmm: > > > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135 > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012) > > > > Yeah, the fio task averages 13.4ms on any given CPU before being > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > of how long it spends on any one CPU is pretty wide (330us to > > 330ms). > > > Hey Dave > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > Surprised to see you're so sure it has little to do with wq, > > > like the scheduler is repeatedly making the wrong load balancing > > decisions when mixing a very short runtime task (queued work) with a > > long runtime task on the same CPU.... > > > and it helps more to know what is driving lb to make decisions like > this. Because for 70+ per cent of communters in cities like London it > is supposed tube is better than cab on work days, the end_io cb is > tweaked to be a lookalike of execute_in_process_context() in the diff > with the devoted s_dio_done_wq taken out of account. It's interesting > to see what difference lb will make in the tube environment. > > Hillf > > > This is not my area of expertise, so I have no idea why this might > > be happening. Scheduler experts: is this expected behaviour? What > > tunables directly influence the active load balancer (and/or CONFIG > > options) to change how aggressive it's behaviour is? > > > > > Not reproduced the issue with 512 block size on the RH system yet, > > > maybe it is related with my kernel config. > > > > I doubt it - this looks like a load specific corner case in the > > scheduling algorithm.... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > WRITE_ONCE(dio->submit.waiter, NULL); > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > - struct inode *inode = file_inode(dio->iocb->ki_filp); > - > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + schedule_work(&dio->aio.work); I'm not sure that this will make a real difference because it ends up to call queue_work(system_wq, ...) and system_wq is bounded as well so the work will still be pinned to a CPU Using system_unbound_wq should make a difference because it doesn't pin the work on a CPU + queue_work(system_unbound_wq, &dio->aio.work); > } else { > iomap_dio_complete_work(&dio->aio.work); > } > -- >
On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > > > > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote: > > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > > > I can reproduce the issue with 4k block size on another RH system, and > > > > the login info of that system has been shared to you in RH BZ. > > > > > > > > 1) > > > > sysctl kernel.sched_min_granularity_ns=10000000 > > > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > > > > > So, these settings definitely influence behaviour. > > > > > > If these are set to kernel defaults (4ms and 3ms each): > > > > > > sysctl kernel.sched_min_granularity_ns=4000000 > > > sysctl kernel.sched_wakeup_granularity_ns=3000000 > > > > > > The migration problem largely goes away - the fio task migration > > > event count goes from ~2,000 a run down to 200/run. > > > > > > That indicates that the migration trigger is likely load/timing > > > based. The analysis below is based on the 10/15ms numbers above, > > > because it makes it so much easier to reproduce. > > > > > > > 2) > > > > ./xfs_complete 4k > > > > > > > > Then you should see 1k~1.5k fio io thread migration in above test, > > > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel. > > > > > > Almost all the fio task migrations are coming from migration/X > > > kernel threads. i.e it's the scheduler active balancing that is > > > causing the fio thread to bounce around. > > > > > > This is typical a typical trace, trimmed to remove extraneous noise. > > > The fio process is running on CPU 10: > > > > > > fio-3185 [010] 50419.285954: sched_stat_runtime: comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns] > > > fio-3185 [010] 50419.286953: sched_stat_runtime: comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns] > > > fio-3185 [010] 50419.287998: sched_stat_runtime: comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns] > > > fio-3185 [010] 50419.289973: sched_stat_runtime: comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns] > > > fio-3185 [010] 50419.290958: sched_stat_runtime: comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns] > > > fio-3185 [010] 50419.291952: sched_stat_runtime: comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns] > > > > > > fio consumes CPU for several milliseconds, then: > > > > > > fio-3185 [010] 50419.292935: sched_stat_runtime: comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns] > > > fio-3185 [010] 50419.292941: sched_switch: fio:3185 [120] S ==> kworker/10:0:2763 [120] > > > kworker/10:0-2763 [010] 50419.292954: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns] > > > kworker/10:0-2763 [010] 50419.292956: sched_switch: kworker/10:0:2763 [120] R ==> fio:3185 [120] > > > fio-3185 [010] 50419.293115: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > > fio-3185 [010] 50419.293116: sched_stat_runtime: comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns] > > > fio-3185 [010] 50419.293118: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > > > > > A context switch out to a kworker, then 13us later we immediately > > > switch back to the fio process, and go on running. No doubt > > > somewhere in what the fio process is doing, we queue up more work to > > > be run on the cpu, but the fio task keeps running > > > (due to CONFIG_PREEMPT=n). > > > > > > fio-3185 [010] 50419.293934: sched_stat_runtime: comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns] > > > fio-3185 [010] 50419.294936: sched_stat_runtime: comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns] > > > fio-3185 [010] 50419.295934: sched_stat_runtime: comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns] > > > fio-3185 [010] 50419.296935: sched_stat_runtime: comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns] > > > fio-3185 [010] 50419.297934: sched_stat_runtime: comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns] > > > fio-3185 [010] 50419.298937: sched_stat_runtime: comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns] > > > fio-3185 [010] 50419.299935: sched_stat_runtime: comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns] > > > > > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10... > > > > > > <idle>-0 [000] 50419.300014: sched_waking: comm=migration/10 pid=70 prio=0 target_cpu=010 > > > fio-3185 [010] 50419.300024: sched_wakeup: migration/10:70 [0] success=1 CPU:010 > > > fio-3185 [010] 50419.300026: sched_stat_runtime: comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns] > > > fio-3185 [010] 50419.300027: sched_switch: fio:3185 [120] S ==> migration/10:70 [0] > > > migration/10-70 [010] 50419.300032: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12 > > > migration/10-70 [010] 50419.300040: sched_switch: migration/10:70 [0] D ==> kworker/10:0:2763 [120] > > > > > > And 10us later the fio process is switched away, the active load > > > balancer work is run and migrates the fio process to CPU 12. Then... > > > > > > kworker/10:0-2763 [010] 50419.300048: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns] > > > kworker/10:0-2763 [010] 50419.300062: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > > <idle>-0 [010] 50419.300067: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > > <idle>-0 [010] 50419.300069: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > > <idle>-0 [010] 50419.300071: sched_switch: swapper/10:0 [120] S ==> kworker/10:0:2763 [120] > > > kworker/10:0-2763 [010] 50419.300073: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > > > > > The kworker runs for another 10us and the CPU goes idle. Shortly > > > after this, CPU 12 is woken: > > > > > > <idle>-0 [012] 50419.300113: sched_switch: swapper/12:0 [120] S ==> fio:3185 [120] > > > fio-3185 [012] 50419.300596: sched_waking: comm=kworker/12:1 pid=227 prio=120 target_cpu=012 > > > fio-3185 [012] 50419.300598: sched_stat_runtime: comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns] > > > fio-3185 [012] 50419.300936: sched_stat_runtime: comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns] > > > fio-3185 [012] 50419.301935: sched_stat_runtime: comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns] > > > fio-3185 [012] 50419.302935: sched_stat_runtime: comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns] > > > fio-3185 [012] 50419.303934: sched_stat_runtime: comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns] > > > fio-3185 [012] 50419.304934: sched_stat_runtime: comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns] > > > fio-3185 [012] 50419.305948: sched_stat_runtime: comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns] > > > > > > > > > and fio goes on running there. The pattern repeats very soon afterwards: > > > > > > <idle>-0 [000] 50419.314982: sched_waking: comm=migration/12 pid=82 prio=0 target_cpu=012 > > > fio-3185 [012] 50419.314988: sched_wakeup: migration/12:82 [0] success=1 CPU:012 > > > fio-3185 [012] 50419.314990: sched_stat_runtime: comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns] > > > fio-3185 [012] 50419.314991: sched_switch: fio:3185 [120] S ==> migration/12:82 [0] > > > migration/12-82 [012] 50419.314995: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5 > > > migration/12-82 [012] 50419.315001: sched_switch: migration/12:82 [0] D ==> kworker/12:1:227 [120] > > > kworker/12:1-227 [012] 50419.315022: sched_stat_runtime: comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns] > > > kworker/12:1-227 [012] 50419.315028: sched_switch: kworker/12:1:227 [120] R ==> swapper/12:0 [120] > > > <idle>-0 [005] 50419.315053: sched_switch: swapper/5:0 [120] S ==> fio:3185 [120] > > > fio-3185 [005] 50419.315286: sched_waking: comm=kworker/5:0 pid=2646 prio=120 target_cpu=005 > > > fio-3185 [005] 50419.315288: sched_stat_runtime: comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns] > > > > > > And fio is now running on CPU 5 - it only ran on CPU 12 for about > > > 15ms. Hmmm: > > > > > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers > > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135 > > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012) > > > > > > Yeah, the fio task averages 13.4ms on any given CPU before being > > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > > of how long it spends on any one CPU is pretty wide (330us to > > > 330ms). > > > > > Hey Dave > > > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > > > Surprised to see you're so sure it has little to do with wq, > > > > > like the scheduler is repeatedly making the wrong load balancing > > > decisions when mixing a very short runtime task (queued work) with a > > > long runtime task on the same CPU.... > > > > > and it helps more to know what is driving lb to make decisions like > > this. Because for 70+ per cent of communters in cities like London it > > is supposed tube is better than cab on work days, the end_io cb is > > tweaked to be a lookalike of execute_in_process_context() in the diff > > with the devoted s_dio_done_wq taken out of account. It's interesting > > to see what difference lb will make in the tube environment. > > > > Hillf > > > > > This is not my area of expertise, so I have no idea why this might > > > be happening. Scheduler experts: is this expected behaviour? What > > > tunables directly influence the active load balancer (and/or CONFIG > > > options) to change how aggressive it's behaviour is? > > > > > > > Not reproduced the issue with 512 block size on the RH system yet, > > > > maybe it is related with my kernel config. > > > > > > I doubt it - this looks like a load specific corner case in the > > > scheduling algorithm.... > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > WRITE_ONCE(dio->submit.waiter, NULL); > > blk_wake_io_task(waiter); > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > - > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > + schedule_work(&dio->aio.work); > > I'm not sure that this will make a real difference because it ends up > to call queue_work(system_wq, ...) and system_wq is bounded as well so > the work will still be pinned to a CPU > Using system_unbound_wq should make a difference because it doesn't > pin the work on a CPU > + queue_work(system_unbound_wq, &dio->aio.work); Indeed, just run a quick test on my KVM guest, looks the following patch makes a difference: diff --git a/fs/direct-io.c b/fs/direct-io.c index 9329ced91f1d..2f4488b0ecec 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) { struct workqueue_struct *old; struct workqueue_struct *wq = alloc_workqueue("dio/%s", - WQ_MEM_RECLAIM, 0, + WQ_MEM_RECLAIM | + WQ_UNBOUND, 0, sb->s_id); Thanks, Ming
On Thu, Nov 28, 2019 at 05:40:03PM +0800, Hillf Danton wrote: > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote: > > Yeah, the fio task averages 13.4ms on any given CPU before being > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > of how long it spends on any one CPU is pretty wide (330us to > > 330ms). > > > Hey Dave > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > Surprised to see you're so sure it has little to do with wq, Because I understand how the workqueue is used here. Essentially, the workqueue is not necessary for a -pure- overwrite where no metadata updates or end-of-io filesystem work is required. However, change the workload just slightly, such as allocating the space, writing into preallocated space (unwritten extents), using AIO writes to extend the file, using O_DSYNC, etc, and we *must* use a workqueue as we have to take blocking locks and/or run transactions. These may still be very short (e.g. updating inode size) and in most cases will not block, but if they do, then if we don't move the work out of the block layer completion context (i.e. softirq running the block bh) then we risk deadlocking the code. Not to mention none of the filesytem inode locks are irq safe. IOWs, we can remove the workqueue for this -one specific instance- but it does not remove the requirement for using a workqueue for all the other types of write IO that pass through this code. > > like the scheduler is repeatedly making the wrong load balancing > > decisions when mixing a very short runtime task (queued work) with a > > long runtime task on the same CPU.... > > > and it helps more to know what is driving lb to make decisions like > this. I know exactly what is driving it through both observation and understanding of the code, and I've explained it elsewhere in this thread. > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > WRITE_ONCE(dio->submit.waiter, NULL); > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > - struct inode *inode = file_inode(dio->iocb->ki_filp); > - > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + schedule_work(&dio->aio.work); This does nothing but change the workqueue from a per-sb wq to the system wq. The work is still bound to the same CPU it is queued on, so nothing will change. Cheers, Dave.
On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > --- a/fs/iomap/direct-io.c > > > +++ b/fs/iomap/direct-io.c > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > blk_wake_io_task(waiter); > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > - > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > + schedule_work(&dio->aio.work); > > > > I'm not sure that this will make a real difference because it ends up > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > the work will still be pinned to a CPU > > Using system_unbound_wq should make a difference because it doesn't > > pin the work on a CPU > > + queue_work(system_unbound_wq, &dio->aio.work); > > Indeed, just run a quick test on my KVM guest, looks the following patch > makes a difference: > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 9329ced91f1d..2f4488b0ecec 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > { > struct workqueue_struct *old; > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > - WQ_MEM_RECLAIM, 0, > + WQ_MEM_RECLAIM | > + WQ_UNBOUND, 0, > sb->s_id); That's not an answer to the user task migration issue. That is, all this patch does is trade user task migration when the CPU is busy for migrating all the queued work off the CPU so the user task does not get migrated. IOWs, this forces all the queued work to be migrated rather than the user task. IOWs, it does not address the issue we've exposed in the scheduler between tasks with competing CPU affinity scheduling requirements - it just hides the symptom. Maintaining CPU affinity across dispatch and completion work has been proven to be a significant performance win. Right throughout the IO stack we try to keep this submitter/completion affinity, and that's the whole point of using a bound wq in the first place: efficient delayed batch processing of work on the local CPU. Spewing deferred completion work across every idle CPU in the machine because the local cpu is temporarily busy is a bad choice, both from a performance perspective (dirty cacheline bouncing) and from a power efficiency point of view as it causes CPUs to be taken out of idle state much more frequently[*]. The fact that the scheduler migrates the user task we use workqueues for deferred work as they were intended doesn't make this a workqueue problem. If the answer to this problem is "make all IO workqueues WQ_UNBOUND" then we are effectively saying "the scheduler has unfixable problems when mixing bound and unbound work on the same run queue". And, besides, what happens when every other CPU is also completely busy and can't run the work in a timely fashion? We've just moved the work to some random CPU where we wait to be scheduled instead of just sitting on the local CPU and waiting.... So, yes, we can work around the -symptoms- we see (frequent user task migration) by changing the work queue configuration or bypassing the workqueue for this specific workload. But these only address the visible symptom and don't take into account the wider goals of retaining CPU affinity in the IO stack, and they will have variable scheduling latency and perofrmance and as the overall system load changes. So, we can fiddle with workqueues, but it doesn't address the underlying issue that the scheduler appears to be migrating non-bound tasks off a busy CPU too easily.... -Dave. [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work queue: it's designed for interrupt routines that defer work via work queues to avoid doing work on otherwise idle CPUs. It does this by turning the per-cpu wq into an unbound wq so that work gets scheduled on a non-idle CPUs in preference to the local idle CPU which can then remain in low power states. That's the exact opposite of what using WQ_UNBOUND ends up doing in this IO completion context: it pushes the work out over idle CPUs rather than keeping them confined on the already busy CPUs where CPU affinity allows the work to be done quickly. So while WQ_UNBOUND avoids the user task being migrated frequently, it results in the work being spread around many more CPUs and we burn more power to do the same work.
On Mon, Dec 02, 2019 at 03:02:56PM +1100, Dave Chinner wrote: > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > > --- a/fs/iomap/direct-io.c > > > > +++ b/fs/iomap/direct-io.c > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > > blk_wake_io_task(waiter); > > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > - > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > > + schedule_work(&dio->aio.work); > > > > > > I'm not sure that this will make a real difference because it ends up > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > > the work will still be pinned to a CPU > > > Using system_unbound_wq should make a difference because it doesn't > > > pin the work on a CPU > > > + queue_work(system_unbound_wq, &dio->aio.work); > > > > Indeed, just run a quick test on my KVM guest, looks the following patch > > makes a difference: > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index 9329ced91f1d..2f4488b0ecec 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > > { > > struct workqueue_struct *old; > > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > > - WQ_MEM_RECLAIM, 0, > > + WQ_MEM_RECLAIM | > > + WQ_UNBOUND, 0, > > sb->s_id); > > That's not an answer to the user task migration issue. > > That is, all this patch does is trade user task migration when the > CPU is busy for migrating all the queued work off the CPU so the > user task does not get migrated. IOWs, this forces all the queued > work to be migrated rather than the user task. IOWs, it does not > address the issue we've exposed in the scheduler between tasks with > competing CPU affinity scheduling requirements - it just hides the > symptom. Yeah, I believe we all reach the agreement that this is one issue in scheduler's load balance. > > Maintaining CPU affinity across dispatch and completion work has > been proven to be a significant performance win. Right throughout > the IO stack we try to keep this submitter/completion affinity, > and that's the whole point of using a bound wq in the first place: > efficient delayed batch processing of work on the local CPU. > > Spewing deferred completion work across every idle CPU in the > machine because the local cpu is temporarily busy is a bad choice, > both from a performance perspective (dirty cacheline bouncing) and > from a power efficiency point of view as it causes CPUs to be taken > out of idle state much more frequently[*]. > > The fact that the scheduler migrates the user task we use workqueues > for deferred work as they were intended doesn't make this a > workqueue problem. If the answer to this problem is "make all IO > workqueues WQ_UNBOUND" then we are effectively saying "the scheduler > has unfixable problems when mixing bound and unbound work on the > same run queue". > > And, besides, what happens when every other CPU is also completely > busy and can't run the work in a timely fashion? We've just moved > the work to some random CPU where we wait to be scheduled instead of > just sitting on the local CPU and waiting.... > > So, yes, we can work around the -symptoms- we see (frequent user > task migration) by changing the work queue configuration or > bypassing the workqueue for this specific workload. But these only > address the visible symptom and don't take into account the wider > goals of retaining CPU affinity in the IO stack, and they will have > variable scheduling latency and perofrmance and as the overall > system load changes. So far, not see any progress in fixing the load balance issue, I'd suggest to workaround the issue via the patch if no one objects. We can comment it explicitly that it is just for workround scheduler's problem. BTW, there is lots of WQ_UNBOUND uses in kernel: [linux]$ git grep -n -w WQ_UNBOUND ./ | grep -v "kernel\/workqueue" | wc 86 524 7356 Some of them are used in fast IO path too. > > So, we can fiddle with workqueues, but it doesn't address the > underlying issue that the scheduler appears to be migrating > non-bound tasks off a busy CPU too easily.... > > -Dave. > > [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work > queue: it's designed for interrupt routines that defer work via work > queues to avoid doing work on otherwise idle CPUs. It does this by > turning the per-cpu wq into an unbound wq so that work gets > scheduled on a non-idle CPUs in preference to the local idle CPU > which can then remain in low power states. > > That's the exact opposite of what using WQ_UNBOUND ends up doing in > this IO completion context: it pushes the work out over idle CPUs > rather than keeping them confined on the already busy CPUs where CPU > affinity allows the work to be done quickly. So while WQ_UNBOUND > avoids the user task being migrated frequently, it results in the > work being spread around many more CPUs and we burn more power to do > the same work. That can't be worse than this crazy migration, which schedules the IO thread on other IDLE CPUs too. Thanks, Ming
Hi, On 02/12/19 10:46, Ming Lei wrote: > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > > > > > > > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote: > > > > On Fri, Nov 15, 2019 at 03:08:43PM +0800, Ming Lei wrote: > > > > > On Fri, Nov 15, 2019 at 03:56:34PM +1100, Dave Chinner wrote: > > > > > > On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > > > > > I can reproduce the issue with 4k block size on another RH system, and > > > > > the login info of that system has been shared to you in RH BZ. > > > > > > > > > > 1) > > > > > sysctl kernel.sched_min_granularity_ns=10000000 > > > > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > > > > > > > So, these settings definitely influence behaviour. > > > > > > > > If these are set to kernel defaults (4ms and 3ms each): > > > > > > > > sysctl kernel.sched_min_granularity_ns=4000000 > > > > sysctl kernel.sched_wakeup_granularity_ns=3000000 > > > > > > > > The migration problem largely goes away - the fio task migration > > > > event count goes from ~2,000 a run down to 200/run. > > > > > > > > That indicates that the migration trigger is likely load/timing > > > > based. The analysis below is based on the 10/15ms numbers above, > > > > because it makes it so much easier to reproduce. > > > > > > > > > 2) > > > > > ./xfs_complete 4k > > > > > > > > > > Then you should see 1k~1.5k fio io thread migration in above test, > > > > > either v5.4-rc7(build with rhel8 config) or RHEL 4.18 kernel. > > > > > > > > Almost all the fio task migrations are coming from migration/X > > > > kernel threads. i.e it's the scheduler active balancing that is > > > > causing the fio thread to bounce around. > > > > > > > > This is typical a typical trace, trimmed to remove extraneous noise. > > > > The fio process is running on CPU 10: > > > > > > > > fio-3185 [010] 50419.285954: sched_stat_runtime: comm=fio pid=3185 runtime=1004014 [ns] vruntime=27067882290 [ns] > > > > fio-3185 [010] 50419.286953: sched_stat_runtime: comm=fio pid=3185 runtime=979458 [ns] vruntime=27068861748 [ns] > > > > fio-3185 [010] 50419.287998: sched_stat_runtime: comm=fio pid=3185 runtime=1028471 [ns] vruntime=27069890219 [ns] > > > > fio-3185 [010] 50419.289973: sched_stat_runtime: comm=fio pid=3185 runtime=989989 [ns] vruntime=27071836208 [ns] > > > > fio-3185 [010] 50419.290958: sched_stat_runtime: comm=fio pid=3185 runtime=963914 [ns] vruntime=27072800122 [ns] > > > > fio-3185 [010] 50419.291952: sched_stat_runtime: comm=fio pid=3185 runtime=972532 [ns] vruntime=27073772654 [ns] > > > > > > > > fio consumes CPU for several milliseconds, then: > > > > > > > > fio-3185 [010] 50419.292935: sched_stat_runtime: comm=fio pid=3185 runtime=966032 [ns] vruntime=27074738686 [ns] > > > > fio-3185 [010] 50419.292941: sched_switch: fio:3185 [120] S ==> kworker/10:0:2763 [120] > > > > kworker/10:0-2763 [010] 50419.292954: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=13423 [ns] vruntime=27052479694 [ns] > > > > kworker/10:0-2763 [010] 50419.292956: sched_switch: kworker/10:0:2763 [120] R ==> fio:3185 [120] > > > > fio-3185 [010] 50419.293115: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > > > fio-3185 [010] 50419.293116: sched_stat_runtime: comm=fio pid=3185 runtime=160370 [ns] vruntime=27074899056 [ns] > > > > fio-3185 [010] 50419.293118: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > > > > > > > A context switch out to a kworker, then 13us later we immediately > > > > switch back to the fio process, and go on running. No doubt > > > > somewhere in what the fio process is doing, we queue up more work to > > > > be run on the cpu, but the fio task keeps running > > > > (due to CONFIG_PREEMPT=n). > > > > > > > > fio-3185 [010] 50419.293934: sched_stat_runtime: comm=fio pid=3185 runtime=803135 [ns] vruntime=27075702191 [ns] > > > > fio-3185 [010] 50419.294936: sched_stat_runtime: comm=fio pid=3185 runtime=988478 [ns] vruntime=27076690669 [ns] > > > > fio-3185 [010] 50419.295934: sched_stat_runtime: comm=fio pid=3185 runtime=982219 [ns] vruntime=27077672888 [ns] > > > > fio-3185 [010] 50419.296935: sched_stat_runtime: comm=fio pid=3185 runtime=984781 [ns] vruntime=27078657669 [ns] > > > > fio-3185 [010] 50419.297934: sched_stat_runtime: comm=fio pid=3185 runtime=981703 [ns] vruntime=27079639372 [ns] > > > > fio-3185 [010] 50419.298937: sched_stat_runtime: comm=fio pid=3185 runtime=990057 [ns] vruntime=27080629429 [ns] > > > > fio-3185 [010] 50419.299935: sched_stat_runtime: comm=fio pid=3185 runtime=977554 [ns] vruntime=27081606983 [ns] > > > > > > > > About 6ms later, CPU 0 kicks the active load balancer on CPU 10... > > > > > > > > <idle>-0 [000] 50419.300014: sched_waking: comm=migration/10 pid=70 prio=0 target_cpu=010 > > > > fio-3185 [010] 50419.300024: sched_wakeup: migration/10:70 [0] success=1 CPU:010 > > > > fio-3185 [010] 50419.300026: sched_stat_runtime: comm=fio pid=3185 runtime=79291 [ns] vruntime=27081686274 [ns] > > > > fio-3185 [010] 50419.300027: sched_switch: fio:3185 [120] S ==> migration/10:70 [0] > > > > migration/10-70 [010] 50419.300032: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=10 dest_cpu=12 > > > > migration/10-70 [010] 50419.300040: sched_switch: migration/10:70 [0] D ==> kworker/10:0:2763 [120] > > > > > > > > And 10us later the fio process is switched away, the active load > > > > balancer work is run and migrates the fio process to CPU 12. Then... > > > > > > > > kworker/10:0-2763 [010] 50419.300048: sched_stat_runtime: comm=kworker/10:0 pid=2763 runtime=9252 [ns] vruntime=27062908308 [ns] > > > > kworker/10:0-2763 [010] 50419.300062: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > > > <idle>-0 [010] 50419.300067: sched_waking: comm=kworker/10:0 pid=2763 prio=120 target_cpu=010 > > > > <idle>-0 [010] 50419.300069: sched_wakeup: kworker/10:0:2763 [120] success=1 CPU:010 > > > > <idle>-0 [010] 50419.300071: sched_switch: swapper/10:0 [120] S ==> kworker/10:0:2763 [120] > > > > kworker/10:0-2763 [010] 50419.300073: sched_switch: kworker/10:0:2763 [120] R ==> swapper/10:0 [120] > > > > > > > > The kworker runs for another 10us and the CPU goes idle. Shortly > > > > after this, CPU 12 is woken: > > > > > > > > <idle>-0 [012] 50419.300113: sched_switch: swapper/12:0 [120] S ==> fio:3185 [120] > > > > fio-3185 [012] 50419.300596: sched_waking: comm=kworker/12:1 pid=227 prio=120 target_cpu=012 > > > > fio-3185 [012] 50419.300598: sched_stat_runtime: comm=fio pid=3185 runtime=561137 [ns] vruntime=20361153275 [ns] > > > > fio-3185 [012] 50419.300936: sched_stat_runtime: comm=fio pid=3185 runtime=326187 [ns] vruntime=20361479462 [ns] > > > > fio-3185 [012] 50419.301935: sched_stat_runtime: comm=fio pid=3185 runtime=981201 [ns] vruntime=20362460663 [ns] > > > > fio-3185 [012] 50419.302935: sched_stat_runtime: comm=fio pid=3185 runtime=983160 [ns] vruntime=20363443823 [ns] > > > > fio-3185 [012] 50419.303934: sched_stat_runtime: comm=fio pid=3185 runtime=983855 [ns] vruntime=20364427678 [ns] > > > > fio-3185 [012] 50419.304934: sched_stat_runtime: comm=fio pid=3185 runtime=977757 [ns] vruntime=20365405435 [ns] > > > > fio-3185 [012] 50419.305948: sched_stat_runtime: comm=fio pid=3185 runtime=999563 [ns] vruntime=20366404998 [ns] > > > > > > > > > > > > and fio goes on running there. The pattern repeats very soon afterwards: > > > > > > > > <idle>-0 [000] 50419.314982: sched_waking: comm=migration/12 pid=82 prio=0 target_cpu=012 > > > > fio-3185 [012] 50419.314988: sched_wakeup: migration/12:82 [0] success=1 CPU:012 > > > > fio-3185 [012] 50419.314990: sched_stat_runtime: comm=fio pid=3185 runtime=46342 [ns] vruntime=20375268656 [ns] > > > > fio-3185 [012] 50419.314991: sched_switch: fio:3185 [120] S ==> migration/12:82 [0] > > > > migration/12-82 [012] 50419.314995: sched_migrate_task: comm=fio pid=3185 prio=120 orig_cpu=12 dest_cpu=5 > > > > migration/12-82 [012] 50419.315001: sched_switch: migration/12:82 [0] D ==> kworker/12:1:227 [120] > > > > kworker/12:1-227 [012] 50419.315022: sched_stat_runtime: comm=kworker/12:1 pid=227 runtime=21453 [ns] vruntime=20359477889 [ns] > > > > kworker/12:1-227 [012] 50419.315028: sched_switch: kworker/12:1:227 [120] R ==> swapper/12:0 [120] > > > > <idle>-0 [005] 50419.315053: sched_switch: swapper/5:0 [120] S ==> fio:3185 [120] > > > > fio-3185 [005] 50419.315286: sched_waking: comm=kworker/5:0 pid=2646 prio=120 target_cpu=005 > > > > fio-3185 [005] 50419.315288: sched_stat_runtime: comm=fio pid=3185 runtime=287737 [ns] vruntime=33779011507 [ns] > > > > > > > > And fio is now running on CPU 5 - it only ran on CPU 12 for about > > > > 15ms. Hmmm: > > > > > > > > $ grep fio-3185 ~/tmp/sched.out | awk 'BEGIN {totcpu = 0.0; switches = 0.0; prev_waket = 0.0 }/sched_waking/ { cpu = $2; split($3, t, ":"); waket = t[1]; if (cpu != prev_cpu) { t_on_cpu = waket - prev_waket; if (prev_waket) { print "time on CPU", cpu, "was", t_on_cpu; totcpu += t_on_cpu; switches++ } prev_waket = waket; prev_cpu = cpu; } } END { print "switches", switches, "time on cpu", totcpu, "aver time on cpu", (totcpu / switches) } ' | stats --trim-outliers > > > > switches 2211 time on cpu 30.0994 aver time on cpu 0.0136135 > > > > time on CPU [0-23(8.8823+/-6.2)] was 0.000331-0.330772(0.0134759+/-0.012) > > > > > > > > Yeah, the fio task averages 13.4ms on any given CPU before being > > > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > > > of how long it spends on any one CPU is pretty wide (330us to > > > > 330ms). > > > > > > > Hey Dave > > > > > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > > > > > Surprised to see you're so sure it has little to do with wq, > > > > > > > like the scheduler is repeatedly making the wrong load balancing > > > > decisions when mixing a very short runtime task (queued work) with a > > > > long runtime task on the same CPU.... > > > > > > > and it helps more to know what is driving lb to make decisions like > > > this. Because for 70+ per cent of communters in cities like London it > > > is supposed tube is better than cab on work days, the end_io cb is > > > tweaked to be a lookalike of execute_in_process_context() in the diff > > > with the devoted s_dio_done_wq taken out of account. It's interesting > > > to see what difference lb will make in the tube environment. > > > > > > Hillf > > > > > > > This is not my area of expertise, so I have no idea why this might > > > > be happening. Scheduler experts: is this expected behaviour? What > > > > tunables directly influence the active load balancer (and/or CONFIG > > > > options) to change how aggressive it's behaviour is? > > > > > > > > > Not reproduced the issue with 512 block size on the RH system yet, > > > > > maybe it is related with my kernel config. > > > > > > > > I doubt it - this looks like a load specific corner case in the > > > > scheduling algorithm.... > > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com > > > > > > --- a/fs/iomap/direct-io.c > > > +++ b/fs/iomap/direct-io.c > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > blk_wake_io_task(waiter); > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > - > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > + schedule_work(&dio->aio.work); > > > > I'm not sure that this will make a real difference because it ends up > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > the work will still be pinned to a CPU > > Using system_unbound_wq should make a difference because it doesn't > > pin the work on a CPU > > + queue_work(system_unbound_wq, &dio->aio.work); > > Indeed, just run a quick test on my KVM guest, looks the following patch > makes a difference: > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 9329ced91f1d..2f4488b0ecec 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > { > struct workqueue_struct *old; > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > - WQ_MEM_RECLAIM, 0, > + WQ_MEM_RECLAIM | > + WQ_UNBOUND, 0, > sb->s_id); Chiming in to report that, after Vincent suggested to try the very same change out on a private IRC conversation, I saw the very same positive effects. E.g., Baseline is using 'slow' sched parameters (10ms/15ms), baseline 3/4 is same kernel just with default 'fast' sched params and dio unbound 10/15 is a kernel patched with the above patch and using 'slow' sched params. fsperf-4k-5G baseline 3/4 dio unbound 10/15 improv. improv. sequential initial write throughput -0.12 -0.87 latency 1.15 2.87 overwrite throughput 1186.32 1232.74 latency 11.35 11.26 read throughput 0.80 -2.38 latency 2.79 3.03 random initial write throughput 84.73 105.34 latency 0.00 1.14 overwrite throughput 254.27 273.17 latency 1.23 7.37 read throughput 40.05 3.24 latency 8.03 7.36 Best, Juri
On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > > --- a/fs/iomap/direct-io.c > > > > +++ b/fs/iomap/direct-io.c > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > > blk_wake_io_task(waiter); > > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > - > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > > + schedule_work(&dio->aio.work); > > > > > > I'm not sure that this will make a real difference because it ends up > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > > the work will still be pinned to a CPU > > > Using system_unbound_wq should make a difference because it doesn't > > > pin the work on a CPU > > > + queue_work(system_unbound_wq, &dio->aio.work); > > > > Indeed, just run a quick test on my KVM guest, looks the following patch > > makes a difference: > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index 9329ced91f1d..2f4488b0ecec 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > > { > > struct workqueue_struct *old; > > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > > - WQ_MEM_RECLAIM, 0, > > + WQ_MEM_RECLAIM | > > + WQ_UNBOUND, 0, > > sb->s_id); > > That's not an answer to the user task migration issue. > > That is, all this patch does is trade user task migration when the > CPU is busy for migrating all the queued work off the CPU so the > user task does not get migrated. IOWs, this forces all the queued > work to be migrated rather than the user task. IOWs, it does not > address the issue we've exposed in the scheduler between tasks with > competing CPU affinity scheduling requirements - it just hides the > symptom. > > Maintaining CPU affinity across dispatch and completion work has > been proven to be a significant performance win. Right throughout > the IO stack we try to keep this submitter/completion affinity, > and that's the whole point of using a bound wq in the first place: > efficient delayed batch processing of work on the local CPU. Do you really want to target the same CPU ? looks like what you really want to target the same cache instead > > Spewing deferred completion work across every idle CPU in the > machine because the local cpu is temporarily busy is a bad choice, > both from a performance perspective (dirty cacheline bouncing) and > from a power efficiency point of view as it causes CPUs to be taken > out of idle state much more frequently[*]. > > The fact that the scheduler migrates the user task we use workqueues > for deferred work as they were intended doesn't make this a > workqueue problem. If the answer to this problem is "make all IO > workqueues WQ_UNBOUND" then we are effectively saying "the scheduler > has unfixable problems when mixing bound and unbound work on the > same run queue". > > And, besides, what happens when every other CPU is also completely > busy and can't run the work in a timely fashion? We've just moved > the work to some random CPU where we wait to be scheduled instead of > just sitting on the local CPU and waiting.... > > So, yes, we can work around the -symptoms- we see (frequent user > task migration) by changing the work queue configuration or > bypassing the workqueue for this specific workload. But these only > address the visible symptom and don't take into account the wider > goals of retaining CPU affinity in the IO stack, and they will have > variable scheduling latency and perofrmance and as the overall > system load changes. > > So, we can fiddle with workqueues, but it doesn't address the > underlying issue that the scheduler appears to be migrating > non-bound tasks off a busy CPU too easily.... The root cause of the problem is that the sched_wakeup_granularity_ns is in the same range or higher than load balance period. As Peter explained, This make the kworker waiting for the CPU for several load period and a transient unbalanced state becomes a stable one that the scheduler to fix. With default value, the scheduler doesn't try to migrate any task. Then, I agree that having an ack close to the request makes sense but forcing it on the exact same CPU is too restrictive IMO. Being able to use another CPU on the same core should not harm the performance and may even improve it. And that may still be the case while CPUs share their cache. > > -Dave. > > [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work > queue: it's designed for interrupt routines that defer work via work > queues to avoid doing work on otherwise idle CPUs. It does this by > turning the per-cpu wq into an unbound wq so that work gets > scheduled on a non-idle CPUs in preference to the local idle CPU > which can then remain in low power states. > > That's the exact opposite of what using WQ_UNBOUND ends up doing in > this IO completion context: it pushes the work out over idle CPUs > rather than keeping them confined on the already busy CPUs where CPU > affinity allows the work to be done quickly. So while WQ_UNBOUND > avoids the user task being migrated frequently, it results in the > work being spread around many more CPUs and we burn more power to do > the same work. > > -- > Dave Chinner > david@fromorbit.com
Hi Vincent, On Mon, Dec 02, 2019 at 02:45:42PM +0100 Vincent Guittot wrote: > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote: ... > > So, we can fiddle with workqueues, but it doesn't address the > > underlying issue that the scheduler appears to be migrating > > non-bound tasks off a busy CPU too easily.... > > The root cause of the problem is that the sched_wakeup_granularity_ns > is in the same range or higher than load balance period. As Peter > explained, This make the kworker waiting for the CPU for several load > period and a transient unbalanced state becomes a stable one that the > scheduler to fix. With default value, the scheduler doesn't try to > migrate any task. There are actually two issues here. With the high wakeup granularity we get the user task actively migrated. This causes the significant performance hit Ming was showing. With the fast wakeup_granularity (or smaller IOs - 512 instead of 4k) we get, instead, the user task migrated at wakeup to a new CPU for every IO completion. This is the 11k migrations per sec doing 11k iops. In this test it is not by itself causing the measured performance issue. It generally flips back and forth between 2 cpus for large periods. I think it is crossing cache boundaries at times (but I have not looked closely at the traces compared to the topology, yet). The active balances are what really hurts in thie case but I agree that seems to be a tuning problem. Cheers, Phil > > Then, I agree that having an ack close to the request makes sense but > forcing it on the exact same CPU is too restrictive IMO. Being able to > use another CPU on the same core should not harm the performance and > may even improve it. And that may still be the case while CPUs share > their cache. > > > > > -Dave. > > > > [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work > > queue: it's designed for interrupt routines that defer work via work > > queues to avoid doing work on otherwise idle CPUs. It does this by > > turning the per-cpu wq into an unbound wq so that work gets > > scheduled on a non-idle CPUs in preference to the local idle CPU > > which can then remain in low power states. > > > > That's the exact opposite of what using WQ_UNBOUND ends up doing in > > this IO completion context: it pushes the work out over idle CPUs > > rather than keeping them confined on the already busy CPUs where CPU > > affinity allows the work to be done quickly. So while WQ_UNBOUND > > avoids the user task being migrated frequently, it results in the > > work being spread around many more CPUs and we burn more power to do > > the same work. > > > > -- > > Dave Chinner > > david@fromorbit.com > --
On Mon, Dec 02, 2019 at 05:01:58PM +0800, Hillf Danton wrote: > > On Mon, 2 Dec 2019 14:08:44 +1100 Dave Chinner wrote: > > On Thu, Nov 28, 2019 at 05:40:03PM +0800, Hillf Danton wrote: > > > On Sat, 16 Nov 2019 10:40:05 Dave Chinner wrote: > > > > Yeah, the fio task averages 13.4ms on any given CPU before being > > > > switched to another CPU. Mind you, the stddev is 12ms, so the range > > > > of how long it spends on any one CPU is pretty wide (330us to > > > > 330ms). > > > > > > > Hey Dave > > > > > > > IOWs, this doesn't look like a workqueue problem at all - this looks > > > > > > Surprised to see you're so sure it has little to do with wq, > > > > Because I understand how the workqueue is used here. > > > > Essentially, the workqueue is not necessary for a -pure- overwrite > > where no metadata updates or end-of-io filesystem work is required. > > > > However, change the workload just slightly, such as allocating the > > space, writing into preallocated space (unwritten extents), using > > AIO writes to extend the file, using O_DSYNC, etc, and we *must* > > use a workqueue as we have to take blocking locks and/or run > > transactions. > > > > These may still be very short (e.g. updating inode size) and in most > > cases will not block, but if they do, then if we don't move the work > > out of the block layer completion context (i.e. softirq running the > > block bh) then we risk deadlocking the code. > > > > Not to mention none of the filesytem inode locks are irq safe. > > > > IOWs, we can remove the workqueue for this -one specific instance- > > but it does not remove the requirement for using a workqueue for all > > the other types of write IO that pass through this code. > > > So it's not true that it doesn't has anything to do with workqueue. You misunderstood what I was saying. I meant that this adverse schdeuler behaviour is not *unique to this specific workqueue instance* or workload. There are another 5+ workqueues in XFS alone that are based around the same "do all the deferred work on the same CPU" queuing behaviour. Several of them are IO completion processing workqueues, and it is designed this way to avoid running completion work that access common structures across all the CPUs in the system. And, FWIW, we've had this "per-cpu delayed work" processing mechanism in XFS since ~2002 when per-cpu work queues were introduced in ~2.5.40. What we are doing with workqueues here is not new or novel, and it's worked just fine for most of this time... > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > + schedule_work(&dio->aio.work); > > > > This does nothing but change the workqueue from a per-sb wq to the > > system wq. The work is still bound to the same CPU it is queued on, > > so nothing will change. > > > The system wq is enough here to make some visible difference as CFS will > be looking to make new lb decision in particular when submitter and > completion are running on different CPUs. That's noise caused by slightly different loading of the system workqueue vs a private work queue. It's likely just enough to move the scheduler out of the window where it makes incorrect decisions. i.e. Add a bit more user load or load onto other CPUs, and the problem will reappear. As I said, this is *not* a fix for the problem - it just moves it around so that you can't see it for this specific workload instance. > It's claimed that "Maintaining CPU affinity across dispatch and completion > work has been proven to be a significant performance win." If completion > is running in the softirq context then it would take some time to sort > out why irq (not CPU) affinity is making difference across CPUs. We use irq steering to provide CPU affinity for the structures being used by completion because they are the same ones used by submission. If completion happens quickly enough, those structures are still hot in the cache of the submission CPU, and so we don't drag bio and filesystem structures out of the CPU cache they sit in by steering the completion to the submission CPU. Most of the modern high perofrmance storage hardware has hardware interrupt steering so the block layer doesn't have to do this. See __blk_mq_complete_request() and __blk_complete_request(). If the device has multiple hardware queues, they are already delivering CPU affine completions. Otherwise __blk_complete_request() uses IPIs to steer the completion to a CPU that shares a cache with the submission CPU.... IOWs, we are trying to ensure that we run the data IO completion on the CPU with that has that data hot in cache. When we are running millions of IOs every second, this matters -a lot-. IRQ steering is just a mechansim that is used to ensure completion processing hits hot caches. Cheers, Dave.
On Mon, Dec 02, 2019 at 02:45:42PM +0100, Vincent Guittot wrote: > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > > > --- a/fs/iomap/direct-io.c > > > > > +++ b/fs/iomap/direct-io.c > > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > > > blk_wake_io_task(waiter); > > > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > > - > > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > > > + schedule_work(&dio->aio.work); > > > > > > > > I'm not sure that this will make a real difference because it ends up > > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > > > the work will still be pinned to a CPU > > > > Using system_unbound_wq should make a difference because it doesn't > > > > pin the work on a CPU > > > > + queue_work(system_unbound_wq, &dio->aio.work); > > > > > > Indeed, just run a quick test on my KVM guest, looks the following patch > > > makes a difference: > > > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > > index 9329ced91f1d..2f4488b0ecec 100644 > > > --- a/fs/direct-io.c > > > +++ b/fs/direct-io.c > > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > > > { > > > struct workqueue_struct *old; > > > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > > > - WQ_MEM_RECLAIM, 0, > > > + WQ_MEM_RECLAIM | > > > + WQ_UNBOUND, 0, > > > sb->s_id); > > > > That's not an answer to the user task migration issue. > > > > That is, all this patch does is trade user task migration when the > > CPU is busy for migrating all the queued work off the CPU so the > > user task does not get migrated. IOWs, this forces all the queued > > work to be migrated rather than the user task. IOWs, it does not > > address the issue we've exposed in the scheduler between tasks with > > competing CPU affinity scheduling requirements - it just hides the > > symptom. > > > > Maintaining CPU affinity across dispatch and completion work has > > been proven to be a significant performance win. Right throughout > > the IO stack we try to keep this submitter/completion affinity, > > and that's the whole point of using a bound wq in the first place: > > efficient delayed batch processing of work on the local CPU. > > Do you really want to target the same CPU ? looks like what you really > want to target the same cache instead Well, yes, ideally we want to target the same cache, but we can't do that with workqueues. However, the block layer already does that same-cache steering for it's directed completions (see __blk_mq_complete_request()), so we are *already running in a "hot cache" CPU context* when we queue work. When we queue to the same CPU, we are simply maintaining the "cache-hot" context that we are already running in. Besides, selecting a specific "hot cache" CPU and bind the work to that CPU (via queue_work_on()) doesn't fix the scheduler problem - it just moves it to another CPU. If the destination CPU is loaded like the local CPU, then it's jsut going to cause migrations on the destination CPU instead of the local CPU. IOWs, this is -not a fix- for the scheduler making an incorrect migration decisions when we are mixing bound and unbound tasks on the local run queue. Yes, it will hide the problem from this specific workload instance but it doesn't fix it. We'll just hit it under heavier load, such as when production workloads start running AIO submission from tens of CPUs at a time while burning near 100% CPU in userspace....... Cheers, Dave.
On Tue, Dec 03, 2019 at 10:53:21AM +1100, Dave Chinner wrote: > On Mon, Dec 02, 2019 at 02:45:42PM +0100, Vincent Guittot wrote: > > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > > > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > > > > --- a/fs/iomap/direct-io.c > > > > > > +++ b/fs/iomap/direct-io.c > > > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > > > > blk_wake_io_task(waiter); > > > > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > > > - > > > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > > > > + schedule_work(&dio->aio.work); > > > > > > > > > > I'm not sure that this will make a real difference because it ends up > > > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > > > > the work will still be pinned to a CPU > > > > > Using system_unbound_wq should make a difference because it doesn't > > > > > pin the work on a CPU > > > > > + queue_work(system_unbound_wq, &dio->aio.work); > > > > > > > > Indeed, just run a quick test on my KVM guest, looks the following patch > > > > makes a difference: > > > > > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > > > index 9329ced91f1d..2f4488b0ecec 100644 > > > > --- a/fs/direct-io.c > > > > +++ b/fs/direct-io.c > > > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > > > > { > > > > struct workqueue_struct *old; > > > > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > > > > - WQ_MEM_RECLAIM, 0, > > > > + WQ_MEM_RECLAIM | > > > > + WQ_UNBOUND, 0, > > > > sb->s_id); > > > > > > That's not an answer to the user task migration issue. > > > > > > That is, all this patch does is trade user task migration when the > > > CPU is busy for migrating all the queued work off the CPU so the > > > user task does not get migrated. IOWs, this forces all the queued > > > work to be migrated rather than the user task. IOWs, it does not > > > address the issue we've exposed in the scheduler between tasks with > > > competing CPU affinity scheduling requirements - it just hides the > > > symptom. > > > > > > Maintaining CPU affinity across dispatch and completion work has > > > been proven to be a significant performance win. Right throughout > > > the IO stack we try to keep this submitter/completion affinity, > > > and that's the whole point of using a bound wq in the first place: > > > efficient delayed batch processing of work on the local CPU. > > > > Do you really want to target the same CPU ? looks like what you really > > want to target the same cache instead > > Well, yes, ideally we want to target the same cache, but we can't do > that with workqueues. > > However, the block layer already does that same-cache steering for > it's directed completions (see __blk_mq_complete_request()), so we > are *already running in a "hot cache" CPU context* when we queue > work. When we queue to the same CPU, we are simply maintaining the > "cache-hot" context that we are already running in. __blk_mq_complete_request() doesn't always complete the request on the submission CPU, which is only done in case of 1:1 queue mapping and N:1 mapping when nr_hw_queues < nr_nodes. Also, the default completion flag is SAME_GROUP, which just requires the completion CPU to share cache with submission CPU: #define QUEUE_FLAG_SAME_COMP 4 /* complete on same CPU-group */ Thanks, Ming
On Mon, 2 Dec 2019 at 22:22, Phil Auld <pauld@redhat.com> wrote: > > Hi Vincent, > > On Mon, Dec 02, 2019 at 02:45:42PM +0100 Vincent Guittot wrote: > > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote: > > ... > > > > So, we can fiddle with workqueues, but it doesn't address the > > > underlying issue that the scheduler appears to be migrating > > > non-bound tasks off a busy CPU too easily.... > > > > The root cause of the problem is that the sched_wakeup_granularity_ns > > is in the same range or higher than load balance period. As Peter > > explained, This make the kworker waiting for the CPU for several load > > period and a transient unbalanced state becomes a stable one that the > > scheduler to fix. With default value, the scheduler doesn't try to > > migrate any task. > > There are actually two issues here. With the high wakeup granularity > we get the user task actively migrated. This causes the significant > performance hit Ming was showing. With the fast wakeup_granularity > (or smaller IOs - 512 instead of 4k) we get, instead, the user task > migrated at wakeup to a new CPU for every IO completion. Ok, I haven't noticed that this one was a problem too. Do we have perf regression ? > > This is the 11k migrations per sec doing 11k iops. In this test it > is not by itself causing the measured performance issue. It generally > flips back and forth between 2 cpus for large periods. I think it is > crossing cache boundaries at times (but I have not looked closely > at the traces compared to the topology, yet). At task wake up, scheduler compares local and previous CPU to decide where to place the task and will then try to find an idle one which shares cache so I don't expect that it will cross cache boundary as local and previous are in your case. > > The active balances are what really hurts in thie case but I agree > that seems to be a tuning problem. > > > Cheers, > Phil > > > > > > Then, I agree that having an ack close to the request makes sense but > > forcing it on the exact same CPU is too restrictive IMO. Being able to > > use another CPU on the same core should not harm the performance and > > may even improve it. And that may still be the case while CPUs share > > their cache. > > > > > > > > -Dave. > > > > > > [*] Pay attention to the WQ_POWER_EFFICIENT definition for a work > > > queue: it's designed for interrupt routines that defer work via work > > > queues to avoid doing work on otherwise idle CPUs. It does this by > > > turning the per-cpu wq into an unbound wq so that work gets > > > scheduled on a non-idle CPUs in preference to the local idle CPU > > > which can then remain in low power states. > > > > > > That's the exact opposite of what using WQ_UNBOUND ends up doing in > > > this IO completion context: it pushes the work out over idle CPUs > > > rather than keeping them confined on the already busy CPUs where CPU > > > affinity allows the work to be done quickly. So while WQ_UNBOUND > > > avoids the user task being migrated frequently, it results in the > > > work being spread around many more CPUs and we burn more power to do > > > the same work. > > > > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > > -- >
On Tue, 3 Dec 2019 at 00:53, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Dec 02, 2019 at 02:45:42PM +0100, Vincent Guittot wrote: > > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, Dec 02, 2019 at 10:46:25AM +0800, Ming Lei wrote: > > > > On Thu, Nov 28, 2019 at 10:53:33AM +0100, Vincent Guittot wrote: > > > > > On Thu, 28 Nov 2019 at 10:40, Hillf Danton <hdanton@sina.com> wrote: > > > > > > --- a/fs/iomap/direct-io.c > > > > > > +++ b/fs/iomap/direct-io.c > > > > > > @@ -157,10 +157,8 @@ static void iomap_dio_bio_end_io(struct > > > > > > WRITE_ONCE(dio->submit.waiter, NULL); > > > > > > blk_wake_io_task(waiter); > > > > > > } else if (dio->flags & IOMAP_DIO_WRITE) { > > > > > > - struct inode *inode = file_inode(dio->iocb->ki_filp); > > > > > > - > > > > > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > > > > > > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > > > > > > + schedule_work(&dio->aio.work); > > > > > > > > > > I'm not sure that this will make a real difference because it ends up > > > > > to call queue_work(system_wq, ...) and system_wq is bounded as well so > > > > > the work will still be pinned to a CPU > > > > > Using system_unbound_wq should make a difference because it doesn't > > > > > pin the work on a CPU > > > > > + queue_work(system_unbound_wq, &dio->aio.work); > > > > > > > > Indeed, just run a quick test on my KVM guest, looks the following patch > > > > makes a difference: > > > > > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > > > index 9329ced91f1d..2f4488b0ecec 100644 > > > > --- a/fs/direct-io.c > > > > +++ b/fs/direct-io.c > > > > @@ -613,7 +613,8 @@ int sb_init_dio_done_wq(struct super_block *sb) > > > > { > > > > struct workqueue_struct *old; > > > > struct workqueue_struct *wq = alloc_workqueue("dio/%s", > > > > - WQ_MEM_RECLAIM, 0, > > > > + WQ_MEM_RECLAIM | > > > > + WQ_UNBOUND, 0, > > > > sb->s_id); > > > > > > That's not an answer to the user task migration issue. > > > > > > That is, all this patch does is trade user task migration when the > > > CPU is busy for migrating all the queued work off the CPU so the > > > user task does not get migrated. IOWs, this forces all the queued > > > work to be migrated rather than the user task. IOWs, it does not > > > address the issue we've exposed in the scheduler between tasks with > > > competing CPU affinity scheduling requirements - it just hides the > > > symptom. > > > > > > Maintaining CPU affinity across dispatch and completion work has > > > been proven to be a significant performance win. Right throughout > > > the IO stack we try to keep this submitter/completion affinity, > > > and that's the whole point of using a bound wq in the first place: > > > efficient delayed batch processing of work on the local CPU. > > > > Do you really want to target the same CPU ? looks like what you really > > want to target the same cache instead > > Well, yes, ideally we want to target the same cache, but we can't do > that with workqueues. Yes, this seems to be your main problem IMHO. You want to stay on the same cache and the only way to do so it to pin the work on one single CPU. But by doing so and increasing sched_wakeup_granularity_ns, the scheduler detects an imbalanced state because of pinned task that it wants to fix. Being able to set the work on a cpumask that covers the cache would be the solution so the scheduler would be able to select an idle CPU that share the cache instead of being pinned to a CPU > > However, the block layer already does that same-cache steering for > it's directed completions (see __blk_mq_complete_request()), so we > are *already running in a "hot cache" CPU context* when we queue > work. When we queue to the same CPU, we are simply maintaining the > "cache-hot" context that we are already running in. > > Besides, selecting a specific "hot cache" CPU and bind the work to > that CPU (via queue_work_on()) doesn't fix the scheduler problem - > it just moves it to another CPU. If the destination CPU is loaded > like the local CPU, then it's jsut going to cause migrations on the > destination CPU instead of the local CPU. > > IOWs, this is -not a fix- for the scheduler making an incorrect > migration decisions when we are mixing bound and unbound tasks on > the local run queue. Yes, it will hide the problem from this > specific workload instance but it doesn't fix it. We'll just hit it > under heavier load, such as when production workloads start running > AIO submission from tens of CPUs at a time while burning near 100% > CPU in userspace....... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Dec 03, 2019 at 09:15:14PM +0800, Hillf Danton wrote: > > IOWs, we are trying to ensure that we run the data IO completion on > > the CPU with that has that data hot in cache. When we are running > > millions of IOs every second, this matters -a lot-. IRQ steering is > > just a mechansim that is used to ensure completion processing hits > > hot caches. > > Along the "CPU affinity" direction, a trade-off is made between CPU > affinity and cache affinity before lb can bear the ca scheme. > Completion works are queued in round robin on the CPUs that share > cache with the submission CPU. > > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -143,6 +143,42 @@ static inline void iomap_dio_set_error(s > cmpxchg(&dio->error, 0, ret); > } > > +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cnt); > +static DEFINE_PER_CPU(int, iomap_dio_bio_end_io_cpu); > +#define IOMAP_DIO_BIO_END_IO_BATCH 7 > + > +static int iomap_dio_cpu_rr(void) > +{ > + int *io_cnt, *io_cpu; > + int cpu, this_cpu; > + > + io_cnt = get_cpu_ptr(&iomap_dio_bio_end_io_cnt); > + io_cpu = this_cpu_ptr(&iomap_dio_bio_end_io_cpu); > + this_cpu = smp_processor_id(); > + > + if (!(*io_cnt & IOMAP_DIO_BIO_END_IO_BATCH)) { > + for (cpu = *io_cpu + 1; cpu < nr_cpu_id; cpu++) > + if (cpu == this_cpu || > + cpus_share_cache(cpu, this_cpu)) > + goto update_cpu; > + > + for (cpu = 0; cpu < *io_cpu; cpu++) > + if (cpu == this_cpu || > + cpus_share_cache(cpu, this_cpu)) > + goto update_cpu; Linear scans like this just don't scale. We can have thousands of CPUs in a system and maybe only 8 cores that share a local cache. And we can be completing millions of direct IO writes a second these days. A linear scan of (thousands - 8) cpu ids every so often is going to show up as long tail latency for the unfortunate IO that has to scan those thousands of non-matching CPU IDs to find a sibling, and we'll be doing that every handful of IOs that are completed on every CPU. > + > + cpu = this_cpu; > +update_cpu: > + *io_cpu = cpu; > + } > + > + (*io_cnt)++; > + cpu = *io_cpu; > + put_cpu_ptr(&iomap_dio_bio_end_io_cnt); > + > + return cpu; > +} > > static void iomap_dio_bio_end_io(struct bio *bio) > { > struct iomap_dio *dio = bio->bi_private; > @@ -158,9 +194,10 @@ static void iomap_dio_bio_end_io(struct > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > struct inode *inode = file_inode(dio->iocb->ki_filp); > + int cpu = iomap_dio_cpu_rr(); IMO, this sort of "limit work to sibling CPU cores" does not belong in general code. We have *lots* of workqueues that need this treatment, and it's not viable to add this sort of linear search loop to every workqueue and place we queue work. Besides.... > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > + queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work); .... as I've stated before, this *does not solve the scheduler problem*. All this does is move the problem to the target CPU instead of seeing it on the local CPU. If we really want to hack around the load balancer problems in this way, then we need to add a new workqueue concurrency management type with behaviour that lies between the default of bound and WQ_UNBOUND. WQ_UNBOUND limits scheduling to within a numa node - see wq_update_unbound_numa() for how it sets up the cpumask attributes it applies to it's workers - but we need the work to be bound to within the local cache domain rather than a numa node. IOWs, set up the kworker task pool management structure with the right attributes (e.g. cpu masks) to define the cache domains, add all the hotplug code to make it work with CPU hotplug, then simply apply those attributes to the kworker task that is selected to execute the work. This allows the scheduler to migrate the kworker away from the local run queue without interrupting the currently scheduled task. The cpumask limits the task is configured with limit the scheduler to selecting the best CPU within the local cache domain, and we don't have to bind work to CPUs to get CPU cache friendly work scheduling. This also avoids overhead of per-queue_work_on() sibling CPU calculation, and all the code that wants to use this functionality needs to do is add a single flag at work queue init time (e.g. WQ_CACHEBOUND). IOWs, if the task migration behaviour cannot be easily fixed and so we need work queue users to be more flexible about work placement, then the solution needed here is "cpu cache local work queue scheduling" implemented in the work queue infrastructure, not in every workqueue user. Cheers, Dave.
On Tue, Dec 03, 2019 at 10:45:38AM +0100, Vincent Guittot wrote: > On Mon, 2 Dec 2019 at 22:22, Phil Auld <pauld@redhat.com> wrote: > > > > Hi Vincent, > > > > On Mon, Dec 02, 2019 at 02:45:42PM +0100 Vincent Guittot wrote: > > > On Mon, 2 Dec 2019 at 05:02, Dave Chinner <david@fromorbit.com> wrote: > > > > ... > > > > > > So, we can fiddle with workqueues, but it doesn't address the > > > > underlying issue that the scheduler appears to be migrating > > > > non-bound tasks off a busy CPU too easily.... > > > > > > The root cause of the problem is that the sched_wakeup_granularity_ns > > > is in the same range or higher than load balance period. As Peter > > > explained, This make the kworker waiting for the CPU for several load > > > period and a transient unbalanced state becomes a stable one that the > > > scheduler to fix. With default value, the scheduler doesn't try to > > > migrate any task. > > > > There are actually two issues here. With the high wakeup granularity > > we get the user task actively migrated. This causes the significant > > performance hit Ming was showing. With the fast wakeup_granularity > > (or smaller IOs - 512 instead of 4k) we get, instead, the user task > > migrated at wakeup to a new CPU for every IO completion. > > Ok, I haven't noticed that this one was a problem too. Do we have perf > regression ? Follows the test result on one server(Dell, R630: Haswell-E): kernel.sched_wakeup_granularity_ns = 4000000 kernel.sched_min_granularity_ns = 3000000 --------------------------------------- test | IOPS --------------------------------------- ./xfs_complete 512 | 7.8K --------------------------------------- taskset -c 8 ./xfs_complete 512 | 9.8K --------------------------------------- Thanks, Ming
On Wed, Dec 04, 2019 at 06:29:03PM +0800, Hillf Danton wrote: > > On Wed, 4 Dec 2019 09:29:25 +1100 Dave Chinner wrote: > > > > If we really want to hack around the load balancer problems in this > > way, then we need to add a new workqueue concurrency management type > > with behaviour that lies between the default of bound and WQ_UNBOUND. > > > > WQ_UNBOUND limits scheduling to within a numa node - see > > wq_update_unbound_numa() for how it sets up the cpumask attributes > > it applies to it's workers - but we need the work to be bound to > > within the local cache domain rather than a numa node. IOWs, set up > > the kworker task pool management structure with the right attributes > > (e.g. cpu masks) to define the cache domains, add all the hotplug > > code to make it work with CPU hotplug, then simply apply those > > attributes to the kworker task that is selected to execute the work. > > > > This allows the scheduler to migrate the kworker away from the local > > run queue without interrupting the currently scheduled task. The > > cpumask limits the task is configured with limit the scheduler to > > selecting the best CPU within the local cache domain, and we don't > > have to bind work to CPUs to get CPU cache friendly work scheduling. > > This also avoids overhead of per-queue_work_on() sibling CPU > > calculation, and all the code that wants to use this functionality > > needs to do is add a single flag at work queue init time (e.g. > > WQ_CACHEBOUND). > > > > IOWs, if the task migration behaviour cannot be easily fixed and so > > we need work queue users to be more flexible about work placement, > > then the solution needed here is "cpu cache local work queue > > scheduling" implemented in the work queue infrastructure, not in > > every workqueue user. > > Add WQ_CACHE_BOUND and a user of it and a helper to find cpus that > share cache. <sigh> If you are going to quote my suggestion in full, then please implement it in full. This patch does almost none of what you quoted above - it still has all the problems of the previous version that lead me to write the above. > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1358,16 +1358,42 @@ static bool is_chained_work(struct workq > return worker && worker->current_pwq->wq == wq; > } > > +static DEFINE_PER_CPU(int, wq_sel_cbc_cnt); > +static DEFINE_PER_CPU(int, wq_sel_cbc_cpu); > +#define WQ_SEL_CBC_BATCH 7 > + > +static int wq_select_cache_bound_cpu(int this_cpu) > +{ > + int *cntp, *cpup; > + int cpu; > + > + cntp = get_cpu_ptr(&wq_sel_cbc_cnt); > + cpup = this_cpu_ptr(&wq_sel_cbc_cpu); > + cpu = *cpup; > + > + if (!(*cntp & WQ_SEL_CBC_BATCH)) { > + cpu = cpus_share_cache_next_cpu(this_cpu, cpu); > + *cpup = cpu; > + } > + (*cntp)++; > + put_cpu_ptr(&wq_sel_cbc_cnt); > + > + return cpu; > +} This selects a specific CPU in the local cache domain at queue_work() time, just like the previous patch. It does not do what I suggested above in reponse to the scalability issues this approach has... > /* > * When queueing an unbound work item to a wq, prefer local CPU if allowed > * by wq_unbound_cpumask. Otherwise, round robin among the allowed ones to > * avoid perturbing sensitive tasks. > */ > -static int wq_select_unbound_cpu(int cpu) > +static int wq_select_unbound_cpu(int cpu, bool cache_bound) > { > static bool printed_dbg_warning; > int new_cpu; > > + if (cache_bound) > + return wq_select_cache_bound_cpu(cpu); > + > if (likely(!wq_debug_force_rr_cpu)) { > if (cpumask_test_cpu(cpu, wq_unbound_cpumask)) > return cpu; > @@ -1417,7 +1443,8 @@ static void __queue_work(int cpu, struct > rcu_read_lock(); > retry: > if (req_cpu == WORK_CPU_UNBOUND) > - cpu = wq_select_unbound_cpu(raw_smp_processor_id()); > + cpu = wq_select_unbound_cpu(raw_smp_processor_id(), > + wq->flags & WQ_CACHE_BOUND); And the per-cpu kworker pool selection after we've selected a CPU here binds it to that specific CPU or (if WQ_UNBOUND) the local node. IOWs, this is exactly the same functionality as the last patch, just moved inside the work queue infrastructure. IOWs, apart from the WQ_CACHE_BOUND flag, this patch doesn't implement any of what I suggested above. It does not solve the the problem of bound kworkers kicking running tasks off a CPU so the bound task can run, and it does not allow the scheduler to select the best CPU in the local cache scheduling domain for the kworker to run on. i.e. it still behaves like bound work rather than WQ_UNBOUND work. IMO, this adds the CPU selection to the -wrong function-. We still want the local CPU selected when req_cpu == WORK_CPU_UNBOUND. The following code selects where the kworker will be bound based on the task pool that the workqueue is configured to use: /* pwq which will be used unless @work is executing elsewhere */ if (!(wq->flags & WQ_UNBOUND)) pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); else pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); i.e. the local CPU is the key we need to look up the task pool for running tasks in the local cache domain - we do not use it as the CPU we want to run work on. IOWs, what we really want is this: if (wq->flags & WQ_UNBOUND) pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); else if (wq->flags & WQ_CACHE_BOUND) pwq = unbound_pwq_by_cache(wq, cpu); else pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); And then unbound_pwq_by_cache() is implemented in a similar manner to unbound_pwq_by_node() where there is a separate worker pool per cache domain. THe scheduler domain attributes (cpumask) is held by the task pool, and they are applied to the kworker task when it's given the task to run. This, like WQ_UNBOUND, allows the scheduler to select the best CPU in the cpumask for the task to run on. Binding kworkers to a single CPU is exactly the problem we need to avoid here - we need to let the scheduler choose the best CPU in the local cache domain based on the current load. That means the implementation needs to behave like a WQ_UNBOUND workqueue, just with a more restrictive cpumask. -Dave.
* Peter Zijlstra <peterz@infradead.org> [2019-11-21 14:29:37]: > On Wed, Nov 20, 2019 at 05:03:13PM -0500, Phil Auld wrote: > > On Wed, Nov 20, 2019 at 08:16:36PM +0100 Peter Zijlstra wrote: > > > On Tue, Nov 19, 2019 at 07:40:54AM +1100, Dave Chinner wrote: > > > > > Yes, that's precisely the problem - work is queued, by default, on a > > > > specific CPU and it will wait for a kworker that is pinned to that > > > > > > I'm thinking the problem is that it doesn't wait. If it went and waited > > > for it, active balance wouldn't be needed, that only works on active > > > tasks. > > > > Since this is AIO I wonder if it should queue_work on a nearby cpu by > > default instead of unbound. > > The thing seems to be that 'unbound' is in fact 'bound'. Maybe we should > fix that. If the load-balancer were allowed to move the kworker around > when it didn't get time to run, that would probably be a better > solution. > Can the scheduler detect this situation and probably preempt the current task? ----->8-------------------------------------------8<------------------ From 578407a0d01fa458ccb412f3390b04d3e1030d33 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Date: Sat, 7 Dec 2019 08:50:54 +0530 Subject: [RFC PATCH] sched/core: Preempt current task in favour of bound kthread A running task can wake-up a per CPU bound kthread on the same CPU. If the current running task doesn't yield the CPU before the next load balance operation, the scheduler would detect load imbalance and try to balance the load. However this load balance would fail as the waiting task is CPU bound, while the running task cannot be moved by the regular load balancer. Finally the active load balancer would kick in and move the task to a different CPU/Core. Moving the task to a different CPU/core can lead to loss in cache affinity leading to poor performance. This is more prone to happen if the current running task is CPU intensive and the sched_wake_up_granularity is set to larger value. When the sched_wake_up_granularity was relatively small, it was observed that the bound thread would complete before the load balancer would have chosen to move the cache hot task to a different CPU. To deal with this situation, the current running task would yield to a per CPU bound kthread, provided kthread is not CPU intensive. /pboffline/hwcct_prg_old/lib/fsperf -t overwrite --noclean -f 5g -b 4k /pboffline (With sched_wake_up_granularity set to 15ms) Performance counter stats for 'system wide' (5 runs): event v5.4 v5.4 + patch probe:active_load_balance_cpu_stop 1,740 ( +- 4.06% ) 4 ( +- 15.41% ) sched:sched_waking 431,952 ( +- 0.63% ) 905,001 ( +- 0.25% ) sched:sched_wakeup 431,950 ( +- 0.63% ) 905,000 ( +- 0.25% ) sched:sched_wakeup_new 427 ( +- 14.50% ) 544 ( +- 3.11% ) sched:sched_switch 773,491 ( +- 0.72% ) 1,467,539 ( +- 0.30% ) sched:sched_migrate_task 19,744 ( +- 0.69% ) 2,488 ( +- 5.24% ) sched:sched_process_free 417 ( +- 15.26% ) 545 ( +- 3.47% ) sched:sched_process_exit 433 ( +- 14.71% ) 560 ( +- 3.37% ) sched:sched_wait_task 3 ( +- 23.57% ) 1 ( +- 61.24% ) sched:sched_process_wait 132 ( +- 80.37% ) 848 ( +- 3.63% ) sched:sched_process_fork 427 ( +- 14.50% ) 543 ( +- 3.08% ) sched:sched_process_exec 36 ( +- 92.46% ) 211 ( +- 7.50% ) sched:sched_wake_idle_without_ipi 178,349 ( +- 0.87% ) 351,912 ( +- 0.31% ) elapsed time in seconds 288.09 +- 2.30 ( +- 0.80% ) 72.631 +- 0.109 ( +- 0.15% ) Throughput results v5.4 Trigger time:................... 0.842679 s (Throughput: 6075.86 MB/s) Asynchronous submit time:....... 1.0184 s (Throughput: 5027.49 MB/s) Synchronous submit time:........ 0 s (Throughput: 0 MB/s) I/O time:....................... 263.17 s (Throughput: 19.455 MB/s) Ratio trigger time to I/O time:.0.00320202 v5.4 + patch Trigger time:................... 0.858728 s (Throughput: 5962.3 MB/s) Asynchronous submit time:....... 0.758399 s (Throughput: 6751.06 MB/s) Synchronous submit time:........ 0 s (Throughput: 0 MB/s) I/O time:....................... 43.411 s (Throughput: 117.942 MB/s) Ratio trigger time to I/O time:.0.0197813 (With sched_wake_up_granularity set to 4ms) Performance counter stats for 'system wide' (5 runs): event v5.4 v5.4 + patch probe:active_load_balance_cpu_stop 4 ( +- 29.92% ) 6 ( +- 21.88% ) sched:sched_waking 896,177 ( +- 0.25% ) 900,352 ( +- 0.36% ) sched:sched_wakeup 896,174 ( +- 0.25% ) 900,352 ( +- 0.36% ) sched:sched_wakeup_new 255 ( +- 40.79% ) 568 ( +- 4.22% ) sched:sched_switch 1,453,937 ( +- 0.27% ) 1,459,653 ( +- 0.46% ) sched:sched_migrate_task 2,318 ( +- 6.55% ) 2,898 ( +- 13.14% ) sched:sched_process_free 239 ( +- 43.14% ) 553 ( +- 4.46% ) sched:sched_process_exit 255 ( +- 40.54% ) 568 ( +- 4.33% ) sched:sched_wait_task 3 ( +- 38.13% ) 2 ( +- 32.39% ) sched:sched_process_wait 257 ( +- 68.90% ) 887 ( +- 4.59% ) sched:sched_process_fork 255 ( +- 40.87% ) 567 ( +- 4.21% ) sched:sched_process_exec 116 ( +- 71.52% ) 214 ( +- 4.51% ) sched:sched_stat_runtime 82,757,021,750 ( +- 2.38% ) 82,092,839,452 ( +- 0.31% ) sched:sched_wake_idle_without_ipi 347,790 ( +- 0.69% ) 350,369 ( +- 0.27% ) elapsed time in seconds 72.6114 +- 0.0516 ( +- 0.07% ) 72.6425 +- 0.0658 ( +- 0.09% ) Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- kernel/sched/core.c | 7 ++++++- kernel/sched/fair.c | 23 ++++++++++++++++++++++- kernel/sched/sched.h | 3 ++- 3 files changed, 30 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44123b4d14e8..efd740aafa17 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2664,7 +2664,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ int wake_up_process(struct task_struct *p) { - return try_to_wake_up(p, TASK_NORMAL, 0); + int wake_flags = 0; + + if (is_per_cpu_kthread(p)) + wake_flags = WF_KTHREAD; + + return try_to_wake_up(p, TASK_NORMAL, WF_KTHREAD); } EXPORT_SYMBOL(wake_up_process); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 69a81a5709ff..36486f71e59f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6660,6 +6660,27 @@ static void set_skip_buddy(struct sched_entity *se) cfs_rq_of(se)->skip = se; } +static int kthread_wakeup_preempt(struct rq *rq, struct task_struct *p, int wake_flags) +{ + struct task_struct *curr = rq->curr; + struct cfs_rq *cfs_rq = task_cfs_rq(curr); + + if (!(wake_flags & WF_KTHREAD)) + return 0; + + if (p->nr_cpus_allowed != 1 || curr->nr_cpus_allowed == 1) + return 0; + + if (cfs_rq->nr_running > 2) + return 0; + + /* + * Don't preempt, if the waking kthread is more CPU intensive than + * the current thread. + */ + return p->nvcsw * curr->nivcsw >= p->nivcsw * curr->nvcsw; +} + /* * Preempt the current task with a newly woken task if needed: */ @@ -6716,7 +6737,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ find_matching_se(&se, &pse); update_curr(cfs_rq_of(se)); BUG_ON(!pse); - if (wakeup_preempt_entity(se, pse) == 1) { + if (wakeup_preempt_entity(se, pse) == 1 || kthread_wakeup_preempt(rq, p, wake_flags)) { /* * Bias pick_next to pick the sched entity that is * triggering this preemption. diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c8870c5bd7df..23d4284ad1e3 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1643,7 +1643,8 @@ static inline int task_on_rq_migrating(struct task_struct *p) */ #define WF_SYNC 0x01 /* Waker goes to sleep after wakeup */ #define WF_FORK 0x02 /* Child wakeup after fork */ -#define WF_MIGRATED 0x4 /* Internal use, task got migrated */ +#define WF_MIGRATED 0x04 /* Internal use, task got migrated */ +#define WF_KTHREAD 0x08 /* Per CPU Kthread*/ /* * To aid in avoiding the subversion of "niceness" due to uneven distribution
* Jens Axboe <axboe@kernel.dk> [2019-11-21 09:19:29]: > > > > I wish there was a flag to wake_up() or to the event object that says > > to relinquish the remaning of the time-slice to the waiter on same > > CPU, since I will be soon sleeping. > > Isn't that basically what wake_up_sync() is? > Workqueue don't seem to be using wait_queue_head which is needed when using wake_up_sync and its related APIs. Also wake_up_sync would work when the waking task seems to hand off the CPU and goes to sleep/block. However here, i.e in fsperf case atleast, the waking thread continues to run after it has woken the per cpu workqueue.
On Mon, Dec 09, 2019 at 10:21:22PM +0530, Srikar Dronamraju wrote: > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 44123b4d14e8..efd740aafa17 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2664,7 +2664,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > */ > int wake_up_process(struct task_struct *p) > { > - return try_to_wake_up(p, TASK_NORMAL, 0); > + int wake_flags = 0; > + > + if (is_per_cpu_kthread(p)) > + wake_flags = WF_KTHREAD; > + > + return try_to_wake_up(p, TASK_NORMAL, WF_KTHREAD); This is buggy. It always sets WF_KTHREAD, even for non-kernel processes. I think you meant: return try_to_wake_up(p, TASK_NORMAL, wake_flags); I suspect this bug invalidates the test results presented, too... -Dave.
* Dave Chinner <david@fromorbit.com> [2019-12-10 10:17:43]: > On Mon, Dec 09, 2019 at 10:21:22PM +0530, Srikar Dronamraju wrote: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 44123b4d14e8..efd740aafa17 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -2664,7 +2664,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > > */ > > int wake_up_process(struct task_struct *p) > > { > > - return try_to_wake_up(p, TASK_NORMAL, 0); > > + int wake_flags = 0; > > + > > + if (is_per_cpu_kthread(p)) > > + wake_flags = WF_KTHREAD; > > + > > + return try_to_wake_up(p, TASK_NORMAL, WF_KTHREAD); > > This is buggy. It always sets WF_KTHREAD, even for non-kernel > processes. I think you meant: > > return try_to_wake_up(p, TASK_NORMAL, wake_flags); Yes, I meant the above. Thanks for catching. Will test with this and repost. > > I suspect this bug invalidates the test results presented, too... > > -Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 1fc28c2da279..bdc007a57706 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -158,9 +158,14 @@ static void iomap_dio_bio_end_io(struct bio *bio) blk_wake_io_task(waiter); } else if (dio->flags & IOMAP_DIO_WRITE) { struct inode *inode = file_inode(dio->iocb->ki_filp); + unsigned cpu = cpumask_next(smp_processor_id(), + cpu_online_mask); + + if (cpu >= nr_cpu_ids) + cpu = 0; INIT_WORK(&dio->aio.work, iomap_dio_complete_work); - queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); + queue_work_on(cpu, inode->i_sb->s_dio_done_wq, &dio->aio.work); } else { iomap_dio_complete_work(&dio->aio.work); }