Message ID | db9c91f6-1b17-6136-84f0-03c3c2581ab4@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote: > Hello, > > I observe that running stress-ng with the cpu-online and fstat tests > results in a deadlock of hung tasks: > > [ 366.810486] INFO: task stress-ng-cpu-o:2590 blocked for more than > 120 seconds. > [ 366.817689] Not tainted 4.9.0 #39 > [ 366.821504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 366.829320] stress-ng-cpu-o D 0 2590 2589 0x00000008 > [ 366.834803] Call trace: > [ 366.837222] [<ffff000008085dd0>] __switch_to+0x60/0x70 > [ 366.842338] [<ffff000008a23c18>] __schedule+0x178/0x648 > [ 366.847550] [<ffff000008a24120>] schedule+0x38/0x98 > [ 366.852408] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8 > [ 366.858749] [<ffff00000848e9d4>] blk_mq_queue_reinit_work+0x74/0x110 > [ 366.865081] [<ffff00000848ea94>] blk_mq_queue_reinit_dead+0x24/0x30 > [ 366.871335] [<ffff0000080c9898>] cpuhp_invoke_callback+0x98/0x4a8 > [ 366.877411] [<ffff0000080cb084>] cpuhp_down_callbacks+0x114/0x150 > [ 366.883484] [<ffff000008a22578>] _cpu_down+0x100/0x1d8 > [ 366.888609] [<ffff0000080cbfdc>] do_cpu_down+0x4c/0x78 > [ 366.893727] [<ffff0000080cc02c>] cpu_down+0x24/0x30 > [ 366.898593] [<ffff0000086aaf28>] cpu_subsys_offline+0x20/0x30 > [ 366.904318] [<ffff0000086a53d8>] device_offline+0xa8/0xd8 > [ 366.909704] [<ffff0000086a550c>] online_store+0x4c/0xa8 > [ 366.914907] [<ffff0000086a241c>] dev_attr_store+0x44/0x60 > [ 366.920294] [<ffff0000082b6a24>] sysfs_kf_write+0x5c/0x78 > [ 366.925672] [<ffff0000082b5cec>] kernfs_fop_write+0xbc/0x1e8 > [ 366.931318] [<ffff000008238320>] __vfs_write+0x48/0x138 > [ 366.936526] [<ffff000008239078>] vfs_write+0xa8/0x1c0 > [ 366.941557] [<ffff00000823a08c>] SyS_write+0x54/0xb0 > [ 366.946511] [<ffff000008083370>] el0_svc_naked+0x24/0x28 > [ 366.951800] INFO: task stress-ng-fstat:2591 blocked for more than > 120 seconds. > [ 366.959008] Not tainted 4.9.0 #39 > [ 366.962823] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 366.970640] stress-ng-fstat D 0 2591 2589 0x00000000 > [ 366.976105] Call trace: > [ 366.978540] [<ffff000008085dd0>] __switch_to+0x60/0x70 > [ 366.983658] [<ffff000008a23c18>] __schedule+0x178/0x648 > [ 366.988870] [<ffff000008a24120>] schedule+0x38/0x98 > [ 366.993727] [<ffff00000848b774>] blk_mq_freeze_queue_wait+0x64/0x1a8 > [ 367.000068] [<ffff00000848e2d0>] blk_mq_freeze_queue+0x28/0x38 > [ 367.005880] [<ffff0000086d480c>] lo_release+0x64/0x90 > [ 367.010919] [<ffff000008278bd4>] __blkdev_put+0x26c/0x2c8 > [ 367.016300] [<ffff000008278fec>] blkdev_put+0x54/0x128 > [ 367.021418] [<ffff0000082790ec>] blkdev_close+0x2c/0x40 > [ 367.026631] [<ffff00000823ab58>] __fput+0xa0/0x1e0 > [ 367.031401] [<ffff00000823ad10>] ____fput+0x20/0x30 > [ 367.036266] [<ffff0000080e7a40>] task_work_run+0xc8/0xe8 > [ 367.041557] [<ffff0000080882b4>] do_notify_resume+0xac/0xb8 > [ 367.047116] [<ffff000008083294>] work_pending+0x8/0x10 > > I have tested and found this issue to be reproducible on both x86 > and ARM64 architectures on 4.7, 4.8, 4.9, 4.10, and 4.11-rc3 > kernels. > > Using the below test methodology [1], the issue reproduces within a > few minutes. > > Using ftrace, I have analyzed the issue on 4.9 and I believe I've > found the root cause [2]. > > Based on my analysis, I have developed a fix [3], which addresses > the issue as I am able to run stress-ng for over an hour where I was > unable to do so before, however I do not know the full extend of > impacts from this fix, and look for guidance from the community to > determine the final fix. > > > [1] Test methodology > -------------------- > Boot a multicore system such as a desktop i5 system with nr_cpus=2 > > Enable all logging to determine when the deadlock occurs (prints > from test stop flowing out of the serial port) > echo 1 > /sys/module/printk/parameters/ignore_loglevel > > Run stress-ng > stress-ng --fstat 1 --cpu-online 1 -t 3600 > > Wait for the test output to stop, and the hung task watchdog to fire. > > > [2] Analysis > ------------ > Again, this analysis is based upon the 4.9 kernel, but believe it to > still apply to newer kernels. > > I conclude that the hung tasks occur due to a race condition which > results in a deadlock. > > The race condition occurs between "normal" work in the block layer > on a core (the stress-ng-fstat task in the above dump) and cpu > offline of that core (the stress-ng-cpu-o task in the above dump). > > The fput() from userspace in the fstat task results in a call to > blk_mq_freeze_queue(), which drops the last reference to the queue > via percpu_ref_kill(), and then waits for the ref count of the queue > to hit 0 in blk_mq_freeze_queue_wait(). percpu_ref_kill() will > result in __percpu_ref_switch_to_atomic() which will use > call_rcu_sched() to setup delayed work to finalize the percpu_ref > cleanup and drop the ref count to 0. > > Note that call_rcu_sched() queues the work to a per-cpu queue, thus > the work can only be run on the core it is queued on, by the work > thread that is pinned to that core. > > It is a race between this work running, and the cpu offline processing. One quick way to test this assumption is to build a kernel with Kconfig options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will cause call_rcu_sched() to queue the work to a kthread, which can migrate to some other CPU. If your analysis is correct, this should avoid the deadlock. (Note that the deadlock should be fixed in any case, just a diagnostic assumption-check procedure.) > If the cpu offline processing is able to get to and process the > RCU/tree:online state before the queued work from the block layer, > then the pinned work thread will be migrated to another core via > rcutree_offline_cpu(), and the work will not be able to execute. > > This race condition does not result in deadlock until later in the > cpu offline processing. Once we hit the block/mq:prepare state the > block layer freezes all the queues and waits for the ref counts to > hit 0. This normally works because at this point the cpu being > offlined is dead from cpu:teardown, and the offline processing is > occuring on another active cpu, so call_rcu_sched() will queue work > to an active cpu where it can get processed. However the fstat > process already did that work for one of the queues to be frozen in > the block layer, so the processing of the block/mq:prepare state > waits on the same ref count as fstat to hit 0. Thus we see the > result of this as the stress-ng-cpu-o task above. > > The block/mq:prepare processing stalls the cpu offline processing > which causes a deadlock because the processing does not get to the > RCU/tree:prepare state which migrates all of the queued work from > the offline cpu to another cpu, which would allow the work that the > fstat task queued to execute, drop the ref count to 0, and unblock > both stalled tasks. > > By reordering the cpu offline states such the shutdown processing of > RCU/tree:prepare occurs before block/mq:prepare [3], we prevent > deadlock by enabling the queued work in the RCU framework to run > elsewhere, and eventually unblock the tasks waiting on the ref > count. > > However, it is not entirely clear what are the full ramifications of > this reorder. I assume the ordering of these cpu online/offline > states is carefully considered, and without that knowledge, I could > not say for certain that my fix [3] is safe. > > What is the opinion of the domain experts? I do hope that we can come up with a better fix. No offense intended, as coming up with -any- fix in the CPU-hotplug domain is not to be denigrated, but this looks to be at vest quite fragile. Thanx, Paul > -- > Jeffrey Hugo > Qualcomm Datacenter Technologies as an affiliate of Qualcomm > Technologies, Inc. > Qualcomm Technologies, Inc. is a member of the > Code Aurora Forum, a Linux Foundation Collaborative Project. > > > [3] Proposed fix > ---8>--- > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h > index afe641c..9b86db9 100644 > --- a/include/linux/cpuhotplug.h > +++ b/include/linux/cpuhotplug.h > @@ -49,6 +49,7 @@ enum cpuhp_state { > CPUHP_ARM_SHMOBILE_SCU_PREPARE, > CPUHP_SH_SH3X_PREPARE, > CPUHP_BLK_MQ_PREPARE, > + CPUHP_RCUTREE_PREP2, > CPUHP_TIMERS_DEAD, > CPUHP_NOTF_ERR_INJ_PREPARE, > CPUHP_MIPS_SOC_PREPARE, > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 29de1a9..b46c573 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1289,6 +1289,11 @@ static int __init cpu_hotplug_pm_sync_init(void) > [CPUHP_RCUTREE_PREP] = { > .name = "RCU/tree:prepare", > .startup.single = rcutree_prepare_cpu, > + .teardown.single = NULL, > + }, > + [CPUHP_RCUTREE_PREP2] = { > + .name = "RCU/tree:dead", > + .startup.single = NULL, > .teardown.single = rcutree_dead_cpu, > }, > /* >
Hi Paul. Thanks for the quick reply. On 3/26/2017 5:28 PM, Paul E. McKenney wrote: > On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote: >> It is a race between this work running, and the cpu offline processing. > > One quick way to test this assumption is to build a kernel with Kconfig > options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will > cause call_rcu_sched() to queue the work to a kthread, which can migrate > to some other CPU. If your analysis is correct, this should avoid > the deadlock. (Note that the deadlock should be fixed in any case, > just a diagnostic assumption-check procedure.) I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to do one test run however the issue reproduced, but it took a fair bit longer to do so. An initial look at the data indicates that the work is still not running. An odd observation, the two threads are no longer blocked on the same queue, but different ones. Let me look at this more and see what is going on now. >> What is the opinion of the domain experts? > > I do hope that we can come up with a better fix. No offense intended, > as coming up with -any- fix in the CPU-hotplug domain is not to be > denigrated, but this looks to be at vest quite fragile. > > Thanx, Paul > None taken. I'm not particularly attached to the current fix. I agree, it does appear to be quite fragile. I'm still not sure what a better solution would be though. Maybe the RCU framework flushes the work somehow during cpu offline? It would need to ensure further work is not queued after that point, which seems like it might be tricky to synchronize. I don't know enough about the working of RCU to even attempt to implement that. In any case, it seem like some more analysis is needed based on the latest data.
On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote: > Hi Paul. > > Thanks for the quick reply. > > On 3/26/2017 5:28 PM, Paul E. McKenney wrote: > >On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote: > > >>It is a race between this work running, and the cpu offline processing. > > > >One quick way to test this assumption is to build a kernel with Kconfig > >options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y. This will > >cause call_rcu_sched() to queue the work to a kthread, which can migrate > >to some other CPU. If your analysis is correct, this should avoid > >the deadlock. (Note that the deadlock should be fixed in any case, > >just a diagnostic assumption-check procedure.) > > I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y, > CONFIG_RCU_NOCB_CPU_ALL=y in my build. I've only had time so far to > do one test run however the issue reproduced, but it took a fair bit > longer to do so. An initial look at the data indicates that the > work is still not running. An odd observation, the two threads are > no longer blocked on the same queue, but different ones. I was afraid of that... > Let me look at this more and see what is going on now. Another thing to try would be to affinity the "rcuo" kthreads to some CPU that is never taken offline, just in case that kthread is sometimes somehow getting stuck during the CPU-hotplug operation. > >>What is the opinion of the domain experts? > > > >I do hope that we can come up with a better fix. No offense intended, > >as coming up with -any- fix in the CPU-hotplug domain is not to be > >denigrated, but this looks to be at vest quite fragile. > > > > Thanx, Paul > > > > None taken. I'm not particularly attached to the current fix. I > agree, it does appear to be quite fragile. > > I'm still not sure what a better solution would be though. Maybe > the RCU framework flushes the work somehow during cpu offline? It > would need to ensure further work is not queued after that point, > which seems like it might be tricky to synchronize. I don't know > enough about the working of RCU to even attempt to implement that. There are some ways that RCU might be able to shrink the window during which the outgoing CPU's callbacks are in limbo, but they are not free of risk, so we really need to compleetly understand what is going on before making any possibly ill-conceived changes. ;-) > In any case, it seem like some more analysis is needed based on the > latest data. Looking forward to hearing about you find! Thanx, Paul > -- > Jeffrey Hugo > Qualcomm Datacenter Technologies as an affiliate of Qualcomm > Technologies, Inc. > Qualcomm Technologies, Inc. is a member of the > Code Aurora Forum, a Linux Foundation Collaborative Project. >
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index afe641c..9b86db9 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -49,6 +49,7 @@ enum cpuhp_state { CPUHP_ARM_SHMOBILE_SCU_PREPARE, CPUHP_SH_SH3X_PREPARE, CPUHP_BLK_MQ_PREPARE, + CPUHP_RCUTREE_PREP2, CPUHP_TIMERS_DEAD, CPUHP_NOTF_ERR_INJ_PREPARE, CPUHP_MIPS_SOC_PREPARE, diff --git a/kernel/cpu.c b/kernel/cpu.c index 29de1a9..b46c573 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1289,6 +1289,11 @@ static int __init cpu_hotplug_pm_sync_init(void) [CPUHP_RCUTREE_PREP] = { .name = "RCU/tree:prepare", .startup.single = rcutree_prepare_cpu, + .teardown.single = NULL, + }, + [CPUHP_RCUTREE_PREP2] = { + .name = "RCU/tree:dead", + .startup.single = NULL, .teardown.single = rcutree_dead_cpu, }, /*