Message ID | 20161116173036.GK3612@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Wed 16-11-16 09:30:36, Paul E. McKenney wrote: > On Wed, Nov 16, 2016 at 06:01:19PM +0100, Paul Menzel wrote: > > Dear Linux folks, > > > > > > On 11/08/16 19:39, Paul E. McKenney wrote: > > >On Tue, Nov 08, 2016 at 06:38:18PM +0100, Paul Menzel wrote: > > >>On 11/08/16 18:03, Paul E. McKenney wrote: > > >>>On Tue, Nov 08, 2016 at 01:22:28PM +0100, Paul Menzel wrote: > > >> > > >>>>Could you please help me shedding some light into the messages below? > > >>>> > > >>>>With Linux 4.4.X, these messages were not seen. When updating to > > >>>>Linux 4.8.4, and Linux 4.8.6 they started to appear. In that > > >>>>version, we enabled several CGROUP options. > > >>>> > > >>>>>$ dmesg -T > > >>>>>[…] > > >>>>>[Mon Nov 7 15:09:45 2016] INFO: rcu_sched detected stalls on CPUs/tasks: > > >>>>>[Mon Nov 7 15:09:45 2016] 3-...: (493 ticks this GP) idle=515/140000000000000/0 softirq=5504423/5504423 fqs=13876 > > >>>>>[Mon Nov 7 15:09:45 2016] (detected by 5, t=60002 jiffies, g=1363193, c=1363192, q=268508) > > >>>>>[Mon Nov 7 15:09:45 2016] Task dump for CPU 3: > > >>>>>[Mon Nov 7 15:09:45 2016] kswapd1 R running task 0 87 2 0x00000008 > > >>>>>[Mon Nov 7 15:09:45 2016] ffffffff81aabdfd ffff8810042a5cb8 ffff88080ad34000 ffff88080ad33dc8 > > >>>>>[Mon Nov 7 15:09:45 2016] ffff88080ad33d00 0000000000003501 0000000000000000 0000000000000000 > > >>>>>[Mon Nov 7 15:09:45 2016] 0000000000000000 0000000000000000 0000000000022316 000000000002bc9f > > >>>>>[Mon Nov 7 15:09:45 2016] Call Trace: > > >>>>>[Mon Nov 7 15:09:45 2016] [<ffffffff81aabdfd>] ? __schedule+0x21d/0x5b0 > > >>>>>[Mon Nov 7 15:09:45 2016] [<ffffffff81106dcf>] ? shrink_node+0xbf/0x1c0 > > >>>>>[Mon Nov 7 15:09:45 2016] [<ffffffff81107865>] ? kswapd+0x315/0x5f0 > > >>>>>[Mon Nov 7 15:09:45 2016] [<ffffffff81107550>] ? mem_cgroup_shrink_node+0x90/0x90 > > >>>>>[Mon Nov 7 15:09:45 2016] [<ffffffff8106c614>] ? kthread+0xc4/0xe0 > > >>>>>[Mon Nov 7 15:09:45 2016] [<ffffffff81aaf64f>] ? ret_from_fork+0x1f/0x40 > > >>>>>[Mon Nov 7 15:09:45 2016] [<ffffffff8106c550>] ? kthread_worker_fn+0x160/0x160 > > >>>> > > >>>>Even after reading `stallwarn.txt` [1], I don’t know what could > > >>>>cause this. All items in the backtrace seem to belong to the Linux > > >>>>kernel. > > >>>> > > >>>>There is also nothing suspicious in the monitoring graphs during that time. > > >>> > > >>>If you let it be, do you get a later stall warning a few minutes later? > > >>>If so, how does the stack trace compare? > > >> > > >>With Linux 4.8.6 this is the only occurrence since yesterday. > > >> > > >>With Linux 4.8.3, and 4.8.4 the following stack traces were seen. > > > > > >Looks to me like one or both of the loops in shrink_node() need > > >an cond_resched_rcu_qs(). > > > > Thank you for the pointer. I haven’t had time yet to look into it. > > In theory, it is quite straightforward, as shown by the patch below. > In practice, the MM guys might wish to call cond_resched_rcu_qs() less > frequently, but I will leave that to their judgment. My guess is that > the overhead of the cond_resched_rcu_qs() is way down in the noise, > but I have been surprised in the past. > > Anyway, please give this patch a try and let me know how it goes. I am not seeing the full thread in my inbox but I am wondering what is actually going on here. The reclaim path (shrink_node_memcg resp. shrink_slab should have preemption points and there is not done much except of iterating over all memcgs other than that. Are there gazillions of memcgs configured (most of them with the low limit configured)? In other words is the system configured properly? To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds way too lowlevel for this usage. If anything cond_resched somewhere inside mem_cgroup_iter would be more appropriate to me.
On Mon 21-11-16 06:01:22, Paul E. McKenney wrote: > On Mon, Nov 21, 2016 at 02:41:31PM +0100, Michal Hocko wrote: [...] > > To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds > > way too lowlevel for this usage. If anything cond_resched somewhere inside > > mem_cgroup_iter would be more appropriate to me. > > Like this? > > Thanx, Paul > > ------------------------------------------------------------------------ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ae052b5e3315..81cb30d5b2fc 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -867,6 +867,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > out: > if (prev && prev != root) > css_put(&prev->css); > + cond_resched_rcu_qs(); I still do not understand why should we play with _rcu_qs at all and a regular cond_resched is not sufficient. Anyway I would have to double check whether we can do cond_resched in the iterator. I do not remember having users which are atomic but I might be easily wrong here. Before we touch this code, though, I would really like to understand what is actually going on here because as I've already pointed out we should have some resched points in the reclaim path.
On Mon, Nov 21, 2016 at 03:18:19PM +0100, Michal Hocko wrote: > On Mon 21-11-16 06:01:22, Paul E. McKenney wrote: > > On Mon, Nov 21, 2016 at 02:41:31PM +0100, Michal Hocko wrote: > [...] > > > To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds > > > way too lowlevel for this usage. If anything cond_resched somewhere inside > > > mem_cgroup_iter would be more appropriate to me. > > > > Like this? > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index ae052b5e3315..81cb30d5b2fc 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -867,6 +867,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, > > out: > > if (prev && prev != root) > > css_put(&prev->css); > > + cond_resched_rcu_qs(); > > I still do not understand why should we play with _rcu_qs at all and a > regular cond_resched is not sufficient. Anyway I would have to double > check whether we can do cond_resched in the iterator. I do not remember > having users which are atomic but I might be easily wrong here. Before > we touch this code, though, I would really like to understand what is > actually going on here because as I've already pointed out we should > have some resched points in the reclaim path. If there is a tight loop in the kernel, cond_resched() will ensure that other tasks get a chance to run, but if there are no such tasks, it does nothing to give RCU the quiescent state that it needs from time to time. So if there is a possibility of a long-running in-kernel loop without preemption by some other task, cond_resched_rcu_qs() is required. I welcome your deeper investigation -- I am very much treating symptoms here, which might or might not have any relationship to fixing underlying problems. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/21/16 15:29, Paul E. McKenney wrote: > On Mon, Nov 21, 2016 at 03:18:19PM +0100, Michal Hocko wrote: >> On Mon 21-11-16 06:01:22, Paul E. McKenney wrote: >>> On Mon, Nov 21, 2016 at 02:41:31PM +0100, Michal Hocko wrote: >> [...] >>>> To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds >>>> way too lowlevel for this usage. If anything cond_resched somewhere inside >>>> mem_cgroup_iter would be more appropriate to me. >>> Like this? >>> >>> Thanx, Paul >>> >>> ------------------------------------------------------------------------ >>> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index ae052b5e3315..81cb30d5b2fc 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -867,6 +867,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, >>> out: >>> if (prev && prev != root) >>> css_put(&prev->css); >>> + cond_resched_rcu_qs(); >> I still do not understand why should we play with _rcu_qs at all and a >> regular cond_resched is not sufficient. Anyway I would have to double >> check whether we can do cond_resched in the iterator. I do not remember >> having users which are atomic but I might be easily wrong here. Before >> we touch this code, though, I would really like to understand what is >> actually going on here because as I've already pointed out we should >> have some resched points in the reclaim path. > If there is a tight loop in the kernel, cond_resched() will ensure that > other tasks get a chance to run, but if there are no such tasks, it does > nothing to give RCU the quiescent state that it needs from time to time. > So if there is a possibility of a long-running in-kernel loop without > preemption by some other task, cond_resched_rcu_qs() is required. > > I welcome your deeper investigation -- I am very much treating symptoms > here, which might or might not have any relationship to fixing underlying > problems. > > Thanx, Paul > Hello, thanks a lot for looking into this! Let me add some information from the reporting site: * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov 2016) and it doesn't shut up the rcu stall warnings. * Log file from a boot with the patch applied ( grep kernel /var/log/messages ) is here : http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt * This system is a backup server and walks over thousands of files sometimes with multiple parallel rsync processes. * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , 4.8.8 and now 4.9.0-rc5+Pauls patch * When the backups are actually happening there might be relevant memory pressure from inode cache and the rsync processes. We saw the oom-killer kick in on another machine with same hardware and similar (a bit higher) workload. This other machine also shows a lot of rcu stall warnings since 4.8.4. * We see "rcu_sched detected stalls" also on some other machines since we switched to 4.8 but not as frequently as on the two backup servers. Usually there's "shrink_node" and "kswapd" on the top of the stack. Often "xfs_reclaim_inodes" variants on top of that. Donald
On Mon 21-11-16 16:35:53, Donald Buczek wrote: [...] > Hello, > > thanks a lot for looking into this! > > Let me add some information from the reporting site: > > * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov > 2016) and it doesn't shut up the rcu stall warnings. > > * Log file from a boot with the patch applied ( grep kernel > /var/log/messages ) is here : > http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt > > * This system is a backup server and walks over thousands of files sometimes > with multiple parallel rsync processes. > > * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , > 4.8.8 and now 4.9.0-rc5+Pauls patch I assume you haven't tried the Linus 4.8 kernel without any further stable patches? Just to be sure we are not talking about some later regression which found its way to the stable tree. > * When the backups are actually happening there might be relevant memory > pressure from inode cache and the rsync processes. We saw the oom-killer > kick in on another machine with same hardware and similar (a bit higher) > workload. This other machine also shows a lot of rcu stall warnings since > 4.8.4. > > * We see "rcu_sched detected stalls" also on some other machines since we > switched to 4.8 but not as frequently as on the two backup servers. Usually > there's "shrink_node" and "kswapd" on the top of the stack. Often > "xfs_reclaim_inodes" variants on top of that. I would be interested to see some reclaim tracepoints enabled. Could you try that out? At least mm_shrink_slab_{start,end} and mm_vmscan_lru_shrink_inactive. This should tell us more about how the reclaim behaved.
On 24.11.2016 11:15, Michal Hocko wrote: > On Mon 21-11-16 16:35:53, Donald Buczek wrote: > [...] >> Hello, >> >> thanks a lot for looking into this! >> >> Let me add some information from the reporting site: >> >> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov >> 2016) and it doesn't shut up the rcu stall warnings. >> >> * Log file from a boot with the patch applied ( grep kernel >> /var/log/messages ) is here : >> http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt >> >> * This system is a backup server and walks over thousands of files sometimes >> with multiple parallel rsync processes. >> >> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , >> 4.8.8 and now 4.9.0-rc5+Pauls patch > I assume you haven't tried the Linus 4.8 kernel without any further > stable patches? Just to be sure we are not talking about some later > regression which found its way to the stable tree. > >> * When the backups are actually happening there might be relevant memory >> pressure from inode cache and the rsync processes. We saw the oom-killer >> kick in on another machine with same hardware and similar (a bit higher) >> workload. This other machine also shows a lot of rcu stall warnings since >> 4.8.4. >> >> * We see "rcu_sched detected stalls" also on some other machines since we >> switched to 4.8 but not as frequently as on the two backup servers. Usually >> there's "shrink_node" and "kswapd" on the top of the stack. Often >> "xfs_reclaim_inodes" variants on top of that. > I would be interested to see some reclaim tracepoints enabled. Could you > try that out? At least mm_shrink_slab_{start,end} and > mm_vmscan_lru_shrink_inactive. This should tell us more about how the > reclaim behaved. We'll try that tomorrow! Donald
> On Nov 24, 2016, at 5:15 AM, Michal Hocko <mhocko@kernel.org> wrote: > >> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , >> 4.8.8 and now 4.9.0-rc5+Pauls patch > > I assume you haven't tried the Linus 4.8 kernel without any further > stable patches? Just to be sure we are not talking about some later > regression which found its way to the stable tree. We are also seeing this frequently on our fleet since moving from 4.7.x to 4.8. This is from a machine running vanilla 4.8.6 just a few moments ago: INFO: rcu_sched detected stalls on CPUs/tasks: 13-...: (420 ticks this GP) idle=ce1/140000000000000/0 softirq=225550784/225550904 fqs=87105 (detected by 26, t=600030 jiffies, g=68185325, c=68185324, q=344996) Task dump for CPU 13: kswapd1 R running task 12200 1840 2 0x00000808 0000000000000001 0000000000000034 000000000000012b 0000000000003139 ffff8b643fffb000 ffff8b028cee7cf8 ffff8b028cee7cf8 ffff8b028cee7d08 ffff8b028cee7d08 ffff8b028cee7d18 ffff8b028cee7d18 ffff8b0200000000 Call Trace: [] ? shrink_node+0xcd/0x2f0 [] ? kswapd+0x304/0x710 [] ? mem_cgroup_shrink_node+0x160/0x160 [] ? kthread+0xc4/0xe0 [] ? ret_from_fork+0x1f/0x40 [] ? kthread_worker_fn+0x140/0x140 The machine will lag terribly during these occurrences .. some will eventually recover, some will spiral down and require a reboot. -Chris -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24.11.2016 11:15, Michal Hocko wrote: > On Mon 21-11-16 16:35:53, Donald Buczek wrote: > [...] >> Hello, >> >> thanks a lot for looking into this! >> >> Let me add some information from the reporting site: >> >> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov >> 2016) and it doesn't shut up the rcu stall warnings. >> >> * Log file from a boot with the patch applied ( grep kernel >> /var/log/messages ) is here : >> http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt >> >> * This system is a backup server and walks over thousands of files sometimes >> with multiple parallel rsync processes. >> >> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , >> 4.8.8 and now 4.9.0-rc5+Pauls patch > I assume you haven't tried the Linus 4.8 kernel without any further > stable patches? Just to be sure we are not talking about some later > regression which found its way to the stable tree. We've tried v4.8 and got the first rcu stall warnings with this, too. First one after about 20 hours uptime. >> * When the backups are actually happening there might be relevant memory >> pressure from inode cache and the rsync processes. We saw the oom-killer >> kick in on another machine with same hardware and similar (a bit higher) >> workload. This other machine also shows a lot of rcu stall warnings since >> 4.8.4. >> >> * We see "rcu_sched detected stalls" also on some other machines since we >> switched to 4.8 but not as frequently as on the two backup servers. Usually >> there's "shrink_node" and "kswapd" on the top of the stack. Often >> "xfs_reclaim_inodes" variants on top of that. > I would be interested to see some reclaim tracepoints enabled. Could you > try that out? At least mm_shrink_slab_{start,end} and > mm_vmscan_lru_shrink_inactive. This should tell us more about how the > reclaim behaved. http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.dmesg.txt (80K) http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.trace.txt (50M) Traces wrapped, but the last event is covered. all vmscan events were enabled
Am Donnerstag, den 24.11.2016, 19:50 +0100 schrieb Donald Buczek: > On 24.11.2016 11:15, Michal Hocko wrote: > > > On Mon 21-11-16 16:35:53, Donald Buczek wrote: > > [...] > >> Let me add some information from the reporting site: > >> > >> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov > >> 2016) and it doesn't shut up the rcu stall warnings. > >> > >> * Log file from a boot with the patch applied ( grep kernel > >> /var/log/messages ) is here : […] > >> * This system is a backup server and walks over thousands of files sometimes > >> with multiple parallel rsync processes. > >> > >> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , > >> 4.8.8 and now 4.9.0-rc5+Pauls patch > > I assume you haven't tried the Linus 4.8 kernel without any further > > stable patches? Just to be sure we are not talking about some later > > regression which found its way to the stable tree. We tried, and the problem also shows up with the plain 4.8 kernel. ``` $ dmesg […] [77554.135657] INFO: rcu_sched detected stalls on CPUs/tasks: [77554.135662] 1-...: (222 ticks this GP) idle=7dd/140000000000000/0 softirq=30962751/30962968 fqs=12961 [77554.135663] (detected by 10, t=60002 jiffies, g=7958132, c=7958131, q=90237) [77554.135667] Task dump for CPU 1: [77554.135669] kswapd1 R running task 0 86 2 0x00000008 [77554.135672] ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 ffff88100c1e5200 [77554.135674] 0000000000000003 0000000000000000 ffff88080ad87e60 ffff88080ad87d90 [77554.135675] ffffffff811345f5 ffff88080ad87da0 ffff88080ad87db0 ffff88100c1e5200 [77554.135677] Call Trace: [77554.135684] [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870 [77554.135686] [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0 [77554.135687] [<ffffffff81135642>] ? kswapd+0x342/0x6b0 [77554.135689] [<ffffffff81135300>] ? mem_cgroup_shrink_node+0x150/0x150 [77554.135692] [<ffffffff81075be4>] ? kthread+0xc4/0xe0 [77554.135695] [<ffffffff81b2b34f>] ? ret_from_fork+0x1f/0x40 [77554.135696] [<ffffffff81075b20>] ? kthread_worker_fn+0x160/0x160 [77734.252362] INFO: rcu_sched detected stalls on CPUs/tasks: [77734.252367] 1-...: (897 ticks this GP) idle=7dd/140000000000000/0 softirq=30962751/30963197 fqs=50466 [77734.252368] (detected by 0, t=240122 jiffies, g=7958132, c=7958131, q=456322) [77734.252372] Task dump for CPU 1: [77734.252373] kswapd1 R running task 0 86 2 0x00000008 [77734.252376] ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 ffff88100c1e5200 [77734.252378] 0000000000000003 0000000000000000 ffff88080ad87e60 ffff88080ad87d90 [77734.252380] ffffffff811345f5 ffff88080ad87da0 ffff88080ad87db0 ffff88100c1e5200 [77734.252382] Call Trace: [77734.252388] [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870 [77734.252390] [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0 [77734.252391] [<ffffffff81135642>] ? kswapd+0x342/0x6b0 [77734.252393] [<ffffffff81135300>] ? mem_cgroup_shrink_node+0x150/0x150 [77734.252396] [<ffffffff81075be4>] ? kthread+0xc4/0xe0 [77734.252399] [<ffffffff81b2b34f>] ? ret_from_fork+0x1f/0x40 [77734.252401] [<ffffffff81075b20>] ? kthread_worker_fn+0x160/0x160 […] ``` > >> * When the backups are actually happening there might be relevant memory > >> pressure from inode cache and the rsync processes. We saw the oom-killer > >> kick in on another machine with same hardware and similar (a bit higher) > >> workload. This other machine also shows a lot of rcu stall warnings since > >> 4.8.4. > >> > >> * We see "rcu_sched detected stalls" also on some other machines since we > >> switched to 4.8 but not as frequently as on the two backup servers. Usually > >> there's "shrink_node" and "kswapd" on the top of the stack. Often > >> "xfs_reclaim_inodes" variants on top of that. > > > > I would be interested to see some reclaim tracepoints enabled. Could you > > try that out? At least mm_shrink_slab_{start,end} and > > mm_vmscan_lru_shrink_inactive. This should tell us more about how the > > reclaim behaved. > > We'll try that tomorrow! Unfortunately, looking today at `trace`, the corresponding messages have already been thrown out the buffer. We continue trying. Kind regards, Paul Menzel -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun 27-11-16 10:19:06, Donald Buczek wrote: > On 24.11.2016 11:15, Michal Hocko wrote: > > On Mon 21-11-16 16:35:53, Donald Buczek wrote: > > [...] > > > Hello, > > > > > > thanks a lot for looking into this! > > > > > > Let me add some information from the reporting site: > > > > > > * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov > > > 2016) and it doesn't shut up the rcu stall warnings. > > > > > > * Log file from a boot with the patch applied ( grep kernel > > > /var/log/messages ) is here : > > > http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt > > > > > > * This system is a backup server and walks over thousands of files sometimes > > > with multiple parallel rsync processes. > > > > > > * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , > > > 4.8.8 and now 4.9.0-rc5+Pauls patch > > I assume you haven't tried the Linus 4.8 kernel without any further > > stable patches? Just to be sure we are not talking about some later > > regression which found its way to the stable tree. > > We've tried v4.8 and got the first rcu stall warnings with this, too. First > one after about 20 hours uptime. > > > > > * When the backups are actually happening there might be relevant memory > > > pressure from inode cache and the rsync processes. We saw the oom-killer > > > kick in on another machine with same hardware and similar (a bit higher) > > > workload. This other machine also shows a lot of rcu stall warnings since > > > 4.8.4. > > > > > > * We see "rcu_sched detected stalls" also on some other machines since we > > > switched to 4.8 but not as frequently as on the two backup servers. Usually > > > there's "shrink_node" and "kswapd" on the top of the stack. Often > > > "xfs_reclaim_inodes" variants on top of that. > > I would be interested to see some reclaim tracepoints enabled. Could you > > try that out? At least mm_shrink_slab_{start,end} and > > mm_vmscan_lru_shrink_inactive. This should tell us more about how the > > reclaim behaved. > > http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.dmesg.txt (80K) > http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.trace.txt (50M) > > Traces wrapped, but the last event is covered. all vmscan events were > enabled OK, so one of the stall is reported at [118077.988410] INFO: rcu_sched detected stalls on CPUs/tasks: [118077.988416] 1-...: (181 ticks this GP) idle=6d5/140000000000000/0 softirq=46417663/46417663 fqs=10691 [118077.988417] (detected by 4, t=60002 jiffies, g=11845915, c=11845914, q=46475) [118077.988421] Task dump for CPU 1: [118077.988421] kswapd1 R running task 0 86 2 0x00000008 [118077.988424] ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 ffff88100c1e5200 [118077.988426] 0000000000000003 0000000000000000 ffff88080ad87e60 ffff88080ad87d90 [118077.988428] ffffffff811345f5 ffff88080ad87da0 ffff88100c1e5200 ffff88080ad87dd0 [118077.988430] Call Trace: [118077.988436] [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870 [118077.988438] [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0 [118077.988440] [<ffffffff81135642>] ? kswapd+0x342/0x6b0 the interesting part of the traces would be around the same time: clusterd-989 [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193 kswapd1-86 [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1 kswapd1-86 [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1 [...] kswapd1-86 [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1 Note the Need resched flag. The IRQ off part is expected because we are holding the LRU lock which is IRQ safe. That is not a problem because the lock is only held for SWAP_CLUSTER_MAX pages at maximum. It is also interesing to see that we have scanned only 1303 pages during that 1 minute. That would be dead slow. None of them were good enough for the reclaim but that doesn't sound like a problem. The trace simply suggests that the reclaim was preempted by something else. Otherwise I cannot imagine such a slow scanning. Is it possible that something else is hogging the CPU and the RCU just happens to blame kswapd which is running in the standard user process context?
diff --git a/mm/vmscan.c b/mm/vmscan.c index 744f926af442..0d3b5f5a04ef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2529,8 +2529,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long scanned; if (mem_cgroup_low(root, memcg)) { - if (!sc->may_thrash) + if (!sc->may_thrash) { + /* Prevent CPU CPU stalls. */ + cond_resched_rcu_qs(); continue; + } mem_cgroup_events(memcg, MEMCG_LOW, 1); } @@ -2565,6 +2568,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) mem_cgroup_iter_break(root, memcg); break; } + cond_resched_rcu_qs(); /* Prevent CPU CPU stalls. */ } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); /*