diff mbox

INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`

Message ID 20161116173036.GK3612@linux.vnet.ibm.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Paul E. McKenney Nov. 16, 2016, 5:30 p.m. UTC
On Wed, Nov 16, 2016 at 06:01:19PM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> On 11/08/16 19:39, Paul E. McKenney wrote:
> >On Tue, Nov 08, 2016 at 06:38:18PM +0100, Paul Menzel wrote:
> >>On 11/08/16 18:03, Paul E. McKenney wrote:
> >>>On Tue, Nov 08, 2016 at 01:22:28PM +0100, Paul Menzel wrote:
> >>
> >>>>Could you please help me shedding some light into the messages below?
> >>>>
> >>>>With Linux 4.4.X, these messages were not seen. When updating to
> >>>>Linux 4.8.4, and Linux 4.8.6 they started to appear. In that
> >>>>version, we enabled several CGROUP options.
> >>>>
> >>>>>$ dmesg -T
> >>>>>[…]
> >>>>>[Mon Nov  7 15:09:45 2016] INFO: rcu_sched detected stalls on CPUs/tasks:
> >>>>>[Mon Nov  7 15:09:45 2016]     3-...: (493 ticks this GP) idle=515/140000000000000/0 softirq=5504423/5504423 fqs=13876
> >>>>>[Mon Nov  7 15:09:45 2016]     (detected by 5, t=60002 jiffies, g=1363193, c=1363192, q=268508)
> >>>>>[Mon Nov  7 15:09:45 2016] Task dump for CPU 3:
> >>>>>[Mon Nov  7 15:09:45 2016] kswapd1         R  running task        0    87      2 0x00000008
> >>>>>[Mon Nov  7 15:09:45 2016]  ffffffff81aabdfd ffff8810042a5cb8 ffff88080ad34000 ffff88080ad33dc8
> >>>>>[Mon Nov  7 15:09:45 2016]  ffff88080ad33d00 0000000000003501 0000000000000000 0000000000000000
> >>>>>[Mon Nov  7 15:09:45 2016]  0000000000000000 0000000000000000 0000000000022316 000000000002bc9f
> >>>>>[Mon Nov  7 15:09:45 2016] Call Trace:
> >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81aabdfd>] ? __schedule+0x21d/0x5b0
> >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81106dcf>] ? shrink_node+0xbf/0x1c0
> >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81107865>] ? kswapd+0x315/0x5f0
> >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81107550>] ? mem_cgroup_shrink_node+0x90/0x90
> >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff8106c614>] ? kthread+0xc4/0xe0
> >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81aaf64f>] ? ret_from_fork+0x1f/0x40
> >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff8106c550>] ? kthread_worker_fn+0x160/0x160
> >>>>
> >>>>Even after reading `stallwarn.txt` [1], I don’t know what could
> >>>>cause this. All items in the backtrace seem to belong to the Linux
> >>>>kernel.
> >>>>
> >>>>There is also nothing suspicious in the monitoring graphs during that time.
> >>>
> >>>If you let it be, do you get a later stall warning a few minutes later?
> >>>If so, how does the stack trace compare?
> >>
> >>With Linux 4.8.6 this is the only occurrence since yesterday.
> >>
> >>With Linux 4.8.3, and 4.8.4 the following stack traces were seen.
> >
> >Looks to me like one or both of the loops in shrink_node() need
> >an cond_resched_rcu_qs().
> 
> Thank you for the pointer. I haven’t had time yet to look into it.

In theory, it is quite straightforward, as shown by the patch below.
In practice, the MM guys might wish to call cond_resched_rcu_qs() less
frequently, but I will leave that to their judgment.  My guess is that
the overhead of the cond_resched_rcu_qs() is way down in the noise,
but I have been surprised in the past.

Anyway, please give this patch a try and let me know how it goes.

							Thanx, Paul

------------------------------------------------------------------------

commit 1a5595eec6c034c27e1c826a93292240bfea934e
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Nov 16 09:26:28 2016 -0800

    mm: Prevent shrink_node() RCU CPU stall warnings
    
    This commit adds a couple cond_resched_rcu_qs() calls in the inner
    loop in shrink_node() in order to prevent RCU CPU stall warnings.
    
    Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>


--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michal Hocko Nov. 21, 2016, 1:41 p.m. UTC | #1
On Wed 16-11-16 09:30:36, Paul E. McKenney wrote:
> On Wed, Nov 16, 2016 at 06:01:19PM +0100, Paul Menzel wrote:
> > Dear Linux folks,
> > 
> > 
> > On 11/08/16 19:39, Paul E. McKenney wrote:
> > >On Tue, Nov 08, 2016 at 06:38:18PM +0100, Paul Menzel wrote:
> > >>On 11/08/16 18:03, Paul E. McKenney wrote:
> > >>>On Tue, Nov 08, 2016 at 01:22:28PM +0100, Paul Menzel wrote:
> > >>
> > >>>>Could you please help me shedding some light into the messages below?
> > >>>>
> > >>>>With Linux 4.4.X, these messages were not seen. When updating to
> > >>>>Linux 4.8.4, and Linux 4.8.6 they started to appear. In that
> > >>>>version, we enabled several CGROUP options.
> > >>>>
> > >>>>>$ dmesg -T
> > >>>>>[…]
> > >>>>>[Mon Nov  7 15:09:45 2016] INFO: rcu_sched detected stalls on CPUs/tasks:
> > >>>>>[Mon Nov  7 15:09:45 2016]     3-...: (493 ticks this GP) idle=515/140000000000000/0 softirq=5504423/5504423 fqs=13876
> > >>>>>[Mon Nov  7 15:09:45 2016]     (detected by 5, t=60002 jiffies, g=1363193, c=1363192, q=268508)
> > >>>>>[Mon Nov  7 15:09:45 2016] Task dump for CPU 3:
> > >>>>>[Mon Nov  7 15:09:45 2016] kswapd1         R  running task        0    87      2 0x00000008
> > >>>>>[Mon Nov  7 15:09:45 2016]  ffffffff81aabdfd ffff8810042a5cb8 ffff88080ad34000 ffff88080ad33dc8
> > >>>>>[Mon Nov  7 15:09:45 2016]  ffff88080ad33d00 0000000000003501 0000000000000000 0000000000000000
> > >>>>>[Mon Nov  7 15:09:45 2016]  0000000000000000 0000000000000000 0000000000022316 000000000002bc9f
> > >>>>>[Mon Nov  7 15:09:45 2016] Call Trace:
> > >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81aabdfd>] ? __schedule+0x21d/0x5b0
> > >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81106dcf>] ? shrink_node+0xbf/0x1c0
> > >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81107865>] ? kswapd+0x315/0x5f0
> > >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81107550>] ? mem_cgroup_shrink_node+0x90/0x90
> > >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff8106c614>] ? kthread+0xc4/0xe0
> > >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff81aaf64f>] ? ret_from_fork+0x1f/0x40
> > >>>>>[Mon Nov  7 15:09:45 2016]  [<ffffffff8106c550>] ? kthread_worker_fn+0x160/0x160
> > >>>>
> > >>>>Even after reading `stallwarn.txt` [1], I don’t know what could
> > >>>>cause this. All items in the backtrace seem to belong to the Linux
> > >>>>kernel.
> > >>>>
> > >>>>There is also nothing suspicious in the monitoring graphs during that time.
> > >>>
> > >>>If you let it be, do you get a later stall warning a few minutes later?
> > >>>If so, how does the stack trace compare?
> > >>
> > >>With Linux 4.8.6 this is the only occurrence since yesterday.
> > >>
> > >>With Linux 4.8.3, and 4.8.4 the following stack traces were seen.
> > >
> > >Looks to me like one or both of the loops in shrink_node() need
> > >an cond_resched_rcu_qs().
> > 
> > Thank you for the pointer. I haven’t had time yet to look into it.
> 
> In theory, it is quite straightforward, as shown by the patch below.
> In practice, the MM guys might wish to call cond_resched_rcu_qs() less
> frequently, but I will leave that to their judgment.  My guess is that
> the overhead of the cond_resched_rcu_qs() is way down in the noise,
> but I have been surprised in the past.
> 
> Anyway, please give this patch a try and let me know how it goes.

I am not seeing the full thread in my inbox but I am wondering what is
actually going on here. The reclaim path (shrink_node_memcg resp.
shrink_slab should have preemption points and there is not done much
except of iterating over all memcgs other than that. Are there
gazillions of memcgs configured (most of them with the low limit
configured)? In other words is the system configured properly?

To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds
way too lowlevel for this usage. If anything cond_resched somewhere inside
mem_cgroup_iter would be more appropriate to me.
Michal Hocko Nov. 21, 2016, 2:18 p.m. UTC | #2
On Mon 21-11-16 06:01:22, Paul E. McKenney wrote:
> On Mon, Nov 21, 2016 at 02:41:31PM +0100, Michal Hocko wrote:
[...]
> > To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds
> > way too lowlevel for this usage. If anything cond_resched somewhere inside
> > mem_cgroup_iter would be more appropriate to me.
> 
> Like this?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae052b5e3315..81cb30d5b2fc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -867,6 +867,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  out:
>  	if (prev && prev != root)
>  		css_put(&prev->css);
> +	cond_resched_rcu_qs();

I still do not understand why should we play with _rcu_qs at all and a
regular cond_resched is not sufficient. Anyway I would have to double
check whether we can do cond_resched in the iterator. I do not remember
having users which are atomic but I might be easily wrong here. Before
we touch this code, though, I would really like to understand what is
actually going on here because as I've already pointed out we should
have some resched points in the reclaim path.
Paul E. McKenney Nov. 21, 2016, 2:29 p.m. UTC | #3
On Mon, Nov 21, 2016 at 03:18:19PM +0100, Michal Hocko wrote:
> On Mon 21-11-16 06:01:22, Paul E. McKenney wrote:
> > On Mon, Nov 21, 2016 at 02:41:31PM +0100, Michal Hocko wrote:
> [...]
> > > To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds
> > > way too lowlevel for this usage. If anything cond_resched somewhere inside
> > > mem_cgroup_iter would be more appropriate to me.
> > 
> > Like this?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae052b5e3315..81cb30d5b2fc 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -867,6 +867,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  out:
> >  	if (prev && prev != root)
> >  		css_put(&prev->css);
> > +	cond_resched_rcu_qs();
> 
> I still do not understand why should we play with _rcu_qs at all and a
> regular cond_resched is not sufficient. Anyway I would have to double
> check whether we can do cond_resched in the iterator. I do not remember
> having users which are atomic but I might be easily wrong here. Before
> we touch this code, though, I would really like to understand what is
> actually going on here because as I've already pointed out we should
> have some resched points in the reclaim path.

If there is a tight loop in the kernel, cond_resched() will ensure that
other tasks get a chance to run, but if there are no such tasks, it does
nothing to give RCU the quiescent state that it needs from time to time.
So if there is a possibility of a long-running in-kernel loop without
preemption by some other task, cond_resched_rcu_qs() is required.

I welcome your deeper investigation -- I am very much treating symptoms
here, which might or might not have any relationship to fixing underlying
problems.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Buczek Nov. 21, 2016, 3:35 p.m. UTC | #4
On 11/21/16 15:29, Paul E. McKenney wrote:
> On Mon, Nov 21, 2016 at 03:18:19PM +0100, Michal Hocko wrote:
>> On Mon 21-11-16 06:01:22, Paul E. McKenney wrote:
>>> On Mon, Nov 21, 2016 at 02:41:31PM +0100, Michal Hocko wrote:
>> [...]
>>>> To the patch. I cannot say I would like it. cond_resched_rcu_qs sounds
>>>> way too lowlevel for this usage. If anything cond_resched somewhere inside
>>>> mem_cgroup_iter would be more appropriate to me.
>>> Like this?
>>>
>>> 							Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ae052b5e3315..81cb30d5b2fc 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -867,6 +867,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>>>   out:
>>>   	if (prev && prev != root)
>>>   		css_put(&prev->css);
>>> +	cond_resched_rcu_qs();
>> I still do not understand why should we play with _rcu_qs at all and a
>> regular cond_resched is not sufficient. Anyway I would have to double
>> check whether we can do cond_resched in the iterator. I do not remember
>> having users which are atomic but I might be easily wrong here. Before
>> we touch this code, though, I would really like to understand what is
>> actually going on here because as I've already pointed out we should
>> have some resched points in the reclaim path.
> If there is a tight loop in the kernel, cond_resched() will ensure that
> other tasks get a chance to run, but if there are no such tasks, it does
> nothing to give RCU the quiescent state that it needs from time to time.
> So if there is a possibility of a long-running in-kernel loop without
> preemption by some other task, cond_resched_rcu_qs() is required.
>
> I welcome your deeper investigation -- I am very much treating symptoms
> here, which might or might not have any relationship to fixing underlying
> problems.
>
> 							Thanx, Paul
>

Hello,

thanks a lot for looking into this!

Let me add some information from the reporting site:

* We've tried the patch from Paul E. McKenney (the one posted Wed, 16 
Nov 2016)  and it doesn't shut up the rcu stall warnings.

* Log file from a boot with the patch applied ( grep kernel 
/var/log/messages ) is here : 
http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt

* This system is a backup server and walks over thousands of files 
sometimes with multiple parallel rsync processes.

* No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 , 
4.8.8 and now 4.9.0-rc5+Pauls patch

* When the backups are actually happening there might be relevant memory 
pressure from inode cache and the rsync processes. We saw the oom-killer 
kick in on another machine with same hardware and similar (a bit higher) 
workload. This other machine also shows a lot of rcu stall warnings 
since 4.8.4.

* We see "rcu_sched detected stalls" also on some other machines since 
we switched to 4.8 but not as frequently as on the two backup servers. 
Usually there's "shrink_node" and "kswapd" on the top of the stack. 
Often "xfs_reclaim_inodes" variants on top of that.

Donald
Michal Hocko Nov. 24, 2016, 10:15 a.m. UTC | #5
On Mon 21-11-16 16:35:53, Donald Buczek wrote:
[...]
> Hello,
> 
> thanks a lot for looking into this!
> 
> Let me add some information from the reporting site:
> 
> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
> 2016)  and it doesn't shut up the rcu stall warnings.
> 
> * Log file from a boot with the patch applied ( grep kernel
> /var/log/messages ) is here :
> http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt
> 
> * This system is a backup server and walks over thousands of files sometimes
> with multiple parallel rsync processes.
> 
> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 ,
> 4.8.8 and now 4.9.0-rc5+Pauls patch

I assume you haven't tried the Linus 4.8 kernel without any further
stable patches? Just to be sure we are not talking about some later
regression which found its way to the stable tree.

> * When the backups are actually happening there might be relevant memory
> pressure from inode cache and the rsync processes. We saw the oom-killer
> kick in on another machine with same hardware and similar (a bit higher)
> workload. This other machine also shows a lot of rcu stall warnings since
> 4.8.4.
> 
> * We see "rcu_sched detected stalls" also on some other machines since we
> switched to 4.8 but not as frequently as on the two backup servers. Usually
> there's "shrink_node" and "kswapd" on the top of the stack. Often
> "xfs_reclaim_inodes" variants on top of that.

I would be interested to see some reclaim tracepoints enabled. Could you
try that out? At least mm_shrink_slab_{start,end} and
mm_vmscan_lru_shrink_inactive. This should tell us more about how the
reclaim behaved.
Donald Buczek Nov. 24, 2016, 6:50 p.m. UTC | #6
On 24.11.2016 11:15, Michal Hocko wrote:

> On Mon 21-11-16 16:35:53, Donald Buczek wrote:
> [...]
>> Hello,
>>
>> thanks a lot for looking into this!
>>
>> Let me add some information from the reporting site:
>>
>> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
>> 2016)  and it doesn't shut up the rcu stall warnings.
>>
>> * Log file from a boot with the patch applied ( grep kernel
>> /var/log/messages ) is here :
>> http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt
>>
>> * This system is a backup server and walks over thousands of files sometimes
>> with multiple parallel rsync processes.
>>
>> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 ,
>> 4.8.8 and now 4.9.0-rc5+Pauls patch
> I assume you haven't tried the Linus 4.8 kernel without any further
> stable patches? Just to be sure we are not talking about some later
> regression which found its way to the stable tree.
>
>> * When the backups are actually happening there might be relevant memory
>> pressure from inode cache and the rsync processes. We saw the oom-killer
>> kick in on another machine with same hardware and similar (a bit higher)
>> workload. This other machine also shows a lot of rcu stall warnings since
>> 4.8.4.
>>
>> * We see "rcu_sched detected stalls" also on some other machines since we
>> switched to 4.8 but not as frequently as on the two backup servers. Usually
>> there's "shrink_node" and "kswapd" on the top of the stack. Often
>> "xfs_reclaim_inodes" variants on top of that.
> I would be interested to see some reclaim tracepoints enabled. Could you
> try that out? At least mm_shrink_slab_{start,end} and
> mm_vmscan_lru_shrink_inactive. This should tell us more about how the
> reclaim behaved.

We'll try that tomorrow!

Donald
Christopher S. Aker Nov. 27, 2016, 5:32 a.m. UTC | #7
> On Nov 24, 2016, at 5:15 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
>> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 ,
>> 4.8.8 and now 4.9.0-rc5+Pauls patch
> 
> I assume you haven't tried the Linus 4.8 kernel without any further
> stable patches? Just to be sure we are not talking about some later
> regression which found its way to the stable tree.

We are also seeing this frequently on our fleet since moving from 4.7.x to 4.8. This is from a machine running vanilla 4.8.6 just a few moments ago:

INFO: rcu_sched detected stalls on CPUs/tasks:
	13-...: (420 ticks this GP) idle=ce1/140000000000000/0 softirq=225550784/225550904 fqs=87105 
	(detected by 26, t=600030 jiffies, g=68185325, c=68185324, q=344996)
Task dump for CPU 13:
kswapd1         R  running task    12200  1840      2 0x00000808
 0000000000000001 0000000000000034 000000000000012b 0000000000003139
 ffff8b643fffb000 ffff8b028cee7cf8 ffff8b028cee7cf8 ffff8b028cee7d08
 ffff8b028cee7d08 ffff8b028cee7d18 ffff8b028cee7d18 ffff8b0200000000
Call Trace:
 [] ? shrink_node+0xcd/0x2f0
 [] ? kswapd+0x304/0x710
 [] ? mem_cgroup_shrink_node+0x160/0x160
 [] ? kthread+0xc4/0xe0
 [] ? ret_from_fork+0x1f/0x40
 [] ? kthread_worker_fn+0x140/0x140

The machine will lag terribly during these occurrences .. some will eventually recover, some will spiral down and require a reboot.

-Chris

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Buczek Nov. 27, 2016, 9:19 a.m. UTC | #8
On 24.11.2016 11:15, Michal Hocko wrote:
> On Mon 21-11-16 16:35:53, Donald Buczek wrote:
> [...]
>> Hello,
>>
>> thanks a lot for looking into this!
>>
>> Let me add some information from the reporting site:
>>
>> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
>> 2016)  and it doesn't shut up the rcu stall warnings.
>>
>> * Log file from a boot with the patch applied ( grep kernel
>> /var/log/messages ) is here :
>> http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt
>>
>> * This system is a backup server and walks over thousands of files sometimes
>> with multiple parallel rsync processes.
>>
>> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 ,
>> 4.8.8 and now 4.9.0-rc5+Pauls patch
> I assume you haven't tried the Linus 4.8 kernel without any further
> stable patches? Just to be sure we are not talking about some later
> regression which found its way to the stable tree.

We've tried v4.8 and got the first rcu stall warnings with this, too. 
First one after about 20 hours uptime.


>> * When the backups are actually happening there might be relevant memory
>> pressure from inode cache and the rsync processes. We saw the oom-killer
>> kick in on another machine with same hardware and similar (a bit higher)
>> workload. This other machine also shows a lot of rcu stall warnings since
>> 4.8.4.
>>
>> * We see "rcu_sched detected stalls" also on some other machines since we
>> switched to 4.8 but not as frequently as on the two backup servers. Usually
>> there's "shrink_node" and "kswapd" on the top of the stack. Often
>> "xfs_reclaim_inodes" variants on top of that.
> I would be interested to see some reclaim tracepoints enabled. Could you
> try that out? At least mm_shrink_slab_{start,end} and
> mm_vmscan_lru_shrink_inactive. This should tell us more about how the
> reclaim behaved.

http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.dmesg.txt  (80K)
http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.trace.txt (50M)

Traces wrapped, but the last event is covered. all vmscan events were 
enabled
Paul Menzel Nov. 27, 2016, 9:37 a.m. UTC | #9
Am Donnerstag, den 24.11.2016, 19:50 +0100 schrieb Donald Buczek:
> On 24.11.2016 11:15, Michal Hocko wrote:
> 
> > On Mon 21-11-16 16:35:53, Donald Buczek wrote:
> > [...]

> >> Let me add some information from the reporting site:
> >>
> >> * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
> >> 2016) and it doesn't shut up the rcu stall warnings.
> >>
> >> * Log file from a boot with the patch applied ( grep kernel
> >> /var/log/messages ) is here :

[…]

> >> * This system is a backup server and walks over thousands of files sometimes
> >> with multiple parallel rsync processes.
> >>
> >> * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 ,
> >> 4.8.8 and now 4.9.0-rc5+Pauls patch
> > I assume you haven't tried the Linus 4.8 kernel without any further
> > stable patches? Just to be sure we are not talking about some later
> > regression which found its way to the stable tree.

We tried, and the problem also shows up with the plain 4.8 kernel.

```
$ dmesg
[…]
[77554.135657] INFO: rcu_sched detected stalls on CPUs/tasks:
[77554.135662]  1-...: (222 ticks this GP) idle=7dd/140000000000000/0 softirq=30962751/30962968 fqs=12961
[77554.135663]  (detected by 10, t=60002 jiffies, g=7958132, c=7958131, q=90237)
[77554.135667] Task dump for CPU 1:
[77554.135669] kswapd1         R  running task        0    86      2 0x00000008
[77554.135672]  ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 ffff88100c1e5200
[77554.135674]  0000000000000003 0000000000000000 ffff88080ad87e60 ffff88080ad87d90
[77554.135675]  ffffffff811345f5 ffff88080ad87da0 ffff88080ad87db0 ffff88100c1e5200
[77554.135677] Call Trace:
[77554.135684]  [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870
[77554.135686]  [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0
[77554.135687]  [<ffffffff81135642>] ? kswapd+0x342/0x6b0
[77554.135689]  [<ffffffff81135300>] ? mem_cgroup_shrink_node+0x150/0x150
[77554.135692]  [<ffffffff81075be4>] ? kthread+0xc4/0xe0
[77554.135695]  [<ffffffff81b2b34f>] ? ret_from_fork+0x1f/0x40
[77554.135696]  [<ffffffff81075b20>] ? kthread_worker_fn+0x160/0x160
[77734.252362] INFO: rcu_sched detected stalls on CPUs/tasks:
[77734.252367]  1-...: (897 ticks this GP) idle=7dd/140000000000000/0 softirq=30962751/30963197 fqs=50466
[77734.252368]  (detected by 0, t=240122 jiffies, g=7958132, c=7958131, q=456322)
[77734.252372] Task dump for CPU 1:
[77734.252373] kswapd1         R  running task        0    86      2 0x00000008
[77734.252376]  ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 ffff88100c1e5200
[77734.252378]  0000000000000003 0000000000000000 ffff88080ad87e60 ffff88080ad87d90
[77734.252380]  ffffffff811345f5 ffff88080ad87da0 ffff88080ad87db0 ffff88100c1e5200
[77734.252382] Call Trace:
[77734.252388]  [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870
[77734.252390]  [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0
[77734.252391]  [<ffffffff81135642>] ? kswapd+0x342/0x6b0
[77734.252393]  [<ffffffff81135300>] ? mem_cgroup_shrink_node+0x150/0x150
[77734.252396]  [<ffffffff81075be4>] ? kthread+0xc4/0xe0
[77734.252399]  [<ffffffff81b2b34f>] ? ret_from_fork+0x1f/0x40
[77734.252401]  [<ffffffff81075b20>] ? kthread_worker_fn+0x160/0x160
[…]
```

> >> * When the backups are actually happening there might be relevant memory
> >> pressure from inode cache and the rsync processes. We saw the oom-killer
> >> kick in on another machine with same hardware and similar (a bit higher)
> >> workload. This other machine also shows a lot of rcu stall warnings since
> >> 4.8.4.
> >>
> >> * We see "rcu_sched detected stalls" also on some other machines since we
> >> switched to 4.8 but not as frequently as on the two backup servers. Usually
> >> there's "shrink_node" and "kswapd" on the top of the stack. Often
> >> "xfs_reclaim_inodes" variants on top of that.
> >
> > I would be interested to see some reclaim tracepoints enabled. Could you
> > try that out? At least mm_shrink_slab_{start,end} and
> > mm_vmscan_lru_shrink_inactive. This should tell us more about how the
> > reclaim behaved.
> 
> We'll try that tomorrow!

Unfortunately, looking today at `trace`, the corresponding messages have
already been thrown out the buffer. We continue trying.


Kind regards,

Paul Menzel

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Nov. 28, 2016, 11:04 a.m. UTC | #10
On Sun 27-11-16 10:19:06, Donald Buczek wrote:
> On 24.11.2016 11:15, Michal Hocko wrote:
> > On Mon 21-11-16 16:35:53, Donald Buczek wrote:
> > [...]
> > > Hello,
> > > 
> > > thanks a lot for looking into this!
> > > 
> > > Let me add some information from the reporting site:
> > > 
> > > * We've tried the patch from Paul E. McKenney (the one posted Wed, 16 Nov
> > > 2016)  and it doesn't shut up the rcu stall warnings.
> > > 
> > > * Log file from a boot with the patch applied ( grep kernel
> > > /var/log/messages ) is here :
> > > http://owww.molgen.mpg.de/~buczek/321322/2016-11-21_syslog.txt
> > > 
> > > * This system is a backup server and walks over thousands of files sometimes
> > > with multiple parallel rsync processes.
> > > 
> > > * No rcu_* warnings on that machine with 4.7.2, but with 4.8.4 , 4.8.6 ,
> > > 4.8.8 and now 4.9.0-rc5+Pauls patch
> > I assume you haven't tried the Linus 4.8 kernel without any further
> > stable patches? Just to be sure we are not talking about some later
> > regression which found its way to the stable tree.
> 
> We've tried v4.8 and got the first rcu stall warnings with this, too. First
> one after about 20 hours uptime.
> 
> 
> > > * When the backups are actually happening there might be relevant memory
> > > pressure from inode cache and the rsync processes. We saw the oom-killer
> > > kick in on another machine with same hardware and similar (a bit higher)
> > > workload. This other machine also shows a lot of rcu stall warnings since
> > > 4.8.4.
> > > 
> > > * We see "rcu_sched detected stalls" also on some other machines since we
> > > switched to 4.8 but not as frequently as on the two backup servers. Usually
> > > there's "shrink_node" and "kswapd" on the top of the stack. Often
> > > "xfs_reclaim_inodes" variants on top of that.
> > I would be interested to see some reclaim tracepoints enabled. Could you
> > try that out? At least mm_shrink_slab_{start,end} and
> > mm_vmscan_lru_shrink_inactive. This should tell us more about how the
> > reclaim behaved.
> 
> http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.dmesg.txt  (80K)
> http://owww.molgen.mpg.de/~buczek/321322/2016-11-26.trace.txt (50M)
> 
> Traces wrapped, but the last event is covered. all vmscan events were
> enabled

OK, so one of the stall is reported at
[118077.988410] INFO: rcu_sched detected stalls on CPUs/tasks:
[118077.988416] 	1-...: (181 ticks this GP) idle=6d5/140000000000000/0 softirq=46417663/46417663 fqs=10691 
[118077.988417] 	(detected by 4, t=60002 jiffies, g=11845915, c=11845914, q=46475)
[118077.988421] Task dump for CPU 1:
[118077.988421] kswapd1         R  running task        0    86      2 0x00000008
[118077.988424]  ffff88080ad87c58 ffff88080ad87c58 ffff88080ad87cf8 ffff88100c1e5200
[118077.988426]  0000000000000003 0000000000000000 ffff88080ad87e60 ffff88080ad87d90
[118077.988428]  ffffffff811345f5 ffff88080ad87da0 ffff88100c1e5200 ffff88080ad87dd0
[118077.988430] Call Trace:
[118077.988436]  [<ffffffff811345f5>] ? shrink_node_memcg+0x605/0x870
[118077.988438]  [<ffffffff8113491f>] ? shrink_node+0xbf/0x1c0
[118077.988440]  [<ffffffff81135642>] ? kswapd+0x342/0x6b0

the interesting part of the traces would be around the same time:
        clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
         kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
[...]
         kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1

Note the Need resched flag. The IRQ off part is expected because we are
holding the LRU lock which is IRQ safe. That is not a problem because
the lock is only held for SWAP_CLUSTER_MAX pages at maximum. It is also
interesing to see that we have scanned only 1303 pages during that 1
minute. That would be dead slow. None of them were good enough for the
reclaim but that doesn't sound like a problem. The trace simply suggests
that the reclaim was preempted by something else. Otherwise I cannot
imagine such a slow scanning.

Is it possible that something else is hogging the CPU and the RCU just
happens to blame kswapd which is running in the standard user process
context?
diff mbox

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926af442..0d3b5f5a04ef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2529,8 +2529,11 @@  static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			unsigned long scanned;
 
 			if (mem_cgroup_low(root, memcg)) {
-				if (!sc->may_thrash)
+				if (!sc->may_thrash) {
+					/* Prevent CPU CPU stalls. */
+					cond_resched_rcu_qs();
 					continue;
+				}
 				mem_cgroup_events(memcg, MEMCG_LOW, 1);
 			}
 
@@ -2565,6 +2568,7 @@  static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				mem_cgroup_iter_break(root, memcg);
 				break;
 			}
+			cond_resched_rcu_qs(); /* Prevent CPU CPU stalls. */
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
 		/*