From patchwork Wed Jun 29 16:44:15 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Paul E. McKenney" X-Patchwork-Id: 9205763 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 12A4960752 for ; Wed, 29 Jun 2016 16:46:07 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0382428666 for ; Wed, 29 Jun 2016 16:46:07 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EBECB28671; Wed, 29 Jun 2016 16:46:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=unavailable version=3.3.1 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.9]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 8245428666 for ; Wed, 29 Jun 2016 16:46:06 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.85_2 #1 (Red Hat Linux)) id 1bIIbV-0003mi-5x; Wed, 29 Jun 2016 16:44:45 +0000 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5] helo=mx0a-001b2d01.pphosted.com) by bombadil.infradead.org with esmtps (Exim 4.85_2 #1 (Red Hat Linux)) id 1bIIbS-0003cE-4z for linux-arm-kernel@lists.infradead.org; Wed, 29 Jun 2016 16:44:42 +0000 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.11/8.16.0.11) with SMTP id u5TGcwhU067788 for ; Wed, 29 Jun 2016 12:44:20 -0400 Received: from e19.ny.us.ibm.com (e19.ny.us.ibm.com [129.33.205.209]) by mx0a-001b2d01.pphosted.com with ESMTP id 23utc9awfp-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 29 Jun 2016 12:44:19 -0400 Received: from localhost by e19.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 29 Jun 2016 12:44:18 -0400 Received: from d01dlp02.pok.ibm.com (9.56.250.167) by e19.ny.us.ibm.com (146.89.104.206) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 29 Jun 2016 12:44:17 -0400 X-IBM-Helo: d01dlp02.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com [9.57.198.29]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id E1A0E6E803C for ; Wed, 29 Jun 2016 12:43:57 -0400 (EDT) Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id u5TGiItA6947084 for ; Wed, 29 Jun 2016 16:44:18 GMT Received: from d01av01.pok.ibm.com (localhost [127.0.0.1]) by d01av01.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id u5TGiDFF009375 for ; Wed, 29 Jun 2016 12:44:16 -0400 Received: from paulmck-ThinkPad-W541 ([9.70.82.204]) by d01av01.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVin) with ESMTP id u5TGiDrv009355; Wed, 29 Jun 2016 12:44:13 -0400 Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id 7F4E816C1057; Wed, 29 Jun 2016 09:44:15 -0700 (PDT) Date: Wed, 29 Jun 2016 09:44:15 -0700 From: "Paul E. McKenney" To: Geert Uytterhoeven Subject: Re: Boot failure on emev2/kzm9d (was: Re: [PATCH v2 11/11] mm/slab: lockless decision to grow cache) References: <20160621064302.GA20635@js1304-P5Q-DELUXE> <20160621125406.GF3923@linux.vnet.ibm.com> <20160622005208.GB25106@js1304-P5Q-DELUXE> <20160622190859.GA1473@linux.vnet.ibm.com> <20160623004935.GA20752@linux.vnet.ibm.com> <20160623023756.GA30438@js1304-P5Q-DELUXE> <20160623024742.GD1473@linux.vnet.ibm.com> <20160623025329.GA13095@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16062916-0056-0000-0000-000000A7B28A X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16062916-0057-0000-0000-000004C1A64C Message-Id: <20160629164415.GG4650@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2016-06-29_09:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1606290155 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20160629_094442_336799_C228C10C X-CRM114-Status: GOOD ( 26.61 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: paulmck@linux.vnet.ibm.com Cc: Linux-Renesas , Andrew Morton , David Rientjes , "linux-kernel@vger.kernel.org" , Pekka Enberg , Linux MM , Jesper Dangaard Brouer , Joonsoo Kim , Christoph Lameter , "linux-arm-kernel@lists.infradead.org" Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Virus-Scanned: ClamAV using ClamSMTP On Wed, Jun 29, 2016 at 04:54:44PM +0200, Geert Uytterhoeven wrote: > Hi Paul, > > On Thu, Jun 23, 2016 at 4:53 AM, Paul E. McKenney > wrote: > > On Wed, Jun 22, 2016 at 07:47:42PM -0700, Paul E. McKenney wrote: [ . . . ] > > @@ -4720,11 +4720,18 @@ static void __init rcu_dump_rcu_node_tree(struct rcu_state *rsp) > > pr_info(" "); > > level = rnp->level; > > } > > - pr_cont("%d:%d ^%d ", rnp->grplo, rnp->grphi, rnp->grpnum); > > + pr_cont("%d:%d/%#lx/%#lx ^%d ", rnp->grplo, rnp->grphi, > > + rnp->qsmask, > > + rnp->qsmaskinit | rnp->qsmaskinitnext, rnp->grpnum); > > } > > pr_cont("\n"); > > } > > For me it always crashes during the 37th call of synchronize_sched() in > setup_kmem_cache_node(), which is the first call after secondary CPU bring up. > With your and my debug code, I get: > > CPU: Testing write buffer coherency: ok > CPU0: thread -1, cpu 0, socket 0, mpidr 80000000 > Setting up static identity map for 0x40100000 - 0x40100058 > cnt = 36, sync > CPU1: thread -1, cpu 1, socket 0, mpidr 80000001 > Brought up 2 CPUs > SMP: Total of 2 processors activated (2132.00 BogoMIPS). > CPU: All CPU(s) started in SVC mode. > rcu_node tree layout dump > 0:1/0x0/0x3 ^0 Thank you for running this! OK, so RCU knows about both CPUs (the "0x3"), and the previous grace period has seen quiescent states from both of them (the "0x0"). That would indicate that your synchronize_sched() showed up when RCU was idle, so it had to start a new grace period. It also rules out failure modes where RCU thinks that there are more CPUs than really exist. (Don't laugh, such things have really happened.) > devtmpfs: initialized > VFP support v0.3: implementor 41 architecture 3 part 30 variant 9 rev 1 > clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, > max_idle_ns: 19112604462750000 ns > > I hope it helps. Thanks! I am going to guess that this was the first grace period since the second CPU came online. When there only on CPU online, synchronize_sched() is a no-op. OK, this showed some things that aren't a problem. What might the problem be? o The grace-period kthread has not yet started. It -should- start at early_initcall() time, but who knows? Adding code to print out that kthread's task_struct address. o The grace-period kthread might not be responding to wakeups. Checking this requires that a grace period be in progress, so please put a call_rcu_sched() just before the call to rcu_dump_rcu_node_tree(). (Sample code below.) Adding code to my patch to print out more GP-kthread state as well. o One of the CPUs might not be responding to RCU. That -should- result in an RCU CPU stall warning, so I will ignore this possibility for the moment. That said, do you have some way to determine whether scheduling clock interrupts are really happening? Without these interrupts, no RCU CPU stall warnings. OK, that should be enough for the next phase, please see the end for the patch. This patch applies on top of my previous one. Could you please set this up as follows? struct rcu_head rh; rcu_dump_rcu_node_tree(&rcu_sched_state); /* Initial state. */ call_rcu(&rh, do_nothing_cb); schedule_timeout_uninterruptible(5 * HZ); /* Or whatever delay. */ rcu_dump_rcu_node_tree(&rcu_sched_state); /* GP state. */ synchronize_sched(); /* Probably hangs. */ rcu_barrier(); /* Drop RCU's references to rh before return. */ Thanx, Paul ------------------------------------------------------------------------ commit 82829ec76c2c0de18874a60ebd7ff8ee80f244d1 Author: Paul E. McKenney Date: Wed Jun 29 09:42:13 2016 -0700 rcu: More diagnostics Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2eda7bece401..ff55c569473c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4712,6 +4712,11 @@ static void rcu_dump_rcu_node_tree(struct rcu_state *rsp) int level = 0; struct rcu_node *rnp; + pr_info("RCU: %s GP kthread: %p state: %d flags: %#x g:%ld c:%ld\n", + rsp->name, rsp->gp_kthread, rsp->gp_state, rsp->gp_flags, + (long)rsp->gpnum, (long)rsp->completed); + pr_info(" jiffies: %#lx GP start: %#lx Last GP activity: %#lx\n", + jiffies, rsp->gp_start, rsp->gp_activity); pr_info("rcu_node tree layout dump\n"); pr_info(" "); rcu_for_each_node_breadth_first(rsp, rnp) {