From patchwork Sat Apr 14 03:32:24 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tai Nguyen X-Patchwork-Id: 10340915 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9FD4F601C2 for ; Sat, 14 Apr 2018 03:32:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8E9A7288BC for ; Sat, 14 Apr 2018 03:32:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 838592891E; Sat, 14 Apr 2018 03:32:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI autolearn=ham version=3.3.1 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id CAAF9288BC for ; Sat, 14 Apr 2018 03:32:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From: References:In-Reply-To:MIME-Version:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=jkU2fUVCrJUKbRc6L/BdKWrHSuRtm/kA2GC8wLok0r8=; b=iQWmaS8hzwx8O6 mRg/JNLP6JA5uJEACDHf9cUojKPdu/rOeNXz5qX2NWSzwscjfPDgVfB41N52rnBGaNKb8kHq9Z8SP 7hVlRucgGoxyu/i2stdos+Y2ejnatGouTRIZ++Ye3iv8QKc0XENdBlqJsFkF6GRGvBAhW0jTuqyVk 0nyWuK+9aRRXYhn4uLhZnfTQBFDDkjLxFnc1+ADAcl6wdTs/Nd1R85hSO7E/FWItyFYHVVHiDpXsZ OkLL8TfLILoPRhwOOIrPUblTUTLk99R/hvyBa0o1htMrNABE2iQn9HdhR1Iz2mQBUFbT81ODnhLcz q6mrIZbOdxg4L3zCeaXw==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux)) id 1f7Bvb-0005rE-8P; Sat, 14 Apr 2018 03:32:39 +0000 Received: from mail-vk0-x22f.google.com ([2607:f8b0:400c:c05::22f]) by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux)) id 1f7BvX-0005pV-ST for linux-arm-kernel@lists.infradead.org; Sat, 14 Apr 2018 03:32:37 +0000 Received: by mail-vk0-x22f.google.com with SMTP id n124so6525720vkd.6 for ; Fri, 13 Apr 2018 20:32:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apm.com; s=apm; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=EUMY4AUnKOEK82n2W64Htw+/Nt6+O1b4AXefqFS1LZU=; b=Mnyo7yf5VQOxpXBI0Fk9aVZLAJ0CA20LVpSHgg4onOcM04U6q6VD3EMRfYX1uXrKLG YMgn1Wi2J4S4TZlH6zrvFuTXvhmNiLFN7iSJjnjzdRRigkmfk+kfB8/xJeieYM87Ac0Y luKZm4HEY/RykZnZuX3FM0ALOCmd39gF4WJjU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=EUMY4AUnKOEK82n2W64Htw+/Nt6+O1b4AXefqFS1LZU=; b=UpGoOIUHbD4zDrcjoLILTgExiIneXjwqOPGFdRmwo7T9ZdusrO7N9J5QMFh2/YNgj0 Y1WBmkkHtWyZX18xb8DrFFf/iC/9x9nC0vtyLHmnqacGYPLnuG1GHMAvYRq0GDfnmCOd yvK/kuIIuE4iJQy5qwR1v7eUDKo+FR4h1uTlN1bsclAdrvWMS206TrS7kseJsdfs3OuW mqbPIR9Dr3cHOxPxyTALEmPBm8TMCXabP1JroK3S8nireZvK77zOD7ysvRgmBehsYUmI aRbQNVzvyoF3tHluYCf+Y5MyJ3MWvsn+qNqX/586lvB3WHMfMaE2eyB+nvUnjMTj32BW sjdw== X-Gm-Message-State: ALQs6tDO9h4xf/stOyyo5IVFiRiaMe+t/i1AaimTvXIPi5iXGDR0Epmo BXyaAIAiSqEJmFgoJ2YsGJYS2kcMX+7SifHgcnI7YA== X-Google-Smtp-Source: AIpwx4+aR1GyfuS0pzLHtSS/X5VrjHfYRk/gsX1yX9meXJD8ivylAISIxyw3IWtPljv/isDI8+y/TTRukh4EBN1F5iM= X-Received: by 10.31.252.68 with SMTP id a65mr5793478vki.78.1523676744636; Fri, 13 Apr 2018 20:32:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.48.198 with HTTP; Fri, 13 Apr 2018 20:32:24 -0700 (PDT) In-Reply-To: References: <20180412172456.GA27033@arm.com> From: Tai Tri Nguyen Date: Fri, 13 Apr 2018 20:32:24 -0700 Message-ID: Subject: Re: [bug report] locking/qrwlock, arm64: Move rwlock implementation over to qrwlocks causes CPU crashes/stalls when killing java processes To: Will Deacon X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20180413_203235_923659_9AF251A6 X-CRM114-Status: GOOD ( 16.44 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Loc Ho , linux-arm-kernel Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Virus-Scanned: ClamAV using ClamSMTP Hi Will, I tried to remove wfe from cmpwait and still could reproduce the issue. I dumped out CPU pc and they seem to be looping in the arch_spin_lock() trying to get the lock. ffff00000810a968 : ffff00000810a968: a9bf7bfd stp x29, x30, [sp,#-16]! ffff00000810a96c: aa0003e3 mov x3, x0 ffff00000810a970: 91001005 add x5, x0, #0x4 ffff00000810a974: 910003fd mov x29, sp ffff00000810a978: f98000b1 prfm pstl1strm, [x5] ffff00000810a97c: 885ffca1 ldaxr w1, [x5] ffff00000810a980: 11404022 add w2, w1, #0x10, lsl #12 ffff00000810a984: 88047ca2 stxr w4, w2, [x5] ffff00000810a988: 35ffffa4 cbnz w4, ffff00000810a97c ffff00000810a98c: 4ac14022 eor w2, w1, w1, ror #16 ffff00000810a990: 340000c2 cbz w2, ffff00000810a9a8 ffff00000810a994: d50320bf sevl ffff00000810a998: d503205f wfe ffff00000810a99c: 485ffca4 ldaxrh w4, [x5] ffff00000810a9a0: 4a414082 eor w2, w4, w1, lsr #16 ffff00000810a9a4: 35ffffa2 cbnz w2, ffff00000810a998 ffff00000810a9a8: b9400001 ldr w1, [x0] ffff00000810a9ac: 350000e1 cbnz w1, ffff00000810a9c8 ffff00000810a9b0: d2800001 mov x1, #0x0 // #0 ffff00000810a9b4: d2801fe2 mov x2, #0xff // #255 Tai On Thu, Apr 12, 2018 at 10:49 AM, Tai Tri Nguyen wrote: > Hi Will, > > Somehow so far I only see the issue when killing Java processes. > Here is the log. It's a bit long. > I'll try removing wfe as you suggested. > > [ 2499.459849] INFO: rcu_sched self-detected stall on CPU > [ 2499.459851] INFO: rcu_sched self-detected stall on CPU > [ 2499.459855] INFO: rcu_sched self-detected stall on CPU > > [ 2499.459856] INFO: rcu_sched self-detected stall on CPU > > [ 2499.459858] INFO: rcu_sched self-detected stall on CPU > > [ 2499.459861] 1-...!: (11 GPs behind) idle=dee/140000000000001/0 > softirq=40890/40890 fqs=0 > > [ 2499.459862] INFO: rcu_sched self-detected stall on CPU > > [ 2499.459864] INFO: rcu_sched self-detected stall on CPU > > [ 2499.459866] INFO: rcu_sched self-detected stall on CPU > > [ 2499.459866] > [ 2499.459870] 14-...!: (1 GPs behind) idle=a9e/140000000000001/0 > softirq=42949/42950 fqs=0 > [ 2499.459871] INFO: rcu_sched self-detected stall on CPU > [ 2499.459873] INFO: rcu_sched self-detected stall on CPU > [ 2499.459876] INFO: rcu_sched self-detected stall on CPU > [ 2499.459878] INFO: rcu_sched self-detected stall on CPU > [ 2499.459881] INFO: rcu_sched self-detected stall on CPU > [ 2499.459882] (t=60744 jiffies g=20044 c=20043 q=4496) > [ 2499.459884] INFO: rcu_sched self-detected stall on CPU > [ 2499.459885] INFO: rcu_sched self-detected stall on CPU > [ 2499.459886] > [ 2499.459890] 0-...!: (11 GPs behind) idle=70a/140000000000001/0 > softirq=44883/44883 fqs=0 > [ 2499.459891] INFO: rcu_sched self-detected stall on CPU > [ 2499.459893] INFO: rcu_sched self-detected stall on CPU > [ 2499.459896] INFO: rcu_sched self-detected stall on CPU > [ 2499.459898] INFO: rcu_sched self-detected stall on CPU > [ 2499.459899] INFO: rcu_sched self-detected stall on CPU > [ 2499.459901] INFO: rcu_sched self-detected stall on CPU > [ 2499.459903] INFO: rcu_sched self-detected stall on CPU > [ 2499.459906] INFO: rcu_sched self-detected stall on CPU > [ 2499.459907] INFO: rcu_sched self-detected stall on CPU > [ 2499.459909] INFO: rcu_sched self-detected stall on CPU > [ 2499.459910] INFO: rcu_sched self-detected stall on CPU > [ 2499.459914] rcu_sched kthread starved for 60744 jiffies! g20044 > c20043 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=16 > [ 2499.459916] (t=60744 jiffies g=20044 c=20043 q=4506) > [ 2499.459916] > [ 2499.459920] 6-...!: (2 GPs behind) idle=e0a/140000000000001/0 > softirq=42006/42006 fqs=0 > [ 2499.459924] 31-...!: (8 GPs behind) idle=6fe/140000000000001/0 > softirq=39495/39495 fqs=0 > [ 2499.459928] rcu_sched I > [ 2499.459930] rcu_sched kthread starved for 60744 jiffies! g20044 > c20043 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=16 > [ 2499.459932] (t=60744 jiffies g=20044 c=20043 q=4518) > [ 2499.459932] > [ 2499.459935] 12-...!: (48 GPs behind) idle=37a/140000000000001/0 [...] > [ 2499.460317] rcu_sched kthread starved for 60744 jiffies! g20044 > c20043 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x200 ->cpu=16 > [ 2499.460318] (t=60744 jiffies g=20044 c=20043 q=4703) > [ 2499.460320] ret_from_fork+0x10/0x18 > [ 2499.460322] kthread+0xfc/0x128 > [ 2499.460323] rcu_gp_kthread+0x3a8/0x7c0 > [ 2499.460324] schedule_timeout+0x17c/0x340 > [ 2499.460326] schedule+0x2c/0x84 > [ 2499.460328] __schedule+0x308/0x780 > [ 2499.460332] __switch_to+0x8c/0xa8 > [ 2499.460333] Call trace: > [ 2499.460334] 0 8 2 0x00000000 > [ 2499.460335] rcu_sched R > [ 2499.460338] rcu_sched kthread starved for 60744 jiffies! g20044 > c20043 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x200 ->cpu=16 > [ 2499.460338] running task > [ 2499.460341] ret_from_fork+0x10/0x18 > [ 2499.460342] kthread+0xfc/0x128 > [ 2499.460345] rcu_gp_kthread+0x3a8/0x7c0 > [ 2499.460347] schedule_timeout+0x17c/0x340 > [root@hadoop-slave-1 scripts]# [ 2499.460349] schedule+0x2c/0x84 > [ 2499.460351] __schedule+0x308/0x780 > [ 2499.460355] __switch_to+0x8c/0xa8 > > [ 2499.460356] Call trace: > [ 2499.460358] 0 8 2 0x00000000 > [ 2499.460359] rcu_sched R > [ 2499.460362] rcu_sched kthread starved for 60744 jiffies! g20044 > c20043 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x200 ->cpu=16 > [ 2499.460363] ret_from_fork+0x10/0x18 > [root@hadoop-slave-1 scripts]# [ 2499.460365] kthread+0xfc/0x128 > [ 2499.460366] rcu_gp_kthread+0x3a8/0x7c0 > [ 2499.460368] schedule_timeout+0x17c/0x340 > > [ 2499.460370] schedule+0x2c/0x84 > [ 2499.460373] __switch_to+0x8c/0xa8 > [ 2499.460376] __schedule+0x308/0x780 > [ 2499.460377] Call trace: > [ 2499.460378] 0 8 2 0x00000000 [...] > On Thu, Apr 12, 2018 at 10:24 AM, Will Deacon wrote: >> Hi Tai, >> >> On Thu, Apr 12, 2018 at 10:10:40AM -0700, Tai Tri Nguyen wrote: >>> Recently I have observed the CPU crashes/stalls when rebooting after I >>> ran cassandra benchmark. >>> The issue happens randomly. >> >> Please could you share some logs and more details about the crashes and >> stalls? At the moment there's not much we can do with this report :/ >> >> Ideally, we'd have steps to reproduce the issue, but it seems that you >> don't have a reliable method for that. >> >> One other thing you could try is removing the WFE from our cmpwait >> implementation in case you have a CPU erratum in that area. >> >> Will > > > > -- > Tai --- a/arch/arm64/include/asm/cmpxchg.h +++ b/arch/arm64/include/asm/cmpxchg.h @@ -232,7 +232,6 @@ " ldxr" #sz "\t%" #w "[tmp], %[v]\n" \ " eor %" #w "[tmp], %" #w "[tmp], %" #w "[val]\n" \ " cbnz %" #w "[tmp], 1f\n" \ - " wfe\n" \ "1:" \ : [tmp] "=&r" (tmp), [v] "+Q" (*(unsigned long *)ptr) \ : [val] "r" (val)); \