From patchwork Fri Nov 10 11:26:28 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jon Hunter X-Patchwork-Id: 10052877 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 7C40D60365 for ; Fri, 10 Nov 2017 11:27:00 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5CF8F2B0E5 for ; Fri, 10 Nov 2017 11:27:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5116E2B276; Fri, 10 Nov 2017 11:27:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from bombadil.infradead.org (bombadil.infradead.org [65.50.211.133]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id D23972B0E5 for ; Fri, 10 Nov 2017 11:26:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:References:To:From:Subject:Reply-To:Content-ID:Content-Description :Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=iEKyRqBe2DltgS2baJWBLQJFxP1/KF8DXjbBUc3ZL4w=; b=r3bGhMyAL7ZNOb cqhxWEBwJeXbYgUuzoOK0AhH2E0WP6RH1DJFUdwNdy1XiaW1sWOw+rI/E7QgxGfV9/8e6W+8Mvnye NS0QAOb0qCF+sWDn1wxq69/EAYQJuJZz6QoLXbDAzIpDHhXzDC6AOKdOz7vsMcZHvhyGw8DHzDwgW +XtL5i/Q+9uZWNs9b3FfevXQBAT9sqEp2TI78SPLoCybGKy7uTS0jmHK/bM5CouV0RZvMgo1SIO3d heXnIg1A4uLb+uU5PNoW34aCUq7GMpmtwMb071KVFP74vYi7LQQbCOoVOdVjjRAkWOn23cGEUv82K kP7rQwvkaF7auHOFvHFw==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.87 #1 (Red Hat Linux)) id 1eD7Sc-0007IO-4D; Fri, 10 Nov 2017 11:26:58 +0000 Received: from hqemgate15.nvidia.com ([216.228.121.64]) by bombadil.infradead.org with esmtps (Exim 4.87 #1 (Red Hat Linux)) id 1eD7SY-00078f-5d for linux-arm-kernel@lists.infradead.org; Fri, 10 Nov 2017 11:26:56 +0000 Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate15.nvidia.com id ; Fri, 10 Nov 2017 03:26:31 -0800 Received: from HQMAIL103.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Fri, 10 Nov 2017 03:26:33 -0800 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Fri, 10 Nov 2017 03:26:33 -0800 Received: from UKMAIL101.nvidia.com (10.26.138.13) by HQMAIL103.nvidia.com (172.20.187.11) with Microsoft SMTP Server (TLS) id 15.0.1293.2; Fri, 10 Nov 2017 11:26:32 +0000 Received: from [10.21.132.144] (10.21.132.144) by UKMAIL101.nvidia.com (10.26.138.13) with Microsoft SMTP Server (TLS) id 15.0.1293.2; Fri, 10 Nov 2017 11:26:28 +0000 Subject: Re: next/master boot: 273 boots: 63 failed, 209 passed with 1 untried/unknown (next-20171106) From: Jon Hunter To: Ben Skeggs References: <5a0055f1.85a8500a.98d54.a4e4@mx.google.com> <20171106191713.d7jqg2b6zqchythw@sirena.co.uk> <20171107105501.7x74gdqzhr7uulp2@sirena.org.uk> <613bcd63-a215-acbe-9150-c1495f7604f6@collabora.com> <5740b853-4898-2ebc-f67d-0808d1b44c36@nvidia.com> <7cdfa633-d9c6-881a-ae5f-f94f7e6413ee@nvidia.com> <5f064e65-ee29-179b-8102-984d12d24d9d@collabora.com> <15792a16-6b57-a6ad-92dc-0ffaba0354db@nvidia.com> <1eb4e14f-4728-d4f7-95a6-0a6308760d7a@collabora.com> <18ef379f-0c23-0cbf-4228-30d5c46c690f@nvidia.com> <2038e32a-de4a-a683-462e-555e991a2111@nvidia.com> <64b117d3-b2ad-037a-b034-78ef1510fbb4@nvidia.com> <5505affd-58a5-857f-051d-5b93257e175d@redhat.com> <1040af29-4d15-4e8a-29ab-40952523535c@nvidia.com> Message-ID: <5321abfb-845b-354e-f3d9-7773cfe175f4@nvidia.com> Date: Fri, 10 Nov 2017 11:26:28 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <1040af29-4d15-4e8a-29ab-40952523535c@nvidia.com> X-Originating-IP: [10.21.132.144] X-ClientProxiedBy: UKMAIL102.nvidia.com (10.26.138.15) To UKMAIL101.nvidia.com (10.26.138.13) Content-Language: en-US X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20171110_032654_356466_0A1D5785 X-CRM114-Status: GOOD ( 17.46 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Arnd Bergmann , Guillaume Tucker , Mark Brown , "linux-tegra@vger.kernel.org" , Robin Murphy , "linux-arm-kernel@lists.infradead.org" Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Virus-Scanned: ClamAV using ClamSMTP On 10/11/17 09:18, Jon Hunter wrote: ... > Thanks Ben. However, looking at next-20171109 this one is already in. > So maybe the bisect is still not getting me to the current issue. When > booting next-20171109 the last thing I see is ... > > [ 2.228178] nouveau 57000000.gpu: NVIDIA GK20A (0ea000a1) > [ 2.233634] nouveau 57000000.gpu: imem: using IOMMU > [ 2.238572] nouveau 57000000.gpu: Direct firmware load for nvidia/gk20a/fecs_inst.bin failed with error -2 > [ 2.248295] nouveau 57000000.gpu: Direct firmware load for nouveau/nvea_fuc409c failed with error -2 > [ 2.257479] nouveau 57000000.gpu: Direct firmware load for nouveau/fuc409c failed with error -2 > [ 2.266189] nouveau 57000000.gpu: gr: failed to load fuc409c > > So no crash. I did see the crash after the bisect, but not in top of > tree. It appears to hang after the nouveau probe fails. Any thoughts > on how to debug further? So this is probably wrong, but here is a clue about what is happening. It appears that the error code is not being propagated from gk20a_gr_new(). gk20a_gr_new is returning -ENODEV due to the firmware loading failure... 342 if (gf100_gr_ctor_fw(gr, "fecs_inst", &gr->fuc409c) || 343 gf100_gr_ctor_fw(gr, "fecs_data", &gr->fuc409d) || 344 gf100_gr_ctor_fw(gr, "gpccs_inst", &gr->fuc41ac) || 345 gf100_gr_ctor_fw(gr, "gpccs_data", &gr->fuc41ad)) 346 return -ENODEV; ... but this is ignored by nvkm_device_ctor() (probably for good reason). If I make the following change the hang no longer occurs (although I realise this is probably wrong as it has been there for years!) ... So is gk20a_gr_new() returning the wrong error code for when the firmware load fails? I have no gone back to see what has change in this regard, but I can, probably next week. Cheers Jon diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c index e14643615698..a611615d3ce7 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c @@ -2869,7 +2869,7 @@ struct nvkm_engine * subdev = nvkm_device_subdev(device, (s)); \ nvkm_subdev_del(&subdev); \ device->m = NULL; \ - if (ret != -ENODEV) { \ + if (ret == -ENODEV) { \ nvdev_error(device, "%s ctor failed, %d\n", \ nvkm_subdev_name[s], ret); \ goto done; \