From patchwork Tue Mar 10 19:26:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Karol Herbst X-Patchwork-Id: 11430179 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AFD7C18E8 for ; Tue, 10 Mar 2020 19:27:02 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8D82220873 for ; Tue, 10 Mar 2020 19:27:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="RnwUiv8K" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8D82220873 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 568056E8D1; Tue, 10 Mar 2020 19:26:59 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 922AC6E8D2 for ; Tue, 10 Mar 2020 19:26:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1583868417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=Wq5OtijW4qkxUGaUQxhzEeaUg8qC16AJk23Bpi60tww=; b=RnwUiv8Kt1BWrGVaG7T/rXo0RHbzvLndYq/I+gTi7z6ec2O7b/rmJ+AZRiJYo0SJzYiNXa uiXel6roJRa5aFHDAfHN0Es8ldC/h8xQ61A482c/AKRE1bgYBDIbiPz4uoFGXiinsYRorw SHE9XUr6eW6YCTCjfY501vI2qibYJi4= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-156-lZLErUOkNQiuzoSGW6AIvA-1; Tue, 10 Mar 2020 15:26:50 -0400 X-MC-Unique: lZLErUOkNQiuzoSGW6AIvA-1 Received: by mail-wr1-f71.google.com with SMTP id p11so7200844wrn.10 for ; Tue, 10 Mar 2020 12:26:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=vBBQf8YSJ+Mjo+kqeDTD8Z4+zLQIuCyl3eitu/ifyRg=; b=gg9D/NTPJTNtvjB8XMrbq4tQdP6FrSRhjumzsOpehWIqm6bR+ehDRZkHfIZ49dPoTo Wyy75pYPa8DClrAY0VenQfb3cuBgZBUa0Penqs+XW2c6HXEKhQAJRu7i/qOudGvWy0WM BiAoW8JKdMYG/c1g5X1vi/rBDdVC72aROIv6Fm6qTxGrOvUsZgqTifD6io+S1IkudNCg blnJTA0ACQlz/94jJ99WLigYISvSwmyG316pzpUMB8Tu1WHwCH4sYNUtuxY61rciYYF7 6gBnIx4DI5Q78o2VLxSANlQKvVeU0VvnpQRR872NtTFj40YC3wXKiPfamL9Uq89fcp7r HAtA== X-Gm-Message-State: ANhLgQ2dUYfdqSJSCv7A0oHRzi6wX5RXWLJ6uuro3F0qNyPwl3MmRHHV +G+rl6/ayVJ1n5gvNsCmGQlxV9DTUEZ7uMOFLUEoHnu3WvAWVRLAZN+bJ/Sa1O7ZB7A7183dX0N USEoWx4S25C3YIl7kw9XWxEFDg75m X-Received: by 2002:a7b:c153:: with SMTP id z19mr3416626wmi.37.1583868409502; Tue, 10 Mar 2020 12:26:49 -0700 (PDT) X-Google-Smtp-Source: ADFU+vulmI+P5MpROhNKaiKOfzCCxtCMi66TM8UrbbQOxyf42qxjAGtaVwzF6sV60157AZSfFYTJtQ== X-Received: by 2002:a7b:c153:: with SMTP id z19mr3416586wmi.37.1583868409147; Tue, 10 Mar 2020 12:26:49 -0700 (PDT) Received: from kherbst.pingu.com ([2a02:8308:b0be:6900:482c:9537:40:83ba]) by smtp.gmail.com with ESMTPSA id c8sm58633569wru.7.2020.03.10.12.26.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Mar 2020 12:26:48 -0700 (PDT) From: Karol Herbst To: linux-kernel@vger.kernel.org Subject: [PATCH v7] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges Date: Tue, 10 Mar 2020 20:26:27 +0100 Message-Id: <20200310192627.437947-1-kherbst@redhat.com> X-Mailer: git-send-email 2.24.1 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Karol Herbst , linux-pm@vger.kernel.org, linux-pci@vger.kernel.org, Mika Westerberg , "Rafael J . Wysocki" , dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org, Bjorn Helgaas , Mika Westerberg Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Fixes the infamous 'runtime PM' bug many users are facing on Laptops with Nvidia Pascal GPUs by skipping said PCI power state changes on the GPU. Depending on the used kernel there might be messages like those in demsg: "nouveau 0000:01:00.0: Refused to change power state, currently in D3" "nouveau 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)" followed by backtraces of kernel crashes or timeouts within nouveau. It's still unkown why this issue exists, but this is a reliable workaround and solves a very annoying issue for user having to choose between a crashing kernel or higher power consumption of their Laptops. Signed-off-by: Karol Herbst Cc: Bjorn Helgaas Cc: Lyude Paul Cc: Rafael J. Wysocki Cc: Mika Westerberg Cc: linux-pci@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: nouveau@lists.freedesktop.org Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205623 Reviewed-by: Mika Westerberg --- v2: convert to pci_dev quirk put a proper technical explanation of the issue as a in-code comment v3: disable it only for certain combinations of intel and nvidia hardware v4: simplify quirk by setting flag on the GPU itself v5: restructure quirk to make it easier to add new IDs fix whitespace issues fix potential NULL pointer access update the quirk documentation v6: move quirk into nouveau v7: fix typos and commit message drivers/gpu/drm/nouveau/nouveau_drm.c | 57 +++++++++++++++++++++++++++ drivers/pci/pci.c | 8 ++++ include/linux/pci.h | 1 + 3 files changed, 66 insertions(+) diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index b65ae817eabf..2c86f0248305 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -618,6 +618,60 @@ nouveau_drm_device_fini(struct drm_device *dev) kfree(drm); } +/* + * On some Intel PCIe bridge controllers doing a + * D0 -> D3hot -> D3cold -> D0 sequence causes Nvidia GPUs to not reappear. + * Skipping the intermediate D3hot step seems to make it work again. This is + * probably caused by not meeting the expectation the involved AML code has + * when the GPU is put into D3hot state before invoking it. + * + * This leads to various manifestations of this issue: + * - AML code execution to power on the GPU hits an infinite loop (as the + * code waits on device memory to change). + * - kernel crashes, as all PCI reads return -1, which most code isn't able + * to handle well enough. + * + * In all cases dmesg will contain at least one line like this: + * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3' + * followed by a lot of nouveau timeouts. + * + * In the \_SB.PCI0.PEG0.PG00._OFF code deeper down writes bit 0x80 to the not + * documented PCI config space register 0x248 of the Intel PCIe bridge + * controller (0x1901) in order to change the state of the PCIe link between + * the PCIe port and the GPU. There are alternative code paths using other + * registers, which seem to work fine (executed pre Windows 8): + * - 0xbc bit 0x20 (publicly available documentation claims 'reserved') + * - 0xb0 bit 0x10 (link disable) + * Changing the conditions inside the firmware by poking into the relevant + * addresses does resolve the issue, but it seemed to be ACPI private memory + * and not any device accessible memory at all, so there is no portable way of + * changing the conditions. + * On a XPS 9560 that means bits [0,3] on \CPEX need to be cleared. + * + * The only systems where this behavior can be seen are hybrid graphics laptops + * with a secondary Nvidia Maxwell, Pascal or Turing GPU. It's unclear whether + * this issue only occurs in combination with listed Intel PCIe bridge + * controllers and the mentioned GPUs or other devices as well. + * + * documentation on the PCIe bridge controller can be found in the + * "7th Generation IntelĀ® Processor Families for H Platforms Datasheet Volume 2" + * Section "12 PCI Express* Controller (x16) Registers" + */ + +static void quirk_broken_nv_runpm(struct pci_dev *dev) +{ + struct pci_dev *bridge = pci_upstream_bridge(dev); + + if (!bridge || bridge->vendor != PCI_VENDOR_ID_INTEL) + return; + + switch (bridge->device) { + case 0x1901: + dev->parent_d3cold = 1; + break; + } +} + static int nouveau_drm_probe(struct pci_dev *pdev, const struct pci_device_id *pent) { @@ -699,6 +753,7 @@ static int nouveau_drm_probe(struct pci_dev *pdev, if (ret) goto fail_drm_dev_init; + quirk_broken_nv_runpm(pdev); return 0; fail_drm_dev_init: @@ -735,6 +790,8 @@ nouveau_drm_remove(struct pci_dev *pdev) { struct drm_device *dev = pci_get_drvdata(pdev); + /* revert our workaround */ + pdev->parent_d3cold = false; nouveau_drm_device_remove(dev); pci_disable_device(pdev); } diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index d828ca835a98..9c4044fc2553 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -861,6 +861,14 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state) || (state == PCI_D2 && !dev->d2_support)) return -EIO; + /* + * Power management can be disabled for certain devices as they don't + * come back up later on runtime_resume. We rely on platform means to + * cut power consumption instead (e.g. ACPI). + */ + if (state != PCI_D0 && dev->parent_d3cold) + return 0; + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); if (pmcsr == (u16) ~0) { pci_err(dev, "can't change power state from %s to %s (config space inaccessible)\n", diff --git a/include/linux/pci.h b/include/linux/pci.h index 3840a541a9de..3c01f043519a 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -340,6 +340,7 @@ struct pci_dev { unsigned int no_d3cold:1; /* D3cold is forbidden */ unsigned int bridge_d3:1; /* Allow D3 for bridge */ unsigned int d3cold_allowed:1; /* D3cold is allowed by user */ + unsigned int parent_d3cold:1; /* Power manage the parent instead */ unsigned int mmio_always_on:1; /* Disallow turning off io/mem decoding during BAR sizing */ unsigned int wakeup_prepared:1;