From patchwork Thu Oct 23 18:22:12 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Prarit Bhargava X-Patchwork-Id: 5142581 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: patchwork-linux-pci@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id DCE589F30B for ; Thu, 23 Oct 2014 18:22:22 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 009312021B for ; Thu, 23 Oct 2014 18:22:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id F04102013D for ; Thu, 23 Oct 2014 18:22:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755545AbaJWSWT (ORCPT ); Thu, 23 Oct 2014 14:22:19 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49172 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753730AbaJWSWT (ORCPT ); Thu, 23 Oct 2014 14:22:19 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s9NIMHP0004243 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 23 Oct 2014 14:22:18 -0400 Received: from praritdesktop.bos.redhat.com (prarit-guest.khw.lab.eng.bos.redhat.com [10.16.186.145]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s9NIMGbw007906; Thu, 23 Oct 2014 14:22:17 -0400 From: Prarit Bhargava To: linux-pci@vger.kernel.org Cc: Prarit Bhargava , Myron Stowe , Bjorn Helgaas , Alexander Ducyk , Jiang Liu Subject: [PATCH V2] pci, add sysfs numa_node write function Date: Thu, 23 Oct 2014 14:22:12 -0400 Message-Id: <1414088532-24605-1-git-send-email-prarit@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Some new drivers, such as the Intel QAT driver, drivers/crypto/qat, require that a specific node be assigned to the device in order to achieve maximum performance for the device, and will fail to load if the device has NUMA_NO_NODE. Users can in some cases, with additional information provided by vendor support, determine what the correct numa node is supposed to be. In the cases a quick hack of the driver results in a function QAT device. In theory, it should be possible to map a PCI device to a PCI root bridge to a specific node, however, in practice it is not possible. Nodes may have multiple PCI root bridges, may share multiple PCI root bridges, or may not have an active root bridge assigned. Hardware manufacturers may specifically have designed systems without numa node to PCI root bridge mappings. Without assistance from some hardware reporting mechanism (SMBIOS, ACPI, etc.) there is no reliable way to determine the numa node for a PCI bridge or device. Typically this numa mapping is done via the ACPI _PXM values in the ACPI tables, however, there are many systems out there that do not populate the ACPI _PXM entries and therefore do not have correct PCI device numa_node values. Hardware vendors are accepting of reported bugs for the ACPI _PXM entries, but production fixes are typically seen in 6 months to a year and in some past cases, never. This patch introduces a mechanism to allow a user that knows the correct value of the numa node to set it via sysfs. As suggested by Alexander and Bjorn, the setting of the value issues a loud FW_BUG message and TAINTS notify the user that the issue really is a firmware bug. To use this, one can do echo 3 > /sys/devices/pci0000:ff/0000:03:1f.3/numa_node to set the numa node for PCI device 0000:03:1f.3. Cc: Myron Stowe Cc: Bjorn Helgaas Cc: Alexander Ducyk Cc: Jiang Liu Cc: linux-pci@vger.kernel.org Signed-off-by: Prarit Bhargava [v2]: add warning about broken BIOS, rework message after attempting to determine numa node on a wide number of broken systems, add Documentation. --- Documentation/ABI/testing/sysfs-bus-pci | 13 +++++++++++++ drivers/pci/pci-sysfs.c | 29 ++++++++++++++++++++++++++++- 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index ee6c040..76007b3 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -281,3 +281,16 @@ Description: opt-out of driver binding using a driver_override name such as "none". Only a single driver may be specified in the override, there is no support for parsing delimiters. + +What: /sys/bus/pci/devices/.../numa_node +Date: Oct 2014 +Contact: Prarit Bhargava +Description: + This file contains the value of the NUMA node that the PCI + device is attached to, or -1 if the device is attached to + multiple nodes. The file can be written to to override the + value if the user determines that the system's firmware has + provided an incorrect value. If this file is written to + the user should report a firmware bug to the system vendor. + Writing to this file will result in kernel taint of + TAINT_FIRMWARE_WORKAROUND. diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 92b6d9a..e5a4664 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -221,12 +221,39 @@ static ssize_t enabled_show(struct device *dev, struct device_attribute *attr, static DEVICE_ATTR_RW(enabled); #ifdef CONFIG_NUMA +static ssize_t numa_node_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct pci_dev *pdev = to_pci_dev(dev); + int node, ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + ret = kstrtoint(buf, 0, &node); + if (ret) + return ret; + + if (!node_online(node)) + return -EINVAL; + + add_taint(TAINT_FIRMWARE_WORKAROUND, LOCKDEP_STILL_OK); + dev_alert(&pdev->dev, + FW_BUG "Overriding NUMA node to %d. Contact your vendor for updates.", + node); + + dev->numa_node = node; + + return count; +} + static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr, char *buf) { return sprintf(buf, "%d\n", dev->numa_node); } -static DEVICE_ATTR_RO(numa_node); +static DEVICE_ATTR_RW(numa_node); #endif static ssize_t dma_mask_bits_show(struct device *dev,