From patchwork Mon May 15 11:35:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "M K, Muralidhara" X-Patchwork-Id: 13241291 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 999E1C77B7D for ; Mon, 15 May 2023 11:40:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241349AbjEOLkC (ORCPT ); Mon, 15 May 2023 07:40:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241421AbjEOLiA (ORCPT ); Mon, 15 May 2023 07:38:00 -0400 Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2051.outbound.protection.outlook.com [40.107.243.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1809A1FEA; Mon, 15 May 2023 04:36:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=B6CBTETQ27yBgZ0pTOXSjqP/ctNiItV0laByrxriEWQDJ3qMD9v4eVBDqHXRXjVdAnabU4eiOi+l+i0pfYfsrUTj0RH+znO36gNyGK/gpUcUWXGVTxS1WZbyLNyd6gh+PKE7IyyAC8TRGUfmChraRKZRfjj8MbmRpHmlqGQjxwL/alck/DZUMgqPXzUdTNODjzcv6Ft2Ql2H3xqwDAFaEkhuhiC6lyYTQfRJIQj9F8LMFTCaKTjXUv4KxkJFFfjSAv5f/3UfGwYL6DAa2CDw6llhfvieSx/f+AhTo60lW1omliV3gdQtP+Z9wzEnvagdWK7F+4knhIPmcSk70aYncQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=hyX2vk4DIeH85+YwVNj4EKx5H+HbzveaWJthZjp1SrM=; b=MD8+3hkTl+gc7t97nSerx02aJJoVxHxsYepcE5sPiQUjQxhjMgqX+SErNDJWKnhoQE1m7DBeu2S3br9SxIAgrjmItyTP87TPUmqh5XEWxjA9AC6R0px+WKfgykyltKr+a7rpItZ4uPlYLivJqhrRc3+38JXYcpltbcipiI0KO+2L6ZQSgdceU7uFxnDGKEserSFL8baeMGw7DBDpwEPl5hoaNbcfIGmWBOyI5XxNfT6WSkudIyN3gOXnU8CbkZ8UdH3zHFR+UmGy5BQDi94C+318JXHy8JUGLeuDenH1E8I6R19huzWgZ1ApU0DK76xvTSrFZktA6k23a7qpBNIIFA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hyX2vk4DIeH85+YwVNj4EKx5H+HbzveaWJthZjp1SrM=; b=0CEB3DzbXwiyzPYDlGpiIleQMuat3mHMVeLOQHNjTTFx2yIAAg+2xM9IMeBO5nJbYWo8e29vBopgqyAAzMDdYVZoyt4dG5i9IWAgfoiU5O1zXwcTso9aXFdSk61p503icPQRxre6TrL/4+kGMpLTC96SJ9eGxW+bUq7uypbCl+A= Received: from SJ0PR05CA0043.namprd05.prod.outlook.com (2603:10b6:a03:33f::18) by LV2PR12MB5919.namprd12.prod.outlook.com (2603:10b6:408:173::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.29; Mon, 15 May 2023 11:36:00 +0000 Received: from DM6NAM11FT014.eop-nam11.prod.protection.outlook.com (2603:10b6:a03:33f:cafe::d8) by SJ0PR05CA0043.outlook.office365.com (2603:10b6:a03:33f::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6411.14 via Frontend Transport; Mon, 15 May 2023 11:36:00 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by DM6NAM11FT014.mail.protection.outlook.com (10.13.173.132) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6411.14 via Frontend Transport; Mon, 15 May 2023 11:35:59 +0000 Received: from amd.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Mon, 15 May 2023 06:35:56 -0500 From: Muralidhara M K To: , CC: , , , , , Muralidhara M K Subject: [PATCH 1/5] x86/amd_nb: Add MI200 PCI IDs Date: Mon, 15 May 2023 11:35:33 +0000 Message-ID: <20230515113537.1052146-2-muralimk@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230515113537.1052146-1-muralimk@amd.com> References: <20230515113537.1052146-1-muralimk@amd.com> MIME-Version: 1.0 X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6NAM11FT014:EE_|LV2PR12MB5919:EE_ X-MS-Office365-Filtering-Correlation-Id: 34b6fcc0-6485-4b05-bf2a-08db55388e6f X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: bai6wU4+SewpNXxpfCS6cqTuYsSBOOKMlD7cAG2cpbPNNKKjuAU5FZHnH3AbcTZd0bAhK7eN8nKVNhOXdzbTuWkXTHPNA1X+K6SC9CF5Pp6+6UcJyNx54oOrmROdih/23wZupVQdNXWaeP+j0AOBH1cs3b7N5FYnRb4ZeiAJH/gwwMUNvhPI+r1oCmckhkNHSeYVGYI/aal6f13zCHKeFnx8O0PT0UfEofhJpbkUn+U9+QKxie1csjLWPkbe536lwm2DhJM2XSeKriwHOkclb0KbTpXZaMq0SLaW0EFQunQ0coeV0FpAIt0iZUwRumqUzX06cX9gkH/e0QcyuXHn4DypOPOkYJKf0aWlal2z00ZQuTQW+cAsa48gT5K4XCjMFoelzSA+RekSCdLqqmgbFdSOCNMeM6QQlB+FoHkz1/kvyoygDqI5NVpcraM/L8WyGmE3wvKiNyNPPE4pZTZ0x9reXGosLZ5UWbF+YsFm5DXXo7yV0injExV70c8/6Ogy0gUWCk94lC4GYgX4g9svdQ6TyJOZhwNCgVLjMEGKTiB6BaScbJLPqHguukP1gSWmFgwQWM3/+OWwxKmY0vxhthtzOtjwLkXM7GFkUaSZyIEE9n3DaijFQmVWK/Qc1yXJOKkidTzC3JCXQM0HFC9RXWn8XJej+Dc0Mhp0WoCNbMbDVlzGT6XYlVsk6dt5d7AqsDHS3J2fb4OhlB1ktnYl8tHg/sFp+Y1Jg1hrcKjo7vykurRu9gqITVolApVzkSQOGMzVge1NzGQjNRRhdFK6rQ== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230028)(4636009)(396003)(376002)(136003)(39860400002)(346002)(451199021)(40470700004)(46966006)(36840700001)(356005)(81166007)(82310400005)(82740400003)(36756003)(40460700003)(41300700001)(4326008)(316002)(5660300002)(8936002)(8676002)(54906003)(26005)(186003)(1076003)(16526019)(40480700001)(47076005)(336012)(426003)(2616005)(2906002)(70586007)(70206006)(7696005)(110136005)(478600001)(6666004)(83380400001)(36860700001)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 May 2023 11:35:59.8493 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 34b6fcc0-6485-4b05-bf2a-08db55388e6f X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT014.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV2PR12MB5919 Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org From: Yazen Ghannam The AMD Instinct™ MI200 series accelerators are the data center GPUs. The MI200 (Aldebaran) series of accelerator devices include Unified Memory Controllers and a data fabric similar to those used in AMD x86 CPU products. The memory controllers report errors using MCA, though these errors are generally handled through GPU drivers that directly manage the accelerator device. In some configurations, memory errors from these devices will be reported through MCA and managed by x86 CPUs. The OS is expected to handle these errors in similar fashion to MCA errors originating from memory controllers on x86 CPUs. In Linux, this flow includes passing MCA errors to a notifier chain that with handlers in the EDAC subsystem. The AMD64 EDAC module requires information from the memory controllers and data fabric in order to provide detailed decoding of memory errors. The information is read from hardware registers accessed through interfaces in the data fabric. The accelerator data fabrics are visible to the host x86 CPUs as PCI devices just like x86 CPU data fabrics are already. However, the accelerator fabrics have new and unique PCI IDs. Add PCI IDs for the MI200 (Aldebaran) series of accelerator devices in order to enable EDAC support. The data fabrics of the accelerator devices will be enumerated as any other fabric already supported. System-specific implementation details will be handled within the AMD64 EDAC module. Signed-off-by: Yazen Ghannam Co-developed-by: Muralidhara M K Signed-off-by: Muralidhara M K Signed-off-by: Borislav Petkov (AMD) --- arch/x86/kernel/amd_nb.c | 5 +++++ include/linux/pci_ids.h | 1 + 2 files changed, 6 insertions(+) diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c index 7e331e8f3692..8fd955414b08 100644 --- a/arch/x86/kernel/amd_nb.c +++ b/arch/x86/kernel/amd_nb.c @@ -23,6 +23,7 @@ #define PCI_DEVICE_ID_AMD_19H_M10H_ROOT 0x14a4 #define PCI_DEVICE_ID_AMD_19H_M60H_ROOT 0x14d8 #define PCI_DEVICE_ID_AMD_19H_M70H_ROOT 0x14e8 +#define PCI_DEVICE_ID_AMD_MI200_ROOT 0x14bb #define PCI_DEVICE_ID_AMD_17H_DF_F4 0x1464 #define PCI_DEVICE_ID_AMD_17H_M10H_DF_F4 0x15ec #define PCI_DEVICE_ID_AMD_17H_M30H_DF_F4 0x1494 @@ -37,6 +38,7 @@ #define PCI_DEVICE_ID_AMD_19H_M60H_DF_F4 0x14e4 #define PCI_DEVICE_ID_AMD_19H_M70H_DF_F4 0x14f4 #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F4 0x12fc +#define PCI_DEVICE_ID_AMD_MI200_DF_F4 0x14d4 /* Protect the PCI config register pairs used for SMN. */ static DEFINE_MUTEX(smn_mutex); @@ -53,6 +55,7 @@ static const struct pci_device_id amd_root_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M40H_ROOT) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M60H_ROOT) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_ROOT) }, + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_ROOT) }, {} }; @@ -81,6 +84,7 @@ static const struct pci_device_id amd_nb_misc_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M60H_DF_F3) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_DF_F3) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) }, + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F3) }, {} }; @@ -101,6 +105,7 @@ static const struct pci_device_id amd_nb_link_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M40H_DF_F4) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M50H_DF_F4) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F4) }, + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F4) }, {} }; diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index 95f33dadb2be..a99b1fcfc617 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -568,6 +568,7 @@ #define PCI_DEVICE_ID_AMD_19H_M60H_DF_F3 0x14e3 #define PCI_DEVICE_ID_AMD_19H_M70H_DF_F3 0x14f3 #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F3 0x12fb +#define PCI_DEVICE_ID_AMD_MI200_DF_F3 0x14d3 #define PCI_DEVICE_ID_AMD_CNB17H_F3 0x1703 #define PCI_DEVICE_ID_AMD_LANCE 0x2000 #define PCI_DEVICE_ID_AMD_LANCE_HOME 0x2001 From patchwork Mon May 15 11:35:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "M K, Muralidhara" X-Patchwork-Id: 13241292 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7349DC77B75 for ; Mon, 15 May 2023 11:40:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241312AbjEOLkO (ORCPT ); Mon, 15 May 2023 07:40:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41344 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241313AbjEOLiL (ORCPT ); Mon, 15 May 2023 07:38:11 -0400 Received: from NAM04-MW2-obe.outbound.protection.outlook.com (mail-mw2nam04on2055.outbound.protection.outlook.com [40.107.101.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C5FA1FED; Mon, 15 May 2023 04:36:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=U+kdIOZHg2w8eEdZs20PYoxL69xalMwPq+wCXuxF/XAkfZOS2HsPk3x4OPF2zFy4ucdeD5+UinHTT1Rr4sMuMA+i/skRBv75VLnwHx1qMeZ2LW6I4CKSeAt9MP6lddelyTiWkmx2gueyw4oQ4ANUPiPyWDGh6b5L9P56nbsUJNoW59eRTJ+J6aVfijxsRVYYBeINJXrLOcqN+4igLIShyI551+x5msJoCytFYFPSVWTXG8SzLY6z1ESrOJfbPhZ2xmwBAG4DSoWP+2d5FI4YlD93NPrMEuvSS42K1f5kWvJQrBTECsfyFllcIv8mb5ghDkQQIl+u/GQNm3PcjjytXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=zxo+4t2JTZ3v8Vol6QECHiP0JqOkA6QErxaO9nzbiHc=; b=Nhx1NqdGilKDDfPAvOyqgN66SL95utvfPDb1zL6lYYf6A2FS4Qcs8vHlddTwoB71tvfdgCX6NVpIY6sz/bacLDZyKBr4W4ZOpMysk+rsKl19cFxcTWvq4UvjBPIdd1Y3xEU6pyBdTzjjC8wfwE00zR1Fo053tvwpb8k8aGr5pZzsmM4iKM2jZYw+ZNcCPaWi+avLCBKCeOZV9BH21gadrdS03XQNGs75PGCQyIaBW9Fopq3IqyvdqHOoNDKn282W1R830brSKcJLgBcG4K+0UJG+yL0jrLixSdWhHey5ujA3nA8fpbMXnfPvXDXUulGXBwv44d/O8t75sADmfX0t5A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=zxo+4t2JTZ3v8Vol6QECHiP0JqOkA6QErxaO9nzbiHc=; b=k63xtQyLTBEy1h+Al3YvPV6HFgYdhPOUy2t4HWNoAIPBuPEDdTQRUR0EG/l/4cS8EHN3r/gsp6hZBKzJRU4oeVBBin00bPyMSTPrzNuMg4h7tVEVr7Vm9haQpneBHbyU6c8nR2BTCe920flMZFBOXCcLONLGpzu0GtRrb0qvqPI= Received: from DM6PR21CA0009.namprd21.prod.outlook.com (2603:10b6:5:174::19) by DS0PR12MB8454.namprd12.prod.outlook.com (2603:10b6:8:15e::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.30; Mon, 15 May 2023 11:36:02 +0000 Received: from DM6NAM11FT011.eop-nam11.prod.protection.outlook.com (2603:10b6:5:174:cafe::a6) by DM6PR21CA0009.outlook.office365.com (2603:10b6:5:174::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6433.3 via Frontend Transport; Mon, 15 May 2023 11:36:02 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by DM6NAM11FT011.mail.protection.outlook.com (10.13.172.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6411.14 via Frontend Transport; Mon, 15 May 2023 11:36:02 +0000 Received: from amd.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Mon, 15 May 2023 06:35:59 -0500 From: Muralidhara M K To: , CC: , , , , , Muralidhara M K Subject: [PATCH 2/5] x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors Date: Mon, 15 May 2023 11:35:34 +0000 Message-ID: <20230515113537.1052146-3-muralimk@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230515113537.1052146-1-muralimk@amd.com> References: <20230515113537.1052146-1-muralimk@amd.com> MIME-Version: 1.0 X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6NAM11FT011:EE_|DS0PR12MB8454:EE_ X-MS-Office365-Filtering-Correlation-Id: 50f0b32f-dcea-4dd1-4a88-08db55388fdc X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 9SO781xg+iISbAMAbV1PjUCO1Q/cy9Nt9W5aodseYC2+E43x6+fQKtdP+i55tWl17UbJNoeKE5WXE4r7fuCTBTc7NGPqIoxjYW26cqTQR/MagIAot7eQPBuY0sHG03IRG8RP8R5VOsdAV/bi/Llt6I/DkIEtC0DxLllM50CXKSJAstxDAwn9kOaIRcA6o502x49VOdenGJKqCD1L+/SKnGZYtFxnKg7cX0+CPaBpLUdv5XcuqqcoC66gdpSp4c2OO6wuDia+HKpgrfGibL3g1J722U8RQmvO7Fv+si1RtigNCreFfuescNFtH26DdtFgOSNgIeet7/N4jo79+74jN2NSdB0AVAjE1zVhWwqLkF+z8Y8lJmztTX9sbhl+vwfGbGLjrn1gUl3oQED42uQ+y59SELGHKrdiAps409l7SkvCvUfD2NKilwHdyDqdNY/WtPYNfqinxTY+rhT4ZHAAMFqEFAsAd9UsSvjixftDNRkrd456doZ5x2cK/lRnoLgUGgUvfuBfefwWxmOtC8l5gPe17/3+Ey2l+CMfNKLAB1ys0025TdVZFvSkoeLQSybsbMTaGpicc9oN3QPdiNGtdSybRIOrQbn2AwLHotW1MYQs6Z3Hv28yPYmWeNoYYPpv3/do1TD1MQEhA/tuylnmbO93Uu3lWKqjm3araGZLAInEHAxEIz4NaL6wMy2N/KiArAJnHOoB7OEDLesXRo529+RfqECcBohnBTEPePR3OLnXBpEaunr+xs0r5GfMFmxS6YImlE/dLcQ43A0LBjc6jQ== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230028)(4636009)(396003)(376002)(39860400002)(136003)(346002)(451199021)(46966006)(36840700001)(40470700004)(36756003)(54906003)(110136005)(316002)(4326008)(70206006)(70586007)(478600001)(7696005)(82310400005)(40480700001)(5660300002)(8676002)(8936002)(6666004)(2906002)(81166007)(356005)(82740400003)(41300700001)(16526019)(2616005)(1076003)(186003)(26005)(336012)(36860700001)(83380400001)(426003)(47076005)(40460700003)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 May 2023 11:36:02.2384 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 50f0b32f-dcea-4dd1-4a88-08db55388fdc X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT011.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR12MB8454 Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org From: Yazen Ghannam The MI200 (Aldebaran) series of devices introduced a new SMCA bank type for Unified Memory Controllers. The MCE subsystem already has support for this new type. The MCE decoder module will decode the common MCA error information for the new bank type, but it will not pass the information to the AMD64 EDAC module for detailed memory error decoding. Have the MCE decoder module recognize the new bank type as an SMCA UMC memory error and pass the MCA information to AMD64 EDAC. Signed-off-by: Yazen Ghannam Co-developed-by: Muralidhara M K Signed-off-by: Muralidhara M K --- arch/x86/kernel/cpu/mce/amd.c | 6 ++++-- drivers/edac/mce_amd.c | 3 ++- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 0b971f974096..5e74610b39e7 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -715,11 +715,13 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c) bool amd_mce_is_memory_error(struct mce *m) { + enum smca_bank_types bank_type; /* ErrCodeExt[20:16] */ u8 xec = (m->status >> 16) & 0x1f; + bank_type = smca_get_bank_type(m->extcpu, m->bank); if (mce_flags.smca) - return smca_get_bank_type(m->extcpu, m->bank) == SMCA_UMC && xec == 0x0; + return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0; return m->bank == 4 && xec == 0x8; } @@ -1050,7 +1052,7 @@ static const char *get_name(unsigned int cpu, unsigned int bank, struct threshol if (bank_type >= N_SMCA_BANK_TYPES) return NULL; - if (b && bank_type == SMCA_UMC) { + if (b && (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2)) { if (b->block < ARRAY_SIZE(smca_umc_block_names)) return smca_umc_block_names[b->block]; return NULL; diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c index cc5c63feb26a..9215c06783df 100644 --- a/drivers/edac/mce_amd.c +++ b/drivers/edac/mce_amd.c @@ -1186,7 +1186,8 @@ static void decode_smca_error(struct mce *m) if (xec < smca_mce_descs[bank_type].num_descs) pr_cont(", %s.\n", smca_mce_descs[bank_type].descs[xec]); - if (bank_type == SMCA_UMC && xec == 0 && decode_dram_ecc) + if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && + xec == 0 && decode_dram_ecc) decode_dram_ecc(topology_die_id(m->extcpu), m); } From patchwork Mon May 15 11:35:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "M K, Muralidhara" X-Patchwork-Id: 13241293 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2139C77B7D for ; Mon, 15 May 2023 11:41:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241432AbjEOLlV (ORCPT ); Mon, 15 May 2023 07:41:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44604 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240413AbjEOLjU (ORCPT ); Mon, 15 May 2023 07:39:20 -0400 Received: from NAM02-BN1-obe.outbound.protection.outlook.com (mail-bn1nam02on2086.outbound.protection.outlook.com [40.107.212.86]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 156E71FE1; Mon, 15 May 2023 04:36:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=aJHx5nVwHcSdjf915XcC29VUbNa2x7exrS82W5H9GwtJQ6thD+xLWVKS/ZbWJbg6ZarhYft0+X297hDY7Dj+wRrwpEZrCiRMufjjXs0okNHCfOApBIDrawA1aS0vZ1fbfYjDnhdKtrRLIRS/79P2ZnXQjOp/N03KBYyQS61izLWShoJIs2HB0NTETrFV9dXmU96WszJWeKXkH4p1ZBFCJ8erAduPgYvmdqGmAp8Lk7VbappKtPhBqOIXHQxBvQQJbQD2nnoiMAE+u5jtxPvoCBZMoCjyqA7us3CMUnUUw9L1d7PW/uZ/CvTq7gPZrpfuARgjC989dngbeEsYKKMGoA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=F0gZhrzob57y3Ne4CCwL2DGkjPeQhG61cqBhR9rVR4w=; b=VGkC90ISAglDtqQasd0vzCz4BB20UOuj/NbK/jRIuBXg2f5lHR2jD708fNc1mh+4EguMYPY8FTIrE0iFWdWtQQb/vzqqWBimdgNyUrQefW9TLBReql1IE4aKZGJIZ12IvnC+rRG0YU1d0GVbNjsLwiYHX7bbXCvqr2kjFLmUidioM0o23idntxrS7ZNUnRi7tzF96IWn3I5ArB+SgSCUEHALdQds/SVv2diGAOq696tYKJwGOoh++YMl9YL4yBYrKF0gLRMTb5F1t5t2Ix3hcVk9d8NLukvizJ86BATjeOzMwb3Iygxw1Gp6OQ238kxR9BXVAMKsTe+r1A5TXaK7+Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=F0gZhrzob57y3Ne4CCwL2DGkjPeQhG61cqBhR9rVR4w=; b=YGe82VKOolcXkh2oGNsrhxcidP7Dvsr9nBqMcUFCFrJGEHZ0E+VcOYn3dewTA+2gSRsP5IGkZQOS1gunbWkzudcyc7AGuNZ+UgkiZZBXXSGZy5kMqPVRFHjRkDEDaQl8hnPr1siiFmsO1C73+hY17jUnPW5A5EVjpi0lKnaRRrk= Received: from SJ0PR03CA0024.namprd03.prod.outlook.com (2603:10b6:a03:33a::29) by MN0PR12MB5811.namprd12.prod.outlook.com (2603:10b6:208:377::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.30; Mon, 15 May 2023 11:36:05 +0000 Received: from DM6NAM11FT069.eop-nam11.prod.protection.outlook.com (2603:10b6:a03:33a:cafe::20) by SJ0PR03CA0024.outlook.office365.com (2603:10b6:a03:33a::29) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.30 via Frontend Transport; Mon, 15 May 2023 11:36:05 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by DM6NAM11FT069.mail.protection.outlook.com (10.13.173.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6411.14 via Frontend Transport; Mon, 15 May 2023 11:36:05 +0000 Received: from amd.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Mon, 15 May 2023 06:36:02 -0500 From: Muralidhara M K To: , CC: , , , , , Muralidhara M K , Naveen Krishna Chatradhi Subject: [PATCH 3/5] EDAC/amd64: Document heterogeneous system enumeration Date: Mon, 15 May 2023 11:35:35 +0000 Message-ID: <20230515113537.1052146-4-muralimk@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230515113537.1052146-1-muralimk@amd.com> References: <20230515113537.1052146-1-muralimk@amd.com> MIME-Version: 1.0 X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6NAM11FT069:EE_|MN0PR12MB5811:EE_ X-MS-Office365-Filtering-Correlation-Id: d5f644d7-fa50-4ad9-3c22-08db55389196 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: HA/XuJdhwZtlIhqLpciW5qp8cdV6GjIx63MCHk+qUjTvUs4PejfdDO1nZpZyfWoGt7n6tltfZTKmEdp/8PsOKCAqCBu7rxevZe4Tv60jS3PkvnWVCsKOOaOvJ8OoreeuxJ4Ydgc6XEV3GYHB9lmMtuOC0FHVgrL4w6zjc6NaoFQOZ7W0HrQ/M5ITtOdEvWHQVQLFriw1m2neeI88knIDpUnqTD4ZdAecyF4itOtleMk/V44zpLirutRnjCgF4pntNyS8yiQBnaCYyxDykvUJ524jjye95qOqfP1AogIOtspESZjlkyO0XYiyJ5nyz72kHUwNcMaj2pILauk3Y8ObzZzP/1gI6aJwrv7gtxuvbpUbCjeSchD/WsOy5p+t9I2VqTtCehhtKKtU87Z4UW3Bv1RimKnma9hd5xn0HX7zyVvFoD4voMG4bnk/fv1btBkZh50zlnUqGvKFxYN9yaRwo7kcVYS1DOLBuGvDflPmYirHb15NlLapTJD7x3Dd7iqTHmNw0jNwYfOxO5oOjh9gJyOTResSNlrWYHyTBMzfh252jYmXQGy55teSnELboGmlFyW+9HvKTBOMgfJ6yurl5QBiS4u72DngjzqZRhiKDFi5EQ3LRC3rU6eMDCgCeoA91exBjUG+pYQH5eFlb5V1D0TDvMdDT2tF4VEQVuEHpvFlROjHHhje5WxqZlyRdHubgYBu7PkJzSqiuQtDP3E2A7p/GK5OCFJMScFL5BEmD5DUqGOhXHhF/y/K5qhuX/7t/DUh2pWMAO3QyevVXKNuDQ== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230028)(4636009)(39860400002)(376002)(396003)(346002)(136003)(451199021)(36840700001)(46966006)(40470700004)(36860700001)(47076005)(186003)(16526019)(2616005)(41300700001)(6666004)(7696005)(426003)(336012)(83380400001)(1076003)(26005)(40460700003)(478600001)(54906003)(110136005)(70206006)(70586007)(4326008)(82740400003)(81166007)(40480700001)(316002)(356005)(5660300002)(8936002)(8676002)(2906002)(36756003)(82310400005)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 May 2023 11:36:05.1347 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: d5f644d7-fa50-4ad9-3c22-08db55389196 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT069.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN0PR12MB5811 Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org From: Muralidhara M K Document High Bandwidth Memory (HBM) and AMD heterogeneous system topology and enumeration. [ bp: Simplify and de-marketize, unify, massage. ] Signed-off-by: Muralidhara M K Co-developed-by: Naveen Krishna Chatradhi Signed-off-by: Naveen Krishna Chatradhi Signed-off-by: Yazen Ghannam Signed-off-by: Borislav Petkov (AMD) --- Documentation/driver-api/edac.rst | 120 ++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) diff --git a/Documentation/driver-api/edac.rst b/Documentation/driver-api/edac.rst index b8c742aa0a71..f4f044b95c4f 100644 --- a/Documentation/driver-api/edac.rst +++ b/Documentation/driver-api/edac.rst @@ -106,6 +106,16 @@ will occupy those chip-select rows. This term is avoided because it is unclear when needing to distinguish between chip-select rows and socket sets. +* High Bandwidth Memory (HBM) + +HBM is a new memory type with low power consumption and ultra-wide +communication lanes. It uses vertically stacked memory chips (DRAM dies) +interconnected by microscopic wires called "through-silicon vias," or +TSVs. + +Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast +interconnect called the "interposer". Therefore, HBM's characteristics +are nearly indistinguishable from on-chip integrated RAM. Memory Controllers ------------------ @@ -176,3 +186,113 @@ nodes:: the L1 and L2 directories would be "edac_device_block's" .. kernel-doc:: drivers/edac/edac_device.h + + +Heterogeneous system support +---------------------------- + +An AMD heterogeneous system is built by connecting the data fabrics of +both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the +GPU nodes can be accessed the same way as the data fabric on CPU nodes. + +The MI200 accelerators are data center GPUs. They have 2 data fabrics, +and each GPU data fabric contains four Unified Memory Controllers (UMC). +Each UMC contains eight channels. Each UMC channel controls one 128-bit +HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total +of 4096-bits of DRAM data bus. + +While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC +channel is interfacing 2GB of DRAM (represented as rank). + +Memory controllers on AMD GPU nodes can be represented in EDAC thusly: + + GPU DF / GPU Node -> EDAC MC + GPU UMC -> EDAC CSROW + GPU UMC channel -> EDAC CHANNEL + +For example: a heterogeneous system with 1 AMD CPU is connected to +4 MI200 (Aldebaran) GPUs using xGMI. + +Some more heterogeneous hardware details: + +- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC. + They have chip selects (csrows) and channels. However, the layouts are different + for performance, physical layout, or other reasons. +- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the + marketing speak. CPU has X memory channels, etc. +- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW. +- GPU UMCs use 1 chip select, So UMC = EDAC CSROW. +- GPU UMCs use 8 channels, So UMC channel = EDAC channel. + +The EDAC subsystem provides a mechanism to handle AMD heterogeneous +systems by calling system specific ops for both CPUs and GPUs. + +AMD GPU nodes are enumerated in sequential order based on the PCI +hierarchy, and the first GPU node is assumed to have a Node ID value +following those of the CPU nodes after latter are fully populated:: + + $ ls /sys/devices/system/edac/mc/ + mc0 - CPU MC node 0 + mc1 | + mc2 |- GPU card[0] => node 0(mc1), node 1(mc2) + mc3 | + mc4 |- GPU card[1] => node 0(mc3), node 1(mc4) + mc5 | + mc6 |- GPU card[2] => node 0(mc5), node 1(mc6) + mc7 | + mc8 |- GPU card[3] => node 0(mc7), node 1(mc8) + +For example, a heterogeneous system with one AMD CPU is connected to +four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented +via the following sysfs entries:: + + /sys/devices/system/edac/mc/.. + + CPU # CPU node + ├── mc 0 + + GPU Nodes are enumerated sequentially after CPU nodes have been populated + GPU card 1 # Each MI200 GPU has 2 nodes/mcs + ├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs + │   ├── csrow 0 # UMC 0 + │   │   ├── channel 0 # Each UMC has 8 channels + │   │   ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB + │   │   ├── channel 2 + │   │   ├── channel 3 + │   │   ├── channel 4 + │   │   ├── channel 5 + │   │   ├── channel 6 + │   │   ├── channel 7 + │   ├── csrow 1 # UMC 1 + │   │   ├── channel 0 + │   │   ├── .. + │   │   ├── channel 7 + │   ├── .. .. + │   ├── csrow 3 # UMC 3 + │   │   ├── channel 0 + │   │   ├── .. + │   │   ├── channel 7 + │   ├── rank 0 + │   ├── .. .. + │   ├── rank 31 # total 32 ranks/dimms from 4 UMCs + ├ + ├── mc 2 # GPU node 1 == mc2 + │   ├── .. # each GPU has total 64 GB + + GPU card 2 + ├── mc 3 + │   ├── .. + ├── mc 4 + │   ├── .. + + GPU card 3 + ├── mc 5 + │   ├── .. + ├── mc 6 + │   ├── .. + + GPU card 4 + ├── mc 7 + │   ├── .. + ├── mc 8 + │   ├── .. From patchwork Mon May 15 11:35:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "M K, Muralidhara" X-Patchwork-Id: 13241294 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 740EAC77B7D for ; Mon, 15 May 2023 11:41:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241464AbjEOLlh (ORCPT ); Mon, 15 May 2023 07:41:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241470AbjEOLjg (ORCPT ); Mon, 15 May 2023 07:39:36 -0400 Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2040.outbound.protection.outlook.com [40.107.236.40]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5EAC4212B; Mon, 15 May 2023 04:36:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=cHr9kQ+wTa65Ey29e7IEzuqmgEqPm+6+A3xpybvjgRsFbcU9cPXPMS3+HmwpwwiwJ5cAA9esJO8W5DFkzXqx0LNa2ITwWH2fOD1NVqvpnuYF3WoiaZ1E7jtK5CDJGFWTpFR+fX9e7bT+9GfmAf+mBvYQHDSQfMoZ8MGQq7zVpFP/tAV+fuffXNAjDfK65wd6CL7Emgyb0JpeBtQSecRctEPLeUHXUIU8am5jSK8epTBrP/QE1dSDoL0oMm/t/2o/4CoV9TtxwaeBsqu/TGrJtK/CYfwoFA+3SCggtKM2c1vz5vYdQOsx310q+n6v3dxY9VfUPWosbzKE6p358tenGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Yf2JQlzTZ8sOCjeIBYPKPOjn5xXHOSE5L6m9BbixCTM=; b=JxkLKh+rud6DCONx8WtKEnGQuknbWAy3VayZjv6eAFmZpAq3YHk9/1hFic237gKD+ZT5bsv57+fnGhD9Z81qNJ7bmTkQ1ypywc1jqosxa3JR5liXJsIC3pd5k4R9d3D8pL3OtCa65gDoFddME/mhAgIcAm11ijSt++et73+rQJuLE+0XsNzHipsnB3S1uag6+7ekFlaF1JZQfGAOx96p4wLz6wB+ZrQlqbtyajTMnu25z6rJQHKHee9Okw4zIJDQMlVR7LzAjX4mxcPiWMaXaUK//vvlet4dp4FnJ7cO3t9fFf6LbhgUX118YjI3zOqrCIyW9i07cBygU9hAcjLXNQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Yf2JQlzTZ8sOCjeIBYPKPOjn5xXHOSE5L6m9BbixCTM=; b=zdqhtZt4WUbdLN0VSEd1xPOuwHlUAmuUCzxmFgBfPEqeQrAXVLkVFFFuzhMsRoLmpxc+eoV8u/ub46UHENkvd4in7h8EO2rdIcnzP8QLHsQcLsBrYFOi1ne11HFrgC5pJZLSSI7Ntz0bvxR/9H6aQF/kTYSv0zTC/vAmQgE46JA= Received: from DS7PR03CA0183.namprd03.prod.outlook.com (2603:10b6:5:3b6::8) by CH3PR12MB7644.namprd12.prod.outlook.com (2603:10b6:610:14f::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.30; Mon, 15 May 2023 11:36:08 +0000 Received: from DM6NAM11FT085.eop-nam11.prod.protection.outlook.com (2603:10b6:5:3b6:cafe::be) by DS7PR03CA0183.outlook.office365.com (2603:10b6:5:3b6::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.30 via Frontend Transport; Mon, 15 May 2023 11:36:08 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by DM6NAM11FT085.mail.protection.outlook.com (10.13.172.236) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6411.14 via Frontend Transport; Mon, 15 May 2023 11:36:08 +0000 Received: from amd.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Mon, 15 May 2023 06:36:04 -0500 From: Muralidhara M K To: , CC: , , , , , Muralidhara M K , Naveen Krishna Chatradhi Subject: [PATCH 4/5] EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh Date: Mon, 15 May 2023 11:35:36 +0000 Message-ID: <20230515113537.1052146-5-muralimk@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230515113537.1052146-1-muralimk@amd.com> References: <20230515113537.1052146-1-muralimk@amd.com> MIME-Version: 1.0 X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6NAM11FT085:EE_|CH3PR12MB7644:EE_ X-MS-Office365-Filtering-Correlation-Id: a07c7219-60b0-423f-e07c-08db5538935f X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: XKtRAM++aK6cJs8EJMWxvscStomeH+4qxoiegFMSWx5SpffsdHtIQYUWQ1AA3uL6auwHWla4oit0FDqhUEyPSyFGdsThPXuhHrK1hPs7PJkdxVnNwIoS9WU4W9TnAt8c/Xzru3anEJFxgIWaLOgruiKbHVDuGt9BeOe4YzNKsJgPADFT7W9iund+njvoQE1ETURyfJwn362LUHfWJxCnt877lf9kq6GmICCxnNu0P4bx2c0NI7AT8eG1Am69s3RZRButF9JTj2MOtQaF4L+NdeeKjNI6gsTirAcdUYu4Fyb4qdW2/BWo95ZqxaLUSYoxFRPnLfkH1ojIhX/3YMO/YrFVAMx/adwhuzrJQe1dUvexBi376+DmEn5AsFyL0pwtFQXiQwPo2XAzfiewlr95Aod48b9tDfW3KayM55tRSiD0bJwisd4txpZQzKrp7skRgXmMqk1DxETxKmdzP1Al0MW9XF3G+IbwC9uj1ebZW0UW+Yxyrnf/3/o8oc7nIiSr6ghHjZiLwM50irLV7MtExzzpwGz/Vc+5f+v8EDOdEqFsgT5FfPb1zAs0FcU4qm804HungfbDYP+qCLcNIoB4dLkA54/kz69Q3gxR6ToQtUZV8NuAIBfqZLjfsyXmu6os8Czi6nKROLEtVfmZ1oPQ6DLpSqVKiJnB6kbNUS7UI8lS1jjsGXKdY7TJ/2CzypwAnLDo7O2io23oUCKyIElEQ88m9Rja99oPbMGbP9mMP8toCw9dw11u7hEzMOxAiMOfyI9BnTvVUfdVIpXgs2eGgg== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230028)(4636009)(376002)(396003)(39860400002)(136003)(346002)(451199021)(40470700004)(46966006)(36840700001)(40460700003)(70586007)(4326008)(70206006)(7696005)(478600001)(316002)(54906003)(110136005)(36756003)(83380400001)(426003)(47076005)(1076003)(26005)(186003)(2616005)(336012)(36860700001)(41300700001)(5660300002)(8936002)(6666004)(30864003)(8676002)(2906002)(82310400005)(40480700001)(356005)(82740400003)(81166007)(16526019)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 May 2023 11:36:08.1339 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: a07c7219-60b0-423f-e07c-08db5538935f X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT085.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB7644 Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org From: Muralidhara M K AMD Family 19h Model 30h-3Fh systems can be connected to AMD MI200 accelerator/GPU devices such that the CPU and GPU data fabrics are connected together. In this configuration, the CPU manages error logging and reporting for MCA banks located on the GPUs. This includes HBM memory errors reported from Unified Memory Controllers (UMCs) on the GPUs. The GPU memory errors are handled like CPU memory errors. AMD CPU UMC support in EDAC can be re-used for GPU UMC support. However, keeping them separate means drastic changes in one path (e.g. to support newer products) should have less impact on the other path. Also, simplify the "gpu_" helper functions where possible. GPU product configuration, like memory type and channel count, is fixed compared to CPU products. GPU UMCs each have four physical connections (phys) connected to eight channels. There is a single "chip select". This differs from CPUs where each UMC has one physical connection connected to one channel, and each channel has up to four "chip selects". Enumerate each UMC "phy" as an EDAC CSROW, since there is only a single chip select for each physical connection. This is similar to how a CPU UMC "phy" is enumerated as an EDAC CHANNEL, since there is only a single channel for each physical connection. Signed-off-by: Muralidhara M K Co-developed-by: Naveen Krishna Chatradhi Signed-off-by: Naveen Krishna Chatradhi Co-developed-by: Yazen Ghannam Signed-off-by: Yazen Ghannam --- drivers/edac/amd64_edac.c | 310 ++++++++++++++++++++++++++++++++++---- 1 file changed, 279 insertions(+), 31 deletions(-) diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index 5c4292e65b96..28155b01f144 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -1426,12 +1426,47 @@ static int umc_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt) return cs_mode; } +static int __addr_mask_to_cs_size(u32 addr_mask_orig, unsigned int cs_mode, + int csrow_nr, int dimm) +{ + u32 msb, weight, num_zero_bits; + u32 addr_mask_deinterleaved; + int size = 0; + + /* + * The number of zero bits in the mask is equal to the number of bits + * in a full mask minus the number of bits in the current mask. + * + * The MSB is the number of bits in the full mask because BIT[0] is + * always 0. + * + * In the special 3 Rank interleaving case, a single bit is flipped + * without swapping with the most significant bit. This can be handled + * by keeping the MSB where it is and ignoring the single zero bit. + */ + msb = fls(addr_mask_orig) - 1; + weight = hweight_long(addr_mask_orig); + num_zero_bits = msb - weight - !!(cs_mode & CS_3R_INTERLEAVE); + + /* Take the number of zero bits off from the top of the mask. */ + addr_mask_deinterleaved = GENMASK_ULL(msb - num_zero_bits, 1); + + edac_dbg(1, "CS%d DIMM%d AddrMasks:\n", csrow_nr, dimm); + edac_dbg(1, " Original AddrMask: 0x%x\n", addr_mask_orig); + edac_dbg(1, " Deinterleaved AddrMask: 0x%x\n", addr_mask_deinterleaved); + + /* Register [31:1] = Address [39:9]. Size is in kBs here. */ + size = (addr_mask_deinterleaved >> 2) + 1; + + /* Return size in MBs. */ + return size >> 10; +} + static int umc_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc, unsigned int cs_mode, int csrow_nr) { - u32 addr_mask_orig, addr_mask_deinterleaved; - u32 msb, weight, num_zero_bits; int cs_mask_nr = csrow_nr; + u32 addr_mask_orig; int dimm, size = 0; /* No Chip Selects are enabled. */ @@ -1475,33 +1510,7 @@ static int umc_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc, else addr_mask_orig = pvt->csels[umc].csmasks[cs_mask_nr]; - /* - * The number of zero bits in the mask is equal to the number of bits - * in a full mask minus the number of bits in the current mask. - * - * The MSB is the number of bits in the full mask because BIT[0] is - * always 0. - * - * In the special 3 Rank interleaving case, a single bit is flipped - * without swapping with the most significant bit. This can be handled - * by keeping the MSB where it is and ignoring the single zero bit. - */ - msb = fls(addr_mask_orig) - 1; - weight = hweight_long(addr_mask_orig); - num_zero_bits = msb - weight - !!(cs_mode & CS_3R_INTERLEAVE); - - /* Take the number of zero bits off from the top of the mask. */ - addr_mask_deinterleaved = GENMASK_ULL(msb - num_zero_bits, 1); - - edac_dbg(1, "CS%d DIMM%d AddrMasks:\n", csrow_nr, dimm); - edac_dbg(1, " Original AddrMask: 0x%x\n", addr_mask_orig); - edac_dbg(1, " Deinterleaved AddrMask: 0x%x\n", addr_mask_deinterleaved); - - /* Register [31:1] = Address [39:9]. Size is in kBs here. */ - size = (addr_mask_deinterleaved >> 2) + 1; - - /* Return size in MBs. */ - return size >> 10; + return __addr_mask_to_cs_size(addr_mask_orig, cs_mode, csrow_nr, dimm); } static void umc_debug_display_dimm_sizes(struct amd64_pvt *pvt, u8 ctrl) @@ -3675,6 +3684,221 @@ static int umc_hw_info_get(struct amd64_pvt *pvt) return 0; } +/* + * The CPUs have one channel per UMC, so UMC number is equivalent to a + * channel number. The GPUs have 8 channels per UMC, so the UMC number no + * longer works as a channel number. + * + * The channel number within a GPU UMC is given in MCA_IPID[15:12]. + * However, the IDs are split such that two UMC values go to one UMC, and + * the channel numbers are split in two groups of four. + * + * Refer to comment on gpu_get_umc_base(). + * + * For example, + * UMC0 CH[3:0] = 0x0005[3:0]000 + * UMC0 CH[7:4] = 0x0015[3:0]000 + * UMC1 CH[3:0] = 0x0025[3:0]000 + * UMC1 CH[7:4] = 0x0035[3:0]000 + */ +static void gpu_get_err_info(struct mce *m, struct err_info *err) +{ + u8 ch = (m->ipid & GENMASK(31, 0)) >> 20; + u8 phy = ((m->ipid >> 12) & 0xf); + + err->channel = ch % 2 ? phy + 4 : phy; + err->csrow = phy; +} + +static int gpu_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc, + unsigned int cs_mode, int csrow_nr) +{ + u32 addr_mask_orig = pvt->csels[umc].csmasks[csrow_nr]; + + return __addr_mask_to_cs_size(addr_mask_orig, cs_mode, csrow_nr, csrow_nr >> 1); +} + +static void gpu_debug_display_dimm_sizes(struct amd64_pvt *pvt, u8 ctrl) +{ + int size, cs_mode, cs = 0; + + edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl); + + cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY; + + for_each_chip_select(cs, ctrl, pvt) { + size = gpu_addr_mask_to_cs_size(pvt, ctrl, cs_mode, cs); + amd64_info(EDAC_MC ": %d: %5dMB\n", cs, size); + } +} + +static void gpu_dump_misc_regs(struct amd64_pvt *pvt) +{ + struct amd64_umc *umc; + u32 i; + + for_each_umc(i) { + umc = &pvt->umc[i]; + + edac_dbg(1, "UMC%d UMC cfg: 0x%x\n", i, umc->umc_cfg); + edac_dbg(1, "UMC%d SDP ctrl: 0x%x\n", i, umc->sdp_ctrl); + edac_dbg(1, "UMC%d ECC ctrl: 0x%x\n", i, umc->ecc_ctrl); + edac_dbg(1, "UMC%d All HBMs support ECC: yes\n", i); + + gpu_debug_display_dimm_sizes(pvt, i); + } +} + +static u32 gpu_get_csrow_nr_pages(struct amd64_pvt *pvt, u8 dct, int csrow_nr) +{ + u32 nr_pages; + int cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY; + + nr_pages = gpu_addr_mask_to_cs_size(pvt, dct, cs_mode, csrow_nr); + nr_pages <<= 20 - PAGE_SHIFT; + + edac_dbg(0, "csrow: %d, channel: %d\n", csrow_nr, dct); + edac_dbg(0, "nr_pages/channel: %u\n", nr_pages); + + return nr_pages; +} + +static void gpu_init_csrows(struct mem_ctl_info *mci) +{ + struct amd64_pvt *pvt = mci->pvt_info; + struct dimm_info *dimm; + u8 umc, cs; + + for_each_umc(umc) { + for_each_chip_select(cs, umc, pvt) { + if (!csrow_enabled(cs, umc, pvt)) + continue; + + dimm = mci->csrows[umc]->channels[cs]->dimm; + + edac_dbg(1, "MC node: %d, csrow: %d\n", + pvt->mc_node_id, cs); + + dimm->nr_pages = gpu_get_csrow_nr_pages(pvt, umc, cs); + dimm->edac_mode = EDAC_SECDED; + dimm->mtype = MEM_HBM2; + dimm->dtype = DEV_X16; + dimm->grain = 64; + } + } +} + +static void gpu_setup_mci_misc_attrs(struct mem_ctl_info *mci) +{ + struct amd64_pvt *pvt = mci->pvt_info; + + mci->mtype_cap = MEM_FLAG_HBM2; + mci->edac_ctl_cap = EDAC_FLAG_SECDED; + + mci->edac_cap = EDAC_FLAG_EC; + mci->mod_name = EDAC_MOD_STR; + mci->ctl_name = pvt->ctl_name; + mci->dev_name = pci_name(pvt->F3); + mci->ctl_page_to_phys = NULL; + + gpu_init_csrows(mci); +} + +/* ECC is enabled by default on GPU nodes */ +static bool gpu_ecc_enabled(struct amd64_pvt *pvt) +{ + return true; +} + +static inline u32 gpu_get_umc_base(u8 umc, u8 channel) +{ + /* + * On CPUs, there is one channel per UMC, so UMC numbering equals + * channel numbering. On GPUs, there are eight channels per UMC, + * so the channel numbering is different from UMC numbering. + * + * On CPU nodes channels are selected in 6th nibble + * UMC chY[3:0]= [(chY*2 + 1) : (chY*2)]50000; + * + * On GPU nodes channels are selected in 3rd nibble + * HBM chX[3:0]= [Y ]5X[3:0]000; + * HBM chX[7:4]= [Y+1]5X[3:0]000 + */ + umc *= 2; + + if (channel >= 4) + umc++; + + return 0x50000 + (umc << 20) + ((channel % 4) << 12); +} + +static void gpu_read_mc_regs(struct amd64_pvt *pvt) +{ + u8 nid = pvt->mc_node_id; + struct amd64_umc *umc; + u32 i, umc_base; + + /* Read registers from each UMC */ + for_each_umc(i) { + umc_base = gpu_get_umc_base(i, 0); + umc = &pvt->umc[i]; + + amd_smn_read(nid, umc_base + UMCCH_UMC_CFG, &umc->umc_cfg); + amd_smn_read(nid, umc_base + UMCCH_SDP_CTRL, &umc->sdp_ctrl); + amd_smn_read(nid, umc_base + UMCCH_ECC_CTRL, &umc->ecc_ctrl); + } +} + +static void gpu_read_base_mask(struct amd64_pvt *pvt) +{ + u32 base_reg, mask_reg; + u32 *base, *mask; + int umc, cs; + + for_each_umc(umc) { + for_each_chip_select(cs, umc, pvt) { + base_reg = gpu_get_umc_base(umc, cs) + UMCCH_BASE_ADDR; + base = &pvt->csels[umc].csbases[cs]; + + if (!amd_smn_read(pvt->mc_node_id, base_reg, base)) { + edac_dbg(0, " DCSB%d[%d]=0x%08x reg: 0x%x\n", + umc, cs, *base, base_reg); + } + + mask_reg = gpu_get_umc_base(umc, cs) + UMCCH_ADDR_MASK; + mask = &pvt->csels[umc].csmasks[cs]; + + if (!amd_smn_read(pvt->mc_node_id, mask_reg, mask)) { + edac_dbg(0, " DCSM%d[%d]=0x%08x reg: 0x%x\n", + umc, cs, *mask, mask_reg); + } + } + } +} + +static void gpu_prep_chip_selects(struct amd64_pvt *pvt) +{ + int umc; + + for_each_umc(umc) { + pvt->csels[umc].b_cnt = 8; + pvt->csels[umc].m_cnt = 8; + } +} + +static int gpu_hw_info_get(struct amd64_pvt *pvt) +{ + pvt->umc = kcalloc(pvt->max_mcs, sizeof(struct amd64_umc), GFP_KERNEL); + if (!pvt->umc) + return -ENOMEM; + + gpu_prep_chip_selects(pvt); + gpu_read_base_mask(pvt); + gpu_read_mc_regs(pvt); + + return 0; +} + static void hw_info_put(struct amd64_pvt *pvt) { pci_dev_put(pvt->F1); @@ -3690,6 +3914,14 @@ static struct low_ops umc_ops = { .get_err_info = umc_get_err_info, }; +static struct low_ops gpu_ops = { + .hw_info_get = gpu_hw_info_get, + .ecc_enabled = gpu_ecc_enabled, + .setup_mci_misc_attrs = gpu_setup_mci_misc_attrs, + .dump_misc_regs = gpu_dump_misc_regs, + .get_err_info = gpu_get_err_info, +}; + /* Use Family 16h versions for defaults and adjust as needed below. */ static struct low_ops dct_ops = { .map_sysaddr_to_csrow = f1x_map_sysaddr_to_csrow, @@ -3813,6 +4045,16 @@ static int per_family_init(struct amd64_pvt *pvt) case 0x20 ... 0x2f: pvt->ctl_name = "F19h_M20h"; break; + case 0x30 ... 0x3f: + if (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) { + pvt->ctl_name = "MI200"; + pvt->max_mcs = 4; + pvt->ops = &gpu_ops; + } else { + pvt->ctl_name = "F19h_M30h"; + pvt->max_mcs = 8; + } + break; case 0x50 ... 0x5f: pvt->ctl_name = "F19h_M50h"; break; @@ -3846,11 +4088,17 @@ static int init_one_instance(struct amd64_pvt *pvt) struct edac_mc_layer layers[2]; int ret = -ENOMEM; + /* + * For Heterogeneous family EDAC CHIP_SELECT and CHANNEL layers should + * be swapped to fit into the layers. + */ layers[0].type = EDAC_MC_LAYER_CHIP_SELECT; - layers[0].size = pvt->csels[0].b_cnt; + layers[0].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ? + pvt->max_mcs : pvt->csels[0].b_cnt; layers[0].is_virt_csrow = true; layers[1].type = EDAC_MC_LAYER_CHANNEL; - layers[1].size = pvt->max_mcs; + layers[1].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ? + pvt->csels[0].b_cnt : pvt->max_mcs; layers[1].is_virt_csrow = false; mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0); From patchwork Mon May 15 11:35:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "M K, Muralidhara" X-Patchwork-Id: 13241295 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39148C77B7D for ; Mon, 15 May 2023 11:41:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241256AbjEOLln (ORCPT ); Mon, 15 May 2023 07:41:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45162 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241483AbjEOLjl (ORCPT ); Mon, 15 May 2023 07:39:41 -0400 Received: from NAM11-DM6-obe.outbound.protection.outlook.com (mail-dm6nam11on2059.outbound.protection.outlook.com [40.107.223.59]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 878452686; Mon, 15 May 2023 04:36:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=kx9P6O827hL4N6jlWSX1gUxe7A/T8X56Cy9O6652zQWvyxKMkIWl70FJZbwRduNA6P8O//ujkR0AVnpNZ6bXiaaFxLFkAnF0S9mLQfI91mRX7E5DKtQKu/5ffLnPMKfd4gsW9HLJqxfshyNhLbqteakdqdFdZ+/e86CibiIcKjmiVe+gyomRl7jLTrEoPbJldz38Ng86X5B21zihsO7Q7XqJMH2ZY7ywpz2G346E2SzZe/K21wNhElup6CO3MsWr4AzVSL3egV/156r+BMj9XA82Jejq7fclXfrb6+f9EzKhlmpCSDeotKG7LUcwaqkeM+VWN9N+E7jI2d5cWrbHxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ra7RonSk1GnQW8K51m2jRMJIcijnXfPN8FOp8iCanY8=; b=XKqStM5l9UHp4drUldcc6DIDzVtoJA6euw4qi9IQyAoScDFiOddQgSftHRA4GMtaNfOAvNP3FV/aJHOSvDLNjIRSGZVQZmC3r5OEqy7tuCgXmqoUiF5ahL6wVUS7rVnHNV14wSZiPkVHyYf+E3xF/ccwSWI1FUloDaPtAKTCfJHQ0EnZf88UBLiHB1VTigPk+JbuC8rnXaZ0HCxrp/XIfyehymhJu7AjKLRw3CD/U0p38jf8kZuYEuuFVipkJjXrhqZYIrdWfjhYl7IECb/9LpYGr+eYq//oLwONr+PWo6MMsGPg28i7+wTGcHghTSewMSO79qBBW5H1oVNDo1cUAQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ra7RonSk1GnQW8K51m2jRMJIcijnXfPN8FOp8iCanY8=; b=ur+gmmQYUUvB9ZcP7KyGS63p+HgHhbFkRp6v8/9WMKlOXmKQ/5T4rLrl717hNd7xsQTS30Q/6TEFW2EunMsCsUuH4B8mQhKW2XjTmTqCESRvJ3wtZ7Gke+eVLsaK14gJFXTgSU/Wg7not7rk13TjD38R2uEKF4KsI0HOKgNWHao= Received: from DM6PR18CA0027.namprd18.prod.outlook.com (2603:10b6:5:15b::40) by CH3PR12MB8709.namprd12.prod.outlook.com (2603:10b6:610:17c::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.30; Mon, 15 May 2023 11:36:11 +0000 Received: from DM6NAM11FT109.eop-nam11.prod.protection.outlook.com (2603:10b6:5:15b:cafe::bc) by DM6PR18CA0027.outlook.office365.com (2603:10b6:5:15b::40) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6387.30 via Frontend Transport; Mon, 15 May 2023 11:36:10 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by DM6NAM11FT109.mail.protection.outlook.com (10.13.173.178) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.6411.14 via Frontend Transport; Mon, 15 May 2023 11:36:10 +0000 Received: from amd.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Mon, 15 May 2023 06:36:07 -0500 From: Muralidhara M K To: , CC: , , , , , Muralidhara M K Subject: [PATCH 5/5] EDAC/amd64: Cache and use GPU node map Date: Mon, 15 May 2023 11:35:37 +0000 Message-ID: <20230515113537.1052146-6-muralimk@amd.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230515113537.1052146-1-muralimk@amd.com> References: <20230515113537.1052146-1-muralimk@amd.com> MIME-Version: 1.0 X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6NAM11FT109:EE_|CH3PR12MB8709:EE_ X-MS-Office365-Filtering-Correlation-Id: d2075d5a-9790-42ff-a587-08db553894f0 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: Nfot9OBe0J+dIWkqrGsoudfKW7rEOOeBgCZ2mR1wG6DeB2apzDCxy3KdFn4N2ETIfBm3gOac9iOEGerK6qTcVeV2P/365JHBB8vVu9mO2wu/b5pYsgC3H1oAJjMd/t3E173+YVsvTO1ompZqilTii+eN1/+WnDQ8X/KJl+PE2pYKkJfhWVRy9GyALVD2Ddk7ELckt2uAnNknnmp2mNoRlMh9C3Hz7c6wu2Dxo0zmTUB2qWQ+DfP2os8ARNIgSx9ghLcTUYeZWy9bQ9YTTRs64aEEQRgJJbDSIu6brIxriW7kYQS0rmSpFbY/nirm9bj7ffVSSDKDUBlHOKckEY/r6KuBipN6XyqgvYdTjntS7X7VdPCFA5O8AtHjPCZp1s5gwnm0pti1OX36HvWxLFzPwSSl9EjHUFqCTUwfOhmK1jsfNWN2SgYF1kTFIX+WugjAOxrrE220Vld8o40SA89X7kjakDAIp9bqTfVln3N7hF2gTxhZxg3hs81wbYtnXGPxaa+FGS09wyhP6ujd4PWYkE6urszRnLfFt9Ve4MOVnUOcnZHIBBmWZTYOJlIW3kyEw0QMWHVhmCFXY/MW3igISQGAJnKWRxkua+4OWMA720hUrlKCG0WPCv3Gsr9lM+PO5OFzdB1LrzCVdt5QO/UDtnRgWRlbokXOwivG9DYFTeHgc8ArOe+jfvKHFRr4oPyFOUw3yin0UBnbVR1xldwr0kGybd1oi0/7ZTNkmTR+uCaLOXVDxsETBZ/NBMEd1z5+hs62uZhMUtPc524ELN0Waw== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230028)(4636009)(346002)(396003)(136003)(39860400002)(376002)(451199021)(36840700001)(46966006)(40470700004)(336012)(47076005)(83380400001)(36860700001)(426003)(316002)(70586007)(70206006)(2616005)(7696005)(1076003)(26005)(40480700001)(478600001)(2906002)(54906003)(110136005)(16526019)(6666004)(186003)(8676002)(8936002)(40460700003)(5660300002)(36756003)(356005)(82740400003)(4326008)(41300700001)(81166007)(82310400005)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 May 2023 11:36:10.7604 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: d2075d5a-9790-42ff-a587-08db553894f0 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT109.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB8709 Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org From: Yazen Ghannam AMD systems have historically provided an "AMD Node ID" that is a unique identifier for each die in a multi-die package. This was associated with a unique instance of the AMD Northbridge on legacy system. And now it is associated with a unique instance of the AMD Data Fabric on modern systems. Each instance is referred to as a "Node"; this is an AMD-specific term not to be confused with NUMA nodes. The data fabric provides a number of interfaces accessible through a set of functions in a single PCI device. There is one PCI device per Data Fabric (AMD Node), and multi-die systems will see multiple such PCI devices. The AMD Node ID matches a Node's position in the PCI hierarchy. For example, the Node 0 is accessed using the first PCI device, Node 1 is accessed using the second, and so on. A logical CPU can find its AMD Node ID using CPUID. Furthermore, the AMD Node ID is used within the hardware fabric, so it is not purely a logical value. Heterogeneous AMD systems, with a CPU Data Fabric connected to GPU data fabrics, follow a similar convention. Each CPU and GPU die has a unique AMD Node ID value, and each Node ID corresponds to PCI devices in sequential order. However, there are two caveats: 1) GPUs are not x86, and they don't have CPUID to read their AMD Node ID like on CPUs. This means the value is more implicit and based on PCI enumeration and hardware-specifics. 2) There is a gap in the hardware values for AMD Node IDs. Values 0-7 are for CPUs and values 8-15 are for GPUs. For example, a system with one CPU die and two GPUs dies will have the following values: CPU0 -> AMD Node 0 GPU0 -> AMD Node 8 GPU1 -> AMD Node 9 EDAC is the only subsystem where this has a practical effect. Memory errors on AMD systems are commonly reported through MCA to a CPU on the local AMD Node. The error information is passed along to EDAC where the AMD EDAC modules use the AMD Node ID of reporting logical CPU to access AMD Node information. However, memory errors from a GPU die will be reported to the CPU die. Therefore, the logical CPU's AMD Node ID can't be used since it won't match the AMD Node ID of the GPU die. The AMD Node ID of the GPU die is provided as part of the MCA information, and the value will match the hardware enumeration (e.g. 8-15). Handle this situation by discovering GPU dies the same way as CPU dies in the AMD NB code. But do a "node id" fixup in AMD64 EDAC where it's needed. The GPU data fabrics provide a register with the base AMD Node ID for their local "type", i.e. GPU data fabric. This value is the same for all fabrics of the same type in a system. Read and cache the base AMD Node ID from one of the GPU devices during module initialization. Use this to fixup the "node id" when reporting memory errors at runtime. Signed-off-by: Yazen Ghannam Co-developed-by: Muralidhara M K Signed-off-by: Muralidhara M K --- drivers/edac/amd64_edac.c | 76 +++++++++++++++++++++++++++++++++++++++ drivers/edac/amd64_edac.h | 1 + 2 files changed, 77 insertions(+) diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index 28155b01f144..ef3e50f4d66a 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -975,6 +975,74 @@ static int sys_addr_to_csrow(struct mem_ctl_info *mci, u64 sys_addr) return csrow; } +/* + * See AMD PPR DF::LclNodeTypeMap + * + * This register gives information for nodes of the same type within a system. + * + * Reading this register from a GPU node will tell how many GPU nodes are in the + * system and what the lowest AMD Node ID value is for the GPU nodes. Use this + * info to fixup the Linux logical "Node ID" value set in the AMD NB code and EDAC. + */ +struct local_node_map { + u16 node_count; + u16 base_node_id; +} gpu_node_map; + +#define PCI_DEVICE_ID_AMD_MI200_DF_F1 0x14d1 +#define REG_LOCAL_NODE_TYPE_MAP 0x144 + +/* Local Node Type Map (LNTM) fields */ +#define LNTM_NODE_COUNT GENMASK(27, 16) +#define LNTM_BASE_NODE_ID GENMASK(11, 0) + +static int gpu_get_node_map(void) +{ + struct pci_dev *pdev; + int ret; + u32 tmp; + + /* + * Node ID 0 is reserved for CPUs. + * Therefore, a non-zero Node ID means we've already cached the values. + */ + if (gpu_node_map.base_node_id) + return 0; + + pdev = pci_get_device(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F1, NULL); + if (!pdev) { + ret = -ENODEV; + goto out; + } + + ret = pci_read_config_dword(pdev, REG_LOCAL_NODE_TYPE_MAP, &tmp); + if (ret) + goto out; + + gpu_node_map.node_count = FIELD_GET(LNTM_NODE_COUNT, tmp); + gpu_node_map.base_node_id = FIELD_GET(LNTM_BASE_NODE_ID, tmp); + +out: + pci_dev_put(pdev); + return ret; +} + +static int fixup_node_id(int node_id, struct mce *m) +{ + /* MCA_IPID[InstanceIdHi] give the AMD Node ID for the bank. */ + u8 nid = (m->ipid >> 44) & 0xF; + + if (smca_get_bank_type(m->extcpu, m->bank) != SMCA_UMC_V2) + return node_id; + + /* Nodes below the GPU base node are CPU nodes and don't need a fixup. */ + if (nid < gpu_node_map.base_node_id) + return node_id; + + /* Convert the hardware-provided AMD Node ID to a Linux logical one. */ + return nid - gpu_node_map.base_node_id + 1; +} + /* Protect the PCI config register pairs used for DF indirect access. */ static DEFINE_MUTEX(df_indirect_mutex); @@ -3001,6 +3069,8 @@ static void decode_umc_error(int node_id, struct mce *m) struct err_info err; u64 sys_addr; + node_id = fixup_node_id(node_id, m); + mci = edac_mc_find(node_id); if (!mci) return; @@ -3888,6 +3958,12 @@ static void gpu_prep_chip_selects(struct amd64_pvt *pvt) static int gpu_hw_info_get(struct amd64_pvt *pvt) { + int ret; + + ret = gpu_get_node_map(); + if (ret) + return ret; + pvt->umc = kcalloc(pvt->max_mcs, sizeof(struct amd64_umc), GFP_KERNEL); if (!pvt->umc) return -ENOMEM; diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h index e84fe0d4120a..a9d62907a7a0 100644 --- a/drivers/edac/amd64_edac.h +++ b/drivers/edac/amd64_edac.h @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include "edac_module.h"