[1/2] x86/mce: Simplify AMD severity grading logic

Message ID	20220405183212.354606-2-carlos.bilbao@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-edac-owner@kernel.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; From: Carlos Bilbao <carlos.bilbao@amd.com> To: <bp@alien8.de>, <yazen.ghannam@amd.com> CC: <tglx@linutronix.de>, <mingo@redhat.com>, <dave.hansen@linux.intel.com>, <x86@kernel.org>, <linux-kernel@vger.kernel.org>, <linux-edac@vger.kernel.org>, <bilbao@vt.edu>, Carlos Bilbao <carlos.bilbao@amd.com> Subject: [PATCH 1/2] x86/mce: Simplify AMD severity grading logic Date: Tue, 5 Apr 2022 13:32:13 -0500 Message-ID: <20220405183212.354606-2-carlos.bilbao@amd.com> In-Reply-To: <20220405183212.354606-1-carlos.bilbao@amd.com> References: <20220405183212.354606-1-carlos.bilbao@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	x86/mce: Simplify AMD MCEs severity grading and include messages for panic cases \| expand [0/2] x86/mce: Simplify AMD MCEs severity grading and include messages for panic cases [1/2] x86/mce: Simplify AMD severity grading logic [2/2] x86/mce: Add messages for panic errors in AMD's MCE grading

Message ID

20220405183212.354606-2-carlos.bilbao@amd.com (mailing list archive)

State

New, archived

Headers

Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com;
From: Carlos Bilbao <carlos.bilbao@amd.com>
To: <bp@alien8.de>, <yazen.ghannam@amd.com>
CC: <tglx@linutronix.de>, <mingo@redhat.com>,
        <dave.hansen@linux.intel.com>, <x86@kernel.org>,
        <linux-kernel@vger.kernel.org>, <linux-edac@vger.kernel.org>,
        <bilbao@vt.edu>, Carlos Bilbao <carlos.bilbao@amd.com>
Subject: [PATCH 1/2] x86/mce: Simplify AMD severity grading logic
Date: Tue, 5 Apr 2022 13:32:13 -0500
Message-ID: <20220405183212.354606-2-carlos.bilbao@amd.com>
In-Reply-To: <20220405183212.354606-1-carlos.bilbao@amd.com>
References: <20220405183212.354606-1-carlos.bilbao@amd.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Apr 2022 18:32:49.2543
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 6b059b13-5bde-4a8a-f748-08da1732afe4
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: 
 BN8NAM11FT065.eop-nam11.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM8PR12MB5477
Precedence: bulk
List-ID: <linux-edac.vger.kernel.org>
X-Mailing-List: linux-edac@vger.kernel.org

Series

x86/mce: Simplify AMD MCEs severity grading and include messages for panic cases | expand

Commit Message

Carlos Bilbao April 5, 2022, 6:32 p.m. UTC

The MCE handler needs to understand the severity of the machine errors to
act accordingly. Simplify the AMD grading logic following a logic that
closely resembles the descriptions of the public PPR documents. This will
help include more fine-grained grading of errors in the future.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 arch/x86/kernel/cpu/mce/severity.c | 104 +++++++++++------------------
 1 file changed, 39 insertions(+), 65 deletions(-)

Comments

Yazen Ghannam April 10, 2022, 1:04 p.m. UTC | #1

On Tue, Apr 05, 2022 at 01:32:13PM -0500, Carlos Bilbao wrote:

...

>  /*
> - * See AMD Error Scope Hierarchy table in a newer BKDG. For example
> - * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
> + * See AMD PPR(s) section Machine Check Error Handling
>   */

This is now a single-line comment, so the /* */ should be adjusted. This is a
minor issue, so please wait for further review by others before sending
another revision, if needed.

Otherwise, the patch looks good to me.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks!

-Yazen

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 1add86935349..25aec5a27899 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -301,85 +301,59 @@  static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	}
 }
 
-static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
-{
-	u64 mcx_cfg;
-
-	/*
-	 * We need to look at the following bits:
-	 * - "succor" bit (data poisoning support), and
-	 * - TCC bit (Task Context Corrupt)
-	 * in MCi_STATUS to determine error severity.
-	 */
-	if (!mce_flags.succor)
-		return MCE_PANIC_SEVERITY;
-
-	mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
-
-	/* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
-	if ((mcx_cfg & MCI_CONFIG_MCAX) &&
-	    (m->status & MCI_STATUS_TCC) &&
-	    (err_ctx == IN_KERNEL))
-		return MCE_PANIC_SEVERITY;
-
-	 /* ...otherwise invoke hwpoison handler. */
-	return MCE_AR_SEVERITY;
-}
-
 /*
- * See AMD Error Scope Hierarchy table in a newer BKDG. For example
- * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
+ * See AMD PPR(s) section Machine Check Error Handling
  */
 static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
 {
-	enum context ctx = error_context(m, regs);
-
-	/* Processor Context Corrupt, no need to fumble too much, die! */
-	if (m->status & MCI_STATUS_PCC)
-		return MCE_PANIC_SEVERITY;
+	int ret;
 
-	if (m->status & MCI_STATUS_UC) {
+	/*
+	 * Default return value: Action required, the error must be handled
+	 * immediately.
+	 */
+	ret = MCE_AR_SEVERITY;
 
-		if (ctx == IN_KERNEL)
-			return MCE_PANIC_SEVERITY;
+	/* Processor Context Corrupt, no need to fumble too much, die! */
+	if (m->status & MCI_STATUS_PCC) {
+		ret = MCE_PANIC_SEVERITY;
+		goto out_amd_severity;
+	}
 
-		/*
-		 * On older systems where overflow_recov flag is not present, we
-		 * should simply panic if an error overflow occurs. If
-		 * overflow_recov flag is present and set, then software can try
-		 * to at least kill process to prolong system operation.
-		 */
-		if (mce_flags.overflow_recov) {
-			if (mce_flags.smca)
-				return mce_severity_amd_smca(m, ctx);
-
-			/* kill current process */
-			return MCE_AR_SEVERITY;
-		} else {
-			/* at least one error was not logged */
-			if (m->status & MCI_STATUS_OVER)
-				return MCE_PANIC_SEVERITY;
-		}
-
-		/*
-		 * For any other case, return MCE_UC_SEVERITY so that we log the
-		 * error and exit #MC handler.
-		 */
-		return MCE_UC_SEVERITY;
+	if (m->status & MCI_STATUS_DEFERRED) {
+		ret = MCE_DEFERRED_SEVERITY;
+		goto out_amd_severity;
 	}
 
 	/*
-	 * deferred error: poll handler catches these and adds to mce_ring so
-	 * memory-failure can take recovery actions.
+	 * If the UC bit is not set, the system either corrected or deferred
+	 * the error. No action will be required after logging the error.
 	 */
-	if (m->status & MCI_STATUS_DEFERRED)
-		return MCE_DEFERRED_SEVERITY;
+	if (!(m->status & MCI_STATUS_UC)) {
+		ret = MCE_KEEP_SEVERITY;
+		goto out_amd_severity;
+	}
 
 	/*
-	 * corrected error: poll handler catches these and passes responsibility
-	 * of decoding the error to EDAC
+	 * On MCA Overflow, without the MCA Overflow recovery feature the
+	 * system will not be able to recover.
 	 */
-	return MCE_KEEP_SEVERITY;
+	if ((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov) {
+		ret = MCE_PANIC_SEVERITY;
+		goto out_amd_severity;
+	}
+
+	if (!mce_flags.succor) {
+		ret = MCE_PANIC_SEVERITY;
+		goto out_amd_severity;
+	}
+
+	if (error_context(m, regs) == IN_KERNEL)
+		ret = MCE_PANIC_SEVERITY;
+
+out_amd_severity:
+
+	return ret;
 }
 
 static noinstr int mce_severity_intel(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)

[1/2] x86/mce: Simplify AMD severity grading logic

Commit Message

Comments

Patch