From patchwork Tue Oct 15 07:22:35 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Zhuo, Qiuxu" X-Patchwork-Id: 13835822 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E98317DFE3; Tue, 15 Oct 2024 07:44:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728978302; cv=none; b=lRwlp2dILnd41mShwkpwp7SdS0HanrZN8tM+vp96050ryXzhWAvN+6DZFvFMQ8CjLNoHMUpNGCkqive94SrGhQCVz+vu86vQZ9WFh9Sz8+C5ouGzY1vw/RMEjO4ux4i/nbRjLXAbSfBjv4VGJkDsBqrZUffH4p8zVM4gLMojio0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728978302; c=relaxed/simple; bh=pg6huS9orWVHA8z+tE8SRJtPswKHKWq9Ep/HS4CMVxY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References; b=ojQrLPesT8+tbdl3sLhHmJnwQ7RmRxxzFFNIuAndbTK/znk6y+iwhjNIf749XFw9OOfxngISIpzCOdNrOHlUA3Ri1YH20QjHvn8QVPB/IoIx3IU5uyjezErzZQ9xeiu83gzjfeHKuKt2xAV2iJN75QIHbBNUr7QTS2pVOKem6BI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=BSgjl2kD; arc=none smtp.client-ip=192.198.163.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="BSgjl2kD" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1728978300; x=1760514300; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=pg6huS9orWVHA8z+tE8SRJtPswKHKWq9Ep/HS4CMVxY=; b=BSgjl2kDwgiCb+wjKpqYoi4ZPQ38gevv5DzNeXu+Ut8ywMlumf34REcp +qdHCaY4H72w/EytEM/2Ri6gQyZbMs6dLOjTwbazXZmPz5w3SRInWqFVd rCwYCGQLswxBFmiVJjCpi3uO1CbMkJpWCfo8dV8c1JnuWgrah5fDlpAtF cU5uJbOmkD94JT6KnBiyXE0wpgIenPPIwf+hJ/mo+hTSgC9n9YvRzB7XA Diayt73oxFV69tO2ne0TmwXX8SozJz7fNR5C353AGpLpedX+K3G72v7pl MgVmM36K55FO1U19zBh1TS1/0OLZS54H9Ma90Sq+H4M2IKNCQBrin47K+ A==; X-CSE-ConnectionGUID: XcFHxbdpSoK2YQ1ad4wdpA== X-CSE-MsgGUID: 4aHA9Am7SFOBllJbh9PLnw== X-IronPort-AV: E=McAfee;i="6700,10204,11225"; a="28483524" X-IronPort-AV: E=Sophos;i="6.11,204,1725346800"; d="scan'208";a="28483524" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2024 00:44:58 -0700 X-CSE-ConnectionGUID: wCS3SrVoSca0hx60hbZoNQ== X-CSE-MsgGUID: AJgsbuv5RL2Nxgk7p+ETVQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,204,1725346800"; d="scan'208";a="77752307" Received: from qiuxu-clx.sh.intel.com ([10.239.53.109]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2024 00:44:56 -0700 From: Qiuxu Zhuo To: Tony Luck Cc: Qiuxu Zhuo , Borislav Petkov , James Morse , Mauro Carvalho Chehab , Robert Richter , Diego Garcia Rodriguez , linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/2] EDAC/skx_common: Distinguish the memory error source Date: Tue, 15 Oct 2024 15:22:35 +0800 Message-Id: <20241015072236.24543-2-qiuxu.zhuo@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20241015072236.24543-1-qiuxu.zhuo@intel.com> References: <20241015072236.24543-1-qiuxu.zhuo@intel.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: The current skx_common determines whether the memory error source is the near memory of the 2LM system and then retrieves the decoded error results from the ADXL components (near-memory vs. non-near-memory) accordingly. However, some memory controllers may have limitations in correctly reporting the memory error source, leading to the retrieval of incorrect decoded parts from the ADXL. To address these limitations, instead of simply determining whether the memory error is from the near memory of the 2LM system, it is necessary to distinguish the memory error source details as follows: Memory error from the near memory of the 2LM system. Memory error from the far memory of the 2LM system. Memory error from the 1LM system. Not a memory error. This will enable the i10nm_edac driver to take appropriate actions for those memory controllers that have limitations in reporting the memory error source. Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support") Tested-by: Diego Garcia Rodriguez Signed-off-by: Qiuxu Zhuo --- drivers/edac/skx_common.c | 34 ++++++++++++++++------------------ drivers/edac/skx_common.h | 7 +++++++ 2 files changed, 23 insertions(+), 18 deletions(-) diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 85713646957b..52b462899870 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -119,7 +119,7 @@ void skx_adxl_put(void) } EXPORT_SYMBOL_GPL(skx_adxl_put); -static bool skx_adxl_decode(struct decoded_addr *res, bool error_in_1st_level_mem) +static bool skx_adxl_decode(struct decoded_addr *res, enum error_source err_src) { struct skx_dev *d; int i, len = 0; @@ -136,7 +136,7 @@ static bool skx_adxl_decode(struct decoded_addr *res, bool error_in_1st_level_me } res->socket = (int)adxl_values[component_indices[INDEX_SOCKET]]; - if (error_in_1st_level_mem) { + if (err_src == ERR_SRC_2LM_NM) { res->imc = (adxl_nm_bitmap & BIT_NM_MEMCTRL) ? (int)adxl_values[component_indices[INDEX_NM_MEMCTRL]] : -1; res->channel = (adxl_nm_bitmap & BIT_NM_CHANNEL) ? @@ -620,31 +620,27 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, optype, skx_msg); } -static bool skx_error_in_1st_level_mem(const struct mce *m) +static enum error_source skx_error_source(const struct mce *m) { - u32 errcode; + u32 errcode = GET_BITFIELD(m->status, 0, 15) & MCACOD_MEM_ERR_MASK; + + if (errcode != MCACOD_MEM_CTL_ERR && errcode != MCACOD_EXT_MEM_ERR) + return ERR_SRC_NOT_MEMORY; if (!skx_mem_cfg_2lm) - return false; + return ERR_SRC_1LM; - errcode = GET_BITFIELD(m->status, 0, 15) & MCACOD_MEM_ERR_MASK; + if (errcode == MCACOD_EXT_MEM_ERR) + return ERR_SRC_2LM_NM; - return errcode == MCACOD_EXT_MEM_ERR; -} - -static bool skx_error_in_mem(const struct mce *m) -{ - u32 errcode; - - errcode = GET_BITFIELD(m->status, 0, 15) & MCACOD_MEM_ERR_MASK; - - return (errcode == MCACOD_MEM_CTL_ERR || errcode == MCACOD_EXT_MEM_ERR); + return ERR_SRC_2LM_FM; } int skx_mce_check_error(struct notifier_block *nb, unsigned long val, void *data) { struct mce *mce = (struct mce *)data; + enum error_source err_src; struct decoded_addr res; struct mem_ctl_info *mci; char *type; @@ -652,8 +648,10 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, if (mce->kflags & MCE_HANDLED_CEC) return NOTIFY_DONE; + err_src = skx_error_source(mce); + /* Ignore unless this is memory related with an address */ - if (!skx_error_in_mem(mce) || !(mce->status & MCI_STATUS_ADDRV)) + if (err_src == ERR_SRC_NOT_MEMORY || !(mce->status & MCI_STATUS_ADDRV)) return NOTIFY_DONE; memset(&res, 0, sizeof(res)); @@ -667,7 +665,7 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, /* Try driver decoder first */ if (!(driver_decode && driver_decode(&res))) { /* Then try firmware decoder (ACPI DSM methods) */ - if (!(adxl_component_count && skx_adxl_decode(&res, skx_error_in_1st_level_mem(mce)))) + if (!(adxl_component_count && skx_adxl_decode(&res, err_src))) return NOTIFY_DONE; } diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index f945c1bf5ca4..cd47f8186831 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -146,6 +146,13 @@ enum { INDEX_MAX }; +enum error_source { + ERR_SRC_1LM, + ERR_SRC_2LM_NM, + ERR_SRC_2LM_FM, + ERR_SRC_NOT_MEMORY, +}; + #define BIT_NM_MEMCTRL BIT_ULL(INDEX_NM_MEMCTRL) #define BIT_NM_CHANNEL BIT_ULL(INDEX_NM_CHANNEL) #define BIT_NM_DIMM BIT_ULL(INDEX_NM_DIMM) From patchwork Tue Oct 15 07:22:36 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Zhuo, Qiuxu" X-Patchwork-Id: 13835823 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D48D0231C95; Tue, 15 Oct 2024 07:45:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728978309; cv=none; b=SbgznLGJOtD/IXkdk2E4wyIOIIjz/a0TDgbdLW4JDzafvZzpCVawq1pnZAXkUk8J4IfG/obAWzTtOg74MsXmk9jtfEOp4qJaWbqo1NEPh0Ah6qQO3ynYv+9AP3lIR0PHLKtPOEUwi9S/bSk1nR5tDu3OViy/YBWTVqu1YFWzA04= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728978309; c=relaxed/simple; bh=VM730l2qFb0YD0FEHTmVV8j8ITO1tx28eMDxqydRoxU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References; b=NG6qXLgwlbmNKjsJktHQ6/j/sdIpASglomPZT0nLVAQAwT1TOV6I5PaeMU7tl459M2Fbh794w+9dPn9V7Z/jkuvN7A/qWqV9bx4f+LjxL7o3YJorwblDNqEHwzAF0b9W2TBK1hZSl15Xrvdde+UwJ0RIRLk5gt/04Wdds6TNR1I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Oh+hcgvw; arc=none smtp.client-ip=192.198.163.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Oh+hcgvw" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1728978308; x=1760514308; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=VM730l2qFb0YD0FEHTmVV8j8ITO1tx28eMDxqydRoxU=; b=Oh+hcgvwCB5QFZZGDS7zWlV9Kdh37pwB89uGWbgLaSfCFl5QZ4s6GKFf iw5DL6ul2i3BOoihSROIEkG85965XOgIi4oiKtBGOW8HXdh2ePCJVT6Zx 69+UcWRXYSCKhO1JEXoQblZdJLUGfV42aw0+SLuqO1vZKZBO7DYZqqqht D5O83Ez2IiBbtu17YODNXn4cVmRmtsYpQDSntspM/vqgDNp+Hor508730 HjNA2nAsARg8cWShhYXWU3jwJhnR4E2riLN86S9W8sYGk/oN87GSt406B nNe3LFsrbaqA+vsy6LpM7rQrf/tsTLEj5fwjDOF0IZyL2IXNl7WHx7yyo Q==; X-CSE-ConnectionGUID: 6XqEx/0bSLueCfmLSOGK4Q== X-CSE-MsgGUID: 3qWE7a0QQ+uZHQX0AbJ1uw== X-IronPort-AV: E=McAfee;i="6700,10204,11225"; a="28483562" X-IronPort-AV: E=Sophos;i="6.11,204,1725346800"; d="scan'208";a="28483562" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2024 00:45:04 -0700 X-CSE-ConnectionGUID: iyPOuqyDSmmFmfGIaC70Ug== X-CSE-MsgGUID: +ytgteYUQ0mwdP3hj95lhg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,204,1725346800"; d="scan'208";a="77752370" Received: from qiuxu-clx.sh.intel.com ([10.239.53.109]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2024 00:45:02 -0700 From: Qiuxu Zhuo To: Tony Luck Cc: Qiuxu Zhuo , Borislav Petkov , James Morse , Mauro Carvalho Chehab , Robert Richter , Diego Garcia Rodriguez , linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 2/2] EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator Date: Tue, 15 Oct 2024 15:22:36 +0800 Message-Id: <20241015072236.24543-3-qiuxu.zhuo@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20241015072236.24543-1-qiuxu.zhuo@intel.com> References: <20241015072236.24543-1-qiuxu.zhuo@intel.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: The Granite Rapids CPUs with Flat2LM memory configurations may mistakenly report near-memory errors as far-memory errors, resulting in the invalid decoded ADXL results: EDAC skx: Bad imc -1 Fix this incorrect far-memory error source indicator by prefetching the decoded far-memory controller ID, and adjust the error source indicator to near-memory if the far-memory controller ID is invalid. Fixes: ba987eaaabf9 ("EDAC/i10nm: Add Intel Granite Rapids server support") Tested-by: Diego Garcia Rodriguez Signed-off-by: Qiuxu Zhuo --- drivers/edac/i10nm_base.c | 1 + drivers/edac/skx_common.c | 23 +++++++++++++++++++++++ drivers/edac/skx_common.h | 1 + 3 files changed, 25 insertions(+) diff --git a/drivers/edac/i10nm_base.c b/drivers/edac/i10nm_base.c index e2a954de913b..51556c72a967 100644 --- a/drivers/edac/i10nm_base.c +++ b/drivers/edac/i10nm_base.c @@ -1036,6 +1036,7 @@ static int __init i10nm_init(void) return -ENODEV; cfg = (struct res_config *)id->driver_data; + skx_set_res_cfg(cfg); res_cfg = cfg; rc = skx_get_hi_lo(0x09a2, off, &tolm, &tohm); diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 52b462899870..6cf17af7d911 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -47,6 +47,7 @@ static skx_show_retry_log_f skx_show_retry_rd_err_log; static u64 skx_tolm, skx_tohm; static LIST_HEAD(dev_edac_list); static bool skx_mem_cfg_2lm; +static struct res_config *skx_res_cfg; int skx_adxl_get(void) { @@ -135,6 +136,22 @@ static bool skx_adxl_decode(struct decoded_addr *res, enum error_source err_src) return false; } + /* + * GNR with a Flat2LM memory configuration may mistakenly classify + * a near-memory error(DDR5) as a far-memory error(CXL), resulting + * in the incorrect selection of decoded ADXL components. + * To address this, prefetch the decoded far-memory controller ID + * and adjust the error source to near-memory if the far-memory + * controller ID is invalid. + */ + if (skx_res_cfg && skx_res_cfg->type == GNR && err_src == ERR_SRC_2LM_FM) { + res->imc = (int)adxl_values[component_indices[INDEX_MEMCTRL]]; + if (res->imc == -1) { + err_src = ERR_SRC_2LM_NM; + edac_dbg(0, "Adjust the error source to near-memory.\n"); + } + } + res->socket = (int)adxl_values[component_indices[INDEX_SOCKET]]; if (err_src == ERR_SRC_2LM_NM) { res->imc = (adxl_nm_bitmap & BIT_NM_MEMCTRL) ? @@ -191,6 +208,12 @@ void skx_set_mem_cfg(bool mem_cfg_2lm) } EXPORT_SYMBOL_GPL(skx_set_mem_cfg); +void skx_set_res_cfg(struct res_config *cfg) +{ + skx_res_cfg = cfg; +} +EXPORT_SYMBOL_GPL(skx_set_res_cfg); + void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log) { driver_decode = decode; diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index cd47f8186831..54bba8a62f72 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -241,6 +241,7 @@ int skx_adxl_get(void); void skx_adxl_put(void); void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log); void skx_set_mem_cfg(bool mem_cfg_2lm); +void skx_set_res_cfg(struct res_config *cfg); int skx_get_src_id(struct skx_dev *d, int off, u8 *id); int skx_get_node_id(struct skx_dev *d, u8 *id);