diff mbox series

[RFC,6/6] cxl: Add mce notifier to emit aliased address for extended linear cache

Message ID 20240927142108.1156362-7-dave.jiang@intel.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series acpi/hmat / cxl: Add exclusive caching enumeration and RAS support | expand

Commit Message

Dave Jiang Sept. 27, 2024, 2:16 p.m. UTC
Below is a setup with extended linear cache configuration with an example
layout of of memory region shown below presented as a single memory region
consists of 256G memory where there's 128G of DRAM and 128G of CXL memory.
The kernel sees a region of total 256G of system memory.

              128G DRAM                          128G CXL memory
|-----------------------------------|-------------------------------------|

Data resides in either DRAM or far memory (FM) with no replication. Hot data
is swapped into DRAM by the hardware behind the scenes. When error is detected
in one location, it is possible that error also resides in the aliased
location. Therefore when a memory location that is flagged by MCE is part of
the special region, the aliased memory location needs to be offlined as well.

Add an mce notify callback to identify if the MCE address location is part of
an extended linear cache region and handle accordingly.

Added symbol export to set_mce_nospec() in x86 code in order to call
set_mce_nospec() from the CXL MCE notify callback.

Link: https://lore.kernel.org/linux-cxl/668333b17e4b2_5639294fd@dwillia2-xfh.jf.intel.com.notmuch/
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 arch/x86/include/asm/mce.h   |  1 +
 arch/x86/mm/pat/set_memory.c |  1 +
 drivers/cxl/core/mbox.c      | 45 ++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/region.c    | 25 ++++++++++++++++++++
 drivers/cxl/cxl.h            |  6 +++++
 drivers/cxl/cxlmem.h         |  2 ++
 6 files changed, 80 insertions(+)

Comments

Jonathan Cameron Oct. 17, 2024, 4:40 p.m. UTC | #1
On Fri, 27 Sep 2024 07:16:58 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> Below is a setup with extended linear cache configuration with an example
> layout of of memory region shown below presented as a single memory region
> consists of 256G memory where there's 128G of DRAM and 128G of CXL memory.
> The kernel sees a region of total 256G of system memory.
> 
>               128G DRAM                          128G CXL memory
> |-----------------------------------|-------------------------------------|
> 
> Data resides in either DRAM or far memory (FM) with no replication. Hot data
> is swapped into DRAM by the hardware behind the scenes. When error is detected
> in one location, it is possible that error also resides in the aliased
> location. Therefore when a memory location that is flagged by MCE is part of
> the special region, the aliased memory location needs to be offlined as well.
> 
> Add an mce notify callback to identify if the MCE address location is part of
> an extended linear cache region and handle accordingly.
> 
> Added symbol export to set_mce_nospec() in x86 code in order to call
> set_mce_nospec() from the CXL MCE notify callback.

Whilst not commenting on whether any other implementation might exist,
this code should be written to be arch independent at some level.

> 
> Link: https://lore.kernel.org/linux-cxl/668333b17e4b2_5639294fd@dwillia2-xfh.jf.intel.com.notmuch/
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Dave Jiang Oct. 30, 2024, 11:37 p.m. UTC | #2
On 10/17/24 9:40 AM, Jonathan Cameron wrote:
> On Fri, 27 Sep 2024 07:16:58 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
>> Below is a setup with extended linear cache configuration with an example
>> layout of of memory region shown below presented as a single memory region
>> consists of 256G memory where there's 128G of DRAM and 128G of CXL memory.
>> The kernel sees a region of total 256G of system memory.
>>
>>               128G DRAM                          128G CXL memory
>> |-----------------------------------|-------------------------------------|
>>
>> Data resides in either DRAM or far memory (FM) with no replication. Hot data
>> is swapped into DRAM by the hardware behind the scenes. When error is detected
>> in one location, it is possible that error also resides in the aliased
>> location. Therefore when a memory location that is flagged by MCE is part of
>> the special region, the aliased memory location needs to be offlined as well.
>>
>> Add an mce notify callback to identify if the MCE address location is part of
>> an extended linear cache region and handle accordingly.
>>
>> Added symbol export to set_mce_nospec() in x86 code in order to call
>> set_mce_nospec() from the CXL MCE notify callback.
> 
> Whilst not commenting on whether any other implementation might exist,
> this code should be written to be arch independent at some level.

I did get a 0-day report on this with mce bits. But with asm/mce.h included, it seems to make other archs happy as well AFAICT.
> 
>>
>> Link: https://lore.kernel.org/linux-cxl/668333b17e4b2_5639294fd@dwillia2-xfh.jf.intel.com.notmuch/
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>
Dave Jiang Oct. 31, 2024, 9:12 p.m. UTC | #3
On 10/30/24 4:37 PM, Dave Jiang wrote:
> 
> 
> On 10/17/24 9:40 AM, Jonathan Cameron wrote:
>> On Fri, 27 Sep 2024 07:16:58 -0700
>> Dave Jiang <dave.jiang@intel.com> wrote:
>>
>>> Below is a setup with extended linear cache configuration with an example
>>> layout of of memory region shown below presented as a single memory region
>>> consists of 256G memory where there's 128G of DRAM and 128G of CXL memory.
>>> The kernel sees a region of total 256G of system memory.
>>>
>>>               128G DRAM                          128G CXL memory
>>> |-----------------------------------|-------------------------------------|
>>>
>>> Data resides in either DRAM or far memory (FM) with no replication. Hot data
>>> is swapped into DRAM by the hardware behind the scenes. When error is detected
>>> in one location, it is possible that error also resides in the aliased
>>> location. Therefore when a memory location that is flagged by MCE is part of
>>> the special region, the aliased memory location needs to be offlined as well.
>>>
>>> Add an mce notify callback to identify if the MCE address location is part of
>>> an extended linear cache region and handle accordingly.
>>>
>>> Added symbol export to set_mce_nospec() in x86 code in order to call
>>> set_mce_nospec() from the CXL MCE notify callback.
>>
>> Whilst not commenting on whether any other implementation might exist,
>> this code should be written to be arch independent at some level.
> 
> I did get a 0-day report on this with mce bits. But with asm/mce.h included, it seems to make other archs happy as well AFAICT.

Ok I was wrong. Arch wrappers needed to deal with MCE bits only exists for x86. 

>>
>>>
>>> Link: https://lore.kernel.org/linux-cxl/668333b17e4b2_5639294fd@dwillia2-xfh.jf.intel.com.notmuch/
>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>
> 
>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 3ad29b128943..5da45e870858 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -182,6 +182,7 @@  enum mce_notifier_prios {
 	MCE_PRIO_NFIT,
 	MCE_PRIO_EXTLOG,
 	MCE_PRIO_UC,
+	MCE_PRIO_CXL,
 	MCE_PRIO_EARLY,
 	MCE_PRIO_CEC,
 	MCE_PRIO_HIGHEST = MCE_PRIO_CEC
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 44f7b2ea6a07..1f85c29e118e 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2083,6 +2083,7 @@  int set_mce_nospec(unsigned long pfn)
 		pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
 	return rc;
 }
+EXPORT_SYMBOL_GPL(set_mce_nospec);
 
 /* Restore full speculative operation to the pfn. */
 int clear_mce_nospec(unsigned long pfn)
diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
index ac170fd85a1a..4488f30abc64 100644
--- a/drivers/cxl/core/mbox.c
+++ b/drivers/cxl/core/mbox.c
@@ -4,6 +4,9 @@ 
 #include <linux/debugfs.h>
 #include <linux/ktime.h>
 #include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/set_memory.h>
+#include <asm/mce.h>
 #include <asm/unaligned.h>
 #include <cxlpci.h>
 #include <cxlmem.h>
@@ -1444,6 +1447,44 @@  int cxl_poison_state_init(struct cxl_memdev_state *mds)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL);
 
+static int cxl_handle_mce(struct notifier_block *nb, unsigned long val,
+			  void *data)
+{
+	struct cxl_memdev_state *mds = container_of(nb, struct cxl_memdev_state,
+						    mce_notifier);
+	struct cxl_memdev *cxlmd = mds->cxlds.cxlmd;
+	struct cxl_port *endpoint = cxlmd->endpoint;
+	struct mce *mce = (struct mce *)data;
+	u64 spa, spa_alias;
+	unsigned long pfn;
+
+	if (!mce || !mce_usable_address(mce))
+		return NOTIFY_DONE;
+
+	spa = mce->addr & MCI_ADDR_PHYSADDR;
+
+	pfn = spa >> PAGE_SHIFT;
+	if (!pfn_valid(pfn))
+		return NOTIFY_DONE;
+
+	spa_alias = cxl_port_get_spa_cache_alias(endpoint, spa);
+	if (!spa_alias)
+		return NOTIFY_DONE;
+
+	pfn = spa_alias >> PAGE_SHIFT;
+
+	/*
+	 * Take down the aliased memory page. The original memory page flagged
+	 * by the MCE will be taken cared of by the standard MCE handler.
+	 */
+	dev_emerg(mds->cxlds.dev, "Offlining aliased SPA address: %#llx\n",
+		  spa_alias);
+	if (!memory_failure(pfn, 0))
+		set_mce_nospec(pfn);
+
+	return NOTIFY_OK;
+}
+
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 {
 	struct cxl_memdev_state *mds;
@@ -1463,6 +1504,10 @@  struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev)
 	mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID;
 	mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID;
 
+	mds->mce_notifier.notifier_call = cxl_handle_mce;
+	mds->mce_notifier.priority = MCE_PRIO_CXL;
+	mce_register_decode_chain(&mds->mce_notifier);
+
 	return mds;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_memdev_state_create, CXL);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index c19bbbf8079d..a60af9763a95 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -3420,6 +3420,31 @@  int cxl_add_to_region(struct cxl_port *root, struct cxl_endpoint_decoder *cxled)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
 
+u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa)
+{
+	struct cxl_region_ref *iter;
+	unsigned long index;
+
+	guard(rwsem_write)(&cxl_region_rwsem);
+
+	xa_for_each(&endpoint->regions, index, iter) {
+		struct cxl_region_params *p = &iter->region->params;
+
+		if (p->res->start <= spa && spa <= p->res->end) {
+			if (!p->cache_size)
+				return 0;
+
+			if (spa > p->res->start + p->cache_size)
+				return spa - p->cache_size;
+
+			return spa + p->cache_size;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_get_spa_cache_alias, CXL);
+
 static int is_system_ram(struct resource *res, void *arg)
 {
 	struct cxl_region *cxlr = arg;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d8d715090779..8516f6da620c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -864,6 +864,7 @@  struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
 int cxl_add_to_region(struct cxl_port *root,
 		      struct cxl_endpoint_decoder *cxled);
 struct cxl_dax_region *to_cxl_dax_region(struct device *dev);
+u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint, u64 spa);
 #else
 static inline bool is_cxl_pmem_region(struct device *dev)
 {
@@ -882,6 +883,11 @@  static inline struct cxl_dax_region *to_cxl_dax_region(struct device *dev)
 {
 	return NULL;
 }
+static inline u64 cxl_port_get_spa_cache_alias(struct cxl_port *endpoint,
+					       u64 spa)
+{
+	return 0;
+}
 #endif
 
 void cxl_endpoint_parse_cdat(struct cxl_port *port);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index afb53d058d62..46515d2a49cb 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -470,6 +470,7 @@  struct cxl_dev_state {
  * @poison: poison driver state info
  * @security: security driver state info
  * @fw: firmware upload / activation state
+ * @mce_notifier: MCE notifier
  * @mbox_wait: RCU wait for mbox send completely
  * @mbox_send: @dev specific transport for transmitting mailbox commands
  *
@@ -500,6 +501,7 @@  struct cxl_memdev_state {
 	struct cxl_poison_state poison;
 	struct cxl_security_state security;
 	struct cxl_fw_state fw;
+	struct notifier_block mce_notifier;
 
 	struct rcuwait mbox_wait;
 	int (*mbox_send)(struct cxl_memdev_state *mds,