diff mbox series

[v7,3/7] x86/sgx: Initial poison handling for dirty and free pages

Message ID 20210927213452.212470-4-tony.luck@intel.com (mailing list archive)
State New, archived
Headers show
Series Basic recovery for machine checks inside SGX | expand

Commit Message

Luck, Tony Sept. 27, 2021, 9:34 p.m. UTC
A memory controller patrol scrubber can report poison in a page
that isn't currently being used.

Add "poison" field in the sgx_epc_page that can be set for an
sgx_epc_page. Check for it:
1) When sanitizing dirty pages
2) When freeing epc pages

Poison is a new field separated from flags to avoid having to make
all updates to flags atomic, or integrate poison state changes into
some other locking scheme to protect flags.

In both cases place the poisoned page on a list of poisoned epc pages
to make sure it will not be reallocated.

Add debugfs files /sys/kernel/debug/sgx/poison_page_list so that system
administrators get a list of those pages that have been dropped because
of poison.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 31 ++++++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/sgx.h  |  3 ++-
 2 files changed, 32 insertions(+), 2 deletions(-)

Comments

Jarkko Sakkinen Sept. 28, 2021, 2:46 a.m. UTC | #1
On Mon, 2021-09-27 at 14:34 -0700, Tony Luck wrote:
> A memory controller patrol scrubber can report poison in a page
> that isn't currently being used.
> 
> Add "poison" field in the sgx_epc_page that can be set for an
> sgx_epc_page. Check for it:
> 1) When sanitizing dirty pages
> 2) When freeing epc pages
> 
> Poison is a new field separated from flags to avoid having to make
> all updates to flags atomic, or integrate poison state changes into
> some other locking scheme to protect flags.
> 
> In both cases place the poisoned page on a list of poisoned epc pages
> to make sure it will not be reallocated.
> 
> Add debugfs files /sys/kernel/debug/sgx/poison_page_list so that system
> administrators get a list of those pages that have been dropped because
> of poison.

So, what would a sysadmin do with that detailed information?

I would decrease the granularity a bit rather add something like
this for each node:

	/sys/devices/system/node/node[0-9]*/sgx/poisoned_size

which would give the total amount of poisoned memory in bytes
for that node.

See the series that I've recently posted:

https://lore.kernel.org/linux-sgx/20210914030422.377601-1-jarkko@kernel.org/T/#t

/Jarkko
Luck, Tony Sept. 28, 2021, 3:41 p.m. UTC | #2
>> Add debugfs files /sys/kernel/debug/sgx/poison_page_list so that system
>> administrators get a list of those pages that have been dropped because
>> of poison.
>
> So, what would a sysadmin do with that detailed information?

It's going to be a rare case that there are any poisoned pages on that list
(a large enough cluster will have some systems that have uncorrected
recoverable errors in SGX EPC memory).

Even when there are some poisoned pages, there will only be a few. Systems
that have thousands of pages with uncorrected memory errors will surely crash
because one of those errors is going to either trigger an error marked as fatal,
or the error won’t be recoverable by Linux because it is in kernel memory.

A sysadmin might add a script to run during system shutdown (or periodically
during run-time) to save the poison page list. Then at startup run:

for addr in `cat saved_sgx_poison_page_list`
do
	echo $addr > /sys/devices/system/memory/hard_offline_page
done

to make poison persistent across reboots.

-Tony
Jarkko Sakkinen Sept. 28, 2021, 8:11 p.m. UTC | #3
On Tue, 2021-09-28 at 15:41 +0000, Luck, Tony wrote:
> > > Add debugfs files /sys/kernel/debug/sgx/poison_page_list so that system
> > > administrators get a list of those pages that have been dropped because
> > > of poison.
> > 
> > So, what would a sysadmin do with that detailed information?
> 
> It's going to be a rare case that there are any poisoned pages on that list
> (a large enough cluster will have some systems that have uncorrected
> recoverable errors in SGX EPC memory).
> 
> Even when there are some poisoned pages, there will only be a few. Systems
> that have thousands of pages with uncorrected memory errors will surely crash
> because one of those errors is going to either trigger an error marked as fatal,
> or the error won’t be recoverable by Linux because it is in kernel memory.
> 
> A sysadmin might add a script to run during system shutdown (or periodically
> during run-time) to save the poison page list. Then at startup run:
> 
> for addr in `cat saved_sgx_poison_page_list`
> do
> 	echo $addr > /sys/devices/system/memory/hard_offline_page
> done
> 
> to make poison persistent across reboots.
> 
> -Tony

Couldn't it be a blob with 8 bytes for each address?

/Jarkko
Luck, Tony Sept. 28, 2021, 8:53 p.m. UTC | #4
On Tue, Sep 28, 2021 at 11:11:30PM +0300, Jarkko Sakkinen wrote:
> On Tue, 2021-09-28 at 15:41 +0000, Luck, Tony wrote:
> > > > Add debugfs files /sys/kernel/debug/sgx/poison_page_list so that system
> > > > administrators get a list of those pages that have been dropped because
> > > > of poison.
> > > 
> > > So, what would a sysadmin do with that detailed information?
> > 
> > It's going to be a rare case that there are any poisoned pages on that list
> > (a large enough cluster will have some systems that have uncorrected
> > recoverable errors in SGX EPC memory).
> > 
> > Even when there are some poisoned pages, there will only be a few. Systems
> > that have thousands of pages with uncorrected memory errors will surely crash
> > because one of those errors is going to either trigger an error marked as fatal,
> > or the error won’t be recoverable by Linux because it is in kernel memory.
> > 
> > A sysadmin might add a script to run during system shutdown (or periodically
> > during run-time) to save the poison page list. Then at startup run:
> > 
> > for addr in `cat saved_sgx_poison_page_list`
> > do
> > 	echo $addr > /sys/devices/system/memory/hard_offline_page
> > done
> > 
> > to make poison persistent across reboots.
> > 
> > -Tony
> 
> Couldn't it be a blob with 8 bytes for each address?

It could be a blob. But that would require some perl/python
instead of simple shell to do the above persistence trick.

Or I could just drop the debugfs interface from this patch,
waiting until some use case for the data is fleshed out so
that it can be done in the most sensible way for that use case.

Untested updated patch below.

-Tony

From 551fbc5822e8faf93ff53f0a2b2448b0b98f1dde Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Mon, 27 Sep 2021 13:26:06 -0700
Subject: [PATCH] x86/sgx: Initial poison handling for dirty and free pages

A memory controller patrol scrubber can report poison in a page
that isn't currently being used.

Add "poison" field in the sgx_epc_page that can be set for an
sgx_epc_page. Check for it:
1) When sanitizing dirty pages
2) When freeing epc pages

Poison is a new field separated from flags to avoid having to make
all updates to flags atomic, or integrate poison state changes into
some other locking scheme to protect flags.

In both cases place the poisoned page on a list of poisoned epc pages
to make sure it will not be reallocated.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 14 +++++++++++++-
 arch/x86/kernel/cpu/sgx/sgx.h  |  3 ++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 09fa42690ff2..653bace26100 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -43,6 +43,7 @@ static nodemask_t sgx_numa_mask;
 static struct sgx_numa_node *sgx_numa_nodes;
 
 static LIST_HEAD(sgx_dirty_page_list);
+static LIST_HEAD(sgx_poison_page_list);
 
 /*
  * Reset post-kexec EPC pages to the uninitialized state. The pages are removed
@@ -62,6 +63,12 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 		page = list_first_entry(dirty_page_list, struct sgx_epc_page, list);
 
+		if (page->poison) {
+			list_del(&page->list);
+			list_add(&page->list, &sgx_poison_page_list);
+			continue;
+		}
+
 		ret = __eremove(sgx_get_epc_virt_addr(page));
 		if (!ret) {
 			/*
@@ -626,7 +633,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	list_add_tail(&page->list, &node->free_page_list);
+	page->owner = NULL;
+	if (page->poison)
+		list_add(&page->list, &sgx_poison_page_list);
+	else
+		list_add_tail(&page->list, &node->free_page_list);
 	sgx_nr_free_pages++;
 	page->flags = 0;
 
@@ -658,6 +669,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].section = index;
 		section->pages[i].flags = SGX_EPC_PAGE_IN_USE;
 		section->pages[i].owner = NULL;
+		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index f9202d3d6278..a990a4c9a00f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -31,7 +31,8 @@
 
 struct sgx_epc_page {
 	unsigned int section;
-	unsigned int flags;
+	u16 flags;
+	u16 poison;
 	struct sgx_encl_page *owner;
 	struct list_head list;
 };
Jarkko Sakkinen Sept. 30, 2021, 2:40 p.m. UTC | #5
On Tue, 2021-09-28 at 13:53 -0700, Luck, Tony wrote:
> On Tue, Sep 28, 2021 at 11:11:30PM +0300, Jarkko Sakkinen wrote:
> > On Tue, 2021-09-28 at 15:41 +0000, Luck, Tony wrote:
> > > > > Add debugfs files /sys/kernel/debug/sgx/poison_page_list so that system
> > > > > administrators get a list of those pages that have been dropped because
> > > > > of poison.
> > > > 
> > > > So, what would a sysadmin do with that detailed information?
> > > 
> > > It's going to be a rare case that there are any poisoned pages on that list
> > > (a large enough cluster will have some systems that have uncorrected
> > > recoverable errors in SGX EPC memory).
> > > 
> > > Even when there are some poisoned pages, there will only be a few. Systems
> > > that have thousands of pages with uncorrected memory errors will surely crash
> > > because one of those errors is going to either trigger an error marked as fatal,
> > > or the error won’t be recoverable by Linux because it is in kernel memory.
> > > 
> > > A sysadmin might add a script to run during system shutdown (or periodically
> > > during run-time) to save the poison page list. Then at startup run:
> > > 
> > > for addr in `cat saved_sgx_poison_page_list`
> > > do
> > > 	echo $addr > /sys/devices/system/memory/hard_offline_page
> > > done
> > > 
> > > to make poison persistent across reboots.
> > > 
> > > -Tony
> > 
> > Couldn't it be a blob with 8 bytes for each address?
> 
> It could be a blob. But that would require some perl/python
> instead of simple shell to do the above persistence trick.

The way I've understood it, a list of values breaks sysfs conventions.
There can be only single value per attribute. Even, if the blob is
interpreted as a list of integers, it is still a value, as far as sysfs
is concerned.

I'd also consider programs written with C, or perhaps Rust, when we
(ever) add any new sysfs for SGX. In my opinion, it makes sense to make
any uapi things we add accesible to as many tools as we can.

Such a trivially constructed blob is not enormously hard to parse in any
language, but at least I don't enjoy parsing list of strings in C code,
whereas loading a blob is effortless.

This kind of shows why the current sysfs conventions make sense in the
first place: they enforce to design attributes in the manner that they
are as reachable as possible. That's why I would follow the conventions
in a strict manner.

Finally, I would make a proper sysfs attribute out of this (and a separate
patch), which would be available per node.

/Jarkko
Luck, Tony Sept. 30, 2021, 6:02 p.m. UTC | #6
On Thu, Sep 30, 2021 at 05:40:18PM +0300, Jarkko Sakkinen wrote:
> On Tue, 2021-09-28 at 13:53 -0700, Luck, Tony wrote:
> > > Couldn't it be a blob with 8 bytes for each address?
> > 
> > It could be a blob. But that would require some perl/python
> > instead of simple shell to do the above persistence trick.
> 
> The way I've understood it, a list of values breaks sysfs conventions.
> There can be only single value per attribute. Even, if the blob is
> interpreted as a list of integers, it is still a value, as far as sysfs
> is concerned.
> 
> I'd also consider programs written with C, or perhaps Rust, when we
> (ever) add any new sysfs for SGX. In my opinion, it makes sense to make
> any uapi things we add accesible to as many tools as we can.
> 
> Such a trivially constructed blob is not enormously hard to parse in any
> language, but at least I don't enjoy parsing list of strings in C code,
> whereas loading a blob is effortless.
> 
> This kind of shows why the current sysfs conventions make sense in the
> first place: they enforce to design attributes in the manner that they
> are as reachable as possible. That's why I would follow the conventions
> in a strict manner.
> 
> Finally, I would make a proper sysfs attribute out of this (and a separate
> patch), which would be available per node.

Those are all good points. I'm going to drop any interface from this
series (because that's above and beyond the goal of "basic machine check
support"). We can spend some time to come up with the right interface
and add that in a future series.

-Tony
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 09fa42690ff2..b558c9a80af4 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -1,6 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0
 /*  Copyright(c) 2016-20 Intel Corporation. */
 
+#include <linux/debugfs.h>
 #include <linux/file.h>
 #include <linux/freezer.h>
 #include <linux/highmem.h>
@@ -43,6 +44,7 @@  static nodemask_t sgx_numa_mask;
 static struct sgx_numa_node *sgx_numa_nodes;
 
 static LIST_HEAD(sgx_dirty_page_list);
+static LIST_HEAD(sgx_poison_page_list);
 
 /*
  * Reset post-kexec EPC pages to the uninitialized state. The pages are removed
@@ -62,6 +64,12 @@  static void __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 		page = list_first_entry(dirty_page_list, struct sgx_epc_page, list);
 
+		if (page->poison) {
+			list_del(&page->list);
+			list_add(&page->list, &sgx_poison_page_list);
+			continue;
+		}
+
 		ret = __eremove(sgx_get_epc_virt_addr(page));
 		if (!ret) {
 			/*
@@ -626,7 +634,11 @@  void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	list_add_tail(&page->list, &node->free_page_list);
+	page->owner = NULL;
+	if (page->poison)
+		list_add(&page->list, &sgx_poison_page_list);
+	else
+		list_add_tail(&page->list, &node->free_page_list);
 	sgx_nr_free_pages++;
 	page->flags = 0;
 
@@ -658,6 +670,7 @@  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].section = index;
 		section->pages[i].flags = SGX_EPC_PAGE_IN_USE;
 		section->pages[i].owner = NULL;
+		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -801,8 +814,21 @@  int sgx_set_attribute(unsigned long *allowed_attributes,
 }
 EXPORT_SYMBOL_GPL(sgx_set_attribute);
 
+static int poison_list_show(struct seq_file *m, void *private)
+{
+	struct sgx_epc_page *page;
+
+	list_for_each_entry(page, &sgx_poison_page_list, list)
+		seq_printf(m, "0x%lx\n", sgx_get_epc_phys_addr(page));
+
+	return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(poison_list);
+
 static int __init sgx_init(void)
 {
+	struct dentry *dir;
 	int ret;
 	int i;
 
@@ -834,6 +860,9 @@  static int __init sgx_init(void)
 	if (sgx_vepc_init() && ret)
 		goto err_provision;
 
+	dir = debugfs_create_dir("sgx", arch_debugfs_dir);
+	debugfs_create_file("poison_page_list", 0400, dir, NULL, &poison_list_fops);
+
 	return 0;
 
 err_provision:
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index f9202d3d6278..a990a4c9a00f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -31,7 +31,8 @@ 
 
 struct sgx_epc_page {
 	unsigned int section;
-	unsigned int flags;
+	u16 flags;
+	u16 poison;
 	struct sgx_encl_page *owner;
 	struct list_head list;
 };