x86/sgx: Fix deadloop in __sgx_alloc_epc_page()

Message ID	20240829023800.1671210-1-aaron.lu@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 80378610B; Thu, 29 Aug 2024 02:38:08 +0000 (UTC) From: Aaron Lu <aaron.lu@intel.com> To: Jarkko Sakkinen <jarkko@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org, linux-sgx@vger.kernel.org, linux-kernel@vger.kernel.org, Zhimin Luo <zhimin.luo@intel.com> Subject: [PATCH] x86/sgx: Fix deadloop in __sgx_alloc_epc_page() Date: Thu, 29 Aug 2024 10:38:00 +0800 Message-ID: <20240829023800.1671210-1-aaron.lu@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	x86/sgx: Fix deadloop in __sgx_alloc_epc_page() \| expand x86/sgx: Fix deadloop in __sgx_alloc_epc_page()

Aaron Lu Aug. 29, 2024, 2:38 a.m. UTC

When current node doesn't have a EPC section configured by firmware and
all other EPC sections memory are used up, CPU can stuck inside the
while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
Note how nid_of_current will never equal to nid in that while loop because
nid_of_current is not set in sgx_numa_mask.

Also worth mentioning is that it's perfectly fine for firmware to not
seup an EPC section on a node. Setting an EPC section on each node can
be good for performance but that's not a requirement functionality wise.

Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Reported-by: Zhimin Luo <zhimin.luo@intel.com>
Tested-by: Zhimin Luo <zhimin.luo@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
This issue is found by Zhimin when doing internal testing and no
external bug report has been sent out so there is no Closes: tag.

 arch/x86/kernel/cpu/sgx/main.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)


base-commit: a85536e1bce722cb184abbac98068217874bdd6e

Huang, Kai Aug. 29, 2024, 7:47 a.m. UTC | #1

On Thu, 2024-08-29 at 10:38 +0800, Aaron Lu wrote:
> When current node doesn't have a EPC section configured by firmware and
> all other EPC sections memory are used up, CPU can stuck inside the
> while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> Note how nid_of_current will never equal to nid in that while loop because
> nid_of_current is not set in sgx_numa_mask.
> 
> Also worth mentioning is that it's perfectly fine for firmware to not
> seup an EPC section on a node. Setting an EPC section on each node can

seup -> set up.

> be good for performance but that's not a requirement functionality wise.
> 
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Reported-by: Zhimin Luo <zhimin.luo@intel.com>
> Tested-by: Zhimin Luo <zhimin.luo@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Reviewed-by: Kai Huang <kai.huang@intel.com>

Huang, Kai Aug. 29, 2024, 7:56 a.m. UTC | #2

Actually run spell check this time ...

On Thu, 2024-08-29 at 10:38 +0800, Aaron Lu wrote:
> When current node doesn't have a EPC section configured by firmware and

"current node" -> "the current node"

"a EPC section" -> "an EPC section"

> all other EPC sections memory are used up, CPU can stuck inside the

"EPC sections memory" -> "EPC sections"

"can stuck" -> "can get stuck"

> while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> Note how nid_of_current will never equal to nid in that while loop because
> nid_of_current is not set in sgx_numa_mask.
> 
> Also worth mentioning is that it's perfectly fine for firmware to not
> seup an EPC section on a node. Setting an EPC section on each node can
> be good for performance but that's not a requirement functionality wise.
> 

[...]

Aaron Lu Aug. 29, 2024, 1:22 p.m. UTC | #3

On Thu, Aug 29, 2024 at 03:56:39PM +0800, Huang, Kai wrote:
> Actually run spell check this time ...
> 
> On Thu, 2024-08-29 at 10:38 +0800, Aaron Lu wrote:
> > When current node doesn't have a EPC section configured by firmware and
> 
> "current node" -> "the current node"
> 
> "a EPC section" -> "an EPC section"
> 
> > all other EPC sections memory are used up, CPU can stuck inside the
> 
> "EPC sections memory" -> "EPC sections"
> 
> "can stuck" -> "can get stuck"
> 
> > while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> > Note how nid_of_current will never equal to nid in that while loop because
> > nid_of_current is not set in sgx_numa_mask.
> > 
> > Also worth mentioning is that it's perfectly fine for firmware to not
> > seup an EPC section on a node. Setting an EPC section on each node can
> > be good for performance but that's not a requirement functionality wise.
> > 
> 
> [...]

Thank you Kai for your detailed review, will reword the changelog
according to your suggestions when sending the next version.

Dave Hansen Aug. 29, 2024, 3:17 p.m. UTC | #4

Generally, I think it's a bad idea to refer to function names in
subjects.  This, for instance would be much more informative:

	x86/sgx: Fix deadlock in SGX NUMA node search

On 8/28/24 19:38, Aaron Lu wrote:
> When current node doesn't have a EPC section configured by firmware and
> all other EPC sections memory are used up, CPU can stuck inside the
> while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> Note how nid_of_current will never equal to nid in that while loop because
> nid_of_current is not set in sgx_numa_mask.
> 
> Also worth mentioning is that it's perfectly fine for firmware to not
> seup an EPC section on a node. Setting an EPC section on each node can
> be good for performance but that's not a requirement functionality wise.

The changelog is a little rough, but I think Kai gave some good
suggestions.  The other thing you can do is dump the text in chatgpt (or
whatever) and have it fix your grammar.  It actually does a pretty
decent job.

Also, you didn't say _how_ you fixed this.  That needs to be in here.
Something along the lines of:

	Rework the loop to start and end on *a* node that has SGX
	memory.  This avoids the deadlock looking for the current SGX-
	lacking node to show up in the loop when it never will.

The code looks fine, so feel free to add:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Also, I do think we should probably add some kind of sanity warning to
the SGX code in another patch.  If a node on an SGX system has CPUs and
memory, it's very likely it will also have some EPC.  It can be
something soft like a pr_info(), but I think it would be nice to have.

Jarkko Sakkinen Aug. 29, 2024, 4:44 p.m. UTC | #5

On Thu Aug 29, 2024 at 5:38 AM EEST, Aaron Lu wrote:
> When current node doesn't have a EPC section configured by firmware and
> all other EPC sections memory are used up, CPU can stuck inside the
> while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> Note how nid_of_current will never equal to nid in that while loop because
                                                     ~~~~

Oh *that* while loop ;-) Please be more specific.

> nid_of_current is not set in sgx_numa_mask.
>
> Also worth mentioning is that it's perfectly fine for firmware to not
> seup an EPC section on a node. Setting an EPC section on each node can
> be good for performance but that's not a requirement functionality wise.

This lacks any description of what is done to __sgx_alloc_epc_page().

>
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Reported-by: Zhimin Luo <zhimin.luo@intel.com>
> Tested-by: Zhimin Luo <zhimin.luo@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

BR, Jarkko

Aaron Lu Aug. 30, 2024, 6:02 a.m. UTC | #6

On Thu, Aug 29, 2024 at 08:17:53AM -0700, Dave Hansen wrote:
> Generally, I think it's a bad idea to refer to function names in
> subjects.  This, for instance would be much more informative:
> 
> 	x86/sgx: Fix deadlock in SGX NUMA node search

Indeed, will use this as subject, thanks.

> On 8/28/24 19:38, Aaron Lu wrote:
> > When current node doesn't have a EPC section configured by firmware and
> > all other EPC sections memory are used up, CPU can stuck inside the
> > while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> > Note how nid_of_current will never equal to nid in that while loop because
> > nid_of_current is not set in sgx_numa_mask.
> > 
> > Also worth mentioning is that it's perfectly fine for firmware to not
> > seup an EPC section on a node. Setting an EPC section on each node can
> > be good for performance but that's not a requirement functionality wise.
> 
> The changelog is a little rough, but I think Kai gave some good
> suggestions.  The other thing you can do is dump the text in chatgpt (or
> whatever) and have it fix your grammar.  It actually does a pretty
> decent job.

Thanks for the suggestion.

> 
> Also, you didn't say _how_ you fixed this.  That needs to be in here.
> Something along the lines of:
> 
> 	Rework the loop to start and end on *a* node that has SGX
> 	memory.  This avoids the deadlock looking for the current SGX-
> 	lacking node to show up in the loop when it never will.

Will add this to the changelog, thanks for the write-up.

> 
> The code looks fine, so feel free to add:
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks.

> 
> Also, I do think we should probably add some kind of sanity warning to
> the SGX code in another patch.  If a node on an SGX system has CPUs and
> memory, it's very likely it will also have some EPC.  It can be
> something soft like a pr_info(), but I think it would be nice to have.

I think there are systems with valid reason to not setup an EPC section
per node, e.g. a 8 sockets system with SNC=2, there would be a total of
16 nodes and it's not possible to have one EPC section per node because
the upper limit of EPC sections is 8. I'm not sure a warning is
appropriate here, what do you think?

Aaron Lu Aug. 30, 2024, 6:14 a.m. UTC | #7

On Thu, Aug 29, 2024 at 07:44:13PM +0300, Jarkko Sakkinen wrote:
> On Thu Aug 29, 2024 at 5:38 AM EEST, Aaron Lu wrote:
> > When current node doesn't have a EPC section configured by firmware and
> > all other EPC sections memory are used up, CPU can stuck inside the
> > while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> > Note how nid_of_current will never equal to nid in that while loop because
>                                                      ~~~~
> 
> Oh *that* while loop ;-) Please be more specific.

What about:
Note how nid_of_current will never be equal to nid in the while loop that
searches an available EPC page from remote nodes because nid_of_current is
not set in sgx_numa_mask.

> > nid_of_current is not set in sgx_numa_mask.
> >
> > Also worth mentioning is that it's perfectly fine for firmware to not
> > seup an EPC section on a node. Setting an EPC section on each node can
> > be good for performance but that's not a requirement functionality wise.
> 
> This lacks any description of what is done to __sgx_alloc_epc_page().

Will add what Dave suggested on how the problem is fixed to the changelog.

> >
> > Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> > Reported-by: Zhimin Luo <zhimin.luo@intel.com>
> > Tested-by: Zhimin Luo <zhimin.luo@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Thanks,
Aaron

Dave Hansen Aug. 30, 2024, 2:03 p.m. UTC | #8

On 8/29/24 23:02, Aaron Lu wrote:
>> Also, I do think we should probably add some kind of sanity warning to
>> the SGX code in another patch.  If a node on an SGX system has CPUs and
>> memory, it's very likely it will also have some EPC.  It can be
>> something soft like a pr_info(), but I think it would be nice to have.
> I think there are systems with valid reason to not setup an EPC section
> per node, e.g. a 8 sockets system with SNC=2, there would be a total of
> 16 nodes and it's not possible to have one EPC section per node because
> the upper limit of EPC sections is 8. I'm not sure a warning is
> appropriate here, what do you think?

While possible, those systems are pretty rare.  I don't think a
softly-worded pr_info() will scare anyone too much.

Aaron Lu Sept. 2, 2024, 7:57 a.m. UTC | #9

On Fri, Aug 30, 2024 at 07:03:33AM -0700, Dave Hansen wrote:
> On 8/29/24 23:02, Aaron Lu wrote:
> >> Also, I do think we should probably add some kind of sanity warning to
> >> the SGX code in another patch.  If a node on an SGX system has CPUs and
> >> memory, it's very likely it will also have some EPC.  It can be
> >> something soft like a pr_info(), but I think it would be nice to have.
> > I think there are systems with valid reason to not setup an EPC section
> > per node, e.g. a 8 sockets system with SNC=2, there would be a total of
> > 16 nodes and it's not possible to have one EPC section per node because
> > the upper limit of EPC sections is 8. I'm not sure a warning is
> > appropriate here, what do you think?
> 
> While possible, those systems are pretty rare.  I don't think a
> softly-worded pr_info() will scare anyone too much.

Understood.

Maybe something like below?

From e49a78f27956b3d62c5ef962320e63dc3eeb897c Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Mon, 2 Sep 2024 11:46:07 +0800
Subject: [PATCH] x86/sgx: Log information when a node lacks an EPC section

For optimized performance, firmware typically distributes EPC sections
evenly across different NUMA nodes. However, there are scenarios where
a node may have both CPUs and memory but no EPC section configured. For
example, in an 8-socket system with a Sub-Numa-Cluster=2 setup, there
are a total of 16 nodes. Given that the maximum number of supported EPC
sections is 8, it is simply not feasible to assign one EPC section to
each node. This configuration is not incorrect - SGX will still operate
correctly; it is just not optimized from a NUMA standpoint.

For this reason, log a message when a node with both CPUs and memory
lacks an EPC section. This will provide users with a hint as to why they
might be experiencing less-than-ideal performance when running SGX
enclaves.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 694fcf7a5e3a..3a79105455f1 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -848,6 +848,13 @@ static bool __init sgx_page_cache_init(void)
 		return false;
 	}

+	for_each_online_node(nid) {
+		if (!node_isset(nid, sgx_numa_mask) &&
+		    node_state(nid, N_MEMORY) && node_state(nid, N_CPU))
+			pr_info("node%d has both CPUs and memory but doesn't have an EPC section\n",
+				nid);
+	}
+
 	return true;
 }

Jarkko Sakkinen Sept. 3, 2024, 4:05 p.m. UTC | #10

On Fri Aug 30, 2024 at 9:14 AM EEST, Aaron Lu wrote:
> On Thu, Aug 29, 2024 at 07:44:13PM +0300, Jarkko Sakkinen wrote:
> > On Thu Aug 29, 2024 at 5:38 AM EEST, Aaron Lu wrote:
> > > When current node doesn't have a EPC section configured by firmware and
> > > all other EPC sections memory are used up, CPU can stuck inside the
> > > while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> > > Note how nid_of_current will never equal to nid in that while loop because
> >                                                      ~~~~
> > 
> > Oh *that* while loop ;-) Please be more specific.
>
> What about:
> Note how nid_of_current will never be equal to nid in the while loop that
> searches an available EPC page from remote nodes because nid_of_current is
> not set in sgx_numa_mask.

That would work I think!

>
> > > nid_of_current is not set in sgx_numa_mask.
> > >
> > > Also worth mentioning is that it's perfectly fine for firmware to not
> > > seup an EPC section on a node. Setting an EPC section on each node can
> > > be good for performance but that's not a requirement functionality wise.
> > 
> > This lacks any description of what is done to __sgx_alloc_epc_page().
>
> Will add what Dave suggested on how the problem is fixed to the changelog.

Great. I think the code change is correct reflecting these additions.
I'll look the next version as a whole but with high probability I can
ack that as long as the commit message has these updates.

>
> > >
> > > Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> > > Reported-by: Zhimin Luo <zhimin.luo@intel.com>
> > > Tested-by: Zhimin Luo <zhimin.luo@intel.com>
> > > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>
> Thanks,
> Aaron

BR, Jarkko

Aaron Lu Sept. 4, 2024, 1:39 a.m. UTC | #11

On Tue, Sep 03, 2024 at 07:05:40PM +0300, Jarkko Sakkinen wrote:
> On Fri Aug 30, 2024 at 9:14 AM EEST, Aaron Lu wrote:
> > On Thu, Aug 29, 2024 at 07:44:13PM +0300, Jarkko Sakkinen wrote:
> > > On Thu Aug 29, 2024 at 5:38 AM EEST, Aaron Lu wrote:
> > > > When current node doesn't have a EPC section configured by firmware and
> > > > all other EPC sections memory are used up, CPU can stuck inside the
> > > > while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> > > > Note how nid_of_current will never equal to nid in that while loop because
> > >                                                      ~~~~
> > > 
> > > Oh *that* while loop ;-) Please be more specific.
> >
> > What about:
> > Note how nid_of_current will never be equal to nid in the while loop that
> > searches an available EPC page from remote nodes because nid_of_current is
> > not set in sgx_numa_mask.
> 
> That would work I think!

While rewriting the changelog, I find it more natural to explain this
"while loop" when I first mentioned it, i.e.

    When the current node doesn't have an EPC section configured by firmware
    and all other EPC sections are used up, CPU can get stuck inside the
    while loop that looks for an available EPC page from remote nodes
    indefinitely, leading to a soft lockup. Note how nid_of_current will
    never be equal to nid in that while loop because nid_of_current is not
    set in sgx_numa_mask.

I hope this looks fine to you.

> >
> > > > nid_of_current is not set in sgx_numa_mask.
> > > >
> > > > Also worth mentioning is that it's perfectly fine for firmware to not
> > > > seup an EPC section on a node. Setting an EPC section on each node can
> > > > be good for performance but that's not a requirement functionality wise.
> > > 
> > > This lacks any description of what is done to __sgx_alloc_epc_page().
> >
> > Will add what Dave suggested on how the problem is fixed to the changelog.
> 
> Great. I think the code change is correct reflecting these additions.
> I'll look the next version as a whole but with high probability I can
> ack that as long as the commit message has these updates.

Thanks.

> >
> > > >
> > > > Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> > > > Reported-by: Zhimin Luo <zhimin.luo@intel.com>
> > > > Tested-by: Zhimin Luo <zhimin.luo@intel.com>
> > > > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> >
> > Thanks,
> > Aaron
> 
> BR, Jarkko

Jarkko Sakkinen Sept. 4, 2024, 2:17 p.m. UTC | #12

On Wed Sep 4, 2024 at 4:39 AM EEST, Aaron Lu wrote:
> On Tue, Sep 03, 2024 at 07:05:40PM +0300, Jarkko Sakkinen wrote:
> > On Fri Aug 30, 2024 at 9:14 AM EEST, Aaron Lu wrote:
> > > On Thu, Aug 29, 2024 at 07:44:13PM +0300, Jarkko Sakkinen wrote:
> > > > On Thu Aug 29, 2024 at 5:38 AM EEST, Aaron Lu wrote:
> > > > > When current node doesn't have a EPC section configured by firmware and
> > > > > all other EPC sections memory are used up, CPU can stuck inside the
> > > > > while loop in __sgx_alloc_epc_page() forever and soft lockup will happen.
> > > > > Note how nid_of_current will never equal to nid in that while loop because
> > > >                                                      ~~~~
> > > > 
> > > > Oh *that* while loop ;-) Please be more specific.
> > >
> > > What about:
> > > Note how nid_of_current will never be equal to nid in the while loop that
> > > searches an available EPC page from remote nodes because nid_of_current is
> > > not set in sgx_numa_mask.
> > 
> > That would work I think!
>
> While rewriting the changelog, I find it more natural to explain this
> "while loop" when I first mentioned it, i.e.
>
>     When the current node doesn't have an EPC section configured by firmware
>     and all other EPC sections are used up, CPU can get stuck inside the
>     while loop that looks for an available EPC page from remote nodes
>     indefinitely, leading to a soft lockup. Note how nid_of_current will
>     never be equal to nid in that while loop because nid_of_current is not
>     set in sgx_numa_mask.
>
> I hope this looks fine to you.

Yes, it is. I just want the storyline to the commit message as
a reminder why we did this, that's all. It helps a lot later on
when having to look into commit history.

BR, Jarkko

x86/sgx: Fix deadloop in __sgx_alloc_epc_page()

Commit Message

Comments

Patch