diff mbox series

[RFC] mm: Disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT

Message ID 20211026165100.ahz5bkx44lrrw5pt@linutronix.de (mailing list archive)
State New
Headers show
Series [RFC] mm: Disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT | expand

Commit Message

Sebastian Andrzej Siewior Oct. 26, 2021, 4:51 p.m. UTC
In https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/
Mel wrote:

| While I ack'd this, an RT application using THP is playing with fire,
| I know the RT extension for SLE explicitly disables it from being enabled
| at kernel config time. At minimum the critical regions should be mlocked
| followed by prctl to disable future THP faults that are non-deterministic,
| both from an allocation point of view, and a TLB access point of view. It's
| still reasonable to expect a smaller TLB reach for huge pages than
| base pages.

With TRANSPARENT_HUGEPAGE enabled I haven't seen spikes > 100us
in cyclictest. I did have mlock_all() enabled but nothing else.
PR_SET_THP_DISABLE remained unchanged (enabled). Is there anything to
stress this to be sure or is mlock_all() enough to do THP but leave the
mlock() applications alone?

Then Mel continued with:

| It's a similar hazard with NUMA balancing, an RT application should either
| disable balancing globally or set a memory policy that forces it to be
| ignored. They should be doing this anyway to avoid non-deterministic
| memory access costs due to NUMA artifacts but it wouldn't surprise me
| if some applications got it wrong.

Usually (often) RT applications are pinned. I would assume that on
bigger box the RT tasks are at least pinned to a node. How bad can this
get in worst case? cyclictest pins every thread to CPU. I could remove
this for testing. What would be a good test to push this to its limit?

Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 init/Kconfig | 2 +-
 mm/Kconfig   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Mel Gorman Oct. 27, 2021, 9:12 a.m. UTC | #1
On Tue, Oct 26, 2021 at 06:51:00PM +0200, Sebastian Andrzej Siewior wrote:
> In https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/
> Mel wrote:
> 
> | While I ack'd this, an RT application using THP is playing with fire,
> | I know the RT extension for SLE explicitly disables it from being enabled
> | at kernel config time. At minimum the critical regions should be mlocked
> | followed by prctl to disable future THP faults that are non-deterministic,
> | both from an allocation point of view, and a TLB access point of view. It's
> | still reasonable to expect a smaller TLB reach for huge pages than
> | base pages.
> 
> With TRANSPARENT_HUGEPAGE enabled I haven't seen spikes > 100us
> in cyclictest. I did have mlock_all() enabled but nothing else.
> PR_SET_THP_DISABLE remained unchanged (enabled). Is there anything to
> stress this to be sure or is mlock_all() enough to do THP but leave the
> mlock() applications alone?
> 
> Then Mel continued with:
> 
> | It's a similar hazard with NUMA balancing, an RT application should either
> | disable balancing globally or set a memory policy that forces it to be
> | ignored. They should be doing this anyway to avoid non-deterministic
> | memory access costs due to NUMA artifacts but it wouldn't surprise me
> | if some applications got it wrong.
> 
> Usually (often) RT applications are pinned. I would assume that on
> bigger box the RT tasks are at least pinned to a node. How bad can this
> get in worst case? cyclictest pins every thread to CPU. I could remove
> this for testing. What would be a good test to push this to its limit?
> 
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Somewhat tentative but

Acked-by: Mel Gorman <mgorman@techsingularity.net>

It's tentative because NUMA Balancing gets default disabled on PREEMPT_RT
but it's still possible to enable where as THP is disabled entirely
and can never be enabled. This is a little inconsistent and it would be
preferable that they match either by disabling NUMA_BALANCING entirely or
forbidding TRANSPARENT_HUGEPAGE_ALWAYS && PREEMPT_RT. I'm ok with either.

There is the possibility that an RT application could use THP safely by
using madvise() and mlock(). That way, THP is available but only if an
application has explicit knowledge of THP and smart enough to do it only
during the initialisation phase with

diff --git a/mm/Kconfig b/mm/Kconfig
index d16ba9249bc5..d6ccca216028 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -393,6 +393,7 @@ choice
 
 	config TRANSPARENT_HUGEPAGE_ALWAYS
 		bool "always"
+		depends on !PREEMPT_RT
 	help
 	  Enabling Transparent Hugepage always, can increase the
 	  memory footprint of applications without a guaranteed

There is the slight caveat that even then THP can have inconsistent
latencies if it has a split THP with separate entries for base and huge
pages. The responsibility would be on the person deploying the application
to ensure a platform was suitable for both RT and using huge pages.
Sebastian Andrzej Siewior Oct. 28, 2021, 12:04 p.m. UTC | #2
On 2021-10-27 10:12:12 [+0100], Mel Gorman wrote:
> On Tue, Oct 26, 2021 at 06:51:00PM +0200, Sebastian Andrzej Siewior wrote:
> > In https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/
> > Mel wrote:
> > 
> > | While I ack'd this, an RT application using THP is playing with fire,
> > | I know the RT extension for SLE explicitly disables it from being enabled
> > | at kernel config time. At minimum the critical regions should be mlocked
> > | followed by prctl to disable future THP faults that are non-deterministic,
> > | both from an allocation point of view, and a TLB access point of view. It's
> > | still reasonable to expect a smaller TLB reach for huge pages than
> > | base pages.
> > 
> > With TRANSPARENT_HUGEPAGE enabled I haven't seen spikes > 100us
> > in cyclictest. I did have mlock_all() enabled but nothing else.
> > PR_SET_THP_DISABLE remained unchanged (enabled). Is there anything to
> > stress this to be sure or is mlock_all() enough to do THP but leave the
> > mlock() applications alone?
> > 
> > Then Mel continued with:
> > 
> > | It's a similar hazard with NUMA balancing, an RT application should either
> > | disable balancing globally or set a memory policy that forces it to be
> > | ignored. They should be doing this anyway to avoid non-deterministic
> > | memory access costs due to NUMA artifacts but it wouldn't surprise me
> > | if some applications got it wrong.
> > 
> > Usually (often) RT applications are pinned. I would assume that on
> > bigger box the RT tasks are at least pinned to a node. How bad can this
> > get in worst case? cyclictest pins every thread to CPU. I could remove
> > this for testing. What would be a good test to push this to its limit?
> > 
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> Somewhat tentative but
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> 
> It's tentative because NUMA Balancing gets default disabled on PREEMPT_RT
> but it's still possible to enable where as THP is disabled entirely
> and can never be enabled. This is a little inconsistent and it would be
> preferable that they match either by disabling NUMA_BALANCING entirely or
> forbidding TRANSPARENT_HUGEPAGE_ALWAYS && PREEMPT_RT. I'm ok with either.

Oh. I can go either way depending on the input ;)

> There is the possibility that an RT application could use THP safely by
> using madvise() and mlock(). That way, THP is available but only if an
> application has explicit knowledge of THP and smart enough to do it only
> during the initialisation phase with

Yes that was my question. So if you have "always", do mlock_all() in the
application and then have other threads that same application doing
malloc/ free of memory that the RT thread is not touching then bad
things can still happen, right?
My understanding is that all threads can be blocked in a page fault if
there is some THP operation going on.

You suggest that the application is using THP by setting madvice on the
relevant area, mlock afterwards and then nothing bad can happen. No
defrag or an optimisation happens later. The memory area uses hugepages
after the madvice or not.
If so, then this sounds good.

> diff --git a/mm/Kconfig b/mm/Kconfig
> index d16ba9249bc5..d6ccca216028 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -393,6 +393,7 @@ choice
>  
>  	config TRANSPARENT_HUGEPAGE_ALWAYS
>  		bool "always"
> +		depends on !PREEMPT_RT
>  	help
>  	  Enabling Transparent Hugepage always, can increase the
>  	  memory footprint of applications without a guaranteed
> 
> There is the slight caveat that even then THP can have inconsistent
> latencies if it has a split THP with separate entries for base and huge
> pages. The responsibility would be on the person deploying the application
> to ensure a platform was suitable for both RT and using huge pages.

split THP? You mean latencies are different by accessing the memory
depending if it is reached via the THP entry or one of the many 4kib
entries?
I'm more worries about locked mmap_lock while the THP operation is in
progress and then a fault from the RT application has to wait until the
THP operation is done.

Sebastian
Mel Gorman Oct. 28, 2021, 12:52 p.m. UTC | #3
On Thu, Oct 28, 2021 at 02:04:29PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-10-27 10:12:12 [+0100], Mel Gorman wrote:
> > On Tue, Oct 26, 2021 at 06:51:00PM +0200, Sebastian Andrzej Siewior wrote:
> > > In https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/
> > > Mel wrote:
> > > 
> > > | While I ack'd this, an RT application using THP is playing with fire,
> > > | I know the RT extension for SLE explicitly disables it from being enabled
> > > | at kernel config time. At minimum the critical regions should be mlocked
> > > | followed by prctl to disable future THP faults that are non-deterministic,
> > > | both from an allocation point of view, and a TLB access point of view. It's
> > > | still reasonable to expect a smaller TLB reach for huge pages than
> > > | base pages.
> > > 
> > > With TRANSPARENT_HUGEPAGE enabled I haven't seen spikes > 100us
> > > in cyclictest. I did have mlock_all() enabled but nothing else.
> > > PR_SET_THP_DISABLE remained unchanged (enabled). Is there anything to
> > > stress this to be sure or is mlock_all() enough to do THP but leave the
> > > mlock() applications alone?
> > > 
> > > Then Mel continued with:
> > > 
> > > | It's a similar hazard with NUMA balancing, an RT application should either
> > > | disable balancing globally or set a memory policy that forces it to be
> > > | ignored. They should be doing this anyway to avoid non-deterministic
> > > | memory access costs due to NUMA artifacts but it wouldn't surprise me
> > > | if some applications got it wrong.
> > > 
> > > Usually (often) RT applications are pinned. I would assume that on
> > > bigger box the RT tasks are at least pinned to a node. How bad can this
> > > get in worst case? cyclictest pins every thread to CPU. I could remove
> > > this for testing. What would be a good test to push this to its limit?
> > > 
> > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > 
> > Somewhat tentative but
> > 
> > Acked-by: Mel Gorman <mgorman@techsingularity.net>
> > 
> > It's tentative because NUMA Balancing gets default disabled on PREEMPT_RT
> > but it's still possible to enable where as THP is disabled entirely
> > and can never be enabled. This is a little inconsistent and it would be
> > preferable that they match either by disabling NUMA_BALANCING entirely or
> > forbidding TRANSPARENT_HUGEPAGE_ALWAYS && PREEMPT_RT. I'm ok with either.
> 
> Oh. I can go either way depending on the input ;)
> 
> > There is the possibility that an RT application could use THP safely by
> > using madvise() and mlock(). That way, THP is available but only if an
> > application has explicit knowledge of THP and smart enough to do it only
> > during the initialisation phase with
> 
> Yes that was my question. So if you have "always", do mlock_all() in the
> application and then have other threads that same application doing
> malloc/ free of memory that the RT thread is not touching then bad
> things can still happen, right?
> My understanding is that all threads can be blocked in a page fault if
> there is some THP operation going on.
> 

Hmm, it could happen if all the memory used by the RT thread was not
hugepage-aligned and potentially khugepaged could interfere. khugepaged
can be disabled if tuned properly but the alignment requirement would be
tricky. Probably safer to just disable it like it has been historically.
For consistently, force NUMA_BALANCING to be disabled too because it
introduces non-deterministic latencies even if memory regions are locked
and bound.

> > There is the slight caveat that even then THP can have inconsistent
> > latencies if it has a split THP with separate entries for base and huge
> > pages. The responsibility would be on the person deploying the application
> > to ensure a platform was suitable for both RT and using huge pages.
> 
> split THP?

Sorry, "split TLB" where part of the TLB only handles base pages and
another part handles huge pages.
Sebastian Andrzej Siewior Oct. 28, 2021, 1:56 p.m. UTC | #4
On 2021-10-28 13:52:24 [+0100], Mel Gorman wrote:
> > Yes that was my question. So if you have "always", do mlock_all() in the
> > application and then have other threads that same application doing
> > malloc/ free of memory that the RT thread is not touching then bad
> > things can still happen, right?
> > My understanding is that all threads can be blocked in a page fault if
> > there is some THP operation going on.
> > 
> 
> Hmm, it could happen if all the memory used by the RT thread was not
> hugepage-aligned and potentially khugepaged could interfere. khugepaged
> can be disabled if tuned properly but the alignment requirement would be
> tricky. Probably safer to just disable it like it has been historically.
> For consistently, force NUMA_BALANCING to be disabled too because it
> introduces non-deterministic latencies even if memory regions are locked
> and bound.

Okay. I don't mind disabling it or keeping it enabled under some
restrictions. I just need it to document it so people are aware why it
is disabled so if they want to enable they know what the areas that need
attention.

THP disable due to alignment issues and potential defragmentation by
khugepaged. Understood. Workaround: Use hugepages.

NUMA_BALANCING. It looks like it replaces the physical page while
keeping the virtual address. This kind of page migration does not look
good if it happens for everyone since it involves mmap_lock.
Let me write that up and post properly.

Thank you.

> > > There is the slight caveat that even then THP can have inconsistent
> > > latencies if it has a split THP with separate entries for base and huge
> > > pages. The responsibility would be on the person deploying the application
> > > to ensure a platform was suitable for both RT and using huge pages.
> > 
> > split THP?
> 
> Sorry, "split TLB" where part of the TLB only handles base pages and
> another part handles huge pages.

ah okay.

Sebastian
Mel Gorman Oct. 28, 2021, 2:14 p.m. UTC | #5
On Thu, Oct 28, 2021 at 03:56:47PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-10-28 13:52:24 [+0100], Mel Gorman wrote:
> > > Yes that was my question. So if you have "always", do mlock_all() in the
> > > application and then have other threads that same application doing
> > > malloc/ free of memory that the RT thread is not touching then bad
> > > things can still happen, right?
> > > My understanding is that all threads can be blocked in a page fault if
> > > there is some THP operation going on.
> > > 
> > 
> > Hmm, it could happen if all the memory used by the RT thread was not
> > hugepage-aligned and potentially khugepaged could interfere. khugepaged
> > can be disabled if tuned properly but the alignment requirement would be
> > tricky. Probably safer to just disable it like it has been historically.
> > For consistently, force NUMA_BALANCING to be disabled too because it
> > introduces non-deterministic latencies even if memory regions are locked
> > and bound.
> 
> Okay. I don't mind disabling it or keeping it enabled under some
> restrictions. I just need it to document it so people are aware why it
> is disabled so if they want to enable they know what the areas that need
> attention.
> 
> THP disable due to alignment issues and potential defragmentation by
> khugepaged. Understood. Workaround: Use hugepages.
> 
> NUMA_BALANCING. It looks like it replaces the physical page while
> keeping the virtual address. This kind of page migration does not look
> good if it happens for everyone since it involves mmap_lock.
> Let me write that up and post properly.
> 

In case it helps;

TRANSPARENT_HUGEPAGE: There are potential non-determinstic delays to an
	RT thread if a critical memory region is not THP-aligned and a
	non-RT buffer is located in the same hugepage-aligned region. It's
	also possible for an unrelated thread to migrate pages belonging
	to an RT task incurring unexpected page faults due to memory
	defragmentation even if khugepaged is disabled.

NUMA_BALANCING: There is a non-determinstic delay to mark PTEs PROT_NONE
	to gather NUMA fault samples, increased page faults of regions
	even if mlocked and non-deterministic delays when migrating pages.
Sebastian Andrzej Siewior Oct. 28, 2021, 2:34 p.m. UTC | #6
On 2021-10-28 15:14:52 [+0100], Mel Gorman wrote:
> In case it helps;

Yes. Thank you.

Sebastian
diff mbox series

Patch

diff --git a/init/Kconfig b/init/Kconfig
index edc0a0228f143..8e96817d507c3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -922,7 +922,7 @@  config NUMA_BALANCING
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
-	depends on NUMA_BALANCING
+	depends on NUMA_BALANCING && !PREEMPT_RT
 	help
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.
diff --git a/mm/Kconfig b/mm/Kconfig
index c150a0c6fce2c..5c5508fafcec5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -374,7 +374,7 @@  config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
 	select COMPACTION
 	select XARRAY_MULTI
 	help