diff mbox series

[v2] mm: let kswapd work again for node that used to be hopeless but may not now

Message ID 20240604072323.10886-1-byungchul@sk.com (mailing list archive)
State New
Headers show
Series [v2] mm: let kswapd work again for node that used to be hopeless but may not now | expand

Commit Message

Byungchul Park June 4, 2024, 7:23 a.m. UTC
Changes from v1:
	1. Don't allow to resume kswapd if the system is under memory
	   pressure that might affect direct reclaim by any chance, like
	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.

--->8---
From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
From: Byungchul Park <byungchul@sk.com>
Date: Tue, 4 Jun 2024 15:27:56 +0900
Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now

A system should run with kswapd running in background when under memory
pressure, such as when the available memory level is below the low water
mark and there are reclaimable folios.

However, the current code let the system run with kswapd stopped if
kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
until direct reclaim will do for that, even if there are reclaimable
folios that can be reclaimed by kswapd.  This case was observed in the
following scenario:

   CONFIG_NUMA_BALANCING enabled
   sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
   numa node0 (500GB local DRAM, 128 CPUs)
   numa node1 (100GB CXL memory, no CPUs)
   swap off

   1) Run a workload with big anon pages e.g. mmap(200GB).
   2) Continue adding the same workload to the system.
   3) The anon pages are placed in node0 by promotion/demotion.
   4) kswapd0 stops because of the unreclaimable anon pages in node0.
   5) Kill the memory hoggers to restore the system.

After restoring the system at 5), the system starts to run without
kswapd.  Even worse, tiering mechanism is no longer able to work since
the mechanism relies on kswapd for demotion.

However, the node0 has pages newly allocated after 5), that might or
might not be reclaimable.  Since those are potentially reclaimable, it's
worth hopefully trying reclaim by allowing kswapd to work again.

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/mmzone.h |  4 ++++
 mm/page_alloc.c        | 12 ++++++++++
 mm/vmscan.c            | 52 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 63 insertions(+), 5 deletions(-)

Comments

Huang, Ying June 4, 2024, 7:57 a.m. UTC | #1
Byungchul Park <byungchul@sk.com> writes:

> Changes from v1:
> 	1. Don't allow to resume kswapd if the system is under memory
> 	   pressure that might affect direct reclaim by any chance, like
> 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
>
> --->8---
> From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> From: Byungchul Park <byungchul@sk.com>
> Date: Tue, 4 Jun 2024 15:27:56 +0900
> Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
>
> A system should run with kswapd running in background when under memory
> pressure, such as when the available memory level is below the low water
> mark and there are reclaimable folios.
>
> However, the current code let the system run with kswapd stopped if
> kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> until direct reclaim will do for that, even if there are reclaimable
> folios that can be reclaimed by kswapd.  This case was observed in the
> following scenario:
>
>    CONFIG_NUMA_BALANCING enabled
>    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
>    numa node0 (500GB local DRAM, 128 CPUs)
>    numa node1 (100GB CXL memory, no CPUs)
>    swap off
>
>    1) Run a workload with big anon pages e.g. mmap(200GB).
>    2) Continue adding the same workload to the system.
>    3) The anon pages are placed in node0 by promotion/demotion.
>    4) kswapd0 stops because of the unreclaimable anon pages in node0.
>    5) Kill the memory hoggers to restore the system.
>
> After restoring the system at 5), the system starts to run without
> kswapd.  Even worse, tiering mechanism is no longer able to work since
> the mechanism relies on kswapd for demotion.

We have run into the situation that kswapd is kept in failure state for
long in a multiple tiers system.  I think that your solution is too
limited, because OOM killing may not happen, while the access pattern of
the workloads may change.  We have a preliminary and simple solution for
this as follows,

https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866

where we will try to wake up kswapd to check every 10 seconds if kswapd
is in failure state.  This is another possible solution.

> However, the node0 has pages newly allocated after 5), that might or
> might not be reclaimable.  Since those are potentially reclaimable, it's
> worth hopefully trying reclaim by allowing kswapd to work again.
>

[snip]

--
Best Regards,
Huang, Ying
Byungchul Park June 4, 2024, 8:45 a.m. UTC | #2
On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@sk.com> writes:
> 
> > Changes from v1:
> > 	1. Don't allow to resume kswapd if the system is under memory
> > 	   pressure that might affect direct reclaim by any chance, like
> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> >
> > --->8---
> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> > From: Byungchul Park <byungchul@sk.com>
> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> >
> > A system should run with kswapd running in background when under memory
> > pressure, such as when the available memory level is below the low water
> > mark and there are reclaimable folios.
> >
> > However, the current code let the system run with kswapd stopped if
> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> > until direct reclaim will do for that, even if there are reclaimable
> > folios that can be reclaimed by kswapd.  This case was observed in the
> > following scenario:
> >
> >    CONFIG_NUMA_BALANCING enabled
> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> >    numa node0 (500GB local DRAM, 128 CPUs)
> >    numa node1 (100GB CXL memory, no CPUs)
> >    swap off
> >
> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> >    2) Continue adding the same workload to the system.
> >    3) The anon pages are placed in node0 by promotion/demotion.
> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> >    5) Kill the memory hoggers to restore the system.
> >
> > After restoring the system at 5), the system starts to run without
> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> > the mechanism relies on kswapd for demotion.
> 
> We have run into the situation that kswapd is kept in failure state for
> long in a multiple tiers system.  I think that your solution is too

My solution just gives a chance for kswapd to work again even if
kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
reclaimable folios.  That's it.

> limited, because OOM killing may not happen, while the access pattern of

I don't get this.  OOM will happen as is, through direct reclaim.

> the workloads may change.  We have a preliminary and simple solution for
> this as follows,
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866

Whether tiering is involved or not, the same problem can arise if
kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.

	Byungchul

> where we will try to wake up kswapd to check every 10 seconds if kswapd
> is in failure state.  This is another possible solution.
> 
> > However, the node0 has pages newly allocated after 5), that might or
> > might not be reclaimable.  Since those are potentially reclaimable, it's
> > worth hopefully trying reclaim by allowing kswapd to work again.
> >
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying
Huang, Ying June 4, 2024, 8:57 a.m. UTC | #3
Byungchul Park <byungchul@sk.com> writes:

> On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
>> Byungchul Park <byungchul@sk.com> writes:
>> 
>> > Changes from v1:
>> > 	1. Don't allow to resume kswapd if the system is under memory
>> > 	   pressure that might affect direct reclaim by any chance, like
>> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
>> >
>> > --->8---
>> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
>> > From: Byungchul Park <byungchul@sk.com>
>> > Date: Tue, 4 Jun 2024 15:27:56 +0900
>> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
>> >
>> > A system should run with kswapd running in background when under memory
>> > pressure, such as when the available memory level is below the low water
>> > mark and there are reclaimable folios.
>> >
>> > However, the current code let the system run with kswapd stopped if
>> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
>> > until direct reclaim will do for that, even if there are reclaimable
>> > folios that can be reclaimed by kswapd.  This case was observed in the
>> > following scenario:
>> >
>> >    CONFIG_NUMA_BALANCING enabled
>> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
>> >    numa node0 (500GB local DRAM, 128 CPUs)
>> >    numa node1 (100GB CXL memory, no CPUs)
>> >    swap off
>> >
>> >    1) Run a workload with big anon pages e.g. mmap(200GB).
>> >    2) Continue adding the same workload to the system.
>> >    3) The anon pages are placed in node0 by promotion/demotion.
>> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
>> >    5) Kill the memory hoggers to restore the system.
>> >
>> > After restoring the system at 5), the system starts to run without
>> > kswapd.  Even worse, tiering mechanism is no longer able to work since
>> > the mechanism relies on kswapd for demotion.
>> 
>> We have run into the situation that kswapd is kept in failure state for
>> long in a multiple tiers system.  I think that your solution is too
>
> My solution just gives a chance for kswapd to work again even if
> kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> reclaimable folios.  That's it.
>
>> limited, because OOM killing may not happen, while the access pattern of
>
> I don't get this.  OOM will happen as is, through direct reclaim.

A system that fails to reclaim via kswapd may succeed to reclaim via
direct reclaim, because more CPUs are used to scanning the page tables.

In a system with NUMA balancing based page promotion and page demotion
enabled, page promotion will wake up kswapd, but kswapd may fail in some
situations.  But page promotion will no trigger direct reclaim or OOM.

>> the workloads may change.  We have a preliminary and simple solution for
>> this as follows,
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
>
> Whether tiering is involved or not, the same problem can arise if
> kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.

Your description is about tiering too.  Can you describe a situation
without tiering?

--
Best Regards,
Huang, Ying

> 	Byungchul
>
>> where we will try to wake up kswapd to check every 10 seconds if kswapd
>> is in failure state.  This is another possible solution.
>> 
>> > However, the node0 has pages newly allocated after 5), that might or
>> > might not be reclaimable.  Since those are potentially reclaimable, it's
>> > worth hopefully trying reclaim by allowing kswapd to work again.
>> >
>> 
>> [snip]
>> 
>> --
>> Best Regards,
>> Huang, Ying
Byungchul Park June 4, 2024, 9:12 a.m. UTC | #4
On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@sk.com> writes:
> 
> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> >> Byungchul Park <byungchul@sk.com> writes:
> >> 
> >> > Changes from v1:
> >> > 	1. Don't allow to resume kswapd if the system is under memory
> >> > 	   pressure that might affect direct reclaim by any chance, like
> >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> >> >
> >> > --->8---
> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> >> > From: Byungchul Park <byungchul@sk.com>
> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> >> >
> >> > A system should run with kswapd running in background when under memory
> >> > pressure, such as when the available memory level is below the low water
> >> > mark and there are reclaimable folios.
> >> >
> >> > However, the current code let the system run with kswapd stopped if
> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> >> > until direct reclaim will do for that, even if there are reclaimable
> >> > folios that can be reclaimed by kswapd.  This case was observed in the
> >> > following scenario:
> >> >
> >> >    CONFIG_NUMA_BALANCING enabled
> >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> >> >    numa node0 (500GB local DRAM, 128 CPUs)
> >> >    numa node1 (100GB CXL memory, no CPUs)
> >> >    swap off
> >> >
> >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> >> >    2) Continue adding the same workload to the system.
> >> >    3) The anon pages are placed in node0 by promotion/demotion.
> >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> >> >    5) Kill the memory hoggers to restore the system.
> >> >
> >> > After restoring the system at 5), the system starts to run without
> >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> >> > the mechanism relies on kswapd for demotion.
> >> 
> >> We have run into the situation that kswapd is kept in failure state for
> >> long in a multiple tiers system.  I think that your solution is too
> >
> > My solution just gives a chance for kswapd to work again even if
> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> > reclaimable folios.  That's it.
> >
> >> limited, because OOM killing may not happen, while the access pattern of
> >
> > I don't get this.  OOM will happen as is, through direct reclaim.
> 
> A system that fails to reclaim via kswapd may succeed to reclaim via
> direct reclaim, because more CPUs are used to scanning the page tables.
> 
> In a system with NUMA balancing based page promotion and page demotion
> enabled, page promotion will wake up kswapd, but kswapd may fail in some
> situations.  But page promotion will no trigger direct reclaim or OOM.
> 
> >> the workloads may change.  We have a preliminary and simple solution for
> >> this as follows,
> >> 
> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> >
> > Whether tiering is involved or not, the same problem can arise if
> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> 
> Your description is about tiering too.  Can you describe a situation

I mentioned "tiering" while I described how to reproduce because I ran
into the situation while testing with tiering system but I don't think
it's the necessary condition.

Let me ask you back, why the logic to stop kswapd was considered in the
first place?  That's because the problem was already observed anyway
whether tiering is involved or not.  The same problem will arise once
kswapd stops.

	Byungchul

> without tiering?
> 
> --
> Best Regards,
> Huang, Ying
> 
> > 	Byungchul
> >
> >> where we will try to wake up kswapd to check every 10 seconds if kswapd
> >> is in failure state.  This is another possible solution.
> >> 
> >> > However, the node0 has pages newly allocated after 5), that might or
> >> > might not be reclaimable.  Since those are potentially reclaimable, it's
> >> > worth hopefully trying reclaim by allowing kswapd to work again.
> >> >
> >> 
> >> [snip]
> >> 
> >> --
> >> Best Regards,
> >> Huang, Ying
Byungchul Park June 4, 2024, 10:25 a.m. UTC | #5
On Tue, Jun 04, 2024 at 06:12:22PM +0900, Byungchul Park wrote:
> On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> > Byungchul Park <byungchul@sk.com> writes:
> > 
> > > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> > >> Byungchul Park <byungchul@sk.com> writes:
> > >> 
> > >> > Changes from v1:
> > >> > 	1. Don't allow to resume kswapd if the system is under memory
> > >> > 	   pressure that might affect direct reclaim by any chance, like
> > >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> > >> >
> > >> > --->8---
> > >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> > >> > From: Byungchul Park <byungchul@sk.com>
> > >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> > >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> > >> >
> > >> > A system should run with kswapd running in background when under memory
> > >> > pressure, such as when the available memory level is below the low water
> > >> > mark and there are reclaimable folios.
> > >> >
> > >> > However, the current code let the system run with kswapd stopped if
> > >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> > >> > until direct reclaim will do for that, even if there are reclaimable
> > >> > folios that can be reclaimed by kswapd.  This case was observed in the
> > >> > following scenario:
> > >> >
> > >> >    CONFIG_NUMA_BALANCING enabled
> > >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> > >> >    numa node0 (500GB local DRAM, 128 CPUs)
> > >> >    numa node1 (100GB CXL memory, no CPUs)
> > >> >    swap off
> > >> >
> > >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> > >> >    2) Continue adding the same workload to the system.
> > >> >    3) The anon pages are placed in node0 by promotion/demotion.
> > >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> > >> >    5) Kill the memory hoggers to restore the system.
> > >> >
> > >> > After restoring the system at 5), the system starts to run without
> > >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> > >> > the mechanism relies on kswapd for demotion.
> > >> 
> > >> We have run into the situation that kswapd is kept in failure state for
> > >> long in a multiple tiers system.  I think that your solution is too
> > >
> > > My solution just gives a chance for kswapd to work again even if
> > > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> > > reclaimable folios.  That's it.
> > >
> > >> limited, because OOM killing may not happen, while the access pattern of
> > >
> > > I don't get this.  OOM will happen as is, through direct reclaim.
> > 
> > A system that fails to reclaim via kswapd may succeed to reclaim via
> > direct reclaim, because more CPUs are used to scanning the page tables.
> > 
> > In a system with NUMA balancing based page promotion and page demotion
> > enabled, page promotion will wake up kswapd, but kswapd may fail in some
> > situations.  But page promotion will no trigger direct reclaim or OOM.
> > 
> > >> the workloads may change.  We have a preliminary and simple solution for
> > >> this as follows,
> > >> 
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> > >
> > > Whether tiering is involved or not, the same problem can arise if
> > > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> > 
> > Your description is about tiering too.  Can you describe a situation
> 
> I mentioned "tiering" while I described how to reproduce because I ran
> into the situation while testing with tiering system but I don't think
> it's the necessary condition.
> 
> Let me ask you back, why the logic to stop kswapd was considered in the
> first place?  That's because the problem was already observed anyway

To be clear..

The problem, kswapd_failures >= MAX_RECLAIM_RETRIES, can happen whether
tiering is involved not not.  Once kswapd stops, the system should run
without kswapd even after recovered e.g. by killing the hoggers.  *Even
worse*, tiering mechanism doesn't work in this situation.

I hope what I meant has been delivered.

	Byungchul

> whether tiering is involved or not.  The same problem will arise once
> kswapd stops.
> 
> 	Byungchul
> 
> > without tiering?
> > 
> > --
> > Best Regards,
> > Huang, Ying
> > 
> > > 	Byungchul
> > >
> > >> where we will try to wake up kswapd to check every 10 seconds if kswapd
> > >> is in failure state.  This is another possible solution.
> > >> 
> > >> > However, the node0 has pages newly allocated after 5), that might or
> > >> > might not be reclaimable.  Since those are potentially reclaimable, it's
> > >> > worth hopefully trying reclaim by allowing kswapd to work again.
> > >> >
> > >> 
> > >> [snip]
> > >> 
> > >> --
> > >> Best Regards,
> > >> Huang, Ying
Johannes Weiner June 4, 2024, 12:29 p.m. UTC | #6
On Tue, Jun 04, 2024 at 07:25:16PM +0900, Byungchul Park wrote:
> On Tue, Jun 04, 2024 at 06:12:22PM +0900, Byungchul Park wrote:
> > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> > > Byungchul Park <byungchul@sk.com> writes:
> > > 
> > > > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> > > >> Byungchul Park <byungchul@sk.com> writes:
> > > >> 
> > > >> > Changes from v1:
> > > >> > 	1. Don't allow to resume kswapd if the system is under memory
> > > >> > 	   pressure that might affect direct reclaim by any chance, like
> > > >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> > > >> >
> > > >> > --->8---
> > > >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> > > >> > From: Byungchul Park <byungchul@sk.com>
> > > >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> > > >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> > > >> >
> > > >> > A system should run with kswapd running in background when under memory
> > > >> > pressure, such as when the available memory level is below the low water
> > > >> > mark and there are reclaimable folios.
> > > >> >
> > > >> > However, the current code let the system run with kswapd stopped if
> > > >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> > > >> > until direct reclaim will do for that, even if there are reclaimable
> > > >> > folios that can be reclaimed by kswapd.  This case was observed in the
> > > >> > following scenario:
> > > >> >
> > > >> >    CONFIG_NUMA_BALANCING enabled
> > > >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> > > >> >    numa node0 (500GB local DRAM, 128 CPUs)
> > > >> >    numa node1 (100GB CXL memory, no CPUs)
> > > >> >    swap off
> > > >> >
> > > >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> > > >> >    2) Continue adding the same workload to the system.
> > > >> >    3) The anon pages are placed in node0 by promotion/demotion.
> > > >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> > > >> >    5) Kill the memory hoggers to restore the system.
> > > >> >
> > > >> > After restoring the system at 5), the system starts to run without
> > > >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> > > >> > the mechanism relies on kswapd for demotion.
> > > >> 
> > > >> We have run into the situation that kswapd is kept in failure state for
> > > >> long in a multiple tiers system.  I think that your solution is too
> > > >
> > > > My solution just gives a chance for kswapd to work again even if
> > > > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> > > > reclaimable folios.  That's it.
> > > >
> > > >> limited, because OOM killing may not happen, while the access pattern of
> > > >
> > > > I don't get this.  OOM will happen as is, through direct reclaim.
> > > 
> > > A system that fails to reclaim via kswapd may succeed to reclaim via
> > > direct reclaim, because more CPUs are used to scanning the page tables.
> > > 
> > > In a system with NUMA balancing based page promotion and page demotion
> > > enabled, page promotion will wake up kswapd, but kswapd may fail in some
> > > situations.  But page promotion will no trigger direct reclaim or OOM.
> > > 
> > > >> the workloads may change.  We have a preliminary and simple solution for
> > > >> this as follows,
> > > >> 
> > > >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> > > >
> > > > Whether tiering is involved or not, the same problem can arise if
> > > > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> > > 
> > > Your description is about tiering too.  Can you describe a situation
> > 
> > I mentioned "tiering" while I described how to reproduce because I ran
> > into the situation while testing with tiering system but I don't think
> > it's the necessary condition.
> > 
> > Let me ask you back, why the logic to stop kswapd was considered in the
> > first place?  That's because the problem was already observed anyway
> 
> To be clear..
> 
> The problem, kswapd_failures >= MAX_RECLAIM_RETRIES, can happen whether
> tiering is involved not not.  Once kswapd stops, the system should run
> without kswapd even after recovered e.g. by killing the hoggers.  *Even
> worse*, tiering mechanism doesn't work in this situation.

But like Ying said, in other situations it's direct reclaim that kicks
in and clears the flag.

The failure-sleep and direct reclaim triggered recovery have been in
place since 2017. Both parties who observed an issue with it recently
did so in tiering scenarios. IMO a tiering-specific solution makes the
most sense.
Byungchul Park June 5, 2024, 12:21 a.m. UTC | #7
On Tue, Jun 04, 2024 at 08:29:27AM -0400, Johannes Weiner wrote:
> On Tue, Jun 04, 2024 at 07:25:16PM +0900, Byungchul Park wrote:
> > On Tue, Jun 04, 2024 at 06:12:22PM +0900, Byungchul Park wrote:
> > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> > > > Byungchul Park <byungchul@sk.com> writes:
> > > > 
> > > > > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> > > > >> Byungchul Park <byungchul@sk.com> writes:
> > > > >> 
> > > > >> > Changes from v1:
> > > > >> > 	1. Don't allow to resume kswapd if the system is under memory
> > > > >> > 	   pressure that might affect direct reclaim by any chance, like
> > > > >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> > > > >> >
> > > > >> > --->8---
> > > > >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> > > > >> > From: Byungchul Park <byungchul@sk.com>
> > > > >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> > > > >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> > > > >> >
> > > > >> > A system should run with kswapd running in background when under memory
> > > > >> > pressure, such as when the available memory level is below the low water
> > > > >> > mark and there are reclaimable folios.
> > > > >> >
> > > > >> > However, the current code let the system run with kswapd stopped if
> > > > >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> > > > >> > until direct reclaim will do for that, even if there are reclaimable
> > > > >> > folios that can be reclaimed by kswapd.  This case was observed in the
> > > > >> > following scenario:
> > > > >> >
> > > > >> >    CONFIG_NUMA_BALANCING enabled
> > > > >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> > > > >> >    numa node0 (500GB local DRAM, 128 CPUs)
> > > > >> >    numa node1 (100GB CXL memory, no CPUs)
> > > > >> >    swap off
> > > > >> >
> > > > >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> > > > >> >    2) Continue adding the same workload to the system.
> > > > >> >    3) The anon pages are placed in node0 by promotion/demotion.
> > > > >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> > > > >> >    5) Kill the memory hoggers to restore the system.
> > > > >> >
> > > > >> > After restoring the system at 5), the system starts to run without
> > > > >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> > > > >> > the mechanism relies on kswapd for demotion.
> > > > >> 
> > > > >> We have run into the situation that kswapd is kept in failure state for
> > > > >> long in a multiple tiers system.  I think that your solution is too
> > > > >
> > > > > My solution just gives a chance for kswapd to work again even if
> > > > > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> > > > > reclaimable folios.  That's it.
> > > > >
> > > > >> limited, because OOM killing may not happen, while the access pattern of
> > > > >
> > > > > I don't get this.  OOM will happen as is, through direct reclaim.
> > > > 
> > > > A system that fails to reclaim via kswapd may succeed to reclaim via
> > > > direct reclaim, because more CPUs are used to scanning the page tables.
> > > > 
> > > > In a system with NUMA balancing based page promotion and page demotion
> > > > enabled, page promotion will wake up kswapd, but kswapd may fail in some
> > > > situations.  But page promotion will no trigger direct reclaim or OOM.
> > > > 
> > > > >> the workloads may change.  We have a preliminary and simple solution for
> > > > >> this as follows,
> > > > >> 
> > > > >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> > > > >
> > > > > Whether tiering is involved or not, the same problem can arise if
> > > > > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> > > > 
> > > > Your description is about tiering too.  Can you describe a situation
> > > 
> > > I mentioned "tiering" while I described how to reproduce because I ran
> > > into the situation while testing with tiering system but I don't think
> > > it's the necessary condition.
> > > 
> > > Let me ask you back, why the logic to stop kswapd was considered in the
> > > first place?  That's because the problem was already observed anyway
> > 
> > To be clear..
> > 
> > The problem, kswapd_failures >= MAX_RECLAIM_RETRIES, can happen whether
> > tiering is involved not not.  Once kswapd stops, the system should run
> > without kswapd even after recovered e.g. by killing the hoggers.  *Even
> > worse*, tiering mechanism doesn't work in this situation.
> 
> But like Ying said, in other situations it's direct reclaim that kicks
> in and clears the flag.

I already described it in the commit message.

> The failure-sleep and direct reclaim triggered recovery have been in

Sure.  It's better than nothing.

> place since 2017. Both parties who observed an issue with it recently
> did so in tiering scenarios. IMO a tiering-specific solution makes the
> most sense.

So..  Is the follow situation in a non-tiering system okay?  Really?

   A system runs with kswapd disabled unless hitting min water mark,
   even if there might be something that kswapd can work on.

I don't undertand why it's okay.  Could you explain more?  Then why do
we use kswapd in background?

	Byungchul
Huang, Ying June 5, 2024, 12:59 a.m. UTC | #8
Byungchul Park <byungchul@sk.com> writes:

> On Tue, Jun 04, 2024 at 08:29:27AM -0400, Johannes Weiner wrote:
>> On Tue, Jun 04, 2024 at 07:25:16PM +0900, Byungchul Park wrote:
>> > On Tue, Jun 04, 2024 at 06:12:22PM +0900, Byungchul Park wrote:
>> > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
>> > > > Byungchul Park <byungchul@sk.com> writes:
>> > > > 
>> > > > > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
>> > > > >> Byungchul Park <byungchul@sk.com> writes:
>> > > > >> 
>> > > > >> > Changes from v1:
>> > > > >> > 	1. Don't allow to resume kswapd if the system is under memory
>> > > > >> > 	   pressure that might affect direct reclaim by any chance, like
>> > > > >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
>> > > > >> >
>> > > > >> > --->8---
>> > > > >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
>> > > > >> > From: Byungchul Park <byungchul@sk.com>
>> > > > >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
>> > > > >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
>> > > > >> >
>> > > > >> > A system should run with kswapd running in background when under memory
>> > > > >> > pressure, such as when the available memory level is below the low water
>> > > > >> > mark and there are reclaimable folios.
>> > > > >> >
>> > > > >> > However, the current code let the system run with kswapd stopped if
>> > > > >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
>> > > > >> > until direct reclaim will do for that, even if there are reclaimable
>> > > > >> > folios that can be reclaimed by kswapd.  This case was observed in the
>> > > > >> > following scenario:
>> > > > >> >
>> > > > >> >    CONFIG_NUMA_BALANCING enabled
>> > > > >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
>> > > > >> >    numa node0 (500GB local DRAM, 128 CPUs)
>> > > > >> >    numa node1 (100GB CXL memory, no CPUs)
>> > > > >> >    swap off
>> > > > >> >
>> > > > >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
>> > > > >> >    2) Continue adding the same workload to the system.
>> > > > >> >    3) The anon pages are placed in node0 by promotion/demotion.
>> > > > >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
>> > > > >> >    5) Kill the memory hoggers to restore the system.
>> > > > >> >
>> > > > >> > After restoring the system at 5), the system starts to run without
>> > > > >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
>> > > > >> > the mechanism relies on kswapd for demotion.
>> > > > >> 
>> > > > >> We have run into the situation that kswapd is kept in failure state for
>> > > > >> long in a multiple tiers system.  I think that your solution is too
>> > > > >
>> > > > > My solution just gives a chance for kswapd to work again even if
>> > > > > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
>> > > > > reclaimable folios.  That's it.
>> > > > >
>> > > > >> limited, because OOM killing may not happen, while the access pattern of
>> > > > >
>> > > > > I don't get this.  OOM will happen as is, through direct reclaim.
>> > > > 
>> > > > A system that fails to reclaim via kswapd may succeed to reclaim via
>> > > > direct reclaim, because more CPUs are used to scanning the page tables.
>> > > > 
>> > > > In a system with NUMA balancing based page promotion and page demotion
>> > > > enabled, page promotion will wake up kswapd, but kswapd may fail in some
>> > > > situations.  But page promotion will no trigger direct reclaim or OOM.
>> > > > 
>> > > > >> the workloads may change.  We have a preliminary and simple solution for
>> > > > >> this as follows,
>> > > > >> 
>> > > > >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
>> > > > >
>> > > > > Whether tiering is involved or not, the same problem can arise if
>> > > > > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
>> > > > 
>> > > > Your description is about tiering too.  Can you describe a situation
>> > > 
>> > > I mentioned "tiering" while I described how to reproduce because I ran
>> > > into the situation while testing with tiering system but I don't think
>> > > it's the necessary condition.
>> > > 
>> > > Let me ask you back, why the logic to stop kswapd was considered in the
>> > > first place?  That's because the problem was already observed anyway
>> > 
>> > To be clear..
>> > 
>> > The problem, kswapd_failures >= MAX_RECLAIM_RETRIES, can happen whether
>> > tiering is involved not not.  Once kswapd stops, the system should run
>> > without kswapd even after recovered e.g. by killing the hoggers.  *Even
>> > worse*, tiering mechanism doesn't work in this situation.
>> 
>> But like Ying said, in other situations it's direct reclaim that kicks
>> in and clears the flag.
>
> I already described it in the commit message.
>
>> The failure-sleep and direct reclaim triggered recovery have been in
>
> Sure.  It's better than nothing.
>
>> place since 2017. Both parties who observed an issue with it recently
>> did so in tiering scenarios. IMO a tiering-specific solution makes the
>> most sense.
>
> So..  Is the follow situation in a non-tiering system okay?  Really?
>
>    A system runs with kswapd disabled unless hitting min water mark,
>    even if there might be something that kswapd can work on.
>
> I don't undertand why it's okay.  Could you explain more?  Then why do
> we use kswapd in background?

IIUC, it's okey.  One direct reclaiming will be triggered, then kswapd
reclaiming will be recovered.  So, the performance will not be
influenced much.

Do you think that this will impact performance?  If so, please try to
prove it with test results.

--
Best Regards,
Huang, Ying
Byungchul Park June 5, 2024, 1:24 a.m. UTC | #9
On Wed, Jun 05, 2024 at 08:59:01AM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@sk.com> writes:
> 
> > On Tue, Jun 04, 2024 at 08:29:27AM -0400, Johannes Weiner wrote:
> >> On Tue, Jun 04, 2024 at 07:25:16PM +0900, Byungchul Park wrote:
> >> > On Tue, Jun 04, 2024 at 06:12:22PM +0900, Byungchul Park wrote:
> >> > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> >> > > > Byungchul Park <byungchul@sk.com> writes:
> >> > > > 
> >> > > > > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> >> > > > >> Byungchul Park <byungchul@sk.com> writes:
> >> > > > >> 
> >> > > > >> > Changes from v1:
> >> > > > >> > 	1. Don't allow to resume kswapd if the system is under memory
> >> > > > >> > 	   pressure that might affect direct reclaim by any chance, like
> >> > > > >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> >> > > > >> >
> >> > > > >> > --->8---
> >> > > > >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> >> > > > >> > From: Byungchul Park <byungchul@sk.com>
> >> > > > >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> >> > > > >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> >> > > > >> >
> >> > > > >> > A system should run with kswapd running in background when under memory
> >> > > > >> > pressure, such as when the available memory level is below the low water
> >> > > > >> > mark and there are reclaimable folios.
> >> > > > >> >
> >> > > > >> > However, the current code let the system run with kswapd stopped if
> >> > > > >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> >> > > > >> > until direct reclaim will do for that, even if there are reclaimable
> >> > > > >> > folios that can be reclaimed by kswapd.  This case was observed in the
> >> > > > >> > following scenario:
> >> > > > >> >
> >> > > > >> >    CONFIG_NUMA_BALANCING enabled
> >> > > > >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> >> > > > >> >    numa node0 (500GB local DRAM, 128 CPUs)
> >> > > > >> >    numa node1 (100GB CXL memory, no CPUs)
> >> > > > >> >    swap off
> >> > > > >> >
> >> > > > >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> >> > > > >> >    2) Continue adding the same workload to the system.
> >> > > > >> >    3) The anon pages are placed in node0 by promotion/demotion.
> >> > > > >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> >> > > > >> >    5) Kill the memory hoggers to restore the system.
> >> > > > >> >
> >> > > > >> > After restoring the system at 5), the system starts to run without
> >> > > > >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> >> > > > >> > the mechanism relies on kswapd for demotion.
> >> > > > >> 
> >> > > > >> We have run into the situation that kswapd is kept in failure state for
> >> > > > >> long in a multiple tiers system.  I think that your solution is too
> >> > > > >
> >> > > > > My solution just gives a chance for kswapd to work again even if
> >> > > > > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> >> > > > > reclaimable folios.  That's it.
> >> > > > >
> >> > > > >> limited, because OOM killing may not happen, while the access pattern of
> >> > > > >
> >> > > > > I don't get this.  OOM will happen as is, through direct reclaim.
> >> > > > 
> >> > > > A system that fails to reclaim via kswapd may succeed to reclaim via
> >> > > > direct reclaim, because more CPUs are used to scanning the page tables.
> >> > > > 
> >> > > > In a system with NUMA balancing based page promotion and page demotion
> >> > > > enabled, page promotion will wake up kswapd, but kswapd may fail in some
> >> > > > situations.  But page promotion will no trigger direct reclaim or OOM.
> >> > > > 
> >> > > > >> the workloads may change.  We have a preliminary and simple solution for
> >> > > > >> this as follows,
> >> > > > >> 
> >> > > > >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> >> > > > >
> >> > > > > Whether tiering is involved or not, the same problem can arise if
> >> > > > > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> >> > > > 
> >> > > > Your description is about tiering too.  Can you describe a situation
> >> > > 
> >> > > I mentioned "tiering" while I described how to reproduce because I ran
> >> > > into the situation while testing with tiering system but I don't think
> >> > > it's the necessary condition.
> >> > > 
> >> > > Let me ask you back, why the logic to stop kswapd was considered in the
> >> > > first place?  That's because the problem was already observed anyway
> >> > 
> >> > To be clear..
> >> > 
> >> > The problem, kswapd_failures >= MAX_RECLAIM_RETRIES, can happen whether
> >> > tiering is involved not not.  Once kswapd stops, the system should run
> >> > without kswapd even after recovered e.g. by killing the hoggers.  *Even
> >> > worse*, tiering mechanism doesn't work in this situation.
> >> 
> >> But like Ying said, in other situations it's direct reclaim that kicks
> >> in and clears the flag.
> >
> > I already described it in the commit message.
> >
> >> The failure-sleep and direct reclaim triggered recovery have been in
> >
> > Sure.  It's better than nothing.
> >
> >> place since 2017. Both parties who observed an issue with it recently
> >> did so in tiering scenarios. IMO a tiering-specific solution makes the
> >> most sense.
> >
> > So..  Is the follow situation in a non-tiering system okay?  Really?
> >
> >    A system runs with kswapd disabled unless hitting min water mark,
> >    even if there might be something that kswapd can work on.
> >
> > I don't undertand why it's okay.  Could you explain more?  Then why do
> > we use kswapd in background?
> 
> IIUC, it's okey.  One direct reclaiming will be triggered, then kswapd
> reclaiming will be recovered.  So, the performance will not be
> influenced much.

So is it because the performance will not be influenced much?  Hm..  the
system would get impacted at the moment when direct reclaim gets
triggerred, even though kswapd can mitigate the impact proactively.

However, I don't want to insist strongly if you all consider it's okay.

Changing the topic to tiering, which one looks better between two
appoaches to solve the issue that tiering doens't work once the failures
hit MAX_RECLAIM_RETRIES:

   1) periodically run kswapd
   2) run kswapd if there might be reclaimable folios as this patch does

For 2), this patch should be modified a lil bit tho.

	Byungchul

> Do you think that this will impact performance?  If so, please try to
> prove it with test results.
> 
> --
> Best Regards,
> Huang, Ying
Byungchul Park June 5, 2024, 1:50 a.m. UTC | #10
On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@sk.com> writes:
> 
> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> >> Byungchul Park <byungchul@sk.com> writes:
> >> 
> >> > Changes from v1:
> >> > 	1. Don't allow to resume kswapd if the system is under memory
> >> > 	   pressure that might affect direct reclaim by any chance, like
> >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> >> >
> >> > --->8---
> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> >> > From: Byungchul Park <byungchul@sk.com>
> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> >> >
> >> > A system should run with kswapd running in background when under memory
> >> > pressure, such as when the available memory level is below the low water
> >> > mark and there are reclaimable folios.
> >> >
> >> > However, the current code let the system run with kswapd stopped if
> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> >> > until direct reclaim will do for that, even if there are reclaimable
> >> > folios that can be reclaimed by kswapd.  This case was observed in the
> >> > following scenario:
> >> >
> >> >    CONFIG_NUMA_BALANCING enabled
> >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> >> >    numa node0 (500GB local DRAM, 128 CPUs)
> >> >    numa node1 (100GB CXL memory, no CPUs)
> >> >    swap off
> >> >
> >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> >> >    2) Continue adding the same workload to the system.
> >> >    3) The anon pages are placed in node0 by promotion/demotion.
> >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> >> >    5) Kill the memory hoggers to restore the system.
> >> >
> >> > After restoring the system at 5), the system starts to run without
> >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> >> > the mechanism relies on kswapd for demotion.
> >> 
> >> We have run into the situation that kswapd is kept in failure state for
> >> long in a multiple tiers system.  I think that your solution is too
> >
> > My solution just gives a chance for kswapd to work again even if
> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> > reclaimable folios.  That's it.
> >
> >> limited, because OOM killing may not happen, while the access pattern of
> >
> > I don't get this.  OOM will happen as is, through direct reclaim.
> 
> A system that fails to reclaim via kswapd may succeed to reclaim via
> direct reclaim, because more CPUs are used to scanning the page tables.

Honestly, I don't think so with this description.

The fact that the system hit MAX_RECLAIM_RETRIES means the system is
currently hopeless unless reclaiming folios in a stronger way by *direct
reclaim*.  The solution for this situation should not be about letting
more CPUs particiated in reclaiming, again, *at least in this situation*.

What you described here is true only in a normal state where the more
CPUs work on reclaiming, the more reclaimable folios can be reclaimed.
kswapd can be a helper *only* when there are kswapd-reclaimable folios.

	Byungchul

> In a system with NUMA balancing based page promotion and page demotion
> enabled, page promotion will wake up kswapd, but kswapd may fail in some
> situations.  But page promotion will no trigger direct reclaim or OOM.
> 
> >> the workloads may change.  We have a preliminary and simple solution for
> >> this as follows,
> >> 
> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> >
> > Whether tiering is involved or not, the same problem can arise if
> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> 
> Your description is about tiering too.  Can you describe a situation
> without tiering?
> 
> --
> Best Regards,
> Huang, Ying
> 
> > 	Byungchul
> >
> >> where we will try to wake up kswapd to check every 10 seconds if kswapd
> >> is in failure state.  This is another possible solution.
> >> 
> >> > However, the node0 has pages newly allocated after 5), that might or
> >> > might not be reclaimable.  Since those are potentially reclaimable, it's
> >> > worth hopefully trying reclaim by allowing kswapd to work again.
> >> >
> >> 
> >> [snip]
> >> 
> >> --
> >> Best Regards,
> >> Huang, Ying
Huang, Ying June 5, 2024, 2:02 a.m. UTC | #11
Byungchul Park <byungchul@sk.com> writes:

> On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
>> Byungchul Park <byungchul@sk.com> writes:
>> 
>> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
>> >> Byungchul Park <byungchul@sk.com> writes:
>> >> 
>> >> > Changes from v1:
>> >> > 	1. Don't allow to resume kswapd if the system is under memory
>> >> > 	   pressure that might affect direct reclaim by any chance, like
>> >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
>> >> >
>> >> > --->8---
>> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
>> >> > From: Byungchul Park <byungchul@sk.com>
>> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
>> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
>> >> >
>> >> > A system should run with kswapd running in background when under memory
>> >> > pressure, such as when the available memory level is below the low water
>> >> > mark and there are reclaimable folios.
>> >> >
>> >> > However, the current code let the system run with kswapd stopped if
>> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
>> >> > until direct reclaim will do for that, even if there are reclaimable
>> >> > folios that can be reclaimed by kswapd.  This case was observed in the
>> >> > following scenario:
>> >> >
>> >> >    CONFIG_NUMA_BALANCING enabled
>> >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
>> >> >    numa node0 (500GB local DRAM, 128 CPUs)
>> >> >    numa node1 (100GB CXL memory, no CPUs)
>> >> >    swap off
>> >> >
>> >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
>> >> >    2) Continue adding the same workload to the system.
>> >> >    3) The anon pages are placed in node0 by promotion/demotion.
>> >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
>> >> >    5) Kill the memory hoggers to restore the system.
>> >> >
>> >> > After restoring the system at 5), the system starts to run without
>> >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
>> >> > the mechanism relies on kswapd for demotion.
>> >> 
>> >> We have run into the situation that kswapd is kept in failure state for
>> >> long in a multiple tiers system.  I think that your solution is too
>> >
>> > My solution just gives a chance for kswapd to work again even if
>> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
>> > reclaimable folios.  That's it.
>> >
>> >> limited, because OOM killing may not happen, while the access pattern of
>> >
>> > I don't get this.  OOM will happen as is, through direct reclaim.
>> 
>> A system that fails to reclaim via kswapd may succeed to reclaim via
>> direct reclaim, because more CPUs are used to scanning the page tables.
>
> Honestly, I don't think so with this description.
>
> The fact that the system hit MAX_RECLAIM_RETRIES means the system is
> currently hopeless unless reclaiming folios in a stronger way by *direct
> reclaim*.  The solution for this situation should not be about letting
> more CPUs particiated in reclaiming, again, *at least in this situation*.
>
> What you described here is true only in a normal state where the more
> CPUs work on reclaiming, the more reclaimable folios can be reclaimed.
> kswapd can be a helper *only* when there are kswapd-reclaimable folios.

Sometimes, we cannot reclaim just because we doesn't scan fast enough so
the Accessed-bit is set again during scanning.  With more CPUs, we can
scan faster, so make some progress.  But, yes, this only cover one
situation, there are other situations too.

--
Best Regards,
Huang, Ying

> 	Byungchul
>
>> In a system with NUMA balancing based page promotion and page demotion
>> enabled, page promotion will wake up kswapd, but kswapd may fail in some
>> situations.  But page promotion will no trigger direct reclaim or OOM.
>> 
>> >> the workloads may change.  We have a preliminary and simple solution for
>> >> this as follows,
>> >> 
>> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
>> >
>> > Whether tiering is involved or not, the same problem can arise if
>> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
>> 
>> Your description is about tiering too.  Can you describe a situation
>> without tiering?
>> 
>> --
>> Best Regards,
>> Huang, Ying
>> 
>> > 	Byungchul
>> >
>> >> where we will try to wake up kswapd to check every 10 seconds if kswapd
>> >> is in failure state.  This is another possible solution.
>> >> 
>> >> > However, the node0 has pages newly allocated after 5), that might or
>> >> > might not be reclaimable.  Since those are potentially reclaimable, it's
>> >> > worth hopefully trying reclaim by allowing kswapd to work again.
>> >> >
>> >> 
>> >> [snip]
>> >> 
>> >> --
>> >> Best Regards,
>> >> Huang, Ying
Huang, Ying June 5, 2024, 2:14 a.m. UTC | #12
Byungchul Park <byungchul@sk.com> writes:

> On Wed, Jun 05, 2024 at 08:59:01AM +0800, Huang, Ying wrote:
>> Byungchul Park <byungchul@sk.com> writes:
>> 
>> > On Tue, Jun 04, 2024 at 08:29:27AM -0400, Johannes Weiner wrote:
>> >> On Tue, Jun 04, 2024 at 07:25:16PM +0900, Byungchul Park wrote:
>> >> > On Tue, Jun 04, 2024 at 06:12:22PM +0900, Byungchul Park wrote:
>> >> > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
>> >> > > > Byungchul Park <byungchul@sk.com> writes:
>> >> > > > 
>> >> > > > > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
>> >> > > > >> Byungchul Park <byungchul@sk.com> writes:
>> >> > > > >> 
>> >> > > > >> > Changes from v1:
>> >> > > > >> > 	1. Don't allow to resume kswapd if the system is under memory
>> >> > > > >> > 	   pressure that might affect direct reclaim by any chance, like
>> >> > > > >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
>> >> > > > >> >
>> >> > > > >> > --->8---
>> >> > > > >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
>> >> > > > >> > From: Byungchul Park <byungchul@sk.com>
>> >> > > > >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
>> >> > > > >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
>> >> > > > >> >
>> >> > > > >> > A system should run with kswapd running in background when under memory
>> >> > > > >> > pressure, such as when the available memory level is below the low water
>> >> > > > >> > mark and there are reclaimable folios.
>> >> > > > >> >
>> >> > > > >> > However, the current code let the system run with kswapd stopped if
>> >> > > > >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
>> >> > > > >> > until direct reclaim will do for that, even if there are reclaimable
>> >> > > > >> > folios that can be reclaimed by kswapd.  This case was observed in the
>> >> > > > >> > following scenario:
>> >> > > > >> >
>> >> > > > >> >    CONFIG_NUMA_BALANCING enabled
>> >> > > > >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
>> >> > > > >> >    numa node0 (500GB local DRAM, 128 CPUs)
>> >> > > > >> >    numa node1 (100GB CXL memory, no CPUs)
>> >> > > > >> >    swap off
>> >> > > > >> >
>> >> > > > >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
>> >> > > > >> >    2) Continue adding the same workload to the system.
>> >> > > > >> >    3) The anon pages are placed in node0 by promotion/demotion.
>> >> > > > >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
>> >> > > > >> >    5) Kill the memory hoggers to restore the system.
>> >> > > > >> >
>> >> > > > >> > After restoring the system at 5), the system starts to run without
>> >> > > > >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
>> >> > > > >> > the mechanism relies on kswapd for demotion.
>> >> > > > >> 
>> >> > > > >> We have run into the situation that kswapd is kept in failure state for
>> >> > > > >> long in a multiple tiers system.  I think that your solution is too
>> >> > > > >
>> >> > > > > My solution just gives a chance for kswapd to work again even if
>> >> > > > > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
>> >> > > > > reclaimable folios.  That's it.
>> >> > > > >
>> >> > > > >> limited, because OOM killing may not happen, while the access pattern of
>> >> > > > >
>> >> > > > > I don't get this.  OOM will happen as is, through direct reclaim.
>> >> > > > 
>> >> > > > A system that fails to reclaim via kswapd may succeed to reclaim via
>> >> > > > direct reclaim, because more CPUs are used to scanning the page tables.
>> >> > > > 
>> >> > > > In a system with NUMA balancing based page promotion and page demotion
>> >> > > > enabled, page promotion will wake up kswapd, but kswapd may fail in some
>> >> > > > situations.  But page promotion will no trigger direct reclaim or OOM.
>> >> > > > 
>> >> > > > >> the workloads may change.  We have a preliminary and simple solution for
>> >> > > > >> this as follows,
>> >> > > > >> 
>> >> > > > >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
>> >> > > > >
>> >> > > > > Whether tiering is involved or not, the same problem can arise if
>> >> > > > > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
>> >> > > > 
>> >> > > > Your description is about tiering too.  Can you describe a situation
>> >> > > 
>> >> > > I mentioned "tiering" while I described how to reproduce because I ran
>> >> > > into the situation while testing with tiering system but I don't think
>> >> > > it's the necessary condition.
>> >> > > 
>> >> > > Let me ask you back, why the logic to stop kswapd was considered in the
>> >> > > first place?  That's because the problem was already observed anyway
>> >> > 
>> >> > To be clear..
>> >> > 
>> >> > The problem, kswapd_failures >= MAX_RECLAIM_RETRIES, can happen whether
>> >> > tiering is involved not not.  Once kswapd stops, the system should run
>> >> > without kswapd even after recovered e.g. by killing the hoggers.  *Even
>> >> > worse*, tiering mechanism doesn't work in this situation.
>> >> 
>> >> But like Ying said, in other situations it's direct reclaim that kicks
>> >> in and clears the flag.
>> >
>> > I already described it in the commit message.
>> >
>> >> The failure-sleep and direct reclaim triggered recovery have been in
>> >
>> > Sure.  It's better than nothing.
>> >
>> >> place since 2017. Both parties who observed an issue with it recently
>> >> did so in tiering scenarios. IMO a tiering-specific solution makes the
>> >> most sense.
>> >
>> > So..  Is the follow situation in a non-tiering system okay?  Really?
>> >
>> >    A system runs with kswapd disabled unless hitting min water mark,
>> >    even if there might be something that kswapd can work on.
>> >
>> > I don't undertand why it's okay.  Could you explain more?  Then why do
>> > we use kswapd in background?
>> 
>> IIUC, it's okey.  One direct reclaiming will be triggered, then kswapd
>> reclaiming will be recovered.  So, the performance will not be
>> influenced much.
>
> So is it because the performance will not be influenced much?  Hm..  the
> system would get impacted at the moment when direct reclaim gets
> triggerred, even though kswapd can mitigate the impact proactively.
>
> However, I don't want to insist strongly if you all consider it's okay.
>
> Changing the topic to tiering, which one looks better between two
> appoaches to solve the issue that tiering doens't work once the failures
> hit MAX_RECLAIM_RETRIES:
>
>    1) periodically run kswapd
>    2) run kswapd if there might be reclaimable folios as this patch does

It's hard to capture all situations.  Folios may become reclaimable in
various ways, some folios become cold, swap devices are added, munlock,
folios freeing/allocating, etc.  It's hard to detect them all without
trying.

It may be not necessary to run kswapd periodically.  So, another
possibility is to allow wakeup kswapd after some timeout even if
kswapd_failures >= MAX_RECLAIM_RETRIES.

--
Best Regards,
Huang, Ying

> For 2), this patch should be modified a lil bit tho.
>
> 	Byungchul
>
>> Do you think that this will impact performance?  If so, please try to
>> prove it with test results.
>> 
>> --
>> Best Regards,
>> Huang, Ying
Byungchul Park June 5, 2024, 2:19 a.m. UTC | #13
On Wed, Jun 05, 2024 at 10:02:07AM +0800, Huang, Ying wrote:
> Byungchul Park <byungchul@sk.com> writes:
> 
> > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> >> Byungchul Park <byungchul@sk.com> writes:
> >> 
> >> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> >> >> Byungchul Park <byungchul@sk.com> writes:
> >> >> 
> >> >> > Changes from v1:
> >> >> > 	1. Don't allow to resume kswapd if the system is under memory
> >> >> > 	   pressure that might affect direct reclaim by any chance, like
> >> >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> >> >> >
> >> >> > --->8---
> >> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> >> >> > From: Byungchul Park <byungchul@sk.com>
> >> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> >> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> >> >> >
> >> >> > A system should run with kswapd running in background when under memory
> >> >> > pressure, such as when the available memory level is below the low water
> >> >> > mark and there are reclaimable folios.
> >> >> >
> >> >> > However, the current code let the system run with kswapd stopped if
> >> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> >> >> > until direct reclaim will do for that, even if there are reclaimable
> >> >> > folios that can be reclaimed by kswapd.  This case was observed in the
> >> >> > following scenario:
> >> >> >
> >> >> >    CONFIG_NUMA_BALANCING enabled
> >> >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> >> >> >    numa node0 (500GB local DRAM, 128 CPUs)
> >> >> >    numa node1 (100GB CXL memory, no CPUs)
> >> >> >    swap off
> >> >> >
> >> >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> >> >> >    2) Continue adding the same workload to the system.
> >> >> >    3) The anon pages are placed in node0 by promotion/demotion.
> >> >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> >> >> >    5) Kill the memory hoggers to restore the system.
> >> >> >
> >> >> > After restoring the system at 5), the system starts to run without
> >> >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> >> >> > the mechanism relies on kswapd for demotion.
> >> >> 
> >> >> We have run into the situation that kswapd is kept in failure state for
> >> >> long in a multiple tiers system.  I think that your solution is too
> >> >
> >> > My solution just gives a chance for kswapd to work again even if
> >> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> >> > reclaimable folios.  That's it.
> >> >
> >> >> limited, because OOM killing may not happen, while the access pattern of
> >> >
> >> > I don't get this.  OOM will happen as is, through direct reclaim.
> >> 
> >> A system that fails to reclaim via kswapd may succeed to reclaim via
> >> direct reclaim, because more CPUs are used to scanning the page tables.
> >
> > Honestly, I don't think so with this description.
> >
> > The fact that the system hit MAX_RECLAIM_RETRIES means the system is
> > currently hopeless unless reclaiming folios in a stronger way by *direct
> > reclaim*.  The solution for this situation should not be about letting
> > more CPUs particiated in reclaiming, again, *at least in this situation*.
> >
> > What you described here is true only in a normal state where the more
> > CPUs work on reclaiming, the more reclaimable folios can be reclaimed.
> > kswapd can be a helper *only* when there are kswapd-reclaimable folios.
> 
> Sometimes, we cannot reclaim just because we doesn't scan fast enough so
> the Accessed-bit is set again during scanning.  With more CPUs, we can
> scan faster, so make some progress.  But, yes, this only cover one
> situation, there are other situations too.

What I mean is *the issue we try to solve* is not the situation that
can be solved by letting more CPUs participate in reclaiming.

	Byungchul

> --
> Best Regards,
> Huang, Ying
> 
> > 	Byungchul
> >
> >> In a system with NUMA balancing based page promotion and page demotion
> >> enabled, page promotion will wake up kswapd, but kswapd may fail in some
> >> situations.  But page promotion will no trigger direct reclaim or OOM.
> >> 
> >> >> the workloads may change.  We have a preliminary and simple solution for
> >> >> this as follows,
> >> >> 
> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> >> >
> >> > Whether tiering is involved or not, the same problem can arise if
> >> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> >> 
> >> Your description is about tiering too.  Can you describe a situation
> >> without tiering?
> >> 
> >> --
> >> Best Regards,
> >> Huang, Ying
> >> 
> >> > 	Byungchul
> >> >
> >> >> where we will try to wake up kswapd to check every 10 seconds if kswapd
> >> >> is in failure state.  This is another possible solution.
> >> >> 
> >> >> > However, the node0 has pages newly allocated after 5), that might or
> >> >> > might not be reclaimable.  Since those are potentially reclaimable, it's
> >> >> > worth hopefully trying reclaim by allowing kswapd to work again.
> >> >> >
> >> >> 
> >> >> [snip]
> >> >> 
> >> >> --
> >> >> Best Regards,
> >> >> Huang, Ying
Byungchul Park June 7, 2024, 7:12 a.m. UTC | #14
On Wed, Jun 05, 2024 at 11:19:02AM +0900, Byungchul Park wrote:
> On Wed, Jun 05, 2024 at 10:02:07AM +0800, Huang, Ying wrote:
> > Byungchul Park <byungchul@sk.com> writes:
> > 
> > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> > >> Byungchul Park <byungchul@sk.com> writes:
> > >> 
> > >> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> > >> >> Byungchul Park <byungchul@sk.com> writes:
> > >> >> 
> > >> >> > Changes from v1:
> > >> >> > 	1. Don't allow to resume kswapd if the system is under memory
> > >> >> > 	   pressure that might affect direct reclaim by any chance, like
> > >> >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> > >> >> >
> > >> >> > --->8---
> > >> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> > >> >> > From: Byungchul Park <byungchul@sk.com>
> > >> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> > >> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> > >> >> >
> > >> >> > A system should run with kswapd running in background when under memory
> > >> >> > pressure, such as when the available memory level is below the low water
> > >> >> > mark and there are reclaimable folios.
> > >> >> >
> > >> >> > However, the current code let the system run with kswapd stopped if
> > >> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> > >> >> > until direct reclaim will do for that, even if there are reclaimable
> > >> >> > folios that can be reclaimed by kswapd.  This case was observed in the
> > >> >> > following scenario:
> > >> >> >
> > >> >> >    CONFIG_NUMA_BALANCING enabled
> > >> >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> > >> >> >    numa node0 (500GB local DRAM, 128 CPUs)
> > >> >> >    numa node1 (100GB CXL memory, no CPUs)
> > >> >> >    swap off
> > >> >> >
> > >> >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> > >> >> >    2) Continue adding the same workload to the system.
> > >> >> >    3) The anon pages are placed in node0 by promotion/demotion.
> > >> >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> > >> >> >    5) Kill the memory hoggers to restore the system.
> > >> >> >
> > >> >> > After restoring the system at 5), the system starts to run without
> > >> >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> > >> >> > the mechanism relies on kswapd for demotion.
> > >> >> 
> > >> >> We have run into the situation that kswapd is kept in failure state for
> > >> >> long in a multiple tiers system.  I think that your solution is too
> > >> >
> > >> > My solution just gives a chance for kswapd to work again even if
> > >> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> > >> > reclaimable folios.  That's it.
> > >> >
> > >> >> limited, because OOM killing may not happen, while the access pattern of
> > >> >
> > >> > I don't get this.  OOM will happen as is, through direct reclaim.
> > >> 
> > >> A system that fails to reclaim via kswapd may succeed to reclaim via
> > >> direct reclaim, because more CPUs are used to scanning the page tables.
> > >
> > > Honestly, I don't think so with this description.
> > >
> > > The fact that the system hit MAX_RECLAIM_RETRIES means the system is
> > > currently hopeless unless reclaiming folios in a stronger way by *direct
> > > reclaim*.  The solution for this situation should not be about letting
> > > more CPUs particiated in reclaiming, again, *at least in this situation*.
> > >
> > > What you described here is true only in a normal state where the more
> > > CPUs work on reclaiming, the more reclaimable folios can be reclaimed.
> > > kswapd can be a helper *only* when there are kswapd-reclaimable folios.
> > 
> > Sometimes, we cannot reclaim just because we doesn't scan fast enough so
> > the Accessed-bit is set again during scanning.  With more CPUs, we can
> > scan faster, so make some progress.  But, yes, this only cover one
> > situation, there are other situations too.
> 
> What I mean is *the issue we try to solve* is not the situation that
> can be solved by letting more CPUs participate in reclaiming.

Again, in the situation where kswapd has failed more than
MAX_RECLAIM_RETRIES, say, holeless, I don't think it makes sense to wake
up kswapd every 10 seconds.  It'd be more sensible to wake up kwapd only
if there are *at least potentially* reclaimable folios.

As Ying said, there's no way to precisely track if reclaimable, but it's
worth trying when the possibility becomes positive and looks more
reasonable.  Thoughts?

	Byungchul

> 	Byungchul
> 
> > --
> > Best Regards,
> > Huang, Ying
> > 
> > > 	Byungchul
> > >
> > >> In a system with NUMA balancing based page promotion and page demotion
> > >> enabled, page promotion will wake up kswapd, but kswapd may fail in some
> > >> situations.  But page promotion will no trigger direct reclaim or OOM.
> > >> 
> > >> >> the workloads may change.  We have a preliminary and simple solution for
> > >> >> this as follows,
> > >> >> 
> > >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> > >> >
> > >> > Whether tiering is involved or not, the same problem can arise if
> > >> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> > >> 
> > >> Your description is about tiering too.  Can you describe a situation
> > >> without tiering?
> > >> 
> > >> --
> > >> Best Regards,
> > >> Huang, Ying
> > >> 
> > >> > 	Byungchul
> > >> >
> > >> >> where we will try to wake up kswapd to check every 10 seconds if kswapd
> > >> >> is in failure state.  This is another possible solution.
> > >> >> 
> > >> >> > However, the node0 has pages newly allocated after 5), that might or
> > >> >> > might not be reclaimable.  Since those are potentially reclaimable, it's
> > >> >> > worth hopefully trying reclaim by allowing kswapd to work again.
> > >> >> >
> > >> >> 
> > >> >> [snip]
> > >> >> 
> > >> >> --
> > >> >> Best Regards,
> > >> >> Huang, Ying
Byungchul Park June 13, 2024, 1:27 a.m. UTC | #15
On Fri, Jun 07, 2024 at 04:12:28PM +0900, Byungchul Park wrote:
> On Wed, Jun 05, 2024 at 11:19:02AM +0900, Byungchul Park wrote:
> > On Wed, Jun 05, 2024 at 10:02:07AM +0800, Huang, Ying wrote:
> > > Byungchul Park <byungchul@sk.com> writes:
> > > 
> > > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
> > > >> Byungchul Park <byungchul@sk.com> writes:
> > > >> 
> > > >> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
> > > >> >> Byungchul Park <byungchul@sk.com> writes:
> > > >> >> 
> > > >> >> > Changes from v1:
> > > >> >> > 	1. Don't allow to resume kswapd if the system is under memory
> > > >> >> > 	   pressure that might affect direct reclaim by any chance, like
> > > >> >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
> > > >> >> >
> > > >> >> > --->8---
> > > >> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
> > > >> >> > From: Byungchul Park <byungchul@sk.com>
> > > >> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
> > > >> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
> > > >> >> >
> > > >> >> > A system should run with kswapd running in background when under memory
> > > >> >> > pressure, such as when the available memory level is below the low water
> > > >> >> > mark and there are reclaimable folios.
> > > >> >> >
> > > >> >> > However, the current code let the system run with kswapd stopped if
> > > >> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
> > > >> >> > until direct reclaim will do for that, even if there are reclaimable
> > > >> >> > folios that can be reclaimed by kswapd.  This case was observed in the
> > > >> >> > following scenario:
> > > >> >> >
> > > >> >> >    CONFIG_NUMA_BALANCING enabled
> > > >> >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
> > > >> >> >    numa node0 (500GB local DRAM, 128 CPUs)
> > > >> >> >    numa node1 (100GB CXL memory, no CPUs)
> > > >> >> >    swap off
> > > >> >> >
> > > >> >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
> > > >> >> >    2) Continue adding the same workload to the system.
> > > >> >> >    3) The anon pages are placed in node0 by promotion/demotion.
> > > >> >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
> > > >> >> >    5) Kill the memory hoggers to restore the system.
> > > >> >> >
> > > >> >> > After restoring the system at 5), the system starts to run without
> > > >> >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
> > > >> >> > the mechanism relies on kswapd for demotion.
> > > >> >> 
> > > >> >> We have run into the situation that kswapd is kept in failure state for
> > > >> >> long in a multiple tiers system.  I think that your solution is too
> > > >> >
> > > >> > My solution just gives a chance for kswapd to work again even if
> > > >> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
> > > >> > reclaimable folios.  That's it.
> > > >> >
> > > >> >> limited, because OOM killing may not happen, while the access pattern of
> > > >> >
> > > >> > I don't get this.  OOM will happen as is, through direct reclaim.
> > > >> 
> > > >> A system that fails to reclaim via kswapd may succeed to reclaim via
> > > >> direct reclaim, because more CPUs are used to scanning the page tables.
> > > >
> > > > Honestly, I don't think so with this description.
> > > >
> > > > The fact that the system hit MAX_RECLAIM_RETRIES means the system is
> > > > currently hopeless unless reclaiming folios in a stronger way by *direct
> > > > reclaim*.  The solution for this situation should not be about letting
> > > > more CPUs particiated in reclaiming, again, *at least in this situation*.
> > > >
> > > > What you described here is true only in a normal state where the more
> > > > CPUs work on reclaiming, the more reclaimable folios can be reclaimed.
> > > > kswapd can be a helper *only* when there are kswapd-reclaimable folios.
> > > 
> > > Sometimes, we cannot reclaim just because we doesn't scan fast enough so
> > > the Accessed-bit is set again during scanning.  With more CPUs, we can
> > > scan faster, so make some progress.  But, yes, this only cover one
> > > situation, there are other situations too.
> > 
> > What I mean is *the issue we try to solve* is not the situation that
> > can be solved by letting more CPUs participate in reclaiming.
> 
> Again, in the situation where kswapd has failed more than
> MAX_RECLAIM_RETRIES, say, holeless, I don't think it makes sense to wake
> up kswapd every 10 seconds.  It'd be more sensible to wake up kwapd only
> if there are *at least potentially* reclaimable folios.

1) numa balancing tiering on

No doubt the patch should work for it since numa balancing tiering
doesn't work at all once kswapd stops.  We are already applying and
using this patch in tests for tiering.  It works perfect.

2) numa balancing tiering off

kswapd will be resumed even without this patch if free memory hits min
wmark.  However, do we have to wait for direct reclaim to work for it?
Even though we can proactively prevent direct reclaim using kswapd?

	Byungchul

> As Ying said, there's no way to precisely track if reclaimable, but it's
> worth trying when the possibility becomes positive and looks more
> reasonable.  Thoughts?
> 
> 	Byungchul
> 
> > 	Byungchul
> > 
> > > --
> > > Best Regards,
> > > Huang, Ying
> > > 
> > > > 	Byungchul
> > > >
> > > >> In a system with NUMA balancing based page promotion and page demotion
> > > >> enabled, page promotion will wake up kswapd, but kswapd may fail in some
> > > >> situations.  But page promotion will no trigger direct reclaim or OOM.
> > > >> 
> > > >> >> the workloads may change.  We have a preliminary and simple solution for
> > > >> >> this as follows,
> > > >> >> 
> > > >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
> > > >> >
> > > >> > Whether tiering is involved or not, the same problem can arise if
> > > >> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
> > > >> 
> > > >> Your description is about tiering too.  Can you describe a situation
> > > >> without tiering?
> > > >> 
> > > >> --
> > > >> Best Regards,
> > > >> Huang, Ying
> > > >> 
> > > >> > 	Byungchul
> > > >> >
> > > >> >> where we will try to wake up kswapd to check every 10 seconds if kswapd
> > > >> >> is in failure state.  This is another possible solution.
> > > >> >> 
> > > >> >> > However, the node0 has pages newly allocated after 5), that might or
> > > >> >> > might not be reclaimable.  Since those are potentially reclaimable, it's
> > > >> >> > worth hopefully trying reclaim by allowing kswapd to work again.
> > > >> >> >
> > > >> >> 
> > > >> >> [snip]
> > > >> >> 
> > > >> >> --
> > > >> >> Best Regards,
> > > >> >> Huang, Ying
Huang, Ying June 13, 2024, 6:38 a.m. UTC | #16
Byungchul Park <byungchul@sk.com> writes:

> On Fri, Jun 07, 2024 at 04:12:28PM +0900, Byungchul Park wrote:
>> On Wed, Jun 05, 2024 at 11:19:02AM +0900, Byungchul Park wrote:
>> > On Wed, Jun 05, 2024 at 10:02:07AM +0800, Huang, Ying wrote:
>> > > Byungchul Park <byungchul@sk.com> writes:
>> > > 
>> > > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote:
>> > > >> Byungchul Park <byungchul@sk.com> writes:
>> > > >> 
>> > > >> > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote:
>> > > >> >> Byungchul Park <byungchul@sk.com> writes:
>> > > >> >> 
>> > > >> >> > Changes from v1:
>> > > >> >> > 	1. Don't allow to resume kswapd if the system is under memory
>> > > >> >> > 	   pressure that might affect direct reclaim by any chance, like
>> > > >> >> > 	   if NR_FREE_PAGES is less than (low wmark + min wmark)/2.
>> > > >> >> >
>> > > >> >> > --->8---
>> > > >> >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001
>> > > >> >> > From: Byungchul Park <byungchul@sk.com>
>> > > >> >> > Date: Tue, 4 Jun 2024 15:27:56 +0900
>> > > >> >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now
>> > > >> >> >
>> > > >> >> > A system should run with kswapd running in background when under memory
>> > > >> >> > pressure, such as when the available memory level is below the low water
>> > > >> >> > mark and there are reclaimable folios.
>> > > >> >> >
>> > > >> >> > However, the current code let the system run with kswapd stopped if
>> > > >> >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures
>> > > >> >> > until direct reclaim will do for that, even if there are reclaimable
>> > > >> >> > folios that can be reclaimed by kswapd.  This case was observed in the
>> > > >> >> > following scenario:
>> > > >> >> >
>> > > >> >> >    CONFIG_NUMA_BALANCING enabled
>> > > >> >> >    sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
>> > > >> >> >    numa node0 (500GB local DRAM, 128 CPUs)
>> > > >> >> >    numa node1 (100GB CXL memory, no CPUs)
>> > > >> >> >    swap off
>> > > >> >> >
>> > > >> >> >    1) Run a workload with big anon pages e.g. mmap(200GB).
>> > > >> >> >    2) Continue adding the same workload to the system.
>> > > >> >> >    3) The anon pages are placed in node0 by promotion/demotion.
>> > > >> >> >    4) kswapd0 stops because of the unreclaimable anon pages in node0.
>> > > >> >> >    5) Kill the memory hoggers to restore the system.
>> > > >> >> >
>> > > >> >> > After restoring the system at 5), the system starts to run without
>> > > >> >> > kswapd.  Even worse, tiering mechanism is no longer able to work since
>> > > >> >> > the mechanism relies on kswapd for demotion.
>> > > >> >> 
>> > > >> >> We have run into the situation that kswapd is kept in failure state for
>> > > >> >> long in a multiple tiers system.  I think that your solution is too
>> > > >> >
>> > > >> > My solution just gives a chance for kswapd to work again even if
>> > > >> > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential
>> > > >> > reclaimable folios.  That's it.
>> > > >> >
>> > > >> >> limited, because OOM killing may not happen, while the access pattern of
>> > > >> >
>> > > >> > I don't get this.  OOM will happen as is, through direct reclaim.
>> > > >> 
>> > > >> A system that fails to reclaim via kswapd may succeed to reclaim via
>> > > >> direct reclaim, because more CPUs are used to scanning the page tables.
>> > > >
>> > > > Honestly, I don't think so with this description.
>> > > >
>> > > > The fact that the system hit MAX_RECLAIM_RETRIES means the system is
>> > > > currently hopeless unless reclaiming folios in a stronger way by *direct
>> > > > reclaim*.  The solution for this situation should not be about letting
>> > > > more CPUs particiated in reclaiming, again, *at least in this situation*.
>> > > >
>> > > > What you described here is true only in a normal state where the more
>> > > > CPUs work on reclaiming, the more reclaimable folios can be reclaimed.
>> > > > kswapd can be a helper *only* when there are kswapd-reclaimable folios.
>> > > 
>> > > Sometimes, we cannot reclaim just because we doesn't scan fast enough so
>> > > the Accessed-bit is set again during scanning.  With more CPUs, we can
>> > > scan faster, so make some progress.  But, yes, this only cover one
>> > > situation, there are other situations too.
>> > 
>> > What I mean is *the issue we try to solve* is not the situation that
>> > can be solved by letting more CPUs participate in reclaiming.
>> 
>> Again, in the situation where kswapd has failed more than
>> MAX_RECLAIM_RETRIES, say, holeless, I don't think it makes sense to wake
>> up kswapd every 10 seconds.  It'd be more sensible to wake up kwapd only
>> if there are *at least potentially* reclaimable folios.
>
> 1) numa balancing tiering on
>
> No doubt the patch should work for it since numa balancing tiering
> doesn't work at all once kswapd stops.  We are already applying and
> using this patch in tests for tiering.  It works perfect.

If my understanding of the code were correct, it doesn't work if there
are not many pages allocated after kswapd stops.  For example, if
some processes use many fast memory become idle.

> 2) numa balancing tiering off
>
> kswapd will be resumed even without this patch if free memory hits min
> wmark.  However, do we have to wait for direct reclaim to work for it?
> Even though we can proactively prevent direct reclaim using kswapd?

Please prove it with data instead of reasoning.  You patch adds some
overhead in hot page allocation path.  If the number of CPU is large,
cache ping-pong may be triggered because shared variable
(pgdat->nr_may_reclaimable) is accessed.

--
Best Regards,
Huang, Ying

> 	Byungchul
>
>> As Ying said, there's no way to precisely track if reclaimable, but it's
>> worth trying when the possibility becomes positive and looks more
>> reasonable.  Thoughts?
>> 
>> 	Byungchul
>> 
>> > 	Byungchul
>> > 
>> > > --
>> > > Best Regards,
>> > > Huang, Ying
>> > > 
>> > > > 	Byungchul
>> > > >
>> > > >> In a system with NUMA balancing based page promotion and page demotion
>> > > >> enabled, page promotion will wake up kswapd, but kswapd may fail in some
>> > > >> situations.  But page promotion will no trigger direct reclaim or OOM.
>> > > >> 
>> > > >> >> the workloads may change.  We have a preliminary and simple solution for
>> > > >> >> this as follows,
>> > > >> >> 
>> > > >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866
>> > > >> >
>> > > >> > Whether tiering is involved or not, the same problem can arise if
>> > > >> > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES.
>> > > >> 
>> > > >> Your description is about tiering too.  Can you describe a situation
>> > > >> without tiering?
>> > > >> 
>> > > >> --
>> > > >> Best Regards,
>> > > >> Huang, Ying
>> > > >> 
>> > > >> > 	Byungchul
>> > > >> >
>> > > >> >> where we will try to wake up kswapd to check every 10 seconds if kswapd
>> > > >> >> is in failure state.  This is another possible solution.
>> > > >> >> 
>> > > >> >> > However, the node0 has pages newly allocated after 5), that might or
>> > > >> >> > might not be reclaimable.  Since those are potentially reclaimable, it's
>> > > >> >> > worth hopefully trying reclaim by allowing kswapd to work again.
>> > > >> >> >
>> > > >> >> 
>> > > >> >> [snip]
>> > > >> >> 
>> > > >> >> --
>> > > >> >> Best Regards,
>> > > >> >> Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c11b7cde81ef..7c0ba90ea7b4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1331,6 +1331,10 @@  typedef struct pglist_data {
 	enum zone_type kswapd_highest_zoneidx;
 
 	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
+	int nr_may_reclaimable;		/* Number of pages that have been
+					   allocated since considered the
+					   node is hopeless due to too many
+					   kswapd_failures. */
 
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 14d39f34d336..1dd2daede014 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1538,8 +1538,20 @@  inline void post_alloc_hook(struct page *page, unsigned int order,
 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags)
 {
+	pg_data_t *pgdat = page_pgdat(page);
+
 	post_alloc_hook(page, order, gfp_flags);
 
+	/*
+	 * New pages might or might not be reclaimable depending on how
+	 * these pages are going to be used.  However, since these are
+	 * potentially reclaimable, it's worth hopefully trying reclaim
+	 * by allowing kswapd to work again even if there have been too
+	 * many ->kswapd_failures, if ->nr_may_reclaimable is big enough.
+	 */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		pgdat->nr_may_reclaimable += 1 << order;
+
 	if (order && (gfp_flags & __GFP_COMP))
 		prep_compound_page(page, order);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ef654addd44..6cf7ff164c2a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4943,6 +4943,7 @@  static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
 done:
 	/* kswapd should never fail */
 	pgdat->kswapd_failures = 0;
+	pgdat->nr_may_reclaimable = 0;
 }
 
 /******************************************************************************
@@ -5991,9 +5992,10 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * sleep. On reclaim progress, reset the failure counter. A
 	 * successful direct reclaim run will revive a dormant kswapd.
 	 */
-	if (reclaimable)
+	if (reclaimable) {
 		pgdat->kswapd_failures = 0;
-	else if (sc->cache_trim_mode)
+		pgdat->nr_may_reclaimable = 0;
+	} else if (sc->cache_trim_mode)
 		sc->cache_trim_mode_failed = 1;
 }
 
@@ -6636,6 +6638,42 @@  static void clear_pgdat_congested(pg_data_t *pgdat)
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 }
 
+static bool may_reclaimable(pg_data_t *pgdat, int order,
+		int highest_zoneidx)
+{
+	int i;
+	bool may_reclaimable;
+
+	may_reclaimable = pgdat->nr_may_reclaimable >= 1 << order;
+	if (!may_reclaimable)
+		return false;
+
+	/*
+	 * Check watermarks bottom-up as lower zones are more likely to
+	 * meet watermarks.
+	 */
+	for (i = 0; i <= highest_zoneidx; i++) {
+		unsigned long mark;
+		struct zone *zone;
+
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		/*
+		 * Don't bother the system by resuming kswapd if the
+		 * system is under memory pressure that might affect
+		 * direct reclaim by any chance.  Conservatively allow it
+		 * unless NR_FREE_PAGES is less than (low + min)/2.
+		 */
+		mark = (low_wmark_pages(zone) + min_wmark_pages(zone)) >> 1;
+		if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
@@ -6662,7 +6700,8 @@  static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order,
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
 	/* Hopeless node, leave it to direct reclaim */
-	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES &&
+	    !may_reclaimable(pgdat, order, highest_zoneidx))
 		return true;
 
 	if (pgdat_balanced(pgdat, order, highest_zoneidx)) {
@@ -6940,8 +6979,10 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		goto restart;
 	}
 
-	if (!sc.nr_reclaimed)
+	if (!sc.nr_reclaimed) {
 		pgdat->kswapd_failures++;
+		pgdat->nr_may_reclaimable = 0;
+	}
 
 out:
 	clear_reclaim_active(pgdat, highest_zoneidx);
@@ -7204,7 +7245,8 @@  void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 		return;
 
 	/* Hopeless node, leave it to direct reclaim if possible */
-	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
+	if ((pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES &&
+	     !may_reclaimable(pgdat, order, highest_zoneidx)) ||
 	    (pgdat_balanced(pgdat, order, highest_zoneidx) &&
 	     !pgdat_watermark_boosted(pgdat, highest_zoneidx))) {
 		/*