diff mbox series

[v2] mm: mglru: provide a separate list for lazyfree anon folios

Message ID 20241016033030.36990-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [v2] mm: mglru: provide a separate list for lazyfree anon folios | expand

Commit Message

Barry Song Oct. 16, 2024, 3:30 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

This builds on the discussion about Gaoxu's work[1]. Significant
file refaults may occur when userspace makes extensive use of
MADV_FREE on anonymous folios, as these folios are not
positioned in an easily reclaimable area within the LRU.

According to Lokesh, MADV_FREE'd anon folios are expected to be
released earlier than file folios. One option, as implemented
by Gao Xu, is to place lazyfree anon folios at the tail of the
file's `min_seq` generation[1]. However, this approach results in
lazyfree folios being released in a LIFO manner, which conflicts
with LRU behavior, as noted by Michal.

To address this, this patch proposes maintaining a separate list
for lazyfree anon folios while keeping them classified under the
"file" LRU type to minimize code changes. These lazyfree anon
folios will still be counted as file folios and share the same
generation with regular files. In the eviction path, the lazyfree
list will be prioritized for scanning before the actual file
LRU list.

Thanks to Gao Xu for the test results, which look quite promising:

Base version:android V (enable Android ART use MADV_FREE)
Test cases: 60 apps repeatedly restarted, tested for 8 hours;
The test results are as follows:
        workingset_refault_anon   workingset_refault_file
base        42016805                92010542
patch       19834873                49383572
% diff       -52.79%                  -46.33%

A comparative test was also performed on approach [1], with the
following results:
               workingset_refault_anon   workingset_refault_file
lazyfree-tail     20313395                 52203061
patch             19834873                 49383572
% diff              -2.36%                  -5.40%

[1] https://lore.kernel.org/linux-mm/f29f64e29c08427b95e3df30a5770056@honor.com/

Tested-by: Gao Xu <gaoxu2@hihonor.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 -v2:
 collect Gao Xu's test results and tested-by tag, thanks!

 include/linux/mm_inline.h |  5 +-
 include/linux/mmzone.h    |  2 +-
 mm/vmscan.c               | 97 +++++++++++++++++++++++----------------
 3 files changed, 61 insertions(+), 43 deletions(-)

Comments

Andrew Morton Oct. 16, 2024, 10:58 p.m. UTC | #1
On Wed, 16 Oct 2024 16:30:30 +1300 Barry Song <21cnbao@gmail.com> wrote:

> To address this, this patch proposes maintaining a separate list
> for lazyfree anon folios while keeping them classified under the
> "file" LRU type to minimize code changes.

Thanks.  I'll await input from other MGLRU developers before adding
this for testing.
Barry Song Oct. 18, 2024, 5:12 a.m. UTC | #2
On Fri, Oct 18, 2024 at 6:58 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Oct 17, 2024 at 06:59:09PM +1300, Barry Song wrote:
> > On Thu, Oct 17, 2024 at 11:58 AM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed, 16 Oct 2024 16:30:30 +1300 Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > To address this, this patch proposes maintaining a separate list
> > > > for lazyfree anon folios while keeping them classified under the
> > > > "file" LRU type to minimize code changes.
> > >
> > > Thanks.  I'll await input from other MGLRU developers before adding
> > > this for testing.
> >
> > Thanks!
> >
> > Hi Minchan, Yu,
> >
> > Any comments? I understand that Minchan may have a broader plan
> > to "enable the system to maintain a quickly reclaimable memory
> > pool and provide a knob for admins to control its size." While I
> > have no objection to that plan, I believe improving MADV_FREE
> > performance is a more urgent priority and a low-hanging fruit at this
> > stage.
>
> Hi Barry,
>
> I have no idea why my email didn't send well before. I sent following
> reply on Sep 24. Hope it works this time.

Hi Minchan,

I guess not. Your *this* email ended up in my spam folder of gmail, and
my oppo.com account still hasn’t received it. Any idea why?

>
> ====== &< ======
>
> My proposal involves the following:
>
> 1. Introduce an "easily reclaimable" LRU list. This list would hold pages
>    that can be quickly freed without significant overhead.

I assume you plan to keep both lazyfree anon pages and 'reclaimed'
file folios (reclaimed in the normal LRU lists but still in the easily-
reclaimable list) in this 'easily reclaimable' LRU list. However, I'm
not sure this will work, as this patch aims to help reclaim lazyfree
anon pages before file folios to reduce both file and anon refaults.
If we place 'reclaimed' file folios and lazyfree anon folios in the
same list, we may need to revisit how to reclaim lazyfree anon folios
before reclaiming the 'reclaimed' file folios.

>
> 2. Implement a parameter to control the size of this list. This allows for
>    system tuning based on available memory and performance requirements.

If we include only 'reclaimed' file folios in this 'easily
reclaimable' LRU list, the
parameter makes sense. However, if we also add lazyfree folios to the list, the
parameter becomes less meaningful since we can't predict how many
lazyfree anon folios user space might have. I still feel lazyfree anon folios
are different with "reclaimed" file folios (I mean reclaimed from normal
lists but still in 'easily-reclaimable' list).

>
> 3. Modify kswapd behavior to utilize this list. When kswapd is awakened due
>    to memory pressure, it should attempt to drop those pages first to refill
>    free pages up to the high watermark by first reclaiming.
>
> 4. Before kswapd goes to sleep, it should scan the tail of the LRU list and
>    move cold pages to the easily reclaimable list, unmapping them from the
>    page table.
>
> 5. Whenever page cache hit, move the page into evictable LRU.
>
> This approach allows the system to maintain a pool of readily available
> memory, mitigating the "aging" problem. The trade-off is the potential for
> minor page faults and LRU movement ovehreads if these pages in ez_reclaimable
> LRU are accessed again.

I believe you're aware of an implementation from Samsung that uses
cleancache. Although it was dropped from the mainline kernel, it still
exists in the Android kernel. Samsung's rbincache, based on cleancache,
maintains a reserved memory region for holding reclaimed file folios.
Instead of LRU movement, rbincache uses memcpy to transfer data between
the pool and the page cache.

>
> Furthermore, we could put some asynchrnous writeback pages(e.g., swap
> out or writeback the fs pages) into the list, too.
> Currently, what we are doing is rotate those pages back to head of LRU
> and once writeback is done, move the page to the tail of LRU again.
> We can simply put the page into ez_reclaimable LRU without rotating
> back and forth.

If this is about establishing a pool of easily reclaimable file folios, I
fully support the idea and am eager to try it, especially for Android,
where there are certainly strong use cases. However, I suspect it may
be controversial and could take months to gain acceptance. Therefore,
I’d prefer we first focus on landing a smaller change to address the
madv_free performance issue and treat that idea as a separate
incremental patch set.

My current patch specifically targets the issue of reclaiming lazyfree
anon folios before reclaiming file folios. It appears your proposal is
independent (though related) work, and I don't believe it should delay
resolving the madv_free issue. Additionally, that pool doesn’t effectively
address the reclamation priority between files and lazyfree anon folios.

In conclusion:

1. I agree that the pool is valuable, and I’d like to develop it as an
incremental patch set. However, this is a significant step that will
require considerable time.
2. It could be quite tricky to include both lazyfree anon folios and
reclaimed file folios (which are reclaimed in normal lists but not in
the 'easily-reclaimable' list) in the same LRU list. I’d prefer to
start by replacing Samsung's rbincache to reduce file folio I/O if we
decide to implement the pool.
3. I believe we should first focus on landing this fix patch for the
madv_free performance issue.

What are your thoughts? I spoke with Yu, and he would like to hear
your opinion.

Thanks
Barry
Minchan Kim Oct. 22, 2024, 8:15 p.m. UTC | #3
Hi Barry,

Sorry for slow response.

On Fri, Oct 18, 2024 at 06:12:01PM +1300, Barry Song wrote:
> On Fri, Oct 18, 2024 at 6:58 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Thu, Oct 17, 2024 at 06:59:09PM +1300, Barry Song wrote:
> > > On Thu, Oct 17, 2024 at 11:58 AM Andrew Morton
> > > <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Wed, 16 Oct 2024 16:30:30 +1300 Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > > To address this, this patch proposes maintaining a separate list
> > > > > for lazyfree anon folios while keeping them classified under the
> > > > > "file" LRU type to minimize code changes.
> > > >
> > > > Thanks.  I'll await input from other MGLRU developers before adding
> > > > this for testing.
> > >
> > > Thanks!
> > >
> > > Hi Minchan, Yu,
> > >
> > > Any comments? I understand that Minchan may have a broader plan
> > > to "enable the system to maintain a quickly reclaimable memory
> > > pool and provide a knob for admins to control its size." While I
> > > have no objection to that plan, I believe improving MADV_FREE
> > > performance is a more urgent priority and a low-hanging fruit at this
> > > stage.
> >
> > Hi Barry,
> >
> > I have no idea why my email didn't send well before. I sent following
> > reply on Sep 24. Hope it works this time.
> 
> Hi Minchan,
> 
> I guess not. Your *this* email ended up in my spam folder of gmail, and
> my oppo.com account still hasn’t received it. Any idea why?

In the end, that's my problem and don't know when it can be fixed.
Anyway, hope again this time works.

> 
> >
> > ====== &< ======
> >
> > My proposal involves the following:
> >
> > 1. Introduce an "easily reclaimable" LRU list. This list would hold pages
> >    that can be quickly freed without significant overhead.
> 
> I assume you plan to keep both lazyfree anon pages and 'reclaimed'
> file folios (reclaimed in the normal LRU lists but still in the easily-
> reclaimable list) in this 'easily reclaimable' LRU list. However, I'm
> not sure this will work, as this patch aims to help reclaim lazyfree
> anon pages before file folios to reduce both file and anon refaults.
> If we place 'reclaimed' file folios and lazyfree anon folios in the
> same list, we may need to revisit how to reclaim lazyfree anon folios
> before reclaiming the 'reclaimed' file folios.

Those reclaimed folio was already *decision-made* but just couldn't due to
the *impelementation issue*. So, that's strong candidate to be reclaimed
as long as there is no access since then rather other candidates.

> 
> >
> > 2. Implement a parameter to control the size of this list. This allows for
> >    system tuning based on available memory and performance requirements.
> 
> If we include only 'reclaimed' file folios in this 'easily
> reclaimable' LRU list, the
> parameter makes sense. However, if we also add lazyfree folios to the list, the
> parameter becomes less meaningful since we can't predict how many
> lazyfree anon folios user space might have. I still feel lazyfree anon folios
> are different with "reclaimed" file folios (I mean reclaimed from normal
> lists but still in 'easily-reclaimable' list).

I thought the ez-reclamable LRU doesn't need to be accurate since we can
put other folios later(e.g., fadvise_dontneed but couldn't at that time)

> 
> >
> > 3. Modify kswapd behavior to utilize this list. When kswapd is awakened due
> >    to memory pressure, it should attempt to drop those pages first to refill
> >    free pages up to the high watermark by first reclaiming.
> >
> > 4. Before kswapd goes to sleep, it should scan the tail of the LRU list and
> >    move cold pages to the easily reclaimable list, unmapping them from the
> >    page table.
> >
> > 5. Whenever page cache hit, move the page into evictable LRU.
> >
> > This approach allows the system to maintain a pool of readily available
> > memory, mitigating the "aging" problem. The trade-off is the potential for
> > minor page faults and LRU movement ovehreads if these pages in ez_reclaimable
> > LRU are accessed again.
> 
> I believe you're aware of an implementation from Samsung that uses
> cleancache. Although it was dropped from the mainline kernel, it still
> exists in the Android kernel. Samsung's rbincache, based on cleancache,
> maintains a reserved memory region for holding reclaimed file folios.
> Instead of LRU movement, rbincache uses memcpy to transfer data between
> the pool and the page cache.
> 
> >
> > Furthermore, we could put some asynchrnous writeback pages(e.g., swap
> > out or writeback the fs pages) into the list, too.
> > Currently, what we are doing is rotate those pages back to head of LRU
> > and once writeback is done, move the page to the tail of LRU again.
> > We can simply put the page into ez_reclaimable LRU without rotating
> > back and forth.
> 
> If this is about establishing a pool of easily reclaimable file folios, I
> fully support the idea and am eager to try it, especially for Android,
> where there are certainly strong use cases. However, I suspect it may
> be controversial and could take months to gain acceptance. Therefore,
> I’d prefer we first focus on landing a smaller change to address the
> madv_free performance issue and treat that idea as a separate
> incremental patch set.

I don't want to block the improvement, Barry.

The reason I suggested another LRU was actullay to prevent divergent
between MGLRU and split-LRU and show the same behavior introducing
additional logic in the central place.
I don't think that's desire that a usespace hint showed different
priority depending on admin config.

Personally, I belive that would be better to introudce a knob to
change MADV_FREE's behavior for both LRU algorithms at the same time
instead of only one even though we will see the LRU inversion issue.

> 
> My current patch specifically targets the issue of reclaiming lazyfree
> anon folios before reclaiming file folios. It appears your proposal is
> independent (though related) work, and I don't believe it should delay
> resolving the madv_free issue. Additionally, that pool doesn’t effectively
> address the reclamation priority between files and lazyfree anon folios.
> 
> In conclusion:
> 
> 1. I agree that the pool is valuable, and I’d like to develop it as an
> incremental patch set. However, this is a significant step that will
> require considerable time.
> 2. It could be quite tricky to include both lazyfree anon folios and
> reclaimed file folios (which are reclaimed in normal lists but not in
> the 'easily-reclaimable' list) in the same LRU list. I’d prefer to
> start by replacing Samsung's rbincache to reduce file folio I/O if we
> decide to implement the pool.
> 3. I believe we should first focus on landing this fix patch for the
> madv_free performance issue.
> 
> What are your thoughts? I spoke with Yu, and he would like to hear
> your opinion.

Sure, I don't want to block any improvement but please think one more
one more about my concern and just go with your ideas if everyone
except me don't concern it.

Thank you.
Barry Song Oct. 28, 2024, 6:54 a.m. UTC | #4
On Wed, Oct 23, 2024 at 9:15 AM Minchan Kim <minchan.kim@gmail.com> wrote:
>
> Hi Barry,
>
> Sorry for slow response.
>
> On Fri, Oct 18, 2024 at 06:12:01PM +1300, Barry Song wrote:
> > On Fri, Oct 18, 2024 at 6:58 AM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Thu, Oct 17, 2024 at 06:59:09PM +1300, Barry Song wrote:
> > > > On Thu, Oct 17, 2024 at 11:58 AM Andrew Morton
> > > > <akpm@linux-foundation.org> wrote:
> > > > >
> > > > > On Wed, 16 Oct 2024 16:30:30 +1300 Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > > To address this, this patch proposes maintaining a separate list
> > > > > > for lazyfree anon folios while keeping them classified under the
> > > > > > "file" LRU type to minimize code changes.
> > > > >
> > > > > Thanks.  I'll await input from other MGLRU developers before adding
> > > > > this for testing.
> > > >
> > > > Thanks!
> > > >
> > > > Hi Minchan, Yu,
> > > >
> > > > Any comments? I understand that Minchan may have a broader plan
> > > > to "enable the system to maintain a quickly reclaimable memory
> > > > pool and provide a knob for admins to control its size." While I
> > > > have no objection to that plan, I believe improving MADV_FREE
> > > > performance is a more urgent priority and a low-hanging fruit at this
> > > > stage.
> > >
> > > Hi Barry,
> > >
> > > I have no idea why my email didn't send well before. I sent following
> > > reply on Sep 24. Hope it works this time.
> >
> > Hi Minchan,
> >
> > I guess not. Your *this* email ended up in my spam folder of gmail, and
> > my oppo.com account still hasn’t received it. Any idea why?
>
> In the end, that's my problem and don't know when it can be fixed.
> Anyway, hope again this time works.
>
> >
> > >
> > > ====== &< ======
> > >
> > > My proposal involves the following:
> > >
> > > 1. Introduce an "easily reclaimable" LRU list. This list would hold pages
> > >    that can be quickly freed without significant overhead.
> >
> > I assume you plan to keep both lazyfree anon pages and 'reclaimed'
> > file folios (reclaimed in the normal LRU lists but still in the easily-
> > reclaimable list) in this 'easily reclaimable' LRU list. However, I'm
> > not sure this will work, as this patch aims to help reclaim lazyfree
> > anon pages before file folios to reduce both file and anon refaults.
> > If we place 'reclaimed' file folios and lazyfree anon folios in the
> > same list, we may need to revisit how to reclaim lazyfree anon folios
> > before reclaiming the 'reclaimed' file folios.
>

Hi Minchan,

> Those reclaimed folio was already *decision-made* but just couldn't due to
> the *impelementation issue*. So, that's strong candidate to be reclaimed
> as long as there is no access since then rather other candidates.

I'm not entirely clear that placing an LRU after inactive or min_gen
will provide meaningful benefits in typical scenarios. Let me give a
concrete example(to be simple, using active/inactive):

We used to have for example 1.5G active + 0.5 G inactive file, so the
length of active+inactive = 2G.

now we have 3 lists, for example:
1G active + 0.5G inactive + 0.5G ez_reclaimable

The total length remains 2G, which is still the size needed to keep files in
the page cache for hits, so the overall size of the LRU hasn’t changed. The
only difference is that 0.5G has been separated from the original active +
inactive lists. By moving this 0.5G out of the normal LRU, it seems that
the CPU overhead for kswapd might increase, as the reduced size of the
normal LRU could lead to more frequent scanning operations(pretty
much like more aggressive reclamation in normal LRUs)?

On the other hand, mglru typically places items like syscall-related files
into a relatively reclaimable generation, which in a way already acts
like an ez_reclaimable generation—though not as "easily reclaimable"
compared to the dedicated ez_reclaimable list.  mglru can also compare
the relative hotness of syscall folios against mmap-ed file folios and
re-order those folios accordingly in lru.

Given that, has the value of the ez_reclaimable list diminished compared
to when we only had active and inactive lists?

>
> >
> > >
> > > 2. Implement a parameter to control the size of this list. This allows for
> > >    system tuning based on available memory and performance requirements.
> >
> > If we include only 'reclaimed' file folios in this 'easily
> > reclaimable' LRU list, the
> > parameter makes sense. However, if we also add lazyfree folios to the list, the
> > parameter becomes less meaningful since we can't predict how many
> > lazyfree anon folios user space might have. I still feel lazyfree anon folios
> > are different with "reclaimed" file folios (I mean reclaimed from normal
> > lists but still in 'easily-reclaimable' list).
>
> I thought the ez-reclamable LRU doesn't need to be accurate since we can
> put other folios later(e.g., fadvise_dontneed but couldn't at that time)

My point is that if we set a parameter—say, ez_reclaimable to 500MB—
and then perform a 1GB MADV_FREE, we will still need to store the
excess MADV_FREE folios in the normal LRU unless we allow the
ez_reclaimable list to grow indefinitely.

If we permit unlimited length, then once it exceeds 500MB (for example,
due to MADV_FREE), would we stop reclaiming files from the normal LRUs
into the ez_reclaimable list since it already holds enough?

However, if we stop this process, the sorting mechanism—differentiating
between easily reclaimable and less reclaimable folios—would essentially
break down.

Currently, Guoxu's reported issue is that if MADV_FREE folios are positioned in
a relatively difficult-to-reclaim spot, like at the head of the LRU,
they may not be
reclaimed in time, while files are being aggressively reclaimed. Merging both
types of folios into a single ez_reclaimable list could still lead to
this issue.
We still need to decide whether we want to place MADV_FREE folios at the
head of the ez_reclaimable list or at the tail, even if that means ignoring
LRU inversion?

>
> >
> > >
> > > 3. Modify kswapd behavior to utilize this list. When kswapd is awakened due
> > >    to memory pressure, it should attempt to drop those pages first to refill
> > >    free pages up to the high watermark by first reclaiming.
> > >
> > > 4. Before kswapd goes to sleep, it should scan the tail of the LRU list and
> > >    move cold pages to the easily reclaimable list, unmapping them from the
> > >    page table.
> > >
> > > 5. Whenever page cache hit, move the page into evictable LRU.
> > >
> > > This approach allows the system to maintain a pool of readily available
> > > memory, mitigating the "aging" problem. The trade-off is the potential for
> > > minor page faults and LRU movement ovehreads if these pages in ez_reclaimable
> > > LRU are accessed again.
> >
> > I believe you're aware of an implementation from Samsung that uses
> > cleancache. Although it was dropped from the mainline kernel, it still
> > exists in the Android kernel. Samsung's rbincache, based on cleancache,
> > maintains a reserved memory region for holding reclaimed file folios.
> > Instead of LRU movement, rbincache uses memcpy to transfer data between
> > the pool and the page cache.
> >
> > >
> > > Furthermore, we could put some asynchrnous writeback pages(e.g., swap
> > > out or writeback the fs pages) into the list, too.
> > > Currently, what we are doing is rotate those pages back to head of LRU
> > > and once writeback is done, move the page to the tail of LRU again.
> > > We can simply put the page into ez_reclaimable LRU without rotating
> > > back and forth.
> >
> > If this is about establishing a pool of easily reclaimable file folios, I
> > fully support the idea and am eager to try it, especially for Android,
> > where there are certainly strong use cases. However, I suspect it may
> > be controversial and could take months to gain acceptance. Therefore,
> > I’d prefer we first focus on landing a smaller change to address the
> > madv_free performance issue and treat that idea as a separate
> > incremental patch set.
>
> I don't want to block the improvement, Barry.
>
> The reason I suggested another LRU was actullay to prevent divergent
> between MGLRU and split-LRU and show the same behavior introducing
> additional logic in the central place.
> I don't think that's desire that a usespace hint showed different
> priority depending on admin config.

I understand your perspective. My interest in the ez_reclaimable LRU
is primarily about providing a quick method for freeing up memory
without tying it up in reservations for specific use cases.

For instance, in Samsung's implementation, there's a reserved memory
area intended for ION and DMA-BUF operations. Certain applications
can rapidly allocate these resources, and if this process is delayed,
it can adversely affect the user interface experience. To mitigate
this issue, they have established a shared reserved memory section
known as a clean cache pool, where reclaimed folios can be copied
into.

When files are read, the read path can also check this clean cache
pool; if there's a match, folios can be copied into the page
cache. Because this clean cache can be swiftly reclaimed, its
performance closely resembles that of being fully reserved.

If this type of reserved memory can help reduce I/O operations, it would
be beneficial, especially since this memory was originally set aside. In
scenarios involving ION and DMA-BUF, this memory could otherwise
go to waste.

This raises a concern: if the ez_reclaimable memory is mainly consumed
by various user scenarios (meaning all alloc_pages can access it
indiscriminately),
ION and DMA-BUF operations may find it difficult to acquire this memory in a
timely manner. This situation undermines the potential benefits we aim to
achieve for user experience in these scenarios.

>
> Personally, I belive that would be better to introudce a knob to
> change MADV_FREE's behavior for both LRU algorithms at the same time
> instead of only one even though we will see the LRU inversion issue.
>
> >
> > My current patch specifically targets the issue of reclaiming lazyfree
> > anon folios before reclaiming file folios. It appears your proposal is
> > independent (though related) work, and I don't believe it should delay
> > resolving the madv_free issue. Additionally, that pool doesn’t effectively
> > address the reclamation priority between files and lazyfree anon folios.
> >
> > In conclusion:
> >
> > 1. I agree that the pool is valuable, and I’d like to develop it as an
> > incremental patch set. However, this is a significant step that will
> > require considerable time.
> > 2. It could be quite tricky to include both lazyfree anon folios and
> > reclaimed file folios (which are reclaimed in normal lists but not in
> > the 'easily-reclaimable' list) in the same LRU list. I’d prefer to
> > start by replacing Samsung's rbincache to reduce file folio I/O if we
> > decide to implement the pool.
> > 3. I believe we should first focus on landing this fix patch for the
> > madv_free performance issue.
> >
> > What are your thoughts? I spoke with Yu, and he would like to hear
> > your opinion.
>
> Sure, I don't want to block any improvement but please think one more
> one more about my concern and just go with your ideas if everyone
> except me don't concern it.

I'm still grappling with these questions: Are we seeking ez_reclaimable
memory that can be equally utilized by all alloc_pages(), or is it primarily
intended for specific high-priority users who previously depended on
reserved memory?

If the goal is the former, I’m still not completely clear on all the
pros and cons.
There seem to be many issues that need careful consideration. For instance,
should we view moving from the ez_reclaimable list to the normal lists as
a refault? Or should we only consider reading from disk as a refault?

For each kswapd wake-up or direct reclamation, how much memory should
we reclaim from the ez_reclaimable list versus how much should we reclaim
from anonymous memory?

On the other hand, I definitely see the value in the latter approach, though
it may not be suitable for all scenarios. It could be particularly beneficial
for users who have reserved memory for specific purposes. This reserved
memory can be repurposed as page caches when those specific applications
are not running. Once those applications are launched, the page caches can
be reclaimed at zero cost.

>
> Thank you.

Thanks
barry
diff mbox series

Patch

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 87580e8363ef..615fe80d73d0 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -226,6 +226,7 @@  static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
+	int lazyfree = type ? folio_test_anon(folio) : 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
 	VM_WARN_ON_ONCE_FOLIO(gen != -1, folio);
@@ -265,9 +266,9 @@  static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio,
 	lru_gen_update_size(lruvec, folio, -1, gen);
 	/* for folio_rotate_reclaimable() */
 	if (reclaiming)
-		list_add_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		list_add_tail(&folio->lru, &lrugen->folios[gen][type + lazyfree][zone]);
 	else
-		list_add(&folio->lru, &lrugen->folios[gen][type][zone]);
+		list_add(&folio->lru, &lrugen->folios[gen][type + lazyfree][zone]);
 
 	return true;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96dea31fb211..5cb86ea324be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -436,7 +436,7 @@  struct lru_gen_folio {
 	/* the birth time of each generation in jiffies */
 	unsigned long timestamps[MAX_NR_GENS];
 	/* the multi-gen LRU lists, lazily sorted on eviction */
-	struct list_head folios[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	struct list_head folios[MAX_NR_GENS][ANON_AND_FILE + 1][MAX_NR_ZONES];
 	/* the multi-gen LRU sizes, eventually consistent */
 	long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* the exponential moving average of refaulted */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fd3908d43b07..e2f13a9b50da 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3736,21 +3736,25 @@  static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
 
 	/* prevent cold/hot inversion if force_scan is true */
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-		struct list_head *head = &lrugen->folios[old_gen][type][zone];
+		int list_num = type ? 2 : 1;
+		struct list_head *head;
 
-		while (!list_empty(head)) {
-			struct folio *folio = lru_to_folio(head);
+		for (int i = list_num - 1; i >= 0; i--) {
+			head = &lrugen->folios[old_gen][type + i][zone];
+			while (!list_empty(head)) {
+				struct folio *folio = lru_to_folio(head);
 
-			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
-			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
-			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
-			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
+				VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
+				VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
+				VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
+				VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
-			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
+				new_gen = folio_inc_gen(lruvec, folio, false);
+				list_move_tail(&folio->lru, &lrugen->folios[new_gen][type + i][zone]);
 
-			if (!--remaining)
-				return false;
+				if (!--remaining)
+					return false;
+			}
 		}
 	}
 done:
@@ -4302,6 +4306,7 @@  static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	int refs = folio_lru_refs(folio);
 	int tier = lru_tier_from_refs(refs);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
+	int lazyfree = type ? folio_test_anon(folio) : 0;
 
 	VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
 
@@ -4317,7 +4322,7 @@  static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* promoted */
 	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
+		list_move(&folio->lru, &lrugen->folios[gen][type + lazyfree][zone]);
 		return true;
 	}
 
@@ -4326,7 +4331,7 @@  static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);
 
 		gen = folio_inc_gen(lruvec, folio, false);
-		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		list_move_tail(&folio->lru, &lrugen->folios[gen][type + lazyfree][zone]);
 
 		WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
 			   lrugen->protected[hist][type][tier - 1] + delta);
@@ -4336,7 +4341,7 @@  static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	/* ineligible */
 	if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
 		gen = folio_inc_gen(lruvec, folio, false);
-		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
+		list_move_tail(&folio->lru, &lrugen->folios[gen][type + lazyfree][zone]);
 		return true;
 	}
 
@@ -4344,7 +4349,7 @@  static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
 	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
 		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
+		list_move(&folio->lru, &lrugen->folios[gen][type + lazyfree][zone]);
 		return true;
 	}
 
@@ -4388,7 +4393,7 @@  static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 		       int type, int tier, struct list_head *list)
 {
-	int i;
+	int i, j;
 	int gen;
 	enum vm_event_item item;
 	int sorted = 0;
@@ -4410,33 +4415,38 @@  static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 		LIST_HEAD(moved);
 		int skipped_zone = 0;
 		int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES;
-		struct list_head *head = &lrugen->folios[gen][type][zone];
-
-		while (!list_empty(head)) {
-			struct folio *folio = lru_to_folio(head);
-			int delta = folio_nr_pages(folio);
-
-			VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
-			VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
-			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
-			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
-
-			scanned += delta;
+		int list_num = type ? 2 : 1;
+		struct list_head *head;
+
+		for (j = list_num - 1; j >= 0; j--) {
+			head = &lrugen->folios[gen][type + j][zone];
+			while (!list_empty(head)) {
+				struct folio *folio = lru_to_folio(head);
+				int delta = folio_nr_pages(folio);
+
+				VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
+				VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
+				VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
+				VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
+
+				scanned += delta;
+
+				if (sort_folio(lruvec, folio, sc, tier))
+					sorted += delta;
+				else if (isolate_folio(lruvec, folio, sc)) {
+					list_add(&folio->lru, list);
+					isolated += delta;
+				} else {
+					list_move(&folio->lru, &moved);
+					skipped_zone += delta;
+				}
 
-			if (sort_folio(lruvec, folio, sc, tier))
-				sorted += delta;
-			else if (isolate_folio(lruvec, folio, sc)) {
-				list_add(&folio->lru, list);
-				isolated += delta;
-			} else {
-				list_move(&folio->lru, &moved);
-				skipped_zone += delta;
+				if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH)
+					goto isolate_done;
 			}
-
-			if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH)
-				break;
 		}
 
+isolate_done:
 		if (skipped_zone) {
 			list_splice(&moved, head);
 			__count_zid_vm_events(PGSCAN_SKIP, zone, skipped_zone);
@@ -5588,8 +5598,15 @@  void lru_gen_init_lruvec(struct lruvec *lruvec)
 	for (i = 0; i <= MIN_NR_GENS + 1; i++)
 		lrugen->timestamps[i] = jiffies;
 
-	for_each_gen_type_zone(gen, type, zone)
+	for_each_gen_type_zone(gen, type, zone) {
 		INIT_LIST_HEAD(&lrugen->folios[gen][type][zone]);
+		/*
+		 * lazyfree anon folios have a separate list while using
+		 * file as type
+		 */
+		if (type)
+			INIT_LIST_HEAD(&lrugen->folios[gen][type + 1][zone]);
+	}
 
 	if (mm_state)
 		mm_state->seq = MIN_NR_GENS;