diff mbox series

[v10] mm: vmscan: try to reclaim swapcache pages if no swap space

Message ID 20231121090624.1814733-1-liushixin2@huawei.com (mailing list archive)
State New
Headers show
Series [v10] mm: vmscan: try to reclaim swapcache pages if no swap space | expand

Commit Message

Liu Shixin Nov. 21, 2023, 9:06 a.m. UTC
When spaces of swap devices are exhausted, only file pages can be
reclaimed.  But there are still some swapcache pages in anon lru list.
This can lead to a premature out-of-memory.

The problem is found with such step:

 Firstly, set a 9MB disk swap space, then create a cgroup with 10MB
 memory limit, then runs an program to allocates about 15MB memory.

The problem occurs occasionally, which may need about 100 times [1].

Fix it by checking number of swapcache pages in can_reclaim_anon_pages().
If the number is not zero, return true and set swapcache_only to 1.
When scan anon lru list in swapcache_only mode, non-swapcache pages will
be skipped to isolate in order to accelerate reclaim efficiency.

However, in swapcache_only mode, the scan count still increased when scan
non-swapcache pages because there are large number of non-swapcache pages
and rare swapcache pages in swapcache_only mode, and if the non-swapcache
is skipped and do not count, the scan of pages in isolate_lru_folios() can
eventually lead to hung task, just as Sachin reported [2].

By the way, since there are enough times of memory reclaim before OOM, it
is not need to isolate too much swapcache pages in one times.

[1]. https://lore.kernel.org/lkml/CAJD7tkZAfgncV+KbKr36=eDzMnT=9dZOT0dpMWcurHLr6Do+GA@mail.gmail.com/
[2]. https://lore.kernel.org/linux-mm/CAJD7tkafz_2XAuqE8tGLPEcpLngewhUo=5US14PAtSM9tLBUQg@mail.gmail.com/

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
---
v9->v10: Use per-node swapcache suggested by Yu Zhao.
v8->v9: Move the swapcache check after can_demote() and refector 
	can_reclaim_anon_pages() a bit.
v7->v8: Reset swapcache_only at the beginning of can_reclaim_anon_pages().
v6->v7: Reset swapcache_only to zero after there are swap spaces.
v5->v6: Fix NULL pointing derefence and hung task problem reported by Sachin.

 mm/vmscan.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

Comments

Michal Hocko Nov. 21, 2023, 1 p.m. UTC | #1
On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> When spaces of swap devices are exhausted, only file pages can be
> reclaimed.  But there are still some swapcache pages in anon lru list.
> This can lead to a premature out-of-memory.
> 
> The problem is found with such step:
> 
>  Firstly, set a 9MB disk swap space, then create a cgroup with 10MB
>  memory limit, then runs an program to allocates about 15MB memory.
> 
> The problem occurs occasionally, which may need about 100 times [1].
> 
> Fix it by checking number of swapcache pages in can_reclaim_anon_pages().
> If the number is not zero, return true and set swapcache_only to 1.
> When scan anon lru list in swapcache_only mode, non-swapcache pages will
> be skipped to isolate in order to accelerate reclaim efficiency.
> 
> However, in swapcache_only mode, the scan count still increased when scan
> non-swapcache pages because there are large number of non-swapcache pages
> and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> is skipped and do not count, the scan of pages in isolate_lru_folios() can
> eventually lead to hung task, just as Sachin reported [2].

I find this paragraph really confusing! I guess what you meant to say is
that a real swapcache_only is problematic because it can end up not
making any progress, correct? 

AFAIU you have addressed that problem by making swapcache_only anon LRU
specific, right? That would be certainly more robust as you can still
reclaim from file LRUs. I cannot say I like that because swapcache_only
is a bit confusing and I do not think we want to grow more special
purpose reclaim types. Would it be possible/reasonable to instead put
swapcache pages on the file LRU instead?
Liu Shixin Nov. 22, 2023, 6:41 a.m. UTC | #2
On 2023/11/21 21:00, Michal Hocko wrote:
> On Tue 21-11-23 17:06:24, Liu Shixin wrote:
>
> However, in swapcache_only mode, the scan count still increased when scan
> non-swapcache pages because there are large number of non-swapcache pages
> and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> is skipped and do not count, the scan of pages in isolate_lru_folios() can
> eventually lead to hung task, just as Sachin reported [2].
> I find this paragraph really confusing! I guess what you meant to say is
> that a real swapcache_only is problematic because it can end up not
> making any progress, correct? 
This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
>
> AFAIU you have addressed that problem by making swapcache_only anon LRU
> specific, right? That would be certainly more robust as you can still
> reclaim from file LRUs. I cannot say I like that because swapcache_only
> is a bit confusing and I do not think we want to grow more special
> purpose reclaim types. Would it be possible/reasonable to instead put
> swapcache pages on the file LRU instead?
It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
pay attention to?

Thanks,
Yosry Ahmed Nov. 22, 2023, 6:44 a.m. UTC | #3
On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
>
>
> On 2023/11/21 21:00, Michal Hocko wrote:
> > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> >
> > However, in swapcache_only mode, the scan count still increased when scan
> > non-swapcache pages because there are large number of non-swapcache pages
> > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> > eventually lead to hung task, just as Sachin reported [2].
> > I find this paragraph really confusing! I guess what you meant to say is
> > that a real swapcache_only is problematic because it can end up not
> > making any progress, correct?
> This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
> >
> > AFAIU you have addressed that problem by making swapcache_only anon LRU
> > specific, right? That would be certainly more robust as you can still
> > reclaim from file LRUs. I cannot say I like that because swapcache_only
> > is a bit confusing and I do not think we want to grow more special
> > purpose reclaim types. Would it be possible/reasonable to instead put
> > swapcache pages on the file LRU instead?
> It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
> pay attention to?

I think this might be more intrusive than we think. Every time a page
is added to or removed from the swap cache, we will need to move it
between LRUs. All pages on the anon LRU will need to go through the
file LRU before being reclaimed. I think this might be too big of a
change to achieve this patch's goal.

>
> Thanks,
>
Huang, Ying Nov. 22, 2023, 6:57 a.m. UTC | #4
Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
>>
>>
>> On 2023/11/21 21:00, Michal Hocko wrote:
>> > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
>> >
>> > However, in swapcache_only mode, the scan count still increased when scan
>> > non-swapcache pages because there are large number of non-swapcache pages
>> > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
>> > is skipped and do not count, the scan of pages in isolate_lru_folios() can
>> > eventually lead to hung task, just as Sachin reported [2].
>> > I find this paragraph really confusing! I guess what you meant to say is
>> > that a real swapcache_only is problematic because it can end up not
>> > making any progress, correct?
>> This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
>> >
>> > AFAIU you have addressed that problem by making swapcache_only anon LRU
>> > specific, right? That would be certainly more robust as you can still
>> > reclaim from file LRUs. I cannot say I like that because swapcache_only
>> > is a bit confusing and I do not think we want to grow more special
>> > purpose reclaim types. Would it be possible/reasonable to instead put
>> > swapcache pages on the file LRU instead?
>> It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
>> pay attention to?
>
> I think this might be more intrusive than we think. Every time a page
> is added to or removed from the swap cache, we will need to move it
> between LRUs. All pages on the anon LRU will need to go through the
> file LRU before being reclaimed. I think this might be too big of a
> change to achieve this patch's goal.

We need to identify swap cache pages on file LRU firstly.  It appears
hard from the current definition of page flags.

  /* Filesystems */
  PG_checked = PG_owner_priv_1,

  /* SwapBacked */
  PG_swapcache = PG_owner_priv_1,       /* Swap page: swp_entry_t in private */

--
Best Regards,
Huang, Ying
Michal Hocko Nov. 22, 2023, 8:52 a.m. UTC | #5
On Tue 21-11-23 22:44:32, Yosry Ahmed wrote:
> On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
> >
> >
> > On 2023/11/21 21:00, Michal Hocko wrote:
> > > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> > >
> > > However, in swapcache_only mode, the scan count still increased when scan
> > > non-swapcache pages because there are large number of non-swapcache pages
> > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> > > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> > > eventually lead to hung task, just as Sachin reported [2].
> > > I find this paragraph really confusing! I guess what you meant to say is
> > > that a real swapcache_only is problematic because it can end up not
> > > making any progress, correct?
> > This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
> > >
> > > AFAIU you have addressed that problem by making swapcache_only anon LRU
> > > specific, right? That would be certainly more robust as you can still
> > > reclaim from file LRUs. I cannot say I like that because swapcache_only
> > > is a bit confusing and I do not think we want to grow more special
> > > purpose reclaim types. Would it be possible/reasonable to instead put
> > > swapcache pages on the file LRU instead?
> > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
> > pay attention to?
> 
> I think this might be more intrusive than we think. Every time a page
> is added to or removed from the swap cache, we will need to move it
> between LRUs. All pages on the anon LRU will need to go through the
> file LRU before being reclaimed. I think this might be too big of a
> change to achieve this patch's goal.

TBH I am not really sure how complex that might turn out to be.
Swapcache tends to be full of subtle issues. So you might be right but
it would be better to know _why_ this is not possible before we end up
phising for couple of swapcache pages on potentially huge anon LRU to
isolate them. Think of TB sized machines in this context.
Michal Hocko Nov. 22, 2023, 8:55 a.m. UTC | #6
On Wed 22-11-23 14:57:24, Huang, Ying wrote:
> Yosry Ahmed <yosryahmed@google.com> writes:
> 
> > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
> >>
> >>
> >> On 2023/11/21 21:00, Michal Hocko wrote:
> >> > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> >> >
> >> > However, in swapcache_only mode, the scan count still increased when scan
> >> > non-swapcache pages because there are large number of non-swapcache pages
> >> > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> >> > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> >> > eventually lead to hung task, just as Sachin reported [2].
> >> > I find this paragraph really confusing! I guess what you meant to say is
> >> > that a real swapcache_only is problematic because it can end up not
> >> > making any progress, correct?
> >> This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
> >> >
> >> > AFAIU you have addressed that problem by making swapcache_only anon LRU
> >> > specific, right? That would be certainly more robust as you can still
> >> > reclaim from file LRUs. I cannot say I like that because swapcache_only
> >> > is a bit confusing and I do not think we want to grow more special
> >> > purpose reclaim types. Would it be possible/reasonable to instead put
> >> > swapcache pages on the file LRU instead?
> >> It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
> >> pay attention to?
> >
> > I think this might be more intrusive than we think. Every time a page
> > is added to or removed from the swap cache, we will need to move it
> > between LRUs. All pages on the anon LRU will need to go through the
> > file LRU before being reclaimed. I think this might be too big of a
> > change to achieve this patch's goal.
> 
> We need to identify swap cache pages on file LRU firstly.  It appears
> hard from the current definition of page flags.
> 
>   /* Filesystems */
>   PG_checked = PG_owner_priv_1,
> 
>   /* SwapBacked */
>   PG_swapcache = PG_owner_priv_1,       /* Swap page: swp_entry_t in private */

Checking along with folio_test_swapbacked would do the trick, right?
Michal Hocko Nov. 22, 2023, 10:09 a.m. UTC | #7
On Wed 22-11-23 09:52:42, Michal Hocko wrote:
> On Tue 21-11-23 22:44:32, Yosry Ahmed wrote:
> > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
> > >
> > >
> > > On 2023/11/21 21:00, Michal Hocko wrote:
> > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> > > >
> > > > However, in swapcache_only mode, the scan count still increased when scan
> > > > non-swapcache pages because there are large number of non-swapcache pages
> > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> > > > eventually lead to hung task, just as Sachin reported [2].
> > > > I find this paragraph really confusing! I guess what you meant to say is
> > > > that a real swapcache_only is problematic because it can end up not
> > > > making any progress, correct?
> > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
> > > >
> > > > AFAIU you have addressed that problem by making swapcache_only anon LRU
> > > > specific, right? That would be certainly more robust as you can still
> > > > reclaim from file LRUs. I cannot say I like that because swapcache_only
> > > > is a bit confusing and I do not think we want to grow more special
> > > > purpose reclaim types. Would it be possible/reasonable to instead put
> > > > swapcache pages on the file LRU instead?
> > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
> > > pay attention to?
> > 
> > I think this might be more intrusive than we think. Every time a page
> > is added to or removed from the swap cache, we will need to move it
> > between LRUs. All pages on the anon LRU will need to go through the
> > file LRU before being reclaimed. I think this might be too big of a
> > change to achieve this patch's goal.
> 
> TBH I am not really sure how complex that might turn out to be.
> Swapcache tends to be full of subtle issues. So you might be right but
> it would be better to know _why_ this is not possible before we end up
> phising for couple of swapcache pages on potentially huge anon LRU to
> isolate them. Think of TB sized machines in this context.

Forgot to mention that it is not really far fetched from comparing this
to MADV_FREE pages. Those are anonymous but we do not want to keep them
on anon LRU because we want to age them indepdendent on the swap
availability as they are just dropped during reclaim. Not too much
different from swapcache pages. There are more constrains on those but
fundamentally this is the same problem, no?
Yosry Ahmed Nov. 22, 2023, 10:39 a.m. UTC | #8
On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 22-11-23 09:52:42, Michal Hocko wrote:
> > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote:
> > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
> > > >
> > > >
> > > > On 2023/11/21 21:00, Michal Hocko wrote:
> > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> > > > >
> > > > > However, in swapcache_only mode, the scan count still increased when scan
> > > > > non-swapcache pages because there are large number of non-swapcache pages
> > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> > > > > eventually lead to hung task, just as Sachin reported [2].
> > > > > I find this paragraph really confusing! I guess what you meant to say is
> > > > > that a real swapcache_only is problematic because it can end up not
> > > > > making any progress, correct?
> > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
> > > > >
> > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU
> > > > > specific, right? That would be certainly more robust as you can still
> > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only
> > > > > is a bit confusing and I do not think we want to grow more special
> > > > > purpose reclaim types. Would it be possible/reasonable to instead put
> > > > > swapcache pages on the file LRU instead?
> > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
> > > > pay attention to?
> > >
> > > I think this might be more intrusive than we think. Every time a page
> > > is added to or removed from the swap cache, we will need to move it
> > > between LRUs. All pages on the anon LRU will need to go through the
> > > file LRU before being reclaimed. I think this might be too big of a
> > > change to achieve this patch's goal.
> >
> > TBH I am not really sure how complex that might turn out to be.
> > Swapcache tends to be full of subtle issues. So you might be right but
> > it would be better to know _why_ this is not possible before we end up
> > phising for couple of swapcache pages on potentially huge anon LRU to
> > isolate them. Think of TB sized machines in this context.
>
> Forgot to mention that it is not really far fetched from comparing this
> to MADV_FREE pages. Those are anonymous but we do not want to keep them
> on anon LRU because we want to age them indepdendent on the swap
> availability as they are just dropped during reclaim. Not too much
> different from swapcache pages. There are more constrains on those but
> fundamentally this is the same problem, no?

I agree it's not a first, but swap cache pages are more complicated
because they can go back and forth, unlike MADV_FREE pages which
usually go on a one way ticket AFAICT. Also pages going into the swap
cache can be much more common that MADV_FREE pages for a lot of
workloads. I am not sure how different reclaim heuristics will react
to such mobility between the LRUs, and the fact that all pages will
now only get evicted through the file LRU. The anon LRU will
essentially become an LRU that feeds the file LRU. Also, the more
pages we move between LRUs, the more ordering violations we introduce,
as we may put colder pages in front of hotter pages or vice versa.

All in all, I am not saying it's a bad idea or not possible, I am just
saying it's probably more complicated than MADV_FREE, and adding more
cases where pages move between LRUs could introduce problems (or make
existing problems more visible).

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 22, 2023, 1:19 p.m. UTC | #9
On Wed 22-11-23 02:39:15, Yosry Ahmed wrote:
> On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 22-11-23 09:52:42, Michal Hocko wrote:
> > > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote:
> > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
> > > > >
> > > > >
> > > > > On 2023/11/21 21:00, Michal Hocko wrote:
> > > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> > > > > >
> > > > > > However, in swapcache_only mode, the scan count still increased when scan
> > > > > > non-swapcache pages because there are large number of non-swapcache pages
> > > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> > > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> > > > > > eventually lead to hung task, just as Sachin reported [2].
> > > > > > I find this paragraph really confusing! I guess what you meant to say is
> > > > > > that a real swapcache_only is problematic because it can end up not
> > > > > > making any progress, correct?
> > > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
> > > > > >
> > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU
> > > > > > specific, right? That would be certainly more robust as you can still
> > > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only
> > > > > > is a bit confusing and I do not think we want to grow more special
> > > > > > purpose reclaim types. Would it be possible/reasonable to instead put
> > > > > > swapcache pages on the file LRU instead?
> > > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
> > > > > pay attention to?
> > > >
> > > > I think this might be more intrusive than we think. Every time a page
> > > > is added to or removed from the swap cache, we will need to move it
> > > > between LRUs. All pages on the anon LRU will need to go through the
> > > > file LRU before being reclaimed. I think this might be too big of a
> > > > change to achieve this patch's goal.
> > >
> > > TBH I am not really sure how complex that might turn out to be.
> > > Swapcache tends to be full of subtle issues. So you might be right but
> > > it would be better to know _why_ this is not possible before we end up
> > > phising for couple of swapcache pages on potentially huge anon LRU to
> > > isolate them. Think of TB sized machines in this context.
> >
> > Forgot to mention that it is not really far fetched from comparing this
> > to MADV_FREE pages. Those are anonymous but we do not want to keep them
> > on anon LRU because we want to age them indepdendent on the swap
> > availability as they are just dropped during reclaim. Not too much
> > different from swapcache pages. There are more constrains on those but
> > fundamentally this is the same problem, no?
> 
> I agree it's not a first, but swap cache pages are more complicated
> because they can go back and forth, unlike MADV_FREE pages which
> usually go on a one way ticket AFAICT.

Yes swapcache pages are indeed more complicated but most of the time
they just go away as well, no? MADV_FREE can be reinitiated if they are
written as well. So fundamentally they are not that different.

> Also pages going into the swap
> cache can be much more common that MADV_FREE pages for a lot of
> workloads. I am not sure how different reclaim heuristics will react
> to such mobility between the LRUs, and the fact that all pages will
> now only get evicted through the file LRU. The anon LRU will
> essentially become an LRU that feeds the file LRU. Also, the more
> pages we move between LRUs, the more ordering violations we introduce,
> as we may put colder pages in front of hotter pages or vice versa.

Well, traditionally the file LRU has been maintaining page cache or
easily disposable pages like MADV_FREE (which can be considered a cache
as well). Swapcache is a form of a page cache as well.

> All in all, I am not saying it's a bad idea or not possible, I am just
> saying it's probably more complicated than MADV_FREE, and adding more
> cases where pages move between LRUs could introduce problems (or make
> existing problems more visible).

Do we want to start adding filtered anon scan for a certain type of
pages? Because this is the question here AFAICS. This might seem an
easier solution but I would argue that it is less predictable one. 
It is not unusual that a huge anon LRU would contain only very few LRU
pages.

That being said, I might be missing some obvious or less obvious reasons
why this is completely bad idea. Swapcache is indeed subtle.
Yosry Ahmed Nov. 22, 2023, 8:13 p.m. UTC | #10
On Wed, Nov 22, 2023 at 5:19 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 22-11-23 02:39:15, Yosry Ahmed wrote:
> > On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 22-11-23 09:52:42, Michal Hocko wrote:
> > > > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote:
> > > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
> > > > > >
> > > > > >
> > > > > > On 2023/11/21 21:00, Michal Hocko wrote:
> > > > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
> > > > > > >
> > > > > > > However, in swapcache_only mode, the scan count still increased when scan
> > > > > > > non-swapcache pages because there are large number of non-swapcache pages
> > > > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> > > > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> > > > > > > eventually lead to hung task, just as Sachin reported [2].
> > > > > > > I find this paragraph really confusing! I guess what you meant to say is
> > > > > > > that a real swapcache_only is problematic because it can end up not
> > > > > > > making any progress, correct?
> > > > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
> > > > > > >
> > > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU
> > > > > > > specific, right? That would be certainly more robust as you can still
> > > > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only
> > > > > > > is a bit confusing and I do not think we want to grow more special
> > > > > > > purpose reclaim types. Would it be possible/reasonable to instead put
> > > > > > > swapcache pages on the file LRU instead?
> > > > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
> > > > > > pay attention to?
> > > > >
> > > > > I think this might be more intrusive than we think. Every time a page
> > > > > is added to or removed from the swap cache, we will need to move it
> > > > > between LRUs. All pages on the anon LRU will need to go through the
> > > > > file LRU before being reclaimed. I think this might be too big of a
> > > > > change to achieve this patch's goal.
> > > >
> > > > TBH I am not really sure how complex that might turn out to be.
> > > > Swapcache tends to be full of subtle issues. So you might be right but
> > > > it would be better to know _why_ this is not possible before we end up
> > > > phising for couple of swapcache pages on potentially huge anon LRU to
> > > > isolate them. Think of TB sized machines in this context.
> > >
> > > Forgot to mention that it is not really far fetched from comparing this
> > > to MADV_FREE pages. Those are anonymous but we do not want to keep them
> > > on anon LRU because we want to age them indepdendent on the swap
> > > availability as they are just dropped during reclaim. Not too much
> > > different from swapcache pages. There are more constrains on those but
> > > fundamentally this is the same problem, no?
> >
> > I agree it's not a first, but swap cache pages are more complicated
> > because they can go back and forth, unlike MADV_FREE pages which
> > usually go on a one way ticket AFAICT.
>
> Yes swapcache pages are indeed more complicated but most of the time
> they just go away as well, no? MADV_FREE can be reinitiated if they are
> written as well. So fundamentally they are not that different.
>
> > Also pages going into the swap
> > cache can be much more common that MADV_FREE pages for a lot of
> > workloads. I am not sure how different reclaim heuristics will react
> > to such mobility between the LRUs, and the fact that all pages will
> > now only get evicted through the file LRU. The anon LRU will
> > essentially become an LRU that feeds the file LRU. Also, the more
> > pages we move between LRUs, the more ordering violations we introduce,
> > as we may put colder pages in front of hotter pages or vice versa.
>
> Well, traditionally the file LRU has been maintaining page cache or
> easily disposable pages like MADV_FREE (which can be considered a cache
> as well). Swapcache is a form of a page cache as well.

If I understand correctly, when we move the MADV_FREE pages to the
file LRU, we don't have correct information about their relative
ordering compared to the pages that are already in the inactive file
LRU. IOW, we mess up the ordering of the inactive file LRU a little.
If we add more cases of moving pages to the file LRU (for the swap
cache), we may make it worse. I am also not sure how this works with
MGLRU generations.

Keep in mind that when a page is affected with MADV_FREE, it's always
called. On the other hand, when a page is added to the swap cache, it
could be because it's under reclaim (cold), or it was just swapped in
(hot). I am not sure this practically matters, just something to think
about.

It also seems like all evictions will now be done from the file LRU,
so some heuristics (like workingset) may need to be updated
accordingly.

>
> > All in all, I am not saying it's a bad idea or not possible, I am just
> > saying it's probably more complicated than MADV_FREE, and adding more
> > cases where pages move between LRUs could introduce problems (or make
> > existing problems more visible).
>
> Do we want to start adding filtered anon scan for a certain type of
> pages? Because this is the question here AFAICS. This might seem an
> easier solution but I would argue that it is less predictable one.
> It is not unusual that a huge anon LRU would contain only very few LRU
> pages.

I agree that it may be a problem in some situations.

>
> That being said, I might be missing some obvious or less obvious reasons
> why this is completely bad idea. Swapcache is indeed subtle.
> --
> Michal Hocko
> SUSE Labs
Huang, Ying Nov. 23, 2023, 6:15 a.m. UTC | #11
Michal Hocko <mhocko@suse.com> writes:

> On Wed 22-11-23 02:39:15, Yosry Ahmed wrote:
>> On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote:
>> >
>> > On Wed 22-11-23 09:52:42, Michal Hocko wrote:
>> > > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote:
>> > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote:
>> > > > >
>> > > > >
>> > > > > On 2023/11/21 21:00, Michal Hocko wrote:
>> > > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote:
>> > > > > >
>> > > > > > However, in swapcache_only mode, the scan count still increased when scan
>> > > > > > non-swapcache pages because there are large number of non-swapcache pages
>> > > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
>> > > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can
>> > > > > > eventually lead to hung task, just as Sachin reported [2].
>> > > > > > I find this paragraph really confusing! I guess what you meant to say is
>> > > > > > that a real swapcache_only is problematic because it can end up not
>> > > > > > making any progress, correct?
>> > > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages;
>> > > > > >
>> > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU
>> > > > > > specific, right? That would be certainly more robust as you can still
>> > > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only
>> > > > > > is a bit confusing and I do not think we want to grow more special
>> > > > > > purpose reclaim types. Would it be possible/reasonable to instead put
>> > > > > > swapcache pages on the file LRU instead?
>> > > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to
>> > > > > pay attention to?
>> > > >
>> > > > I think this might be more intrusive than we think. Every time a page
>> > > > is added to or removed from the swap cache, we will need to move it
>> > > > between LRUs. All pages on the anon LRU will need to go through the
>> > > > file LRU before being reclaimed. I think this might be too big of a
>> > > > change to achieve this patch's goal.
>> > >
>> > > TBH I am not really sure how complex that might turn out to be.
>> > > Swapcache tends to be full of subtle issues. So you might be right but
>> > > it would be better to know _why_ this is not possible before we end up
>> > > phising for couple of swapcache pages on potentially huge anon LRU to
>> > > isolate them. Think of TB sized machines in this context.
>> >
>> > Forgot to mention that it is not really far fetched from comparing this
>> > to MADV_FREE pages. Those are anonymous but we do not want to keep them
>> > on anon LRU because we want to age them indepdendent on the swap
>> > availability as they are just dropped during reclaim. Not too much
>> > different from swapcache pages. There are more constrains on those but
>> > fundamentally this is the same problem, no?
>> 
>> I agree it's not a first, but swap cache pages are more complicated
>> because they can go back and forth, unlike MADV_FREE pages which
>> usually go on a one way ticket AFAICT.
>
> Yes swapcache pages are indeed more complicated but most of the time
> they just go away as well, no?

When we swapin a page, we will put it in swapcache too.  And the page
can be in that state for long time if there is more than 50% free space
in the swap device.

> MADV_FREE can be reinitiated if they are
> written as well. So fundamentally they are not that different.
>

[snip]

--
Best Regards,
Huang, Ying
Chris Li Nov. 23, 2023, 5:19 p.m. UTC | #12
Hi Shixin,

On Tue, Nov 21, 2023 at 12:08 AM Liu Shixin <liushixin2@huawei.com> wrote:
>
> When spaces of swap devices are exhausted, only file pages can be
> reclaimed.  But there are still some swapcache pages in anon lru list.
> This can lead to a premature out-of-memory.
>
> The problem is found with such step:
>
>  Firstly, set a 9MB disk swap space, then create a cgroup with 10MB
>  memory limit, then runs an program to allocates about 15MB memory.
>
> The problem occurs occasionally, which may need about 100 times [1].

Just out of my curiosity, in your usage case, how much additional
memory in terms of pages or MB can be freed by this patch, using
current code as base line?
Does the swap cache page reclaimed in swapcache_only mode, all swap
count drop to zero, and the only reason to stay in swap cache is to
void page IO write if we need to swap that page out again?

> Fix it by checking number of swapcache pages in can_reclaim_anon_pages().
> If the number is not zero, return true and set swapcache_only to 1.
> When scan anon lru list in swapcache_only mode, non-swapcache pages will
> be skipped to isolate in order to accelerate reclaim efficiency.

Here you said non-swapcache will be skipped if swapcache_only == 1

>
> However, in swapcache_only mode, the scan count still increased when scan
> non-swapcache pages because there are large number of non-swapcache pages
> and rare swapcache pages in swapcache_only mode, and if the non-swapcache

Here you suggest non-swapcache pages will also be scanned even when
swapcache_only == 1. It seems to contradict what you said above. I
feel that I am missing something here.

> is skipped and do not count, the scan of pages in isolate_lru_folios() can

Can you clarify which "scan of pages", are those pages swapcache pages
or non-swapcache pages?

> eventually lead to hung task, just as Sachin reported [2].
>
> By the way, since there are enough times of memory reclaim before OOM, it
> is not need to isolate too much swapcache pages in one times.
>
> [1]. https://lore.kernel.org/lkml/CAJD7tkZAfgncV+KbKr36=eDzMnT=9dZOT0dpMWcurHLr6Do+GA@mail.gmail.com/
> [2]. https://lore.kernel.org/linux-mm/CAJD7tkafz_2XAuqE8tGLPEcpLngewhUo=5US14PAtSM9tLBUQg@mail.gmail.com/
>
> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
> Tested-by: Yosry Ahmed <yosryahmed@google.com>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> ---
> v9->v10: Use per-node swapcache suggested by Yu Zhao.
> v8->v9: Move the swapcache check after can_demote() and refector
>         can_reclaim_anon_pages() a bit.
> v7->v8: Reset swapcache_only at the beginning of can_reclaim_anon_pages().
> v6->v7: Reset swapcache_only to zero after there are swap spaces.
> v5->v6: Fix NULL pointing derefence and hung task problem reported by Sachin.
>
>  mm/vmscan.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 506f8220c5fe..1fcc94717370 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -136,6 +136,9 @@ struct scan_control {
>         /* Always discard instead of demoting to lower tier memory */
>         unsigned int no_demotion:1;
>
> +       /* Swap space is exhausted, only reclaim swapcache for anon LRU */
> +       unsigned int swapcache_only:1;
> +
>         /* Allocation order */
>         s8 order;
>
> @@ -308,10 +311,36 @@ static bool can_demote(int nid, struct scan_control *sc)
>         return true;
>  }
>
> +#ifdef CONFIG_SWAP
> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid)
> +{
> +       struct pglist_data *pgdat = NODE_DATA(nid);
> +       unsigned long nr_swapcache;
> +
> +       if (!memcg) {
> +               nr_swapcache = node_page_state(pgdat, NR_SWAPCACHE);
> +       } else {
> +               struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
> +               nr_swapcache = lruvec_page_state_local(lruvec, NR_SWAPCACHE);
> +       }
> +
> +       return nr_swapcache > 0;
> +}
> +#else
> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid)
> +{
> +       return false;
> +}
> +#endif
> +
>  static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
>                                           int nid,
>                                           struct scan_control *sc)
>  {
> +       if (sc)
> +               sc->swapcache_only = 0;
> +

Minor nitpick. The sc->swapcache_only is first set to 0 then later set
to 1. Better use a local variable then write to sc->swapcache_only in
one go. If the scan_control has more than one thread accessing it, the
threads can see the flicker of 0->1 change.  I don't think that is the
case in our current code, sc is created on stack. There are other
minor benefits as The "if (sc) test" only needs to be done once, one
store instruction.

Chris

>         if (memcg == NULL) {
>                 /*
>                  * For non-memcg reclaim, is there
> @@ -330,7 +359,17 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
>          *
>          * Can it be reclaimed from this node via demotion?
>          */
> -       return can_demote(nid, sc);
> +       if (can_demote(nid, sc))
> +               return true;
> +
> +       /* Is there any swapcache pages to reclaim in this node? */
> +       if (can_reclaim_swapcache(memcg, nid)) {
> +               if (sc)
> +                       sc->swapcache_only = 1;
> +               return true;
> +       }
> +
> +       return false;
>  }
>
>  /*
> @@ -1642,6 +1681,15 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>                  */
>                 scan += nr_pages;
>
> +               /*
> +                * Count non-swapcache too because the swapcache pages may
> +                * be rare and it takes too much times here if not count
> +                * the non-swapcache pages.
> +                */
> +               if (unlikely(sc->swapcache_only && !is_file_lru(lru) &&
> +                   !folio_test_swapcache(folio)))
> +                       goto move;
> +
>                 if (!folio_test_lru(folio))
>                         goto move;
>                 if (!sc->may_unmap && folio_mapped(folio))
> --
> 2.25.1
>
>
Chris Li Nov. 23, 2023, 5:30 p.m. UTC | #13
On Tue, Nov 21, 2023 at 5:00 AM Michal Hocko <mhocko@suse.com> wrote:
> > However, in swapcache_only mode, the scan count still increased when scan
> > non-swapcache pages because there are large number of non-swapcache pages
> > and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> > is skipped and do not count, the scan of pages in isolate_lru_folios() can
> > eventually lead to hung task, just as Sachin reported [2].
>
> I find this paragraph really confusing! I guess what you meant to say is
> that a real swapcache_only is problematic because it can end up not
> making any progress, correct?
>
> AFAIU you have addressed that problem by making swapcache_only anon LRU
> specific, right? That would be certainly more robust as you can still
> reclaim from file LRUs. I cannot say I like that because swapcache_only
> is a bit confusing and I do not think we want to grow more special

That is my feeling as well. I don't like to have too many special
purposes modes either. It makes the whole process much harder to
reason. The comment seems to suggest it is not effective in some
situations. I am wondering if we can address that situation more
directly without the special mode. At the same time I am not very
familiar with the reclaim code path yet. I need to learn more about
this problem space to articulate my thoughts better . I can dig in
more, I might ask a lot of silly questions.

Chris

> purpose reclaim types. Would it be possible/reasonable to instead put
> swapcache pages on the file LRU instead?
> --
> Michal Hocko
> SUSE Labs
>
Michal Hocko Nov. 24, 2023, 4:30 p.m. UTC | #14
On Thu 23-11-23 14:15:59, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
[...]
> > Yes swapcache pages are indeed more complicated but most of the time
> > they just go away as well, no?
> 
> When we swapin a page, we will put it in swapcache too.  And the page
> can be in that state for long time if there is more than 50% free space
> in the swap device.

True, but why is that a problem? If you encounter such a swapped in page
on the file LRU then the page should be referened and as such should be
activated, no?
Huang, Ying Nov. 27, 2023, 2:34 a.m. UTC | #15
Michal Hocko <mhocko@suse.com> writes:

> On Thu 23-11-23 14:15:59, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
> [...]
>> > Yes swapcache pages are indeed more complicated but most of the time
>> > they just go away as well, no?
>> 
>> When we swapin a page, we will put it in swapcache too.  And the page
>> can be in that state for long time if there is more than 50% free space
>> in the swap device.
>
> True, but why is that a problem? If you encounter such a swapped in page
> on the file LRU then the page should be referened and as such should be
> activated, no?

This just means that anonymous pages in file LRU aren't temporary or
short-term.  So we need to consider that.  For example, the original
method to balance between anonymous pages and file pages need to be
re-designed.  The anonymous pages are considered hotter than file pages
in general.

--
Best Regards,
Huang, Ying
Chris Li Nov. 27, 2023, 7:42 a.m. UTC | #16
Hi Ying,

On Sun, Nov 26, 2023 at 6:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Thu 23-11-23 14:15:59, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > Yes swapcache pages are indeed more complicated but most of the time
> >> > they just go away as well, no?
> >>
> >> When we swapin a page, we will put it in swapcache too.  And the page
> >> can be in that state for long time if there is more than 50% free space
> >> in the swap device.
> >
> > True, but why is that a problem? If you encounter such a swapped in page
> > on the file LRU then the page should be referened and as such should be
> > activated, no?
>
> This just means that anonymous pages in file LRU aren't temporary or
> short-term.  So we need to consider that.  For example, the original
> method to balance between anonymous pages and file pages need to be
> re-designed.  The anonymous pages are considered hotter than file pages
> in general.

 I agree with Ying that anonymous pages typically have different page
access patterns than file pages, so we might want to treat them
differently to reclaim them effectively.
One random idea:
How about we put the anonymous page in a swap cache in a different LRU
than the rest of the anonymous pages. Then shrinking against those
pages in the swap cache would be more effective.Instead of having
[anon, file] LRU, now we have [anon not in swap cache, anon in swap
cache, file] LRU

Chris
Huang, Ying Nov. 27, 2023, 8:11 a.m. UTC | #17
Chris Li <chrisl@kernel.org> writes:

> Hi Ying,
>
> On Sun, Nov 26, 2023 at 6:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Michal Hocko <mhocko@suse.com> writes:
>>
>> > On Thu 23-11-23 14:15:59, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> > [...]
>> >> > Yes swapcache pages are indeed more complicated but most of the time
>> >> > they just go away as well, no?
>> >>
>> >> When we swapin a page, we will put it in swapcache too.  And the page
>> >> can be in that state for long time if there is more than 50% free space
>> >> in the swap device.
>> >
>> > True, but why is that a problem? If you encounter such a swapped in page
>> > on the file LRU then the page should be referened and as such should be
>> > activated, no?
>>
>> This just means that anonymous pages in file LRU aren't temporary or
>> short-term.  So we need to consider that.  For example, the original
>> method to balance between anonymous pages and file pages need to be
>> re-designed.  The anonymous pages are considered hotter than file pages
>> in general.
>
>  I agree with Ying that anonymous pages typically have different page
> access patterns than file pages, so we might want to treat them
> differently to reclaim them effectively.
> One random idea:
> How about we put the anonymous page in a swap cache in a different LRU
> than the rest of the anonymous pages. Then shrinking against those
> pages in the swap cache would be more effective.Instead of having
> [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> cache, file] LRU

I don't think that it is necessary.  The patch is only for a special use
case.  Where the swap device is used up while some pages are in swap
cache.  The patch will kill performance, but it is used to avoid OOM
only, not to improve performance.  Per my understanding, we will not use
up swap device space in most cases.  This may be true for ZRAM, but will
we keep pages in swap cache for long when we use ZRAM?

--
Best Regards,
Huang, Ying
Chris Li Nov. 27, 2023, 8:22 a.m. UTC | #18
On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> >  I agree with Ying that anonymous pages typically have different page
> > access patterns than file pages, so we might want to treat them
> > differently to reclaim them effectively.
> > One random idea:
> > How about we put the anonymous page in a swap cache in a different LRU
> > than the rest of the anonymous pages. Then shrinking against those
> > pages in the swap cache would be more effective.Instead of having
> > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> > cache, file] LRU
>
> I don't think that it is necessary.  The patch is only for a special use
> case.  Where the swap device is used up while some pages are in swap
> cache.  The patch will kill performance, but it is used to avoid OOM
> only, not to improve performance.  Per my understanding, we will not use
> up swap device space in most cases.  This may be true for ZRAM, but will
> we keep pages in swap cache for long when we use ZRAM?

I ask the question regarding how many pages can be freed by this patch
in this email thread as well, but haven't got the answer from the
author yet. That is one important aspect to evaluate how valuable is
that patch.
Regarding running out of swap space. That is a good point, in server
workload we don't typically run out of swap device space anyway.

Android uses ZRAM, the story might be different. Adding Minchan here.

Chris
Michal Hocko Nov. 27, 2023, 9:10 a.m. UTC | #19
On Mon 27-11-23 10:34:46, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 23-11-23 14:15:59, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > Yes swapcache pages are indeed more complicated but most of the time
> >> > they just go away as well, no?
> >> 
> >> When we swapin a page, we will put it in swapcache too.  And the page
> >> can be in that state for long time if there is more than 50% free space
> >> in the swap device.
> >
> > True, but why is that a problem? If you encounter such a swapped in page
> > on the file LRU then the page should be referened and as such should be
> > activated, no?
> 
> This just means that anonymous pages in file LRU aren't temporary or
> short-term.  So we need to consider that.

Right. On the other hand we could be more aggressive when dropping the
swapcache. Is there any actual reason why we cannot try to folio_free_swap
even when mem_cgroup_swap_full == F?
Minchan Kim Nov. 27, 2023, 9:31 p.m. UTC | #20
On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
> On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >  I agree with Ying that anonymous pages typically have different page
> > > access patterns than file pages, so we might want to treat them
> > > differently to reclaim them effectively.
> > > One random idea:
> > > How about we put the anonymous page in a swap cache in a different LRU
> > > than the rest of the anonymous pages. Then shrinking against those
> > > pages in the swap cache would be more effective.Instead of having
> > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> > > cache, file] LRU
> >
> > I don't think that it is necessary.  The patch is only for a special use
> > case.  Where the swap device is used up while some pages are in swap
> > cache.  The patch will kill performance, but it is used to avoid OOM
> > only, not to improve performance.  Per my understanding, we will not use
> > up swap device space in most cases.  This may be true for ZRAM, but will
> > we keep pages in swap cache for long when we use ZRAM?
> 
> I ask the question regarding how many pages can be freed by this patch
> in this email thread as well, but haven't got the answer from the
> author yet. That is one important aspect to evaluate how valuable is
> that patch.

Exactly. Since swap cache has different life time with page cache, they
would be usually dropped when pages are unmapped(unless they are shared
with others but anon is usually exclusive private) so I wonder how much
memory we can save.

> Regarding running out of swap space. That is a good point, in server
> workload we don't typically run out of swap device space anyway.
> 
> Android uses ZRAM, the story might be different. Adding Minchan here.

Swap is usually almost full in Android since it compacts(i.e., swapout)
background apps aggressively.
Yosry Ahmed Nov. 27, 2023, 9:56 p.m. UTC | #21
On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> > > >  I agree with Ying that anonymous pages typically have different page
> > > > access patterns than file pages, so we might want to treat them
> > > > differently to reclaim them effectively.
> > > > One random idea:
> > > > How about we put the anonymous page in a swap cache in a different LRU
> > > > than the rest of the anonymous pages. Then shrinking against those
> > > > pages in the swap cache would be more effective.Instead of having
> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> > > > cache, file] LRU
> > >
> > > I don't think that it is necessary.  The patch is only for a special use
> > > case.  Where the swap device is used up while some pages are in swap
> > > cache.  The patch will kill performance, but it is used to avoid OOM
> > > only, not to improve performance.  Per my understanding, we will not use
> > > up swap device space in most cases.  This may be true for ZRAM, but will
> > > we keep pages in swap cache for long when we use ZRAM?
> >
> > I ask the question regarding how many pages can be freed by this patch
> > in this email thread as well, but haven't got the answer from the
> > author yet. That is one important aspect to evaluate how valuable is
> > that patch.
>
> Exactly. Since swap cache has different life time with page cache, they
> would be usually dropped when pages are unmapped(unless they are shared
> with others but anon is usually exclusive private) so I wonder how much
> memory we can save.

I think the point of this patch is not saving memory, but rather
avoiding an OOM condition that will happen if we have no swap space
left, but some pages left in the swap cache. Of course, the OOM
avoidance will come at the cost of extra work in reclaim to swap those
pages out.

The only case where I think this might be harmful is if there's plenty
of pages to reclaim on the file LRU, and instead we opt to chase down
the few swap cache pages. So perhaps we can add a check to only set
sc->swapcache_only if the number of pages in the swap cache is more
than the number of pages on the file LRU or similar? Just make sure we
don't chase the swapcache pages down if there's plenty to scan on the
file LRU?

>
> > Regarding running out of swap space. That is a good point, in server
> > workload we don't typically run out of swap device space anyway.
> >
> > Android uses ZRAM, the story might be different. Adding Minchan here.
>
> Swap is usually almost full in Android since it compacts(i.e., swapout)
> background apps aggressively.
Huang, Ying Nov. 28, 2023, 1:31 a.m. UTC | #22
Michal Hocko <mhocko@suse.com> writes:

> On Mon 27-11-23 10:34:46, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 23-11-23 14:15:59, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> > [...]
>> >> > Yes swapcache pages are indeed more complicated but most of the time
>> >> > they just go away as well, no?
>> >> 
>> >> When we swapin a page, we will put it in swapcache too.  And the page
>> >> can be in that state for long time if there is more than 50% free space
>> >> in the swap device.
>> >
>> > True, but why is that a problem? If you encounter such a swapped in page
>> > on the file LRU then the page should be referened and as such should be
>> > activated, no?
>> 
>> This just means that anonymous pages in file LRU aren't temporary or
>> short-term.  So we need to consider that.
>
> Right. On the other hand we could be more aggressive when dropping the
> swapcache. Is there any actual reason why we cannot try to folio_free_swap
> even when mem_cgroup_swap_full == F?

If there are plenty free space in swap device, why not take advantage of
it?  If the page is in swap cache and clean, we can discard it when it
is reclaimed, or avoid to allocate swap entry for it.

--
Best Regards,
Huang, Ying
Liu Shixin Nov. 28, 2023, 1:59 a.m. UTC | #23
On 2023/11/24 1:19, Chris Li wrote:
> Hi Shixin,
>
> On Tue, Nov 21, 2023 at 12:08 AM Liu Shixin <liushixin2@huawei.com> wrote:
>> When spaces of swap devices are exhausted, only file pages can be
>> reclaimed.  But there are still some swapcache pages in anon lru list.
>> This can lead to a premature out-of-memory.
>>
>> The problem is found with such step:
>>
>>  Firstly, set a 9MB disk swap space, then create a cgroup with 10MB
>>  memory limit, then runs an program to allocates about 15MB memory.
>>
>> The problem occurs occasionally, which may need about 100 times [1].
> Just out of my curiosity, in your usage case, how much additional
> memory in terms of pages or MB can be freed by this patch, using
> current code as base line?
My testcase is in a memory cgroup with memory limit of 10MB, the memory can
be freed is only about 5MB.
> Does the swap cache page reclaimed in swapcache_only mode, all swap
> count drop to zero, and the only reason to stay in swap cache is to
> void page IO write if we need to swap that page out again?
Yes.
>
>> Fix it by checking number of swapcache pages in can_reclaim_anon_pages().
>> If the number is not zero, return true and set swapcache_only to 1.
>> When scan anon lru list in swapcache_only mode, non-swapcache pages will
>> be skipped to isolate in order to accelerate reclaim efficiency.
> Here you said non-swapcache will be skipped if swapcache_only == 1
>
>> However, in swapcache_only mode, the scan count still increased when scan
>> non-swapcache pages because there are large number of non-swapcache pages
>> and rare swapcache pages in swapcache_only mode, and if the non-swapcache
> Here you suggest non-swapcache pages will also be scanned even when
> swapcache_only == 1. It seems to contradict what you said above. I
> feel that I am missing something here.
The swapcache pages and non-swapcache pages are both in anon lru. So when scan
anon pages, then non-swapcache pages will also be scanned. In isolate_lru_folios(),
if we select to put non-swapcache pages in folios_skipped list, the scan of anon list
will running until finding enough swapcache pages, this will waste too much time.
To avoid such problem, after we scan enough anon pages, even if we don't isolate
enough swapcache pages, we have to stop.
>
>> is skipped and do not count, the scan of pages in isolate_lru_folios() can
> Can you clarify which "scan of pages", are those pages swapcache pages
> or non-swapcache pages?
I mean scan of anon pages, include both swapcache pages and non-swpacache pages.
>
>> eventually lead to hung task, just as Sachin reported [2].
>>
>> By the way, since there are enough times of memory reclaim before OOM, it
>> is not need to isolate too much swapcache pages in one times.
>>
>> [1]. https://lore.kernel.org/lkml/CAJD7tkZAfgncV+KbKr36=eDzMnT=9dZOT0dpMWcurHLr6Do+GA@mail.gmail.com/
>> [2]. https://lore.kernel.org/linux-mm/CAJD7tkafz_2XAuqE8tGLPEcpLngewhUo=5US14PAtSM9tLBUQg@mail.gmail.com/
>>
>> Signed-off-by: Liu Shixin <liushixin2@huawei.com>
>> Tested-by: Yosry Ahmed <yosryahmed@google.com>
>> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
>> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>> ---
>> v9->v10: Use per-node swapcache suggested by Yu Zhao.
>> v8->v9: Move the swapcache check after can_demote() and refector
>>         can_reclaim_anon_pages() a bit.
>> v7->v8: Reset swapcache_only at the beginning of can_reclaim_anon_pages().
>> v6->v7: Reset swapcache_only to zero after there are swap spaces.
>> v5->v6: Fix NULL pointing derefence and hung task problem reported by Sachin.
>>
>>  mm/vmscan.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 49 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 506f8220c5fe..1fcc94717370 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -136,6 +136,9 @@ struct scan_control {
>>         /* Always discard instead of demoting to lower tier memory */
>>         unsigned int no_demotion:1;
>>
>> +       /* Swap space is exhausted, only reclaim swapcache for anon LRU */
>> +       unsigned int swapcache_only:1;
>> +
>>         /* Allocation order */
>>         s8 order;
>>
>> @@ -308,10 +311,36 @@ static bool can_demote(int nid, struct scan_control *sc)
>>         return true;
>>  }
>>
>> +#ifdef CONFIG_SWAP
>> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid)
>> +{
>> +       struct pglist_data *pgdat = NODE_DATA(nid);
>> +       unsigned long nr_swapcache;
>> +
>> +       if (!memcg) {
>> +               nr_swapcache = node_page_state(pgdat, NR_SWAPCACHE);
>> +       } else {
>> +               struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +
>> +               nr_swapcache = lruvec_page_state_local(lruvec, NR_SWAPCACHE);
>> +       }
>> +
>> +       return nr_swapcache > 0;
>> +}
>> +#else
>> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid)
>> +{
>> +       return false;
>> +}
>> +#endif
>> +
>>  static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
>>                                           int nid,
>>                                           struct scan_control *sc)
>>  {
>> +       if (sc)
>> +               sc->swapcache_only = 0;
>> +
> Minor nitpick. The sc->swapcache_only is first set to 0 then later set
> to 1. Better use a local variable then write to sc->swapcache_only in
> one go. If the scan_control has more than one thread accessing it, the
> threads can see the flicker of 0->1 change.  I don't think that is the
> case in our current code, sc is created on stack. There are other
> minor benefits as The "if (sc) test" only needs to be done once, one
> store instruction.
>
> Chris
Thanks for your advice. If finally decide to use this patch, I will revise it.
>>         if (memcg == NULL) {
>>                 /*
>>                  * For non-memcg reclaim, is there
>> @@ -330,7 +359,17 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
>>          *
>>          * Can it be reclaimed from this node via demotion?
>>          */
>> -       return can_demote(nid, sc);
>> +       if (can_demote(nid, sc))
>> +               return true;
>> +
>> +       /* Is there any swapcache pages to reclaim in this node? */
>> +       if (can_reclaim_swapcache(memcg, nid)) {
>> +               if (sc)
>> +                       sc->swapcache_only = 1;
>> +               return true;
>> +       }
>> +
>> +       return false;
>>  }
>>
>>  /*
>> @@ -1642,6 +1681,15 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>>                  */
>>                 scan += nr_pages;
>>
>> +               /*
>> +                * Count non-swapcache too because the swapcache pages may
>> +                * be rare and it takes too much times here if not count
>> +                * the non-swapcache pages.
>> +                */
>> +               if (unlikely(sc->swapcache_only && !is_file_lru(lru) &&
>> +                   !folio_test_swapcache(folio)))
>> +                       goto move;
>> +
>>                 if (!folio_test_lru(folio))
>>                         goto move;
>>                 if (!sc->may_unmap && folio_mapped(folio))
>> --
>> 2.25.1
>>
>>
> .
>
Huang, Ying Nov. 28, 2023, 3:19 a.m. UTC | #24
Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>>
>> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
>> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > > >  I agree with Ying that anonymous pages typically have different page
>> > > > access patterns than file pages, so we might want to treat them
>> > > > differently to reclaim them effectively.
>> > > > One random idea:
>> > > > How about we put the anonymous page in a swap cache in a different LRU
>> > > > than the rest of the anonymous pages. Then shrinking against those
>> > > > pages in the swap cache would be more effective.Instead of having
>> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
>> > > > cache, file] LRU
>> > >
>> > > I don't think that it is necessary.  The patch is only for a special use
>> > > case.  Where the swap device is used up while some pages are in swap
>> > > cache.  The patch will kill performance, but it is used to avoid OOM
>> > > only, not to improve performance.  Per my understanding, we will not use
>> > > up swap device space in most cases.  This may be true for ZRAM, but will
>> > > we keep pages in swap cache for long when we use ZRAM?
>> >
>> > I ask the question regarding how many pages can be freed by this patch
>> > in this email thread as well, but haven't got the answer from the
>> > author yet. That is one important aspect to evaluate how valuable is
>> > that patch.
>>
>> Exactly. Since swap cache has different life time with page cache, they
>> would be usually dropped when pages are unmapped(unless they are shared
>> with others but anon is usually exclusive private) so I wonder how much
>> memory we can save.
>
> I think the point of this patch is not saving memory, but rather
> avoiding an OOM condition that will happen if we have no swap space
> left, but some pages left in the swap cache. Of course, the OOM
> avoidance will come at the cost of extra work in reclaim to swap those
> pages out.
>
> The only case where I think this might be harmful is if there's plenty
> of pages to reclaim on the file LRU, and instead we opt to chase down
> the few swap cache pages. So perhaps we can add a check to only set
> sc->swapcache_only if the number of pages in the swap cache is more
> than the number of pages on the file LRU or similar? Just make sure we
> don't chase the swapcache pages down if there's plenty to scan on the
> file LRU?

The swap cache pages can be divided to 3 groups.

- group 1: pages have been written out, at the tail of inactive LRU, but
  not reclaimed yet.

- group 2: pages have been written out, but were failed to be reclaimed
  (e.g., were accessed before reclaiming)

- group 3: pages have been swapped in, but were kept in swap cache.  The
  pages may be in active LRU.

The main target of the original patch should be group 1.  And the pages
may be cheaper to reclaim than file pages.

Group 2 are hard to be reclaimed if swap_count() isn't 0.

Group 3 should be reclaimed in theory, but the overhead may be high.
And we may need to reclaim the swap entries instead of pages if the pages
are hot.  But we can start to reclaim the swap entries before the swap
space is run out.

So, if we can count group 1, we may use that as indicator to scan anon
pages.  And we may add code to reclaim group 3 earlier.

>> > Regarding running out of swap space. That is a good point, in server
>> > workload we don't typically run out of swap device space anyway.

Think about this again.  In server workload, if we set some swap usage
limit for a memcg, we may run out of the limit.  Is it common for a
server workload run out of the swap usage limit of the memcg?

>> > Android uses ZRAM, the story might be different. Adding Minchan here.
>>
>> Swap is usually almost full in Android since it compacts(i.e., swapout)
>> background apps aggressively.

If my understanding were correct, because ZRAM has SWP_SYNCHRONOUS_IO
set, the anonymous pages will only be put in swap cache temporarily
during swap out.  So, the remaining swap cache pages in anon LRU should
not be a problem for ZRAM.

--
Best Regards,
Huang, Ying
Yosry Ahmed Nov. 28, 2023, 3:27 a.m. UTC | #25
On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
> >>
> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > > >  I agree with Ying that anonymous pages typically have different page
> >> > > > access patterns than file pages, so we might want to treat them
> >> > > > differently to reclaim them effectively.
> >> > > > One random idea:
> >> > > > How about we put the anonymous page in a swap cache in a different LRU
> >> > > > than the rest of the anonymous pages. Then shrinking against those
> >> > > > pages in the swap cache would be more effective.Instead of having
> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> >> > > > cache, file] LRU
> >> > >
> >> > > I don't think that it is necessary.  The patch is only for a special use
> >> > > case.  Where the swap device is used up while some pages are in swap
> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
> >> > > only, not to improve performance.  Per my understanding, we will not use
> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
> >> > > we keep pages in swap cache for long when we use ZRAM?
> >> >
> >> > I ask the question regarding how many pages can be freed by this patch
> >> > in this email thread as well, but haven't got the answer from the
> >> > author yet. That is one important aspect to evaluate how valuable is
> >> > that patch.
> >>
> >> Exactly. Since swap cache has different life time with page cache, they
> >> would be usually dropped when pages are unmapped(unless they are shared
> >> with others but anon is usually exclusive private) so I wonder how much
> >> memory we can save.
> >
> > I think the point of this patch is not saving memory, but rather
> > avoiding an OOM condition that will happen if we have no swap space
> > left, but some pages left in the swap cache. Of course, the OOM
> > avoidance will come at the cost of extra work in reclaim to swap those
> > pages out.
> >
> > The only case where I think this might be harmful is if there's plenty
> > of pages to reclaim on the file LRU, and instead we opt to chase down
> > the few swap cache pages. So perhaps we can add a check to only set
> > sc->swapcache_only if the number of pages in the swap cache is more
> > than the number of pages on the file LRU or similar? Just make sure we
> > don't chase the swapcache pages down if there's plenty to scan on the
> > file LRU?
>
> The swap cache pages can be divided to 3 groups.
>
> - group 1: pages have been written out, at the tail of inactive LRU, but
>   not reclaimed yet.
>
> - group 2: pages have been written out, but were failed to be reclaimed
>   (e.g., were accessed before reclaiming)
>
> - group 3: pages have been swapped in, but were kept in swap cache.  The
>   pages may be in active LRU.
>
> The main target of the original patch should be group 1.  And the pages
> may be cheaper to reclaim than file pages.
>
> Group 2 are hard to be reclaimed if swap_count() isn't 0.
>
> Group 3 should be reclaimed in theory, but the overhead may be high.
> And we may need to reclaim the swap entries instead of pages if the pages
> are hot.  But we can start to reclaim the swap entries before the swap
> space is run out.
>
> So, if we can count group 1, we may use that as indicator to scan anon
> pages.  And we may add code to reclaim group 3 earlier.
>

My point was not that reclaiming the pages in the swap cache is more
expensive that reclaiming the pages in the file LRU. In a lot of
cases, as you point out, the pages in the swap cache can just be
dropped, so they may be as cheap or cheaper to reclaim than the pages
in the file LRU.

My point was that scanning the anon LRU when swap space is exhausted
to get to the pages in the swap cache may be much more expensive,
because there may be a lot of pages on the anon LRU that are not in
the swap cache, and hence are not reclaimable, unlike pages in the
file LRU, which should mostly be reclaimable.

So what I am saying is that maybe we should not do the effort of
scanning the anon LRU in the swapcache_only case unless there aren't a
lot of pages to reclaim on the file LRU (relatively). For example, if
we have a 100 pages in the swap cache out of 10000 pages in the anon
LRU, and there are 10000 pages in the file LRU, it's probably not
worth scanning the anon LRU.
Huang, Ying Nov. 28, 2023, 4:03 a.m. UTC | #26
Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>> >>
>> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
>> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> > > >  I agree with Ying that anonymous pages typically have different page
>> >> > > > access patterns than file pages, so we might want to treat them
>> >> > > > differently to reclaim them effectively.
>> >> > > > One random idea:
>> >> > > > How about we put the anonymous page in a swap cache in a different LRU
>> >> > > > than the rest of the anonymous pages. Then shrinking against those
>> >> > > > pages in the swap cache would be more effective.Instead of having
>> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
>> >> > > > cache, file] LRU
>> >> > >
>> >> > > I don't think that it is necessary.  The patch is only for a special use
>> >> > > case.  Where the swap device is used up while some pages are in swap
>> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
>> >> > > only, not to improve performance.  Per my understanding, we will not use
>> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
>> >> > > we keep pages in swap cache for long when we use ZRAM?
>> >> >
>> >> > I ask the question regarding how many pages can be freed by this patch
>> >> > in this email thread as well, but haven't got the answer from the
>> >> > author yet. That is one important aspect to evaluate how valuable is
>> >> > that patch.
>> >>
>> >> Exactly. Since swap cache has different life time with page cache, they
>> >> would be usually dropped when pages are unmapped(unless they are shared
>> >> with others but anon is usually exclusive private) so I wonder how much
>> >> memory we can save.
>> >
>> > I think the point of this patch is not saving memory, but rather
>> > avoiding an OOM condition that will happen if we have no swap space
>> > left, but some pages left in the swap cache. Of course, the OOM
>> > avoidance will come at the cost of extra work in reclaim to swap those
>> > pages out.
>> >
>> > The only case where I think this might be harmful is if there's plenty
>> > of pages to reclaim on the file LRU, and instead we opt to chase down
>> > the few swap cache pages. So perhaps we can add a check to only set
>> > sc->swapcache_only if the number of pages in the swap cache is more
>> > than the number of pages on the file LRU or similar? Just make sure we
>> > don't chase the swapcache pages down if there's plenty to scan on the
>> > file LRU?
>>
>> The swap cache pages can be divided to 3 groups.
>>
>> - group 1: pages have been written out, at the tail of inactive LRU, but
>>   not reclaimed yet.
>>
>> - group 2: pages have been written out, but were failed to be reclaimed
>>   (e.g., were accessed before reclaiming)
>>
>> - group 3: pages have been swapped in, but were kept in swap cache.  The
>>   pages may be in active LRU.
>>
>> The main target of the original patch should be group 1.  And the pages
>> may be cheaper to reclaim than file pages.
>>
>> Group 2 are hard to be reclaimed if swap_count() isn't 0.
>>
>> Group 3 should be reclaimed in theory, but the overhead may be high.
>> And we may need to reclaim the swap entries instead of pages if the pages
>> are hot.  But we can start to reclaim the swap entries before the swap
>> space is run out.
>>
>> So, if we can count group 1, we may use that as indicator to scan anon
>> pages.  And we may add code to reclaim group 3 earlier.
>>
>
> My point was not that reclaiming the pages in the swap cache is more
> expensive that reclaiming the pages in the file LRU. In a lot of
> cases, as you point out, the pages in the swap cache can just be
> dropped, so they may be as cheap or cheaper to reclaim than the pages
> in the file LRU.
>
> My point was that scanning the anon LRU when swap space is exhausted
> to get to the pages in the swap cache may be much more expensive,
> because there may be a lot of pages on the anon LRU that are not in
> the swap cache, and hence are not reclaimable, unlike pages in the
> file LRU, which should mostly be reclaimable.
>
> So what I am saying is that maybe we should not do the effort of
> scanning the anon LRU in the swapcache_only case unless there aren't a
> lot of pages to reclaim on the file LRU (relatively). For example, if
> we have a 100 pages in the swap cache out of 10000 pages in the anon
> LRU, and there are 10000 pages in the file LRU, it's probably not
> worth scanning the anon LRU.

For group 1 pages, they are at the tail of the anon inactive LRU, so the
scan overhead is low too.  For example, if number of group 1 pages is
100, we just need to scan 100 pages to reclaim them.  We can choose to
stop scanning when the number of the non-group-1 pages reached some
threshold.

--
Best Regards,
Huang, Ying
Yosry Ahmed Nov. 28, 2023, 4:13 a.m. UTC | #27
On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
> >> >>
> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
> >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> > > >  I agree with Ying that anonymous pages typically have different page
> >> >> > > > access patterns than file pages, so we might want to treat them
> >> >> > > > differently to reclaim them effectively.
> >> >> > > > One random idea:
> >> >> > > > How about we put the anonymous page in a swap cache in a different LRU
> >> >> > > > than the rest of the anonymous pages. Then shrinking against those
> >> >> > > > pages in the swap cache would be more effective.Instead of having
> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> >> >> > > > cache, file] LRU
> >> >> > >
> >> >> > > I don't think that it is necessary.  The patch is only for a special use
> >> >> > > case.  Where the swap device is used up while some pages are in swap
> >> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
> >> >> > > only, not to improve performance.  Per my understanding, we will not use
> >> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
> >> >> > > we keep pages in swap cache for long when we use ZRAM?
> >> >> >
> >> >> > I ask the question regarding how many pages can be freed by this patch
> >> >> > in this email thread as well, but haven't got the answer from the
> >> >> > author yet. That is one important aspect to evaluate how valuable is
> >> >> > that patch.
> >> >>
> >> >> Exactly. Since swap cache has different life time with page cache, they
> >> >> would be usually dropped when pages are unmapped(unless they are shared
> >> >> with others but anon is usually exclusive private) so I wonder how much
> >> >> memory we can save.
> >> >
> >> > I think the point of this patch is not saving memory, but rather
> >> > avoiding an OOM condition that will happen if we have no swap space
> >> > left, but some pages left in the swap cache. Of course, the OOM
> >> > avoidance will come at the cost of extra work in reclaim to swap those
> >> > pages out.
> >> >
> >> > The only case where I think this might be harmful is if there's plenty
> >> > of pages to reclaim on the file LRU, and instead we opt to chase down
> >> > the few swap cache pages. So perhaps we can add a check to only set
> >> > sc->swapcache_only if the number of pages in the swap cache is more
> >> > than the number of pages on the file LRU or similar? Just make sure we
> >> > don't chase the swapcache pages down if there's plenty to scan on the
> >> > file LRU?
> >>
> >> The swap cache pages can be divided to 3 groups.
> >>
> >> - group 1: pages have been written out, at the tail of inactive LRU, but
> >>   not reclaimed yet.
> >>
> >> - group 2: pages have been written out, but were failed to be reclaimed
> >>   (e.g., were accessed before reclaiming)
> >>
> >> - group 3: pages have been swapped in, but were kept in swap cache.  The
> >>   pages may be in active LRU.
> >>
> >> The main target of the original patch should be group 1.  And the pages
> >> may be cheaper to reclaim than file pages.
> >>
> >> Group 2 are hard to be reclaimed if swap_count() isn't 0.
> >>
> >> Group 3 should be reclaimed in theory, but the overhead may be high.
> >> And we may need to reclaim the swap entries instead of pages if the pages
> >> are hot.  But we can start to reclaim the swap entries before the swap
> >> space is run out.
> >>
> >> So, if we can count group 1, we may use that as indicator to scan anon
> >> pages.  And we may add code to reclaim group 3 earlier.
> >>
> >
> > My point was not that reclaiming the pages in the swap cache is more
> > expensive that reclaiming the pages in the file LRU. In a lot of
> > cases, as you point out, the pages in the swap cache can just be
> > dropped, so they may be as cheap or cheaper to reclaim than the pages
> > in the file LRU.
> >
> > My point was that scanning the anon LRU when swap space is exhausted
> > to get to the pages in the swap cache may be much more expensive,
> > because there may be a lot of pages on the anon LRU that are not in
> > the swap cache, and hence are not reclaimable, unlike pages in the
> > file LRU, which should mostly be reclaimable.
> >
> > So what I am saying is that maybe we should not do the effort of
> > scanning the anon LRU in the swapcache_only case unless there aren't a
> > lot of pages to reclaim on the file LRU (relatively). For example, if
> > we have a 100 pages in the swap cache out of 10000 pages in the anon
> > LRU, and there are 10000 pages in the file LRU, it's probably not
> > worth scanning the anon LRU.
>
> For group 1 pages, they are at the tail of the anon inactive LRU, so the
> scan overhead is low too.  For example, if number of group 1 pages is
> 100, we just need to scan 100 pages to reclaim them.  We can choose to
> stop scanning when the number of the non-group-1 pages reached some
> threshold.
>

We should still try to reclaim pages in groups 2 & 3 before OOMing
though. Maybe the motivation for this patch is group 1, but I don't
see why we should special case them. Pages in groups 2 & 3 should be
roughly equally cheap to reclaim. They may have higher refault cost,
but IIUC we should still try to reclaim them before OOMing.
Huang, Ying Nov. 28, 2023, 5:37 a.m. UTC | #28
Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>> >> >>
>> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
>> >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> > > >  I agree with Ying that anonymous pages typically have different page
>> >> >> > > > access patterns than file pages, so we might want to treat them
>> >> >> > > > differently to reclaim them effectively.
>> >> >> > > > One random idea:
>> >> >> > > > How about we put the anonymous page in a swap cache in a different LRU
>> >> >> > > > than the rest of the anonymous pages. Then shrinking against those
>> >> >> > > > pages in the swap cache would be more effective.Instead of having
>> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
>> >> >> > > > cache, file] LRU
>> >> >> > >
>> >> >> > > I don't think that it is necessary.  The patch is only for a special use
>> >> >> > > case.  Where the swap device is used up while some pages are in swap
>> >> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
>> >> >> > > only, not to improve performance.  Per my understanding, we will not use
>> >> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
>> >> >> > > we keep pages in swap cache for long when we use ZRAM?
>> >> >> >
>> >> >> > I ask the question regarding how many pages can be freed by this patch
>> >> >> > in this email thread as well, but haven't got the answer from the
>> >> >> > author yet. That is one important aspect to evaluate how valuable is
>> >> >> > that patch.
>> >> >>
>> >> >> Exactly. Since swap cache has different life time with page cache, they
>> >> >> would be usually dropped when pages are unmapped(unless they are shared
>> >> >> with others but anon is usually exclusive private) so I wonder how much
>> >> >> memory we can save.
>> >> >
>> >> > I think the point of this patch is not saving memory, but rather
>> >> > avoiding an OOM condition that will happen if we have no swap space
>> >> > left, but some pages left in the swap cache. Of course, the OOM
>> >> > avoidance will come at the cost of extra work in reclaim to swap those
>> >> > pages out.
>> >> >
>> >> > The only case where I think this might be harmful is if there's plenty
>> >> > of pages to reclaim on the file LRU, and instead we opt to chase down
>> >> > the few swap cache pages. So perhaps we can add a check to only set
>> >> > sc->swapcache_only if the number of pages in the swap cache is more
>> >> > than the number of pages on the file LRU or similar? Just make sure we
>> >> > don't chase the swapcache pages down if there's plenty to scan on the
>> >> > file LRU?
>> >>
>> >> The swap cache pages can be divided to 3 groups.
>> >>
>> >> - group 1: pages have been written out, at the tail of inactive LRU, but
>> >>   not reclaimed yet.
>> >>
>> >> - group 2: pages have been written out, but were failed to be reclaimed
>> >>   (e.g., were accessed before reclaiming)
>> >>
>> >> - group 3: pages have been swapped in, but were kept in swap cache.  The
>> >>   pages may be in active LRU.
>> >>
>> >> The main target of the original patch should be group 1.  And the pages
>> >> may be cheaper to reclaim than file pages.
>> >>
>> >> Group 2 are hard to be reclaimed if swap_count() isn't 0.
>> >>
>> >> Group 3 should be reclaimed in theory, but the overhead may be high.
>> >> And we may need to reclaim the swap entries instead of pages if the pages
>> >> are hot.  But we can start to reclaim the swap entries before the swap
>> >> space is run out.
>> >>
>> >> So, if we can count group 1, we may use that as indicator to scan anon
>> >> pages.  And we may add code to reclaim group 3 earlier.
>> >>
>> >
>> > My point was not that reclaiming the pages in the swap cache is more
>> > expensive that reclaiming the pages in the file LRU. In a lot of
>> > cases, as you point out, the pages in the swap cache can just be
>> > dropped, so they may be as cheap or cheaper to reclaim than the pages
>> > in the file LRU.
>> >
>> > My point was that scanning the anon LRU when swap space is exhausted
>> > to get to the pages in the swap cache may be much more expensive,
>> > because there may be a lot of pages on the anon LRU that are not in
>> > the swap cache, and hence are not reclaimable, unlike pages in the
>> > file LRU, which should mostly be reclaimable.
>> >
>> > So what I am saying is that maybe we should not do the effort of
>> > scanning the anon LRU in the swapcache_only case unless there aren't a
>> > lot of pages to reclaim on the file LRU (relatively). For example, if
>> > we have a 100 pages in the swap cache out of 10000 pages in the anon
>> > LRU, and there are 10000 pages in the file LRU, it's probably not
>> > worth scanning the anon LRU.
>>
>> For group 1 pages, they are at the tail of the anon inactive LRU, so the
>> scan overhead is low too.  For example, if number of group 1 pages is
>> 100, we just need to scan 100 pages to reclaim them.  We can choose to
>> stop scanning when the number of the non-group-1 pages reached some
>> threshold.
>>
>
> We should still try to reclaim pages in groups 2 & 3 before OOMing
> though. Maybe the motivation for this patch is group 1, but I don't
> see why we should special case them. Pages in groups 2 & 3 should be
> roughly equally cheap to reclaim. They may have higher refault cost,
> but IIUC we should still try to reclaim them before OOMing.

The scan cost of group 3 may be high, you may need to scan all anonymous
pages to identify them.  The reclaim cost of group 2 may be high, it may
just cause trashing (shared pages that are accessed by just one
process).  So I think that we can allow reclaim group 1 in all cases.
Try to reclaim swap entries for group 3 during normal LRU scanning after
more than half of swap space of limit is used.  As a last resort before
OOM, try to reclaim group 2 and group 3.  Or, limit scan count for group
2 and group 3.

BTW, in some situation, OOM is not the worst situation.  For example,
trashing may kill interaction latency, while killing the memory hog (may
be caused by memory leak) saves system response time.

--
Best Regards,
Huang, Ying
Yosry Ahmed Nov. 28, 2023, 5:41 a.m. UTC | #29
On Mon, Nov 27, 2023 at 9:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
> >> >> >>
> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
> >> >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> > > >  I agree with Ying that anonymous pages typically have different page
> >> >> >> > > > access patterns than file pages, so we might want to treat them
> >> >> >> > > > differently to reclaim them effectively.
> >> >> >> > > > One random idea:
> >> >> >> > > > How about we put the anonymous page in a swap cache in a different LRU
> >> >> >> > > > than the rest of the anonymous pages. Then shrinking against those
> >> >> >> > > > pages in the swap cache would be more effective.Instead of having
> >> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> >> >> >> > > > cache, file] LRU
> >> >> >> > >
> >> >> >> > > I don't think that it is necessary.  The patch is only for a special use
> >> >> >> > > case.  Where the swap device is used up while some pages are in swap
> >> >> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
> >> >> >> > > only, not to improve performance.  Per my understanding, we will not use
> >> >> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
> >> >> >> > > we keep pages in swap cache for long when we use ZRAM?
> >> >> >> >
> >> >> >> > I ask the question regarding how many pages can be freed by this patch
> >> >> >> > in this email thread as well, but haven't got the answer from the
> >> >> >> > author yet. That is one important aspect to evaluate how valuable is
> >> >> >> > that patch.
> >> >> >>
> >> >> >> Exactly. Since swap cache has different life time with page cache, they
> >> >> >> would be usually dropped when pages are unmapped(unless they are shared
> >> >> >> with others but anon is usually exclusive private) so I wonder how much
> >> >> >> memory we can save.
> >> >> >
> >> >> > I think the point of this patch is not saving memory, but rather
> >> >> > avoiding an OOM condition that will happen if we have no swap space
> >> >> > left, but some pages left in the swap cache. Of course, the OOM
> >> >> > avoidance will come at the cost of extra work in reclaim to swap those
> >> >> > pages out.
> >> >> >
> >> >> > The only case where I think this might be harmful is if there's plenty
> >> >> > of pages to reclaim on the file LRU, and instead we opt to chase down
> >> >> > the few swap cache pages. So perhaps we can add a check to only set
> >> >> > sc->swapcache_only if the number of pages in the swap cache is more
> >> >> > than the number of pages on the file LRU or similar? Just make sure we
> >> >> > don't chase the swapcache pages down if there's plenty to scan on the
> >> >> > file LRU?
> >> >>
> >> >> The swap cache pages can be divided to 3 groups.
> >> >>
> >> >> - group 1: pages have been written out, at the tail of inactive LRU, but
> >> >>   not reclaimed yet.
> >> >>
> >> >> - group 2: pages have been written out, but were failed to be reclaimed
> >> >>   (e.g., were accessed before reclaiming)
> >> >>
> >> >> - group 3: pages have been swapped in, but were kept in swap cache.  The
> >> >>   pages may be in active LRU.
> >> >>
> >> >> The main target of the original patch should be group 1.  And the pages
> >> >> may be cheaper to reclaim than file pages.
> >> >>
> >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0.
> >> >>
> >> >> Group 3 should be reclaimed in theory, but the overhead may be high.
> >> >> And we may need to reclaim the swap entries instead of pages if the pages
> >> >> are hot.  But we can start to reclaim the swap entries before the swap
> >> >> space is run out.
> >> >>
> >> >> So, if we can count group 1, we may use that as indicator to scan anon
> >> >> pages.  And we may add code to reclaim group 3 earlier.
> >> >>
> >> >
> >> > My point was not that reclaiming the pages in the swap cache is more
> >> > expensive that reclaiming the pages in the file LRU. In a lot of
> >> > cases, as you point out, the pages in the swap cache can just be
> >> > dropped, so they may be as cheap or cheaper to reclaim than the pages
> >> > in the file LRU.
> >> >
> >> > My point was that scanning the anon LRU when swap space is exhausted
> >> > to get to the pages in the swap cache may be much more expensive,
> >> > because there may be a lot of pages on the anon LRU that are not in
> >> > the swap cache, and hence are not reclaimable, unlike pages in the
> >> > file LRU, which should mostly be reclaimable.
> >> >
> >> > So what I am saying is that maybe we should not do the effort of
> >> > scanning the anon LRU in the swapcache_only case unless there aren't a
> >> > lot of pages to reclaim on the file LRU (relatively). For example, if
> >> > we have a 100 pages in the swap cache out of 10000 pages in the anon
> >> > LRU, and there are 10000 pages in the file LRU, it's probably not
> >> > worth scanning the anon LRU.
> >>
> >> For group 1 pages, they are at the tail of the anon inactive LRU, so the
> >> scan overhead is low too.  For example, if number of group 1 pages is
> >> 100, we just need to scan 100 pages to reclaim them.  We can choose to
> >> stop scanning when the number of the non-group-1 pages reached some
> >> threshold.
> >>
> >
> > We should still try to reclaim pages in groups 2 & 3 before OOMing
> > though. Maybe the motivation for this patch is group 1, but I don't
> > see why we should special case them. Pages in groups 2 & 3 should be
> > roughly equally cheap to reclaim. They may have higher refault cost,
> > but IIUC we should still try to reclaim them before OOMing.
>
> The scan cost of group 3 may be high, you may need to scan all anonymous
> pages to identify them.  The reclaim cost of group 2 may be high, it may
> just cause trashing (shared pages that are accessed by just one
> process).  So I think that we can allow reclaim group 1 in all cases.
> Try to reclaim swap entries for group 3 during normal LRU scanning after
> more than half of swap space of limit is used.  As a last resort before
> OOM, try to reclaim group 2 and group 3.  Or, limit scan count for group
> 2 and group 3.

It would be nice if this can be done auto-magically without having to
keep track of the groups separately.

>
> BTW, in some situation, OOM is not the worst situation.  For example,
> trashing may kill interaction latency, while killing the memory hog (may
> be caused by memory leak) saves system response time.

I agree that in some situations OOMs are better than thrashing, it's
not an easy problem.
Huang, Ying Nov. 28, 2023, 5:52 a.m. UTC | #30
Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Nov 27, 2023 at 9:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >>
>> >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>> >> >> >>
>> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
>> >> >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> > > >  I agree with Ying that anonymous pages typically have different page
>> >> >> >> > > > access patterns than file pages, so we might want to treat them
>> >> >> >> > > > differently to reclaim them effectively.
>> >> >> >> > > > One random idea:
>> >> >> >> > > > How about we put the anonymous page in a swap cache in a different LRU
>> >> >> >> > > > than the rest of the anonymous pages. Then shrinking against those
>> >> >> >> > > > pages in the swap cache would be more effective.Instead of having
>> >> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
>> >> >> >> > > > cache, file] LRU
>> >> >> >> > >
>> >> >> >> > > I don't think that it is necessary.  The patch is only for a special use
>> >> >> >> > > case.  Where the swap device is used up while some pages are in swap
>> >> >> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
>> >> >> >> > > only, not to improve performance.  Per my understanding, we will not use
>> >> >> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
>> >> >> >> > > we keep pages in swap cache for long when we use ZRAM?
>> >> >> >> >
>> >> >> >> > I ask the question regarding how many pages can be freed by this patch
>> >> >> >> > in this email thread as well, but haven't got the answer from the
>> >> >> >> > author yet. That is one important aspect to evaluate how valuable is
>> >> >> >> > that patch.
>> >> >> >>
>> >> >> >> Exactly. Since swap cache has different life time with page cache, they
>> >> >> >> would be usually dropped when pages are unmapped(unless they are shared
>> >> >> >> with others but anon is usually exclusive private) so I wonder how much
>> >> >> >> memory we can save.
>> >> >> >
>> >> >> > I think the point of this patch is not saving memory, but rather
>> >> >> > avoiding an OOM condition that will happen if we have no swap space
>> >> >> > left, but some pages left in the swap cache. Of course, the OOM
>> >> >> > avoidance will come at the cost of extra work in reclaim to swap those
>> >> >> > pages out.
>> >> >> >
>> >> >> > The only case where I think this might be harmful is if there's plenty
>> >> >> > of pages to reclaim on the file LRU, and instead we opt to chase down
>> >> >> > the few swap cache pages. So perhaps we can add a check to only set
>> >> >> > sc->swapcache_only if the number of pages in the swap cache is more
>> >> >> > than the number of pages on the file LRU or similar? Just make sure we
>> >> >> > don't chase the swapcache pages down if there's plenty to scan on the
>> >> >> > file LRU?
>> >> >>
>> >> >> The swap cache pages can be divided to 3 groups.
>> >> >>
>> >> >> - group 1: pages have been written out, at the tail of inactive LRU, but
>> >> >>   not reclaimed yet.
>> >> >>
>> >> >> - group 2: pages have been written out, but were failed to be reclaimed
>> >> >>   (e.g., were accessed before reclaiming)
>> >> >>
>> >> >> - group 3: pages have been swapped in, but were kept in swap cache.  The
>> >> >>   pages may be in active LRU.
>> >> >>
>> >> >> The main target of the original patch should be group 1.  And the pages
>> >> >> may be cheaper to reclaim than file pages.
>> >> >>
>> >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0.
>> >> >>
>> >> >> Group 3 should be reclaimed in theory, but the overhead may be high.
>> >> >> And we may need to reclaim the swap entries instead of pages if the pages
>> >> >> are hot.  But we can start to reclaim the swap entries before the swap
>> >> >> space is run out.
>> >> >>
>> >> >> So, if we can count group 1, we may use that as indicator to scan anon
>> >> >> pages.  And we may add code to reclaim group 3 earlier.
>> >> >>
>> >> >
>> >> > My point was not that reclaiming the pages in the swap cache is more
>> >> > expensive that reclaiming the pages in the file LRU. In a lot of
>> >> > cases, as you point out, the pages in the swap cache can just be
>> >> > dropped, so they may be as cheap or cheaper to reclaim than the pages
>> >> > in the file LRU.
>> >> >
>> >> > My point was that scanning the anon LRU when swap space is exhausted
>> >> > to get to the pages in the swap cache may be much more expensive,
>> >> > because there may be a lot of pages on the anon LRU that are not in
>> >> > the swap cache, and hence are not reclaimable, unlike pages in the
>> >> > file LRU, which should mostly be reclaimable.
>> >> >
>> >> > So what I am saying is that maybe we should not do the effort of
>> >> > scanning the anon LRU in the swapcache_only case unless there aren't a
>> >> > lot of pages to reclaim on the file LRU (relatively). For example, if
>> >> > we have a 100 pages in the swap cache out of 10000 pages in the anon
>> >> > LRU, and there are 10000 pages in the file LRU, it's probably not
>> >> > worth scanning the anon LRU.
>> >>
>> >> For group 1 pages, they are at the tail of the anon inactive LRU, so the
>> >> scan overhead is low too.  For example, if number of group 1 pages is
>> >> 100, we just need to scan 100 pages to reclaim them.  We can choose to
>> >> stop scanning when the number of the non-group-1 pages reached some
>> >> threshold.
>> >>
>> >
>> > We should still try to reclaim pages in groups 2 & 3 before OOMing
>> > though. Maybe the motivation for this patch is group 1, but I don't
>> > see why we should special case them. Pages in groups 2 & 3 should be
>> > roughly equally cheap to reclaim. They may have higher refault cost,
>> > but IIUC we should still try to reclaim them before OOMing.
>>
>> The scan cost of group 3 may be high, you may need to scan all anonymous
>> pages to identify them.  The reclaim cost of group 2 may be high, it may
>> just cause trashing (shared pages that are accessed by just one
>> process).  So I think that we can allow reclaim group 1 in all cases.
>> Try to reclaim swap entries for group 3 during normal LRU scanning after
>> more than half of swap space of limit is used.  As a last resort before
>> OOM, try to reclaim group 2 and group 3.  Or, limit scan count for group
>> 2 and group 3.
>
> It would be nice if this can be done auto-magically without having to
> keep track of the groups separately.

Some rough idea may be

- trying to scan anon LRU if there are swap cache pages.

- if some number of pages other than group 1 encountered, stop scanning
  anon LRU list.

- the threshold to stopping can be tuned according to whether we are
  going to OOM.

We can try to reclaim swap entries for group 3 when we haven't run out
of swap space yet.

>>
>> BTW, in some situation, OOM is not the worst situation.  For example,
>> trashing may kill interaction latency, while killing the memory hog (may
>> be caused by memory leak) saves system response time.
>
> I agree that in some situations OOMs are better than thrashing, it's
> not an easy problem.

--
Best Regards,
Huang, Ying
Michal Hocko Nov. 28, 2023, 10:16 a.m. UTC | #31
On Tue 28-11-23 09:31:06, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
[...]
> > Right. On the other hand we could be more aggressive when dropping the
> > swapcache. Is there any actual reason why we cannot try to folio_free_swap
> > even when mem_cgroup_swap_full == F?
> 
> If there are plenty free space in swap device, why not take advantage of
> it?

Maybe a stupid question but what is the advantage of keeping around in
the swap cache?
Minchan Kim Nov. 28, 2023, 10:37 p.m. UTC | #32
On Tue, Nov 28, 2023 at 11:19:20AM +0800, Huang, Ying wrote:
> Yosry Ahmed <yosryahmed@google.com> writes:
> 
> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
> >>
> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > > >  I agree with Ying that anonymous pages typically have different page
> >> > > > access patterns than file pages, so we might want to treat them
> >> > > > differently to reclaim them effectively.
> >> > > > One random idea:
> >> > > > How about we put the anonymous page in a swap cache in a different LRU
> >> > > > than the rest of the anonymous pages. Then shrinking against those
> >> > > > pages in the swap cache would be more effective.Instead of having
> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> >> > > > cache, file] LRU
> >> > >
> >> > > I don't think that it is necessary.  The patch is only for a special use
> >> > > case.  Where the swap device is used up while some pages are in swap
> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
> >> > > only, not to improve performance.  Per my understanding, we will not use
> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
> >> > > we keep pages in swap cache for long when we use ZRAM?
> >> >
> >> > I ask the question regarding how many pages can be freed by this patch
> >> > in this email thread as well, but haven't got the answer from the
> >> > author yet. That is one important aspect to evaluate how valuable is
> >> > that patch.
> >>
> >> Exactly. Since swap cache has different life time with page cache, they
> >> would be usually dropped when pages are unmapped(unless they are shared
> >> with others but anon is usually exclusive private) so I wonder how much
> >> memory we can save.
> >
> > I think the point of this patch is not saving memory, but rather
> > avoiding an OOM condition that will happen if we have no swap space
> > left, but some pages left in the swap cache. Of course, the OOM
> > avoidance will come at the cost of extra work in reclaim to swap those
> > pages out.
> >
> > The only case where I think this might be harmful is if there's plenty
> > of pages to reclaim on the file LRU, and instead we opt to chase down
> > the few swap cache pages. So perhaps we can add a check to only set
> > sc->swapcache_only if the number of pages in the swap cache is more
> > than the number of pages on the file LRU or similar? Just make sure we
> > don't chase the swapcache pages down if there's plenty to scan on the
> > file LRU?
> 
> The swap cache pages can be divided to 3 groups.
> 
> - group 1: pages have been written out, at the tail of inactive LRU, but
>   not reclaimed yet.
> 
> - group 2: pages have been written out, but were failed to be reclaimed
>   (e.g., were accessed before reclaiming)
> 
> - group 3: pages have been swapped in, but were kept in swap cache.  The
>   pages may be in active LRU.
> 
> The main target of the original patch should be group 1.  And the pages
> may be cheaper to reclaim than file pages.

Yeah, that's common for asynchronous swap devices and that's popular. Then,
How about freeing those memory as soon as the writeback is done instead of
keep adding more tricks to solve the issue?

https://lkml.kernel.org/linux-mm/1368411048-3753-1-git-send-email-minchan@kernel.org/

I remember it's under softIRQ context so there were some issues to change
locking rules for memcg and swap. And there was some concern to increase
softirq latency due to page freeing but both were not the main obstacle to
be fixed.

> 
> Group 2 are hard to be reclaimed if swap_count() isn't 0.

"were accessed before reclaiming" would be rare.

> 
> Group 3 should be reclaimed in theory, but the overhead may be high.
> And we may need to reclaim the swap entries instead of pages if the pages
> are hot.  But we can start to reclaim the swap entries before the swap
> space is run out.

I thought the swap-in path will reclaim the swap slots once it detects
swapspace wasn't enough(e.g., vm_swap_full or mem_cgroup_swap-full)?

> 
> So, if we can count group 1, we may use that as indicator to scan anon
> pages.  And we may add code to reclaim group 3 earlier.
Minchan Kim Nov. 28, 2023, 10:45 p.m. UTC | #33
On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote:
> On Tue 28-11-23 09:31:06, Huang, Ying wrote:
> > Michal Hocko <mhocko@suse.com> writes:
> [...]
> > > Right. On the other hand we could be more aggressive when dropping the
> > > swapcache. Is there any actual reason why we cannot try to folio_free_swap
> > > even when mem_cgroup_swap_full == F?
> > 
> > If there are plenty free space in swap device, why not take advantage of
> > it?
> 
> Maybe a stupid question but what is the advantage of keeping around in
> the swap cache?

If the page is shared, we avoids addtional IO to bring them back so
swap cache.

If the page is still clean on reclaim moment since swap-in, VM doesn't
need to write them out to the swap disk since original data is already
there.
Yosry Ahmed Nov. 28, 2023, 11:05 p.m. UTC | #34
On Tue, Nov 28, 2023 at 2:45 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote:
> > On Tue 28-11-23 09:31:06, Huang, Ying wrote:
> > > Michal Hocko <mhocko@suse.com> writes:
> > [...]
> > > > Right. On the other hand we could be more aggressive when dropping the
> > > > swapcache. Is there any actual reason why we cannot try to folio_free_swap
> > > > even when mem_cgroup_swap_full == F?
> > >
> > > If there are plenty free space in swap device, why not take advantage of
> > > it?
> >
> > Maybe a stupid question but what is the advantage of keeping around in
> > the swap cache?
>
> If the page is shared, we avoids addtional IO to bring them back so
> swap cache.

I think this case is actually necessary for correctness, not just to
avoid additional IO. Otherwise subsequent swapins will create new
copies of the page, right?

>
> If the page is still clean on reclaim moment since swap-in, VM doesn't
> need to write them out to the swap disk since original data is already
> there.
Minchan Kim Nov. 28, 2023, 11:15 p.m. UTC | #35
On Tue, Nov 28, 2023 at 03:05:29PM -0800, Yosry Ahmed wrote:
> On Tue, Nov 28, 2023 at 2:45 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote:
> > > On Tue 28-11-23 09:31:06, Huang, Ying wrote:
> > > > Michal Hocko <mhocko@suse.com> writes:
> > > [...]
> > > > > Right. On the other hand we could be more aggressive when dropping the
> > > > > swapcache. Is there any actual reason why we cannot try to folio_free_swap
> > > > > even when mem_cgroup_swap_full == F?
> > > >
> > > > If there are plenty free space in swap device, why not take advantage of
> > > > it?
> > >
> > > Maybe a stupid question but what is the advantage of keeping around in
> > > the swap cache?
> >
> > If the page is shared, we avoids addtional IO to bring them back so
> > swap cache.
> 
> I think this case is actually necessary for correctness, not just to
> avoid additional IO. Otherwise subsequent swapins will create new
> copies of the page, right?

I think if the page was shared by MAP_SHARED, then, yes.
I think if the page was shared by MAP_PRIVATE but CoW(e.g., fork), then, no.
Chris Li Nov. 28, 2023, 11:45 p.m. UTC | #36
On Mon, Nov 27, 2023 at 1:57 PM Yosry Ahmed <yosryahmed@google.com> wrote:

> >
> > Exactly. Since swap cache has different life time with page cache, they
> > would be usually dropped when pages are unmapped(unless they are shared
> > with others but anon is usually exclusive private) so I wonder how much
> > memory we can save.
>
> I think the point of this patch is not saving memory, but rather
> avoiding an OOM condition that will happen if we have no swap space
> left, but some pages left in the swap cache. Of course, the OOM
> avoidance will come at the cost of extra work in reclaim to swap those
> pages out.

You are discussing how to use the memory that got freed. e.g. using
other apps so avoid OOM.
I am asking how much memory can be freed by this patch. That number is
still useful to understand how effective the patch is.  Does it
justify the additional complexity?

> The only case where I think this might be harmful is if there's plenty
> of pages to reclaim on the file LRU, and instead we opt to chase down
> the few swap cache pages. So perhaps we can add a check to only set
> sc->swapcache_only if the number of pages in the swap cache is more
> than the number of pages on the file LRU or similar? Just make sure we
> don't chase the swapcache pages down if there's plenty to scan on the
> file LRU?

One idea is that we need to measure how effective the reclaim was
toward the swap cache. If we do a lot of work but are not able to
reclaim much from the swap cache, then additional reclaim work on swap
cache is not going to be effective.
MGLRU has a PID controller to adjust the reclaim ratio between the
file and anonymous LRU. That is one way of measuring the effectiveness
of the reclaim. Use the feedback to adjust the follow up reclaim work.
I kind of wish we could have some feedback machine like this for the
swap cache reclaim as well. Not sure how complicated it is to
implement one.

Chris
Huang, Ying Nov. 29, 2023, 3:12 a.m. UTC | #37
Minchan Kim <minchan@kernel.org> writes:

> On Tue, Nov 28, 2023 at 11:19:20AM +0800, Huang, Ying wrote:
>> Yosry Ahmed <yosryahmed@google.com> writes:
>> 
>> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>> >>
>> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
>> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> > > >  I agree with Ying that anonymous pages typically have different page
>> >> > > > access patterns than file pages, so we might want to treat them
>> >> > > > differently to reclaim them effectively.
>> >> > > > One random idea:
>> >> > > > How about we put the anonymous page in a swap cache in a different LRU
>> >> > > > than the rest of the anonymous pages. Then shrinking against those
>> >> > > > pages in the swap cache would be more effective.Instead of having
>> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
>> >> > > > cache, file] LRU
>> >> > >
>> >> > > I don't think that it is necessary.  The patch is only for a special use
>> >> > > case.  Where the swap device is used up while some pages are in swap
>> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
>> >> > > only, not to improve performance.  Per my understanding, we will not use
>> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
>> >> > > we keep pages in swap cache for long when we use ZRAM?
>> >> >
>> >> > I ask the question regarding how many pages can be freed by this patch
>> >> > in this email thread as well, but haven't got the answer from the
>> >> > author yet. That is one important aspect to evaluate how valuable is
>> >> > that patch.
>> >>
>> >> Exactly. Since swap cache has different life time with page cache, they
>> >> would be usually dropped when pages are unmapped(unless they are shared
>> >> with others but anon is usually exclusive private) so I wonder how much
>> >> memory we can save.
>> >
>> > I think the point of this patch is not saving memory, but rather
>> > avoiding an OOM condition that will happen if we have no swap space
>> > left, but some pages left in the swap cache. Of course, the OOM
>> > avoidance will come at the cost of extra work in reclaim to swap those
>> > pages out.
>> >
>> > The only case where I think this might be harmful is if there's plenty
>> > of pages to reclaim on the file LRU, and instead we opt to chase down
>> > the few swap cache pages. So perhaps we can add a check to only set
>> > sc->swapcache_only if the number of pages in the swap cache is more
>> > than the number of pages on the file LRU or similar? Just make sure we
>> > don't chase the swapcache pages down if there's plenty to scan on the
>> > file LRU?
>> 
>> The swap cache pages can be divided to 3 groups.
>> 
>> - group 1: pages have been written out, at the tail of inactive LRU, but
>>   not reclaimed yet.
>> 
>> - group 2: pages have been written out, but were failed to be reclaimed
>>   (e.g., were accessed before reclaiming)
>> 
>> - group 3: pages have been swapped in, but were kept in swap cache.  The
>>   pages may be in active LRU.
>> 
>> The main target of the original patch should be group 1.  And the pages
>> may be cheaper to reclaim than file pages.
>
> Yeah, that's common for asynchronous swap devices and that's popular. Then,
> How about freeing those memory as soon as the writeback is done instead of
> keep adding more tricks to solve the issue?
>
> https://lkml.kernel.org/linux-mm/1368411048-3753-1-git-send-email-minchan@kernel.org/
>
> I remember it's under softIRQ context so there were some issues to change
> locking rules for memcg and swap. And there was some concern to increase
> softirq latency due to page freeing but both were not the main obstacle to
> be fixed.

Thanks for sharing.  It's good to avoid to add the pages back to LRU,
then isolate them from LRU.  I have concerns that is it possible that
too many pages are reclaimed?  For example, to reclaim a small number of
pages, too many pages were written to disk because the performance
difference between CPU and storage.  Originally, we still only reclaim
requested number of pages although much more were written.  But with the
change, we may reclaim them all.

>> 
>> Group 2 are hard to be reclaimed if swap_count() isn't 0.
>
> "were accessed before reclaiming" would be rare.

If page reclaiming algorithm works well enough, that should be true.

>> 
>> Group 3 should be reclaimed in theory, but the overhead may be high.
>> And we may need to reclaim the swap entries instead of pages if the pages
>> are hot.  But we can start to reclaim the swap entries before the swap
>> space is run out.
>
> I thought the swap-in path will reclaim the swap slots once it detects
> swapspace wasn't enough(e.g., vm_swap_full or mem_cgroup_swap-full)?

Yes.  You are right.  But before swap space wasn't enough, we may keep
quite some pages in swap cache.  But these pages may becomes hot later.
Then we have no opportunity to reclaim these swap space.  So, we may
need to add some code to check this situation at appropriate places.
For example, when we scan pages in active list, or activate pages in
inactive list.

>> 
>> So, if we can count group 1, we may use that as indicator to scan anon
>> pages.  And we may add code to reclaim group 3 earlier.

--
Best Regards,
Huang, Ying
Michal Hocko Nov. 29, 2023, 10:17 a.m. UTC | #38
On Tue 28-11-23 15:15:08, Minchan Kim wrote:
> On Tue, Nov 28, 2023 at 03:05:29PM -0800, Yosry Ahmed wrote:
> > On Tue, Nov 28, 2023 at 2:45 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote:
> > > > On Tue 28-11-23 09:31:06, Huang, Ying wrote:
> > > > > Michal Hocko <mhocko@suse.com> writes:
> > > > [...]
> > > > > > Right. On the other hand we could be more aggressive when dropping the
> > > > > > swapcache. Is there any actual reason why we cannot try to folio_free_swap
> > > > > > even when mem_cgroup_swap_full == F?
> > > > >
> > > > > If there are plenty free space in swap device, why not take advantage of
> > > > > it?
> > > >
> > > > Maybe a stupid question but what is the advantage of keeping around in
> > > > the swap cache?
> > >
> > > If the page is shared, we avoids addtional IO to bring them back so
> > > swap cache.
> > 
> > I think this case is actually necessary for correctness, not just to
> > avoid additional IO. Otherwise subsequent swapins will create new
> > copies of the page, right?
> 
> I think if the page was shared by MAP_SHARED, then, yes.

In that case deleting from the swap cache would fail because there are
still swapped out ptes, no?

> I think if the page was shared by MAP_PRIVATE but CoW(e.g., fork), then, no.

OK, but we are talking about a memory pressure here and evicting
available memory. So what is the actual cost benefit model here?
Is it really better to keep swapcache around even under memory pressure
if the CoWed page could never be faulted in?
Michal Hocko Nov. 29, 2023, 10:22 a.m. UTC | #39
On Tue 28-11-23 11:19:20, Huang, Ying wrote:
> Yosry Ahmed <yosryahmed@google.com> writes:
> 
> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
> >>
> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > > >  I agree with Ying that anonymous pages typically have different page
> >> > > > access patterns than file pages, so we might want to treat them
> >> > > > differently to reclaim them effectively.
> >> > > > One random idea:
> >> > > > How about we put the anonymous page in a swap cache in a different LRU
> >> > > > than the rest of the anonymous pages. Then shrinking against those
> >> > > > pages in the swap cache would be more effective.Instead of having
> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
> >> > > > cache, file] LRU
> >> > >
> >> > > I don't think that it is necessary.  The patch is only for a special use
> >> > > case.  Where the swap device is used up while some pages are in swap
> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
> >> > > only, not to improve performance.  Per my understanding, we will not use
> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
> >> > > we keep pages in swap cache for long when we use ZRAM?
> >> >
> >> > I ask the question regarding how many pages can be freed by this patch
> >> > in this email thread as well, but haven't got the answer from the
> >> > author yet. That is one important aspect to evaluate how valuable is
> >> > that patch.
> >>
> >> Exactly. Since swap cache has different life time with page cache, they
> >> would be usually dropped when pages are unmapped(unless they are shared
> >> with others but anon is usually exclusive private) so I wonder how much
> >> memory we can save.
> >
> > I think the point of this patch is not saving memory, but rather
> > avoiding an OOM condition that will happen if we have no swap space
> > left, but some pages left in the swap cache. Of course, the OOM
> > avoidance will come at the cost of extra work in reclaim to swap those
> > pages out.
> >
> > The only case where I think this might be harmful is if there's plenty
> > of pages to reclaim on the file LRU, and instead we opt to chase down
> > the few swap cache pages. So perhaps we can add a check to only set
> > sc->swapcache_only if the number of pages in the swap cache is more
> > than the number of pages on the file LRU or similar? Just make sure we
> > don't chase the swapcache pages down if there's plenty to scan on the
> > file LRU?
> 
> The swap cache pages can be divided to 3 groups.
> 
> - group 1: pages have been written out, at the tail of inactive LRU, but
>   not reclaimed yet.
> 
> - group 2: pages have been written out, but were failed to be reclaimed
>   (e.g., were accessed before reclaiming)
> 
> - group 3: pages have been swapped in, but were kept in swap cache.  The
>   pages may be in active LRU.
> 
> The main target of the original patch should be group 1.  And the pages
> may be cheaper to reclaim than file pages.

Thanks this is really useful summary. And it begs question. How are we
telling those different types from each other? vmstat counter is
certainly not sufficient and that means we might be scanning a lot
without actually making any progress. And doing that repeatedly.
Huang, Ying Nov. 30, 2023, 8:07 a.m. UTC | #40
Michal Hocko <mhocko@suse.com> writes:

> On Tue 28-11-23 11:19:20, Huang, Ying wrote:
>> Yosry Ahmed <yosryahmed@google.com> writes:
>> 
>> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote:
>> >>
>> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote:
>> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> > > >  I agree with Ying that anonymous pages typically have different page
>> >> > > > access patterns than file pages, so we might want to treat them
>> >> > > > differently to reclaim them effectively.
>> >> > > > One random idea:
>> >> > > > How about we put the anonymous page in a swap cache in a different LRU
>> >> > > > than the rest of the anonymous pages. Then shrinking against those
>> >> > > > pages in the swap cache would be more effective.Instead of having
>> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap
>> >> > > > cache, file] LRU
>> >> > >
>> >> > > I don't think that it is necessary.  The patch is only for a special use
>> >> > > case.  Where the swap device is used up while some pages are in swap
>> >> > > cache.  The patch will kill performance, but it is used to avoid OOM
>> >> > > only, not to improve performance.  Per my understanding, we will not use
>> >> > > up swap device space in most cases.  This may be true for ZRAM, but will
>> >> > > we keep pages in swap cache for long when we use ZRAM?
>> >> >
>> >> > I ask the question regarding how many pages can be freed by this patch
>> >> > in this email thread as well, but haven't got the answer from the
>> >> > author yet. That is one important aspect to evaluate how valuable is
>> >> > that patch.
>> >>
>> >> Exactly. Since swap cache has different life time with page cache, they
>> >> would be usually dropped when pages are unmapped(unless they are shared
>> >> with others but anon is usually exclusive private) so I wonder how much
>> >> memory we can save.
>> >
>> > I think the point of this patch is not saving memory, but rather
>> > avoiding an OOM condition that will happen if we have no swap space
>> > left, but some pages left in the swap cache. Of course, the OOM
>> > avoidance will come at the cost of extra work in reclaim to swap those
>> > pages out.
>> >
>> > The only case where I think this might be harmful is if there's plenty
>> > of pages to reclaim on the file LRU, and instead we opt to chase down
>> > the few swap cache pages. So perhaps we can add a check to only set
>> > sc->swapcache_only if the number of pages in the swap cache is more
>> > than the number of pages on the file LRU or similar? Just make sure we
>> > don't chase the swapcache pages down if there's plenty to scan on the
>> > file LRU?
>> 
>> The swap cache pages can be divided to 3 groups.
>> 
>> - group 1: pages have been written out, at the tail of inactive LRU, but
>>   not reclaimed yet.
>> 
>> - group 2: pages have been written out, but were failed to be reclaimed
>>   (e.g., were accessed before reclaiming)
>> 
>> - group 3: pages have been swapped in, but were kept in swap cache.  The
>>   pages may be in active LRU.
>> 
>> The main target of the original patch should be group 1.  And the pages
>> may be cheaper to reclaim than file pages.
>
> Thanks this is really useful summary. And it begs question. How are we
> telling those different types from each other? vmstat counter is
> certainly not sufficient and that means we might be scanning a lot
> without actually making any progress. And doing that repeatedly.

We don't have counters for pages in individual groups.  Pages in group 1
and some pages in group 2 are always at the tail of inactive LRU.  So,
we can identify them relatively easily.  So a simple method could be, if
there are swap cache, try scan the tail of inactive LRU to free pages in
group 1 and move pages in group 2.  If we found some pages aren't in
swap cache, there may be no pages in group 1.  Then, we may give up
scanning if the memory pressure isn't too large.

One possible issue is that some pages (map count == 1) may be swapped
out, swapped in, then deleted from swap cache.  So, some pages not in
swap cache may be at the end of inactive LRU too.

--
Best Regards,
Huang, Ying
Andrew Morton Dec. 13, 2023, 11:13 p.m. UTC | #41
tl;dr, what's the viewpoint on this patch now?

I have a note here that we're hoping input from Johannes, Vlastimil and
Mel (please).
Huang, Ying Dec. 15, 2023, 5:05 a.m. UTC | #42
Andrew Morton <akpm@linux-foundation.org> writes:

> tl;dr, what's the viewpoint on this patch now?
>
> I have a note here that we're hoping input from Johannes, Vlastimil and
> Mel (please).

Michal has performance concerns.  So, I think that we need to stop
scanning anon LRU if don't find enough page in swap cache at the tail of
the inactive list.

--
Best Regards,
Huang, Ying
Andrew Morton Dec. 15, 2023, 7:24 p.m. UTC | #43
On Fri, 15 Dec 2023 13:05:09 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > tl;dr, what's the viewpoint on this patch now?
> >
> > I have a note here that we're hoping input from Johannes, Vlastimil and
> > Mel (please).
> 
> Michal has performance concerns.  So, I think that we need to stop
> scanning anon LRU if don't find enough page in swap cache at the tail of
> the inactive list.

OK, thanks, I'll make the v10 patch go away for now.
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 506f8220c5fe..1fcc94717370 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -136,6 +136,9 @@  struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+	/* Swap space is exhausted, only reclaim swapcache for anon LRU */
+	unsigned int swapcache_only:1;
+
 	/* Allocation order */
 	s8 order;
 
@@ -308,10 +311,36 @@  static bool can_demote(int nid, struct scan_control *sc)
 	return true;
 }
 
+#ifdef CONFIG_SWAP
+static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+	unsigned long nr_swapcache;
+
+	if (!memcg) {
+		nr_swapcache = node_page_state(pgdat, NR_SWAPCACHE);
+	} else {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+		nr_swapcache = lruvec_page_state_local(lruvec, NR_SWAPCACHE);
+	}
+
+	return nr_swapcache > 0;
+}
+#else
+static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid)
+{
+	return false;
+}
+#endif
+
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
 					  int nid,
 					  struct scan_control *sc)
 {
+	if (sc)
+		sc->swapcache_only = 0;
+
 	if (memcg == NULL) {
 		/*
 		 * For non-memcg reclaim, is there
@@ -330,7 +359,17 @@  static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
 	 *
 	 * Can it be reclaimed from this node via demotion?
 	 */
-	return can_demote(nid, sc);
+	if (can_demote(nid, sc))
+		return true;
+
+	/* Is there any swapcache pages to reclaim in this node? */
+	if (can_reclaim_swapcache(memcg, nid)) {
+		if (sc)
+			sc->swapcache_only = 1;
+		return true;
+	}
+
+	return false;
 }
 
 /*
@@ -1642,6 +1681,15 @@  static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
 		 */
 		scan += nr_pages;
 
+		/*
+		 * Count non-swapcache too because the swapcache pages may
+		 * be rare and it takes too much times here if not count
+		 * the non-swapcache pages.
+		 */
+		if (unlikely(sc->swapcache_only && !is_file_lru(lru) &&
+		    !folio_test_swapcache(folio)))
+			goto move;
+
 		if (!folio_test_lru(folio))
 			goto move;
 		if (!sc->may_unmap && folio_mapped(folio))