Message ID | 20231121090624.1814733-1-liushixin2@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v10] mm: vmscan: try to reclaim swapcache pages if no swap space | expand |
On Tue 21-11-23 17:06:24, Liu Shixin wrote: > When spaces of swap devices are exhausted, only file pages can be > reclaimed. But there are still some swapcache pages in anon lru list. > This can lead to a premature out-of-memory. > > The problem is found with such step: > > Firstly, set a 9MB disk swap space, then create a cgroup with 10MB > memory limit, then runs an program to allocates about 15MB memory. > > The problem occurs occasionally, which may need about 100 times [1]. > > Fix it by checking number of swapcache pages in can_reclaim_anon_pages(). > If the number is not zero, return true and set swapcache_only to 1. > When scan anon lru list in swapcache_only mode, non-swapcache pages will > be skipped to isolate in order to accelerate reclaim efficiency. > > However, in swapcache_only mode, the scan count still increased when scan > non-swapcache pages because there are large number of non-swapcache pages > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > is skipped and do not count, the scan of pages in isolate_lru_folios() can > eventually lead to hung task, just as Sachin reported [2]. I find this paragraph really confusing! I guess what you meant to say is that a real swapcache_only is problematic because it can end up not making any progress, correct? AFAIU you have addressed that problem by making swapcache_only anon LRU specific, right? That would be certainly more robust as you can still reclaim from file LRUs. I cannot say I like that because swapcache_only is a bit confusing and I do not think we want to grow more special purpose reclaim types. Would it be possible/reasonable to instead put swapcache pages on the file LRU instead?
On 2023/11/21 21:00, Michal Hocko wrote: > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > > However, in swapcache_only mode, the scan count still increased when scan > non-swapcache pages because there are large number of non-swapcache pages > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > is skipped and do not count, the scan of pages in isolate_lru_folios() can > eventually lead to hung task, just as Sachin reported [2]. > I find this paragraph really confusing! I guess what you meant to say is > that a real swapcache_only is problematic because it can end up not > making any progress, correct? This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > > AFAIU you have addressed that problem by making swapcache_only anon LRU > specific, right? That would be certainly more robust as you can still > reclaim from file LRUs. I cannot say I like that because swapcache_only > is a bit confusing and I do not think we want to grow more special > purpose reclaim types. Would it be possible/reasonable to instead put > swapcache pages on the file LRU instead? It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to pay attention to? Thanks,
On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: > > > On 2023/11/21 21:00, Michal Hocko wrote: > > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > > > > However, in swapcache_only mode, the scan count still increased when scan > > non-swapcache pages because there are large number of non-swapcache pages > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > > is skipped and do not count, the scan of pages in isolate_lru_folios() can > > eventually lead to hung task, just as Sachin reported [2]. > > I find this paragraph really confusing! I guess what you meant to say is > > that a real swapcache_only is problematic because it can end up not > > making any progress, correct? > This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU > > specific, right? That would be certainly more robust as you can still > > reclaim from file LRUs. I cannot say I like that because swapcache_only > > is a bit confusing and I do not think we want to grow more special > > purpose reclaim types. Would it be possible/reasonable to instead put > > swapcache pages on the file LRU instead? > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to > pay attention to? I think this might be more intrusive than we think. Every time a page is added to or removed from the swap cache, we will need to move it between LRUs. All pages on the anon LRU will need to go through the file LRU before being reclaimed. I think this might be too big of a change to achieve this patch's goal. > > Thanks, >
Yosry Ahmed <yosryahmed@google.com> writes: > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: >> >> >> On 2023/11/21 21:00, Michal Hocko wrote: >> > On Tue 21-11-23 17:06:24, Liu Shixin wrote: >> > >> > However, in swapcache_only mode, the scan count still increased when scan >> > non-swapcache pages because there are large number of non-swapcache pages >> > and rare swapcache pages in swapcache_only mode, and if the non-swapcache >> > is skipped and do not count, the scan of pages in isolate_lru_folios() can >> > eventually lead to hung task, just as Sachin reported [2]. >> > I find this paragraph really confusing! I guess what you meant to say is >> > that a real swapcache_only is problematic because it can end up not >> > making any progress, correct? >> This paragraph is going to explain why checking swapcache_only after scan += nr_pages; >> > >> > AFAIU you have addressed that problem by making swapcache_only anon LRU >> > specific, right? That would be certainly more robust as you can still >> > reclaim from file LRUs. I cannot say I like that because swapcache_only >> > is a bit confusing and I do not think we want to grow more special >> > purpose reclaim types. Would it be possible/reasonable to instead put >> > swapcache pages on the file LRU instead? >> It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to >> pay attention to? > > I think this might be more intrusive than we think. Every time a page > is added to or removed from the swap cache, we will need to move it > between LRUs. All pages on the anon LRU will need to go through the > file LRU before being reclaimed. I think this might be too big of a > change to achieve this patch's goal. We need to identify swap cache pages on file LRU firstly. It appears hard from the current definition of page flags. /* Filesystems */ PG_checked = PG_owner_priv_1, /* SwapBacked */ PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */ -- Best Regards, Huang, Ying
On Tue 21-11-23 22:44:32, Yosry Ahmed wrote: > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: > > > > > > On 2023/11/21 21:00, Michal Hocko wrote: > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > > > > > > However, in swapcache_only mode, the scan count still increased when scan > > > non-swapcache pages because there are large number of non-swapcache pages > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can > > > eventually lead to hung task, just as Sachin reported [2]. > > > I find this paragraph really confusing! I guess what you meant to say is > > > that a real swapcache_only is problematic because it can end up not > > > making any progress, correct? > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU > > > specific, right? That would be certainly more robust as you can still > > > reclaim from file LRUs. I cannot say I like that because swapcache_only > > > is a bit confusing and I do not think we want to grow more special > > > purpose reclaim types. Would it be possible/reasonable to instead put > > > swapcache pages on the file LRU instead? > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to > > pay attention to? > > I think this might be more intrusive than we think. Every time a page > is added to or removed from the swap cache, we will need to move it > between LRUs. All pages on the anon LRU will need to go through the > file LRU before being reclaimed. I think this might be too big of a > change to achieve this patch's goal. TBH I am not really sure how complex that might turn out to be. Swapcache tends to be full of subtle issues. So you might be right but it would be better to know _why_ this is not possible before we end up phising for couple of swapcache pages on potentially huge anon LRU to isolate them. Think of TB sized machines in this context.
On Wed 22-11-23 14:57:24, Huang, Ying wrote: > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: > >> > >> > >> On 2023/11/21 21:00, Michal Hocko wrote: > >> > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > >> > > >> > However, in swapcache_only mode, the scan count still increased when scan > >> > non-swapcache pages because there are large number of non-swapcache pages > >> > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > >> > is skipped and do not count, the scan of pages in isolate_lru_folios() can > >> > eventually lead to hung task, just as Sachin reported [2]. > >> > I find this paragraph really confusing! I guess what you meant to say is > >> > that a real swapcache_only is problematic because it can end up not > >> > making any progress, correct? > >> This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > >> > > >> > AFAIU you have addressed that problem by making swapcache_only anon LRU > >> > specific, right? That would be certainly more robust as you can still > >> > reclaim from file LRUs. I cannot say I like that because swapcache_only > >> > is a bit confusing and I do not think we want to grow more special > >> > purpose reclaim types. Would it be possible/reasonable to instead put > >> > swapcache pages on the file LRU instead? > >> It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to > >> pay attention to? > > > > I think this might be more intrusive than we think. Every time a page > > is added to or removed from the swap cache, we will need to move it > > between LRUs. All pages on the anon LRU will need to go through the > > file LRU before being reclaimed. I think this might be too big of a > > change to achieve this patch's goal. > > We need to identify swap cache pages on file LRU firstly. It appears > hard from the current definition of page flags. > > /* Filesystems */ > PG_checked = PG_owner_priv_1, > > /* SwapBacked */ > PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */ Checking along with folio_test_swapbacked would do the trick, right?
On Wed 22-11-23 09:52:42, Michal Hocko wrote: > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote: > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: > > > > > > > > > On 2023/11/21 21:00, Michal Hocko wrote: > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > > > > > > > > However, in swapcache_only mode, the scan count still increased when scan > > > > non-swapcache pages because there are large number of non-swapcache pages > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can > > > > eventually lead to hung task, just as Sachin reported [2]. > > > > I find this paragraph really confusing! I guess what you meant to say is > > > > that a real swapcache_only is problematic because it can end up not > > > > making any progress, correct? > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > > > > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU > > > > specific, right? That would be certainly more robust as you can still > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only > > > > is a bit confusing and I do not think we want to grow more special > > > > purpose reclaim types. Would it be possible/reasonable to instead put > > > > swapcache pages on the file LRU instead? > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to > > > pay attention to? > > > > I think this might be more intrusive than we think. Every time a page > > is added to or removed from the swap cache, we will need to move it > > between LRUs. All pages on the anon LRU will need to go through the > > file LRU before being reclaimed. I think this might be too big of a > > change to achieve this patch's goal. > > TBH I am not really sure how complex that might turn out to be. > Swapcache tends to be full of subtle issues. So you might be right but > it would be better to know _why_ this is not possible before we end up > phising for couple of swapcache pages on potentially huge anon LRU to > isolate them. Think of TB sized machines in this context. Forgot to mention that it is not really far fetched from comparing this to MADV_FREE pages. Those are anonymous but we do not want to keep them on anon LRU because we want to age them indepdendent on the swap availability as they are just dropped during reclaim. Not too much different from swapcache pages. There are more constrains on those but fundamentally this is the same problem, no?
On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 22-11-23 09:52:42, Michal Hocko wrote: > > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote: > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: > > > > > > > > > > > > On 2023/11/21 21:00, Michal Hocko wrote: > > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > > > > > > > > > > However, in swapcache_only mode, the scan count still increased when scan > > > > > non-swapcache pages because there are large number of non-swapcache pages > > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can > > > > > eventually lead to hung task, just as Sachin reported [2]. > > > > > I find this paragraph really confusing! I guess what you meant to say is > > > > > that a real swapcache_only is problematic because it can end up not > > > > > making any progress, correct? > > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > > > > > > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU > > > > > specific, right? That would be certainly more robust as you can still > > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only > > > > > is a bit confusing and I do not think we want to grow more special > > > > > purpose reclaim types. Would it be possible/reasonable to instead put > > > > > swapcache pages on the file LRU instead? > > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to > > > > pay attention to? > > > > > > I think this might be more intrusive than we think. Every time a page > > > is added to or removed from the swap cache, we will need to move it > > > between LRUs. All pages on the anon LRU will need to go through the > > > file LRU before being reclaimed. I think this might be too big of a > > > change to achieve this patch's goal. > > > > TBH I am not really sure how complex that might turn out to be. > > Swapcache tends to be full of subtle issues. So you might be right but > > it would be better to know _why_ this is not possible before we end up > > phising for couple of swapcache pages on potentially huge anon LRU to > > isolate them. Think of TB sized machines in this context. > > Forgot to mention that it is not really far fetched from comparing this > to MADV_FREE pages. Those are anonymous but we do not want to keep them > on anon LRU because we want to age them indepdendent on the swap > availability as they are just dropped during reclaim. Not too much > different from swapcache pages. There are more constrains on those but > fundamentally this is the same problem, no? I agree it's not a first, but swap cache pages are more complicated because they can go back and forth, unlike MADV_FREE pages which usually go on a one way ticket AFAICT. Also pages going into the swap cache can be much more common that MADV_FREE pages for a lot of workloads. I am not sure how different reclaim heuristics will react to such mobility between the LRUs, and the fact that all pages will now only get evicted through the file LRU. The anon LRU will essentially become an LRU that feeds the file LRU. Also, the more pages we move between LRUs, the more ordering violations we introduce, as we may put colder pages in front of hotter pages or vice versa. All in all, I am not saying it's a bad idea or not possible, I am just saying it's probably more complicated than MADV_FREE, and adding more cases where pages move between LRUs could introduce problems (or make existing problems more visible). > -- > Michal Hocko > SUSE Labs
On Wed 22-11-23 02:39:15, Yosry Ahmed wrote: > On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 22-11-23 09:52:42, Michal Hocko wrote: > > > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote: > > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: > > > > > > > > > > > > > > > On 2023/11/21 21:00, Michal Hocko wrote: > > > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > > > > > > > > > > > > However, in swapcache_only mode, the scan count still increased when scan > > > > > > non-swapcache pages because there are large number of non-swapcache pages > > > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > > > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can > > > > > > eventually lead to hung task, just as Sachin reported [2]. > > > > > > I find this paragraph really confusing! I guess what you meant to say is > > > > > > that a real swapcache_only is problematic because it can end up not > > > > > > making any progress, correct? > > > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > > > > > > > > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU > > > > > > specific, right? That would be certainly more robust as you can still > > > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only > > > > > > is a bit confusing and I do not think we want to grow more special > > > > > > purpose reclaim types. Would it be possible/reasonable to instead put > > > > > > swapcache pages on the file LRU instead? > > > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to > > > > > pay attention to? > > > > > > > > I think this might be more intrusive than we think. Every time a page > > > > is added to or removed from the swap cache, we will need to move it > > > > between LRUs. All pages on the anon LRU will need to go through the > > > > file LRU before being reclaimed. I think this might be too big of a > > > > change to achieve this patch's goal. > > > > > > TBH I am not really sure how complex that might turn out to be. > > > Swapcache tends to be full of subtle issues. So you might be right but > > > it would be better to know _why_ this is not possible before we end up > > > phising for couple of swapcache pages on potentially huge anon LRU to > > > isolate them. Think of TB sized machines in this context. > > > > Forgot to mention that it is not really far fetched from comparing this > > to MADV_FREE pages. Those are anonymous but we do not want to keep them > > on anon LRU because we want to age them indepdendent on the swap > > availability as they are just dropped during reclaim. Not too much > > different from swapcache pages. There are more constrains on those but > > fundamentally this is the same problem, no? > > I agree it's not a first, but swap cache pages are more complicated > because they can go back and forth, unlike MADV_FREE pages which > usually go on a one way ticket AFAICT. Yes swapcache pages are indeed more complicated but most of the time they just go away as well, no? MADV_FREE can be reinitiated if they are written as well. So fundamentally they are not that different. > Also pages going into the swap > cache can be much more common that MADV_FREE pages for a lot of > workloads. I am not sure how different reclaim heuristics will react > to such mobility between the LRUs, and the fact that all pages will > now only get evicted through the file LRU. The anon LRU will > essentially become an LRU that feeds the file LRU. Also, the more > pages we move between LRUs, the more ordering violations we introduce, > as we may put colder pages in front of hotter pages or vice versa. Well, traditionally the file LRU has been maintaining page cache or easily disposable pages like MADV_FREE (which can be considered a cache as well). Swapcache is a form of a page cache as well. > All in all, I am not saying it's a bad idea or not possible, I am just > saying it's probably more complicated than MADV_FREE, and adding more > cases where pages move between LRUs could introduce problems (or make > existing problems more visible). Do we want to start adding filtered anon scan for a certain type of pages? Because this is the question here AFAICS. This might seem an easier solution but I would argue that it is less predictable one. It is not unusual that a huge anon LRU would contain only very few LRU pages. That being said, I might be missing some obvious or less obvious reasons why this is completely bad idea. Swapcache is indeed subtle.
On Wed, Nov 22, 2023 at 5:19 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 22-11-23 02:39:15, Yosry Ahmed wrote: > > On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 22-11-23 09:52:42, Michal Hocko wrote: > > > > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote: > > > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: > > > > > > > > > > > > > > > > > > On 2023/11/21 21:00, Michal Hocko wrote: > > > > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote: > > > > > > > > > > > > > > However, in swapcache_only mode, the scan count still increased when scan > > > > > > > non-swapcache pages because there are large number of non-swapcache pages > > > > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > > > > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can > > > > > > > eventually lead to hung task, just as Sachin reported [2]. > > > > > > > I find this paragraph really confusing! I guess what you meant to say is > > > > > > > that a real swapcache_only is problematic because it can end up not > > > > > > > making any progress, correct? > > > > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages; > > > > > > > > > > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU > > > > > > > specific, right? That would be certainly more robust as you can still > > > > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only > > > > > > > is a bit confusing and I do not think we want to grow more special > > > > > > > purpose reclaim types. Would it be possible/reasonable to instead put > > > > > > > swapcache pages on the file LRU instead? > > > > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to > > > > > > pay attention to? > > > > > > > > > > I think this might be more intrusive than we think. Every time a page > > > > > is added to or removed from the swap cache, we will need to move it > > > > > between LRUs. All pages on the anon LRU will need to go through the > > > > > file LRU before being reclaimed. I think this might be too big of a > > > > > change to achieve this patch's goal. > > > > > > > > TBH I am not really sure how complex that might turn out to be. > > > > Swapcache tends to be full of subtle issues. So you might be right but > > > > it would be better to know _why_ this is not possible before we end up > > > > phising for couple of swapcache pages on potentially huge anon LRU to > > > > isolate them. Think of TB sized machines in this context. > > > > > > Forgot to mention that it is not really far fetched from comparing this > > > to MADV_FREE pages. Those are anonymous but we do not want to keep them > > > on anon LRU because we want to age them indepdendent on the swap > > > availability as they are just dropped during reclaim. Not too much > > > different from swapcache pages. There are more constrains on those but > > > fundamentally this is the same problem, no? > > > > I agree it's not a first, but swap cache pages are more complicated > > because they can go back and forth, unlike MADV_FREE pages which > > usually go on a one way ticket AFAICT. > > Yes swapcache pages are indeed more complicated but most of the time > they just go away as well, no? MADV_FREE can be reinitiated if they are > written as well. So fundamentally they are not that different. > > > Also pages going into the swap > > cache can be much more common that MADV_FREE pages for a lot of > > workloads. I am not sure how different reclaim heuristics will react > > to such mobility between the LRUs, and the fact that all pages will > > now only get evicted through the file LRU. The anon LRU will > > essentially become an LRU that feeds the file LRU. Also, the more > > pages we move between LRUs, the more ordering violations we introduce, > > as we may put colder pages in front of hotter pages or vice versa. > > Well, traditionally the file LRU has been maintaining page cache or > easily disposable pages like MADV_FREE (which can be considered a cache > as well). Swapcache is a form of a page cache as well. If I understand correctly, when we move the MADV_FREE pages to the file LRU, we don't have correct information about their relative ordering compared to the pages that are already in the inactive file LRU. IOW, we mess up the ordering of the inactive file LRU a little. If we add more cases of moving pages to the file LRU (for the swap cache), we may make it worse. I am also not sure how this works with MGLRU generations. Keep in mind that when a page is affected with MADV_FREE, it's always called. On the other hand, when a page is added to the swap cache, it could be because it's under reclaim (cold), or it was just swapped in (hot). I am not sure this practically matters, just something to think about. It also seems like all evictions will now be done from the file LRU, so some heuristics (like workingset) may need to be updated accordingly. > > > All in all, I am not saying it's a bad idea or not possible, I am just > > saying it's probably more complicated than MADV_FREE, and adding more > > cases where pages move between LRUs could introduce problems (or make > > existing problems more visible). > > Do we want to start adding filtered anon scan for a certain type of > pages? Because this is the question here AFAICS. This might seem an > easier solution but I would argue that it is less predictable one. > It is not unusual that a huge anon LRU would contain only very few LRU > pages. I agree that it may be a problem in some situations. > > That being said, I might be missing some obvious or less obvious reasons > why this is completely bad idea. Swapcache is indeed subtle. > -- > Michal Hocko > SUSE Labs
Michal Hocko <mhocko@suse.com> writes: > On Wed 22-11-23 02:39:15, Yosry Ahmed wrote: >> On Wed, Nov 22, 2023 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote: >> > >> > On Wed 22-11-23 09:52:42, Michal Hocko wrote: >> > > On Tue 21-11-23 22:44:32, Yosry Ahmed wrote: >> > > > On Tue, Nov 21, 2023 at 10:41 PM Liu Shixin <liushixin2@huawei.com> wrote: >> > > > > >> > > > > >> > > > > On 2023/11/21 21:00, Michal Hocko wrote: >> > > > > > On Tue 21-11-23 17:06:24, Liu Shixin wrote: >> > > > > > >> > > > > > However, in swapcache_only mode, the scan count still increased when scan >> > > > > > non-swapcache pages because there are large number of non-swapcache pages >> > > > > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache >> > > > > > is skipped and do not count, the scan of pages in isolate_lru_folios() can >> > > > > > eventually lead to hung task, just as Sachin reported [2]. >> > > > > > I find this paragraph really confusing! I guess what you meant to say is >> > > > > > that a real swapcache_only is problematic because it can end up not >> > > > > > making any progress, correct? >> > > > > This paragraph is going to explain why checking swapcache_only after scan += nr_pages; >> > > > > > >> > > > > > AFAIU you have addressed that problem by making swapcache_only anon LRU >> > > > > > specific, right? That would be certainly more robust as you can still >> > > > > > reclaim from file LRUs. I cannot say I like that because swapcache_only >> > > > > > is a bit confusing and I do not think we want to grow more special >> > > > > > purpose reclaim types. Would it be possible/reasonable to instead put >> > > > > > swapcache pages on the file LRU instead? >> > > > > It looks like a good idea, but I'm not sure if it's possible. I can try it, is there anything to >> > > > > pay attention to? >> > > > >> > > > I think this might be more intrusive than we think. Every time a page >> > > > is added to or removed from the swap cache, we will need to move it >> > > > between LRUs. All pages on the anon LRU will need to go through the >> > > > file LRU before being reclaimed. I think this might be too big of a >> > > > change to achieve this patch's goal. >> > > >> > > TBH I am not really sure how complex that might turn out to be. >> > > Swapcache tends to be full of subtle issues. So you might be right but >> > > it would be better to know _why_ this is not possible before we end up >> > > phising for couple of swapcache pages on potentially huge anon LRU to >> > > isolate them. Think of TB sized machines in this context. >> > >> > Forgot to mention that it is not really far fetched from comparing this >> > to MADV_FREE pages. Those are anonymous but we do not want to keep them >> > on anon LRU because we want to age them indepdendent on the swap >> > availability as they are just dropped during reclaim. Not too much >> > different from swapcache pages. There are more constrains on those but >> > fundamentally this is the same problem, no? >> >> I agree it's not a first, but swap cache pages are more complicated >> because they can go back and forth, unlike MADV_FREE pages which >> usually go on a one way ticket AFAICT. > > Yes swapcache pages are indeed more complicated but most of the time > they just go away as well, no? When we swapin a page, we will put it in swapcache too. And the page can be in that state for long time if there is more than 50% free space in the swap device. > MADV_FREE can be reinitiated if they are > written as well. So fundamentally they are not that different. > [snip] -- Best Regards, Huang, Ying
Hi Shixin, On Tue, Nov 21, 2023 at 12:08 AM Liu Shixin <liushixin2@huawei.com> wrote: > > When spaces of swap devices are exhausted, only file pages can be > reclaimed. But there are still some swapcache pages in anon lru list. > This can lead to a premature out-of-memory. > > The problem is found with such step: > > Firstly, set a 9MB disk swap space, then create a cgroup with 10MB > memory limit, then runs an program to allocates about 15MB memory. > > The problem occurs occasionally, which may need about 100 times [1]. Just out of my curiosity, in your usage case, how much additional memory in terms of pages or MB can be freed by this patch, using current code as base line? Does the swap cache page reclaimed in swapcache_only mode, all swap count drop to zero, and the only reason to stay in swap cache is to void page IO write if we need to swap that page out again? > Fix it by checking number of swapcache pages in can_reclaim_anon_pages(). > If the number is not zero, return true and set swapcache_only to 1. > When scan anon lru list in swapcache_only mode, non-swapcache pages will > be skipped to isolate in order to accelerate reclaim efficiency. Here you said non-swapcache will be skipped if swapcache_only == 1 > > However, in swapcache_only mode, the scan count still increased when scan > non-swapcache pages because there are large number of non-swapcache pages > and rare swapcache pages in swapcache_only mode, and if the non-swapcache Here you suggest non-swapcache pages will also be scanned even when swapcache_only == 1. It seems to contradict what you said above. I feel that I am missing something here. > is skipped and do not count, the scan of pages in isolate_lru_folios() can Can you clarify which "scan of pages", are those pages swapcache pages or non-swapcache pages? > eventually lead to hung task, just as Sachin reported [2]. > > By the way, since there are enough times of memory reclaim before OOM, it > is not need to isolate too much swapcache pages in one times. > > [1]. https://lore.kernel.org/lkml/CAJD7tkZAfgncV+KbKr36=eDzMnT=9dZOT0dpMWcurHLr6Do+GA@mail.gmail.com/ > [2]. https://lore.kernel.org/linux-mm/CAJD7tkafz_2XAuqE8tGLPEcpLngewhUo=5US14PAtSM9tLBUQg@mail.gmail.com/ > > Signed-off-by: Liu Shixin <liushixin2@huawei.com> > Tested-by: Yosry Ahmed <yosryahmed@google.com> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > --- > v9->v10: Use per-node swapcache suggested by Yu Zhao. > v8->v9: Move the swapcache check after can_demote() and refector > can_reclaim_anon_pages() a bit. > v7->v8: Reset swapcache_only at the beginning of can_reclaim_anon_pages(). > v6->v7: Reset swapcache_only to zero after there are swap spaces. > v5->v6: Fix NULL pointing derefence and hung task problem reported by Sachin. > > mm/vmscan.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 49 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 506f8220c5fe..1fcc94717370 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -136,6 +136,9 @@ struct scan_control { > /* Always discard instead of demoting to lower tier memory */ > unsigned int no_demotion:1; > > + /* Swap space is exhausted, only reclaim swapcache for anon LRU */ > + unsigned int swapcache_only:1; > + > /* Allocation order */ > s8 order; > > @@ -308,10 +311,36 @@ static bool can_demote(int nid, struct scan_control *sc) > return true; > } > > +#ifdef CONFIG_SWAP > +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) > +{ > + struct pglist_data *pgdat = NODE_DATA(nid); > + unsigned long nr_swapcache; > + > + if (!memcg) { > + nr_swapcache = node_page_state(pgdat, NR_SWAPCACHE); > + } else { > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > + > + nr_swapcache = lruvec_page_state_local(lruvec, NR_SWAPCACHE); > + } > + > + return nr_swapcache > 0; > +} > +#else > +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) > +{ > + return false; > +} > +#endif > + > static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, > int nid, > struct scan_control *sc) > { > + if (sc) > + sc->swapcache_only = 0; > + Minor nitpick. The sc->swapcache_only is first set to 0 then later set to 1. Better use a local variable then write to sc->swapcache_only in one go. If the scan_control has more than one thread accessing it, the threads can see the flicker of 0->1 change. I don't think that is the case in our current code, sc is created on stack. There are other minor benefits as The "if (sc) test" only needs to be done once, one store instruction. Chris > if (memcg == NULL) { > /* > * For non-memcg reclaim, is there > @@ -330,7 +359,17 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, > * > * Can it be reclaimed from this node via demotion? > */ > - return can_demote(nid, sc); > + if (can_demote(nid, sc)) > + return true; > + > + /* Is there any swapcache pages to reclaim in this node? */ > + if (can_reclaim_swapcache(memcg, nid)) { > + if (sc) > + sc->swapcache_only = 1; > + return true; > + } > + > + return false; > } > > /* > @@ -1642,6 +1681,15 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, > */ > scan += nr_pages; > > + /* > + * Count non-swapcache too because the swapcache pages may > + * be rare and it takes too much times here if not count > + * the non-swapcache pages. > + */ > + if (unlikely(sc->swapcache_only && !is_file_lru(lru) && > + !folio_test_swapcache(folio))) > + goto move; > + > if (!folio_test_lru(folio)) > goto move; > if (!sc->may_unmap && folio_mapped(folio)) > -- > 2.25.1 > >
On Tue, Nov 21, 2023 at 5:00 AM Michal Hocko <mhocko@suse.com> wrote: > > However, in swapcache_only mode, the scan count still increased when scan > > non-swapcache pages because there are large number of non-swapcache pages > > and rare swapcache pages in swapcache_only mode, and if the non-swapcache > > is skipped and do not count, the scan of pages in isolate_lru_folios() can > > eventually lead to hung task, just as Sachin reported [2]. > > I find this paragraph really confusing! I guess what you meant to say is > that a real swapcache_only is problematic because it can end up not > making any progress, correct? > > AFAIU you have addressed that problem by making swapcache_only anon LRU > specific, right? That would be certainly more robust as you can still > reclaim from file LRUs. I cannot say I like that because swapcache_only > is a bit confusing and I do not think we want to grow more special That is my feeling as well. I don't like to have too many special purposes modes either. It makes the whole process much harder to reason. The comment seems to suggest it is not effective in some situations. I am wondering if we can address that situation more directly without the special mode. At the same time I am not very familiar with the reclaim code path yet. I need to learn more about this problem space to articulate my thoughts better . I can dig in more, I might ask a lot of silly questions. Chris > purpose reclaim types. Would it be possible/reasonable to instead put > swapcache pages on the file LRU instead? > -- > Michal Hocko > SUSE Labs >
On Thu 23-11-23 14:15:59, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: [...] > > Yes swapcache pages are indeed more complicated but most of the time > > they just go away as well, no? > > When we swapin a page, we will put it in swapcache too. And the page > can be in that state for long time if there is more than 50% free space > in the swap device. True, but why is that a problem? If you encounter such a swapped in page on the file LRU then the page should be referened and as such should be activated, no?
Michal Hocko <mhocko@suse.com> writes: > On Thu 23-11-23 14:15:59, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: > [...] >> > Yes swapcache pages are indeed more complicated but most of the time >> > they just go away as well, no? >> >> When we swapin a page, we will put it in swapcache too. And the page >> can be in that state for long time if there is more than 50% free space >> in the swap device. > > True, but why is that a problem? If you encounter such a swapped in page > on the file LRU then the page should be referened and as such should be > activated, no? This just means that anonymous pages in file LRU aren't temporary or short-term. So we need to consider that. For example, the original method to balance between anonymous pages and file pages need to be re-designed. The anonymous pages are considered hotter than file pages in general. -- Best Regards, Huang, Ying
Hi Ying, On Sun, Nov 26, 2023 at 6:36 PM Huang, Ying <ying.huang@intel.com> wrote: > > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 23-11-23 14:15:59, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > > [...] > >> > Yes swapcache pages are indeed more complicated but most of the time > >> > they just go away as well, no? > >> > >> When we swapin a page, we will put it in swapcache too. And the page > >> can be in that state for long time if there is more than 50% free space > >> in the swap device. > > > > True, but why is that a problem? If you encounter such a swapped in page > > on the file LRU then the page should be referened and as such should be > > activated, no? > > This just means that anonymous pages in file LRU aren't temporary or > short-term. So we need to consider that. For example, the original > method to balance between anonymous pages and file pages need to be > re-designed. The anonymous pages are considered hotter than file pages > in general. I agree with Ying that anonymous pages typically have different page access patterns than file pages, so we might want to treat them differently to reclaim them effectively. One random idea: How about we put the anonymous page in a swap cache in a different LRU than the rest of the anonymous pages. Then shrinking against those pages in the swap cache would be more effective.Instead of having [anon, file] LRU, now we have [anon not in swap cache, anon in swap cache, file] LRU Chris
Chris Li <chrisl@kernel.org> writes: > Hi Ying, > > On Sun, Nov 26, 2023 at 6:36 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Thu 23-11-23 14:15:59, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> > [...] >> >> > Yes swapcache pages are indeed more complicated but most of the time >> >> > they just go away as well, no? >> >> >> >> When we swapin a page, we will put it in swapcache too. And the page >> >> can be in that state for long time if there is more than 50% free space >> >> in the swap device. >> > >> > True, but why is that a problem? If you encounter such a swapped in page >> > on the file LRU then the page should be referened and as such should be >> > activated, no? >> >> This just means that anonymous pages in file LRU aren't temporary or >> short-term. So we need to consider that. For example, the original >> method to balance between anonymous pages and file pages need to be >> re-designed. The anonymous pages are considered hotter than file pages >> in general. > > I agree with Ying that anonymous pages typically have different page > access patterns than file pages, so we might want to treat them > differently to reclaim them effectively. > One random idea: > How about we put the anonymous page in a swap cache in a different LRU > than the rest of the anonymous pages. Then shrinking against those > pages in the swap cache would be more effective.Instead of having > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > cache, file] LRU I don't think that it is necessary. The patch is only for a special use case. Where the swap device is used up while some pages are in swap cache. The patch will kill performance, but it is used to avoid OOM only, not to improve performance. Per my understanding, we will not use up swap device space in most cases. This may be true for ZRAM, but will we keep pages in swap cache for long when we use ZRAM? -- Best Regards, Huang, Ying
On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > > I agree with Ying that anonymous pages typically have different page > > access patterns than file pages, so we might want to treat them > > differently to reclaim them effectively. > > One random idea: > > How about we put the anonymous page in a swap cache in a different LRU > > than the rest of the anonymous pages. Then shrinking against those > > pages in the swap cache would be more effective.Instead of having > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > > cache, file] LRU > > I don't think that it is necessary. The patch is only for a special use > case. Where the swap device is used up while some pages are in swap > cache. The patch will kill performance, but it is used to avoid OOM > only, not to improve performance. Per my understanding, we will not use > up swap device space in most cases. This may be true for ZRAM, but will > we keep pages in swap cache for long when we use ZRAM? I ask the question regarding how many pages can be freed by this patch in this email thread as well, but haven't got the answer from the author yet. That is one important aspect to evaluate how valuable is that patch. Regarding running out of swap space. That is a good point, in server workload we don't typically run out of swap device space anyway. Android uses ZRAM, the story might be different. Adding Minchan here. Chris
On Mon 27-11-23 10:34:46, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Thu 23-11-23 14:15:59, Huang, Ying wrote: > >> Michal Hocko <mhocko@suse.com> writes: > > [...] > >> > Yes swapcache pages are indeed more complicated but most of the time > >> > they just go away as well, no? > >> > >> When we swapin a page, we will put it in swapcache too. And the page > >> can be in that state for long time if there is more than 50% free space > >> in the swap device. > > > > True, but why is that a problem? If you encounter such a swapped in page > > on the file LRU then the page should be referened and as such should be > > activated, no? > > This just means that anonymous pages in file LRU aren't temporary or > short-term. So we need to consider that. Right. On the other hand we could be more aggressive when dropping the swapcache. Is there any actual reason why we cannot try to folio_free_swap even when mem_cgroup_swap_full == F?
On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > > > I agree with Ying that anonymous pages typically have different page > > > access patterns than file pages, so we might want to treat them > > > differently to reclaim them effectively. > > > One random idea: > > > How about we put the anonymous page in a swap cache in a different LRU > > > than the rest of the anonymous pages. Then shrinking against those > > > pages in the swap cache would be more effective.Instead of having > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > > > cache, file] LRU > > > > I don't think that it is necessary. The patch is only for a special use > > case. Where the swap device is used up while some pages are in swap > > cache. The patch will kill performance, but it is used to avoid OOM > > only, not to improve performance. Per my understanding, we will not use > > up swap device space in most cases. This may be true for ZRAM, but will > > we keep pages in swap cache for long when we use ZRAM? > > I ask the question regarding how many pages can be freed by this patch > in this email thread as well, but haven't got the answer from the > author yet. That is one important aspect to evaluate how valuable is > that patch. Exactly. Since swap cache has different life time with page cache, they would be usually dropped when pages are unmapped(unless they are shared with others but anon is usually exclusive private) so I wonder how much memory we can save. > Regarding running out of swap space. That is a good point, in server > workload we don't typically run out of swap device space anyway. > > Android uses ZRAM, the story might be different. Adding Minchan here. Swap is usually almost full in Android since it compacts(i.e., swapout) background apps aggressively.
On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: > > On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: > > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > > > > I agree with Ying that anonymous pages typically have different page > > > > access patterns than file pages, so we might want to treat them > > > > differently to reclaim them effectively. > > > > One random idea: > > > > How about we put the anonymous page in a swap cache in a different LRU > > > > than the rest of the anonymous pages. Then shrinking against those > > > > pages in the swap cache would be more effective.Instead of having > > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > > > > cache, file] LRU > > > > > > I don't think that it is necessary. The patch is only for a special use > > > case. Where the swap device is used up while some pages are in swap > > > cache. The patch will kill performance, but it is used to avoid OOM > > > only, not to improve performance. Per my understanding, we will not use > > > up swap device space in most cases. This may be true for ZRAM, but will > > > we keep pages in swap cache for long when we use ZRAM? > > > > I ask the question regarding how many pages can be freed by this patch > > in this email thread as well, but haven't got the answer from the > > author yet. That is one important aspect to evaluate how valuable is > > that patch. > > Exactly. Since swap cache has different life time with page cache, they > would be usually dropped when pages are unmapped(unless they are shared > with others but anon is usually exclusive private) so I wonder how much > memory we can save. I think the point of this patch is not saving memory, but rather avoiding an OOM condition that will happen if we have no swap space left, but some pages left in the swap cache. Of course, the OOM avoidance will come at the cost of extra work in reclaim to swap those pages out. The only case where I think this might be harmful is if there's plenty of pages to reclaim on the file LRU, and instead we opt to chase down the few swap cache pages. So perhaps we can add a check to only set sc->swapcache_only if the number of pages in the swap cache is more than the number of pages on the file LRU or similar? Just make sure we don't chase the swapcache pages down if there's plenty to scan on the file LRU? > > > Regarding running out of swap space. That is a good point, in server > > workload we don't typically run out of swap device space anyway. > > > > Android uses ZRAM, the story might be different. Adding Minchan here. > > Swap is usually almost full in Android since it compacts(i.e., swapout) > background apps aggressively.
Michal Hocko <mhocko@suse.com> writes: > On Mon 27-11-23 10:34:46, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Thu 23-11-23 14:15:59, Huang, Ying wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> > [...] >> >> > Yes swapcache pages are indeed more complicated but most of the time >> >> > they just go away as well, no? >> >> >> >> When we swapin a page, we will put it in swapcache too. And the page >> >> can be in that state for long time if there is more than 50% free space >> >> in the swap device. >> > >> > True, but why is that a problem? If you encounter such a swapped in page >> > on the file LRU then the page should be referened and as such should be >> > activated, no? >> >> This just means that anonymous pages in file LRU aren't temporary or >> short-term. So we need to consider that. > > Right. On the other hand we could be more aggressive when dropping the > swapcache. Is there any actual reason why we cannot try to folio_free_swap > even when mem_cgroup_swap_full == F? If there are plenty free space in swap device, why not take advantage of it? If the page is in swap cache and clean, we can discard it when it is reclaimed, or avoid to allocate swap entry for it. -- Best Regards, Huang, Ying
On 2023/11/24 1:19, Chris Li wrote: > Hi Shixin, > > On Tue, Nov 21, 2023 at 12:08 AM Liu Shixin <liushixin2@huawei.com> wrote: >> When spaces of swap devices are exhausted, only file pages can be >> reclaimed. But there are still some swapcache pages in anon lru list. >> This can lead to a premature out-of-memory. >> >> The problem is found with such step: >> >> Firstly, set a 9MB disk swap space, then create a cgroup with 10MB >> memory limit, then runs an program to allocates about 15MB memory. >> >> The problem occurs occasionally, which may need about 100 times [1]. > Just out of my curiosity, in your usage case, how much additional > memory in terms of pages or MB can be freed by this patch, using > current code as base line? My testcase is in a memory cgroup with memory limit of 10MB, the memory can be freed is only about 5MB. > Does the swap cache page reclaimed in swapcache_only mode, all swap > count drop to zero, and the only reason to stay in swap cache is to > void page IO write if we need to swap that page out again? Yes. > >> Fix it by checking number of swapcache pages in can_reclaim_anon_pages(). >> If the number is not zero, return true and set swapcache_only to 1. >> When scan anon lru list in swapcache_only mode, non-swapcache pages will >> be skipped to isolate in order to accelerate reclaim efficiency. > Here you said non-swapcache will be skipped if swapcache_only == 1 > >> However, in swapcache_only mode, the scan count still increased when scan >> non-swapcache pages because there are large number of non-swapcache pages >> and rare swapcache pages in swapcache_only mode, and if the non-swapcache > Here you suggest non-swapcache pages will also be scanned even when > swapcache_only == 1. It seems to contradict what you said above. I > feel that I am missing something here. The swapcache pages and non-swapcache pages are both in anon lru. So when scan anon pages, then non-swapcache pages will also be scanned. In isolate_lru_folios(), if we select to put non-swapcache pages in folios_skipped list, the scan of anon list will running until finding enough swapcache pages, this will waste too much time. To avoid such problem, after we scan enough anon pages, even if we don't isolate enough swapcache pages, we have to stop. > >> is skipped and do not count, the scan of pages in isolate_lru_folios() can > Can you clarify which "scan of pages", are those pages swapcache pages > or non-swapcache pages? I mean scan of anon pages, include both swapcache pages and non-swpacache pages. > >> eventually lead to hung task, just as Sachin reported [2]. >> >> By the way, since there are enough times of memory reclaim before OOM, it >> is not need to isolate too much swapcache pages in one times. >> >> [1]. https://lore.kernel.org/lkml/CAJD7tkZAfgncV+KbKr36=eDzMnT=9dZOT0dpMWcurHLr6Do+GA@mail.gmail.com/ >> [2]. https://lore.kernel.org/linux-mm/CAJD7tkafz_2XAuqE8tGLPEcpLngewhUo=5US14PAtSM9tLBUQg@mail.gmail.com/ >> >> Signed-off-by: Liu Shixin <liushixin2@huawei.com> >> Tested-by: Yosry Ahmed <yosryahmed@google.com> >> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> >> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> >> --- >> v9->v10: Use per-node swapcache suggested by Yu Zhao. >> v8->v9: Move the swapcache check after can_demote() and refector >> can_reclaim_anon_pages() a bit. >> v7->v8: Reset swapcache_only at the beginning of can_reclaim_anon_pages(). >> v6->v7: Reset swapcache_only to zero after there are swap spaces. >> v5->v6: Fix NULL pointing derefence and hung task problem reported by Sachin. >> >> mm/vmscan.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 49 insertions(+), 1 deletion(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 506f8220c5fe..1fcc94717370 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -136,6 +136,9 @@ struct scan_control { >> /* Always discard instead of demoting to lower tier memory */ >> unsigned int no_demotion:1; >> >> + /* Swap space is exhausted, only reclaim swapcache for anon LRU */ >> + unsigned int swapcache_only:1; >> + >> /* Allocation order */ >> s8 order; >> >> @@ -308,10 +311,36 @@ static bool can_demote(int nid, struct scan_control *sc) >> return true; >> } >> >> +#ifdef CONFIG_SWAP >> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) >> +{ >> + struct pglist_data *pgdat = NODE_DATA(nid); >> + unsigned long nr_swapcache; >> + >> + if (!memcg) { >> + nr_swapcache = node_page_state(pgdat, NR_SWAPCACHE); >> + } else { >> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); >> + >> + nr_swapcache = lruvec_page_state_local(lruvec, NR_SWAPCACHE); >> + } >> + >> + return nr_swapcache > 0; >> +} >> +#else >> +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) >> +{ >> + return false; >> +} >> +#endif >> + >> static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, >> int nid, >> struct scan_control *sc) >> { >> + if (sc) >> + sc->swapcache_only = 0; >> + > Minor nitpick. The sc->swapcache_only is first set to 0 then later set > to 1. Better use a local variable then write to sc->swapcache_only in > one go. If the scan_control has more than one thread accessing it, the > threads can see the flicker of 0->1 change. I don't think that is the > case in our current code, sc is created on stack. There are other > minor benefits as The "if (sc) test" only needs to be done once, one > store instruction. > > Chris Thanks for your advice. If finally decide to use this patch, I will revise it. >> if (memcg == NULL) { >> /* >> * For non-memcg reclaim, is there >> @@ -330,7 +359,17 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, >> * >> * Can it be reclaimed from this node via demotion? >> */ >> - return can_demote(nid, sc); >> + if (can_demote(nid, sc)) >> + return true; >> + >> + /* Is there any swapcache pages to reclaim in this node? */ >> + if (can_reclaim_swapcache(memcg, nid)) { >> + if (sc) >> + sc->swapcache_only = 1; >> + return true; >> + } >> + >> + return false; >> } >> >> /* >> @@ -1642,6 +1681,15 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, >> */ >> scan += nr_pages; >> >> + /* >> + * Count non-swapcache too because the swapcache pages may >> + * be rare and it takes too much times here if not count >> + * the non-swapcache pages. >> + */ >> + if (unlikely(sc->swapcache_only && !is_file_lru(lru) && >> + !folio_test_swapcache(folio))) >> + goto move; >> + >> if (!folio_test_lru(folio)) >> goto move; >> if (!sc->may_unmap && folio_mapped(folio)) >> -- >> 2.25.1 >> >> > . >
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: >> > > > I agree with Ying that anonymous pages typically have different page >> > > > access patterns than file pages, so we might want to treat them >> > > > differently to reclaim them effectively. >> > > > One random idea: >> > > > How about we put the anonymous page in a swap cache in a different LRU >> > > > than the rest of the anonymous pages. Then shrinking against those >> > > > pages in the swap cache would be more effective.Instead of having >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap >> > > > cache, file] LRU >> > > >> > > I don't think that it is necessary. The patch is only for a special use >> > > case. Where the swap device is used up while some pages are in swap >> > > cache. The patch will kill performance, but it is used to avoid OOM >> > > only, not to improve performance. Per my understanding, we will not use >> > > up swap device space in most cases. This may be true for ZRAM, but will >> > > we keep pages in swap cache for long when we use ZRAM? >> > >> > I ask the question regarding how many pages can be freed by this patch >> > in this email thread as well, but haven't got the answer from the >> > author yet. That is one important aspect to evaluate how valuable is >> > that patch. >> >> Exactly. Since swap cache has different life time with page cache, they >> would be usually dropped when pages are unmapped(unless they are shared >> with others but anon is usually exclusive private) so I wonder how much >> memory we can save. > > I think the point of this patch is not saving memory, but rather > avoiding an OOM condition that will happen if we have no swap space > left, but some pages left in the swap cache. Of course, the OOM > avoidance will come at the cost of extra work in reclaim to swap those > pages out. > > The only case where I think this might be harmful is if there's plenty > of pages to reclaim on the file LRU, and instead we opt to chase down > the few swap cache pages. So perhaps we can add a check to only set > sc->swapcache_only if the number of pages in the swap cache is more > than the number of pages on the file LRU or similar? Just make sure we > don't chase the swapcache pages down if there's plenty to scan on the > file LRU? The swap cache pages can be divided to 3 groups. - group 1: pages have been written out, at the tail of inactive LRU, but not reclaimed yet. - group 2: pages have been written out, but were failed to be reclaimed (e.g., were accessed before reclaiming) - group 3: pages have been swapped in, but were kept in swap cache. The pages may be in active LRU. The main target of the original patch should be group 1. And the pages may be cheaper to reclaim than file pages. Group 2 are hard to be reclaimed if swap_count() isn't 0. Group 3 should be reclaimed in theory, but the overhead may be high. And we may need to reclaim the swap entries instead of pages if the pages are hot. But we can start to reclaim the swap entries before the swap space is run out. So, if we can count group 1, we may use that as indicator to scan anon pages. And we may add code to reclaim group 3 earlier. >> > Regarding running out of swap space. That is a good point, in server >> > workload we don't typically run out of swap device space anyway. Think about this again. In server workload, if we set some swap usage limit for a memcg, we may run out of the limit. Is it common for a server workload run out of the swap usage limit of the memcg? >> > Android uses ZRAM, the story might be different. Adding Minchan here. >> >> Swap is usually almost full in Android since it compacts(i.e., swapout) >> background apps aggressively. If my understanding were correct, because ZRAM has SWP_SYNCHRONOUS_IO set, the anonymous pages will only be put in swap cache temporarily during swap out. So, the remaining swap cache pages in anon LRU should not be a problem for ZRAM. -- Best Regards, Huang, Ying
On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: > >> > >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: > >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > > > I agree with Ying that anonymous pages typically have different page > >> > > > access patterns than file pages, so we might want to treat them > >> > > > differently to reclaim them effectively. > >> > > > One random idea: > >> > > > How about we put the anonymous page in a swap cache in a different LRU > >> > > > than the rest of the anonymous pages. Then shrinking against those > >> > > > pages in the swap cache would be more effective.Instead of having > >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > >> > > > cache, file] LRU > >> > > > >> > > I don't think that it is necessary. The patch is only for a special use > >> > > case. Where the swap device is used up while some pages are in swap > >> > > cache. The patch will kill performance, but it is used to avoid OOM > >> > > only, not to improve performance. Per my understanding, we will not use > >> > > up swap device space in most cases. This may be true for ZRAM, but will > >> > > we keep pages in swap cache for long when we use ZRAM? > >> > > >> > I ask the question regarding how many pages can be freed by this patch > >> > in this email thread as well, but haven't got the answer from the > >> > author yet. That is one important aspect to evaluate how valuable is > >> > that patch. > >> > >> Exactly. Since swap cache has different life time with page cache, they > >> would be usually dropped when pages are unmapped(unless they are shared > >> with others but anon is usually exclusive private) so I wonder how much > >> memory we can save. > > > > I think the point of this patch is not saving memory, but rather > > avoiding an OOM condition that will happen if we have no swap space > > left, but some pages left in the swap cache. Of course, the OOM > > avoidance will come at the cost of extra work in reclaim to swap those > > pages out. > > > > The only case where I think this might be harmful is if there's plenty > > of pages to reclaim on the file LRU, and instead we opt to chase down > > the few swap cache pages. So perhaps we can add a check to only set > > sc->swapcache_only if the number of pages in the swap cache is more > > than the number of pages on the file LRU or similar? Just make sure we > > don't chase the swapcache pages down if there's plenty to scan on the > > file LRU? > > The swap cache pages can be divided to 3 groups. > > - group 1: pages have been written out, at the tail of inactive LRU, but > not reclaimed yet. > > - group 2: pages have been written out, but were failed to be reclaimed > (e.g., were accessed before reclaiming) > > - group 3: pages have been swapped in, but were kept in swap cache. The > pages may be in active LRU. > > The main target of the original patch should be group 1. And the pages > may be cheaper to reclaim than file pages. > > Group 2 are hard to be reclaimed if swap_count() isn't 0. > > Group 3 should be reclaimed in theory, but the overhead may be high. > And we may need to reclaim the swap entries instead of pages if the pages > are hot. But we can start to reclaim the swap entries before the swap > space is run out. > > So, if we can count group 1, we may use that as indicator to scan anon > pages. And we may add code to reclaim group 3 earlier. > My point was not that reclaiming the pages in the swap cache is more expensive that reclaiming the pages in the file LRU. In a lot of cases, as you point out, the pages in the swap cache can just be dropped, so they may be as cheap or cheaper to reclaim than the pages in the file LRU. My point was that scanning the anon LRU when swap space is exhausted to get to the pages in the swap cache may be much more expensive, because there may be a lot of pages on the anon LRU that are not in the swap cache, and hence are not reclaimable, unlike pages in the file LRU, which should mostly be reclaimable. So what I am saying is that maybe we should not do the effort of scanning the anon LRU in the swapcache_only case unless there aren't a lot of pages to reclaim on the file LRU (relatively). For example, if we have a 100 pages in the swap cache out of 10000 pages in the anon LRU, and there are 10000 pages in the file LRU, it's probably not worth scanning the anon LRU.
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> > > > I agree with Ying that anonymous pages typically have different page >> >> > > > access patterns than file pages, so we might want to treat them >> >> > > > differently to reclaim them effectively. >> >> > > > One random idea: >> >> > > > How about we put the anonymous page in a swap cache in a different LRU >> >> > > > than the rest of the anonymous pages. Then shrinking against those >> >> > > > pages in the swap cache would be more effective.Instead of having >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap >> >> > > > cache, file] LRU >> >> > > >> >> > > I don't think that it is necessary. The patch is only for a special use >> >> > > case. Where the swap device is used up while some pages are in swap >> >> > > cache. The patch will kill performance, but it is used to avoid OOM >> >> > > only, not to improve performance. Per my understanding, we will not use >> >> > > up swap device space in most cases. This may be true for ZRAM, but will >> >> > > we keep pages in swap cache for long when we use ZRAM? >> >> > >> >> > I ask the question regarding how many pages can be freed by this patch >> >> > in this email thread as well, but haven't got the answer from the >> >> > author yet. That is one important aspect to evaluate how valuable is >> >> > that patch. >> >> >> >> Exactly. Since swap cache has different life time with page cache, they >> >> would be usually dropped when pages are unmapped(unless they are shared >> >> with others but anon is usually exclusive private) so I wonder how much >> >> memory we can save. >> > >> > I think the point of this patch is not saving memory, but rather >> > avoiding an OOM condition that will happen if we have no swap space >> > left, but some pages left in the swap cache. Of course, the OOM >> > avoidance will come at the cost of extra work in reclaim to swap those >> > pages out. >> > >> > The only case where I think this might be harmful is if there's plenty >> > of pages to reclaim on the file LRU, and instead we opt to chase down >> > the few swap cache pages. So perhaps we can add a check to only set >> > sc->swapcache_only if the number of pages in the swap cache is more >> > than the number of pages on the file LRU or similar? Just make sure we >> > don't chase the swapcache pages down if there's plenty to scan on the >> > file LRU? >> >> The swap cache pages can be divided to 3 groups. >> >> - group 1: pages have been written out, at the tail of inactive LRU, but >> not reclaimed yet. >> >> - group 2: pages have been written out, but were failed to be reclaimed >> (e.g., were accessed before reclaiming) >> >> - group 3: pages have been swapped in, but were kept in swap cache. The >> pages may be in active LRU. >> >> The main target of the original patch should be group 1. And the pages >> may be cheaper to reclaim than file pages. >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0. >> >> Group 3 should be reclaimed in theory, but the overhead may be high. >> And we may need to reclaim the swap entries instead of pages if the pages >> are hot. But we can start to reclaim the swap entries before the swap >> space is run out. >> >> So, if we can count group 1, we may use that as indicator to scan anon >> pages. And we may add code to reclaim group 3 earlier. >> > > My point was not that reclaiming the pages in the swap cache is more > expensive that reclaiming the pages in the file LRU. In a lot of > cases, as you point out, the pages in the swap cache can just be > dropped, so they may be as cheap or cheaper to reclaim than the pages > in the file LRU. > > My point was that scanning the anon LRU when swap space is exhausted > to get to the pages in the swap cache may be much more expensive, > because there may be a lot of pages on the anon LRU that are not in > the swap cache, and hence are not reclaimable, unlike pages in the > file LRU, which should mostly be reclaimable. > > So what I am saying is that maybe we should not do the effort of > scanning the anon LRU in the swapcache_only case unless there aren't a > lot of pages to reclaim on the file LRU (relatively). For example, if > we have a 100 pages in the swap cache out of 10000 pages in the anon > LRU, and there are 10000 pages in the file LRU, it's probably not > worth scanning the anon LRU. For group 1 pages, they are at the tail of the anon inactive LRU, so the scan overhead is low too. For example, if number of group 1 pages is 100, we just need to scan 100 pages to reclaim them. We can choose to stop scanning when the number of the non-group-1 pages reached some threshold. -- Best Regards, Huang, Ying
On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yosry Ahmed <yosryahmed@google.com> writes: > >> > >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: > >> >> > >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: > >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > > > I agree with Ying that anonymous pages typically have different page > >> >> > > > access patterns than file pages, so we might want to treat them > >> >> > > > differently to reclaim them effectively. > >> >> > > > One random idea: > >> >> > > > How about we put the anonymous page in a swap cache in a different LRU > >> >> > > > than the rest of the anonymous pages. Then shrinking against those > >> >> > > > pages in the swap cache would be more effective.Instead of having > >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > >> >> > > > cache, file] LRU > >> >> > > > >> >> > > I don't think that it is necessary. The patch is only for a special use > >> >> > > case. Where the swap device is used up while some pages are in swap > >> >> > > cache. The patch will kill performance, but it is used to avoid OOM > >> >> > > only, not to improve performance. Per my understanding, we will not use > >> >> > > up swap device space in most cases. This may be true for ZRAM, but will > >> >> > > we keep pages in swap cache for long when we use ZRAM? > >> >> > > >> >> > I ask the question regarding how many pages can be freed by this patch > >> >> > in this email thread as well, but haven't got the answer from the > >> >> > author yet. That is one important aspect to evaluate how valuable is > >> >> > that patch. > >> >> > >> >> Exactly. Since swap cache has different life time with page cache, they > >> >> would be usually dropped when pages are unmapped(unless they are shared > >> >> with others but anon is usually exclusive private) so I wonder how much > >> >> memory we can save. > >> > > >> > I think the point of this patch is not saving memory, but rather > >> > avoiding an OOM condition that will happen if we have no swap space > >> > left, but some pages left in the swap cache. Of course, the OOM > >> > avoidance will come at the cost of extra work in reclaim to swap those > >> > pages out. > >> > > >> > The only case where I think this might be harmful is if there's plenty > >> > of pages to reclaim on the file LRU, and instead we opt to chase down > >> > the few swap cache pages. So perhaps we can add a check to only set > >> > sc->swapcache_only if the number of pages in the swap cache is more > >> > than the number of pages on the file LRU or similar? Just make sure we > >> > don't chase the swapcache pages down if there's plenty to scan on the > >> > file LRU? > >> > >> The swap cache pages can be divided to 3 groups. > >> > >> - group 1: pages have been written out, at the tail of inactive LRU, but > >> not reclaimed yet. > >> > >> - group 2: pages have been written out, but were failed to be reclaimed > >> (e.g., were accessed before reclaiming) > >> > >> - group 3: pages have been swapped in, but were kept in swap cache. The > >> pages may be in active LRU. > >> > >> The main target of the original patch should be group 1. And the pages > >> may be cheaper to reclaim than file pages. > >> > >> Group 2 are hard to be reclaimed if swap_count() isn't 0. > >> > >> Group 3 should be reclaimed in theory, but the overhead may be high. > >> And we may need to reclaim the swap entries instead of pages if the pages > >> are hot. But we can start to reclaim the swap entries before the swap > >> space is run out. > >> > >> So, if we can count group 1, we may use that as indicator to scan anon > >> pages. And we may add code to reclaim group 3 earlier. > >> > > > > My point was not that reclaiming the pages in the swap cache is more > > expensive that reclaiming the pages in the file LRU. In a lot of > > cases, as you point out, the pages in the swap cache can just be > > dropped, so they may be as cheap or cheaper to reclaim than the pages > > in the file LRU. > > > > My point was that scanning the anon LRU when swap space is exhausted > > to get to the pages in the swap cache may be much more expensive, > > because there may be a lot of pages on the anon LRU that are not in > > the swap cache, and hence are not reclaimable, unlike pages in the > > file LRU, which should mostly be reclaimable. > > > > So what I am saying is that maybe we should not do the effort of > > scanning the anon LRU in the swapcache_only case unless there aren't a > > lot of pages to reclaim on the file LRU (relatively). For example, if > > we have a 100 pages in the swap cache out of 10000 pages in the anon > > LRU, and there are 10000 pages in the file LRU, it's probably not > > worth scanning the anon LRU. > > For group 1 pages, they are at the tail of the anon inactive LRU, so the > scan overhead is low too. For example, if number of group 1 pages is > 100, we just need to scan 100 pages to reclaim them. We can choose to > stop scanning when the number of the non-group-1 pages reached some > threshold. > We should still try to reclaim pages in groups 2 & 3 before OOMing though. Maybe the motivation for this patch is group 1, but I don't see why we should special case them. Pages in groups 2 & 3 should be roughly equally cheap to reclaim. They may have higher refault cost, but IIUC we should still try to reclaim them before OOMing.
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: >> >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> > > > I agree with Ying that anonymous pages typically have different page >> >> >> > > > access patterns than file pages, so we might want to treat them >> >> >> > > > differently to reclaim them effectively. >> >> >> > > > One random idea: >> >> >> > > > How about we put the anonymous page in a swap cache in a different LRU >> >> >> > > > than the rest of the anonymous pages. Then shrinking against those >> >> >> > > > pages in the swap cache would be more effective.Instead of having >> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap >> >> >> > > > cache, file] LRU >> >> >> > > >> >> >> > > I don't think that it is necessary. The patch is only for a special use >> >> >> > > case. Where the swap device is used up while some pages are in swap >> >> >> > > cache. The patch will kill performance, but it is used to avoid OOM >> >> >> > > only, not to improve performance. Per my understanding, we will not use >> >> >> > > up swap device space in most cases. This may be true for ZRAM, but will >> >> >> > > we keep pages in swap cache for long when we use ZRAM? >> >> >> > >> >> >> > I ask the question regarding how many pages can be freed by this patch >> >> >> > in this email thread as well, but haven't got the answer from the >> >> >> > author yet. That is one important aspect to evaluate how valuable is >> >> >> > that patch. >> >> >> >> >> >> Exactly. Since swap cache has different life time with page cache, they >> >> >> would be usually dropped when pages are unmapped(unless they are shared >> >> >> with others but anon is usually exclusive private) so I wonder how much >> >> >> memory we can save. >> >> > >> >> > I think the point of this patch is not saving memory, but rather >> >> > avoiding an OOM condition that will happen if we have no swap space >> >> > left, but some pages left in the swap cache. Of course, the OOM >> >> > avoidance will come at the cost of extra work in reclaim to swap those >> >> > pages out. >> >> > >> >> > The only case where I think this might be harmful is if there's plenty >> >> > of pages to reclaim on the file LRU, and instead we opt to chase down >> >> > the few swap cache pages. So perhaps we can add a check to only set >> >> > sc->swapcache_only if the number of pages in the swap cache is more >> >> > than the number of pages on the file LRU or similar? Just make sure we >> >> > don't chase the swapcache pages down if there's plenty to scan on the >> >> > file LRU? >> >> >> >> The swap cache pages can be divided to 3 groups. >> >> >> >> - group 1: pages have been written out, at the tail of inactive LRU, but >> >> not reclaimed yet. >> >> >> >> - group 2: pages have been written out, but were failed to be reclaimed >> >> (e.g., were accessed before reclaiming) >> >> >> >> - group 3: pages have been swapped in, but were kept in swap cache. The >> >> pages may be in active LRU. >> >> >> >> The main target of the original patch should be group 1. And the pages >> >> may be cheaper to reclaim than file pages. >> >> >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0. >> >> >> >> Group 3 should be reclaimed in theory, but the overhead may be high. >> >> And we may need to reclaim the swap entries instead of pages if the pages >> >> are hot. But we can start to reclaim the swap entries before the swap >> >> space is run out. >> >> >> >> So, if we can count group 1, we may use that as indicator to scan anon >> >> pages. And we may add code to reclaim group 3 earlier. >> >> >> > >> > My point was not that reclaiming the pages in the swap cache is more >> > expensive that reclaiming the pages in the file LRU. In a lot of >> > cases, as you point out, the pages in the swap cache can just be >> > dropped, so they may be as cheap or cheaper to reclaim than the pages >> > in the file LRU. >> > >> > My point was that scanning the anon LRU when swap space is exhausted >> > to get to the pages in the swap cache may be much more expensive, >> > because there may be a lot of pages on the anon LRU that are not in >> > the swap cache, and hence are not reclaimable, unlike pages in the >> > file LRU, which should mostly be reclaimable. >> > >> > So what I am saying is that maybe we should not do the effort of >> > scanning the anon LRU in the swapcache_only case unless there aren't a >> > lot of pages to reclaim on the file LRU (relatively). For example, if >> > we have a 100 pages in the swap cache out of 10000 pages in the anon >> > LRU, and there are 10000 pages in the file LRU, it's probably not >> > worth scanning the anon LRU. >> >> For group 1 pages, they are at the tail of the anon inactive LRU, so the >> scan overhead is low too. For example, if number of group 1 pages is >> 100, we just need to scan 100 pages to reclaim them. We can choose to >> stop scanning when the number of the non-group-1 pages reached some >> threshold. >> > > We should still try to reclaim pages in groups 2 & 3 before OOMing > though. Maybe the motivation for this patch is group 1, but I don't > see why we should special case them. Pages in groups 2 & 3 should be > roughly equally cheap to reclaim. They may have higher refault cost, > but IIUC we should still try to reclaim them before OOMing. The scan cost of group 3 may be high, you may need to scan all anonymous pages to identify them. The reclaim cost of group 2 may be high, it may just cause trashing (shared pages that are accessed by just one process). So I think that we can allow reclaim group 1 in all cases. Try to reclaim swap entries for group 3 during normal LRU scanning after more than half of swap space of limit is used. As a last resort before OOM, try to reclaim group 2 and group 3. Or, limit scan count for group 2 and group 3. BTW, in some situation, OOM is not the worst situation. For example, trashing may kill interaction latency, while killing the memory hog (may be caused by memory leak) saves system response time. -- Best Regards, Huang, Ying
On Mon, Nov 27, 2023 at 9:39 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yosry Ahmed <yosryahmed@google.com> writes: > >> > >> > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yosry Ahmed <yosryahmed@google.com> writes: > >> >> > >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: > >> >> >> > >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: > >> >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > > > I agree with Ying that anonymous pages typically have different page > >> >> >> > > > access patterns than file pages, so we might want to treat them > >> >> >> > > > differently to reclaim them effectively. > >> >> >> > > > One random idea: > >> >> >> > > > How about we put the anonymous page in a swap cache in a different LRU > >> >> >> > > > than the rest of the anonymous pages. Then shrinking against those > >> >> >> > > > pages in the swap cache would be more effective.Instead of having > >> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > >> >> >> > > > cache, file] LRU > >> >> >> > > > >> >> >> > > I don't think that it is necessary. The patch is only for a special use > >> >> >> > > case. Where the swap device is used up while some pages are in swap > >> >> >> > > cache. The patch will kill performance, but it is used to avoid OOM > >> >> >> > > only, not to improve performance. Per my understanding, we will not use > >> >> >> > > up swap device space in most cases. This may be true for ZRAM, but will > >> >> >> > > we keep pages in swap cache for long when we use ZRAM? > >> >> >> > > >> >> >> > I ask the question regarding how many pages can be freed by this patch > >> >> >> > in this email thread as well, but haven't got the answer from the > >> >> >> > author yet. That is one important aspect to evaluate how valuable is > >> >> >> > that patch. > >> >> >> > >> >> >> Exactly. Since swap cache has different life time with page cache, they > >> >> >> would be usually dropped when pages are unmapped(unless they are shared > >> >> >> with others but anon is usually exclusive private) so I wonder how much > >> >> >> memory we can save. > >> >> > > >> >> > I think the point of this patch is not saving memory, but rather > >> >> > avoiding an OOM condition that will happen if we have no swap space > >> >> > left, but some pages left in the swap cache. Of course, the OOM > >> >> > avoidance will come at the cost of extra work in reclaim to swap those > >> >> > pages out. > >> >> > > >> >> > The only case where I think this might be harmful is if there's plenty > >> >> > of pages to reclaim on the file LRU, and instead we opt to chase down > >> >> > the few swap cache pages. So perhaps we can add a check to only set > >> >> > sc->swapcache_only if the number of pages in the swap cache is more > >> >> > than the number of pages on the file LRU or similar? Just make sure we > >> >> > don't chase the swapcache pages down if there's plenty to scan on the > >> >> > file LRU? > >> >> > >> >> The swap cache pages can be divided to 3 groups. > >> >> > >> >> - group 1: pages have been written out, at the tail of inactive LRU, but > >> >> not reclaimed yet. > >> >> > >> >> - group 2: pages have been written out, but were failed to be reclaimed > >> >> (e.g., were accessed before reclaiming) > >> >> > >> >> - group 3: pages have been swapped in, but were kept in swap cache. The > >> >> pages may be in active LRU. > >> >> > >> >> The main target of the original patch should be group 1. And the pages > >> >> may be cheaper to reclaim than file pages. > >> >> > >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0. > >> >> > >> >> Group 3 should be reclaimed in theory, but the overhead may be high. > >> >> And we may need to reclaim the swap entries instead of pages if the pages > >> >> are hot. But we can start to reclaim the swap entries before the swap > >> >> space is run out. > >> >> > >> >> So, if we can count group 1, we may use that as indicator to scan anon > >> >> pages. And we may add code to reclaim group 3 earlier. > >> >> > >> > > >> > My point was not that reclaiming the pages in the swap cache is more > >> > expensive that reclaiming the pages in the file LRU. In a lot of > >> > cases, as you point out, the pages in the swap cache can just be > >> > dropped, so they may be as cheap or cheaper to reclaim than the pages > >> > in the file LRU. > >> > > >> > My point was that scanning the anon LRU when swap space is exhausted > >> > to get to the pages in the swap cache may be much more expensive, > >> > because there may be a lot of pages on the anon LRU that are not in > >> > the swap cache, and hence are not reclaimable, unlike pages in the > >> > file LRU, which should mostly be reclaimable. > >> > > >> > So what I am saying is that maybe we should not do the effort of > >> > scanning the anon LRU in the swapcache_only case unless there aren't a > >> > lot of pages to reclaim on the file LRU (relatively). For example, if > >> > we have a 100 pages in the swap cache out of 10000 pages in the anon > >> > LRU, and there are 10000 pages in the file LRU, it's probably not > >> > worth scanning the anon LRU. > >> > >> For group 1 pages, they are at the tail of the anon inactive LRU, so the > >> scan overhead is low too. For example, if number of group 1 pages is > >> 100, we just need to scan 100 pages to reclaim them. We can choose to > >> stop scanning when the number of the non-group-1 pages reached some > >> threshold. > >> > > > > We should still try to reclaim pages in groups 2 & 3 before OOMing > > though. Maybe the motivation for this patch is group 1, but I don't > > see why we should special case them. Pages in groups 2 & 3 should be > > roughly equally cheap to reclaim. They may have higher refault cost, > > but IIUC we should still try to reclaim them before OOMing. > > The scan cost of group 3 may be high, you may need to scan all anonymous > pages to identify them. The reclaim cost of group 2 may be high, it may > just cause trashing (shared pages that are accessed by just one > process). So I think that we can allow reclaim group 1 in all cases. > Try to reclaim swap entries for group 3 during normal LRU scanning after > more than half of swap space of limit is used. As a last resort before > OOM, try to reclaim group 2 and group 3. Or, limit scan count for group > 2 and group 3. It would be nice if this can be done auto-magically without having to keep track of the groups separately. > > BTW, in some situation, OOM is not the worst situation. For example, > trashing may kill interaction latency, while killing the memory hog (may > be caused by memory leak) saves system response time. I agree that in some situations OOMs are better than thrashing, it's not an easy problem.
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Nov 27, 2023 at 9:39 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Mon, Nov 27, 2023 at 8:05 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> >> >> > On Mon, Nov 27, 2023 at 7:21 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> >> >> >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> >> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: >> >> >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> > > > I agree with Ying that anonymous pages typically have different page >> >> >> >> > > > access patterns than file pages, so we might want to treat them >> >> >> >> > > > differently to reclaim them effectively. >> >> >> >> > > > One random idea: >> >> >> >> > > > How about we put the anonymous page in a swap cache in a different LRU >> >> >> >> > > > than the rest of the anonymous pages. Then shrinking against those >> >> >> >> > > > pages in the swap cache would be more effective.Instead of having >> >> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap >> >> >> >> > > > cache, file] LRU >> >> >> >> > > >> >> >> >> > > I don't think that it is necessary. The patch is only for a special use >> >> >> >> > > case. Where the swap device is used up while some pages are in swap >> >> >> >> > > cache. The patch will kill performance, but it is used to avoid OOM >> >> >> >> > > only, not to improve performance. Per my understanding, we will not use >> >> >> >> > > up swap device space in most cases. This may be true for ZRAM, but will >> >> >> >> > > we keep pages in swap cache for long when we use ZRAM? >> >> >> >> > >> >> >> >> > I ask the question regarding how many pages can be freed by this patch >> >> >> >> > in this email thread as well, but haven't got the answer from the >> >> >> >> > author yet. That is one important aspect to evaluate how valuable is >> >> >> >> > that patch. >> >> >> >> >> >> >> >> Exactly. Since swap cache has different life time with page cache, they >> >> >> >> would be usually dropped when pages are unmapped(unless they are shared >> >> >> >> with others but anon is usually exclusive private) so I wonder how much >> >> >> >> memory we can save. >> >> >> > >> >> >> > I think the point of this patch is not saving memory, but rather >> >> >> > avoiding an OOM condition that will happen if we have no swap space >> >> >> > left, but some pages left in the swap cache. Of course, the OOM >> >> >> > avoidance will come at the cost of extra work in reclaim to swap those >> >> >> > pages out. >> >> >> > >> >> >> > The only case where I think this might be harmful is if there's plenty >> >> >> > of pages to reclaim on the file LRU, and instead we opt to chase down >> >> >> > the few swap cache pages. So perhaps we can add a check to only set >> >> >> > sc->swapcache_only if the number of pages in the swap cache is more >> >> >> > than the number of pages on the file LRU or similar? Just make sure we >> >> >> > don't chase the swapcache pages down if there's plenty to scan on the >> >> >> > file LRU? >> >> >> >> >> >> The swap cache pages can be divided to 3 groups. >> >> >> >> >> >> - group 1: pages have been written out, at the tail of inactive LRU, but >> >> >> not reclaimed yet. >> >> >> >> >> >> - group 2: pages have been written out, but were failed to be reclaimed >> >> >> (e.g., were accessed before reclaiming) >> >> >> >> >> >> - group 3: pages have been swapped in, but were kept in swap cache. The >> >> >> pages may be in active LRU. >> >> >> >> >> >> The main target of the original patch should be group 1. And the pages >> >> >> may be cheaper to reclaim than file pages. >> >> >> >> >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0. >> >> >> >> >> >> Group 3 should be reclaimed in theory, but the overhead may be high. >> >> >> And we may need to reclaim the swap entries instead of pages if the pages >> >> >> are hot. But we can start to reclaim the swap entries before the swap >> >> >> space is run out. >> >> >> >> >> >> So, if we can count group 1, we may use that as indicator to scan anon >> >> >> pages. And we may add code to reclaim group 3 earlier. >> >> >> >> >> > >> >> > My point was not that reclaiming the pages in the swap cache is more >> >> > expensive that reclaiming the pages in the file LRU. In a lot of >> >> > cases, as you point out, the pages in the swap cache can just be >> >> > dropped, so they may be as cheap or cheaper to reclaim than the pages >> >> > in the file LRU. >> >> > >> >> > My point was that scanning the anon LRU when swap space is exhausted >> >> > to get to the pages in the swap cache may be much more expensive, >> >> > because there may be a lot of pages on the anon LRU that are not in >> >> > the swap cache, and hence are not reclaimable, unlike pages in the >> >> > file LRU, which should mostly be reclaimable. >> >> > >> >> > So what I am saying is that maybe we should not do the effort of >> >> > scanning the anon LRU in the swapcache_only case unless there aren't a >> >> > lot of pages to reclaim on the file LRU (relatively). For example, if >> >> > we have a 100 pages in the swap cache out of 10000 pages in the anon >> >> > LRU, and there are 10000 pages in the file LRU, it's probably not >> >> > worth scanning the anon LRU. >> >> >> >> For group 1 pages, they are at the tail of the anon inactive LRU, so the >> >> scan overhead is low too. For example, if number of group 1 pages is >> >> 100, we just need to scan 100 pages to reclaim them. We can choose to >> >> stop scanning when the number of the non-group-1 pages reached some >> >> threshold. >> >> >> > >> > We should still try to reclaim pages in groups 2 & 3 before OOMing >> > though. Maybe the motivation for this patch is group 1, but I don't >> > see why we should special case them. Pages in groups 2 & 3 should be >> > roughly equally cheap to reclaim. They may have higher refault cost, >> > but IIUC we should still try to reclaim them before OOMing. >> >> The scan cost of group 3 may be high, you may need to scan all anonymous >> pages to identify them. The reclaim cost of group 2 may be high, it may >> just cause trashing (shared pages that are accessed by just one >> process). So I think that we can allow reclaim group 1 in all cases. >> Try to reclaim swap entries for group 3 during normal LRU scanning after >> more than half of swap space of limit is used. As a last resort before >> OOM, try to reclaim group 2 and group 3. Or, limit scan count for group >> 2 and group 3. > > It would be nice if this can be done auto-magically without having to > keep track of the groups separately. Some rough idea may be - trying to scan anon LRU if there are swap cache pages. - if some number of pages other than group 1 encountered, stop scanning anon LRU list. - the threshold to stopping can be tuned according to whether we are going to OOM. We can try to reclaim swap entries for group 3 when we haven't run out of swap space yet. >> >> BTW, in some situation, OOM is not the worst situation. For example, >> trashing may kill interaction latency, while killing the memory hog (may >> be caused by memory leak) saves system response time. > > I agree that in some situations OOMs are better than thrashing, it's > not an easy problem. -- Best Regards, Huang, Ying
On Tue 28-11-23 09:31:06, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: [...] > > Right. On the other hand we could be more aggressive when dropping the > > swapcache. Is there any actual reason why we cannot try to folio_free_swap > > even when mem_cgroup_swap_full == F? > > If there are plenty free space in swap device, why not take advantage of > it? Maybe a stupid question but what is the advantage of keeping around in the swap cache?
On Tue, Nov 28, 2023 at 11:19:20AM +0800, Huang, Ying wrote: > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: > >> > >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: > >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > > > I agree with Ying that anonymous pages typically have different page > >> > > > access patterns than file pages, so we might want to treat them > >> > > > differently to reclaim them effectively. > >> > > > One random idea: > >> > > > How about we put the anonymous page in a swap cache in a different LRU > >> > > > than the rest of the anonymous pages. Then shrinking against those > >> > > > pages in the swap cache would be more effective.Instead of having > >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > >> > > > cache, file] LRU > >> > > > >> > > I don't think that it is necessary. The patch is only for a special use > >> > > case. Where the swap device is used up while some pages are in swap > >> > > cache. The patch will kill performance, but it is used to avoid OOM > >> > > only, not to improve performance. Per my understanding, we will not use > >> > > up swap device space in most cases. This may be true for ZRAM, but will > >> > > we keep pages in swap cache for long when we use ZRAM? > >> > > >> > I ask the question regarding how many pages can be freed by this patch > >> > in this email thread as well, but haven't got the answer from the > >> > author yet. That is one important aspect to evaluate how valuable is > >> > that patch. > >> > >> Exactly. Since swap cache has different life time with page cache, they > >> would be usually dropped when pages are unmapped(unless they are shared > >> with others but anon is usually exclusive private) so I wonder how much > >> memory we can save. > > > > I think the point of this patch is not saving memory, but rather > > avoiding an OOM condition that will happen if we have no swap space > > left, but some pages left in the swap cache. Of course, the OOM > > avoidance will come at the cost of extra work in reclaim to swap those > > pages out. > > > > The only case where I think this might be harmful is if there's plenty > > of pages to reclaim on the file LRU, and instead we opt to chase down > > the few swap cache pages. So perhaps we can add a check to only set > > sc->swapcache_only if the number of pages in the swap cache is more > > than the number of pages on the file LRU or similar? Just make sure we > > don't chase the swapcache pages down if there's plenty to scan on the > > file LRU? > > The swap cache pages can be divided to 3 groups. > > - group 1: pages have been written out, at the tail of inactive LRU, but > not reclaimed yet. > > - group 2: pages have been written out, but were failed to be reclaimed > (e.g., were accessed before reclaiming) > > - group 3: pages have been swapped in, but were kept in swap cache. The > pages may be in active LRU. > > The main target of the original patch should be group 1. And the pages > may be cheaper to reclaim than file pages. Yeah, that's common for asynchronous swap devices and that's popular. Then, How about freeing those memory as soon as the writeback is done instead of keep adding more tricks to solve the issue? https://lkml.kernel.org/linux-mm/1368411048-3753-1-git-send-email-minchan@kernel.org/ I remember it's under softIRQ context so there were some issues to change locking rules for memcg and swap. And there was some concern to increase softirq latency due to page freeing but both were not the main obstacle to be fixed. > > Group 2 are hard to be reclaimed if swap_count() isn't 0. "were accessed before reclaiming" would be rare. > > Group 3 should be reclaimed in theory, but the overhead may be high. > And we may need to reclaim the swap entries instead of pages if the pages > are hot. But we can start to reclaim the swap entries before the swap > space is run out. I thought the swap-in path will reclaim the swap slots once it detects swapspace wasn't enough(e.g., vm_swap_full or mem_cgroup_swap-full)? > > So, if we can count group 1, we may use that as indicator to scan anon > pages. And we may add code to reclaim group 3 earlier.
On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote: > On Tue 28-11-23 09:31:06, Huang, Ying wrote: > > Michal Hocko <mhocko@suse.com> writes: > [...] > > > Right. On the other hand we could be more aggressive when dropping the > > > swapcache. Is there any actual reason why we cannot try to folio_free_swap > > > even when mem_cgroup_swap_full == F? > > > > If there are plenty free space in swap device, why not take advantage of > > it? > > Maybe a stupid question but what is the advantage of keeping around in > the swap cache? If the page is shared, we avoids addtional IO to bring them back so swap cache. If the page is still clean on reclaim moment since swap-in, VM doesn't need to write them out to the swap disk since original data is already there.
On Tue, Nov 28, 2023 at 2:45 PM Minchan Kim <minchan@kernel.org> wrote: > > On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote: > > On Tue 28-11-23 09:31:06, Huang, Ying wrote: > > > Michal Hocko <mhocko@suse.com> writes: > > [...] > > > > Right. On the other hand we could be more aggressive when dropping the > > > > swapcache. Is there any actual reason why we cannot try to folio_free_swap > > > > even when mem_cgroup_swap_full == F? > > > > > > If there are plenty free space in swap device, why not take advantage of > > > it? > > > > Maybe a stupid question but what is the advantage of keeping around in > > the swap cache? > > If the page is shared, we avoids addtional IO to bring them back so > swap cache. I think this case is actually necessary for correctness, not just to avoid additional IO. Otherwise subsequent swapins will create new copies of the page, right? > > If the page is still clean on reclaim moment since swap-in, VM doesn't > need to write them out to the swap disk since original data is already > there.
On Tue, Nov 28, 2023 at 03:05:29PM -0800, Yosry Ahmed wrote: > On Tue, Nov 28, 2023 at 2:45 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote: > > > On Tue 28-11-23 09:31:06, Huang, Ying wrote: > > > > Michal Hocko <mhocko@suse.com> writes: > > > [...] > > > > > Right. On the other hand we could be more aggressive when dropping the > > > > > swapcache. Is there any actual reason why we cannot try to folio_free_swap > > > > > even when mem_cgroup_swap_full == F? > > > > > > > > If there are plenty free space in swap device, why not take advantage of > > > > it? > > > > > > Maybe a stupid question but what is the advantage of keeping around in > > > the swap cache? > > > > If the page is shared, we avoids addtional IO to bring them back so > > swap cache. > > I think this case is actually necessary for correctness, not just to > avoid additional IO. Otherwise subsequent swapins will create new > copies of the page, right? I think if the page was shared by MAP_SHARED, then, yes. I think if the page was shared by MAP_PRIVATE but CoW(e.g., fork), then, no.
On Mon, Nov 27, 2023 at 1:57 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > Exactly. Since swap cache has different life time with page cache, they > > would be usually dropped when pages are unmapped(unless they are shared > > with others but anon is usually exclusive private) so I wonder how much > > memory we can save. > > I think the point of this patch is not saving memory, but rather > avoiding an OOM condition that will happen if we have no swap space > left, but some pages left in the swap cache. Of course, the OOM > avoidance will come at the cost of extra work in reclaim to swap those > pages out. You are discussing how to use the memory that got freed. e.g. using other apps so avoid OOM. I am asking how much memory can be freed by this patch. That number is still useful to understand how effective the patch is. Does it justify the additional complexity? > The only case where I think this might be harmful is if there's plenty > of pages to reclaim on the file LRU, and instead we opt to chase down > the few swap cache pages. So perhaps we can add a check to only set > sc->swapcache_only if the number of pages in the swap cache is more > than the number of pages on the file LRU or similar? Just make sure we > don't chase the swapcache pages down if there's plenty to scan on the > file LRU? One idea is that we need to measure how effective the reclaim was toward the swap cache. If we do a lot of work but are not able to reclaim much from the swap cache, then additional reclaim work on swap cache is not going to be effective. MGLRU has a PID controller to adjust the reclaim ratio between the file and anonymous LRU. That is one way of measuring the effectiveness of the reclaim. Use the feedback to adjust the follow up reclaim work. I kind of wish we could have some feedback machine like this for the swap cache reclaim as well. Not sure how complicated it is to implement one. Chris
Minchan Kim <minchan@kernel.org> writes: > On Tue, Nov 28, 2023 at 11:19:20AM +0800, Huang, Ying wrote: >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> > > > I agree with Ying that anonymous pages typically have different page >> >> > > > access patterns than file pages, so we might want to treat them >> >> > > > differently to reclaim them effectively. >> >> > > > One random idea: >> >> > > > How about we put the anonymous page in a swap cache in a different LRU >> >> > > > than the rest of the anonymous pages. Then shrinking against those >> >> > > > pages in the swap cache would be more effective.Instead of having >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap >> >> > > > cache, file] LRU >> >> > > >> >> > > I don't think that it is necessary. The patch is only for a special use >> >> > > case. Where the swap device is used up while some pages are in swap >> >> > > cache. The patch will kill performance, but it is used to avoid OOM >> >> > > only, not to improve performance. Per my understanding, we will not use >> >> > > up swap device space in most cases. This may be true for ZRAM, but will >> >> > > we keep pages in swap cache for long when we use ZRAM? >> >> > >> >> > I ask the question regarding how many pages can be freed by this patch >> >> > in this email thread as well, but haven't got the answer from the >> >> > author yet. That is one important aspect to evaluate how valuable is >> >> > that patch. >> >> >> >> Exactly. Since swap cache has different life time with page cache, they >> >> would be usually dropped when pages are unmapped(unless they are shared >> >> with others but anon is usually exclusive private) so I wonder how much >> >> memory we can save. >> > >> > I think the point of this patch is not saving memory, but rather >> > avoiding an OOM condition that will happen if we have no swap space >> > left, but some pages left in the swap cache. Of course, the OOM >> > avoidance will come at the cost of extra work in reclaim to swap those >> > pages out. >> > >> > The only case where I think this might be harmful is if there's plenty >> > of pages to reclaim on the file LRU, and instead we opt to chase down >> > the few swap cache pages. So perhaps we can add a check to only set >> > sc->swapcache_only if the number of pages in the swap cache is more >> > than the number of pages on the file LRU or similar? Just make sure we >> > don't chase the swapcache pages down if there's plenty to scan on the >> > file LRU? >> >> The swap cache pages can be divided to 3 groups. >> >> - group 1: pages have been written out, at the tail of inactive LRU, but >> not reclaimed yet. >> >> - group 2: pages have been written out, but were failed to be reclaimed >> (e.g., were accessed before reclaiming) >> >> - group 3: pages have been swapped in, but were kept in swap cache. The >> pages may be in active LRU. >> >> The main target of the original patch should be group 1. And the pages >> may be cheaper to reclaim than file pages. > > Yeah, that's common for asynchronous swap devices and that's popular. Then, > How about freeing those memory as soon as the writeback is done instead of > keep adding more tricks to solve the issue? > > https://lkml.kernel.org/linux-mm/1368411048-3753-1-git-send-email-minchan@kernel.org/ > > I remember it's under softIRQ context so there were some issues to change > locking rules for memcg and swap. And there was some concern to increase > softirq latency due to page freeing but both were not the main obstacle to > be fixed. Thanks for sharing. It's good to avoid to add the pages back to LRU, then isolate them from LRU. I have concerns that is it possible that too many pages are reclaimed? For example, to reclaim a small number of pages, too many pages were written to disk because the performance difference between CPU and storage. Originally, we still only reclaim requested number of pages although much more were written. But with the change, we may reclaim them all. >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0. > > "were accessed before reclaiming" would be rare. If page reclaiming algorithm works well enough, that should be true. >> >> Group 3 should be reclaimed in theory, but the overhead may be high. >> And we may need to reclaim the swap entries instead of pages if the pages >> are hot. But we can start to reclaim the swap entries before the swap >> space is run out. > > I thought the swap-in path will reclaim the swap slots once it detects > swapspace wasn't enough(e.g., vm_swap_full or mem_cgroup_swap-full)? Yes. You are right. But before swap space wasn't enough, we may keep quite some pages in swap cache. But these pages may becomes hot later. Then we have no opportunity to reclaim these swap space. So, we may need to add some code to check this situation at appropriate places. For example, when we scan pages in active list, or activate pages in inactive list. >> >> So, if we can count group 1, we may use that as indicator to scan anon >> pages. And we may add code to reclaim group 3 earlier. -- Best Regards, Huang, Ying
On Tue 28-11-23 15:15:08, Minchan Kim wrote: > On Tue, Nov 28, 2023 at 03:05:29PM -0800, Yosry Ahmed wrote: > > On Tue, Nov 28, 2023 at 2:45 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > On Tue, Nov 28, 2023 at 11:16:04AM +0100, Michal Hocko wrote: > > > > On Tue 28-11-23 09:31:06, Huang, Ying wrote: > > > > > Michal Hocko <mhocko@suse.com> writes: > > > > [...] > > > > > > Right. On the other hand we could be more aggressive when dropping the > > > > > > swapcache. Is there any actual reason why we cannot try to folio_free_swap > > > > > > even when mem_cgroup_swap_full == F? > > > > > > > > > > If there are plenty free space in swap device, why not take advantage of > > > > > it? > > > > > > > > Maybe a stupid question but what is the advantage of keeping around in > > > > the swap cache? > > > > > > If the page is shared, we avoids addtional IO to bring them back so > > > swap cache. > > > > I think this case is actually necessary for correctness, not just to > > avoid additional IO. Otherwise subsequent swapins will create new > > copies of the page, right? > > I think if the page was shared by MAP_SHARED, then, yes. In that case deleting from the swap cache would fail because there are still swapped out ptes, no? > I think if the page was shared by MAP_PRIVATE but CoW(e.g., fork), then, no. OK, but we are talking about a memory pressure here and evicting available memory. So what is the actual cost benefit model here? Is it really better to keep swapcache around even under memory pressure if the CoWed page could never be faulted in?
On Tue 28-11-23 11:19:20, Huang, Ying wrote: > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: > >> > >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: > >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > > > I agree with Ying that anonymous pages typically have different page > >> > > > access patterns than file pages, so we might want to treat them > >> > > > differently to reclaim them effectively. > >> > > > One random idea: > >> > > > How about we put the anonymous page in a swap cache in a different LRU > >> > > > than the rest of the anonymous pages. Then shrinking against those > >> > > > pages in the swap cache would be more effective.Instead of having > >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap > >> > > > cache, file] LRU > >> > > > >> > > I don't think that it is necessary. The patch is only for a special use > >> > > case. Where the swap device is used up while some pages are in swap > >> > > cache. The patch will kill performance, but it is used to avoid OOM > >> > > only, not to improve performance. Per my understanding, we will not use > >> > > up swap device space in most cases. This may be true for ZRAM, but will > >> > > we keep pages in swap cache for long when we use ZRAM? > >> > > >> > I ask the question regarding how many pages can be freed by this patch > >> > in this email thread as well, but haven't got the answer from the > >> > author yet. That is one important aspect to evaluate how valuable is > >> > that patch. > >> > >> Exactly. Since swap cache has different life time with page cache, they > >> would be usually dropped when pages are unmapped(unless they are shared > >> with others but anon is usually exclusive private) so I wonder how much > >> memory we can save. > > > > I think the point of this patch is not saving memory, but rather > > avoiding an OOM condition that will happen if we have no swap space > > left, but some pages left in the swap cache. Of course, the OOM > > avoidance will come at the cost of extra work in reclaim to swap those > > pages out. > > > > The only case where I think this might be harmful is if there's plenty > > of pages to reclaim on the file LRU, and instead we opt to chase down > > the few swap cache pages. So perhaps we can add a check to only set > > sc->swapcache_only if the number of pages in the swap cache is more > > than the number of pages on the file LRU or similar? Just make sure we > > don't chase the swapcache pages down if there's plenty to scan on the > > file LRU? > > The swap cache pages can be divided to 3 groups. > > - group 1: pages have been written out, at the tail of inactive LRU, but > not reclaimed yet. > > - group 2: pages have been written out, but were failed to be reclaimed > (e.g., were accessed before reclaiming) > > - group 3: pages have been swapped in, but were kept in swap cache. The > pages may be in active LRU. > > The main target of the original patch should be group 1. And the pages > may be cheaper to reclaim than file pages. Thanks this is really useful summary. And it begs question. How are we telling those different types from each other? vmstat counter is certainly not sufficient and that means we might be scanning a lot without actually making any progress. And doing that repeatedly.
Michal Hocko <mhocko@suse.com> writes: > On Tue 28-11-23 11:19:20, Huang, Ying wrote: >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Mon, Nov 27, 2023 at 1:32 PM Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: >> >> > On Mon, Nov 27, 2023 at 12:14 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> > > > I agree with Ying that anonymous pages typically have different page >> >> > > > access patterns than file pages, so we might want to treat them >> >> > > > differently to reclaim them effectively. >> >> > > > One random idea: >> >> > > > How about we put the anonymous page in a swap cache in a different LRU >> >> > > > than the rest of the anonymous pages. Then shrinking against those >> >> > > > pages in the swap cache would be more effective.Instead of having >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, anon in swap >> >> > > > cache, file] LRU >> >> > > >> >> > > I don't think that it is necessary. The patch is only for a special use >> >> > > case. Where the swap device is used up while some pages are in swap >> >> > > cache. The patch will kill performance, but it is used to avoid OOM >> >> > > only, not to improve performance. Per my understanding, we will not use >> >> > > up swap device space in most cases. This may be true for ZRAM, but will >> >> > > we keep pages in swap cache for long when we use ZRAM? >> >> > >> >> > I ask the question regarding how many pages can be freed by this patch >> >> > in this email thread as well, but haven't got the answer from the >> >> > author yet. That is one important aspect to evaluate how valuable is >> >> > that patch. >> >> >> >> Exactly. Since swap cache has different life time with page cache, they >> >> would be usually dropped when pages are unmapped(unless they are shared >> >> with others but anon is usually exclusive private) so I wonder how much >> >> memory we can save. >> > >> > I think the point of this patch is not saving memory, but rather >> > avoiding an OOM condition that will happen if we have no swap space >> > left, but some pages left in the swap cache. Of course, the OOM >> > avoidance will come at the cost of extra work in reclaim to swap those >> > pages out. >> > >> > The only case where I think this might be harmful is if there's plenty >> > of pages to reclaim on the file LRU, and instead we opt to chase down >> > the few swap cache pages. So perhaps we can add a check to only set >> > sc->swapcache_only if the number of pages in the swap cache is more >> > than the number of pages on the file LRU or similar? Just make sure we >> > don't chase the swapcache pages down if there's plenty to scan on the >> > file LRU? >> >> The swap cache pages can be divided to 3 groups. >> >> - group 1: pages have been written out, at the tail of inactive LRU, but >> not reclaimed yet. >> >> - group 2: pages have been written out, but were failed to be reclaimed >> (e.g., were accessed before reclaiming) >> >> - group 3: pages have been swapped in, but were kept in swap cache. The >> pages may be in active LRU. >> >> The main target of the original patch should be group 1. And the pages >> may be cheaper to reclaim than file pages. > > Thanks this is really useful summary. And it begs question. How are we > telling those different types from each other? vmstat counter is > certainly not sufficient and that means we might be scanning a lot > without actually making any progress. And doing that repeatedly. We don't have counters for pages in individual groups. Pages in group 1 and some pages in group 2 are always at the tail of inactive LRU. So, we can identify them relatively easily. So a simple method could be, if there are swap cache, try scan the tail of inactive LRU to free pages in group 1 and move pages in group 2. If we found some pages aren't in swap cache, there may be no pages in group 1. Then, we may give up scanning if the memory pressure isn't too large. One possible issue is that some pages (map count == 1) may be swapped out, swapped in, then deleted from swap cache. So, some pages not in swap cache may be at the end of inactive LRU too. -- Best Regards, Huang, Ying
tl;dr, what's the viewpoint on this patch now? I have a note here that we're hoping input from Johannes, Vlastimil and Mel (please).
Andrew Morton <akpm@linux-foundation.org> writes: > tl;dr, what's the viewpoint on this patch now? > > I have a note here that we're hoping input from Johannes, Vlastimil and > Mel (please). Michal has performance concerns. So, I think that we need to stop scanning anon LRU if don't find enough page in swap cache at the tail of the inactive list. -- Best Regards, Huang, Ying
On Fri, 15 Dec 2023 13:05:09 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > Andrew Morton <akpm@linux-foundation.org> writes: > > > tl;dr, what's the viewpoint on this patch now? > > > > I have a note here that we're hoping input from Johannes, Vlastimil and > > Mel (please). > > Michal has performance concerns. So, I think that we need to stop > scanning anon LRU if don't find enough page in swap cache at the tail of > the inactive list. OK, thanks, I'll make the v10 patch go away for now.
diff --git a/mm/vmscan.c b/mm/vmscan.c index 506f8220c5fe..1fcc94717370 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -136,6 +136,9 @@ struct scan_control { /* Always discard instead of demoting to lower tier memory */ unsigned int no_demotion:1; + /* Swap space is exhausted, only reclaim swapcache for anon LRU */ + unsigned int swapcache_only:1; + /* Allocation order */ s8 order; @@ -308,10 +311,36 @@ static bool can_demote(int nid, struct scan_control *sc) return true; } +#ifdef CONFIG_SWAP +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) +{ + struct pglist_data *pgdat = NODE_DATA(nid); + unsigned long nr_swapcache; + + if (!memcg) { + nr_swapcache = node_page_state(pgdat, NR_SWAPCACHE); + } else { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + nr_swapcache = lruvec_page_state_local(lruvec, NR_SWAPCACHE); + } + + return nr_swapcache > 0; +} +#else +static bool can_reclaim_swapcache(struct mem_cgroup *memcg, int nid) +{ + return false; +} +#endif + static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, int nid, struct scan_control *sc) { + if (sc) + sc->swapcache_only = 0; + if (memcg == NULL) { /* * For non-memcg reclaim, is there @@ -330,7 +359,17 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, * * Can it be reclaimed from this node via demotion? */ - return can_demote(nid, sc); + if (can_demote(nid, sc)) + return true; + + /* Is there any swapcache pages to reclaim in this node? */ + if (can_reclaim_swapcache(memcg, nid)) { + if (sc) + sc->swapcache_only = 1; + return true; + } + + return false; } /* @@ -1642,6 +1681,15 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan, */ scan += nr_pages; + /* + * Count non-swapcache too because the swapcache pages may + * be rare and it takes too much times here if not count + * the non-swapcache pages. + */ + if (unlikely(sc->swapcache_only && !is_file_lru(lru) && + !folio_test_swapcache(folio))) + goto move; + if (!folio_test_lru(folio)) goto move; if (!sc->may_unmap && folio_mapped(folio))