Message ID | 1557505420-21809-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: vmscan: correct nr_reclaimed for THP | expand |
On Sat 11-05-19 00:23:40, Yang Shi wrote: > Since commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after > swapped out"), THP can be swapped out in a whole. But, nr_reclaimed > still gets inc'ed by one even though a whole THP (512 pages) gets > swapped out. > > This doesn't make too much sense to memory reclaim. For example, direct > reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP > could fulfill it. But, if nr_reclaimed is not increased correctly, > direct reclaim may just waste time to reclaim more pages, > SWAP_CLUSTER_MAX * 512 pages in worst case. You are technically right here. This has been a known issue for a while. I am wondering whether somebody actually noticed some misbehavior. Swapping out is a rare event and you have to have a considerable number of THPs to notice. > This change may result in more reclaimed pages than scanned pages showed > by /proc/vmstat since scanning one head page would reclaim 512 base pages. This is quite nasty and confusing. I am worried that having those two unsynced begs for subtle issues. Can we account THP as scanning 512 base pages as well? > Cc: "Huang, Ying" <ying.huang@intel.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> > Cc: Hugh Dickins <hughd@google.com> > Reviewed-by: Shakeel Butt <shakeelb@google.com> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > --- > v2: Added Shakeel's Reviewed-by > Use hpage_nr_pages instead of compound_order per Huang Ying and William Kucharski > > mm/vmscan.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fd9de50..4226d6b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1446,7 +1446,11 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > unlock_page(page); > free_it: > - nr_reclaimed++; > + /* > + * THP may get swapped out in a whole, need account > + * all base pages. > + */ > + nr_reclaimed += hpage_nr_pages(page); > > /* > * Is there need to periodically free_page_list? It would > -- > 1.8.3.1 >
On Mon 13-05-19 14:09:59, Yang Shi wrote: [...] > I think we can just account 512 base pages for nr_scanned for > isolate_lru_pages() to make the counters sane since PGSCAN_KSWAPD/DIRECT > just use it. > > And, sc->nr_scanned should be accounted as 512 base pages too otherwise we > may have nr_scanned < nr_to_reclaim all the time to result in false-negative > for priority raise and something else wrong (e.g. wrong vmpressure). Be careful. nr_scanned is used as a pressure indicator to slab shrinking AFAIR. Maybe this is ok but it really begs for much more explaining than "it should be fine". This should have happened when THP swap out was implemented...
On Mon, May 13, 2019 at 2:45 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 13-05-19 14:09:59, Yang Shi wrote: > [...] > > I think we can just account 512 base pages for nr_scanned for > > isolate_lru_pages() to make the counters sane since PGSCAN_KSWAPD/DIRECT > > just use it. > > > > And, sc->nr_scanned should be accounted as 512 base pages too otherwise we > > may have nr_scanned < nr_to_reclaim all the time to result in false-negative > > for priority raise and something else wrong (e.g. wrong vmpressure). > > Be careful. nr_scanned is used as a pressure indicator to slab shrinking > AFAIR. Maybe this is ok but it really begs for much more explaining I don't know why my company mailbox didn't receive this email, so I replied with my personal email. It is not used to double slab pressure any more since commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets"). It uses sc->priority to determine the pressure for slab shrinking now. So, I think we can just remove that "double slab pressure" code. It is not used actually and looks confusing now. Actually, the "double slab pressure" does something opposite. The extra inc to sc->nr_scanned just prevents from raising sc->priority. > than "it should be fine". This should have happened when THP swap out > was implemented... > > -- > Michal Hocko > SUSE Labs >
On Mon 13-05-19 21:36:59, Yang Shi wrote: > On Mon, May 13, 2019 at 2:45 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 13-05-19 14:09:59, Yang Shi wrote: > > [...] > > > I think we can just account 512 base pages for nr_scanned for > > > isolate_lru_pages() to make the counters sane since PGSCAN_KSWAPD/DIRECT > > > just use it. > > > > > > And, sc->nr_scanned should be accounted as 512 base pages too otherwise we > > > may have nr_scanned < nr_to_reclaim all the time to result in false-negative > > > for priority raise and something else wrong (e.g. wrong vmpressure). > > > > Be careful. nr_scanned is used as a pressure indicator to slab shrinking > > AFAIR. Maybe this is ok but it really begs for much more explaining > > I don't know why my company mailbox didn't receive this email, so I > replied with my personal email. > > It is not used to double slab pressure any more since commit > 9092c71bb724 ("mm: use sc->priority for slab shrink targets"). It uses > sc->priority to determine the pressure for slab shrinking now. > > So, I think we can just remove that "double slab pressure" code. It is > not used actually and looks confusing now. Actually, the "double slab > pressure" does something opposite. The extra inc to sc->nr_scanned > just prevents from raising sc->priority. I have to get in sync with the recent changes. I am aware there were some patches floating around but I didn't get to review them. I was trying to point out that nr_scanned used to have a side effect to be careful about. If it doesn't have anymore then this is getting much more easier of course. Please document everything in the changelog.
On 5/13/19 11:20 PM, Michal Hocko wrote: > On Mon 13-05-19 21:36:59, Yang Shi wrote: >> On Mon, May 13, 2019 at 2:45 PM Michal Hocko <mhocko@kernel.org> wrote: >>> On Mon 13-05-19 14:09:59, Yang Shi wrote: >>> [...] >>>> I think we can just account 512 base pages for nr_scanned for >>>> isolate_lru_pages() to make the counters sane since PGSCAN_KSWAPD/DIRECT >>>> just use it. >>>> >>>> And, sc->nr_scanned should be accounted as 512 base pages too otherwise we >>>> may have nr_scanned < nr_to_reclaim all the time to result in false-negative >>>> for priority raise and something else wrong (e.g. wrong vmpressure). >>> Be careful. nr_scanned is used as a pressure indicator to slab shrinking >>> AFAIR. Maybe this is ok but it really begs for much more explaining >> I don't know why my company mailbox didn't receive this email, so I >> replied with my personal email. >> >> It is not used to double slab pressure any more since commit >> 9092c71bb724 ("mm: use sc->priority for slab shrink targets"). It uses >> sc->priority to determine the pressure for slab shrinking now. >> >> So, I think we can just remove that "double slab pressure" code. It is >> not used actually and looks confusing now. Actually, the "double slab >> pressure" does something opposite. The extra inc to sc->nr_scanned >> just prevents from raising sc->priority. > I have to get in sync with the recent changes. I am aware there were > some patches floating around but I didn't get to review them. I was > trying to point out that nr_scanned used to have a side effect to be > careful about. If it doesn't have anymore then this is getting much more > easier of course. Please document everything in the changelog. Thanks for reminding. Yes, I remembered nr_scanned would double slab pressure. But, when I inspected into the code yesterday, it turns out it is not true anymore. I will run some test to make sure it doesn't introduce regression. BTW, I noticed the counter of memory reclaim is not correct with THP swap on vanilla kernel, please see the below: pgsteal_kswapd 21435 pgsteal_direct 26573329 pgscan_kswapd 3514 pgscan_direct 14417775 pgsteal is always greater than pgscan, my patch could fix the problem. Anyway, I will elaborate these in the commit log.
On Tue, May 14, 2019 at 01:44:35PM -0700, Yang Shi wrote: > On 5/13/19 11:20 PM, Michal Hocko wrote: > > On Mon 13-05-19 21:36:59, Yang Shi wrote: > > > On Mon, May 13, 2019 at 2:45 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 13-05-19 14:09:59, Yang Shi wrote: > > > > [...] > > > > > I think we can just account 512 base pages for nr_scanned for > > > > > isolate_lru_pages() to make the counters sane since PGSCAN_KSWAPD/DIRECT > > > > > just use it. > > > > > > > > > > And, sc->nr_scanned should be accounted as 512 base pages too otherwise we > > > > > may have nr_scanned < nr_to_reclaim all the time to result in false-negative > > > > > for priority raise and something else wrong (e.g. wrong vmpressure). > > > > Be careful. nr_scanned is used as a pressure indicator to slab shrinking > > > > AFAIR. Maybe this is ok but it really begs for much more explaining > > > I don't know why my company mailbox didn't receive this email, so I > > > replied with my personal email. > > > > > > It is not used to double slab pressure any more since commit > > > 9092c71bb724 ("mm: use sc->priority for slab shrink targets"). It uses > > > sc->priority to determine the pressure for slab shrinking now. > > > > > > So, I think we can just remove that "double slab pressure" code. It is > > > not used actually and looks confusing now. Actually, the "double slab > > > pressure" does something opposite. The extra inc to sc->nr_scanned > > > just prevents from raising sc->priority. > > I have to get in sync with the recent changes. I am aware there were > > some patches floating around but I didn't get to review them. I was > > trying to point out that nr_scanned used to have a side effect to be > > careful about. If it doesn't have anymore then this is getting much more > > easier of course. Please document everything in the changelog. > > Thanks for reminding. Yes, I remembered nr_scanned would double slab > pressure. But, when I inspected into the code yesterday, it turns out it is > not true anymore. I will run some test to make sure it doesn't introduce > regression. Yeah, sc->nr_scanned is used for three things right now: 1. vmpressure - this looks at the scanned/reclaimed ratio so it won't change semantics as long as scanned & reclaimed are fixed in parallel 2. compaction/reclaim - this is broken. Compaction wants a certain number of physical pages freed up before going back to compacting. Without Yang Shi's fix, we can overreclaim by a factor of 512. 3. kswapd priority raising - this is broken. kswapd raises priority if we scan fewer pages than the reclaim target (which itself is obviously expressed in order-0 pages). As a result, kswapd can falsely raise its aggressiveness even when it's making great progress. Both sc->nr_scanned & sc->nr_reclaimed should be fixed. > BTW, I noticed the counter of memory reclaim is not correct with THP swap on > vanilla kernel, please see the below: > > pgsteal_kswapd 21435 > pgsteal_direct 26573329 > pgscan_kswapd 3514 > pgscan_direct 14417775 > > pgsteal is always greater than pgscan, my patch could fix the problem. Ouch, how is that possible with the current code? I think it happens when isolate_lru_pages() counts 1 nr_scanned for a THP, then shrink_page_list() splits the THP and we reclaim tail pages one by one. This goes all the way back to the initial THP patch! isolate_lru_pages() needs to be fixed. Its return value, nr_taken, is correct, but its *nr_scanned parameter is wrong, which causes issues: 1. The trace point, as Yang Shi pointed out, will underreport the number of pages scanned, as it reports it along with nr_to_scan (base pages) and nr_taken (base pages) 2. vmstat and memory.stat count 'struct page' operations rather than base pages, which makes zero sense to neither user nor kernel developers (I routinely multiply these counters by 4096 to get a sense of work performed). All of isolate_lru_pages()'s accounting should be in base pages, which includes nr_scanned and PGSCAN_SKIPPED. That should also simplify the code; e.g.: for (total_scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src); total_scan++) { scan < nr_to_scan && nr_taken >= nr_to_scan is a weird condition that does not make sense in page reclaim imo. Reclaim cares about physical memory - freeing one THP is as much progress for reclaim as freeing 512 order-0 pages. IMO *all* '++' in vmscan.c are suspicious and should be reviewed: nr_scanned, nr_reclaimed, nr_dirty, nr_unqueued_dirty, nr_congested, nr_immediate, nr_writeback, nr_ref_keep, nr_unmap_fail, pgactivate, total_scan & scan, nr_skipped. Yang Shi, it would be nice if you could convert all of these to base page accounting in one patch, as it's a single logical fix for the initial introduction of THP that had huge pages show up on the LRUs. [ check_move_unevictable_pages() seems weird. It gets a pagevec from find_get_entries(), which, if I understand the THP page cache code correctly, might contain the same compound page over and over. It'll be !unevictable after the first iteration, so will only run once. So it produces incorrect numbers now, but it is probably best to ignore it until we figure out THP cache. Maybe add an XXX comment. ]
On 5/16/19 11:10 PM, Johannes Weiner wrote: > On Tue, May 14, 2019 at 01:44:35PM -0700, Yang Shi wrote: >> On 5/13/19 11:20 PM, Michal Hocko wrote: >>> On Mon 13-05-19 21:36:59, Yang Shi wrote: >>>> On Mon, May 13, 2019 at 2:45 PM Michal Hocko <mhocko@kernel.org> wrote: >>>>> On Mon 13-05-19 14:09:59, Yang Shi wrote: >>>>> [...] >>>>>> I think we can just account 512 base pages for nr_scanned for >>>>>> isolate_lru_pages() to make the counters sane since PGSCAN_KSWAPD/DIRECT >>>>>> just use it. >>>>>> >>>>>> And, sc->nr_scanned should be accounted as 512 base pages too otherwise we >>>>>> may have nr_scanned < nr_to_reclaim all the time to result in false-negative >>>>>> for priority raise and something else wrong (e.g. wrong vmpressure). >>>>> Be careful. nr_scanned is used as a pressure indicator to slab shrinking >>>>> AFAIR. Maybe this is ok but it really begs for much more explaining >>>> I don't know why my company mailbox didn't receive this email, so I >>>> replied with my personal email. >>>> >>>> It is not used to double slab pressure any more since commit >>>> 9092c71bb724 ("mm: use sc->priority for slab shrink targets"). It uses >>>> sc->priority to determine the pressure for slab shrinking now. >>>> >>>> So, I think we can just remove that "double slab pressure" code. It is >>>> not used actually and looks confusing now. Actually, the "double slab >>>> pressure" does something opposite. The extra inc to sc->nr_scanned >>>> just prevents from raising sc->priority. >>> I have to get in sync with the recent changes. I am aware there were >>> some patches floating around but I didn't get to review them. I was >>> trying to point out that nr_scanned used to have a side effect to be >>> careful about. If it doesn't have anymore then this is getting much more >>> easier of course. Please document everything in the changelog. >> Thanks for reminding. Yes, I remembered nr_scanned would double slab >> pressure. But, when I inspected into the code yesterday, it turns out it is >> not true anymore. I will run some test to make sure it doesn't introduce >> regression. > Yeah, sc->nr_scanned is used for three things right now: > > 1. vmpressure - this looks at the scanned/reclaimed ratio so it won't > change semantics as long as scanned & reclaimed are fixed in parallel > > 2. compaction/reclaim - this is broken. Compaction wants a certain > number of physical pages freed up before going back to compacting. > Without Yang Shi's fix, we can overreclaim by a factor of 512. > > 3. kswapd priority raising - this is broken. kswapd raises priority if > we scan fewer pages than the reclaim target (which itself is obviously > expressed in order-0 pages). As a result, kswapd can falsely raise its > aggressiveness even when it's making great progress. > > Both sc->nr_scanned & sc->nr_reclaimed should be fixed. Yes, v3 patch (sit in my local repo now) did fix both. > >> BTW, I noticed the counter of memory reclaim is not correct with THP swap on >> vanilla kernel, please see the below: >> >> pgsteal_kswapd 21435 >> pgsteal_direct 26573329 >> pgscan_kswapd 3514 >> pgscan_direct 14417775 >> >> pgsteal is always greater than pgscan, my patch could fix the problem. > Ouch, how is that possible with the current code? > > I think it happens when isolate_lru_pages() counts 1 nr_scanned for a > THP, then shrink_page_list() splits the THP and we reclaim tail pages > one by one. This goes all the way back to the initial THP patch! I think so. It does make sense. But, the weird thing is I just see this with synchronous swap device (some THPs got swapped out in a whole, some got split), but I've never seen this with rotate swap device (all THPs got split). I haven't figured out why. > > isolate_lru_pages() needs to be fixed. Its return value, nr_taken, is > correct, but its *nr_scanned parameter is wrong, which causes issues: > > 1. The trace point, as Yang Shi pointed out, will underreport the > number of pages scanned, as it reports it along with nr_to_scan (base > pages) and nr_taken (base pages) > > 2. vmstat and memory.stat count 'struct page' operations rather than > base pages, which makes zero sense to neither user nor kernel > developers (I routinely multiply these counters by 4096 to get a sense > of work performed). > > All of isolate_lru_pages()'s accounting should be in base pages, which > includes nr_scanned and PGSCAN_SKIPPED. > > That should also simplify the code; e.g.: > > for (total_scan = 0; > scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src); > total_scan++) { > > scan < nr_to_scan && nr_taken >= nr_to_scan is a weird condition that > does not make sense in page reclaim imo. Reclaim cares about physical > memory - freeing one THP is as much progress for reclaim as freeing > 512 order-0 pages. Yes, I do agree. The v3 patch did this. > > IMO *all* '++' in vmscan.c are suspicious and should be reviewed: > nr_scanned, nr_reclaimed, nr_dirty, nr_unqueued_dirty, nr_congested, > nr_immediate, nr_writeback, nr_ref_keep, nr_unmap_fail, pgactivate, > total_scan & scan, nr_skipped. Some of them should be fine but I'm not sure the side effect. IMHO, let's fix the most obvious problem first. > > Yang Shi, it would be nice if you could convert all of these to base > page accounting in one patch, as it's a single logical fix for the > initial introduction of THP that had huge pages show up on the LRUs.\ Yes, sure. > > [ check_move_unevictable_pages() seems weird. It gets a pagevec from > find_get_entries(), which, if I understand the THP page cache code > correctly, might contain the same compound page over and over. It'll > be !unevictable after the first iteration, so will only run once. So > it produces incorrect numbers now, but it is probably best to ignore > it until we figure out THP cache. Maybe add an XXX comment. ]
On 5/20/19 5:43 PM, Yang Shi wrote: > > > On 5/16/19 11:10 PM, Johannes Weiner wrote: >> On Tue, May 14, 2019 at 01:44:35PM -0700, Yang Shi wrote: >>> On 5/13/19 11:20 PM, Michal Hocko wrote: >>>> On Mon 13-05-19 21:36:59, Yang Shi wrote: >>>>> On Mon, May 13, 2019 at 2:45 PM Michal Hocko <mhocko@kernel.org> >>>>> wrote: >>>>>> On Mon 13-05-19 14:09:59, Yang Shi wrote: >>>>>> [...] >>>>>>> I think we can just account 512 base pages for nr_scanned for >>>>>>> isolate_lru_pages() to make the counters sane since >>>>>>> PGSCAN_KSWAPD/DIRECT >>>>>>> just use it. >>>>>>> >>>>>>> And, sc->nr_scanned should be accounted as 512 base pages too >>>>>>> otherwise we >>>>>>> may have nr_scanned < nr_to_reclaim all the time to result in >>>>>>> false-negative >>>>>>> for priority raise and something else wrong (e.g. wrong >>>>>>> vmpressure). >>>>>> Be careful. nr_scanned is used as a pressure indicator to slab >>>>>> shrinking >>>>>> AFAIR. Maybe this is ok but it really begs for much more explaining >>>>> I don't know why my company mailbox didn't receive this email, so I >>>>> replied with my personal email. >>>>> >>>>> It is not used to double slab pressure any more since commit >>>>> 9092c71bb724 ("mm: use sc->priority for slab shrink targets"). It >>>>> uses >>>>> sc->priority to determine the pressure for slab shrinking now. >>>>> >>>>> So, I think we can just remove that "double slab pressure" code. >>>>> It is >>>>> not used actually and looks confusing now. Actually, the "double slab >>>>> pressure" does something opposite. The extra inc to sc->nr_scanned >>>>> just prevents from raising sc->priority. >>>> I have to get in sync with the recent changes. I am aware there were >>>> some patches floating around but I didn't get to review them. I was >>>> trying to point out that nr_scanned used to have a side effect to be >>>> careful about. If it doesn't have anymore then this is getting much >>>> more >>>> easier of course. Please document everything in the changelog. >>> Thanks for reminding. Yes, I remembered nr_scanned would double slab >>> pressure. But, when I inspected into the code yesterday, it turns >>> out it is >>> not true anymore. I will run some test to make sure it doesn't >>> introduce >>> regression. >> Yeah, sc->nr_scanned is used for three things right now: >> >> 1. vmpressure - this looks at the scanned/reclaimed ratio so it won't >> change semantics as long as scanned & reclaimed are fixed in parallel >> >> 2. compaction/reclaim - this is broken. Compaction wants a certain >> number of physical pages freed up before going back to compacting. >> Without Yang Shi's fix, we can overreclaim by a factor of 512. >> >> 3. kswapd priority raising - this is broken. kswapd raises priority if >> we scan fewer pages than the reclaim target (which itself is obviously >> expressed in order-0 pages). As a result, kswapd can falsely raise its >> aggressiveness even when it's making great progress. >> >> Both sc->nr_scanned & sc->nr_reclaimed should be fixed. > > Yes, v3 patch (sit in my local repo now) did fix both. > >> >>> BTW, I noticed the counter of memory reclaim is not correct with THP >>> swap on >>> vanilla kernel, please see the below: >>> >>> pgsteal_kswapd 21435 >>> pgsteal_direct 26573329 >>> pgscan_kswapd 3514 >>> pgscan_direct 14417775 >>> >>> pgsteal is always greater than pgscan, my patch could fix the problem. >> Ouch, how is that possible with the current code? >> >> I think it happens when isolate_lru_pages() counts 1 nr_scanned for a >> THP, then shrink_page_list() splits the THP and we reclaim tail pages >> one by one. This goes all the way back to the initial THP patch! > > I think so. It does make sense. But, the weird thing is I just see > this with synchronous swap device (some THPs got swapped out in a > whole, some got split), but I've never seen this with rotate swap > device (all THPs got split). > > I haven't figured out why. > >> >> isolate_lru_pages() needs to be fixed. Its return value, nr_taken, is >> correct, but its *nr_scanned parameter is wrong, which causes issues: >> >> 1. The trace point, as Yang Shi pointed out, will underreport the >> number of pages scanned, as it reports it along with nr_to_scan (base >> pages) and nr_taken (base pages) >> >> 2. vmstat and memory.stat count 'struct page' operations rather than >> base pages, which makes zero sense to neither user nor kernel >> developers (I routinely multiply these counters by 4096 to get a sense >> of work performed). >> >> All of isolate_lru_pages()'s accounting should be in base pages, which >> includes nr_scanned and PGSCAN_SKIPPED. >> >> That should also simplify the code; e.g.: >> >> for (total_scan = 0; >> scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src); >> total_scan++) { >> >> scan < nr_to_scan && nr_taken >= nr_to_scan is a weird condition that >> does not make sense in page reclaim imo. Reclaim cares about physical >> memory - freeing one THP is as much progress for reclaim as freeing >> 512 order-0 pages. > > Yes, I do agree. The v3 patch did this. > >> >> IMO *all* '++' in vmscan.c are suspicious and should be reviewed: >> nr_scanned, nr_reclaimed, nr_dirty, nr_unqueued_dirty, nr_congested, >> nr_immediate, nr_writeback, nr_ref_keep, nr_unmap_fail, pgactivate, >> total_scan & scan, nr_skipped. > > Some of them should be fine but I'm not sure the side effect. IMHO, > let's fix the most obvious problem first. A quick review shows we should correct nr_scanned, nr_reclaimed, pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail since they are user visible (via cgroup, /proc/vmstat or trace point) and the wrong number may confuse and mislead the users. nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by file cache, so they are not impacted by THP swap. > >> >> Yang Shi, it would be nice if you could convert all of these to base >> page accounting in one patch, as it's a single logical fix for the >> initial introduction of THP that had huge pages show up on the LRUs.\ > > Yes, sure. > >> >> [ check_move_unevictable_pages() seems weird. It gets a pagevec from >> find_get_entries(), which, if I understand the THP page cache code >> correctly, might contain the same compound page over and over. It'll >> be !unevictable after the first iteration, so will only run once. So >> it produces incorrect numbers now, but it is probably best to ignore >> it until we figure out THP cache. Maybe add an XXX comment. ] >
> [ check_move_unevictable_pages() seems weird. It gets a pagevec from > find_get_entries(), which, if I understand the THP page cache code > correctly, might contain the same compound page over and over. It'll > be !unevictable after the first iteration, so will only run once. So > it produces incorrect numbers now, but it is probably best to ignore > it until we figure out THP cache. Maybe add an XXX comment. ] The commit 5fd4ca2d84b2 ("mm: page cache: store only head pages in i_pages") changed how THP is stored in page cache, but find_get_entries() would return base page by calling find_subpage(), so check_move_unevictable_pages() should just returns the number of base pages.
diff --git a/mm/vmscan.c b/mm/vmscan.c index fd9de50..4226d6b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1446,7 +1446,11 @@ static unsigned long shrink_page_list(struct list_head *page_list, unlock_page(page); free_it: - nr_reclaimed++; + /* + * THP may get swapped out in a whole, need account + * all base pages. + */ + nr_reclaimed += hpage_nr_pages(page); /* * Is there need to periodically free_page_list? It would