mbox series

[v4,0/5] batched remove rmap in try_to_unmap_one()

Message ID 20230313124526.1207490-1-fengwei.yin@intel.com (mailing list archive)
Headers show
Series batched remove rmap in try_to_unmap_one() | expand

Message

Yin Fengwei March 13, 2023, 12:45 p.m. UTC
This series is trying to bring the batched rmap removing to
try_to_unmap_one(). It's expected that the batched rmap
removing bring performance gain than remove rmap per page.

This series reconstruct the try_to_unmap_one() from:
  loop:
     clear and update PTE
     unmap one page
     goto loop
to:
  loop:
     clear and update PTE
     goto loop
  unmap the range of folio in one call
It is one step to always map/unmap the entire folio in one call.
Which can simplify the folio mapcount handling by avoid dealing
with each page map/unmap.


The changes are organized as:
Patch1/2 move the hugetlb and normal page unmap to dedicated
functions to make try_to_unmap_one() logic clearer and easy
to add batched rmap removing. To make code review easier, no
function change.

Patch3 cleanup the try_to_unmap_one_page(). Try to removed
some duplicated function calls.

Patch4 adds folio_remove_rmap_range() which batched remove rmap.

Patch5 make try_to_unmap_one() to batched remove rmap.

Functional testing done with the V3 patchset in a qemu guest
with 4G mem:
  - kernel mm selftest to trigger vmscan() and final hit
    try_to_unmap_one().
  - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
    call against hugetlb.
  - 8 hours stress testing: Firefox + kernel mm selftest + kernel
    build.

For performance gain demonstration, changed the MADV_PAGEOUT not
to split the large folio for page cache and created a micro
benchmark mainly as following:

        #define FILESIZE (2 * 1024 * 1024)
        char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE, fd, 0);
	count = 0;
        while (1) {
                unsigned long i;

                for (i = 0; i < FILESIZE; i += pgsize) {
                        cc = *(volatile char *)(c + i);
                }
                madvise(c, FILESIZE, MADV_PAGEOUT);
		count++;
        }
        munmap(c, FILESIZE);

Run it with 96 instances + 96 files on xfs file system for 1
second. The test platform was IceLake with 48C/96T + 192G memory.

Test result (number count) got around %7 (58865 -> 63247) improvement
with this patch series. And perf shows following:

Without this series:
18.26%--try_to_unmap_one
        |          
        |--10.71%--page_remove_rmap
        |          |          
        |           --9.81%--__mod_lruvec_page_state
        |                     |          
        |                     |--1.36%--__mod_memcg_lruvec_state
        |                     |          |          
        |                     |           --0.80%--cgroup_rstat_updated
        |                     |          
        |                      --0.67%--__mod_lruvec_state
        |                                |          
        |                                 --0.59%--__mod_node_page_state
        |          
        |--5.41%--ptep_clear_flush
        |          |          
        |           --4.64%--flush_tlb_mm_range
        |                     |          
        |                      --3.88%--flush_tlb_func
        |                                |          
        |                                 --3.56%--native_flush_tlb_one_user
        |          
        |--0.75%--percpu_counter_add_batch
        |          
         --0.53%--PageHeadHuge

With this series:
9.87%--try_to_unmap_one
        |          
        |--7.14%--try_to_unmap_one_page.constprop.0.isra.0
        |          |          
        |          |--5.21%--ptep_clear_flush
        |          |          |          
        |          |           --4.36%--flush_tlb_mm_range
        |          |                     |          
        |          |                      --3.54%--flush_tlb_func
        |          |                                |          
        |          |                                 --3.17%--native_flush_tlb_one_user
        |          |          
        |           --0.82%--percpu_counter_add_batch
        |          
        |--1.18%--folio_remove_rmap_and_update_count.part.0
        |          |          
        |           --1.11%--folio_remove_rmap_range
        |                     |          
        |                      --0.53%--__mod_lruvec_page_state
        |          
         --0.57%--PageHeadHuge

As expected, the cost of __mod_lruvec_page_state is reduced significantly
with batched folio_remove_rmap_range. Suppose the page reclaim path can
get same benefit also.


This series based on next-20230310.

Changes from v3:
  - General
    - Rebase to next-20230310
    - Add performance testing result

  - Patch1
    - Fixed incorrect comments as Mike Kravetz pointed out
    - Use huge_pte_dirty() as Mike Kravetz suggested
    - Use true instead of folio_test_hugetlb() in
      try_to_unmap_one_hugetlb() as it's hugetlb page
      for sure as Mike Kravetz suggested

Changes from v2:
  - General
    - Rebase the patch to next-20230303
    - Update cover letter about the preparation to unmap
      the entire folio in one call
    - No code change comparing to V2. But fix the patch applying
      conflict because of wrong patch order in V2.

Changes from v1:
  - General
    - Rebase the patch to next-20230228

  - Patch1
    - Removed the if (PageHWPoison(page) && !(flags & TTU_HWPOISON)
      as suggestion from Mike Kravetz and HORIGUCHI NAOYA
    - Removed the mlock_drain_local() as suggestion from Mike Kravetz
    _ Removed the comments about the mm counter change as suggestion
      from Mike Kravetz

Yin Fengwei (5):
  rmap: move hugetlb try_to_unmap to dedicated function
  rmap: move page unmap operation to dedicated function
  rmap: cleanup exit path of try_to_unmap_one_page()
  rmap:addd folio_remove_rmap_range()
  try_to_unmap_one: batched remove rmap, update folio refcount

 include/linux/rmap.h |   5 +
 mm/page_vma_mapped.c |  30 +++
 mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
 3 files changed, 398 insertions(+), 260 deletions(-)

Comments

Andrew Morton March 13, 2023, 6:49 p.m. UTC | #1
On Mon, 13 Mar 2023 20:45:21 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> This series is trying to bring the batched rmap removing to
> try_to_unmap_one(). It's expected that the batched rmap
> removing bring performance gain than remove rmap per page.
> 
> This series reconstruct the try_to_unmap_one() from:
>   loop:
>      clear and update PTE
>      unmap one page
>      goto loop
> to:
>   loop:
>      clear and update PTE
>      goto loop
>   unmap the range of folio in one call
> It is one step to always map/unmap the entire folio in one call.
> Which can simplify the folio mapcount handling by avoid dealing
> with each page map/unmap.
>
> ...
> 
> For performance gain demonstration, changed the MADV_PAGEOUT not
> to split the large folio for page cache and created a micro
> benchmark mainly as following:

Please remind me why it's necessary to patch the kernel to actually
performance test this?  And why it's proving so hard to demonstrate
benefits in real-world workloads?

(Yes, this was touched on in earlier discussion, but I do think these
considerations should be spelled out in the [0/N] changelog).

Thanks.
Yin Fengwei March 14, 2023, 3:09 a.m. UTC | #2
On 3/14/23 02:49, Andrew Morton wrote:
> On Mon, 13 Mar 2023 20:45:21 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> 
>> This series is trying to bring the batched rmap removing to
>> try_to_unmap_one(). It's expected that the batched rmap
>> removing bring performance gain than remove rmap per page.
>>
>> This series reconstruct the try_to_unmap_one() from:
>>    loop:
>>       clear and update PTE
>>       unmap one page
>>       goto loop
>> to:
>>    loop:
>>       clear and update PTE
>>       goto loop
>>    unmap the range of folio in one call
>> It is one step to always map/unmap the entire folio in one call.
>> Which can simplify the folio mapcount handling by avoid dealing
>> with each page map/unmap.
>>
>> ...
>>
>> For performance gain demonstration, changed the MADV_PAGEOUT not
>> to split the large folio for page cache and created a micro
>> benchmark mainly as following:
> 
> Please remind me why it's necessary to patch the kernel to actually
> performance test this?  And why it's proving so hard to demonstrate
> benefits in real-world workloads?
> 
> (Yes, this was touched on in earlier discussion, but I do think these
> considerations should be spelled out in the [0/N] changelog).
OK. What about add following in cover letter:
"
  The performance gain of this series can be demonstrated with large
  folio reclaim. In current kernel, vmscan() path will be benefited by
  the changes. But there is no workload/benchmark can show the exact
  performance gain for vmscan() path as far as I am aware.

  Another way to demonstrate the performance benefit is using
  MADV_PAGEOUT which can trigger page reclaim also. The problem is that
  MADV_PAGEOUT always split the large folio because it's not aware of
  large folio for page cache currently. To show the performance benefit,
  MADV_PAGEOUT is updated not to split the large folio.

  For long term with wider adoption of large folio in kernel (like large
  folio for anonymous page), MADV_PAGEOUT needs be updated to handle
  large folio as whole to avoid splitting it always.
"


Regards
Yin, Fengwei

> 
> Thanks.
David Hildenbrand March 14, 2023, 9:16 a.m. UTC | #3
On 14.03.23 04:09, Yin Fengwei wrote:
> On 3/14/23 02:49, Andrew Morton wrote:
>> On Mon, 13 Mar 2023 20:45:21 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>> This series is trying to bring the batched rmap removing to
>>> try_to_unmap_one(). It's expected that the batched rmap
>>> removing bring performance gain than remove rmap per page.
>>>
>>> This series reconstruct the try_to_unmap_one() from:
>>>     loop:
>>>        clear and update PTE
>>>        unmap one page
>>>        goto loop
>>> to:
>>>     loop:
>>>        clear and update PTE
>>>        goto loop
>>>     unmap the range of folio in one call
>>> It is one step to always map/unmap the entire folio in one call.
>>> Which can simplify the folio mapcount handling by avoid dealing
>>> with each page map/unmap.
>>>
>>> ...
>>>
>>> For performance gain demonstration, changed the MADV_PAGEOUT not
>>> to split the large folio for page cache and created a micro
>>> benchmark mainly as following:
>>
>> Please remind me why it's necessary to patch the kernel to actually
>> performance test this?  And why it's proving so hard to demonstrate
>> benefits in real-world workloads?
>>
>> (Yes, this was touched on in earlier discussion, but I do think these
>> considerations should be spelled out in the [0/N] changelog).
> OK. What about add following in cover letter:
> "
>    The performance gain of this series can be demonstrated with large
>    folio reclaim. In current kernel, vmscan() path will be benefited by
>    the changes. But there is no workload/benchmark can show the exact
>    performance gain for vmscan() path as far as I am aware.
> 
>    Another way to demonstrate the performance benefit is using
>    MADV_PAGEOUT which can trigger page reclaim also. The problem is that
>    MADV_PAGEOUT always split the large folio because it's not aware of
>    large folio for page cache currently. To show the performance benefit,
>    MADV_PAGEOUT is updated not to split the large folio.
> 
>    For long term with wider adoption of large folio in kernel (like large
>    folio for anonymous page), MADV_PAGEOUT needs be updated to handle
>    large folio as whole to avoid splitting it always.

Just curious what the last sentence implies. Large folios are supposed 
to be a transparent optimization. So why should we pageout all 
surrounding subpages simply because a single subpage was requested to be 
paged out? That might harm performance of some workloads ... more than 
the actual split.

So it's not immediately obvious to me why "avoid splitting" is the 
correct answer to the problem at hand.
Matthew Wilcox March 14, 2023, 9:48 a.m. UTC | #4
On Tue, Mar 14, 2023 at 10:16:09AM +0100, David Hildenbrand wrote:
> On 14.03.23 04:09, Yin Fengwei wrote:
> >    For long term with wider adoption of large folio in kernel (like large
> >    folio for anonymous page), MADV_PAGEOUT needs be updated to handle
> >    large folio as whole to avoid splitting it always.
> 
> Just curious what the last sentence implies. Large folios are supposed to be
> a transparent optimization. So why should we pageout all surrounding
> subpages simply because a single subpage was requested to be paged out? That
> might harm performance of some workloads ... more than the actual split.
> 
> So it's not immediately obvious to me why "avoid splitting" is the correct
> answer to the problem at hand.

Even if your madvise() call says to pageout all pages covered by a
folio, the current code will split it.  That's what needs to be fixed.

At least for anonymous pages, using large folios is an attempt to treat
all pages in a particular range the same way.  If the user says to only
page out some of them, that's a big clue that these pages are different
from the other pages, and so we should split a folio where the madvise
call does not cover every page in the folio.

I'm less convinced that argument holds for page cache pages.
David Hildenbrand March 14, 2023, 9:50 a.m. UTC | #5
On 14.03.23 10:48, Matthew Wilcox wrote:
> On Tue, Mar 14, 2023 at 10:16:09AM +0100, David Hildenbrand wrote:
>> On 14.03.23 04:09, Yin Fengwei wrote:
>>>     For long term with wider adoption of large folio in kernel (like large
>>>     folio for anonymous page), MADV_PAGEOUT needs be updated to handle
>>>     large folio as whole to avoid splitting it always.
>>
>> Just curious what the last sentence implies. Large folios are supposed to be
>> a transparent optimization. So why should we pageout all surrounding
>> subpages simply because a single subpage was requested to be paged out? That
>> might harm performance of some workloads ... more than the actual split.
>>
>> So it's not immediately obvious to me why "avoid splitting" is the correct
>> answer to the problem at hand.
> 
> Even if your madvise() call says to pageout all pages covered by a
> folio, the current code will split it.  That's what needs to be fixed.

Agreed, if possible in the future (swap handling ...).

> 
> At least for anonymous pages, using large folios is an attempt to treat
> all pages in a particular range the same way.  If the user says to only
> page out some of them, that's a big clue that these pages are different
> from the other pages, and so we should split a folio where the madvise
> call does not cover every page in the folio.

Agreed.
Yin Fengwei March 14, 2023, 2:50 p.m. UTC | #6
On 3/14/2023 5:48 PM, Matthew Wilcox wrote:
> On Tue, Mar 14, 2023 at 10:16:09AM +0100, David Hildenbrand wrote:
>> On 14.03.23 04:09, Yin Fengwei wrote:
>>>    For long term with wider adoption of large folio in kernel (like large
>>>    folio for anonymous page), MADV_PAGEOUT needs be updated to handle
>>>    large folio as whole to avoid splitting it always.
>>
>> Just curious what the last sentence implies. Large folios are supposed to be
>> a transparent optimization. So why should we pageout all surrounding
>> subpages simply because a single subpage was requested to be paged out? That
>> might harm performance of some workloads ... more than the actual split.
>>
>> So it's not immediately obvious to me why "avoid splitting" is the correct
>> answer to the problem at hand.
> 
> Even if your madvise() call says to pageout all pages covered by a
> folio, the current code will split it.  That's what needs to be fixed.
Yes. This is my understanding.

> 
> At least for anonymous pages, using large folios is an attempt to treat
> all pages in a particular range the same way.  If the user says to only
> page out some of them, that's a big clue that these pages are different
> from the other pages, and so we should split a folio where the madvise
> call does not cover every page in the folio.
Yes. This is my understanding also. :).

> 
> I'm less convinced that argument holds for page cache pages.
Can you explain more about this? My understanding is that if we need
to reclaim the large folio for page cache, it's better to reclaim the
whole folio.


Regards
Yin, Fengwei
Matthew Wilcox March 14, 2023, 3:01 p.m. UTC | #7
On Tue, Mar 14, 2023 at 10:50:36PM +0800, Yin, Fengwei wrote:
> On 3/14/2023 5:48 PM, Matthew Wilcox wrote:
> > On Tue, Mar 14, 2023 at 10:16:09AM +0100, David Hildenbrand wrote:
> >> Just curious what the last sentence implies. Large folios are supposed to be
> >> a transparent optimization. So why should we pageout all surrounding
> >> subpages simply because a single subpage was requested to be paged out? That
> >> might harm performance of some workloads ... more than the actual split.
> >>
> >> So it's not immediately obvious to me why "avoid splitting" is the correct
> >> answer to the problem at hand.
> >
> > At least for anonymous pages, using large folios is an attempt to treat
> > all pages in a particular range the same way.  If the user says to only
> > page out some of them, that's a big clue that these pages are different
> > from the other pages, and so we should split a folio where the madvise
> > call does not cover every page in the folio.
>
> Yes. This is my understanding also. :).
> 
> > I'm less convinced that argument holds for page cache pages.
>
> Can you explain more about this? My understanding is that if we need
> to reclaim the large folio for page cache, it's better to reclaim the
> whole folio.

Pagecache is a shared resource.  To determine how best to handle all
the memory used to cache a file (ie the correct folio size), ideally
we would take into account how all the users of a particular file are
using it.  If we just listen to the most recent advice from one user,
we risk making a decision that's bad for potentially many other users.

Of course, we don't have any framework for deciding the correct folio size
used for pagecache yet.  We have the initial guess based on readahead
and we have various paths that will split back to individual pages.
But it's something I know we'll want to do at some point.
Yin Fengwei March 15, 2023, 2:17 a.m. UTC | #8
On 3/14/23 23:01, Matthew Wilcox wrote:
> On Tue, Mar 14, 2023 at 10:50:36PM +0800, Yin, Fengwei wrote:
>> On 3/14/2023 5:48 PM, Matthew Wilcox wrote:
>>> On Tue, Mar 14, 2023 at 10:16:09AM +0100, David Hildenbrand wrote:
>>>> Just curious what the last sentence implies. Large folios are supposed to be
>>>> a transparent optimization. So why should we pageout all surrounding
>>>> subpages simply because a single subpage was requested to be paged out? That
>>>> might harm performance of some workloads ... more than the actual split.
>>>>
>>>> So it's not immediately obvious to me why "avoid splitting" is the correct
>>>> answer to the problem at hand.
>>>
>>> At least for anonymous pages, using large folios is an attempt to treat
>>> all pages in a particular range the same way.  If the user says to only
>>> page out some of them, that's a big clue that these pages are different
>>> from the other pages, and so we should split a folio where the madvise
>>> call does not cover every page in the folio.
>>
>> Yes. This is my understanding also. :).
>>
>>> I'm less convinced that argument holds for page cache pages.
>>
>> Can you explain more about this? My understanding is that if we need
>> to reclaim the large folio for page cache, it's better to reclaim the
>> whole folio.
> 
> Pagecache is a shared resource.  To determine how best to handle all
> the memory used to cache a file (ie the correct folio size), ideally
> we would take into account how all the users of a particular file are
> using it.  If we just listen to the most recent advice from one user,
> we risk making a decision that's bad for potentially many other users.
> 
> Of course, we don't have any framework for deciding the correct folio size
> used for pagecache yet.  We have the initial guess based on readahead
> and we have various paths that will split back to individual pages.
> But it's something I know we'll want to do at some point.
Thanks a lot for detail explanation.


Regards
Yin, Fengwei
Yin Fengwei March 20, 2023, 1:47 p.m. UTC | #9
Hi Andrew, David,

On 3/14/2023 11:09 AM, Yin Fengwei wrote:
> On 3/14/23 02:49, Andrew Morton wrote:
>> On Mon, 13 Mar 2023 20:45:21 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>> This series is trying to bring the batched rmap removing to
>>> try_to_unmap_one(). It's expected that the batched rmap
>>> removing bring performance gain than remove rmap per page.
>>>
>>> This series reconstruct the try_to_unmap_one() from:
>>>    loop:
>>>       clear and update PTE
>>>       unmap one page
>>>       goto loop
>>> to:
>>>    loop:
>>>       clear and update PTE
>>>       goto loop
>>>    unmap the range of folio in one call
>>> It is one step to always map/unmap the entire folio in one call.
>>> Which can simplify the folio mapcount handling by avoid dealing
>>> with each page map/unmap.
>>>
>>> ...
>>>
>>> For performance gain demonstration, changed the MADV_PAGEOUT not
>>> to split the large folio for page cache and created a micro
>>> benchmark mainly as following:
>>
>> Please remind me why it's necessary to patch the kernel to actually
>> performance test this?  And why it's proving so hard to demonstrate
>> benefits in real-world workloads?
>>
>> (Yes, this was touched on in earlier discussion, but I do think these
>> considerations should be spelled out in the [0/N] changelog).
> OK. What about add following in cover letter:
> "
>  The performance gain of this series can be demonstrated with large
>  folio reclaim. In current kernel, vmscan() path will be benefited by
>  the changes. But there is no workload/benchmark can show the exact
>  performance gain for vmscan() path as far as I am aware.
> 
>  Another way to demonstrate the performance benefit is using
>  MADV_PAGEOUT which can trigger page reclaim also. The problem is that
>  MADV_PAGEOUT always split the large folio because it's not aware of
>  large folio for page cache currently. To show the performance benefit,
>  MADV_PAGEOUT is updated not to split the large folio.
> 
>  For long term with wider adoption of large folio in kernel (like large
>  folio for anonymous page), MADV_PAGEOUT needs be updated to handle
>  large folio as whole to avoid splitting it always.
> "
I just want to check how I can move this work forward. Is it enough
by adding above message? Or still need some other work be done first? Thanks.


Regards
Yin, Fengwei

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks.
>
David Hildenbrand March 21, 2023, 2:17 p.m. UTC | #10
On 20.03.23 14:47, Yin, Fengwei wrote:
> Hi Andrew, David,
> 
> On 3/14/2023 11:09 AM, Yin Fengwei wrote:
>> On 3/14/23 02:49, Andrew Morton wrote:
>>> On Mon, 13 Mar 2023 20:45:21 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>> This series is trying to bring the batched rmap removing to
>>>> try_to_unmap_one(). It's expected that the batched rmap
>>>> removing bring performance gain than remove rmap per page.
>>>>
>>>> This series reconstruct the try_to_unmap_one() from:
>>>>     loop:
>>>>        clear and update PTE
>>>>        unmap one page
>>>>        goto loop
>>>> to:
>>>>     loop:
>>>>        clear and update PTE
>>>>        goto loop
>>>>     unmap the range of folio in one call
>>>> It is one step to always map/unmap the entire folio in one call.
>>>> Which can simplify the folio mapcount handling by avoid dealing
>>>> with each page map/unmap.
>>>>
>>>> ...
>>>>
>>>> For performance gain demonstration, changed the MADV_PAGEOUT not
>>>> to split the large folio for page cache and created a micro
>>>> benchmark mainly as following:
>>>
>>> Please remind me why it's necessary to patch the kernel to actually
>>> performance test this?  And why it's proving so hard to demonstrate
>>> benefits in real-world workloads?
>>>
>>> (Yes, this was touched on in earlier discussion, but I do think these
>>> considerations should be spelled out in the [0/N] changelog).
>> OK. What about add following in cover letter:
>> "
>>   The performance gain of this series can be demonstrated with large
>>   folio reclaim. In current kernel, vmscan() path will be benefited by
>>   the changes. But there is no workload/benchmark can show the exact
>>   performance gain for vmscan() path as far as I am aware.
>>
>>   Another way to demonstrate the performance benefit is using
>>   MADV_PAGEOUT which can trigger page reclaim also. The problem is that
>>   MADV_PAGEOUT always split the large folio because it's not aware of
>>   large folio for page cache currently. To show the performance benefit,
>>   MADV_PAGEOUT is updated not to split the large folio.
>>
>>   For long term with wider adoption of large folio in kernel (like large
>>   folio for anonymous page), MADV_PAGEOUT needs be updated to handle
>>   large folio as whole to avoid splitting it always.
>> "
> I just want to check how I can move this work forward. Is it enough
> by adding above message? Or still need some other work be done first? Thanks.

I think Andrew can add that, no need to resend. But we should see more 
review (I'm fairly busy ...).
Yin Fengwei March 22, 2023, 1:31 a.m. UTC | #11
On 3/21/23 22:17, David Hildenbrand wrote:
> On 20.03.23 14:47, Yin, Fengwei wrote:
>> Hi Andrew, David,
>>
>> On 3/14/2023 11:09 AM, Yin Fengwei wrote:
>>> On 3/14/23 02:49, Andrew Morton wrote:
>>>> On Mon, 13 Mar 2023 20:45:21 +0800 Yin Fengwei 
>>>> <fengwei.yin@intel.com> wrote:
>>>>
>>>>> This series is trying to bring the batched rmap removing to
>>>>> try_to_unmap_one(). It's expected that the batched rmap
>>>>> removing bring performance gain than remove rmap per page.
>>>>>
>>>>> This series reconstruct the try_to_unmap_one() from:
>>>>>     loop:
>>>>>        clear and update PTE
>>>>>        unmap one page
>>>>>        goto loop
>>>>> to:
>>>>>     loop:
>>>>>        clear and update PTE
>>>>>        goto loop
>>>>>     unmap the range of folio in one call
>>>>> It is one step to always map/unmap the entire folio in one call.
>>>>> Which can simplify the folio mapcount handling by avoid dealing
>>>>> with each page map/unmap.
>>>>>
>>>>> ...
>>>>>
>>>>> For performance gain demonstration, changed the MADV_PAGEOUT not
>>>>> to split the large folio for page cache and created a micro
>>>>> benchmark mainly as following:
>>>>
>>>> Please remind me why it's necessary to patch the kernel to actually
>>>> performance test this?  And why it's proving so hard to demonstrate
>>>> benefits in real-world workloads?
>>>>
>>>> (Yes, this was touched on in earlier discussion, but I do think these
>>>> considerations should be spelled out in the [0/N] changelog).
>>> OK. What about add following in cover letter:
>>> "
>>>   The performance gain of this series can be demonstrated with large
>>>   folio reclaim. In current kernel, vmscan() path will be benefited by
>>>   the changes. But there is no workload/benchmark can show the exact
>>>   performance gain for vmscan() path as far as I am aware.
>>>
>>>   Another way to demonstrate the performance benefit is using
>>>   MADV_PAGEOUT which can trigger page reclaim also. The problem is that
>>>   MADV_PAGEOUT always split the large folio because it's not aware of
>>>   large folio for page cache currently. To show the performance benefit,
>>>   MADV_PAGEOUT is updated not to split the large folio.
>>>
>>>   For long term with wider adoption of large folio in kernel (like large
>>>   folio for anonymous page), MADV_PAGEOUT needs be updated to handle
>>>   large folio as whole to avoid splitting it always.
>>> "
>> I just want to check how I can move this work forward. Is it enough
>> by adding above message? Or still need some other work be done first? 
>> Thanks.
> 
> I think Andrew can add that, no need to resend. But we should see more 
> review (I'm fairly busy ...).
OK.

Regards
Yin, Fengwei

>