[v2,0/5] batched remove rmap in try_to_unmap_one()

Message ID	20230228122308.2972219-1-fengwei.yin@intel.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Yin Fengwei <fengwei.yin@intel.com> To: linux-mm@kvack.org, akpm@linux-foundation.org, willy@infradead.org, sidhartha.kumar@oracle.com, mike.kravetz@oracle.com, jane.chu@oracle.com, naoya.horiguchi@nec.com Cc: fengwei.yin@intel.com Subject: [PATCH v2 0/5] batched remove rmap in try_to_unmap_one() Date: Tue, 28 Feb 2023 20:23:03 +0800 Message-Id: <20230228122308.2972219-1-fengwei.yin@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	batched remove rmap in try_to_unmap_one() \| expand [v2,0/5] batched remove rmap in try_to_unmap_one() [v2,1/5] rmap: move hugetlb try_to_unmap to dedicated function [v2,2/5] rmap: move page unmap operation to dedicated function [v2,3/5] rmap: cleanup exit path of try_to_unmap_one_page() [v2,4/5] rmap:addd folio_remove_rmap_range() [v2,5/5] try_to_unmap_one: batched remove rmap, update folio refcount

Yin Fengwei Feb. 28, 2023, 12:23 p.m. UTC

This series is trying to bring the batched rmap removing to
try_to_unmap_one(). It's expected that the batched rmap
removing bring performance gain than remove rmap per page.

The changes are organized as:
Patch1/2 move the hugetlb and normal page unmap to dedicated
functions to make try_to_unmap_one() logic clearer and easy
to add batched rmap removing. To make code review easier, no
function change.

Patch3 cleanup the try_to_unmap_one_page(). Try to removed
some duplicated function calls.

Patch4 adds folio_remove_rmap_range() which batched remove rmap.

Patch5 make try_to_unmap_one() to batched remove rmap.

Testing done with the V2 patchset in a qemu guest
with 4G mem + 512M zram:
  - kernel mm selftest to trigger vmscan() and final hit
    try_to_unmap_one().
  - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
    call against hugetlb.
  - 8 hours stress testing: Firefox + kernel mm selftest + kernel
    build.

This series is based on next-20230228.

Changes from v1:
  - General
    - Rebase the patch to next-20230228

  - Patch1
    - Removed the if (PageHWPoison(page) && !(flags & TTU_HWPOISON)
      as suggestion from Mike Kravetz and HORIGUCHI NAOYA
    - Removed the mlock_drain_local() as suggestion from Mike Kravetz
    _ Removed the comments about the mm counter change as suggestion
      from Mike Kravetz

Yin Fengwei (5):
  rmap: move hugetlb try_to_unmap to dedicated function
  rmap: move page unmap operation to dedicated function
  rmap: cleanup exit path of try_to_unmap_one_page()
  rmap:addd folio_remove_rmap_range()
  try_to_unmap_one: batched remove rmap, update folio refcount

 include/linux/rmap.h |   5 +
 mm/page_vma_mapped.c |  30 +++
 mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
 3 files changed, 398 insertions(+), 260 deletions(-)

Andrew Morton Feb. 28, 2023, 8:28 p.m. UTC | #1

On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> Testing done with the V2 patchset in a qemu guest
> with 4G mem + 512M zram:
>   - kernel mm selftest to trigger vmscan() and final hit
>     try_to_unmap_one().
>   - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
>     call against hugetlb.
>   - 8 hours stress testing: Firefox + kernel mm selftest + kernel
>     build.

Was any performance testing done with these changes?

Yin Fengwei March 1, 2023, 1:44 a.m. UTC | #2

On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote:
> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei
> <fengwei.yin@intel.com> wrote:
> 
> > Testing done with the V2 patchset in a qemu guest
> > with 4G mem + 512M zram:
> >   - kernel mm selftest to trigger vmscan() and final hit
> >     try_to_unmap_one().
> >   - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
> >     call against hugetlb.
> >   - 8 hours stress testing: Firefox + kernel mm selftest + kernel
> >     build.
> 
> Was any performance testing done with these changes?
I tried to collect the performance data. But found out that it's
not easy to trigger try_to_unmap_one() path (the only one I noticed
is to trigger page cache reclaim). And I am not aware of a workload
can show it. Do you have some workloads suggsted to run? Thanks.


Regards
Yin, Fengwei

David Hildenbrand March 2, 2023, 10:04 a.m. UTC | #3

On 01.03.23 02:44, Yin, Fengwei wrote:
> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote:
>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei
>> <fengwei.yin@intel.com> wrote:
>>
>>> Testing done with the V2 patchset in a qemu guest
>>> with 4G mem + 512M zram:
>>>    - kernel mm selftest to trigger vmscan() and final hit
>>>      try_to_unmap_one().
>>>    - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
>>>      call against hugetlb.
>>>    - 8 hours stress testing: Firefox + kernel mm selftest + kernel
>>>      build.
>>
>> Was any performance testing done with these changes?
> I tried to collect the performance data. But found out that it's
> not easy to trigger try_to_unmap_one() path (the only one I noticed
> is to trigger page cache reclaim). And I am not aware of a workload
> can show it. Do you have some workloads suggsted to run? Thanks.

If it happens barely, why care about performance and have a "398 
insertions(+), 260 deletions(-)" ?

Yin Fengwei March 2, 2023, 1:32 p.m. UTC | #4

On 3/2/2023 6:04 PM, David Hildenbrand wrote:
> On 01.03.23 02:44, Yin, Fengwei wrote:
>> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote:
>>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei
>>> <fengwei.yin@intel.com> wrote:
>>>
>>>> Testing done with the V2 patchset in a qemu guest
>>>> with 4G mem + 512M zram:
>>>>    - kernel mm selftest to trigger vmscan() and final hit
>>>>      try_to_unmap_one().
>>>>    - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
>>>>      call against hugetlb.
>>>>    - 8 hours stress testing: Firefox + kernel mm selftest + kernel
>>>>      build.
>>>
>>> Was any performance testing done with these changes?
>> I tried to collect the performance data. But found out that it's
>> not easy to trigger try_to_unmap_one() path (the only one I noticed
>> is to trigger page cache reclaim). And I am not aware of a workload
>> can show it. Do you have some workloads suggsted to run? Thanks.
> 
> If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ?
I mean I can't find workload to trigger page cache reclaim and measure
its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim
page cache. But there is no obvious indicator which shows the advantage
of this patchset. Maybe I could try eBPF to capture some statistic of 
try_to_unmap_one()?


Regards
Yin, Fengwei

>

David Hildenbrand March 2, 2023, 2:23 p.m. UTC | #5

On 02.03.23 14:32, Yin, Fengwei wrote:
> 
> 
> On 3/2/2023 6:04 PM, David Hildenbrand wrote:
>> On 01.03.23 02:44, Yin, Fengwei wrote:
>>> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote:
>>>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei
>>>> <fengwei.yin@intel.com> wrote:
>>>>
>>>>> Testing done with the V2 patchset in a qemu guest
>>>>> with 4G mem + 512M zram:
>>>>>     - kernel mm selftest to trigger vmscan() and final hit
>>>>>       try_to_unmap_one().
>>>>>     - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
>>>>>       call against hugetlb.
>>>>>     - 8 hours stress testing: Firefox + kernel mm selftest + kernel
>>>>>       build.
>>>>
>>>> Was any performance testing done with these changes?
>>> I tried to collect the performance data. But found out that it's
>>> not easy to trigger try_to_unmap_one() path (the only one I noticed
>>> is to trigger page cache reclaim). And I am not aware of a workload
>>> can show it. Do you have some workloads suggsted to run? Thanks.
>>
>> If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ?
> I mean I can't find workload to trigger page cache reclaim and measure
> its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim
> page cache. But there is no obvious indicator which shows the advantage
> of this patchset. Maybe I could try eBPF to capture some statistic of
> try_to_unmap_one()?

If no workload/benchmark is affected (or simply corner cases where 
nobody cares about performance), I hope you understand that it's hard to 
argue why we should care about such an optimization then.

I briefly thought that page migration could benefit, but it always uses 
try_to_migrate().

So I guess we're fairly limited to vmscan (memory failure is a corner 
cases).

I recall that there are some performance-sensitive swap-to-nvdimm test 
cases. As an alternative, one could eventually write a microbenchmark 
that measures MADV_PAGEOUT performance -- it should also end up 
triggering vmscan, but only if the page is mapped exactly once (in which 
case, I assume batch removal doesn't really help ?).

Matthew Wilcox March 2, 2023, 2:33 p.m. UTC | #6

On Thu, Mar 02, 2023 at 03:23:46PM +0100, David Hildenbrand wrote:
> If no workload/benchmark is affected (or simply corner cases where nobody
> cares about performance), I hope you understand that it's hard to argue why
> we should care about such an optimization then.

In order to solve the mapcount problem, we're going to want to unmap
the entire folio in one call, instead of unmapping each page in it
individually and checking each time whether there are any remaining
pages from this folio still mapped.

David Hildenbrand March 2, 2023, 2:55 p.m. UTC | #7

On 02.03.23 15:33, Matthew Wilcox wrote:
> On Thu, Mar 02, 2023 at 03:23:46PM +0100, David Hildenbrand wrote:
>> If no workload/benchmark is affected (or simply corner cases where nobody
>> cares about performance), I hope you understand that it's hard to argue why
>> we should care about such an optimization then.
> 
> In order to solve the mapcount problem, we're going to want to unmap
> the entire folio in one call, instead of unmapping each page in it
> individually and checking each time whether there are any remaining
> pages from this folio still mapped.

Okay, thanks. That should better be added to the cover letter, ideally 
with more details on the mapcount goal and how this patch set will helps 
in that regard. So far the cover letter only talks about eventual 
performance gains.

Yin Fengwei March 3, 2023, 2:26 a.m. UTC | #8

On 3/2/2023 10:23 PM, David Hildenbrand wrote:
> On 02.03.23 14:32, Yin, Fengwei wrote:
>>
>>
>> On 3/2/2023 6:04 PM, David Hildenbrand wrote:
>>> On 01.03.23 02:44, Yin, Fengwei wrote:
>>>> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote:
>>>>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei
>>>>> <fengwei.yin@intel.com> wrote:
>>>>>
>>>>>> Testing done with the V2 patchset in a qemu guest
>>>>>> with 4G mem + 512M zram:
>>>>>>     - kernel mm selftest to trigger vmscan() and final hit
>>>>>>       try_to_unmap_one().
>>>>>>     - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
>>>>>>       call against hugetlb.
>>>>>>     - 8 hours stress testing: Firefox + kernel mm selftest + kernel
>>>>>>       build.
>>>>>
>>>>> Was any performance testing done with these changes?
>>>> I tried to collect the performance data. But found out that it's
>>>> not easy to trigger try_to_unmap_one() path (the only one I noticed
>>>> is to trigger page cache reclaim). And I am not aware of a workload
>>>> can show it. Do you have some workloads suggsted to run? Thanks.
>>>
>>> If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ?
>> I mean I can't find workload to trigger page cache reclaim and measure
>> its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim
>> page cache. But there is no obvious indicator which shows the advantage
>> of this patchset. Maybe I could try eBPF to capture some statistic of
>> try_to_unmap_one()?
> 
> If no workload/benchmark is affected (or simply corner cases where nobody cares about performance), I hope you understand that it's hard to argue why we should care about such an optimization then.
Yes. I understood this.

> 
> I briefly thought that page migration could benefit, but it always uses try_to_migrate().
Yes. try_to_migrate() shared very similar logic with try_to_unmap_one(). Same batched
operation apply to try_to_migrate() also.

> 
> So I guess we're fairly limited to vmscan (memory failure is a corner cases).
Agree.

> 
> I recall that there are some performance-sensitive swap-to-nvdimm test cases. As an alternative, one could eventually write a microbenchmark that measures MADV_PAGEOUT performance -- it should also end up triggering vmscan, but only if the page is mapped exactly once (in which case, I assume batch removal doesn't really help ?).
Yes. MADV_PAGEOUT can trigger vmscan. My understanding is that only one map
also could benefit from the batched operation also. Let me try to have
a microbenchmark based on MADV_PAGEOUT and see what we could get. Thanks.


Regards
Yin, Fengwei

>

Yin Fengwei March 3, 2023, 2:44 a.m. UTC | #9

On 3/2/2023 10:55 PM, David Hildenbrand wrote:
> On 02.03.23 15:33, Matthew Wilcox wrote:
>> On Thu, Mar 02, 2023 at 03:23:46PM +0100, David Hildenbrand wrote:
>>> If no workload/benchmark is affected (or simply corner cases where nobody
>>> cares about performance), I hope you understand that it's hard to argue why
>>> we should care about such an optimization then.
>>
>> In order to solve the mapcount problem, we're going to want to unmap
>> the entire folio in one call, instead of unmapping each page in it
>> individually and checking each time whether there are any remaining
>> pages from this folio still mapped.
This is a good point.

> 
> Okay, thanks. That should better be added to the cover letter, ideally with more details on the mapcount goal and how this patch set will helps in that regard. So far the cover letter only talks about eventual performance gains.

This patch reconstruct the try_to_unmap_one() from:
  loop:
     clear and update PTE
     unmap one page
     goto loop

to:
  loop:
     clear and update PTE
     goto loop
  unmap the range of folio in one call

Finally, if the entire folio is always mapped, it will be
unmapped entirely also. I will add this to cover letter.
Thanks.

Regards
Yin, Fengwei

>

Yin Fengwei March 6, 2023, 9:11 a.m. UTC | #10

On 3/3/23 10:26, Yin, Fengwei wrote:
> 
> 
> On 3/2/2023 10:23 PM, David Hildenbrand wrote:
>> On 02.03.23 14:32, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/2/2023 6:04 PM, David Hildenbrand wrote:
>>>> On 01.03.23 02:44, Yin, Fengwei wrote:
>>>>> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote:
>>>>>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei
>>>>>> <fengwei.yin@intel.com> wrote:
>>>>>>
>>>>>>> Testing done with the V2 patchset in a qemu guest
>>>>>>> with 4G mem + 512M zram:
>>>>>>>     - kernel mm selftest to trigger vmscan() and final hit
>>>>>>>       try_to_unmap_one().
>>>>>>>     - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
>>>>>>>       call against hugetlb.
>>>>>>>     - 8 hours stress testing: Firefox + kernel mm selftest + kernel
>>>>>>>       build.
>>>>>>
>>>>>> Was any performance testing done with these changes?
>>>>> I tried to collect the performance data. But found out that it's
>>>>> not easy to trigger try_to_unmap_one() path (the only one I noticed
>>>>> is to trigger page cache reclaim). And I am not aware of a workload
>>>>> can show it. Do you have some workloads suggsted to run? Thanks.
>>>>
>>>> If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ?
>>> I mean I can't find workload to trigger page cache reclaim and measure
>>> its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim
>>> page cache. But there is no obvious indicator which shows the advantage
>>> of this patchset. Maybe I could try eBPF to capture some statistic of
>>> try_to_unmap_one()?
>>
>> If no workload/benchmark is affected (or simply corner cases where nobody cares about performance), I hope you understand that it's hard to argue why we should care about such an optimization then.
> Yes. I understood this.
> 
>>
>> I briefly thought that page migration could benefit, but it always uses try_to_migrate().
> Yes. try_to_migrate() shared very similar logic with try_to_unmap_one(). Same batched
> operation apply to try_to_migrate() also.
> 
>>
>> So I guess we're fairly limited to vmscan (memory failure is a corner cases).
> Agree.
> 
>>
>> I recall that there are some performance-sensitive swap-to-nvdimm test cases. As an alternative, one could eventually write a microbenchmark that measures MADV_PAGEOUT performance -- it should also end up triggering vmscan, but only if the page is mapped exactly once (in which case, I assume batch removal doesn't really help ?).
> Yes. MADV_PAGEOUT can trigger vmscan. My understanding is that only one map
> also could benefit from the batched operation also. Let me try to have
> a microbenchmark based on MADV_PAGEOUT and see what we could get. Thanks.
Checked the MADV_PAGEOUT, it can't benefit from this series because the large
folio is split. I suppose we will further update MADV_PAGEOUT to support reclaim
large folio without splitting it later.


Regards
Yin, Fengwei

> 
> 
> Regards
> Yin, Fengwei
> 
>>

[v2,0/5] batched remove rmap in try_to_unmap_one()

Message

Comments