mbox series

[v3,0/5] batched remove rmap in try_to_unmap_one()

Message ID 20230306092259.3507807-1-fengwei.yin@intel.com (mailing list archive)
Headers show
Series batched remove rmap in try_to_unmap_one() | expand

Message

Yin Fengwei March 6, 2023, 9:22 a.m. UTC
This series is trying to bring the batched rmap removing to
try_to_unmap_one(). It's expected that the batched rmap
removing bring performance gain than remove rmap per page.

This series reconstruct the try_to_unmap_one() from:
  loop:
     clear and update PTE
     unmap one page
     goto loop
to:
  loop:
     clear and update PTE
     goto loop
  unmap the range of folio in one call
It is one step to always map/unmap the entire folio in one call.
Which can simplify the folio mapcount handling by avoid dealing
with each page map/unmap.


The changes are organized as:
Patch1/2 move the hugetlb and normal page unmap to dedicated
functions to make try_to_unmap_one() logic clearer and easy
to add batched rmap removing. To make code review easier, no
function change.

Patch3 cleanup the try_to_unmap_one_page(). Try to removed
some duplicated function calls.

Patch4 adds folio_remove_rmap_range() which batched remove rmap.

Patch5 make try_to_unmap_one() to batched remove rmap.

Testing done with the V3 patchset in a qemu guest
with 4G mem:
  - kernel mm selftest to trigger vmscan() and final hit
    try_to_unmap_one().
  - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
    call against hugetlb.
  - 8 hours stress testing: Firefox + kernel mm selftest + kernel
    build.

This series is based on next-20230303.

Changes from v2:
  - General
    - Rebase the patch to next-20230303
    - Update cover letter about the preparation to unmap
      the entire folio in one call
    - No code change comparing to V2. But fix the patch applying
      conflict because of wrong patch order in V2.

Changes from v1:
  - General
    - Rebase the patch to next-20230228

  - Patch1
    - Removed the if (PageHWPoison(page) && !(flags & TTU_HWPOISON)
      as suggestion from Mike Kravetz and HORIGUCHI NAOYA
    - Removed the mlock_drain_local() as suggestion from Mike Kravetz
    _ Removed the comments about the mm counter change as suggestion
      from Mike Kravetz

Yin Fengwei (5):
  rmap: move hugetlb try_to_unmap to dedicated function
  rmap: move page unmap operation to dedicated function
  rmap: cleanup exit path of try_to_unmap_one_page()
  rmap:addd folio_remove_rmap_range()
  try_to_unmap_one: batched remove rmap, update folio refcount

 include/linux/rmap.h |   5 +
 mm/page_vma_mapped.c |  30 +++
 mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
 3 files changed, 398 insertions(+), 260 deletions(-)

Comments

Andrew Morton March 6, 2023, 9:12 p.m. UTC | #1
On Mon,  6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> This series is trying to bring the batched rmap removing to
> try_to_unmap_one(). It's expected that the batched rmap
> removing bring performance gain than remove rmap per page.
> 
> ...
>
>  include/linux/rmap.h |   5 +
>  mm/page_vma_mapped.c |  30 +++
>  mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
>  3 files changed, 398 insertions(+), 260 deletions(-)

As was discussed in v2's review, if no performance benefit has been
demonstrated, why make this change?
Yin Fengwei March 7, 2023, 2:44 a.m. UTC | #2
On Mon, 2023-03-06 at 13:12 -0800, Andrew Morton wrote:
> On Mon,  6 Mar 2023 17:22:54 +0800 Yin Fengwei
> <fengwei.yin@intel.com> wrote:
> 
> > This series is trying to bring the batched rmap removing to
> > try_to_unmap_one(). It's expected that the batched rmap
> > removing bring performance gain than remove rmap per page.
> > 
> > ...
> > 
> >  include/linux/rmap.h |   5 +
> >  mm/page_vma_mapped.c |  30 +++
> >  mm/rmap.c            | 623 +++++++++++++++++++++++++--------------
> > ----
> >  3 files changed, 398 insertions(+), 260 deletions(-)
> 
> As was discussed in v2's review, if no performance benefit has been
> demonstrated, why make this change?
OK. Let me check whether there is way to demonstrate the performance
gain. Thanks.


Regards
Yin, Fengwei

>
Yin Fengwei March 9, 2023, 1:56 p.m. UTC | #3
On 3/7/2023 5:12 AM, Andrew Morton wrote:
> On Mon,  6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> 
>> This series is trying to bring the batched rmap removing to
>> try_to_unmap_one(). It's expected that the batched rmap
>> removing bring performance gain than remove rmap per page.
>>
>> ...
>>
>>  include/linux/rmap.h |   5 +
>>  mm/page_vma_mapped.c |  30 +++
>>  mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
>>  3 files changed, 398 insertions(+), 260 deletions(-)
> 
> As was discussed in v2's review, if no performance benefit has been
> demonstrated, why make this change?
> 
I changed the MADV_PAGEOUT not to split the large folio for page cache
and created a micro benchmark mainly as following:

        char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE, fd, 0);
	count = 0;
        while (1) {
                unsigned long i;

                for (i = 0; i < FILESIZE; i += pgsize) {
                        cc = *(volatile char *)(c + i);
                }
                madvise(c, FILESIZE, MADV_PAGEOUT);
		count++;
        }
        munmap(c, FILESIZE);

Run it with 96 instances + 96 files for 1 second. The test platform was on
an IceLake with 48C/96T + 192G memory.

Test result (number count) got 10% improvement with this patch series. And
perf shows following:

Before the patch:
--19.97%--try_to_unmap_one
          |          
          |--12.35%--page_remove_rmap
          |          |          
          |           --11.39%--__mod_lruvec_page_state
          |                     |          
          |                     |--1.51%--__mod_memcg_lruvec_state
          |                     |          |          
          |                     |           --0.91%--cgroup_rstat_updated
          |                     |          
          |                      --0.70%--__mod_lruvec_state
          |                                |          
          |                                 --0.63%--__mod_node_page_state
          |          
          |--5.41%--ptep_clear_flush
          |          |          
          |           --4.65%--flush_tlb_mm_range
          |                     |          
          |                      --3.83%--flush_tlb_func
          |                                |          
          |                                 --3.51%--native_flush_tlb_one_user
          |          
          |--0.75%--percpu_counter_add_batch
          |          
           --0.55%--PageHeadHuge

After the patch:
--9.50%--try_to_unmap_one
          |          
          |--6.94%--try_to_unmap_one_page.constprop.0.isra.0
          |          |          
          |          |--5.07%--ptep_clear_flush
          |          |          |          
          |          |           --4.25%--flush_tlb_mm_range
          |          |                     |          
          |          |                      --3.44%--flush_tlb_func
          |          |                                |          
          |          |                                 --3.05%--native_flush_tlb_one_user
          |          |          
          |           --0.80%--percpu_counter_add_batch
          |          
          |--1.22%--folio_remove_rmap_and_update_count.part.0
          |          |          
          |           --1.16%--folio_remove_rmap_range
          |                     |          
          |                      --0.62%--__mod_lruvec_page_state
          |          
           --0.56%--PageHeadHuge

As expected, the cost of __mod_lruvec_page_state is reduced a lot with batched
folio_remove_rmap_range.

I believe the same benefit is there for page reclaim path also. Thanks.


Regards
Yin, Fengwei