Message ID | 20230306092259.3507807-1-fengwei.yin@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | batched remove rmap in try_to_unmap_one() | expand |
On Mon, 6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote: > This series is trying to bring the batched rmap removing to > try_to_unmap_one(). It's expected that the batched rmap > removing bring performance gain than remove rmap per page. > > ... > > include/linux/rmap.h | 5 + > mm/page_vma_mapped.c | 30 +++ > mm/rmap.c | 623 +++++++++++++++++++++++++------------------ > 3 files changed, 398 insertions(+), 260 deletions(-) As was discussed in v2's review, if no performance benefit has been demonstrated, why make this change?
On Mon, 2023-03-06 at 13:12 -0800, Andrew Morton wrote: > On Mon, 6 Mar 2023 17:22:54 +0800 Yin Fengwei > <fengwei.yin@intel.com> wrote: > > > This series is trying to bring the batched rmap removing to > > try_to_unmap_one(). It's expected that the batched rmap > > removing bring performance gain than remove rmap per page. > > > > ... > > > > include/linux/rmap.h | 5 + > > mm/page_vma_mapped.c | 30 +++ > > mm/rmap.c | 623 +++++++++++++++++++++++++-------------- > > ---- > > 3 files changed, 398 insertions(+), 260 deletions(-) > > As was discussed in v2's review, if no performance benefit has been > demonstrated, why make this change? OK. Let me check whether there is way to demonstrate the performance gain. Thanks. Regards Yin, Fengwei >
On 3/7/2023 5:12 AM, Andrew Morton wrote: > On Mon, 6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote: > >> This series is trying to bring the batched rmap removing to >> try_to_unmap_one(). It's expected that the batched rmap >> removing bring performance gain than remove rmap per page. >> >> ... >> >> include/linux/rmap.h | 5 + >> mm/page_vma_mapped.c | 30 +++ >> mm/rmap.c | 623 +++++++++++++++++++++++++------------------ >> 3 files changed, 398 insertions(+), 260 deletions(-) > > As was discussed in v2's review, if no performance benefit has been > demonstrated, why make this change? > I changed the MADV_PAGEOUT not to split the large folio for page cache and created a micro benchmark mainly as following: char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); count = 0; while (1) { unsigned long i; for (i = 0; i < FILESIZE; i += pgsize) { cc = *(volatile char *)(c + i); } madvise(c, FILESIZE, MADV_PAGEOUT); count++; } munmap(c, FILESIZE); Run it with 96 instances + 96 files for 1 second. The test platform was on an IceLake with 48C/96T + 192G memory. Test result (number count) got 10% improvement with this patch series. And perf shows following: Before the patch: --19.97%--try_to_unmap_one | |--12.35%--page_remove_rmap | | | --11.39%--__mod_lruvec_page_state | | | |--1.51%--__mod_memcg_lruvec_state | | | | | --0.91%--cgroup_rstat_updated | | | --0.70%--__mod_lruvec_state | | | --0.63%--__mod_node_page_state | |--5.41%--ptep_clear_flush | | | --4.65%--flush_tlb_mm_range | | | --3.83%--flush_tlb_func | | | --3.51%--native_flush_tlb_one_user | |--0.75%--percpu_counter_add_batch | --0.55%--PageHeadHuge After the patch: --9.50%--try_to_unmap_one | |--6.94%--try_to_unmap_one_page.constprop.0.isra.0 | | | |--5.07%--ptep_clear_flush | | | | | --4.25%--flush_tlb_mm_range | | | | | --3.44%--flush_tlb_func | | | | | --3.05%--native_flush_tlb_one_user | | | --0.80%--percpu_counter_add_batch | |--1.22%--folio_remove_rmap_and_update_count.part.0 | | | --1.16%--folio_remove_rmap_range | | | --0.62%--__mod_lruvec_page_state | --0.56%--PageHeadHuge As expected, the cost of __mod_lruvec_page_state is reduced a lot with batched folio_remove_rmap_range. I believe the same benefit is there for page reclaim path also. Thanks. Regards Yin, Fengwei