Message ID | 20230228122308.2972219-1-fengwei.yin@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | batched remove rmap in try_to_unmap_one() | expand |
On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote: > Testing done with the V2 patchset in a qemu guest > with 4G mem + 512M zram: > - kernel mm selftest to trigger vmscan() and final hit > try_to_unmap_one(). > - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() > call against hugetlb. > - 8 hours stress testing: Firefox + kernel mm selftest + kernel > build. Was any performance testing done with these changes?
On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote: > On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei > <fengwei.yin@intel.com> wrote: > > > Testing done with the V2 patchset in a qemu guest > > with 4G mem + 512M zram: > > - kernel mm selftest to trigger vmscan() and final hit > > try_to_unmap_one(). > > - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() > > call against hugetlb. > > - 8 hours stress testing: Firefox + kernel mm selftest + kernel > > build. > > Was any performance testing done with these changes? I tried to collect the performance data. But found out that it's not easy to trigger try_to_unmap_one() path (the only one I noticed is to trigger page cache reclaim). And I am not aware of a workload can show it. Do you have some workloads suggsted to run? Thanks. Regards Yin, Fengwei
On 01.03.23 02:44, Yin, Fengwei wrote: > On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote: >> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei >> <fengwei.yin@intel.com> wrote: >> >>> Testing done with the V2 patchset in a qemu guest >>> with 4G mem + 512M zram: >>> - kernel mm selftest to trigger vmscan() and final hit >>> try_to_unmap_one(). >>> - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() >>> call against hugetlb. >>> - 8 hours stress testing: Firefox + kernel mm selftest + kernel >>> build. >> >> Was any performance testing done with these changes? > I tried to collect the performance data. But found out that it's > not easy to trigger try_to_unmap_one() path (the only one I noticed > is to trigger page cache reclaim). And I am not aware of a workload > can show it. Do you have some workloads suggsted to run? Thanks. If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ?
On 3/2/2023 6:04 PM, David Hildenbrand wrote: > On 01.03.23 02:44, Yin, Fengwei wrote: >> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote: >>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei >>> <fengwei.yin@intel.com> wrote: >>> >>>> Testing done with the V2 patchset in a qemu guest >>>> with 4G mem + 512M zram: >>>> - kernel mm selftest to trigger vmscan() and final hit >>>> try_to_unmap_one(). >>>> - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() >>>> call against hugetlb. >>>> - 8 hours stress testing: Firefox + kernel mm selftest + kernel >>>> build. >>> >>> Was any performance testing done with these changes? >> I tried to collect the performance data. But found out that it's >> not easy to trigger try_to_unmap_one() path (the only one I noticed >> is to trigger page cache reclaim). And I am not aware of a workload >> can show it. Do you have some workloads suggsted to run? Thanks. > > If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ? I mean I can't find workload to trigger page cache reclaim and measure its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim page cache. But there is no obvious indicator which shows the advantage of this patchset. Maybe I could try eBPF to capture some statistic of try_to_unmap_one()? Regards Yin, Fengwei >
On 02.03.23 14:32, Yin, Fengwei wrote: > > > On 3/2/2023 6:04 PM, David Hildenbrand wrote: >> On 01.03.23 02:44, Yin, Fengwei wrote: >>> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote: >>>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei >>>> <fengwei.yin@intel.com> wrote: >>>> >>>>> Testing done with the V2 patchset in a qemu guest >>>>> with 4G mem + 512M zram: >>>>> - kernel mm selftest to trigger vmscan() and final hit >>>>> try_to_unmap_one(). >>>>> - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() >>>>> call against hugetlb. >>>>> - 8 hours stress testing: Firefox + kernel mm selftest + kernel >>>>> build. >>>> >>>> Was any performance testing done with these changes? >>> I tried to collect the performance data. But found out that it's >>> not easy to trigger try_to_unmap_one() path (the only one I noticed >>> is to trigger page cache reclaim). And I am not aware of a workload >>> can show it. Do you have some workloads suggsted to run? Thanks. >> >> If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ? > I mean I can't find workload to trigger page cache reclaim and measure > its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim > page cache. But there is no obvious indicator which shows the advantage > of this patchset. Maybe I could try eBPF to capture some statistic of > try_to_unmap_one()? If no workload/benchmark is affected (or simply corner cases where nobody cares about performance), I hope you understand that it's hard to argue why we should care about such an optimization then. I briefly thought that page migration could benefit, but it always uses try_to_migrate(). So I guess we're fairly limited to vmscan (memory failure is a corner cases). I recall that there are some performance-sensitive swap-to-nvdimm test cases. As an alternative, one could eventually write a microbenchmark that measures MADV_PAGEOUT performance -- it should also end up triggering vmscan, but only if the page is mapped exactly once (in which case, I assume batch removal doesn't really help ?).
On Thu, Mar 02, 2023 at 03:23:46PM +0100, David Hildenbrand wrote: > If no workload/benchmark is affected (or simply corner cases where nobody > cares about performance), I hope you understand that it's hard to argue why > we should care about such an optimization then. In order to solve the mapcount problem, we're going to want to unmap the entire folio in one call, instead of unmapping each page in it individually and checking each time whether there are any remaining pages from this folio still mapped.
On 02.03.23 15:33, Matthew Wilcox wrote: > On Thu, Mar 02, 2023 at 03:23:46PM +0100, David Hildenbrand wrote: >> If no workload/benchmark is affected (or simply corner cases where nobody >> cares about performance), I hope you understand that it's hard to argue why >> we should care about such an optimization then. > > In order to solve the mapcount problem, we're going to want to unmap > the entire folio in one call, instead of unmapping each page in it > individually and checking each time whether there are any remaining > pages from this folio still mapped. Okay, thanks. That should better be added to the cover letter, ideally with more details on the mapcount goal and how this patch set will helps in that regard. So far the cover letter only talks about eventual performance gains.
On 3/2/2023 10:23 PM, David Hildenbrand wrote: > On 02.03.23 14:32, Yin, Fengwei wrote: >> >> >> On 3/2/2023 6:04 PM, David Hildenbrand wrote: >>> On 01.03.23 02:44, Yin, Fengwei wrote: >>>> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote: >>>>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei >>>>> <fengwei.yin@intel.com> wrote: >>>>> >>>>>> Testing done with the V2 patchset in a qemu guest >>>>>> with 4G mem + 512M zram: >>>>>> - kernel mm selftest to trigger vmscan() and final hit >>>>>> try_to_unmap_one(). >>>>>> - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() >>>>>> call against hugetlb. >>>>>> - 8 hours stress testing: Firefox + kernel mm selftest + kernel >>>>>> build. >>>>> >>>>> Was any performance testing done with these changes? >>>> I tried to collect the performance data. But found out that it's >>>> not easy to trigger try_to_unmap_one() path (the only one I noticed >>>> is to trigger page cache reclaim). And I am not aware of a workload >>>> can show it. Do you have some workloads suggsted to run? Thanks. >>> >>> If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ? >> I mean I can't find workload to trigger page cache reclaim and measure >> its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim >> page cache. But there is no obvious indicator which shows the advantage >> of this patchset. Maybe I could try eBPF to capture some statistic of >> try_to_unmap_one()? > > If no workload/benchmark is affected (or simply corner cases where nobody cares about performance), I hope you understand that it's hard to argue why we should care about such an optimization then. Yes. I understood this. > > I briefly thought that page migration could benefit, but it always uses try_to_migrate(). Yes. try_to_migrate() shared very similar logic with try_to_unmap_one(). Same batched operation apply to try_to_migrate() also. > > So I guess we're fairly limited to vmscan (memory failure is a corner cases). Agree. > > I recall that there are some performance-sensitive swap-to-nvdimm test cases. As an alternative, one could eventually write a microbenchmark that measures MADV_PAGEOUT performance -- it should also end up triggering vmscan, but only if the page is mapped exactly once (in which case, I assume batch removal doesn't really help ?). Yes. MADV_PAGEOUT can trigger vmscan. My understanding is that only one map also could benefit from the batched operation also. Let me try to have a microbenchmark based on MADV_PAGEOUT and see what we could get. Thanks. Regards Yin, Fengwei >
On 3/2/2023 10:55 PM, David Hildenbrand wrote: > On 02.03.23 15:33, Matthew Wilcox wrote: >> On Thu, Mar 02, 2023 at 03:23:46PM +0100, David Hildenbrand wrote: >>> If no workload/benchmark is affected (or simply corner cases where nobody >>> cares about performance), I hope you understand that it's hard to argue why >>> we should care about such an optimization then. >> >> In order to solve the mapcount problem, we're going to want to unmap >> the entire folio in one call, instead of unmapping each page in it >> individually and checking each time whether there are any remaining >> pages from this folio still mapped. This is a good point. > > Okay, thanks. That should better be added to the cover letter, ideally with more details on the mapcount goal and how this patch set will helps in that regard. So far the cover letter only talks about eventual performance gains. This patch reconstruct the try_to_unmap_one() from: loop: clear and update PTE unmap one page goto loop to: loop: clear and update PTE goto loop unmap the range of folio in one call Finally, if the entire folio is always mapped, it will be unmapped entirely also. I will add this to cover letter. Thanks. Regards Yin, Fengwei >
On 3/3/23 10:26, Yin, Fengwei wrote: > > > On 3/2/2023 10:23 PM, David Hildenbrand wrote: >> On 02.03.23 14:32, Yin, Fengwei wrote: >>> >>> >>> On 3/2/2023 6:04 PM, David Hildenbrand wrote: >>>> On 01.03.23 02:44, Yin, Fengwei wrote: >>>>> On Tue, 2023-02-28 at 12:28 -0800, Andrew Morton wrote: >>>>>> On Tue, 28 Feb 2023 20:23:03 +0800 Yin Fengwei >>>>>> <fengwei.yin@intel.com> wrote: >>>>>> >>>>>>> Testing done with the V2 patchset in a qemu guest >>>>>>> with 4G mem + 512M zram: >>>>>>> - kernel mm selftest to trigger vmscan() and final hit >>>>>>> try_to_unmap_one(). >>>>>>> - Inject hwpoison to hugetlb page to trigger try_to_unmap_one() >>>>>>> call against hugetlb. >>>>>>> - 8 hours stress testing: Firefox + kernel mm selftest + kernel >>>>>>> build. >>>>>> >>>>>> Was any performance testing done with these changes? >>>>> I tried to collect the performance data. But found out that it's >>>>> not easy to trigger try_to_unmap_one() path (the only one I noticed >>>>> is to trigger page cache reclaim). And I am not aware of a workload >>>>> can show it. Do you have some workloads suggsted to run? Thanks. >>>> >>>> If it happens barely, why care about performance and have a "398 insertions(+), 260 deletions(-)" ? >>> I mean I can't find workload to trigger page cache reclaim and measure >>> its performance. We can do "echo 1 > /proc/sys/vm/drop_caches" to reclaim >>> page cache. But there is no obvious indicator which shows the advantage >>> of this patchset. Maybe I could try eBPF to capture some statistic of >>> try_to_unmap_one()? >> >> If no workload/benchmark is affected (or simply corner cases where nobody cares about performance), I hope you understand that it's hard to argue why we should care about such an optimization then. > Yes. I understood this. > >> >> I briefly thought that page migration could benefit, but it always uses try_to_migrate(). > Yes. try_to_migrate() shared very similar logic with try_to_unmap_one(). Same batched > operation apply to try_to_migrate() also. > >> >> So I guess we're fairly limited to vmscan (memory failure is a corner cases). > Agree. > >> >> I recall that there are some performance-sensitive swap-to-nvdimm test cases. As an alternative, one could eventually write a microbenchmark that measures MADV_PAGEOUT performance -- it should also end up triggering vmscan, but only if the page is mapped exactly once (in which case, I assume batch removal doesn't really help ?). > Yes. MADV_PAGEOUT can trigger vmscan. My understanding is that only one map > also could benefit from the batched operation also. Let me try to have > a microbenchmark based on MADV_PAGEOUT and see what we could get. Thanks. Checked the MADV_PAGEOUT, it can't benefit from this series because the large folio is split. I suppose we will further update MADV_PAGEOUT to support reclaim large folio without splitting it later. Regards Yin, Fengwei > > > Regards > Yin, Fengwei > >>