Message ID | 20230410075224.827740-1-ying.huang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm,unmap: avoid flushing TLB in batch if PTE is inaccessible | expand |
> On Apr 10, 2023, at 12:52 AM, Huang Ying <ying.huang@intel.com> wrote: > > 0Day/LKP reported a performance regression for commit > 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the > TLB flushing during page migration is batched. So, in > try_to_migrate_one(), ptep_clear_flush() is replaced with > set_tlb_ubc_flush_pending(). In further investigation, it is found > that the TLB flushing can be avoided in ptep_clear_flush() if the PTE > is inaccessible. In fact, we can optimize in similar way for the > batched TLB flushing too to improve the performance. > > So in this patch, we check pte_accessible() before > set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show > that the benchmark score of the anon-cow-rand-mt test case of > vm-scalability test suite can improve up to 2.1% with the patch on a > Intel server machine. The TLB flushing IPI can reduce up to 44.3%. LGTM. I know it’s meaningless for x86 (but perhaps ARM would use this infra too): do we need smp_mb__after_atomic() after ptep_get_and_clear() and before pte_accessible()? In addition, if this goes into stable (based on the Fixes tag), consider breaking it into 2 patches, when only one would be backported.
Hi, Amit, Thank you very much for review! Nadav Amit <namit@vmware.com> writes: >> On Apr 10, 2023, at 12:52 AM, Huang Ying <ying.huang@intel.com> wrote: >> >> 0Day/LKP reported a performance regression for commit >> 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the >> TLB flushing during page migration is batched. So, in >> try_to_migrate_one(), ptep_clear_flush() is replaced with >> set_tlb_ubc_flush_pending(). In further investigation, it is found >> that the TLB flushing can be avoided in ptep_clear_flush() if the PTE >> is inaccessible. In fact, we can optimize in similar way for the >> batched TLB flushing too to improve the performance. >> >> So in this patch, we check pte_accessible() before >> set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show >> that the benchmark score of the anon-cow-rand-mt test case of >> vm-scalability test suite can improve up to 2.1% with the patch on a >> Intel server machine. The TLB flushing IPI can reduce up to 44.3%. > > LGTM. Thanks! > I know it’s meaningless for x86 (but perhaps ARM would use this infra > too): do we need smp_mb__after_atomic() after ptep_get_and_clear() and > before pte_accessible()? Why do we need the memory barrier? IIUC, the PTL is locked, so PTE value will not be changed under us. Anything else? > In addition, if this goes into stable (based on the Fixes tag), consider > breaking it into 2 patches, when only one would be backported. The fixed commit (7e12beb8ca2a ("migrate_pages: batch flushing TLB")) is merged by v6.3-rc1. So this patch will only be backported to v6.3 and later. Is it OK? Best Regards, Huang, Ying
> On Apr 10, 2023, at 6:31 PM, Huang, Ying <ying.huang@intel.com> wrote: > > !! External Email > > Hi, Amit, > > Thank you very much for review! > > Nadav Amit <namit@vmware.com> writes: > >>> On Apr 10, 2023, at 12:52 AM, Huang Ying <ying.huang@intel.com> wrote: >>> >>> 0Day/LKP reported a performance regression for commit >>> 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the >>> TLB flushing during page migration is batched. So, in >>> try_to_migrate_one(), ptep_clear_flush() is replaced with >>> set_tlb_ubc_flush_pending(). In further investigation, it is found >>> that the TLB flushing can be avoided in ptep_clear_flush() if the PTE >>> is inaccessible. In fact, we can optimize in similar way for the >>> batched TLB flushing too to improve the performance. >>> >>> So in this patch, we check pte_accessible() before >>> set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show >>> that the benchmark score of the anon-cow-rand-mt test case of >>> vm-scalability test suite can improve up to 2.1% with the patch on a >>> Intel server machine. The TLB flushing IPI can reduce up to 44.3%. >> >> LGTM. > > Thanks! > >> I know it’s meaningless for x86 (but perhaps ARM would use this infra >> too): do we need smp_mb__after_atomic() after ptep_get_and_clear() and >> before pte_accessible()? > > Why do we need the memory barrier? IIUC, the PTL is locked, so PTE > value will not be changed under us. Anything else? I was thinking about the ordering with respect to atomic_read(&mm->tlb_flush_pending), which is not protected by the PTL. I guess you can correctly argue that because of other control-flow dependencies, the barrier is not necessary. > >> In addition, if this goes into stable (based on the Fixes tag), consider >> breaking it into 2 patches, when only one would be backported. > > The fixed commit (7e12beb8ca2a ("migrate_pages: batch flushing TLB")) is > merged by v6.3-rc1. So this patch will only be backported to v6.3 and > later. Is it OK? Of course. I wasn’t sure when the bug was introduced.
Nadav Amit <namit@vmware.com> writes: >> On Apr 10, 2023, at 6:31 PM, Huang, Ying <ying.huang@intel.com> wrote: >> >> !! External Email >> >> Hi, Amit, >> >> Thank you very much for review! >> >> Nadav Amit <namit@vmware.com> writes: >> >>>> On Apr 10, 2023, at 12:52 AM, Huang Ying <ying.huang@intel.com> wrote: >>>> >>>> 0Day/LKP reported a performance regression for commit >>>> 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the >>>> TLB flushing during page migration is batched. So, in >>>> try_to_migrate_one(), ptep_clear_flush() is replaced with >>>> set_tlb_ubc_flush_pending(). In further investigation, it is found >>>> that the TLB flushing can be avoided in ptep_clear_flush() if the PTE >>>> is inaccessible. In fact, we can optimize in similar way for the >>>> batched TLB flushing too to improve the performance. >>>> >>>> So in this patch, we check pte_accessible() before >>>> set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show >>>> that the benchmark score of the anon-cow-rand-mt test case of >>>> vm-scalability test suite can improve up to 2.1% with the patch on a >>>> Intel server machine. The TLB flushing IPI can reduce up to 44.3%. >>> >>> LGTM. >> >> Thanks! >> >>> I know it’s meaningless for x86 (but perhaps ARM would use this infra >>> too): do we need smp_mb__after_atomic() after ptep_get_and_clear() and >>> before pte_accessible()? >> >> Why do we need the memory barrier? IIUC, the PTL is locked, so PTE >> value will not be changed under us. Anything else? > > I was thinking about the ordering with respect to > atomic_read(&mm->tlb_flush_pending), which is not protected by the PTL. > I guess you can correctly argue that because of other control-flow > dependencies, the barrier is not necessary. For ordering between ptep_get_and_clear() and atomic_read(&mm->tlb_flush_pending), I think PTL has provided the necessary protection already. The code path to write mm->tlb_flush_pending is, tlb_gather_mmu inc_tlb_flush_pending a) lock PTL change PTE b) unlock PTL tlb_finish_mmu dec_tlb_flush_pending c) While code path of try_to_unmap/migrate_one is, lock PTL read and change PTE d) read mm->tlb_flush_pending e) unlock PTL Even if e) occurs before d), they cannot occur at the same time of b). Do I miss anything? Best Regards, Huang, Ying [snip]
> On Apr 11, 2023, at 6:50 PM, Huang, Ying <ying.huang@intel.com> wrote: > > !! External Email > > Nadav Amit <namit@vmware.com> writes: > >>> On Apr 10, 2023, at 6:31 PM, Huang, Ying <ying.huang@intel.com> wrote: >>> >>> !! External Email >>> >>> Hi, Amit, >>> >>> Thank you very much for review! >>> >>> Nadav Amit <namit@vmware.com> writes: >>> >>>>> On Apr 10, 2023, at 12:52 AM, Huang Ying <ying.huang@intel.com> wrote: >>>>> >>>>> 0Day/LKP reported a performance regression for commit >>>>> 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the >>>>> TLB flushing during page migration is batched. So, in >>>>> try_to_migrate_one(), ptep_clear_flush() is replaced with >>>>> set_tlb_ubc_flush_pending(). In further investigation, it is found >>>>> that the TLB flushing can be avoided in ptep_clear_flush() if the PTE >>>>> is inaccessible. In fact, we can optimize in similar way for the >>>>> batched TLB flushing too to improve the performance. >>>>> >>>>> So in this patch, we check pte_accessible() before >>>>> set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show >>>>> that the benchmark score of the anon-cow-rand-mt test case of >>>>> vm-scalability test suite can improve up to 2.1% with the patch on a >>>>> Intel server machine. The TLB flushing IPI can reduce up to 44.3%. >>>> >>>> LGTM. >>> >>> Thanks! >>> >>>> I know it’s meaningless for x86 (but perhaps ARM would use this infra >>>> too): do we need smp_mb__after_atomic() after ptep_get_and_clear() and >>>> before pte_accessible()? >>> >>> Why do we need the memory barrier? IIUC, the PTL is locked, so PTE >>> value will not be changed under us. Anything else? >> >> I was thinking about the ordering with respect to >> atomic_read(&mm->tlb_flush_pending), which is not protected by the PTL. >> I guess you can correctly argue that because of other control-flow >> dependencies, the barrier is not necessary. > > For ordering between ptep_get_and_clear() and > atomic_read(&mm->tlb_flush_pending), I think PTL has provided the > necessary protection already. The code path to write > mm->tlb_flush_pending is, > > tlb_gather_mmu > inc_tlb_flush_pending a) > lock PTL > change PTE b) > unlock PTL > tlb_finish_mmu > dec_tlb_flush_pending c) > > While code path of try_to_unmap/migrate_one is, > > lock PTL > read and change PTE d) > read mm->tlb_flush_pending e) > unlock PTL > > Even if e) occurs before d), they cannot occur at the same time of b). > Do I miss anything? You didn’t miss anything. I went over the comment on inc_tlb_flush_pending() and you follow the scheme.
Nadav Amit <namit@vmware.com> writes: >> On Apr 11, 2023, at 6:50 PM, Huang, Ying <ying.huang@intel.com> wrote: >> >> !! External Email >> >> Nadav Amit <namit@vmware.com> writes: >> >>>> On Apr 10, 2023, at 6:31 PM, Huang, Ying <ying.huang@intel.com> wrote: >>>> >>>> !! External Email >>>> >>>> Hi, Amit, >>>> >>>> Thank you very much for review! >>>> >>>> Nadav Amit <namit@vmware.com> writes: >>>> >>>>>> On Apr 10, 2023, at 12:52 AM, Huang Ying <ying.huang@intel.com> wrote: >>>>>> >>>>>> 0Day/LKP reported a performance regression for commit >>>>>> 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the >>>>>> TLB flushing during page migration is batched. So, in >>>>>> try_to_migrate_one(), ptep_clear_flush() is replaced with >>>>>> set_tlb_ubc_flush_pending(). In further investigation, it is found >>>>>> that the TLB flushing can be avoided in ptep_clear_flush() if the PTE >>>>>> is inaccessible. In fact, we can optimize in similar way for the >>>>>> batched TLB flushing too to improve the performance. >>>>>> >>>>>> So in this patch, we check pte_accessible() before >>>>>> set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show >>>>>> that the benchmark score of the anon-cow-rand-mt test case of >>>>>> vm-scalability test suite can improve up to 2.1% with the patch on a >>>>>> Intel server machine. The TLB flushing IPI can reduce up to 44.3%. >>>>> >>>>> LGTM. >>>> >>>> Thanks! >>>> >>>>> I know it’s meaningless for x86 (but perhaps ARM would use this infra >>>>> too): do we need smp_mb__after_atomic() after ptep_get_and_clear() and >>>>> before pte_accessible()? >>>> >>>> Why do we need the memory barrier? IIUC, the PTL is locked, so PTE >>>> value will not be changed under us. Anything else? >>> >>> I was thinking about the ordering with respect to >>> atomic_read(&mm->tlb_flush_pending), which is not protected by the PTL. >>> I guess you can correctly argue that because of other control-flow >>> dependencies, the barrier is not necessary. >> >> For ordering between ptep_get_and_clear() and >> atomic_read(&mm->tlb_flush_pending), I think PTL has provided the >> necessary protection already. The code path to write >> mm->tlb_flush_pending is, >> >> tlb_gather_mmu >> inc_tlb_flush_pending a) >> lock PTL >> change PTE b) >> unlock PTL >> tlb_finish_mmu >> dec_tlb_flush_pending c) >> >> While code path of try_to_unmap/migrate_one is, >> >> lock PTL >> read and change PTE d) >> read mm->tlb_flush_pending e) >> unlock PTL >> >> Even if e) occurs before d), they cannot occur at the same time of b). >> Do I miss anything? > > You didn’t miss anything. I went over the comment on > inc_tlb_flush_pending() and you follow the scheme. Thanks! Can I get your acked-by or reviewed-by for this patch? Best Regards, Huang, Ying
> On Apr 17, 2023, at 8:17 PM, Huang, Ying <ying.huang@intel.com> wrote: > > !! External Email > > Nadav Amit <namit@vmware.com> writes: >> >> You didn’t miss anything. I went over the comment on >> inc_tlb_flush_pending() and you follow the scheme. > > Thanks! Can I get your acked-by or reviewed-by for this patch? Reviewed-by: Nadav Amit <namit@vmware.com> Thanks, Nadav
在 2023/4/10 下午3:52, Huang Ying 写道: > 0Day/LKP reported a performance regression for commit > 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the > TLB flushing during page migration is batched. So, in > try_to_migrate_one(), ptep_clear_flush() is replaced with > set_tlb_ubc_flush_pending(). In further investigation, it is found > that the TLB flushing can be avoided in ptep_clear_flush() if the PTE > is inaccessible. In fact, we can optimize in similar way for the > batched TLB flushing too to improve the performance. > > So in this patch, we check pte_accessible() before > set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show > that the benchmark score of the anon-cow-rand-mt test case of > vm-scalability test suite can improve up to 2.1% with the patch on a > Intel server machine. The TLB flushing IPI can reduce up to 44.3%. > > Link: https://lore.kernel.org/oe-lkp/202303192325.ecbaf968-yujie.liu@intel.com > Link: https://lore.kernel.org/oe-lkp/ab92aaddf1b52ede15e2c608696c36765a2602c1.camel@intel.com/ > Fixes: 7e12beb8ca2a ("migrate_pages: batch flushing TLB") > Reported-by: kernel test robot <yujie.liu@intel.com> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: Nadav Amit <namit@vmware.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Hugh Dickins <hughd@google.com> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: David Hildenbrand <david@redhat.com> > --- > mm/rmap.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 8632e02661ac..3c7c43642d7c 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1582,7 +1582,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > */ > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > + if (pte_accessible(mm, pteval)) > + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > } else { > pteval = ptep_clear_flush(vma, address, pvmw.pte); > } > @@ -1963,7 +1964,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, > */ > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > + if (pte_accessible(mm, pteval)) > + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); Just a advice, can you put pte_accessible() into set_tlb_ubc_flush_pendin(), just like ptep_clear_flush(); so that we no need to add if (pte_accessible()) in per place where call set_tlb_ubc_flush_pending(); > } else { > pteval = ptep_clear_flush(vma, address, pvmw.pte); > }
haoxin <xhao@linux.alibaba.com> writes: > ( 2023/4/10 H3:52, Huang Ying S: >> 0Day/LKP reported a performance regression for commit >> 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the >> TLB flushing during page migration is batched. So, in >> try_to_migrate_one(), ptep_clear_flush() is replaced with >> set_tlb_ubc_flush_pending(). In further investigation, it is found >> that the TLB flushing can be avoided in ptep_clear_flush() if the PTE >> is inaccessible. In fact, we can optimize in similar way for the >> batched TLB flushing too to improve the performance. >> >> So in this patch, we check pte_accessible() before >> set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show >> that the benchmark score of the anon-cow-rand-mt test case of >> vm-scalability test suite can improve up to 2.1% with the patch on a >> Intel server machine. The TLB flushing IPI can reduce up to 44.3%. >> >> Link: https://lore.kernel.org/oe-lkp/202303192325.ecbaf968-yujie.liu@intel.com >> Link: https://lore.kernel.org/oe-lkp/ab92aaddf1b52ede15e2c608696c36765a2602c1.camel@intel.com/ >> Fixes: 7e12beb8ca2a ("migrate_pages: batch flushing TLB") >> Reported-by: kernel test robot <yujie.liu@intel.com> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> >> Cc: Nadav Amit <namit@vmware.com> >> Cc: Mel Gorman <mgorman@techsingularity.net> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> >> Cc: David Hildenbrand <david@redhat.com> >> --- >> mm/rmap.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 8632e02661ac..3c7c43642d7c 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1582,7 +1582,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, >> */ >> pteval = ptep_get_and_clear(mm, address, pvmw.pte); >> - set_tlb_ubc_flush_pending(mm, >> pte_dirty(pteval)); >> + if (pte_accessible(mm, pteval)) >> + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); >> } else { >> pteval = ptep_clear_flush(vma, address, pvmw.pte); >> } >> @@ -1963,7 +1964,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, >> */ >> pteval = ptep_get_and_clear(mm, address, pvmw.pte); >> - set_tlb_ubc_flush_pending(mm, >> pte_dirty(pteval)); >> + if (pte_accessible(mm, pteval)) >> + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > > Just a advice, can you put pte_accessible() into > set_tlb_ubc_flush_pendin(), just like ptep_clear_flush(); so that we > no need to add if (pte_accessible()) in per place > > where call set_tlb_ubc_flush_pending(); Sounds reasonable for me, will do that in the next version. Thanks for suggestion. Best Regards, Huang, Ying >> } else { >> pteval = ptep_clear_flush(vma, address, pvmw.pte); >> }
diff --git a/mm/rmap.c b/mm/rmap.c index 8632e02661ac..3c7c43642d7c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1582,7 +1582,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pvmw.pte); - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); + if (pte_accessible(mm, pteval)) + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); } @@ -1963,7 +1964,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pvmw.pte); - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); + if (pte_accessible(mm, pteval)) + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); }
0Day/LKP reported a performance regression for commit 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the commit, the TLB flushing during page migration is batched. So, in try_to_migrate_one(), ptep_clear_flush() is replaced with set_tlb_ubc_flush_pending(). In further investigation, it is found that the TLB flushing can be avoided in ptep_clear_flush() if the PTE is inaccessible. In fact, we can optimize in similar way for the batched TLB flushing too to improve the performance. So in this patch, we check pte_accessible() before set_tlb_ubc_flush_pending() in try_to_unmap/migrate_one(). Tests show that the benchmark score of the anon-cow-rand-mt test case of vm-scalability test suite can improve up to 2.1% with the patch on a Intel server machine. The TLB flushing IPI can reduce up to 44.3%. Link: https://lore.kernel.org/oe-lkp/202303192325.ecbaf968-yujie.liu@intel.com Link: https://lore.kernel.org/oe-lkp/ab92aaddf1b52ede15e2c608696c36765a2602c1.camel@intel.com/ Fixes: 7e12beb8ca2a ("migrate_pages: batch flushing TLB") Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> --- mm/rmap.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)