Message ID | 20220921084302.43631-3-yangyicong@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH | expand |
[...] On 9/21/22 14:13, Yicong Yang wrote: > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + /* for small systems with small number of CPUs, TLB shootdown is cheap */ > + if (num_online_cpus() <= 4) It would be great to have some more inputs from others, whether 4 (which should to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) is optimal for an wide range of arm64 platforms. > + return false;> + > +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI > + if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) > + return false; > +#endif > + > + return true; > +} > + [...]
On 2022/9/27 14:16, Anshuman Khandual wrote: > [...] > > On 9/21/22 14:13, Yicong Yang wrote: >> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >> +{ >> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >> + if (num_online_cpus() <= 4) > > It would be great to have some more inputs from others, whether 4 (which should > to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) > is optimal for an wide range of arm64 platforms. > Do you prefer this macro to be static or make it configurable through kconfig then different platforms can make choice based on their own situations? It maybe hard to test on all the arm64 platforms. Thanks. >> + return false;> + >> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI >> + if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) >> + return false; >> +#endif >> + >> + return true; >> +} >> + > > [...] > > . >
On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: > > On 2022/9/27 14:16, Anshuman Khandual wrote: > > [...] > > > > On 9/21/22 14:13, Yicong Yang wrote: > >> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >> +{ > >> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ > >> + if (num_online_cpus() <= 4) > > > > It would be great to have some more inputs from others, whether 4 (which should > > to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) > > is optimal for an wide range of arm64 platforms. > > I have tested it on a 4-cpus and 8-cpus machine. but i have no machine with 5,6,7 cores. I saw improvement on 8-cpus machines and I found 4-cpus machines don't need this patch. so it seems safe to have if (num_online_cpus() < 8) > > Do you prefer this macro to be static or make it configurable through kconfig then > different platforms can make choice based on their own situations? It maybe hard to > test on all the arm64 platforms. Maybe we can have this default enabled on machines with 8 and more cpus and provide a tlbflush_batched = on or off to allow users enable or disable it according to their hardware and products. Similar example: rodata=on or off. Hi Anshuman, Will, Catalin, Andrew, what do you think about this approach? BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ I do believe we need it based on the expensive cost of tlb shootdown in arm64 even by hardware broadcast. > > Thanks. > > >> + return false;> + > >> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI > >> + if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) > >> + return false; > >> +#endif > >> + > >> + return true; > >> +} > >> + > > > > [...] > > > > . > > Thanks Barry
On 9/28/22 05:53, Barry Song wrote: > On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >> >> On 2022/9/27 14:16, Anshuman Khandual wrote: >>> [...] >>> >>> On 9/21/22 14:13, Yicong Yang wrote: >>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>> +{ >>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>> + if (num_online_cpus() <= 4) >>> >>> It would be great to have some more inputs from others, whether 4 (which should >>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>> is optimal for an wide range of arm64 platforms. >>> > > I have tested it on a 4-cpus and 8-cpus machine. but i have no machine > with 5,6,7 > cores. > I saw improvement on 8-cpus machines and I found 4-cpus machines don't need > this patch. > > so it seems safe to have > if (num_online_cpus() < 8) > >> >> Do you prefer this macro to be static or make it configurable through kconfig then >> different platforms can make choice based on their own situations? It maybe hard to >> test on all the arm64 platforms. > > Maybe we can have this default enabled on machines with 8 and more cpus and > provide a tlbflush_batched = on or off to allow users enable or > disable it according > to their hardware and products. Similar example: rodata=on or off. No, sounds bit excessive. Kernel command line options should not be added for every possible run time switch options. > > Hi Anshuman, Will, Catalin, Andrew, > what do you think about this approach? > > BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: > https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ > > I do believe we need it based on the expensive cost of tlb shootdown in arm64 > even by hardware broadcast. Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively with CONFIG_EXPERT and for num_online_cpus() > 8 ?
[ Apologies for chiming in late in the conversation ] Anshuman Khandual <anshuman.khandual@arm.com> writes: > On 9/28/22 05:53, Barry Song wrote: >> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >>> >>> On 2022/9/27 14:16, Anshuman Khandual wrote: >>>> [...] >>>> >>>> On 9/21/22 14:13, Yicong Yang wrote: >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>> +{ >>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>>> + if (num_online_cpus() <= 4) >>>> >>>> It would be great to have some more inputs from others, whether 4 (which should >>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>>> is optimal for an wide range of arm64 platforms. >>>> >> >> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >> with 5,6,7 >> cores. >> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >> this patch. >> >> so it seems safe to have >> if (num_online_cpus() < 8) >> >>> >>> Do you prefer this macro to be static or make it configurable through kconfig then >>> different platforms can make choice based on their own situations? It maybe hard to >>> test on all the arm64 platforms. >> >> Maybe we can have this default enabled on machines with 8 and more cpus and >> provide a tlbflush_batched = on or off to allow users enable or >> disable it according >> to their hardware and products. Similar example: rodata=on or off. > > No, sounds bit excessive. Kernel command line options should not be added > for every possible run time switch options. > >> >> Hi Anshuman, Will, Catalin, Andrew, >> what do you think about this approach? >> >> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ >> >> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >> even by hardware broadcast. > > Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively > with CONFIG_EXPERT and for num_online_cpus() > 8 ? When running the test program in the commit in a VM, I saw benefits from the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, ptep_clear_flush() went from ~1% in the unpatched version to not showing up. Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is there any overhead? I am wondering what are the downsides of enabling the config by default. Thanks, Punit
On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal <punit.agrawal@bytedance.com> wrote: > > > [ Apologies for chiming in late in the conversation ] > > Anshuman Khandual <anshuman.khandual@arm.com> writes: > > > On 9/28/22 05:53, Barry Song wrote: > >> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: > >>> > >>> On 2022/9/27 14:16, Anshuman Khandual wrote: > >>>> [...] > >>>> > >>>> On 9/21/22 14:13, Yicong Yang wrote: > >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >>>>> +{ > >>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ > >>>>> + if (num_online_cpus() <= 4) > >>>> > >>>> It would be great to have some more inputs from others, whether 4 (which should > >>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) > >>>> is optimal for an wide range of arm64 platforms. > >>>> > >> > >> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine > >> with 5,6,7 > >> cores. > >> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need > >> this patch. > >> > >> so it seems safe to have > >> if (num_online_cpus() < 8) > >> > >>> > >>> Do you prefer this macro to be static or make it configurable through kconfig then > >>> different platforms can make choice based on their own situations? It maybe hard to > >>> test on all the arm64 platforms. > >> > >> Maybe we can have this default enabled on machines with 8 and more cpus and > >> provide a tlbflush_batched = on or off to allow users enable or > >> disable it according > >> to their hardware and products. Similar example: rodata=on or off. > > > > No, sounds bit excessive. Kernel command line options should not be added > > for every possible run time switch options. > > > >> > >> Hi Anshuman, Will, Catalin, Andrew, > >> what do you think about this approach? > >> > >> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: > >> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ > >> > >> I do believe we need it based on the expensive cost of tlb shootdown in arm64 > >> even by hardware broadcast. > > > > Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively > > with CONFIG_EXPERT and for num_online_cpus() > 8 ? > > When running the test program in the commit in a VM, I saw benefits from > the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, > ptep_clear_flush() went from ~1% in the unpatched version to not showing > up. > > Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is > there any overhead? I am wondering what are the downsides of enabling > the config by default. As we are deferring tlb flush, but sometimes while we are modifying the vma which are deferred, we need to do a sync by flush_tlb_batched_pending() in mprotect() , madvise() to make sure they can see the flushed result. if nobody is doing mprotect(), madvise() etc in the deferred period, the overhead is zero. > > Thanks, > Punit Thanks Barry
On Thu, Oct 27, 2022 at 11:42 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 9/28/22 05:53, Barry Song wrote: > > On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: > >> > >> On 2022/9/27 14:16, Anshuman Khandual wrote: > >>> [...] > >>> > >>> On 9/21/22 14:13, Yicong Yang wrote: > >>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >>>> +{ > >>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ > >>>> + if (num_online_cpus() <= 4) > >>> > >>> It would be great to have some more inputs from others, whether 4 (which should > >>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) > >>> is optimal for an wide range of arm64 platforms. > >>> > > > > I have tested it on a 4-cpus and 8-cpus machine. but i have no machine > > with 5,6,7 > > cores. > > I saw improvement on 8-cpus machines and I found 4-cpus machines don't need > > this patch. > > > > so it seems safe to have > > if (num_online_cpus() < 8) > > > >> > >> Do you prefer this macro to be static or make it configurable through kconfig then > >> different platforms can make choice based on their own situations? It maybe hard to > >> test on all the arm64 platforms. > > > > Maybe we can have this default enabled on machines with 8 and more cpus and > > provide a tlbflush_batched = on or off to allow users enable or > > disable it according > > to their hardware and products. Similar example: rodata=on or off. > > No, sounds bit excessive. Kernel command line options should not be added > for every possible run time switch options. > > > > > Hi Anshuman, Will, Catalin, Andrew, > > what do you think about this approach? > > > > BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: > > https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ > > > > I do believe we need it based on the expensive cost of tlb shootdown in arm64 > > even by hardware broadcast. > > Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively > with CONFIG_EXPERT and for num_online_cpus() > 8 ? Sounds good to me. It is a good start to bring up tlb batched flush in ARM64. Later on, we might want to see it in both memory reclamation and migration. Thanks Barry
On 2022/10/27 22:19, Punit Agrawal wrote: > > [ Apologies for chiming in late in the conversation ] > > Anshuman Khandual <anshuman.khandual@arm.com> writes: > >> On 9/28/22 05:53, Barry Song wrote: >>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >>>> >>>> On 2022/9/27 14:16, Anshuman Khandual wrote: >>>>> [...] >>>>> >>>>> On 9/21/22 14:13, Yicong Yang wrote: >>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>> +{ >>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>>>> + if (num_online_cpus() <= 4) >>>>> >>>>> It would be great to have some more inputs from others, whether 4 (which should >>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>>>> is optimal for an wide range of arm64 platforms. >>>>> >>> >>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >>> with 5,6,7 >>> cores. >>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >>> this patch. >>> >>> so it seems safe to have >>> if (num_online_cpus() < 8) >>> >>>> >>>> Do you prefer this macro to be static or make it configurable through kconfig then >>>> different platforms can make choice based on their own situations? It maybe hard to >>>> test on all the arm64 platforms. >>> >>> Maybe we can have this default enabled on machines with 8 and more cpus and >>> provide a tlbflush_batched = on or off to allow users enable or >>> disable it according >>> to their hardware and products. Similar example: rodata=on or off. >> >> No, sounds bit excessive. Kernel command line options should not be added >> for every possible run time switch options. >> >>> >>> Hi Anshuman, Will, Catalin, Andrew, >>> what do you think about this approach? >>> >>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ >>> >>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >>> even by hardware broadcast. >> >> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively >> with CONFIG_EXPERT and for num_online_cpus() > 8 ? > > When running the test program in the commit in a VM, I saw benefits from > the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, > ptep_clear_flush() went from ~1% in the unpatched version to not showing > up. > Maybe you're booting VM on a server with more than 32 cores and Barry tested on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to a 4 CPU real machine as the tbli and dsb in the VM may influence the host as well. > Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is > there any overhead? I am wondering what are the downsides of enabling > the config by default. > > Thanks, > Punit > . >
On 10/28/22 03:37, Barry Song wrote: > On Thu, Oct 27, 2022 at 11:42 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> >> >> On 9/28/22 05:53, Barry Song wrote: >>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >>>> >>>> On 2022/9/27 14:16, Anshuman Khandual wrote: >>>>> [...] >>>>> >>>>> On 9/21/22 14:13, Yicong Yang wrote: >>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>> +{ >>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>>>> + if (num_online_cpus() <= 4) >>>>> >>>>> It would be great to have some more inputs from others, whether 4 (which should >>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>>>> is optimal for an wide range of arm64 platforms. >>>>> >>> >>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >>> with 5,6,7 >>> cores. >>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >>> this patch. >>> >>> so it seems safe to have >>> if (num_online_cpus() < 8) >>> >>>> >>>> Do you prefer this macro to be static or make it configurable through kconfig then >>>> different platforms can make choice based on their own situations? It maybe hard to >>>> test on all the arm64 platforms. >>> >>> Maybe we can have this default enabled on machines with 8 and more cpus and >>> provide a tlbflush_batched = on or off to allow users enable or >>> disable it according >>> to their hardware and products. Similar example: rodata=on or off. >> >> No, sounds bit excessive. Kernel command line options should not be added >> for every possible run time switch options. >> >>> >>> Hi Anshuman, Will, Catalin, Andrew, >>> what do you think about this approach? >>> >>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ >>> >>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >>> even by hardware broadcast. >> >> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively >> with CONFIG_EXPERT and for num_online_cpus() > 8 ? > > Sounds good to me. It is a good start to bring up tlb batched flush in > ARM64. Later on, we > might want to see it in both memory reclamation and migration. Right, that is the idea, CONFIG_EXPERT gives an way to test this out for some time on various platforms, and later it can be dropped off. Regarding num_online_cpus() = '8' as the threshold which would potentially give benefit of batched TLB should be defined as a macro e.g NR_CPUS_FOR_BATCHED_TLB or internal (non user selectable) config , with a proper in-code comment, explaining the rationale.
On 10/28/22 03:25, Barry Song wrote: > On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal > <punit.agrawal@bytedance.com> wrote: >> >> [ Apologies for chiming in late in the conversation ] >> >> Anshuman Khandual <anshuman.khandual@arm.com> writes: >> >>> On 9/28/22 05:53, Barry Song wrote: >>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >>>>> On 2022/9/27 14:16, Anshuman Khandual wrote: >>>>>> [...] >>>>>> >>>>>> On 9/21/22 14:13, Yicong Yang wrote: >>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>>> +{ >>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>>>>> + if (num_online_cpus() <= 4) >>>>>> It would be great to have some more inputs from others, whether 4 (which should >>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>>>>> is optimal for an wide range of arm64 platforms. >>>>>> >>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >>>> with 5,6,7 >>>> cores. >>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >>>> this patch. >>>> >>>> so it seems safe to have >>>> if (num_online_cpus() < 8) >>>> >>>>> Do you prefer this macro to be static or make it configurable through kconfig then >>>>> different platforms can make choice based on their own situations? It maybe hard to >>>>> test on all the arm64 platforms. >>>> Maybe we can have this default enabled on machines with 8 and more cpus and >>>> provide a tlbflush_batched = on or off to allow users enable or >>>> disable it according >>>> to their hardware and products. Similar example: rodata=on or off. >>> No, sounds bit excessive. Kernel command line options should not be added >>> for every possible run time switch options. >>> >>>> Hi Anshuman, Will, Catalin, Andrew, >>>> what do you think about this approach? >>>> >>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ >>>> >>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >>>> even by hardware broadcast. >>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively >>> with CONFIG_EXPERT and for num_online_cpus() > 8 ? >> When running the test program in the commit in a VM, I saw benefits from >> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, >> ptep_clear_flush() went from ~1% in the unpatched version to not showing >> up. >> >> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is >> there any overhead? I am wondering what are the downsides of enabling >> the config by default. > As we are deferring tlb flush, but sometimes while we are modifying the vma > which are deferred, we need to do a sync by flush_tlb_batched_pending() in > mprotect() , madvise() to make sure they can see the flushed result. if nobody > is doing mprotect(), madvise() etc in the deferred period, the overhead is zero. Right, it is difficult to justify this overhead for smaller systems, which for sure would not benefit from this batched TLB framework.
Yicong Yang <yangyicong@huawei.com> writes: > On 2022/10/27 22:19, Punit Agrawal wrote: >> >> [ Apologies for chiming in late in the conversation ] >> >> Anshuman Khandual <anshuman.khandual@arm.com> writes: >> >>> On 9/28/22 05:53, Barry Song wrote: >>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >>>>> >>>>> On 2022/9/27 14:16, Anshuman Khandual wrote: >>>>>> [...] >>>>>> >>>>>> On 9/21/22 14:13, Yicong Yang wrote: >>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>>> +{ >>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>>>>> + if (num_online_cpus() <= 4) >>>>>> >>>>>> It would be great to have some more inputs from others, whether 4 (which should >>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>>>>> is optimal for an wide range of arm64 platforms. >>>>>> >>>> >>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >>>> with 5,6,7 >>>> cores. >>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >>>> this patch. >>>> >>>> so it seems safe to have >>>> if (num_online_cpus() < 8) >>>> >>>>> >>>>> Do you prefer this macro to be static or make it configurable through kconfig then >>>>> different platforms can make choice based on their own situations? It maybe hard to >>>>> test on all the arm64 platforms. >>>> >>>> Maybe we can have this default enabled on machines with 8 and more cpus and >>>> provide a tlbflush_batched = on or off to allow users enable or >>>> disable it according >>>> to their hardware and products. Similar example: rodata=on or off. >>> >>> No, sounds bit excessive. Kernel command line options should not be added >>> for every possible run time switch options. >>> >>>> >>>> Hi Anshuman, Will, Catalin, Andrew, >>>> what do you think about this approach? >>>> >>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ >>>> >>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >>>> even by hardware broadcast. >>> >>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively >>> with CONFIG_EXPERT and for num_online_cpus() > 8 ? >> >> When running the test program in the commit in a VM, I saw benefits from >> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, >> ptep_clear_flush() went from ~1% in the unpatched version to not showing >> up. >> > > Maybe you're booting VM on a server with more than 32 cores and Barry tested > on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to > a 4 CPU real machine as the tbli and dsb in the VM may influence the host > as well. Yeah, I also wondered about this. I was able to test on a 6-core RK3399 based system - there the ptep_clear_flush() was only 0.10% of the overall execution time. The hardware seems to do a pretty good job of keeping the TLB flushing overhead low. [...]
Anshuman Khandual <anshuman.khandual@arm.com> writes: > On 10/28/22 03:25, Barry Song wrote: >> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal >> <punit.agrawal@bytedance.com> wrote: >>> >>> [ Apologies for chiming in late in the conversation ] >>> >>> Anshuman Khandual <anshuman.khandual@arm.com> writes: >>> >>>> On 9/28/22 05:53, Barry Song wrote: >>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote: >>>>>>> [...] >>>>>>> >>>>>>> On 9/21/22 14:13, Yicong Yang wrote: >>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >>>>>>>> +{ >>>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >>>>>>>> + if (num_online_cpus() <= 4) >>>>>>> It would be great to have some more inputs from others, whether 4 (which should >>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >>>>>>> is optimal for an wide range of arm64 platforms. >>>>>>> >>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >>>>> with 5,6,7 >>>>> cores. >>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >>>>> this patch. >>>>> >>>>> so it seems safe to have >>>>> if (num_online_cpus() < 8) >>>>> >>>>>> Do you prefer this macro to be static or make it configurable through kconfig then >>>>>> different platforms can make choice based on their own situations? It maybe hard to >>>>>> test on all the arm64 platforms. >>>>> Maybe we can have this default enabled on machines with 8 and more cpus and >>>>> provide a tlbflush_batched = on or off to allow users enable or >>>>> disable it according >>>>> to their hardware and products. Similar example: rodata=on or off. >>>> No, sounds bit excessive. Kernel command line options should not be added >>>> for every possible run time switch options. >>>> >>>>> Hi Anshuman, Will, Catalin, Andrew, >>>>> what do you think about this approach? >>>>> >>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >>>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ >>>>> >>>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >>>>> even by hardware broadcast. >>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively >>>> with CONFIG_EXPERT and for num_online_cpus() > 8 ? >>> When running the test program in the commit in a VM, I saw benefits from >>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, >>> ptep_clear_flush() went from ~1% in the unpatched version to not showing >>> up. >>> >>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is >>> there any overhead? I am wondering what are the downsides of enabling >>> the config by default. >> As we are deferring tlb flush, but sometimes while we are modifying the vma >> which are deferred, we need to do a sync by flush_tlb_batched_pending() in >> mprotect() , madvise() to make sure they can see the flushed result. if nobody >> is doing mprotect(), madvise() etc in the deferred period, the overhead is zero. > > Right, it is difficult to justify this overhead for smaller systems, > which for sure would not benefit from this batched TLB framework. Thank you for the pointers to the overhead. Having looked at this more closely, I also see that flush_tlb_batched_pending() discards the entire mm vs just flushing the page being unmapped (as is done with ptep_clear_flush()).
On Sat, Oct 29, 2022 at 2:11 AM Punit Agrawal <punit.agrawal@bytedance.com> wrote: > > Yicong Yang <yangyicong@huawei.com> writes: > > > On 2022/10/27 22:19, Punit Agrawal wrote: > >> > >> [ Apologies for chiming in late in the conversation ] > >> > >> Anshuman Khandual <anshuman.khandual@arm.com> writes: > >> > >>> On 9/28/22 05:53, Barry Song wrote: > >>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: > >>>>> > >>>>> On 2022/9/27 14:16, Anshuman Khandual wrote: > >>>>>> [...] > >>>>>> > >>>>>> On 9/21/22 14:13, Yicong Yang wrote: > >>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > >>>>>>> +{ > >>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ > >>>>>>> + if (num_online_cpus() <= 4) > >>>>>> > >>>>>> It would be great to have some more inputs from others, whether 4 (which should > >>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) > >>>>>> is optimal for an wide range of arm64 platforms. > >>>>>> > >>>> > >>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine > >>>> with 5,6,7 > >>>> cores. > >>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need > >>>> this patch. > >>>> > >>>> so it seems safe to have > >>>> if (num_online_cpus() < 8) > >>>> > >>>>> > >>>>> Do you prefer this macro to be static or make it configurable through kconfig then > >>>>> different platforms can make choice based on their own situations? It maybe hard to > >>>>> test on all the arm64 platforms. > >>>> > >>>> Maybe we can have this default enabled on machines with 8 and more cpus and > >>>> provide a tlbflush_batched = on or off to allow users enable or > >>>> disable it according > >>>> to their hardware and products. Similar example: rodata=on or off. > >>> > >>> No, sounds bit excessive. Kernel command line options should not be added > >>> for every possible run time switch options. > >>> > >>>> > >>>> Hi Anshuman, Will, Catalin, Andrew, > >>>> what do you think about this approach? > >>>> > >>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: > >>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ > >>>> > >>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 > >>>> even by hardware broadcast. > >>> > >>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively > >>> with CONFIG_EXPERT and for num_online_cpus() > 8 ? > >> > >> When running the test program in the commit in a VM, I saw benefits from > >> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, > >> ptep_clear_flush() went from ~1% in the unpatched version to not showing > >> up. > >> > > > > Maybe you're booting VM on a server with more than 32 cores and Barry tested > > on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to > > a 4 CPU real machine as the tbli and dsb in the VM may influence the host > > as well. > > Yeah, I also wondered about this. > > I was able to test on a 6-core RK3399 based system - there the > ptep_clear_flush() was only 0.10% of the overall execution time. The > hardware seems to do a pretty good job of keeping the TLB flushing > overhead low. RK3399 has Dual-core ARM Cortex-A72 MPCore processor and Quad-core ARM Cortex-A53 MPCore processor. you are probably going to see different overhead of ptep_clear_flush() when you bind the micro-benchmark on different cores. > > [...] > Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > On Sat, Oct 29, 2022 at 2:11 AM Punit Agrawal > <punit.agrawal@bytedance.com> wrote: >> >> Yicong Yang <yangyicong@huawei.com> writes: >> >> > On 2022/10/27 22:19, Punit Agrawal wrote: >> >> >> >> [ Apologies for chiming in late in the conversation ] >> >> >> >> Anshuman Khandual <anshuman.khandual@arm.com> writes: >> >> >> >>> On 9/28/22 05:53, Barry Song wrote: >> >>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <yangyicong@huawei.com> wrote: >> >>>>> >> >>>>> On 2022/9/27 14:16, Anshuman Khandual wrote: >> >>>>>> [...] >> >>>>>> >> >>>>>> On 9/21/22 14:13, Yicong Yang wrote: >> >>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) >> >>>>>>> +{ >> >>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */ >> >>>>>>> + if (num_online_cpus() <= 4) >> >>>>>> >> >>>>>> It would be great to have some more inputs from others, whether 4 (which should >> >>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar) >> >>>>>> is optimal for an wide range of arm64 platforms. >> >>>>>> >> >>>> >> >>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine >> >>>> with 5,6,7 >> >>>> cores. >> >>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need >> >>>> this patch. >> >>>> >> >>>> so it seems safe to have >> >>>> if (num_online_cpus() < 8) >> >>>> >> >>>>> >> >>>>> Do you prefer this macro to be static or make it configurable through kconfig then >> >>>>> different platforms can make choice based on their own situations? It maybe hard to >> >>>>> test on all the arm64 platforms. >> >>>> >> >>>> Maybe we can have this default enabled on machines with 8 and more cpus and >> >>>> provide a tlbflush_batched = on or off to allow users enable or >> >>>> disable it according >> >>>> to their hardware and products. Similar example: rodata=on or off. >> >>> >> >>> No, sounds bit excessive. Kernel command line options should not be added >> >>> for every possible run time switch options. >> >>> >> >>>> >> >>>> Hi Anshuman, Will, Catalin, Andrew, >> >>>> what do you think about this approach? >> >>>> >> >>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64: >> >>>> https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ >> >>>> >> >>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64 >> >>>> even by hardware broadcast. >> >>> >> >>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively >> >>> with CONFIG_EXPERT and for num_online_cpus() > 8 ? >> >> >> >> When running the test program in the commit in a VM, I saw benefits from >> >> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine, >> >> ptep_clear_flush() went from ~1% in the unpatched version to not showing >> >> up. >> >> >> > >> > Maybe you're booting VM on a server with more than 32 cores and Barry tested >> > on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to >> > a 4 CPU real machine as the tbli and dsb in the VM may influence the host >> > as well. >> >> Yeah, I also wondered about this. >> >> I was able to test on a 6-core RK3399 based system - there the >> ptep_clear_flush() was only 0.10% of the overall execution time. The >> hardware seems to do a pretty good job of keeping the TLB flushing >> overhead low. I found a problem with my measurements (missing volatile). Correcting that increased the overhead somewhat - more below. > RK3399 has Dual-core ARM Cortex-A72 MPCore processor and > Quad-core ARM Cortex-A53 MPCore processor. you are probably > going to see different overhead of ptep_clear_flush() when you > bind the micro-benchmark on different cores. Indeed - binding the code on the A53 shows half the overhead from ptep_clear_flush() compared to the A72. On the A53 - $ perf report --stdio -i perf.vanilla.a53.data | grep ptep_clear_flush 0.63% pageout [kernel.kallsyms] [k] ptep_clear_flush On the A72 $ perf report --stdio -i perf.vanilla.a72.data | grep ptep_clear_flush 1.34% pageout [kernel.kallsyms] [k] ptep_clear_flush [...]
diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt index 039e4e91ada3..2caf815d7c6c 100644 --- a/Documentation/features/vm/TLB/arch-support.txt +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | N/A | + | arm64: | ok | | csky: | TODO | | hexagon: | TODO | | ia64: | TODO | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1ce7685ad5de..40da6984f303 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -93,6 +93,7 @@ config ARM64 select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h new file mode 100644 index 000000000000..fedb0b87b8db --- /dev/null +++ b/arch/arm64/include/asm/tlbbatch.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ARCH_ARM64_TLBBATCH_H +#define _ARCH_ARM64_TLBBATCH_H + +struct arch_tlbflush_unmap_batch { + /* + * For arm64, HW can do tlb shootdown, so we don't + * need to record cpumask for sending IPI + */ +}; + +#endif /* _ARCH_ARM64_TLBBATCH_H */ diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 412a3b9a3c25..1b4df0352960 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm) dsb(ish); } -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, + +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, unsigned long uaddr) { unsigned long addr; dsb(ishst); - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); + addr = __TLBI_VADDR(uaddr, ASID(mm)); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); } +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, + unsigned long uaddr) +{ + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); +} + static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) { @@ -272,6 +279,32 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, dsb(ish); } +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + /* for small systems with small number of CPUs, TLB shootdown is cheap */ + if (num_online_cpus() <= 4) + return false; + +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI + if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI))) + return false; +#endif + + return true; +} + +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, + struct mm_struct *mm, + unsigned long uaddr) +{ + __flush_tlb_page_nosync(mm, uaddr); +} + +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) +{ + dsb(ish); +} + /* * This is meant to avoid soft lock-ups on large TLB flushing ranges and not * necessarily a performance improvement. diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 8a497d902c16..5bd78ae55cd4 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) } static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, - struct mm_struct *mm) + struct mm_struct *mm, + unsigned long uaddr) { inc_mm_tlb_gen(mm); cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); diff --git a/mm/rmap.c b/mm/rmap.c index cd8cf5cb0b01..e060cc0187cd 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -645,12 +645,13 @@ void try_to_unmap_flush_dirty(void) #define TLB_FLUSH_BATCH_PENDING_LARGE \ (TLB_FLUSH_BATCH_PENDING_MASK / 2) -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, + unsigned long uaddr) { struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; int batch, nbatch; - arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); + arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, uaddr); tlb_ubc->flush_required = true; /* @@ -728,7 +729,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm) } } #else -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, + unsigned long uaddr) { } @@ -1590,7 +1592,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pvmw.pte); - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address); } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); }