Message ID | 20240229232211.161961-9-samuel.holland@sifive.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | riscv: ASID-related and UP-related TLB flush enhancements | expand |
Hi Samuel, On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <samuel.holland@sifive.com> wrote: > > Since implementations affected by SiFive errata CIP-1200 always use the > global variant of the sfence.vma instruction, they only need to execute > the instruction once. The range-based loop only hurts performance. > > Signed-off-by: Samuel Holland <samuel.holland@sifive.com> > --- > > (no changes since v4) > > Changes in v4: > - Only set tlb_flush_all_threshold when CONFIG_MMU=y. > > Changes in v3: > - New patch for v3 > > arch/riscv/errata/sifive/errata.c | 5 +++++ > arch/riscv/include/asm/tlbflush.h | 2 ++ > arch/riscv/mm/tlbflush.c | 2 +- > 3 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c > index 3d9a32d791f7..716cfedad3a2 100644 > --- a/arch/riscv/errata/sifive/errata.c > +++ b/arch/riscv/errata/sifive/errata.c > @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp > return false; > if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626) > return false; > + > +#ifdef CONFIG_MMU > + tlb_flush_all_threshold = 0; > +#endif > + > return true; > } > > diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h > index 463b615d7728..8e329721375b 100644 > --- a/arch/riscv/include/asm/tlbflush.h > +++ b/arch/riscv/include/asm/tlbflush.h > @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, > unsigned long uaddr); > void arch_flush_tlb_batched_pending(struct mm_struct *mm); > void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); > + > +extern unsigned long tlb_flush_all_threshold; > #else /* CONFIG_MMU */ > #define local_flush_tlb_all() do { } while (0) > #endif /* CONFIG_MMU */ > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c > index 365e0a0e4725..22870f213188 100644 > --- a/arch/riscv/mm/tlbflush.c > +++ b/arch/riscv/mm/tlbflush.c > @@ -11,7 +11,7 @@ > * Flush entire TLB if number of entries to be flushed is greater > * than the threshold below. > */ > -static unsigned long tlb_flush_all_threshold __read_mostly = 64; > +unsigned long tlb_flush_all_threshold __read_mostly = 64; > > static void local_flush_tlb_range_threshold_asid(unsigned long start, > unsigned long size, > -- > 2.43.1 > If local_flush_tlb_all_asid() is used every time, more PTWs will be generated. Will such modifications definitely improve the overall performance? Hi Alex, Samuel, The relationship between flush_xx_range_asid() and nr_ptes is basically linear growth (y=kx +b), while flush_all_asid() has nothing to do with nr_ptes (y=c). Some TLBs may do some optimization. The operation of flush all itself requires very few cycles, but there is a certain delay between consecutive flush all. The intersection of the two straight lines is the optimal solution of tlb_flush_all_threshold. In actual situations, continuous flush_all_asid will not occur. One problem caused by flush_all_asid() is that multiple flush entries require PTW, which causes greater latency. Therefore, the value of tlb_flush_all_threshold needs to be considered or quantified. Maybe doing local_flush_tlb_page_asid() based on the actual nr_ptes_in_range would give better overall performance. What do you think? Thanks, Yunhui
Hi Yunhui, On 2024-02-29 8:48 PM, yunhui cui wrote: > Hi Samuel, > > On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <samuel.holland@sifive.com> wrote: >> >> Since implementations affected by SiFive errata CIP-1200 always use the >> global variant of the sfence.vma instruction, they only need to execute >> the instruction once. The range-based loop only hurts performance. >> >> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> >> --- >> >> (no changes since v4) >> >> Changes in v4: >> - Only set tlb_flush_all_threshold when CONFIG_MMU=y. >> >> Changes in v3: >> - New patch for v3 >> >> arch/riscv/errata/sifive/errata.c | 5 +++++ >> arch/riscv/include/asm/tlbflush.h | 2 ++ >> arch/riscv/mm/tlbflush.c | 2 +- >> 3 files changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c >> index 3d9a32d791f7..716cfedad3a2 100644 >> --- a/arch/riscv/errata/sifive/errata.c >> +++ b/arch/riscv/errata/sifive/errata.c >> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp >> return false; >> if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626) >> return false; >> + >> +#ifdef CONFIG_MMU >> + tlb_flush_all_threshold = 0; >> +#endif >> + >> return true; >> } >> >> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h >> index 463b615d7728..8e329721375b 100644 >> --- a/arch/riscv/include/asm/tlbflush.h >> +++ b/arch/riscv/include/asm/tlbflush.h >> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, >> unsigned long uaddr); >> void arch_flush_tlb_batched_pending(struct mm_struct *mm); >> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); >> + >> +extern unsigned long tlb_flush_all_threshold; >> #else /* CONFIG_MMU */ >> #define local_flush_tlb_all() do { } while (0) >> #endif /* CONFIG_MMU */ >> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c >> index 365e0a0e4725..22870f213188 100644 >> --- a/arch/riscv/mm/tlbflush.c >> +++ b/arch/riscv/mm/tlbflush.c >> @@ -11,7 +11,7 @@ >> * Flush entire TLB if number of entries to be flushed is greater >> * than the threshold below. >> */ >> -static unsigned long tlb_flush_all_threshold __read_mostly = 64; >> +unsigned long tlb_flush_all_threshold __read_mostly = 64; >> >> static void local_flush_tlb_range_threshold_asid(unsigned long start, >> unsigned long size, >> -- >> 2.43.1 >> > > If local_flush_tlb_all_asid() is used every time, more PTWs will be > generated. Will such modifications definitely improve the overall > performance? This change in this commit specifically applies to older SiFive SoCs with a bug making single-page sfence.vma instructions unsafe to use. In this case, a single call to local_flush_tlb_all_asid() is optimal, yes. > Hi Alex, Samuel, > The relationship between flush_xx_range_asid() and nr_ptes is > basically linear growth (y=kx +b), while flush_all_asid() has nothing > to do with nr_ptes (y=c). > Some TLBs may do some optimization. The operation of flush all itself > requires very few cycles, but there is a certain delay between > consecutive flush all. > The intersection of the two straight lines is the optimal solution of > tlb_flush_all_threshold. In actual situations, continuous > flush_all_asid will not occur. One problem caused by flush_all_asid() > is that multiple flush entries require PTW, which causes greater > latency. > Therefore, the value of tlb_flush_all_threshold needs to be considered > or quantified. Maybe doing local_flush_tlb_page_asid() based on the > actual nr_ptes_in_range would give better overall performance. > What do you think? Yes, this was something Alex brought up when adding this threshold, that it should be tuned for various scenarios. That still needs to be done. This patch just covers one specific case where we know the optimal answer due to an erratum. Regards, Samuel
H Samuel, On Tue, Mar 12, 2024 at 8:36 AM Samuel Holland <samuel.holland@sifive.com> wrote: > > Hi Yunhui, > > On 2024-02-29 8:48 PM, yunhui cui wrote: > > Hi Samuel, > > > > On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <samuel.holland@sifive.com> wrote: > >> > >> Since implementations affected by SiFive errata CIP-1200 always use the > >> global variant of the sfence.vma instruction, they only need to execute > >> the instruction once. The range-based loop only hurts performance. > >> > >> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> > >> --- > >> > >> (no changes since v4) > >> > >> Changes in v4: > >> - Only set tlb_flush_all_threshold when CONFIG_MMU=y. > >> > >> Changes in v3: > >> - New patch for v3 > >> > >> arch/riscv/errata/sifive/errata.c | 5 +++++ > >> arch/riscv/include/asm/tlbflush.h | 2 ++ > >> arch/riscv/mm/tlbflush.c | 2 +- > >> 3 files changed, 8 insertions(+), 1 deletion(-) > >> > >> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c > >> index 3d9a32d791f7..716cfedad3a2 100644 > >> --- a/arch/riscv/errata/sifive/errata.c > >> +++ b/arch/riscv/errata/sifive/errata.c > >> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp > >> return false; > >> if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626) > >> return false; > >> + > >> +#ifdef CONFIG_MMU > >> + tlb_flush_all_threshold = 0; > >> +#endif > >> + > >> return true; > >> } > >> > >> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h > >> index 463b615d7728..8e329721375b 100644 > >> --- a/arch/riscv/include/asm/tlbflush.h > >> +++ b/arch/riscv/include/asm/tlbflush.h > >> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, > >> unsigned long uaddr); > >> void arch_flush_tlb_batched_pending(struct mm_struct *mm); > >> void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); > >> + > >> +extern unsigned long tlb_flush_all_threshold; > >> #else /* CONFIG_MMU */ > >> #define local_flush_tlb_all() do { } while (0) > >> #endif /* CONFIG_MMU */ > >> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c > >> index 365e0a0e4725..22870f213188 100644 > >> --- a/arch/riscv/mm/tlbflush.c > >> +++ b/arch/riscv/mm/tlbflush.c > >> @@ -11,7 +11,7 @@ > >> * Flush entire TLB if number of entries to be flushed is greater > >> * than the threshold below. > >> */ > >> -static unsigned long tlb_flush_all_threshold __read_mostly = 64; > >> +unsigned long tlb_flush_all_threshold __read_mostly = 64; > >> > >> static void local_flush_tlb_range_threshold_asid(unsigned long start, > >> unsigned long size, > >> -- > >> 2.43.1 > >> > > > > If local_flush_tlb_all_asid() is used every time, more PTWs will be > > generated. Will such modifications definitely improve the overall > > performance? > > This change in this commit specifically applies to older SiFive SoCs with a bug > making single-page sfence.vma instructions unsafe to use. In this case, a single > call to local_flush_tlb_all_asid() is optimal, yes. Would it be more clear to add this content to the git commit description appropriately? > > > Hi Alex, Samuel, > > The relationship between flush_xx_range_asid() and nr_ptes is > > basically linear growth (y=kx +b), while flush_all_asid() has nothing > > to do with nr_ptes (y=c). > > Some TLBs may do some optimization. The operation of flush all itself > > requires very few cycles, but there is a certain delay between > > consecutive flush all. > > The intersection of the two straight lines is the optimal solution of > > tlb_flush_all_threshold. In actual situations, continuous > > flush_all_asid will not occur. One problem caused by flush_all_asid() > > is that multiple flush entries require PTW, which causes greater > > latency. > > Therefore, the value of tlb_flush_all_threshold needs to be considered > > or quantified. Maybe doing local_flush_tlb_page_asid() based on the > > actual nr_ptes_in_range would give better overall performance. > > What do you think? > > Yes, this was something Alex brought up when adding this threshold, that it > should be tuned for various scenarios. That still needs to be done. This patch > just covers one specific case where we know the optimal answer due to an erratum. > > Regards, > Samuel > Thanks, Yunhui
diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c index 3d9a32d791f7..716cfedad3a2 100644 --- a/arch/riscv/errata/sifive/errata.c +++ b/arch/riscv/errata/sifive/errata.c @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long arch_id, unsigned long imp return false; if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626) return false; + +#ifdef CONFIG_MMU + tlb_flush_all_threshold = 0; +#endif + return true; } diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h index 463b615d7728..8e329721375b 100644 --- a/arch/riscv/include/asm/tlbflush.h +++ b/arch/riscv/include/asm/tlbflush.h @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, unsigned long uaddr); void arch_flush_tlb_batched_pending(struct mm_struct *mm); void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); + +extern unsigned long tlb_flush_all_threshold; #else /* CONFIG_MMU */ #define local_flush_tlb_all() do { } while (0) #endif /* CONFIG_MMU */ diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c index 365e0a0e4725..22870f213188 100644 --- a/arch/riscv/mm/tlbflush.c +++ b/arch/riscv/mm/tlbflush.c @@ -11,7 +11,7 @@ * Flush entire TLB if number of entries to be flushed is greater * than the threshold below. */ -static unsigned long tlb_flush_all_threshold __read_mostly = 64; +unsigned long tlb_flush_all_threshold __read_mostly = 64; static void local_flush_tlb_range_threshold_asid(unsigned long start, unsigned long size,
Since implementations affected by SiFive errata CIP-1200 always use the global variant of the sfence.vma instruction, they only need to execute the instruction once. The range-based loop only hurts performance. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> --- (no changes since v4) Changes in v4: - Only set tlb_flush_all_threshold when CONFIG_MMU=y. Changes in v3: - New patch for v3 arch/riscv/errata/sifive/errata.c | 5 +++++ arch/riscv/include/asm/tlbflush.h | 2 ++ arch/riscv/mm/tlbflush.c | 2 +- 3 files changed, 8 insertions(+), 1 deletion(-)