diff mbox series

[v5,08/13] riscv: Avoid TLB flush loops when affected by SiFive CIP-1200

Message ID 20240229232211.161961-9-samuel.holland@sifive.com (mailing list archive)
State Superseded
Headers show
Series riscv: ASID-related and UP-related TLB flush enhancements | expand

Checks

Context Check Description
conchuod/vmtest-fixes-PR fail merge-conflict
conchuod/vmtest-for-next-PR fail PR summary
conchuod/patch-8-test-1 fail .github/scripts/patches/tests/build_rv32_defconfig.sh
conchuod/patch-8-test-2 success .github/scripts/patches/tests/build_rv64_clang_allmodconfig.sh
conchuod/patch-8-test-3 success .github/scripts/patches/tests/build_rv64_gcc_allmodconfig.sh
conchuod/patch-8-test-4 success .github/scripts/patches/tests/build_rv64_nommu_k210_defconfig.sh
conchuod/patch-8-test-5 success .github/scripts/patches/tests/build_rv64_nommu_virt_defconfig.sh
conchuod/patch-8-test-6 success .github/scripts/patches/tests/checkpatch.sh
conchuod/patch-8-test-7 success .github/scripts/patches/tests/dtb_warn_rv64.sh
conchuod/patch-8-test-8 success .github/scripts/patches/tests/header_inline.sh
conchuod/patch-8-test-9 success .github/scripts/patches/tests/kdoc.sh
conchuod/patch-8-test-10 success .github/scripts/patches/tests/module_param.sh
conchuod/patch-8-test-11 success .github/scripts/patches/tests/verify_fixes.sh
conchuod/patch-8-test-12 success .github/scripts/patches/tests/verify_signedoff.sh

Commit Message

Samuel Holland Feb. 29, 2024, 11:21 p.m. UTC
Since implementations affected by SiFive errata CIP-1200 always use the
global variant of the sfence.vma instruction, they only need to execute
the instruction once. The range-based loop only hurts performance.

Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
---

(no changes since v4)

Changes in v4:
 - Only set tlb_flush_all_threshold when CONFIG_MMU=y.

Changes in v3:
 - New patch for v3

 arch/riscv/errata/sifive/errata.c | 5 +++++
 arch/riscv/include/asm/tlbflush.h | 2 ++
 arch/riscv/mm/tlbflush.c          | 2 +-
 3 files changed, 8 insertions(+), 1 deletion(-)

Comments

yunhui cui March 1, 2024, 2:48 a.m. UTC | #1
Hi Samuel,

On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <samuel.holland@sifive.com> wrote:
>
> Since implementations affected by SiFive errata CIP-1200 always use the
> global variant of the sfence.vma instruction, they only need to execute
> the instruction once. The range-based loop only hurts performance.
>
> Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
> ---
>
> (no changes since v4)
>
> Changes in v4:
>  - Only set tlb_flush_all_threshold when CONFIG_MMU=y.
>
> Changes in v3:
>  - New patch for v3
>
>  arch/riscv/errata/sifive/errata.c | 5 +++++
>  arch/riscv/include/asm/tlbflush.h | 2 ++
>  arch/riscv/mm/tlbflush.c          | 2 +-
>  3 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
> index 3d9a32d791f7..716cfedad3a2 100644
> --- a/arch/riscv/errata/sifive/errata.c
> +++ b/arch/riscv/errata/sifive/errata.c
> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long  arch_id, unsigned long imp
>                 return false;
>         if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
>                 return false;
> +
> +#ifdef CONFIG_MMU
> +       tlb_flush_all_threshold = 0;
> +#endif
> +
>         return true;
>  }
>
> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> index 463b615d7728..8e329721375b 100644
> --- a/arch/riscv/include/asm/tlbflush.h
> +++ b/arch/riscv/include/asm/tlbflush.h
> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
>                                unsigned long uaddr);
>  void arch_flush_tlb_batched_pending(struct mm_struct *mm);
>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> +
> +extern unsigned long tlb_flush_all_threshold;
>  #else /* CONFIG_MMU */
>  #define local_flush_tlb_all()                  do { } while (0)
>  #endif /* CONFIG_MMU */
> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> index 365e0a0e4725..22870f213188 100644
> --- a/arch/riscv/mm/tlbflush.c
> +++ b/arch/riscv/mm/tlbflush.c
> @@ -11,7 +11,7 @@
>   * Flush entire TLB if number of entries to be flushed is greater
>   * than the threshold below.
>   */
> -static unsigned long tlb_flush_all_threshold __read_mostly = 64;
> +unsigned long tlb_flush_all_threshold __read_mostly = 64;
>
>  static void local_flush_tlb_range_threshold_asid(unsigned long start,
>                                                  unsigned long size,
> --
> 2.43.1
>

If local_flush_tlb_all_asid() is used every time, more PTWs will be
generated. Will such modifications definitely improve the overall
performance?

Hi Alex, Samuel,
The relationship between flush_xx_range_asid() and nr_ptes is
basically linear growth (y=kx +b), while flush_all_asid() has nothing
to do with nr_ptes (y=c).
Some TLBs may do some optimization. The operation of flush all itself
requires very few cycles, but there is a certain delay between
consecutive flush all.
The intersection of the two straight lines is the optimal solution of
tlb_flush_all_threshold. In actual situations, continuous
flush_all_asid will not occur. One problem caused by flush_all_asid()
is that multiple flush entries require PTW, which causes greater
latency.
Therefore, the value of tlb_flush_all_threshold needs to be considered
or quantified. Maybe doing local_flush_tlb_page_asid() based on the
actual nr_ptes_in_range would give better overall performance.
What do you think?

Thanks,
Yunhui
Samuel Holland March 12, 2024, 12:35 a.m. UTC | #2
Hi Yunhui,

On 2024-02-29 8:48 PM, yunhui cui wrote:
> Hi Samuel,
> 
> On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <samuel.holland@sifive.com> wrote:
>>
>> Since implementations affected by SiFive errata CIP-1200 always use the
>> global variant of the sfence.vma instruction, they only need to execute
>> the instruction once. The range-based loop only hurts performance.
>>
>> Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
>> ---
>>
>> (no changes since v4)
>>
>> Changes in v4:
>>  - Only set tlb_flush_all_threshold when CONFIG_MMU=y.
>>
>> Changes in v3:
>>  - New patch for v3
>>
>>  arch/riscv/errata/sifive/errata.c | 5 +++++
>>  arch/riscv/include/asm/tlbflush.h | 2 ++
>>  arch/riscv/mm/tlbflush.c          | 2 +-
>>  3 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
>> index 3d9a32d791f7..716cfedad3a2 100644
>> --- a/arch/riscv/errata/sifive/errata.c
>> +++ b/arch/riscv/errata/sifive/errata.c
>> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long  arch_id, unsigned long imp
>>                 return false;
>>         if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
>>                 return false;
>> +
>> +#ifdef CONFIG_MMU
>> +       tlb_flush_all_threshold = 0;
>> +#endif
>> +
>>         return true;
>>  }
>>
>> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
>> index 463b615d7728..8e329721375b 100644
>> --- a/arch/riscv/include/asm/tlbflush.h
>> +++ b/arch/riscv/include/asm/tlbflush.h
>> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
>>                                unsigned long uaddr);
>>  void arch_flush_tlb_batched_pending(struct mm_struct *mm);
>>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>> +
>> +extern unsigned long tlb_flush_all_threshold;
>>  #else /* CONFIG_MMU */
>>  #define local_flush_tlb_all()                  do { } while (0)
>>  #endif /* CONFIG_MMU */
>> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
>> index 365e0a0e4725..22870f213188 100644
>> --- a/arch/riscv/mm/tlbflush.c
>> +++ b/arch/riscv/mm/tlbflush.c
>> @@ -11,7 +11,7 @@
>>   * Flush entire TLB if number of entries to be flushed is greater
>>   * than the threshold below.
>>   */
>> -static unsigned long tlb_flush_all_threshold __read_mostly = 64;
>> +unsigned long tlb_flush_all_threshold __read_mostly = 64;
>>
>>  static void local_flush_tlb_range_threshold_asid(unsigned long start,
>>                                                  unsigned long size,
>> --
>> 2.43.1
>>
> 
> If local_flush_tlb_all_asid() is used every time, more PTWs will be
> generated. Will such modifications definitely improve the overall
> performance?

This change in this commit specifically applies to older SiFive SoCs with a bug
making single-page sfence.vma instructions unsafe to use. In this case, a single
call to local_flush_tlb_all_asid() is optimal, yes.

> Hi Alex, Samuel,
> The relationship between flush_xx_range_asid() and nr_ptes is
> basically linear growth (y=kx +b), while flush_all_asid() has nothing
> to do with nr_ptes (y=c).
> Some TLBs may do some optimization. The operation of flush all itself
> requires very few cycles, but there is a certain delay between
> consecutive flush all.
> The intersection of the two straight lines is the optimal solution of
> tlb_flush_all_threshold. In actual situations, continuous
> flush_all_asid will not occur. One problem caused by flush_all_asid()
> is that multiple flush entries require PTW, which causes greater
> latency.
> Therefore, the value of tlb_flush_all_threshold needs to be considered
> or quantified. Maybe doing local_flush_tlb_page_asid() based on the
> actual nr_ptes_in_range would give better overall performance.
> What do you think?

Yes, this was something Alex brought up when adding this threshold, that it
should be tuned for various scenarios. That still needs to be done. This patch
just covers one specific case where we know the optimal answer due to an erratum.

Regards,
Samuel
yunhui cui March 12, 2024, 1:51 a.m. UTC | #3
H Samuel,

On Tue, Mar 12, 2024 at 8:36 AM Samuel Holland
<samuel.holland@sifive.com> wrote:
>
> Hi Yunhui,
>
> On 2024-02-29 8:48 PM, yunhui cui wrote:
> > Hi Samuel,
> >
> > On Fri, Mar 1, 2024 at 7:22 AM Samuel Holland <samuel.holland@sifive.com> wrote:
> >>
> >> Since implementations affected by SiFive errata CIP-1200 always use the
> >> global variant of the sfence.vma instruction, they only need to execute
> >> the instruction once. The range-based loop only hurts performance.
> >>
> >> Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
> >> ---
> >>
> >> (no changes since v4)
> >>
> >> Changes in v4:
> >>  - Only set tlb_flush_all_threshold when CONFIG_MMU=y.
> >>
> >> Changes in v3:
> >>  - New patch for v3
> >>
> >>  arch/riscv/errata/sifive/errata.c | 5 +++++
> >>  arch/riscv/include/asm/tlbflush.h | 2 ++
> >>  arch/riscv/mm/tlbflush.c          | 2 +-
> >>  3 files changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
> >> index 3d9a32d791f7..716cfedad3a2 100644
> >> --- a/arch/riscv/errata/sifive/errata.c
> >> +++ b/arch/riscv/errata/sifive/errata.c
> >> @@ -42,6 +42,11 @@ static bool errata_cip_1200_check_func(unsigned long  arch_id, unsigned long imp
> >>                 return false;
> >>         if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
> >>                 return false;
> >> +
> >> +#ifdef CONFIG_MMU
> >> +       tlb_flush_all_threshold = 0;
> >> +#endif
> >> +
> >>         return true;
> >>  }
> >>
> >> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> >> index 463b615d7728..8e329721375b 100644
> >> --- a/arch/riscv/include/asm/tlbflush.h
> >> +++ b/arch/riscv/include/asm/tlbflush.h
> >> @@ -66,6 +66,8 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> >>                                unsigned long uaddr);
> >>  void arch_flush_tlb_batched_pending(struct mm_struct *mm);
> >>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> >> +
> >> +extern unsigned long tlb_flush_all_threshold;
> >>  #else /* CONFIG_MMU */
> >>  #define local_flush_tlb_all()                  do { } while (0)
> >>  #endif /* CONFIG_MMU */
> >> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> >> index 365e0a0e4725..22870f213188 100644
> >> --- a/arch/riscv/mm/tlbflush.c
> >> +++ b/arch/riscv/mm/tlbflush.c
> >> @@ -11,7 +11,7 @@
> >>   * Flush entire TLB if number of entries to be flushed is greater
> >>   * than the threshold below.
> >>   */
> >> -static unsigned long tlb_flush_all_threshold __read_mostly = 64;
> >> +unsigned long tlb_flush_all_threshold __read_mostly = 64;
> >>
> >>  static void local_flush_tlb_range_threshold_asid(unsigned long start,
> >>                                                  unsigned long size,
> >> --
> >> 2.43.1
> >>
> >
> > If local_flush_tlb_all_asid() is used every time, more PTWs will be
> > generated. Will such modifications definitely improve the overall
> > performance?
>
> This change in this commit specifically applies to older SiFive SoCs with a bug
> making single-page sfence.vma instructions unsafe to use. In this case, a single
> call to local_flush_tlb_all_asid() is optimal, yes.
Would it be more clear to add this content to the git commit
description appropriately?

>
> > Hi Alex, Samuel,
> > The relationship between flush_xx_range_asid() and nr_ptes is
> > basically linear growth (y=kx +b), while flush_all_asid() has nothing
> > to do with nr_ptes (y=c).
> > Some TLBs may do some optimization. The operation of flush all itself
> > requires very few cycles, but there is a certain delay between
> > consecutive flush all.
> > The intersection of the two straight lines is the optimal solution of
> > tlb_flush_all_threshold. In actual situations, continuous
> > flush_all_asid will not occur. One problem caused by flush_all_asid()
> > is that multiple flush entries require PTW, which causes greater
> > latency.
> > Therefore, the value of tlb_flush_all_threshold needs to be considered
> > or quantified. Maybe doing local_flush_tlb_page_asid() based on the
> > actual nr_ptes_in_range would give better overall performance.
> > What do you think?
>
> Yes, this was something Alex brought up when adding this threshold, that it
> should be tuned for various scenarios. That still needs to be done. This patch
> just covers one specific case where we know the optimal answer due to an erratum.
>
> Regards,
> Samuel
>

Thanks,
Yunhui
diff mbox series

Patch

diff --git a/arch/riscv/errata/sifive/errata.c b/arch/riscv/errata/sifive/errata.c
index 3d9a32d791f7..716cfedad3a2 100644
--- a/arch/riscv/errata/sifive/errata.c
+++ b/arch/riscv/errata/sifive/errata.c
@@ -42,6 +42,11 @@  static bool errata_cip_1200_check_func(unsigned long  arch_id, unsigned long imp
 		return false;
 	if ((impid & 0xffffff) > 0x200630 || impid == 0x1200626)
 		return false;
+
+#ifdef CONFIG_MMU
+	tlb_flush_all_threshold = 0;
+#endif
+
 	return true;
 }
 
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 463b615d7728..8e329721375b 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -66,6 +66,8 @@  void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			       unsigned long uaddr);
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+
+extern unsigned long tlb_flush_all_threshold;
 #else /* CONFIG_MMU */
 #define local_flush_tlb_all()			do { } while (0)
 #endif /* CONFIG_MMU */
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 365e0a0e4725..22870f213188 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -11,7 +11,7 @@ 
  * Flush entire TLB if number of entries to be flushed is greater
  * than the threshold below.
  */
-static unsigned long tlb_flush_all_threshold __read_mostly = 64;
+unsigned long tlb_flush_all_threshold __read_mostly = 64;
 
 static void local_flush_tlb_range_threshold_asid(unsigned long start,
 						 unsigned long size,