[-fixes] riscv: Implement flush_cache_vmap()

Message ID	20230725132246.817726-1-alexghiti@rivosinc.com (mailing list archive)
State	Accepted
Commit	7e3811521dc3934e2ecae8458676fc4a1f62bf9f
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> From: Alexandre Ghiti <alexghiti@rivosinc.com> To: Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>, =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn@rivosinc.com>, Alexandre Ghiti <alexghiti@rivosinc.com>, Guo Ren <guoren@kernel.org>, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH -fixes] riscv: Implement flush_cache_vmap() Date: Tue, 25 Jul 2023 15:22:46 +0200 Message-Id: <20230725132246.817726-1-alexghiti@rivosinc.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	[-fixes] riscv: Implement flush_cache_vmap() \| expand [-fixes] riscv: Implement flush_cache_vmap()

Context	Check	Description
conchuod/cover_letter	success	Single patches do not need cover letters
conchuod/tree_selection	success	Guessed tree name to be fixes at HEAD ab2dbc7acced
conchuod/fixes_present	success	Fixes tag present in non-next series
conchuod/maintainers_pattern	success	MAINTAINERS pattern errors before the patch: 4 and now 4
conchuod/verify_signedoff	success	Signed-off-by tag matches author and committer
conchuod/kdoc	success	Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig	success	Errors and warnings before: 2290 this patch: 2290
conchuod/module_param	success	Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig	success	Errors and warnings before: 12756 this patch: 12756
conchuod/build_rv32_defconfig	success	Build OK
conchuod/dtb_warn_rv64	success	Errors and warnings before: 3 this patch: 3
conchuod/header_inline	success	No static functions without inline keyword in header files
conchuod/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 10 lines checked
conchuod/build_rv64_nommu_k210_defconfig	success	Build OK
conchuod/verify_fixes	success	Fixes tag looks correct
conchuod/build_rv64_nommu_virt_defconfig	success	Build OK

Alexandre Ghiti July 25, 2023, 1:22 p.m. UTC

The RISC-V kernel needs a sfence.vma after a page table modification: we
used to rely on the vmalloc fault handling to emit an sfence.vma, but
commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
need to explicitly emit a sfence.vma in flush_cache_vmap().

Note that we don't need to implement flush_cache_vunmap() as the generic
code should emit a flush tlb after unmapping a vmalloc region.

Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/include/asm/cacheflush.h | 4 ++++
 1 file changed, 4 insertions(+)

Guo Ren July 30, 2023, 5:08 a.m. UTC | #1

On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> The RISC-V kernel needs a sfence.vma after a page table modification: we
> used to rely on the vmalloc fault handling to emit an sfence.vma, but
> commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> need to explicitly emit a sfence.vma in flush_cache_vmap().
>
> Note that we don't need to implement flush_cache_vunmap() as the generic
> code should emit a flush tlb after unmapping a vmalloc region.
>
> Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>  arch/riscv/include/asm/cacheflush.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> index 8091b8bf4883..b93ffddf8a61 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
>  #define flush_icache_user_page(vma, pg, addr, len) \
>         flush_icache_mm(vma->vm_mm, 0)
>
> +#ifdef CONFIG_64BIT
> +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
Sorry, I couldn't agree with the above in a PIPT cache machine. It's
not worth for.

It would reduce the performance of vmap_pages_range,
ioremap_page_range ... API, which may cause some drivers' performance
issues when they install/uninstall memory frequently.

> +#endif
> +
>  #ifndef CONFIG_SMP
>
>  #define flush_icache_all() local_flush_icache_all()
> --
> 2.39.2
>

Dylan Jhong Aug. 3, 2023, 9:14 a.m. UTC | #2

On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > need to explicitly emit a sfence.vma in flush_cache_vmap().
> >
> > Note that we don't need to implement flush_cache_vunmap() as the generic
> > code should emit a flush tlb after unmapping a vmalloc region.
> >
> > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > index 8091b8bf4883..b93ffddf8a61 100644
> > --- a/arch/riscv/include/asm/cacheflush.h
> > +++ b/arch/riscv/include/asm/cacheflush.h
> > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> >  #define flush_icache_user_page(vma, pg, addr, len) \
> >         flush_icache_mm(vma->vm_mm, 0)
> >
> > +#ifdef CONFIG_64BIT
> > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> not worth for.
> 
> It would reduce the performance of vmap_pages_range,
> ioremap_page_range ... API, which may cause some drivers' performance
> issues when they install/uninstall memory frequently.
> 

Hi All,

I think functional correctness should be more important than system performance
in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
specification allowing invalidation entries to be cached in the TLB.

The problem[1] we are currently encountering is caused by not updating the TLB
after the page table is created, and the solution to this problem can only be
solved by updating the TLB immediately after the page table is created.

There are currently two possible approaches to flush TLB:
1. Flush TLB in flush_cache_vmap()
2. Flush TLB in arch_sync_kernel_mappings()

But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
The name of this function indicates that it should be related to cache operations, maybe
it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?

[1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html

Best regards,
Dylan

> > +#endif
> > +
> >  #ifndef CONFIG_SMP
> >
> >  #define flush_icache_all() local_flush_icache_all()
> > --
> > 2.39.2
> >
> 
> 
> -- 
> Best Regards
>  Guo Ren
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Conor Dooley Aug. 3, 2023, 9:24 a.m. UTC | #3

On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > >
> > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > code should emit a flush tlb after unmapping a vmalloc region.
> > >
> > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > ---
> > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > index 8091b8bf4883..b93ffddf8a61 100644
> > > --- a/arch/riscv/include/asm/cacheflush.h
> > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > >         flush_icache_mm(vma->vm_mm, 0)
> > >
> > > +#ifdef CONFIG_64BIT
> > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > not worth for.
> > 
> > It would reduce the performance of vmap_pages_range,
> > ioremap_page_range ... API, which may cause some drivers' performance
> > issues when they install/uninstall memory frequently.
> > 
> 
> Hi All,
> 
> I think functional correctness should be more important than system performance
> in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> specification allowing invalidation entries to be cached in the TLB.

We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
can be reverted later when a more targeted fix shows up, to make sure
that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
original commit should probably be reverted.

> The problem[1] we are currently encountering is caused by not updating the TLB
> after the page table is created, and the solution to this problem can only be
> solved by updating the TLB immediately after the page table is created.
> 
> There are currently two possible approaches to flush TLB:
> 1. Flush TLB in flush_cache_vmap()
> 2. Flush TLB in arch_sync_kernel_mappings()
> 
> But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> The name of this function indicates that it should be related to cache operations, maybe
> it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
> 
> [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html

Alexandre Ghiti Aug. 3, 2023, 9:48 a.m. UTC | #4

On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > >
> > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > >
> > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > >
> > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > ---
> > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > >         flush_icache_mm(vma->vm_mm, 0)
> > > >
> > > > +#ifdef CONFIG_64BIT
> > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > not worth for.
> > >
> > > It would reduce the performance of vmap_pages_range,
> > > ioremap_page_range ... API, which may cause some drivers' performance
> > > issues when they install/uninstall memory frequently.
> > >
> >
> > Hi All,
> >
> > I think functional correctness should be more important than system performance
> > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > specification allowing invalidation entries to be cached in the TLB.
>
> We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> can be reverted later when a more targeted fix shows up, to make sure
> that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> original commit should probably be reverted.

The original commit that removed vmalloc_fault() is required, handling
vmalloc faults in the page fault path is not possible (see the links
in the description of 7d3332be011e and the example that I gave in the
thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).

I totally agree with Dylan that we'll work (I'm currently working on
that) on the performance side of the problem in the next release, we
need correctness and for that we need a preventive global sfence.vma
as we have no means (for now) to distinguish between uarch that cache
or not invalid entries.

>
> > The problem[1] we are currently encountering is caused by not updating the TLB
> > after the page table is created, and the solution to this problem can only be
> > solved by updating the TLB immediately after the page table is created.
> >
> > There are currently two possible approaches to flush TLB:
> > 1. Flush TLB in flush_cache_vmap()
> > 2. Flush TLB in arch_sync_kernel_mappings()
> >
> > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > The name of this function indicates that it should be related to cache operations, maybe
> > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?

TLDR: The downsides to implementing arch_sync_kernel_mappings()
instead of flush_cache_vmap():

- 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
- flushes the tlb in the noflush suffixed functions so it prevents any
flush optimization (ie: a loop of vmap_range_noflush() without flush
and then a final flush afterwards)

So I'd favour the flush_cache_vmap() implementation which seems
lighter. powerpc does that
https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
(but admits that it may not be the right place)

Here is the long story (my raw notes):

* arch_sync_kernel_mappings() is called from:
- _apply_to_page_range(): would only emit global sfence.vma if vmalloc
addresses, I guess that's ok.
- __vunmap_range_noflush(): it is noted here
https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
any caller must call flush_tlb_kernel_range(). Then the implementation
of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
- vmap_range_noflush(): does not fit well with the noflush() suffix.

* flush_cache_vmap() is called from:
- kasan_populate_vmalloc(): legit since it bypasses vmap api (but
called right a apply_to_page_range() so your patch would work here)
- kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
establishes and flush_tlb_kernel_range() must be called afterwards =>
3 global tlb flushes but the 3 are needed as they target different
addresses. Implementing only arch_sync_kernel_mappings() would result
in way more global flushes (see the loop here
https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
where  __vmap_pages_range_noflush() would result in more
flush_tlb_all())
- kmsan_vmap_pages_range_noflush(): here we would flush twice, but
same thing for the arch_sync_kernel_mappings() implementation.
- ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
implementation.
- vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.

Let me know what you think!

Alex

> >
> > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
>

Alexandre Ghiti Aug. 3, 2023, 9:58 a.m. UTC | #5

On Thu, Aug 3, 2023 at 11:48 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> >
> > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > >
> > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > > >
> > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > > >
> > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > ---
> > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > >
> > > > > +#ifdef CONFIG_64BIT
> > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > > not worth for.
> > > >
> > > > It would reduce the performance of vmap_pages_range,
> > > > ioremap_page_range ... API, which may cause some drivers' performance
> > > > issues when they install/uninstall memory frequently.
> > > >
> > >
> > > Hi All,
> > >
> > > I think functional correctness should be more important than system performance
> > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > > specification allowing invalidation entries to be cached in the TLB.
> >
> > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> > can be reverted later when a more targeted fix shows up, to make sure
> > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> > original commit should probably be reverted.
>
> The original commit that removed vmalloc_fault() is required, handling
> vmalloc faults in the page fault path is not possible (see the links
> in the description of 7d3332be011e and the example that I gave in the
> thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).

Actually the problem was there before this commit (Dylan has the issue
in 6.1), so no faulty commit to revert :)

>
> I totally agree with Dylan that we'll work (I'm currently working on
> that) on the performance side of the problem in the next release, we
> need correctness and for that we need a preventive global sfence.vma
> as we have no means (for now) to distinguish between uarch that cache
> or not invalid entries.
>
> >
> > > The problem[1] we are currently encountering is caused by not updating the TLB
> > > after the page table is created, and the solution to this problem can only be
> > > solved by updating the TLB immediately after the page table is created.
> > >
> > > There are currently two possible approaches to flush TLB:
> > > 1. Flush TLB in flush_cache_vmap()
> > > 2. Flush TLB in arch_sync_kernel_mappings()
> > >
> > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > > The name of this function indicates that it should be related to cache operations, maybe
> > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
>
> TLDR: The downsides to implementing arch_sync_kernel_mappings()
> instead of flush_cache_vmap():
>
> - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
> - flushes the tlb in the noflush suffixed functions so it prevents any
> flush optimization (ie: a loop of vmap_range_noflush() without flush
> and then a final flush afterwards)
>
> So I'd favour the flush_cache_vmap() implementation which seems
> lighter. powerpc does that
> https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
> (but admits that it may not be the right place)
>
> Here is the long story (my raw notes):
>
> * arch_sync_kernel_mappings() is called from:
> - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
> addresses, I guess that's ok.
> - __vunmap_range_noflush(): it is noted here
> https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
> any caller must call flush_tlb_kernel_range(). Then the implementation
> of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
> - vmap_range_noflush(): does not fit well with the noflush() suffix.
>
> * flush_cache_vmap() is called from:
> - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
> called right a apply_to_page_range() so your patch would work here)
> - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
> establishes and flush_tlb_kernel_range() must be called afterwards =>
> 3 global tlb flushes but the 3 are needed as they target different
> addresses. Implementing only arch_sync_kernel_mappings() would result
> in way more global flushes (see the loop here
> https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
> where  __vmap_pages_range_noflush() would result in more
> flush_tlb_all())
> - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
> same thing for the arch_sync_kernel_mappings() implementation.
> - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
> implementation.
> - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
>
> Let me know what you think!
>
> Alex
>
> > >
> > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
> >

Conor Dooley Aug. 3, 2023, 10:05 a.m. UTC | #6

On Thu, Aug 03, 2023 at 11:58:41AM +0200, Alexandre Ghiti wrote:
> On Thu, Aug 3, 2023 at 11:48 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > >
> > > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > >
> > > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > > > >
> > > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > > > >
> > > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > > ---
> > > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > > > >  1 file changed, 4 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > > >
> > > > > > +#ifdef CONFIG_64BIT
> > > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > > > not worth for.
> > > > >
> > > > > It would reduce the performance of vmap_pages_range,
> > > > > ioremap_page_range ... API, which may cause some drivers' performance
> > > > > issues when they install/uninstall memory frequently.
> > > > >
> > > >
> > > > Hi All,
> > > >
> > > > I think functional correctness should be more important than system performance
> > > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > > > specification allowing invalidation entries to be cached in the TLB.
> > >
> > > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> > > can be reverted later when a more targeted fix shows up, to make sure
> > > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> > > original commit should probably be reverted.
> >
> > The original commit that removed vmalloc_fault() is required, handling
> > vmalloc faults in the page fault path is not possible (see the links
> > in the description of 7d3332be011e and the example that I gave in the
> > thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
> 
> Actually the problem was there before this commit (Dylan has the issue
> in 6.1), so no faulty commit to revert :)

Ah, my bad - I was focusing on 7d3332be011e ("riscv: mm: Pre-allocate PGD
entries for vmalloc/modules area") from your commit message. Big hammer,
followed by tweaks or a replacement sounds like the way forward to me
so!

Cheers,
Conor.

> > I totally agree with Dylan that we'll work (I'm currently working on
> > that) on the performance side of the problem in the next release, we
> > need correctness and for that we need a preventive global sfence.vma
> > as we have no means (for now) to distinguish between uarch that cache
> > or not invalid entries.
> >
> > >
> > > > The problem[1] we are currently encountering is caused by not updating the TLB
> > > > after the page table is created, and the solution to this problem can only be
> > > > solved by updating the TLB immediately after the page table is created.
> > > >
> > > > There are currently two possible approaches to flush TLB:
> > > > 1. Flush TLB in flush_cache_vmap()
> > > > 2. Flush TLB in arch_sync_kernel_mappings()
> > > >
> > > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > > > The name of this function indicates that it should be related to cache operations, maybe
> > > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
> >
> > TLDR: The downsides to implementing arch_sync_kernel_mappings()
> > instead of flush_cache_vmap():
> >
> > - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
> > - flushes the tlb in the noflush suffixed functions so it prevents any
> > flush optimization (ie: a loop of vmap_range_noflush() without flush
> > and then a final flush afterwards)
> >
> > So I'd favour the flush_cache_vmap() implementation which seems
> > lighter. powerpc does that
> > https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
> > (but admits that it may not be the right place)
> >
> > Here is the long story (my raw notes):
> >
> > * arch_sync_kernel_mappings() is called from:
> > - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
> > addresses, I guess that's ok.
> > - __vunmap_range_noflush(): it is noted here
> > https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
> > any caller must call flush_tlb_kernel_range(). Then the implementation
> > of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
> > - vmap_range_noflush(): does not fit well with the noflush() suffix.
> >
> > * flush_cache_vmap() is called from:
> > - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
> > called right a apply_to_page_range() so your patch would work here)
> > - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
> > establishes and flush_tlb_kernel_range() must be called afterwards =>
> > 3 global tlb flushes but the 3 are needed as they target different
> > addresses. Implementing only arch_sync_kernel_mappings() would result
> > in way more global flushes (see the loop here
> > https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
> > where  __vmap_pages_range_noflush() would result in more
> > flush_tlb_all())
> > - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
> > same thing for the arch_sync_kernel_mappings() implementation.
> > - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
> > implementation.
> > - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
> >
> > Let me know what you think!
> >
> > Alex
> >
> > > >
> > > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
> > >

Dylan Jhong Aug. 4, 2023, 7:48 a.m. UTC | #7

On Thu, Aug 03, 2023 at 11:48:36AM +0200, Alexandre Ghiti wrote:
> On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> >
> > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > >
> > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > > >
> > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > > >
> > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > ---
> > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > >
> > > > > +#ifdef CONFIG_64BIT
> > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > > not worth for.
> > > >
> > > > It would reduce the performance of vmap_pages_range,
> > > > ioremap_page_range ... API, which may cause some drivers' performance
> > > > issues when they install/uninstall memory frequently.
> > > >
> > >
> > > Hi All,
> > >
> > > I think functional correctness should be more important than system performance
> > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > > specification allowing invalidation entries to be cached in the TLB.
> >
> > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> > can be reverted later when a more targeted fix shows up, to make sure
> > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> > original commit should probably be reverted.
> 
> The original commit that removed vmalloc_fault() is required, handling
> vmalloc faults in the page fault path is not possible (see the links
> in the description of 7d3332be011e and the example that I gave in the
> thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
> 
> I totally agree with Dylan that we'll work (I'm currently working on
> that) on the performance side of the problem in the next release, we
> need correctness and for that we need a preventive global sfence.vma
> as we have no means (for now) to distinguish between uarch that cache
> or not invalid entries.
> 
> >
> > > The problem[1] we are currently encountering is caused by not updating the TLB
> > > after the page table is created, and the solution to this problem can only be
> > > solved by updating the TLB immediately after the page table is created.
> > >
> > > There are currently two possible approaches to flush TLB:
> > > 1. Flush TLB in flush_cache_vmap()
> > > 2. Flush TLB in arch_sync_kernel_mappings()
> > >
> > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > > The name of this function indicates that it should be related to cache operations, maybe
> > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
> 
> TLDR: The downsides to implementing arch_sync_kernel_mappings()
> instead of flush_cache_vmap():
> 
> - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
> - flushes the tlb in the noflush suffixed functions so it prevents any
> flush optimization (ie: a loop of vmap_range_noflush() without flush
> and then a final flush afterwards)
> 
> So I'd favour the flush_cache_vmap() implementation which seems
> lighter. powerpc does that
> https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
> (but admits that it may not be the right place)
> 
> Here is the long story (my raw notes):
> 
> * arch_sync_kernel_mappings() is called from:
> - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
> addresses, I guess that's ok.
> - __vunmap_range_noflush(): it is noted here
> https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
> any caller must call flush_tlb_kernel_range(). Then the implementation
> of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
> - vmap_range_noflush(): does not fit well with the noflush() suffix.
> 
> * flush_cache_vmap() is called from:
> - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
> called right a apply_to_page_range() so your patch would work here)
> - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
> establishes and flush_tlb_kernel_range() must be called afterwards =>
> 3 global tlb flushes but the 3 are needed as they target different
> addresses. Implementing only arch_sync_kernel_mappings() would result
> in way more global flushes (see the loop here
> https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
> where  __vmap_pages_range_noflush() would result in more
> flush_tlb_all())
> - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
> same thing for the arch_sync_kernel_mappings() implementation.
> - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
> implementation.
> - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
> 
> Let me know what you think!
> 
> Alex
> 
Hi Alex,

Thank you for the detailed explanation. It is indeed undeniable that in certain
situations, there might be a possibility of repeated flushing TLB. But I think
there are some potential problem in flush_cache_vmap().

In most case, vmap_range_noflush() and flush_cache_vmap() will appear at the same
time, so it should be no problem to choose one of them to do the TLB flush. But
flush_cache_vmap() does not cover all the places where apply_to_page_range()
appears (please correct me if I'm wrong), such as vmap_pfn()[1].

The function you mentioned here, each will eventually call:
    vmap_range_noflush() -> arch_sync_kernel_mappings() -> TLB Flush

As for the performance, because the current parameter of flush_tlb_page() needs to
pass *vma, we cannot pass in this parameter so we can only choose flush_tlb_all().
If it can be changed to flush_tlb_page() in the future, the performance should be improved.

[1]: https://elixir.bootlin.com/linux/v6.5-rc4/source/mm/vmalloc.c#L2977

Best regards,
Dylan Jhong

> > >
> > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
> >

Alexandre Ghiti Aug. 8, 2023, 11:23 a.m. UTC | #8

Hey Dylan,

Sorry I was busy debugging 6.5 failing to boot on the Unmatched (also
TLB related, crazy everything converges to TLB issues lately :))

On Fri, Aug 4, 2023 at 9:48 AM Dylan Jhong <dylan@andestech.com> wrote:
>
> On Thu, Aug 03, 2023 at 11:48:36AM +0200, Alexandre Ghiti wrote:
> > On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > >
> > > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > >
> > > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > > > >
> > > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > > > >
> > > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > > ---
> > > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > > > >  1 file changed, 4 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > > >
> > > > > > +#ifdef CONFIG_64BIT
> > > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > > > not worth for.
> > > > >
> > > > > It would reduce the performance of vmap_pages_range,
> > > > > ioremap_page_range ... API, which may cause some drivers' performance
> > > > > issues when they install/uninstall memory frequently.
> > > > >
> > > >
> > > > Hi All,
> > > >
> > > > I think functional correctness should be more important than system performance
> > > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > > > specification allowing invalidation entries to be cached in the TLB.
> > >
> > > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> > > can be reverted later when a more targeted fix shows up, to make sure
> > > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> > > original commit should probably be reverted.
> >
> > The original commit that removed vmalloc_fault() is required, handling
> > vmalloc faults in the page fault path is not possible (see the links
> > in the description of 7d3332be011e and the example that I gave in the
> > thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
> >
> > I totally agree with Dylan that we'll work (I'm currently working on
> > that) on the performance side of the problem in the next release, we
> > need correctness and for that we need a preventive global sfence.vma
> > as we have no means (for now) to distinguish between uarch that cache
> > or not invalid entries.
> >
> > >
> > > > The problem[1] we are currently encountering is caused by not updating the TLB
> > > > after the page table is created, and the solution to this problem can only be
> > > > solved by updating the TLB immediately after the page table is created.
> > > >
> > > > There are currently two possible approaches to flush TLB:
> > > > 1. Flush TLB in flush_cache_vmap()
> > > > 2. Flush TLB in arch_sync_kernel_mappings()
> > > >
> > > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > > > The name of this function indicates that it should be related to cache operations, maybe
> > > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
> >
> > TLDR: The downsides to implementing arch_sync_kernel_mappings()
> > instead of flush_cache_vmap():
> >
> > - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
> > - flushes the tlb in the noflush suffixed functions so it prevents any
> > flush optimization (ie: a loop of vmap_range_noflush() without flush
> > and then a final flush afterwards)
> >
> > So I'd favour the flush_cache_vmap() implementation which seems
> > lighter. powerpc does that
> > https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
> > (but admits that it may not be the right place)
> >
> > Here is the long story (my raw notes):
> >
> > * arch_sync_kernel_mappings() is called from:
> > - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
> > addresses, I guess that's ok.
> > - __vunmap_range_noflush(): it is noted here
> > https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
> > any caller must call flush_tlb_kernel_range(). Then the implementation
> > of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
> > - vmap_range_noflush(): does not fit well with the noflush() suffix.
> >
> > * flush_cache_vmap() is called from:
> > - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
> > called right a apply_to_page_range() so your patch would work here)
> > - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
> > establishes and flush_tlb_kernel_range() must be called afterwards =>
> > 3 global tlb flushes but the 3 are needed as they target different
> > addresses. Implementing only arch_sync_kernel_mappings() would result
> > in way more global flushes (see the loop here
> > https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
> > where  __vmap_pages_range_noflush() would result in more
> > flush_tlb_all())
> > - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
> > same thing for the arch_sync_kernel_mappings() implementation.
> > - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
> > implementation.
> > - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
> >
> > Let me know what you think!
> >
> > Alex
> >
> Hi Alex,
>
> Thank you for the detailed explanation. It is indeed undeniable that in certain
> situations, there might be a possibility of repeated flushing TLB. But I think
> there are some potential problem in flush_cache_vmap().
>
> In most case, vmap_range_noflush() and flush_cache_vmap() will appear at the same
> time, so it should be no problem to choose one of them to do the TLB flush. But
> flush_cache_vmap() does not cover all the places where apply_to_page_range()
> appears (please correct me if I'm wrong), such as vmap_pfn()[1].

That's a good catch, but shouldn't there be a flush_cache_vmap() in
vmap_pfn()? What happens to architectures that implement
flush_cache_vmap() and not arch_sync_kernel_mappings() (most of them)?

>
> The function you mentioned here, each will eventually call:
>     vmap_range_noflush() -> arch_sync_kernel_mappings() -> TLB Flush
>
> As for the performance, because the current parameter of flush_tlb_page() needs to
> pass *vma, we cannot pass in this parameter so we can only choose flush_tlb_all().
> If it can be changed to flush_tlb_page() in the future, the performance should be improved.

We should call only flush_tlb_kernel_range() on kernel addresses, so
that won't be a problem.

>
> [1]: https://elixir.bootlin.com/linux/v6.5-rc4/source/mm/vmalloc.c#L2977
>
> Best regards,
> Dylan Jhong
>
> > > >
> > > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
> > >

Guo Ren Aug. 8, 2023, 11:47 p.m. UTC | #9

On Thu, Aug 3, 2023 at 5:14 PM dylan <dylan@andestech.com> wrote:
>
> On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > >
> > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > code should emit a flush tlb after unmapping a vmalloc region.
> > >
> > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > ---
> > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > index 8091b8bf4883..b93ffddf8a61 100644
> > > --- a/arch/riscv/include/asm/cacheflush.h
> > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > >         flush_icache_mm(vma->vm_mm, 0)
> > >
> > > +#ifdef CONFIG_64BIT
> > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > not worth for.
> >
> > It would reduce the performance of vmap_pages_range,
> > ioremap_page_range ... API, which may cause some drivers' performance
> > issues when they install/uninstall memory frequently.
> >
>
> Hi All,
>
> I think functional correctness should be more important than system performance
> in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> specification allowing invalidation entries to be cached in the TLB.
>
> The problem[1] we are currently encountering is caused by not updating the TLB
> after the page table is created, and the solution to this problem can only be
> solved by updating the TLB immediately after the page table is created.
>
> There are currently two possible approaches to flush TLB:
> 1. Flush TLB in flush_cache_vmap()
> 2. Flush TLB in arch_sync_kernel_mappings()
>
> But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> The name of this function indicates that it should be related to cache operations, maybe
> it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
>
> [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
Not all machines need it, and some CPUs prevent PTE.V=0 into TLB
during PTW, which is stronger than ISA's requirement.

So could we put an errata alternative here?

>
> Best regards,
> Dylan
>
> > > +#endif
> > > +
> > >  #ifndef CONFIG_SMP
> > >
> > >  #define flush_icache_all() local_flush_icache_all()
> > > --
> > > 2.39.2
> > >
> >
> >
> > --
> > Best Regards
> >  Guo Ren
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv

Guo Ren Aug. 8, 2023, 11:55 p.m. UTC | #10

On Tue, Aug 8, 2023 at 7:23 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> Hey Dylan,
>
> Sorry I was busy debugging 6.5 failing to boot on the Unmatched (also
> TLB related, crazy everything converges to TLB issues lately :))
>
> On Fri, Aug 4, 2023 at 9:48 AM Dylan Jhong <dylan@andestech.com> wrote:
> >
> > On Thu, Aug 03, 2023 at 11:48:36AM +0200, Alexandre Ghiti wrote:
> > > On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > >
> > > > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > > > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > > >
> > > > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > > > > >
> > > > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > > > > >
> > > > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > > > ---
> > > > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > > > > >  1 file changed, 4 insertions(+)
> > > > > > >
> > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > > > >
> > > > > > > +#ifdef CONFIG_64BIT
> > > > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > > > > not worth for.
> > > > > >
> > > > > > It would reduce the performance of vmap_pages_range,
> > > > > > ioremap_page_range ... API, which may cause some drivers' performance
> > > > > > issues when they install/uninstall memory frequently.
> > > > > >
> > > > >
> > > > > Hi All,
> > > > >
> > > > > I think functional correctness should be more important than system performance
> > > > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > > > > specification allowing invalidation entries to be cached in the TLB.
> > > >
> > > > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> > > > can be reverted later when a more targeted fix shows up, to make sure
> > > > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> > > > original commit should probably be reverted.
> > >
> > > The original commit that removed vmalloc_fault() is required, handling
> > > vmalloc faults in the page fault path is not possible (see the links
> > > in the description of 7d3332be011e and the example that I gave in the
> > > thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
> > >
> > > I totally agree with Dylan that we'll work (I'm currently working on
> > > that) on the performance side of the problem in the next release, we
> > > need correctness and for that we need a preventive global sfence.vma
> > > as we have no means (for now) to distinguish between uarch that cache
> > > or not invalid entries.
> > >
> > > >
> > > > > The problem[1] we are currently encountering is caused by not updating the TLB
> > > > > after the page table is created, and the solution to this problem can only be
> > > > > solved by updating the TLB immediately after the page table is created.
> > > > >
> > > > > There are currently two possible approaches to flush TLB:
> > > > > 1. Flush TLB in flush_cache_vmap()
> > > > > 2. Flush TLB in arch_sync_kernel_mappings()
> > > > >
> > > > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > > > > The name of this function indicates that it should be related to cache operations, maybe
> > > > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
> > >
> > > TLDR: The downsides to implementing arch_sync_kernel_mappings()
> > > instead of flush_cache_vmap():
> > >
> > > - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
> > > - flushes the tlb in the noflush suffixed functions so it prevents any
> > > flush optimization (ie: a loop of vmap_range_noflush() without flush
> > > and then a final flush afterwards)
> > >
> > > So I'd favour the flush_cache_vmap() implementation which seems
> > > lighter. powerpc does that
> > > https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
> > > (but admits that it may not be the right place)
> > >
> > > Here is the long story (my raw notes):
> > >
> > > * arch_sync_kernel_mappings() is called from:
> > > - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
> > > addresses, I guess that's ok.
> > > - __vunmap_range_noflush(): it is noted here
> > > https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
> > > any caller must call flush_tlb_kernel_range(). Then the implementation
> > > of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
> > > - vmap_range_noflush(): does not fit well with the noflush() suffix.
> > >
> > > * flush_cache_vmap() is called from:
> > > - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
> > > called right a apply_to_page_range() so your patch would work here)
> > > - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
> > > establishes and flush_tlb_kernel_range() must be called afterwards =>
> > > 3 global tlb flushes but the 3 are needed as they target different
> > > addresses. Implementing only arch_sync_kernel_mappings() would result
> > > in way more global flushes (see the loop here
> > > https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
> > > where  __vmap_pages_range_noflush() would result in more
> > > flush_tlb_all())
> > > - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
> > > same thing for the arch_sync_kernel_mappings() implementation.
> > > - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
> > > implementation.
> > > - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
> > >
> > > Let me know what you think!
> > >
> > > Alex
> > >
> > Hi Alex,
> >
> > Thank you for the detailed explanation. It is indeed undeniable that in certain
> > situations, there might be a possibility of repeated flushing TLB. But I think
> > there are some potential problem in flush_cache_vmap().
> >
> > In most case, vmap_range_noflush() and flush_cache_vmap() will appear at the same
> > time, so it should be no problem to choose one of them to do the TLB flush. But
> > flush_cache_vmap() does not cover all the places where apply_to_page_range()
> > appears (please correct me if I'm wrong), such as vmap_pfn()[1].
>
> That's a good catch, but shouldn't there be a flush_cache_vmap() in
> vmap_pfn()? What happens to architectures that implement
> flush_cache_vmap() and not arch_sync_kernel_mappings() (most of them)?
>
> >
> > The function you mentioned here, each will eventually call:
> >     vmap_range_noflush() -> arch_sync_kernel_mappings() -> TLB Flush
> >
> > As for the performance, because the current parameter of flush_tlb_page() needs to
> > pass *vma, we cannot pass in this parameter so we can only choose flush_tlb_all().
> > If it can be changed to flush_tlb_page() in the future, the performance should be improved.
>
> We should call only flush_tlb_kernel_range() on kernel addresses, so
> that won't be a problem.
Another idea for reference:
 1. keep vmalloc_fault()
 2. flush_tlb_page() for vmap_stack when creating kernel thread.

>
> >
> > [1]: https://elixir.bootlin.com/linux/v6.5-rc4/source/mm/vmalloc.c#L2977
> >
> > Best regards,
> > Dylan Jhong
> >
> > > > >
> > > > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
> > > >

Palmer Dabbelt Aug. 10, 2023, 4 p.m. UTC | #11

On Tue, 08 Aug 2023 16:55:12 PDT (-0700), guoren@kernel.org wrote:
> On Tue, Aug 8, 2023 at 7:23 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>>
>> Hey Dylan,
>>
>> Sorry I was busy debugging 6.5 failing to boot on the Unmatched (also
>> TLB related, crazy everything converges to TLB issues lately :))
>>
>> On Fri, Aug 4, 2023 at 9:48 AM Dylan Jhong <dylan@andestech.com> wrote:
>> >
>> > On Thu, Aug 03, 2023 at 11:48:36AM +0200, Alexandre Ghiti wrote:
>> > > On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>> > > >
>> > > > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
>> > > > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
>> > > > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>> > > > > > >
>> > > > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
>> > > > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
>> > > > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>> > > > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
>> > > > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
>> > > > > > >
>> > > > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
>> > > > > > > code should emit a flush tlb after unmapping a vmalloc region.
>> > > > > > >
>> > > > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
>> > > > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
>> > > > > > > ---
>> > > > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
>> > > > > > >  1 file changed, 4 insertions(+)
>> > > > > > >
>> > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
>> > > > > > > index 8091b8bf4883..b93ffddf8a61 100644
>> > > > > > > --- a/arch/riscv/include/asm/cacheflush.h
>> > > > > > > +++ b/arch/riscv/include/asm/cacheflush.h
>> > > > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
>> > > > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
>> > > > > > >         flush_icache_mm(vma->vm_mm, 0)
>> > > > > > >
>> > > > > > > +#ifdef CONFIG_64BIT
>> > > > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
>> > > > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
>> > > > > > not worth for.
>> > > > > >
>> > > > > > It would reduce the performance of vmap_pages_range,
>> > > > > > ioremap_page_range ... API, which may cause some drivers' performance
>> > > > > > issues when they install/uninstall memory frequently.
>> > > > > >
>> > > > >
>> > > > > Hi All,
>> > > > >
>> > > > > I think functional correctness should be more important than system performance
>> > > > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
>> > > > > specification allowing invalidation entries to be cached in the TLB.
>> > > >
>> > > > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
>> > > > can be reverted later when a more targeted fix shows up, to make sure
>> > > > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
>> > > > original commit should probably be reverted.
>> > >
>> > > The original commit that removed vmalloc_fault() is required, handling
>> > > vmalloc faults in the page fault path is not possible (see the links
>> > > in the description of 7d3332be011e and the example that I gave in the
>> > > thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
>> > >
>> > > I totally agree with Dylan that we'll work (I'm currently working on
>> > > that) on the performance side of the problem in the next release, we
>> > > need correctness and for that we need a preventive global sfence.vma
>> > > as we have no means (for now) to distinguish between uarch that cache
>> > > or not invalid entries.
>> > >
>> > > >
>> > > > > The problem[1] we are currently encountering is caused by not updating the TLB
>> > > > > after the page table is created, and the solution to this problem can only be
>> > > > > solved by updating the TLB immediately after the page table is created.
>> > > > >
>> > > > > There are currently two possible approaches to flush TLB:
>> > > > > 1. Flush TLB in flush_cache_vmap()
>> > > > > 2. Flush TLB in arch_sync_kernel_mappings()
>> > > > >
>> > > > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
>> > > > > The name of this function indicates that it should be related to cache operations, maybe
>> > > > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
>> > >
>> > > TLDR: The downsides to implementing arch_sync_kernel_mappings()
>> > > instead of flush_cache_vmap():
>> > >
>> > > - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
>> > > - flushes the tlb in the noflush suffixed functions so it prevents any
>> > > flush optimization (ie: a loop of vmap_range_noflush() without flush
>> > > and then a final flush afterwards)
>> > >
>> > > So I'd favour the flush_cache_vmap() implementation which seems
>> > > lighter. powerpc does that
>> > > https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
>> > > (but admits that it may not be the right place)
>> > >
>> > > Here is the long story (my raw notes):
>> > >
>> > > * arch_sync_kernel_mappings() is called from:
>> > > - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
>> > > addresses, I guess that's ok.
>> > > - __vunmap_range_noflush(): it is noted here
>> > > https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
>> > > any caller must call flush_tlb_kernel_range(). Then the implementation
>> > > of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
>> > > - vmap_range_noflush(): does not fit well with the noflush() suffix.
>> > >
>> > > * flush_cache_vmap() is called from:
>> > > - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
>> > > called right a apply_to_page_range() so your patch would work here)
>> > > - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
>> > > establishes and flush_tlb_kernel_range() must be called afterwards =>
>> > > 3 global tlb flushes but the 3 are needed as they target different
>> > > addresses. Implementing only arch_sync_kernel_mappings() would result
>> > > in way more global flushes (see the loop here
>> > > https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
>> > > where  __vmap_pages_range_noflush() would result in more
>> > > flush_tlb_all())
>> > > - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
>> > > same thing for the arch_sync_kernel_mappings() implementation.
>> > > - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
>> > > implementation.
>> > > - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
>> > >
>> > > Let me know what you think!
>> > >
>> > > Alex
>> > >
>> > Hi Alex,
>> >
>> > Thank you for the detailed explanation. It is indeed undeniable that in certain
>> > situations, there might be a possibility of repeated flushing TLB. But I think
>> > there are some potential problem in flush_cache_vmap().
>> >
>> > In most case, vmap_range_noflush() and flush_cache_vmap() will appear at the same
>> > time, so it should be no problem to choose one of them to do the TLB flush. But
>> > flush_cache_vmap() does not cover all the places where apply_to_page_range()
>> > appears (please correct me if I'm wrong), such as vmap_pfn()[1].
>>
>> That's a good catch, but shouldn't there be a flush_cache_vmap() in
>> vmap_pfn()? What happens to architectures that implement
>> flush_cache_vmap() and not arch_sync_kernel_mappings() (most of them)?
>>
>> >
>> > The function you mentioned here, each will eventually call:
>> >     vmap_range_noflush() -> arch_sync_kernel_mappings() -> TLB Flush
>> >
>> > As for the performance, because the current parameter of flush_tlb_page() needs to
>> > pass *vma, we cannot pass in this parameter so we can only choose flush_tlb_all().
>> > If it can be changed to flush_tlb_page() in the future, the performance should be improved.
>>
>> We should call only flush_tlb_kernel_range() on kernel addresses, so
>> that won't be a problem.
> Another idea for reference:
>  1. keep vmalloc_fault()
>  2. flush_tlb_page() for vmap_stack when creating kernel thread.
>

Sorry, I replied to the wrong thread over here 
<https://lore.kernel.org/all/mhng-3d3afb21-bd40-4095-ba62-41cf40b782ca@palmer-ri-x1c9/#t>.  
I agree we'll likely have performance issues, but at least this fixes 
the bugs we're seeing.  If there's performance issues on real workloads 
on real HW then I'm happy to look at something more complicated, but 
that's probably too much for this release.

So I've queue this up, it'll end up on fixes when it passes the tests.

Thanks!

>>
>> >
>> > [1]: https://elixir.bootlin.com/linux/v6.5-rc4/source/mm/vmalloc.c#L2977
>> >
>> > Best regards,
>> > Dylan Jhong
>> >
>> > > > >
>> > > > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
>> > > >
>
>
>
> -- 
> Best Regards
>  Guo Ren

Dylan Jhong Aug. 10, 2023, 4:22 p.m. UTC | #12

On Thu, Aug 10, 2023 at 09:00:52AM -0700, Palmer Dabbelt wrote:
> On Tue, 08 Aug 2023 16:55:12 PDT (-0700), guoren@kernel.org wrote:
> > On Tue, Aug 8, 2023 at 7:23 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > 
> > > Hey Dylan,
> > > 
> > > Sorry I was busy debugging 6.5 failing to boot on the Unmatched (also
> > > TLB related, crazy everything converges to TLB issues lately :))
> > > 
> > > On Fri, Aug 4, 2023 at 9:48 AM Dylan Jhong <dylan@andestech.com> wrote:
> > > >
> > > > On Thu, Aug 03, 2023 at 11:48:36AM +0200, Alexandre Ghiti wrote:
> > > > > On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > > > > > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > > > > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > > > > >
> > > > > > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > > > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > > > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > > > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > > > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > > > > > > >
> > > > > > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > > > > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > > > > > > >
> > > > > > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > > > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > > > > > ---
> > > > > > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > > > > > > >  1 file changed, 4 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > > > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > > > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > > > > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > > > > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > > > > > >
> > > > > > > > > +#ifdef CONFIG_64BIT
> > > > > > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > > > > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > > > > > > not worth for.
> > > > > > > >
> > > > > > > > It would reduce the performance of vmap_pages_range,
> > > > > > > > ioremap_page_range ... API, which may cause some drivers' performance
> > > > > > > > issues when they install/uninstall memory frequently.
> > > > > > > >
> > > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I think functional correctness should be more important than system performance
> > > > > > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > > > > > > specification allowing invalidation entries to be cached in the TLB.
> > > > > >
> > > > > > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> > > > > > can be reverted later when a more targeted fix shows up, to make sure
> > > > > > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> > > > > > original commit should probably be reverted.
> > > > >
> > > > > The original commit that removed vmalloc_fault() is required, handling
> > > > > vmalloc faults in the page fault path is not possible (see the links
> > > > > in the description of 7d3332be011e and the example that I gave in the
> > > > > thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
> > > > >
> > > > > I totally agree with Dylan that we'll work (I'm currently working on
> > > > > that) on the performance side of the problem in the next release, we
> > > > > need correctness and for that we need a preventive global sfence.vma
> > > > > as we have no means (for now) to distinguish between uarch that cache
> > > > > or not invalid entries.
> > > > >
> > > > > >
> > > > > > > The problem[1] we are currently encountering is caused by not updating the TLB
> > > > > > > after the page table is created, and the solution to this problem can only be
> > > > > > > solved by updating the TLB immediately after the page table is created.
> > > > > > >
> > > > > > > There are currently two possible approaches to flush TLB:
> > > > > > > 1. Flush TLB in flush_cache_vmap()
> > > > > > > 2. Flush TLB in arch_sync_kernel_mappings()
> > > > > > >
> > > > > > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > > > > > > The name of this function indicates that it should be related to cache operations, maybe
> > > > > > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
> > > > >
> > > > > TLDR: The downsides to implementing arch_sync_kernel_mappings()
> > > > > instead of flush_cache_vmap():
> > > > >
> > > > > - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
> > > > > - flushes the tlb in the noflush suffixed functions so it prevents any
> > > > > flush optimization (ie: a loop of vmap_range_noflush() without flush
> > > > > and then a final flush afterwards)
> > > > >
> > > > > So I'd favour the flush_cache_vmap() implementation which seems
> > > > > lighter. powerpc does that
> > > > > https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
> > > > > (but admits that it may not be the right place)
> > > > >
> > > > > Here is the long story (my raw notes):
> > > > >
> > > > > * arch_sync_kernel_mappings() is called from:
> > > > > - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
> > > > > addresses, I guess that's ok.
> > > > > - __vunmap_range_noflush(): it is noted here
> > > > > https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
> > > > > any caller must call flush_tlb_kernel_range(). Then the implementation
> > > > > of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
> > > > > - vmap_range_noflush(): does not fit well with the noflush() suffix.
> > > > >
> > > > > * flush_cache_vmap() is called from:
> > > > > - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
> > > > > called right a apply_to_page_range() so your patch would work here)
> > > > > - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
> > > > > establishes and flush_tlb_kernel_range() must be called afterwards =>
> > > > > 3 global tlb flushes but the 3 are needed as they target different
> > > > > addresses. Implementing only arch_sync_kernel_mappings() would result
> > > > > in way more global flushes (see the loop here
> > > > > https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
> > > > > where  __vmap_pages_range_noflush() would result in more
> > > > > flush_tlb_all())
> > > > > - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
> > > > > same thing for the arch_sync_kernel_mappings() implementation.
> > > > > - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
> > > > > implementation.
> > > > > - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
> > > > >
> > > > > Let me know what you think!
> > > > >
> > > > > Alex
> > > > >
> > > > Hi Alex,
> > > >
> > > > Thank you for the detailed explanation. It is indeed undeniable that in certain
> > > > situations, there might be a possibility of repeated flushing TLB. But I think
> > > > there are some potential problem in flush_cache_vmap().
> > > >
> > > > In most case, vmap_range_noflush() and flush_cache_vmap() will appear at the same
> > > > time, so it should be no problem to choose one of them to do the TLB flush. But
> > > > flush_cache_vmap() does not cover all the places where apply_to_page_range()
> > > > appears (please correct me if I'm wrong), such as vmap_pfn()[1].
> > > 
> > > That's a good catch, but shouldn't there be a flush_cache_vmap() in
> > > vmap_pfn()? What happens to architectures that implement
> > > flush_cache_vmap() and not arch_sync_kernel_mappings() (most of them)?
> > > 
> > > >
> > > > The function you mentioned here, each will eventually call:
> > > >     vmap_range_noflush() -> arch_sync_kernel_mappings() -> TLB Flush
> > > >
> > > > As for the performance, because the current parameter of flush_tlb_page() needs to
> > > > pass *vma, we cannot pass in this parameter so we can only choose flush_tlb_all().
> > > > If it can be changed to flush_tlb_page() in the future, the performance should be improved.
> > > 
> > > We should call only flush_tlb_kernel_range() on kernel addresses, so
> > > that won't be a problem.
> > Another idea for reference:
> >  1. keep vmalloc_fault()
> >  2. flush_tlb_page() for vmap_stack when creating kernel thread.
> > 
> 
> Sorry, I replied to the wrong thread over here <https://lore.kernel.org/all/mhng-3d3afb21-bd40-4095-ba62-41cf40b782ca@palmer-ri-x1c9/#t>.
> I agree we'll likely have performance issues, but at least this fixes the
> bugs we're seeing.  If there's performance issues on real workloads on real
> HW then I'm happy to look at something more complicated, but that's probably
> too much for this release.
> 
> So I've queue this up, it'll end up on fixes when it passes the tests.
> 
> Thanks!

Hi Palmer,

Reply to your post on another thread.

Yes, we need the implementation of flush_cache_vmap() under the RISC-V architecture to
solve our problem. Alexandre proposed another solution to deal with the TLB
updating issue in another thread [1], I think that patch should be able to better
ensure performance and correctness. But before that, we need this patch as a
workaround.

BTW, since there is no flush_cache_vmap() in vmap_pfn(), there may be a risk
that the TLB is not updated. I think this patch[2] should be pulled back to the
RISC-V port.

[1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/038079.html
[2]: https://lore.kernel.org/all/mhng-3d3afb21-bd40-4095-ba62-41cf40b782ca@palmer-ri-x1c9/#t

> 
> > > 
> > > >
> > > > [1]: https://elixir.bootlin.com/linux/v6.5-rc4/source/mm/vmalloc.c#L2977
> > > >
> > > > Best regards,
> > > > Dylan Jhong
> > > >
> > > > > > >
> > > > > > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
> > > > > >
> > 
> > 
> > 
> > -- 
> > Best Regards
> >  Guo Ren

Palmer Dabbelt Aug. 10, 2023, 7:39 p.m. UTC | #13

On Thu, 10 Aug 2023 09:22:30 PDT (-0700), dylan@andestech.com wrote:
> On Thu, Aug 10, 2023 at 09:00:52AM -0700, Palmer Dabbelt wrote:
>> On Tue, 08 Aug 2023 16:55:12 PDT (-0700), guoren@kernel.org wrote:
>> > On Tue, Aug 8, 2023 at 7:23 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>> > >
>> > > Hey Dylan,
>> > >
>> > > Sorry I was busy debugging 6.5 failing to boot on the Unmatched (also
>> > > TLB related, crazy everything converges to TLB issues lately :))
>> > >
>> > > On Fri, Aug 4, 2023 at 9:48 AM Dylan Jhong <dylan@andestech.com> wrote:
>> > > >
>> > > > On Thu, Aug 03, 2023 at 11:48:36AM +0200, Alexandre Ghiti wrote:
>> > > > > On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
>> > > > > >
>> > > > > > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
>> > > > > > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
>> > > > > > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>> > > > > > > > >
>> > > > > > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
>> > > > > > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
>> > > > > > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>> > > > > > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
>> > > > > > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
>> > > > > > > > >
>> > > > > > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
>> > > > > > > > > code should emit a flush tlb after unmapping a vmalloc region.
>> > > > > > > > >
>> > > > > > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
>> > > > > > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
>> > > > > > > > > ---
>> > > > > > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
>> > > > > > > > >  1 file changed, 4 insertions(+)
>> > > > > > > > >
>> > > > > > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
>> > > > > > > > > index 8091b8bf4883..b93ffddf8a61 100644
>> > > > > > > > > --- a/arch/riscv/include/asm/cacheflush.h
>> > > > > > > > > +++ b/arch/riscv/include/asm/cacheflush.h
>> > > > > > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
>> > > > > > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
>> > > > > > > > >         flush_icache_mm(vma->vm_mm, 0)
>> > > > > > > > >
>> > > > > > > > > +#ifdef CONFIG_64BIT
>> > > > > > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
>> > > > > > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
>> > > > > > > > not worth for.
>> > > > > > > >
>> > > > > > > > It would reduce the performance of vmap_pages_range,
>> > > > > > > > ioremap_page_range ... API, which may cause some drivers' performance
>> > > > > > > > issues when they install/uninstall memory frequently.
>> > > > > > > >
>> > > > > > >
>> > > > > > > Hi All,
>> > > > > > >
>> > > > > > > I think functional correctness should be more important than system performance
>> > > > > > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
>> > > > > > > specification allowing invalidation entries to be cached in the TLB.
>> > > > > >
>> > > > > > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
>> > > > > > can be reverted later when a more targeted fix shows up, to make sure
>> > > > > > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
>> > > > > > original commit should probably be reverted.
>> > > > >
>> > > > > The original commit that removed vmalloc_fault() is required, handling
>> > > > > vmalloc faults in the page fault path is not possible (see the links
>> > > > > in the description of 7d3332be011e and the example that I gave in the
>> > > > > thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
>> > > > >
>> > > > > I totally agree with Dylan that we'll work (I'm currently working on
>> > > > > that) on the performance side of the problem in the next release, we
>> > > > > need correctness and for that we need a preventive global sfence.vma
>> > > > > as we have no means (for now) to distinguish between uarch that cache
>> > > > > or not invalid entries.
>> > > > >
>> > > > > >
>> > > > > > > The problem[1] we are currently encountering is caused by not updating the TLB
>> > > > > > > after the page table is created, and the solution to this problem can only be
>> > > > > > > solved by updating the TLB immediately after the page table is created.
>> > > > > > >
>> > > > > > > There are currently two possible approaches to flush TLB:
>> > > > > > > 1. Flush TLB in flush_cache_vmap()
>> > > > > > > 2. Flush TLB in arch_sync_kernel_mappings()
>> > > > > > >
>> > > > > > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
>> > > > > > > The name of this function indicates that it should be related to cache operations, maybe
>> > > > > > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
>> > > > >
>> > > > > TLDR: The downsides to implementing arch_sync_kernel_mappings()
>> > > > > instead of flush_cache_vmap():
>> > > > >
>> > > > > - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
>> > > > > - flushes the tlb in the noflush suffixed functions so it prevents any
>> > > > > flush optimization (ie: a loop of vmap_range_noflush() without flush
>> > > > > and then a final flush afterwards)
>> > > > >
>> > > > > So I'd favour the flush_cache_vmap() implementation which seems
>> > > > > lighter. powerpc does that
>> > > > > https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
>> > > > > (but admits that it may not be the right place)
>> > > > >
>> > > > > Here is the long story (my raw notes):
>> > > > >
>> > > > > * arch_sync_kernel_mappings() is called from:
>> > > > > - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
>> > > > > addresses, I guess that's ok.
>> > > > > - __vunmap_range_noflush(): it is noted here
>> > > > > https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
>> > > > > any caller must call flush_tlb_kernel_range(). Then the implementation
>> > > > > of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
>> > > > > - vmap_range_noflush(): does not fit well with the noflush() suffix.
>> > > > >
>> > > > > * flush_cache_vmap() is called from:
>> > > > > - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
>> > > > > called right a apply_to_page_range() so your patch would work here)
>> > > > > - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
>> > > > > establishes and flush_tlb_kernel_range() must be called afterwards =>
>> > > > > 3 global tlb flushes but the 3 are needed as they target different
>> > > > > addresses. Implementing only arch_sync_kernel_mappings() would result
>> > > > > in way more global flushes (see the loop here
>> > > > > https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
>> > > > > where  __vmap_pages_range_noflush() would result in more
>> > > > > flush_tlb_all())
>> > > > > - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
>> > > > > same thing for the arch_sync_kernel_mappings() implementation.
>> > > > > - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
>> > > > > implementation.
>> > > > > - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
>> > > > >
>> > > > > Let me know what you think!
>> > > > >
>> > > > > Alex
>> > > > >
>> > > > Hi Alex,
>> > > >
>> > > > Thank you for the detailed explanation. It is indeed undeniable that in certain
>> > > > situations, there might be a possibility of repeated flushing TLB. But I think
>> > > > there are some potential problem in flush_cache_vmap().
>> > > >
>> > > > In most case, vmap_range_noflush() and flush_cache_vmap() will appear at the same
>> > > > time, so it should be no problem to choose one of them to do the TLB flush. But
>> > > > flush_cache_vmap() does not cover all the places where apply_to_page_range()
>> > > > appears (please correct me if I'm wrong), such as vmap_pfn()[1].
>> > >
>> > > That's a good catch, but shouldn't there be a flush_cache_vmap() in
>> > > vmap_pfn()? What happens to architectures that implement
>> > > flush_cache_vmap() and not arch_sync_kernel_mappings() (most of them)?
>> > >
>> > > >
>> > > > The function you mentioned here, each will eventually call:
>> > > >     vmap_range_noflush() -> arch_sync_kernel_mappings() -> TLB Flush
>> > > >
>> > > > As for the performance, because the current parameter of flush_tlb_page() needs to
>> > > > pass *vma, we cannot pass in this parameter so we can only choose flush_tlb_all().
>> > > > If it can be changed to flush_tlb_page() in the future, the performance should be improved.
>> > >
>> > > We should call only flush_tlb_kernel_range() on kernel addresses, so
>> > > that won't be a problem.
>> > Another idea for reference:
>> >  1. keep vmalloc_fault()
>> >  2. flush_tlb_page() for vmap_stack when creating kernel thread.
>> >
>>
>> Sorry, I replied to the wrong thread over here <https://lore.kernel.org/all/mhng-3d3afb21-bd40-4095-ba62-41cf40b782ca@palmer-ri-x1c9/#t>.
>> I agree we'll likely have performance issues, but at least this fixes the
>> bugs we're seeing.  If there's performance issues on real workloads on real
>> HW then I'm happy to look at something more complicated, but that's probably
>> too much for this release.
>>
>> So I've queue this up, it'll end up on fixes when it passes the tests.
>>
>> Thanks!
>
> Hi Palmer,
>
> Reply to your post on another thread.
>
> Yes, we need the implementation of flush_cache_vmap() under the RISC-V architecture to
> solve our problem. Alexandre proposed another solution to deal with the TLB
> updating issue in another thread [1], I think that patch should be able to better
> ensure performance and correctness. But before that, we need this patch as a
> workaround.

Awesome, thanks.

> BTW, since there is no flush_cache_vmap() in vmap_pfn(), there may be a risk
> that the TLB is not updated. I think this patch[2] should be pulled back to the
> RISC-V port.

That one's for the MM tree, though -- it should still end up in stable, 
so we should be fine from the RISC-V side of things.

>
> [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/038079.html
> [2]: https://lore.kernel.org/all/mhng-3d3afb21-bd40-4095-ba62-41cf40b782ca@palmer-ri-x1c9/#t
>
>>
>> > >
>> > > >
>> > > > [1]: https://elixir.bootlin.com/linux/v6.5-rc4/source/mm/vmalloc.c#L2977
>> > > >
>> > > > Best regards,
>> > > > Dylan Jhong
>> > > >
>> > > > > > >
>> > > > > > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
>> > > > > >
>> >
>> >
>> >
>> > --
>> > Best Regards
>> >  Guo Ren

patchwork-bot+linux-riscv@kernel.org Aug. 10, 2023, 7:50 p.m. UTC | #14

Hello:

This patch was applied to riscv/linux.git (fixes)
by Palmer Dabbelt <palmer@rivosinc.com>:

On Tue, 25 Jul 2023 15:22:46 +0200 you wrote:
> The RISC-V kernel needs a sfence.vma after a page table modification: we
> used to rely on the vmalloc fault handling to emit an sfence.vma, but
> commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> need to explicitly emit a sfence.vma in flush_cache_vmap().
> 
> Note that we don't need to implement flush_cache_vunmap() as the generic
> code should emit a flush tlb after unmapping a vmalloc region.
> 
> [...]

Here is the summary with links:
  - [-fixes] riscv: Implement flush_cache_vmap()
    https://git.kernel.org/riscv/c/7e3811521dc3

You are awesome, thank you!

[-fixes] riscv: Implement flush_cache_vmap()

Checks

Commit Message

Comments

Patch