diff mbox series

[v9,01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional

Message ID 20250206044346.3810242-2-riel@surriel.com (mailing list archive)
State New
Headers show
Series AMD broadcast TLB invalidation | expand

Commit Message

Rik van Riel Feb. 6, 2025, 4:43 a.m. UTC
Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.

There is no real good reason to do things differently for
each setup. Make them all the same.

Currently get_user_pages_fast synchronizes against page table
freeing in two different ways:
- on bare metal, by blocking IRQs, which block TLB flush IPIs
- on paravirt, with MMU_GATHER_RCU_TABLE_FREE

This is done because some paravirt TLB flush implementations
handle the TLB flush in the hypervisor, and will do the flush
even when the target CPU has interrupts disabled.

Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
Using RCU synchronization between page table freeing and get_user_pages_fast()
allows bare metal to also do TLB flushing while interrupts are disabled.

Various places in the mm do still block IRQs or disable preemption
as an implicit way to block RCU frees.

That makes it safe to use INVLPGB on AMD CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/Kconfig           |  2 +-
 arch/x86/kernel/paravirt.c |  7 +------
 arch/x86/mm/pgtable.c      | 16 ++++------------
 3 files changed, 6 insertions(+), 19 deletions(-)

Comments

Brendan Jackman Feb. 7, 2025, 2:28 p.m. UTC | #1
Hi Rik,

On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> This is done because some paravirt TLB flush implementations
> handle the TLB flush in the hypervisor, and will do the flush
> even when the target CPU has interrupts disabled.
>
> Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
> Using RCU synchronization between page table freeing and get_user_pages_fast()
> allows bare metal to also do TLB flushing while interrupts are disabled.
>
> Various places in the mm do still block IRQs or disable preemption
> as an implicit way to block RCU frees.
>
> That makes it safe to use INVLPGB on AMD CPUs.

It would be nice to update the CONFIG_MMU_GATHER_RCU_TABLE_FREE
comment in mm/mmu_gather.c to mention INVLPG alongside "Architectures
that do not have this (PPC)" - and while that's being updated it would
also be useful to note down the paravirt thing you explained above,
IMO it's pretty helpful to have more examples of the concrete usecases
for this logic.
Peter Zijlstra Feb. 11, 2025, 11:07 a.m. UTC | #2
On Fri, Feb 07, 2025 at 03:28:31PM +0100, Brendan Jackman wrote:
> Hi Rik,
> 
> On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> > This is done because some paravirt TLB flush implementations
> > handle the TLB flush in the hypervisor, and will do the flush
> > even when the target CPU has interrupts disabled.
> >
> > Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
> > Using RCU synchronization between page table freeing and get_user_pages_fast()
> > allows bare metal to also do TLB flushing while interrupts are disabled.
> >
> > Various places in the mm do still block IRQs or disable preemption
> > as an implicit way to block RCU frees.
> >
> > That makes it safe to use INVLPGB on AMD CPUs.
> 
> It would be nice to update the CONFIG_MMU_GATHER_RCU_TABLE_FREE
> comment in mm/mmu_gather.c to mention INVLPG alongside "Architectures
> that do not have this (PPC)" 

Why? This is just one more architecture that does broadcast.

- and while that's being updated it would
> also be useful to note down the paravirt thing you explained above,
> IMO it's pretty helpful to have more examples of the concrete usecases
> for this logic.

Look at kvm_flush_tlb_multi() if you're interested. The notable detail
is that is avoids flushing TLB for vCPUs that are preempted, and instead
lets the vCPU resume code do the invalidate.
Brendan Jackman Feb. 11, 2025, 12:10 p.m. UTC | #3
On Tue, 11 Feb 2025 at 12:07, Peter Zijlstra <peterz@infradead.org> wrote:
> > It would be nice to update the CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > comment in mm/mmu_gather.c to mention INVLPG alongside "Architectures
> > that do not have this (PPC)"
>
> Why? This is just one more architecture that does broadcast.

Hmm yeah, I didn't really make the leap from "doesn't do an IPI" to
"that just means it uses broadcast TLB flush". In that light I can see
how it's "just another architecture".

I do think it would make sense to be more explicit about that, even
though it seems obvious now you point it out. But it's not really
relevant to this patchset.

> - and while that's being updated it would
> > also be useful to note down the paravirt thing you explained above,
> > IMO it's pretty helpful to have more examples of the concrete usecases
> > for this logic.
>
> Look at kvm_flush_tlb_multi() if you're interested. The notable detail
> is that is avoids flushing TLB for vCPUs that are preempted, and instead
> lets the vCPU resume code do the invalidate.

Oh that actually looks like a slightly different case from what Rik mentioned?

> some paravirt TLB flush implementations
> handle the TLB flush in the hypervisor, and will do the flush
> even when the target CPU has interrupts disabled.

Do we have

a) Systems where the flush gets completely pushed into the hypervisor
- xen_flush_tlb_multi()?

b) Systems where the guest coordinates with the hypervisor to avoid
IPI-ing preempted vCPUs?

Maybe I can send a separate patch to note this in the commentary, it's
interesting and useful to know.
Rik van Riel Feb. 11, 2025, 8:23 p.m. UTC | #4
On Tue, 2025-02-11 at 13:10 +0100, Brendan Jackman wrote:
> 
> Maybe I can send a separate patch to note this in the commentary,
> it's
> interesting and useful to know.
> 

I added a similar comment to this patch series
for v10 yesterday. I'll send it out after a few
more tests have finished running.

I'm not particularly invested in which version
of the comment gets merged :)
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..e8743f8c9fd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,7 +274,7 @@  config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec381533555..2b78a6b466ed 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,11 +59,6 @@  void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
-static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
@@ -191,7 +186,7 @@  struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= native_tlb_remove_table,
+	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241..3dc4af1f7868 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -18,14 +18,6 @@  EXPORT_SYMBOL(physical_mask);
 #define PGTABLE_HIGHMEM 0
 #endif
 
-#ifndef CONFIG_PARAVIRT
-static inline
-void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-#endif
-
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -54,7 +46,7 @@  void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pagetable_pte_dtor(page_ptdesc(pte));
 	paravirt_release_pte(page_to_pfn(pte));
-	paravirt_tlb_remove_table(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
@@ -70,7 +62,7 @@  void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 	tlb->need_flush_all = 1;
 #endif
 	pagetable_pmd_dtor(ptdesc);
-	paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
+	tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -80,14 +72,14 @@  void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 
 	pagetable_pud_dtor(ptdesc);
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
 	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(p4d));
+	tlb_remove_table(tlb, virt_to_page(p4d));
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */