mbox series

[00/12] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

Message ID 1535645747-9823-1-git-send-email-will.deacon@arm.com (mailing list archive)
Headers show
Series Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 | expand

Message

Will Deacon Aug. 30, 2018, 4:15 p.m. UTC
Hello again,

This is v1 of the RFC I previously posted here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2018-August/597821.html

The main changes include:

  * Rewrite the comment in tlbflush.h to explain the various functions
    and justify the barrier semantics

  * Fix the "flush entire ASID" heuristic to work with !4K page sizes

  * Fixed the build on sh (well, it fails somewhere else that isn't my fault)

  * Report PxD_SHIFT instead of PxD_SIZE via tlb_get_unmap_shift()

It's also had a lot more testing, but has held up nicely so far on arm64.
I haven't figured out how to merge this yet, but I'll probably end up pulling
the core changes out onto a separate branch.

Cheers,

Will

--->8

Peter Zijlstra (1):
  asm-generic/tlb: Track freeing of page-table directories in struct
    mmu_gather

Will Deacon (11):
  arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range()
  arm64: tlb: Add DSB ISHST prior to TLBI in
    __flush_tlb_[kernel_]pgtable()
  arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
  arm64: tlb: Justify non-leaf invalidation in flush_tlb_range()
  arm64: tlbflush: Allow stride to be specified for __flush_tlb_range()
  arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code
  asm-generic/tlb: Guard with #ifdef CONFIG_MMU
  asm-generic/tlb: Track which levels of the page tables have been
    cleared
  arm64: tlb: Adjust stride and type of TLBI according to mmu_gather
  arm64: tlb: Avoid synchronous TLBIs when freeing page tables
  arm64: tlb: Rewrite stale comment in asm/tlbflush.h

 arch/arm64/Kconfig                |   1 +
 arch/arm64/include/asm/pgtable.h  |  10 +++-
 arch/arm64/include/asm/tlb.h      |  34 +++++-------
 arch/arm64/include/asm/tlbflush.h | 112 ++++++++++++++++++++++++--------------
 include/asm-generic/tlb.h         |  85 +++++++++++++++++++++++++----
 mm/memory.c                       |   4 +-
 6 files changed, 168 insertions(+), 78 deletions(-)

Comments

Linus Torvalds Aug. 30, 2018, 4:39 p.m. UTC | #1
On Thu, Aug 30, 2018 at 9:15 AM Will Deacon <will.deacon@arm.com> wrote:
>
> It's also had a lot more testing, but has held up nicely so far on arm64.
> I haven't figured out how to merge this yet, but I'll probably end up pulling
> the core changes out onto a separate branch.

This looks fine, and I'm actually ok getting the core changes through
the arm64 branch, since this has been discussed across architectures,
and I think "whoever does the work gets to drive the car".

After all, a lot of the core changes originally came from x86 people
(pretty much all of it, historically). No reason why arm64 can't get
some of that too.

But with the glory comes the blame when something breaks ;)

                Linus
Peter Zijlstra Aug. 30, 2018, 5:11 p.m. UTC | #2
On Thu, Aug 30, 2018 at 05:15:34PM +0100, Will Deacon wrote:
> Peter Zijlstra (1):
>   asm-generic/tlb: Track freeing of page-table directories in struct
>     mmu_gather
> 
> Will Deacon (11):
>   arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range()
>   arm64: tlb: Add DSB ISHST prior to TLBI in
>     __flush_tlb_[kernel_]pgtable()
>   arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
>   arm64: tlb: Justify non-leaf invalidation in flush_tlb_range()
>   arm64: tlbflush: Allow stride to be specified for __flush_tlb_range()
>   arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code
>   asm-generic/tlb: Guard with #ifdef CONFIG_MMU
>   asm-generic/tlb: Track which levels of the page tables have been
>     cleared
>   arm64: tlb: Adjust stride and type of TLBI according to mmu_gather
>   arm64: tlb: Avoid synchronous TLBIs when freeing page tables
>   arm64: tlb: Rewrite stale comment in asm/tlbflush.h
> 
>  arch/arm64/Kconfig                |   1 +
>  arch/arm64/include/asm/pgtable.h  |  10 +++-
>  arch/arm64/include/asm/tlb.h      |  34 +++++-------
>  arch/arm64/include/asm/tlbflush.h | 112 ++++++++++++++++++++++++--------------
>  include/asm-generic/tlb.h         |  85 +++++++++++++++++++++++++----
>  mm/memory.c                       |   4 +-
>  6 files changed, 168 insertions(+), 78 deletions(-)

These patches look good to me, thanks!

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Nicholas Piggin Aug. 31, 2018, 1 a.m. UTC | #3
On Thu, 30 Aug 2018 09:39:38 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Aug 30, 2018 at 9:15 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > It's also had a lot more testing, but has held up nicely so far on arm64.
> > I haven't figured out how to merge this yet, but I'll probably end up pulling
> > the core changes out onto a separate branch.  
> 
> This looks fine, and I'm actually ok getting the core changes through
> the arm64 branch, since this has been discussed across architectures,
> and I think "whoever does the work gets to drive the car".

Well it would help if powerpc say wanted to start using them without a
merge cycle lag. Not a huge issue because powerpc already does
reasonably well here and there's other work that can be done.

I will try to review the core changes carefully next week.

Thanks,
Nick
Linus Torvalds Aug. 31, 2018, 1:04 a.m. UTC | #4
On Thu, Aug 30, 2018 at 6:01 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Well it would help if powerpc say wanted to start using them without a
> merge cycle lag. Not a huge issue because powerpc already does
> reasonably well here and there's other work that can be done.

Sure. If somebody wants to send the generic changes I can just take
them directly to make it easier for people to work on this.

               Linus
Will Deacon Aug. 31, 2018, 9:54 a.m. UTC | #5
On Thu, Aug 30, 2018 at 06:04:12PM -0700, Linus Torvalds wrote:
> On Thu, Aug 30, 2018 at 6:01 PM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > Well it would help if powerpc say wanted to start using them without a
> > merge cycle lag. Not a huge issue because powerpc already does
> > reasonably well here and there's other work that can be done.
> 
> Sure. If somebody wants to send the generic changes I can just take
> them directly to make it easier for people to work on this.

Tell you what: how about I stick the following patches (with Nick's and
Peter's acks) on a separate, stable branch:

  asm-generic/tlb: Track which levels of the page tables have been cleared
  asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather
  asm-generic/tlb: Guard with #ifdef CONFIG_MMU

and then anybody who needs them can just pull that in for the merge window?

Also, how would people feel about adding a MAINTAINERS entry for all the
tlb.h files? A big part of the recent "fun" was us figuring out what the
code is actually doing ("It used to do foo() but that may have changed").
and it certainly took me the best part of a day to figure things out again.
If we're trying to do more in the generic code and less in the arch code,
it would help if we're on top of the changes in this area.

Proposal below (omitted Linus because that seems to be the pattern elsewhere
in the file and he's not going to shout at himself when things break :)
Anybody I've missed?

Will

--->8

diff --git a/MAINTAINERS b/MAINTAINERS
index a5b256b25905..7224b5618883 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9681,6 +9681,15 @@ S:	Maintained
 F:	arch/arm/boot/dts/mmp*
 F:	arch/arm/mach-mmp/
 
+MMU GATHER AND TLB INVALIDATION
+M:	Will Deacon <will.deacon@arm.com>
+M:	Nick Piggin <npiggin@gmail.com>
+M:	Peter Zijlstra <peterz@infradead.org>
+L:	linux-arch@vger.kernel.org
+S:	Maintained
+F:	include/asm-generic/tlb.h
+F:	arch/*/include/asm/tlb.h
+
 MN88472 MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
 L:	linux-media@vger.kernel.org
Peter Zijlstra Aug. 31, 2018, 10:10 a.m. UTC | #6
On Fri, Aug 31, 2018 at 10:54:18AM +0100, Will Deacon wrote:

> Proposal below (omitted Linus because that seems to be the pattern elsewhere
> in the file and he's not going to shout at himself when things break :)
> Anybody I've missed?
> 
> Will
> 
> --->8
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a5b256b25905..7224b5618883 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9681,6 +9681,15 @@ S:	Maintained
>  F:	arch/arm/boot/dts/mmp*
>  F:	arch/arm/mach-mmp/
>  
> +MMU GATHER AND TLB INVALIDATION
> +M:	Will Deacon <will.deacon@arm.com>
> +M:	Nick Piggin <npiggin@gmail.com>
> +M:	Peter Zijlstra <peterz@infradead.org>
> +L:	linux-arch@vger.kernel.org
> +S:	Maintained
> +F:	include/asm-generic/tlb.h
> +F:	arch/*/include/asm/tlb.h
> +
>  MN88472 MEDIA DRIVER
>  M:	Antti Palosaari <crope@iki.fi>
>  L:	linux-media@vger.kernel.org

If we're going to do that (and I'm not opposed); it might make sense to
do something like the below and add:

 F:  mm/mmu_gather.c

---
 b/mm/mmu_gather.c         |  250 ++++++++++++++++++++++++++++++++++++++++++++++
 include/asm-generic/tlb.h |    2 
 mm/Makefile               |    2 
 mm/memory.c               |  247 ---------------------------------------------
 4 files changed, 253 insertions(+), 248 deletions(-)

--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -149,6 +149,8 @@ static inline void tlb_flush_mmu_tlbonly
 	__tlb_reset_range(tlb);
 }
 
+extern void tlb_flush_mmu_free(struct mmu_gather *tlb);
+
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -22,7 +22,7 @@ KCOV_INSTRUMENT_mmzone.o := n
 KCOV_INSTRUMENT_vmstat.o := n
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mmu_gather.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o \
 			   page_vma_mapped.o pagewalk.o pgtable-generic.o \
 			   rmap.o vmalloc.o
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -186,253 +186,6 @@ static void check_sync_rss_stat(struct t
 
 #endif /* SPLIT_RSS_COUNTING */
 
-#ifdef HAVE_GENERIC_MMU_GATHER
-
-static bool tlb_next_batch(struct mmu_gather *tlb)
-{
-	struct mmu_gather_batch *batch;
-
-	batch = tlb->active;
-	if (batch->next) {
-		tlb->active = batch->next;
-		return true;
-	}
-
-	if (tlb->batch_count == MAX_GATHER_BATCH_COUNT)
-		return false;
-
-	batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-	if (!batch)
-		return false;
-
-	tlb->batch_count++;
-	batch->next = NULL;
-	batch->nr   = 0;
-	batch->max  = MAX_GATHER_BATCH;
-
-	tlb->active->next = batch;
-	tlb->active = batch;
-
-	return true;
-}
-
-void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-				unsigned long start, unsigned long end)
-{
-	tlb->mm = mm;
-
-	/* Is it from 0 to ~0? */
-	tlb->fullmm     = !(start | (end+1));
-	tlb->need_flush_all = 0;
-	tlb->local.next = NULL;
-	tlb->local.nr   = 0;
-	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
-	tlb->active     = &tlb->local;
-	tlb->batch_count = 0;
-
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-	tlb->batch = NULL;
-#endif
-	tlb->page_size = 0;
-
-	__tlb_reset_range(tlb);
-}
-
-static void tlb_flush_mmu_free(struct mmu_gather *tlb)
-{
-	struct mmu_gather_batch *batch;
-
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-	tlb_table_flush(tlb);
-#endif
-	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		free_pages_and_swap_cache(batch->pages, batch->nr);
-		batch->nr = 0;
-	}
-	tlb->active = &tlb->local;
-}
-
-void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-	tlb_flush_mmu_tlbonly(tlb);
-	tlb_flush_mmu_free(tlb);
-}
-
-/* tlb_finish_mmu
- *	Called at the end of the shootdown operation to free up any resources
- *	that were required.
- */
-void arch_tlb_finish_mmu(struct mmu_gather *tlb,
-		unsigned long start, unsigned long end, bool force)
-{
-	struct mmu_gather_batch *batch, *next;
-
-	if (force)
-		__tlb_adjust_range(tlb, start, end - start);
-
-	tlb_flush_mmu(tlb);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	for (batch = tlb->local.next; batch; batch = next) {
-		next = batch->next;
-		free_pages((unsigned long)batch, 0);
-	}
-	tlb->local.next = NULL;
-}
-
-/* __tlb_remove_page
- *	Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while
- *	handling the additional races in SMP caused by other CPUs caching valid
- *	mappings in their TLBs. Returns the number of free page slots left.
- *	When out of page slots we must call tlb_flush_mmu().
- *returns true if the caller should flush.
- */
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
-{
-	struct mmu_gather_batch *batch;
-
-	VM_BUG_ON(!tlb->end);
-	VM_WARN_ON(tlb->page_size != page_size);
-
-	batch = tlb->active;
-	/*
-	 * Add the page and check if we are full. If so
-	 * force a flush.
-	 */
-	batch->pages[batch->nr++] = page;
-	if (batch->nr == batch->max) {
-		if (!tlb_next_batch(tlb))
-			return true;
-		batch = tlb->active;
-	}
-	VM_BUG_ON_PAGE(batch->nr > batch->max, page);
-
-	return false;
-}
-
-#endif /* HAVE_GENERIC_MMU_GATHER */
-
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-
-/*
- * See the comment near struct mmu_table_batch.
- */
-
-/*
- * If we want tlb_remove_table() to imply TLB invalidates.
- */
-static inline void tlb_table_invalidate(struct mmu_gather *tlb)
-{
-#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
-	/*
-	 * Invalidate page-table caches used by hardware walkers. Then we still
-	 * need to RCU-sched wait while freeing the pages because software
-	 * walkers can still be in-flight.
-	 */
-	tlb_flush_mmu_tlbonly(tlb);
-#endif
-}
-
-static void tlb_remove_table_smp_sync(void *arg)
-{
-	/* Simply deliver the interrupt */
-}
-
-static void tlb_remove_table_one(void *table)
-{
-	/*
-	 * This isn't an RCU grace period and hence the page-tables cannot be
-	 * assumed to be actually RCU-freed.
-	 *
-	 * It is however sufficient for software page-table walkers that rely on
-	 * IRQ disabling. See the comment near struct mmu_table_batch.
-	 */
-	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
-	__tlb_remove_table(table);
-}
-
-static void tlb_remove_table_rcu(struct rcu_head *head)
-{
-	struct mmu_table_batch *batch;
-	int i;
-
-	batch = container_of(head, struct mmu_table_batch, rcu);
-
-	for (i = 0; i < batch->nr; i++)
-		__tlb_remove_table(batch->tables[i]);
-
-	free_page((unsigned long)batch);
-}
-
-void tlb_table_flush(struct mmu_gather *tlb)
-{
-	struct mmu_table_batch **batch = &tlb->batch;
-
-	if (*batch) {
-		tlb_table_invalidate(tlb);
-		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
-		*batch = NULL;
-	}
-}
-
-void tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	struct mmu_table_batch **batch = &tlb->batch;
-
-	if (*batch == NULL) {
-		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
-		if (*batch == NULL) {
-			tlb_table_invalidate(tlb);
-			tlb_remove_table_one(table);
-			return;
-		}
-		(*batch)->nr = 0;
-	}
-
-	(*batch)->tables[(*batch)->nr++] = table;
-	if ((*batch)->nr == MAX_TABLE_BATCH)
-		tlb_table_flush(tlb);
-}
-
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
-
-/**
- * tlb_gather_mmu - initialize an mmu_gather structure for page-table tear-down
- * @tlb: the mmu_gather structure to initialize
- * @mm: the mm_struct of the target address space
- * @start: start of the region that will be removed from the page-table
- * @end: end of the region that will be removed from the page-table
- *
- * Called to initialize an (on-stack) mmu_gather structure for page-table
- * tear-down from @mm. The @start and @end are set to 0 and -1
- * respectively when @mm is without users and we're going to destroy
- * the full address space (exit/execve).
- */
-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-			unsigned long start, unsigned long end)
-{
-	arch_tlb_gather_mmu(tlb, mm, start, end);
-	inc_tlb_flush_pending(tlb->mm);
-}
-
-void tlb_finish_mmu(struct mmu_gather *tlb,
-		unsigned long start, unsigned long end)
-{
-	/*
-	 * If there are parallel threads are doing PTE changes on same range
-	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-	 * flush by batching, a thread has stable TLB entry can fail to flush
-	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-	 * forcefully if we detect parallel PTE batching threads.
-	 */
-	bool force = mm_tlb_flush_nested(tlb->mm);
-
-	arch_tlb_finish_mmu(tlb, start, end, force);
-	dec_tlb_flush_pending(tlb->mm);
-}
-
 /*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
--- /dev/null
+++ b/mm/mmu_gather.c
@@ -0,0 +1,250 @@
+#include <linux/smp.h>
+#include "asm/tlb.h"
+
+#ifdef HAVE_GENERIC_MMU_GATHER
+
+static bool tlb_next_batch(struct mmu_gather *tlb)
+{
+	struct mmu_gather_batch *batch;
+
+	batch = tlb->active;
+	if (batch->next) {
+		tlb->active = batch->next;
+		return true;
+	}
+
+	if (tlb->batch_count == MAX_GATHER_BATCH_COUNT)
+		return false;
+
+	batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+	if (!batch)
+		return false;
+
+	tlb->batch_count++;
+	batch->next = NULL;
+	batch->nr   = 0;
+	batch->max  = MAX_GATHER_BATCH;
+
+	tlb->active->next = batch;
+	tlb->active = batch;
+
+	return true;
+}
+
+void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+				unsigned long start, unsigned long end)
+{
+	tlb->mm = mm;
+
+	/* Is it from 0 to ~0? */
+	tlb->fullmm     = !(start | (end+1));
+	tlb->need_flush_all = 0;
+	tlb->local.next = NULL;
+	tlb->local.nr   = 0;
+	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
+	tlb->active     = &tlb->local;
+	tlb->batch_count = 0;
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
+#endif
+	tlb->page_size = 0;
+
+	__tlb_reset_range(tlb);
+}
+
+void tlb_flush_mmu_free(struct mmu_gather *tlb)
+{
+	struct mmu_gather_batch *batch;
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
+	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
+		free_pages_and_swap_cache(batch->pages, batch->nr);
+		batch->nr = 0;
+	}
+	tlb->active = &tlb->local;
+}
+
+void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+	tlb_flush_mmu_tlbonly(tlb);
+	tlb_flush_mmu_free(tlb);
+}
+
+/* tlb_finish_mmu
+ *	Called at the end of the shootdown operation to free up any resources
+ *	that were required.
+ */
+void arch_tlb_finish_mmu(struct mmu_gather *tlb,
+		unsigned long start, unsigned long end, bool force)
+{
+	struct mmu_gather_batch *batch, *next;
+
+	if (force)
+		__tlb_adjust_range(tlb, start, end - start);
+
+	tlb_flush_mmu(tlb);
+
+	/* keep the page table cache within bounds */
+	check_pgt_cache();
+
+	for (batch = tlb->local.next; batch; batch = next) {
+		next = batch->next;
+		free_pages((unsigned long)batch, 0);
+	}
+	tlb->local.next = NULL;
+}
+
+/* __tlb_remove_page
+ *	Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while
+ *	handling the additional races in SMP caused by other CPUs caching valid
+ *	mappings in their TLBs. Returns the number of free page slots left.
+ *	When out of page slots we must call tlb_flush_mmu().
+ *returns true if the caller should flush.
+ */
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+{
+	struct mmu_gather_batch *batch;
+
+	VM_BUG_ON(!tlb->end);
+	VM_WARN_ON(tlb->page_size != page_size);
+
+	batch = tlb->active;
+	/*
+	 * Add the page and check if we are full. If so
+	 * force a flush.
+	 */
+	batch->pages[batch->nr++] = page;
+	if (batch->nr == batch->max) {
+		if (!tlb_next_batch(tlb))
+			return true;
+		batch = tlb->active;
+	}
+	VM_BUG_ON_PAGE(batch->nr > batch->max, page);
+
+	return false;
+}
+
+#endif /* HAVE_GENERIC_MMU_GATHER */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+/*
+ * If we want tlb_remove_table() to imply TLB invalidates.
+ */
+static inline void tlb_table_invalidate(struct mmu_gather *tlb)
+{
+#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+	/*
+	 * Invalidate page-table caches used by hardware walkers. Then we still
+	 * need to RCU-sched wait while freeing the pages because software
+	 * walkers can still be in-flight.
+	 */
+	tlb_flush_mmu_tlbonly(tlb);
+#endif
+}
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for software page-table walkers that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		tlb_table_invalidate(tlb);
+		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+		if (*batch == NULL) {
+			tlb_table_invalidate(tlb);
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
+/**
+ * tlb_gather_mmu - initialize an mmu_gather structure for page-table tear-down
+ * @tlb: the mmu_gather structure to initialize
+ * @mm: the mm_struct of the target address space
+ * @start: start of the region that will be removed from the page-table
+ * @end: end of the region that will be removed from the page-table
+ *
+ * Called to initialize an (on-stack) mmu_gather structure for page-table
+ * tear-down from @mm. The @start and @end are set to 0 and -1
+ * respectively when @mm is without users and we're going to destroy
+ * the full address space (exit/execve).
+ */
+void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+			unsigned long start, unsigned long end)
+{
+	arch_tlb_gather_mmu(tlb, mm, start, end);
+	inc_tlb_flush_pending(tlb->mm);
+}
+
+void tlb_finish_mmu(struct mmu_gather *tlb,
+		unsigned long start, unsigned long end)
+{
+	/*
+	 * If there are parallel threads are doing PTE changes on same range
+	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
+	 * flush by batching, a thread has stable TLB entry can fail to flush
+	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
+	 * forcefully if we detect parallel PTE batching threads.
+	 */
+	bool force = mm_tlb_flush_nested(tlb->mm);
+
+	arch_tlb_finish_mmu(tlb, start, end, force);
+	dec_tlb_flush_pending(tlb->mm);
+}
+
Nicholas Piggin Aug. 31, 2018, 10:32 a.m. UTC | #7
On Fri, 31 Aug 2018 12:10:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Aug 31, 2018 at 10:54:18AM +0100, Will Deacon wrote:
> 
> > Proposal below (omitted Linus because that seems to be the pattern elsewhere
> > in the file and he's not going to shout at himself when things break :)
> > Anybody I've missed?
> > 
> > Will
> >   
> > --->8  
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index a5b256b25905..7224b5618883 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -9681,6 +9681,15 @@ S:	Maintained
> >  F:	arch/arm/boot/dts/mmp*
> >  F:	arch/arm/mach-mmp/
> >  
> > +MMU GATHER AND TLB INVALIDATION
> > +M:	Will Deacon <will.deacon@arm.com>
> > +M:	Nick Piggin <npiggin@gmail.com>

Oh gee, I suppose. powerpc hash is kind of interesting because it's
crazy, Aneesh knows that code a lot better than I do. radix modulo
some minor details of exact instructions is fairly like x86 (he 
wrote a lot of that code too AFAIK).

> > +M:	Peter Zijlstra <peterz@infradead.org>
> > +L:	linux-arch@vger.kernel.org

Maybe put linux-mm as well? Or should there just be one list?

> > +S:	Maintained
> > +F:	include/asm-generic/tlb.h
> > +F:	arch/*/include/asm/tlb.h
> > +
> >  MN88472 MEDIA DRIVER
> >  M:	Antti Palosaari <crope@iki.fi>
> >  L:	linux-media@vger.kernel.org  
> 
> If we're going to do that (and I'm not opposed); it might make sense to
> do something like the below and add:
> 
>  F:  mm/mmu_gather.c

I think that is a good idea regardless. How do feel about calling it
tlb.c? Easier to type and autocompletes sooner.

> 
> ---
>  b/mm/mmu_gather.c         |  250 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/asm-generic/tlb.h |    2 
>  mm/Makefile               |    2 
>  mm/memory.c               |  247 ---------------------------------------------
Peter Zijlstra Aug. 31, 2018, 10:49 a.m. UTC | #8
On Fri, Aug 31, 2018 at 08:32:34PM +1000, Nicholas Piggin wrote:
> Oh gee, I suppose. powerpc hash is kind of interesting because it's
> crazy, Aneesh knows that code a lot better than I do. radix modulo
> some minor details of exact instructions is fairly like x86 

The whole TLB broadcast vs explicit IPIs is a fairly big difference in
my book.

Anyway, have you guys tried the explicit IPI approach? Depending on how
IPIs are routed vs broadcasts it might save a little bus traffic. No
point in getting all CPUs to process the TLBI when there's only a hand
full that really need it.

OTOH, I suppose the broadcast thing has been optimized to death on the
hardware side, so who knows..
Will Deacon Aug. 31, 2018, 11:12 a.m. UTC | #9
On Fri, Aug 31, 2018 at 12:49:45PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 31, 2018 at 08:32:34PM +1000, Nicholas Piggin wrote:
> > Oh gee, I suppose. powerpc hash is kind of interesting because it's
> > crazy, Aneesh knows that code a lot better than I do. radix modulo
> > some minor details of exact instructions is fairly like x86 
> 
> The whole TLB broadcast vs explicit IPIs is a fairly big difference in
> my book.
> 
> Anyway, have you guys tried the explicit IPI approach? Depending on how
> IPIs are routed vs broadcasts it might save a little bus traffic. No
> point in getting all CPUs to process the TLBI when there's only a hand
> full that really need it.
> 
> OTOH, I suppose the broadcast thing has been optimized to death on the
> hardware side, so who knows..

You also can't IPI an IOMMU or a GPU ;)

Will
Peter Zijlstra Aug. 31, 2018, 11:20 a.m. UTC | #10
On Fri, Aug 31, 2018 at 12:12:48PM +0100, Will Deacon wrote:
> On Fri, Aug 31, 2018 at 12:49:45PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 31, 2018 at 08:32:34PM +1000, Nicholas Piggin wrote:
> > > Oh gee, I suppose. powerpc hash is kind of interesting because it's
> > > crazy, Aneesh knows that code a lot better than I do. radix modulo
> > > some minor details of exact instructions is fairly like x86 
> > 
> > The whole TLB broadcast vs explicit IPIs is a fairly big difference in
> > my book.
> > 
> > Anyway, have you guys tried the explicit IPI approach? Depending on how
> > IPIs are routed vs broadcasts it might save a little bus traffic. No
> > point in getting all CPUs to process the TLBI when there's only a hand
> > full that really need it.
> > 
> > OTOH, I suppose the broadcast thing has been optimized to death on the
> > hardware side, so who knows..
> 
> You also can't IPI an IOMMU or a GPU ;)

Oh, right you are. I suppose that is why x86-iommu is using those mmu_notifiers.
Nicholas Piggin Aug. 31, 2018, 11:50 a.m. UTC | #11
On Fri, 31 Aug 2018 12:49:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Aug 31, 2018 at 08:32:34PM +1000, Nicholas Piggin wrote:
> > Oh gee, I suppose. powerpc hash is kind of interesting because it's
> > crazy, Aneesh knows that code a lot better than I do. radix modulo
> > some minor details of exact instructions is fairly like x86   
> 
> The whole TLB broadcast vs explicit IPIs is a fairly big difference in
> my book.

That's true I guess. Maybe arm64 is closer.

> Anyway, have you guys tried the explicit IPI approach? Depending on how
> IPIs are routed vs broadcasts it might save a little bus traffic. No
> point in getting all CPUs to process the TLBI when there's only a hand
> full that really need it.

It has been looked at now and again there's a lot of variables to
weigh. And things are also sized and speced to cover various
hypervisors, OSes, hash and radix, etc. This is something we need to
evaluate on radix a bit better.

> 
> OTOH, I suppose the broadcast thing has been optimized to death on the
> hardware side, so who knows..

There are some advantages of doing it in hardware. Also some of doing
IPIs though. The "problem" is actually Linux is well optimised and it
can be hard to notice much impact until you get to big systems. At
least I don't know of any problem workloads outside micro benchmarks or
stress tests.

Thanks,
Nick
Will Deacon Sept. 3, 2018, 12:52 p.m. UTC | #12
On Fri, Aug 31, 2018 at 08:32:34PM +1000, Nicholas Piggin wrote:
> On Fri, 31 Aug 2018 12:10:14 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Aug 31, 2018 at 10:54:18AM +0100, Will Deacon wrote:
> > 
> > > Proposal below (omitted Linus because that seems to be the pattern elsewhere
> > > in the file and he's not going to shout at himself when things break :)
> > > Anybody I've missed?
> > > 
> > > Will
> > >   
> > > --->8  
> > > 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index a5b256b25905..7224b5618883 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -9681,6 +9681,15 @@ S:	Maintained
> > >  F:	arch/arm/boot/dts/mmp*
> > >  F:	arch/arm/mach-mmp/
> > >  
> > > +MMU GATHER AND TLB INVALIDATION
> > > +M:	Will Deacon <will.deacon@arm.com>
> > > +M:	Nick Piggin <npiggin@gmail.com>
> 
> Oh gee, I suppose. powerpc hash is kind of interesting because it's
> crazy, Aneesh knows that code a lot better than I do. radix modulo
> some minor details of exact instructions is fairly like x86 (he 
> wrote a lot of that code too AFAIK).

Sure, as long as we have Power represented here. Would you rather add Aneesh
instead of yourself, or shall we just add you both?

> 
> > > +M:	Peter Zijlstra <peterz@infradead.org>
> > > +L:	linux-arch@vger.kernel.org
> 
> Maybe put linux-mm as well? Or should there just be one list?

If we do the landgrab on mmu_gather (which I think makes sense), then adding
both lists makes sense to me. I'll spin this as a proper patch, along with
Peter's code move.

> > > +S:	Maintained
> > > +F:	include/asm-generic/tlb.h
> > > +F:	arch/*/include/asm/tlb.h
> > > +
> > >  MN88472 MEDIA DRIVER
> > >  M:	Antti Palosaari <crope@iki.fi>
> > >  L:	linux-media@vger.kernel.org  
> > 
> > If we're going to do that (and I'm not opposed); it might make sense to
> > do something like the below and add:
> > 
> >  F:  mm/mmu_gather.c
> 
> I think that is a good idea regardless. How do feel about calling it
> tlb.c? Easier to type and autocompletes sooner.

No strong opinion on name, but I slightly prefer mmu_gather.c so that it
avoids any remote possibility of confusion with tlb.c vs hugetlb.c

Will