mbox series

[RFC,00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

Message ID 1535125966-7666-1-git-send-email-will.deacon@arm.com (mailing list archive)
Headers show
Series Avoid synchronous TLB invalidation for intermediate page-table entries on arm64 | expand

Message

Will Deacon Aug. 24, 2018, 3:52 p.m. UTC
Hi all,

I hacked up this RFC on the back of the recent changes to the mmu_gather
stuff in mainline. It's had a bit of testing and it looks pretty good so
far.

The main changes in the series are:

  - Avoid emitting a DSB barrier after clearing each page-table entry.
    Instead, we can have a single DSB prior to the actual TLB invalidation.

  - Batch last-level TLB invalidation until the end of the VMA, and use
    last-level-only invalidation instructions

  - Batch intermediate TLB invalidation until the end of the gather, and
    use all-level invalidation instructions

  - Adjust the stride of TLB invalidation based upon the smallest unflushed
    granule in the gather

As a really stupid benchmark, unmapping a populated mapping of
0x4_3333_3000 bytes using munmap() takes around 20% of the time it took
before.

The core changes now track the levels of page-table that have been visited
by the mmu_gather since the last flush. It may be possible to use the
page_size field instead if we decide to resurrect that from its current
"debug" status, but I think I'd prefer to track the levels explicitly.

Anyway, I wanted to post this before disappearing for the long weekend
(Monday is a holiday in the UK) in the hope that it helps some of the
ongoing discussions.

Cheers,

Will

--->8

Peter Zijlstra (1):
  asm-generic/tlb: Track freeing of page-table directories in struct
    mmu_gather

Will Deacon (10):
  arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range()
  arm64: tlb: Add DSB ISHST prior to TLBI in
    __flush_tlb_[kernel_]pgtable()
  arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
  arm64: tlb: Justify non-leaf invalidation in flush_tlb_range()
  arm64: tlbflush: Allow stride to be specified for __flush_tlb_range()
  arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code
  asm-generic/tlb: Guard with #ifdef CONFIG_MMU
  asm-generic/tlb: Track which levels of the page tables have been
    cleared
  arm64: tlb: Adjust stride and type of TLBI according to mmu_gather
  arm64: tlb: Avoid synchronous TLBIs when freeing page tables

 arch/arm64/Kconfig                |  1 +
 arch/arm64/include/asm/pgtable.h  | 10 ++++-
 arch/arm64/include/asm/tlb.h      | 34 +++++++----------
 arch/arm64/include/asm/tlbflush.h | 28 +++++++-------
 include/asm-generic/tlb.h         | 79 +++++++++++++++++++++++++++++++++------
 mm/memory.c                       |  4 +-
 6 files changed, 105 insertions(+), 51 deletions(-)

Comments

Linus Torvalds Aug. 24, 2018, 4:20 p.m. UTC | #1
On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <will.deacon@arm.com> wrote:
>
> I hacked up this RFC on the back of the recent changes to the mmu_gather
> stuff in mainline. It's had a bit of testing and it looks pretty good so
> far.

Looks good to me.

Apart from the arm64-specific question I had, I wonder whether we need
to have that single "freed_tables" bit at all, since you wanted to
have the four individual bits for the different levels.

Even if somebody doesn't care about the individual bits, it's
generally exactly as cheap to test four bits as it is to test one, so
it seems unnecessary to have that summary bit.

             Linus
Peter Zijlstra Aug. 26, 2018, 10:56 a.m. UTC | #2
On Fri, Aug 24, 2018 at 09:20:00AM -0700, Linus Torvalds wrote:
> On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > I hacked up this RFC on the back of the recent changes to the mmu_gather
> > stuff in mainline. It's had a bit of testing and it looks pretty good so
> > far.
> 
> Looks good to me.
> 
> Apart from the arm64-specific question I had, I wonder whether we need
> to have that single "freed_tables" bit at all, since you wanted to
> have the four individual bits for the different levels.

I think so; because he also sets those size bits for things like hugetlb
and thp user page frees, not only table page frees. So they're not
exactly the same.

And I think x86 could use this too; if we know we only freed 2M
pages, we can use that in flush_tlb_mm_range() to range flush in 2M
increments instead of 4K.

Something a little like so..

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index cb0a1f470980..cb0898fe9d37 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -8,10 +8,15 @@
 
 #define tlb_flush(tlb)							\
 {									\
-	if (!tlb->fullmm && !tlb->need_flush_all) 			\
-		flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end, 0UL);	\
-	else								\
-		flush_tlb_mm_range(tlb->mm, 0UL, TLB_FLUSH_ALL, 0UL);	\
+	unsigned long start = 0UL, end = TLB_FLUSH_ALL;			\
+	unsigned int invl_shift = tlb_get_unmap_shift(tlb);		\
+									\
+	if (!tlb->fullmm && !tlb->need_flush_all) {			\
+		start = tlb->start;					\
+		end = tlb->end;						\
+	}								\
+									\
+	flush_tlb_mm_range(tlb->mm, start, end, invl_shift);		\
 }
 
 #include <asm-generic/tlb.h>
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 511bf5fae8b8..8ac1cac34f63 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -491,23 +491,25 @@ struct flush_tlb_info {
 	unsigned long		start;
 	unsigned long		end;
 	u64			new_tlb_gen;
+	unsigned int		invl_shift;
 };
 
 #define local_flush_tlb() __flush_tlb()
 
 #define flush_tlb_mm(mm)	flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
 
-#define flush_tlb_range(vma, start, end)	\
-		flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
+#define flush_tlb_range(vma, start, end)			\
+		flush_tlb_mm_range(vma->vm_mm, start, end,	\
+				vma->vm_flags & VM_HUGETLB ? PMD_SHUFT : PAGE_SHIFT)
 
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned long vmflag);
+				unsigned long end, unsigned int invl_shift);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE);
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT);
 }
 
 void native_flush_tlb_others(const struct cpumask *cpumask,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 752dbf4e0e50..806aa74a8fb4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -537,12 +537,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	    f->new_tlb_gen == mm_tlb_gen) {
 		/* Partial flush */
 		unsigned long addr;
-		unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
+		unsigned long nr_pages = (f->end - f->start) >> f->invl_shift;
 
 		addr = f->start;
 		while (addr < f->end) {
 			__flush_tlb_one_user(addr);
-			addr += PAGE_SIZE;
+			addr += 1UL << f->invl_shift;
 		}
 		if (local)
 			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
@@ -653,12 +653,13 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned long vmflag)
+				unsigned long end, unsigned int invl_shift)
 {
 	int cpu;
 
 	struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
 		.mm = mm,
+		.invl_shift = invl_shift;
 	};
 
 	cpu = get_cpu();
@@ -668,8 +669,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	/* Should we flush just the requested range? */
 	if ((end != TLB_FLUSH_ALL) &&
-	    !(vmflag & VM_HUGETLB) &&
-	    ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
+	    ((end - start) >> invl_shift) <= tlb_single_page_flush_ceiling) {
 		info.start = start;
 		info.end = end;
 	} else {
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e811ef7b8350..cdde0cdb23e7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -175,6 +200,25 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 }
 #endif
 
+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+	if (tlb->cleared_ptes)
+		return PAGE_SHIFT;
+	if (tlb->cleared_pmds)
+		return PMD_SHIFT;
+	if (tlb->cleared_puds)
+		return PUD_SHIFT;
+	if (tlb->cleared_p4ds)
+		return P4D_SHIFT;
+
+	return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+	return 1ULL << tlb_get_unmap_shift(tlb);
+}
+
 /*
  * In the case of tlb vma handling, we can optimise these away in the
  * case where we're doing a full MM flush.  When we're doing a munmap,
Jon Masters Sept. 4, 2018, 6:38 p.m. UTC | #3
On 08/24/2018 11:52 AM, Will Deacon wrote:

> I hacked up this RFC on the back of the recent changes to the mmu_gather
> stuff in mainline. It's had a bit of testing and it looks pretty good so
> far.

I will request the server folks go and test this. You'll probably
remember a couple of parts we've seen where aggressive walker caches
ended up (correctly) seeing stale page table entries and we had all
manner of horrifically hard to debug problems. We have some fairly nice
reproducers that were able to find this last time that we can test.

Jon.
Will Deacon Sept. 5, 2018, 12:28 p.m. UTC | #4
On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
> On 08/24/2018 11:52 AM, Will Deacon wrote:
> 
> > I hacked up this RFC on the back of the recent changes to the mmu_gather
> > stuff in mainline. It's had a bit of testing and it looks pretty good so
> > far.
> 
> I will request the server folks go and test this. You'll probably
> remember a couple of parts we've seen where aggressive walker caches
> ended up (correctly) seeing stale page table entries and we had all
> manner of horrifically hard to debug problems. We have some fairly nice
> reproducers that were able to find this last time that we can test.

Cheers, Jon, that would be very helpful. You're probably best off using
my (rebasing) tlb branch rather than picking the RFC:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb

Let me know if you'd prefer something stable (I can tag it with a date).

Will
Jon Masters Sept. 7, 2018, 6:36 a.m. UTC | #5
On 09/05/2018 08:28 AM, Will Deacon wrote:
> On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
>> On 08/24/2018 11:52 AM, Will Deacon wrote:
>>
>>> I hacked up this RFC on the back of the recent changes to the mmu_gather
>>> stuff in mainline. It's had a bit of testing and it looks pretty good so
>>> far.
>>
>> I will request the server folks go and test this. You'll probably
>> remember a couple of parts we've seen where aggressive walker caches
>> ended up (correctly) seeing stale page table entries and we had all
>> manner of horrifically hard to debug problems. We have some fairly nice
>> reproducers that were able to find this last time that we can test.
> 
> Cheers, Jon, that would be very helpful. You're probably best off using
> my (rebasing) tlb branch rather than picking the RFC:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
> 
> Let me know if you'd prefer something stable (I can tag it with a date).

That would be useful. I've prodded each of the Arm server SoC vendors I
work with via our weekly call to have them each specifically check this.
A tag would be helpful to that effort I expect. They all claim to be
watching this thread now, so we'll see if they see cabbages here.

Jon.
Will Deacon Sept. 13, 2018, 3:53 p.m. UTC | #6
On Fri, Sep 07, 2018 at 02:36:08AM -0400, Jon Masters wrote:
> On 09/05/2018 08:28 AM, Will Deacon wrote:
> > On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
> >> On 08/24/2018 11:52 AM, Will Deacon wrote:
> >>
> >>> I hacked up this RFC on the back of the recent changes to the mmu_gather
> >>> stuff in mainline. It's had a bit of testing and it looks pretty good so
> >>> far.
> >>
> >> I will request the server folks go and test this. You'll probably
> >> remember a couple of parts we've seen where aggressive walker caches
> >> ended up (correctly) seeing stale page table entries and we had all
> >> manner of horrifically hard to debug problems. We have some fairly nice
> >> reproducers that were able to find this last time that we can test.
> > 
> > Cheers, Jon, that would be very helpful. You're probably best off using
> > my (rebasing) tlb branch rather than picking the RFC:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
> > 
> > Let me know if you'd prefer something stable (I can tag it with a date).
> 
> That would be useful. I've prodded each of the Arm server SoC vendors I
> work with via our weekly call to have them each specifically check this.
> A tag would be helpful to that effort I expect. They all claim to be
> watching this thread now, so we'll see if they see cabbages here.

This is now all queued up in the (stable) arm64 for-next/core branch:

  https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/core

so that's the best place to grab the patches.

Will
Jon Masters Sept. 13, 2018, 4:53 p.m. UTC | #7
Tnx