mbox series

[v6,0/3] mm/mprotect: avoid unnecessary TLB flushes

Message ID 20220401180821.1986781-1-namit@vmware.com (mailing list archive)
Headers show
Series mm/mprotect: avoid unnecessary TLB flushes | expand

Message

Nadav Amit April 1, 2022, 6:08 p.m. UTC
From: Nadav Amit <namit@vmware.com>

This patch-set is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar
and further optimizations for MADV_COLD and userfaultfd would be
possible.

Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and
   do better/fewer flushes. This would also be handy for later
   userfaultfd enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
   provides most of the performance benefits. Unlike previous versions,
   we now only avoid flushes that would not result in spurious
   page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
   prevent the A/D bits from changing.

Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I therre
ran a microbenchmark: a loop that does the following on anonymous
memory, just as a sanity check to see that time is saved by avoiding TLB
flushes. The loop goes:

	mprotect(p, PAGE_SIZE, PROT_READ)
	mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
	*p = 0; // make the page writable

The test was run in KVM guest with 1 or 2 threads (the second thread
was busy-looping). I measured the time (cycles) of each operation:

		1 thread		2 threads
		mmots	+patch		mmots	+patch
PROT_READ	3494	2725 (-22%)	8630	7788 (-10%)
PROT_READ|WRITE	3952	2724 (-31%)	9075	2865 (-68%)

[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]

The exact numbers are really meaningless, but the benefit is clear.
There are 2 interesting results though. 

(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved

(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org

--

v5 -> v6:
* Wrong patch 2 was sent on v5

v4 -> v5:
* Avoid only TLB flushes that would not result in spurious PF [Dave]
* Better comments, names in pte_flags_need_flush() [Dave]

v3 -> v4:
* Remove KNL-related stuff [Dave]
* Check error code sanity on every PF [Dave]
* Reduce nesting, simplify access_error() changes [Dave] 
* Remove redundant present->non-present check
* Use break instead of goto in do_mprotect_pkey()
* Add missing change_prot_numa() chunk

v2 -> v3:
* Fix orders of patches (order could lead to breakage)
* Better comments
* Clearer KNL detection [Dave]
* Assertion on PF error-code [Dave]
* Comments, code, function names improvements [PeterZ]
* Flush on access-bit clearing on PMD changes to follow the way
  flushing on x86 is done today in the kernel.

v1 -> v2:
* Wrong detection of permission demotion [Andrea]
* Better comments [Andrea]
* Handle THP [Andrea]
* Batching across VMAs [Peter Xu]
* Avoid open-coding PTE analysis
* Fix wrong use of the mmu_gather()



Nadav Amit (3):
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush when not required architecturally
  mm: avoid unnecessary flush on change_huge_pmd()

 arch/x86/include/asm/pgtable.h       |   5 ++
 arch/x86/include/asm/pgtable_types.h |   2 +
 arch/x86/include/asm/tlbflush.h      | 121 +++++++++++++++++++++++++++
 arch/x86/mm/pgtable.c                |  10 +++
 fs/exec.c                            |   6 +-
 include/asm-generic/tlb.h            |  14 ++++
 include/linux/huge_mm.h              |   5 +-
 include/linux/mm.h                   |   5 +-
 include/linux/pgtable.h              |  20 +++++
 mm/huge_memory.c                     |  19 +++--
 mm/mempolicy.c                       |   9 +-
 mm/mprotect.c                        |  93 ++++++++++----------
 mm/pgtable-generic.c                 |   8 ++
 mm/userfaultfd.c                     |   6 +-
 14 files changed, 268 insertions(+), 55 deletions(-)