diff mbox series

[RFC,15/20] mm: detect deferred TLB flushes in vma granularity

Message ID 20210131001132.3368247-16-namit@vmware.com (mailing list archive)
State New, archived
Headers show
Series TLB batching consolidation and enhancements | expand

Commit Message

Nadav Amit Jan. 31, 2021, 12:11 a.m. UTC
From: Nadav Amit <namit@vmware.com>

Currently, deferred TLB flushes are detected in the mm granularity: if
there is any deferred TLB flush in the entire address space due to NUMA
migration, pte_accessible() in x86 would return true, and
ptep_clear_flush() would require a TLB flush. This would happen even if
the PTE resides in a completely different vma.

Recent code changes and possible future enhancements might require to
detect TLB flushes in finer granularity. Detection in finer granularity
can also enable more aggressive TLB deferring in the future.

Record for each vma the last mm's TLB generation after the last deferred
PTE/PMD change while the page-table lock is still held. Increase the mm
generation before recording to indicate that a pending TLB flush is
pending. Record in the mmu_gather struct the mm's TLB generation at the
time in which the last TLB flushing was deferred.

Once the TLB flushing of deferred request takes place, use the deferred
TLB generation that is recorded in mmu_gather. Detection of deferred TLB
flushes is performed by checking whether the mm's completed TLB
generation is the lower/equal than the mm's TLB generation.
Architectures that use the TLB generation logic are required to perform
a full TLB flush if they detect that a new TLB flush request "skips" a
generation (as already done by x86 code).

To indicate that a deferred TLB flush takes place, increase the mm's TLB
generation after updating the PTEs. However, try to avoid increasing the
mm's generation after subsequent PTE updates, as increasing it again
would lead to a full TLB flush once the deferred TLB flushes are
performed (due to the "skipped" TLB generation). Therefore, if the mm
generation did not change after subsequent PTE update, use the previous
generation.

As multiple updates of the vma generation can be performed concurrently,
use atomic operations to ensure that the TLB generation as recorded in
the vma is the last (most recent) one.

Once a deferred TLB flush is eventually performed it might be redundant,
if due to another TLB flush the deferred flush was performed (by doing a
full TLB flush once detecting the "skipped" generation).  This case can
be detected if the deferred TLB generation, as recorded in mmu_gather
was already completed. However, we do not record deferred PUD/P4D
flushes, and freeing tables also requires a flush on cores in lazy
TLB mode. In such cases a TLB flush is needed even if the mm's completed
TLB generation indicates the flush was already "performed".

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 arch/x86/include/asm/tlb.h      |  18 ++++--
 arch/x86/include/asm/tlbflush.h |   5 ++
 arch/x86/mm/tlb.c               |  14 ++++-
 include/asm-generic/tlb.h       | 104 ++++++++++++++++++++++++++++++--
 include/linux/mm_types.h        |  19 ++++++
 mm/mmap.c                       |   1 +
 mm/mmu_gather.c                 |   3 +
 7 files changed, 150 insertions(+), 14 deletions(-)

Comments

Nadav Amit Feb. 1, 2021, 10:04 p.m. UTC | #1
> On Jan 30, 2021, at 4:11 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> From: Nadav Amit <namit@vmware.com>
> 
> Currently, deferred TLB flushes are detected in the mm granularity: if
> there is any deferred TLB flush in the entire address space due to NUMA
> migration, pte_accessible() in x86 would return true, and
> ptep_clear_flush() would require a TLB flush. This would happen even if
> the PTE resides in a completely different vma.

[ snip ]

> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
> +{
> +	struct mm_struct *mm = tlb->mm;
> +	u64 mm_gen;
> +
> +	/*
> +	 * Any change of PTE before calling __track_deferred_tlb_flush() must be
> +	 * performed using RMW atomic operation that provides a memory barriers,
> +	 * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
> +	 * written before the current generation is read, synchronizing
> +	 * (implicitly) with flush_tlb_mm_range().
> +	 */
> +	smp_mb__after_atomic();
> +
> +	mm_gen = atomic64_read(&mm->tlb_gen);
> +
> +	/*
> +	 * This condition checks for both first deferred TLB flush and for other
> +	 * TLB pending or executed TLB flushes after the last table that we
> +	 * updated. In the latter case, we are going to skip a generation, which
> +	 * would lead to a full TLB flush. This should therefore not cause
> +	 * correctness issues, and should not induce overheads, since anyhow in
> +	 * TLB storms it is better to perform full TLB flush.
> +	 */
> +	if (mm_gen != tlb->defer_gen) {
> +		VM_BUG_ON(mm_gen < tlb->defer_gen);
> +
> +		tlb->defer_gen = inc_mm_tlb_gen(mm);
> +	}
> +}

Andy’s comments managed to make me realize this code is wrong. We must
call inc_mm_tlb_gen(mm) every time.

Otherwise, a CPU that saw the old tlb_gen and updated it in its local
cpu_tlbstate on a context-switch. If the process was not running when the
TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
switch_mm_irqs_off() back to the process will not flush the local TLB.

I need to think if there is a better solution. Multiple calls to
inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
instead of one that is specific to the ranges, once the flush actually takes
place. On x86 it’s practically a non-issue, since anyhow any update of more
than 33-entries or so would cause a full TLB flush, but this is still ugly.
Andy Lutomirski Feb. 2, 2021, 12:14 a.m. UTC | #2
> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
>> 
>> On Jan 30, 2021, at 4:11 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Currently, deferred TLB flushes are detected in the mm granularity: if
>> there is any deferred TLB flush in the entire address space due to NUMA
>> migration, pte_accessible() in x86 would return true, and
>> ptep_clear_flush() would require a TLB flush. This would happen even if
>> the PTE resides in a completely different vma.
> 
> [ snip ]
> 
>> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
>> +{
>> +    struct mm_struct *mm = tlb->mm;
>> +    u64 mm_gen;
>> +
>> +    /*
>> +     * Any change of PTE before calling __track_deferred_tlb_flush() must be
>> +     * performed using RMW atomic operation that provides a memory barriers,
>> +     * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
>> +     * written before the current generation is read, synchronizing
>> +     * (implicitly) with flush_tlb_mm_range().
>> +     */
>> +    smp_mb__after_atomic();
>> +
>> +    mm_gen = atomic64_read(&mm->tlb_gen);
>> +
>> +    /*
>> +     * This condition checks for both first deferred TLB flush and for other
>> +     * TLB pending or executed TLB flushes after the last table that we
>> +     * updated. In the latter case, we are going to skip a generation, which
>> +     * would lead to a full TLB flush. This should therefore not cause
>> +     * correctness issues, and should not induce overheads, since anyhow in
>> +     * TLB storms it is better to perform full TLB flush.
>> +     */
>> +    if (mm_gen != tlb->defer_gen) {
>> +        VM_BUG_ON(mm_gen < tlb->defer_gen);
>> +
>> +        tlb->defer_gen = inc_mm_tlb_gen(mm);
>> +    }
>> +}
> 
> Andy’s comments managed to make me realize this code is wrong. We must
> call inc_mm_tlb_gen(mm) every time.
> 
> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
> cpu_tlbstate on a context-switch. If the process was not running when the
> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
> switch_mm_irqs_off() back to the process will not flush the local TLB.
> 
> I need to think if there is a better solution. Multiple calls to
> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
> instead of one that is specific to the ranges, once the flush actually takes
> place. On x86 it’s practically a non-issue, since anyhow any update of more
> than 33-entries or so would cause a full TLB flush, but this is still ugly.
> 

What if we had a per-mm ring buffer of flushes?  When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up.  This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.
Nadav Amit Feb. 2, 2021, 8:51 p.m. UTC | #3
> On Feb 1, 2021, at 4:14 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
>> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>> Andy’s comments managed to make me realize this code is wrong. We must
>> call inc_mm_tlb_gen(mm) every time.
>> 
>> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
>> cpu_tlbstate on a context-switch. If the process was not running when the
>> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
>> switch_mm_irqs_off() back to the process will not flush the local TLB.
>> 
>> I need to think if there is a better solution. Multiple calls to
>> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
>> instead of one that is specific to the ranges, once the flush actually takes
>> place. On x86 it’s practically a non-issue, since anyhow any update of more
>> than 33-entries or so would cause a full TLB flush, but this is still ugly.
> 
> What if we had a per-mm ring buffer of flushes?  When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up.  This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.

I wanted to sleep on it, and went back and forth on whether it is the right
direction, hence the late response.

I think that what you say make sense. I think that I even tried to do once
something similar for some reason, but my memory plays tricks on me.

So tell me what you think on this ring-based solution. As you said, you keep
per-mm ring of flush_tlb_info. When you queue an entry, you do something
like:

#define RING_ENTRY_INVALID (0)

  gen = inc_mm_tlb_gen(mm);
  struct flush_tlb_info *info = mm->ring[gen % RING_SIZE];
  spin_lock(&mm->ring_lock);
  WRITE_ONCE(info->new_tlb_gen, RING_ENTRY_INVALID);
  smp_wmb();
  info->start = start;
  info->end = end;
  info->stride_shift = stride_shift;
  info->freed_tables = freed_tables;
  smp_store_release(&info->new_tlb_gen, gen);
  spin_unlock(&mm->ring_lock);
  
When you flush you use the entry generation as a sequence lock. On overflow
of the ring (i.e., sequence number mismatch) you perform a full flush:

  for (gen = mm->tlb_gen_completed; gen < mm->tlb_gen; gen++) {
	struct flush_tlb_info *info = &mm->ring[gen % RING_SIZE];

	// detect overflow and invalid entries
	if (smp_load_acquire(info->new_tlb_gen) != gen)
		goto full_flush;

	start = min(start, info->start);
	end = max(end, info->end);
	stride_shift = min(stride_shift, info->stride_shift);
	freed_tables |= info.freed_tables;
	smp_rmb();

	// seqlock-like check that the information was not updated 
	if (READ_ONCE(info->new_tlb_gen) != gen)
		goto full_flush;
  }

On x86 I suspect that performing a full TLB flush would anyhow be the best
thing to do if there is more than a single entry. I am also not sure that it
makes sense to check the ring from flush_tlb_func_common() (i.e., in each
IPI handler) as it might cause cache thrashing.

Instead it may be better to do so from flush_tlb_mm_range(), when the
flushes are initiated, and use an aggregated flush_tlb_info for the flush.

It may also be better to have the ring arch-independent, so it would
resemble more of mmu_gather (the parts about the TLB flush information,
without the freed pages stuff).

We can detect deferred TLB flushes either by storing “deferred_gen” in the
page-tables/VMA (as I did) or by going over the ring, from tlb_gen_completed
to tlb_gen, and checking for an overlap. I think page-tables would be most
efficient/scalable, but perhaps going over the ring would be easier to
understand logic.

Makes sense? Thoughts?
Andy Lutomirski Feb. 4, 2021, 4:35 a.m. UTC | #4
> On Feb 2, 2021, at 12:52 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
>> 
>>> On Feb 1, 2021, at 4:14 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> 
>>> 
>>>> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>> Andy’s comments managed to make me realize this code is wrong. We must
>>> call inc_mm_tlb_gen(mm) every time.
>>> 
>>> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
>>> cpu_tlbstate on a context-switch. If the process was not running when the
>>> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
>>> switch_mm_irqs_off() back to the process will not flush the local TLB.
>>> 
>>> I need to think if there is a better solution. Multiple calls to
>>> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
>>> instead of one that is specific to the ranges, once the flush actually takes
>>> place. On x86 it’s practically a non-issue, since anyhow any update of more
>>> than 33-entries or so would cause a full TLB flush, but this is still ugly.
>> 
>> What if we had a per-mm ring buffer of flushes?  When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up.  This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.
> 
> I wanted to sleep on it, and went back and forth on whether it is the right
> direction, hence the late response.
> 
> I think that what you say make sense. I think that I even tried to do once
> something similar for some reason, but my memory plays tricks on me.
> 
> So tell me what you think on this ring-based solution. As you said, you keep
> per-mm ring of flush_tlb_info. When you queue an entry, you do something
> like:
> 
> #define RING_ENTRY_INVALID (0)
> 
>  gen = inc_mm_tlb_gen(mm);
>  struct flush_tlb_info *info = mm->ring[gen % RING_SIZE];
>  spin_lock(&mm->ring_lock);

Once you are holding the lock, you should presumably check that the ring didn’t overflow while you were getting the lock — if new_tlb_gen > gen, abort.

>  WRITE_ONCE(info->new_tlb_gen, RING_ENTRY_INVALID);
>  smp_wmb();
>  info->start = start;
>  info->end = end;
>  info->stride_shift = stride_shift;
>  info->freed_tables = freed_tables;
>  smp_store_release(&info->new_tlb_gen, gen);
>  spin_unlock(&mm->ring_lock);
> 

Seems reasonable.  I’m curious how this ends up getting used.

> When you flush you use the entry generation as a sequence lock. On overflow
> of the ring (i.e., sequence number mismatch) you perform a full flush:
> 
>  for (gen = mm->tlb_gen_completed; gen < mm->tlb_gen; gen++) {
>    struct flush_tlb_info *info = &mm->ring[gen % RING_SIZE];
> 
>    // detect overflow and invalid entries
>    if (smp_load_acquire(info->new_tlb_gen) != gen)
>        goto full_flush;
> 
>    start = min(start, info->start);
>    end = max(end, info->end);
>    stride_shift = min(stride_shift, info->stride_shift);
>    freed_tables |= info.freed_tables;
>    smp_rmb();
> 
>    // seqlock-like check that the information was not updated 
>    if (READ_ONCE(info->new_tlb_gen) != gen)
>        goto full_flush;
>  }
> 
> On x86 I suspect that performing a full TLB flush would anyhow be the best
> thing to do if there is more than a single entry. I am also not sure that it
> makes sense to check the ring from flush_tlb_func_common() (i.e., in each
> IPI handler) as it might cause cache thrashing.
> 
> Instead it may be better to do so from flush_tlb_mm_range(), when the
> flushes are initiated, and use an aggregated flush_tlb_info for the flush.
> 
> It may also be better to have the ring arch-independent, so it would
> resemble more of mmu_gather (the parts about the TLB flush information,
> without the freed pages stuff).
> 
> We can detect deferred TLB flushes either by storing “deferred_gen” in the
> page-tables/VMA (as I did) or by going over the ring, from tlb_gen_completed
> to tlb_gen, and checking for an overlap. I think page-tables would be most
> efficient/scalable, but perhaps going over the ring would be easier to
> understand logic.
> 
> Makes sense? Thoughts?
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 580636cdc257..ecf538e6c6d5 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -9,15 +9,23 @@  static inline void tlb_flush(struct mmu_gather *tlb);
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	unsigned long start = 0UL, end = TLB_FLUSH_ALL;
 	unsigned int stride_shift = tlb_get_unmap_shift(tlb);
 
-	if (!tlb->fullmm && !tlb->need_flush_all) {
-		start = tlb->start;
-		end = tlb->end;
+	/* Perform full flush when needed */
+	if (tlb->fullmm || tlb->need_flush_all) {
+		flush_tlb_mm_range(tlb->mm, 0, TLB_FLUSH_ALL, stride_shift,
+				   tlb->freed_tables);
+		return;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	/* Check if flush was already performed */
+	if (!tlb->freed_tables && !tlb->cleared_puds &&
+	    !tlb->cleared_p4ds &&
+	    atomic64_read(&tlb->mm->tlb_gen_completed) > tlb->defer_gen)
+		return;
+
+	flush_tlb_mm_range_gen(tlb->mm, tlb->start, tlb->end, stride_shift,
+			       tlb->freed_tables, tlb->defer_gen);
 }
 
 /*
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 2110b98026a7..296a00545056 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -225,6 +225,11 @@  void flush_tlb_others(const struct cpumask *cpumask,
 				: PAGE_SHIFT, false)
 
 extern void flush_tlb_all(void);
+
+extern void flush_tlb_mm_range_gen(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables, u64 gen);
+
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d17b5575531e..48f4b56fc4a7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -883,12 +883,11 @@  static inline void put_flush_tlb_info(void)
 #endif
 }
 
-void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
+void flush_tlb_mm_range_gen(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables)
+				bool freed_tables, u64 new_tlb_gen)
 {
 	struct flush_tlb_info *info;
-	u64 new_tlb_gen;
 	int cpu;
 
 	cpu = get_cpu();
@@ -923,6 +922,15 @@  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	put_cpu();
 }
 
+void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables)
+{
+	u64 new_tlb_gen = inc_mm_tlb_gen(mm);
+
+	flush_tlb_mm_range_gen(mm, start, end, stride_shift, freed_tables,
+			       new_tlb_gen);
+}
 
 static void do_flush_tlb_all(void *info)
 {
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 10690763090a..f25d2d955076 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -295,6 +295,11 @@  struct mmu_gather {
 	unsigned int		cleared_puds : 1;
 	unsigned int		cleared_p4ds : 1;
 
+	/*
+	 * Whether a TLB flush was needed for PTEs in the current table
+	 */
+	unsigned int		cleared_ptes_in_table : 1;
+
 	unsigned int		batch_count;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -305,6 +310,10 @@  struct mmu_gather {
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	unsigned int page_size;
 #endif
+
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+	u64			defer_gen;
+#endif
 #endif
 };
 
@@ -381,7 +390,8 @@  static inline void tlb_flush(struct mmu_gather *tlb)
 #endif
 
 #if __is_defined(tlb_flush) ||						\
-	IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)
+	IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING) ||	\
+	IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS)
 
 static inline void
 tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
@@ -472,7 +482,8 @@  static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
  */
 static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
-	if (tlb->fullmm)
+	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING) &&
+		       tlb->fullmm)
 		return;
 
 	tlb_update_vma(tlb, vma);
@@ -530,16 +541,87 @@  static inline void mark_mm_tlb_gen_done(struct mm_struct *mm, u64 gen)
 	tlb_update_generation(&mm->tlb_gen_completed, gen);
 }
 
-#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
+static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
+{
+	struct mm_struct *mm = tlb->mm;
+	u64 mm_gen;
+
+	/*
+	 * Any change of PTE before calling __track_deferred_tlb_flush() must be
+	 * performed using RMW atomic operation that provides a memory barriers,
+	 * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
+	 * written before the current generation is read, synchronizing
+	 * (implicitly) with flush_tlb_mm_range().
+	 */
+	smp_mb__after_atomic();
+
+	mm_gen = atomic64_read(&mm->tlb_gen);
+
+	/*
+	 * This condition checks for both first deferred TLB flush and for other
+	 * TLB pending or executed TLB flushes after the last table that we
+	 * updated. In the latter case, we are going to skip a generation, which
+	 * would lead to a full TLB flush. This should therefore not cause
+	 * correctness issues, and should not induce overheads, since anyhow in
+	 * TLB storms it is better to perform full TLB flush.
+	 */
+	if (mm_gen != tlb->defer_gen) {
+		VM_BUG_ON(mm_gen < tlb->defer_gen);
+
+		tlb->defer_gen = inc_mm_tlb_gen(mm);
+	}
+}
+
+/*
+ * Store the deferred TLB generation in the VMA
+ */
+static inline void store_deferred_tlb_gen(struct mmu_gather *tlb)
+{
+	tlb_update_generation(&tlb->vma->defer_tlb_gen, tlb->defer_gen);
+}
+
+/*
+ * Track deferred TLB flushes for PTEs and PMDs to allow fine granularity checks
+ * whether a PTE is accessible. The TLB generation after the PTE is flushed is
+ * saved in the mmu_gather struct. Once a flush is performed, the geneartion is
+ * advanced.
+ */
+static inline void track_defer_tlb_flush(struct mmu_gather *tlb)
+{
+	if (tlb->fullmm)
+		return;
+
+	BUG_ON(!tlb->vma);
+
+	read_defer_tlb_flush_gen(tlb);
+	store_deferred_tlb_gen(tlb);
+}
+
+#define init_vma_tlb_generation(vma)				\
+	atomic64_set(&(vma)->defer_tlb_gen, 0)
+#else
+static inline void init_vma_tlb_generation(struct vm_area_struct *vma) { }
+#endif
 
 #define tlb_start_ptes(tlb)						\
 	do {								\
 		struct mmu_gather *_tlb = (tlb);			\
 									\
 		flush_tlb_batched_pending(_tlb->mm);			\
+		if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))	\
+			_tlb->cleared_ptes_in_table = 0;		\
 	} while (0)
 
-static inline void tlb_end_ptes(struct mmu_gather *tlb) { }
+static inline void tlb_end_ptes(struct mmu_gather *tlb)
+{
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))
+		return;
+
+	if (tlb->cleared_ptes_in_table)
+		track_defer_tlb_flush(tlb);
+
+	tlb->cleared_ptes_in_table = 0;
+}
 
 /*
  * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->end,
@@ -550,15 +632,25 @@  static inline void tlb_flush_pte_range(struct mmu_gather *tlb,
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_ptes = 1;
+
+	if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS))
+		tlb->cleared_ptes_in_table = 1;
 }
 
-static inline void tlb_flush_pmd_range(struct mmu_gather *tlb,
+static inline void __tlb_flush_pmd_range(struct mmu_gather *tlb,
 				     unsigned long address, unsigned long size)
 {
 	__tlb_adjust_range(tlb, address, size);
 	tlb->cleared_pmds = 1;
 }
 
+static inline void tlb_flush_pmd_range(struct mmu_gather *tlb,
+				     unsigned long address, unsigned long size)
+{
+	__tlb_flush_pmd_range(tlb, address, size);
+	track_defer_tlb_flush(tlb);
+}
+
 static inline void tlb_flush_pud_range(struct mmu_gather *tlb,
 				     unsigned long address, unsigned long size)
 {
@@ -649,7 +741,7 @@  static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 #ifndef pte_free_tlb
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
-		tlb_flush_pmd_range(tlb, address, PAGE_SIZE);	\
+		__tlb_flush_pmd_range(tlb, address, PAGE_SIZE);	\
 		tlb->freed_tables = 1;				\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 676795dfd5d4..bbe5d4a422f7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -367,6 +367,9 @@  struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+	atomic64_t defer_tlb_gen;	/* Deferred TLB flushes generation */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
@@ -628,6 +631,21 @@  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending);
 }
 
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte_t *pte)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	return atomic64_read(&vma->defer_tlb_gen) < atomic64_read(&mm->tlb_gen_completed);
+}
+
+static inline bool pmd_tlb_flush_pending(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	return atomic64_read(&vma->defer_tlb_gen) < atomic64_read(&mm->tlb_gen_completed);
+}
+#else /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
 static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte_t *pte)
 {
 	return mm_tlb_flush_pending(vma->vm_mm);
@@ -637,6 +655,7 @@  static inline bool pmd_tlb_flush_pending(struct vm_area_struct *vma, pmd_t *pmd)
 {
 	return mm_tlb_flush_pending(vma->vm_mm);
 }
+#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */
 
 static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index 90673febce6a..a81ef902e296 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3337,6 +3337,7 @@  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
 			new_vma->vm_ops->open(new_vma);
+		init_vma_tlb_generation(new_vma);
 		vma_link(mm, new_vma, prev, rb_link, rb_parent);
 		*need_rmap_locks = false;
 	}
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 13338c096cc6..0d554f2f92ac 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -329,6 +329,9 @@  static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 #endif
 
 	tlb_table_init(tlb);
+#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
+	tlb->defer_gen = 0;
+#endif
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	tlb->page_size = 0;
 #endif