diff mbox series

[RFC,v3,13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI

Message ID 20241119153502.41361-14-vschneid@redhat.com (mailing list archive)
State RFC
Headers show
Series context_tracking,x86: Defer some IPIs until a user->kernel transition | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Valentin Schneider Nov. 19, 2024, 3:35 p.m. UTC
Kernel TLB invalidation IPIs are a common source of interference on
NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not
accessing any kernel addresses, these invalidations do not need to happen
immediately, and can be deferred until the next user->kernel transition.

Add a minimal, noinstr-compliant variant of __flush_tlb_all() that doesn't
try to leverage INVPCID. To achieve this:
o Add a noinstr variant of native_write_cr4(), keeping native_write_cr4() as
  "only" __no_profile.
o Make invalidate_user_asid() __always_inline

XXX: what about paravirt? Should these be instead new operations made
available through pv_ops.mmu.*? x86_64_start_kernel() uses
__native_tlb_flush_global() regardless of paravirt, so I'm thinking the
paravirt variants are more optimizations than hard requirements?

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  4 +++
 arch/x86/include/asm/special_insns.h         |  1 +
 arch/x86/include/asm/tlbflush.h              | 16 ++++++++++--
 arch/x86/kernel/cpu/common.c                 |  6 ++++-
 arch/x86/mm/tlb.c                            | 26 ++++++++++++++++++--
 include/linux/context_tracking_work.h        |  2 ++
 6 files changed, 50 insertions(+), 5 deletions(-)

Comments

Peter Zijlstra Nov. 20, 2024, 3:22 p.m. UTC | #1
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:

> +void noinstr __flush_tlb_all_noinstr(void)
> +{
> +	/*
> +	 * This is for invocation in early entry code that cannot be
> +	 * instrumented. A RMW to CR4 works for most cases, but relies on
> +	 * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
> +	 * would require also resetting CR3.PCID, so just try with CR4.PGE, else
> +	 * do the CR3 write.
> +	 *
> +	 * XXX: this gives paravirt the finger.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_PGE))
> +		__native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
> +	else
> +		native_flush_tlb_local_noinstr();
> +}

Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah.

Why not always just do the CR3 write and call it a day? That should also
work for paravirt, no? Just make the whole write_cr3 thing noinstr and
voila.
Peter Zijlstra Nov. 20, 2024, 3:32 p.m. UTC | #2
On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:
> 
> > +void noinstr __flush_tlb_all_noinstr(void)
> > +{
> > +	/*
> > +	 * This is for invocation in early entry code that cannot be
> > +	 * instrumented. A RMW to CR4 works for most cases, but relies on
> > +	 * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
> > +	 * would require also resetting CR3.PCID, so just try with CR4.PGE, else
> > +	 * do the CR3 write.
> > +	 *
> > +	 * XXX: this gives paravirt the finger.
> > +	 */
> > +	if (cpu_feature_enabled(X86_FEATURE_PGE))
> > +		__native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
> > +	else
> > +		native_flush_tlb_local_noinstr();
> > +}
> 
> Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah.
> 
> Why not always just do the CR3 write and call it a day? That should also
> work for paravirt, no? Just make the whole write_cr3 thing noinstr and
> voila.

Oh gawd, just having looked at xen_write_cr3() this might not be
entirely trivial to mark noinstr :/
Valentin Schneider Nov. 20, 2024, 5:24 p.m. UTC | #3
On 20/11/24 16:32, Peter Zijlstra wrote:
> On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote:
>> On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:
>> 
>> > +void noinstr __flush_tlb_all_noinstr(void)
>> > +{
>> > +	/*
>> > +	 * This is for invocation in early entry code that cannot be
>> > +	 * instrumented. A RMW to CR4 works for most cases, but relies on
>> > +	 * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
>> > +	 * would require also resetting CR3.PCID, so just try with CR4.PGE, else
>> > +	 * do the CR3 write.
>> > +	 *
>> > +	 * XXX: this gives paravirt the finger.
>> > +	 */
>> > +	if (cpu_feature_enabled(X86_FEATURE_PGE))
>> > +		__native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
>> > +	else
>> > +		native_flush_tlb_local_noinstr();
>> > +}
>> 
>> Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah.
>> 
>> Why not always just do the CR3 write and call it a day? That should also
>> work for paravirt, no? Just make the whole write_cr3 thing noinstr and
>> voila.
>
> Oh gawd, just having looked at xen_write_cr3() this might not be
> entirely trivial to mark noinstr :/

... I hadn't even seen that.

AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then
global pages aren't flushed.

The question becomes: what is held in global pages and do we care about
that when it comes to vmalloc()? I'm starting to think no, but this is x86,
I don't know what surprises are waiting for me.

I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag,
and it correctly uses the non-deferrable flush_tlb_kernel_range().
Peter Zijlstra Nov. 21, 2024, 11:12 a.m. UTC | #4
On Wed, Nov 20, 2024 at 06:24:56PM +0100, Valentin Schneider wrote:

> > Oh gawd, just having looked at xen_write_cr3() this might not be
> > entirely trivial to mark noinstr :/
> 
> ... I hadn't even seen that.
> 
> AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then
> global pages aren't flushed.
> 
> The question becomes: what is held in global pages and do we care about
> that when it comes to vmalloc()? I'm starting to think no, but this is x86,
> I don't know what surprises are waiting for me.
> 
> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag,
> and it correctly uses the non-deferrable flush_tlb_kernel_range().


I always forget what we use global pages for, dhansen might know, but
let me try and have a look.

I *think* we only have GLOBAL on kernel text, and that only sometimes.
Dave Hansen Nov. 21, 2024, 3:07 p.m. UTC | #5
On 11/21/24 03:12, Peter Zijlstra wrote:
>> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag,
>> and it correctly uses the non-deferrable flush_tlb_kernel_range().
> 
> I always forget what we use global pages for, dhansen might know, but
> let me try and have a look.
> 
> I *think* we only have GLOBAL on kernel text, and that only sometimes.

I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.

Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings.
Before PCIDs, global mappings let the kernel TLB entries live across CR3
writes. When PCIDs are in play, global mappings let two different ASIDs
share TLB entries.

When KPTI is around, the kernel writes CR3 at user/kernel switches to
make sure secrets are unmapped and can't be leaked by Meltdown. But
unmapping those secrets doesn't do squat if they were mapped globally
since they'll still be in the TLB and quite usable. There, we're more
judicious and only mark performance-sensitive things that are not secret
to be global, like kernel text.
Peter Zijlstra Nov. 21, 2024, 3:30 p.m. UTC | #6
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote:
> On 11/21/24 03:12, Peter Zijlstra wrote:
> >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag,
> >> and it correctly uses the non-deferrable flush_tlb_kernel_range().
> > 
> > I always forget what we use global pages for, dhansen might know, but
> > let me try and have a look.
> > 
> > I *think* we only have GLOBAL on kernel text, and that only sometimes.
> 
> I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.

Yah, I suppose I am. That was the last time I had a good look at this
stuff :-)

> Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings.
> Before PCIDs, global mappings let the kernel TLB entries live across CR3
> writes. When PCIDs are in play, global mappings let two different ASIDs
> share TLB entries.

Hurmph.. bah. That means we do need that horrible CR4 dance :/
Peter Zijlstra Nov. 21, 2024, 4:26 p.m. UTC | #7
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:

> @@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask)
>  #endif
>  #endif /* !MODULE */
>  
> +#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4)		\
> +	native_write_cr4##suffix(cr4 ^ X86_CR4_PGE);	\
> +	native_write_cr4##suffix(cr4)
> +#define NATIVE_TLB_FLUSH_GLOBAL(cr4)         __NATIVE_TLB_FLUSH_GLOBAL(, cr4)
> +#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4)
> +
>  static inline void __native_tlb_flush_global(unsigned long cr4)
>  {
> -	native_write_cr4(cr4 ^ X86_CR4_PGE);
> -	native_write_cr4(cr4);
> +	NATIVE_TLB_FLUSH_GLOBAL(cr4);
>  }
> +
> +static inline void __native_tlb_flush_global_noinstr(unsigned long cr4)
> +{
> +	NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4);
> +}

How about something like this instead? I've only compile tested the
tlb.c bit, but it should get __flush_tlb_global() to be noinstr I think,
including the Xen bit (unless I missed something but then objtool should
complain).

---
diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h
index 734482afbf81..ff26136fcd9c 100644
--- a/arch/x86/include/asm/invpcid.h
+++ b/arch/x86/include/asm/invpcid.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_X86_INVPCID
 #define _ASM_X86_INVPCID
 
-static inline void __invpcid(unsigned long pcid, unsigned long addr,
+static __always_inline void __invpcid(unsigned long pcid, unsigned long addr,
 			     unsigned long type)
 {
 	struct { u64 d[2]; } desc = { { pcid, addr } };
@@ -13,7 +13,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr,
 	 * mappings, we don't want the compiler to reorder any subsequent
 	 * memory accesses before the TLB flush.
 	 */
-	asm volatile("invpcid %[desc], %[type]"
+	asm_inline volatile("invpcid %[desc], %[type]"
 		     :: [desc] "m" (desc), [type] "r" (type) : "memory");
 }
 
@@ -23,26 +23,25 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr,
 #define INVPCID_TYPE_ALL_NON_GLOBAL	3
 
 /* Flush all mappings for a given pcid and addr, not including globals. */
-static inline void invpcid_flush_one(unsigned long pcid,
-				     unsigned long addr)
+static __always_inline void invpcid_flush_one(unsigned long pcid, unsigned long addr)
 {
 	__invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
-static inline void invpcid_flush_single_context(unsigned long pcid)
+static __always_inline void invpcid_flush_single_context(unsigned long pcid)
 {
 	__invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT);
 }
 
 /* Flush all mappings, including globals, for all PCIDs. */
-static inline void invpcid_flush_all(void)
+static __always_inline void invpcid_flush_all(void)
 {
 	__invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL);
 }
 
 /* Flush all mappings for all PCIDs except globals. */
-static inline void invpcid_flush_all_nonglobals(void)
+static __always_inline void invpcid_flush_all_nonglobals(void)
 {
 	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d4eb9e1d61b8..b3daee3d4667 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -75,7 +75,7 @@ static inline void __flush_tlb_local(void)
 	PVOP_VCALL0(mmu.flush_tlb_user);
 }
 
-static inline void __flush_tlb_global(void)
+static __always_inline void __flush_tlb_global(void)
 {
 	PVOP_VCALL0(mmu.flush_tlb_kernel);
 }
diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index a2dd24947eb8..b4c635b20538 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -357,8 +357,8 @@ MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req,
 	trace_xen_mc_entry(mcl, 4);
 }
 
-static inline void
-MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
+static __always_inline void
+__MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
 		int *success_count, domid_t domid)
 {
 	mcl->op = __HYPERVISOR_mmuext_op;
@@ -366,6 +366,13 @@ MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
 	mcl->args[1] = count;
 	mcl->args[2] = (unsigned long)success_count;
 	mcl->args[3] = domid;
+}
+
+static inline void
+MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count,
+		int *success_count, domid_t domid)
+{
+	__MULTI_mmuext_op(mcl, op, count, success_count, domid);
 
 	trace_xen_mc_entry(mcl, 4);
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b0d5a644fc84..0cfc00a34b7e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1168,9 +1168,10 @@ void flush_tlb_one_user(unsigned long addr)
 /*
  * Flush everything
  */
-STATIC_NOPV void native_flush_tlb_global(void)
+STATIC_NOPV noinstr void native_flush_tlb_global(void)
 {
 	unsigned long flags;
+	unsigned long cr4;
 
 	if (static_cpu_has(X86_FEATURE_INVPCID)) {
 		/*
@@ -1189,9 +1190,15 @@ STATIC_NOPV void native_flush_tlb_global(void)
 	 * be called from deep inside debugging code.)
 	 */
 	raw_local_irq_save(flags);
-
-	__native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4));
-
+	cr4 = this_cpu_read(cpu_tlbstate.cr4);
+	asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory");
+	asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory");
+	/*
+	 * In lieu of not having the pinning crap, hard fail if CR4 doesn't
+	 * match the expected value. This ensures that anybody doing dodgy gets
+	 * the fallthrough check.
+	 */
+	BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4));
 	raw_local_irq_restore(flags);
 }
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 55a4996d0c04..4eb265eb867a 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1231,22 +1231,22 @@ static noinstr void xen_write_cr2(unsigned long cr2)
 	this_cpu_read(xen_vcpu)->arch.cr2 = cr2;
 }
 
-static noinline void xen_flush_tlb(void)
+static noinline noinstr void xen_flush_tlb(void)
 {
 	struct mmuext_op *op;
 	struct multicall_space mcs;
 
-	preempt_disable();
+	preempt_disable_notrace();
 
 	mcs = xen_mc_entry(sizeof(*op));
 
 	op = mcs.args;
 	op->cmd = MMUEXT_TLB_FLUSH_LOCAL;
-	MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+	__MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
 
-	xen_mc_issue(XEN_LAZY_MMU);
+	__xen_mc_issue(XEN_LAZY_MMU);
 
-	preempt_enable();
+	preempt_enable_notrace();
 }
 
 static void xen_flush_tlb_one_user(unsigned long addr)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index e1b782e823e6..31eddca45c27 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -235,15 +235,19 @@ static inline struct multicall_space xen_mc_entry(size_t args)
 void xen_mc_flush(void);
 
 /* Issue a multicall if we're not in a lazy mode */
-static inline void xen_mc_issue(unsigned mode)
+static __always_inline void __xen_mc_issue(unsigned mode)
 {
-	trace_xen_mc_issue(mode);
-
 	if ((xen_get_lazy_mode() & mode) == 0)
 		xen_mc_flush();
 
 	/* restore flags saved in xen_mc_batch */
-	local_irq_restore(this_cpu_read(xen_mc_irq_flags));
+	raw_local_irq_restore(this_cpu_read(xen_mc_irq_flags));
+}
+
+static inline void xen_mc_issue(unsigned mode)
+{
+	trace_xen_mc_issue(mode);
+	__xen_mc_issue(mode);
 }
 
 /* Set up a callback to be called when the current batch is flushed */
diff mbox series

Patch

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 2c66687ce00e2..9d4f021b5a45b 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@ 
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
 #include <asm/sync_core.h>
+#include <asm/tlbflush.h>
 
 static __always_inline void arch_context_tracking_work(int work)
 {
@@ -10,6 +11,9 @@  static __always_inline void arch_context_tracking_work(int work)
 	case CONTEXT_WORK_SYNC:
 		sync_core();
 		break;
+	case CONTEXT_WORK_TLBI:
+		__flush_tlb_all_noinstr();
+		break;
 	}
 }
 
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index aec6e2d3aa1d5..b97157a69d48e 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -74,6 +74,7 @@  static inline unsigned long native_read_cr4(void)
 	return val;
 }
 
+void native_write_cr4_noinstr(unsigned long val);
 void native_write_cr4(unsigned long val);
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 69e79fff41b80..a653b5f47f0e6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -18,6 +18,7 @@ 
 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
 
 void __flush_tlb_all(void);
+void noinstr __flush_tlb_all_noinstr(void);
 
 #define TLB_FLUSH_ALL	-1UL
 #define TLB_GENERATION_INVALID	0
@@ -418,9 +419,20 @@  static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask)
 #endif
 #endif /* !MODULE */
 
+#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4)		\
+	native_write_cr4##suffix(cr4 ^ X86_CR4_PGE);	\
+	native_write_cr4##suffix(cr4)
+#define NATIVE_TLB_FLUSH_GLOBAL(cr4)         __NATIVE_TLB_FLUSH_GLOBAL(, cr4)
+#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4)
+
 static inline void __native_tlb_flush_global(unsigned long cr4)
 {
-	native_write_cr4(cr4 ^ X86_CR4_PGE);
-	native_write_cr4(cr4);
+	NATIVE_TLB_FLUSH_GLOBAL(cr4);
 }
+
+static inline void __native_tlb_flush_global_noinstr(unsigned long cr4)
+{
+	NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4);
+}
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a5f221ea56888..a84bb8511650b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -424,7 +424,7 @@  void native_write_cr0(unsigned long val)
 }
 EXPORT_SYMBOL(native_write_cr0);
 
-void __no_profile native_write_cr4(unsigned long val)
+noinstr void native_write_cr4_noinstr(unsigned long val)
 {
 	unsigned long bits_changed = 0;
 
@@ -442,6 +442,10 @@  void __no_profile native_write_cr4(unsigned long val)
 			  bits_changed);
 	}
 }
+void native_write_cr4(unsigned long val)
+{
+	native_write_cr4_noinstr(val);
+}
 #if IS_MODULE(CONFIG_LKDTM)
 EXPORT_SYMBOL_GPL(native_write_cr4);
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 86593d1b787d8..973a4ab3f53b3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -256,7 +256,7 @@  static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
  *
  * See SWITCH_TO_USER_CR3.
  */
-static inline void invalidate_user_asid(u16 asid)
+static __always_inline void invalidate_user_asid(u16 asid)
 {
 	/* There is no user ASID if address space separation is off */
 	if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION))
@@ -1198,7 +1198,7 @@  STATIC_NOPV void native_flush_tlb_global(void)
 /*
  * Flush the entire current user mapping
  */
-STATIC_NOPV void native_flush_tlb_local(void)
+static noinstr void native_flush_tlb_local_noinstr(void)
 {
 	/*
 	 * Preemption or interrupts must be disabled to protect the access
@@ -1213,6 +1213,11 @@  STATIC_NOPV void native_flush_tlb_local(void)
 	native_write_cr3(__native_read_cr3());
 }
 
+STATIC_NOPV void native_flush_tlb_local(void)
+{
+	native_flush_tlb_local_noinstr();
+}
+
 void flush_tlb_local(void)
 {
 	__flush_tlb_local();
@@ -1240,6 +1245,23 @@  void __flush_tlb_all(void)
 }
 EXPORT_SYMBOL_GPL(__flush_tlb_all);
 
+void noinstr __flush_tlb_all_noinstr(void)
+{
+	/*
+	 * This is for invocation in early entry code that cannot be
+	 * instrumented. A RMW to CR4 works for most cases, but relies on
+	 * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
+	 * would require also resetting CR3.PCID, so just try with CR4.PGE, else
+	 * do the CR3 write.
+	 *
+	 * XXX: this gives paravirt the finger.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_PGE))
+		__native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
+	else
+		native_flush_tlb_local_noinstr();
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 13fc97b395030..47d5ced39a43a 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,11 +6,13 @@ 
 
 enum {
 	CONTEXT_WORK_SYNC_OFFSET,
+	CONTEXT_WORK_TLBI_OFFSET,
 	CONTEXT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
 	CONTEXT_WORK_SYNC     = BIT(CONTEXT_WORK_SYNC_OFFSET),
+	CONTEXT_WORK_TLBI     = BIT(CONTEXT_WORK_TLBI_OFFSET),
 	CONTEXT_WORK_MAX      = BIT(CONTEXT_WORK_MAX_OFFSET)
 };