diff mbox series

x86/tlb: Flush global mappings when KAISER is disabled

Message ID 20210325102959.GD31322@zn.tnic (mailing list archive)
State New, archived
Headers show
Series x86/tlb: Flush global mappings when KAISER is disabled | expand

Commit Message

Borislav Petkov March 25, 2021, 10:29 a.m. UTC
Ok,

I tried to be as specific as possible in the commit message so that we
don't forget. Please lemme know if I've missed something.

Babu, Jim, I'd appreciate it if you ran this to confirm.

Thx.

---
From: Borislav Petkov <bp@suse.de>
Date: Thu, 25 Mar 2021 11:02:31 +0100

Jim Mattson reported that Debian 9 guests using a 4.9-stable kernel
are exploding during alternatives patching:

  kernel BUG at /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
  invalid opcode: 0000 [#1] SMP
  Modules linked in:
  CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-13-amd64 #1 Debian 4.9.228-1
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   swap_entry_free
   swap_entry_free
   text_poke_bp
   swap_entry_free
   arch_jump_label_transform
   set_debug_rodata
   __jump_label_update
   static_key_slow_inc
   frontswap_register_ops
   init_zswap
   init_frontswap
   do_one_initcall
   set_debug_rodata
   kernel_init_freeable
   rest_init
   kernel_init
   ret_from_fork

triggering the BUG_ON in text_poke() which verifies whether patched
instruction bytes have actually landed at the destination.

Further debugging showed that the TLB flush before that check is
insufficient because there could be global mappings left in the TLB,
leading to a stale mapping getting used.

I say "global mappings" because the hardware configuration is a new one:
machine is an AMD, which means, KAISER/PTI doesn't need to be enabled
there, which also means there's no user/kernel pagetables split and
therefore the TLB can have global mappings.

And the configuration is new one for a second reason: because that AMD
machine supports PCID and INVPCID, which leads the CPU detection code to
set the synthetic X86_FEATURE_INVPCID_SINGLE flag.

Now, __native_flush_tlb_single() does invalidate global mappings when
X86_FEATURE_INVPCID_SINGLE is *not* set and returns.

When X86_FEATURE_INVPCID_SINGLE is set, however, it invalidates the
requested address from both PCIDs in the KAISER-enabled case. But if
KAISER is not enabled and the machine has global mappings in the TLB,
then those global mappings do not get invalidated, which would lead to
the above mismatch from using a stale TLB entry.

So make sure to flush those global mappings in the KAISER disabled case.

Co-debugged by Babu Moger <babu.moger@amd.com>.

Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/CALMp9eRDSW66%2BXvbHVF4ohL7XhThoPoT0BrB0TcS0cgk=dkcBg@mail.gmail.com
---
 arch/x86/include/asm/tlbflush.h | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini March 25, 2021, 10:52 a.m. UTC | #1
On 25/03/21 11:29, Borislav Petkov wrote:
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..2bfa4deb8cae 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -245,12 +245,15 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>   	 * ASID.  But, userspace flushes are probably much more
>   	 * important performance-wise.
>   	 *
> -	 * Make sure to do only a single invpcid when KAISER is
> -	 * disabled and we have only a single ASID.
> +	 * In the KAISER disabled case, do an INVLPG to make sure
> +	 * the mapping is flushed in case it is a global one.
>   	 */
> -	if (kaiser_enabled)
> +	if (kaiser_enabled) {
>   		invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> -	invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> +		invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> +	} else {
> +		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> +	}
>   }
>   
>   static inline void __flush_tlb_all(void)
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Babu Moger March 25, 2021, 3:13 p.m. UTC | #2
On 3/25/21 5:29 AM, Borislav Petkov wrote:
> Ok,
> 
> I tried to be as specific as possible in the commit message so that we
> don't forget. Please lemme know if I've missed something.
> 
> Babu, Jim, I'd appreciate it if you ran this to confirm.
> 
> Thx.
> 
> ---
> From: Borislav Petkov <bp@suse.de>
> Date: Thu, 25 Mar 2021 11:02:31 +0100
> 
> Jim Mattson reported that Debian 9 guests using a 4.9-stable kernel
> are exploding during alternatives patching:
> 
>   kernel BUG at /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
>   invalid opcode: 0000 [#1] SMP
>   Modules linked in:
>   CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-13-amd64 #1 Debian 4.9.228-1
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>   Call Trace:
>    swap_entry_free
>    swap_entry_free
>    text_poke_bp
>    swap_entry_free
>    arch_jump_label_transform
>    set_debug_rodata
>    __jump_label_update
>    static_key_slow_inc
>    frontswap_register_ops
>    init_zswap
>    init_frontswap
>    do_one_initcall
>    set_debug_rodata
>    kernel_init_freeable
>    rest_init
>    kernel_init
>    ret_from_fork
> 
> triggering the BUG_ON in text_poke() which verifies whether patched
> instruction bytes have actually landed at the destination.
> 
> Further debugging showed that the TLB flush before that check is
> insufficient because there could be global mappings left in the TLB,
> leading to a stale mapping getting used.
> 
> I say "global mappings" because the hardware configuration is a new one:
> machine is an AMD, which means, KAISER/PTI doesn't need to be enabled
> there, which also means there's no user/kernel pagetables split and
> therefore the TLB can have global mappings.
> 
> And the configuration is new one for a second reason: because that AMD
> machine supports PCID and INVPCID, which leads the CPU detection code to
> set the synthetic X86_FEATURE_INVPCID_SINGLE flag.
> 
> Now, __native_flush_tlb_single() does invalidate global mappings when
> X86_FEATURE_INVPCID_SINGLE is *not* set and returns.
> 
> When X86_FEATURE_INVPCID_SINGLE is set, however, it invalidates the
> requested address from both PCIDs in the KAISER-enabled case. But if
> KAISER is not enabled and the machine has global mappings in the TLB,
> then those global mappings do not get invalidated, which would lead to
> the above mismatch from using a stale TLB entry.
> 
> So make sure to flush those global mappings in the KAISER disabled case.
> 
> Co-debugged by Babu Moger <babu.moger@amd.com>.
> 
> Reported-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2FCALMp9eRDSW66%252BXvbHVF4ohL7XhThoPoT0BrB0TcS0cgk%3DdkcBg%40mail.gmail.com&amp;data=04%7C01%7Cbabu.moger%40amd.com%7Cf4e0aacf81744dc8be4408d8ef78f2cf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637522650066097649%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1c4MQ9I9KrLxWLqghGCI%2BC%2Bvs0c9vYaNC5d%2FiYL0oMA%3D&amp;reserved=0
> ---
>  arch/x86/include/asm/tlbflush.h | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..2bfa4deb8cae 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -245,12 +245,15 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>  	 * ASID.  But, userspace flushes are probably much more
>  	 * important performance-wise.
>  	 *
> -	 * Make sure to do only a single invpcid when KAISER is
> -	 * disabled and we have only a single ASID.
> +	 * In the KAISER disabled case, do an INVLPG to make sure
> +	 * the mapping is flushed in case it is a global one.
>  	 */
> -	if (kaiser_enabled)
> +	if (kaiser_enabled) {
>  		invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> -	invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> +		invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> +	} else {
> +		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> +	}
>  }
>  
>  static inline void __flush_tlb_all(void)
> 

Thanks Boris. As you updated the patch little bit since yesterday, I
retested them again both on host and guest kernel. They are looking good.

Tested-by: Babu Moger <babu.moger@amd.com>
Hugh Dickins March 25, 2021, 4:33 p.m. UTC | #3
On Thu, 25 Mar 2021, Borislav Petkov wrote:

> Ok,
> 
> I tried to be as specific as possible in the commit message so that we
> don't forget. Please lemme know if I've missed something.
> 
> Babu, Jim, I'd appreciate it if you ran this to confirm.
> 
> Thx.
> 
> ---
> From: Borislav Petkov <bp@suse.de>
> Date: Thu, 25 Mar 2021 11:02:31 +0100
> 
> Jim Mattson reported that Debian 9 guests using a 4.9-stable kernel
> are exploding during alternatives patching:
> 
>   kernel BUG at /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
>   invalid opcode: 0000 [#1] SMP
>   Modules linked in:
>   CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-13-amd64 #1 Debian 4.9.228-1
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>   Call Trace:
>    swap_entry_free
>    swap_entry_free
>    text_poke_bp
>    swap_entry_free
>    arch_jump_label_transform
>    set_debug_rodata
>    __jump_label_update
>    static_key_slow_inc
>    frontswap_register_ops
>    init_zswap
>    init_frontswap
>    do_one_initcall
>    set_debug_rodata
>    kernel_init_freeable
>    rest_init
>    kernel_init
>    ret_from_fork
> 
> triggering the BUG_ON in text_poke() which verifies whether patched
> instruction bytes have actually landed at the destination.
> 
> Further debugging showed that the TLB flush before that check is
> insufficient because there could be global mappings left in the TLB,
> leading to a stale mapping getting used.
> 
> I say "global mappings" because the hardware configuration is a new one:
> machine is an AMD, which means, KAISER/PTI doesn't need to be enabled
> there, which also means there's no user/kernel pagetables split and
> therefore the TLB can have global mappings.
> 
> And the configuration is new one for a second reason: because that AMD
> machine supports PCID and INVPCID, which leads the CPU detection code to
> set the synthetic X86_FEATURE_INVPCID_SINGLE flag.
> 
> Now, __native_flush_tlb_single() does invalidate global mappings when
> X86_FEATURE_INVPCID_SINGLE is *not* set and returns.
> 
> When X86_FEATURE_INVPCID_SINGLE is set, however, it invalidates the
> requested address from both PCIDs in the KAISER-enabled case. But if
> KAISER is not enabled and the machine has global mappings in the TLB,
> then those global mappings do not get invalidated, which would lead to
> the above mismatch from using a stale TLB entry.
> 
> So make sure to flush those global mappings in the KAISER disabled case.
> 
> Co-debugged by Babu Moger <babu.moger@amd.com>.
> 
> Reported-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Link: https://lkml.kernel.org/r/CALMp9eRDSW66%2BXvbHVF4ohL7XhThoPoT0BrB0TcS0cgk=dkcBg@mail.gmail.com

Acked-by: Hugh Dickins <hughd@google.com>

Great write-up too: many thanks.

> ---
>  arch/x86/include/asm/tlbflush.h | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..2bfa4deb8cae 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -245,12 +245,15 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>  	 * ASID.  But, userspace flushes are probably much more
>  	 * important performance-wise.
>  	 *
> -	 * Make sure to do only a single invpcid when KAISER is
> -	 * disabled and we have only a single ASID.
> +	 * In the KAISER disabled case, do an INVLPG to make sure
> +	 * the mapping is flushed in case it is a global one.
>  	 */
> -	if (kaiser_enabled)
> +	if (kaiser_enabled) {
>  		invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> -	invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> +		invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> +	} else {
> +		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> +	}
>  }
>  
>  static inline void __flush_tlb_all(void)
> -- 
> 2.29.2
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Jim Mattson March 25, 2021, 7 p.m. UTC | #4
On Thu, Mar 25, 2021 at 9:33 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 25 Mar 2021, Borislav Petkov wrote:
>
> > Ok,
> >
> > I tried to be as specific as possible in the commit message so that we
> > don't forget. Please lemme know if I've missed something.
> >
> > Babu, Jim, I'd appreciate it if you ran this to confirm.
Tested-by: Jim Mattson <jmattson@google.com>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f5ca15622dc9..2bfa4deb8cae 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -245,12 +245,15 @@  static inline void __native_flush_tlb_single(unsigned long addr)
 	 * ASID.  But, userspace flushes are probably much more
 	 * important performance-wise.
 	 *
-	 * Make sure to do only a single invpcid when KAISER is
-	 * disabled and we have only a single ASID.
+	 * In the KAISER disabled case, do an INVLPG to make sure
+	 * the mapping is flushed in case it is a global one.
 	 */
-	if (kaiser_enabled)
+	if (kaiser_enabled) {
 		invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
-	invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
+		invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
+	} else {
+		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+	}
 }
 
 static inline void __flush_tlb_all(void)