x86/tsx: fix KVM guest live migration for tsx=on

Message ID	20220411180131.5054-1-jon@nutanix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Jon Kohler <jon@nutanix.com> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Tony Luck <tony.luck@intel.com>, Jon Kohler <jon@nutanix.com>, Andi Kleen <ak@linux.intel.com>, Pawan Gupta <pawan.kumar.gupta@linux.intel.com>, linux-kernel@vger.kernel.org Cc: Borislav Petkov <bp@suse.de>, Neelima Krishnan <neelima.krishnan@intel.com>, "kvm @ vger . kernel . org" <kvm@vger.kernel.org> Subject: [PATCH] x86/tsx: fix KVM guest live migration for tsx=on Date: Mon, 11 Apr 2022 14:01:29 -0400 Message-Id: <20220411180131.5054-1-jon@nutanix.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	x86/tsx: fix KVM guest live migration for tsx=on \| expand x86/tsx: fix KVM guest live migration for tsx=on

Jon Kohler April 11, 2022, 6:01 p.m. UTC

Move automatic disablement for TSX microcode deprecation from tsx_init() to
x86_get_tsx_auto_mode(), such that systems with tsx=on will continue to
see the TSX CPU features (HLE, RTM) even on updated microcode.

KVM live migration could be possibly be broken in 5.14+ commit 293649307ef9
("x86/tsx: Clear CPUID bits when TSX always force aborts"). Consider the
following scenario:

1. KVM hosts clustered in a live migration capable setup.
2. KVM guests have TSX CPU features HLE and/or RTM presented.
3. One of the three maintenance events occur:
3a. An existing host running kernel >= 5.14 in the pool updated with the
    new microcode.
3b. A new host running kernel >= 5.14 is commissioned that already has the
    microcode update preloaded.
3c. All hosts are running kernel < 5.14 with microcode update already
    loaded and one existing host gets updated to kernel >= 5.14.
4. After maintenance event, the impacted host will not have HLE and RTM
   exposed, and live migrations with guests with TSX features might not
   migrate.

Users using tsx=on or CONFIG_X86_INTEL_TSX_MODE_ON should always see
HLE and RTM on capable Intel SKUs, even if microcode has been clubbed to
prevent functionality.

Users using tsx=auto get or CONFIG_X86_INTEL_TSX_MODE_AUTO get to roll the
dice with whatever the kernel believes the appropriate default is, which
includes the feature disappearing after a kernel and/or microcode update.
These users should consider masking HLE and RTM at a higher control plane
level, e.g. qemu or libvirt, such that guests on TSX enabled systems do not
see HLE/RTM and therefore do not enable TAA mitigation.

Fixes: 293649307ef9 ("x86/tsx: Clear CPUID bits when TSX always force aborts")

Signed-off-by: Jon Kohler <jon@nutanix.com>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Neelima Krishnan <neelima.krishnan@intel.com>
Cc: kvm@vger.kernel.org <kvm@vger.kernel.org>
---
 arch/x86/kernel/cpu/tsx.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

Dave Hansen April 11, 2022, 7:26 p.m. UTC | #1

On 4/11/22 11:01, Jon Kohler wrote:
>  static enum tsx_ctrl_states x86_get_tsx_auto_mode(void)
>  {
> +	/*
> +	 * Hardware will always abort a TSX transaction if both CPUID bits
> +	 * RTM_ALWAYS_ABORT and TSX_FORCE_ABORT are set. In this case, it is
> +	 * better not to enumerate CPUID.RTM and CPUID.HLE bits. Clear them
> +	 * here.
> +	 */
> +	if (boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT) &&
> +	    boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) {
> +		tsx_clear_cpuid();
> +		setup_clear_cpu_cap(X86_FEATURE_RTM);
> +		setup_clear_cpu_cap(X86_FEATURE_HLE);
> +		return TSX_CTRL_RTM_ALWAYS_ABORT;
> +	}

I don't really like hiding the setup_clear_cpu_cap() like this.  Right
now, all of the setup_clear_cpu_cap()'s are in a single function and
they are pretty easy to figure out.

This seems like logic that deserves to be appended down to the last if()
block of code in tsx_init() instead of squirreled away in a "get mode"
function.  Does this work?

        if (tsx_ctrl_state == TSX_CTRL_DISABLE) {
		...
        } else if (tsx_ctrl_state == TSX_CTRL_ENABLE) {
		...	
        } else if (tsx_ctrl_state == TSX_CTRL_RTM_ALWAYS_ABORT) {
		tsx_clear_cpuid();

		setup_clear_cpu_cap(X86_FEATURE_RTM);
		setup_clear_cpu_cap(X86_FEATURE_HLE);
	}

Jon Kohler April 11, 2022, 7:35 p.m. UTC | #2

> On Apr 11, 2022, at 3:26 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/11/22 11:01, Jon Kohler wrote:
>> static enum tsx_ctrl_states x86_get_tsx_auto_mode(void)
>> {
>> +	/*
>> +	 * Hardware will always abort a TSX transaction if both CPUID bits
>> +	 * RTM_ALWAYS_ABORT and TSX_FORCE_ABORT are set. In this case, it is
>> +	 * better not to enumerate CPUID.RTM and CPUID.HLE bits. Clear them
>> +	 * here.
>> +	 */
>> +	if (boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT) &&
>> +	    boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) {
>> +		tsx_clear_cpuid();
>> +		setup_clear_cpu_cap(X86_FEATURE_RTM);
>> +		setup_clear_cpu_cap(X86_FEATURE_HLE);
>> +		return TSX_CTRL_RTM_ALWAYS_ABORT;
>> +	}
> 
> I don't really like hiding the setup_clear_cpu_cap() like this.  Right
> now, all of the setup_clear_cpu_cap()'s are in a single function and
> they are pretty easy to figure out.
> 
> This seems like logic that deserves to be appended down to the last if()
> block of code in tsx_init() instead of squirreled away in a "get mode"
> function.  Does this work?

Thanks for the review, Dave. Was trying to make the change simple
with just a cut-n-paste of existing code from one place to the other,
but I see what you’re saying. Yea, I can rework the logic as you
suggested, I’ll send out a v2 patch.

Also, while I’ve got you, I’d also like to send out a patch to simply
force abort all transactions even when tsx=on, and just be done with
TSX. Now that we’ve had the patch that introduced this functionality
I’m patching for roughly a year, combined with the microcode going
out, it seems like TSX’s numbered days have come to an end. 

That could greatly simplify the kernels handling of TAA on systems
that have ARCH_CAP_TSX_CTRL_MSR.

Thoughts?

>        if (tsx_ctrl_state == TSX_CTRL_DISABLE) {
> 		...
>        } else if (tsx_ctrl_state == TSX_CTRL_ENABLE) {
> 		...	
>        } else if (tsx_ctrl_state == TSX_CTRL_RTM_ALWAYS_ABORT) {
> 		tsx_clear_cpuid();
> 
> 		setup_clear_cpu_cap(X86_FEATURE_RTM);
> 		setup_clear_cpu_cap(X86_FEATURE_HLE);
> 	}
>

Dave Hansen April 11, 2022, 11:45 p.m. UTC | #3

On 4/11/22 12:35, Jon Kohler wrote:
> Also, while I’ve got you, I’d also like to send out a patch to simply
> force abort all transactions even when tsx=on, and just be done with
> TSX. Now that we’ve had the patch that introduced this functionality
> I’m patching for roughly a year, combined with the microcode going
> out, it seems like TSX’s numbered days have come to an end. 

Could you elaborate a little more here?  Why would we ever want to force
abort transactions that don't need to be aborted for some reason?

Jon Kohler April 12, 2022, 1:36 p.m. UTC | #4

> On Apr 11, 2022, at 7:45 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/11/22 12:35, Jon Kohler wrote:
>> Also, while I’ve got you, I’d also like to send out a patch to simply
>> force abort all transactions even when tsx=on, and just be done with
>> TSX. Now that we’ve had the patch that introduced this functionality
>> I’m patching for roughly a year, combined with the microcode going
>> out, it seems like TSX’s numbered days have come to an end. 
> 
> Could you elaborate a little more here?  Why would we ever want to force
> abort transactions that don't need to be aborted for some reason?

Sure, I'm talking specifically about when users of tsx=on (or
CONFIG_X86_INTEL_TSX_MODE_ON) on X86_BUG_TAA CPU SKUs. In this situation,
TSX features are enabled, as are TAA mitigations. Using our own use case
as an example, we only do this because of legacy live migration reasons.

This is fine on Skylake (because we're signed up for MDS mitigation anyhow)
and fine on Ice Lake because TAA_NO=1; however this is wicked painful on
Cascade Lake, because MDS_NO=1 and TAA_NO=0, so we're still signed up for
TAA mitigation by default. On CLX, this hits us on host syscalls as well as
vmexits with the mds clear on every one :(

So tsx=on is this oddball for us, because if we switch to auto, we'll break
live migration for some of our customers (but TAA overhead is gone), but
if we leave tsx=on, we keep the feature enabled (but no one likely uses it)
and still have to pay the TAA tax even if a customer doesn't use it.

So my theory here is to extend the logical effort of the microcode driven
automatic disablement as well as the tsx=auto automatic disablement and
have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
the CPU features enumerated to maintain live migration.

This would still leave TSX totally good on Ice Lake / non-buggy systems.

If it would help, I'm working up an RFC patch, and we could discuss there?

In the mean time, I did send out a v2 patch for this series addressing your
comments.

Thanks again,
Jon

Dave Hansen April 12, 2022, 3:54 p.m. UTC | #5

On 4/12/22 06:36, Jon Kohler wrote:
> So my theory here is to extend the logical effort of the microcode driven
> automatic disablement as well as the tsx=auto automatic disablement and
> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
> the CPU features enumerated to maintain live migration.
> 
> This would still leave TSX totally good on Ice Lake / non-buggy systems.
> 
> If it would help, I'm working up an RFC patch, and we could discuss there?

Sure.  But, it sounds like you really want a new tdx=something rather
than to muck with tsx=on behavior.  Surely someone else will come along
and complain that we broke their TDX setup if we change its behavior.

Maybe you should just pay the one-time cost and move your whole fleet
over to tsx=off if you truly believe nobody is using it.

Jon Kohler April 12, 2022, 4:08 p.m. UTC | #6

> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/12/22 06:36, Jon Kohler wrote:
>> So my theory here is to extend the logical effort of the microcode driven
>> automatic disablement as well as the tsx=auto automatic disablement and
>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
>> the CPU features enumerated to maintain live migration.
>> 
>> This would still leave TSX totally good on Ice Lake / non-buggy systems.
>> 
>> If it would help, I'm working up an RFC patch, and we could discuss there?
> 
> Sure.  But, it sounds like you really want a new tdx=something rather
> than to muck with tsx=on behavior.  Surely someone else will come along
> and complain that we broke their TDX setup if we change its behavior.

Good point, there will always be a squeaky wheel. I’ll work that into the RFC,
I’ll do something like tsx=compat and see how it shapes up. 

To be fair though, this commit I’m patching with this series would break
setups as they apply 5.14+ and the microcode update, but you have a 
good point for certain.

> 
> Maybe you should just pay the one-time cost and move your whole fleet
> over to tsx=off if you truly believe nobody is using it.
> 

Trust me, I’d love to do that; however:
We’ve thousands of hosts across thousands of unique customers,
which aren't managed as a centralized service (customers manage them directly),
so doing that would require each individual customer to organize a full power
cycle for all of their VMs prior to an upgrade to tsx=off hosts.

That said, we are marching in that direction, we're shipping a control plane
update that will mask HLE and RTM after power cycles, but that requires
customers to apply that control plane update, then power cycle everything. Just
means that we've begun the feature deprecation now, it will take years to fully
bleed off without having customers to micro manage full power cycles.

Pawan Gupta April 12, 2022, 6:04 p.m. UTC | #7

On Tue, Apr 12, 2022 at 04:08:32PM +0000, Jon Kohler wrote:
>
>
>> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/12/22 06:36, Jon Kohler wrote:
>>> So my theory here is to extend the logical effort of the microcode driven
>>> automatic disablement as well as the tsx=auto automatic disablement and
>>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
>>> the CPU features enumerated to maintain live migration.
>>>
>>> This would still leave TSX totally good on Ice Lake / non-buggy systems.
>>>
>>> If it would help, I'm working up an RFC patch, and we could discuss there?
>>
>> Sure.  But, it sounds like you really want a new tdx=something rather
>> than to muck with tsx=on behavior.  Surely someone else will come along
>> and complain that we broke their TDX setup if we change its behavior.
>
>Good point, there will always be a squeaky wheel. I’ll work that into the RFC,
>I’ll do something like tsx=compat and see how it shapes up.

FYI, the original series had tsx=fake, that would have taken care of
this breakage.

   https://lore.kernel.org/lkml/de6b97a567e273adff1f5268998692bad548aa10.1623272033.git-series.pawan.kumar.gupta@linux.intel.com/

For the lack of real world use-cases at that time, this patch was dropped.

Thanks,
Pawan

Jon Kohler April 12, 2022, 6:12 p.m. UTC | #8

> On Apr 12, 2022, at 2:04 PM, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> On Tue, Apr 12, 2022 at 04:08:32PM +0000, Jon Kohler wrote:
>> 
>> 
>>> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>> On 4/12/22 06:36, Jon Kohler wrote:
>>>> So my theory here is to extend the logical effort of the microcode driven
>>>> automatic disablement as well as the tsx=auto automatic disablement and
>>>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
>>>> the CPU features enumerated to maintain live migration.
>>>> 
>>>> This would still leave TSX totally good on Ice Lake / non-buggy systems.
>>>> 
>>>> If it would help, I'm working up an RFC patch, and we could discuss there?
>>> 
>>> Sure.  But, it sounds like you really want a new tdx=something rather
>>> than to muck with tsx=on behavior.  Surely someone else will come along
>>> and complain that we broke their TDX setup if we change its behavior.
>> 
>> Good point, there will always be a squeaky wheel. I’ll work that into the RFC,
>> I’ll do something like tsx=compat and see how it shapes up.
> 
> FYI, the original series had tsx=fake, that would have taken care of
> this breakage.

Fake sounds way better than compat, which is what I had :) 

My RFC code looks similar to your patch, I’ll combine the
approaches and send it out shortly, almost done

> 
>  https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_de6b97a567e273adff1f5268998692bad548aa10.1623272033.git-2Dseries.pawan.kumar.gupta-40linux.intel.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=AgPWHzCORdn5x5rYXE0QeJ2yf158HOjDA5Bn8udzp-m6i9V9s7S_jtSiLog-dk93&s=kR74kfovpa0zOK0tZ2Ss9xbg2aRLI5oocB_cp_6DLkg&e= 
> For the lack of real world use-cases at that time, this patch was dropped.
> 
> Thanks,
> Pawan

Pawan Gupta April 12, 2022, 8:40 p.m. UTC | #9

On Tue, Apr 12, 2022 at 01:36:20PM +0000, Jon Kohler wrote:
>
>
>> On Apr 11, 2022, at 7:45 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/11/22 12:35, Jon Kohler wrote:
>>> Also, while I’ve got you, I’d also like to send out a patch to simply
>>> force abort all transactions even when tsx=on, and just be done with
>>> TSX. Now that we’ve had the patch that introduced this functionality
>>> I’m patching for roughly a year, combined with the microcode going
>>> out, it seems like TSX’s numbered days have come to an end.
>>
>> Could you elaborate a little more here?  Why would we ever want to force
>> abort transactions that don't need to be aborted for some reason?
>
>Sure, I'm talking specifically about when users of tsx=on (or
>CONFIG_X86_INTEL_TSX_MODE_ON) on X86_BUG_TAA CPU SKUs. In this situation,
>TSX features are enabled, as are TAA mitigations. Using our own use case
>as an example, we only do this because of legacy live migration reasons.
>
>This is fine on Skylake (because we're signed up for MDS mitigation anyhow)
>and fine on Ice Lake because TAA_NO=1; however this is wicked painful on
>Cascade Lake, because MDS_NO=1 and TAA_NO=0, so we're still signed up for
>TAA mitigation by default. On CLX, this hits us on host syscalls as well as
>vmexits with the mds clear on every one :(
>
>So tsx=on is this oddball for us, because if we switch to auto, we'll break
>live migration for some of our customers (but TAA overhead is gone), but
>if we leave tsx=on, we keep the feature enabled (but no one likely uses it)
>and still have to pay the TAA tax even if a customer doesn't use it.
>
>So my theory here is to extend the logical effort of the microcode driven
>automatic disablement as well as the tsx=auto automatic disablement and
>have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
>the CPU features enumerated to maintain live migration.

This won't help on CLX as server parts did not get the microcode driven
automatic disablement. On CLX CPUID.RTM_ALWAYS_ABORT will not be set.

What could work on CLX is TSX_CTRL_RTM_DISABLE=1 and
TSX_CTRL_CPUID_CLEAR=0. This can be done for tsx=auto or with a new mode
tsx=fake|compat. IMO, adding a new mode would be better, otherwise
tsx=auto behavior will differ depending on the kernel version.

Provided that software using TSX is following below guidance [*]:

   When Intel TSX is disabled at runtime using TSX_CTRL, but the CPUID
   enumeration of Intel TSX is not cleared, existing software using RTM may
   see aborts for every transaction. The abort will always return a 0
   status code in EAX after XBEGIN. When the software does a number of
   transaction retries, it should never retry for a 0 status value, but go
   to the nontransactional fall back path immediately.

Thanks,
Pawan

[*] TAA document: section -> Implications on Intel TSX software
     https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-tsx-asynchronous-abort.html

Jon Kohler April 13, 2022, 12:43 p.m. UTC | #10

> On Apr 12, 2022, at 4:40 PM, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> 
> On Tue, Apr 12, 2022 at 01:36:20PM +0000, Jon Kohler wrote:
>> 
>> 
>>> On Apr 11, 2022, at 7:45 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>> On 4/11/22 12:35, Jon Kohler wrote:
>>>> Also, while I’ve got you, I’d also like to send out a patch to simply
>>>> force abort all transactions even when tsx=on, and just be done with
>>>> TSX. Now that we’ve had the patch that introduced this functionality
>>>> I’m patching for roughly a year, combined with the microcode going
>>>> out, it seems like TSX’s numbered days have come to an end.
>>> 
>>> Could you elaborate a little more here?  Why would we ever want to force
>>> abort transactions that don't need to be aborted for some reason?
>> 
>> Sure, I'm talking specifically about when users of tsx=on (or
>> CONFIG_X86_INTEL_TSX_MODE_ON) on X86_BUG_TAA CPU SKUs. In this situation,
>> TSX features are enabled, as are TAA mitigations. Using our own use case
>> as an example, we only do this because of legacy live migration reasons.
>> 
>> This is fine on Skylake (because we're signed up for MDS mitigation anyhow)
>> and fine on Ice Lake because TAA_NO=1; however this is wicked painful on
>> Cascade Lake, because MDS_NO=1 and TAA_NO=0, so we're still signed up for
>> TAA mitigation by default. On CLX, this hits us on host syscalls as well as
>> vmexits with the mds clear on every one :(
>> 
>> So tsx=on is this oddball for us, because if we switch to auto, we'll break
>> live migration for some of our customers (but TAA overhead is gone), but
>> if we leave tsx=on, we keep the feature enabled (but no one likely uses it)
>> and still have to pay the TAA tax even if a customer doesn't use it.
>> 
>> So my theory here is to extend the logical effort of the microcode driven
>> automatic disablement as well as the tsx=auto automatic disablement and
>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
>> the CPU features enumerated to maintain live migration.
> 
> This won't help on CLX as server parts did not get the microcode driven
> automatic disablement. On CLX CPUID.RTM_ALWAYS_ABORT will not be set.
> 
> What could work on CLX is TSX_CTRL_RTM_DISABLE=1 and
> TSX_CTRL_CPUID_CLEAR=0. This can be done for tsx=auto or with a new mode
> tsx=fake|compat. IMO, adding a new mode would be better, otherwise
> tsx=auto behavior will differ depending on the kernel version.

Thanks for the guidance, Pawan, I appreciate it. This is exactly the
approach my other patch is taking. Need to do a bit more review and
testing and ill get the RFC out

> 
> Provided that software using TSX is following below guidance [*]:
> 
>  When Intel TSX is disabled at runtime using TSX_CTRL, but the CPUID
>  enumeration of Intel TSX is not cleared, existing software using RTM may
>  see aborts for every transaction. The abort will always return a 0
>  status code in EAX after XBEGIN. When the software does a number of
>  transaction retries, it should never retry for a 0 status value, but go
>  to the nontransactional fall back path immediately.
> 
> Thanks,
> Pawan
> 
> [*] TAA document: section -> Implications on Intel TSX software
>    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.intel.com_content_www_us_en_developer_articles_technical_software-2Dsecurity-2Dguidance_technical-2Ddocumentation_intel-2Dtsx-2Dasynchronous-2Dabort.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=-yy3gpUOG7W2s79bE3KTnzd9h32x038M5CkPkhFsUW22MWWzcf3SoX6An2835zrn&s=t85c0qBMosrY_UvEVGzkR4j125aGfHju3SFEEPAImpQ&e=

x86/tsx: fix KVM guest live migration for tsx=on

Commit Message

Comments

Patch