Message ID | 20220411180131.5054-1-jon@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/tsx: fix KVM guest live migration for tsx=on | expand |
On 4/11/22 11:01, Jon Kohler wrote: > static enum tsx_ctrl_states x86_get_tsx_auto_mode(void) > { > + /* > + * Hardware will always abort a TSX transaction if both CPUID bits > + * RTM_ALWAYS_ABORT and TSX_FORCE_ABORT are set. In this case, it is > + * better not to enumerate CPUID.RTM and CPUID.HLE bits. Clear them > + * here. > + */ > + if (boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT) && > + boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) { > + tsx_clear_cpuid(); > + setup_clear_cpu_cap(X86_FEATURE_RTM); > + setup_clear_cpu_cap(X86_FEATURE_HLE); > + return TSX_CTRL_RTM_ALWAYS_ABORT; > + } I don't really like hiding the setup_clear_cpu_cap() like this. Right now, all of the setup_clear_cpu_cap()'s are in a single function and they are pretty easy to figure out. This seems like logic that deserves to be appended down to the last if() block of code in tsx_init() instead of squirreled away in a "get mode" function. Does this work? if (tsx_ctrl_state == TSX_CTRL_DISABLE) { ... } else if (tsx_ctrl_state == TSX_CTRL_ENABLE) { ... } else if (tsx_ctrl_state == TSX_CTRL_RTM_ALWAYS_ABORT) { tsx_clear_cpuid(); setup_clear_cpu_cap(X86_FEATURE_RTM); setup_clear_cpu_cap(X86_FEATURE_HLE); }
> On Apr 11, 2022, at 3:26 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/11/22 11:01, Jon Kohler wrote: >> static enum tsx_ctrl_states x86_get_tsx_auto_mode(void) >> { >> + /* >> + * Hardware will always abort a TSX transaction if both CPUID bits >> + * RTM_ALWAYS_ABORT and TSX_FORCE_ABORT are set. In this case, it is >> + * better not to enumerate CPUID.RTM and CPUID.HLE bits. Clear them >> + * here. >> + */ >> + if (boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT) && >> + boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) { >> + tsx_clear_cpuid(); >> + setup_clear_cpu_cap(X86_FEATURE_RTM); >> + setup_clear_cpu_cap(X86_FEATURE_HLE); >> + return TSX_CTRL_RTM_ALWAYS_ABORT; >> + } > > I don't really like hiding the setup_clear_cpu_cap() like this. Right > now, all of the setup_clear_cpu_cap()'s are in a single function and > they are pretty easy to figure out. > > This seems like logic that deserves to be appended down to the last if() > block of code in tsx_init() instead of squirreled away in a "get mode" > function. Does this work? Thanks for the review, Dave. Was trying to make the change simple with just a cut-n-paste of existing code from one place to the other, but I see what you’re saying. Yea, I can rework the logic as you suggested, I’ll send out a v2 patch. Also, while I’ve got you, I’d also like to send out a patch to simply force abort all transactions even when tsx=on, and just be done with TSX. Now that we’ve had the patch that introduced this functionality I’m patching for roughly a year, combined with the microcode going out, it seems like TSX’s numbered days have come to an end. That could greatly simplify the kernels handling of TAA on systems that have ARCH_CAP_TSX_CTRL_MSR. Thoughts? > if (tsx_ctrl_state == TSX_CTRL_DISABLE) { > ... > } else if (tsx_ctrl_state == TSX_CTRL_ENABLE) { > ... > } else if (tsx_ctrl_state == TSX_CTRL_RTM_ALWAYS_ABORT) { > tsx_clear_cpuid(); > > setup_clear_cpu_cap(X86_FEATURE_RTM); > setup_clear_cpu_cap(X86_FEATURE_HLE); > } >
On 4/11/22 12:35, Jon Kohler wrote: > Also, while I’ve got you, I’d also like to send out a patch to simply > force abort all transactions even when tsx=on, and just be done with > TSX. Now that we’ve had the patch that introduced this functionality > I’m patching for roughly a year, combined with the microcode going > out, it seems like TSX’s numbered days have come to an end. Could you elaborate a little more here? Why would we ever want to force abort transactions that don't need to be aborted for some reason?
> On Apr 11, 2022, at 7:45 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/11/22 12:35, Jon Kohler wrote: >> Also, while I’ve got you, I’d also like to send out a patch to simply >> force abort all transactions even when tsx=on, and just be done with >> TSX. Now that we’ve had the patch that introduced this functionality >> I’m patching for roughly a year, combined with the microcode going >> out, it seems like TSX’s numbered days have come to an end. > > Could you elaborate a little more here? Why would we ever want to force > abort transactions that don't need to be aborted for some reason? Sure, I'm talking specifically about when users of tsx=on (or CONFIG_X86_INTEL_TSX_MODE_ON) on X86_BUG_TAA CPU SKUs. In this situation, TSX features are enabled, as are TAA mitigations. Using our own use case as an example, we only do this because of legacy live migration reasons. This is fine on Skylake (because we're signed up for MDS mitigation anyhow) and fine on Ice Lake because TAA_NO=1; however this is wicked painful on Cascade Lake, because MDS_NO=1 and TAA_NO=0, so we're still signed up for TAA mitigation by default. On CLX, this hits us on host syscalls as well as vmexits with the mds clear on every one :( So tsx=on is this oddball for us, because if we switch to auto, we'll break live migration for some of our customers (but TAA overhead is gone), but if we leave tsx=on, we keep the feature enabled (but no one likely uses it) and still have to pay the TAA tax even if a customer doesn't use it. So my theory here is to extend the logical effort of the microcode driven automatic disablement as well as the tsx=auto automatic disablement and have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave the CPU features enumerated to maintain live migration. This would still leave TSX totally good on Ice Lake / non-buggy systems. If it would help, I'm working up an RFC patch, and we could discuss there? In the mean time, I did send out a v2 patch for this series addressing your comments. Thanks again, Jon
On 4/12/22 06:36, Jon Kohler wrote: > So my theory here is to extend the logical effort of the microcode driven > automatic disablement as well as the tsx=auto automatic disablement and > have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave > the CPU features enumerated to maintain live migration. > > This would still leave TSX totally good on Ice Lake / non-buggy systems. > > If it would help, I'm working up an RFC patch, and we could discuss there? Sure. But, it sounds like you really want a new tdx=something rather than to muck with tsx=on behavior. Surely someone else will come along and complain that we broke their TDX setup if we change its behavior. Maybe you should just pay the one-time cost and move your whole fleet over to tsx=off if you truly believe nobody is using it.
> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/12/22 06:36, Jon Kohler wrote: >> So my theory here is to extend the logical effort of the microcode driven >> automatic disablement as well as the tsx=auto automatic disablement and >> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave >> the CPU features enumerated to maintain live migration. >> >> This would still leave TSX totally good on Ice Lake / non-buggy systems. >> >> If it would help, I'm working up an RFC patch, and we could discuss there? > > Sure. But, it sounds like you really want a new tdx=something rather > than to muck with tsx=on behavior. Surely someone else will come along > and complain that we broke their TDX setup if we change its behavior. Good point, there will always be a squeaky wheel. I’ll work that into the RFC, I’ll do something like tsx=compat and see how it shapes up. To be fair though, this commit I’m patching with this series would break setups as they apply 5.14+ and the microcode update, but you have a good point for certain. > > Maybe you should just pay the one-time cost and move your whole fleet > over to tsx=off if you truly believe nobody is using it. > Trust me, I’d love to do that; however: We’ve thousands of hosts across thousands of unique customers, which aren't managed as a centralized service (customers manage them directly), so doing that would require each individual customer to organize a full power cycle for all of their VMs prior to an upgrade to tsx=off hosts. That said, we are marching in that direction, we're shipping a control plane update that will mask HLE and RTM after power cycles, but that requires customers to apply that control plane update, then power cycle everything. Just means that we've begun the feature deprecation now, it will take years to fully bleed off without having customers to micro manage full power cycles.
On Tue, Apr 12, 2022 at 04:08:32PM +0000, Jon Kohler wrote: > > >> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 4/12/22 06:36, Jon Kohler wrote: >>> So my theory here is to extend the logical effort of the microcode driven >>> automatic disablement as well as the tsx=auto automatic disablement and >>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave >>> the CPU features enumerated to maintain live migration. >>> >>> This would still leave TSX totally good on Ice Lake / non-buggy systems. >>> >>> If it would help, I'm working up an RFC patch, and we could discuss there? >> >> Sure. But, it sounds like you really want a new tdx=something rather >> than to muck with tsx=on behavior. Surely someone else will come along >> and complain that we broke their TDX setup if we change its behavior. > >Good point, there will always be a squeaky wheel. I’ll work that into the RFC, >I’ll do something like tsx=compat and see how it shapes up. FYI, the original series had tsx=fake, that would have taken care of this breakage. https://lore.kernel.org/lkml/de6b97a567e273adff1f5268998692bad548aa10.1623272033.git-series.pawan.kumar.gupta@linux.intel.com/ For the lack of real world use-cases at that time, this patch was dropped. Thanks, Pawan
> On Apr 12, 2022, at 2:04 PM, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote: > > On Tue, Apr 12, 2022 at 04:08:32PM +0000, Jon Kohler wrote: >> >> >>> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@intel.com> wrote: >>> >>> On 4/12/22 06:36, Jon Kohler wrote: >>>> So my theory here is to extend the logical effort of the microcode driven >>>> automatic disablement as well as the tsx=auto automatic disablement and >>>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave >>>> the CPU features enumerated to maintain live migration. >>>> >>>> This would still leave TSX totally good on Ice Lake / non-buggy systems. >>>> >>>> If it would help, I'm working up an RFC patch, and we could discuss there? >>> >>> Sure. But, it sounds like you really want a new tdx=something rather >>> than to muck with tsx=on behavior. Surely someone else will come along >>> and complain that we broke their TDX setup if we change its behavior. >> >> Good point, there will always be a squeaky wheel. I’ll work that into the RFC, >> I’ll do something like tsx=compat and see how it shapes up. > > FYI, the original series had tsx=fake, that would have taken care of > this breakage. Fake sounds way better than compat, which is what I had :) My RFC code looks similar to your patch, I’ll combine the approaches and send it out shortly, almost done > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_de6b97a567e273adff1f5268998692bad548aa10.1623272033.git-2Dseries.pawan.kumar.gupta-40linux.intel.com_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=AgPWHzCORdn5x5rYXE0QeJ2yf158HOjDA5Bn8udzp-m6i9V9s7S_jtSiLog-dk93&s=kR74kfovpa0zOK0tZ2Ss9xbg2aRLI5oocB_cp_6DLkg&e= > For the lack of real world use-cases at that time, this patch was dropped. > > Thanks, > Pawan
On Tue, Apr 12, 2022 at 01:36:20PM +0000, Jon Kohler wrote: > > >> On Apr 11, 2022, at 7:45 PM, Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 4/11/22 12:35, Jon Kohler wrote: >>> Also, while I’ve got you, I’d also like to send out a patch to simply >>> force abort all transactions even when tsx=on, and just be done with >>> TSX. Now that we’ve had the patch that introduced this functionality >>> I’m patching for roughly a year, combined with the microcode going >>> out, it seems like TSX’s numbered days have come to an end. >> >> Could you elaborate a little more here? Why would we ever want to force >> abort transactions that don't need to be aborted for some reason? > >Sure, I'm talking specifically about when users of tsx=on (or >CONFIG_X86_INTEL_TSX_MODE_ON) on X86_BUG_TAA CPU SKUs. In this situation, >TSX features are enabled, as are TAA mitigations. Using our own use case >as an example, we only do this because of legacy live migration reasons. > >This is fine on Skylake (because we're signed up for MDS mitigation anyhow) >and fine on Ice Lake because TAA_NO=1; however this is wicked painful on >Cascade Lake, because MDS_NO=1 and TAA_NO=0, so we're still signed up for >TAA mitigation by default. On CLX, this hits us on host syscalls as well as >vmexits with the mds clear on every one :( > >So tsx=on is this oddball for us, because if we switch to auto, we'll break >live migration for some of our customers (but TAA overhead is gone), but >if we leave tsx=on, we keep the feature enabled (but no one likely uses it) >and still have to pay the TAA tax even if a customer doesn't use it. > >So my theory here is to extend the logical effort of the microcode driven >automatic disablement as well as the tsx=auto automatic disablement and >have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave >the CPU features enumerated to maintain live migration. This won't help on CLX as server parts did not get the microcode driven automatic disablement. On CLX CPUID.RTM_ALWAYS_ABORT will not be set. What could work on CLX is TSX_CTRL_RTM_DISABLE=1 and TSX_CTRL_CPUID_CLEAR=0. This can be done for tsx=auto or with a new mode tsx=fake|compat. IMO, adding a new mode would be better, otherwise tsx=auto behavior will differ depending on the kernel version. Provided that software using TSX is following below guidance [*]: When Intel TSX is disabled at runtime using TSX_CTRL, but the CPUID enumeration of Intel TSX is not cleared, existing software using RTM may see aborts for every transaction. The abort will always return a 0 status code in EAX after XBEGIN. When the software does a number of transaction retries, it should never retry for a 0 status value, but go to the nontransactional fall back path immediately. Thanks, Pawan [*] TAA document: section -> Implications on Intel TSX software https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-tsx-asynchronous-abort.html
> On Apr 12, 2022, at 4:40 PM, Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote: > > On Tue, Apr 12, 2022 at 01:36:20PM +0000, Jon Kohler wrote: >> >> >>> On Apr 11, 2022, at 7:45 PM, Dave Hansen <dave.hansen@intel.com> wrote: >>> >>> On 4/11/22 12:35, Jon Kohler wrote: >>>> Also, while I’ve got you, I’d also like to send out a patch to simply >>>> force abort all transactions even when tsx=on, and just be done with >>>> TSX. Now that we’ve had the patch that introduced this functionality >>>> I’m patching for roughly a year, combined with the microcode going >>>> out, it seems like TSX’s numbered days have come to an end. >>> >>> Could you elaborate a little more here? Why would we ever want to force >>> abort transactions that don't need to be aborted for some reason? >> >> Sure, I'm talking specifically about when users of tsx=on (or >> CONFIG_X86_INTEL_TSX_MODE_ON) on X86_BUG_TAA CPU SKUs. In this situation, >> TSX features are enabled, as are TAA mitigations. Using our own use case >> as an example, we only do this because of legacy live migration reasons. >> >> This is fine on Skylake (because we're signed up for MDS mitigation anyhow) >> and fine on Ice Lake because TAA_NO=1; however this is wicked painful on >> Cascade Lake, because MDS_NO=1 and TAA_NO=0, so we're still signed up for >> TAA mitigation by default. On CLX, this hits us on host syscalls as well as >> vmexits with the mds clear on every one :( >> >> So tsx=on is this oddball for us, because if we switch to auto, we'll break >> live migration for some of our customers (but TAA overhead is gone), but >> if we leave tsx=on, we keep the feature enabled (but no one likely uses it) >> and still have to pay the TAA tax even if a customer doesn't use it. >> >> So my theory here is to extend the logical effort of the microcode driven >> automatic disablement as well as the tsx=auto automatic disablement and >> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave >> the CPU features enumerated to maintain live migration. > > This won't help on CLX as server parts did not get the microcode driven > automatic disablement. On CLX CPUID.RTM_ALWAYS_ABORT will not be set. > > What could work on CLX is TSX_CTRL_RTM_DISABLE=1 and > TSX_CTRL_CPUID_CLEAR=0. This can be done for tsx=auto or with a new mode > tsx=fake|compat. IMO, adding a new mode would be better, otherwise > tsx=auto behavior will differ depending on the kernel version. Thanks for the guidance, Pawan, I appreciate it. This is exactly the approach my other patch is taking. Need to do a bit more review and testing and ill get the RFC out > > Provided that software using TSX is following below guidance [*]: > > When Intel TSX is disabled at runtime using TSX_CTRL, but the CPUID > enumeration of Intel TSX is not cleared, existing software using RTM may > see aborts for every transaction. The abort will always return a 0 > status code in EAX after XBEGIN. When the software does a number of > transaction retries, it should never retry for a 0 status value, but go > to the nontransactional fall back path immediately. > > Thanks, > Pawan > > [*] TAA document: section -> Implications on Intel TSX software > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.intel.com_content_www_us_en_developer_articles_technical_software-2Dsecurity-2Dguidance_technical-2Ddocumentation_intel-2Dtsx-2Dasynchronous-2Dabort.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=-yy3gpUOG7W2s79bE3KTnzd9h32x038M5CkPkhFsUW22MWWzcf3SoX6An2835zrn&s=t85c0qBMosrY_UvEVGzkR4j125aGfHju3SFEEPAImpQ&e=
diff --git a/arch/x86/kernel/cpu/tsx.c b/arch/x86/kernel/cpu/tsx.c index 9c7a5f049292..a24e5e471e3f 100644 --- a/arch/x86/kernel/cpu/tsx.c +++ b/arch/x86/kernel/cpu/tsx.c @@ -78,6 +78,20 @@ static bool __init tsx_ctrl_is_supported(void) static enum tsx_ctrl_states x86_get_tsx_auto_mode(void) { + /* + * Hardware will always abort a TSX transaction if both CPUID bits + * RTM_ALWAYS_ABORT and TSX_FORCE_ABORT are set. In this case, it is + * better not to enumerate CPUID.RTM and CPUID.HLE bits. Clear them + * here. + */ + if (boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT) && + boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) { + tsx_clear_cpuid(); + setup_clear_cpu_cap(X86_FEATURE_RTM); + setup_clear_cpu_cap(X86_FEATURE_HLE); + return TSX_CTRL_RTM_ALWAYS_ABORT; + } + if (boot_cpu_has_bug(X86_BUG_TAA)) return TSX_CTRL_DISABLE; @@ -105,21 +119,6 @@ void __init tsx_init(void) char arg[5] = {}; int ret; - /* - * Hardware will always abort a TSX transaction if both CPUID bits - * RTM_ALWAYS_ABORT and TSX_FORCE_ABORT are set. In this case, it is - * better not to enumerate CPUID.RTM and CPUID.HLE bits. Clear them - * here. - */ - if (boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT) && - boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)) { - tsx_ctrl_state = TSX_CTRL_RTM_ALWAYS_ABORT; - tsx_clear_cpuid(); - setup_clear_cpu_cap(X86_FEATURE_RTM); - setup_clear_cpu_cap(X86_FEATURE_HLE); - return; - } - if (!tsx_ctrl_is_supported()) { tsx_ctrl_state = TSX_CTRL_NOT_SUPPORTED; return;
Move automatic disablement for TSX microcode deprecation from tsx_init() to x86_get_tsx_auto_mode(), such that systems with tsx=on will continue to see the TSX CPU features (HLE, RTM) even on updated microcode. KVM live migration could be possibly be broken in 5.14+ commit 293649307ef9 ("x86/tsx: Clear CPUID bits when TSX always force aborts"). Consider the following scenario: 1. KVM hosts clustered in a live migration capable setup. 2. KVM guests have TSX CPU features HLE and/or RTM presented. 3. One of the three maintenance events occur: 3a. An existing host running kernel >= 5.14 in the pool updated with the new microcode. 3b. A new host running kernel >= 5.14 is commissioned that already has the microcode update preloaded. 3c. All hosts are running kernel < 5.14 with microcode update already loaded and one existing host gets updated to kernel >= 5.14. 4. After maintenance event, the impacted host will not have HLE and RTM exposed, and live migrations with guests with TSX features might not migrate. Users using tsx=on or CONFIG_X86_INTEL_TSX_MODE_ON should always see HLE and RTM on capable Intel SKUs, even if microcode has been clubbed to prevent functionality. Users using tsx=auto get or CONFIG_X86_INTEL_TSX_MODE_AUTO get to roll the dice with whatever the kernel believes the appropriate default is, which includes the feature disappearing after a kernel and/or microcode update. These users should consider masking HLE and RTM at a higher control plane level, e.g. qemu or libvirt, such that guests on TSX enabled systems do not see HLE/RTM and therefore do not enable TAA mitigation. Fixes: 293649307ef9 ("x86/tsx: Clear CPUID bits when TSX always force aborts") Signed-off-by: Jon Kohler <jon@nutanix.com> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Neelima Krishnan <neelima.krishnan@intel.com> Cc: kvm@vger.kernel.org <kvm@vger.kernel.org> --- arch/x86/kernel/cpu/tsx.c | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-)