diff mbox series

[6/6] svm/nestedvm: Introduce nested capabilities bit

Message ID 20240206012051.3564035-7-george.dunlap@cloud.com (mailing list archive)
State Superseded
Headers show
Series AMD Nested Virt Preparation | expand

Commit Message

George Dunlap Feb. 6, 2024, 1:20 a.m. UTC
In order to make implementation and testing tractable, we will require
specific host functionality.  Add a nested_virt bit to hvm_funcs.caps,
and return an error if a domain is created with nested virt and this
bit isn't set.

For VMX, start with always enabling it if HAP is present; this
shouldn't change current behvior.

For SVM, require some basic functionality, adding a document
explaining the rationale.

NB that only SVM CPUID bits 0-7 have been considered.  Bits 10-16 may
be considered in a follow-up patch.

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: George Dunlap <george.dunlap@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
CC: "Roger Pau Monné" <roger.pau@citrix.com>
CC: Jun Nakajima <jun.nakajima@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
---
 docs/designs/nested-svm-cpu-features.md | 110 ++++++++++++++++++++++++
 xen/arch/x86/domain.c                   |   6 ++
 xen/arch/x86/hvm/svm/nestedhvm.h        |   1 +
 xen/arch/x86/hvm/svm/nestedsvm.c        |  14 +++
 xen/arch/x86/hvm/svm/svm.c              |   2 +
 xen/arch/x86/hvm/vmx/vmx.c              |   3 +
 xen/arch/x86/include/asm/hvm/hvm.h      |  11 ++-
 7 files changed, 146 insertions(+), 1 deletion(-)

Comments

George Dunlap Feb. 6, 2024, 1:34 a.m. UTC | #1
On Tue, Feb 6, 2024 at 9:20 AM George Dunlap <george.dunlap@cloud.com> wrote:

> @@ -655,6 +658,12 @@ static inline bool hvm_altp2m_supported(void)
>      return hvm_funcs.caps.altp2m;
>  }
>
> +/* Returns true if we have the minimum hardware requirements for nested virt */
> +static inline bool hvm_nested_virt_supported(void)
> +{
> +    return hvm_funcs.caps.nested_virt;
> +}

NB this is missing the !CONFIG_HVM version of this function; I'll
include it in v2.

 -George
Jan Beulich Feb. 19, 2024, 4:25 p.m. UTC | #2
On 06.02.2024 02:20, George Dunlap wrote:
> --- /dev/null
> +++ b/docs/designs/nested-svm-cpu-features.md
> @@ -0,0 +1,110 @@
> +# Nested SVM (AMD) CPUID requirements
> +
> +The first step in making nested SVM production-ready is to make sure
> +that all features are implemented and well-tested.  To make this
> +tractable, we will initially be limiting the "supported" range of
> +nested virt to a specific subset of host and guest features.  This
> +document describes the criteria for deciding on features, and the
> +rationale behind each feature.
> +
> +For AMD, all virtualization-related features can be found in CPUID
> +leaf 8000000A:edx
> +
> +# Criteria
> +
> +- Processor support: At a minimum we want to support processors from
> +  the last 5 years.  All things being equal, older processors are
> +  better.

Nit: Perhaps missing "covering"? Generally I hope newer processors are
"better".

>  Bits 0:7 were available in the very earliest processors;
> +  and even through bit 15 we should be pretty good support-wise.
> +
> +- Faithfulness to hardware: We need the behavior of the "virtual cpu"
> +  from the L1 hypervisor's perspective to be as close as possible to
> +  the original hardware.  In particular, the behavior of the hardware
> +  on error paths 1) is not easy to understand or test, 2) can be the
> +  source of surprising vulnerabiliies.  (See XSA-7 for an example of a
> +  case where subtle error-handling differences can open up a privilege
> +  escalation.)  We should avoid emulating any bit of the hardware with
> +  complex error paths if we can at all help it.
> +
> +- Cost of implementation: We want to minimize the cost of
> +  implementation (where this includes bringing an existing sub-par
> +  implementation up to speed).  All things being equal, we'll favor a
> +  configuration which does not require any new implementation.
> +
> +- Performance: All things being equal, we'd prefer to choose a set of
> +  L0 / L1 CPUID bits that are faster than slower.
> +
> +
> +# Bits
> +
> +- 0 `NP` *Nested Paging*: Required both for L0 and L1.
> +
> +  Based primarily on faithfulness and performance, as well as
> +  potential cost of implementation.  Available on earliest hardware,
> +  so no compatibility issues.
> +
> +- 1 `LbrVirt` *LBR / debugging virtualization*: Require for L0 and L1.
> +
> +  For L0 this is required for performance: There's no way to tell the
> +  guests not to use the LBR-related registers; and if the guest does,
> +  then you have to save and restore all LBR-related registers on
> +  context switch, which is prohibitive.

"prohibitive" is too strong imo; maybe "undesirable"?

> --- a/xen/arch/x86/hvm/svm/nestedhvm.h
> +++ b/xen/arch/x86/hvm/svm/nestedhvm.h
> @@ -35,6 +35,7 @@ enum nestedhvm_vmexits
>  nestedsvm_check_intercepts(struct vcpu *v, struct cpu_user_regs *regs,
>      uint64_t exitcode);
>  void svm_nested_features_on_efer_update(struct vcpu *v);
> +void __init start_nested_svm(struct hvm_function_table *svm_function_table);

No section placement attributes on declarations, please.

> --- a/xen/arch/x86/hvm/svm/nestedsvm.c
> +++ b/xen/arch/x86/hvm/svm/nestedsvm.c
> @@ -1666,3 +1666,17 @@ void svm_nested_features_on_efer_update(struct vcpu *v)
>          }
>      }
>  }
> +
> +void __init start_nested_svm(struct hvm_function_table *svm_function_table)
> +{
> +    /* 
> +     * Required host functionality to support nested virt.  See
> +     * docs/designs/nested-svm-cpu-features.md for rationale.
> +     */
> +    svm_function_table->caps.nested_virt =
> +        cpu_has_svm_nrips &&
> +        cpu_has_svm_lbrv &&
> +        cpu_has_svm_nrips &&

nrips twice? Was the earlier one meant to be npt?

> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -3021,6 +3021,9 @@ const struct hvm_function_table * __init start_vmx(void)
>      if ( cpu_has_vmx_tsc_scaling )
>          vmx_function_table.tsc_scaling.ratio_frac_bits = 48;
>  
> +    /* TODO: Require hardware support before enabling nested virt */
> +    vmx_function_table.caps.nested_virt = vmx_function_table.caps.hap;

This won't have the intended effect if hap_supported() ends up clearing
the bit (used as input here) due to a command line option override. I
wonder if instead this wants doing e.g. in a new hook to be called from
nestedhvm_setup(). On the SVM side the hook function would then be the
start_nested_svm() that you already introduce, with a caps.hap check
added.

Since you leave the other nested-related if() in place in
arch_sanitise_domain_config(), all ought to be well, but I think that
other if() really wants replacing by the one you presently add.

Jan
George Dunlap Feb. 22, 2024, 2:33 p.m. UTC | #3
On Tue, Feb 20, 2024 at 12:25 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > --- /dev/null
> > +++ b/docs/designs/nested-svm-cpu-features.md
> > @@ -0,0 +1,110 @@
> > +# Nested SVM (AMD) CPUID requirements
> > +
> > +The first step in making nested SVM production-ready is to make sure
> > +that all features are implemented and well-tested.  To make this
> > +tractable, we will initially be limiting the "supported" range of
> > +nested virt to a specific subset of host and guest features.  This
> > +document describes the criteria for deciding on features, and the
> > +rationale behind each feature.
> > +
> > +For AMD, all virtualization-related features can be found in CPUID
> > +leaf 8000000A:edx
> > +
> > +# Criteria
> > +
> > +- Processor support: At a minimum we want to support processors from
> > +  the last 5 years.  All things being equal, older processors are
> > +  better.
>
> Nit: Perhaps missing "covering"? Generally I hope newer processors are
> "better".

I've reworded it.

> > +  For L0 this is required for performance: There's no way to tell the
> > +  guests not to use the LBR-related registers; and if the guest does,
> > +  then you have to save and restore all LBR-related registers on
> > +  context switch, which is prohibitive.
>
> "prohibitive" is too strong imo; maybe "undesirable"?

Prohibitive sounds strong, but I don't think it really is; here's the
dictionary definition:

"(of a price or charge) so high as to prevent something being done or bought."

The cost of saving the registers on every context switch is so high as
to prevent us from wanting to do it.

I could change it to "too expensive".

> > --- a/xen/arch/x86/hvm/svm/nestedhvm.h
> > +++ b/xen/arch/x86/hvm/svm/nestedhvm.h
> > @@ -35,6 +35,7 @@ enum nestedhvm_vmexits
> >  nestedsvm_check_intercepts(struct vcpu *v, struct cpu_user_regs *regs,
> >      uint64_t exitcode);
> >  void svm_nested_features_on_efer_update(struct vcpu *v);
> > +void __init start_nested_svm(struct hvm_function_table *svm_function_table);
>
> No section placement attributes on declarations, please.

Ack.

> > --- a/xen/arch/x86/hvm/svm/nestedsvm.c
> > +++ b/xen/arch/x86/hvm/svm/nestedsvm.c
> > @@ -1666,3 +1666,17 @@ void svm_nested_features_on_efer_update(struct vcpu *v)
> >          }
> >      }
> >  }
> > +
> > +void __init start_nested_svm(struct hvm_function_table *svm_function_table)
> > +{
> > +    /*
> > +     * Required host functionality to support nested virt.  See
> > +     * docs/designs/nested-svm-cpu-features.md for rationale.
> > +     */
> > +    svm_function_table->caps.nested_virt =
> > +        cpu_has_svm_nrips &&
> > +        cpu_has_svm_lbrv &&
> > +        cpu_has_svm_nrips &&
>
> nrips twice? Was the earlier one meant to be npt?

Er, yes exactly.



> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -3021,6 +3021,9 @@ const struct hvm_function_table * __init start_vmx(void)
> >      if ( cpu_has_vmx_tsc_scaling )
> >          vmx_function_table.tsc_scaling.ratio_frac_bits = 48;
> >
> > +    /* TODO: Require hardware support before enabling nested virt */
> > +    vmx_function_table.caps.nested_virt = vmx_function_table.caps.hap;
>
> This won't have the intended effect if hap_supported() ends up clearing
> the bit (used as input here) due to a command line option override. I
> wonder if instead this wants doing e.g. in a new hook to be called from
> nestedhvm_setup(). On the SVM side the hook function would then be the
> start_nested_svm() that you already introduce, with a caps.hap check
> added.

I take it presmp_initcall()'s are guaranteed to run before __initcall()'s?

If so that's probably a good idea.

> Since you leave the other nested-related if() in place in
> arch_sanitise_domain_config(), all ought to be well, but I think that
> other if() really wants replacing by the one you presently add.

Ack.

I'll probably check in patches 1,2,3, and 5, and resend the other two,
unless you'd like to see all the changes again...

 -George




 -George
Jan Beulich Feb. 22, 2024, 3:05 p.m. UTC | #4
On 22.02.2024 15:33, George Dunlap wrote:
> On Tue, Feb 20, 2024 at 12:25 AM Jan Beulich <jbeulich@suse.com> wrote:
>> On 06.02.2024 02:20, George Dunlap wrote:
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -3021,6 +3021,9 @@ const struct hvm_function_table * __init start_vmx(void)
>>>      if ( cpu_has_vmx_tsc_scaling )
>>>          vmx_function_table.tsc_scaling.ratio_frac_bits = 48;
>>>
>>> +    /* TODO: Require hardware support before enabling nested virt */
>>> +    vmx_function_table.caps.nested_virt = vmx_function_table.caps.hap;
>>
>> This won't have the intended effect if hap_supported() ends up clearing
>> the bit (used as input here) due to a command line option override. I
>> wonder if instead this wants doing e.g. in a new hook to be called from
>> nestedhvm_setup(). On the SVM side the hook function would then be the
>> start_nested_svm() that you already introduce, with a caps.hap check
>> added.
> 
> I take it presmp_initcall()'s are guaranteed to run before __initcall()'s?

Yes - the former happen ahead of AP bringup, the latter after.

>> Since you leave the other nested-related if() in place in
>> arch_sanitise_domain_config(), all ought to be well, but I think that
>> other if() really wants replacing by the one you presently add.
> 
> Ack.
> 
> I'll probably check in patches 1,2,3, and 5, and resend the other two,
> unless you'd like to see all the changes again...

No need imo to re-post anything that was agreed upon.

Jan
diff mbox series

Patch

diff --git a/docs/designs/nested-svm-cpu-features.md b/docs/designs/nested-svm-cpu-features.md
new file mode 100644
index 0000000000..7ffb8daefd
--- /dev/null
+++ b/docs/designs/nested-svm-cpu-features.md
@@ -0,0 +1,110 @@ 
+# Nested SVM (AMD) CPUID requirements
+
+The first step in making nested SVM production-ready is to make sure
+that all features are implemented and well-tested.  To make this
+tractable, we will initially be limiting the "supported" range of
+nested virt to a specific subset of host and guest features.  This
+document describes the criteria for deciding on features, and the
+rationale behind each feature.
+
+For AMD, all virtualization-related features can be found in CPUID
+leaf 8000000A:edx
+
+# Criteria
+
+- Processor support: At a minimum we want to support processors from
+  the last 5 years.  All things being equal, older processors are
+  better.  Bits 0:7 were available in the very earliest processors;
+  and even through bit 15 we should be pretty good support-wise.
+
+- Faithfulness to hardware: We need the behavior of the "virtual cpu"
+  from the L1 hypervisor's perspective to be as close as possible to
+  the original hardware.  In particular, the behavior of the hardware
+  on error paths 1) is not easy to understand or test, 2) can be the
+  source of surprising vulnerabiliies.  (See XSA-7 for an example of a
+  case where subtle error-handling differences can open up a privilege
+  escalation.)  We should avoid emulating any bit of the hardware with
+  complex error paths if we can at all help it.
+
+- Cost of implementation: We want to minimize the cost of
+  implementation (where this includes bringing an existing sub-par
+  implementation up to speed).  All things being equal, we'll favor a
+  configuration which does not require any new implementation.
+
+- Performance: All things being equal, we'd prefer to choose a set of
+  L0 / L1 CPUID bits that are faster than slower.
+
+
+# Bits
+
+- 0 `NP` *Nested Paging*: Required both for L0 and L1.
+
+  Based primarily on faithfulness and performance, as well as
+  potential cost of implementation.  Available on earliest hardware,
+  so no compatibility issues.
+
+- 1 `LbrVirt` *LBR / debugging virtualization*: Require for L0 and L1.
+
+  For L0 this is required for performance: There's no way to tell the
+  guests not to use the LBR-related registers; and if the guest does,
+  then you have to save and restore all LBR-related registers on
+  context switch, which is prohibitive.  Furthermore, the additional
+  emulation risks a security-relevant difference to come up.
+
+  Providing it to L1 when we have it in L0 is basically free, and
+  already implemented.
+
+  Just require it and provide it.
+
+- 2 `SVML` *SVM Lock*: Not required for L0, not provided to L1
+
+  Seems to be aboult enabling an operating system to prevent "blue
+  pill" attacks against itself.
+
+  Xen doesn't use it, nor provide it; so it would need to be
+  implementend.  The best way to protect a guest OS is to leave nested
+  virt disabled in the tools.
+
+- 3 `NRIPS` NRIP Save: Require for both L0 and L1
+
+  If NRIPS is not present, the software interrupt injection
+  functionality can't be used; and Xen has to emulate it.  That's
+  another source of potential security issues.  If hardware supports
+  it, then providing it to guest is basically free.
+
+- 4 `TscRateMsr`: Not required by L0, not provided to L1
+
+  The main putative use for this would be trying to maintain an
+  invariant TSC across cores with different clock speeds, or after a
+  migrate.  Unlike others, this doesn't have an error path to worry
+  about compatibility-wise; and according to tests done when nestedSVM
+  was first implemented, it's actually faster to emliate TscRateMSR in
+  the L0 hypervisor than for L1 to attempt to emulate it itself.
+
+  However, using this properly in L0 will take some implementation
+  effort; and composing it properly with L1 will take even more
+  effort.  Just leave it off for now.
+
+ - 5 `VmcbClean`: VMCB Clean Bits: Not required by L0, provide to L1
+
+  This is a pure optimization, both on the side of the L0 and L1.  The
+  implementaiton for L1 is entirely Xen-side, so can be provided even
+  on hardware that doesn't provide it.  And it's purely an
+  optimization, so could be "implemented" by ignoring the bits
+  entirely.
+
+  As such, we don't need to require it for L0; and as it's already
+  implemented, no reason not to provide it to L1.  Before this feature
+  was available those bits were marked SBZ ("should be zero"); setting
+  them was already advertised to cause unpredictable behavior.
+
+- 6 `FlushByAsid`: Require for L0, provide to L1
+
+  This is cheap and easy to use for L0 and to provide to the L1;
+  there's no reson not to just pass it through.
+
+- 7 `DecodeAssists`: Require for L0, provide to L1
+
+  Using it in L0 reduces the chance that we'll make some sort of error
+  in the decode path.  And if hardware supports it, it's easy enough
+  to provide to the L1.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index bda853e3c9..a25f498265 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -673,6 +673,12 @@  int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
          */
         config->flags |= XEN_DOMCTL_CDF_oos_off;
 
+    if ( nested_virt && !hvm_nested_virt_supported() )
+    {
+        dprintk(XENLOG_INFO, "Nested virt requested but not available\n");
+        return -EINVAL;        
+    }
+
     if ( nested_virt && !hap )
     {
         dprintk(XENLOG_INFO, "Nested virt not supported without HAP\n");
diff --git a/xen/arch/x86/hvm/svm/nestedhvm.h b/xen/arch/x86/hvm/svm/nestedhvm.h
index 43245e13de..31cf2af8e4 100644
--- a/xen/arch/x86/hvm/svm/nestedhvm.h
+++ b/xen/arch/x86/hvm/svm/nestedhvm.h
@@ -35,6 +35,7 @@  enum nestedhvm_vmexits
 nestedsvm_check_intercepts(struct vcpu *v, struct cpu_user_regs *regs,
     uint64_t exitcode);
 void svm_nested_features_on_efer_update(struct vcpu *v);
+void __init start_nested_svm(struct hvm_function_table *svm_function_table);
 
 /* Interface methods */
 void cf_check nsvm_vcpu_destroy(struct vcpu *v);
diff --git a/xen/arch/x86/hvm/svm/nestedsvm.c b/xen/arch/x86/hvm/svm/nestedsvm.c
index a5319ab729..92b063daa5 100644
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -1666,3 +1666,17 @@  void svm_nested_features_on_efer_update(struct vcpu *v)
         }
     }
 }
+
+void __init start_nested_svm(struct hvm_function_table *svm_function_table)
+{
+    /* 
+     * Required host functionality to support nested virt.  See
+     * docs/designs/nested-svm-cpu-features.md for rationale.
+     */
+    svm_function_table->caps.nested_virt =
+        cpu_has_svm_nrips &&
+        cpu_has_svm_lbrv &&
+        cpu_has_svm_nrips &&
+        cpu_has_svm_flushbyasid &&
+        cpu_has_svm_decode;
+}
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 34b9f603bc..5c2e171777 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2527,6 +2527,8 @@  const struct hvm_function_table * __init start_svm(void)
     svm_function_table.caps.hap_superpage_2mb = true;
     svm_function_table.caps.hap_superpage_1gb = cpu_has_page1gb;
 
+    start_nested_svm(&svm_function_table);
+
     return &svm_function_table;
 }
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 4bcf436d2c..6b5ad4a509 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3021,6 +3021,9 @@  const struct hvm_function_table * __init start_vmx(void)
     if ( cpu_has_vmx_tsc_scaling )
         vmx_function_table.tsc_scaling.ratio_frac_bits = 48;
 
+    /* TODO: Require hardware support before enabling nested virt */
+    vmx_function_table.caps.nested_virt = vmx_function_table.caps.hap;
+
     model_specific_lbr = get_model_specific_lbr();
     lbr_tsx_fixup_check();
     ler_to_fixup_check();
diff --git a/xen/arch/x86/include/asm/hvm/hvm.h b/xen/arch/x86/include/asm/hvm/hvm.h
index bbd83a8275..8a3df0eca7 100644
--- a/xen/arch/x86/include/asm/hvm/hvm.h
+++ b/xen/arch/x86/include/asm/hvm/hvm.h
@@ -97,7 +97,10 @@  struct hvm_function_table {
             singlestep:1,
             
             /* Hardware virtual interrupt delivery enable? */
-            virtual_intr_delivery;
+            virtual_intr_delivery,
+
+            /* Nested virt capabilities */
+            nested_virt:1;
 
     } caps;
 
@@ -655,6 +658,12 @@  static inline bool hvm_altp2m_supported(void)
     return hvm_funcs.caps.altp2m;
 }
 
+/* Returns true if we have the minimum hardware requirements for nested virt */
+static inline bool hvm_nested_virt_supported(void)
+{
+    return hvm_funcs.caps.nested_virt;
+}
+
 /* updates the current hardware p2m */
 static inline void altp2m_vcpu_update_p2m(struct vcpu *v)
 {