diff mbox series

[v3,1/2,4.15] x86/PV: conditionally avoid raising #GP for early guest MSR reads

Message ID 90f87aa8-09da-1453-bd82-c722465c2881@suse.com (mailing list archive)
State Superseded
Headers show
Series x86: guest MSR access handling tweaks | expand

Commit Message

Jan Beulich March 12, 2021, 7:54 a.m. UTC
Prior to 4.15 Linux, when running in PV mode, did not install a #GP
handler early enough to cover for example the rdmsrl_safe() of
MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read
of MSR_K7_HWCR later in the same function). The respective change
(42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was
backported to 4.14, but no further - presumably since it wasn't really
easy because of other dependencies.

Therefore, to prevent our change in the handling of guest MSR accesses
to render PV Linux 4.13 and older unusable on at least AMD systems, make
the raising of #GP on this paths conditional upon the guest having
installed a handler, provided of course the MSR can be read in the first
place (we would have raised #GP in that case even before). Producing
zero for reads isn't necessarily correct and may trip code trying to
detect presence of MSRs early, but since such detection logic won't work
without a #GP handler anyway, this ought to be a fair workaround.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Release-Acked-by: Ian Jackson <iwj@xenproject.org>
---
(projected v4: re-base over Roger's change)
v3: Use temporary variable for probing. Document the behavior (in a
    public header, for the lack of a better place).
v2: Probe MSR read. Exclude hypervisor range. Avoid issuing two log
    messages (in debug builds). Don't alter WRMSR behavior.
---
While I didn't myself observe or find similar WRMSR side issues, I'm
nevertheless not convinced we can get away without also making the WRMSR
path somewhat more permissive again, e.g. tolerating attempts to set
bits which are already set. But of course this would require keeping in
sync for which MSRs we "fake" reads, as then a kernel attempt to set a
bit may also appear as an attempt to clear others (because of the zero
value that we gave it for the read). Roger validly points out that
making behavior dependent upon MSR values has its own downsides, so
simply depending on MSR readability is another option (with, in turn,
its own undesirable effects, e.g. for write-only MSRs).

Comments

Roger Pau Monne March 12, 2021, 9:08 a.m. UTC | #1
On Fri, Mar 12, 2021 at 08:54:46AM +0100, Jan Beulich wrote:
> Prior to 4.15 Linux, when running in PV mode, did not install a #GP
> handler early enough to cover for example the rdmsrl_safe() of
> MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read
> of MSR_K7_HWCR later in the same function). The respective change
> (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was
> backported to 4.14, but no further - presumably since it wasn't really
> easy because of other dependencies.
> 
> Therefore, to prevent our change in the handling of guest MSR accesses
> to render PV Linux 4.13 and older unusable on at least AMD systems, make
> the raising of #GP on this paths conditional upon the guest having
> installed a handler, provided of course the MSR can be read in the first
> place (we would have raised #GP in that case even before). Producing
> zero for reads isn't necessarily correct and may trip code trying to
> detect presence of MSRs early, but since such detection logic won't work
> without a #GP handler anyway, this ought to be a fair workaround.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Release-Acked-by: Ian Jackson <iwj@xenproject.org>

I think the approach is fine:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Some comments below.

> ---
> (projected v4: re-base over Roger's change)
> v3: Use temporary variable for probing. Document the behavior (in a
>     public header, for the lack of a better place).
> v2: Probe MSR read. Exclude hypervisor range. Avoid issuing two log
>     messages (in debug builds). Don't alter WRMSR behavior.
> ---
> While I didn't myself observe or find similar WRMSR side issues, I'm
> nevertheless not convinced we can get away without also making the WRMSR
> path somewhat more permissive again, e.g. tolerating attempts to set
> bits which are already set. But of course this would require keeping in
> sync for which MSRs we "fake" reads, as then a kernel attempt to set a
> bit may also appear as an attempt to clear others (because of the zero
> value that we gave it for the read). Roger validly points out that
> making behavior dependent upon MSR values has its own downsides, so
> simply depending on MSR readability is another option (with, in turn,
> its own undesirable effects, e.g. for write-only MSRs).
> 
> --- a/xen/arch/x86/pv/emul-priv-op.c
> +++ b/xen/arch/x86/pv/emul-priv-op.c
> @@ -874,7 +874,8 @@ static int read_msr(unsigned int reg, ui
>      struct vcpu *curr = current;
>      const struct domain *currd = curr->domain;
>      const struct cpuid_policy *cp = currd->arch.cpuid;
> -    bool vpmu_msr = false;
> +    bool vpmu_msr = false, warn = false;
> +    uint64_t tmp;
>      int ret;
>  
>      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
> @@ -882,7 +883,7 @@ static int read_msr(unsigned int reg, ui
>          if ( ret == X86EMUL_EXCEPTION )
>              x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt);

You might want to move the injection of the exception to the done
label?

So that we can avoid the call to x86_emul_reset_event.

>  
> -        return ret;
> +        goto done;
>      }
>  
>      switch ( reg )
> @@ -986,7 +987,7 @@ static int read_msr(unsigned int reg, ui
>          }
>          /* fall through */
>      default:
> -        gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
> +        warn = true;
>          break;
>  
>      normal:
> @@ -995,7 +996,19 @@ static int read_msr(unsigned int reg, ui
>          return X86EMUL_OKAY;
>      }
>  
> -    return X86EMUL_UNHANDLEABLE;
> + done:
> +    if ( ret != X86EMUL_OKAY && !curr->arch.pv.trap_ctxt[X86_EXC_GP].address &&
> +         (reg >> 16) != 0x4000 && !rdmsr_safe(reg, tmp) )
> +    {
> +        gprintk(XENLOG_WARNING, "faking RDMSR 0x%08x\n", reg);
> +        *val = 0;
> +        x86_emul_reset_event(ctxt);
> +        ret = X86EMUL_OKAY;
> +    }
> +    else if ( warn )
> +        gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);

I think you could add:

if ( rc == X86EMUL_EXCEPTION )
    x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt);

> +
> +    return ret;
>  }
>  
>  static int write_msr(unsigned int reg, uint64_t val,
> --- a/xen/include/public/arch-x86/xen.h
> +++ b/xen/include/public/arch-x86/xen.h
> @@ -143,6 +143,12 @@ typedef unsigned long xen_ulong_t;
>   *  Level == 1: Kernel may enter
>   *  Level == 2: Kernel may enter
>   *  Level == 3: Everyone may enter
> + *
> + * Note: For compatibility with kernels not setting up exception handlers
> + *       early enough, Xen will avoid trying to inject #GP (and hence crash
> + *       the domain) when an RDMSR would require this, but no handler was
> + *       set yet. The precise conditions are implementation specific, and

You can drop the 'yet' here I think? As even if a handler has been set
and then removed we would still prevent injecting a #GP AFAICT. Not a
native speaker anyway, so I might be wrong on that one.

> + *       new code shouldn't rely on such behavior anyway.

I would use a stronger mustn't here instead of shouldn't.

Thanks, Roger.
Jan Beulich March 12, 2021, 9:32 a.m. UTC | #2
On 12.03.2021 10:08, Roger Pau Monné wrote:
> On Fri, Mar 12, 2021 at 08:54:46AM +0100, Jan Beulich wrote:
>> Prior to 4.15 Linux, when running in PV mode, did not install a #GP
>> handler early enough to cover for example the rdmsrl_safe() of
>> MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read
>> of MSR_K7_HWCR later in the same function). The respective change
>> (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was
>> backported to 4.14, but no further - presumably since it wasn't really
>> easy because of other dependencies.
>>
>> Therefore, to prevent our change in the handling of guest MSR accesses
>> to render PV Linux 4.13 and older unusable on at least AMD systems, make
>> the raising of #GP on this paths conditional upon the guest having
>> installed a handler, provided of course the MSR can be read in the first
>> place (we would have raised #GP in that case even before). Producing
>> zero for reads isn't necessarily correct and may trip code trying to
>> detect presence of MSRs early, but since such detection logic won't work
>> without a #GP handler anyway, this ought to be a fair workaround.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
> 
> I think the approach is fine:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> --- a/xen/arch/x86/pv/emul-priv-op.c
>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>> @@ -874,7 +874,8 @@ static int read_msr(unsigned int reg, ui
>>      struct vcpu *curr = current;
>>      const struct domain *currd = curr->domain;
>>      const struct cpuid_policy *cp = currd->arch.cpuid;
>> -    bool vpmu_msr = false;
>> +    bool vpmu_msr = false, warn = false;
>> +    uint64_t tmp;
>>      int ret;
>>  
>>      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
>> @@ -882,7 +883,7 @@ static int read_msr(unsigned int reg, ui
>>          if ( ret == X86EMUL_EXCEPTION )
>>              x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt);
> 
> You might want to move the injection of the exception to the done
> label?
> 
> So that we can avoid the call to x86_emul_reset_event.

At the expense of slightly more code churn, yes, perhaps. I have
to admit though that it feels less prone to future issues to me
to have an unconditional x86_emul_reset_event() on that path.

>> --- a/xen/include/public/arch-x86/xen.h
>> +++ b/xen/include/public/arch-x86/xen.h
>> @@ -143,6 +143,12 @@ typedef unsigned long xen_ulong_t;
>>   *  Level == 1: Kernel may enter
>>   *  Level == 2: Kernel may enter
>>   *  Level == 3: Everyone may enter
>> + *
>> + * Note: For compatibility with kernels not setting up exception handlers
>> + *       early enough, Xen will avoid trying to inject #GP (and hence crash
>> + *       the domain) when an RDMSR would require this, but no handler was
>> + *       set yet. The precise conditions are implementation specific, and
> 
> You can drop the 'yet' here I think? As even if a handler has been set
> and then removed we would still prevent injecting a #GP AFAICT. Not a
> native speaker anyway, so I might be wrong on that one.

Well, I've put it there intentionally to leave room (effectively
trying to further emphasize "implementation specific") for us to
indeed only behave this way if no handler was ever set (as
opposed to a handler having got set and then zapped again).

>> + *       new code shouldn't rely on such behavior anyway.
> 
> I would use a stronger mustn't here instead of shouldn't.

Fine with me. I've been using "mustn't" in a number of cases in
the past and was told I'm using it too often, sounding sort of
impolite. I guess I'll switch to "may not", which was suggested
to me as the better replacement.

Jan
Andrew Cooper March 12, 2021, 11:19 a.m. UTC | #3
On 12/03/2021 07:54, Jan Beulich wrote:
> Prior to 4.15 Linux, when running in PV mode, did not install a #GP
> handler early enough to cover for example the rdmsrl_safe() of
> MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read
> of MSR_K7_HWCR later in the same function). The respective change
> (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was
> backported to 4.14, but no further - presumably since it wasn't really
> easy because of other dependencies.
>
> Therefore, to prevent our change in the handling of guest MSR accesses
> to render PV Linux 4.13 and older unusable on at least AMD systems, make
> the raising of #GP on this paths conditional upon the guest having
> installed a handler, provided of course the MSR can be read in the first
> place (we would have raised #GP in that case even before). Producing
> zero for reads isn't necessarily correct and may trip code trying to
> detect presence of MSRs early, but since such detection logic won't work
> without a #GP handler anyway, this ought to be a fair workaround.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Release-Acked-by: Ian Jackson <iwj@xenproject.org>

I am still of the firm belief that this is the wrong course of action.

It is deliberately papering over error handling bugs which, in the
NetBSD case, literally created memory corruption scenarios.  (Yes - that
was a WRMSR not RDMSR, but the general point still stands, particularly
in light of your expectation to do the same to the WRMSR).

It is one thing to not realise that we have a trainwreck here.  Its
totally another to take wilful action to keep current and all future
guests broken in the same way.

The *only* case where it is acceptable to skip error handling is if the
VM admin has specifically signed their life away to say that they'll
accept the, now discovered, potential-memory-corrupion consequences.

Rogers patch already does this.

~Andrew
Jan Beulich March 12, 2021, 1:34 p.m. UTC | #4
On 12.03.2021 12:19, Andrew Cooper wrote:
> On 12/03/2021 07:54, Jan Beulich wrote:
>> Prior to 4.15 Linux, when running in PV mode, did not install a #GP
>> handler early enough to cover for example the rdmsrl_safe() of
>> MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read
>> of MSR_K7_HWCR later in the same function). The respective change
>> (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was
>> backported to 4.14, but no further - presumably since it wasn't really
>> easy because of other dependencies.
>>
>> Therefore, to prevent our change in the handling of guest MSR accesses
>> to render PV Linux 4.13 and older unusable on at least AMD systems, make
>> the raising of #GP on this paths conditional upon the guest having
>> installed a handler, provided of course the MSR can be read in the first
>> place (we would have raised #GP in that case even before). Producing
>> zero for reads isn't necessarily correct and may trip code trying to
>> detect presence of MSRs early, but since such detection logic won't work
>> without a #GP handler anyway, this ought to be a fair workaround.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
> 
> I am still of the firm belief that this is the wrong course of action.
> 
> It is deliberately papering over error handling bugs which, in the
> NetBSD case, literally created memory corruption scenarios.  (Yes - that
> was a WRMSR not RDMSR, but the general point still stands, particularly
> in light of your expectation to do the same to the WRMSR).
> 
> It is one thing to not realise that we have a trainwreck here.  Its
> totally another to take wilful action to keep current and all future
> guests broken in the same way.
> 
> The *only* case where it is acceptable to skip error handling is if the
> VM admin has specifically signed their life away to say that they'll
> accept the, now discovered, potential-memory-corrupion consequences.
> 
> Rogers patch already does this.

With _much_ bigger impact - it requires changing the behavior for the
entire lifetime of the domain, rather than just very early boot. And
as you may have seen, despite my fear that it may not be enough, Roger
and I have agreed to leave the WRMSR path alone.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -874,7 +874,8 @@  static int read_msr(unsigned int reg, ui
     struct vcpu *curr = current;
     const struct domain *currd = curr->domain;
     const struct cpuid_policy *cp = currd->arch.cpuid;
-    bool vpmu_msr = false;
+    bool vpmu_msr = false, warn = false;
+    uint64_t tmp;
     int ret;
 
     if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
@@ -882,7 +883,7 @@  static int read_msr(unsigned int reg, ui
         if ( ret == X86EMUL_EXCEPTION )
             x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt);
 
-        return ret;
+        goto done;
     }
 
     switch ( reg )
@@ -986,7 +987,7 @@  static int read_msr(unsigned int reg, ui
         }
         /* fall through */
     default:
-        gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
+        warn = true;
         break;
 
     normal:
@@ -995,7 +996,19 @@  static int read_msr(unsigned int reg, ui
         return X86EMUL_OKAY;
     }
 
-    return X86EMUL_UNHANDLEABLE;
+ done:
+    if ( ret != X86EMUL_OKAY && !curr->arch.pv.trap_ctxt[X86_EXC_GP].address &&
+         (reg >> 16) != 0x4000 && !rdmsr_safe(reg, tmp) )
+    {
+        gprintk(XENLOG_WARNING, "faking RDMSR 0x%08x\n", reg);
+        *val = 0;
+        x86_emul_reset_event(ctxt);
+        ret = X86EMUL_OKAY;
+    }
+    else if ( warn )
+        gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
+
+    return ret;
 }
 
 static int write_msr(unsigned int reg, uint64_t val,
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -143,6 +143,12 @@  typedef unsigned long xen_ulong_t;
  *  Level == 1: Kernel may enter
  *  Level == 2: Kernel may enter
  *  Level == 3: Everyone may enter
+ *
+ * Note: For compatibility with kernels not setting up exception handlers
+ *       early enough, Xen will avoid trying to inject #GP (and hence crash
+ *       the domain) when an RDMSR would require this, but no handler was
+ *       set yet. The precise conditions are implementation specific, and
+ *       new code shouldn't rely on such behavior anyway.
  */
 #define TI_GET_DPL(_ti)      ((_ti)->flags & 3)
 #define TI_GET_IF(_ti)       ((_ti)->flags & 4)