diff mbox series

[v3,for-4.15] x86/msr: introduce an option for compatible MSR behavior selection

Message ID 20210309105634.7200-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series [v3,for-4.15] x86/msr: introduce an option for compatible MSR behavior selection | expand

Commit Message

Roger Pau Monne March 9, 2021, 10:56 a.m. UTC
Introduce an option to allow selecting a behavior similar to the pre
Xen 4.15 one for accesses to MSRs not explicitly handled. Since commit
84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
handled by Xen result in the injection of a #GP to the guest. This
is a behavior change since previously a #GP was only injected if
accessing the MSR on the real hardware would also trigger a #GP, or if
the attempted to be set bits wouldn't match the hardware values (for
PV).

This seems to be problematic for some guests, so introduce an option
to fallback to this kind of legacy behavior without leaking the
underlying MSR values to the guest.

When the option is set, for both PV and HVM don't inject a #GP to the
guest on MSR read if reading the underlying MSR doesn't result in a
#GP, do the same for writes and simply discard the value to be written
on that case.

Note that for guests restored or migrated from previous Xen versions
the option is enabled by default, in order to keep a compatible
MSR behavior. Such compatibility is done at the libxl layer, to avoid
higher-level toolstacks from having to know the details about this flag.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
Boris, could you please test with Solaris to see if this fixes the
issue?
---
Changes since v2:
 - Apply the option to both HVM and PV guest.
 - Handle both reads and writes.
 - Rename to msr_relaxed.
 - Introduce a Xen commnad line option to set it for dom0.
 - Set it unconditionally for guests being restored or migrated from
   previous Xen versions.
 - Provide the option to dom0.
 - Attempt at adding the ocaml helpers (build tested only).

Changes since v1:
 - Only apply the option to HVM guests.
 - Only apply the special handling to MSR reads.
 - Sanitize the newly introduced flags field.
 - Print a warning message when the option is used.
---
 docs/man/xl.cfg.5.pod.in            | 19 +++++++++++++++++++
 docs/misc/xen-command-line.pandoc   | 17 ++++++++++++++++-
 tools/include/libxl.h               |  7 +++++++
 tools/libs/light/libxl_arch.h       |  5 +++++
 tools/libs/light/libxl_arm.c        |  6 ++++++
 tools/libs/light/libxl_create.c     |  7 +++++++
 tools/libs/light/libxl_internal.c   |  3 +++
 tools/libs/light/libxl_types.idl    |  2 ++
 tools/libs/light/libxl_x86.c        | 20 ++++++++++++++++++++
 tools/ocaml/libs/xc/xenctrl.ml      |  4 ++++
 tools/ocaml/libs/xc/xenctrl.mli     |  4 ++++
 tools/ocaml/libs/xc/xenctrl_stubs.c | 10 ++++++++++
 tools/xl/xl_parse.c                 |  7 +++++++
 xen/arch/x86/dom0_build.c           |  3 +++
 xen/arch/x86/domain.c               |  9 +++++++++
 xen/arch/x86/hvm/svm/svm.c          | 10 ++++++++++
 xen/arch/x86/hvm/vmx/vmx.c          | 10 ++++++++++
 xen/arch/x86/pv/emul-priv-op.c      | 10 ++++++++++
 xen/arch/x86/setup.c                |  1 +
 xen/include/asm-x86/domain.h        |  3 +++
 xen/include/asm-x86/setup.h         |  1 +
 xen/include/public/arch-x86/xen.h   |  8 ++++++++
 22 files changed, 165 insertions(+), 1 deletion(-)

Comments

Christian Lindig March 9, 2021, 11:06 a.m. UTC | #1
The impact on the OCaml side is minimal and looks good to me.

Acked-by: Christian Lindig <christian.lindig@citrix.com>
Jan Beulich March 9, 2021, 11:36 a.m. UTC | #2
On 09.03.2021 11:56, Roger Pau Monne wrote:
> Introduce an option to allow selecting a behavior similar to the pre
> Xen 4.15 one for accesses to MSRs not explicitly handled. Since commit
> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> handled by Xen result in the injection of a #GP to the guest. This
> is a behavior change since previously a #GP was only injected if
> accessing the MSR on the real hardware would also trigger a #GP, or if
> the attempted to be set bits wouldn't match the hardware values (for
> PV).
> 
> This seems to be problematic for some guests, so introduce an option
> to fallback to this kind of legacy behavior without leaking the
> underlying MSR values to the guest.
> 
> When the option is set, for both PV and HVM don't inject a #GP to the
> guest on MSR read if reading the underlying MSR doesn't result in a
> #GP, do the same for writes and simply discard the value to be written
> on that case.
> 
> Note that for guests restored or migrated from previous Xen versions
> the option is enabled by default, in order to keep a compatible
> MSR behavior. Such compatibility is done at the libxl layer, to avoid
> higher-level toolstacks from having to know the details about this flag.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

I'm generally okay with this approach, but wouldn't want to give it
my R-b until it's at least clear it's not entirely unacceptable to
anyone else (Andrew in particular). Couple of remarks:

> Changes since v2:
>  - Apply the option to both HVM and PV guest.
>  - Handle both reads and writes.

I take it that it's intentional that you didn't allow separate read
and write control?

> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1795,6 +1795,7 @@ static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
>      const struct domain *d = v->domain;
>      struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
>      const struct nestedsvm *nsvm = &vcpu_nestedsvm(v);
> +    uint64_t tmp;
>  
>      switch ( msr )
>      {

While to some degree a matter of taste, I think such variables needed
only inside a switch() and not needing an initializer would better
live in the respective switch()'s scope.

> --- a/xen/arch/x86/pv/emul-priv-op.c
> +++ b/xen/arch/x86/pv/emul-priv-op.c
> @@ -875,6 +875,7 @@ static int read_msr(unsigned int reg, uint64_t *val,
>      const struct domain *currd = curr->domain;
>      const struct cpuid_policy *cp = currd->arch.cpuid;
>      bool vpmu_msr = false;
> +    uint64_t tmp;
>      int ret;
>  
>      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
> @@ -986,6 +987,12 @@ static int read_msr(unsigned int reg, uint64_t *val,
>          }
>          /* fall through */
>      default:
> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, tmp) )
> +        {
> +            *val = 0;
> +            return X86EMUL_OKAY;
> +        }
> +
>          gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
>          break;
>  
> @@ -1148,6 +1155,9 @@ static int write_msr(unsigned int reg, uint64_t val,
>          }
>          /* fall through */
>      default:
> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
> +            return X86EMUL_OKAY;
> +
>          gdprintk(XENLOG_WARNING,
>                   "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
>                   reg, val);

So what are your thoughts wrt my change to this file? Drop it
altogether and require people to use this new option? Or do you
see both coexist? In the latter case, since you had suggested
that I drop the write side of my change - does your changing of
the write path indicate you've changed your mind?

Jan
Roger Pau Monne March 9, 2021, 1:53 p.m. UTC | #3
On Tue, Mar 09, 2021 at 12:36:39PM +0100, Jan Beulich wrote:
> On 09.03.2021 11:56, Roger Pau Monne wrote:
> > Introduce an option to allow selecting a behavior similar to the pre
> > Xen 4.15 one for accesses to MSRs not explicitly handled. Since commit
> > 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> > handled by Xen result in the injection of a #GP to the guest. This
> > is a behavior change since previously a #GP was only injected if
> > accessing the MSR on the real hardware would also trigger a #GP, or if
> > the attempted to be set bits wouldn't match the hardware values (for
> > PV).
> > 
> > This seems to be problematic for some guests, so introduce an option
> > to fallback to this kind of legacy behavior without leaking the
> > underlying MSR values to the guest.
> > 
> > When the option is set, for both PV and HVM don't inject a #GP to the
> > guest on MSR read if reading the underlying MSR doesn't result in a
> > #GP, do the same for writes and simply discard the value to be written
> > on that case.
> > 
> > Note that for guests restored or migrated from previous Xen versions
> > the option is enabled by default, in order to keep a compatible
> > MSR behavior. Such compatibility is done at the libxl layer, to avoid
> > higher-level toolstacks from having to know the details about this flag.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> I'm generally okay with this approach, but wouldn't want to give it
> my R-b until it's at least clear it's not entirely unacceptable to
> anyone else (Andrew in particular). Couple of remarks:

Thanks.

> > Changes since v2:
> >  - Apply the option to both HVM and PV guest.
> >  - Handle both reads and writes.
> 
> I take it that it's intentional that you didn't allow separate read
> and write control?

Yes, I don't have a strong opinion, but I think just having a single
option is better: guests requiring the read side bodge are also likely
to require the same adjustment on the write side. It's also better
from a user perspective, as it's likely people would enable them in
tandem anyway.

> 
> > --- a/xen/arch/x86/hvm/svm/svm.c
> > +++ b/xen/arch/x86/hvm/svm/svm.c
> > @@ -1795,6 +1795,7 @@ static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
> >      const struct domain *d = v->domain;
> >      struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
> >      const struct nestedsvm *nsvm = &vcpu_nestedsvm(v);
> > +    uint64_t tmp;
> >  
> >      switch ( msr )
> >      {
> 
> While to some degree a matter of taste, I think such variables needed
> only inside a switch() and not needing an initializer would better
> live in the respective switch()'s scope.

I can indeed define them inside the switch statement.

> > --- a/xen/arch/x86/pv/emul-priv-op.c
> > +++ b/xen/arch/x86/pv/emul-priv-op.c
> > @@ -875,6 +875,7 @@ static int read_msr(unsigned int reg, uint64_t *val,
> >      const struct domain *currd = curr->domain;
> >      const struct cpuid_policy *cp = currd->arch.cpuid;
> >      bool vpmu_msr = false;
> > +    uint64_t tmp;
> >      int ret;
> >  
> >      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
> > @@ -986,6 +987,12 @@ static int read_msr(unsigned int reg, uint64_t *val,
> >          }
> >          /* fall through */
> >      default:
> > +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, tmp) )
> > +        {
> > +            *val = 0;
> > +            return X86EMUL_OKAY;
> > +        }
> > +
> >          gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
> >          break;
> >  
> > @@ -1148,6 +1155,9 @@ static int write_msr(unsigned int reg, uint64_t val,
> >          }
> >          /* fall through */
> >      default:
> > +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
> > +            return X86EMUL_OKAY;
> > +
> >          gdprintk(XENLOG_WARNING,
> >                   "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
> >                   reg, val);
> 
> So what are your thoughts wrt my change to this file? Drop it
> altogether and require people to use this new option? Or do you
> see both coexist?

I wouldn't be opposed to have both changes co-exist, as long as the PV
one is made part of the PV ABI, that is have it properly described in
the public headers as part of the PV behavior. I've replied with some
details along those lines in your patch.

> In the latter case, since you had suggested
> that I drop the write side of my change - does your changing of
> the write path indicate you've changed your mind?

Yes, I think we need to provide an option to allow users to revert
back to an MSR behavior as close as possible to the previous one for
compatibility reasons, and that should include the write side even if
we don't know any users requiring it right now.

We would be in a bad position if that use-case gets discovered after
the release, so it's IMO best to provide an option that covers both
read and write side straight away.

Thanks, Roger.
Jan Beulich March 9, 2021, 2:26 p.m. UTC | #4
On 09.03.2021 14:53, Roger Pau Monné wrote:
> On Tue, Mar 09, 2021 at 12:36:39PM +0100, Jan Beulich wrote:
>> On 09.03.2021 11:56, Roger Pau Monne wrote:
>>> Changes since v2:
>>>  - Apply the option to both HVM and PV guest.
>>>  - Handle both reads and writes.
>>
>> I take it that it's intentional that you didn't allow separate read
>> and write control?
> 
> Yes, I don't have a strong opinion, but I think just having a single
> option is better: guests requiring the read side bodge are also likely
> to require the same adjustment on the write side.

I'm not convinced of this - there are many MSRs which merely
need reading to discover a certain piece of (configuration)
information. Note in e.g. how the problem I did run into was
affecting RDMSR only.

> It's also better
> from a user perspective, as it's likely people would enable them in
> tandem anyway.

This part I agree with; in fact I did mention this earlier on.

>>> --- a/xen/arch/x86/pv/emul-priv-op.c
>>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>>> @@ -875,6 +875,7 @@ static int read_msr(unsigned int reg, uint64_t *val,
>>>      const struct domain *currd = curr->domain;
>>>      const struct cpuid_policy *cp = currd->arch.cpuid;
>>>      bool vpmu_msr = false;
>>> +    uint64_t tmp;
>>>      int ret;
>>>  
>>>      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
>>> @@ -986,6 +987,12 @@ static int read_msr(unsigned int reg, uint64_t *val,
>>>          }
>>>          /* fall through */
>>>      default:
>>> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, tmp) )
>>> +        {
>>> +            *val = 0;
>>> +            return X86EMUL_OKAY;
>>> +        }
>>> +
>>>          gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
>>>          break;
>>>  
>>> @@ -1148,6 +1155,9 @@ static int write_msr(unsigned int reg, uint64_t val,
>>>          }
>>>          /* fall through */
>>>      default:
>>> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
>>> +            return X86EMUL_OKAY;
>>> +
>>>          gdprintk(XENLOG_WARNING,
>>>                   "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
>>>                   reg, val);
>>
>> So what are your thoughts wrt my change to this file? Drop it
>> altogether and require people to use this new option? Or do you
>> see both coexist?
> 
> I wouldn't be opposed to have both changes co-exist, as long as the PV
> one is made part of the PV ABI, that is have it properly described in
> the public headers as part of the PV behavior. I've replied with some
> details along those lines in your patch.
> 
>> In the latter case, since you had suggested
>> that I drop the write side of my change - does your changing of
>> the write path indicate you've changed your mind?
> 
> Yes, I think we need to provide an option to allow users to revert
> back to an MSR behavior as close as possible to the previous one for
> compatibility reasons, and that should include the write side even if
> we don't know any users requiring it right now.
> 
> We would be in a bad position if that use-case gets discovered after
> the release, so it's IMO best to provide an option that covers both
> read and write side straight away.

Well, for your change it's indeed "an option". For my change it's
not optional behavior (and we also don't mean it to be). Hence I'm
not sure what I should read out of your reply.

Jan
Roger Pau Monne March 9, 2021, 2:53 p.m. UTC | #5
On Tue, Mar 09, 2021 at 03:26:07PM +0100, Jan Beulich wrote:
> On 09.03.2021 14:53, Roger Pau Monné wrote:
> > On Tue, Mar 09, 2021 at 12:36:39PM +0100, Jan Beulich wrote:
> >> On 09.03.2021 11:56, Roger Pau Monne wrote:
> >>> @@ -1148,6 +1155,9 @@ static int write_msr(unsigned int reg, uint64_t val,
> >>>          }
> >>>          /* fall through */
> >>>      default:
> >>> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
> >>> +            return X86EMUL_OKAY;
> >>> +
> >>>          gdprintk(XENLOG_WARNING,
> >>>                   "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
> >>>                   reg, val);
> >>
> >> So what are your thoughts wrt my change to this file? Drop it
> >> altogether and require people to use this new option? Or do you
> >> see both coexist?
> > 
> > I wouldn't be opposed to have both changes co-exist, as long as the PV
> > one is made part of the PV ABI, that is have it properly described in
> > the public headers as part of the PV behavior. I've replied with some
> > details along those lines in your patch.
> > 
> >> In the latter case, since you had suggested
> >> that I drop the write side of my change - does your changing of
> >> the write path indicate you've changed your mind?
> > 
> > Yes, I think we need to provide an option to allow users to revert
> > back to an MSR behavior as close as possible to the previous one for
> > compatibility reasons, and that should include the write side even if
> > we don't know any users requiring it right now.
> > 
> > We would be in a bad position if that use-case gets discovered after
> > the release, so it's IMO best to provide an option that covers both
> > read and write side straight away.
> 
> Well, for your change it's indeed "an option". For my change it's
> not optional behavior (and we also don't mean it to be). Hence I'm
> not sure what I should read out of your reply.

Sorry, maybe my reply wasn't clear. The part of the quote above was my
reply to me re-adding the write side of the change. The reply to
whether I think your PV change is required was in the chunk above.

To clarify:

 - I do think we need the write side of this change, just for the sake
   of providing a behavior as close as possible to the previous
   release.
 - I don't have a strong opinion whether we need two options
   ({rd,wr}msr:relaxed) or just a single one (msr_relaxed). I favor
   a single one because it's likely users will enable both in tandem
   anyway (like you mentioned).
 - I'm fine with your change to PV as long as it's documented in the
   public headers as part of the PV ABI, since it will be enabled
   unconditionally (more about that in a reply [0] to your patch).
 - I think your change to PV should cover the write side also.

Hope that's less confusing.

Thanks, Roger.

[0] https://lore.kernel.org/xen-devel/YEd6GTXJqRIjijl6@Air-de-Roger/
Ian Jackson March 9, 2021, 5:01 p.m. UTC | #6
Roger Pau Monne writes ("[PATCH v3 for-4.15] x86/msr: introduce an option for compatible MSR behavior selection"):
> Introduce an option to allow selecting a behavior similar to the pre
> Xen 4.15 one for accesses to MSRs not explicitly handled. Since commit
> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> handled by Xen result in the injection of a #GP to the guest. This
> is a behavior change since previously a #GP was only injected if
> accessing the MSR on the real hardware would also trigger a #GP, or if
> the attempted to be set bits wouldn't match the hardware values (for
> PV).
> 
> This seems to be problematic for some guests, so introduce an option
> to fallback to this kind of legacy behavior without leaking the
> underlying MSR values to the guest.
> 
> When the option is set, for both PV and HVM don't inject a #GP to the
> guest on MSR read if reading the underlying MSR doesn't result in a
> #GP, do the same for writes and simply discard the value to be written
> on that case.
> 
> Note that for guests restored or migrated from previous Xen versions
> the option is enabled by default, in order to keep a compatible
> MSR behavior. Such compatibility is done at the libxl layer, to avoid
> higher-level toolstacks from having to know the details about this flag.

For the tools parts

Reviewed-by: Ian Jackson <iwj@xenproject.org>
Andrew Cooper March 9, 2021, 7:13 p.m. UTC | #7
On 09/03/2021 10:56, Roger Pau Monne wrote:
> Introduce an option to allow selecting a behavior similar to the pre
> Xen 4.15 one for accesses to MSRs not explicitly handled. Since commit
> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> handled by Xen result in the injection of a #GP to the guest. This
> is a behavior change since previously a #GP was only injected if
> accessing the MSR on the real hardware would also trigger a #GP, or if
> the attempted to be set bits wouldn't match the hardware values (for
> PV).
>
> This seems to be problematic for some guests, so introduce an option
> to fallback to this kind of legacy behavior without leaking the
> underlying MSR values to the guest.
>
> When the option is set, for both PV and HVM don't inject a #GP to the
> guest on MSR read if reading the underlying MSR doesn't result in a
> #GP, do the same for writes and simply discard the value to be written
> on that case.
>
> Note that for guests restored or migrated from previous Xen versions
> the option is enabled by default, in order to keep a compatible
> MSR behavior. Such compatibility is done at the libxl layer, to avoid
> higher-level toolstacks from having to know the details about this flag.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Thankyou for doing this.  By and large, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>, subject to some pandoc syntax fixes below.

However, I think it is worth saying explicitly that the reasons behind
the changes in 84e848fd7a162f669 and 322ec7c89f6640e (not leaking
hardware MSRs, and not breaking #GP-probing) are still legitimate, and
this influences the change in behaviour between msr_relaxed and 4.14
(i.e. read-as-zero rather than leaking).

> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index 4737c92bfe..6cf61a5c57 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -740,7 +740,7 @@ Specify the bit width of the DMA heap.
>  
>  ### dom0
>      = List of [ pv | pvh, shadow=<bool>, verbose=<bool>,
> -                cpuid-faulting=<bool> ]
> +                cpuid-faulting=<bool>, msr-relaxed=<bool> ]
>  
>      Applicability: x86
>  
> @@ -789,6 +789,21 @@ Controls for how dom0 is constructed on x86 systems.
>      restore the pre-4.13 behaviour.  If specifying `no-cpuid-faulting` fixes
>      an issue in dom0, please report a bug.
>  
> +*   msr-relaxed: Select whether to use a relaxed behavior for accesses to MSRs
> +    not explicitly handled by Xen instead of injecting a #GP to dom0.
> +    Such access mode will force Xen to replicate the behavior from the hardware
> +    it's currently running on in order to decide whether a #GP is injected to
> +    dom0 for MSR reads.  Note that dom0 is never allowed to read the value of
> +    unhandled MSRs, this option only changes whether a #GP might be injected
> +    or not.  For writes a #GP won't be injected as long as reading the
> +    underlying MSR doesn't result in a #GP.

I don't find this overly clear to follow, and it also misses stating the
default explicitly.  How about:

---8<---
The `msr-relaxed` boolean is an interim option, and defaults to false.

In Xen 4.15, the default behaviour for unhandled MSRs has been changed,
to avoid leaking host data into guests, and to avoid breaking guest
logic which uses \#GP probing to identify the availability of MSRs.

However, this new stricter behaviour has the possibility to break
guests, and a more 4.14-like behaviour can be selected by specifying
`dom0=msr-relaxed`.

If using this option is necessary to fix an issue, please report a bug.
---8<---

Other pending Sphinx work is drawing together the "how to contact us"
information, so the various "please report a bug"s through this document
will turn into links.

> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 23bbb6e8c1..d217c223b0 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -749,6 +749,7 @@ static struct domain *__init create_dom0(const module_t *image,
>          .max_grant_frames = -1,
>          .max_maptrack_frames = -1,
>          .max_vcpus = dom0_max_vcpus(),
> +        .arch.domain_flags = opt_dom0_msr_relaxed ? XEN_X86_MSR_RELAXED : 0,

Can I request

.arch = {
    .domain_flags = opt_dom0_msr_relaxed ? XEN_X86_MSR_RELAXED : 0,
},

please, to reduce the patch complexity of further additions inside .arch.

> diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
> index 629cb2ba40..f9d0e33b94 100644
> --- a/xen/include/public/arch-x86/xen.h
> +++ b/xen/include/public/arch-x86/xen.h
> @@ -304,6 +304,14 @@ struct xen_arch_domainconfig {
>                                       XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
>                                       XEN_X86_EMU_VPCI)
>      uint32_t emulation_flags;
> +
> +/*
> + * Select whether to use a relaxed behavior for accesses to MSRs not explicitly
> + * handled by Xen instead of injecting a #GP to the guest. Note this option
> + * doesn't allow the guest to read or write to the underlying MSR.
> + */
> +#define XEN_X86_MSR_RELAXED (1u << 0)
> +    uint32_t domain_flags;

The domain prefix is somewhat redundant, given the name of the structure
or the hypercall it is used for.  OTOH, 'flags' on its own probably
isn't ok.  Thoughts on misc_flags?

~Andrew
Andrew Cooper March 9, 2021, 7:57 p.m. UTC | #8
On 09/03/2021 11:36, Jan Beulich wrote:
> On 09.03.2021 11:56, Roger Pau Monne wrote:
>> Introduce an option to allow selecting a behavior similar to the pre
>> Xen 4.15 one for accesses to MSRs not explicitly handled. Since commit
>> 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
>> handled by Xen result in the injection of a #GP to the guest. This
>> is a behavior change since previously a #GP was only injected if
>> accessing the MSR on the real hardware would also trigger a #GP, or if
>> the attempted to be set bits wouldn't match the hardware values (for
>> PV).
>>
>> This seems to be problematic for some guests, so introduce an option
>> to fallback to this kind of legacy behavior without leaking the
>> underlying MSR values to the guest.
>>
>> When the option is set, for both PV and HVM don't inject a #GP to the
>> guest on MSR read if reading the underlying MSR doesn't result in a
>> #GP, do the same for writes and simply discard the value to be written
>> on that case.
>>
>> Note that for guests restored or migrated from previous Xen versions
>> the option is enabled by default, in order to keep a compatible
>> MSR behavior. Such compatibility is done at the libxl layer, to avoid
>> higher-level toolstacks from having to know the details about this flag.
>>
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> I'm generally okay with this approach, but wouldn't want to give it
> my R-b until it's at least clear it's not entirely unacceptable to
> anyone else (Andrew in particular).

I'm broadly happy with this approach.  Some feedback just sent.

>  Couple of remarks:
>
>> Changes since v2:
>>  - Apply the option to both HVM and PV guest.
>>  - Handle both reads and writes.
> I take it that it's intentional that you didn't allow separate read
> and write control?

I think v3 is the correct approach.

This is strictly a backwards compatibility option.  Splitting read and
write just doubles the complexity the admin has wrestle with to recover
a legacy guest.

As we explicitly intend to retire the option in due course, this is very
much a "make my guest work until my upstream (either Xen or kernel) can
fix the bug".

>
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -1795,6 +1795,7 @@ static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
>>      const struct domain *d = v->domain;
>>      struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
>>      const struct nestedsvm *nsvm = &vcpu_nestedsvm(v);
>> +    uint64_t tmp;
>>  
>>      switch ( msr )
>>      {
> While to some degree a matter of taste, I think such variables needed
> only inside a switch() and not needing an initializer would better
> live in the respective switch()'s scope.

Actually, I was hoping to make a CODING_SYTLE change removing this as a
permitted construct.

Recent compilers have hardening features to automatically initialise all
stack variables, because even if your code isn't architecturally buggy,
an attacker can launch speculative attacks the stack rubble.

However, because of the way the compiler transform works, it cannot
tolerate this specific construct in a switch statement, as there is no
available execution to cope with the implicit "=0" or "=POISON".

Even within Xen, we don't have many examples of this construct AFAICT,
and there is a concrete security advantage to being able to support the
compiler hardening.

>> --- a/xen/arch/x86/pv/emul-priv-op.c
>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>> @@ -875,6 +875,7 @@ static int read_msr(unsigned int reg, uint64_t *val,
>>      const struct domain *currd = curr->domain;
>>      const struct cpuid_policy *cp = currd->arch.cpuid;
>>      bool vpmu_msr = false;
>> +    uint64_t tmp;
>>      int ret;
>>  
>>      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
>> @@ -986,6 +987,12 @@ static int read_msr(unsigned int reg, uint64_t *val,
>>          }
>>          /* fall through */
>>      default:
>> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, tmp) )
>> +        {
>> +            *val = 0;
>> +            return X86EMUL_OKAY;
>> +        }
>> +
>>          gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
>>          break;
>>  
>> @@ -1148,6 +1155,9 @@ static int write_msr(unsigned int reg, uint64_t val,
>>          }
>>          /* fall through */
>>      default:
>> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
>> +            return X86EMUL_OKAY;
>> +
>>          gdprintk(XENLOG_WARNING,
>>                   "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
>>                   reg, val);
> So what are your thoughts wrt my change to this file? Drop it
> altogether and require people to use this new option? Or do you
> see both coexist? In the latter case, since you had suggested
> that I drop the write side of my change - does your changing of
> the write path indicate you've changed your mind?

I don't think we should legitimise buggy PV behaviour, either by
codifying in the ABI, or providing a knob beyond this one.

A guest blindly stumbling forward with a missed #GP could very well
worse than crashing hard.

Case in point - the 4.15 behaviour spotted a very serious bug in NetBSD
where it tried writing MSR_PAT with its own choice (which wasn't the
same as Xen's choice).  The consequence of this bug is getting cache
attributes in pagetables wrong.

It is unfortunate that multiple bugs have combined to make this mess,
but every instance needs investigating and fixing.  Continuing to paper
over the hole doesn't help anyone in the long run.

~Andrew
Roger Pau Monne March 10, 2021, 9:33 a.m. UTC | #9
On Tue, Mar 09, 2021 at 07:13:26PM +0000, Andrew Cooper wrote:
> On 09/03/2021 10:56, Roger Pau Monne wrote:
> > Introduce an option to allow selecting a behavior similar to the pre
> > Xen 4.15 one for accesses to MSRs not explicitly handled. Since commit
> > 84e848fd7a162f669 and 322ec7c89f6640e accesses to MSRs not explicitly
> > handled by Xen result in the injection of a #GP to the guest. This
> > is a behavior change since previously a #GP was only injected if
> > accessing the MSR on the real hardware would also trigger a #GP, or if
> > the attempted to be set bits wouldn't match the hardware values (for
> > PV).
> >
> > This seems to be problematic for some guests, so introduce an option
> > to fallback to this kind of legacy behavior without leaking the
> > underlying MSR values to the guest.
> >
> > When the option is set, for both PV and HVM don't inject a #GP to the
> > guest on MSR read if reading the underlying MSR doesn't result in a
> > #GP, do the same for writes and simply discard the value to be written
> > on that case.
> >
> > Note that for guests restored or migrated from previous Xen versions
> > the option is enabled by default, in order to keep a compatible
> > MSR behavior. Such compatibility is done at the libxl layer, to avoid
> > higher-level toolstacks from having to know the details about this flag.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thankyou for doing this.  By and large, Reviewed-by: Andrew Cooper
> <andrew.cooper3@citrix.com>, subject to some pandoc syntax fixes below.

Thanks.

> However, I think it is worth saying explicitly that the reasons behind
> the changes in 84e848fd7a162f669 and 322ec7c89f6640e (not leaking
> hardware MSRs, and not breaking #GP-probing) are still legitimate, and
> this influences the change in behaviour between msr_relaxed and 4.14
> (i.e. read-as-zero rather than leaking).

Right, I tried to convey this fact by using "compatible" behavior
instead of "legacy" one, as the behavior provided by msr_relaxed is
not exactly the same as what you would get on 4.14.

I've added the following at the end of the first paragraph:

"The reasons for not leaking hardware MSR values and injecting a
#GP are fully valid, so the solution proposed here should be
considered a temporary workaround until all the required MSRs are
properly handled."

> > diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> > index 4737c92bfe..6cf61a5c57 100644
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -740,7 +740,7 @@ Specify the bit width of the DMA heap.
> >  
> >  ### dom0
> >      = List of [ pv | pvh, shadow=<bool>, verbose=<bool>,
> > -                cpuid-faulting=<bool> ]
> > +                cpuid-faulting=<bool>, msr-relaxed=<bool> ]
> >  
> >      Applicability: x86
> >  
> > @@ -789,6 +789,21 @@ Controls for how dom0 is constructed on x86 systems.
> >      restore the pre-4.13 behaviour.  If specifying `no-cpuid-faulting` fixes
> >      an issue in dom0, please report a bug.
> >  
> > +*   msr-relaxed: Select whether to use a relaxed behavior for accesses to MSRs
> > +    not explicitly handled by Xen instead of injecting a #GP to dom0.
> > +    Such access mode will force Xen to replicate the behavior from the hardware
> > +    it's currently running on in order to decide whether a #GP is injected to
> > +    dom0 for MSR reads.  Note that dom0 is never allowed to read the value of
> > +    unhandled MSRs, this option only changes whether a #GP might be injected
> > +    or not.  For writes a #GP won't be injected as long as reading the
> > +    underlying MSR doesn't result in a #GP.
> 
> I don't find this overly clear to follow, and it also misses stating the
> default explicitly.  How about:
> 
> ---8<---
> The `msr-relaxed` boolean is an interim option, and defaults to false.
> 
> In Xen 4.15, the default behaviour for unhandled MSRs has been changed,
> to avoid leaking host data into guests, and to avoid breaking guest
> logic which uses \#GP probing to identify the availability of MSRs.
> 
> However, this new stricter behaviour has the possibility to break
> guests, and a more 4.14-like behaviour can be selected by specifying
> `dom0=msr-relaxed`.
> 
> If using this option is necessary to fix an issue, please report a bug.
> ---8<---

OK, this seems to be less verbose that what I previously had, but I'm
fine with it. I assume this should also be changed in xl.cfg then?

> Other pending Sphinx work is drawing together the "how to contact us"
> information, so the various "please report a bug"s through this document
> will turn into links.
> 
> > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> > index 23bbb6e8c1..d217c223b0 100644
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -749,6 +749,7 @@ static struct domain *__init create_dom0(const module_t *image,
> >          .max_grant_frames = -1,
> >          .max_maptrack_frames = -1,
> >          .max_vcpus = dom0_max_vcpus(),
> > +        .arch.domain_flags = opt_dom0_msr_relaxed ? XEN_X86_MSR_RELAXED : 0,
> 
> Can I request
> 
> .arch = {
>     .domain_flags = opt_dom0_msr_relaxed ? XEN_X86_MSR_RELAXED : 0,
> },
> 
> please, to reduce the patch complexity of further additions inside .arch.

Sure.

> > diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
> > index 629cb2ba40..f9d0e33b94 100644
> > --- a/xen/include/public/arch-x86/xen.h
> > +++ b/xen/include/public/arch-x86/xen.h
> > @@ -304,6 +304,14 @@ struct xen_arch_domainconfig {
> >                                       XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
> >                                       XEN_X86_EMU_VPCI)
> >      uint32_t emulation_flags;
> > +
> > +/*
> > + * Select whether to use a relaxed behavior for accesses to MSRs not explicitly
> > + * handled by Xen instead of injecting a #GP to the guest. Note this option
> > + * doesn't allow the guest to read or write to the underlying MSR.
> > + */
> > +#define XEN_X86_MSR_RELAXED (1u << 0)
> > +    uint32_t domain_flags;
> 
> The domain prefix is somewhat redundant, given the name of the structure
> or the hypercall it is used for.  OTOH, 'flags' on its own probably
> isn't ok.  Thoughts on misc_flags?

I'm fine with it, will change unless Jan objects to the name.

Thanks, Roger.
Jan Beulich March 10, 2021, 10:20 a.m. UTC | #10
On 10.03.2021 10:33, Roger Pau Monné wrote:
> On Tue, Mar 09, 2021 at 07:13:26PM +0000, Andrew Cooper wrote:
>> On 09/03/2021 10:56, Roger Pau Monne wrote:
>>> --- a/xen/include/public/arch-x86/xen.h
>>> +++ b/xen/include/public/arch-x86/xen.h
>>> @@ -304,6 +304,14 @@ struct xen_arch_domainconfig {
>>>                                       XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
>>>                                       XEN_X86_EMU_VPCI)
>>>      uint32_t emulation_flags;
>>> +
>>> +/*
>>> + * Select whether to use a relaxed behavior for accesses to MSRs not explicitly
>>> + * handled by Xen instead of injecting a #GP to the guest. Note this option
>>> + * doesn't allow the guest to read or write to the underlying MSR.
>>> + */
>>> +#define XEN_X86_MSR_RELAXED (1u << 0)
>>> +    uint32_t domain_flags;
>>
>> The domain prefix is somewhat redundant, given the name of the structure
>> or the hypercall it is used for.  OTOH, 'flags' on its own probably
>> isn't ok.  Thoughts on misc_flags?
> 
> I'm fine with it, will change unless Jan objects to the name.

I'm fine with the suggestion.

Jan
Roger Pau Monne March 10, 2021, 11:13 a.m. UTC | #11
On Tue, Mar 09, 2021 at 07:57:42PM +0000, Andrew Cooper wrote:
> On 09/03/2021 11:36, Jan Beulich wrote:
> > On 09.03.2021 11:56, Roger Pau Monne wrote:
> >> --- a/xen/arch/x86/pv/emul-priv-op.c
> >> +++ b/xen/arch/x86/pv/emul-priv-op.c
> >> @@ -875,6 +875,7 @@ static int read_msr(unsigned int reg, uint64_t *val,
> >>      const struct domain *currd = curr->domain;
> >>      const struct cpuid_policy *cp = currd->arch.cpuid;
> >>      bool vpmu_msr = false;
> >> +    uint64_t tmp;
> >>      int ret;
> >>  
> >>      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
> >> @@ -986,6 +987,12 @@ static int read_msr(unsigned int reg, uint64_t *val,
> >>          }
> >>          /* fall through */
> >>      default:
> >> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, tmp) )
> >> +        {
> >> +            *val = 0;
> >> +            return X86EMUL_OKAY;
> >> +        }
> >> +
> >>          gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
> >>          break;
> >>  
> >> @@ -1148,6 +1155,9 @@ static int write_msr(unsigned int reg, uint64_t val,
> >>          }
> >>          /* fall through */
> >>      default:
> >> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
> >> +            return X86EMUL_OKAY;
> >> +
> >>          gdprintk(XENLOG_WARNING,
> >>                   "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
> >>                   reg, val);
> > So what are your thoughts wrt my change to this file? Drop it
> > altogether and require people to use this new option? Or do you
> > see both coexist? In the latter case, since you had suggested
> > that I drop the write side of my change - does your changing of
> > the write path indicate you've changed your mind?
> 
> I don't think we should legitimise buggy PV behaviour, either by
> codifying in the ABI, or providing a knob beyond this one.

In that case - can we try to figure out which MSRs is PV Linux trying
to access without having the #GP handler setup?

Maybe we can try to handle the ones we know the buggy Linux versions
will try to access without having #GP setup?

I know it's not possible to test all possible Linux versions, but we
could at least try to get the ones we know about fixed. Is the only
offending one we know about MSR_K8_HWCR?

> A guest blindly stumbling forward with a missed #GP could very well
> worse than crashing hard.
> 
> Case in point - the 4.15 behaviour spotted a very serious bug in NetBSD
> where it tried writing MSR_PAT with its own choice (which wasn't the
> same as Xen's choice).  The consequence of this bug is getting cache
> attributes in pagetables wrong.

Was that write by NetBSD done without the #GP handler setup?

Because the patch by Jan affects only that specific scenario.

Thanks, Roger.
Jan Beulich March 10, 2021, 11:44 a.m. UTC | #12
On 10.03.2021 12:13, Roger Pau Monné wrote:
> On Tue, Mar 09, 2021 at 07:57:42PM +0000, Andrew Cooper wrote:
>> On 09/03/2021 11:36, Jan Beulich wrote:
>>> So what are your thoughts wrt my change to this file? Drop it
>>> altogether and require people to use this new option? Or do you
>>> see both coexist? In the latter case, since you had suggested
>>> that I drop the write side of my change - does your changing of
>>> the write path indicate you've changed your mind?
>>
>> I don't think we should legitimise buggy PV behaviour, either by
>> codifying in the ABI, or providing a knob beyond this one.
> 
> In that case - can we try to figure out which MSRs is PV Linux trying
> to access without having the #GP handler setup?
> 
> Maybe we can try to handle the ones we know the buggy Linux versions
> will try to access without having #GP setup?
> 
> I know it's not possible to test all possible Linux versions, but we
> could at least try to get the ones we know about fixed. Is the only
> offending one we know about MSR_K8_HWCR?

No, that was the secondary observation. The crash was when reading
MSR_K8_TSEG_ADDR. See my patch'es description.

Handling this on an MSR-by-MSR basis is what Andrew suggested when
I first posted that patch. I continue to not agree, because even if
we limited our auditing to just a single Linux version, how could
we be sure to catch all cases (including rarely taken paths)? IOW I
think auditing isn't a workable approach, and waiting for bug
reports isn't one either.

Jan
Jan Beulich March 10, 2021, 1:42 p.m. UTC | #13
On 09.03.2021 20:57, Andrew Cooper wrote:
> On 09/03/2021 11:36, Jan Beulich wrote:
>> On 09.03.2021 11:56, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>> @@ -1795,6 +1795,7 @@ static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
>>>      const struct domain *d = v->domain;
>>>      struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
>>>      const struct nestedsvm *nsvm = &vcpu_nestedsvm(v);
>>> +    uint64_t tmp;
>>>  
>>>      switch ( msr )
>>>      {
>> While to some degree a matter of taste, I think such variables needed
>> only inside a switch() and not needing an initializer would better
>> live in the respective switch()'s scope.
> 
> Actually, I was hoping to make a CODING_SYTLE change removing this as a
> permitted construct.
> 
> Recent compilers have hardening features to automatically initialise all
> stack variables, because even if your code isn't architecturally buggy,
> an attacker can launch speculative attacks the stack rubble.
> 
> However, because of the way the compiler transform works, it cannot
> tolerate this specific construct in a switch statement, as there is no
> available execution to cope with the implicit "=0" or "=POISON".

While an entirely orthogonal issue, since you bring this up here
I'd like to point out that this then is a compiler implementation
issue, not something one ought to change source code for. The
compiler can very well put its initialization at a suitable place,
which - for internal handling purposes - could go as far as
introducing and artificial block and hence making code structure
as if it was

    {
        int tmp;

        switch ( ... )
        {
        case ...: ...
        }
    }

Trying to limit variable scope as much as possible shouldn't be
crippled by incompletely thought through new hardening options.

>>> --- a/xen/arch/x86/pv/emul-priv-op.c
>>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>>> @@ -875,6 +875,7 @@ static int read_msr(unsigned int reg, uint64_t *val,
>>>      const struct domain *currd = curr->domain;
>>>      const struct cpuid_policy *cp = currd->arch.cpuid;
>>>      bool vpmu_msr = false;
>>> +    uint64_t tmp;
>>>      int ret;
>>>  
>>>      if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
>>> @@ -986,6 +987,12 @@ static int read_msr(unsigned int reg, uint64_t *val,
>>>          }
>>>          /* fall through */
>>>      default:
>>> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, tmp) )
>>> +        {
>>> +            *val = 0;
>>> +            return X86EMUL_OKAY;
>>> +        }
>>> +
>>>          gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
>>>          break;
>>>  
>>> @@ -1148,6 +1155,9 @@ static int write_msr(unsigned int reg, uint64_t val,
>>>          }
>>>          /* fall through */
>>>      default:
>>> +        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
>>> +            return X86EMUL_OKAY;
>>> +
>>>          gdprintk(XENLOG_WARNING,
>>>                   "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
>>>                   reg, val);
>> So what are your thoughts wrt my change to this file? Drop it
>> altogether and require people to use this new option? Or do you
>> see both coexist? In the latter case, since you had suggested
>> that I drop the write side of my change - does your changing of
>> the write path indicate you've changed your mind?
> 
> I don't think we should legitimise buggy PV behaviour, either by
> codifying in the ABI, or providing a knob beyond this one.
> 
> A guest blindly stumbling forward with a missed #GP could very well
> worse than crashing hard.

There's a fundamental missing piece in your reply here: Do you
mean this just as an argument against extending my change to the
write side, or as one against my change as a whole? In the
latter case, if instead one had to use Roger's new option, the
same missing #GP would cause the same possible problems, plus -
once the guest has installed a handler - further #GP-s may end
up getting suppressed.

Jan

> Case in point - the 4.15 behaviour spotted a very serious bug in NetBSD
> where it tried writing MSR_PAT with its own choice (which wasn't the
> same as Xen's choice).  The consequence of this bug is getting cache
> attributes in pagetables wrong.
> 
> It is unfortunate that multiple bugs have combined to make this mess,
> but every instance needs investigating and fixing.  Continuing to paper
> over the hole doesn't help anyone in the long run.
> 
> ~Andrew
>
diff mbox series

Patch

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index 040374dcd6..72b7927483 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -2861,6 +2861,25 @@  No MCA capabilities in above list are enabled.
 
 =back
 
+=item B<msr_relaxed=BOOLEAN>
+
+Select whether to use a relaxed behavior for accesses to MSRs not explicitly
+handled by Xen instead of injecting a #GP to the guest.  Such access mode will
+force Xen to replicate the behavior from the hardware it's currently running
+on in order to decide whether a #GP is injected to the guest for MSR reads.
+Note that the guest is never allowed to read the value of unhandled MSRs, this
+option only changes whether a #GP might be injected or not.  For writes a #GP
+won't be injected as long as reading the underlying MSR doesn't result in a
+#GP.
+
+This option will be removed in future releases once we are certain the default
+MSR access policy has been properly tested by a wide variety of guests.  If you
+need to use this option please send a bug report to
+xen-devel@lists.xenproject.org with the details of the guests you are running
+that require it.
+
+=back
+
 =back
 
 =head1 SEE ALSO
diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 4737c92bfe..6cf61a5c57 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -740,7 +740,7 @@  Specify the bit width of the DMA heap.
 
 ### dom0
     = List of [ pv | pvh, shadow=<bool>, verbose=<bool>,
-                cpuid-faulting=<bool> ]
+                cpuid-faulting=<bool>, msr-relaxed=<bool> ]
 
     Applicability: x86
 
@@ -789,6 +789,21 @@  Controls for how dom0 is constructed on x86 systems.
     restore the pre-4.13 behaviour.  If specifying `no-cpuid-faulting` fixes
     an issue in dom0, please report a bug.
 
+*   msr-relaxed: Select whether to use a relaxed behavior for accesses to MSRs
+    not explicitly handled by Xen instead of injecting a #GP to dom0.
+    Such access mode will force Xen to replicate the behavior from the hardware
+    it's currently running on in order to decide whether a #GP is injected to
+    dom0 for MSR reads.  Note that dom0 is never allowed to read the value of
+    unhandled MSRs, this option only changes whether a #GP might be injected
+    or not.  For writes a #GP won't be injected as long as reading the
+    underlying MSR doesn't result in a #GP.
+
+    This option will be removed in future releases once we are certain the
+    default MSR access policy has been properly tested by a wide variety of
+    guests.  If you need to use this option please send a bug report to
+    xen-devel@lists.xenproject.org with the details of the guests you are
+    running that require it.
+
 ### dom0-iommu
     = List of [ passthrough=<bool>, strict=<bool>, map-inclusive=<bool>,
                 map-reserved=<bool>, none ]
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index a7b673e89d..ae7fe27c1f 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -495,6 +495,13 @@ 
  */
 #define LIBXL_HAVE_VMTRACE_BUF_KB 1
 
+/*
+ * LIBXL_HAVE_X86_MSR_RELAXED indicates the toolstack has support for switching
+ * the MSR access handling in the hypervisor to relaxed mode. This is done by
+ * setting the libxl_domain_build_info arch_x86.msr_relaxed field.
+ */
+#define LIBXL_HAVE_X86_MSR_RELAXED 1
+
 /*
  * libxl ABI compatibility
  *
diff --git a/tools/libs/light/libxl_arch.h b/tools/libs/light/libxl_arch.h
index c305d704b1..8527fc5c6c 100644
--- a/tools/libs/light/libxl_arch.h
+++ b/tools/libs/light/libxl_arch.h
@@ -85,6 +85,11 @@  int libxl__arch_extra_memory(libxl__gc *gc,
                              const libxl_domain_build_info *info,
                              uint64_t *out);
 
+_hidden
+void libxl__arch_update_domain_config(libxl__gc *gc,
+                                      libxl_domain_config *dst,
+                                      const libxl_domain_config *src);
+
 #if defined(__i386__) || defined(__x86_64__)
 
 #define LAPIC_BASE_ADDRESS  0xfee00000
diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index 5e2a209a8b..e2901f13b7 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -1222,6 +1222,12 @@  int libxl__arch_passthrough_mode_setdefault(libxl__gc *gc,
     return rc;
 }
 
+void libxl__arch_update_domain_config(libxl__gc *gc,
+                                      libxl_domain_config *dst,
+                                      const libxl_domain_config *src)
+{
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
index 46f68da697..1131b2a733 100644
--- a/tools/libs/light/libxl_create.c
+++ b/tools/libs/light/libxl_create.c
@@ -2287,6 +2287,13 @@  int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
         unset_disk_colo_restore(d_config);
     }
 
+    /*
+     * When restoring (either from a save file or for a migration domain) set
+     * the MSR relaxed mode for compatibility with older Xen versions if the
+     * option is not set as part of the original configuration.
+     */
+    libxl_defbool_setdefault(&d_config->b_info.arch_x86.msr_relaxed, true);
+
     return do_domain_create(ctx, d_config, domid, restore_fd, send_back_fd,
                             params, ao_how, aop_console_how);
 }
diff --git a/tools/libs/light/libxl_internal.c b/tools/libs/light/libxl_internal.c
index d93a75533f..86556b6113 100644
--- a/tools/libs/light/libxl_internal.c
+++ b/tools/libs/light/libxl_internal.c
@@ -16,6 +16,7 @@ 
 #include "libxl_osdeps.h" /* must come before any other headers */
 
 #include "libxl_internal.h"
+#include "libxl_arch.h"
 
 void libxl__alloc_failed(libxl_ctx *ctx, const char *func,
                          size_t nmemb, size_t size) {
@@ -594,6 +595,8 @@  void libxl__update_domain_configuration(libxl__gc *gc,
 
     /* video ram */
     dst->b_info.video_memkb = src->b_info.video_memkb;
+
+    libxl__arch_update_domain_config(gc, dst, src);
 }
 
 static void ev_slowlock_init_internal(libxl__ev_slowlock *lock,
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 5b85a7419f..f45adddab0 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -644,6 +644,8 @@  libxl_domain_build_info = Struct("domain_build_info",[
     ("arch_arm", Struct(None, [("gic_version", libxl_gic_version),
                                ("vuart", libxl_vuart_type),
                               ])),
+    ("arch_x86", Struct(None, [("msr_relaxed", libxl_defbool),
+                              ])),
     # Alternate p2m is not bound to any architecture or guest type, as it is
     # supported by x86 HVM and ARM support is planned.
     ("altp2m", libxl_altp2m_mode),
diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 58187ed760..1656f2c730 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -19,6 +19,10 @@  int libxl__arch_domain_prepare_config(libxl__gc *gc,
         abort();
     }
 
+    config->arch.domain_flags = 0;
+    if (libxl_defbool_val(d_config->b_info.arch_x86.msr_relaxed))
+        config->arch.domain_flags |= XEN_X86_MSR_RELAXED;
+
     return 0;
 }
 
@@ -809,6 +813,7 @@  void libxl__arch_domain_build_info_setdefault(libxl__gc *gc,
                                               libxl_domain_build_info *b_info)
 {
     libxl_defbool_setdefault(&b_info->acpi, true);
+    libxl_defbool_setdefault(&b_info->arch_x86.msr_relaxed, false);
 }
 
 int libxl__arch_passthrough_mode_setdefault(libxl__gc *gc,
@@ -851,6 +856,21 @@  int libxl__arch_passthrough_mode_setdefault(libxl__gc *gc,
     return rc;
 }
 
+void libxl__arch_update_domain_config(libxl__gc *gc,
+                                      libxl_domain_config *dst,
+                                      const libxl_domain_config *src)
+{
+    /*
+     * Force MSR relaxed to be set (either to true or false) so it's part of
+     * the domain configuration when saving or performing a live-migration.
+     *
+     * Doing so allows the recovery side to figure out whether the flag should
+     * be set to true in order to keep backwards compatibility with already
+     * started domains.
+     */
+    libxl_defbool_setdefault(&dst->b_info.arch_x86.msr_relaxed,
+                    libxl_defbool_val(src->b_info.arch_x86.msr_relaxed));
+}
 
 /*
  * Local variables:
diff --git a/tools/ocaml/libs/xc/xenctrl.ml b/tools/ocaml/libs/xc/xenctrl.ml
index a02e26b27f..4789e42e91 100644
--- a/tools/ocaml/libs/xc/xenctrl.ml
+++ b/tools/ocaml/libs/xc/xenctrl.ml
@@ -48,9 +48,13 @@  type x86_arch_emulation_flags =
 	| X86_EMU_USE_PIRQ
 	| X86_EMU_VPCI
 
+type x86_arch_domain_flags =
+	| X86_MSR_RELAXED
+
 type xen_x86_arch_domainconfig =
 {
 	emulation_flags: x86_arch_emulation_flags list;
+	domain_flags: x86_arch_domain_flags list;
 }
 
 type arch_domainconfig =
diff --git a/tools/ocaml/libs/xc/xenctrl.mli b/tools/ocaml/libs/xc/xenctrl.mli
index d2a312e273..566c12cb26 100644
--- a/tools/ocaml/libs/xc/xenctrl.mli
+++ b/tools/ocaml/libs/xc/xenctrl.mli
@@ -42,8 +42,12 @@  type x86_arch_emulation_flags =
   | X86_EMU_USE_PIRQ
   | X86_EMU_VPCI
 
+type x86_arch_domain_flags =
+  | X86_MSR_RELAXED
+
 type xen_x86_arch_domainconfig = {
   emulation_flags: x86_arch_emulation_flags list;
+  domain_flags: x86_arch_domain_flags list;
 }
 
 type arch_domainconfig =
diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c b/tools/ocaml/libs/xc/xenctrl_stubs.c
index 9a8dbe5579..2f658a6418 100644
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -233,6 +233,16 @@  CAMLprim value stub_xc_domain_create(value xch, value wanted_domid, value config
 
 #undef VAL_EMUL_FLAGS
 
+#define VAL_DOMAIN_FLAGS        Field(arch_domconfig, 1)
+
+		cfg.arch.domain_flags = ocaml_list_to_c_bitmap
+			/* ! x86_arch_domain_flags X86_ none */
+			/* ! XEN_X86_ XEN_X86_MSR_RELAXED all */
+
+			(VAL_DOMAIN_FLAGS);
+
+#undef VAL_DOMAIN_FLAGS
+
 #else
 		caml_failwith("Unhandled: x86");
 #endif
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 1893cfc086..9fb0791429 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -2741,6 +2741,13 @@  skip_usbdev:
     xlu_cfg_get_defbool(config, "xend_suspend_evtchn_compat",
                         &c_info->xend_suspend_evtchn_compat, 0);
 
+    if (!xlu_cfg_get_defbool(config, "msr_relaxed",
+                             &b_info->arch_x86.msr_relaxed, 0))
+            fprintf(stderr,
+                    "WARNING: msr_relaxed will be removed in future versions.\n"
+                    "If it fixes an issue you are having please report to "
+                    "xen-devel@lists.xenproject.org.\n");
+
     xlu_cfg_destroy(config);
 }
 
diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index 0ce29e91a3..74b443e509 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -256,6 +256,7 @@  bool __initdata opt_dom0_shadow;
 #endif
 bool __initdata opt_dom0_pvh = !IS_ENABLED(CONFIG_PV);
 bool __initdata opt_dom0_verbose = IS_ENABLED(CONFIG_VERBOSE_DEBUG);
+bool __initdata opt_dom0_msr_relaxed;
 
 static int __init parse_dom0_param(const char *s)
 {
@@ -282,6 +283,8 @@  static int __init parse_dom0_param(const char *s)
         else if ( IS_ENABLED(CONFIG_PV) &&
                   (val = parse_boolean("cpuid-faulting", s, ss)) >= 0 )
             opt_dom0_cpuid_faulting = val;
+        else if ( (val = parse_boolean("msr-relaxed", s, ss)) >= 0 )
+            opt_dom0_msr_relaxed = val;
         else
             rc = -EINVAL;
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 5e3c94d3fa..86e5b506ce 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -683,6 +683,13 @@  int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
         }
     }
 
+    if ( config->arch.domain_flags & ~XEN_X86_MSR_RELAXED )
+    {
+        dprintk(XENLOG_INFO, "Invalid arch domain flags %#x\n",
+                config->arch.domain_flags);
+        return -EINVAL;
+    }
+
     return 0;
 }
 
@@ -852,6 +859,8 @@  int arch_domain_create(struct domain *d,
 
     domain_cpu_policy_changed(d);
 
+    d->arch.msr_relaxed = config->arch.domain_flags & XEN_X86_MSR_RELAXED;
+
     return 0;
 
  fail:
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index b819897a4a..4585efe1f8 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1795,6 +1795,7 @@  static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
     const struct domain *d = v->domain;
     struct vmcb_struct *vmcb = v->arch.hvm.svm.vmcb;
     const struct nestedsvm *nsvm = &vcpu_nestedsvm(v);
+    uint64_t tmp;
 
     switch ( msr )
     {
@@ -1965,6 +1966,12 @@  static int svm_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
         break;
 
     default:
+        if ( d->arch.msr_relaxed && !rdmsr_safe(msr, tmp) )
+        {
+            *msr_content = 0;
+            break;
+        }
+
         gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", msr);
         goto gpf;
     }
@@ -2151,6 +2158,9 @@  static int svm_msr_write_intercept(unsigned int msr, uint64_t msr_content)
         break;
 
     default:
+        if ( d->arch.msr_relaxed && !rdmsr_safe(msr, msr_content) )
+            break;
+
         gdprintk(XENLOG_WARNING,
                  "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
                  msr, msr_content);
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index bfea1b0f8a..b52824677e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3123,6 +3123,7 @@  static int is_last_branch_msr(u32 ecx)
 static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
 {
     struct vcpu *curr = current;
+    uint64_t tmp;
 
     HVM_DBG_LOG(DBG_LEVEL_MSR, "ecx=%#x", msr);
 
@@ -3204,6 +3205,12 @@  static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
             break;
         }
 
+        if ( curr->domain->arch.msr_relaxed && !rdmsr_safe(msr, tmp) )
+        {
+            *msr_content = 0;
+            break;
+        }
+
         gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", msr);
         goto gp_fault;
     }
@@ -3505,6 +3512,9 @@  static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content)
              is_last_branch_msr(msr) )
             break;
 
+        if ( v->domain->arch.msr_relaxed && !rdmsr_safe(msr, msr_content) )
+            break;
+
         gdprintk(XENLOG_WARNING,
                  "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
                  msr, msr_content);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index e5a22b9347..74e71403ff 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -875,6 +875,7 @@  static int read_msr(unsigned int reg, uint64_t *val,
     const struct domain *currd = curr->domain;
     const struct cpuid_policy *cp = currd->arch.cpuid;
     bool vpmu_msr = false;
+    uint64_t tmp;
     int ret;
 
     if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE )
@@ -986,6 +987,12 @@  static int read_msr(unsigned int reg, uint64_t *val,
         }
         /* fall through */
     default:
+        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, tmp) )
+        {
+            *val = 0;
+            return X86EMUL_OKAY;
+        }
+
         gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg);
         break;
 
@@ -1148,6 +1155,9 @@  static int write_msr(unsigned int reg, uint64_t val,
         }
         /* fall through */
     default:
+        if ( currd->arch.msr_relaxed && !rdmsr_safe(reg, val) )
+            return X86EMUL_OKAY;
+
         gdprintk(XENLOG_WARNING,
                  "WRMSR 0x%08x val 0x%016"PRIx64" unimplemented\n",
                  reg, val);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 23bbb6e8c1..d217c223b0 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -749,6 +749,7 @@  static struct domain *__init create_dom0(const module_t *image,
         .max_grant_frames = -1,
         .max_maptrack_frames = -1,
         .max_vcpus = dom0_max_vcpus(),
+        .arch.domain_flags = opt_dom0_msr_relaxed ? XEN_X86_MSR_RELAXED : 0,
     };
     struct domain *d;
     char *cmdline;
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 3900d7b48b..7213d184b0 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -437,6 +437,9 @@  struct arch_domain
     /* Mem_access emulation control */
     bool_t mem_access_emulate_each_rep;
 
+    /* Don't unconditionally inject #GP for unhandled MSRs. */
+    bool msr_relaxed;
+
     /* Emulated devices enabled bitmap. */
     uint32_t emulation_flags;
 } __cacheline_aligned;
diff --git a/xen/include/asm-x86/setup.h b/xen/include/asm-x86/setup.h
index 642a5e8460..24be46115d 100644
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -65,6 +65,7 @@  extern bool opt_dom0_shadow;
 extern bool opt_dom0_pvh;
 extern bool opt_dom0_verbose;
 extern bool opt_dom0_cpuid_faulting;
+extern bool opt_dom0_msr_relaxed;
 
 #define max_init_domid (0)
 
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index 629cb2ba40..f9d0e33b94 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -304,6 +304,14 @@  struct xen_arch_domainconfig {
                                      XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
                                      XEN_X86_EMU_VPCI)
     uint32_t emulation_flags;
+
+/*
+ * Select whether to use a relaxed behavior for accesses to MSRs not explicitly
+ * handled by Xen instead of injecting a #GP to the guest. Note this option
+ * doesn't allow the guest to read or write to the underlying MSR.
+ */
+#define XEN_X86_MSR_RELAXED (1u << 0)
+    uint32_t domain_flags;
 };
 
 /* Location of online VCPU bitmap. */