Message ID | 20240726152206.28411-4-roger.pau@citrix.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86: adventures in Address Space Isolation | expand |
On Fri, Jul 26, 2024 at 05:21:47PM +0200, Roger Pau Monne wrote: > The PVH dom0 builder doesn't switch page tables and has no need to run with > SMAP disabled. This should be reworded as: "The PVH dom0 builder doesn't build guest page-tables, because PVH is started in 32bit protected mode, hence has no need to run with SMAP disabled." Thanks, Roger.
On 26.07.2024 17:21, Roger Pau Monne wrote: > The PVH dom0 builder doesn't switch page tables and has no need to run with > SMAP disabled. > > Put the SMAP disabling close to the code region where it's necessary, as it > then becomes obvious why switch_cr3_cr4() is required instead of > write_ptbase(). > > Note removing SMAP from cr4_pv32_mask is not required, as we never jump into > guest context, and hence updating the value of cr4_pv32_mask is not relevant. I'm okay-ish with that being dropped, but iirc the goal was to keep the variable in sync with CPU state. > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
On 29/07/2024 12:53 pm, Jan Beulich wrote: > On 26.07.2024 17:21, Roger Pau Monne wrote: >> The PVH dom0 builder doesn't switch page tables and has no need to run with >> SMAP disabled. >> >> Put the SMAP disabling close to the code region where it's necessary, as it >> then becomes obvious why switch_cr3_cr4() is required instead of >> write_ptbase(). >> >> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into >> guest context, and hence updating the value of cr4_pv32_mask is not relevant. > I'm okay-ish with that being dropped, but iirc the goal was to keep the > variable in sync with CPU state. Removing SMAP from cr4_pv32_mask is necessary. Otherwise IST vectors will reactive SMAP behind the back of the dombuilder. This will probably only manifest in practice in a CONFIG_PV32=y build, and with a poorly timed NMI. ~Andrew
On 26/07/2024 4:21 pm, Roger Pau Monne wrote: > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c > index eee20bb1753c..bc387d96b519 100644 > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -955,26 +955,9 @@ static struct domain *__init create_dom0(const module_t *image, > } > } > > - /* > - * Temporarily clear SMAP in CR4 to allow user-accesses in construct_dom0(). > - * This saves a large number of corner cases interactions with > - * copy_from_user(). > - */ > - if ( cpu_has_smap ) > - { > - cr4_pv32_mask &= ~X86_CR4_SMAP; > - write_cr4(read_cr4() & ~X86_CR4_SMAP); > - } > - > if ( construct_dom0(d, image, headroom, initrd, cmdline) != 0 ) > panic("Could not construct domain 0\n"); > > - if ( cpu_has_smap ) > - { > - write_cr4(read_cr4() | X86_CR4_SMAP); > - cr4_pv32_mask |= X86_CR4_SMAP; > - } > - Hang on. Isn't this (preexistingly) broken given the distinction between cpu_has_smap and X86_FEATURE_XEN_SMAP ? I'm very tempted to use this as a justification to remove opt_smap. ~Andrew
On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote: > On 29/07/2024 12:53 pm, Jan Beulich wrote: > > On 26.07.2024 17:21, Roger Pau Monne wrote: > >> The PVH dom0 builder doesn't switch page tables and has no need to run with > >> SMAP disabled. > >> > >> Put the SMAP disabling close to the code region where it's necessary, as it > >> then becomes obvious why switch_cr3_cr4() is required instead of > >> write_ptbase(). > >> > >> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into > >> guest context, and hence updating the value of cr4_pv32_mask is not relevant. > > I'm okay-ish with that being dropped, but iirc the goal was to keep the > > variable in sync with CPU state. > > Removing SMAP from cr4_pv32_mask is necessary. > > Otherwise IST vectors will reactive SMAP behind the back of the dombuilder. > > This will probably only manifest in practice in a CONFIG_PV32=y build, Sorry, I'm possibly missing some context here. When running the dom0 builder we switch to the guest page-tables, but not to the guest vCPU, (iow: current == idle) and hence the context is always the Xen context. Why would the return path of the IST use cr4_pv32_mask when the context in which the IST happened was the Xen one, and the current vCPU is the idle one (a 64bit PV guest from Xen's PoV). My understanding is that cr4_pv32_mask should only be used when the current context is running a 32bit PV vCPU. Thanks, Roger.
On Mon, Jul 29, 2024 at 04:59:09PM +0100, Andrew Cooper wrote: > On 26/07/2024 4:21 pm, Roger Pau Monne wrote: > > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c > > index eee20bb1753c..bc387d96b519 100644 > > --- a/xen/arch/x86/setup.c > > +++ b/xen/arch/x86/setup.c > > @@ -955,26 +955,9 @@ static struct domain *__init create_dom0(const module_t *image, > > } > > } > > > > - /* > > - * Temporarily clear SMAP in CR4 to allow user-accesses in construct_dom0(). > > - * This saves a large number of corner cases interactions with > > - * copy_from_user(). > > - */ > > - if ( cpu_has_smap ) > > - { > > - cr4_pv32_mask &= ~X86_CR4_SMAP; > > - write_cr4(read_cr4() & ~X86_CR4_SMAP); > > - } > > - > > if ( construct_dom0(d, image, headroom, initrd, cmdline) != 0 ) > > panic("Could not construct domain 0\n"); > > > > - if ( cpu_has_smap ) > > - { > > - write_cr4(read_cr4() | X86_CR4_SMAP); > > - cr4_pv32_mask |= X86_CR4_SMAP; > > - } > > - > > Hang on. Isn't this (preexistingly) broken given the distinction > between cpu_has_smap and X86_FEATURE_XEN_SMAP ? I see, looks like Xen will unconditionally enable SMAP if the user has requested SMP for HVM only. Forcefully disabling SMAP for both PV and HVM will result in the CPUID bit getting cleared, and hence cpu_has_smap == false. > I'm very tempted to use this as a justification to remove opt_smap. Oh, so my change fixes that bug by caching the previous cr4 instead of using cpu_has_smap. It seems like opt_smap is useful for the PV shim, as it caused some unnecessary performance degradation on AMD hardware due to AMD not allowing to selectively trap accesses to CR4, so on pvshim mode it gets disabled: b05ec9263e56 x86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD Thanks, Roger.
On 29/07/2024 5:18 pm, Roger Pau Monné wrote: > On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote: >> On 29/07/2024 12:53 pm, Jan Beulich wrote: >>> On 26.07.2024 17:21, Roger Pau Monne wrote: >>>> The PVH dom0 builder doesn't switch page tables and has no need to run with >>>> SMAP disabled. >>>> >>>> Put the SMAP disabling close to the code region where it's necessary, as it >>>> then becomes obvious why switch_cr3_cr4() is required instead of >>>> write_ptbase(). >>>> >>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into >>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant. >>> I'm okay-ish with that being dropped, but iirc the goal was to keep the >>> variable in sync with CPU state. >> Removing SMAP from cr4_pv32_mask is necessary. >> >> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder. >> >> This will probably only manifest in practice in a CONFIG_PV32=y build, > Sorry, I'm possibly missing some context here. When running the dom0 > builder we switch to the guest page-tables, but not to the guest vCPU, > (iow: current == idle) and hence the context is always the Xen > context. Correct. > Why would the return path of the IST use cr4_pv32_mask when the > context in which the IST happened was the Xen one, and the current > vCPU is the idle one (a 64bit PV guest from Xen's PoV). > > My understanding is that cr4_pv32_mask should only be used when the > current context is running a 32bit PV vCPU. This logic is evil to follow, because you need to look at both cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it. Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4? CR4_PV32_RESTORE is called from every entry path which *might* have come from a 32bit PV guest, and it always results in Xen having SMEP/SMAP active (as applicable). This includes NMI. The change is only undone in compat_restore_all_guest(), where XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2. This is logic cunningly disguised in the use of the Parity flag. Because the NMI handler does reactive SMEP/SMAP (based on the value in cr4_pv32_mask), and returning to Xen does not pass through compat_restore_all_guest(), taking an NMI in the middle of of the dombuilder will reactive SMAP behind your back. ~Andrew
On Mon, Jul 29, 2024 at 06:51:31PM +0100, Andrew Cooper wrote: > On 29/07/2024 5:18 pm, Roger Pau Monné wrote: > > On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote: > >> On 29/07/2024 12:53 pm, Jan Beulich wrote: > >>> On 26.07.2024 17:21, Roger Pau Monne wrote: > >>>> The PVH dom0 builder doesn't switch page tables and has no need to run with > >>>> SMAP disabled. > >>>> > >>>> Put the SMAP disabling close to the code region where it's necessary, as it > >>>> then becomes obvious why switch_cr3_cr4() is required instead of > >>>> write_ptbase(). > >>>> > >>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into > >>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant. > >>> I'm okay-ish with that being dropped, but iirc the goal was to keep the > >>> variable in sync with CPU state. > >> Removing SMAP from cr4_pv32_mask is necessary. > >> > >> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder. > >> > >> This will probably only manifest in practice in a CONFIG_PV32=y build, > > Sorry, I'm possibly missing some context here. When running the dom0 > > builder we switch to the guest page-tables, but not to the guest vCPU, > > (iow: current == idle) and hence the context is always the Xen > > context. > > Correct. > > > Why would the return path of the IST use cr4_pv32_mask when the > > context in which the IST happened was the Xen one, and the current > > vCPU is the idle one (a 64bit PV guest from Xen's PoV). > > > > My understanding is that cr4_pv32_mask should only be used when the > > current context is running a 32bit PV vCPU. > > This logic is evil to follow, because you need to look at both > cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it. > > Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4? > > CR4_PV32_RESTORE is called from every entry path which *might* have come > from a 32bit PV guest, and it always results in Xen having SMEP/SMAP > active (as applicable). This includes NMI. > > The change is only undone in compat_restore_all_guest(), where > XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2. This > is logic cunningly disguised in the use of the Parity flag. > > > Because the NMI handler does reactive SMEP/SMAP (based on the value in > cr4_pv32_mask), and returning to Xen does not pass through > compat_restore_all_guest(), taking an NMI in the middle of of the > dombuilder will reactive SMAP behind your back. After further conversations with Andrew we believe the current disabling of X86_CR4_SMAP in %cr4 during dom0 build is not safe. Regardless of whether cr4_pv32_mask is properly adjusted return to Xen context from interrupt would be done with SMAP enabled if X86_FEATURE_XEN_SMAP is set. I will send a new patch that uses stac/clac in order to disable SMAP (if required) around the dom0 builder code that switches to the guest page-tables. Thanks, Roger.
On 30/07/2024 11:55 am, Roger Pau Monné wrote: > On Mon, Jul 29, 2024 at 06:51:31PM +0100, Andrew Cooper wrote: >> On 29/07/2024 5:18 pm, Roger Pau Monné wrote: >>> On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote: >>>> On 29/07/2024 12:53 pm, Jan Beulich wrote: >>>>> On 26.07.2024 17:21, Roger Pau Monne wrote: >>>>>> The PVH dom0 builder doesn't switch page tables and has no need to run with >>>>>> SMAP disabled. >>>>>> >>>>>> Put the SMAP disabling close to the code region where it's necessary, as it >>>>>> then becomes obvious why switch_cr3_cr4() is required instead of >>>>>> write_ptbase(). >>>>>> >>>>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into >>>>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant. >>>>> I'm okay-ish with that being dropped, but iirc the goal was to keep the >>>>> variable in sync with CPU state. >>>> Removing SMAP from cr4_pv32_mask is necessary. >>>> >>>> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder. >>>> >>>> This will probably only manifest in practice in a CONFIG_PV32=y build, >>> Sorry, I'm possibly missing some context here. When running the dom0 >>> builder we switch to the guest page-tables, but not to the guest vCPU, >>> (iow: current == idle) and hence the context is always the Xen >>> context. >> Correct. >> >>> Why would the return path of the IST use cr4_pv32_mask when the >>> context in which the IST happened was the Xen one, and the current >>> vCPU is the idle one (a 64bit PV guest from Xen's PoV). >>> >>> My understanding is that cr4_pv32_mask should only be used when the >>> current context is running a 32bit PV vCPU. >> This logic is evil to follow, because you need to look at both >> cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it. >> >> Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4? >> >> CR4_PV32_RESTORE is called from every entry path which *might* have come >> from a 32bit PV guest, and it always results in Xen having SMEP/SMAP >> active (as applicable). This includes NMI. >> >> The change is only undone in compat_restore_all_guest(), where >> XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2. This >> is logic cunningly disguised in the use of the Parity flag. >> >> >> Because the NMI handler does reactive SMEP/SMAP (based on the value in >> cr4_pv32_mask), and returning to Xen does not pass through >> compat_restore_all_guest(), taking an NMI in the middle of of the >> dombuilder will reactive SMAP behind your back. > After further conversations with Andrew we believe the current > disabling of X86_CR4_SMAP in %cr4 during dom0 build is not safe. > > Regardless of whether cr4_pv32_mask is properly adjusted return to Xen > context from interrupt would be done with SMAP enabled if > X86_FEATURE_XEN_SMAP is set. Sorry - that's not what I intended to convey. The logic prior to this patch is safe. SMAP is cleared from cr4_pv32_mask before clearing CR4.SMAP, and reinstated in the opposite order. Therefore, an NMI hitting the region won't reactivate SMAP because it's not (instantaniously) set in cr4_pv32_mask. Arguably it wants some barrier()'s for clarity, and an explanation of why this works. The problem your patch has is that by not clearing SMAP from cr4_pv32_mask, it becomes unsafe iff an NMI/#MC/#DB hits the region. > I will send a new patch that uses stac/clac in order to disable SMAP > (if required) around the dom0 builder code that switches to the guest > page-tables. Either way - this is a much cleaner solution. ~Andrew
On Tue, Jul 30, 2024 at 12:06:45PM +0100, Andrew Cooper wrote: > On 30/07/2024 11:55 am, Roger Pau Monné wrote: > > On Mon, Jul 29, 2024 at 06:51:31PM +0100, Andrew Cooper wrote: > >> On 29/07/2024 5:18 pm, Roger Pau Monné wrote: > >>> On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote: > >>>> On 29/07/2024 12:53 pm, Jan Beulich wrote: > >>>>> On 26.07.2024 17:21, Roger Pau Monne wrote: > >>>>>> The PVH dom0 builder doesn't switch page tables and has no need to run with > >>>>>> SMAP disabled. > >>>>>> > >>>>>> Put the SMAP disabling close to the code region where it's necessary, as it > >>>>>> then becomes obvious why switch_cr3_cr4() is required instead of > >>>>>> write_ptbase(). > >>>>>> > >>>>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into > >>>>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant. > >>>>> I'm okay-ish with that being dropped, but iirc the goal was to keep the > >>>>> variable in sync with CPU state. > >>>> Removing SMAP from cr4_pv32_mask is necessary. > >>>> > >>>> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder. > >>>> > >>>> This will probably only manifest in practice in a CONFIG_PV32=y build, > >>> Sorry, I'm possibly missing some context here. When running the dom0 > >>> builder we switch to the guest page-tables, but not to the guest vCPU, > >>> (iow: current == idle) and hence the context is always the Xen > >>> context. > >> Correct. > >> > >>> Why would the return path of the IST use cr4_pv32_mask when the > >>> context in which the IST happened was the Xen one, and the current > >>> vCPU is the idle one (a 64bit PV guest from Xen's PoV). > >>> > >>> My understanding is that cr4_pv32_mask should only be used when the > >>> current context is running a 32bit PV vCPU. > >> This logic is evil to follow, because you need to look at both > >> cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it. > >> > >> Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4? > >> > >> CR4_PV32_RESTORE is called from every entry path which *might* have come > >> from a 32bit PV guest, and it always results in Xen having SMEP/SMAP > >> active (as applicable). This includes NMI. > >> > >> The change is only undone in compat_restore_all_guest(), where > >> XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2. This > >> is logic cunningly disguised in the use of the Parity flag. > >> > >> > >> Because the NMI handler does reactive SMEP/SMAP (based on the value in > >> cr4_pv32_mask), and returning to Xen does not pass through > >> compat_restore_all_guest(), taking an NMI in the middle of of the > >> dombuilder will reactive SMAP behind your back. > > After further conversations with Andrew we believe the current > > disabling of X86_CR4_SMAP in %cr4 during dom0 build is not safe. > > > > Regardless of whether cr4_pv32_mask is properly adjusted return to Xen > > context from interrupt would be done with SMAP enabled if > > X86_FEATURE_XEN_SMAP is set. > > Sorry - that's not what I intended to convey. > > The logic prior to this patch is safe. SMAP is cleared from > cr4_pv32_mask before clearing CR4.SMAP, and reinstated in the opposite > order. Therefore, an NMI hitting the region won't reactivate SMAP > because it's not (instantaniously) set in cr4_pv32_mask. My bad, I was getting confused with the `clac` instructions in the event entry points. I think with my proposed change we would also hit the BUG in cr4_pv32_restore on debug builds if Xen got an interrupt in the middle of the SMAP disabled dom0 build region if SMEP is enabled, as cr4_pv32_mask would differ from the current %cr4 value (not all bits intended would be actually set). > Arguably it wants some barrier()'s for clarity, and an explanation of > why this works. > > The problem your patch has is that by not clearing SMAP from > cr4_pv32_mask, it becomes unsafe iff an NMI/#MC/#DB hits the region. Won't such issue also affect common_interrupt, and hence not be limited to NMI/#MC/#DB only? Regards, Roger.
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c index d8043fa58a27..41772dbe80bf 100644 --- a/xen/arch/x86/pv/dom0_build.c +++ b/xen/arch/x86/pv/dom0_build.c @@ -370,6 +370,7 @@ int __init dom0_construct_pv(struct domain *d, unsigned long alloc_epfn; unsigned long initrd_pfn = -1, initrd_mfn = 0; unsigned long count; + unsigned long cr4; struct page_info *page = NULL; unsigned int flush_flags = 0; start_info_t *si; @@ -814,8 +815,14 @@ int __init dom0_construct_pv(struct domain *d, /* Set up CR3 value for switch_cr3_cr4(). */ update_cr3(v); + /* + * Temporarily clear SMAP in CR4 to allow user-accesses when running with + * the dom0 page-tables. Cache the value of CR4 so it can be restored. + */ + cr4 = read_cr4(); + /* We run on dom0's page tables for the final part of the build process. */ - switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4()); + switch_cr3_cr4(cr3_pa(v->arch.cr3), cr4 & ~X86_CR4_SMAP); mapcache_override_current(v); /* Copy the OS image and free temporary buffer. */ @@ -836,7 +843,7 @@ int __init dom0_construct_pv(struct domain *d, (parms.virt_hypercall >= v_end) ) { mapcache_override_current(NULL); - switch_cr3_cr4(current->arch.cr3, read_cr4()); + switch_cr3_cr4(current->arch.cr3, cr4); printk("Invalid HYPERCALL_PAGE field in ELF notes.\n"); return -EINVAL; } @@ -978,7 +985,7 @@ int __init dom0_construct_pv(struct domain *d, /* Return to idle domain's page tables. */ mapcache_override_current(NULL); - switch_cr3_cr4(current->arch.cr3, read_cr4()); + switch_cr3_cr4(current->arch.cr3, cr4); update_domain_wallclock_time(d); diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index eee20bb1753c..bc387d96b519 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -955,26 +955,9 @@ static struct domain *__init create_dom0(const module_t *image, } } - /* - * Temporarily clear SMAP in CR4 to allow user-accesses in construct_dom0(). - * This saves a large number of corner cases interactions with - * copy_from_user(). - */ - if ( cpu_has_smap ) - { - cr4_pv32_mask &= ~X86_CR4_SMAP; - write_cr4(read_cr4() & ~X86_CR4_SMAP); - } - if ( construct_dom0(d, image, headroom, initrd, cmdline) != 0 ) panic("Could not construct domain 0\n"); - if ( cpu_has_smap ) - { - write_cr4(read_cr4() | X86_CR4_SMAP); - cr4_pv32_mask |= X86_CR4_SMAP; - } - return d; }
The PVH dom0 builder doesn't switch page tables and has no need to run with SMAP disabled. Put the SMAP disabling close to the code region where it's necessary, as it then becomes obvious why switch_cr3_cr4() is required instead of write_ptbase(). Note removing SMAP from cr4_pv32_mask is not required, as we never jump into guest context, and hence updating the value of cr4_pv32_mask is not relevant. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/pv/dom0_build.c | 13 ++++++++++--- xen/arch/x86/setup.c | 17 ----------------- 2 files changed, 10 insertions(+), 20 deletions(-)