Message ID | 20200518153820.18170-1-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/traps: Rework #PF[Rsvd] bit handling | expand |
On 18/05/2020 16:38, Andrew Cooper wrote: > The reserved_bit_page_fault() paths effectively turn reserved bit faults into > a warning, but in the light of L1TF, the real impact is far more serious. > > Xen does not have any reserved bits set in its pagetables, nor do we permit PV > guests to write any. An HVM shadow guest may have reserved bits via the MMIO > fastpath, but those faults are handled in the VMExit #PF intercept, rather > than Xen's #PF handler. > > There is no need to disable interrupts (in spurious_page_fault()) for > __page_fault_type() to look at the rsvd bit, nor should extable fixup be > tolerated. > > Make #PF[Rsvd] a hard error, irrespective of mode. Any new panic() caused by > this constitutes an L1TF gadget needing fixing. > > Additionally, drop the comment for do_page_fault(). It is inaccurate (bit 0 > being set isn't always a protection violation) and stale (missing bits > 5,6,15,31). > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Wei Liu <wl@xen.org> > CC: Roger Pau Monné <roger.pau@citrix.com> > --- > xen/arch/x86/traps.c | 39 +++++++++++++-------------------------- > 1 file changed, 13 insertions(+), 26 deletions(-) > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index 73c6218660..4f8e3c7a32 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -1137,15 +1137,6 @@ void do_int3(struct cpu_user_regs *regs) > pv_inject_hw_exception(TRAP_int3, X86_EVENT_NO_EC); > } > > -static void reserved_bit_page_fault(unsigned long addr, > - struct cpu_user_regs *regs) > -{ > - printk("%pv: reserved bit in page table (ec=%04X)\n", > - current, regs->error_code); > - show_page_walk(addr); > - show_execution_state(regs); > -} > - > #ifdef CONFIG_PV > static int handle_ldt_mapping_fault(unsigned int offset, > struct cpu_user_regs *regs) > @@ -1248,10 +1239,6 @@ static enum pf_type __page_fault_type(unsigned long addr, > if ( in_irq() ) > return real_fault; > > - /* Reserved bit violations are never spurious faults. */ > - if ( error_code & PFEC_reserved_bit ) > - return real_fault; > - > required_flags = _PAGE_PRESENT; > if ( error_code & PFEC_write_access ) > required_flags |= _PAGE_RW; > @@ -1413,14 +1400,6 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs) > return 0; > } > > -/* > - * #PF error code: > - * Bit 0: Protection violation (=1) ; Page not present (=0) > - * Bit 1: Write access > - * Bit 2: User mode (=1) ; Supervisor mode (=0) > - * Bit 3: Reserved bit violation > - * Bit 4: Instruction fetch > - */ > void do_page_fault(struct cpu_user_regs *regs) > { > unsigned long addr, fixup; > @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) > if ( unlikely(fixup_page_fault(addr, regs) != 0) ) > return; > > + /* > + * Xen have reserved bits in its pagetables, nor do we permit PV guests to This should read "Xen doesn't have" ~Andrew > + * write any. Such entries would be vulnerable to the L1TF sidechannel. > + * > + * The only logic which intentionally sets reserved bits is the shadow > + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be > + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather > + * than here. > + */ > + if ( error_code & PFEC_reserved_bit ) > + goto fatal; > + > if ( unlikely(!guest_mode(regs)) ) > { > enum pf_type pf_type = spurious_page_fault(addr, regs); > @@ -1457,13 +1448,12 @@ void do_page_fault(struct cpu_user_regs *regs) > if ( likely((fixup = search_exception_table(regs)) != 0) ) > { > perfc_incr(copy_user_faults); > - if ( unlikely(regs->error_code & PFEC_reserved_bit) ) > - reserved_bit_page_fault(addr, regs); > this_cpu(last_extable_addr) = regs->rip; > regs->rip = fixup; > return; > } > > + fatal: > if ( debugger_trap_fatal(TRAP_page_fault, regs) ) > return; > > @@ -1475,9 +1465,6 @@ void do_page_fault(struct cpu_user_regs *regs) > error_code, _p(addr)); > } > > - if ( unlikely(regs->error_code & PFEC_reserved_bit) ) > - reserved_bit_page_fault(addr, regs); > - > pv_inject_page_fault(regs->error_code, addr); > } >
On 18.05.2020 17:38, Andrew Cooper wrote: > The reserved_bit_page_fault() paths effectively turn reserved bit faults into > a warning, but in the light of L1TF, the real impact is far more serious. > > Xen does not have any reserved bits set in its pagetables, nor do we permit PV > guests to write any. An HVM shadow guest may have reserved bits via the MMIO > fastpath, but those faults are handled in the VMExit #PF intercept, rather > than Xen's #PF handler. > > There is no need to disable interrupts (in spurious_page_fault()) for > __page_fault_type() to look at the rsvd bit, nor should extable fixup be > tolerated. I'm afraid I don't understand the connection of the first half of this to the patch - you don't alter spurious_page_fault() in this regard (at all, actually). As to extable fixup, I'm not sure: If a reserved bit ends up slipping into the non-Xen parts of the page tables, and if guest accessors then become able to trip a corresponding #PF, the bug will need an XSA with the proposed change, while - afaict - it won't if the exception gets recovered from. (There may then still be log spam issue, I admit.) > @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) > if ( unlikely(fixup_page_fault(addr, regs) != 0) ) > return; > > + /* > + * Xen have reserved bits in its pagetables, nor do we permit PV guests to > + * write any. Such entries would be vulnerable to the L1TF sidechannel. > + * > + * The only logic which intentionally sets reserved bits is the shadow > + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be > + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather > + * than here. > + */ > + if ( error_code & PFEC_reserved_bit ) > + goto fatal; Judging from the description, wouldn't this then better go even further up, ahead of the fixup_page_fault() invocation? In fact the function has two PFEC_reserved_bit checks to _avoid_ taking action, which look like they could then be dropped. Jan
On 18.05.2020 17:38, Andrew Cooper wrote: > @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) > if ( unlikely(fixup_page_fault(addr, regs) != 0) ) > return; > > + /* > + * Xen have reserved bits in its pagetables, nor do we permit PV guests to > + * write any. Such entries would be vulnerable to the L1TF sidechannel. > + * > + * The only logic which intentionally sets reserved bits is the shadow > + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be > + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather > + * than here. What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by _sh_propagate() without visible restriction to HVM. And of course every time I look at this code I wonder how we can get away with (quoting a comment) "We store 28 bits of GFN in bits 4:32 of the entry." Do we have a hidden restriction somewhere guaranteeing that guests won't have (emulated MMIO) GFNs above 1Tb when run in shadow mode? Jan
On 19/05/2020 09:34, Jan Beulich wrote: > On 18.05.2020 17:38, Andrew Cooper wrote: >> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) >> if ( unlikely(fixup_page_fault(addr, regs) != 0) ) >> return; >> >> + /* >> + * Xen have reserved bits in its pagetables, nor do we permit PV guests to >> + * write any. Such entries would be vulnerable to the L1TF sidechannel. >> + * >> + * The only logic which intentionally sets reserved bits is the shadow >> + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be >> + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather >> + * than here. > What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by > _sh_propagate() without visible restriction to HVM. SH_L1E_MAGIC looks to be redundant with SH_L1E_MMIO_MAGIC. sh_l1e_mmio() is the only path which ever creates an entry like that. sh_l1e_gnp() is a very well hidden use of reserved bits, but surely can't be used for PV guests, as there doesn't appear to be anything to turn the resulting fault back into a plain not-present. > And of course every time I look at this code I wonder how we can > get away with (quoting a comment) "We store 28 bits of GFN in > bits 4:32 of the entry." Do we have a hidden restriction > somewhere guaranteeing that guests won't have (emulated MMIO) > GFNs above 1Tb when run in shadow mode? I've raised that several times before. Its broken. Given that shadow frames are limited to 44 bits anyway (and not yet levelled safely in the migration stream), my suggestion for fixing this was just to use one extra nibble for the extra 4 bits and call it done. ~Andrew
On 19/05/2020 09:14, Jan Beulich wrote: > On 18.05.2020 17:38, Andrew Cooper wrote: >> The reserved_bit_page_fault() paths effectively turn reserved bit faults into >> a warning, but in the light of L1TF, the real impact is far more serious. >> >> Xen does not have any reserved bits set in its pagetables, nor do we permit PV >> guests to write any. An HVM shadow guest may have reserved bits via the MMIO >> fastpath, but those faults are handled in the VMExit #PF intercept, rather >> than Xen's #PF handler. >> >> There is no need to disable interrupts (in spurious_page_fault()) for >> __page_fault_type() to look at the rsvd bit, nor should extable fixup be >> tolerated. > I'm afraid I don't understand the connection of the first half of this > to the patch - you don't alter spurious_page_fault() in this regard (at > all, actually). The disabling interrupts is in spurious_page_fault(). But the point is that there is no need to enter this logic at all for a reserved page fault. > > As to extable fixup, I'm not sure: If a reserved bit ends up slipping > into the non-Xen parts of the page tables, and if guest accessors then > become able to trip a corresponding #PF, the bug will need an XSA with > the proposed change, while - afaict - it won't if the exception gets > recovered from. (There may then still be log spam issue, I admit.) We need to issue an XSA anyway because such a construct would be an L1TF gadget. What this change does is make it substantially more obvious, and turns an information leak into a DoS. >> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) >> if ( unlikely(fixup_page_fault(addr, regs) != 0) ) >> return; >> >> + /* >> + * Xen have reserved bits in its pagetables, nor do we permit PV guests to >> + * write any. Such entries would be vulnerable to the L1TF sidechannel. >> + * >> + * The only logic which intentionally sets reserved bits is the shadow >> + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be >> + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather >> + * than here. >> + */ >> + if ( error_code & PFEC_reserved_bit ) >> + goto fatal; > Judging from the description, wouldn't this then better go even further > up, ahead of the fixup_page_fault() invocation? In fact the function > has two PFEC_reserved_bit checks to _avoid_ taking action, which look > like they could then be dropped. Only for certain Xen-only fixup. The path into paging_fault() is not guarded. Depending on whether GNP is actually used for PV guests, this is where it would be fixed up. ~Andrew
On 19.05.2020 16:11, Andrew Cooper wrote: > On 19/05/2020 09:34, Jan Beulich wrote: >> On 18.05.2020 17:38, Andrew Cooper wrote: >>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) >>> if ( unlikely(fixup_page_fault(addr, regs) != 0) ) >>> return; >>> >>> + /* >>> + * Xen have reserved bits in its pagetables, nor do we permit PV guests to >>> + * write any. Such entries would be vulnerable to the L1TF sidechannel. >>> + * >>> + * The only logic which intentionally sets reserved bits is the shadow >>> + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be >>> + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather >>> + * than here. >> What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by >> _sh_propagate() without visible restriction to HVM. > > SH_L1E_MAGIC looks to be redundant with SH_L1E_MMIO_MAGIC. > sh_l1e_mmio() is the only path which ever creates an entry like that. > > sh_l1e_gnp() is a very well hidden use of reserved bits, but surely > can't be used for PV guests, as there doesn't appear to be anything to > turn the resulting fault back into a plain not-present. Well, in this case the implied question remains: How does this fit with what _sh_propagate() does? >> And of course every time I look at this code I wonder how we can >> get away with (quoting a comment) "We store 28 bits of GFN in >> bits 4:32 of the entry." Do we have a hidden restriction >> somewhere guaranteeing that guests won't have (emulated MMIO) >> GFNs above 1Tb when run in shadow mode? > > I've raised that several times before. Its broken. > > Given that shadow frames are limited to 44 bits anyway (and not yet > levelled safely in the migration stream), my suggestion for fixing this > was just to use one extra nibble for the extra 4 bits and call it done. Would you remind(?) me of where this 44-bit restriction is coming from? Jan
On 19.05.2020 16:29, Andrew Cooper wrote: > On 19/05/2020 09:14, Jan Beulich wrote: >> On 18.05.2020 17:38, Andrew Cooper wrote: >>> The reserved_bit_page_fault() paths effectively turn reserved bit faults into >>> a warning, but in the light of L1TF, the real impact is far more serious. >>> >>> Xen does not have any reserved bits set in its pagetables, nor do we permit PV >>> guests to write any. An HVM shadow guest may have reserved bits via the MMIO >>> fastpath, but those faults are handled in the VMExit #PF intercept, rather >>> than Xen's #PF handler. >>> >>> There is no need to disable interrupts (in spurious_page_fault()) for >>> __page_fault_type() to look at the rsvd bit, nor should extable fixup be >>> tolerated. >> I'm afraid I don't understand the connection of the first half of this >> to the patch - you don't alter spurious_page_fault() in this regard (at >> all, actually). > > The disabling interrupts is in spurious_page_fault(). But the point is > that there is no need to enter this logic at all for a reserved page fault. > >> >> As to extable fixup, I'm not sure: If a reserved bit ends up slipping >> into the non-Xen parts of the page tables, and if guest accessors then >> become able to trip a corresponding #PF, the bug will need an XSA with >> the proposed change, while - afaict - it won't if the exception gets >> recovered from. (There may then still be log spam issue, I admit.) > > We need to issue an XSA anyway because such a construct would be an L1TF > gadget. > > What this change does is make it substantially more obvious, and turns > an information leak into a DoS. For L1TF-affected hardware. For unaffected hardware it turns a possible (but not guaranteed) log spam DoS into a reliable crash. >>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) >>> if ( unlikely(fixup_page_fault(addr, regs) != 0) ) >>> return; >>> >>> + /* >>> + * Xen have reserved bits in its pagetables, nor do we permit PV guests to >>> + * write any. Such entries would be vulnerable to the L1TF sidechannel. >>> + * >>> + * The only logic which intentionally sets reserved bits is the shadow >>> + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be >>> + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather >>> + * than here. >>> + */ >>> + if ( error_code & PFEC_reserved_bit ) >>> + goto fatal; >> Judging from the description, wouldn't this then better go even further >> up, ahead of the fixup_page_fault() invocation? In fact the function >> has two PFEC_reserved_bit checks to _avoid_ taking action, which look >> like they could then be dropped. > > Only for certain Xen-only fixup. The path into paging_fault() is not > guarded. Hmm, yes indeed. Jan
On 19/05/2020 15:48, Jan Beulich wrote: > On 19.05.2020 16:11, Andrew Cooper wrote: >> On 19/05/2020 09:34, Jan Beulich wrote: >>> On 18.05.2020 17:38, Andrew Cooper wrote: >>>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) >>>> if ( unlikely(fixup_page_fault(addr, regs) != 0) ) >>>> return; >>>> >>>> + /* >>>> + * Xen have reserved bits in its pagetables, nor do we permit PV guests to >>>> + * write any. Such entries would be vulnerable to the L1TF sidechannel. >>>> + * >>>> + * The only logic which intentionally sets reserved bits is the shadow >>>> + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be >>>> + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather >>>> + * than here. >>> What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by >>> _sh_propagate() without visible restriction to HVM. >> SH_L1E_MAGIC looks to be redundant with SH_L1E_MMIO_MAGIC. >> sh_l1e_mmio() is the only path which ever creates an entry like that. >> >> sh_l1e_gnp() is a very well hidden use of reserved bits, but surely >> can't be used for PV guests, as there doesn't appear to be anything to >> turn the resulting fault back into a plain not-present. > Well, in this case the implied question remains: How does this fit > with what _sh_propagate() does? I'm in the process of investigating. >>> And of course every time I look at this code I wonder how we can >>> get away with (quoting a comment) "We store 28 bits of GFN in >>> bits 4:32 of the entry." Do we have a hidden restriction >>> somewhere guaranteeing that guests won't have (emulated MMIO) >>> GFNs above 1Tb when run in shadow mode? >> I've raised that several times before. Its broken. >> >> Given that shadow frames are limited to 44 bits anyway (and not yet >> levelled safely in the migration stream), my suggestion for fixing this >> was just to use one extra nibble for the extra 4 bits and call it done. > Would you remind(?) me of where this 44-bit restriction is coming > from? From paging_max_paddr_bits(), /* Shadowed superpages store GFNs in 32-bit page_info fields. */ ~Andrew
On 19/05/2020 15:55, Jan Beulich wrote: > On 19.05.2020 16:29, Andrew Cooper wrote: >> On 19/05/2020 09:14, Jan Beulich wrote: >>> On 18.05.2020 17:38, Andrew Cooper wrote: >>>> The reserved_bit_page_fault() paths effectively turn reserved bit faults into >>>> a warning, but in the light of L1TF, the real impact is far more serious. >>>> >>>> Xen does not have any reserved bits set in its pagetables, nor do we permit PV >>>> guests to write any. An HVM shadow guest may have reserved bits via the MMIO >>>> fastpath, but those faults are handled in the VMExit #PF intercept, rather >>>> than Xen's #PF handler. >>>> >>>> There is no need to disable interrupts (in spurious_page_fault()) for >>>> __page_fault_type() to look at the rsvd bit, nor should extable fixup be >>>> tolerated. >>> I'm afraid I don't understand the connection of the first half of this >>> to the patch - you don't alter spurious_page_fault() in this regard (at >>> all, actually). >> The disabling interrupts is in spurious_page_fault(). But the point is >> that there is no need to enter this logic at all for a reserved page fault. >> >>> As to extable fixup, I'm not sure: If a reserved bit ends up slipping >>> into the non-Xen parts of the page tables, and if guest accessors then >>> become able to trip a corresponding #PF, the bug will need an XSA with >>> the proposed change, while - afaict - it won't if the exception gets >>> recovered from. (There may then still be log spam issue, I admit.) >> We need to issue an XSA anyway because such a construct would be an L1TF >> gadget. >> >> What this change does is make it substantially more obvious, and turns >> an information leak into a DoS. > For L1TF-affected hardware. For unaffected hardware it turns a possible > (but not guaranteed) log spam DoS into a reliable crash. It represents unexpected corruption of our most critical security resource in the processor. Obviously we need to account for any legitimate uses Xen has of reserved bits (so far maybe GNP for PV guests), but BUG()-like behaviour *is* the response appropriate to the severity of finding corrupt PTEs. ~Andrew
On 19.05.2020 17:33, Andrew Cooper wrote: > On 19/05/2020 15:48, Jan Beulich wrote: >> On 19.05.2020 16:11, Andrew Cooper wrote: >>> Given that shadow frames are limited to 44 bits anyway (and not yet >>> levelled safely in the migration stream), my suggestion for fixing this >>> was just to use one extra nibble for the extra 4 bits and call it done. >> Would you remind(?) me of where this 44-bit restriction is coming >> from? > > From paging_max_paddr_bits(), > > /* Shadowed superpages store GFNs in 32-bit page_info fields. */ Ah, that's an abuse of the backlink field. After some looking around I first thought the up field could be used to store the GFN instead, as it's supposedly used for single-page shadows only. Then however I found static inline int sh_type_has_up_pointer(struct domain *d, unsigned int t) { /* Multi-page shadows don't have up-pointers */ if ( t == SH_type_l1_32_shadow || t == SH_type_fl1_32_shadow || t == SH_type_l2_32_shadow ) return 0; /* Pinnable shadows don't have up-pointers either */ return !sh_type_is_pinnable(d, t); } It's unclear to me in which way SH_type_l1_32_shadow and SH_type_l2_32_shadow are "multi-page" shadows; I'd rather have expected all three SH_type_fl1_*_shadow to be. Tim? In any event there would be 12 bits to reclaim from the up pointer - it being a physical address, there'll not be more than 52 significant bits. Jan
On 19/05/2020 17:09, Jan Beulich wrote: > On 19.05.2020 17:33, Andrew Cooper wrote: >> On 19/05/2020 15:48, Jan Beulich wrote: >>> On 19.05.2020 16:11, Andrew Cooper wrote: >>>> Given that shadow frames are limited to 44 bits anyway (and not yet >>>> levelled safely in the migration stream), my suggestion for fixing this >>>> was just to use one extra nibble for the extra 4 bits and call it done. >>> Would you remind(?) me of where this 44-bit restriction is coming >>> from? >> From paging_max_paddr_bits(), >> >> /* Shadowed superpages store GFNs in 32-bit page_info fields. */ > Ah, that's an abuse of the backlink field. After some looking > around I first thought the up field could be used to store the > GFN instead, as it's supposedly used for single-page shadows > only. Then however I found > > static inline int sh_type_has_up_pointer(struct domain *d, unsigned int t) > { > /* Multi-page shadows don't have up-pointers */ > if ( t == SH_type_l1_32_shadow > || t == SH_type_fl1_32_shadow > || t == SH_type_l2_32_shadow ) > return 0; > /* Pinnable shadows don't have up-pointers either */ > return !sh_type_is_pinnable(d, t); > } > > It's unclear to me in which way SH_type_l1_32_shadow and > SH_type_l2_32_shadow are "multi-page" shadows; I'd rather have > expected all three SH_type_fl1_*_shadow to be. Tim? I suspect the comment is incomplete, and should include "4k shadows don't have up-pointers". > > In any event there would be 12 bits to reclaim from the up > pointer - it being a physical address, there'll not be more > than 52 significant bits. Right, but for L1TF safety, the address bits in the PTE must not be cacheable. Currently, on fully populated multi-socket servers, the MMIO fastpath relies on the top 4G of address space not being cacheable, which is the safest we can reasonably manage. Extending this by a nibble takes us to 16G which is not meaningfully less safe. ~Andrew
At 18:09 +0200 on 19 May (1589911795), Jan Beulich wrote: > static inline int sh_type_has_up_pointer(struct domain *d, unsigned int t) > { > /* Multi-page shadows don't have up-pointers */ > if ( t == SH_type_l1_32_shadow > || t == SH_type_fl1_32_shadow > || t == SH_type_l2_32_shadow ) > return 0; > /* Pinnable shadows don't have up-pointers either */ > return !sh_type_is_pinnable(d, t); > } > > It's unclear to me in which way SH_type_l1_32_shadow and > SH_type_l2_32_shadow are "multi-page" shadows; I'd rather have > expected all three SH_type_fl1_*_shadow to be. Tim? They are multi-page in the sense that the shadow itself is more than a page long (because it needs to have 1024 8-byte entries). The FL1 shadows are the same size as their equivalent L1s. Tim.
On 19.05.2020 20:00, Andrew Cooper wrote: > On 19/05/2020 17:09, Jan Beulich wrote: >> In any event there would be 12 bits to reclaim from the up >> pointer - it being a physical address, there'll not be more >> than 52 significant bits. > > Right, but for L1TF safety, the address bits in the PTE must not be > cacheable. So if I understand this right, your response was only indirectly related to what I said: You mean that no matter whether we find a way to store full-width GFNs, SH_L1E_MMIO_MAGIC can't have arbitrarily many set bits dropped. On L1TF vulnerable hardware, that is (i.e. in principle the constant could become a variable to be determined at boot). > Currently, on fully populated multi-socket servers, the MMIO fastpath > relies on the top 4G of address space not being cacheable, which is the > safest we can reasonably manage. Extending this by a nibble takes us to > 16G which is not meaningfully less safe. That's 64G (36 address bits), isn't it? Looking at l1tf_calculations(), I'd be worried in particular Penryn / Dunnington might not support more than 36 address bits (I don't think I have anywhere to check). Even if it was 38, 39, or 40 bits, 64G becomes a not insignificant part of the overall 256G / 512G / 1T address space. Then again the top quarter assumption in l1tf_calculations() would still be met in this latter case. Jan
On 20/05/2020 08:48, Jan Beulich wrote: > On 19.05.2020 20:00, Andrew Cooper wrote: >> On 19/05/2020 17:09, Jan Beulich wrote: >>> In any event there would be 12 bits to reclaim from the up >>> pointer - it being a physical address, there'll not be more >>> than 52 significant bits. >> Right, but for L1TF safety, the address bits in the PTE must not be >> cacheable. > So if I understand this right, your response was only indirectly > related to what I said: You mean that no matter whether we find > a way to store full-width GFNs, SH_L1E_MMIO_MAGIC can't have > arbitrarily many set bits dropped. Yes > On L1TF vulnerable hardware, > that is (i.e. in principle the constant could become a variable > to be determined at boot). The only thing which can usefully be done at runtime disable the fastpath. If cacheable memory overlaps with the used address bits, there are no safe values to use. > >> Currently, on fully populated multi-socket servers, the MMIO fastpath >> relies on the top 4G of address space not being cacheable, which is the >> safest we can reasonably manage. Extending this by a nibble takes us to >> 16G which is not meaningfully less safe. > That's 64G (36 address bits), isn't it? Yes it is. I can't count. > Looking at > l1tf_calculations(), I'd be worried in particular Penryn / > Dunnington might not support more than 36 address bits (I don't > think I have anywhere to check). Even if it was 38, 39, or 40 > bits, 64G becomes a not insignificant part of the overall 256G / > 512G / 1T address space. Then again the top quarter assumption > in l1tf_calculations() would still be met in this latter case. I'm honestly not too worried. Intel has ceased supporting anything older than SandyBridge, and there are other unfixed speculative security issues. Anyone using these processors has bigger problems. ~Andrew
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 73c6218660..4f8e3c7a32 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -1137,15 +1137,6 @@ void do_int3(struct cpu_user_regs *regs) pv_inject_hw_exception(TRAP_int3, X86_EVENT_NO_EC); } -static void reserved_bit_page_fault(unsigned long addr, - struct cpu_user_regs *regs) -{ - printk("%pv: reserved bit in page table (ec=%04X)\n", - current, regs->error_code); - show_page_walk(addr); - show_execution_state(regs); -} - #ifdef CONFIG_PV static int handle_ldt_mapping_fault(unsigned int offset, struct cpu_user_regs *regs) @@ -1248,10 +1239,6 @@ static enum pf_type __page_fault_type(unsigned long addr, if ( in_irq() ) return real_fault; - /* Reserved bit violations are never spurious faults. */ - if ( error_code & PFEC_reserved_bit ) - return real_fault; - required_flags = _PAGE_PRESENT; if ( error_code & PFEC_write_access ) required_flags |= _PAGE_RW; @@ -1413,14 +1400,6 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs) return 0; } -/* - * #PF error code: - * Bit 0: Protection violation (=1) ; Page not present (=0) - * Bit 1: Write access - * Bit 2: User mode (=1) ; Supervisor mode (=0) - * Bit 3: Reserved bit violation - * Bit 4: Instruction fetch - */ void do_page_fault(struct cpu_user_regs *regs) { unsigned long addr, fixup; @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs) if ( unlikely(fixup_page_fault(addr, regs) != 0) ) return; + /* + * Xen have reserved bits in its pagetables, nor do we permit PV guests to + * write any. Such entries would be vulnerable to the L1TF sidechannel. + * + * The only logic which intentionally sets reserved bits is the shadow + * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be + * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather + * than here. + */ + if ( error_code & PFEC_reserved_bit ) + goto fatal; + if ( unlikely(!guest_mode(regs)) ) { enum pf_type pf_type = spurious_page_fault(addr, regs); @@ -1457,13 +1448,12 @@ void do_page_fault(struct cpu_user_regs *regs) if ( likely((fixup = search_exception_table(regs)) != 0) ) { perfc_incr(copy_user_faults); - if ( unlikely(regs->error_code & PFEC_reserved_bit) ) - reserved_bit_page_fault(addr, regs); this_cpu(last_extable_addr) = regs->rip; regs->rip = fixup; return; } + fatal: if ( debugger_trap_fatal(TRAP_page_fault, regs) ) return; @@ -1475,9 +1465,6 @@ void do_page_fault(struct cpu_user_regs *regs) error_code, _p(addr)); } - if ( unlikely(regs->error_code & PFEC_reserved_bit) ) - reserved_bit_page_fault(addr, regs); - pv_inject_page_fault(regs->error_code, addr); }
The reserved_bit_page_fault() paths effectively turn reserved bit faults into a warning, but in the light of L1TF, the real impact is far more serious. Xen does not have any reserved bits set in its pagetables, nor do we permit PV guests to write any. An HVM shadow guest may have reserved bits via the MMIO fastpath, but those faults are handled in the VMExit #PF intercept, rather than Xen's #PF handler. There is no need to disable interrupts (in spurious_page_fault()) for __page_fault_type() to look at the rsvd bit, nor should extable fixup be tolerated. Make #PF[Rsvd] a hard error, irrespective of mode. Any new panic() caused by this constitutes an L1TF gadget needing fixing. Additionally, drop the comment for do_page_fault(). It is inaccurate (bit 0 being set isn't always a protection violation) and stale (missing bits 5,6,15,31). Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Wei Liu <wl@xen.org> CC: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/traps.c | 39 +++++++++++++-------------------------- 1 file changed, 13 insertions(+), 26 deletions(-)