x86/traps: Rework #PF[Rsvd] bit handling

Message ID	20200518153820.18170-1-andrew.cooper3@citrix.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=XBwx=7A=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CF49520657 IronPort-SDR: 8QUu+RqeCZTfLX9+9j1ojai1ovw9ILlgQLJwDsrEgt6CH5GczC6YWJZFgbsgqGiCZp5+0V/pix 8tuXJ520mu9zerRSOtQeTnTAeLRRYkPVBi6ozObfHjm3gk+P4WfeuuM92ImZTE53nohKCNwLGP bgY5Ecxy1sLb92snbPP1lKQoWkLxpmSaKWqu87LuhGZhBRC39l2ds6FvggVS8UDnncnldY3C2L X1QH7KEft+vvhkftRVSalTwu2TiFUsEFAm1FHne+LHLlvDIRbcnCj1TrcNl4/SjecEXtOe6uiW x28= From: Andrew Cooper <andrew.cooper3@citrix.com> To: Xen-devel <xen-devel@lists.xenproject.org> Subject: [PATCH] x86/traps: Rework #PF[Rsvd] bit handling Date: Mon, 18 May 2020 16:38:20 +0100 Message-ID: <20200518153820.18170-1-andrew.cooper3@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Precedence: list Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>, Jan Beulich <JBeulich@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	x86/traps: Rework #PF[Rsvd] bit handling \| expand x86/traps: Rework #PF[Rsvd] bit handling

Andrew Cooper May 18, 2020, 3:38 p.m. UTC

The reserved_bit_page_fault() paths effectively turn reserved bit faults into
a warning, but in the light of L1TF, the real impact is far more serious.

Xen does not have any reserved bits set in its pagetables, nor do we permit PV
guests to write any.  An HVM shadow guest may have reserved bits via the MMIO
fastpath, but those faults are handled in the VMExit #PF intercept, rather
than Xen's #PF handler.

There is no need to disable interrupts (in spurious_page_fault()) for
__page_fault_type() to look at the rsvd bit, nor should extable fixup be
tolerated.

Make #PF[Rsvd] a hard error, irrespective of mode.  Any new panic() caused by
this constitutes an L1TF gadget needing fixing.

Additionally, drop the comment for do_page_fault().  It is inaccurate (bit 0
being set isn't always a protection violation) and stale (missing bits
5,6,15,31).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/traps.c | 39 +++++++++++++--------------------------
 1 file changed, 13 insertions(+), 26 deletions(-)

Andrew Cooper May 18, 2020, 3:40 p.m. UTC | #1

On 18/05/2020 16:38, Andrew Cooper wrote:
> The reserved_bit_page_fault() paths effectively turn reserved bit faults into
> a warning, but in the light of L1TF, the real impact is far more serious.
>
> Xen does not have any reserved bits set in its pagetables, nor do we permit PV
> guests to write any.  An HVM shadow guest may have reserved bits via the MMIO
> fastpath, but those faults are handled in the VMExit #PF intercept, rather
> than Xen's #PF handler.
>
> There is no need to disable interrupts (in spurious_page_fault()) for
> __page_fault_type() to look at the rsvd bit, nor should extable fixup be
> tolerated.
>
> Make #PF[Rsvd] a hard error, irrespective of mode.  Any new panic() caused by
> this constitutes an L1TF gadget needing fixing.
>
> Additionally, drop the comment for do_page_fault().  It is inaccurate (bit 0
> being set isn't always a protection violation) and stale (missing bits
> 5,6,15,31).
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Wei Liu <wl@xen.org>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/traps.c | 39 +++++++++++++--------------------------
>  1 file changed, 13 insertions(+), 26 deletions(-)
>
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 73c6218660..4f8e3c7a32 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -1137,15 +1137,6 @@ void do_int3(struct cpu_user_regs *regs)
>      pv_inject_hw_exception(TRAP_int3, X86_EVENT_NO_EC);
>  }
>  
> -static void reserved_bit_page_fault(unsigned long addr,
> -                                    struct cpu_user_regs *regs)
> -{
> -    printk("%pv: reserved bit in page table (ec=%04X)\n",
> -           current, regs->error_code);
> -    show_page_walk(addr);
> -    show_execution_state(regs);
> -}
> -
>  #ifdef CONFIG_PV
>  static int handle_ldt_mapping_fault(unsigned int offset,
>                                      struct cpu_user_regs *regs)
> @@ -1248,10 +1239,6 @@ static enum pf_type __page_fault_type(unsigned long addr,
>      if ( in_irq() )
>          return real_fault;
>  
> -    /* Reserved bit violations are never spurious faults. */
> -    if ( error_code & PFEC_reserved_bit )
> -        return real_fault;
> -
>      required_flags  = _PAGE_PRESENT;
>      if ( error_code & PFEC_write_access )
>          required_flags |= _PAGE_RW;
> @@ -1413,14 +1400,6 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs)
>      return 0;
>  }
>  
> -/*
> - * #PF error code:
> - *  Bit 0: Protection violation (=1) ; Page not present (=0)
> - *  Bit 1: Write access
> - *  Bit 2: User mode (=1) ; Supervisor mode (=0)
> - *  Bit 3: Reserved bit violation
> - *  Bit 4: Instruction fetch
> - */
>  void do_page_fault(struct cpu_user_regs *regs)
>  {
>      unsigned long addr, fixup;
> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>          return;
>  
> +    /*
> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to

This should read "Xen doesn't have"

~Andrew

> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
> +     *
> +     * The only logic which intentionally sets reserved bits is the shadow
> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
> +     * than here.
> +     */
> +    if ( error_code & PFEC_reserved_bit )
> +        goto fatal;
> +
>      if ( unlikely(!guest_mode(regs)) )
>      {
>          enum pf_type pf_type = spurious_page_fault(addr, regs);
> @@ -1457,13 +1448,12 @@ void do_page_fault(struct cpu_user_regs *regs)
>          if ( likely((fixup = search_exception_table(regs)) != 0) )
>          {
>              perfc_incr(copy_user_faults);
> -            if ( unlikely(regs->error_code & PFEC_reserved_bit) )
> -                reserved_bit_page_fault(addr, regs);
>              this_cpu(last_extable_addr) = regs->rip;
>              regs->rip = fixup;
>              return;
>          }
>  
> +    fatal:
>          if ( debugger_trap_fatal(TRAP_page_fault, regs) )
>              return;
>  
> @@ -1475,9 +1465,6 @@ void do_page_fault(struct cpu_user_regs *regs)
>                error_code, _p(addr));
>      }
>  
> -    if ( unlikely(regs->error_code & PFEC_reserved_bit) )
> -        reserved_bit_page_fault(addr, regs);
> -
>      pv_inject_page_fault(regs->error_code, addr);
>  }
>

Jan Beulich May 19, 2020, 8:14 a.m. UTC | #2

On 18.05.2020 17:38, Andrew Cooper wrote:
> The reserved_bit_page_fault() paths effectively turn reserved bit faults into
> a warning, but in the light of L1TF, the real impact is far more serious.
> 
> Xen does not have any reserved bits set in its pagetables, nor do we permit PV
> guests to write any.  An HVM shadow guest may have reserved bits via the MMIO
> fastpath, but those faults are handled in the VMExit #PF intercept, rather
> than Xen's #PF handler.
> 
> There is no need to disable interrupts (in spurious_page_fault()) for
> __page_fault_type() to look at the rsvd bit, nor should extable fixup be
> tolerated.

I'm afraid I don't understand the connection of the first half of this
to the patch - you don't alter spurious_page_fault() in this regard (at
all, actually).

As to extable fixup, I'm not sure: If a reserved bit ends up slipping
into the non-Xen parts of the page tables, and if guest accessors then
become able to trip a corresponding #PF, the bug will need an XSA with
the proposed change, while - afaict - it won't if the exception gets
recovered from. (There may then still be log spam issue, I admit.)

> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>          return;
>  
> +    /*
> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to
> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
> +     *
> +     * The only logic which intentionally sets reserved bits is the shadow
> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
> +     * than here.
> +     */
> +    if ( error_code & PFEC_reserved_bit )
> +        goto fatal;

Judging from the description, wouldn't this then better go even further
up, ahead of the fixup_page_fault() invocation? In fact the function
has two PFEC_reserved_bit checks to _avoid_ taking action, which look
like they could then be dropped.

Jan

Jan Beulich May 19, 2020, 8:34 a.m. UTC | #3

On 18.05.2020 17:38, Andrew Cooper wrote:
> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>          return;
>  
> +    /*
> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to
> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
> +     *
> +     * The only logic which intentionally sets reserved bits is the shadow
> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
> +     * than here.

What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by
_sh_propagate() without visible restriction to HVM.

And of course every time I look at this code I wonder how we can
get away with (quoting a comment) "We store 28 bits of GFN in
bits 4:32 of the entry." Do we have a hidden restriction
somewhere guaranteeing that guests won't have (emulated MMIO)
GFNs above 1Tb when run in shadow mode?

Jan

Andrew Cooper May 19, 2020, 2:11 p.m. UTC | #4

On 19/05/2020 09:34, Jan Beulich wrote:
> On 18.05.2020 17:38, Andrew Cooper wrote:
>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>>          return;
>>  
>> +    /*
>> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to
>> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
>> +     *
>> +     * The only logic which intentionally sets reserved bits is the shadow
>> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
>> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
>> +     * than here.
> What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by
> _sh_propagate() without visible restriction to HVM.

SH_L1E_MAGIC looks to be redundant with SH_L1E_MMIO_MAGIC. 
sh_l1e_mmio() is the only path which ever creates an entry like that.

sh_l1e_gnp() is a very well hidden use of reserved bits, but surely
can't be used for PV guests, as there doesn't appear to be anything to
turn the resulting fault back into a plain not-present.

> And of course every time I look at this code I wonder how we can
> get away with (quoting a comment) "We store 28 bits of GFN in
> bits 4:32 of the entry." Do we have a hidden restriction
> somewhere guaranteeing that guests won't have (emulated MMIO)
> GFNs above 1Tb when run in shadow mode?

I've raised that several times before.  Its broken.

Given that shadow frames are limited to 44 bits anyway (and not yet
levelled safely in the migration stream), my suggestion for fixing this
was just to use one extra nibble for the extra 4 bits and call it done.

~Andrew

Andrew Cooper May 19, 2020, 2:29 p.m. UTC | #5

On 19/05/2020 09:14, Jan Beulich wrote:
> On 18.05.2020 17:38, Andrew Cooper wrote:
>> The reserved_bit_page_fault() paths effectively turn reserved bit faults into
>> a warning, but in the light of L1TF, the real impact is far more serious.
>>
>> Xen does not have any reserved bits set in its pagetables, nor do we permit PV
>> guests to write any.  An HVM shadow guest may have reserved bits via the MMIO
>> fastpath, but those faults are handled in the VMExit #PF intercept, rather
>> than Xen's #PF handler.
>>
>> There is no need to disable interrupts (in spurious_page_fault()) for
>> __page_fault_type() to look at the rsvd bit, nor should extable fixup be
>> tolerated.
> I'm afraid I don't understand the connection of the first half of this
> to the patch - you don't alter spurious_page_fault() in this regard (at
> all, actually).

The disabling interrupts is in spurious_page_fault().  But the point is
that there is no need to enter this logic at all for a reserved page fault.

>
> As to extable fixup, I'm not sure: If a reserved bit ends up slipping
> into the non-Xen parts of the page tables, and if guest accessors then
> become able to trip a corresponding #PF, the bug will need an XSA with
> the proposed change, while - afaict - it won't if the exception gets
> recovered from. (There may then still be log spam issue, I admit.)

We need to issue an XSA anyway because such a construct would be an L1TF
gadget.

What this change does is make it substantially more obvious, and turns
an information leak into a DoS.

>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>>          return;
>>  
>> +    /*
>> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to
>> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
>> +     *
>> +     * The only logic which intentionally sets reserved bits is the shadow
>> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
>> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
>> +     * than here.
>> +     */
>> +    if ( error_code & PFEC_reserved_bit )
>> +        goto fatal;
> Judging from the description, wouldn't this then better go even further
> up, ahead of the fixup_page_fault() invocation? In fact the function
> has two PFEC_reserved_bit checks to _avoid_ taking action, which look
> like they could then be dropped.

Only for certain Xen-only fixup.  The path into paging_fault() is not
guarded.

Depending on whether GNP is actually used for PV guests, this is where
it would be fixed up.

~Andrew

Jan Beulich May 19, 2020, 2:48 p.m. UTC | #6

On 19.05.2020 16:11, Andrew Cooper wrote:
> On 19/05/2020 09:34, Jan Beulich wrote:
>> On 18.05.2020 17:38, Andrew Cooper wrote:
>>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>>>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>>>          return;
>>>  
>>> +    /*
>>> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to
>>> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
>>> +     *
>>> +     * The only logic which intentionally sets reserved bits is the shadow
>>> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
>>> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
>>> +     * than here.
>> What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by
>> _sh_propagate() without visible restriction to HVM.
> 
> SH_L1E_MAGIC looks to be redundant with SH_L1E_MMIO_MAGIC. 
> sh_l1e_mmio() is the only path which ever creates an entry like that.
> 
> sh_l1e_gnp() is a very well hidden use of reserved bits, but surely
> can't be used for PV guests, as there doesn't appear to be anything to
> turn the resulting fault back into a plain not-present.

Well, in this case the implied question remains: How does this fit
with what _sh_propagate() does?

>> And of course every time I look at this code I wonder how we can
>> get away with (quoting a comment) "We store 28 bits of GFN in
>> bits 4:32 of the entry." Do we have a hidden restriction
>> somewhere guaranteeing that guests won't have (emulated MMIO)
>> GFNs above 1Tb when run in shadow mode?
> 
> I've raised that several times before.  Its broken.
> 
> Given that shadow frames are limited to 44 bits anyway (and not yet
> levelled safely in the migration stream), my suggestion for fixing this
> was just to use one extra nibble for the extra 4 bits and call it done.

Would you remind(?) me of where this 44-bit restriction is coming
from?

Jan

Jan Beulich May 19, 2020, 2:55 p.m. UTC | #7

On 19.05.2020 16:29, Andrew Cooper wrote:
> On 19/05/2020 09:14, Jan Beulich wrote:
>> On 18.05.2020 17:38, Andrew Cooper wrote:
>>> The reserved_bit_page_fault() paths effectively turn reserved bit faults into
>>> a warning, but in the light of L1TF, the real impact is far more serious.
>>>
>>> Xen does not have any reserved bits set in its pagetables, nor do we permit PV
>>> guests to write any.  An HVM shadow guest may have reserved bits via the MMIO
>>> fastpath, but those faults are handled in the VMExit #PF intercept, rather
>>> than Xen's #PF handler.
>>>
>>> There is no need to disable interrupts (in spurious_page_fault()) for
>>> __page_fault_type() to look at the rsvd bit, nor should extable fixup be
>>> tolerated.
>> I'm afraid I don't understand the connection of the first half of this
>> to the patch - you don't alter spurious_page_fault() in this regard (at
>> all, actually).
> 
> The disabling interrupts is in spurious_page_fault().  But the point is
> that there is no need to enter this logic at all for a reserved page fault.
> 
>>
>> As to extable fixup, I'm not sure: If a reserved bit ends up slipping
>> into the non-Xen parts of the page tables, and if guest accessors then
>> become able to trip a corresponding #PF, the bug will need an XSA with
>> the proposed change, while - afaict - it won't if the exception gets
>> recovered from. (There may then still be log spam issue, I admit.)
> 
> We need to issue an XSA anyway because such a construct would be an L1TF
> gadget.
> 
> What this change does is make it substantially more obvious, and turns
> an information leak into a DoS.

For L1TF-affected hardware. For unaffected hardware it turns a possible
(but not guaranteed) log spam DoS into a reliable crash.

>>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>>>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>>>          return;
>>>  
>>> +    /*
>>> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to
>>> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
>>> +     *
>>> +     * The only logic which intentionally sets reserved bits is the shadow
>>> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
>>> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
>>> +     * than here.
>>> +     */
>>> +    if ( error_code & PFEC_reserved_bit )
>>> +        goto fatal;
>> Judging from the description, wouldn't this then better go even further
>> up, ahead of the fixup_page_fault() invocation? In fact the function
>> has two PFEC_reserved_bit checks to _avoid_ taking action, which look
>> like they could then be dropped.
> 
> Only for certain Xen-only fixup.  The path into paging_fault() is not
> guarded.

Hmm, yes indeed.

Jan

Andrew Cooper May 19, 2020, 3:33 p.m. UTC | #8

On 19/05/2020 15:48, Jan Beulich wrote:
> On 19.05.2020 16:11, Andrew Cooper wrote:
>> On 19/05/2020 09:34, Jan Beulich wrote:
>>> On 18.05.2020 17:38, Andrew Cooper wrote:
>>>> @@ -1439,6 +1418,18 @@ void do_page_fault(struct cpu_user_regs *regs)
>>>>      if ( unlikely(fixup_page_fault(addr, regs) != 0) )
>>>>          return;
>>>>  
>>>> +    /*
>>>> +     * Xen have reserved bits in its pagetables, nor do we permit PV guests to
>>>> +     * write any.  Such entries would be vulnerable to the L1TF sidechannel.
>>>> +     *
>>>> +     * The only logic which intentionally sets reserved bits is the shadow
>>>> +     * MMIO fastpath (SH_L1E_MMIO_*), which is careful not to be
>>>> +     * L1TF-vulnerable, and handled via the VMExit #PF intercept path, rather
>>>> +     * than here.
>>> What about SH_L1E_MAGIC and sh_l1e_gnp()? The latter gets used by
>>> _sh_propagate() without visible restriction to HVM.
>> SH_L1E_MAGIC looks to be redundant with SH_L1E_MMIO_MAGIC. 
>> sh_l1e_mmio() is the only path which ever creates an entry like that.
>>
>> sh_l1e_gnp() is a very well hidden use of reserved bits, but surely
>> can't be used for PV guests, as there doesn't appear to be anything to
>> turn the resulting fault back into a plain not-present.
> Well, in this case the implied question remains: How does this fit
> with what _sh_propagate() does?

I'm in the process of investigating.

>>> And of course every time I look at this code I wonder how we can
>>> get away with (quoting a comment) "We store 28 bits of GFN in
>>> bits 4:32 of the entry." Do we have a hidden restriction
>>> somewhere guaranteeing that guests won't have (emulated MMIO)
>>> GFNs above 1Tb when run in shadow mode?
>> I've raised that several times before.  Its broken.
>>
>> Given that shadow frames are limited to 44 bits anyway (and not yet
>> levelled safely in the migration stream), my suggestion for fixing this
>> was just to use one extra nibble for the extra 4 bits and call it done.
> Would you remind(?) me of where this 44-bit restriction is coming
> from?

From paging_max_paddr_bits(),

/* Shadowed superpages store GFNs in 32-bit page_info fields. */

~Andrew

Andrew Cooper May 19, 2020, 3:59 p.m. UTC | #9

On 19/05/2020 15:55, Jan Beulich wrote:
> On 19.05.2020 16:29, Andrew Cooper wrote:
>> On 19/05/2020 09:14, Jan Beulich wrote:
>>> On 18.05.2020 17:38, Andrew Cooper wrote:
>>>> The reserved_bit_page_fault() paths effectively turn reserved bit faults into
>>>> a warning, but in the light of L1TF, the real impact is far more serious.
>>>>
>>>> Xen does not have any reserved bits set in its pagetables, nor do we permit PV
>>>> guests to write any.  An HVM shadow guest may have reserved bits via the MMIO
>>>> fastpath, but those faults are handled in the VMExit #PF intercept, rather
>>>> than Xen's #PF handler.
>>>>
>>>> There is no need to disable interrupts (in spurious_page_fault()) for
>>>> __page_fault_type() to look at the rsvd bit, nor should extable fixup be
>>>> tolerated.
>>> I'm afraid I don't understand the connection of the first half of this
>>> to the patch - you don't alter spurious_page_fault() in this regard (at
>>> all, actually).
>> The disabling interrupts is in spurious_page_fault().  But the point is
>> that there is no need to enter this logic at all for a reserved page fault.
>>
>>> As to extable fixup, I'm not sure: If a reserved bit ends up slipping
>>> into the non-Xen parts of the page tables, and if guest accessors then
>>> become able to trip a corresponding #PF, the bug will need an XSA with
>>> the proposed change, while - afaict - it won't if the exception gets
>>> recovered from. (There may then still be log spam issue, I admit.)
>> We need to issue an XSA anyway because such a construct would be an L1TF
>> gadget.
>>
>> What this change does is make it substantially more obvious, and turns
>> an information leak into a DoS.
> For L1TF-affected hardware. For unaffected hardware it turns a possible
> (but not guaranteed) log spam DoS into a reliable crash.

It represents unexpected corruption of our most critical security
resource in the processor.

Obviously we need to account for any legitimate uses Xen has of reserved
bits (so far maybe GNP for PV guests), but BUG()-like behaviour *is* the
response appropriate to the severity of finding corrupt PTEs.

~Andrew

Jan Beulich May 19, 2020, 4:09 p.m. UTC | #10

On 19.05.2020 17:33, Andrew Cooper wrote:
> On 19/05/2020 15:48, Jan Beulich wrote:
>> On 19.05.2020 16:11, Andrew Cooper wrote:
>>> Given that shadow frames are limited to 44 bits anyway (and not yet
>>> levelled safely in the migration stream), my suggestion for fixing this
>>> was just to use one extra nibble for the extra 4 bits and call it done.
>> Would you remind(?) me of where this 44-bit restriction is coming
>> from?
> 
> From paging_max_paddr_bits(),
> 
> /* Shadowed superpages store GFNs in 32-bit page_info fields. */

Ah, that's an abuse of the backlink field. After some looking
around I first thought the up field could be used to store the
GFN instead, as it's supposedly used for single-page shadows
only. Then however I found

static inline int sh_type_has_up_pointer(struct domain *d, unsigned int t)
{
    /* Multi-page shadows don't have up-pointers */
    if ( t == SH_type_l1_32_shadow
         || t == SH_type_fl1_32_shadow
         || t == SH_type_l2_32_shadow )
        return 0;
    /* Pinnable shadows don't have up-pointers either */
    return !sh_type_is_pinnable(d, t);
}

It's unclear to me in which way SH_type_l1_32_shadow and
SH_type_l2_32_shadow are "multi-page" shadows; I'd rather have
expected all three SH_type_fl1_*_shadow to be. Tim?

In any event there would be 12 bits to reclaim from the up
pointer - it being a physical address, there'll not be more
than 52 significant bits.

Jan

Andrew Cooper May 19, 2020, 6 p.m. UTC | #11

On 19/05/2020 17:09, Jan Beulich wrote:
> On 19.05.2020 17:33, Andrew Cooper wrote:
>> On 19/05/2020 15:48, Jan Beulich wrote:
>>> On 19.05.2020 16:11, Andrew Cooper wrote:
>>>> Given that shadow frames are limited to 44 bits anyway (and not yet
>>>> levelled safely in the migration stream), my suggestion for fixing this
>>>> was just to use one extra nibble for the extra 4 bits and call it done.
>>> Would you remind(?) me of where this 44-bit restriction is coming
>>> from?
>> From paging_max_paddr_bits(),
>>
>> /* Shadowed superpages store GFNs in 32-bit page_info fields. */
> Ah, that's an abuse of the backlink field. After some looking
> around I first thought the up field could be used to store the
> GFN instead, as it's supposedly used for single-page shadows
> only. Then however I found
>
> static inline int sh_type_has_up_pointer(struct domain *d, unsigned int t)
> {
>     /* Multi-page shadows don't have up-pointers */
>     if ( t == SH_type_l1_32_shadow
>          || t == SH_type_fl1_32_shadow
>          || t == SH_type_l2_32_shadow )
>         return 0;
>     /* Pinnable shadows don't have up-pointers either */
>     return !sh_type_is_pinnable(d, t);
> }
>
> It's unclear to me in which way SH_type_l1_32_shadow and
> SH_type_l2_32_shadow are "multi-page" shadows; I'd rather have
> expected all three SH_type_fl1_*_shadow to be. Tim?

I suspect the comment is incomplete, and should include "4k shadows
don't have up-pointers".

>
> In any event there would be 12 bits to reclaim from the up
> pointer - it being a physical address, there'll not be more
> than 52 significant bits.

Right, but for L1TF safety, the address bits in the PTE must not be
cacheable.

Currently, on fully populated multi-socket servers, the MMIO fastpath
relies on the top 4G of address space not being cacheable, which is the
safest we can reasonably manage.  Extending this by a nibble takes us to
16G which is not meaningfully less safe.

~Andrew

Tim Deegan May 20, 2020, 7:10 a.m. UTC | #12

At 18:09 +0200 on 19 May (1589911795), Jan Beulich wrote:
> static inline int sh_type_has_up_pointer(struct domain *d, unsigned int t)
> {
>     /* Multi-page shadows don't have up-pointers */
>     if ( t == SH_type_l1_32_shadow
>          || t == SH_type_fl1_32_shadow
>          || t == SH_type_l2_32_shadow )
>         return 0;
>     /* Pinnable shadows don't have up-pointers either */
>     return !sh_type_is_pinnable(d, t);
> }
> 
> It's unclear to me in which way SH_type_l1_32_shadow and
> SH_type_l2_32_shadow are "multi-page" shadows; I'd rather have
> expected all three SH_type_fl1_*_shadow to be. Tim?

They are multi-page in the sense that the shadow itself is more than a
page long (because it needs to have 1024 8-byte entries).

The FL1 shadows are the same size as their equivalent L1s.

Tim.

Jan Beulich May 20, 2020, 7:48 a.m. UTC | #13

On 19.05.2020 20:00, Andrew Cooper wrote:
> On 19/05/2020 17:09, Jan Beulich wrote:
>> In any event there would be 12 bits to reclaim from the up
>> pointer - it being a physical address, there'll not be more
>> than 52 significant bits.
> 
> Right, but for L1TF safety, the address bits in the PTE must not be
> cacheable.

So if I understand this right, your response was only indirectly
related to what I said: You mean that no matter whether we find
a way to store full-width GFNs, SH_L1E_MMIO_MAGIC can't have
arbitrarily many set bits dropped. On L1TF vulnerable hardware,
that is (i.e. in principle the constant could become a variable
to be determined at boot).

> Currently, on fully populated multi-socket servers, the MMIO fastpath
> relies on the top 4G of address space not being cacheable, which is the
> safest we can reasonably manage.  Extending this by a nibble takes us to
> 16G which is not meaningfully less safe.

That's 64G (36 address bits), isn't it? Looking at
l1tf_calculations(), I'd be worried in particular Penryn /
Dunnington might not support more than 36 address bits (I don't
think I have anywhere to check). Even if it was 38, 39, or 40
bits, 64G becomes a not insignificant part of the overall 256G /
512G / 1T address space. Then again the top quarter assumption
in l1tf_calculations() would still be met in this latter case.

Jan

Andrew Cooper May 20, 2020, 3:48 p.m. UTC | #14

On 20/05/2020 08:48, Jan Beulich wrote:
> On 19.05.2020 20:00, Andrew Cooper wrote:
>> On 19/05/2020 17:09, Jan Beulich wrote:
>>> In any event there would be 12 bits to reclaim from the up
>>> pointer - it being a physical address, there'll not be more
>>> than 52 significant bits.
>> Right, but for L1TF safety, the address bits in the PTE must not be
>> cacheable.
> So if I understand this right, your response was only indirectly
> related to what I said: You mean that no matter whether we find
> a way to store full-width GFNs, SH_L1E_MMIO_MAGIC can't have
> arbitrarily many set bits dropped.

Yes

> On L1TF vulnerable hardware,
> that is (i.e. in principle the constant could become a variable
> to be determined at boot).

The only thing which can usefully be done at runtime disable the fastpath.

If cacheable memory overlaps with the used address bits, there are no
safe values to use.

>
>> Currently, on fully populated multi-socket servers, the MMIO fastpath
>> relies on the top 4G of address space not being cacheable, which is the
>> safest we can reasonably manage.  Extending this by a nibble takes us to
>> 16G which is not meaningfully less safe.
> That's 64G (36 address bits), isn't it?

Yes it is.  I can't count.

> Looking at
> l1tf_calculations(), I'd be worried in particular Penryn /
> Dunnington might not support more than 36 address bits (I don't
> think I have anywhere to check). Even if it was 38, 39, or 40
> bits, 64G becomes a not insignificant part of the overall 256G /
> 512G / 1T address space. Then again the top quarter assumption
> in l1tf_calculations() would still be met in this latter case.

I'm honestly not too worried.  Intel has ceased supporting anything
older than SandyBridge, and there are other unfixed speculative security
issues.

Anyone using these processors has bigger problems.

~Andrew

x86/traps: Rework #PF[Rsvd] bit handling

Commit Message

Comments

Patch