diff mbox series

[v2,2/2] x86/xpti: Don't leak TSS-adjacent percpu data via Meltdown

Message ID 20190726203222.4833-3-andrew.cooper3@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/xpti: Don't leak TSS-adjacent percpu data via Meltdown | expand

Commit Message

Andrew Cooper July 26, 2019, 8:32 p.m. UTC
The XPTI work restricted the visibility of most of memory, but missed a few
aspects when it came to the TSS.

Given that the TSS is just an object in percpu data, the 4k mapping for it
created in setup_cpu_root_pgt() maps adjacent percpu data, making it all
leakable via Meltdown, even when XPTI is in use.

Furthermore, no care is taken to check that the TSS doesn't cross a page
boundary.  As it turns out, struct tss_struct is aligned on its size which
does prevent it straddling a page boundary, but this will cease to be true
once CET and Shadow Stack support is added to Xen.

Move the TSS into the page aligned percpu area, so no adjacent data can be
leaked.  Move the definition from setup.c to traps.c, which is a more
appropriate place for it to live.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>

v2:
 * Rebase over changes to include __aligned() within
   DEFINE_PER_CPU_PAGE_ALIGNED()
 * Drop now-unused xen/percpu.h from setup.c
---
 xen/arch/x86/setup.c            | 3 ---
 xen/arch/x86/traps.c            | 6 ++++++
 xen/arch/x86/xen.lds.S          | 2 ++
 xen/include/asm-x86/processor.h | 4 ++--
 4 files changed, 10 insertions(+), 5 deletions(-)

Comments

Roger Pau Monné July 29, 2019, 8:53 a.m. UTC | #1
On Fri, Jul 26, 2019 at 09:32:22PM +0100, Andrew Cooper wrote:
> The XPTI work restricted the visibility of most of memory, but missed a few
> aspects when it came to the TSS.
> 
> Given that the TSS is just an object in percpu data, the 4k mapping for it
> created in setup_cpu_root_pgt() maps adjacent percpu data, making it all
> leakable via Meltdown, even when XPTI is in use.
> 
> Furthermore, no care is taken to check that the TSS doesn't cross a page
> boundary.  As it turns out, struct tss_struct is aligned on its size which
> does prevent it straddling a page boundary, but this will cease to be true
> once CET and Shadow Stack support is added to Xen.
> 
> Move the TSS into the page aligned percpu area, so no adjacent data can be
> leaked.  Move the definition from setup.c to traps.c, which is a more
> appropriate place for it to live.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.
Jan Beulich July 29, 2019, 1:51 p.m. UTC | #2
On 26.07.2019 22:32, Andrew Cooper wrote:
> The XPTI work restricted the visibility of most of memory, but missed a few
> aspects when it came to the TSS.

None of these were "missed" afair - we'd been aware, and accepted things
to be the way they are now for the first step. Remember that at the time
XPTI was called "XPTI light", in anticipation for this to just be a
temporary solution.

> Given that the TSS is just an object in percpu data, the 4k mapping for it
> created in setup_cpu_root_pgt() maps adjacent percpu data, making it all
> leakable via Meltdown, even when XPTI is in use.
> 
> Furthermore, no care is taken to check that the TSS doesn't cross a page
> boundary.  As it turns out, struct tss_struct is aligned on its size which
> does prevent it straddling a page boundary, but this will cease to be true
> once CET and Shadow Stack support is added to Xen.

Please can you point me at the CET aspect in documentation here? Aiui
it's only task switches which are affected, and hence only 32-bit TSSes
which would grow (and even then not enough to exceed 128 bytes). For
the purposes 64-bit has there are MSRs to load SSP from.

> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx)
>   #define IOBMP_BYTES             8192
>   #define IOBMP_INVALID_OFFSET    0x8000
>   
> -struct __packed __cacheline_aligned tss_struct {
> +struct __packed tss_struct {
>       uint32_t :32;
>       uint64_t rsp0, rsp1, rsp2;
>       uint64_t :64;
> @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct {
>       /* Pads the TSS to be cacheline-aligned (total size is 0x80). */
>       uint8_t __cacheline_filler[24];
>   };
> +DECLARE_PER_CPU(struct tss_struct, init_tss);

Taking patch 1 this expands to

     __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \
                      __aligned(PAGE_SIZE), struct tss_struct, _init_tss);

and then

     __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE)
     __typeof__(struct tss_struct) per_cpu__init_tss;

which is not what you want: You have an object of size
sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you
therefore still leak everything that follows in the same page. There
was a reason for __cacheline_aligned's original placement, albeit I
agree that it was/is against the intention of having the struct
define an interface to the hardware (which doesn't have such an
alignment requirement). Perhaps the solution is a two-layer
approach:

struct __aligned(PAGE_SIZE) xen_tss {
     struct __packed tss_struct {
         ...
     };
};

where the inner structure describes the hardware interface and the
containing one our own requirement(s). But personally I also
wouldn't mind putting the __aligned(PAGE_SIZE) right on struct
tss_struct, where __cacheline_aligned has been sitting.

Of course either approach goes against the idea of avoiding usage
mistakes (as pointed out by others in the v1 discussion, iirc):
There better wouldn't be a need to get the two "page aligned"
attributes in sync, i.e. the instantiation of the structure
would better enforce the requested alignment. I've not thought
through whether there's trickery to actually make this work, but
I'd hope we could at the very least detect things not being in
sync at compile time.

Jan
Andrew Cooper July 29, 2019, 3:55 p.m. UTC | #3
On 29/07/2019 14:51, Jan Beulich wrote:
> On 26.07.2019 22:32, Andrew Cooper wrote:
>> The XPTI work restricted the visibility of most of memory, but missed a few
>> aspects when it came to the TSS.
> None of these were "missed" afair - we'd been aware, and accepted things
> to be the way they are now for the first step. Remember that at the time
> XPTI was called "XPTI light", in anticipation for this to just be a
> temporary solution.

Did the term "XPTI light" survive past the first RFC posting?

Sure - we did things in an incremental way because it was a technically
complex change and Meltdown was out in the wild at the time.

However, I would have fixed this at the same time as .entry.text if I
had noticed, because the purpose of that series was identical to this
series - avoid leaking things we don't absolutely need to leak.

>> Given that the TSS is just an object in percpu data, the 4k mapping for it
>> created in setup_cpu_root_pgt() maps adjacent percpu data, making it all
>> leakable via Meltdown, even when XPTI is in use.
>>
>> Furthermore, no care is taken to check that the TSS doesn't cross a page
>> boundary.  As it turns out, struct tss_struct is aligned on its size which
>> does prevent it straddling a page boundary, but this will cease to be true
>> once CET and Shadow Stack support is added to Xen.
> Please can you point me at the CET aspect in documentation here? Aiui
> it's only task switches which are affected, and hence only 32-bit TSSes
> which would grow (and even then not enough to exceed 128 bytes). For
> the purposes 64-bit has there are MSRs to load SSP from.

Ah - it was v1 of the CET spec.  I see v3 no longer has the shadow stack
pointer in the TSS.

I'll drop this part of the message.

>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx)
>>   #define IOBMP_BYTES             8192
>>   #define IOBMP_INVALID_OFFSET    0x8000
>>   
>> -struct __packed __cacheline_aligned tss_struct {
>> +struct __packed tss_struct {
>>       uint32_t :32;
>>       uint64_t rsp0, rsp1, rsp2;
>>       uint64_t :64;
>> @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct {
>>       /* Pads the TSS to be cacheline-aligned (total size is 0x80). */
>>       uint8_t __cacheline_filler[24];
>>   };
>> +DECLARE_PER_CPU(struct tss_struct, init_tss);
> Taking patch 1 this expands to
>
>      __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \
>                       __aligned(PAGE_SIZE), struct tss_struct, _init_tss);
>
> and then
>
>      __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE)
>      __typeof__(struct tss_struct) per_cpu__init_tss;
>
> which is not what you want: You have an object of size
> sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you
> therefore still leak everything that follows in the same page.

What data might this be?

Every object put into this section is suitably aligned, so nothing will
sit in the slack between the TSS and the end of the page.

> There was a reason for __cacheline_aligned's original placement, albeit I
> agree that it was/is against the intention of having the struct
> define an interface to the hardware (which doesn't have such an
> alignment requirement).

There is a hard requirement to have the first 104 bytes be physically
contiguous, because on a task switch, some CPUs translate the TSS base
and offset directly from there.

I expect that is where the __cacheline_aligned(), being 128, comes in.

However, the manual also makes it clear that this is only on a task
switch, which is inapplicable for us.

Finally, were we to put a structure like this on the stack (e.g. like
hvm_task_switch() does with tss32), we specifically wouldn't want any
unnecessary alignment.

> Perhaps the solution is a two-layer approach:
>
> struct __aligned(PAGE_SIZE) xen_tss {
>      struct __packed tss_struct {
>          ...
>      };
> };
>
> where the inner structure describes the hardware interface and the
> containing one our own requirement(s). But personally I also
> wouldn't mind putting the __aligned(PAGE_SIZE) right on struct
> tss_struct, where __cacheline_aligned has been sitting.

The only way that would make things more robust is if xen_tss was a
union with char[4096] to extend its size.

However, I think this is overkill, given the internals of
DEFINE_PER_CPU_PAGE_ALIGNED()

> Of course either approach goes against the idea of avoiding usage
> mistakes (as pointed out by others in the v1 discussion, iirc):
> There better wouldn't be a need to get the two "page aligned"
> attributes in sync, i.e. the instantiation of the structure
> would better enforce the requested alignment. I've not thought
> through whether there's trickery to actually make this work, but
> I'd hope we could at the very least detect things not being in
> sync at compile time.

There is a reason why I put in a linker assertion for the TSS being
non-aligned.

~Andrew
Andrew Cooper July 29, 2019, 4:02 p.m. UTC | #4
On 29/07/2019 16:55, Andrew Cooper wrote:
>>> --- a/xen/include/asm-x86/processor.h
>>> +++ b/xen/include/asm-x86/processor.h
>>> @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx)
>>>   #define IOBMP_BYTES             8192
>>>   #define IOBMP_INVALID_OFFSET    0x8000
>>>   
>>> -struct __packed __cacheline_aligned tss_struct {
>>> +struct __packed tss_struct {
>>>       uint32_t :32;
>>>       uint64_t rsp0, rsp1, rsp2;
>>>       uint64_t :64;
>>> @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct {
>>>       /* Pads the TSS to be cacheline-aligned (total size is 0x80). */
>>>       uint8_t __cacheline_filler[24];
>>>   };
>>> +DECLARE_PER_CPU(struct tss_struct, init_tss);
>> Taking patch 1 this expands to
>>
>>      __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \
>>                       __aligned(PAGE_SIZE), struct tss_struct, _init_tss);
>>
>> and then
>>
>>      __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE)
>>      __typeof__(struct tss_struct) per_cpu__init_tss;
>>
>> which is not what you want: You have an object of size
>> sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you
>> therefore still leak everything that follows in the same page.
> What data might this be?
>
> Every object put into this section is suitably aligned, so nothing will
> sit in the slack between the TSS and the end of the page.

And of course, I should have actually checked...

ffff82d08092e000 B per_cpu__init_tss
ffff82d08092e000 B __per_cpu_start
ffff82d08092e080 B per_cpu__cpupool
ffff82d08092e088 b per_cpu__continue_info

This needs fixing with suitable alignment in the linker script.

~Andrew
Jan Beulich July 29, 2019, 4:07 p.m. UTC | #5
On 29.07.2019 17:55, Andrew Cooper wrote:
> On 29/07/2019 14:51, Jan Beulich wrote:
>> On 26.07.2019 22:32, Andrew Cooper wrote:
>>> --- a/xen/include/asm-x86/processor.h
>>> +++ b/xen/include/asm-x86/processor.h
>>> @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx)
>>>    #define IOBMP_BYTES             8192
>>>    #define IOBMP_INVALID_OFFSET    0x8000
>>>    
>>> -struct __packed __cacheline_aligned tss_struct {
>>> +struct __packed tss_struct {
>>>        uint32_t :32;
>>>        uint64_t rsp0, rsp1, rsp2;
>>>        uint64_t :64;
>>> @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct {
>>>        /* Pads the TSS to be cacheline-aligned (total size is 0x80). */
>>>        uint8_t __cacheline_filler[24];
>>>    };
>>> +DECLARE_PER_CPU(struct tss_struct, init_tss);
>> Taking patch 1 this expands to
>>
>>       __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \
>>                        __aligned(PAGE_SIZE), struct tss_struct, _init_tss);
>>
>> and then
>>
>>       __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE)
>>       __typeof__(struct tss_struct) per_cpu__init_tss;
>>
>> which is not what you want: You have an object of size
>> sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you
>> therefore still leak everything that follows in the same page.
> 
> What data might this be?

Whatever sits early enough in .bss.percpu. There's no page size
alignment enforced between the two sections, ...

> Every object put into this section is suitably aligned, so nothing will
> sit in the slack between the TSS and the end of the page.

... so for the object being last in .bss.percpu.page_aligned it
is its size that matters, not its alignment.

>> Perhaps the solution is a two-layer approach:
>>
>> struct __aligned(PAGE_SIZE) xen_tss {
>>       struct __packed tss_struct {
>>           ...
>>       };
>> };
>>
>> where the inner structure describes the hardware interface and the
>> containing one our own requirement(s). But personally I also
>> wouldn't mind putting the __aligned(PAGE_SIZE) right on struct
>> tss_struct, where __cacheline_aligned has been sitting.
> 
> The only way that would make things more robust is if xen_tss was a
> union with char[4096] to extend its size.
> 
> However, I think this is overkill, given the internals of
> DEFINE_PER_CPU_PAGE_ALIGNED()
> 
>> Of course either approach goes against the idea of avoiding usage
>> mistakes (as pointed out by others in the v1 discussion, iirc):
>> There better wouldn't be a need to get the two "page aligned"
>> attributes in sync, i.e. the instantiation of the structure
>> would better enforce the requested alignment. I've not thought
>> through whether there's trickery to actually make this work, but
>> I'd hope we could at the very least detect things not being in
>> sync at compile time.
> 
> There is a reason why I put in a linker assertion for the TSS being
> non-aligned.

Hmm, I indeed didn't have that one on my radar when writing the reply.
However, a compile time diagnostic was what I would prefer: Having to
add linker assertions for each and every object we put in this new
section wouldn't really be nice, even if right now we certainly hope
for this to be the only such object.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index d2011910fa..f9d38155d3 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -16,7 +16,6 @@ 
 #include <xen/domain_page.h>
 #include <xen/version.h>
 #include <xen/gdbstub.h>
-#include <xen/percpu.h>
 #include <xen/hypercall.h>
 #include <xen/keyhandler.h>
 #include <xen/numa.h>
@@ -100,8 +99,6 @@  unsigned long __read_mostly xen_phys_start;
 
 unsigned long __read_mostly xen_virt_end;
 
-DEFINE_PER_CPU(struct tss_struct, init_tss);
-
 char __section(".bss.stack_aligned") __aligned(STACK_SIZE)
     cpu0_stack[STACK_SIZE];
 
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 38d12013db..de3ac135f5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -108,6 +108,12 @@  idt_entry_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
 /* Pointer to the IDT of every CPU. */
 idt_entry_t *idt_tables[NR_CPUS] __read_mostly;
 
+/*
+ * The TSS is smaller than a page, but we give it a full page to avoid
+ * adjacent per-cpu data leaking via Meltdown when XPTI is in use.
+ */
+DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, init_tss);
+
 bool (*ioemul_handle_quirk)(
     u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs);
 
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index b8a2ea4259..c82e1e504a 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -368,6 +368,8 @@  ASSERT(IS_ALIGNED(__2M_rwdata_end,   SECTION_ALIGN), "__2M_rwdata_end misaligned
 
 ASSERT(IS_ALIGNED(cpu0_stack, STACK_SIZE), "cpu0_stack misaligned")
 
+ASSERT(IS_ALIGNED(per_cpu__init_tss, PAGE_SIZE), "per_cpu(init_tss) misaligned")
+
 ASSERT(IS_ALIGNED(__init_begin, PAGE_SIZE), "__init_begin misaligned")
 ASSERT(IS_ALIGNED(__init_end,   PAGE_SIZE), "__init_end misaligned")
 
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 2862321eee..b5bee94931 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -411,7 +411,7 @@  static always_inline void __mwait(unsigned long eax, unsigned long ecx)
 #define IOBMP_BYTES             8192
 #define IOBMP_INVALID_OFFSET    0x8000
 
-struct __packed __cacheline_aligned tss_struct {
+struct __packed tss_struct {
     uint32_t :32;
     uint64_t rsp0, rsp1, rsp2;
     uint64_t :64;
@@ -425,6 +425,7 @@  struct __packed __cacheline_aligned tss_struct {
     /* Pads the TSS to be cacheline-aligned (total size is 0x80). */
     uint8_t __cacheline_filler[24];
 };
+DECLARE_PER_CPU(struct tss_struct, init_tss);
 
 #define IST_NONE 0UL
 #define IST_DF   1UL
@@ -463,7 +464,6 @@  static inline void disable_each_ist(idt_entry_t *idt)
 extern idt_entry_t idt_table[];
 extern idt_entry_t *idt_tables[];
 
-DECLARE_PER_CPU(struct tss_struct, init_tss);
 DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
 extern void write_ptbase(struct vcpu *v);