Message ID | 20190726203222.4833-3-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/xpti: Don't leak TSS-adjacent percpu data via Meltdown | expand |
On Fri, Jul 26, 2019 at 09:32:22PM +0100, Andrew Cooper wrote: > The XPTI work restricted the visibility of most of memory, but missed a few > aspects when it came to the TSS. > > Given that the TSS is just an object in percpu data, the 4k mapping for it > created in setup_cpu_root_pgt() maps adjacent percpu data, making it all > leakable via Meltdown, even when XPTI is in use. > > Furthermore, no care is taken to check that the TSS doesn't cross a page > boundary. As it turns out, struct tss_struct is aligned on its size which > does prevent it straddling a page boundary, but this will cease to be true > once CET and Shadow Stack support is added to Xen. > > Move the TSS into the page aligned percpu area, so no adjacent data can be > leaked. Move the definition from setup.c to traps.c, which is a more > appropriate place for it to live. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger.
On 26.07.2019 22:32, Andrew Cooper wrote: > The XPTI work restricted the visibility of most of memory, but missed a few > aspects when it came to the TSS. None of these were "missed" afair - we'd been aware, and accepted things to be the way they are now for the first step. Remember that at the time XPTI was called "XPTI light", in anticipation for this to just be a temporary solution. > Given that the TSS is just an object in percpu data, the 4k mapping for it > created in setup_cpu_root_pgt() maps adjacent percpu data, making it all > leakable via Meltdown, even when XPTI is in use. > > Furthermore, no care is taken to check that the TSS doesn't cross a page > boundary. As it turns out, struct tss_struct is aligned on its size which > does prevent it straddling a page boundary, but this will cease to be true > once CET and Shadow Stack support is added to Xen. Please can you point me at the CET aspect in documentation here? Aiui it's only task switches which are affected, and hence only 32-bit TSSes which would grow (and even then not enough to exceed 128 bytes). For the purposes 64-bit has there are MSRs to load SSP from. > --- a/xen/include/asm-x86/processor.h > +++ b/xen/include/asm-x86/processor.h > @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx) > #define IOBMP_BYTES 8192 > #define IOBMP_INVALID_OFFSET 0x8000 > > -struct __packed __cacheline_aligned tss_struct { > +struct __packed tss_struct { > uint32_t :32; > uint64_t rsp0, rsp1, rsp2; > uint64_t :64; > @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct { > /* Pads the TSS to be cacheline-aligned (total size is 0x80). */ > uint8_t __cacheline_filler[24]; > }; > +DECLARE_PER_CPU(struct tss_struct, init_tss); Taking patch 1 this expands to __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \ __aligned(PAGE_SIZE), struct tss_struct, _init_tss); and then __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE) __typeof__(struct tss_struct) per_cpu__init_tss; which is not what you want: You have an object of size sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you therefore still leak everything that follows in the same page. There was a reason for __cacheline_aligned's original placement, albeit I agree that it was/is against the intention of having the struct define an interface to the hardware (which doesn't have such an alignment requirement). Perhaps the solution is a two-layer approach: struct __aligned(PAGE_SIZE) xen_tss { struct __packed tss_struct { ... }; }; where the inner structure describes the hardware interface and the containing one our own requirement(s). But personally I also wouldn't mind putting the __aligned(PAGE_SIZE) right on struct tss_struct, where __cacheline_aligned has been sitting. Of course either approach goes against the idea of avoiding usage mistakes (as pointed out by others in the v1 discussion, iirc): There better wouldn't be a need to get the two "page aligned" attributes in sync, i.e. the instantiation of the structure would better enforce the requested alignment. I've not thought through whether there's trickery to actually make this work, but I'd hope we could at the very least detect things not being in sync at compile time. Jan
On 29/07/2019 14:51, Jan Beulich wrote: > On 26.07.2019 22:32, Andrew Cooper wrote: >> The XPTI work restricted the visibility of most of memory, but missed a few >> aspects when it came to the TSS. > None of these were "missed" afair - we'd been aware, and accepted things > to be the way they are now for the first step. Remember that at the time > XPTI was called "XPTI light", in anticipation for this to just be a > temporary solution. Did the term "XPTI light" survive past the first RFC posting? Sure - we did things in an incremental way because it was a technically complex change and Meltdown was out in the wild at the time. However, I would have fixed this at the same time as .entry.text if I had noticed, because the purpose of that series was identical to this series - avoid leaking things we don't absolutely need to leak. >> Given that the TSS is just an object in percpu data, the 4k mapping for it >> created in setup_cpu_root_pgt() maps adjacent percpu data, making it all >> leakable via Meltdown, even when XPTI is in use. >> >> Furthermore, no care is taken to check that the TSS doesn't cross a page >> boundary. As it turns out, struct tss_struct is aligned on its size which >> does prevent it straddling a page boundary, but this will cease to be true >> once CET and Shadow Stack support is added to Xen. > Please can you point me at the CET aspect in documentation here? Aiui > it's only task switches which are affected, and hence only 32-bit TSSes > which would grow (and even then not enough to exceed 128 bytes). For > the purposes 64-bit has there are MSRs to load SSP from. Ah - it was v1 of the CET spec. I see v3 no longer has the shadow stack pointer in the TSS. I'll drop this part of the message. >> --- a/xen/include/asm-x86/processor.h >> +++ b/xen/include/asm-x86/processor.h >> @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx) >> #define IOBMP_BYTES 8192 >> #define IOBMP_INVALID_OFFSET 0x8000 >> >> -struct __packed __cacheline_aligned tss_struct { >> +struct __packed tss_struct { >> uint32_t :32; >> uint64_t rsp0, rsp1, rsp2; >> uint64_t :64; >> @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct { >> /* Pads the TSS to be cacheline-aligned (total size is 0x80). */ >> uint8_t __cacheline_filler[24]; >> }; >> +DECLARE_PER_CPU(struct tss_struct, init_tss); > Taking patch 1 this expands to > > __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \ > __aligned(PAGE_SIZE), struct tss_struct, _init_tss); > > and then > > __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE) > __typeof__(struct tss_struct) per_cpu__init_tss; > > which is not what you want: You have an object of size > sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you > therefore still leak everything that follows in the same page. What data might this be? Every object put into this section is suitably aligned, so nothing will sit in the slack between the TSS and the end of the page. > There was a reason for __cacheline_aligned's original placement, albeit I > agree that it was/is against the intention of having the struct > define an interface to the hardware (which doesn't have such an > alignment requirement). There is a hard requirement to have the first 104 bytes be physically contiguous, because on a task switch, some CPUs translate the TSS base and offset directly from there. I expect that is where the __cacheline_aligned(), being 128, comes in. However, the manual also makes it clear that this is only on a task switch, which is inapplicable for us. Finally, were we to put a structure like this on the stack (e.g. like hvm_task_switch() does with tss32), we specifically wouldn't want any unnecessary alignment. > Perhaps the solution is a two-layer approach: > > struct __aligned(PAGE_SIZE) xen_tss { > struct __packed tss_struct { > ... > }; > }; > > where the inner structure describes the hardware interface and the > containing one our own requirement(s). But personally I also > wouldn't mind putting the __aligned(PAGE_SIZE) right on struct > tss_struct, where __cacheline_aligned has been sitting. The only way that would make things more robust is if xen_tss was a union with char[4096] to extend its size. However, I think this is overkill, given the internals of DEFINE_PER_CPU_PAGE_ALIGNED() > Of course either approach goes against the idea of avoiding usage > mistakes (as pointed out by others in the v1 discussion, iirc): > There better wouldn't be a need to get the two "page aligned" > attributes in sync, i.e. the instantiation of the structure > would better enforce the requested alignment. I've not thought > through whether there's trickery to actually make this work, but > I'd hope we could at the very least detect things not being in > sync at compile time. There is a reason why I put in a linker assertion for the TSS being non-aligned. ~Andrew
On 29/07/2019 16:55, Andrew Cooper wrote: >>> --- a/xen/include/asm-x86/processor.h >>> +++ b/xen/include/asm-x86/processor.h >>> @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx) >>> #define IOBMP_BYTES 8192 >>> #define IOBMP_INVALID_OFFSET 0x8000 >>> >>> -struct __packed __cacheline_aligned tss_struct { >>> +struct __packed tss_struct { >>> uint32_t :32; >>> uint64_t rsp0, rsp1, rsp2; >>> uint64_t :64; >>> @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct { >>> /* Pads the TSS to be cacheline-aligned (total size is 0x80). */ >>> uint8_t __cacheline_filler[24]; >>> }; >>> +DECLARE_PER_CPU(struct tss_struct, init_tss); >> Taking patch 1 this expands to >> >> __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \ >> __aligned(PAGE_SIZE), struct tss_struct, _init_tss); >> >> and then >> >> __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE) >> __typeof__(struct tss_struct) per_cpu__init_tss; >> >> which is not what you want: You have an object of size >> sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you >> therefore still leak everything that follows in the same page. > What data might this be? > > Every object put into this section is suitably aligned, so nothing will > sit in the slack between the TSS and the end of the page. And of course, I should have actually checked... ffff82d08092e000 B per_cpu__init_tss ffff82d08092e000 B __per_cpu_start ffff82d08092e080 B per_cpu__cpupool ffff82d08092e088 b per_cpu__continue_info This needs fixing with suitable alignment in the linker script. ~Andrew
On 29.07.2019 17:55, Andrew Cooper wrote: > On 29/07/2019 14:51, Jan Beulich wrote: >> On 26.07.2019 22:32, Andrew Cooper wrote: >>> --- a/xen/include/asm-x86/processor.h >>> +++ b/xen/include/asm-x86/processor.h >>> @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx) >>> #define IOBMP_BYTES 8192 >>> #define IOBMP_INVALID_OFFSET 0x8000 >>> >>> -struct __packed __cacheline_aligned tss_struct { >>> +struct __packed tss_struct { >>> uint32_t :32; >>> uint64_t rsp0, rsp1, rsp2; >>> uint64_t :64; >>> @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct { >>> /* Pads the TSS to be cacheline-aligned (total size is 0x80). */ >>> uint8_t __cacheline_filler[24]; >>> }; >>> +DECLARE_PER_CPU(struct tss_struct, init_tss); >> Taking patch 1 this expands to >> >> __DEFINE_PER_CPU(__section(".bss.percpu.page_aligned") \ >> __aligned(PAGE_SIZE), struct tss_struct, _init_tss); >> >> and then >> >> __section(".bss.percpu.page_aligned") __aligned(PAGE_SIZE) >> __typeof__(struct tss_struct) per_cpu__init_tss; >> >> which is not what you want: You have an object of size >> sizeof(struct tss_struct) which is PAGE_SIZE aligned. Afaict you >> therefore still leak everything that follows in the same page. > > What data might this be? Whatever sits early enough in .bss.percpu. There's no page size alignment enforced between the two sections, ... > Every object put into this section is suitably aligned, so nothing will > sit in the slack between the TSS and the end of the page. ... so for the object being last in .bss.percpu.page_aligned it is its size that matters, not its alignment. >> Perhaps the solution is a two-layer approach: >> >> struct __aligned(PAGE_SIZE) xen_tss { >> struct __packed tss_struct { >> ... >> }; >> }; >> >> where the inner structure describes the hardware interface and the >> containing one our own requirement(s). But personally I also >> wouldn't mind putting the __aligned(PAGE_SIZE) right on struct >> tss_struct, where __cacheline_aligned has been sitting. > > The only way that would make things more robust is if xen_tss was a > union with char[4096] to extend its size. > > However, I think this is overkill, given the internals of > DEFINE_PER_CPU_PAGE_ALIGNED() > >> Of course either approach goes against the idea of avoiding usage >> mistakes (as pointed out by others in the v1 discussion, iirc): >> There better wouldn't be a need to get the two "page aligned" >> attributes in sync, i.e. the instantiation of the structure >> would better enforce the requested alignment. I've not thought >> through whether there's trickery to actually make this work, but >> I'd hope we could at the very least detect things not being in >> sync at compile time. > > There is a reason why I put in a linker assertion for the TSS being > non-aligned. Hmm, I indeed didn't have that one on my radar when writing the reply. However, a compile time diagnostic was what I would prefer: Having to add linker assertions for each and every object we put in this new section wouldn't really be nice, even if right now we certainly hope for this to be the only such object. Jan
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index d2011910fa..f9d38155d3 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -16,7 +16,6 @@ #include <xen/domain_page.h> #include <xen/version.h> #include <xen/gdbstub.h> -#include <xen/percpu.h> #include <xen/hypercall.h> #include <xen/keyhandler.h> #include <xen/numa.h> @@ -100,8 +99,6 @@ unsigned long __read_mostly xen_phys_start; unsigned long __read_mostly xen_virt_end; -DEFINE_PER_CPU(struct tss_struct, init_tss); - char __section(".bss.stack_aligned") __aligned(STACK_SIZE) cpu0_stack[STACK_SIZE]; diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 38d12013db..de3ac135f5 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -108,6 +108,12 @@ idt_entry_t __section(".bss.page_aligned") __aligned(PAGE_SIZE) /* Pointer to the IDT of every CPU. */ idt_entry_t *idt_tables[NR_CPUS] __read_mostly; +/* + * The TSS is smaller than a page, but we give it a full page to avoid + * adjacent per-cpu data leaking via Meltdown when XPTI is in use. + */ +DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, init_tss); + bool (*ioemul_handle_quirk)( u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs); diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S index b8a2ea4259..c82e1e504a 100644 --- a/xen/arch/x86/xen.lds.S +++ b/xen/arch/x86/xen.lds.S @@ -368,6 +368,8 @@ ASSERT(IS_ALIGNED(__2M_rwdata_end, SECTION_ALIGN), "__2M_rwdata_end misaligned ASSERT(IS_ALIGNED(cpu0_stack, STACK_SIZE), "cpu0_stack misaligned") +ASSERT(IS_ALIGNED(per_cpu__init_tss, PAGE_SIZE), "per_cpu(init_tss) misaligned") + ASSERT(IS_ALIGNED(__init_begin, PAGE_SIZE), "__init_begin misaligned") ASSERT(IS_ALIGNED(__init_end, PAGE_SIZE), "__init_end misaligned") diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h index 2862321eee..b5bee94931 100644 --- a/xen/include/asm-x86/processor.h +++ b/xen/include/asm-x86/processor.h @@ -411,7 +411,7 @@ static always_inline void __mwait(unsigned long eax, unsigned long ecx) #define IOBMP_BYTES 8192 #define IOBMP_INVALID_OFFSET 0x8000 -struct __packed __cacheline_aligned tss_struct { +struct __packed tss_struct { uint32_t :32; uint64_t rsp0, rsp1, rsp2; uint64_t :64; @@ -425,6 +425,7 @@ struct __packed __cacheline_aligned tss_struct { /* Pads the TSS to be cacheline-aligned (total size is 0x80). */ uint8_t __cacheline_filler[24]; }; +DECLARE_PER_CPU(struct tss_struct, init_tss); #define IST_NONE 0UL #define IST_DF 1UL @@ -463,7 +464,6 @@ static inline void disable_each_ist(idt_entry_t *idt) extern idt_entry_t idt_table[]; extern idt_entry_t *idt_tables[]; -DECLARE_PER_CPU(struct tss_struct, init_tss); DECLARE_PER_CPU(root_pgentry_t *, root_pgt); extern void write_ptbase(struct vcpu *v);
The XPTI work restricted the visibility of most of memory, but missed a few aspects when it came to the TSS. Given that the TSS is just an object in percpu data, the 4k mapping for it created in setup_cpu_root_pgt() maps adjacent percpu data, making it all leakable via Meltdown, even when XPTI is in use. Furthermore, no care is taken to check that the TSS doesn't cross a page boundary. As it turns out, struct tss_struct is aligned on its size which does prevent it straddling a page boundary, but this will cease to be true once CET and Shadow Stack support is added to Xen. Move the TSS into the page aligned percpu area, so no adjacent data can be leaked. Move the definition from setup.c to traps.c, which is a more appropriate place for it to live. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Wei Liu <wl@xen.org> CC: Roger Pau Monné <roger.pau@citrix.com> v2: * Rebase over changes to include __aligned() within DEFINE_PER_CPU_PAGE_ALIGNED() * Drop now-unused xen/percpu.h from setup.c --- xen/arch/x86/setup.c | 3 --- xen/arch/x86/traps.c | 6 ++++++ xen/arch/x86/xen.lds.S | 2 ++ xen/include/asm-x86/processor.h | 4 ++-- 4 files changed, 10 insertions(+), 5 deletions(-)