Message ID | 20190704175732.5943-1-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/ctxt-switch: Document and improve GDT handling | expand |
On 04.07.19 19:57, Andrew Cooper wrote: > write_full_gdt_ptes() has a latent bug. Using virt_to_mfn() and iterating > with (mfn + i) is wrong, because of PDX compression. The context switch path > only functions correctly because NR_RESERVED_GDT_PAGES is 1. > > As this is exceedingly unlikely to change moving foward, drop the loop > rather than inserting a BUILD_BUG_ON(NR_RESERVED_GDT_PAGES != 1). > > With the loop dropped, write_full_gdt_ptes() becomes more obviously a poor > name, so rename it to update_xen_slot_in_full_gdt(). > > Furthermore, calling virt_to_mfn() in the context switch path is a lot of > wasted cycles for a result which is constant after boot. > > Begin by documenting how Xen handles the GDTs across context switch. > > From this, we observe that load_full_gdt() is completely independent of the > current CPU, and load_default_gdt() only gets passed the current CPU regular > GDT. > > Add two extra per-cpu variables which cache the L1e for the regular and compat > GDT, calculated in cpu_smpboot_alloc()/trap_init() as appropriate, so > update_xen_slot_in_full_gdt() doesn't need to waste time performing the same > calculation on every context switch. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> I did a small performance test with this patch: on a 8 cpu system I started 2 mini-os domains (1 vcpu each) doing a busy loop sending events to dom0. On dom0 I did a build of the hypervisor via "make -j 8" and measured the time for that build, then took the average of 5 such builds (doing a make clean in between). elapsed user system Unpatched 66.51 232.93 109.21 Patched 57.00 225.47 105.47 This is a very clear win of performance! Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Juergen > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Wei Liu <wl@xen.org> > CC: Roger Pau Monné <roger.pau@citrix.com> > CC: Juergen Gross <jgross@suse.com> > > Slightly RFC. > > I'm fairly confident this is better, but Juergen says that the some of his > scheduling perf tests notice large difference from subtle changes in > __context_switch(), so it would be useful to get some numbers from this > change. > > The delta from this change is: > > add/remove: 2/0 grow/shrink: 1/1 up/down: 320/-127 (193) > Function old new delta > cpu_smpboot_callback 1152 1456 +304 > per_cpu__gdt_table_l1e - 8 +8 > per_cpu__compat_gdt_table_l1e - 8 +8 > __context_switch 1238 1111 -127 > Total: Before=3339227, After=3339420, chg +0.01% > > I'm not overly happy about the special case in trap_init() but I can't think > of a better place to put this. > > Also, it should now be very obvious to people that Xen's current GDT handling > for non-PV vcpus is a recipe subtle bugs, if we ever manage to execute a stray > mov/pop %sreg instruction. We really ought to have Xen's regular GDT in an > area where slots 0-13 are either mapped to the zero page, or not present, so > we don't risk loading a non-faulting garbage selector. > --- > xen/arch/x86/domain.c | 52 ++++++++++++++++++++++++++++++---------------- > xen/arch/x86/smpboot.c | 4 ++++ > xen/arch/x86/traps.c | 10 +++++++++ > xen/include/asm-x86/desc.h | 2 ++ > 4 files changed, 50 insertions(+), 18 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index 84cafbe558..147f96a09e 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -1635,23 +1635,42 @@ static void _update_runstate_area(struct vcpu *v) > v->arch.pv.need_update_runstate_area = 1; > } > > +/* > + * Overview of Xen's GDTs. > + * > + * Xen maintains per-CPU compat and regular GDTs which are both a single page > + * in size. Some content is specific to each CPU (the TSS, the per-CPU marker > + * for #DF handling, and optionally the LDT). The compat and regular GDTs > + * differ by the layout and content of the guest accessible selectors. > + * > + * The Xen selectors live from 0xe000 (slot 14 of 16), and need to always > + * appear in this position for interrupt/exception handling to work. > + * > + * A PV guest may specify GDT frames of their own (slots 0 to 13). Room for a > + * full GDT exists in the per-domain mappings. > + * > + * To schedule a PV vcpu, we point slot 14 of the guest's full GDT at the > + * current CPU's compat or regular (as appropriate) GDT frame. This is so > + * that the per-CPU parts still work correctly after switching pagetables and > + * loading the guests full GDT into GDTR. > + * > + * To schedule Idle or HVM vcpus, we load a GDT base address which causes the > + * regular per-CPU GDT frame to appear with selectors at the appropriate > + * offset. > + */ > static inline bool need_full_gdt(const struct domain *d) > { > return is_pv_domain(d) && !is_idle_domain(d); > } > > -static void write_full_gdt_ptes(seg_desc_t *gdt, const struct vcpu *v) > +static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu) > { > - unsigned long mfn = virt_to_mfn(gdt); > - l1_pgentry_t *pl1e = pv_gdt_ptes(v); > - unsigned int i; > - > - for ( i = 0; i < NR_RESERVED_GDT_PAGES; i++ ) > - l1e_write(pl1e + FIRST_RESERVED_GDT_PAGE + i, > - l1e_from_pfn(mfn + i, __PAGE_HYPERVISOR_RW)); > + l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE, > + !is_pv_32bit_vcpu(v) ? per_cpu(gdt_table_l1e, cpu) > + : per_cpu(compat_gdt_table_l1e, cpu)); > } > > -static void load_full_gdt(const struct vcpu *v, unsigned int cpu) > +static void load_full_gdt(const struct vcpu *v) > { > struct desc_ptr gdt_desc = { > .limit = LAST_RESERVED_GDT_BYTE, > @@ -1661,11 +1680,12 @@ static void load_full_gdt(const struct vcpu *v, unsigned int cpu) > lgdt(&gdt_desc); > } > > -static void load_default_gdt(const seg_desc_t *gdt, unsigned int cpu) > +static void load_default_gdt(unsigned int cpu) > { > struct desc_ptr gdt_desc = { > .limit = LAST_RESERVED_GDT_BYTE, > - .base = (unsigned long)(gdt - FIRST_RESERVED_GDT_ENTRY), > + .base = (unsigned long)(per_cpu(gdt_table, cpu) - > + FIRST_RESERVED_GDT_ENTRY), > }; > > lgdt(&gdt_desc); > @@ -1678,7 +1698,6 @@ static void __context_switch(void) > struct vcpu *p = per_cpu(curr_vcpu, cpu); > struct vcpu *n = current; > struct domain *pd = p->domain, *nd = n->domain; > - seg_desc_t *gdt; > > ASSERT(p != n); > ASSERT(!vcpu_cpu_dirty(n)); > @@ -1718,15 +1737,12 @@ static void __context_switch(void) > > psr_ctxt_switch_to(nd); > > - gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) : > - per_cpu(compat_gdt_table, cpu); > - > if ( need_full_gdt(nd) ) > - write_full_gdt_ptes(gdt, n); > + update_xen_slot_in_full_gdt(n, cpu); > > if ( need_full_gdt(pd) && > ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) ) > - load_default_gdt(gdt, cpu); > + load_default_gdt(cpu); > > write_ptbase(n); > > @@ -1739,7 +1755,7 @@ static void __context_switch(void) > > if ( need_full_gdt(nd) && > ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) ) > - load_full_gdt(n, cpu); > + load_full_gdt(n); > > if ( pd != nd ) > cpumask_clear_cpu(cpu, pd->dirty_cpumask); > diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c > index 730fe141fa..004285d14c 100644 > --- a/xen/arch/x86/smpboot.c > +++ b/xen/arch/x86/smpboot.c > @@ -985,6 +985,8 @@ static int cpu_smpboot_alloc(unsigned int cpu) > if ( gdt == NULL ) > goto out; > per_cpu(gdt_table, cpu) = gdt; > + per_cpu(gdt_table_l1e, cpu) = > + l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW); > memcpy(gdt, boot_cpu_gdt_table, NR_RESERVED_GDT_PAGES * PAGE_SIZE); > BUILD_BUG_ON(NR_CPUS > 0x10000); > gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu; > @@ -992,6 +994,8 @@ static int cpu_smpboot_alloc(unsigned int cpu) > per_cpu(compat_gdt_table, cpu) = gdt = alloc_xenheap_pages(order, memflags); > if ( gdt == NULL ) > goto out; > + per_cpu(compat_gdt_table_l1e, cpu) = > + l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW); > memcpy(gdt, boot_cpu_compat_gdt_table, NR_RESERVED_GDT_PAGES * PAGE_SIZE); > gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu; > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index 8097ef3bf5..25b4b47e5e 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -97,7 +97,9 @@ DEFINE_PER_CPU(uint64_t, efer); > static DEFINE_PER_CPU(unsigned long, last_extable_addr); > > DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, gdt_table); > +DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, gdt_table_l1e); > DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, compat_gdt_table); > +DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, compat_gdt_table_l1e); > > /* Master table, used by CPU0. */ > idt_entry_t __section(".bss.page_aligned") __aligned(PAGE_SIZE) > @@ -2059,6 +2061,14 @@ void __init trap_init(void) > } > } > > + /* Cache {,compat_}gdt_table_l1e now that physically relocation is done. */ > + this_cpu(gdt_table_l1e) = > + l1e_from_pfn(virt_to_mfn(boot_cpu_gdt_table), > + __PAGE_HYPERVISOR_RW); > + this_cpu(compat_gdt_table_l1e) = > + l1e_from_pfn(virt_to_mfn(boot_cpu_compat_gdt_table), > + __PAGE_HYPERVISOR_RW); > + > percpu_traps_init(); > > cpu_init(); > diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h > index 85e83bcefb..e565727dc0 100644 > --- a/xen/include/asm-x86/desc.h > +++ b/xen/include/asm-x86/desc.h > @@ -206,8 +206,10 @@ struct __packed desc_ptr { > > extern seg_desc_t boot_cpu_gdt_table[]; > DECLARE_PER_CPU(seg_desc_t *, gdt_table); > +DECLARE_PER_CPU(l1_pgentry_t, gdt_table_l1e); > extern seg_desc_t boot_cpu_compat_gdt_table[]; > DECLARE_PER_CPU(seg_desc_t *, compat_gdt_table); > +DECLARE_PER_CPU(l1_pgentry_t, compat_gdt_table_l1e); > > extern void load_TR(void); > >
On 04.07.2019 19:57, Andrew Cooper wrote: > write_full_gdt_ptes() has a latent bug. Using virt_to_mfn() and iterating > with (mfn + i) is wrong, because of PDX compression. The context switch path > only functions correctly because NR_RESERVED_GDT_PAGES is 1. Whether this is a (latent) bug depends on how the allocation gets done. As long as it's a single alloc_xenheap_pages(), this is perfectly fine. There are no individual allocations which can span a PDX compression hole (or else MFN or struct page pointer arithmetic wouldn't work either, independent of the involvement of a virtual address). > Also, it should now be very obvious to people that Xen's current GDT handling > for non-PV vcpus is a recipe subtle bugs, if we ever manage to execute a stray > mov/pop %sreg instruction. We really ought to have Xen's regular GDT in an > area where slots 0-13 are either mapped to the zero page, or not present, so > we don't risk loading a non-faulting garbage selector. Well, there's certainly room for improvement, but loading a stray selector seems pretty unlikely an event to happen, and the respective code having got slipped in without anyone noticing. Other than in context switching code I don't think there are many places at all where we write to the selector registers. > @@ -1718,15 +1737,12 @@ static void __context_switch(void) > > psr_ctxt_switch_to(nd); > > - gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) : > - per_cpu(compat_gdt_table, cpu); > - > if ( need_full_gdt(nd) ) > - write_full_gdt_ptes(gdt, n); > + update_xen_slot_in_full_gdt(n, cpu); > > if ( need_full_gdt(pd) && > ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) ) > - load_default_gdt(gdt, cpu); > + load_default_gdt(cpu); From looking at this transformation I cannot see how, as said in the description and as expressed by removing the gdt parameter from load_default_gdt(), the gdt having got passed in here would always have been per_cpu(gdt_table, cpu). It pretty clearly was the compat one for nd being 32-bit PV. What am I missing? Or is the description perhaps instead meaning to say that it doesn't _need_ to be the compat one that we load here, as in case it is the subsequent load_full_gdt() will replace it again anyway? > @@ -2059,6 +2061,14 @@ void __init trap_init(void) > } > } > > + /* Cache {,compat_}gdt_table_l1e now that physically relocation is done. */ "physical relocation" or "physically relocating"? Jan
On 05/07/2019 11:00, Jan Beulich wrote: > On 04.07.2019 19:57, Andrew Cooper wrote: >> write_full_gdt_ptes() has a latent bug. Using virt_to_mfn() and iterating >> with (mfn + i) is wrong, because of PDX compression. The context switch path >> only functions correctly because NR_RESERVED_GDT_PAGES is 1. > Whether this is a (latent) bug depends on how the allocation gets > done. As long as it's a single alloc_xenheap_pages(), this is > perfectly fine. There are no individual allocations which can span > a PDX compression hole (or else MFN or struct page pointer > arithmetic wouldn't work either, independent of the involvement of > a virtual address). Hmm - Its still very deceptive code. > >> Also, it should now be very obvious to people that Xen's current GDT handling >> for non-PV vcpus is a recipe subtle bugs, if we ever manage to execute a stray >> mov/pop %sreg instruction. We really ought to have Xen's regular GDT in an >> area where slots 0-13 are either mapped to the zero page, or not present, so >> we don't risk loading a non-faulting garbage selector. > Well, there's certainly room for improvement, but loading a stray > selector seems pretty unlikely an event to happen, and the > respective code having got slipped in without anyone noticing. > Other than in context switching code I don't think there are many > places at all where we write to the selector registers. There are however many places where we write some bytes into a stub and then execute them. This isn't a security issue. There aren't any legitimate codepaths for which is this a problem, but there are plenty of cascade failures where this is liable to make a bad situation worse is weird hard-to-debug ways. Not to mention that for security hardening purposes, we should be using a RO mapping to combat sgdt or fixed-ABI knowledge from an attacker. And on that note... nothing really updates the full GDT via the perdomain mappings, so I think that can already move to being RO. This does depend on the fact that noone has used segmented virtual memory since long before Xen was a thing. We can trap and emulate the setting of A bits, and I bet that path will never get hit even with old PV guests. >> @@ -1718,15 +1737,12 @@ static void __context_switch(void) >> >> psr_ctxt_switch_to(nd); >> >> - gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) : >> - per_cpu(compat_gdt_table, cpu); >> - >> if ( need_full_gdt(nd) ) >> - write_full_gdt_ptes(gdt, n); >> + update_xen_slot_in_full_gdt(n, cpu); >> >> if ( need_full_gdt(pd) && >> ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) ) >> - load_default_gdt(gdt, cpu); >> + load_default_gdt(cpu); > From looking at this transformation I cannot see how, as said in > the description and as expressed by removing the gdt parameter > from load_default_gdt(), the gdt having got passed in here would > always have been per_cpu(gdt_table, cpu). It pretty clearly was > the compat one for nd being 32-bit PV. What am I missing? To be perfectly honest, I wrote "how it {does,should} logically work", then adjusted the code. > Or is the description perhaps instead meaning to say that it doesn't > _need_ to be the compat one that we load here, as in case it is > the subsequent load_full_gdt() will replace it again anyway? lgdt is an expensive operation. I hadn't even spotted that we are doing it twice on that path. There is surely some room for improvement here as well. I wonder if caching the last gdt base address per cpu would be a better option, and only doing a "lazy" lgdt. It would certainly simply the "when should I lgdt?" logic. > >> @@ -2059,6 +2061,14 @@ void __init trap_init(void) >> } >> } >> >> + /* Cache {,compat_}gdt_table_l1e now that physically relocation is done. */ > "physical relocation" or "physically relocating"? Oops. I'll go with the former. ~Andrew
On 05.07.2019 15:36, Andrew Cooper wrote: > On 05/07/2019 11:00, Jan Beulich wrote: >> On 04.07.2019 19:57, Andrew Cooper wrote: >>> Also, it should now be very obvious to people that Xen's current GDT handling >>> for non-PV vcpus is a recipe subtle bugs, if we ever manage to execute a stray >>> mov/pop %sreg instruction. We really ought to have Xen's regular GDT in an >>> area where slots 0-13 are either mapped to the zero page, or not present, so >>> we don't risk loading a non-faulting garbage selector. >> Well, there's certainly room for improvement, but loading a stray >> selector seems pretty unlikely an event to happen, and the >> respective code having got slipped in without anyone noticing. >> Other than in context switching code I don't think there are many >> places at all where we write to the selector registers. > > There are however many places where we write some bytes into a stub and > then execute them. > > This isn't a security issue. There aren't any legitimate codepaths for > which is this a problem, but there are plenty of cascade failures where > this is liable to make a bad situation worse is weird hard-to-debug ways. > > Not to mention that for security hardening purposes, we should be using > a RO mapping to combat sgdt or fixed-ABI knowledge from an attacker. Okay, I can see how SGDT can reveal undue knowledge to an attacker, even if they can't use the address directly, but only to infer further things (hence UMIP's existence). But I'm having trouble seeing how a r/o mapped GDT adds much security. Could you expand on this? > And on that note... nothing really updates the full GDT via the > perdomain mappings, so I think that can already move to being RO. This > does depend on the fact that noone has used segmented virtual memory > since long before Xen was a thing. We can trap and emulate the setting > of A bits, and I bet that path will never get hit even with old PV guests. Well, you could go the first step here and map the Xen page r/o right away. There we don't even need to fear A bits needing to get set, since we control the contents. >>> @@ -1718,15 +1737,12 @@ static void __context_switch(void) >>> >>> psr_ctxt_switch_to(nd); >>> >>> - gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) : >>> - per_cpu(compat_gdt_table, cpu); >>> - >>> if ( need_full_gdt(nd) ) >>> - write_full_gdt_ptes(gdt, n); >>> + update_xen_slot_in_full_gdt(n, cpu); >>> >>> if ( need_full_gdt(pd) && >>> ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) ) >>> - load_default_gdt(gdt, cpu); >>> + load_default_gdt(cpu); >> From looking at this transformation I cannot see how, as said in >> the description and as expressed by removing the gdt parameter >> from load_default_gdt(), the gdt having got passed in here would >> always have been per_cpu(gdt_table, cpu). It pretty clearly was >> the compat one for nd being 32-bit PV. What am I missing? > > To be perfectly honest, I wrote "how it {does,should} logically work", > then adjusted the code. > >> Or is the description perhaps instead meaning to say that it doesn't >> _need_ to be the compat one that we load here, as in case it is >> the subsequent load_full_gdt() will replace it again anyway? > > lgdt is an expensive operation. I hadn't even spotted that we are doing > it twice on that path. There is surely some room for improvement here > as well. Well, the double load had to be added for a very simple reason: While we're switching page tables, the GDT mapping has to remain intact. Hence the dependency upon the two vCPU IDs (not) matching. IOW I don't currently see room for improvement here. > I wonder if caching the last gdt base address per cpu would be a better > option, and only doing a "lazy" lgdt. It would certainly simply the > "when should I lgdt?" logic. At the first glance I'm not convinced this would simplify things, but I'd have to see actual code. Jan
On 05/07/2019 14:56, Jan Beulich wrote: > On 05.07.2019 15:36, Andrew Cooper wrote: >> On 05/07/2019 11:00, Jan Beulich wrote: >>> On 04.07.2019 19:57, Andrew Cooper wrote: >>>> Also, it should now be very obvious to people that Xen's current GDT handling >>>> for non-PV vcpus is a recipe subtle bugs, if we ever manage to execute a stray >>>> mov/pop %sreg instruction. We really ought to have Xen's regular GDT in an >>>> area where slots 0-13 are either mapped to the zero page, or not present, so >>>> we don't risk loading a non-faulting garbage selector. >>> Well, there's certainly room for improvement, but loading a stray >>> selector seems pretty unlikely an event to happen, and the >>> respective code having got slipped in without anyone noticing. >>> Other than in context switching code I don't think there are many >>> places at all where we write to the selector registers. >> There are however many places where we write some bytes into a stub and >> then execute them. >> >> This isn't a security issue. There aren't any legitimate codepaths for >> which is this a problem, but there are plenty of cascade failures where >> this is liable to make a bad situation worse is weird hard-to-debug ways. >> >> Not to mention that for security hardening purposes, we should be using >> a RO mapping to combat sgdt or fixed-ABI knowledge from an attacker. > Okay, I can see how SGDT can reveal undue knowledge to an attacker, > even if they can't use the address directly, but only to infer further > things (hence UMIP's existence). But I'm having trouble seeing how a > r/o mapped GDT adds much security. Could you expand on this? If an attacker has some kind of write-gadget already in Xen, knowing where a writeable mapping to the GDT is makes their life very easy in terms of getting code execution. RO mappings are a defence in depth measure. > >> And on that note... nothing really updates the full GDT via the >> perdomain mappings, so I think that can already move to being RO. This >> does depend on the fact that noone has used segmented virtual memory >> since long before Xen was a thing. We can trap and emulate the setting >> of A bits, and I bet that path will never get hit even with old PV guests. > Well, you could go the first step here and map the Xen page r/o right > away. There we don't even need to fear A bits needing to get set, since > we control the contents. > >>>> @@ -1718,15 +1737,12 @@ static void __context_switch(void) >>>> >>>> psr_ctxt_switch_to(nd); >>>> >>>> - gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) : >>>> - per_cpu(compat_gdt_table, cpu); >>>> - >>>> if ( need_full_gdt(nd) ) >>>> - write_full_gdt_ptes(gdt, n); >>>> + update_xen_slot_in_full_gdt(n, cpu); >>>> >>>> if ( need_full_gdt(pd) && >>>> ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) ) >>>> - load_default_gdt(gdt, cpu); >>>> + load_default_gdt(cpu); >>> From looking at this transformation I cannot see how, as said in >>> the description and as expressed by removing the gdt parameter >>> from load_default_gdt(), the gdt having got passed in here would >>> always have been per_cpu(gdt_table, cpu). It pretty clearly was >>> the compat one for nd being 32-bit PV. What am I missing? >> To be perfectly honest, I wrote "how it {does,should} logically work", >> then adjusted the code. >> >>> Or is the description perhaps instead meaning to say that it doesn't >>> _need_ to be the compat one that we load here, as in case it is >>> the subsequent load_full_gdt() will replace it again anyway? >> lgdt is an expensive operation. I hadn't even spotted that we are doing >> it twice on that path. There is surely some room for improvement here >> as well. > Well, the double load had to be added for a very simple reason: > While we're switching page tables, the GDT mapping has to remain > intact. Hence the dependency upon the two vCPU IDs (not) matching. > IOW I don't currently see room for improvement here. Urgh. That is a nasty corner case. We really don't want to be doing two lgdt's, but it is currently unavoidable when we're switching with p->vcpu_id > nd->max_vcpus, as the current gdtr over p's full GDT will become not present when n has fewer vcpus. For now, I'm tempted to go with an answer of "it doesn't need to be the compat GDT". This is a rabbit hole I really don't have time to follow atm. (Bring on the non-flat address space days, where all of this complexity can disappear totally.) ~Andrew
On 05.07.2019 16:40, Andrew Cooper wrote: > For now, I'm tempted to go with an answer of "it doesn't need to be the > compat GDT". Which is fine with me. > This is a rabbit hole I really don't have time to follow atm. Fully understood. From the very beginning I was only after some commit message adjustments here (and the one trivial comment correction). It's up to you whether to fold in the r/o mapping of the Xen GDT page, since you touch that code anyway. Jan
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 84cafbe558..147f96a09e 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1635,23 +1635,42 @@ static void _update_runstate_area(struct vcpu *v) v->arch.pv.need_update_runstate_area = 1; } +/* + * Overview of Xen's GDTs. + * + * Xen maintains per-CPU compat and regular GDTs which are both a single page + * in size. Some content is specific to each CPU (the TSS, the per-CPU marker + * for #DF handling, and optionally the LDT). The compat and regular GDTs + * differ by the layout and content of the guest accessible selectors. + * + * The Xen selectors live from 0xe000 (slot 14 of 16), and need to always + * appear in this position for interrupt/exception handling to work. + * + * A PV guest may specify GDT frames of their own (slots 0 to 13). Room for a + * full GDT exists in the per-domain mappings. + * + * To schedule a PV vcpu, we point slot 14 of the guest's full GDT at the + * current CPU's compat or regular (as appropriate) GDT frame. This is so + * that the per-CPU parts still work correctly after switching pagetables and + * loading the guests full GDT into GDTR. + * + * To schedule Idle or HVM vcpus, we load a GDT base address which causes the + * regular per-CPU GDT frame to appear with selectors at the appropriate + * offset. + */ static inline bool need_full_gdt(const struct domain *d) { return is_pv_domain(d) && !is_idle_domain(d); } -static void write_full_gdt_ptes(seg_desc_t *gdt, const struct vcpu *v) +static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu) { - unsigned long mfn = virt_to_mfn(gdt); - l1_pgentry_t *pl1e = pv_gdt_ptes(v); - unsigned int i; - - for ( i = 0; i < NR_RESERVED_GDT_PAGES; i++ ) - l1e_write(pl1e + FIRST_RESERVED_GDT_PAGE + i, - l1e_from_pfn(mfn + i, __PAGE_HYPERVISOR_RW)); + l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE, + !is_pv_32bit_vcpu(v) ? per_cpu(gdt_table_l1e, cpu) + : per_cpu(compat_gdt_table_l1e, cpu)); } -static void load_full_gdt(const struct vcpu *v, unsigned int cpu) +static void load_full_gdt(const struct vcpu *v) { struct desc_ptr gdt_desc = { .limit = LAST_RESERVED_GDT_BYTE, @@ -1661,11 +1680,12 @@ static void load_full_gdt(const struct vcpu *v, unsigned int cpu) lgdt(&gdt_desc); } -static void load_default_gdt(const seg_desc_t *gdt, unsigned int cpu) +static void load_default_gdt(unsigned int cpu) { struct desc_ptr gdt_desc = { .limit = LAST_RESERVED_GDT_BYTE, - .base = (unsigned long)(gdt - FIRST_RESERVED_GDT_ENTRY), + .base = (unsigned long)(per_cpu(gdt_table, cpu) - + FIRST_RESERVED_GDT_ENTRY), }; lgdt(&gdt_desc); @@ -1678,7 +1698,6 @@ static void __context_switch(void) struct vcpu *p = per_cpu(curr_vcpu, cpu); struct vcpu *n = current; struct domain *pd = p->domain, *nd = n->domain; - seg_desc_t *gdt; ASSERT(p != n); ASSERT(!vcpu_cpu_dirty(n)); @@ -1718,15 +1737,12 @@ static void __context_switch(void) psr_ctxt_switch_to(nd); - gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) : - per_cpu(compat_gdt_table, cpu); - if ( need_full_gdt(nd) ) - write_full_gdt_ptes(gdt, n); + update_xen_slot_in_full_gdt(n, cpu); if ( need_full_gdt(pd) && ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) ) - load_default_gdt(gdt, cpu); + load_default_gdt(cpu); write_ptbase(n); @@ -1739,7 +1755,7 @@ static void __context_switch(void) if ( need_full_gdt(nd) && ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) ) - load_full_gdt(n, cpu); + load_full_gdt(n); if ( pd != nd ) cpumask_clear_cpu(cpu, pd->dirty_cpumask); diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index 730fe141fa..004285d14c 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -985,6 +985,8 @@ static int cpu_smpboot_alloc(unsigned int cpu) if ( gdt == NULL ) goto out; per_cpu(gdt_table, cpu) = gdt; + per_cpu(gdt_table_l1e, cpu) = + l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW); memcpy(gdt, boot_cpu_gdt_table, NR_RESERVED_GDT_PAGES * PAGE_SIZE); BUILD_BUG_ON(NR_CPUS > 0x10000); gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu; @@ -992,6 +994,8 @@ static int cpu_smpboot_alloc(unsigned int cpu) per_cpu(compat_gdt_table, cpu) = gdt = alloc_xenheap_pages(order, memflags); if ( gdt == NULL ) goto out; + per_cpu(compat_gdt_table_l1e, cpu) = + l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW); memcpy(gdt, boot_cpu_compat_gdt_table, NR_RESERVED_GDT_PAGES * PAGE_SIZE); gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu; diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 8097ef3bf5..25b4b47e5e 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -97,7 +97,9 @@ DEFINE_PER_CPU(uint64_t, efer); static DEFINE_PER_CPU(unsigned long, last_extable_addr); DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, gdt_table); +DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, gdt_table_l1e); DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, compat_gdt_table); +DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, compat_gdt_table_l1e); /* Master table, used by CPU0. */ idt_entry_t __section(".bss.page_aligned") __aligned(PAGE_SIZE) @@ -2059,6 +2061,14 @@ void __init trap_init(void) } } + /* Cache {,compat_}gdt_table_l1e now that physically relocation is done. */ + this_cpu(gdt_table_l1e) = + l1e_from_pfn(virt_to_mfn(boot_cpu_gdt_table), + __PAGE_HYPERVISOR_RW); + this_cpu(compat_gdt_table_l1e) = + l1e_from_pfn(virt_to_mfn(boot_cpu_compat_gdt_table), + __PAGE_HYPERVISOR_RW); + percpu_traps_init(); cpu_init(); diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h index 85e83bcefb..e565727dc0 100644 --- a/xen/include/asm-x86/desc.h +++ b/xen/include/asm-x86/desc.h @@ -206,8 +206,10 @@ struct __packed desc_ptr { extern seg_desc_t boot_cpu_gdt_table[]; DECLARE_PER_CPU(seg_desc_t *, gdt_table); +DECLARE_PER_CPU(l1_pgentry_t, gdt_table_l1e); extern seg_desc_t boot_cpu_compat_gdt_table[]; DECLARE_PER_CPU(seg_desc_t *, compat_gdt_table); +DECLARE_PER_CPU(l1_pgentry_t, compat_gdt_table_l1e); extern void load_TR(void);
write_full_gdt_ptes() has a latent bug. Using virt_to_mfn() and iterating with (mfn + i) is wrong, because of PDX compression. The context switch path only functions correctly because NR_RESERVED_GDT_PAGES is 1. As this is exceedingly unlikely to change moving foward, drop the loop rather than inserting a BUILD_BUG_ON(NR_RESERVED_GDT_PAGES != 1). With the loop dropped, write_full_gdt_ptes() becomes more obviously a poor name, so rename it to update_xen_slot_in_full_gdt(). Furthermore, calling virt_to_mfn() in the context switch path is a lot of wasted cycles for a result which is constant after boot. Begin by documenting how Xen handles the GDTs across context switch. From this, we observe that load_full_gdt() is completely independent of the current CPU, and load_default_gdt() only gets passed the current CPU regular GDT. Add two extra per-cpu variables which cache the L1e for the regular and compat GDT, calculated in cpu_smpboot_alloc()/trap_init() as appropriate, so update_xen_slot_in_full_gdt() doesn't need to waste time performing the same calculation on every context switch. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Wei Liu <wl@xen.org> CC: Roger Pau Monné <roger.pau@citrix.com> CC: Juergen Gross <jgross@suse.com> Slightly RFC. I'm fairly confident this is better, but Juergen says that the some of his scheduling perf tests notice large difference from subtle changes in __context_switch(), so it would be useful to get some numbers from this change. The delta from this change is: add/remove: 2/0 grow/shrink: 1/1 up/down: 320/-127 (193) Function old new delta cpu_smpboot_callback 1152 1456 +304 per_cpu__gdt_table_l1e - 8 +8 per_cpu__compat_gdt_table_l1e - 8 +8 __context_switch 1238 1111 -127 Total: Before=3339227, After=3339420, chg +0.01% I'm not overly happy about the special case in trap_init() but I can't think of a better place to put this. Also, it should now be very obvious to people that Xen's current GDT handling for non-PV vcpus is a recipe subtle bugs, if we ever manage to execute a stray mov/pop %sreg instruction. We really ought to have Xen's regular GDT in an area where slots 0-13 are either mapped to the zero page, or not present, so we don't risk loading a non-faulting garbage selector. --- xen/arch/x86/domain.c | 52 ++++++++++++++++++++++++++++++---------------- xen/arch/x86/smpboot.c | 4 ++++ xen/arch/x86/traps.c | 10 +++++++++ xen/include/asm-x86/desc.h | 2 ++ 4 files changed, 50 insertions(+), 18 deletions(-)