Message ID | 4647a019c7b42d40d3c2f5b0a3685954bea7f982.1595948219.git.bertrand.marquis@arm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2] xen/arm: Convert runstate address during hypcall | expand |
On Tue, 28 Jul 2020, Bertrand Marquis wrote: > At the moment on Arm, a Linux guest running with KTPI enabled will > cause the following error when a context switch happens in user mode: > (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 > > The error is caused by the virtual address for the runstate area > registered by the guest only being accessible when the guest is running > in kernel space when KPTI is enabled. > > To solve this issue, this patch is doing the translation from virtual > address to physical address during the hypercall and mapping the > required pages using vmap. This is removing the conversion from virtual > to physical address during the context switch which is solving the > problem with KPTI. > > This is done only on arm architecture, the behaviour on x86 is not > modified by this patch and the address conversion is done as before > during each context switch. > > This is introducing several limitations in comparison to the previous > behaviour (on arm only): > - if the guest is remapping the area at a different physical address Xen > will continue to update the area at the previous physical address. As > the area is in kernel space and usually defined as a global variable this > is something which is believed not to happen. If this is required by a > guest, it will have to call the hypercall with the new area (even if it > is at the same virtual address). > - the area needs to be mapped during the hypercall. For the same reasons > as for the previous case, even if the area is registered for a different > vcpu. It is believed that registering an area using a virtual address > unmapped is not something done. > > inline functions in headers could not be used as the architecture > domain.h is included before the global domain.h making it impossible > to use the struct vcpu inside the architecture header. > This should not have any performance impact as the hypercall is only > called once per vcpu usually. > > Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> > --- > Changes in v2 > - use vmap to map the pages during the hypercall. > - reintroduce initial copy during hypercall. > > --- > xen/arch/arm/domain.c | 160 +++++++++++++++++++++++++++++++---- > xen/arch/x86/domain.c | 30 ++++++- > xen/arch/x86/x86_64/domain.c | 4 +- > xen/common/domain.c | 19 ++--- > xen/include/asm-arm/domain.h | 9 ++ > xen/include/asm-x86/domain.h | 16 ++++ > xen/include/xen/domain.h | 5 ++ > xen/include/xen/sched.h | 16 +--- > 8 files changed, 207 insertions(+), 52 deletions(-) > > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c > index 31169326b2..c595438bd9 100644 > --- a/xen/arch/arm/domain.c > +++ b/xen/arch/arm/domain.c > @@ -19,6 +19,7 @@ > #include <xen/sched.h> > #include <xen/softirq.h> > #include <xen/wait.h> > +#include <xen/vmap.h> > > #include <asm/alternative.h> > #include <asm/cpuerrata.h> > @@ -275,36 +276,157 @@ static void ctxt_switch_to(struct vcpu *n) > virt_timer_restore(n); > } > > -/* Update per-VCPU guest runstate shared memory area (if registered). */ > -static void update_runstate_area(struct vcpu *v) > +static void cleanup_runstate_vcpu_locked(struct vcpu *v) > +{ > + if ( v->arch.runstate_guest ) > + { > + vunmap((void *)((unsigned long)v->arch.runstate_guest & PAGE_MASK)); > + > + put_page(v->arch.runstate_guest_page[0]); > + > + if ( v->arch.runstate_guest_page[1] ) > + { > + put_page(v->arch.runstate_guest_page[1]); > + } > + v->arch.runstate_guest = NULL; > + } > +} > + > +void arch_vcpu_cleanup_runstate(struct vcpu *v) > { > - void __user *guest_handle = NULL; > + spin_lock(&v->arch.runstate_guest_lock); > + > + cleanup_runstate_vcpu_locked(v); > + > + spin_unlock(&v->arch.runstate_guest_lock); > +} > + > +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr) > +{ > + unsigned int offset; > + mfn_t mfn[2]; > + struct page_info *page; > + unsigned int numpages; > struct vcpu_runstate_info runstate; > + void *p; > > - if ( guest_handle_is_null(runstate_guest(v)) ) > - return; > + /* user can pass a NULL address to unregister a previous area */ > + if ( vaddr == 0 ) > + return 0; > > - memcpy(&runstate, &v->runstate, sizeof(runstate)); > + offset = vaddr & ~PAGE_MASK; > > - if ( VM_ASSIST(v->domain, runstate_update_flag) ) > + /* provided address must be aligned to a 64bit */ > + if ( offset % alignof(struct vcpu_runstate_info) ) > { > - guest_handle = &v->runstate_guest.p->state_entry_time + 1; > - guest_handle--; > - runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; > - __raw_copy_to_guest(guest_handle, > - (void *)(&runstate.state_entry_time + 1) - 1, 1); > - smp_wmb(); > + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > + "Invalid offset\n", vaddr); > + return -EINVAL; > + } > + > + page = get_page_from_gva(v, vaddr, GV2M_WRITE); > + if ( !page ) > + { > + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > + "Page is not mapped\n", vaddr); > + return -EINVAL; > + } > + mfn[0] = page_to_mfn(page); > + v->arch.runstate_guest_page[0] = page; > + > + if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) ) > + { > + /* guest area is crossing pages */ > + page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE); > + if ( !page ) > + { > + put_page(v->arch.runstate_guest_page[0]); > + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > + "2nd Page is not mapped\n", vaddr); > + return -EINVAL; > + } > + mfn[1] = page_to_mfn(page); > + v->arch.runstate_guest_page[1] = page; > + numpages = 2; > + } > + else > + { > + v->arch.runstate_guest_page[1] = NULL; > + numpages = 1; > + } > + > + p = vmap(mfn, numpages); > + if ( !p ) > + { > + put_page(v->arch.runstate_guest_page[0]); > + if ( numpages == 2 ) > + put_page(v->arch.runstate_guest_page[1]); > + > + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > + "vmap error\n", vaddr); > + return -EINVAL; > } > > - __copy_to_guest(runstate_guest(v), &runstate, 1); > + v->arch.runstate_guest = p + offset; > > - if ( guest_handle ) > + if (v == current) > { > - runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; > + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); > + } > + else > + { > + vcpu_runstate_get(v, &runstate); > + memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate)); > + } > + > + return 0; > +} The arm32 build breaks with: domain.c: In function 'setup_runstate_vcpu_locked': domain.c:322:9: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " ^ domain.c:330:9: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " ^ domain.c:344:13: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " ^ domain.c:365:9: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " ^ cc1: all warnings being treated as errors > +int arch_vcpu_setup_runstate(struct vcpu *v, > + struct vcpu_register_runstate_memory_area area) > +{ > + int rc; > + > + spin_lock(&v->arch.runstate_guest_lock); > + > + /* cleanup if we are recalled */ > + cleanup_runstate_vcpu_locked(v); > + > + rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v); > + > + spin_unlock(&v->arch.runstate_guest_lock); > + > + return rc; > +} > + > + > +/* Update per-VCPU guest runstate shared memory area (if registered). */ > +static void update_runstate_area(struct vcpu *v) > +{ > + spin_lock(&v->arch.runstate_guest_lock); > + > + if ( v->arch.runstate_guest ) > + { > + if ( VM_ASSIST(v->domain, runstate_update_flag) ) > + { > + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; > + v->arch.runstate_guest->state_entry_time |= XEN_RUNSTATE_UPDATE; Please use write_atomic (as suggested by Julien here: https://marc.info/?l=xen-devel&m=159225391619240) > + smp_wmb(); > + } > + > + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); > + > + /* copy must be done before switching the bit */ > smp_wmb(); > - __raw_copy_to_guest(guest_handle, > - (void *)(&runstate.state_entry_time + 1) - 1, 1); > + > + if ( VM_ASSIST(v->domain, runstate_update_flag) ) > + { > + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; > + v->arch.runstate_guest->state_entry_time &= ~XEN_RUNSTATE_UPDATE; and here too The rest looks OK to me. > + } > } > + > + spin_unlock(&v->arch.runstate_guest_lock); > } > > static void schedule_tail(struct vcpu *prev)
On 28.07.2020 17:52, Bertrand Marquis wrote: > At the moment on Arm, a Linux guest running with KTPI enabled will > cause the following error when a context switch happens in user mode: > (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 > > The error is caused by the virtual address for the runstate area > registered by the guest only being accessible when the guest is running > in kernel space when KPTI is enabled. > > To solve this issue, this patch is doing the translation from virtual > address to physical address during the hypercall and mapping the > required pages using vmap. This is removing the conversion from virtual > to physical address during the context switch which is solving the > problem with KPTI. > > This is done only on arm architecture, the behaviour on x86 is not > modified by this patch and the address conversion is done as before > during each context switch. > > This is introducing several limitations in comparison to the previous > behaviour (on arm only): > - if the guest is remapping the area at a different physical address Xen > will continue to update the area at the previous physical address. As > the area is in kernel space and usually defined as a global variable this > is something which is believed not to happen. If this is required by a > guest, it will have to call the hypercall with the new area (even if it > is at the same virtual address). > - the area needs to be mapped during the hypercall. For the same reasons > as for the previous case, even if the area is registered for a different > vcpu. It is believed that registering an area using a virtual address > unmapped is not something done. Beside me thinking that an in-use and stable ABI can't be changed like this, no matter what is "believed" kernel code may or may not do, I also don't think having arch-es diverge in behavior here is a good idea. Use of commonly available interfaces shouldn't lead to head aches or surprises when porting code from one arch to another. I'm pretty sure it was suggested before: Why don't you simply introduce a physical address based hypercall (and then also on x86 at the same time, keeping functional parity)? I even seem to recall giving a suggestion how to fit this into a future "physical addresses only" model, as long as we can settle on the basic principles of that conversion path that we want to go sooner or later anyway (as I understand). > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -1642,6 +1642,30 @@ void paravirt_ctxt_switch_to(struct vcpu *v) > wrmsr_tsc_aux(v->arch.msrs->tsc_aux); > } > > +int arch_vcpu_setup_runstate(struct vcpu *v, > + struct vcpu_register_runstate_memory_area area) > +{ > + struct vcpu_runstate_info runstate; > + > + runstate_guest(v) = area.addr.h; > + > + if ( v == current ) > + { > + __copy_to_guest(runstate_guest(v), &v->runstate, 1); > + } Pointless braces (and I think there are more instances). > + else > + { > + vcpu_runstate_get(v, &runstate); > + __copy_to_guest(runstate_guest(v), &runstate, 1); > + } > + return 0; Missing blank line before main "return". Jan
> On 28 Jul 2020, at 21:04, Stefano Stabellini <sstabellini@kernel.org> wrote: > > On Tue, 28 Jul 2020, Bertrand Marquis wrote: >> At the moment on Arm, a Linux guest running with KTPI enabled will >> cause the following error when a context switch happens in user mode: >> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 >> >> The error is caused by the virtual address for the runstate area >> registered by the guest only being accessible when the guest is running >> in kernel space when KPTI is enabled. >> >> To solve this issue, this patch is doing the translation from virtual >> address to physical address during the hypercall and mapping the >> required pages using vmap. This is removing the conversion from virtual >> to physical address during the context switch which is solving the >> problem with KPTI. >> >> This is done only on arm architecture, the behaviour on x86 is not >> modified by this patch and the address conversion is done as before >> during each context switch. >> >> This is introducing several limitations in comparison to the previous >> behaviour (on arm only): >> - if the guest is remapping the area at a different physical address Xen >> will continue to update the area at the previous physical address. As >> the area is in kernel space and usually defined as a global variable this >> is something which is believed not to happen. If this is required by a >> guest, it will have to call the hypercall with the new area (even if it >> is at the same virtual address). >> - the area needs to be mapped during the hypercall. For the same reasons >> as for the previous case, even if the area is registered for a different >> vcpu. It is believed that registering an area using a virtual address >> unmapped is not something done. >> >> inline functions in headers could not be used as the architecture >> domain.h is included before the global domain.h making it impossible >> to use the struct vcpu inside the architecture header. >> This should not have any performance impact as the hypercall is only >> called once per vcpu usually. >> >> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> >> --- >> Changes in v2 >> - use vmap to map the pages during the hypercall. >> - reintroduce initial copy during hypercall. >> >> --- >> xen/arch/arm/domain.c | 160 +++++++++++++++++++++++++++++++---- >> xen/arch/x86/domain.c | 30 ++++++- >> xen/arch/x86/x86_64/domain.c | 4 +- >> xen/common/domain.c | 19 ++--- >> xen/include/asm-arm/domain.h | 9 ++ >> xen/include/asm-x86/domain.h | 16 ++++ >> xen/include/xen/domain.h | 5 ++ >> xen/include/xen/sched.h | 16 +--- >> 8 files changed, 207 insertions(+), 52 deletions(-) >> >> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c >> index 31169326b2..c595438bd9 100644 >> --- a/xen/arch/arm/domain.c >> +++ b/xen/arch/arm/domain.c >> @@ -19,6 +19,7 @@ >> #include <xen/sched.h> >> #include <xen/softirq.h> >> #include <xen/wait.h> >> +#include <xen/vmap.h> >> >> #include <asm/alternative.h> >> #include <asm/cpuerrata.h> >> @@ -275,36 +276,157 @@ static void ctxt_switch_to(struct vcpu *n) >> virt_timer_restore(n); >> } >> >> -/* Update per-VCPU guest runstate shared memory area (if registered). */ >> -static void update_runstate_area(struct vcpu *v) >> +static void cleanup_runstate_vcpu_locked(struct vcpu *v) >> +{ >> + if ( v->arch.runstate_guest ) >> + { >> + vunmap((void *)((unsigned long)v->arch.runstate_guest & PAGE_MASK)); >> + >> + put_page(v->arch.runstate_guest_page[0]); >> + >> + if ( v->arch.runstate_guest_page[1] ) >> + { >> + put_page(v->arch.runstate_guest_page[1]); >> + } >> + v->arch.runstate_guest = NULL; >> + } >> +} >> + >> +void arch_vcpu_cleanup_runstate(struct vcpu *v) >> { >> - void __user *guest_handle = NULL; >> + spin_lock(&v->arch.runstate_guest_lock); >> + >> + cleanup_runstate_vcpu_locked(v); >> + >> + spin_unlock(&v->arch.runstate_guest_lock); >> +} >> + >> +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr) >> +{ >> + unsigned int offset; >> + mfn_t mfn[2]; >> + struct page_info *page; >> + unsigned int numpages; >> struct vcpu_runstate_info runstate; >> + void *p; >> >> - if ( guest_handle_is_null(runstate_guest(v)) ) >> - return; >> + /* user can pass a NULL address to unregister a previous area */ >> + if ( vaddr == 0 ) >> + return 0; >> >> - memcpy(&runstate, &v->runstate, sizeof(runstate)); >> + offset = vaddr & ~PAGE_MASK; >> >> - if ( VM_ASSIST(v->domain, runstate_update_flag) ) >> + /* provided address must be aligned to a 64bit */ >> + if ( offset % alignof(struct vcpu_runstate_info) ) >> { >> - guest_handle = &v->runstate_guest.p->state_entry_time + 1; >> - guest_handle--; >> - runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >> - __raw_copy_to_guest(guest_handle, >> - (void *)(&runstate.state_entry_time + 1) - 1, 1); >> - smp_wmb(); >> + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " >> + "Invalid offset\n", vaddr); >> + return -EINVAL; >> + } >> + >> + page = get_page_from_gva(v, vaddr, GV2M_WRITE); >> + if ( !page ) >> + { >> + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " >> + "Page is not mapped\n", vaddr); >> + return -EINVAL; >> + } >> + mfn[0] = page_to_mfn(page); >> + v->arch.runstate_guest_page[0] = page; >> + >> + if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) ) >> + { >> + /* guest area is crossing pages */ >> + page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE); >> + if ( !page ) >> + { >> + put_page(v->arch.runstate_guest_page[0]); >> + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " >> + "2nd Page is not mapped\n", vaddr); >> + return -EINVAL; >> + } >> + mfn[1] = page_to_mfn(page); >> + v->arch.runstate_guest_page[1] = page; >> + numpages = 2; >> + } >> + else >> + { >> + v->arch.runstate_guest_page[1] = NULL; >> + numpages = 1; >> + } >> + >> + p = vmap(mfn, numpages); >> + if ( !p ) >> + { >> + put_page(v->arch.runstate_guest_page[0]); >> + if ( numpages == 2 ) >> + put_page(v->arch.runstate_guest_page[1]); >> + >> + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " >> + "vmap error\n", vaddr); >> + return -EINVAL; >> } >> >> - __copy_to_guest(runstate_guest(v), &runstate, 1); >> + v->arch.runstate_guest = p + offset; >> >> - if ( guest_handle ) >> + if (v == current) >> { >> - runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; >> + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); >> + } >> + else >> + { >> + vcpu_runstate_get(v, &runstate); >> + memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate)); >> + } >> + >> + return 0; >> +} > > > The arm32 build breaks with: > > domain.c: In function 'setup_runstate_vcpu_locked': > domain.c:322:9: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] > gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > ^ > domain.c:330:9: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] > gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > ^ > domain.c:344:13: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] > gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > ^ > domain.c:365:9: error: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'vaddr_t' [-Werror=format=] > gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " > ^ > cc1: all warnings being treated as errors My bad. I tested x86 and arm64 build but forgot the arm32. I will fix that. > > >> +int arch_vcpu_setup_runstate(struct vcpu *v, >> + struct vcpu_register_runstate_memory_area area) >> +{ >> + int rc; >> + >> + spin_lock(&v->arch.runstate_guest_lock); >> + >> + /* cleanup if we are recalled */ >> + cleanup_runstate_vcpu_locked(v); >> + >> + rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v); >> + >> + spin_unlock(&v->arch.runstate_guest_lock); >> + >> + return rc; >> +} >> + >> + >> +/* Update per-VCPU guest runstate shared memory area (if registered). */ >> +static void update_runstate_area(struct vcpu *v) >> +{ >> + spin_lock(&v->arch.runstate_guest_lock); >> + >> + if ( v->arch.runstate_guest ) >> + { >> + if ( VM_ASSIST(v->domain, runstate_update_flag) ) >> + { >> + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >> + v->arch.runstate_guest->state_entry_time |= XEN_RUNSTATE_UPDATE; > > Please use write_atomic (as suggested by Julien here: > https://marc.info/?l=xen-devel&m=159225391619240) I will do that. > > >> + smp_wmb(); >> + } >> + >> + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); >> + >> + /* copy must be done before switching the bit */ >> smp_wmb(); >> - __raw_copy_to_guest(guest_handle, >> - (void *)(&runstate.state_entry_time + 1) - 1, 1); >> + >> + if ( VM_ASSIST(v->domain, runstate_update_flag) ) >> + { >> + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; >> + v->arch.runstate_guest->state_entry_time &= ~XEN_RUNSTATE_UPDATE; > > and here too > > The rest looks OK to me. Thanks for the review. Regards Bertrand > > >> + } >> } >> + >> + spin_unlock(&v->arch.runstate_guest_lock); >> } >> >> static void schedule_tail(struct vcpu *prev)
> On 28 Jul 2020, at 21:54, Jan Beulich <jbeulich@suse.com> wrote: > > On 28.07.2020 17:52, Bertrand Marquis wrote: >> At the moment on Arm, a Linux guest running with KTPI enabled will >> cause the following error when a context switch happens in user mode: >> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 >> The error is caused by the virtual address for the runstate area >> registered by the guest only being accessible when the guest is running >> in kernel space when KPTI is enabled. >> To solve this issue, this patch is doing the translation from virtual >> address to physical address during the hypercall and mapping the >> required pages using vmap. This is removing the conversion from virtual >> to physical address during the context switch which is solving the >> problem with KPTI. >> This is done only on arm architecture, the behaviour on x86 is not >> modified by this patch and the address conversion is done as before >> during each context switch. >> This is introducing several limitations in comparison to the previous >> behaviour (on arm only): >> - if the guest is remapping the area at a different physical address Xen >> will continue to update the area at the previous physical address. As >> the area is in kernel space and usually defined as a global variable this >> is something which is believed not to happen. If this is required by a >> guest, it will have to call the hypercall with the new area (even if it >> is at the same virtual address). >> - the area needs to be mapped during the hypercall. For the same reasons >> as for the previous case, even if the area is registered for a different >> vcpu. It is believed that registering an area using a virtual address >> unmapped is not something done. > > Beside me thinking that an in-use and stable ABI can't be changed like > this, no matter what is "believed" kernel code may or may not do, I > also don't think having arch-es diverge in behavior here is a good > idea. Use of commonly available interfaces shouldn't lead to head > aches or surprises when porting code from one arch to another. I'm > pretty sure it was suggested before: Why don't you simply introduce > a physical address based hypercall (and then also on x86 at the same > time, keeping functional parity)? I even seem to recall giving a > suggestion how to fit this into a future "physical addresses only" > model, as long as we can settle on the basic principles of that > conversion path that we want to go sooner or later anyway (as I > understand). I fully agree with the “physical address only” model and i think it must be done. Introducing a new hypercall taking a physical address as parameter is the long term solution (and I would even volunteer to do it in a new patchset). But this would not solve the issue here unless linux is modified. So I do see this patch as a “bug fix”. > >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -1642,6 +1642,30 @@ void paravirt_ctxt_switch_to(struct vcpu *v) >> wrmsr_tsc_aux(v->arch.msrs->tsc_aux); >> } >> +int arch_vcpu_setup_runstate(struct vcpu *v, >> + struct vcpu_register_runstate_memory_area area) >> +{ >> + struct vcpu_runstate_info runstate; >> + >> + runstate_guest(v) = area.addr.h; >> + >> + if ( v == current ) >> + { >> + __copy_to_guest(runstate_guest(v), &v->runstate, 1); >> + } > > Pointless braces (and I think there are more instances). So: if cond instruction else { xxx } is something that should be done in Xen ? Sorry if i do those kind of mistakes in the future as i am more used to a model where no braces is an absolute no-go. I will try to remember this. > >> + else >> + { >> + vcpu_runstate_get(v, &runstate); >> + __copy_to_guest(runstate_guest(v), &runstate, 1); >> + } >> + return 0; > > Missing blank line before main "return". I will fix it. Bertrand
On 29.07.2020 09:08, Bertrand Marquis wrote: > > >> On 28 Jul 2020, at 21:54, Jan Beulich <jbeulich@suse.com> wrote: >> >> On 28.07.2020 17:52, Bertrand Marquis wrote: >>> At the moment on Arm, a Linux guest running with KTPI enabled will >>> cause the following error when a context switch happens in user mode: >>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 >>> The error is caused by the virtual address for the runstate area >>> registered by the guest only being accessible when the guest is running >>> in kernel space when KPTI is enabled. >>> To solve this issue, this patch is doing the translation from virtual >>> address to physical address during the hypercall and mapping the >>> required pages using vmap. This is removing the conversion from virtual >>> to physical address during the context switch which is solving the >>> problem with KPTI. >>> This is done only on arm architecture, the behaviour on x86 is not >>> modified by this patch and the address conversion is done as before >>> during each context switch. >>> This is introducing several limitations in comparison to the previous >>> behaviour (on arm only): >>> - if the guest is remapping the area at a different physical address Xen >>> will continue to update the area at the previous physical address. As >>> the area is in kernel space and usually defined as a global variable this >>> is something which is believed not to happen. If this is required by a >>> guest, it will have to call the hypercall with the new area (even if it >>> is at the same virtual address). >>> - the area needs to be mapped during the hypercall. For the same reasons >>> as for the previous case, even if the area is registered for a different >>> vcpu. It is believed that registering an area using a virtual address >>> unmapped is not something done. >> >> Beside me thinking that an in-use and stable ABI can't be changed like >> this, no matter what is "believed" kernel code may or may not do, I >> also don't think having arch-es diverge in behavior here is a good >> idea. Use of commonly available interfaces shouldn't lead to head >> aches or surprises when porting code from one arch to another. I'm >> pretty sure it was suggested before: Why don't you simply introduce >> a physical address based hypercall (and then also on x86 at the same >> time, keeping functional parity)? I even seem to recall giving a >> suggestion how to fit this into a future "physical addresses only" >> model, as long as we can settle on the basic principles of that >> conversion path that we want to go sooner or later anyway (as I >> understand). > > I fully agree with the “physical address only” model and i think it must be > done. Introducing a new hypercall taking a physical address as parameter > is the long term solution (and I would even volunteer to do it in a new patchset). > But this would not solve the issue here unless linux is modified. > So I do see this patch as a “bug fix”. Well, it is sort of implied by my previous reply that we won't get away without an OS side change here. The prereq to get away without would be that it is okay to change the behavior of a hypercall like you do, and that it is okay to make the behavior diverge between arch-es. I think I've made pretty clear that I don't think either is really an option. >>> --- a/xen/arch/x86/domain.c >>> +++ b/xen/arch/x86/domain.c >>> @@ -1642,6 +1642,30 @@ void paravirt_ctxt_switch_to(struct vcpu *v) >>> wrmsr_tsc_aux(v->arch.msrs->tsc_aux); >>> } >>> +int arch_vcpu_setup_runstate(struct vcpu *v, >>> + struct vcpu_register_runstate_memory_area area) >>> +{ >>> + struct vcpu_runstate_info runstate; >>> + >>> + runstate_guest(v) = area.addr.h; >>> + >>> + if ( v == current ) >>> + { >>> + __copy_to_guest(runstate_guest(v), &v->runstate, 1); >>> + } >> >> Pointless braces (and I think there are more instances). > > So: > if cond > instruction > else > { > xxx > } > > is something that should be done in Xen ? Yes. Especially in coding styles placing opening braces on their own lines this is, afaik, quite common. Jan
On Wed, 29 Jul 2020, Jan Beulich wrote: > On 29.07.2020 09:08, Bertrand Marquis wrote: > > > On 28 Jul 2020, at 21:54, Jan Beulich <jbeulich@suse.com> wrote: > > > > > > On 28.07.2020 17:52, Bertrand Marquis wrote: > > > > At the moment on Arm, a Linux guest running with KTPI enabled will > > > > cause the following error when a context switch happens in user mode: > > > > (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 > > > > The error is caused by the virtual address for the runstate area > > > > registered by the guest only being accessible when the guest is running > > > > in kernel space when KPTI is enabled. > > > > To solve this issue, this patch is doing the translation from virtual > > > > address to physical address during the hypercall and mapping the > > > > required pages using vmap. This is removing the conversion from virtual > > > > to physical address during the context switch which is solving the > > > > problem with KPTI. > > > > This is done only on arm architecture, the behaviour on x86 is not > > > > modified by this patch and the address conversion is done as before > > > > during each context switch. > > > > This is introducing several limitations in comparison to the previous > > > > behaviour (on arm only): > > > > - if the guest is remapping the area at a different physical address Xen > > > > will continue to update the area at the previous physical address. As > > > > the area is in kernel space and usually defined as a global variable > > > > this > > > > is something which is believed not to happen. If this is required by a > > > > guest, it will have to call the hypercall with the new area (even if it > > > > is at the same virtual address). > > > > - the area needs to be mapped during the hypercall. For the same reasons > > > > as for the previous case, even if the area is registered for a different > > > > vcpu. It is believed that registering an area using a virtual address > > > > unmapped is not something done. > > > > > > Beside me thinking that an in-use and stable ABI can't be changed like > > > this, no matter what is "believed" kernel code may or may not do, I > > > also don't think having arch-es diverge in behavior here is a good > > > idea. Use of commonly available interfaces shouldn't lead to head > > > aches or surprises when porting code from one arch to another. I'm > > > pretty sure it was suggested before: Why don't you simply introduce > > > a physical address based hypercall (and then also on x86 at the same > > > time, keeping functional parity)? I even seem to recall giving a > > > suggestion how to fit this into a future "physical addresses only" > > > model, as long as we can settle on the basic principles of that > > > conversion path that we want to go sooner or later anyway (as I > > > understand). > > > > I fully agree with the “physical address only” model and i think it must be > > done. Introducing a new hypercall taking a physical address as parameter > > is the long term solution (and I would even volunteer to do it in a new > > patchset). > > But this would not solve the issue here unless linux is modified. > > So I do see this patch as a “bug fix”. > > Well, it is sort of implied by my previous reply that we won't get away > without an OS side change here. The prereq to get away without would be > that it is okay to change the behavior of a hypercall like you do, and > that it is okay to make the behavior diverge between arch-es. I think > I've made pretty clear that I don't think either is really an option. Hi Jan, This is a difficult problem to solve and the current situation honestly sucks: there is no way to solve the problem without making compromises. The new hypercall is good-to-have in any case (it is a better interface) but it is not a full solution. If we introduce a new hypercall we fix new guests but don't fix existing guests. If we change Linux in any way, we are still going to have problems with all already-released kernel binaries. Leaving the issue unfixed is not an option either because the problem can't be ignored. There is no simple way out of this situation. After careful considerations and many discussions (most of them on xen-devel [1][2]) we came to the realization that changing the behavior for this hypercall on ARM is the best we can do: 1) it solves the problem for existing guests 2) it has no ill effects as far as we know Are we happy about it? We are not. It is the best we can do given the constraints. The rule to never change hypercalls is a good one to have and I fully support it, but in this unique circumstance it is best to make an exception. The idea is to minimize the impact to x86 so that's why Bertrand's patch is introducing a divergence between Arm and x86 on this hypercall. In summary, please consider this patch together with its context. This has been the only time to my memory when we had to do this -- it is certainly a unique situation. [1] https://marc.info/?l=xen-devel&m=158946660216159 [2] https://marc.info/?l=xen-devel&m=159067967432381
On 30.07.2020 03:30, Stefano Stabellini wrote: > On Wed, 29 Jul 2020, Jan Beulich wrote: >> On 29.07.2020 09:08, Bertrand Marquis wrote: >>>> On 28 Jul 2020, at 21:54, Jan Beulich <jbeulich@suse.com> wrote: >>>> On 28.07.2020 17:52, Bertrand Marquis wrote: >>>>> At the moment on Arm, a Linux guest running with KTPI enabled will >>>>> cause the following error when a context switch happens in user mode: >>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 >>>>> The error is caused by the virtual address for the runstate area >>>>> registered by the guest only being accessible when the guest is running >>>>> in kernel space when KPTI is enabled. >>>>> To solve this issue, this patch is doing the translation from virtual >>>>> address to physical address during the hypercall and mapping the >>>>> required pages using vmap. This is removing the conversion from virtual >>>>> to physical address during the context switch which is solving the >>>>> problem with KPTI. >>>>> This is done only on arm architecture, the behaviour on x86 is not >>>>> modified by this patch and the address conversion is done as before >>>>> during each context switch. >>>>> This is introducing several limitations in comparison to the previous >>>>> behaviour (on arm only): >>>>> - if the guest is remapping the area at a different physical address Xen >>>>> will continue to update the area at the previous physical address. As >>>>> the area is in kernel space and usually defined as a global variable >>>>> this >>>>> is something which is believed not to happen. If this is required by a >>>>> guest, it will have to call the hypercall with the new area (even if it >>>>> is at the same virtual address). >>>>> - the area needs to be mapped during the hypercall. For the same reasons >>>>> as for the previous case, even if the area is registered for a different >>>>> vcpu. It is believed that registering an area using a virtual address >>>>> unmapped is not something done. >>>> >>>> Beside me thinking that an in-use and stable ABI can't be changed like >>>> this, no matter what is "believed" kernel code may or may not do, I >>>> also don't think having arch-es diverge in behavior here is a good >>>> idea. Use of commonly available interfaces shouldn't lead to head >>>> aches or surprises when porting code from one arch to another. I'm >>>> pretty sure it was suggested before: Why don't you simply introduce >>>> a physical address based hypercall (and then also on x86 at the same >>>> time, keeping functional parity)? I even seem to recall giving a >>>> suggestion how to fit this into a future "physical addresses only" >>>> model, as long as we can settle on the basic principles of that >>>> conversion path that we want to go sooner or later anyway (as I >>>> understand). >>> >>> I fully agree with the “physical address only” model and i think it must be >>> done. Introducing a new hypercall taking a physical address as parameter >>> is the long term solution (and I would even volunteer to do it in a new >>> patchset). >>> But this would not solve the issue here unless linux is modified. >>> So I do see this patch as a “bug fix”. >> >> Well, it is sort of implied by my previous reply that we won't get away >> without an OS side change here. The prereq to get away without would be >> that it is okay to change the behavior of a hypercall like you do, and >> that it is okay to make the behavior diverge between arch-es. I think >> I've made pretty clear that I don't think either is really an option. > > This is a difficult problem to solve and the current situation honestly > sucks: there is no way to solve the problem without making compromises. > > The new hypercall is good-to-have in any case (it is a better interface) > but it is not a full solution. If we introduce a new hypercall we fix > new guests but don't fix existing guests. If we change Linux in any way, > we are still going to have problems with all already-released kernel > binaries. Leaving the issue unfixed is not an option either because the > problem can't be ignored. We're fixing other issues without breaking the ABI. Where's the problem of backporting the kernel side change (which I anticipate to not be overly involved)? If the plan remains to be to make an ABI breaking change, then I think this will need an explicit vote. Jan
Hi Jan, On 31/07/2020 07:39, Jan Beulich wrote: > We're fixing other issues without breaking the ABI. Where's the > problem of backporting the kernel side change (which I anticipate > to not be overly involved)? This means you can't take advantage of the runstage on existing Linux without any modification. > If the plan remains to be to make an ABI breaking change, For a theoritical PoV, this is a ABI breakage. However, I fail to see how the restrictions added would affect OSes at least on Arm. In particular, you can't change the VA -> PA on Arm without going through an invalid mapping. So I wouldn't expect this to happen for the runstate. The only part that *may* be an issue is if the guest is registering the runstate with an initially invalid VA. Although, I have yet to see that in practice. Maybe you know? > then I > think this will need an explicit vote. I was under the impression that the two Arm maintainers (Stefano and I) already agreed with the approach here. Therefore, given the ABI breakage is only affecting Arm, why would we need a vote? Cheers,
On 31.07.2020 12:12, Julien Grall wrote: > On 31/07/2020 07:39, Jan Beulich wrote: >> We're fixing other issues without breaking the ABI. Where's the >> problem of backporting the kernel side change (which I anticipate >> to not be overly involved)? > This means you can't take advantage of the runstage on existing Linux > without any modification. > >> If the plan remains to be to make an ABI breaking change, > > For a theoritical PoV, this is a ABI breakage. However, I fail to see > how the restrictions added would affect OSes at least on Arm. "OSes" covering what? Just Linux? > In particular, you can't change the VA -> PA on Arm without going > through an invalid mapping. So I wouldn't expect this to happen for the > runstate. > > The only part that *may* be an issue is if the guest is registering the > runstate with an initially invalid VA. Although, I have yet to see that > in practice. Maybe you know? I'm unaware of any such use, but this means close to nothing. >> then I >> think this will need an explicit vote. > > I was under the impression that the two Arm maintainers (Stefano and I) > already agreed with the approach here. Therefore, given the ABI breakage > is only affecting Arm, why would we need a vote? The problem here is of conceptual nature: You're planning to make the behavior of a common hypercall diverge between architectures, and in a retroactive fashion. Imo that's nothing we should do even for new hypercalls, if _at all_ avoidable. If we allow this here, we'll have a precedent that people later may (and based on my experience will, sooner or later) reference, to get their own change justified. Jan
> On 31 Jul 2020, at 12:18, Jan Beulich <jbeulich@suse.com> wrote: > > On 31.07.2020 12:12, Julien Grall wrote: >> On 31/07/2020 07:39, Jan Beulich wrote: >>> We're fixing other issues without breaking the ABI. Where's the >>> problem of backporting the kernel side change (which I anticipate >>> to not be overly involved)? >> This means you can't take advantage of the runstage on existing Linux >> without any modification. >> >>> If the plan remains to be to make an ABI breaking change, >> >> For a theoritical PoV, this is a ABI breakage. However, I fail to see >> how the restrictions added would affect OSes at least on Arm. > > "OSes" covering what? Just Linux? > >> In particular, you can't change the VA -> PA on Arm without going >> through an invalid mapping. So I wouldn't expect this to happen for the >> runstate. >> >> The only part that *may* be an issue is if the guest is registering the >> runstate with an initially invalid VA. Although, I have yet to see that >> in practice. Maybe you know? > > I'm unaware of any such use, but this means close to nothing. > >>> then I >>> think this will need an explicit vote. >> >> I was under the impression that the two Arm maintainers (Stefano and I) >> already agreed with the approach here. Therefore, given the ABI breakage >> is only affecting Arm, why would we need a vote? > > The problem here is of conceptual nature: You're planning to > make the behavior of a common hypercall diverge between > architectures, and in a retroactive fashion. Imo that's nothing > we should do even for new hypercalls, if _at all_ avoidable. If > we allow this here, we'll have a precedent that people later > may (and based on my experience will, sooner or later) reference, > to get their own change justified. After a discussion with Jan, he is proposing to have a guest config setting to turn on or off the translation of the address during the hypercall and add a global Xen command line parameter to set the global default behaviour. With this was done on arm could be done on x86 and the current behaviour would be kept by default but possible to modify by configuration. @Jan: please correct me if i said something wrong @others: what is your view on this solution ? Bertrand
On Fri, 31 Jul 2020, Bertrand Marquis wrote: > > On 31 Jul 2020, at 12:18, Jan Beulich <jbeulich@suse.com> wrote: > > > > On 31.07.2020 12:12, Julien Grall wrote: > >> On 31/07/2020 07:39, Jan Beulich wrote: > >>> We're fixing other issues without breaking the ABI. Where's the > >>> problem of backporting the kernel side change (which I anticipate > >>> to not be overly involved)? > >> This means you can't take advantage of the runstage on existing Linux > >> without any modification. > >> > >>> If the plan remains to be to make an ABI breaking change, > >> > >> For a theoritical PoV, this is a ABI breakage. However, I fail to see > >> how the restrictions added would affect OSes at least on Arm. > > > > "OSes" covering what? Just Linux? > > > >> In particular, you can't change the VA -> PA on Arm without going > >> through an invalid mapping. So I wouldn't expect this to happen for the > >> runstate. > >> > >> The only part that *may* be an issue is if the guest is registering the > >> runstate with an initially invalid VA. Although, I have yet to see that > >> in practice. Maybe you know? > > > > I'm unaware of any such use, but this means close to nothing. > > > >>> then I > >>> think this will need an explicit vote. > >> > >> I was under the impression that the two Arm maintainers (Stefano and I) > >> already agreed with the approach here. Therefore, given the ABI breakage > >> is only affecting Arm, why would we need a vote? > > > > The problem here is of conceptual nature: You're planning to > > make the behavior of a common hypercall diverge between > > architectures, and in a retroactive fashion. Imo that's nothing > > we should do even for new hypercalls, if _at all_ avoidable. If > > we allow this here, we'll have a precedent that people later > > may (and based on my experience will, sooner or later) reference, > > to get their own change justified. Please let's avoid "slippery slope" arguments (https://en.wikipedia.org/wiki/Slippery_slope) We shouldn't consider this instance as the first in a long series of bad decisions on hypercall compatibility. Each new case, if there will be any, will have to be considered based on its own merits. Also, let's keep in mind that there have been no other cases in the last 8 years. (I would like to repeat my support for hypercall ABI compatibility.) I would also kindly ask not to put the discussion on a "conceptual" level: there is no way to fix all guests and also keep compatibility. From a conceptual point of view, it is already game over :-) > After a discussion with Jan, he is proposing to have a guest config setting to > turn on or off the translation of the address during the hypercall and add a > global Xen command line parameter to set the global default behaviour. > With this was done on arm could be done on x86 and the current behaviour > would be kept by default but possible to modify by configuration. > > @Jan: please correct me if i said something wrong > @others: what is your view on this solution ? Having options to turn on or off the new behavior could be good-to-have if we find a guest that actually requires the old behavior. Today we don't know of any such cases. We have strong reasons to believe that there aren't any on ARM (see Julien's explanation in regards to the temporary invalid mappings.) In fact, it is one of the factors that led us to think this patch is the right approach. That said, I am also OK with adding such a parameter now, but we need to choose the default value carefully. We need the new behavior as default on ARM because we need the fix to work for all guests. I don't think we want to explain how you always need to set config_foobar otherwise things don't work. It has to work out of the box. It would be nice if we had the same default on x86 too, although I understand if Jan and Andrew don't want to make the same change on x86, at least initially.
> On 1 Aug 2020, at 00:03, Stefano Stabellini <sstabellini@kernel.org> wrote: > > On Fri, 31 Jul 2020, Bertrand Marquis wrote: >>> On 31 Jul 2020, at 12:18, Jan Beulich <jbeulich@suse.com> wrote: >>> >>> On 31.07.2020 12:12, Julien Grall wrote: >>>> On 31/07/2020 07:39, Jan Beulich wrote: >>>>> We're fixing other issues without breaking the ABI. Where's the >>>>> problem of backporting the kernel side change (which I anticipate >>>>> to not be overly involved)? >>>> This means you can't take advantage of the runstage on existing Linux >>>> without any modification. >>>> >>>>> If the plan remains to be to make an ABI breaking change, >>>> >>>> For a theoritical PoV, this is a ABI breakage. However, I fail to see >>>> how the restrictions added would affect OSes at least on Arm. >>> >>> "OSes" covering what? Just Linux? >>> >>>> In particular, you can't change the VA -> PA on Arm without going >>>> through an invalid mapping. So I wouldn't expect this to happen for the >>>> runstate. >>>> >>>> The only part that *may* be an issue is if the guest is registering the >>>> runstate with an initially invalid VA. Although, I have yet to see that >>>> in practice. Maybe you know? >>> >>> I'm unaware of any such use, but this means close to nothing. >>> >>>>> then I >>>>> think this will need an explicit vote. >>>> >>>> I was under the impression that the two Arm maintainers (Stefano and I) >>>> already agreed with the approach here. Therefore, given the ABI breakage >>>> is only affecting Arm, why would we need a vote? >>> >>> The problem here is of conceptual nature: You're planning to >>> make the behavior of a common hypercall diverge between >>> architectures, and in a retroactive fashion. Imo that's nothing >>> we should do even for new hypercalls, if _at all_ avoidable. If >>> we allow this here, we'll have a precedent that people later >>> may (and based on my experience will, sooner or later) reference, >>> to get their own change justified. > > Please let's avoid "slippery slope" arguments > (https://en.wikipedia.org/wiki/Slippery_slope) > > We shouldn't consider this instance as the first in a long series of bad > decisions on hypercall compatibility. Each new case, if there will be > any, will have to be considered based on its own merits. Also, let's > keep in mind that there have been no other cases in the last 8 years. (I > would like to repeat my support for hypercall ABI compatibility.) > > > I would also kindly ask not to put the discussion on a "conceptual" > level: there is no way to fix all guests and also keep compatibility. > From a conceptual point of view, it is already game over :-) > > >> After a discussion with Jan, he is proposing to have a guest config setting to >> turn on or off the translation of the address during the hypercall and add a >> global Xen command line parameter to set the global default behaviour. >> With this was done on arm could be done on x86 and the current behaviour >> would be kept by default but possible to modify by configuration. >> >> @Jan: please correct me if i said something wrong >> @others: what is your view on this solution ? > > Having options to turn on or off the new behavior could be good-to-have > if we find a guest that actually requires the old behavior. Today we > don't know of any such cases. We have strong reasons to believe that > there aren't any on ARM (see Julien's explanation in regards to the > temporary invalid mappings.) In fact, it is one of the factors that led > us to think this patch is the right approach. > > That said, I am also OK with adding such a parameter now, but we need to > choose the default value carefully. This would also mean keeping support in the code for old and new behaviour which might make the code bigger and more complex. > > > We need the new behavior as default on ARM because we need the fix to > work for all guests. I don't think we want to explain how you always > need to set config_foobar otherwise things don't work. It has to work > out of the box. > > It would be nice if we had the same default on x86 too, although I > understand if Jan and Andrew don't want to make the same change on x86, > at least initially. So you mean here adding a parameter but only on Arm ? Should it be a command line parameter ? a configuration parameter ? both ? It seems that with this patch i touched some kind of sensible area. Should i just abandon it and see later to work on adding a new hypercall using a physical address as parameter ? Cheers Bertrand
Hi, Sorry for the late answer. On 14/08/2020 10:25, Bertrand Marquis wrote: > > >> On 1 Aug 2020, at 00:03, Stefano Stabellini <sstabellini@kernel.org> wrote: >> >> On Fri, 31 Jul 2020, Bertrand Marquis wrote: >>>> On 31 Jul 2020, at 12:18, Jan Beulich <jbeulich@suse.com> wrote: >>>> >>>> On 31.07.2020 12:12, Julien Grall wrote: >>>>> On 31/07/2020 07:39, Jan Beulich wrote: >>>>>> We're fixing other issues without breaking the ABI. Where's the >>>>>> problem of backporting the kernel side change (which I anticipate >>>>>> to not be overly involved)? >>>>> This means you can't take advantage of the runstage on existing Linux >>>>> without any modification. >>>>> >>>>>> If the plan remains to be to make an ABI breaking change, >>>>> >>>>> For a theoritical PoV, this is a ABI breakage. However, I fail to see >>>>> how the restrictions added would affect OSes at least on Arm. >>>> >>>> "OSes" covering what? Just Linux? >>>> >>>>> In particular, you can't change the VA -> PA on Arm without going >>>>> through an invalid mapping. So I wouldn't expect this to happen for the >>>>> runstate. >>>>> >>>>> The only part that *may* be an issue is if the guest is registering the >>>>> runstate with an initially invalid VA. Although, I have yet to see that >>>>> in practice. Maybe you know? >>>> >>>> I'm unaware of any such use, but this means close to nothing. >>>> >>>>>> then I >>>>>> think this will need an explicit vote. >>>>> >>>>> I was under the impression that the two Arm maintainers (Stefano and I) >>>>> already agreed with the approach here. Therefore, given the ABI breakage >>>>> is only affecting Arm, why would we need a vote? >>>> >>>> The problem here is of conceptual nature: You're planning to >>>> make the behavior of a common hypercall diverge between >>>> architectures, and in a retroactive fashion. Imo that's nothing >>>> we should do even for new hypercalls, if _at all_ avoidable. If >>>> we allow this here, we'll have a precedent that people later >>>> may (and based on my experience will, sooner or later) reference, >>>> to get their own change justified. >> >> Please let's avoid "slippery slope" arguments >> (https://en.wikipedia.org/wiki/Slippery_slope) >> >> We shouldn't consider this instance as the first in a long series of bad >> decisions on hypercall compatibility. Each new case, if there will be >> any, will have to be considered based on its own merits. Also, let's >> keep in mind that there have been no other cases in the last 8 years. (I >> would like to repeat my support for hypercall ABI compatibility.) >> >> >> I would also kindly ask not to put the discussion on a "conceptual" >> level: there is no way to fix all guests and also keep compatibility. >> From a conceptual point of view, it is already game over :-) >> >> >>> After a discussion with Jan, he is proposing to have a guest config setting to >>> turn on or off the translation of the address during the hypercall and add a >>> global Xen command line parameter to set the global default behaviour. >>> With this was done on arm could be done on x86 and the current behaviour >>> would be kept by default but possible to modify by configuration. >>> >>> @Jan: please correct me if i said something wrong >>> @others: what is your view on this solution ? >> >> Having options to turn on or off the new behavior could be good-to-have >> if we find a guest that actually requires the old behavior. Today we >> don't know of any such cases. We have strong reasons to believe that >> there aren't any on ARM (see Julien's explanation in regards to the >> temporary invalid mappings.) In fact, it is one of the factors that led >> us to think this patch is the right approach. >> >> That said, I am also OK with adding such a parameter now, but we need to >> choose the default value carefully. I agree with that :). > > This would also mean keeping support in the code for old and new behaviour > which might make the code bigger and more complex. I am concerned with that as well. However, this concern is also going to be true if we introduce an hypercall using a physical address as parameter. Indeed, the old hypercall will not go away. If we introduce a second hypercall, you will also have to think about the interactions between the two. For instance: - The firmware may register the runstate using the old hypercall, while the OS may register using the new hypercall. - Can an OS use a mix of the two hypercalls? For more details, you can have a look at the original attempt for a new hypercall (see [1]). The approach you discussed with Jan has the advantage to not require any change in the guest software stack. So this would be my preference over a new hypercall. >> >> >> We need the new behavior as default on ARM because we need the fix to >> work for all guests. I don't think we want to explain how you always >> need to set config_foobar otherwise things don't work. It has to work >> out of the box. >> >> It would be nice if we had the same default on x86 too, although I >> understand if Jan and Andrew don't want to make the same change on x86, >> at least initially. > > So you mean here adding a parameter but only on Arm ? > Should it be a command line parameter ? a configuration parameter ? both ? > > It seems that with this patch i touched some kind of sensible area. > Should i just abandon it and see later to work on adding a new hypercall using > a physical address as parameter ? I would suggest to mention the thread in the next community call. Cheers, [1] <1558721577-13958-3-git-send-email-andrii.anisov@gmail.com>
diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index 31169326b2..c595438bd9 100644 --- a/xen/arch/arm/domain.c +++ b/xen/arch/arm/domain.c @@ -19,6 +19,7 @@ #include <xen/sched.h> #include <xen/softirq.h> #include <xen/wait.h> +#include <xen/vmap.h> #include <asm/alternative.h> #include <asm/cpuerrata.h> @@ -275,36 +276,157 @@ static void ctxt_switch_to(struct vcpu *n) virt_timer_restore(n); } -/* Update per-VCPU guest runstate shared memory area (if registered). */ -static void update_runstate_area(struct vcpu *v) +static void cleanup_runstate_vcpu_locked(struct vcpu *v) +{ + if ( v->arch.runstate_guest ) + { + vunmap((void *)((unsigned long)v->arch.runstate_guest & PAGE_MASK)); + + put_page(v->arch.runstate_guest_page[0]); + + if ( v->arch.runstate_guest_page[1] ) + { + put_page(v->arch.runstate_guest_page[1]); + } + v->arch.runstate_guest = NULL; + } +} + +void arch_vcpu_cleanup_runstate(struct vcpu *v) { - void __user *guest_handle = NULL; + spin_lock(&v->arch.runstate_guest_lock); + + cleanup_runstate_vcpu_locked(v); + + spin_unlock(&v->arch.runstate_guest_lock); +} + +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr) +{ + unsigned int offset; + mfn_t mfn[2]; + struct page_info *page; + unsigned int numpages; struct vcpu_runstate_info runstate; + void *p; - if ( guest_handle_is_null(runstate_guest(v)) ) - return; + /* user can pass a NULL address to unregister a previous area */ + if ( vaddr == 0 ) + return 0; - memcpy(&runstate, &v->runstate, sizeof(runstate)); + offset = vaddr & ~PAGE_MASK; - if ( VM_ASSIST(v->domain, runstate_update_flag) ) + /* provided address must be aligned to a 64bit */ + if ( offset % alignof(struct vcpu_runstate_info) ) { - guest_handle = &v->runstate_guest.p->state_entry_time + 1; - guest_handle--; - runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; - __raw_copy_to_guest(guest_handle, - (void *)(&runstate.state_entry_time + 1) - 1, 1); - smp_wmb(); + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " + "Invalid offset\n", vaddr); + return -EINVAL; + } + + page = get_page_from_gva(v, vaddr, GV2M_WRITE); + if ( !page ) + { + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " + "Page is not mapped\n", vaddr); + return -EINVAL; + } + mfn[0] = page_to_mfn(page); + v->arch.runstate_guest_page[0] = page; + + if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) ) + { + /* guest area is crossing pages */ + page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE); + if ( !page ) + { + put_page(v->arch.runstate_guest_page[0]); + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " + "2nd Page is not mapped\n", vaddr); + return -EINVAL; + } + mfn[1] = page_to_mfn(page); + v->arch.runstate_guest_page[1] = page; + numpages = 2; + } + else + { + v->arch.runstate_guest_page[1] = NULL; + numpages = 1; + } + + p = vmap(mfn, numpages); + if ( !p ) + { + put_page(v->arch.runstate_guest_page[0]); + if ( numpages == 2 ) + put_page(v->arch.runstate_guest_page[1]); + + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 0x%lx: " + "vmap error\n", vaddr); + return -EINVAL; } - __copy_to_guest(runstate_guest(v), &runstate, 1); + v->arch.runstate_guest = p + offset; - if ( guest_handle ) + if (v == current) { - runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); + } + else + { + vcpu_runstate_get(v, &runstate); + memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate)); + } + + return 0; +} + +int arch_vcpu_setup_runstate(struct vcpu *v, + struct vcpu_register_runstate_memory_area area) +{ + int rc; + + spin_lock(&v->arch.runstate_guest_lock); + + /* cleanup if we are recalled */ + cleanup_runstate_vcpu_locked(v); + + rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v); + + spin_unlock(&v->arch.runstate_guest_lock); + + return rc; +} + + +/* Update per-VCPU guest runstate shared memory area (if registered). */ +static void update_runstate_area(struct vcpu *v) +{ + spin_lock(&v->arch.runstate_guest_lock); + + if ( v->arch.runstate_guest ) + { + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; + v->arch.runstate_guest->state_entry_time |= XEN_RUNSTATE_UPDATE; + smp_wmb(); + } + + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); + + /* copy must be done before switching the bit */ smp_wmb(); - __raw_copy_to_guest(guest_handle, - (void *)(&runstate.state_entry_time + 1) - 1, 1); + + if ( VM_ASSIST(v->domain, runstate_update_flag) ) + { + v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; + v->arch.runstate_guest->state_entry_time &= ~XEN_RUNSTATE_UPDATE; + } } + + spin_unlock(&v->arch.runstate_guest_lock); } static void schedule_tail(struct vcpu *prev) @@ -560,6 +682,8 @@ int arch_vcpu_create(struct vcpu *v) v->arch.saved_context.sp = (register_t)v->arch.cpu_info; v->arch.saved_context.pc = (register_t)continue_new_vcpu; + spin_lock_init(&v->arch.runstate_guest_lock); + /* Idle VCPUs don't need the rest of this setup */ if ( is_idle_vcpu(v) ) return rc; diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index fee6c3931a..98910b7cf1 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1642,6 +1642,30 @@ void paravirt_ctxt_switch_to(struct vcpu *v) wrmsr_tsc_aux(v->arch.msrs->tsc_aux); } +int arch_vcpu_setup_runstate(struct vcpu *v, + struct vcpu_register_runstate_memory_area area) +{ + struct vcpu_runstate_info runstate; + + runstate_guest(v) = area.addr.h; + + if ( v == current ) + { + __copy_to_guest(runstate_guest(v), &v->runstate, 1); + } + else + { + vcpu_runstate_get(v, &runstate); + __copy_to_guest(runstate_guest(v), &runstate, 1); + } + return 0; +} + +void arch_vcpu_cleanup_runstate(struct vcpu *v) +{ + set_xen_guest_handle(runstate_guest(v), NULL); +} + /* Update per-VCPU guest runstate shared memory area (if registered). */ bool update_runstate_area(struct vcpu *v) { @@ -1660,8 +1684,8 @@ bool update_runstate_area(struct vcpu *v) if ( VM_ASSIST(v->domain, runstate_update_flag) ) { guest_handle = has_32bit_shinfo(v->domain) - ? &v->runstate_guest.compat.p->state_entry_time + 1 - : &v->runstate_guest.native.p->state_entry_time + 1; + ? &v->arch.runstate_guest.compat.p->state_entry_time + 1 + : &v->arch.runstate_guest.native.p->state_entry_time + 1; guest_handle--; runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; __raw_copy_to_guest(guest_handle, @@ -1674,7 +1698,7 @@ bool update_runstate_area(struct vcpu *v) struct compat_vcpu_runstate_info info; XLAT_vcpu_runstate_info(&info, &runstate); - __copy_to_guest(v->runstate_guest.compat, &info, 1); + __copy_to_guest(v->arch.runstate_guest.compat, &info, 1); rc = true; } else diff --git a/xen/arch/x86/x86_64/domain.c b/xen/arch/x86/x86_64/domain.c index c46dccc25a..b879e8dd2c 100644 --- a/xen/arch/x86/x86_64/domain.c +++ b/xen/arch/x86/x86_64/domain.c @@ -36,7 +36,7 @@ arch_compat_vcpu_op( break; rc = 0; - guest_from_compat_handle(v->runstate_guest.compat, area.addr.h); + guest_from_compat_handle(v->arch.runstate_guest.compat, area.addr.h); if ( v == current ) { @@ -49,7 +49,7 @@ arch_compat_vcpu_op( vcpu_runstate_get(v, &runstate); XLAT_vcpu_runstate_info(&info, &runstate); } - __copy_to_guest(v->runstate_guest.compat, &info, 1); + __copy_to_guest(v->arch.runstate_guest.compat, &info, 1); break; } diff --git a/xen/common/domain.c b/xen/common/domain.c index e9be05f1d0..cd8595b186 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -727,7 +727,10 @@ int domain_kill(struct domain *d) if ( cpupool_move_domain(d, cpupool0) ) return -ERESTART; for_each_vcpu ( d, v ) + { + arch_vcpu_cleanup_runstate(v); unmap_vcpu_info(v); + } d->is_dying = DOMDYING_dead; /* Mem event cleanup has to go here because the rings * have to be put before we call put_domain. */ @@ -1167,7 +1170,7 @@ int domain_soft_reset(struct domain *d) for_each_vcpu ( d, v ) { - set_xen_guest_handle(runstate_guest(v), NULL); + arch_vcpu_cleanup_runstate(v); unmap_vcpu_info(v); } @@ -1494,7 +1497,6 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg) case VCPUOP_register_runstate_memory_area: { struct vcpu_register_runstate_memory_area area; - struct vcpu_runstate_info runstate; rc = -EFAULT; if ( copy_from_guest(&area, arg, 1) ) @@ -1503,18 +1505,7 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg) if ( !guest_handle_okay(area.addr.h, 1) ) break; - rc = 0; - runstate_guest(v) = area.addr.h; - - if ( v == current ) - { - __copy_to_guest(runstate_guest(v), &v->runstate, 1); - } - else - { - vcpu_runstate_get(v, &runstate); - __copy_to_guest(runstate_guest(v), &runstate, 1); - } + rc = arch_vcpu_setup_runstate(v, area); break; } diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h index 4e2f582006..9df4d10abb 100644 --- a/xen/include/asm-arm/domain.h +++ b/xen/include/asm-arm/domain.h @@ -206,6 +206,15 @@ struct arch_vcpu */ bool need_flush_to_ram; + /* runstate guest lock */ + spinlock_t runstate_guest_lock; + + /* runstate guest info */ + struct vcpu_runstate_info *runstate_guest; + + /* runstate pages mapped for runstate_guest */ + struct page_info *runstate_guest_page[2]; + } __cacheline_aligned; void vcpu_show_execution_state(struct vcpu *); diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index 6fd94c2e14..c369d22ccc 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -11,6 +11,11 @@ #include <asm/x86_emulate.h> #include <public/vcpu.h> #include <public/hvm/hvm_info_table.h> +#ifdef CONFIG_COMPAT +#include <compat/vcpu.h> +DEFINE_XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t); +#endif + #define has_32bit_shinfo(d) ((d)->arch.has_32bit_shinfo) @@ -627,6 +632,17 @@ struct arch_vcpu struct { bool next_interrupt_enabled; } monitor; + +#ifndef CONFIG_COMPAT +# define runstate_guest(v) ((v)->arch.runstate_guest) + XEN_GUEST_HANDLE(vcpu_runstate_info_t) runstate_guest; /* guest address */ +#else +# define runstate_guest(v) ((v)->arch.runstate_guest.native) + union { + XEN_GUEST_HANDLE(vcpu_runstate_info_t) native; + XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat; + } runstate_guest; +#endif }; struct guest_memory_policy diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h index 7e51d361de..5e8cbba31d 100644 --- a/xen/include/xen/domain.h +++ b/xen/include/xen/domain.h @@ -5,6 +5,7 @@ #include <xen/types.h> #include <public/xen.h> +#include <public/vcpu.h> #include <asm/domain.h> #include <asm/numa.h> @@ -63,6 +64,10 @@ void arch_vcpu_destroy(struct vcpu *v); int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset); void unmap_vcpu_info(struct vcpu *v); +int arch_vcpu_setup_runstate(struct vcpu *v, + struct vcpu_register_runstate_memory_area area); +void arch_vcpu_cleanup_runstate(struct vcpu *v); + int arch_domain_create(struct domain *d, struct xen_domctl_createdomain *config); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index ac53519d7f..fac030fb83 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -29,11 +29,6 @@ #include <public/vcpu.h> #include <public/event_channel.h> -#ifdef CONFIG_COMPAT -#include <compat/vcpu.h> -DEFINE_XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t); -#endif - /* * Stats * @@ -166,16 +161,7 @@ struct vcpu struct sched_unit *sched_unit; struct vcpu_runstate_info runstate; -#ifndef CONFIG_COMPAT -# define runstate_guest(v) ((v)->runstate_guest) - XEN_GUEST_HANDLE(vcpu_runstate_info_t) runstate_guest; /* guest address */ -#else -# define runstate_guest(v) ((v)->runstate_guest.native) - union { - XEN_GUEST_HANDLE(vcpu_runstate_info_t) native; - XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat; - } runstate_guest; /* guest address */ -#endif + unsigned int new_state; /* Has the FPU been initialised? */
At the moment on Arm, a Linux guest running with KTPI enabled will cause the following error when a context switch happens in user mode: (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 The error is caused by the virtual address for the runstate area registered by the guest only being accessible when the guest is running in kernel space when KPTI is enabled. To solve this issue, this patch is doing the translation from virtual address to physical address during the hypercall and mapping the required pages using vmap. This is removing the conversion from virtual to physical address during the context switch which is solving the problem with KPTI. This is done only on arm architecture, the behaviour on x86 is not modified by this patch and the address conversion is done as before during each context switch. This is introducing several limitations in comparison to the previous behaviour (on arm only): - if the guest is remapping the area at a different physical address Xen will continue to update the area at the previous physical address. As the area is in kernel space and usually defined as a global variable this is something which is believed not to happen. If this is required by a guest, it will have to call the hypercall with the new area (even if it is at the same virtual address). - the area needs to be mapped during the hypercall. For the same reasons as for the previous case, even if the area is registered for a different vcpu. It is believed that registering an area using a virtual address unmapped is not something done. inline functions in headers could not be used as the architecture domain.h is included before the global domain.h making it impossible to use the struct vcpu inside the architecture header. This should not have any performance impact as the hypercall is only called once per vcpu usually. Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> --- Changes in v2 - use vmap to map the pages during the hypercall. - reintroduce initial copy during hypercall. --- xen/arch/arm/domain.c | 160 +++++++++++++++++++++++++++++++---- xen/arch/x86/domain.c | 30 ++++++- xen/arch/x86/x86_64/domain.c | 4 +- xen/common/domain.c | 19 ++--- xen/include/asm-arm/domain.h | 9 ++ xen/include/asm-x86/domain.h | 16 ++++ xen/include/xen/domain.h | 5 ++ xen/include/xen/sched.h | 16 +--- 8 files changed, 207 insertions(+), 52 deletions(-)