Message ID | 1401724216-26486-5-git-send-email-mihai.caraman@freescale.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 06/02/2014 05:50 PM, Mihai Caraman wrote: > On book3e, KVM uses load external pid (lwepx) dedicated instruction to read > guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI > and LRAT), generated by loading a guest address, needs to be handled by KVM. > These exceptions are generated in a substituted guest translation context > (EPLC[EGS] = 1) from host context (MSR[GS] = 0). > > Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1), > doing minimal checks on the fast path to avoid host performance degradation. > lwepx exceptions originate from host state (MSR[GS] = 0) which implies > additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking > at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context > Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious > too intrusive for the host. > > Read guest last instruction from kvmppc_load_last_inst() by searching for the > physical address and kmap it. This address the TODO for TLB eviction and > execute-but-not-read entries, and allow us to get rid of lwepx until we are > able to handle failures. > > A simple stress benchmark shows a 1% sys performance degradation compared with > previous approach (lwepx without failure handling): > > time for i in `seq 1 10000`; do /bin/echo > /dev/null; done > > real 0m 8.85s > user 0m 4.34s > sys 0m 4.48s > > vs > > real 0m 8.84s > user 0m 4.36s > sys 0m 4.44s > > An alternative solution, to handle lwepx exceptions in KVM, is to temporary > highjack the interrupt vector from host. Some cores share host IVOR registers > between hardware threads, which is the case of FSL e6500, which impose additional > synchronization logic for this solution to work. This optimized solution can > be developed later on top of this patch. > > Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> > --- > v3: > - reworked patch description > - use unaltered kmap addr for kunmap > - get last instruction before beeing preempted > > v2: > - reworked patch description > - used pr_* functions > - addressed cosmetic feedback > > arch/powerpc/kvm/booke.c | 32 ++++++++++++ > arch/powerpc/kvm/bookehv_interrupts.S | 37 ++++---------- > arch/powerpc/kvm/e500_mmu_host.c | 93 +++++++++++++++++++++++++++++++++++ > 3 files changed, 134 insertions(+), 28 deletions(-) > > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c > index 34a42b9..4ef52a8 100644 > --- a/arch/powerpc/kvm/booke.c > +++ b/arch/powerpc/kvm/booke.c > @@ -880,6 +880,8 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > int r = RESUME_HOST; > int s; > int idx; > + u32 last_inst = KVM_INST_FETCH_FAILED; > + enum emulation_result emulated = EMULATE_DONE; > > /* update before a new last_exit_type is rewritten */ > kvmppc_update_timing_stats(vcpu); > @@ -887,6 +889,15 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > /* restart interrupts if they were meant for the host */ > kvmppc_restart_interrupt(vcpu, exit_nr); > > + /* > + * get last instruction before beeing preempted > + * TODO: for e6500 check also BOOKE_INTERRUPT_LRAT_ERROR & ESR_DATA > + */ > + if (exit_nr == BOOKE_INTERRUPT_DATA_STORAGE || > + exit_nr == BOOKE_INTERRUPT_DTLB_MISS || > + exit_nr == BOOKE_INTERRUPT_HV_PRIV) Please make this a switch() - that's easier to read. > + emulated = kvmppc_get_last_inst(vcpu, false, &last_inst); > + > local_irq_enable(); > > trace_kvm_exit(exit_nr, vcpu); > @@ -895,6 +906,26 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > run->exit_reason = KVM_EXIT_UNKNOWN; > run->ready_for_interrupt_injection = 1; > > + switch (emulated) { > + case EMULATE_AGAIN: > + r = RESUME_GUEST; > + goto out; > + > + case EMULATE_FAIL: > + pr_debug("%s: emulation at %lx failed (%08x)\n", > + __func__, vcpu->arch.pc, last_inst); > + /* For debugging, encode the failing instruction and > + * report it to userspace. */ > + run->hw.hardware_exit_reason = ~0ULL << 32; > + run->hw.hardware_exit_reason |= last_inst; > + kvmppc_core_queue_program(vcpu, ESR_PIL); > + r = RESUME_HOST; > + goto out; > + > + default: > + break; > + } I think you can just put this into a function. Scott, I think the patch overall looks quite good. Can you please check as well and if you agree give it your reviewed-by? Mike, when Scott gives you a reviewed-by, please include it for the next version. Alex > + > switch (exit_nr) { > case BOOKE_INTERRUPT_MACHINE_CHECK: > printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR)); > @@ -1184,6 +1215,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > BUG(); > } > > +out: > /* > * To avoid clobbering exit_reason, only check for signals if we > * aren't already exiting to userspace for some other reason. > diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S > index 6ff4480..e000b39 100644 > --- a/arch/powerpc/kvm/bookehv_interrupts.S > +++ b/arch/powerpc/kvm/bookehv_interrupts.S > @@ -121,38 +121,14 @@ > 1: > > .if \flags & NEED_EMU > - /* > - * This assumes you have external PID support. > - * To support a bookehv CPU without external PID, you'll > - * need to look up the TLB entry and create a temporary mapping. > - * > - * FIXME: we don't currently handle if the lwepx faults. PR-mode > - * booke doesn't handle it either. Since Linux doesn't use > - * broadcast tlbivax anymore, the only way this should happen is > - * if the guest maps its memory execute-but-not-read, or if we > - * somehow take a TLB miss in the middle of this entry code and > - * evict the relevant entry. On e500mc, all kernel lowmem is > - * bolted into TLB1 large page mappings, and we don't use > - * broadcast invalidates, so we should not take a TLB miss here. > - * > - * Later we'll need to deal with faults here. Disallowing guest > - * mappings that are execute-but-not-read could be an option on > - * e500mc, but not on chips with an LRAT if it is used. > - */ > - > - mfspr r3, SPRN_EPLC /* will already have correct ELPID and EGS */ > PPC_STL r15, VCPU_GPR(R15)(r4) > PPC_STL r16, VCPU_GPR(R16)(r4) > PPC_STL r17, VCPU_GPR(R17)(r4) > PPC_STL r18, VCPU_GPR(R18)(r4) > PPC_STL r19, VCPU_GPR(R19)(r4) > - mr r8, r3 > PPC_STL r20, VCPU_GPR(R20)(r4) > - rlwimi r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS > PPC_STL r21, VCPU_GPR(R21)(r4) > - rlwimi r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR > PPC_STL r22, VCPU_GPR(R22)(r4) > - rlwimi r8, r10, EPC_EPID_SHIFT, EPC_EPID > PPC_STL r23, VCPU_GPR(R23)(r4) > PPC_STL r24, VCPU_GPR(R24)(r4) > PPC_STL r25, VCPU_GPR(R25)(r4) > @@ -162,10 +138,15 @@ > PPC_STL r29, VCPU_GPR(R29)(r4) > PPC_STL r30, VCPU_GPR(R30)(r4) > PPC_STL r31, VCPU_GPR(R31)(r4) > - mtspr SPRN_EPLC, r8 > - isync > - lwepx r9, 0, r5 > - mtspr SPRN_EPLC, r3 > + > + /* > + * We don't use external PID support. lwepx faults would need to be > + * handled by KVM and this implies aditional code in DO_KVM (for > + * DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which > + * is too intrusive for the host. Get last instuction in > + * kvmppc_get_last_inst(). > + */ > + li r9, KVM_INST_FETCH_FAILED > stw r9, VCPU_LAST_INST(r4) > .endif > > diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c > index f692c12..0528fe5 100644 > --- a/arch/powerpc/kvm/e500_mmu_host.c > +++ b/arch/powerpc/kvm/e500_mmu_host.c > @@ -606,10 +606,103 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr, > } > } > > +#ifdef CONFIG_KVM_BOOKE_HV > int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, bool prev, u32 *instr) > { > + gva_t geaddr; > + hpa_t addr; > + hfn_t pfn; > + hva_t eaddr; > + u32 mas0, mas1, mas2, mas3; > + u64 mas7_mas3; > + struct page *page; > + unsigned int addr_space, psize_shift; > + bool pr; > + unsigned long flags; > + > + WARN_ON_ONCE(prev); > + > + /* Search TLB for guest pc to get the real address */ > + geaddr = kvmppc_get_pc(vcpu); > + > + addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG; > + > + local_irq_save(flags); > + mtspr(SPRN_MAS6, (vcpu->arch.pid << MAS6_SPID_SHIFT) | addr_space); > + mtspr(SPRN_MAS5, MAS5_SGS | vcpu->kvm->arch.lpid); > + asm volatile("tlbsx 0, %[geaddr]\n" : : > + [geaddr] "r" (geaddr)); > + mtspr(SPRN_MAS5, 0); > + mtspr(SPRN_MAS8, 0); > + mas0 = mfspr(SPRN_MAS0); > + mas1 = mfspr(SPRN_MAS1); > + mas2 = mfspr(SPRN_MAS2); > + mas3 = mfspr(SPRN_MAS3); > + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) | mas3; > + local_irq_restore(flags); > + > + /* > + * If the TLB entry for guest pc was evicted, return to the guest. > + * There are high chances to find a valid TLB entry next time. > + */ > + if (!(mas1 & MAS1_VALID)) > + return EMULATE_AGAIN; > + > + /* > + * Another thread may rewrite the TLB entry in parallel, don't > + * execute from the address if the execute permission is not set > + */ > + pr = vcpu->arch.shared->msr & MSR_PR; > + if ((pr && !(mas3 & MAS3_UX)) || (!pr && !(mas3 & MAS3_SX))) { > + pr_debug("Instuction emulation from a guest page\n" > + "withot execute permission\n"); > + return EMULATE_FAIL; > + } > + > + /* > + * We will map the real address through a cacheable page, so we will > + * not support cache-inhibited guest pages. Fortunately emulated > + * instructions should not live there. > + */ > + if (mas2 & MAS2_I) { > + pr_debug("Instuction emulation from cache-inhibited\n" > + "guest pages is not supported\n"); > + return EMULATE_FAIL; > + } > + > + /* Get page size */ > + psize_shift = MAS1_GET_TSIZE(mas1) + 10; > + > + /* Map a page and get guest's instruction */ > + addr = (mas7_mas3 & (~0ULL << psize_shift)) | > + (geaddr & ((1ULL << psize_shift) - 1ULL)); > + pfn = addr >> PAGE_SHIFT; > + > + /* Guard us against emulation from devices area */ > + if (unlikely(!page_is_ram(pfn))) { > + pr_debug("Instruction emulation from non-RAM host\n" > + "pages is not supported\n"); > + return EMULATE_FAIL; > + } > + > + if (unlikely(!pfn_valid(pfn))) { > + pr_debug("Invalid frame number\n"); > + return EMULATE_FAIL; > + } > + > + page = pfn_to_page(pfn); > + eaddr = (unsigned long)kmap_atomic(page); > + *instr = *(u32 *)(eaddr | (addr & ~PAGE_MASK)); > + kunmap_atomic((u32 *)eaddr); > + > + return EMULATE_DONE; > +} > +#else > +int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, u32 *instr) > +{ > return EMULATE_FAIL; > } > +#endif > > /************* MMU Notifiers *************/ > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-06-12 at 18:04 +0200, Alexander Graf wrote: > On 06/02/2014 05:50 PM, Mihai Caraman wrote: > > On book3e, KVM uses load external pid (lwepx) dedicated instruction to read > > guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI > > and LRAT), generated by loading a guest address, needs to be handled by KVM. > > These exceptions are generated in a substituted guest translation context > > (EPLC[EGS] = 1) from host context (MSR[GS] = 0). > > > > Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1), > > doing minimal checks on the fast path to avoid host performance degradation. > > lwepx exceptions originate from host state (MSR[GS] = 0) which implies > > additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking > > at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context > > Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious > > too intrusive for the host. > > > > Read guest last instruction from kvmppc_load_last_inst() by searching for the > > physical address and kmap it. This address the TODO for TLB eviction and > > execute-but-not-read entries, and allow us to get rid of lwepx until we are > > able to handle failures. > > > > A simple stress benchmark shows a 1% sys performance degradation compared with > > previous approach (lwepx without failure handling): > > > > time for i in `seq 1 10000`; do /bin/echo > /dev/null; done > > > > real 0m 8.85s > > user 0m 4.34s > > sys 0m 4.48s > > > > vs > > > > real 0m 8.84s > > user 0m 4.36s > > sys 0m 4.44s > > > > An alternative solution, to handle lwepx exceptions in KVM, is to temporary > > highjack the interrupt vector from host. Some cores share host IVOR registers > > between hardware threads, which is the case of FSL e6500, which impose additional > > synchronization logic for this solution to work. This optimized solution can > > be developed later on top of this patch. > > > > Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> > > --- > > v3: > > - reworked patch description > > - use unaltered kmap addr for kunmap > > - get last instruction before beeing preempted > > > > v2: > > - reworked patch description > > - used pr_* functions > > - addressed cosmetic feedback > > > > arch/powerpc/kvm/booke.c | 32 ++++++++++++ > > arch/powerpc/kvm/bookehv_interrupts.S | 37 ++++---------- > > arch/powerpc/kvm/e500_mmu_host.c | 93 +++++++++++++++++++++++++++++++++++ > > 3 files changed, 134 insertions(+), 28 deletions(-) > > > > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c > > index 34a42b9..4ef52a8 100644 > > --- a/arch/powerpc/kvm/booke.c > > +++ b/arch/powerpc/kvm/booke.c > > @@ -880,6 +880,8 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > int r = RESUME_HOST; > > int s; > > int idx; > > + u32 last_inst = KVM_INST_FETCH_FAILED; > > + enum emulation_result emulated = EMULATE_DONE; > > > > /* update before a new last_exit_type is rewritten */ > > kvmppc_update_timing_stats(vcpu); > > @@ -887,6 +889,15 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > /* restart interrupts if they were meant for the host */ > > kvmppc_restart_interrupt(vcpu, exit_nr); > > > > + /* > > + * get last instruction before beeing preempted > > + * TODO: for e6500 check also BOOKE_INTERRUPT_LRAT_ERROR & ESR_DATA > > + */ > > + if (exit_nr == BOOKE_INTERRUPT_DATA_STORAGE || > > + exit_nr == BOOKE_INTERRUPT_DTLB_MISS || > > + exit_nr == BOOKE_INTERRUPT_HV_PRIV) > > Please make this a switch() - that's easier to read. > > > + emulated = kvmppc_get_last_inst(vcpu, false, &last_inst); > > + > > local_irq_enable(); > > > > trace_kvm_exit(exit_nr, vcpu); > > @@ -895,6 +906,26 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > run->exit_reason = KVM_EXIT_UNKNOWN; > > run->ready_for_interrupt_injection = 1; > > > > + switch (emulated) { > > + case EMULATE_AGAIN: > > + r = RESUME_GUEST; > > + goto out; > > + > > + case EMULATE_FAIL: > > + pr_debug("%s: emulation at %lx failed (%08x)\n", > > + __func__, vcpu->arch.pc, last_inst); > > + /* For debugging, encode the failing instruction and > > + * report it to userspace. */ > > + run->hw.hardware_exit_reason = ~0ULL << 32; > > + run->hw.hardware_exit_reason |= last_inst; > > + kvmppc_core_queue_program(vcpu, ESR_PIL); > > + r = RESUME_HOST; > > + goto out; > > + > > + default: > > + break; > > + } > > I think you can just put this into a function. > > Scott, I think the patch overall looks quite good. Can you please check > as well and if you agree give it your reviewed-by? Mike, when Scott > gives you a reviewed-by, please include it for the next version. > > > Alex > > > + > > switch (exit_nr) { > > case BOOKE_INTERRUPT_MACHINE_CHECK: > > printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR)); > > @@ -1184,6 +1215,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > BUG(); > > } > > > > +out: > > /* > > * To avoid clobbering exit_reason, only check for signals if we > > * aren't already exiting to userspace for some other reason. > > diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S > > index 6ff4480..e000b39 100644 > > --- a/arch/powerpc/kvm/bookehv_interrupts.S > > +++ b/arch/powerpc/kvm/bookehv_interrupts.S > > @@ -121,38 +121,14 @@ > > 1: > > > > .if \flags & NEED_EMU > > - /* > > - * This assumes you have external PID support. > > - * To support a bookehv CPU without external PID, you'll > > - * need to look up the TLB entry and create a temporary mapping. > > - * > > - * FIXME: we don't currently handle if the lwepx faults. PR-mode > > - * booke doesn't handle it either. Since Linux doesn't use > > - * broadcast tlbivax anymore, the only way this should happen is > > - * if the guest maps its memory execute-but-not-read, or if we > > - * somehow take a TLB miss in the middle of this entry code and > > - * evict the relevant entry. On e500mc, all kernel lowmem is > > - * bolted into TLB1 large page mappings, and we don't use > > - * broadcast invalidates, so we should not take a TLB miss here. > > - * > > - * Later we'll need to deal with faults here. Disallowing guest > > - * mappings that are execute-but-not-read could be an option on > > - * e500mc, but not on chips with an LRAT if it is used. > > - */ > > - > > - mfspr r3, SPRN_EPLC /* will already have correct ELPID and EGS */ > > PPC_STL r15, VCPU_GPR(R15)(r4) > > PPC_STL r16, VCPU_GPR(R16)(r4) > > PPC_STL r17, VCPU_GPR(R17)(r4) > > PPC_STL r18, VCPU_GPR(R18)(r4) > > PPC_STL r19, VCPU_GPR(R19)(r4) > > - mr r8, r3 > > PPC_STL r20, VCPU_GPR(R20)(r4) > > - rlwimi r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS > > PPC_STL r21, VCPU_GPR(R21)(r4) > > - rlwimi r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR > > PPC_STL r22, VCPU_GPR(R22)(r4) > > - rlwimi r8, r10, EPC_EPID_SHIFT, EPC_EPID > > PPC_STL r23, VCPU_GPR(R23)(r4) > > PPC_STL r24, VCPU_GPR(R24)(r4) > > PPC_STL r25, VCPU_GPR(R25)(r4) > > @@ -162,10 +138,15 @@ > > PPC_STL r29, VCPU_GPR(R29)(r4) > > PPC_STL r30, VCPU_GPR(R30)(r4) > > PPC_STL r31, VCPU_GPR(R31)(r4) > > - mtspr SPRN_EPLC, r8 > > - isync > > - lwepx r9, 0, r5 > > - mtspr SPRN_EPLC, r3 > > + > > + /* > > + * We don't use external PID support. lwepx faults would need to be > > + * handled by KVM and this implies aditional code in DO_KVM (for > > + * DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which > > + * is too intrusive for the host. Get last instuction in > > + * kvmppc_get_last_inst(). > > + */ > > + li r9, KVM_INST_FETCH_FAILED > > stw r9, VCPU_LAST_INST(r4) > > .endif > > > > diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c > > index f692c12..0528fe5 100644 > > --- a/arch/powerpc/kvm/e500_mmu_host.c > > +++ b/arch/powerpc/kvm/e500_mmu_host.c > > @@ -606,10 +606,103 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr, > > } > > } > > > > +#ifdef CONFIG_KVM_BOOKE_HV > > int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, bool prev, u32 *instr) > > { > > + gva_t geaddr; > > + hpa_t addr; > > + hfn_t pfn; > > + hva_t eaddr; > > + u32 mas0, mas1, mas2, mas3; > > + u64 mas7_mas3; > > + struct page *page; > > + unsigned int addr_space, psize_shift; > > + bool pr; > > + unsigned long flags; > > + > > + WARN_ON_ONCE(prev); What are the semantics of "prev"? > > + /* Search TLB for guest pc to get the real address */ > > + geaddr = kvmppc_get_pc(vcpu); > > + > > + addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG; > > + > > + local_irq_save(flags); > > + mtspr(SPRN_MAS6, (vcpu->arch.pid << MAS6_SPID_SHIFT) | addr_space); > > + mtspr(SPRN_MAS5, MAS5_SGS | vcpu->kvm->arch.lpid); > > + asm volatile("tlbsx 0, %[geaddr]\n" : : > > + [geaddr] "r" (geaddr)); > > + mtspr(SPRN_MAS5, 0); > > + mtspr(SPRN_MAS8, 0); > > + mas0 = mfspr(SPRN_MAS0); > > + mas1 = mfspr(SPRN_MAS1); > > + mas2 = mfspr(SPRN_MAS2); > > + mas3 = mfspr(SPRN_MAS3); > > + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) | mas3; > > + local_irq_restore(flags); Where is mas0 used? On 64-bit you could read directly from SPRN_MAS7_MAS3. > > + /* > > + * If the TLB entry for guest pc was evicted, return to the guest. > > + * There are high chances to find a valid TLB entry next time. > > + */ > > + if (!(mas1 & MAS1_VALID)) > > + return EMULATE_AGAIN; > > + > > + /* > > + * Another thread may rewrite the TLB entry in parallel, don't > > + * execute from the address if the execute permission is not set > > + */ > > + pr = vcpu->arch.shared->msr & MSR_PR; > > + if ((pr && !(mas3 & MAS3_UX)) || (!pr && !(mas3 & MAS3_SX))) { > > + pr_debug("Instuction emulation from a guest page\n" > > + "withot execute permission\n"); Don't split user-visible strings. s/withot/without/ Give more context in the message, such as __func__ and the address. > > + return EMULATE_FAIL; > > + } > > + > > + /* > > + * We will map the real address through a cacheable page, so we will > > + * not support cache-inhibited guest pages. Fortunately emulated > > + * instructions should not live there. > > + */ > > + if (mas2 & MAS2_I) { > > + pr_debug("Instuction emulation from cache-inhibited\n" > > + "guest pages is not supported\n"); > > + return EMULATE_FAIL; > > + } Mismatches in M or W are bad as well, but shouldn't the page_is_ram() protect us? Except when LRAT is used, but there we can't prevent mismatches anyway. > > + /* Get page size */ > > + psize_shift = MAS1_GET_TSIZE(mas1) + 10; > > + > > + /* Map a page and get guest's instruction */ > > + addr = (mas7_mas3 & (~0ULL << psize_shift)) | > > + (geaddr & ((1ULL << psize_shift) - 1ULL)); > > + pfn = addr >> PAGE_SHIFT; > > + > > + /* Guard us against emulation from devices area */ > > + if (unlikely(!page_is_ram(pfn))) { > > + pr_debug("Instruction emulation from non-RAM host\n" > > + "pages is not supported\n"); > > + return EMULATE_FAIL; > > + } > > + > > + if (unlikely(!pfn_valid(pfn))) { > > + pr_debug("Invalid frame number\n"); > > + return EMULATE_FAIL; > > + } Can we ever have page_is_ram(pfn) && !pfn_valid(pfn)? -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 34a42b9..4ef52a8 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -880,6 +880,8 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, int r = RESUME_HOST; int s; int idx; + u32 last_inst = KVM_INST_FETCH_FAILED; + enum emulation_result emulated = EMULATE_DONE; /* update before a new last_exit_type is rewritten */ kvmppc_update_timing_stats(vcpu); @@ -887,6 +889,15 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* restart interrupts if they were meant for the host */ kvmppc_restart_interrupt(vcpu, exit_nr); + /* + * get last instruction before beeing preempted + * TODO: for e6500 check also BOOKE_INTERRUPT_LRAT_ERROR & ESR_DATA + */ + if (exit_nr == BOOKE_INTERRUPT_DATA_STORAGE || + exit_nr == BOOKE_INTERRUPT_DTLB_MISS || + exit_nr == BOOKE_INTERRUPT_HV_PRIV) + emulated = kvmppc_get_last_inst(vcpu, false, &last_inst); + local_irq_enable(); trace_kvm_exit(exit_nr, vcpu); @@ -895,6 +906,26 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, run->exit_reason = KVM_EXIT_UNKNOWN; run->ready_for_interrupt_injection = 1; + switch (emulated) { + case EMULATE_AGAIN: + r = RESUME_GUEST; + goto out; + + case EMULATE_FAIL: + pr_debug("%s: emulation at %lx failed (%08x)\n", + __func__, vcpu->arch.pc, last_inst); + /* For debugging, encode the failing instruction and + * report it to userspace. */ + run->hw.hardware_exit_reason = ~0ULL << 32; + run->hw.hardware_exit_reason |= last_inst; + kvmppc_core_queue_program(vcpu, ESR_PIL); + r = RESUME_HOST; + goto out; + + default: + break; + } + switch (exit_nr) { case BOOKE_INTERRUPT_MACHINE_CHECK: printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR)); @@ -1184,6 +1215,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, BUG(); } +out: /* * To avoid clobbering exit_reason, only check for signals if we * aren't already exiting to userspace for some other reason. diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S index 6ff4480..e000b39 100644 --- a/arch/powerpc/kvm/bookehv_interrupts.S +++ b/arch/powerpc/kvm/bookehv_interrupts.S @@ -121,38 +121,14 @@ 1: .if \flags & NEED_EMU - /* - * This assumes you have external PID support. - * To support a bookehv CPU without external PID, you'll - * need to look up the TLB entry and create a temporary mapping. - * - * FIXME: we don't currently handle if the lwepx faults. PR-mode - * booke doesn't handle it either. Since Linux doesn't use - * broadcast tlbivax anymore, the only way this should happen is - * if the guest maps its memory execute-but-not-read, or if we - * somehow take a TLB miss in the middle of this entry code and - * evict the relevant entry. On e500mc, all kernel lowmem is - * bolted into TLB1 large page mappings, and we don't use - * broadcast invalidates, so we should not take a TLB miss here. - * - * Later we'll need to deal with faults here. Disallowing guest - * mappings that are execute-but-not-read could be an option on - * e500mc, but not on chips with an LRAT if it is used. - */ - - mfspr r3, SPRN_EPLC /* will already have correct ELPID and EGS */ PPC_STL r15, VCPU_GPR(R15)(r4) PPC_STL r16, VCPU_GPR(R16)(r4) PPC_STL r17, VCPU_GPR(R17)(r4) PPC_STL r18, VCPU_GPR(R18)(r4) PPC_STL r19, VCPU_GPR(R19)(r4) - mr r8, r3 PPC_STL r20, VCPU_GPR(R20)(r4) - rlwimi r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS PPC_STL r21, VCPU_GPR(R21)(r4) - rlwimi r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR PPC_STL r22, VCPU_GPR(R22)(r4) - rlwimi r8, r10, EPC_EPID_SHIFT, EPC_EPID PPC_STL r23, VCPU_GPR(R23)(r4) PPC_STL r24, VCPU_GPR(R24)(r4) PPC_STL r25, VCPU_GPR(R25)(r4) @@ -162,10 +138,15 @@ PPC_STL r29, VCPU_GPR(R29)(r4) PPC_STL r30, VCPU_GPR(R30)(r4) PPC_STL r31, VCPU_GPR(R31)(r4) - mtspr SPRN_EPLC, r8 - isync - lwepx r9, 0, r5 - mtspr SPRN_EPLC, r3 + + /* + * We don't use external PID support. lwepx faults would need to be + * handled by KVM and this implies aditional code in DO_KVM (for + * DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which + * is too intrusive for the host. Get last instuction in + * kvmppc_get_last_inst(). + */ + li r9, KVM_INST_FETCH_FAILED stw r9, VCPU_LAST_INST(r4) .endif diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index f692c12..0528fe5 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -606,10 +606,103 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr, } } +#ifdef CONFIG_KVM_BOOKE_HV int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, bool prev, u32 *instr) { + gva_t geaddr; + hpa_t addr; + hfn_t pfn; + hva_t eaddr; + u32 mas0, mas1, mas2, mas3; + u64 mas7_mas3; + struct page *page; + unsigned int addr_space, psize_shift; + bool pr; + unsigned long flags; + + WARN_ON_ONCE(prev); + + /* Search TLB for guest pc to get the real address */ + geaddr = kvmppc_get_pc(vcpu); + + addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG; + + local_irq_save(flags); + mtspr(SPRN_MAS6, (vcpu->arch.pid << MAS6_SPID_SHIFT) | addr_space); + mtspr(SPRN_MAS5, MAS5_SGS | vcpu->kvm->arch.lpid); + asm volatile("tlbsx 0, %[geaddr]\n" : : + [geaddr] "r" (geaddr)); + mtspr(SPRN_MAS5, 0); + mtspr(SPRN_MAS8, 0); + mas0 = mfspr(SPRN_MAS0); + mas1 = mfspr(SPRN_MAS1); + mas2 = mfspr(SPRN_MAS2); + mas3 = mfspr(SPRN_MAS3); + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) | mas3; + local_irq_restore(flags); + + /* + * If the TLB entry for guest pc was evicted, return to the guest. + * There are high chances to find a valid TLB entry next time. + */ + if (!(mas1 & MAS1_VALID)) + return EMULATE_AGAIN; + + /* + * Another thread may rewrite the TLB entry in parallel, don't + * execute from the address if the execute permission is not set + */ + pr = vcpu->arch.shared->msr & MSR_PR; + if ((pr && !(mas3 & MAS3_UX)) || (!pr && !(mas3 & MAS3_SX))) { + pr_debug("Instuction emulation from a guest page\n" + "withot execute permission\n"); + return EMULATE_FAIL; + } + + /* + * We will map the real address through a cacheable page, so we will + * not support cache-inhibited guest pages. Fortunately emulated + * instructions should not live there. + */ + if (mas2 & MAS2_I) { + pr_debug("Instuction emulation from cache-inhibited\n" + "guest pages is not supported\n"); + return EMULATE_FAIL; + } + + /* Get page size */ + psize_shift = MAS1_GET_TSIZE(mas1) + 10; + + /* Map a page and get guest's instruction */ + addr = (mas7_mas3 & (~0ULL << psize_shift)) | + (geaddr & ((1ULL << psize_shift) - 1ULL)); + pfn = addr >> PAGE_SHIFT; + + /* Guard us against emulation from devices area */ + if (unlikely(!page_is_ram(pfn))) { + pr_debug("Instruction emulation from non-RAM host\n" + "pages is not supported\n"); + return EMULATE_FAIL; + } + + if (unlikely(!pfn_valid(pfn))) { + pr_debug("Invalid frame number\n"); + return EMULATE_FAIL; + } + + page = pfn_to_page(pfn); + eaddr = (unsigned long)kmap_atomic(page); + *instr = *(u32 *)(eaddr | (addr & ~PAGE_MASK)); + kunmap_atomic((u32 *)eaddr); + + return EMULATE_DONE; +} +#else +int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, u32 *instr) +{ return EMULATE_FAIL; } +#endif /************* MMU Notifiers *************/
On book3e, KVM uses load external pid (lwepx) dedicated instruction to read guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI and LRAT), generated by loading a guest address, needs to be handled by KVM. These exceptions are generated in a substituted guest translation context (EPLC[EGS] = 1) from host context (MSR[GS] = 0). Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1), doing minimal checks on the fast path to avoid host performance degradation. lwepx exceptions originate from host state (MSR[GS] = 0) which implies additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious too intrusive for the host. Read guest last instruction from kvmppc_load_last_inst() by searching for the physical address and kmap it. This address the TODO for TLB eviction and execute-but-not-read entries, and allow us to get rid of lwepx until we are able to handle failures. A simple stress benchmark shows a 1% sys performance degradation compared with previous approach (lwepx without failure handling): time for i in `seq 1 10000`; do /bin/echo > /dev/null; done real 0m 8.85s user 0m 4.34s sys 0m 4.48s vs real 0m 8.84s user 0m 4.36s sys 0m 4.44s An alternative solution, to handle lwepx exceptions in KVM, is to temporary highjack the interrupt vector from host. Some cores share host IVOR registers between hardware threads, which is the case of FSL e6500, which impose additional synchronization logic for this solution to work. This optimized solution can be developed later on top of this patch. Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> --- v3: - reworked patch description - use unaltered kmap addr for kunmap - get last instruction before beeing preempted v2: - reworked patch description - used pr_* functions - addressed cosmetic feedback arch/powerpc/kvm/booke.c | 32 ++++++++++++ arch/powerpc/kvm/bookehv_interrupts.S | 37 ++++---------- arch/powerpc/kvm/e500_mmu_host.c | 93 +++++++++++++++++++++++++++++++++++ 3 files changed, 134 insertions(+), 28 deletions(-)