Message ID | 1490361899-18303-1-git-send-email-rcojocaru@bitdefender.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
The real CMPXCHG hvmemul_cmpxchg() implementation works as expected as far the race conditions go, and returning RETRY for failed writebacks seems to work without issue for regular Xen emulation. However, when the patch is used with introspection, I've had a BCCode: 101 BSOD and rare (but several) occasions when the guest becomes unresponsive (can't close Firefox or have the Windows start menu show up when clicking the "Start" button, but the guest is otherwise running). This I believe is due to the do { } while ( rc == X86EMUL_RETRY ); loop in hvm_emulate_one_vm_event(): I am basically now looping behing the guest's back until the CMPXCHG succeeds, which can theoretically be a very long time to execute a CMPXCHG instruction, and most likely not what the guest OS is expecting. The alternative (and the current default) is to do nothing on X86EMUL_RETRY and just allow the guest to re-enter in the same place, which should trigger the same page fault vm_event, which can hopefully now be able to emulate the current instruction. However, in the best case scenario, this just complicates the above loop since the current instruction will still be unable to complete until emulation succeeds but this time with VMEXITs. And in the worst case scenario (which is what happens in my tests) this adds an additional factor of unpredictability, since the guest quickly BSODs (or rarely just plain hangs). I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a failed CMPXCHG should happen just once, with the proper registers and ZF set. The guest surely expects neither that the instruction resume until it succeeds, nor that some hidden loop goes on for an undeterminate ammount of time until a CMPXCHG succeeds. The picture is further complicated by the two-part handling of instructions in x86_emulate(). For example, for CMPXCHG we have: case X86EMUL_OPC(0x0f, 0xb0): case X86EMUL_OPC(0x0f, 0xb1): /* cmpxchg */ /* Save real source value, then compare EAX against destination. */ src.orig_val = src.val; src.val = _regs.r(ax); /* cmp: %%eax - dst ==> dst and src swapped for macro invocation */ emulate_2op_SrcV("cmp", dst, src, _regs.eflags); if ( _regs.eflags & X86_EFLAGS_ZF ) { /* Success: write back to memory. */ dst.val = src.orig_val; } else { /* Failure: write the value we saw to EAX. */ dst.type = OP_REG; dst.reg = (unsigned long *)&_regs.r(ax); } break; This is the only part that sets the proper registers and ZF, and it's only the comparison. If this succeeds, x86_emulate() currently assumes that the writeback part cannot fail. But then the writeback part is: case OP_MEM: if ( !(d & Mov) && (dst.orig_val == dst.val) && !ctxt->force_writeback ) /* nothing to do */; else if ( lock_prefix ) { fail_if(!ops->cmpxchg); rc = ops->cmpxchg( dst.mem.seg, dst.mem.off, &dst.orig_val, &dst.val, dst.bytes, ctxt); } else { fail_if(!ops->write); rc = ops->write(dst.mem.seg, dst.mem.off, !state->simd_size ? &dst.val : (void *)mmvalp, dst.bytes, ctxt); if ( sfence ) asm volatile ( "sfence" ::: "memory" ); } if ( rc != 0 ) goto done; default: break; I now see that this is why using a spinlock only around the writeback part did not solve the issue: both the compare part and the writeback part need to be part of the same atomic operation - lock needs to be aquired before the cmp / ZF part. Opinions and suggestions are appreciated. If I'm not mistaken, it looks like the smp_lock design is the better solution to the problem. Thanks, Razvan
>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote: > I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a > failed CMPXCHG should happen just once, with the proper registers and ZF > set. The guest surely expects neither that the instruction resume until > it succeeds, nor that some hidden loop goes on for an undeterminate > ammount of time until a CMPXCHG succeeds. The guest doesn't observe the CMPXCHG failing - RETRY leads to the instruction being restarted instead of completed. There is a severe downside to MMIO accesses with this model though: Other than for RAM, we shouldn't be reading (and even less so writing) the memory location more than once. > The picture is further complicated by the two-part handling of > instructions in x86_emulate(). For example, for CMPXCHG we have: >[...] > I now see that this is why using a spinlock only around the writeback > part did not solve the issue: both the compare part and the writeback > part need to be part of the same atomic operation - lock needs to be > aquired before the cmp / ZF part. No exactly: RMW insns require the lock to be acquired ahead of the memory read, and dropped after the memory write. > Opinions and suggestions are appreciated. If I'm not mistaken, it looks > like the smp_lock design is the better solution to the problem. Andrew has told me that the re-work of how we do memory accesses in the emulator is pretty high on his priority list now. Whether we'd want to introduce an interim solution therefore depends on the time scale to possibly reach the only ultimately correct solution (no longer acting on intermediate copies of the data coming from guest memory, which is only correct for plain reads and plain writes). Jan
On 28/03/17 11:03, Jan Beulich wrote: >>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote: >> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a >> failed CMPXCHG should happen just once, with the proper registers and ZF >> set. The guest surely expects neither that the instruction resume until >> it succeeds, nor that some hidden loop goes on for an undeterminate >> ammount of time until a CMPXCHG succeeds. > The guest doesn't observe the CMPXCHG failing - RETRY leads to > the instruction being restarted instead of completed. And this probably is the root of the problem. CMPXCHG on a contended location should be observed to fail from the guests point of view. > > There is a severe downside to MMIO accesses with this model > though: Other than for RAM, we shouldn't be reading (and even > less so writing) the memory location more than once. With MMIO, the hardware analogy is sending a single PCIe transaction with the atomic flag set. However, atomic options on non-RAM locations tend not to actually be atomic on real hardware anyway, so so-long as we don't make multiple reads/writes, the lack of atomicity won't be a problem in practice. > >> The picture is further complicated by the two-part handling of >> instructions in x86_emulate(). For example, for CMPXCHG we have: >> [...] >> I now see that this is why using a spinlock only around the writeback >> part did not solve the issue: both the compare part and the writeback >> part need to be part of the same atomic operation - lock needs to be >> aquired before the cmp / ZF part. > No exactly: RMW insns require the lock to be acquired ahead of > the memory read, and dropped after the memory write. > >> Opinions and suggestions are appreciated. If I'm not mistaken, it looks >> like the smp_lock design is the better solution to the problem. > Andrew has told me that the re-work of how we do memory accesses > in the emulator is pretty high on his priority list now. Whether we'd > want to introduce an interim solution therefore depends on the time > scale to possibly reach the only ultimately correct solution (no longer > acting on intermediate copies of the data coming from guest memory, > which is only correct for plain reads and plain writes). This re-enforces my opinion that mapping the destination and pointing a stub at the mapping is the only viable option. ~Andrew
On 03/28/2017 01:03 PM, Jan Beulich wrote: >>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote: >> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a >> failed CMPXCHG should happen just once, with the proper registers and ZF >> set. The guest surely expects neither that the instruction resume until >> it succeeds, nor that some hidden loop goes on for an undeterminate >> ammount of time until a CMPXCHG succeeds. > > The guest doesn't observe the CMPXCHG failing - RETRY leads to > the instruction being restarted instead of completed. Indeed, but it works differently with hvm_emulate_one_vm_event() where RETRY currently would have the instruction be re-executed (properly re-executed, not just re-emulated) by the guest. > There is a severe downside to MMIO accesses with this model > though: Other than for RAM, we shouldn't be reading (and even > less so writing) the memory location more than once. > >> The picture is further complicated by the two-part handling of >> instructions in x86_emulate(). For example, for CMPXCHG we have: >> [...] >> I now see that this is why using a spinlock only around the writeback >> part did not solve the issue: both the compare part and the writeback >> part need to be part of the same atomic operation - lock needs to be >> aquired before the cmp / ZF part. > > No exactly: RMW insns require the lock to be acquired ahead of > the memory read, and dropped after the memory write. Indeed, that's what I meant. I should have been more precise. >> Opinions and suggestions are appreciated. If I'm not mistaken, it looks >> like the smp_lock design is the better solution to the problem. > > Andrew has told me that the re-work of how we do memory accesses > in the emulator is pretty high on his priority list now. Whether we'd > want to introduce an interim solution therefore depends on the time > scale to possibly reach the only ultimately correct solution (no longer > acting on intermediate copies of the data coming from guest memory, > which is only correct for plain reads and plain writes). Great! I understand, I'll switch to the smp_lock() patch as soon as the schedule allows, and we'll see how that fits into the timeframe. Thanks, Razvan
>>> On 28.03.17 at 12:25, <andrew.cooper3@citrix.com> wrote: > On 28/03/17 11:03, Jan Beulich wrote: >>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote: >>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a >>> failed CMPXCHG should happen just once, with the proper registers and ZF >>> set. The guest surely expects neither that the instruction resume until >>> it succeeds, nor that some hidden loop goes on for an undeterminate >>> ammount of time until a CMPXCHG succeeds. >> The guest doesn't observe the CMPXCHG failing - RETRY leads to >> the instruction being restarted instead of completed. > > And this probably is the root of the problem. CMPXCHG on a contended > location should be observed to fail from the guests point of view. Oops - my reply was imprecise: A _guest_ CMPXCHG would of course be observed to have failed by the guest. Us using CMPXCHG to carry out some other atomic op wouldn't. >>> The picture is further complicated by the two-part handling of >>> instructions in x86_emulate(). For example, for CMPXCHG we have: >>> [...] >>> I now see that this is why using a spinlock only around the writeback >>> part did not solve the issue: both the compare part and the writeback >>> part need to be part of the same atomic operation - lock needs to be >>> aquired before the cmp / ZF part. >> No exactly: RMW insns require the lock to be acquired ahead of >> the memory read, and dropped after the memory write. >> >>> Opinions and suggestions are appreciated. If I'm not mistaken, it looks >>> like the smp_lock design is the better solution to the problem. >> Andrew has told me that the re-work of how we do memory accesses >> in the emulator is pretty high on his priority list now. Whether we'd >> want to introduce an interim solution therefore depends on the time >> scale to possibly reach the only ultimately correct solution (no longer >> acting on intermediate copies of the data coming from guest memory, >> which is only correct for plain reads and plain writes). > > This re-enforces my opinion that mapping the destination and pointing a > stub at the mapping is the only viable option. Well, you re-state what I was saying. Jan
>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote: > On 03/28/2017 01:03 PM, Jan Beulich wrote: >>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote: >>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a >>> failed CMPXCHG should happen just once, with the proper registers and ZF >>> set. The guest surely expects neither that the instruction resume until >>> it succeeds, nor that some hidden loop goes on for an undeterminate >>> ammount of time until a CMPXCHG succeeds. >> >> The guest doesn't observe the CMPXCHG failing - RETRY leads to >> the instruction being restarted instead of completed. > > Indeed, but it works differently with hvm_emulate_one_vm_event() where > RETRY currently would have the instruction be re-executed (properly > re-executed, not just re-emulated) by the guest. Right - see my other reply to Andrew: The function likely would need to tell apart guest CMPXCHG uses from us using the insn to carry out the write by some other one. That may involve adjustments to the memory write logic in x86_emulate() itself, as the late failure of the comparison then would also need to be communicated back (via ZF clear) to the guest. Jan
On 03/28/2017 01:47 PM, Jan Beulich wrote: >>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote: >> On 03/28/2017 01:03 PM, Jan Beulich wrote: >>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote: >>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a >>>> failed CMPXCHG should happen just once, with the proper registers and ZF >>>> set. The guest surely expects neither that the instruction resume until >>>> it succeeds, nor that some hidden loop goes on for an undeterminate >>>> ammount of time until a CMPXCHG succeeds. >>> >>> The guest doesn't observe the CMPXCHG failing - RETRY leads to >>> the instruction being restarted instead of completed. >> >> Indeed, but it works differently with hvm_emulate_one_vm_event() where >> RETRY currently would have the instruction be re-executed (properly >> re-executed, not just re-emulated) by the guest. > > Right - see my other reply to Andrew: The function likely would > need to tell apart guest CMPXCHG uses from us using the insn to > carry out the write by some other one. That may involve > adjustments to the memory write logic in x86_emulate() itself, as > the late failure of the comparison then would also need to be > communicated back (via ZF clear) to the guest. Exactly, it would require quite some reworking of x86_emulate(). Thanks, Razvan
>>> On 28.03.17 at 12:50, <rcojocaru@bitdefender.com> wrote: > On 03/28/2017 01:47 PM, Jan Beulich wrote: >>>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote: >>> On 03/28/2017 01:03 PM, Jan Beulich wrote: >>>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote: >>>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a >>>>> failed CMPXCHG should happen just once, with the proper registers and ZF >>>>> set. The guest surely expects neither that the instruction resume until >>>>> it succeeds, nor that some hidden loop goes on for an undeterminate >>>>> ammount of time until a CMPXCHG succeeds. >>>> >>>> The guest doesn't observe the CMPXCHG failing - RETRY leads to >>>> the instruction being restarted instead of completed. >>> >>> Indeed, but it works differently with hvm_emulate_one_vm_event() where >>> RETRY currently would have the instruction be re-executed (properly >>> re-executed, not just re-emulated) by the guest. >> >> Right - see my other reply to Andrew: The function likely would >> need to tell apart guest CMPXCHG uses from us using the insn to >> carry out the write by some other one. That may involve >> adjustments to the memory write logic in x86_emulate() itself, as >> the late failure of the comparison then would also need to be >> communicated back (via ZF clear) to the guest. > > Exactly, it would require quite some reworking of x86_emulate(). I don't think so, no. It would merely require a special case step following ->cmpxchg() to deal with it being CMPXCHG we're emulating. Plus matching code in CMPXCHG{8,16}B emulation. Jan
>>> On 28.03.17 at 12:25, <andrew.cooper3@citrix.com> wrote: > On 28/03/17 11:03, Jan Beulich wrote: >> There is a severe downside to MMIO accesses with this model >> though: Other than for RAM, we shouldn't be reading (and even >> less so writing) the memory location more than once. > > With MMIO, the hardware analogy is sending a single PCIe transaction > with the atomic flag set. > > However, atomic options on non-RAM locations tend not to actually be > atomic on real hardware anyway, so so-long as we don't make multiple > reads/writes, the lack of atomicity won't be a problem in practice. Are you sure about this? Before caches were introduced, all LOCKed insns uses a bus lock; I would very much assume that this model is being followed for backwards compatibility for all LOCKed memory accesses not going through the cache. Furthermore I've been thinking a little more about the proposed new memory write approach: That won't work for the original purpose the emulator has - emulating insns in order to send get data to/from qemu. In that case there simply is no page to map, and this then also explains the current ->write() mechanism. Multiple reads in this case are being avoided by going through holding the data in struct hvm_mmio_cache. Multiple writes _currently_ are being avoided by hvmemul_cmpxchg() forwarding to hvmemul_write(). This property would need to be retained for any adjustments we make. As to the CMPXCHG insn (mis-)handling, I think it'll be best if I come forward with the outlined x86_emulate() adjustment. Whether that'll be acceptable for 4.9 we'll have to see. Jan
On 03/29/2017 08:59 AM, Jan Beulich wrote: > As to the CMPXCHG insn (mis-)handling, I think it'll be best if I > come forward with the outlined x86_emulate() adjustment. > Whether that'll be acceptable for 4.9 we'll have to see. Thanks! I'll test it with the CMPXCHG changes in my patch when it becomes available. Razvan
diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c index 2d92957..b5754a1 100644 --- a/xen/arch/x86/hvm/emulate.c +++ b/xen/arch/x86/hvm/emulate.c @@ -20,6 +20,7 @@ #include <asm/hvm/emulate.h> #include <asm/hvm/hvm.h> #include <asm/hvm/ioreq.h> +#include <asm/hvm/nestedhvm.h> #include <asm/hvm/trace.h> #include <asm/hvm/support.h> #include <asm/hvm/svm/svm.h> @@ -1029,6 +1030,77 @@ static int hvmemul_wbinvd_discard( return X86EMUL_OKAY; } +static int hvmemul_vaddr_to_mfn( + unsigned long addr, + mfn_t *mfn, + uint32_t pfec, + struct x86_emulate_ctxt *ctxt) +{ + paddr_t gpa = addr & ~PAGE_MASK; + struct page_info *page; + p2m_type_t p2mt; + unsigned long gfn; + struct vcpu *curr = current; + struct hvm_emulate_ctxt *hvmemul_ctxt = + container_of(ctxt, struct hvm_emulate_ctxt, ctxt); + + gfn = paging_gva_to_gfn(curr, addr, &pfec); + + if ( gfn == gfn_x(INVALID_GFN) ) + { + pagefault_info_t pfinfo = {}; + + if ( ( pfec & PFEC_page_paged ) || ( pfec & PFEC_page_shared ) ) + return X86EMUL_RETRY; + + pfinfo.linear = addr; + pfinfo.ec = pfec; + + x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt); + return X86EMUL_EXCEPTION; + } + + gpa |= (paddr_t)gfn << PAGE_SHIFT; + + /* + * No need to do the P2M lookup for internally handled MMIO, benefiting + * - 32-bit WinXP (& older Windows) on AMD CPUs for LAPIC accesses, + * - newer Windows (like Server 2012) for HPET accesses. + */ + if ( !nestedhvm_vcpu_in_guestmode(curr) && hvm_mmio_internal(gpa) ) + return X86EMUL_UNHANDLEABLE; + + page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE); + + if ( !page ) + return X86EMUL_UNHANDLEABLE; + + if ( p2m_is_paging(p2mt) ) + { + put_page(page); + p2m_mem_paging_populate(curr->domain, gfn); + return X86EMUL_RETRY; + } + + if ( p2m_is_shared(p2mt) ) + { + put_page(page); + return X86EMUL_RETRY; + } + + if ( p2m_is_grant(p2mt) ) + { + put_page(page); + return X86EMUL_UNHANDLEABLE; + } + + *mfn = _mfn(page_to_mfn(page)); + + put_page(page); + + return X86EMUL_OKAY; +} + static int hvmemul_cmpxchg( enum x86_segment seg, unsigned long offset, @@ -1037,8 +1109,70 @@ static int hvmemul_cmpxchg( unsigned int bytes, struct x86_emulate_ctxt *ctxt) { - /* Fix this in case the guest is really relying on r-m-w atomicity. */ - return hvmemul_write(seg, offset, p_new, bytes, ctxt); + unsigned long addr, reps = 1; + int rc = X86EMUL_OKAY; + unsigned long old = 0, new = 0; + uint32_t pfec = PFEC_page_present | PFEC_write_access; + struct hvm_emulate_ctxt *hvmemul_ctxt = + container_of(ctxt, struct hvm_emulate_ctxt, ctxt); + mfn_t mfn[2]; + void *map = NULL; + struct domain *currd = current->domain; + + if ( is_x86_system_segment(seg) ) + pfec |= PFEC_implicit; + else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 ) + pfec |= PFEC_user_mode; + + rc = hvmemul_virtual_to_linear( + seg, offset, bytes, &reps, hvm_access_write, hvmemul_ctxt, &addr); + + if ( rc != X86EMUL_OKAY || !bytes ) + return rc; + + rc = hvmemul_vaddr_to_mfn(addr, &mfn[0], pfec, ctxt); + + if ( rc != X86EMUL_OKAY ) + return rc; + + if ( likely(((addr + bytes - 1) & PAGE_MASK) == (addr & PAGE_MASK)) ) + { + /* Whole write fits on a single page. */ + mfn[1] = INVALID_MFN; + map = map_domain_page(mfn[0]); + } + else + { + rc = hvmemul_vaddr_to_mfn((addr + bytes - 1) & PAGE_MASK, &mfn[1], + pfec, ctxt); + if ( rc != X86EMUL_OKAY ) + return rc; + + map = vmap(mfn, 2); + } + + if ( !map ) + return X86EMUL_UNHANDLEABLE; + + map += (addr & ~PAGE_MASK); + + memcpy(&old, p_old, bytes); + memcpy(&new, p_new, bytes); + + if ( __cmpxchg(map, old, new, bytes) != old ) + rc = X86EMUL_RETRY; + + paging_mark_dirty(currd, mfn[0]); + + if ( unlikely(mfn_valid(mfn[1])) ) + { + paging_mark_dirty(currd, mfn[1]); + vunmap((void *)((unsigned long)map & PAGE_MASK)); + } + else + unmap_domain_page(map); + + return rc; } static int hvmemul_validate( @@ -1961,59 +2095,57 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla) void hvm_emulate_one_vm_event(enum emul_kind kind, unsigned int trapnr, unsigned int errcode) { - struct hvm_emulate_ctxt ctx = {{ 0 }}; - int rc; + int rc = X86EMUL_OKAY; - hvm_emulate_init_once(&ctx, NULL, guest_cpu_user_regs()); + do { + struct hvm_emulate_ctxt ctx = {{ 0 }}; - switch ( kind ) - { - case EMUL_KIND_NOWRITE: - rc = _hvm_emulate_one(&ctx, &hvm_emulate_ops_no_write); - break; - case EMUL_KIND_SET_CONTEXT_INSN: { - struct vcpu *curr = current; - struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io; + hvm_emulate_init_once(&ctx, NULL, guest_cpu_user_regs()); - BUILD_BUG_ON(sizeof(vio->mmio_insn) != - sizeof(curr->arch.vm_event->emul.insn.data)); - ASSERT(!vio->mmio_insn_bytes); + switch ( kind ) + { + case EMUL_KIND_NOWRITE: + rc = _hvm_emulate_one(&ctx, &hvm_emulate_ops_no_write); + break; + case EMUL_KIND_SET_CONTEXT_INSN: { + struct vcpu *curr = current; + struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io; + + BUILD_BUG_ON(sizeof(vio->mmio_insn) != + sizeof(curr->arch.vm_event->emul.insn.data)); + ASSERT(!vio->mmio_insn_bytes); + + /* + * Stash insn buffer into mmio buffer here instead of ctx + * to avoid having to add more logic to hvm_emulate_one. + */ + vio->mmio_insn_bytes = sizeof(vio->mmio_insn); + memcpy(vio->mmio_insn, curr->arch.vm_event->emul.insn.data, + vio->mmio_insn_bytes); + } + /* Fall-through */ + default: + ctx.set_context = (kind == EMUL_KIND_SET_CONTEXT_DATA); + rc = hvm_emulate_one(&ctx); + } - /* - * Stash insn buffer into mmio buffer here instead of ctx - * to avoid having to add more logic to hvm_emulate_one. - */ - vio->mmio_insn_bytes = sizeof(vio->mmio_insn); - memcpy(vio->mmio_insn, curr->arch.vm_event->emul.insn.data, - vio->mmio_insn_bytes); - } - /* Fall-through */ - default: - ctx.set_context = (kind == EMUL_KIND_SET_CONTEXT_DATA); - rc = hvm_emulate_one(&ctx); - } + switch ( rc ) + { + case X86EMUL_RETRY: + break; + case X86EMUL_UNHANDLEABLE: + hvm_dump_emulation_state(XENLOG_G_DEBUG "Mem event", &ctx); + hvm_inject_hw_exception(trapnr, errcode); + break; + case X86EMUL_EXCEPTION: + if ( ctx.ctxt.event_pending ) + hvm_inject_event(&ctx.ctxt.event); + break; + } - switch ( rc ) - { - case X86EMUL_RETRY: - /* - * This function is called when handling an EPT-related vm_event - * reply. As such, nothing else needs to be done here, since simply - * returning makes the current instruction cause a page fault again, - * consistent with X86EMUL_RETRY. - */ - return; - case X86EMUL_UNHANDLEABLE: - hvm_dump_emulation_state(XENLOG_G_DEBUG "Mem event", &ctx); - hvm_inject_hw_exception(trapnr, errcode); - break; - case X86EMUL_EXCEPTION: - if ( ctx.ctxt.event_pending ) - hvm_inject_event(&ctx.ctxt.event); - break; - } + hvm_emulate_writeback(&ctx); - hvm_emulate_writeback(&ctx); + } while( rc == X86EMUL_RETRY ); } void hvm_emulate_init_once(
hvmemul_cmpxchg() is no presently SMP-safe, as it ends up doing a memcpy(). Use an actual CMPXCHG instruction in the implementation. Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> --- Questions: - I've used __cmpxchg(), as it's already there, but it always locks. In my tests and in x86_emulate()'s context, the lock is already taken anyway, so it doesn't appear to make a difference at present. Have I missed a case where it does, or should we add a lock prefix anyway for the future? If so, should I use another version of cmpxchg() that we already have in the source code but I've missed, modify this one, bring another implementation in? - Is there anything I've missed in the implementation that could crash the host / guest / cause trouble? - In the introspection context, before this change a RETRY return in hvm_emulate_one_vm_event() would do nothing - when emulating because of a mem_access vm_event, this would mean that the VCPU would then simply try to execute the instruction again, which would trigger another mem_access event (which would hopefully this time trigger a successful insn emulation). However, with these changes, this would mean that, in a SMP scenario where a VCPU failed it's CMPXCHG, the instruction would be re-executed in the guest, which I've found leads to bad things happening (mostly BSODs). I've mitigated this in this patch by trying to emulate the failed instruction in a loop until it succeeds, but I'm not convinced it is a good idea. What are the alternatives? --- xen/arch/x86/hvm/emulate.c | 230 +++++++++++++++++++++++++++++++++++---------- 1 file changed, 181 insertions(+), 49 deletions(-)