diff mbox

[RFC] x86/emulate: implement hvmemul_cmpxchg() with an actual CMPXCHG

Message ID 1490361899-18303-1-git-send-email-rcojocaru@bitdefender.com (mailing list archive)
State New, archived
Headers show

Commit Message

Razvan Cojocaru March 24, 2017, 1:24 p.m. UTC
hvmemul_cmpxchg() is no presently SMP-safe, as it ends up doing
a memcpy(). Use an actual CMPXCHG instruction in the implementation.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>

---
Questions:
 - I've used __cmpxchg(), as it's already there, but it always locks.
   In my tests and in x86_emulate()'s context, the lock is already taken
   anyway, so it doesn't appear to make a difference at present. Have I
   missed a case where it does, or should we add a lock prefix anyway
   for the future? If so, should I use another version of cmpxchg()
   that we already have in the source code but I've missed, modify this
   one, bring another implementation in?
 - Is there anything I've missed in the implementation that could crash
   the host / guest / cause trouble?
 - In the introspection context, before this change a RETRY return in
   hvm_emulate_one_vm_event() would do nothing - when emulating because
   of a mem_access vm_event, this would mean that the VCPU would then
   simply try to execute the instruction again, which would trigger
   another mem_access event (which would hopefully this time trigger
   a successful insn emulation). However, with these changes, this
   would mean that, in a SMP scenario where a VCPU failed it's
   CMPXCHG, the instruction would be re-executed in the guest, which
   I've found leads to bad things happening (mostly BSODs). I've
   mitigated this in this patch by trying to emulate the failed
   instruction in a loop until it succeeds, but I'm not convinced it
   is a good idea. What are the alternatives?
---
 xen/arch/x86/hvm/emulate.c | 230 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 181 insertions(+), 49 deletions(-)

Comments

Razvan Cojocaru March 28, 2017, 9:14 a.m. UTC | #1
The real CMPXCHG hvmemul_cmpxchg() implementation works as expected as
far the race conditions go, and returning RETRY for failed writebacks
seems to work without issue for regular Xen emulation.

However, when the patch is used with introspection, I've had a BCCode:
101 BSOD and rare (but several) occasions when the guest becomes
unresponsive (can't close Firefox or have the Windows start menu show up
when clicking the "Start" button, but the guest is otherwise running).

This I believe is due to the do { } while ( rc == X86EMUL_RETRY ); loop
in hvm_emulate_one_vm_event(): I am basically now looping behing the
guest's back until the CMPXCHG succeeds, which can theoretically be a
very long time to execute a CMPXCHG instruction, and most likely not
what the guest OS is expecting.

The alternative (and the current default) is to do nothing on
X86EMUL_RETRY and just allow the guest to re-enter in the same place,
which should trigger the same page fault vm_event, which can hopefully
now be able to emulate the current instruction. However, in the best
case scenario, this just complicates the above loop since the current
instruction will still be unable to complete until emulation succeeds
but this time with VMEXITs. And in the worst case scenario (which is
what happens in my tests) this adds an additional factor of
unpredictability, since the guest quickly BSODs (or rarely just plain
hangs).

I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
failed CMPXCHG should happen just once, with the proper registers and ZF
set. The guest surely expects neither that the instruction resume until
it succeeds, nor that some hidden loop goes on for an undeterminate
ammount of time until a CMPXCHG succeeds.

The picture is further complicated by the two-part handling of
instructions in x86_emulate(). For example, for CMPXCHG we have:

case X86EMUL_OPC(0x0f, 0xb0): case X86EMUL_OPC(0x0f, 0xb1): /* cmpxchg */
    /* Save real source value, then compare EAX against destination. */
    src.orig_val = src.val;
    src.val = _regs.r(ax);
    /* cmp: %%eax - dst ==> dst and src swapped for macro invocation */
    emulate_2op_SrcV("cmp", dst, src, _regs.eflags);
    if ( _regs.eflags & X86_EFLAGS_ZF )
    {
        /* Success: write back to memory. */
        dst.val = src.orig_val;
    }
    else
    {
        /* Failure: write the value we saw to EAX. */
        dst.type = OP_REG;
        dst.reg  = (unsigned long *)&_regs.r(ax);
    }
    break;

This is the only part that sets the proper registers and ZF, and it's
only the comparison. If this succeeds, x86_emulate() currently assumes
that the writeback part cannot fail. But then the writeback part is:

case OP_MEM:
    if ( !(d & Mov) && (dst.orig_val == dst.val) &&
         !ctxt->force_writeback )
         /* nothing to do */;
    else if ( lock_prefix )
    {
        fail_if(!ops->cmpxchg);
        rc = ops->cmpxchg(
            dst.mem.seg, dst.mem.off, &dst.orig_val,
             &dst.val, dst.bytes, ctxt);
    }
    else
    {
        fail_if(!ops->write);
        rc = ops->write(dst.mem.seg, dst.mem.off,
                        !state->simd_size ? &dst.val : (void *)mmvalp,
                        dst.bytes, ctxt);
        if ( sfence )
            asm volatile ( "sfence" ::: "memory" );
    }
    if ( rc != 0 )
        goto done;
default:
    break;

I now see that this is why using a spinlock only around the writeback
part did not solve the issue: both the compare part and the writeback
part need to be part of the same atomic operation - lock needs to be
aquired before the cmp / ZF part.

Opinions and suggestions are appreciated. If I'm not mistaken, it looks
like the smp_lock design is the better solution to the problem.


Thanks,
Razvan
Jan Beulich March 28, 2017, 10:03 a.m. UTC | #2
>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
> failed CMPXCHG should happen just once, with the proper registers and ZF
> set. The guest surely expects neither that the instruction resume until
> it succeeds, nor that some hidden loop goes on for an undeterminate
> ammount of time until a CMPXCHG succeeds.

The guest doesn't observe the CMPXCHG failing - RETRY leads to
the instruction being restarted instead of completed.

There is a severe downside to MMIO accesses with this model
though: Other than for RAM, we shouldn't be reading (and even
less so writing) the memory location more than once.

> The picture is further complicated by the two-part handling of
> instructions in x86_emulate(). For example, for CMPXCHG we have:
>[...]
> I now see that this is why using a spinlock only around the writeback
> part did not solve the issue: both the compare part and the writeback
> part need to be part of the same atomic operation - lock needs to be
> aquired before the cmp / ZF part.

No exactly: RMW insns require the lock to be acquired ahead of
the memory read, and dropped after the memory write.

> Opinions and suggestions are appreciated. If I'm not mistaken, it looks
> like the smp_lock design is the better solution to the problem.

Andrew has told me that the re-work of how we do memory accesses
in the emulator is pretty high on his priority list now. Whether we'd
want to introduce an interim solution therefore depends on the time
scale to possibly reach the only ultimately correct solution (no longer
acting on intermediate copies of the data coming from guest memory,
which is only correct for plain reads and plain writes).

Jan
Andrew Cooper March 28, 2017, 10:25 a.m. UTC | #3
On 28/03/17 11:03, Jan Beulich wrote:
>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>> failed CMPXCHG should happen just once, with the proper registers and ZF
>> set. The guest surely expects neither that the instruction resume until
>> it succeeds, nor that some hidden loop goes on for an undeterminate
>> ammount of time until a CMPXCHG succeeds.
> The guest doesn't observe the CMPXCHG failing - RETRY leads to
> the instruction being restarted instead of completed.

And this probably is the root of the problem.  CMPXCHG on a contended
location should be observed to fail from the guests point of view.

>
> There is a severe downside to MMIO accesses with this model
> though: Other than for RAM, we shouldn't be reading (and even
> less so writing) the memory location more than once.

With MMIO, the hardware analogy is sending a single PCIe transaction
with the atomic flag set.

However, atomic options on non-RAM locations tend not to actually be
atomic on real hardware anyway, so so-long as we don't make multiple
reads/writes, the lack of atomicity won't be a problem in practice.

>
>> The picture is further complicated by the two-part handling of
>> instructions in x86_emulate(). For example, for CMPXCHG we have:
>> [...]
>> I now see that this is why using a spinlock only around the writeback
>> part did not solve the issue: both the compare part and the writeback
>> part need to be part of the same atomic operation - lock needs to be
>> aquired before the cmp / ZF part.
> No exactly: RMW insns require the lock to be acquired ahead of
> the memory read, and dropped after the memory write.
>
>> Opinions and suggestions are appreciated. If I'm not mistaken, it looks
>> like the smp_lock design is the better solution to the problem.
> Andrew has told me that the re-work of how we do memory accesses
> in the emulator is pretty high on his priority list now. Whether we'd
> want to introduce an interim solution therefore depends on the time
> scale to possibly reach the only ultimately correct solution (no longer
> acting on intermediate copies of the data coming from guest memory,
> which is only correct for plain reads and plain writes).

This re-enforces my opinion that mapping the destination and pointing a
stub at the mapping is the only viable option.

~Andrew
Razvan Cojocaru March 28, 2017, 10:27 a.m. UTC | #4
On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>> failed CMPXCHG should happen just once, with the proper registers and ZF
>> set. The guest surely expects neither that the instruction resume until
>> it succeeds, nor that some hidden loop goes on for an undeterminate
>> ammount of time until a CMPXCHG succeeds.
> 
> The guest doesn't observe the CMPXCHG failing - RETRY leads to
> the instruction being restarted instead of completed.

Indeed, but it works differently with hvm_emulate_one_vm_event() where
RETRY currently would have the instruction be re-executed (properly
re-executed, not just re-emulated) by the guest.

> There is a severe downside to MMIO accesses with this model
> though: Other than for RAM, we shouldn't be reading (and even
> less so writing) the memory location more than once.
> 
>> The picture is further complicated by the two-part handling of
>> instructions in x86_emulate(). For example, for CMPXCHG we have:
>> [...]
>> I now see that this is why using a spinlock only around the writeback
>> part did not solve the issue: both the compare part and the writeback
>> part need to be part of the same atomic operation - lock needs to be
>> aquired before the cmp / ZF part.
> 
> No exactly: RMW insns require the lock to be acquired ahead of
> the memory read, and dropped after the memory write.

Indeed, that's what I meant. I should have been more precise.

>> Opinions and suggestions are appreciated. If I'm not mistaken, it looks
>> like the smp_lock design is the better solution to the problem.
> 
> Andrew has told me that the re-work of how we do memory accesses
> in the emulator is pretty high on his priority list now. Whether we'd
> want to introduce an interim solution therefore depends on the time
> scale to possibly reach the only ultimately correct solution (no longer
> acting on intermediate copies of the data coming from guest memory,
> which is only correct for plain reads and plain writes).

Great! I understand, I'll switch to the smp_lock() patch as soon as the
schedule allows, and we'll see how that fits into the timeframe.


Thanks,
Razvan
Jan Beulich March 28, 2017, 10:44 a.m. UTC | #5
>>> On 28.03.17 at 12:25, <andrew.cooper3@citrix.com> wrote:
> On 28/03/17 11:03, Jan Beulich wrote:
>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>> set. The guest surely expects neither that the instruction resume until
>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>> ammount of time until a CMPXCHG succeeds.
>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>> the instruction being restarted instead of completed.
> 
> And this probably is the root of the problem.  CMPXCHG on a contended
> location should be observed to fail from the guests point of view.

Oops - my reply was imprecise: A _guest_ CMPXCHG would of
course be observed to have failed by the guest. Us using
CMPXCHG to carry out some other atomic op wouldn't.

>>> The picture is further complicated by the two-part handling of
>>> instructions in x86_emulate(). For example, for CMPXCHG we have:
>>> [...]
>>> I now see that this is why using a spinlock only around the writeback
>>> part did not solve the issue: both the compare part and the writeback
>>> part need to be part of the same atomic operation - lock needs to be
>>> aquired before the cmp / ZF part.
>> No exactly: RMW insns require the lock to be acquired ahead of
>> the memory read, and dropped after the memory write.
>>
>>> Opinions and suggestions are appreciated. If I'm not mistaken, it looks
>>> like the smp_lock design is the better solution to the problem.
>> Andrew has told me that the re-work of how we do memory accesses
>> in the emulator is pretty high on his priority list now. Whether we'd
>> want to introduce an interim solution therefore depends on the time
>> scale to possibly reach the only ultimately correct solution (no longer
>> acting on intermediate copies of the data coming from guest memory,
>> which is only correct for plain reads and plain writes).
> 
> This re-enforces my opinion that mapping the destination and pointing a
> stub at the mapping is the only viable option.

Well, you re-state what I was saying.

Jan
Jan Beulich March 28, 2017, 10:47 a.m. UTC | #6
>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote:
> On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>> set. The guest surely expects neither that the instruction resume until
>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>> ammount of time until a CMPXCHG succeeds.
>> 
>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>> the instruction being restarted instead of completed.
> 
> Indeed, but it works differently with hvm_emulate_one_vm_event() where
> RETRY currently would have the instruction be re-executed (properly
> re-executed, not just re-emulated) by the guest.

Right - see my other reply to Andrew: The function likely would
need to tell apart guest CMPXCHG uses from us using the insn to
carry out the write by some other one. That may involve
adjustments to the memory write logic in x86_emulate() itself, as
the late failure of the comparison then would also need to be
communicated back (via ZF clear) to the guest.

Jan
Razvan Cojocaru March 28, 2017, 10:50 a.m. UTC | #7
On 03/28/2017 01:47 PM, Jan Beulich wrote:
>>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote:
>> On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>>> set. The guest surely expects neither that the instruction resume until
>>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>>> ammount of time until a CMPXCHG succeeds.
>>>
>>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>>> the instruction being restarted instead of completed.
>>
>> Indeed, but it works differently with hvm_emulate_one_vm_event() where
>> RETRY currently would have the instruction be re-executed (properly
>> re-executed, not just re-emulated) by the guest.
> 
> Right - see my other reply to Andrew: The function likely would
> need to tell apart guest CMPXCHG uses from us using the insn to
> carry out the write by some other one. That may involve
> adjustments to the memory write logic in x86_emulate() itself, as
> the late failure of the comparison then would also need to be
> communicated back (via ZF clear) to the guest.

Exactly, it would require quite some reworking of x86_emulate().


Thanks,
Razvan
Jan Beulich March 28, 2017, 11:32 a.m. UTC | #8
>>> On 28.03.17 at 12:50, <rcojocaru@bitdefender.com> wrote:
> On 03/28/2017 01:47 PM, Jan Beulich wrote:
>>>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote:
>>> On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>>>> set. The guest surely expects neither that the instruction resume until
>>>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>>>> ammount of time until a CMPXCHG succeeds.
>>>>
>>>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>>>> the instruction being restarted instead of completed.
>>>
>>> Indeed, but it works differently with hvm_emulate_one_vm_event() where
>>> RETRY currently would have the instruction be re-executed (properly
>>> re-executed, not just re-emulated) by the guest.
>> 
>> Right - see my other reply to Andrew: The function likely would
>> need to tell apart guest CMPXCHG uses from us using the insn to
>> carry out the write by some other one. That may involve
>> adjustments to the memory write logic in x86_emulate() itself, as
>> the late failure of the comparison then would also need to be
>> communicated back (via ZF clear) to the guest.
> 
> Exactly, it would require quite some reworking of x86_emulate().

I don't think so, no. It would merely require a special case step
following ->cmpxchg() to deal with it being CMPXCHG we're
emulating. Plus matching code in CMPXCHG{8,16}B emulation.

Jan
Jan Beulich March 29, 2017, 5:59 a.m. UTC | #9
>>> On 28.03.17 at 12:25, <andrew.cooper3@citrix.com> wrote:
> On 28/03/17 11:03, Jan Beulich wrote:
>> There is a severe downside to MMIO accesses with this model
>> though: Other than for RAM, we shouldn't be reading (and even
>> less so writing) the memory location more than once.
> 
> With MMIO, the hardware analogy is sending a single PCIe transaction
> with the atomic flag set.
> 
> However, atomic options on non-RAM locations tend not to actually be
> atomic on real hardware anyway, so so-long as we don't make multiple
> reads/writes, the lack of atomicity won't be a problem in practice.

Are you sure about this? Before caches were introduced, all LOCKed
insns uses a bus lock; I would very much assume that this model is
being followed for backwards compatibility for all LOCKed memory
accesses not going through the cache.

Furthermore I've been thinking a little more about the proposed
new memory write approach: That won't work for the original
purpose the emulator has - emulating insns in order to send get
data to/from qemu. In that case there simply is no page to map,
and this then also explains the current ->write() mechanism.
Multiple reads in this case are being avoided by going through
holding the data in struct hvm_mmio_cache. Multiple writes
_currently_ are being avoided by hvmemul_cmpxchg() forwarding
to hvmemul_write(). This property would need to be retained
for any adjustments we make.

As to the CMPXCHG insn (mis-)handling, I think it'll be best if I
come forward with the outlined x86_emulate() adjustment.
Whether that'll be acceptable for 4.9 we'll have to see.

Jan
Razvan Cojocaru March 29, 2017, 8:14 a.m. UTC | #10
On 03/29/2017 08:59 AM, Jan Beulich wrote:
> As to the CMPXCHG insn (mis-)handling, I think it'll be best if I
> come forward with the outlined x86_emulate() adjustment.
> Whether that'll be acceptable for 4.9 we'll have to see.

Thanks! I'll test it with the CMPXCHG changes in my patch when it
becomes available.


Razvan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 2d92957..b5754a1 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -20,6 +20,7 @@ 
 #include <asm/hvm/emulate.h>
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/ioreq.h>
+#include <asm/hvm/nestedhvm.h>
 #include <asm/hvm/trace.h>
 #include <asm/hvm/support.h>
 #include <asm/hvm/svm/svm.h>
@@ -1029,6 +1030,77 @@  static int hvmemul_wbinvd_discard(
     return X86EMUL_OKAY;
 }
 
+static int hvmemul_vaddr_to_mfn(
+    unsigned long addr,
+    mfn_t *mfn,
+    uint32_t pfec,
+    struct x86_emulate_ctxt *ctxt)
+{
+    paddr_t gpa = addr & ~PAGE_MASK;
+    struct page_info *page;
+    p2m_type_t p2mt;
+    unsigned long gfn;
+    struct vcpu *curr = current;
+    struct hvm_emulate_ctxt *hvmemul_ctxt =
+        container_of(ctxt, struct hvm_emulate_ctxt, ctxt);
+
+    gfn = paging_gva_to_gfn(curr, addr, &pfec);
+
+    if ( gfn == gfn_x(INVALID_GFN) )
+    {
+        pagefault_info_t pfinfo = {};
+
+        if ( ( pfec & PFEC_page_paged ) || ( pfec & PFEC_page_shared ) )
+            return X86EMUL_RETRY;
+
+        pfinfo.linear = addr;
+        pfinfo.ec = pfec;
+
+        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
+        return X86EMUL_EXCEPTION;
+    }
+
+    gpa |= (paddr_t)gfn << PAGE_SHIFT;
+
+    /*
+     * No need to do the P2M lookup for internally handled MMIO, benefiting
+     * - 32-bit WinXP (& older Windows) on AMD CPUs for LAPIC accesses,
+     * - newer Windows (like Server 2012) for HPET accesses.
+     */
+    if ( !nestedhvm_vcpu_in_guestmode(curr) && hvm_mmio_internal(gpa) )
+        return X86EMUL_UNHANDLEABLE;
+
+    page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
+
+    if ( !page )
+        return X86EMUL_UNHANDLEABLE;
+
+    if ( p2m_is_paging(p2mt) )
+    {
+        put_page(page);
+        p2m_mem_paging_populate(curr->domain, gfn);
+        return X86EMUL_RETRY;
+    }
+
+    if ( p2m_is_shared(p2mt) )
+    {
+        put_page(page);
+        return X86EMUL_RETRY;
+    }
+
+    if ( p2m_is_grant(p2mt) )
+    {
+        put_page(page);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    *mfn = _mfn(page_to_mfn(page));
+
+    put_page(page);
+
+    return X86EMUL_OKAY;
+}
+
 static int hvmemul_cmpxchg(
     enum x86_segment seg,
     unsigned long offset,
@@ -1037,8 +1109,70 @@  static int hvmemul_cmpxchg(
     unsigned int bytes,
     struct x86_emulate_ctxt *ctxt)
 {
-    /* Fix this in case the guest is really relying on r-m-w atomicity. */
-    return hvmemul_write(seg, offset, p_new, bytes, ctxt);
+    unsigned long addr, reps = 1;
+    int rc = X86EMUL_OKAY;
+    unsigned long old = 0, new = 0;
+    uint32_t pfec = PFEC_page_present | PFEC_write_access;
+    struct hvm_emulate_ctxt *hvmemul_ctxt =
+        container_of(ctxt, struct hvm_emulate_ctxt, ctxt);
+    mfn_t mfn[2];
+    void *map = NULL;
+    struct domain *currd = current->domain;
+
+    if ( is_x86_system_segment(seg) )
+        pfec |= PFEC_implicit;
+    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
+        pfec |= PFEC_user_mode;
+
+    rc = hvmemul_virtual_to_linear(
+        seg, offset, bytes, &reps, hvm_access_write, hvmemul_ctxt, &addr);
+
+    if ( rc != X86EMUL_OKAY || !bytes )
+        return rc;
+
+    rc = hvmemul_vaddr_to_mfn(addr, &mfn[0], pfec, ctxt);
+
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    if ( likely(((addr + bytes - 1) & PAGE_MASK) == (addr & PAGE_MASK)) )
+    {
+        /* Whole write fits on a single page. */
+        mfn[1] = INVALID_MFN;
+        map = map_domain_page(mfn[0]);
+    }
+    else
+    {
+        rc = hvmemul_vaddr_to_mfn((addr + bytes - 1) & PAGE_MASK, &mfn[1],
+                                  pfec, ctxt);
+        if ( rc != X86EMUL_OKAY )
+            return rc;
+
+        map = vmap(mfn, 2);
+    }
+
+    if ( !map )
+        return X86EMUL_UNHANDLEABLE;
+
+    map += (addr & ~PAGE_MASK);
+
+    memcpy(&old, p_old, bytes);
+    memcpy(&new, p_new, bytes);
+
+    if ( __cmpxchg(map, old, new, bytes) != old )
+        rc = X86EMUL_RETRY;
+
+    paging_mark_dirty(currd, mfn[0]);
+
+    if ( unlikely(mfn_valid(mfn[1])) )
+    {
+        paging_mark_dirty(currd, mfn[1]);
+        vunmap((void *)((unsigned long)map & PAGE_MASK));
+    }
+    else
+        unmap_domain_page(map);
+
+    return rc;
 }
 
 static int hvmemul_validate(
@@ -1961,59 +2095,57 @@  int hvm_emulate_one_mmio(unsigned long mfn, unsigned long gla)
 void hvm_emulate_one_vm_event(enum emul_kind kind, unsigned int trapnr,
     unsigned int errcode)
 {
-    struct hvm_emulate_ctxt ctx = {{ 0 }};
-    int rc;
+    int rc = X86EMUL_OKAY;
 
-    hvm_emulate_init_once(&ctx, NULL, guest_cpu_user_regs());
+    do {
+        struct hvm_emulate_ctxt ctx = {{ 0 }};
 
-    switch ( kind )
-    {
-    case EMUL_KIND_NOWRITE:
-        rc = _hvm_emulate_one(&ctx, &hvm_emulate_ops_no_write);
-        break;
-    case EMUL_KIND_SET_CONTEXT_INSN: {
-        struct vcpu *curr = current;
-        struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
+        hvm_emulate_init_once(&ctx, NULL, guest_cpu_user_regs());
 
-        BUILD_BUG_ON(sizeof(vio->mmio_insn) !=
-                     sizeof(curr->arch.vm_event->emul.insn.data));
-        ASSERT(!vio->mmio_insn_bytes);
+        switch ( kind )
+        {
+        case EMUL_KIND_NOWRITE:
+            rc = _hvm_emulate_one(&ctx, &hvm_emulate_ops_no_write);
+            break;
+        case EMUL_KIND_SET_CONTEXT_INSN: {
+            struct vcpu *curr = current;
+            struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
+
+            BUILD_BUG_ON(sizeof(vio->mmio_insn) !=
+                         sizeof(curr->arch.vm_event->emul.insn.data));
+            ASSERT(!vio->mmio_insn_bytes);
+
+            /*
+             * Stash insn buffer into mmio buffer here instead of ctx
+             * to avoid having to add more logic to hvm_emulate_one.
+             */
+            vio->mmio_insn_bytes = sizeof(vio->mmio_insn);
+            memcpy(vio->mmio_insn, curr->arch.vm_event->emul.insn.data,
+                   vio->mmio_insn_bytes);
+        }
+        /* Fall-through */
+        default:
+            ctx.set_context = (kind == EMUL_KIND_SET_CONTEXT_DATA);
+            rc = hvm_emulate_one(&ctx);
+        }
 
-        /*
-         * Stash insn buffer into mmio buffer here instead of ctx
-         * to avoid having to add more logic to hvm_emulate_one.
-         */
-        vio->mmio_insn_bytes = sizeof(vio->mmio_insn);
-        memcpy(vio->mmio_insn, curr->arch.vm_event->emul.insn.data,
-               vio->mmio_insn_bytes);
-    }
-    /* Fall-through */
-    default:
-        ctx.set_context = (kind == EMUL_KIND_SET_CONTEXT_DATA);
-        rc = hvm_emulate_one(&ctx);
-    }
+        switch ( rc )
+        {
+        case X86EMUL_RETRY:
+            break;
+        case X86EMUL_UNHANDLEABLE:
+            hvm_dump_emulation_state(XENLOG_G_DEBUG "Mem event", &ctx);
+            hvm_inject_hw_exception(trapnr, errcode);
+            break;
+        case X86EMUL_EXCEPTION:
+            if ( ctx.ctxt.event_pending )
+                hvm_inject_event(&ctx.ctxt.event);
+            break;
+        }
 
-    switch ( rc )
-    {
-    case X86EMUL_RETRY:
-        /*
-         * This function is called when handling an EPT-related vm_event
-         * reply. As such, nothing else needs to be done here, since simply
-         * returning makes the current instruction cause a page fault again,
-         * consistent with X86EMUL_RETRY.
-         */
-        return;
-    case X86EMUL_UNHANDLEABLE:
-        hvm_dump_emulation_state(XENLOG_G_DEBUG "Mem event", &ctx);
-        hvm_inject_hw_exception(trapnr, errcode);
-        break;
-    case X86EMUL_EXCEPTION:
-        if ( ctx.ctxt.event_pending )
-            hvm_inject_event(&ctx.ctxt.event);
-        break;
-    }
+        hvm_emulate_writeback(&ctx);
 
-    hvm_emulate_writeback(&ctx);
+    } while( rc == X86EMUL_RETRY );
 }
 
 void hvm_emulate_init_once(