diff mbox

[RFC] x86/emulate: implement hvmemul_cmpxchg() with an actual CMPXCHG

Message ID 185bbccb-0156-07fe-d060-6135cae07caf@bitdefender.com (mailing list archive)
State New, archived
Headers show

Commit Message

Razvan Cojocaru March 29, 2017, 3:04 p.m. UTC
On 03/29/2017 05:00 PM, Razvan Cojocaru wrote:
> On 03/29/2017 04:55 PM, Jan Beulich wrote:
>>>>> On 28.03.17 at 12:50, <rcojocaru@bitdefender.com> wrote:
>>> On 03/28/2017 01:47 PM, Jan Beulich wrote:
>>>>>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote:
>>>>> On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>>>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>>>>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>>>>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>>>>>> set. The guest surely expects neither that the instruction resume until
>>>>>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>>>>>> ammount of time until a CMPXCHG succeeds.
>>>>>>
>>>>>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>>>>>> the instruction being restarted instead of completed.
>>>>>
>>>>> Indeed, but it works differently with hvm_emulate_one_vm_event() where
>>>>> RETRY currently would have the instruction be re-executed (properly
>>>>> re-executed, not just re-emulated) by the guest.
>>>>
>>>> Right - see my other reply to Andrew: The function likely would
>>>> need to tell apart guest CMPXCHG uses from us using the insn to
>>>> carry out the write by some other one. That may involve
>>>> adjustments to the memory write logic in x86_emulate() itself, as
>>>> the late failure of the comparison then would also need to be
>>>> communicated back (via ZF clear) to the guest.
>>>
>>> Exactly, it would require quite some reworking of x86_emulate().
>>
>> I had imagined it to be less intrusive (outside of x86_emulate()),
>> but I've now learned why Andrew was able to get rid of
>> X86EMUL_CMPXCHG_FAILED - the apparently intended behavior
>> was never implemented. Attached a first take at it, which has
>> seen smoke testing, but nothing more. The way it ends up being
>> I don't think this can reasonably be considered for 4.9 at this
>> point in time. (Also Cc-ing Tim for the shadow code changes,
>> even if this isn't really a proper patch submission.)
> 
> Thanks! I'll give a spin with a modified version of my CMPXCHG patch as
> soon as possible.

With the attached patch with hvmemul_cmpxchg() now returning
X86EMUL_CMPXCHG_FAILED if __cmpxchg() fails my (32-bit) Windows 7 guest
gets stuck at the "Starting Windows" screen. It's state appears to be:

# ./xenctx -a 3
cs:eip: 0008:8bcd85d6
flags: 00200246 cid i z p
ss:esp: 0010:82736b9c
eax: 00000000   ebx: 84f3a678   ecx: 84ee2610   edx: 001eb615
esi: 40008000   edi: 82739d20   ebp: 82736c20
 ds:     0023    es:     0023    fs:     0030    gs:     0000

cr0: 8001003b
cr2: 8fd94000
cr3: 00185000
cr4: 000406f9

dr0: 00000000
dr1: 00000000
dr2: 00000000
dr3: 00000000
dr6: fffe0ff0
dr7: 00000400
Code (instr addr 8bcd85d6)
47 fc 83 c7 14 4e 75 ef 5f 5e c3 cc cc cc cc cc cc 8b ff fb f4 <c3> cc
cc cc cc cc 8b ff 55 8b ec

# ./xenctx -a 3
cs:eip: 0008:8bcd85d6
flags: 00200246 cid i z p
ss:esp: 0010:82736b9c
eax: 00000000   ebx: 84f3a678   ecx: 84ee2610   edx: 002ca60d
esi: 40008000   edi: 82739d20   ebp: 82736c20
 ds:     0023    es:     0023    fs:     0030    gs:     0000

cr0: 8001003b
cr2: 8fd94000
cr3: 00185000
cr4: 000406f9

dr0: 00000000
dr1: 00000000
dr2: 00000000
dr3: 00000000
dr6: fffe0ff0
dr7: 00000400
Code (instr addr 8bcd85d6)
47 fc 83 c7 14 4e 75 ef 5f 5e c3 cc cc cc cc cc cc 8b ff fb f4 <c3> cc
cc cc cc cc 8b ff 55 8b ec

This only happens in SMP scenarios (my guest had 10 VCPUs for easy
reproduction). With a single VCPU, the guest booted fine. So something
somehow is still not right when a CMPXCHG fails in a race-type situation
(unless something's obviously wrong with my patch, but I don't see it).


Thanks,
Razvan

Comments

Razvan Cojocaru March 29, 2017, 3:49 p.m. UTC | #1
On 03/29/2017 06:04 PM, Razvan Cojocaru wrote:
> On 03/29/2017 05:00 PM, Razvan Cojocaru wrote:
>> On 03/29/2017 04:55 PM, Jan Beulich wrote:
>>>>>> On 28.03.17 at 12:50, <rcojocaru@bitdefender.com> wrote:
>>>> On 03/28/2017 01:47 PM, Jan Beulich wrote:
>>>>>>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote:
>>>>>> On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>>>>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>>>>>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>>>>>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>>>>>>> set. The guest surely expects neither that the instruction resume until
>>>>>>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>>>>>>> ammount of time until a CMPXCHG succeeds.
>>>>>>>
>>>>>>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>>>>>>> the instruction being restarted instead of completed.
>>>>>>
>>>>>> Indeed, but it works differently with hvm_emulate_one_vm_event() where
>>>>>> RETRY currently would have the instruction be re-executed (properly
>>>>>> re-executed, not just re-emulated) by the guest.
>>>>>
>>>>> Right - see my other reply to Andrew: The function likely would
>>>>> need to tell apart guest CMPXCHG uses from us using the insn to
>>>>> carry out the write by some other one. That may involve
>>>>> adjustments to the memory write logic in x86_emulate() itself, as
>>>>> the late failure of the comparison then would also need to be
>>>>> communicated back (via ZF clear) to the guest.
>>>>
>>>> Exactly, it would require quite some reworking of x86_emulate().
>>>
>>> I had imagined it to be less intrusive (outside of x86_emulate()),
>>> but I've now learned why Andrew was able to get rid of
>>> X86EMUL_CMPXCHG_FAILED - the apparently intended behavior
>>> was never implemented. Attached a first take at it, which has
>>> seen smoke testing, but nothing more. The way it ends up being
>>> I don't think this can reasonably be considered for 4.9 at this
>>> point in time. (Also Cc-ing Tim for the shadow code changes,
>>> even if this isn't really a proper patch submission.)
>>
>> Thanks! I'll give a spin with a modified version of my CMPXCHG patch as
>> soon as possible.
> 
> With the attached patch with hvmemul_cmpxchg() now returning
> X86EMUL_CMPXCHG_FAILED if __cmpxchg() fails my (32-bit) Windows 7 guest
> gets stuck at the "Starting Windows" screen.

And again this change:

1162     if ( __cmpxchg(map, old, new, bytes) != old )
1163     {
1164         memcpy(p_old, map, bytes);
1165         rc = X86EMUL_CMPXCHG_FAILED;
1166     }

i.e. doing the accumulator <- destination part of a failed CMPXCHG which
might be missing from your patch leads me again to BSODs. I'm not sure
if __cmpxchg() should work differently and do this atomically, or if
this should be done in x86_emulate() and it's not, or if it is done
there somewhere I've missed in the first patch.


Thanks,
Razvan
Jan Beulich March 30, 2017, 12:05 p.m. UTC | #2
>>> On 29.03.17 at 17:49, <rcojocaru@bitdefender.com> wrote:
> On 03/29/2017 06:04 PM, Razvan Cojocaru wrote:
>> On 03/29/2017 05:00 PM, Razvan Cojocaru wrote:
>>> On 03/29/2017 04:55 PM, Jan Beulich wrote:
>>>>>>> On 28.03.17 at 12:50, <rcojocaru@bitdefender.com> wrote:
>>>>> On 03/28/2017 01:47 PM, Jan Beulich wrote:
>>>>>>>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote:
>>>>>>> On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>>>>>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>>>>>>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>>>>>>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>>>>>>>> set. The guest surely expects neither that the instruction resume until
>>>>>>>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>>>>>>>> ammount of time until a CMPXCHG succeeds.
>>>>>>>>
>>>>>>>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>>>>>>>> the instruction being restarted instead of completed.
>>>>>>>
>>>>>>> Indeed, but it works differently with hvm_emulate_one_vm_event() where
>>>>>>> RETRY currently would have the instruction be re-executed (properly
>>>>>>> re-executed, not just re-emulated) by the guest.
>>>>>>
>>>>>> Right - see my other reply to Andrew: The function likely would
>>>>>> need to tell apart guest CMPXCHG uses from us using the insn to
>>>>>> carry out the write by some other one. That may involve
>>>>>> adjustments to the memory write logic in x86_emulate() itself, as
>>>>>> the late failure of the comparison then would also need to be
>>>>>> communicated back (via ZF clear) to the guest.
>>>>>
>>>>> Exactly, it would require quite some reworking of x86_emulate().
>>>>
>>>> I had imagined it to be less intrusive (outside of x86_emulate()),
>>>> but I've now learned why Andrew was able to get rid of
>>>> X86EMUL_CMPXCHG_FAILED - the apparently intended behavior
>>>> was never implemented. Attached a first take at it, which has
>>>> seen smoke testing, but nothing more. The way it ends up being
>>>> I don't think this can reasonably be considered for 4.9 at this
>>>> point in time. (Also Cc-ing Tim for the shadow code changes,
>>>> even if this isn't really a proper patch submission.)
>>>
>>> Thanks! I'll give a spin with a modified version of my CMPXCHG patch as
>>> soon as possible.
>> 
>> With the attached patch with hvmemul_cmpxchg() now returning
>> X86EMUL_CMPXCHG_FAILED if __cmpxchg() fails my (32-bit) Windows 7 guest
>> gets stuck at the "Starting Windows" screen.

That's with or without monitoring in use? I specifically did try a
32-bit Win7 guest, and I didn't have an issue. But then again a
single run may not mean much.

> And again this change:
> 
> 1162     if ( __cmpxchg(map, old, new, bytes) != old )
> 1163     {
> 1164         memcpy(p_old, map, bytes);
> 1165         rc = X86EMUL_CMPXCHG_FAILED;
> 1166     }
> 
> i.e. doing the accumulator <- destination part of a failed CMPXCHG which
> might be missing from your patch leads me again to BSODs.

Missing from my patch? Why and/or where? It's not clear to me which
function the above code fragment is supposed to go into. I might
guess hvmemul_cmpxchg(), but then my patch doesn't alter its
behavior (from forwarding to hvmeml_write()), and hence I don't
see why my patch would need to do such an adjustment.

What I do note though is that you don't copy back the value
__cmpxchg() returns, yet that's what is needed. *map may
have changed again already.

Jan
Razvan Cojocaru March 30, 2017, 12:25 p.m. UTC | #3
On 03/30/2017 03:05 PM, Jan Beulich wrote:
>>>> On 29.03.17 at 17:49, <rcojocaru@bitdefender.com> wrote:
>> On 03/29/2017 06:04 PM, Razvan Cojocaru wrote:
>>> On 03/29/2017 05:00 PM, Razvan Cojocaru wrote:
>>>> On 03/29/2017 04:55 PM, Jan Beulich wrote:
>>>>>>>> On 28.03.17 at 12:50, <rcojocaru@bitdefender.com> wrote:
>>>>>> On 03/28/2017 01:47 PM, Jan Beulich wrote:
>>>>>>>>>> On 28.03.17 at 12:27, <rcojocaru@bitdefender.com> wrote:
>>>>>>>> On 03/28/2017 01:03 PM, Jan Beulich wrote:
>>>>>>>>>>>> On 28.03.17 at 11:14, <rcojocaru@bitdefender.com> wrote:
>>>>>>>>>> I'm not sure that the RETRY model is what the guest OS expects. AFAIK, a
>>>>>>>>>> failed CMPXCHG should happen just once, with the proper registers and ZF
>>>>>>>>>> set. The guest surely expects neither that the instruction resume until
>>>>>>>>>> it succeeds, nor that some hidden loop goes on for an undeterminate
>>>>>>>>>> ammount of time until a CMPXCHG succeeds.
>>>>>>>>>
>>>>>>>>> The guest doesn't observe the CMPXCHG failing - RETRY leads to
>>>>>>>>> the instruction being restarted instead of completed.
>>>>>>>>
>>>>>>>> Indeed, but it works differently with hvm_emulate_one_vm_event() where
>>>>>>>> RETRY currently would have the instruction be re-executed (properly
>>>>>>>> re-executed, not just re-emulated) by the guest.
>>>>>>>
>>>>>>> Right - see my other reply to Andrew: The function likely would
>>>>>>> need to tell apart guest CMPXCHG uses from us using the insn to
>>>>>>> carry out the write by some other one. That may involve
>>>>>>> adjustments to the memory write logic in x86_emulate() itself, as
>>>>>>> the late failure of the comparison then would also need to be
>>>>>>> communicated back (via ZF clear) to the guest.
>>>>>>
>>>>>> Exactly, it would require quite some reworking of x86_emulate().
>>>>>
>>>>> I had imagined it to be less intrusive (outside of x86_emulate()),
>>>>> but I've now learned why Andrew was able to get rid of
>>>>> X86EMUL_CMPXCHG_FAILED - the apparently intended behavior
>>>>> was never implemented. Attached a first take at it, which has
>>>>> seen smoke testing, but nothing more. The way it ends up being
>>>>> I don't think this can reasonably be considered for 4.9 at this
>>>>> point in time. (Also Cc-ing Tim for the shadow code changes,
>>>>> even if this isn't really a proper patch submission.)
>>>>
>>>> Thanks! I'll give a spin with a modified version of my CMPXCHG patch as
>>>> soon as possible.
>>>
>>> With the attached patch with hvmemul_cmpxchg() now returning
>>> X86EMUL_CMPXCHG_FAILED if __cmpxchg() fails my (32-bit) Windows 7 guest
>>> gets stuck at the "Starting Windows" screen.
> 
> That's with or without monitoring in use? I specifically did try a
> 32-bit Win7 guest, and I didn't have an issue. But then again a
> single run may not mean much.

With monitoring in use - specifically using hvm_emulate_one_vm_event().
Sorry for the ommision.

>> And again this change:
>>
>> 1162     if ( __cmpxchg(map, old, new, bytes) != old )
>> 1163     {
>> 1164         memcpy(p_old, map, bytes);
>> 1165         rc = X86EMUL_CMPXCHG_FAILED;
>> 1166     }
>>
>> i.e. doing the accumulator <- destination part of a failed CMPXCHG which
>> might be missing from your patch leads me again to BSODs.
> 
> Missing from my patch? Why and/or where? It's not clear to me which
> function the above code fragment is supposed to go into. I might
> guess hvmemul_cmpxchg(), but then my patch doesn't alter its
> behavior (from forwarding to hvmeml_write()), and hence I don't
> see why my patch would need to do such an adjustment.

Right, I was thinking about this bit:

6704         if ( _regs.eflags & X86_EFLAGS_ZF )
6705             dst.type = OP_NONE;
6706         else
6707         {
6708             /* Failure: write the value we saw to EAX. */
6709             dst.type = OP_REG;
6710             dst.reg  = (unsigned long *)&_regs.r(ax);
6711         }

For some reason I had missed it, but I now see it does the writeback. My
mistake.

> What I do note though is that you don't copy back the value
> __cmpxchg() returns, yet that's what is needed. *map may
> have changed again already.

True, I'll update my tests.


Thanks,
Razvan
Razvan Cojocaru March 30, 2017, 12:56 p.m. UTC | #4
On 03/30/2017 03:05 PM, Jan Beulich wrote:
> What I do note though is that you don't copy back the value
> __cmpxchg() returns, yet that's what is needed. *map may
> have changed again already.

Changing the code to:

1162     ret = __cmpxchg(map, old, new, bytes);
1163
1164     if ( ret != old )
1165     {
1166         memcpy(p_old, &ret, bytes);
1167         rc = X86EMUL_CMPXCHG_FAILED;
1168     }

where ret is an unsigned long still triggers BSODs when I add my patch
to yours. I'll need to dig deeper.


Thanks,
Razvan
Razvan Cojocaru March 30, 2017, 2:08 p.m. UTC | #5
On 03/30/2017 03:56 PM, Razvan Cojocaru wrote:
> On 03/30/2017 03:05 PM, Jan Beulich wrote:
>> What I do note though is that you don't copy back the value
>> __cmpxchg() returns, yet that's what is needed. *map may
>> have changed again already.
> 
> Changing the code to:
> 
> 1162     ret = __cmpxchg(map, old, new, bytes);
> 1163
> 1164     if ( ret != old )
> 1165     {
> 1166         memcpy(p_old, &ret, bytes);
> 1167         rc = X86EMUL_CMPXCHG_FAILED;
> 1168     }
> 
> where ret is an unsigned long still triggers BSODs when I add my patch
> to yours. I'll need to dig deeper.

Nevermind, I've found the culprit: hvm_emulate_one_vm_event()'s code
needs to be wrapped in a loop that checks for X86EMUL_RETRY again, since
hvmemul_cmpxchg() may return RETRY even for some mapping problems, in
which case we again end up with the guest trying to re-execute an
emulable CMPXCHG.

However, this gets me back to my original problem when I "solved" it in
the same manner (looping until emulation succeeds) back when
hvmemul_cmpxchg() failures were reported with RETRY: eventually the
guest BSODs with code 101. RETRY failures are still possible coming from
the hvmemul_vaddr_to_mfn() code in my patch.

I wonder if I should just return X86EMUL_CMPXCHG_FAILED for all those as
well and just never end up returning RETRY from hvmemul_cmpxchg().


Thanks,
Razvan
Jan Beulich March 30, 2017, 2:21 p.m. UTC | #6
>>> On 30.03.17 at 16:08, <rcojocaru@bitdefender.com> wrote:
> On 03/30/2017 03:56 PM, Razvan Cojocaru wrote:
>> On 03/30/2017 03:05 PM, Jan Beulich wrote:
>>> What I do note though is that you don't copy back the value
>>> __cmpxchg() returns, yet that's what is needed. *map may
>>> have changed again already.
>> 
>> Changing the code to:
>> 
>> 1162     ret = __cmpxchg(map, old, new, bytes);
>> 1163
>> 1164     if ( ret != old )
>> 1165     {
>> 1166         memcpy(p_old, &ret, bytes);
>> 1167         rc = X86EMUL_CMPXCHG_FAILED;
>> 1168     }
>> 
>> where ret is an unsigned long still triggers BSODs when I add my patch
>> to yours. I'll need to dig deeper.
> 
> Nevermind, I've found the culprit: hvm_emulate_one_vm_event()'s code
> needs to be wrapped in a loop that checks for X86EMUL_RETRY again, since
> hvmemul_cmpxchg() may return RETRY even for some mapping problems, in
> which case we again end up with the guest trying to re-execute an
> emulable CMPXCHG.

This seems wrong to me - note how my patch changes behavior
regarding the return value from paging_cmpxchg_guest_entry()
in ptwr_emulated_update().

> However, this gets me back to my original problem when I "solved" it in
> the same manner (looping until emulation succeeds) back when
> hvmemul_cmpxchg() failures were reported with RETRY: eventually the
> guest BSODs with code 101. RETRY failures are still possible coming from
> the hvmemul_vaddr_to_mfn() code in my patch.
> 
> I wonder if I should just return X86EMUL_CMPXCHG_FAILED for all those as
> well and just never end up returning RETRY from hvmemul_cmpxchg().

That would seem similarly wrong to me - it ought to be
UNHANDLEABLE, I would think. In any event, never returning
RETRY would also be wrong in case you hit emulated MMIO, but
that's really the only case where RETRY should be passed back.

Jan
Razvan Cojocaru March 30, 2017, 3:05 p.m. UTC | #7
On 03/30/2017 05:21 PM, Jan Beulich wrote:
>>>> On 30.03.17 at 16:08, <rcojocaru@bitdefender.com> wrote:
>> On 03/30/2017 03:56 PM, Razvan Cojocaru wrote:
>>> On 03/30/2017 03:05 PM, Jan Beulich wrote:
>>>> What I do note though is that you don't copy back the value
>>>> __cmpxchg() returns, yet that's what is needed. *map may
>>>> have changed again already.
>>>
>>> Changing the code to:
>>>
>>> 1162     ret = __cmpxchg(map, old, new, bytes);
>>> 1163
>>> 1164     if ( ret != old )
>>> 1165     {
>>> 1166         memcpy(p_old, &ret, bytes);
>>> 1167         rc = X86EMUL_CMPXCHG_FAILED;
>>> 1168     }
>>>
>>> where ret is an unsigned long still triggers BSODs when I add my patch
>>> to yours. I'll need to dig deeper.
>>
>> Nevermind, I've found the culprit: hvm_emulate_one_vm_event()'s code
>> needs to be wrapped in a loop that checks for X86EMUL_RETRY again, since
>> hvmemul_cmpxchg() may return RETRY even for some mapping problems, in
>> which case we again end up with the guest trying to re-execute an
>> emulable CMPXCHG.
> 
> This seems wrong to me - note how my patch changes behavior
> regarding the return value from paging_cmpxchg_guest_entry()
> in ptwr_emulated_update().
> 
>> However, this gets me back to my original problem when I "solved" it in
>> the same manner (looping until emulation succeeds) back when
>> hvmemul_cmpxchg() failures were reported with RETRY: eventually the
>> guest BSODs with code 101. RETRY failures are still possible coming from
>> the hvmemul_vaddr_to_mfn() code in my patch.
>>
>> I wonder if I should just return X86EMUL_CMPXCHG_FAILED for all those as
>> well and just never end up returning RETRY from hvmemul_cmpxchg().
> 
> That would seem similarly wrong to me - it ought to be
> UNHANDLEABLE, I would think. In any event, never returning
> RETRY would also be wrong in case you hit emulated MMIO, but
> that's really the only case where RETRY should be passed back.

Sorry, I don't follow: hvm_emulate_one_vm_event() calls
hvm_emulate_one() and then does this with the return value:

switch ( rc )
{
case X86EMUL_RETRY:
    /*
     * This function is called when handling an EPT-related vm_event
     * reply. As such, nothing else needs to be done here, since simply
     * returning makes the current instruction cause a page fault again,
     * consistent with X86EMUL_RETRY.
     */
    return;
case X86EMUL_UNHANDLEABLE:
    hvm_dump_emulation_state(XENLOG_G_DEBUG "Mem event", &ctx);
    hvm_inject_hw_exception(trapnr, errcode);
    break;
case X86EMUL_EXCEPTION:
    if ( ctx.ctxt.event_pending )
        hvm_inject_event(&ctx.ctxt.event);
    break;
}

hvm_emulate_writeback(&ctx);

If I'd return UNHANDLEABLE from hvmemul_cmpxchg() where I'm now
returning RETRY from hvmemul_vaddr_to_mfn(), that would inject
TRAP_invalid_op in the guest, so it seems rather harsh. And returning
RETRY here does nothing, so the guest will resume at the same IP
(re-trying CMPXCHG, causing a page fault (or crashing the guest), and
re-trying the emulation (if it hasn't)). Please see hvm_do_resume() in
arch/x86/hvm/hvm.c.

Speaking of emulated MMIO, I've got this when the guest was crashing
immediately (pre RETRY loop):

 MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
07 8b cb e8 da 4b ff ff 8b 45

Again, this is all with guest monitoring enabled.


Thanks,
Razvan
Jan Beulich March 30, 2017, 3:47 p.m. UTC | #8
>>> On 30.03.17 at 17:05, <rcojocaru@bitdefender.com> wrote:
> On 03/30/2017 05:21 PM, Jan Beulich wrote:
>>>>> On 30.03.17 at 16:08, <rcojocaru@bitdefender.com> wrote:
>>> On 03/30/2017 03:56 PM, Razvan Cojocaru wrote:
>>>> On 03/30/2017 03:05 PM, Jan Beulich wrote:
>>>>> What I do note though is that you don't copy back the value
>>>>> __cmpxchg() returns, yet that's what is needed. *map may
>>>>> have changed again already.
>>>>
>>>> Changing the code to:
>>>>
>>>> 1162     ret = __cmpxchg(map, old, new, bytes);
>>>> 1163
>>>> 1164     if ( ret != old )
>>>> 1165     {
>>>> 1166         memcpy(p_old, &ret, bytes);
>>>> 1167         rc = X86EMUL_CMPXCHG_FAILED;
>>>> 1168     }
>>>>
>>>> where ret is an unsigned long still triggers BSODs when I add my patch
>>>> to yours. I'll need to dig deeper.
>>>
>>> Nevermind, I've found the culprit: hvm_emulate_one_vm_event()'s code
>>> needs to be wrapped in a loop that checks for X86EMUL_RETRY again, since
>>> hvmemul_cmpxchg() may return RETRY even for some mapping problems, in
>>> which case we again end up with the guest trying to re-execute an
>>> emulable CMPXCHG.
>> 
>> This seems wrong to me - note how my patch changes behavior
>> regarding the return value from paging_cmpxchg_guest_entry()
>> in ptwr_emulated_update().
>> 
>>> However, this gets me back to my original problem when I "solved" it in
>>> the same manner (looping until emulation succeeds) back when
>>> hvmemul_cmpxchg() failures were reported with RETRY: eventually the
>>> guest BSODs with code 101. RETRY failures are still possible coming from
>>> the hvmemul_vaddr_to_mfn() code in my patch.
>>>
>>> I wonder if I should just return X86EMUL_CMPXCHG_FAILED for all those as
>>> well and just never end up returning RETRY from hvmemul_cmpxchg().
>> 
>> That would seem similarly wrong to me - it ought to be
>> UNHANDLEABLE, I would think. In any event, never returning
>> RETRY would also be wrong in case you hit emulated MMIO, but
>> that's really the only case where RETRY should be passed back.
> 
> Sorry, I don't follow: hvm_emulate_one_vm_event() calls
> hvm_emulate_one() and then does this with the return value:
> 
> switch ( rc )
> {
> case X86EMUL_RETRY:
>     /*
>      * This function is called when handling an EPT-related vm_event
>      * reply. As such, nothing else needs to be done here, since simply
>      * returning makes the current instruction cause a page fault again,
>      * consistent with X86EMUL_RETRY.
>      */
>     return;
> case X86EMUL_UNHANDLEABLE:
>     hvm_dump_emulation_state(XENLOG_G_DEBUG "Mem event", &ctx);
>     hvm_inject_hw_exception(trapnr, errcode);
>     break;
> case X86EMUL_EXCEPTION:
>     if ( ctx.ctxt.event_pending )
>         hvm_inject_event(&ctx.ctxt.event);
>     break;
> }
> 
> hvm_emulate_writeback(&ctx);
> 
> If I'd return UNHANDLEABLE from hvmemul_cmpxchg() where I'm now
> returning RETRY from hvmemul_vaddr_to_mfn(), that would inject
> TRAP_invalid_op in the guest, so it seems rather harsh.

Well, my comment was for normal execution of the guest. In
oder to make it into the emulator, the virtual->physical
translation in the guest must have worked, so if there is a
mapping failure, this indicates something fishy going on inside
the guest.

> Speaking of emulated MMIO, I've got this when the guest was crashing
> immediately (pre RETRY loop):
> 
>  MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
> 07 8b cb e8 da 4b ff ff 8b 45

That's a BTR, which we should be emulating fine. More information
would need to be collected to have a chance to understand what
might be going one (first of all the virtual and physical memory
address this was trying to act on).

Jan
Razvan Cojocaru March 31, 2017, 6:17 a.m. UTC | #9
On 03/30/2017 06:47 PM, Jan Beulich wrote:
>> Speaking of emulated MMIO, I've got this when the guest was crashing
>> immediately (pre RETRY loop):
>>
>>  MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
>> 07 8b cb e8 da 4b ff ff 8b 45
> 
> That's a BTR, which we should be emulating fine. More information
> would need to be collected to have a chance to understand what
> might be going one (first of all the virtual and physical memory
> address this was trying to act on).

Right, the BTR part should be fine, but I think the LOCK part is what's
causing the issue. I've done a few more test runs to see what return
RETRY (dumping the instruction with an "(r)" prefix to distinguish from
the UNHANDLEABLE dump), and a couple of instructions return RETRY (BTR
and XADD, both LOCK-prefixed, which means they now involve CMPXCHG
handler, which presumably now fails - possibly simply because it's
always LOCKed in my patch):

# grep "Mem event" /var/log/xen/console/hypervisor.log | sort | uniq
(XEN) (r) Mem event emulation failed: d3v1 32bit @ 0008:8267f1aa -> f0
0f ba 28 07 72 d5 8d 45 f4 50 33 ff 56 47 53
(XEN) (r) Mem event emulation failed: d3v5 32bit @ 0008:8267f1aa -> f0
0f ba 28 07 72 d5 8d 45 f4 50 33 ff 56 47 53
(XEN) (r) Mem event emulation failed: d3v5 32bit @ 0008:826ebc7c -> f0
0f c1 08 85 c9 74 1f f6 c1 02 75 1a 41 8d 41
(XEN) (r) Mem event emulation failed: d3v6 32bit @ 0008:8267f1aa -> f0
0f ba 28 07 72 d5 8d 45 f4 50 33 ff 56 47 53
(XEN) (r) Mem event emulation failed: d3v6 32bit @ 0008:826eb861 -> f0
0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
(XEN) (r) Mem event emulation failed: d3v6 32bit @ 0008:826ebc7c -> f0
0f c1 08 85 c9 74 1f f6 c1 02 75 1a 41 8d 41
(XEN) (r) Mem event emulation failed: d3v6 32bit @ 0008:826ebce6 -> f0
0f c1 01 8b 7d fc c1 ef 09 81 e7 f8 ff 7f 00
(XEN) (r) Mem event emulation failed: d3v7 32bit @ 0008:8267f1aa -> f0
0f ba 28 07 72 d5 8d 45 f4 50 33 ff 56 47 53
(XEN) (r) Mem event emulation failed: d3v7 32bit @ 0008:826eb861 -> f0
0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
(XEN) (r) Mem event emulation failed: d3v7 32bit @ 0008:826ebc7c -> f0
0f c1 08 85 c9 74 1f f6 c1 02 75 1a 41 8d 41
(XEN) (r) Mem event emulation failed: d3v7 32bit @ 0008:826ebce6 -> f0
0f c1 01 8b 7d fc c1 ef 09 81 e7 f8 ff 7f 00
(XEN) (r) Mem event emulation failed: d3v7 32bit @ 0008:826ec59a -> f0
0f ba 31 00 72 09 e8 a3 3e ff ff 8b 44 24 18
(XEN) (r) Mem event emulation failed: d3v7 32bit @ 0008:826f6276 -> f0
0f ba 28 07 72 cc 39 53 04 75 5e 8d 43 08 8b
(XEN) (r) Mem event emulation failed: d3v8 32bit @ 0008:8267f1aa -> f0
0f ba 28 07 72 d5 8d 45 f4 50 33 ff 56 47 53
(XEN) (r) Mem event emulation failed: d3v8 32bit @ 0008:826eb861 -> f0
0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
(XEN) (r) Mem event emulation failed: d3v9 32bit @ 0008:8267f1aa -> f0
0f ba 28 07 72 d5 8d 45 f4 50 33 ff 56 47 53
(XEN) (r) Mem event emulation failed: d3v9 32bit @ 0008:826eb861 -> f0
0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
(XEN) (r) Mem event emulation failed: d3v9 32bit @ 0008:826ebce6 -> f0
0f c1 01 8b 7d fc c1 ef 09 81 e7 f8 ff 7f 00
(XEN) (r) Mem event emulation failed: d3v9 32bit @ 0008:826ec583 -> f0
0f c1 01 64 a1 24 01 00 00 66 ff 88 86 00 00


Thanks,
Razvan
Jan Beulich March 31, 2017, 7:34 a.m. UTC | #10
>>> On 31.03.17 at 08:17, <rcojocaru@bitdefender.com> wrote:
> On 03/30/2017 06:47 PM, Jan Beulich wrote:
>>> Speaking of emulated MMIO, I've got this when the guest was crashing
>>> immediately (pre RETRY loop):
>>>
>>>  MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
>>> 07 8b cb e8 da 4b ff ff 8b 45
>> 
>> That's a BTR, which we should be emulating fine. More information
>> would need to be collected to have a chance to understand what
>> might be going one (first of all the virtual and physical memory
>> address this was trying to act on).
> 
> Right, the BTR part should be fine, but I think the LOCK part is what's
> causing the issue. I've done a few more test runs to see what return
> RETRY (dumping the instruction with an "(r)" prefix to distinguish from
> the UNHANDLEABLE dump), and a couple of instructions return RETRY (BTR
> and XADD, both LOCK-prefixed, which means they now involve CMPXCHG
> handler, which presumably now fails - possibly simply because it's
> always LOCKed in my patch):

Well, all of that looks to be expected behavior. I'm afraid I don't see
how this information helps understanding the MMIO emulation failure
above.

Jan
Razvan Cojocaru March 31, 2017, 9:56 a.m. UTC | #11
On 03/31/2017 10:34 AM, Jan Beulich wrote:
>>>> On 31.03.17 at 08:17, <rcojocaru@bitdefender.com> wrote:
>> On 03/30/2017 06:47 PM, Jan Beulich wrote:
>>>> Speaking of emulated MMIO, I've got this when the guest was crashing
>>>> immediately (pre RETRY loop):
>>>>
>>>>  MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
>>>> 07 8b cb e8 da 4b ff ff 8b 45
>>>
>>> That's a BTR, which we should be emulating fine. More information
>>> would need to be collected to have a chance to understand what
>>> might be going one (first of all the virtual and physical memory
>>> address this was trying to act on).
>>
>> Right, the BTR part should be fine, but I think the LOCK part is what's
>> causing the issue. I've done a few more test runs to see what return
>> RETRY (dumping the instruction with an "(r)" prefix to distinguish from
>> the UNHANDLEABLE dump), and a couple of instructions return RETRY (BTR
>> and XADD, both LOCK-prefixed, which means they now involve CMPXCHG
>> handler, which presumably now fails - possibly simply because it's
>> always LOCKed in my patch):
> 
> Well, all of that looks to be expected behavior. I'm afraid I don't see
> how this information helps understanding the MMIO emulation failure
> above.

I've managed to obtain this log of emulation errors:
https://pastebin.com/Esy1SkHx

The "virtual address" lines that are not followed by any "Mem event"
line correspond to CMXCHG_FAILED return codes.

The very last line is a MMIO emulation failed.

It's probably important that this happens with the model where
hvm_emulate_one_vm_event() does _not_ re-try the emulation until it
succeeds. The other model allows me to go further with the guest, but
eventually I get timeout-related BSODs or the guest becomes unresponsive.


Thanks,
Razvan
Jan Beulich March 31, 2017, 2:46 p.m. UTC | #12
>>> On 31.03.17 at 11:56, <rcojocaru@bitdefender.com> wrote:
> On 03/31/2017 10:34 AM, Jan Beulich wrote:
>>>>> On 31.03.17 at 08:17, <rcojocaru@bitdefender.com> wrote:
>>> On 03/30/2017 06:47 PM, Jan Beulich wrote:
>>>>> Speaking of emulated MMIO, I've got this when the guest was crashing
>>>>> immediately (pre RETRY loop):
>>>>>
>>>>>  MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
>>>>> 07 8b cb e8 da 4b ff ff 8b 45
>>>>
>>>> That's a BTR, which we should be emulating fine. More information
>>>> would need to be collected to have a chance to understand what
>>>> might be going one (first of all the virtual and physical memory
>>>> address this was trying to act on).
>>>
>>> Right, the BTR part should be fine, but I think the LOCK part is what's
>>> causing the issue. I've done a few more test runs to see what return
>>> RETRY (dumping the instruction with an "(r)" prefix to distinguish from
>>> the UNHANDLEABLE dump), and a couple of instructions return RETRY (BTR
>>> and XADD, both LOCK-prefixed, which means they now involve CMPXCHG
>>> handler, which presumably now fails - possibly simply because it's
>>> always LOCKed in my patch):
>> 
>> Well, all of that looks to be expected behavior. I'm afraid I don't see
>> how this information helps understanding the MMIO emulation failure
>> above.
> 
> I've managed to obtain this log of emulation errors:
> https://pastebin.com/Esy1SkHx 
> 
> The "virtual address" lines that are not followed by any "Mem event"
> line correspond to CMXCHG_FAILED return codes.
> 
> The very last line is a MMIO emulation failed.
> 
> It's probably important that this happens with the model where
> hvm_emulate_one_vm_event() does _not_ re-try the emulation until it
> succeeds. The other model allows me to go further with the guest, but
> eventually I get timeout-related BSODs or the guest becomes unresponsive.

Interesting. You didn't clarify what the printed "offset" values are,
and it doesn't look like these have any correlation with the underlying
(guest) physical address, which we would also want to see. And then
it strikes me as odd that in these last lines

(XEN) Mem event (RETRY) emulation failed: d5v8 32bit @ 0008:826bb861 -> f0 0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
(XEN) virtual address: 0xffd080f0, offset: 4291854576
(XEN) MMIO emulation failed: d5v8 32bit @ 0008:82655f3c -> f0 0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45

the instruction pointers and virtual addresses are different, but the
code bytes are exactly the same. This doesn't seem very likely, so I
wonder whether there's an issue with us wrongly re-using previously
fetched insn bytes. (Of course I'd be happy to be proven wrong with
this guessing, by you checking the involved binary/ies.)

Jan
Razvan Cojocaru March 31, 2017, 3:01 p.m. UTC | #13
On 03/31/2017 05:46 PM, Jan Beulich wrote:
>>>> On 31.03.17 at 11:56, <rcojocaru@bitdefender.com> wrote:
>> On 03/31/2017 10:34 AM, Jan Beulich wrote:
>>>>>> On 31.03.17 at 08:17, <rcojocaru@bitdefender.com> wrote:
>>>> On 03/30/2017 06:47 PM, Jan Beulich wrote:
>>>>>> Speaking of emulated MMIO, I've got this when the guest was crashing
>>>>>> immediately (pre RETRY loop):
>>>>>>
>>>>>>  MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
>>>>>> 07 8b cb e8 da 4b ff ff 8b 45
>>>>>
>>>>> That's a BTR, which we should be emulating fine. More information
>>>>> would need to be collected to have a chance to understand what
>>>>> might be going one (first of all the virtual and physical memory
>>>>> address this was trying to act on).
>>>>
>>>> Right, the BTR part should be fine, but I think the LOCK part is what's
>>>> causing the issue. I've done a few more test runs to see what return
>>>> RETRY (dumping the instruction with an "(r)" prefix to distinguish from
>>>> the UNHANDLEABLE dump), and a couple of instructions return RETRY (BTR
>>>> and XADD, both LOCK-prefixed, which means they now involve CMPXCHG
>>>> handler, which presumably now fails - possibly simply because it's
>>>> always LOCKed in my patch):
>>>
>>> Well, all of that looks to be expected behavior. I'm afraid I don't see
>>> how this information helps understanding the MMIO emulation failure
>>> above.
>>
>> I've managed to obtain this log of emulation errors:
>> https://pastebin.com/Esy1SkHx 
>>
>> The "virtual address" lines that are not followed by any "Mem event"
>> line correspond to CMXCHG_FAILED return codes.
>>
>> The very last line is a MMIO emulation failed.
>>
>> It's probably important that this happens with the model where
>> hvm_emulate_one_vm_event() does _not_ re-try the emulation until it
>> succeeds. The other model allows me to go further with the guest, but
>> eventually I get timeout-related BSODs or the guest becomes unresponsive.
> 
> Interesting. You didn't clarify what the printed "offset" values are,
> and it doesn't look like these have any correlation with the underlying
> (guest) physical address, which we would also want to see. And then
> it strikes me as odd that in these last lines
> 
> (XEN) Mem event (RETRY) emulation failed: d5v8 32bit @ 0008:826bb861 -> f0 0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
> (XEN) virtual address: 0xffd080f0, offset: 4291854576
> (XEN) MMIO emulation failed: d5v8 32bit @ 0008:82655f3c -> f0 0f ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
> 
> the instruction pointers and virtual addresses are different, but the
> code bytes are exactly the same. This doesn't seem very likely, so I
> wonder whether there's an issue with us wrongly re-using previously
> fetched insn bytes. (Of course I'd be happy to be proven wrong with
> this guessing, by you checking the involved binary/ies.)

Offset is the actual value of the "offset" parameter of
hvmemul_cmpxchg(). I'll output more info and recheck.


Thanks,
Razvan
Jan Beulich March 31, 2017, 3:04 p.m. UTC | #14
>>> On 31.03.17 at 17:01, <rcojocaru@bitdefender.com> wrote:
> On 03/31/2017 05:46 PM, Jan Beulich wrote:
>>>>> On 31.03.17 at 11:56, <rcojocaru@bitdefender.com> wrote:
>>> On 03/31/2017 10:34 AM, Jan Beulich wrote:
>>>>>>> On 31.03.17 at 08:17, <rcojocaru@bitdefender.com> wrote:
>>>>> On 03/30/2017 06:47 PM, Jan Beulich wrote:
>>>>>>> Speaking of emulated MMIO, I've got this when the guest was crashing
>>>>>>> immediately (pre RETRY loop):
>>>>>>>
>>>>>>>  MMIO emulation failed: d3v8 32bit @ 0008:82679f3c -> f0 0f ba 30 00 72
>>>>>>> 07 8b cb e8 da 4b ff ff 8b 45
>>>>>>
>>>>>> That's a BTR, which we should be emulating fine. More information
>>>>>> would need to be collected to have a chance to understand what
>>>>>> might be going one (first of all the virtual and physical memory
>>>>>> address this was trying to act on).
>>>>>
>>>>> Right, the BTR part should be fine, but I think the LOCK part is what's
>>>>> causing the issue. I've done a few more test runs to see what return
>>>>> RETRY (dumping the instruction with an "(r)" prefix to distinguish from
>>>>> the UNHANDLEABLE dump), and a couple of instructions return RETRY (BTR
>>>>> and XADD, both LOCK-prefixed, which means they now involve CMPXCHG
>>>>> handler, which presumably now fails - possibly simply because it's
>>>>> always LOCKed in my patch):
>>>>
>>>> Well, all of that looks to be expected behavior. I'm afraid I don't see
>>>> how this information helps understanding the MMIO emulation failure
>>>> above.
>>>
>>> I've managed to obtain this log of emulation errors:
>>> https://pastebin.com/Esy1SkHx 
>>>
>>> The "virtual address" lines that are not followed by any "Mem event"
>>> line correspond to CMXCHG_FAILED return codes.
>>>
>>> The very last line is a MMIO emulation failed.
>>>
>>> It's probably important that this happens with the model where
>>> hvm_emulate_one_vm_event() does _not_ re-try the emulation until it
>>> succeeds. The other model allows me to go further with the guest, but
>>> eventually I get timeout-related BSODs or the guest becomes unresponsive.
>> 
>> Interesting. You didn't clarify what the printed "offset" values are,
>> and it doesn't look like these have any correlation with the underlying
>> (guest) physical address, which we would also want to see. And then
>> it strikes me as odd that in these last lines
>> 
>> (XEN) Mem event (RETRY) emulation failed: d5v8 32bit @ 0008:826bb861 -> f0 0f 
> ba 30 00 72 07 8b cb e8 da 4b ff ff 8b 45
>> (XEN) virtual address: 0xffd080f0, offset: 4291854576
>> (XEN) MMIO emulation failed: d5v8 32bit @ 0008:82655f3c -> f0 0f ba 30 00 72 
> 07 8b cb e8 da 4b ff ff 8b 45
>> 
>> the instruction pointers and virtual addresses are different, but the
>> code bytes are exactly the same. This doesn't seem very likely, so I
>> wonder whether there's an issue with us wrongly re-using previously
>> fetched insn bytes. (Of course I'd be happy to be proven wrong with
>> this guessing, by you checking the involved binary/ies.)
> 
> Offset is the actual value of the "offset" parameter of
> hvmemul_cmpxchg().

That's not very useful then, as for flat segments "offset" ==
"virtual address" (i.e. you merely re-print in decimal what you've
already printed in hex).

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 2d92957..b946ef7 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -20,6 +20,7 @@ 
 #include <asm/hvm/emulate.h>
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/ioreq.h>
+#include <asm/hvm/nestedhvm.h>
 #include <asm/hvm/trace.h>
 #include <asm/hvm/support.h>
 #include <asm/hvm/svm/svm.h>
@@ -1029,6 +1030,77 @@  static int hvmemul_wbinvd_discard(
     return X86EMUL_OKAY;
 }
 
+static int hvmemul_vaddr_to_mfn(
+    unsigned long addr,
+    mfn_t *mfn,
+    uint32_t pfec,
+    struct x86_emulate_ctxt *ctxt)
+{
+    paddr_t gpa = addr & ~PAGE_MASK;
+    struct page_info *page;
+    p2m_type_t p2mt;
+    unsigned long gfn;
+    struct vcpu *curr = current;
+    struct hvm_emulate_ctxt *hvmemul_ctxt =
+        container_of(ctxt, struct hvm_emulate_ctxt, ctxt);
+
+    gfn = paging_gva_to_gfn(curr, addr, &pfec);
+
+    if ( gfn == gfn_x(INVALID_GFN) )
+    {
+        pagefault_info_t pfinfo = {};
+
+        if ( ( pfec & PFEC_page_paged ) || ( pfec & PFEC_page_shared ) )
+            return X86EMUL_RETRY;
+
+        pfinfo.linear = addr;
+        pfinfo.ec = pfec;
+
+        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
+        return X86EMUL_EXCEPTION;
+    }
+
+    gpa |= (paddr_t)gfn << PAGE_SHIFT;
+
+    /*
+     * No need to do the P2M lookup for internally handled MMIO, benefiting
+     * - 32-bit WinXP (& older Windows) on AMD CPUs for LAPIC accesses,
+     * - newer Windows (like Server 2012) for HPET accesses.
+     */
+    if ( !nestedhvm_vcpu_in_guestmode(curr) && hvm_mmio_internal(gpa) )
+        return X86EMUL_UNHANDLEABLE;
+
+    page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
+
+    if ( !page )
+        return X86EMUL_UNHANDLEABLE;
+
+    if ( p2m_is_paging(p2mt) )
+    {
+        put_page(page);
+        p2m_mem_paging_populate(curr->domain, gfn);
+        return X86EMUL_RETRY;
+    }
+
+    if ( p2m_is_shared(p2mt) )
+    {
+        put_page(page);
+        return X86EMUL_RETRY;
+    }
+
+    if ( p2m_is_grant(p2mt) )
+    {
+        put_page(page);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    *mfn = _mfn(page_to_mfn(page));
+
+    put_page(page);
+
+    return X86EMUL_OKAY;
+}
+
 static int hvmemul_cmpxchg(
     enum x86_segment seg,
     unsigned long offset,
@@ -1037,8 +1109,70 @@  static int hvmemul_cmpxchg(
     unsigned int bytes,
     struct x86_emulate_ctxt *ctxt)
 {
-    /* Fix this in case the guest is really relying on r-m-w atomicity. */
-    return hvmemul_write(seg, offset, p_new, bytes, ctxt);
+    unsigned long addr, reps = 1;
+    int rc = X86EMUL_OKAY;
+    unsigned long old = 0, new = 0;
+    uint32_t pfec = PFEC_page_present | PFEC_write_access;
+    struct hvm_emulate_ctxt *hvmemul_ctxt =
+        container_of(ctxt, struct hvm_emulate_ctxt, ctxt);
+    mfn_t mfn[2];
+    void *map = NULL;
+    struct domain *currd = current->domain;
+
+    if ( is_x86_system_segment(seg) )
+        pfec |= PFEC_implicit;
+    else if ( hvmemul_ctxt->seg_reg[x86_seg_ss].attr.fields.dpl == 3 )
+        pfec |= PFEC_user_mode;
+
+    rc = hvmemul_virtual_to_linear(
+        seg, offset, bytes, &reps, hvm_access_write, hvmemul_ctxt, &addr);
+
+    if ( rc != X86EMUL_OKAY || !bytes )
+        return rc;
+
+    rc = hvmemul_vaddr_to_mfn(addr, &mfn[0], pfec, ctxt);
+
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    if ( likely(((addr + bytes - 1) & PAGE_MASK) == (addr & PAGE_MASK)) )
+    {
+        /* Whole write fits on a single page. */
+        mfn[1] = INVALID_MFN;
+        map = map_domain_page(mfn[0]);
+    }
+    else
+    {
+        rc = hvmemul_vaddr_to_mfn((addr + bytes - 1) & PAGE_MASK, &mfn[1],
+                                  pfec, ctxt);
+        if ( rc != X86EMUL_OKAY )
+            return rc;
+
+        map = vmap(mfn, 2);
+    }
+
+    if ( !map )
+        return X86EMUL_UNHANDLEABLE;
+
+    map += (addr & ~PAGE_MASK);
+
+    memcpy(&old, p_old, bytes);
+    memcpy(&new, p_new, bytes);
+
+    if ( __cmpxchg(map, old, new, bytes) != old )
+        rc = X86EMUL_CMPXCHG_FAILED;
+
+    paging_mark_dirty(currd, mfn[0]);
+
+    if ( unlikely(mfn_valid(mfn[1])) )
+    {
+        paging_mark_dirty(currd, mfn[1]);
+        vunmap((void *)((unsigned long)map & PAGE_MASK));
+    }
+    else
+        unmap_domain_page(map);
+
+    return rc;
 }
 
 static int hvmemul_validate(