diff mbox

[v2,4/4] x86/hvm: Implement hvmemul_write() using real mappings

Message ID 1504886736-1823-5-git-send-email-aisaila@bitdefender.com (mailing list archive)
State New, archived
Headers show

Commit Message

Alexandru Stefan ISAILA Sept. 8, 2017, 4:05 p.m. UTC
From: Andrew Cooper <andrew.cooper3@citrix.com>

An access which crosses a page boundary is performed atomically by x86
hardware, albeit with a severe performance penalty.  An important corner case
is when a straddled access hits two pages which differ in whether a
translation exists, or in net access rights.

The use of hvm_copy*() in hvmemul_write() is problematic, because it performs
a translation then completes the partial write, before moving onto the next
translation.

If an individual emulated write straddles two pages, the first of which is
writable, and the second of which is not, the first half of the write will
complete before #PF is raised from the second half.

This results in guest state corruption as a side effect of emulation, which
has been observed to cause windows to crash while under introspection.

Introduce the hvmemul_{,un}map_linear_addr() helpers, which translate an
entire contents of a linear access, and vmap() the underlying frames to
provide a contiguous virtual mapping for the emulator to use.  This is the
same mechanism as used by the shadow emulation code.

This will catch any translation issues and abort the emulation before any
modifications occur.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: Razvan Cojocaru <rcojocaru@bitdefender.com>
CC: Mihai Donțu <mdontu@bitdefender.com>

Changes since V1:
	- Moved ASSERT to the begining of the loop
	- Corrected the decrement on mfn int the while statement
	- Modified the comment to PAGE_SIZE+1

While the maximum size of linear mapping is capped at 1 page, the logic
in the helpers is written to work properly as hvmemul_ctxt->mfn[] gets longer,
specifically with XSAVE instruction emulation in mind.

This has only had light testing so far.
---
 xen/arch/x86/hvm/emulate.c        | 179 ++++++++++++++++++++++++++++++++++----
 xen/include/asm-x86/hvm/emulate.h |   7 ++
 2 files changed, 169 insertions(+), 17 deletions(-)

Comments

Paul Durrant Sept. 12, 2017, 2:32 p.m. UTC | #1
> -----Original Message-----

> From: Alexandru Isaila [mailto:aisaila@bitdefender.com]

> Sent: 08 September 2017 09:06

> To: xen-devel@lists.xen.org

> Cc: Tim (Xen.org) <tim@xen.org>; George Dunlap

> <George.Dunlap@citrix.com>; jbeulich@suse.com; Andrew Cooper

> <Andrew.Cooper3@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>;

> konrad.wilk@oracle.com; sstabellini@kernel.org; Wei Liu

> <wei.liu2@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>;

> boris.ostrovsky@oracle.com; suravee.suthikulpanit@amd.com;

> jun.nakajima@intel.com; Kevin Tian <kevin.tian@intel.com>; Alexandru Isaila

> <aisaila@bitdefender.com>; Jan Beulich <JBeulich@suse.com>; Razvan

> Cojocaru <rcojocaru@bitdefender.com>; Mihai Donțu

> <mdontu@bitdefender.com>

> Subject: [PATCH v2 4/4] x86/hvm: Implement hvmemul_write() using real

> mappings

> 

> From: Andrew Cooper <andrew.cooper3@citrix.com>

> 

> An access which crosses a page boundary is performed atomically by x86

> hardware, albeit with a severe performance penalty.  An important corner

> case

> is when a straddled access hits two pages which differ in whether a

> translation exists, or in net access rights.

> 

> The use of hvm_copy*() in hvmemul_write() is problematic, because it

> performs

> a translation then completes the partial write, before moving onto the next

> translation.

> 

> If an individual emulated write straddles two pages, the first of which is

> writable, and the second of which is not, the first half of the write will

> complete before #PF is raised from the second half.

> 

> This results in guest state corruption as a side effect of emulation, which

> has been observed to cause windows to crash while under introspection.

> 

> Introduce the hvmemul_{,un}map_linear_addr() helpers, which translate an

> entire contents of a linear access, and vmap() the underlying frames to

> provide a contiguous virtual mapping for the emulator to use.  This is the

> same mechanism as used by the shadow emulation code.

> 

> This will catch any translation issues and abort the emulation before any

> modifications occur.

> 

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

> Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>

> ---

> CC: Jan Beulich <JBeulich@suse.com>

> CC: Paul Durrant <paul.durrant@citrix.com>

> CC: Razvan Cojocaru <rcojocaru@bitdefender.com>

> CC: Mihai Donțu <mdontu@bitdefender.com>

> 

> Changes since V1:

> 	- Moved ASSERT to the begining of the loop

> 	- Corrected the decrement on mfn int the while statement

> 	- Modified the comment to PAGE_SIZE+1

> 

> While the maximum size of linear mapping is capped at 1 page, the logic

> in the helpers is written to work properly as hvmemul_ctxt->mfn[] gets

> longer,

> specifically with XSAVE instruction emulation in mind.

> 

> This has only had light testing so far.

> ---

>  xen/arch/x86/hvm/emulate.c        | 179

> ++++++++++++++++++++++++++++++++++----

>  xen/include/asm-x86/hvm/emulate.h |   7 ++

>  2 files changed, 169 insertions(+), 17 deletions(-)

> 

> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c

> index c871cb3..196a77c 100644

> --- a/xen/arch/x86/hvm/emulate.c

> +++ b/xen/arch/x86/hvm/emulate.c

> @@ -498,6 +498,159 @@ static int hvmemul_do_mmio_addr(paddr_t

> mmio_gpa,

>  }

> 

>  /*

> + * Map the frame(s) covering an individual linear access, for writeable

> + * access.  May return NULL for MMIO, or ERR_PTR(~X86EMUL_*) for other

> errors

> + * including ERR_PTR(~X86EMUL_OKAY) for write-discard mappings.

> + *

> + * In debug builds, map() checks that each slot in hvmemul_ctxt->mfn[] is

> + * clean before use, and poisions unused slots with INVALID_MFN.

> + */

> +static void *hvmemul_map_linear_addr(

> +    unsigned long linear, unsigned int bytes, uint32_t pfec,

> +    struct hvm_emulate_ctxt *hvmemul_ctxt)

> +{

> +    struct vcpu *curr = current;

> +    void *err, *mapping;

> +

> +    /* First and final gfns which need mapping. */

> +    unsigned long frame = linear >> PAGE_SHIFT, first = frame;

> +    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;

> +

> +    /*

> +     * mfn points to the next free slot.  All used slots have a page reference

> +     * held on them.

> +     */

> +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];

> +

> +    /*

> +     * The caller has no legitimate reason for trying a zero-byte write, but

> +     * final is calculate to fail safe in release builds.

> +     *

> +     * The maximum write size depends on the number of adjacent mfns[]

> which

> +     * can be vmap()'d, accouting for possible misalignment within the region.

> +     * The higher level emulation callers are responsible for ensuring that

> +     * mfns[] is large enough for the requested write size.

> +     */

> +    if ( bytes == 0 ||

> +         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )


Should there be an overflow check here to make sure final > first? Your do-while below uses frame < final as its test.

> +    {

> +        ASSERT_UNREACHABLE();

> +        goto unhandleable;

> +    }

> +

> +    do {

> +        enum hvm_translation_result res;

> +        struct page_info *page;

> +        pagefault_info_t pfinfo;

> +        p2m_type_t p2mt;

> +

> +        /* Error checking.  Confirm that the current slot is clean. */

> +        ASSERT(mfn_x(*mfn) == 0);

> +

> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,

> +                                     &pfinfo, &page, NULL, &p2mt);

> +

> +        switch ( res )

> +        {

> +        case HVMTRANS_okay:

> +            break;

> +

> +        case HVMTRANS_bad_linear_to_gfn:

> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);

> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);

> +            goto out;

> +

> +        case HVMTRANS_bad_gfn_to_mfn:

> +            err = NULL;

> +            goto out;

> +

> +        case HVMTRANS_gfn_paged_out:

> +        case HVMTRANS_gfn_shared:

> +            err = ERR_PTR(~(long)X86EMUL_RETRY);

> +            goto out;

> +

> +        default:

> +            goto unhandleable;

> +        }

> +

> +        *mfn++ = _mfn(page_to_mfn(page));

> +        frame++;

> +

> +        if ( p2m_is_discard_write(p2mt) )

> +        {

> +            err = ERR_PTR(~(long)X86EMUL_OKAY);

> +            goto out;

> +        }

> +

> +    } while ( frame < final );


while ( ++frame < final ), and lose the increment above?

> +

> +    /* Entire access within a single frame? */

> +    if ( first == final )

> +        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear &

> ~PAGE_MASK);

> +    /* Multiple frames? Need to vmap(). */

> +    else if ( (mapping = vmap(hvmemul_ctxt->mfn,

> +                              mfn - hvmemul_ctxt->mfn)) == NULL )

> +        goto unhandleable;

> +

> +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */

> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )

> +    {

> +        ASSERT(mfn_x(*mfn) == 0);

> +        *mfn++ = INVALID_MFN;

> +    }

> +#endif

> +

> +    return mapping;

> +

> + unhandleable:

> +    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);

> +

> + out:

> +    /* Drop all held references. */

> +    while ( mfn-- > hvmemul_ctxt->mfn )

> +        put_page(mfn_to_page(mfn_x(*mfn)));

> +

> +    return err;

> +}

> +

> +static void hvmemul_unmap_linear_addr(

> +    void *mapping, unsigned long linear, unsigned int bytes,

> +    struct hvm_emulate_ctxt *hvmemul_ctxt)

> +{

> +    struct domain *currd = current->domain;

> +    unsigned long frame = linear >> PAGE_SHIFT;

> +    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;

> +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];

> +

> +    ASSERT(bytes > 0);

> +

> +    if ( frame == final )

> +        unmap_domain_page(mapping);

> +    else

> +        vunmap(mapping);

> +

> +    do

> +    {

> +        ASSERT(mfn_valid(*mfn));

> +        paging_mark_dirty(currd, *mfn);

> +        put_page(mfn_to_page(mfn_x(*mfn)));

> +

> +        frame++;

> +        *mfn++ = _mfn(0); /* Clean slot for map()'s error checking. */

> +

> +    } while ( frame < final );


Again, the increment could be folded into the test.

  Paul

> +

> +

> +#ifndef NDEBUG /* Check (and clean) all unused mfns. */

> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )

> +    {

> +        ASSERT(mfn_eq(*mfn, INVALID_MFN));

> +        *mfn++ = _mfn(0);

> +    }

> +#endif

> +}

> +

> +/*

>   * Convert addr from linear to physical form, valid over the range

>   * [addr, addr + *reps * bytes_per_rep]. *reps is adjusted according to

>   * the valid computed range. It is always >0 when X86EMUL_OKAY is

> returned.

> @@ -987,11 +1140,11 @@ static int hvmemul_write(

>      struct hvm_emulate_ctxt *hvmemul_ctxt =

>          container_of(ctxt, struct hvm_emulate_ctxt, ctxt);

>      struct vcpu *curr = current;

> -    pagefault_info_t pfinfo;

>      unsigned long addr, reps = 1;

>      uint32_t pfec = PFEC_page_present | PFEC_write_access;

>      struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;

>      int rc;

> +    void *mapping;

> 

>      if ( is_x86_system_segment(seg) )

>          pfec |= PFEC_implicit;

> @@ -1007,23 +1160,15 @@ static int hvmemul_write(

>           (vio->mmio_gla == (addr & PAGE_MASK)) )

>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec,

> hvmemul_ctxt, 1);

> 

> -    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);

> -

> -    switch ( rc )

> -    {

> -    case HVMTRANS_okay:

> -        break;

> -    case HVMTRANS_bad_linear_to_gfn:

> -        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);

> -        return X86EMUL_EXCEPTION;

> -    case HVMTRANS_bad_gfn_to_mfn:

> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec,

> hvmemul_ctxt);

> +    if ( IS_ERR(mapping) )

> +        return ~PTR_ERR(mapping);

> +    else if ( !mapping )

>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec,

> hvmemul_ctxt, 0);

> -    case HVMTRANS_gfn_paged_out:

> -    case HVMTRANS_gfn_shared:

> -        return X86EMUL_RETRY;

> -    default:

> -        return X86EMUL_UNHANDLEABLE;

> -    }

> +

> +    memcpy(mapping, p_data, bytes);

> +

> +    hvmemul_unmap_linear_addr(mapping, addr, bytes, hvmemul_ctxt);

> 

>      return X86EMUL_OKAY;

>  }

> diff --git a/xen/include/asm-x86/hvm/emulate.h b/xen/include/asm-

> x86/hvm/emulate.h

> index 8864775..d379a4a 100644

> --- a/xen/include/asm-x86/hvm/emulate.h

> +++ b/xen/include/asm-x86/hvm/emulate.h

> @@ -37,6 +37,13 @@ struct hvm_emulate_ctxt {

>      unsigned long seg_reg_accessed;

>      unsigned long seg_reg_dirty;

> 

> +    /*

> +     * MFNs behind temporary mappings in the write callback.  The length is

> +     * arbitrary, and can be increased if writes longer than PAGE_SIZE+1 are

> +     * needed.

> +     */

> +    mfn_t mfn[2];

> +

>      uint32_t intr_shadow;

> 

>      bool_t set_context;

> --

> 2.7.4
Andrew Cooper Sept. 12, 2017, 2:37 p.m. UTC | #2
On 12/09/17 15:32, Paul Durrant wrote:
>
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        goto unhandleable;
>> +    }
>> +
>> +    do {
>> +        enum hvm_translation_result res;
>> +        struct page_info *page;
>> +        pagefault_info_t pfinfo;
>> +        p2m_type_t p2mt;
>> +
>> +        /* Error checking.  Confirm that the current slot is clean. */
>> +        ASSERT(mfn_x(*mfn) == 0);
>> +
>> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,
>> +                                     &pfinfo, &page, NULL, &p2mt);
>> +
>> +        switch ( res )
>> +        {
>> +        case HVMTRANS_okay:
>> +            break;
>> +
>> +        case HVMTRANS_bad_linear_to_gfn:
>> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
>> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);
>> +            goto out;
>> +
>> +        case HVMTRANS_bad_gfn_to_mfn:
>> +            err = NULL;
>> +            goto out;
>> +
>> +        case HVMTRANS_gfn_paged_out:
>> +        case HVMTRANS_gfn_shared:
>> +            err = ERR_PTR(~(long)X86EMUL_RETRY);
>> +            goto out;
>> +
>> +        default:
>> +            goto unhandleable;
>> +        }
>> +
>> +        *mfn++ = _mfn(page_to_mfn(page));
>> +        frame++;
>> +
>> +        if ( p2m_is_discard_write(p2mt) )
>> +        {
>> +            err = ERR_PTR(~(long)X86EMUL_OKAY);
>> +            goto out;
>> +        }
>> +
>> +    } while ( frame < final );
> while ( ++frame < final ), and lose the increment above?

I deliberately wrote it this way, to avoid adding to the cognitive load
of trying to work out what is going on.

-1 to the suggestion.

~Andrew
Jan Beulich Sept. 12, 2017, 3:05 p.m. UTC | #3
>>> On 12.09.17 at 16:37, <andrew.cooper3@citrix.com> wrote:
> On 12/09/17 15:32, Paul Durrant wrote:
>>
>>> +    {
>>> +        ASSERT_UNREACHABLE();
>>> +        goto unhandleable;
>>> +    }
>>> +
>>> +    do {
>>> +        enum hvm_translation_result res;
>>> +        struct page_info *page;
>>> +        pagefault_info_t pfinfo;
>>> +        p2m_type_t p2mt;
>>> +
>>> +        /* Error checking.  Confirm that the current slot is clean. */
>>> +        ASSERT(mfn_x(*mfn) == 0);
>>> +
>>> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,
>>> +                                     &pfinfo, &page, NULL, &p2mt);
>>> +
>>> +        switch ( res )
>>> +        {
>>> +        case HVMTRANS_okay:
>>> +            break;
>>> +
>>> +        case HVMTRANS_bad_linear_to_gfn:
>>> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, 
> &hvmemul_ctxt->ctxt);
>>> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);
>>> +            goto out;
>>> +
>>> +        case HVMTRANS_bad_gfn_to_mfn:
>>> +            err = NULL;
>>> +            goto out;
>>> +
>>> +        case HVMTRANS_gfn_paged_out:
>>> +        case HVMTRANS_gfn_shared:
>>> +            err = ERR_PTR(~(long)X86EMUL_RETRY);
>>> +            goto out;
>>> +
>>> +        default:
>>> +            goto unhandleable;
>>> +        }
>>> +
>>> +        *mfn++ = _mfn(page_to_mfn(page));
>>> +        frame++;
>>> +
>>> +        if ( p2m_is_discard_write(p2mt) )
>>> +        {
>>> +            err = ERR_PTR(~(long)X86EMUL_OKAY);
>>> +            goto out;
>>> +        }
>>> +
>>> +    } while ( frame < final );
>> while ( ++frame < final ), and lose the increment above?
> 
> I deliberately wrote it this way, to avoid adding to the cognitive load
> of trying to work out what is going on.

It's a matter of taste as well as what you're used to whether the
increment inside the while() is helpful or hindering. I'm generally
of the opinion that things that belong together should go together
(just like for(;;) enforces by wanting everything on the same line),
and the increment and loop exit check do belong together.
Separating them like above would be advisable only if the
intermediate loop exit really requires the value to still be
un-incremented.

> -1 to the suggestion.

+1 from me, that is.

Jan
Paul Durrant Sept. 12, 2017, 3:06 p.m. UTC | #4
> -----Original Message-----

> From: Andrew Cooper

> Sent: 12 September 2017 07:38

> To: Paul Durrant <Paul.Durrant@citrix.com>; 'Alexandru Isaila'

> <aisaila@bitdefender.com>; xen-devel@lists.xen.org

> Cc: Tim (Xen.org) <tim@xen.org>; George Dunlap

> <George.Dunlap@citrix.com>; jbeulich@suse.com; Ian Jackson

> <Ian.Jackson@citrix.com>; konrad.wilk@oracle.com; sstabellini@kernel.org;

> Wei Liu <wei.liu2@citrix.com>; boris.ostrovsky@oracle.com;

> suravee.suthikulpanit@amd.com; jun.nakajima@intel.com; Kevin Tian

> <kevin.tian@intel.com>; Razvan Cojocaru <rcojocaru@bitdefender.com>;

> Mihai Donțu <mdontu@bitdefender.com>

> Subject: Re: [PATCH v2 4/4] x86/hvm: Implement hvmemul_write() using

> real mappings

> 

> On 12/09/17 15:32, Paul Durrant wrote:

> >

> >> +    {

> >> +        ASSERT_UNREACHABLE();

> >> +        goto unhandleable;

> >> +    }

> >> +

> >> +    do {

> >> +        enum hvm_translation_result res;

> >> +        struct page_info *page;

> >> +        pagefault_info_t pfinfo;

> >> +        p2m_type_t p2mt;

> >> +

> >> +        /* Error checking.  Confirm that the current slot is clean. */

> >> +        ASSERT(mfn_x(*mfn) == 0);

> >> +

> >> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true,

> pfec,

> >> +                                     &pfinfo, &page, NULL, &p2mt);

> >> +

> >> +        switch ( res )

> >> +        {

> >> +        case HVMTRANS_okay:

> >> +            break;

> >> +

> >> +        case HVMTRANS_bad_linear_to_gfn:

> >> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt-

> >ctxt);

> >> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);

> >> +            goto out;

> >> +

> >> +        case HVMTRANS_bad_gfn_to_mfn:

> >> +            err = NULL;

> >> +            goto out;

> >> +

> >> +        case HVMTRANS_gfn_paged_out:

> >> +        case HVMTRANS_gfn_shared:

> >> +            err = ERR_PTR(~(long)X86EMUL_RETRY);

> >> +            goto out;

> >> +

> >> +        default:

> >> +            goto unhandleable;

> >> +        }

> >> +

> >> +        *mfn++ = _mfn(page_to_mfn(page));

> >> +        frame++;

> >> +

> >> +        if ( p2m_is_discard_write(p2mt) )

> >> +        {

> >> +            err = ERR_PTR(~(long)X86EMUL_OKAY);

> >> +            goto out;

> >> +        }

> >> +

> >> +    } while ( frame < final );

> > while ( ++frame < final ), and lose the increment above?

> 

> I deliberately wrote it this way, to avoid adding to the cognitive load

> of trying to work out what is going on.


IMO there is more cognitive load in separating the increment from the test, but I'm not that fussed.

  Paul

> 

> -1 to the suggestion.

> 

> ~Andrew
Andrew Cooper Sept. 12, 2017, 3:09 p.m. UTC | #5
On 08/09/17 17:05, Alexandru Isaila wrote:
> +    } while ( frame < final );
> +
> +    /* Entire access within a single frame? */
> +    if ( first == final )
> +        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear & ~PAGE_MASK);
> +    /* Multiple frames? Need to vmap(). */
> +    else if ( (mapping = vmap(hvmemul_ctxt->mfn,
> +                              mfn - hvmemul_ctxt->mfn)) == NULL )
> +        goto unhandleable;
> +
> +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */
> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
> +    {
> +        ASSERT(mfn_x(*mfn) == 0);
> +        *mfn++ = INVALID_MFN;
> +    }
> +#endif
> +
> +    return mapping;

/sigh - its sad what you notice when looking over your own code somewhat
later...

the + (linear & ~PAGE_MASK) needs removing from the map_domain_page()
call, and adding to this return statement, because it also needs to
happen for the vmap() case.

~Andrew
Paul Durrant Sept. 12, 2017, 3:12 p.m. UTC | #6
> -----Original Message-----

> From: Andrew Cooper

> Sent: 12 September 2017 08:10

> To: Alexandru Isaila <aisaila@bitdefender.com>; xen-devel@lists.xen.org

> Cc: Tim (Xen.org) <tim@xen.org>; George Dunlap

> <George.Dunlap@citrix.com>; jbeulich@suse.com; Ian Jackson

> <Ian.Jackson@citrix.com>; konrad.wilk@oracle.com; sstabellini@kernel.org;

> Wei Liu <wei.liu2@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>;

> boris.ostrovsky@oracle.com; suravee.suthikulpanit@amd.com;

> jun.nakajima@intel.com; Kevin Tian <kevin.tian@intel.com>; Razvan

> Cojocaru <rcojocaru@bitdefender.com>; Mihai Donțu

> <mdontu@bitdefender.com>

> Subject: Re: [PATCH v2 4/4] x86/hvm: Implement hvmemul_write() using

> real mappings

> 

> On 08/09/17 17:05, Alexandru Isaila wrote:

> > +    } while ( frame < final );

> > +

> > +    /* Entire access within a single frame? */

> > +    if ( first == final )

> > +        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear &

> ~PAGE_MASK);

> > +    /* Multiple frames? Need to vmap(). */

> > +    else if ( (mapping = vmap(hvmemul_ctxt->mfn,

> > +                              mfn - hvmemul_ctxt->mfn)) == NULL )

> > +        goto unhandleable;

> > +

> > +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */

> > +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn)

> )

> > +    {

> > +        ASSERT(mfn_x(*mfn) == 0);

> > +        *mfn++ = INVALID_MFN;

> > +    }

> > +#endif

> > +

> > +    return mapping;

> 

> /sigh - its sad what you notice when looking over your own code somewhat

> later...

> 

> the + (linear & ~PAGE_MASK) needs removing from the

> map_domain_page()

> call, and adding to this return statement, because it also needs to

> happen for the vmap() case.


Should vmap call through to map_domain_page() in the case of a single page I wonder?

  Paul

> 

> ~Andrew
Jan Beulich Sept. 18, 2017, 1:43 p.m. UTC | #7
>>> On 08.09.17 at 18:05, <aisaila@bitdefender.com> wrote:
> Changes since V1:
> 	- Moved ASSERT to the begining of the loop
> 	- Corrected the decrement on mfn int the while statement
> 	- Modified the comment to PAGE_SIZE+1

While several of my v1 comments were taken care of verbally, some
haven't been addressed here or during the discussion.

> While the maximum size of linear mapping is capped at 1 page, the logic
> in the helpers is written to work properly as hvmemul_ctxt->mfn[] gets 
> longer,
> specifically with XSAVE instruction emulation in mind.
> 
> This has only had light testing so far.

Has this changed in the meantime?

> +static void *hvmemul_map_linear_addr(
> +    unsigned long linear, unsigned int bytes, uint32_t pfec,
> +    struct hvm_emulate_ctxt *hvmemul_ctxt)
> +{
> +    struct vcpu *curr = current;
> +    void *err, *mapping;
> +
> +    /* First and final gfns which need mapping. */
> +    unsigned long frame = linear >> PAGE_SHIFT, first = frame;
> +    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;
> +
> +    /*
> +     * mfn points to the next free slot.  All used slots have a page reference
> +     * held on them.
> +     */
> +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];
> +
> +    /*
> +     * The caller has no legitimate reason for trying a zero-byte write, but
> +     * final is calculate to fail safe in release builds.
> +     *
> +     * The maximum write size depends on the number of adjacent mfns[] which
> +     * can be vmap()'d, accouting for possible misalignment within the region.
> +     * The higher level emulation callers are responsible for ensuring that
> +     * mfns[] is large enough for the requested write size.
> +     */
> +    if ( bytes == 0 ||
> +         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )
> +    {
> +        ASSERT_UNREACHABLE();
> +        goto unhandleable;
> +    }
> +
> +    do {
> +        enum hvm_translation_result res;
> +        struct page_info *page;
> +        pagefault_info_t pfinfo;
> +        p2m_type_t p2mt;
> +
> +        /* Error checking.  Confirm that the current slot is clean. */
> +        ASSERT(mfn_x(*mfn) == 0);
> +
> +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,
> +                                     &pfinfo, &page, NULL, &p2mt);
> +
> +        switch ( res )
> +        {
> +        case HVMTRANS_okay:
> +            break;
> +
> +        case HVMTRANS_bad_linear_to_gfn:
> +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
> +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);

Why the casts to long here and further down?

> +            goto out;
> +
> +        case HVMTRANS_bad_gfn_to_mfn:
> +            err = NULL;
> +            goto out;
> +
> +        case HVMTRANS_gfn_paged_out:
> +        case HVMTRANS_gfn_shared:
> +            err = ERR_PTR(~(long)X86EMUL_RETRY);
> +            goto out;
> +
> +        default:
> +            goto unhandleable;
> +        }
> +
> +        *mfn++ = _mfn(page_to_mfn(page));
> +        frame++;
> +
> +        if ( p2m_is_discard_write(p2mt) )
> +        {
> +            err = ERR_PTR(~(long)X86EMUL_OKAY);
> +            goto out;
> +        }
> +
> +    } while ( frame < final );
> +
> +    /* Entire access within a single frame? */
> +    if ( first == final )
> +        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear & ~PAGE_MASK);
> +    /* Multiple frames? Need to vmap(). */
> +    else if ( (mapping = vmap(hvmemul_ctxt->mfn,
> +                              mfn - hvmemul_ctxt->mfn)) == NULL )

v1 comment was "final - first + 1 would likely yield better code."

> +        goto unhandleable;
> +
> +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */
> +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
> +    {
> +        ASSERT(mfn_x(*mfn) == 0);
> +        *mfn++ = INVALID_MFN;
> +    }
> +#endif
> +
> +    return mapping;
> +
> + unhandleable:
> +    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);
> +
> + out:
> +    /* Drop all held references. */
> +    while ( mfn-- > hvmemul_ctxt->mfn )
> +        put_page(mfn_to_page(mfn_x(*mfn)));
> +
> +    return err;
> +}
> +
> +static void hvmemul_unmap_linear_addr(
> +    void *mapping, unsigned long linear, unsigned int bytes,

While this was discussed in response to v1, I still think "mapping"
should be const void *, or a prereq patch (which I would object
to) should be submitted to drop the const from vunmap() and
unmap_domain_page().

> @@ -1007,23 +1160,15 @@ static int hvmemul_write(
>           (vio->mmio_gla == (addr & PAGE_MASK)) )
>          return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
>  
> -    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
> -
> -    switch ( rc )
> -    {
> -    case HVMTRANS_okay:
> -        break;
> -    case HVMTRANS_bad_linear_to_gfn:
> -        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
> -        return X86EMUL_EXCEPTION;
> -    case HVMTRANS_bad_gfn_to_mfn:
> +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
> +    if ( IS_ERR(mapping) )
> +        return ~PTR_ERR(mapping);
> +    else if ( !mapping )

v1 comment: 'Pointless "else".'

Jan
Alexandru Stefan ISAILA Sept. 18, 2017, 2:16 p.m. UTC | #8
On Lu, 2017-09-18 at 07:43 -0600, Jan Beulich wrote:
> >

> > >

> > > >

> > > > On 08.09.17 at 18:05, <aisaila@bitdefender.com> wrote:

> > Changes since V1:

> > - Moved ASSERT to the begining of the loop

> > - Corrected the decrement on mfn int the while statement

> > - Modified the comment to PAGE_SIZE+1

> While several of my v1 comments were taken care of verbally, some

> haven't been addressed here or during the discussion.

Sorry about that, I must have lost some or some emails have not been
indexed. I'll address all from now on.
>

> >

> > While the maximum size of linear mapping is capped at 1 page, the

> > logic

> > in the helpers is written to work properly as hvmemul_ctxt->mfn[]

> > gets

> > longer,

> > specifically with XSAVE instruction emulation in mind.

> >

> > This has only had light testing so far.

> Has this changed in the meantime?

This has not changed so far.
>

> >

> > +static void *hvmemul_map_linear_addr(

> > +    unsigned long linear, unsigned int bytes, uint32_t pfec,

> > +    struct hvm_emulate_ctxt *hvmemul_ctxt)

> > +{

> > +    struct vcpu *curr = current;

> > +    void *err, *mapping;

> > +

> > +    /* First and final gfns which need mapping. */

> > +    unsigned long frame = linear >> PAGE_SHIFT, first = frame;

> > +    unsigned long final = (linear + bytes - !!bytes) >>

> > PAGE_SHIFT;

> > +

> > +    /*

> > +     * mfn points to the next free slot.  All used slots have a

> > page reference

> > +     * held on them.

> > +     */

> > +    mfn_t *mfn = &hvmemul_ctxt->mfn[0];

> > +

> > +    /*

> > +     * The caller has no legitimate reason for trying a zero-byte

> > write, but

> > +     * final is calculate to fail safe in release builds.

> > +     *

> > +     * The maximum write size depends on the number of adjacent

> > mfns[] which

> > +     * can be vmap()'d, accouting for possible misalignment within

> > the region.

> > +     * The higher level emulation callers are responsible for

> > ensuring that

> > +     * mfns[] is large enough for the requested write size.

> > +     */

> > +    if ( bytes == 0 ||

> > +         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )

> > +    {

> > +        ASSERT_UNREACHABLE();

> > +        goto unhandleable;

> > +    }

> > +

> > +    do {

> > +        enum hvm_translation_result res;

> > +        struct page_info *page;

> > +        pagefault_info_t pfinfo;

> > +        p2m_type_t p2mt;

> > +

> > +        /* Error checking.  Confirm that the current slot is

> > clean. */

> > +        ASSERT(mfn_x(*mfn) == 0);

> > +

> > +        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT,

> > true, pfec,

> > +                                     &pfinfo, &page, NULL, &p2mt);

> > +

> > +        switch ( res )

> > +        {

> > +        case HVMTRANS_okay:

> > +            break;

> > +

> > +        case HVMTRANS_bad_linear_to_gfn:

> > +            x86_emul_pagefault(pfinfo.ec, pfinfo.linear,

> > &hvmemul_ctxt->ctxt);

> > +            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);

> Why the casts to long here and further down?

>

> >

> > +            goto out;

> > +

> > +        case HVMTRANS_bad_gfn_to_mfn:

> > +            err = NULL;

> > +            goto out;

> > +

> > +        case HVMTRANS_gfn_paged_out:

> > +        case HVMTRANS_gfn_shared:

> > +            err = ERR_PTR(~(long)X86EMUL_RETRY);

> > +            goto out;

> > +

> > +        default:

> > +            goto unhandleable;

> > +        }

> > +

> > +        *mfn++ = _mfn(page_to_mfn(page));

> > +        frame++;

> > +

> > +        if ( p2m_is_discard_write(p2mt) )

> > +        {

> > +            err = ERR_PTR(~(long)X86EMUL_OKAY);

> > +            goto out;

> > +        }

> > +

> > +    } while ( frame < final );

> > +

> > +    /* Entire access within a single frame? */

> > +    if ( first == final )

> > +        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear

> > & ~PAGE_MASK);

> > +    /* Multiple frames? Need to vmap(). */

> > +    else if ( (mapping = vmap(hvmemul_ctxt->mfn,

> > +                              mfn - hvmemul_ctxt->mfn)) == NULL )

> v1 comment was "final - first + 1 would likely yield better code."

will do.
>

> >

> > +        goto unhandleable;

> > +

> > +#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */

> > +    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt-

> > >mfn) )

> > +    {

> > +        ASSERT(mfn_x(*mfn) == 0);

> > +        *mfn++ = INVALID_MFN;

> > +    }

> > +#endif

> > +

> > +    return mapping;

> > +

> > + unhandleable:

> > +    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);

> > +

> > + out:

> > +    /* Drop all held references. */

> > +    while ( mfn-- > hvmemul_ctxt->mfn )

> > +        put_page(mfn_to_page(mfn_x(*mfn)));

> > +

> > +    return err;

> > +}

> > +

> > +static void hvmemul_unmap_linear_addr(

> > +    void *mapping, unsigned long linear, unsigned int bytes,

> While this was discussed in response to v1, I still think "mapping"

> should be const void *, or a prereq patch (which I would object

> to) should be submitted to drop the const from vunmap() and

> unmap_domain_page().

I'll wait for Andrews opinion on this issue.
>

> >

> > @@ -1007,23 +1160,15 @@ static int hvmemul_write(

> >           (vio->mmio_gla == (addr & PAGE_MASK)) )

> >          return hvmemul_linear_mmio_write(addr, bytes, p_data,

> > pfec, hvmemul_ctxt, 1);

> >

> > -    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec,

> > &pfinfo);

> > -

> > -    switch ( rc )

> > -    {

> > -    case HVMTRANS_okay:

> > -        break;

> > -    case HVMTRANS_bad_linear_to_gfn:

> > -        x86_emul_pagefault(pfinfo.ec, pfinfo.linear,

> > &hvmemul_ctxt->ctxt);

> > -        return X86EMUL_EXCEPTION;

> > -    case HVMTRANS_bad_gfn_to_mfn:

> > +    mapping = hvmemul_map_linear_addr(addr, bytes, pfec,

> > hvmemul_ctxt);

> > +    if ( IS_ERR(mapping) )

> > +        return ~PTR_ERR(mapping);

> > +    else if ( !mapping )

> v1 comment: 'Pointless "else".'

Agreed.

Regards,
Alex

________________________
This email was scanned by Bitdefender
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index c871cb3..196a77c 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -498,6 +498,159 @@  static int hvmemul_do_mmio_addr(paddr_t mmio_gpa,
 }
 
 /*
+ * Map the frame(s) covering an individual linear access, for writeable
+ * access.  May return NULL for MMIO, or ERR_PTR(~X86EMUL_*) for other errors
+ * including ERR_PTR(~X86EMUL_OKAY) for write-discard mappings.
+ *
+ * In debug builds, map() checks that each slot in hvmemul_ctxt->mfn[] is
+ * clean before use, and poisions unused slots with INVALID_MFN.
+ */
+static void *hvmemul_map_linear_addr(
+    unsigned long linear, unsigned int bytes, uint32_t pfec,
+    struct hvm_emulate_ctxt *hvmemul_ctxt)
+{
+    struct vcpu *curr = current;
+    void *err, *mapping;
+
+    /* First and final gfns which need mapping. */
+    unsigned long frame = linear >> PAGE_SHIFT, first = frame;
+    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;
+
+    /*
+     * mfn points to the next free slot.  All used slots have a page reference
+     * held on them.
+     */
+    mfn_t *mfn = &hvmemul_ctxt->mfn[0];
+
+    /*
+     * The caller has no legitimate reason for trying a zero-byte write, but
+     * final is calculate to fail safe in release builds.
+     *
+     * The maximum write size depends on the number of adjacent mfns[] which
+     * can be vmap()'d, accouting for possible misalignment within the region.
+     * The higher level emulation callers are responsible for ensuring that
+     * mfns[] is large enough for the requested write size.
+     */
+    if ( bytes == 0 ||
+         final - first > ARRAY_SIZE(hvmemul_ctxt->mfn) - 1 )
+    {
+        ASSERT_UNREACHABLE();
+        goto unhandleable;
+    }
+
+    do {
+        enum hvm_translation_result res;
+        struct page_info *page;
+        pagefault_info_t pfinfo;
+        p2m_type_t p2mt;
+
+        /* Error checking.  Confirm that the current slot is clean. */
+        ASSERT(mfn_x(*mfn) == 0);
+
+        res = hvm_translate_get_page(curr, frame << PAGE_SHIFT, true, pfec,
+                                     &pfinfo, &page, NULL, &p2mt);
+
+        switch ( res )
+        {
+        case HVMTRANS_okay:
+            break;
+
+        case HVMTRANS_bad_linear_to_gfn:
+            x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
+            err = ERR_PTR(~(long)X86EMUL_EXCEPTION);
+            goto out;
+
+        case HVMTRANS_bad_gfn_to_mfn:
+            err = NULL;
+            goto out;
+
+        case HVMTRANS_gfn_paged_out:
+        case HVMTRANS_gfn_shared:
+            err = ERR_PTR(~(long)X86EMUL_RETRY);
+            goto out;
+
+        default:
+            goto unhandleable;
+        }
+
+        *mfn++ = _mfn(page_to_mfn(page));
+        frame++;
+
+        if ( p2m_is_discard_write(p2mt) )
+        {
+            err = ERR_PTR(~(long)X86EMUL_OKAY);
+            goto out;
+        }
+
+    } while ( frame < final );
+
+    /* Entire access within a single frame? */
+    if ( first == final )
+        mapping = map_domain_page(hvmemul_ctxt->mfn[0]) + (linear & ~PAGE_MASK);
+    /* Multiple frames? Need to vmap(). */
+    else if ( (mapping = vmap(hvmemul_ctxt->mfn,
+                              mfn - hvmemul_ctxt->mfn)) == NULL )
+        goto unhandleable;
+
+#ifndef NDEBUG /* Poision unused mfn[]s with INVALID_MFN. */
+    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
+    {
+        ASSERT(mfn_x(*mfn) == 0);
+        *mfn++ = INVALID_MFN;
+    }
+#endif
+
+    return mapping;
+
+ unhandleable:
+    err = ERR_PTR(~(long)X86EMUL_UNHANDLEABLE);
+
+ out:
+    /* Drop all held references. */
+    while ( mfn-- > hvmemul_ctxt->mfn )
+        put_page(mfn_to_page(mfn_x(*mfn)));
+
+    return err;
+}
+
+static void hvmemul_unmap_linear_addr(
+    void *mapping, unsigned long linear, unsigned int bytes,
+    struct hvm_emulate_ctxt *hvmemul_ctxt)
+{
+    struct domain *currd = current->domain;
+    unsigned long frame = linear >> PAGE_SHIFT;
+    unsigned long final = (linear + bytes - !!bytes) >> PAGE_SHIFT;
+    mfn_t *mfn = &hvmemul_ctxt->mfn[0];
+
+    ASSERT(bytes > 0);
+
+    if ( frame == final )
+        unmap_domain_page(mapping);
+    else
+        vunmap(mapping);
+
+    do
+    {
+        ASSERT(mfn_valid(*mfn));
+        paging_mark_dirty(currd, *mfn);
+        put_page(mfn_to_page(mfn_x(*mfn)));
+
+        frame++;
+        *mfn++ = _mfn(0); /* Clean slot for map()'s error checking. */
+
+    } while ( frame < final );
+
+
+#ifndef NDEBUG /* Check (and clean) all unused mfns. */
+    while ( mfn < hvmemul_ctxt->mfn + ARRAY_SIZE(hvmemul_ctxt->mfn) )
+    {
+        ASSERT(mfn_eq(*mfn, INVALID_MFN));
+        *mfn++ = _mfn(0);
+    }
+#endif
+}
+
+/*
  * Convert addr from linear to physical form, valid over the range
  * [addr, addr + *reps * bytes_per_rep]. *reps is adjusted according to
  * the valid computed range. It is always >0 when X86EMUL_OKAY is returned.
@@ -987,11 +1140,11 @@  static int hvmemul_write(
     struct hvm_emulate_ctxt *hvmemul_ctxt =
         container_of(ctxt, struct hvm_emulate_ctxt, ctxt);
     struct vcpu *curr = current;
-    pagefault_info_t pfinfo;
     unsigned long addr, reps = 1;
     uint32_t pfec = PFEC_page_present | PFEC_write_access;
     struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
     int rc;
+    void *mapping;
 
     if ( is_x86_system_segment(seg) )
         pfec |= PFEC_implicit;
@@ -1007,23 +1160,15 @@  static int hvmemul_write(
          (vio->mmio_gla == (addr & PAGE_MASK)) )
         return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 1);
 
-    rc = hvm_copy_to_guest_linear(addr, p_data, bytes, pfec, &pfinfo);
-
-    switch ( rc )
-    {
-    case HVMTRANS_okay:
-        break;
-    case HVMTRANS_bad_linear_to_gfn:
-        x86_emul_pagefault(pfinfo.ec, pfinfo.linear, &hvmemul_ctxt->ctxt);
-        return X86EMUL_EXCEPTION;
-    case HVMTRANS_bad_gfn_to_mfn:
+    mapping = hvmemul_map_linear_addr(addr, bytes, pfec, hvmemul_ctxt);
+    if ( IS_ERR(mapping) )
+        return ~PTR_ERR(mapping);
+    else if ( !mapping )
         return hvmemul_linear_mmio_write(addr, bytes, p_data, pfec, hvmemul_ctxt, 0);
-    case HVMTRANS_gfn_paged_out:
-    case HVMTRANS_gfn_shared:
-        return X86EMUL_RETRY;
-    default:
-        return X86EMUL_UNHANDLEABLE;
-    }
+
+    memcpy(mapping, p_data, bytes);
+
+    hvmemul_unmap_linear_addr(mapping, addr, bytes, hvmemul_ctxt);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/include/asm-x86/hvm/emulate.h b/xen/include/asm-x86/hvm/emulate.h
index 8864775..d379a4a 100644
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -37,6 +37,13 @@  struct hvm_emulate_ctxt {
     unsigned long seg_reg_accessed;
     unsigned long seg_reg_dirty;
 
+    /*
+     * MFNs behind temporary mappings in the write callback.  The length is
+     * arbitrary, and can be increased if writes longer than PAGE_SIZE+1 are
+     * needed.
+     */
+    mfn_t mfn[2];
+
     uint32_t intr_shadow;
 
     bool_t set_context;