diff mbox

[v2,3/3] x86/ioreq server: Add HVMOP to map guest ram with p2m_ioreq_server to an ioreq server

Message ID 1459421618-5991-4-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang March 31, 2016, 10:53 a.m. UTC
A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
let one ioreq server claim/disclaim its responsibility for the
handling of guest pages with p2m type p2m_ioreq_server. Users
of this HVMOP can specify whether the p2m_ioreq_server is supposed
to handle write accesses or read ones or both in a parameter named
flags. For now, we only support one ioreq server for this p2m type,
so once an ioreq server has claimed its ownership, subsequent calls
of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
disclaim the ownership of guest ram pages with this p2m type, by
triggering this new HVMOP, with ioreq server id set to the current
owner's and flags parameter set to 0.

For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
are only supported for HVMs with HAP enabled.

Note that flags parameter(if not 0) of this HVMOP only indicates
which kind of memory accesses are to be forwarded to an ioreq server,
it has impact on the access rights of guest ram pages, but are not
the same. Due to hardware limitations, if only write operations are
to be forwarded, read ones will be performed at full speed, with
no hypervisor intervention. But if read ones are to be forwarded to
an ioreq server, writes will inevitably be trapped into hypervisor,
which means significant performance impact.

Also note that this HVMOP_map_mem_type_to_ioreq_server will not
change the p2m type of any guest ram page, until HVMOP_set_mem_type
is triggered. So normally the steps should be the backend driver
first claims its ownership of guest ram pages with p2m_ioreq_server
type, and then sets the memory type to p2m_ioreq_server for specified
guest ram pages.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Tim Deegan <tim@xen.org>
---
 xen/arch/x86/hvm/emulate.c       | 125 +++++++++++++++++++++++++++++++++++++--
 xen/arch/x86/hvm/hvm.c           |  95 +++++++++++++++++++++++++++--
 xen/arch/x86/mm/hap/nested_hap.c |   2 +-
 xen/arch/x86/mm/p2m-ept.c        |  14 ++++-
 xen/arch/x86/mm/p2m-pt.c         |  25 +++++---
 xen/arch/x86/mm/p2m.c            |  82 +++++++++++++++++++++++++
 xen/arch/x86/mm/shadow/multi.c   |   3 +-
 xen/include/asm-x86/p2m.h        |  36 +++++++++--
 xen/include/public/hvm/hvm_op.h  |  37 ++++++++++++
 9 files changed, 395 insertions(+), 24 deletions(-)

Comments

Yu Zhang April 5, 2016, 6:01 a.m. UTC | #1
Thanks, Tim.

On 4/4/2016 4:25 PM, Tim Deegan wrote:
> At 18:53 +0800 on 31 Mar (1459450418), Yu Zhang wrote:
>> A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
>> let one ioreq server claim/disclaim its responsibility for the
>> handling of guest pages with p2m type p2m_ioreq_server. Users
>> of this HVMOP can specify whether the p2m_ioreq_server is supposed
>> to handle write accesses or read ones or both in a parameter named
>> flags. For now, we only support one ioreq server for this p2m type,
>> so once an ioreq server has claimed its ownership, subsequent calls
>> of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
>> disclaim the ownership of guest ram pages with this p2m type, by
>> triggering this new HVMOP, with ioreq server id set to the current
>> owner's and flags parameter set to 0.
>>
>> For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
>> are only supported for HVMs with HAP enabled.
>
> "For now", meaning you intend to make this work on shadow pagetables? :P
> If not, please just say it's HAP-only.
>

Well, I have no clear plan in the near future to support this on shadow
mode, and I'll change the commit message in next version. :)

>> diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
>> index c81302a..2e0d258 100644
>> --- a/xen/arch/x86/mm/shadow/multi.c
>> +++ b/xen/arch/x86/mm/shadow/multi.c
>> @@ -3224,8 +3224,7 @@ static int sh_page_fault(struct vcpu *v,
>>       }
>>
>>       /* Need to hand off device-model MMIO to the device model */
>> -    if ( p2mt == p2m_mmio_dm
>> -         || (p2mt == p2m_ioreq_server && ft == ft_demand_write) )
>> +    if ( p2mt == p2m_mmio_dm )
>>       {
>>           gpa = guest_walk_to_gpa(&gw);
>>           goto mmio;
>
> Acked-by: Tim Deegan <tim@xen.org>
>
>

Regards.
Yu
George Dunlap April 6, 2016, 5:13 p.m. UTC | #2
On Thu, Mar 31, 2016 at 11:53 AM, Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
> let one ioreq server claim/disclaim its responsibility for the
> handling of guest pages with p2m type p2m_ioreq_server. Users
> of this HVMOP can specify whether the p2m_ioreq_server is supposed
> to handle write accesses or read ones or both in a parameter named
> flags. For now, we only support one ioreq server for this p2m type,
> so once an ioreq server has claimed its ownership, subsequent calls
> of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
> disclaim the ownership of guest ram pages with this p2m type, by
> triggering this new HVMOP, with ioreq server id set to the current
> owner's and flags parameter set to 0.
>
> For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
> are only supported for HVMs with HAP enabled.
>
> Note that flags parameter(if not 0) of this HVMOP only indicates
> which kind of memory accesses are to be forwarded to an ioreq server,
> it has impact on the access rights of guest ram pages, but are not
> the same. Due to hardware limitations, if only write operations are
> to be forwarded, read ones will be performed at full speed, with
> no hypervisor intervention. But if read ones are to be forwarded to
> an ioreq server, writes will inevitably be trapped into hypervisor,
> which means significant performance impact.
>
> Also note that this HVMOP_map_mem_type_to_ioreq_server will not
> change the p2m type of any guest ram page, until HVMOP_set_mem_type
> is triggered. So normally the steps should be the backend driver
> first claims its ownership of guest ram pages with p2m_ioreq_server
> type, and then sets the memory type to p2m_ioreq_server for specified
> guest ram pages.
>
> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>

And again, review of this patch was significantly delayed because you
didn't provide any description of the changes you made between v1 and
v2 or why.

Overall looks good.  Just a few questions...

> +static int hvmop_map_mem_type_to_ioreq_server(
> +    XEN_GUEST_HANDLE_PARAM(xen_hvm_map_mem_type_to_ioreq_server_t) uop)
> +{
> +    xen_hvm_map_mem_type_to_ioreq_server_t op;
> +    struct domain *d;
> +    int rc;
> +
> +    if ( copy_from_guest(&op, uop, 1) )
> +        return -EFAULT;
> +
> +    rc = rcu_lock_remote_domain_by_id(op.domid, &d);
> +    if ( rc != 0 )
> +        return rc;
> +
> +    rc = -EINVAL;
> +    if ( !is_hvm_domain(d) )
> +        goto out;
> +
> +    /* For now, only support for HAP enabled hvm */
> +    if ( !hap_enabled(d) )
> +        goto out;

So before I suggested that this be restricted to HAP because you were
using p2m_memory_type_changed(), which was only implemented on EPT.
But since then you've switched that code to use
p2m_change_entry_type_global() instead, which is implemented by both;
and you implement the type in p2m_type_to_flags().  Is there any
reason to keep this restriction?

> +    /*
> +     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
> +     * we mark the p2m table to be recalculated, so that gfns which were
> +     * previously marked with p2m_ioreq_server can be resynced.
> +     */
> +    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);

This comment doesn't seem to be accurate (or if it is it's a bit
confusing).  Would it be more accurate to say something like the
following:

"Each time we map / unmap in ioreq server to/from p2m_ioreq_server, we
reset all memory currently marked p2m_ioreq_server to p2m_ram_rw."

But of course that raises another question: is there ever any risk
that an ioreq server will change some other ram type (say, p2m_ram_ro)
to p2m_ioreq_server, and then have it changed back to p2m_ram_rw when
it detaches?

>  /* Types that can be subject to bulk transitions. */
>  #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
> -                              | p2m_to_mask(p2m_ram_logdirty) )
> +                              | p2m_to_mask(p2m_ram_logdirty) \
> +                              | p2m_to_mask(p2m_ioreq_server) )

It's probably worth a comment here noting that you can do a mass
change *from* p2m_ioreq_server, but you can't do a mass change *to*
p2m_ioreq_server.  (And doing so would need to change a number of
places in the code where it's assumed that the result of such a
recalculation is either p2m_logdirty or p2m_ram_rw -- e.g.,
p2m-ept.c:553, p2m-pt.c:452, &c.

Really getting down to the fine details here -- thanks for all your
work on this.

 -George
Yu Zhang April 7, 2016, 7:01 a.m. UTC | #3
Thanks for your advices and good questions. :)

On 4/7/2016 1:13 AM, George Dunlap wrote:
> On Thu, Mar 31, 2016 at 11:53 AM, Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>> A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
>> let one ioreq server claim/disclaim its responsibility for the
>> handling of guest pages with p2m type p2m_ioreq_server. Users
>> of this HVMOP can specify whether the p2m_ioreq_server is supposed
>> to handle write accesses or read ones or both in a parameter named
>> flags. For now, we only support one ioreq server for this p2m type,
>> so once an ioreq server has claimed its ownership, subsequent calls
>> of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
>> disclaim the ownership of guest ram pages with this p2m type, by
>> triggering this new HVMOP, with ioreq server id set to the current
>> owner's and flags parameter set to 0.
>>
>> For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
>> are only supported for HVMs with HAP enabled.
>>
>> Note that flags parameter(if not 0) of this HVMOP only indicates
>> which kind of memory accesses are to be forwarded to an ioreq server,
>> it has impact on the access rights of guest ram pages, but are not
>> the same. Due to hardware limitations, if only write operations are
>> to be forwarded, read ones will be performed at full speed, with
>> no hypervisor intervention. But if read ones are to be forwarded to
>> an ioreq server, writes will inevitably be trapped into hypervisor,
>> which means significant performance impact.
>>
>> Also note that this HVMOP_map_mem_type_to_ioreq_server will not
>> change the p2m type of any guest ram page, until HVMOP_set_mem_type
>> is triggered. So normally the steps should be the backend driver
>> first claims its ownership of guest ram pages with p2m_ioreq_server
>> type, and then sets the memory type to p2m_ioreq_server for specified
>> guest ram pages.
>>
>> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
>> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>
> And again, review of this patch was significantly delayed because you
> didn't provide any description of the changes you made between v1 and
> v2 or why.

Sorry about the inconvenience, will change in next version.

>
> Overall looks good.  Just a few questions...
>
>> +static int hvmop_map_mem_type_to_ioreq_server(
>> +    XEN_GUEST_HANDLE_PARAM(xen_hvm_map_mem_type_to_ioreq_server_t) uop)
>> +{
>> +    xen_hvm_map_mem_type_to_ioreq_server_t op;
>> +    struct domain *d;
>> +    int rc;
>> +
>> +    if ( copy_from_guest(&op, uop, 1) )
>> +        return -EFAULT;
>> +
>> +    rc = rcu_lock_remote_domain_by_id(op.domid, &d);
>> +    if ( rc != 0 )
>> +        return rc;
>> +
>> +    rc = -EINVAL;
>> +    if ( !is_hvm_domain(d) )
>> +        goto out;
>> +
>> +    /* For now, only support for HAP enabled hvm */
>> +    if ( !hap_enabled(d) )
>> +        goto out;
>
> So before I suggested that this be restricted to HAP because you were
> using p2m_memory_type_changed(), which was only implemented on EPT.
> But since then you've switched that code to use
> p2m_change_entry_type_global() instead, which is implemented by both;
> and you implement the type in p2m_type_to_flags().  Is there any
> reason to keep this restriction?
>

Yes. And this is a change which was not explained clearly. Sorry.

Reason I've chosen p2m_change_entry_type_global() instead:
p2m_memory_type_changed() will only trigger the resynchronization for
the ept memory types in resolve_misconfig(). Yet it is the p2m type we
wanna to be recalculated, so here comes p2m_change_entry_type_global().

Reasons I restricting the code in HAP mode:
Well, I guess p2m_change_entry_type_global() was only called by HAP code
like hap_[en|dis]able_log_dirty() etc, which were registered during
hap_domain_init(). As to shadow mode, it is sh_[en|dis]able_log_dirty(),
which do not use p2m_change_entry_type_global().

Since my intention is to resync the outdated p2m_ioreq_server pages
back to p2m_ram_rw, calling p2m_change_entry_global() directly should
be much more convenient(and correct) for me than inventing another
wrapper to cover both the HAP and shadow mode(which xengt does not use
by now).


>> +    /*
>> +     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
>> +     * we mark the p2m table to be recalculated, so that gfns which were
>> +     * previously marked with p2m_ioreq_server can be resynced.
>> +     */
>> +    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>
> This comment doesn't seem to be accurate (or if it is it's a bit
> confusing).  Would it be more accurate to say something like the
> following:
>
> "Each time we map / unmap in ioreq server to/from p2m_ioreq_server, we
> reset all memory currently marked p2m_ioreq_server to p2m_ram_rw."
>
Well, I agree this comment is not quite accurate. Like you said in your
comment, the purpose here, calling p2m_change_entry_type_global() is to
"reset all memory currently marked p2m_ioreq_server to p2m_ram_rw". But
the recalculation is asynchronous. So how about:

"Each time we map/unmap an ioreq server to/from p2m_ioreq_server, we
mark the p2m table to be recalculated, so all memory currently marked
p2m_ioreq_server can be reset to p2m_ram_rw later."?

> But of course that raises another question: is there ever any risk
> that an ioreq server will change some other ram type (say, p2m_ram_ro)
> to p2m_ioreq_server, and then have it changed back to p2m_ram_rw when
> it detaches?
>
Good question. And unfortunately, yes there is. :)
Maybe we should insist only p2m_ram_rw pages can be changed to
p2m_ioreq_server, and vice versa.

>>   /* Types that can be subject to bulk transitions. */
>>   #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
>> -                              | p2m_to_mask(p2m_ram_logdirty) )
>> +                              | p2m_to_mask(p2m_ram_logdirty) \
>> +                              | p2m_to_mask(p2m_ioreq_server) )
>
> It's probably worth a comment here noting that you can do a mass
> change *from* p2m_ioreq_server, but you can't do a mass change *to*
> p2m_ioreq_server.  (And doing so would need to change a number of
> places in the code where it's assumed that the result of such a
> recalculation is either p2m_logdirty or p2m_ram_rw -- e.g.,
> p2m-ept.c:553, p2m-pt.c:452, &c.
>
I agree with adding a note here.
But adding extra code in resolve_miconfig()/do_recalc()? Is this
necessary? IIUC, current code already guarantees there will be no mass
change *to* the p2m_ioreq_server.


> Really getting down to the fine details here -- thanks for all your
> work on this.
>
>   -George
>

B.R.
Yu
George Dunlap April 8, 2016, 10:39 a.m. UTC | #4
Oops -- forgot to CC the list...

On Thu, Apr 7, 2016 at 10:55 AM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Thu, Apr 7, 2016 at 8:01 AM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
>>>> +    /* For now, only support for HAP enabled hvm */
>>>> +    if ( !hap_enabled(d) )
>>>> +        goto out;
>>>
>>>
>>> So before I suggested that this be restricted to HAP because you were
>>> using p2m_memory_type_changed(), which was only implemented on EPT.
>>> But since then you've switched that code to use
>>> p2m_change_entry_type_global() instead, which is implemented by both;
>>> and you implement the type in p2m_type_to_flags().  Is there any
>>> reason to keep this restriction?
>>>
>>
>> Yes. And this is a change which was not explained clearly. Sorry.
>>
>> Reason I've chosen p2m_change_entry_type_global() instead:
>> p2m_memory_type_changed() will only trigger the resynchronization for
>> the ept memory types in resolve_misconfig(). Yet it is the p2m type we
>> wanna to be recalculated, so here comes p2m_change_entry_type_global().
>>
>> Reasons I restricting the code in HAP mode:
>> Well, I guess p2m_change_entry_type_global() was only called by HAP code
>> like hap_[en|dis]able_log_dirty() etc, which were registered during
>> hap_domain_init(). As to shadow mode, it is sh_[en|dis]able_log_dirty(),
>> which do not use p2m_change_entry_type_global().
>
> Oh, right -- yes, there's an ASSERT(hap_enabled()) right at the top of
> p2m_pt_change_entry_type_global().
>
> Yes, if that functionality is not already implemented for shadow,
> there's no need for you to implement it; and restricting it to
> HAP-only is the obvious solution.
>
>>>> +    /*
>>>> +     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
>>>> +     * we mark the p2m table to be recalculated, so that gfns which were
>>>> +     * previously marked with p2m_ioreq_server can be resynced.
>>>> +     */
>>>> +    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>>>
>>>
>>> This comment doesn't seem to be accurate (or if it is it's a bit
>>> confusing).  Would it be more accurate to say something like the
>>> following:
>>>
>>> "Each time we map / unmap in ioreq server to/from p2m_ioreq_server, we
>>> reset all memory currently marked p2m_ioreq_server to p2m_ram_rw."
>>>
>> Well, I agree this comment is not quite accurate. Like you said in your
>> comment, the purpose here, calling p2m_change_entry_type_global() is to
>> "reset all memory currently marked p2m_ioreq_server to p2m_ram_rw". But
>> the recalculation is asynchronous. So how about:
>>
>> "Each time we map/unmap an ioreq server to/from p2m_ioreq_server, we
>> mark the p2m table to be recalculated, so all memory currently marked
>> p2m_ioreq_server can be reset to p2m_ram_rw later."?
>
> I think we're emphasizing different things -- I'm emphasizing what the
> change will be, and you're emphasizing when the change will happen.
> :-)
>
> I think from the point of view of someone reading this code, it
> doesn't matter when the physical p2m entries get updated; *logically*
> they are updated now, since from this call onward anything that reads
> the p2m table will get the new type.  The fact that we do it lazily is
> an implementation detail -- we could change the function behind the
> scenes to do it all at once, and the semantics would be the same (it
> would just cause a long change all at once).
>
> If you really want to include when the change will happen, what about this:
>
> "Each time we map / unmap in ioreq server to/from p2m_ioreq_server, we
> reset all memory currently marked p2m_ioreq_server to p2m_ram_rw.
> (This happens lazily as the p2m entries are accessed.)"
>
> BTW, I think this also means that for the interface at the moment, you
> can't change the kinds of accesses you want to intercept very easily
> -- if you want to change from intercepting only writes to intercepting
> both reads and writes, you have to detach from the ioreq_server type
> completely (which will make all your currently-marked ioreq_server
> pages go back to ram_rw), then attach again, and re-mark all the pages
> you were watching.
>
> Which is perhaps not perfect, but I suppose it will do.  It should
> even be possible to change this in the future -- ioreq servers that
> want to change the access mode can try just changing it directly, and
> if they get -EBUSY, do it the hard way.
>
> (Just to be clear, I'm thinking out loud here in the last two
> paragraphs -- no action required unless you feel like it.)
>
>>> But of course that raises another question: is there ever any risk
>>> that an ioreq server will change some other ram type (say, p2m_ram_ro)
>>> to p2m_ioreq_server, and then have it changed back to p2m_ram_rw when
>>> it detaches?
>>>
>> Good question. And unfortunately, yes there is. :)
>> Maybe we should insist only p2m_ram_rw pages can be changed to
>> p2m_ioreq_server, and vice versa.
>
> Well I think if you only allow p2m_ram_rw pages to be changed to
> p2m_ioreq_server, that should be sufficient.  If that works for you it
> seems like it could be a reasonable approach.
>
>>
>>>>   /* Types that can be subject to bulk transitions. */
>>>>   #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
>>>> -                              | p2m_to_mask(p2m_ram_logdirty) )
>>>> +                              | p2m_to_mask(p2m_ram_logdirty) \
>>>> +                              | p2m_to_mask(p2m_ioreq_server) )
>>>
>>>
>>> It's probably worth a comment here noting that you can do a mass
>>> change *from* p2m_ioreq_server, but you can't do a mass change *to*
>>> p2m_ioreq_server.  (And doing so would need to change a number of
>>> places in the code where it's assumed that the result of such a
>>> recalculation is either p2m_logdirty or p2m_ram_rw -- e.g.,
>>> p2m-ept.c:553, p2m-pt.c:452, &c.
>>>
>> I agree with adding a note here.
>> But adding extra code in resolve_miconfig()/do_recalc()? Is this
>> necessary? IIUC, current code already guarantees there will be no mass
>> change *to* the p2m_ioreq_server.
>
> Sorry -- in this context, when I said "And doing so would need...", I
> meant, "And also note in your comment that if someone wanted to do a
> mass change to p2m_ioreq_server, they would need to change..."  That
> is, was suggesting you write a comment giving a hint to people in the
> future about the limitations, not add any code yourself. :-)
>
> But maybe it would actually be better to add an ASSERT(nt !=
> p2m_ioreq_server) to p2m_change_entry_type_global(), with a comment
> saying that much of the lazy recalc code assumes being changed into
> only ram_logdirty or ram_rw?
>
> And then of course there's the p2m_ioreq_server -> p2m_ram_logdirty
> transition -- I assume that live migration is incompatible with this
> functionality?  Is there anything that prevents a live migration from
> being started when there are outstanding p2m_ioreq_server entries?
>
>  -George
George Dunlap April 8, 2016, 11:01 a.m. UTC | #5
On 08/04/16 11:10, Yu, Zhang wrote:
[snip]
> BTW, I noticed your reply has not be CCed to mailing list, and I also
> wonder if we should raise this last question in community?

Oops -- that was a mistake on my part.  :-)  I appreciate the
discretion; just so you know in the future, if I'm purposely changing
the CC list (removing xen-devel and/or adding extra people), I'll almost
always say so at the top of the mail.

>> And then of course there's the p2m_ioreq_server -> p2m_ram_logdirty
>> transition -- I assume that live migration is incompatible with this
>> functionality?  Is there anything that prevents a live migration from
>> being started when there are outstanding p2m_ioreq_server entries?
>>
> 
> Another good question, and the answer is unfortunately yes. :-)
> 
> If live migration happens during the normal emulation process, entries
> marked with p2m_ioreq_server will be changed to p2m_log_dirty in
> resolve_misconfig(), and later write operations will change them to
> p2m_ram_rw, thereafter these pages can not be forwarded to device model.
> From this point of view, this functionality is incompatible with live
> migration.
> 
> But for XenGT, I think this is acceptable, because, if live migration
> is to be supported in the future, intervention from backend device
> model will be necessary. At that time, we can guarantee from the device
> model side that there's no outdated p2m_ioreq_server entries, hence no
> need to reset the p2m type back to p2m_ram_rw(and do not include
> p2m_ioreq_server in the P2M_CHANGEABLE_TYPES). By "outdated", I mean
> when an ioreq server is detached from p2m_ioreq_server, or before an
> ioreq server is attached to this type, entries marked with
> p2m_ioreq_server should be regarded as outdated.
> 
> Is this acceptible to you? Any suggestions?

So the question is, as of this series, what happens if someone tries to
initiate a live migration while there are outstanding p2m_ioreq_server
entries?

If the answer is "the ioreq server suddenly loses all control of the
memory", that's something that needs to be changed.

If the answer is, "everything just works", that's perfect.

If the answer is, "Before logdirty mode is set, the ioreq server has the
opportunity to detach, removing the p2m_ioreq_server entries, and
operating without that functionality", that's good too.

If the answer is, "the live migration request fails and the guest
continues to run", that's also acceptable.  If you want this series to
be checked in today (the last day for 4.7), this is probably your best bet.

 -George
Andrew Cooper April 8, 2016, 1:33 p.m. UTC | #6
On 31/03/16 11:53, Yu Zhang wrote:
> A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
> let one ioreq server claim/disclaim its responsibility for the
> handling of guest pages with p2m type p2m_ioreq_server. Users
> of this HVMOP can specify whether the p2m_ioreq_server is supposed
> to handle write accesses or read ones or both in a parameter named
> flags. For now, we only support one ioreq server for this p2m type,
> so once an ioreq server has claimed its ownership, subsequent calls
> of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
> disclaim the ownership of guest ram pages with this p2m type, by
> triggering this new HVMOP, with ioreq server id set to the current
> owner's and flags parameter set to 0.
>
> For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
> are only supported for HVMs with HAP enabled.
>
> Note that flags parameter(if not 0) of this HVMOP only indicates
> which kind of memory accesses are to be forwarded to an ioreq server,
> it has impact on the access rights of guest ram pages, but are not
> the same. Due to hardware limitations, if only write operations are
> to be forwarded, read ones will be performed at full speed, with
> no hypervisor intervention. But if read ones are to be forwarded to
> an ioreq server, writes will inevitably be trapped into hypervisor,
> which means significant performance impact.
>
> Also note that this HVMOP_map_mem_type_to_ioreq_server will not
> change the p2m type of any guest ram page, until HVMOP_set_mem_type
> is triggered. So normally the steps should be the backend driver
> first claims its ownership of guest ram pages with p2m_ioreq_server
> type, and then sets the memory type to p2m_ioreq_server for specified
> guest ram pages.
>
> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Cc: Keir Fraser <keir@xen.org>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Cc: Tim Deegan <tim@xen.org>
> ---
>  xen/arch/x86/hvm/emulate.c       | 125 +++++++++++++++++++++++++++++++++++++--
>  xen/arch/x86/hvm/hvm.c           |  95 +++++++++++++++++++++++++++--
>  xen/arch/x86/mm/hap/nested_hap.c |   2 +-
>  xen/arch/x86/mm/p2m-ept.c        |  14 ++++-
>  xen/arch/x86/mm/p2m-pt.c         |  25 +++++---
>  xen/arch/x86/mm/p2m.c            |  82 +++++++++++++++++++++++++
>  xen/arch/x86/mm/shadow/multi.c   |   3 +-
>  xen/include/asm-x86/p2m.h        |  36 +++++++++--
>  xen/include/public/hvm/hvm_op.h  |  37 ++++++++++++
>  9 files changed, 395 insertions(+), 24 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
> index ddc8007..77a4793 100644
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -94,11 +94,69 @@ static const struct hvm_io_handler null_handler = {
>      .ops = &null_ops
>  };
>  
> +static int mem_read(const struct hvm_io_handler *io_handler,
> +                    uint64_t addr,
> +                    uint32_t size,
> +                    uint64_t *data)
> +{
> +    struct domain *currd = current->domain;
> +    unsigned long gmfn = paddr_to_pfn(addr);
> +    unsigned long offset = addr & ~PAGE_MASK;
> +    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL, P2M_UNSHARE);
> +    uint8_t *p;
> +
> +    if ( !page )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    p = __map_domain_page(page);
> +    p += offset;
> +    memcpy(data, p, size);

What happens when offset + size crosses the page boundary?


> diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
> index a1eae52..d46f186 100644
> --- a/xen/include/public/hvm/hvm_op.h
> +++ b/xen/include/public/hvm/hvm_op.h
> @@ -489,6 +489,43 @@ struct xen_hvm_altp2m_op {
>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>  
> +#if defined(__XEN__) || defined(__XEN_TOOLS__)
> +
> +/*
> + * HVMOP_map_mem_type_to_ioreq_server : map or unmap the IOREQ Server <id>
> + *                                      to specific memroy type <type>
> + *                                      for specific accesses <flags>
> + *
> + * Note that if only write operations are to be forwarded to an ioreq server,
> + * read operations will be performed with no hypervisor intervention. But if
> + * flags indicates that read operations are to be forwarded to an ioreq server,
> + * write operations will inevitably be trapped into hypervisor, whether they
> + * are emulated by hypervisor or forwarded to ioreq server depends on the flags
> + * setting. This situation means significant performance impact.
> + */
> +#define HVMOP_map_mem_type_to_ioreq_server 26
> +struct xen_hvm_map_mem_type_to_ioreq_server {
> +    domid_t domid;      /* IN - domain to be serviced */
> +    ioservid_t id;      /* IN - ioreq server id */
> +    hvmmem_type_t type; /* IN - memory type */

hvmmem_type_t is an enum and doesn't have a fixed width.  It can't be
used in the public API.

You also have some implicit padding holes as a result of the layout.

~Andrew

> +    uint32_t flags;     /* IN - types of accesses to be forwarded to the
> +                           ioreq server. flags with 0 means to unmap the
> +                           ioreq server */
> +#define _HVMOP_IOREQ_MEM_ACCESS_READ 0
> +#define HVMOP_IOREQ_MEM_ACCESS_READ \
> +    (1u << _HVMOP_IOREQ_MEM_ACCESS_READ)
> +
> +#define _HVMOP_IOREQ_MEM_ACCESS_WRITE 1
> +#define HVMOP_IOREQ_MEM_ACCESS_WRITE \
> +    (1u << _HVMOP_IOREQ_MEM_ACCESS_WRITE)
> +};
> +typedef struct xen_hvm_map_mem_type_to_ioreq_server
> +    xen_hvm_map_mem_type_to_ioreq_server_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_hvm_map_mem_type_to_ioreq_server_t);
> +
> +#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
> +
> +
>  #endif /* __XEN_PUBLIC_HVM_HVM_OP_H__ */
>  
>  /*
Jan Beulich April 8, 2016, 10:28 p.m. UTC | #7
>>> On 31.03.16 at 12:53, <yu.c.zhang@linux.intel.com> wrote:
> +static int mem_write(const struct hvm_io_handler *handler,
> +                     uint64_t addr,
> +                     uint32_t size,
> +                     uint64_t data)
> +{
> +    struct domain *currd = current->domain;
> +    unsigned long gmfn = paddr_to_pfn(addr);
> +    unsigned long offset = addr & ~PAGE_MASK;
> +    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL, P2M_UNSHARE);
> +    uint8_t *p;
> +
> +    if ( !page )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    p = __map_domain_page(page);
> +    p += offset;
> +    memcpy(p, &data, size);

What if the page is a r/o one? Not having found an ioreq server, I'm
not sure assumptions on the page being writable can validly be made.

> @@ -168,13 +226,72 @@ static int hvmemul_do_io(
>          break;
>      case X86EMUL_UNHANDLEABLE:
>      {
> -        struct hvm_ioreq_server *s =
> -            hvm_select_ioreq_server(curr->domain, &p);
> +        struct hvm_ioreq_server *s;
> +        p2m_type_t p2mt;
> +
> +        if ( is_mmio )
> +        {
> +            unsigned long gmfn = paddr_to_pfn(addr);
> +
> +            (void) get_gfn_query_unlocked(currd, gmfn, &p2mt);
> +
> +            switch ( p2mt )
> +            {
> +                case p2m_ioreq_server:
> +                {
> +                    unsigned long flags;
> +
> +                    p2m_get_ioreq_server(currd, &flags, &s);

As the function apparently returns no value right now, please avoid
the indirection on both values you're after - one of the two
(presumably s) can be the function's return value.

> +                    if ( !s )
> +                        break;
> +
> +                    if ( (dir == IOREQ_READ &&
> +                          !(flags & P2M_IOREQ_HANDLE_READ_ACCESS)) ||
> +                         (dir == IOREQ_WRITE &&
> +                          !(flags & P2M_IOREQ_HANDLE_WRITE_ACCESS)) )

I think this would be easier to read using a conditional expression
with the condition being dir == IOREQ_<one-of-the-two>, just
selecting either of the two possible bit masks.

> +                        s = NULL;
> +
> +                    break;
> +                }
> +                case p2m_ram_rw:

Blank line above here please.

>          /* If there is no suitable backing DM, just ignore accesses */
>          if ( !s )
>          {
> -            rc = hvm_process_io_intercept(&null_handler, &p);
> +            switch ( p2mt )
> +            {
> +            case p2m_ioreq_server:
> +            /*
> +             * Race conditions may exist when access to a gfn with
> +             * p2m_ioreq_server is intercepted by hypervisor, during
> +             * which time p2m type of this gfn is recalculated back
> +             * to p2m_ram_rw. mem_handler is used to handle this
> +             * corner case.
> +             */

Now if there is such a race condition, the race could also be with a
page changing first to ram_rw and then immediately further to e.g.
ram_ro. See the earlier comment about assuming the page to be
writable.

> +            case p2m_ram_rw:
> +                rc = hvm_process_io_intercept(&mem_handler, &p);
> +                break;
> +
> +            default:
> +                rc = hvm_process_io_intercept(&null_handler, &p);

Along with the above, I doubt it is correct to have e.g. ram_ro come
here.

> +static int hvm_map_mem_type_to_ioreq_server(struct domain *d,
> +                                            ioservid_t id,
> +                                            hvmmem_type_t type,
> +                                            uint32_t flags)
> +{
> +    struct hvm_ioreq_server *s;
> +    int rc;
> +
> +    /* For now, only HVMMEM_ioreq_server is supported */
> +    if ( type != HVMMEM_ioreq_server )
> +        return -EINVAL;
> +
> +    if ( flags & ~(HVMOP_IOREQ_MEM_ACCESS_READ |
> +                   HVMOP_IOREQ_MEM_ACCESS_WRITE) )
> +        return -EINVAL;
> +
> +    spin_lock(&d->arch.hvm_domain.ioreq_server.lock);
> +
> +    rc = -ENOENT;
> +    list_for_each_entry ( s,
> +                          &d->arch.hvm_domain.ioreq_server.list,
> +                          list_entry )
> +    {
> +        if ( s == d->arch.hvm_domain.default_ioreq_server )
> +            continue;
> +
> +        if ( s->id == id )
> +        {
> +            rc = p2m_set_ioreq_server(d, flags, s);
> +            if ( rc == 0 )
> +                gdprintk(XENLOG_DEBUG, "%u %s type HVMMEM_ioreq_server.\n",
> +                         s->id, (flags != 0) ? "mapped to" : "unmapped from");

Why gdprintk()? I don't think the current domain is of much
interest here. What would be of interest is the subject domain.

> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -132,6 +132,19 @@ static void ept_p2m_type_to_flags(struct p2m_domain 
> *p2m, ept_entry_t *entry,
>              entry->r = entry->w = entry->x = 1;
>              entry->a = entry->d = !!cpu_has_vmx_ept_ad;
>              break;
> +        case p2m_ioreq_server:
> +            entry->r = !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_READ_ACCESS);
> +	    /*
> +	     * write access right is disabled when entry->r is 0, but whether
> +	     * write accesses are emulated by hypervisor or forwarded to an
> +	     * ioreq server depends on the setting of p2m->ioreq.flags.
> +	     */
> +            entry->w = (entry->r &&
> +                        !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_WRITE_ACCESS));
> +            entry->x = entry->r;

Why would we want to allow instruction execution from such pages?
And with all three bits now possibly being clear, aren't we risking the
entries to be mis-treated as not-present ones?

> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -72,8 +72,8 @@ static const unsigned long pgt[] = {
>      PGT_l3_page_table
>  };
>  
> -static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn,
> -                                       unsigned int level)
> +static unsigned long p2m_type_to_flags(struct p2m_domain *p2m, p2m_type_t t,

const

> +int p2m_set_ioreq_server(struct domain *d,
> +                         unsigned long flags,
> +                         struct hvm_ioreq_server *s)
> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +    int rc;
> +
> +    spin_lock(&p2m->ioreq.lock);
> +
> +    rc = -EBUSY;
> +    if ( (flags != 0) && (p2m->ioreq.server != NULL) )
> +        goto out;
> +
> +    rc = -EINVAL;
> +    /* unmap ioreq server from p2m type by passing flags with 0 */

Comment style (also elsewhere).

> +    if ( (flags == 0) && (p2m->ioreq.server != s) )
> +        goto out;

The two flags checks above are redundant with ...

> +    if ( flags == 0 )
> +    {
> +        p2m->ioreq.server = NULL;
> +        p2m->ioreq.flags = 0;
> +    }
> +    else
> +    {
> +        p2m->ioreq.server = s;
> +        p2m->ioreq.flags = flags;
> +    }

... this - I think the earlier ones should be folded into this.

> +    /*
> +     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
> +     * we mark the p2m table to be recalculated, so that gfns which were
> +     * previously marked with p2m_ioreq_server can be resynced.
> +     */
> +    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);

What does "resynced" here mean? I.e. I can see why this is wanted
when unmapping a server, but when mapping a server there shouldn't
be any such pages in the first place.

> +    rc = 0;
> +
> +out:

Labels indented by at least one space please.

> @@ -320,6 +321,27 @@ struct p2m_domain {
>          struct ept_data ept;
>          /* NPT-equivalent structure could be added here. */
>      };
> +
> +    struct {
> +        spinlock_t lock;
> +        /*
> +         * ioreq server who's responsible for the emulation of
> +         * gfns with specific p2m type(for now, p2m_ioreq_server).
> +         * Behaviors of gfns with p2m_ioreq_server set but no
> +         * ioreq server mapped in advance should be the same as
> +         * p2m_ram_rw.
> +         */
> +        struct hvm_ioreq_server *server;
> +        /*
> +         * flags specifies whether read, write or both operations
> +         * are to be emulated by an ioreq server.
> +         */
> +        unsigned long flags;

unsigned int

> --- a/xen/include/public/hvm/hvm_op.h
> +++ b/xen/include/public/hvm/hvm_op.h
> @@ -489,6 +489,43 @@ struct xen_hvm_altp2m_op {
>  typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>  
> +#if defined(__XEN__) || defined(__XEN_TOOLS__)

Instead of adding yet another such section, couldn't this be added
to an already existing one?

> +struct xen_hvm_map_mem_type_to_ioreq_server {
> +    domid_t domid;      /* IN - domain to be serviced */
> +    ioservid_t id;      /* IN - ioreq server id */
> +    hvmmem_type_t type; /* IN - memory type */

You can't use this type for public interface structure fields - this
must be uintXX_t.

Jan
Yu Zhang April 11, 2016, 11:14 a.m. UTC | #8
On 4/8/2016 9:33 PM, Andrew Cooper wrote:
> On 31/03/16 11:53, Yu Zhang wrote:
>> A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
>> let one ioreq server claim/disclaim its responsibility for the
>> handling of guest pages with p2m type p2m_ioreq_server. Users
>> of this HVMOP can specify whether the p2m_ioreq_server is supposed
>> to handle write accesses or read ones or both in a parameter named
>> flags. For now, we only support one ioreq server for this p2m type,
>> so once an ioreq server has claimed its ownership, subsequent calls
>> of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
>> disclaim the ownership of guest ram pages with this p2m type, by
>> triggering this new HVMOP, with ioreq server id set to the current
>> owner's and flags parameter set to 0.
>>
>> For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
>> are only supported for HVMs with HAP enabled.
>>
>> Note that flags parameter(if not 0) of this HVMOP only indicates
>> which kind of memory accesses are to be forwarded to an ioreq server,
>> it has impact on the access rights of guest ram pages, but are not
>> the same. Due to hardware limitations, if only write operations are
>> to be forwarded, read ones will be performed at full speed, with
>> no hypervisor intervention. But if read ones are to be forwarded to
>> an ioreq server, writes will inevitably be trapped into hypervisor,
>> which means significant performance impact.
>>
>> Also note that this HVMOP_map_mem_type_to_ioreq_server will not
>> change the p2m type of any guest ram page, until HVMOP_set_mem_type
>> is triggered. So normally the steps should be the backend driver
>> first claims its ownership of guest ram pages with p2m_ioreq_server
>> type, and then sets the memory type to p2m_ioreq_server for specified
>> guest ram pages.
>>
>> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
>> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>> Cc: Keir Fraser <keir@xen.org>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: George Dunlap <george.dunlap@eu.citrix.com>
>> Cc: Jun Nakajima <jun.nakajima@intel.com>
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Tim Deegan <tim@xen.org>
>> ---
>>   xen/arch/x86/hvm/emulate.c       | 125 +++++++++++++++++++++++++++++++++++++--
>>   xen/arch/x86/hvm/hvm.c           |  95 +++++++++++++++++++++++++++--
>>   xen/arch/x86/mm/hap/nested_hap.c |   2 +-
>>   xen/arch/x86/mm/p2m-ept.c        |  14 ++++-
>>   xen/arch/x86/mm/p2m-pt.c         |  25 +++++---
>>   xen/arch/x86/mm/p2m.c            |  82 +++++++++++++++++++++++++
>>   xen/arch/x86/mm/shadow/multi.c   |   3 +-
>>   xen/include/asm-x86/p2m.h        |  36 +++++++++--
>>   xen/include/public/hvm/hvm_op.h  |  37 ++++++++++++
>>   9 files changed, 395 insertions(+), 24 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
>> index ddc8007..77a4793 100644
>> --- a/xen/arch/x86/hvm/emulate.c
>> +++ b/xen/arch/x86/hvm/emulate.c
>> @@ -94,11 +94,69 @@ static const struct hvm_io_handler null_handler = {
>>       .ops = &null_ops
>>   };
>>
>> +static int mem_read(const struct hvm_io_handler *io_handler,
>> +                    uint64_t addr,
>> +                    uint32_t size,
>> +                    uint64_t *data)
>> +{
>> +    struct domain *currd = current->domain;
>> +    unsigned long gmfn = paddr_to_pfn(addr);
>> +    unsigned long offset = addr & ~PAGE_MASK;
>> +    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL, P2M_UNSHARE);
>> +    uint8_t *p;
>> +
>> +    if ( !page )
>> +        return X86EMUL_UNHANDLEABLE;
>> +
>> +    p = __map_domain_page(page);
>> +    p += offset;
>> +    memcpy(data, p, size);
>
> What happens when offset + size crosses the page boundary?
>

The 'size' is set in hvmemul_linear_mmio_access(), to insure offset + 
size will not cross the page boundary.

>
>> diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
>> index a1eae52..d46f186 100644
>> --- a/xen/include/public/hvm/hvm_op.h
>> +++ b/xen/include/public/hvm/hvm_op.h
>> @@ -489,6 +489,43 @@ struct xen_hvm_altp2m_op {
>>   typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>>   DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>>
>> +#if defined(__XEN__) || defined(__XEN_TOOLS__)
>> +
>> +/*
>> + * HVMOP_map_mem_type_to_ioreq_server : map or unmap the IOREQ Server <id>
>> + *                                      to specific memroy type <type>
>> + *                                      for specific accesses <flags>
>> + *
>> + * Note that if only write operations are to be forwarded to an ioreq server,
>> + * read operations will be performed with no hypervisor intervention. But if
>> + * flags indicates that read operations are to be forwarded to an ioreq server,
>> + * write operations will inevitably be trapped into hypervisor, whether they
>> + * are emulated by hypervisor or forwarded to ioreq server depends on the flags
>> + * setting. This situation means significant performance impact.
>> + */
>> +#define HVMOP_map_mem_type_to_ioreq_server 26
>> +struct xen_hvm_map_mem_type_to_ioreq_server {
>> +    domid_t domid;      /* IN - domain to be serviced */
>> +    ioservid_t id;      /* IN - ioreq server id */
>> +    hvmmem_type_t type; /* IN - memory type */
>
> hvmmem_type_t is an enum and doesn't have a fixed width.  It can't be
> used in the public API.
>
> You also have some implicit padding holes as a result of the layout.
>
Oh, guess I should use uint16_t hvmmem_type(any maybe some pad if
necessary).
Thanks for pointing this out, Andrew. :)

> ~Andrew
>
>> +    uint32_t flags;     /* IN - types of accesses to be forwarded to the
>> +                           ioreq server. flags with 0 means to unmap the
>> +                           ioreq server */
>> +#define _HVMOP_IOREQ_MEM_ACCESS_READ 0
>> +#define HVMOP_IOREQ_MEM_ACCESS_READ \
>> +    (1u << _HVMOP_IOREQ_MEM_ACCESS_READ)
>> +
>> +#define _HVMOP_IOREQ_MEM_ACCESS_WRITE 1
>> +#define HVMOP_IOREQ_MEM_ACCESS_WRITE \
>> +    (1u << _HVMOP_IOREQ_MEM_ACCESS_WRITE)
>> +};
>> +typedef struct xen_hvm_map_mem_type_to_ioreq_server
>> +    xen_hvm_map_mem_type_to_ioreq_server_t;
>> +DEFINE_XEN_GUEST_HANDLE(xen_hvm_map_mem_type_to_ioreq_server_t);
>> +
>> +#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
>> +
>> +
>>   #endif /* __XEN_PUBLIC_HVM_HVM_OP_H__ */
>>
>>   /*
>
>

B.R.
Yu
Yu Zhang April 11, 2016, 11:14 a.m. UTC | #9
On 4/9/2016 6:28 AM, Jan Beulich wrote:
>>>> On 31.03.16 at 12:53, <yu.c.zhang@linux.intel.com> wrote:
>> +static int mem_write(const struct hvm_io_handler *handler,
>> +                     uint64_t addr,
>> +                     uint32_t size,
>> +                     uint64_t data)
>> +{
>> +    struct domain *currd = current->domain;
>> +    unsigned long gmfn = paddr_to_pfn(addr);
>> +    unsigned long offset = addr & ~PAGE_MASK;
>> +    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL, P2M_UNSHARE);
>> +    uint8_t *p;
>> +
>> +    if ( !page )
>> +        return X86EMUL_UNHANDLEABLE;
>> +
>> +    p = __map_domain_page(page);
>> +    p += offset;
>> +    memcpy(p, &data, size);
>
> What if the page is a r/o one? Not having found an ioreq server, I'm
> not sure assumptions on the page being writable can validly be made.
>
>> @@ -168,13 +226,72 @@ static int hvmemul_do_io(
>>           break;
>>       case X86EMUL_UNHANDLEABLE:
>>       {
>> -        struct hvm_ioreq_server *s =
>> -            hvm_select_ioreq_server(curr->domain, &p);
>> +        struct hvm_ioreq_server *s;
>> +        p2m_type_t p2mt;
>> +
>> +        if ( is_mmio )
>> +        {
>> +            unsigned long gmfn = paddr_to_pfn(addr);
>> +
>> +            (void) get_gfn_query_unlocked(currd, gmfn, &p2mt);
>> +
>> +            switch ( p2mt )
>> +            {
>> +                case p2m_ioreq_server:
>> +                {
>> +                    unsigned long flags;
>> +
>> +                    p2m_get_ioreq_server(currd, &flags, &s);
>
> As the function apparently returns no value right now, please avoid
> the indirection on both values you're after - one of the two
> (presumably s) can be the function's return value.
>

Well, current implementation of p2m_get_ioreq_server() has spin_lock/
spin_unlock surrounding the reading of flags and the s, but I believe
we can also use the s as return value.

>> +                    if ( !s )
>> +                        break;
>> +
>> +                    if ( (dir == IOREQ_READ &&
>> +                          !(flags & P2M_IOREQ_HANDLE_READ_ACCESS)) ||
>> +                         (dir == IOREQ_WRITE &&
>> +                          !(flags & P2M_IOREQ_HANDLE_WRITE_ACCESS)) )
>
> I think this would be easier to read using a conditional expression
> with the condition being dir == IOREQ_<one-of-the-two>, just
> selecting either of the two possible bit masks.
>
>> +                        s = NULL;
>> +
>> +                    break;
>> +                }
>> +                case p2m_ram_rw:
>
> Blank line above here please.
>
>>           /* If there is no suitable backing DM, just ignore accesses */
>>           if ( !s )
>>           {
>> -            rc = hvm_process_io_intercept(&null_handler, &p);
>> +            switch ( p2mt )
>> +            {
>> +            case p2m_ioreq_server:
>> +            /*
>> +             * Race conditions may exist when access to a gfn with
>> +             * p2m_ioreq_server is intercepted by hypervisor, during
>> +             * which time p2m type of this gfn is recalculated back
>> +             * to p2m_ram_rw. mem_handler is used to handle this
>> +             * corner case.
>> +             */
>
> Now if there is such a race condition, the race could also be with a
> page changing first to ram_rw and then immediately further to e.g.
> ram_ro. See the earlier comment about assuming the page to be
> writable.
>

Thanks, Jan. After rechecking the code, I suppose the race condition
will not happen. In hvmemul_do_io(), get_gfn_query_unlocked() is used
to peek the p2mt for the gfn, but get_gfn_type_access() is called inside
hvm_hap_nested_page_fault(), and this will guarantee no p2m change shall
occur during the emulation.
Is this understanding correct?


>> +            case p2m_ram_rw:
>> +                rc = hvm_process_io_intercept(&mem_handler, &p);
>> +                break;
>> +
>> +            default:
>> +                rc = hvm_process_io_intercept(&null_handler, &p);
>
> Along with the above, I doubt it is correct to have e.g. ram_ro come
> here.
>

So if no race condition happens, no need to treat p2m_ram_rw specially.

>> +static int hvm_map_mem_type_to_ioreq_server(struct domain *d,
>> +                                            ioservid_t id,
>> +                                            hvmmem_type_t type,
>> +                                            uint32_t flags)
>> +{
>> +    struct hvm_ioreq_server *s;
>> +    int rc;
>> +
>> +    /* For now, only HVMMEM_ioreq_server is supported */
>> +    if ( type != HVMMEM_ioreq_server )
>> +        return -EINVAL;
>> +
>> +    if ( flags & ~(HVMOP_IOREQ_MEM_ACCESS_READ |
>> +                   HVMOP_IOREQ_MEM_ACCESS_WRITE) )
>> +        return -EINVAL;
>> +
>> +    spin_lock(&d->arch.hvm_domain.ioreq_server.lock);
>> +
>> +    rc = -ENOENT;
>> +    list_for_each_entry ( s,
>> +                          &d->arch.hvm_domain.ioreq_server.list,
>> +                          list_entry )
>> +    {
>> +        if ( s == d->arch.hvm_domain.default_ioreq_server )
>> +            continue;
>> +
>> +        if ( s->id == id )
>> +        {
>> +            rc = p2m_set_ioreq_server(d, flags, s);
>> +            if ( rc == 0 )
>> +                gdprintk(XENLOG_DEBUG, "%u %s type HVMMEM_ioreq_server.\n",
>> +                         s->id, (flags != 0) ? "mapped to" : "unmapped from");
>
> Why gdprintk()? I don't think the current domain is of much
> interest here. What would be of interest is the subject domain.
>

s->id is not the domain_id, but id of the ioreq server.

>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -132,6 +132,19 @@ static void ept_p2m_type_to_flags(struct p2m_domain
>> *p2m, ept_entry_t *entry,
>>               entry->r = entry->w = entry->x = 1;
>>               entry->a = entry->d = !!cpu_has_vmx_ept_ad;
>>               break;
>> +        case p2m_ioreq_server:
>> +            entry->r = !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_READ_ACCESS);
>> +	    /*
>> +	     * write access right is disabled when entry->r is 0, but whether
>> +	     * write accesses are emulated by hypervisor or forwarded to an
>> +	     * ioreq server depends on the setting of p2m->ioreq.flags.
>> +	     */
>> +            entry->w = (entry->r &&
>> +                        !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_WRITE_ACCESS));
>> +            entry->x = entry->r;
>
> Why would we want to allow instruction execution from such pages?
> And with all three bits now possibly being clear, aren't we risking the
> entries to be mis-treated as not-present ones?
>

Hah. You got me. Thanks! :)
Now I realized it would be difficult if we wanna to emulate the read
operations for HVM. According to Intel mannual, entry->r is to be
cleared, so should entry->w if we do not want ept misconfig. And
with both read and write permissions being forbidden, entry->x can be
set only on processors with EXECUTE_ONLY capability.
To avoid any entry to be mis-treated as not-present. We have several
solutions:
a> do not support the read emulation for now - we have no such usage
case;
b> add the check of p2m_t against p2m_ioreq_server in is_epte_present -
a bit weird to me.
Which one do you prefer? or any other suggestions?

>> --- a/xen/arch/x86/mm/p2m-pt.c
>> +++ b/xen/arch/x86/mm/p2m-pt.c
>> @@ -72,8 +72,8 @@ static const unsigned long pgt[] = {
>>       PGT_l3_page_table
>>   };
>>
>> -static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn,
>> -                                       unsigned int level)
>> +static unsigned long p2m_type_to_flags(struct p2m_domain *p2m, p2m_type_t t,
>
> const
>
>> +int p2m_set_ioreq_server(struct domain *d,
>> +                         unsigned long flags,
>> +                         struct hvm_ioreq_server *s)
>> +{
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +    int rc;
>> +
>> +    spin_lock(&p2m->ioreq.lock);
>> +
>> +    rc = -EBUSY;
>> +    if ( (flags != 0) && (p2m->ioreq.server != NULL) )
>> +        goto out;
>> +
>> +    rc = -EINVAL;
>> +    /* unmap ioreq server from p2m type by passing flags with 0 */
>
> Comment style (also elsewhere).
>
>> +    if ( (flags == 0) && (p2m->ioreq.server != s) )
>> +        goto out;
>
> The two flags checks above are redundant with ...
>
>> +    if ( flags == 0 )
>> +    {
>> +        p2m->ioreq.server = NULL;
>> +        p2m->ioreq.flags = 0;
>> +    }
>> +    else
>> +    {
>> +        p2m->ioreq.server = s;
>> +        p2m->ioreq.flags = flags;
>> +    }
>
> ... this - I think the earlier ones should be folded into this.
>

Ok. I'll have a try. :)

>> +    /*
>> +     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
>> +     * we mark the p2m table to be recalculated, so that gfns which were
>> +     * previously marked with p2m_ioreq_server can be resynced.
>> +     */
>> +    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>
> What does "resynced" here mean? I.e. I can see why this is wanted
> when unmapping a server, but when mapping a server there shouldn't
> be any such pages in the first place.
>

There shouldn't be. But if there is(misbehavior from the device model
side), it can be recalculated back to p2m_ram_rw(which is not quite
necessary as the unmapping case).

>> +    rc = 0;
>> +
>> +out:
>
> Labels indented by at least one space please.
>
>> @@ -320,6 +321,27 @@ struct p2m_domain {
>>           struct ept_data ept;
>>           /* NPT-equivalent structure could be added here. */
>>       };
>> +
>> +    struct {
>> +        spinlock_t lock;
>> +        /*
>> +         * ioreq server who's responsible for the emulation of
>> +         * gfns with specific p2m type(for now, p2m_ioreq_server).
>> +         * Behaviors of gfns with p2m_ioreq_server set but no
>> +         * ioreq server mapped in advance should be the same as
>> +         * p2m_ram_rw.
>> +         */
>> +        struct hvm_ioreq_server *server;
>> +        /*
>> +         * flags specifies whether read, write or both operations
>> +         * are to be emulated by an ioreq server.
>> +         */
>> +        unsigned long flags;
>
> unsigned int
>
>> --- a/xen/include/public/hvm/hvm_op.h
>> +++ b/xen/include/public/hvm/hvm_op.h
>> @@ -489,6 +489,43 @@ struct xen_hvm_altp2m_op {
>>   typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
>>   DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
>>
>> +#if defined(__XEN__) || defined(__XEN_TOOLS__)
>
> Instead of adding yet another such section, couldn't this be added
> to an already existing one?
>

Yes.

>> +struct xen_hvm_map_mem_type_to_ioreq_server {
>> +    domid_t domid;      /* IN - domain to be serviced */
>> +    ioservid_t id;      /* IN - ioreq server id */
>> +    hvmmem_type_t type; /* IN - memory type */
>
> You can't use this type for public interface structure fields - this
> must be uintXX_t.
>

Got it.


Thanks
Yu
Yu Zhang April 11, 2016, 11:15 a.m. UTC | #10
On 4/8/2016 7:01 PM, George Dunlap wrote:
> On 08/04/16 11:10, Yu, Zhang wrote:
> [snip]
>> BTW, I noticed your reply has not be CCed to mailing list, and I also
>> wonder if we should raise this last question in community?
>
> Oops -- that was a mistake on my part.  :-)  I appreciate the
> discretion; just so you know in the future, if I'm purposely changing
> the CC list (removing xen-devel and/or adding extra people), I'll almost
> always say so at the top of the mail.
>
>>> And then of course there's the p2m_ioreq_server -> p2m_ram_logdirty
>>> transition -- I assume that live migration is incompatible with this
>>> functionality?  Is there anything that prevents a live migration from
>>> being started when there are outstanding p2m_ioreq_server entries?
>>>
>>
>> Another good question, and the answer is unfortunately yes. :-)
>>
>> If live migration happens during the normal emulation process, entries
>> marked with p2m_ioreq_server will be changed to p2m_log_dirty in
>> resolve_misconfig(), and later write operations will change them to
>> p2m_ram_rw, thereafter these pages can not be forwarded to device model.
>>  From this point of view, this functionality is incompatible with live
>> migration.
>>
>> But for XenGT, I think this is acceptable, because, if live migration
>> is to be supported in the future, intervention from backend device
>> model will be necessary. At that time, we can guarantee from the device
>> model side that there's no outdated p2m_ioreq_server entries, hence no
>> need to reset the p2m type back to p2m_ram_rw(and do not include
>> p2m_ioreq_server in the P2M_CHANGEABLE_TYPES). By "outdated", I mean
>> when an ioreq server is detached from p2m_ioreq_server, or before an
>> ioreq server is attached to this type, entries marked with
>> p2m_ioreq_server should be regarded as outdated.
>>
>> Is this acceptible to you? Any suggestions?
>
> So the question is, as of this series, what happens if someone tries to
> initiate a live migration while there are outstanding p2m_ioreq_server
> entries?
>
> If the answer is "the ioreq server suddenly loses all control of the
> memory", that's something that needs to be changed.
>

Sorry, for this patch series, I'm afraid the above description is the
answer.

Besides, I find it's hard to change current code to both support the
deferred resetting of p2m_ioreq_server and the live migration at the
same time. One reason is that a page with p2m_ioreq_server behaves
differently in different situations.

My assumption of XenGT is that, for live migration to work, the device
model should guarantee there's no outstanding p2m_ioreq_server pages
in hypervisor(no need to use the deferred recalculation), and it is our
device model who should be responsible for the copying of the write
protected guest pages later.

And another solution I can think of: when unmapping the ioreq server,
we walk the p2m table and reset entries with p2m_ioreq_server back
directly, instead of deferring the reset. And of course, this means
performance impact. But since the mapping and unmapping of an ioreq
server is not a frequent one, the performance penalty may be acceptable.
How do you think about this approach?

> If the answer is, "everything just works", that's perfect.
>
> If the answer is, "Before logdirty mode is set, the ioreq server has the
> opportunity to detach, removing the p2m_ioreq_server entries, and
> operating without that functionality", that's good too.
>
> If the answer is, "the live migration request fails and the guest
> continues to run", that's also acceptable.  If you want this series to
> be checked in today (the last day for 4.7), this is probably your best bet.
>
>   -George
>
>
>

B.R.
Yu
Andrew Cooper April 11, 2016, 12:20 p.m. UTC | #11
On 11/04/16 12:14, Yu, Zhang wrote:
>
>
> On 4/8/2016 9:33 PM, Andrew Cooper wrote:
>> On 31/03/16 11:53, Yu Zhang wrote:
>>> A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
>>> let one ioreq server claim/disclaim its responsibility for the
>>> handling of guest pages with p2m type p2m_ioreq_server. Users
>>> of this HVMOP can specify whether the p2m_ioreq_server is supposed
>>> to handle write accesses or read ones or both in a parameter named
>>> flags. For now, we only support one ioreq server for this p2m type,
>>> so once an ioreq server has claimed its ownership, subsequent calls
>>> of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
>>> disclaim the ownership of guest ram pages with this p2m type, by
>>> triggering this new HVMOP, with ioreq server id set to the current
>>> owner's and flags parameter set to 0.
>>>
>>> For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
>>> are only supported for HVMs with HAP enabled.
>>>
>>> Note that flags parameter(if not 0) of this HVMOP only indicates
>>> which kind of memory accesses are to be forwarded to an ioreq server,
>>> it has impact on the access rights of guest ram pages, but are not
>>> the same. Due to hardware limitations, if only write operations are
>>> to be forwarded, read ones will be performed at full speed, with
>>> no hypervisor intervention. But if read ones are to be forwarded to
>>> an ioreq server, writes will inevitably be trapped into hypervisor,
>>> which means significant performance impact.
>>>
>>> Also note that this HVMOP_map_mem_type_to_ioreq_server will not
>>> change the p2m type of any guest ram page, until HVMOP_set_mem_type
>>> is triggered. So normally the steps should be the backend driver
>>> first claims its ownership of guest ram pages with p2m_ioreq_server
>>> type, and then sets the memory type to p2m_ioreq_server for specified
>>> guest ram pages.
>>>
>>> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
>>> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>>> Cc: Keir Fraser <keir@xen.org>
>>> Cc: Jan Beulich <jbeulich@suse.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Cc: George Dunlap <george.dunlap@eu.citrix.com>
>>> Cc: Jun Nakajima <jun.nakajima@intel.com>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> Cc: Tim Deegan <tim@xen.org>
>>> ---
>>>   xen/arch/x86/hvm/emulate.c       | 125
>>> +++++++++++++++++++++++++++++++++++++--
>>>   xen/arch/x86/hvm/hvm.c           |  95 +++++++++++++++++++++++++++--
>>>   xen/arch/x86/mm/hap/nested_hap.c |   2 +-
>>>   xen/arch/x86/mm/p2m-ept.c        |  14 ++++-
>>>   xen/arch/x86/mm/p2m-pt.c         |  25 +++++---
>>>   xen/arch/x86/mm/p2m.c            |  82 +++++++++++++++++++++++++
>>>   xen/arch/x86/mm/shadow/multi.c   |   3 +-
>>>   xen/include/asm-x86/p2m.h        |  36 +++++++++--
>>>   xen/include/public/hvm/hvm_op.h  |  37 ++++++++++++
>>>   9 files changed, 395 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
>>> index ddc8007..77a4793 100644
>>> --- a/xen/arch/x86/hvm/emulate.c
>>> +++ b/xen/arch/x86/hvm/emulate.c
>>> @@ -94,11 +94,69 @@ static const struct hvm_io_handler null_handler = {
>>>       .ops = &null_ops
>>>   };
>>>
>>> +static int mem_read(const struct hvm_io_handler *io_handler,
>>> +                    uint64_t addr,
>>> +                    uint32_t size,
>>> +                    uint64_t *data)
>>> +{
>>> +    struct domain *currd = current->domain;
>>> +    unsigned long gmfn = paddr_to_pfn(addr);
>>> +    unsigned long offset = addr & ~PAGE_MASK;
>>> +    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL,
>>> P2M_UNSHARE);
>>> +    uint8_t *p;
>>> +
>>> +    if ( !page )
>>> +        return X86EMUL_UNHANDLEABLE;
>>> +
>>> +    p = __map_domain_page(page);
>>> +    p += offset;
>>> +    memcpy(data, p, size);
>>
>> What happens when offset + size crosses the page boundary?
>>
>
> The 'size' is set in hvmemul_linear_mmio_access(), to insure offset +
> size will not cross the page boundary.

Ok, in which case please

ASSERT(offset + size < PAGE_SIZE)

So it is absolutely clear in this function that the caller is required
to honour this restriction.

~Andrew
Jan Beulich April 11, 2016, 4:25 p.m. UTC | #12
>>> On 11.04.16 at 14:20, <andrew.cooper3@citrix.com> wrote:
> On 11/04/16 12:14, Yu, Zhang wrote:
>>
>>
>> On 4/8/2016 9:33 PM, Andrew Cooper wrote:
>>> On 31/03/16 11:53, Yu Zhang wrote:
>>>> A new HVMOP - HVMOP_map_mem_type_to_ioreq_server, is added to
>>>> let one ioreq server claim/disclaim its responsibility for the
>>>> handling of guest pages with p2m type p2m_ioreq_server. Users
>>>> of this HVMOP can specify whether the p2m_ioreq_server is supposed
>>>> to handle write accesses or read ones or both in a parameter named
>>>> flags. For now, we only support one ioreq server for this p2m type,
>>>> so once an ioreq server has claimed its ownership, subsequent calls
>>>> of the HVMOP_map_mem_type_to_ioreq_server will fail. Users can also
>>>> disclaim the ownership of guest ram pages with this p2m type, by
>>>> triggering this new HVMOP, with ioreq server id set to the current
>>>> owner's and flags parameter set to 0.
>>>>
>>>> For now, both HVMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
>>>> are only supported for HVMs with HAP enabled.
>>>>
>>>> Note that flags parameter(if not 0) of this HVMOP only indicates
>>>> which kind of memory accesses are to be forwarded to an ioreq server,
>>>> it has impact on the access rights of guest ram pages, but are not
>>>> the same. Due to hardware limitations, if only write operations are
>>>> to be forwarded, read ones will be performed at full speed, with
>>>> no hypervisor intervention. But if read ones are to be forwarded to
>>>> an ioreq server, writes will inevitably be trapped into hypervisor,
>>>> which means significant performance impact.
>>>>
>>>> Also note that this HVMOP_map_mem_type_to_ioreq_server will not
>>>> change the p2m type of any guest ram page, until HVMOP_set_mem_type
>>>> is triggered. So normally the steps should be the backend driver
>>>> first claims its ownership of guest ram pages with p2m_ioreq_server
>>>> type, and then sets the memory type to p2m_ioreq_server for specified
>>>> guest ram pages.
>>>>
>>>> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
>>>> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
>>>> Cc: Keir Fraser <keir@xen.org>
>>>> Cc: Jan Beulich <jbeulich@suse.com>
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Cc: George Dunlap <george.dunlap@eu.citrix.com>
>>>> Cc: Jun Nakajima <jun.nakajima@intel.com>
>>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>>> Cc: Tim Deegan <tim@xen.org>
>>>> ---
>>>>   xen/arch/x86/hvm/emulate.c       | 125
>>>> +++++++++++++++++++++++++++++++++++++--
>>>>   xen/arch/x86/hvm/hvm.c           |  95 +++++++++++++++++++++++++++--
>>>>   xen/arch/x86/mm/hap/nested_hap.c |   2 +-
>>>>   xen/arch/x86/mm/p2m-ept.c        |  14 ++++-
>>>>   xen/arch/x86/mm/p2m-pt.c         |  25 +++++---
>>>>   xen/arch/x86/mm/p2m.c            |  82 +++++++++++++++++++++++++
>>>>   xen/arch/x86/mm/shadow/multi.c   |   3 +-
>>>>   xen/include/asm-x86/p2m.h        |  36 +++++++++--
>>>>   xen/include/public/hvm/hvm_op.h  |  37 ++++++++++++
>>>>   9 files changed, 395 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
>>>> index ddc8007..77a4793 100644
>>>> --- a/xen/arch/x86/hvm/emulate.c
>>>> +++ b/xen/arch/x86/hvm/emulate.c
>>>> @@ -94,11 +94,69 @@ static const struct hvm_io_handler null_handler = {
>>>>       .ops = &null_ops
>>>>   };
>>>>
>>>> +static int mem_read(const struct hvm_io_handler *io_handler,
>>>> +                    uint64_t addr,
>>>> +                    uint32_t size,
>>>> +                    uint64_t *data)
>>>> +{
>>>> +    struct domain *currd = current->domain;
>>>> +    unsigned long gmfn = paddr_to_pfn(addr);
>>>> +    unsigned long offset = addr & ~PAGE_MASK;
>>>> +    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL,
>>>> P2M_UNSHARE);
>>>> +    uint8_t *p;
>>>> +
>>>> +    if ( !page )
>>>> +        return X86EMUL_UNHANDLEABLE;
>>>> +
>>>> +    p = __map_domain_page(page);
>>>> +    p += offset;
>>>> +    memcpy(data, p, size);
>>>
>>> What happens when offset + size crosses the page boundary?
>>>
>>
>> The 'size' is set in hvmemul_linear_mmio_access(), to insure offset +
>> size will not cross the page boundary.
> 
> Ok, in which case please
> 
> ASSERT(offset + size < PAGE_SIZE)

<=

Jan
Jan Beulich April 11, 2016, 4:31 p.m. UTC | #13
>>> On 11.04.16 at 13:14, <yu.c.zhang@linux.intel.com> wrote:
> On 4/9/2016 6:28 AM, Jan Beulich wrote:
>>>>> On 31.03.16 at 12:53, <yu.c.zhang@linux.intel.com> wrote:
>>> @@ -168,13 +226,72 @@ static int hvmemul_do_io(
>>>           break;
>>>       case X86EMUL_UNHANDLEABLE:
>>>       {
>>> -        struct hvm_ioreq_server *s =
>>> -            hvm_select_ioreq_server(curr->domain, &p);
>>> +        struct hvm_ioreq_server *s;
>>> +        p2m_type_t p2mt;
>>> +
>>> +        if ( is_mmio )
>>> +        {
>>> +            unsigned long gmfn = paddr_to_pfn(addr);
>>> +
>>> +            (void) get_gfn_query_unlocked(currd, gmfn, &p2mt);
>>> +
>>> +            switch ( p2mt )
>>> +            {
>>> +                case p2m_ioreq_server:
>>> +                {
>>> +                    unsigned long flags;
>>> +
>>> +                    p2m_get_ioreq_server(currd, &flags, &s);
>>
>> As the function apparently returns no value right now, please avoid
>> the indirection on both values you're after - one of the two
>> (presumably s) can be the function's return value.
> 
> Well, current implementation of p2m_get_ioreq_server() has spin_lock/
> spin_unlock surrounding the reading of flags and the s, but I believe
> we can also use the s as return value.

The use of a lock inside the function has nothing to do with how it
returns values to the caller.

>>>           /* If there is no suitable backing DM, just ignore accesses */
>>>           if ( !s )
>>>           {
>>> -            rc = hvm_process_io_intercept(&null_handler, &p);
>>> +            switch ( p2mt )
>>> +            {
>>> +            case p2m_ioreq_server:
>>> +            /*
>>> +             * Race conditions may exist when access to a gfn with
>>> +             * p2m_ioreq_server is intercepted by hypervisor, during
>>> +             * which time p2m type of this gfn is recalculated back
>>> +             * to p2m_ram_rw. mem_handler is used to handle this
>>> +             * corner case.
>>> +             */
>>
>> Now if there is such a race condition, the race could also be with a
>> page changing first to ram_rw and then immediately further to e.g.
>> ram_ro. See the earlier comment about assuming the page to be
>> writable.
>>
> 
> Thanks, Jan. After rechecking the code, I suppose the race condition
> will not happen. In hvmemul_do_io(), get_gfn_query_unlocked() is used
> to peek the p2mt for the gfn, but get_gfn_type_access() is called inside
> hvm_hap_nested_page_fault(), and this will guarantee no p2m change shall
> occur during the emulation.
> Is this understanding correct?

Ah, yes, I think so. So the comment is misleading.

>>> +static int hvm_map_mem_type_to_ioreq_server(struct domain *d,
>>> +                                            ioservid_t id,
>>> +                                            hvmmem_type_t type,
>>> +                                            uint32_t flags)
>>> +{
>>> +    struct hvm_ioreq_server *s;
>>> +    int rc;
>>> +
>>> +    /* For now, only HVMMEM_ioreq_server is supported */
>>> +    if ( type != HVMMEM_ioreq_server )
>>> +        return -EINVAL;
>>> +
>>> +    if ( flags & ~(HVMOP_IOREQ_MEM_ACCESS_READ |
>>> +                   HVMOP_IOREQ_MEM_ACCESS_WRITE) )
>>> +        return -EINVAL;
>>> +
>>> +    spin_lock(&d->arch.hvm_domain.ioreq_server.lock);
>>> +
>>> +    rc = -ENOENT;
>>> +    list_for_each_entry ( s,
>>> +                          &d->arch.hvm_domain.ioreq_server.list,
>>> +                          list_entry )
>>> +    {
>>> +        if ( s == d->arch.hvm_domain.default_ioreq_server )
>>> +            continue;
>>> +
>>> +        if ( s->id == id )
>>> +        {
>>> +            rc = p2m_set_ioreq_server(d, flags, s);
>>> +            if ( rc == 0 )
>>> +                gdprintk(XENLOG_DEBUG, "%u %s type HVMMEM_ioreq_server.\n",
>>> +                         s->id, (flags != 0) ? "mapped to" : "unmapped from");
>>
>> Why gdprintk()? I don't think the current domain is of much
>> interest here. What would be of interest is the subject domain.
>>
> 
> s->id is not the domain_id, but id of the ioreq server.

That's understood. But gdprintk() itself logs the current domain,
which isn't as useful as the subject one.

>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>> @@ -132,6 +132,19 @@ static void ept_p2m_type_to_flags(struct p2m_domain
>>> *p2m, ept_entry_t *entry,
>>>               entry->r = entry->w = entry->x = 1;
>>>               entry->a = entry->d = !!cpu_has_vmx_ept_ad;
>>>               break;
>>> +        case p2m_ioreq_server:
>>> +            entry->r = !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_READ_ACCESS);
>>> +	    /*
>>> +	     * write access right is disabled when entry->r is 0, but whether
>>> +	     * write accesses are emulated by hypervisor or forwarded to an
>>> +	     * ioreq server depends on the setting of p2m->ioreq.flags.
>>> +	     */
>>> +            entry->w = (entry->r &&
>>> +                        !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_WRITE_ACCESS));
>>> +            entry->x = entry->r;
>>
>> Why would we want to allow instruction execution from such pages?
>> And with all three bits now possibly being clear, aren't we risking the
>> entries to be mis-treated as not-present ones?
>>
> 
> Hah. You got me. Thanks! :)
> Now I realized it would be difficult if we wanna to emulate the read
> operations for HVM. According to Intel mannual, entry->r is to be
> cleared, so should entry->w if we do not want ept misconfig. And
> with both read and write permissions being forbidden, entry->x can be
> set only on processors with EXECUTE_ONLY capability.
> To avoid any entry to be mis-treated as not-present. We have several
> solutions:
> a> do not support the read emulation for now - we have no such usage
> case;
> b> add the check of p2m_t against p2m_ioreq_server in is_epte_present -
> a bit weird to me.
> Which one do you prefer? or any other suggestions?

That question would also need to be asked to others who had
suggested supporting both. I'd be fine with a, but I also don't view
b as too awkward.

>>> +    /*
>>> +     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
>>> +     * we mark the p2m table to be recalculated, so that gfns which were
>>> +     * previously marked with p2m_ioreq_server can be resynced.
>>> +     */
>>> +    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>>
>> What does "resynced" here mean? I.e. I can see why this is wanted
>> when unmapping a server, but when mapping a server there shouldn't
>> be any such pages in the first place.
>>
> 
> There shouldn't be. But if there is(misbehavior from the device model
> side), it can be recalculated back to p2m_ram_rw(which is not quite
> necessary as the unmapping case).

DM misbehavior should not result in such a problem - the hypervisor
should refuse any bad requests.

Jan
Yu Zhang April 12, 2016, 9:37 a.m. UTC | #14
On 4/12/2016 12:31 AM, Jan Beulich wrote:
>>>> On 11.04.16 at 13:14, <yu.c.zhang@linux.intel.com> wrote:
>> On 4/9/2016 6:28 AM, Jan Beulich wrote:
>>>>>> On 31.03.16 at 12:53, <yu.c.zhang@linux.intel.com> wrote:
>>>> @@ -168,13 +226,72 @@ static int hvmemul_do_io(
>>>>            break;
>>>>        case X86EMUL_UNHANDLEABLE:
>>>>        {
>>>> -        struct hvm_ioreq_server *s =
>>>> -            hvm_select_ioreq_server(curr->domain, &p);
>>>> +        struct hvm_ioreq_server *s;
>>>> +        p2m_type_t p2mt;
>>>> +
>>>> +        if ( is_mmio )
>>>> +        {
>>>> +            unsigned long gmfn = paddr_to_pfn(addr);
>>>> +
>>>> +            (void) get_gfn_query_unlocked(currd, gmfn, &p2mt);
>>>> +
>>>> +            switch ( p2mt )
>>>> +            {
>>>> +                case p2m_ioreq_server:
>>>> +                {
>>>> +                    unsigned long flags;
>>>> +
>>>> +                    p2m_get_ioreq_server(currd, &flags, &s);
>>>
>>> As the function apparently returns no value right now, please avoid
>>> the indirection on both values you're after - one of the two
>>> (presumably s) can be the function's return value.
>>
>> Well, current implementation of p2m_get_ioreq_server() has spin_lock/
>> spin_unlock surrounding the reading of flags and the s, but I believe
>> we can also use the s as return value.
>
> The use of a lock inside the function has nothing to do with how it
> returns values to the caller.
>

Agree. I'll use s as return value then.

>>>>            /* If there is no suitable backing DM, just ignore accesses */
>>>>            if ( !s )
>>>>            {
>>>> -            rc = hvm_process_io_intercept(&null_handler, &p);
>>>> +            switch ( p2mt )
>>>> +            {
>>>> +            case p2m_ioreq_server:
>>>> +            /*
>>>> +             * Race conditions may exist when access to a gfn with
>>>> +             * p2m_ioreq_server is intercepted by hypervisor, during
>>>> +             * which time p2m type of this gfn is recalculated back
>>>> +             * to p2m_ram_rw. mem_handler is used to handle this
>>>> +             * corner case.
>>>> +             */
>>>
>>> Now if there is such a race condition, the race could also be with a
>>> page changing first to ram_rw and then immediately further to e.g.
>>> ram_ro. See the earlier comment about assuming the page to be
>>> writable.
>>>
>>
>> Thanks, Jan. After rechecking the code, I suppose the race condition
>> will not happen. In hvmemul_do_io(), get_gfn_query_unlocked() is used
>> to peek the p2mt for the gfn, but get_gfn_type_access() is called inside
>> hvm_hap_nested_page_fault(), and this will guarantee no p2m change shall
>> occur during the emulation.
>> Is this understanding correct?
>
> Ah, yes, I think so. So the comment is misleading.
>

I'll remove the comment, together with the p2m_ram_rw case. Thanks. :)

>>>> +static int hvm_map_mem_type_to_ioreq_server(struct domain *d,
>>>> +                                            ioservid_t id,
>>>> +                                            hvmmem_type_t type,
>>>> +                                            uint32_t flags)
>>>> +{
>>>> +    struct hvm_ioreq_server *s;
>>>> +    int rc;
>>>> +
>>>> +    /* For now, only HVMMEM_ioreq_server is supported */
>>>> +    if ( type != HVMMEM_ioreq_server )
>>>> +        return -EINVAL;
>>>> +
>>>> +    if ( flags & ~(HVMOP_IOREQ_MEM_ACCESS_READ |
>>>> +                   HVMOP_IOREQ_MEM_ACCESS_WRITE) )
>>>> +        return -EINVAL;
>>>> +
>>>> +    spin_lock(&d->arch.hvm_domain.ioreq_server.lock);
>>>> +
>>>> +    rc = -ENOENT;
>>>> +    list_for_each_entry ( s,
>>>> +                          &d->arch.hvm_domain.ioreq_server.list,
>>>> +                          list_entry )
>>>> +    {
>>>> +        if ( s == d->arch.hvm_domain.default_ioreq_server )
>>>> +            continue;
>>>> +
>>>> +        if ( s->id == id )
>>>> +        {
>>>> +            rc = p2m_set_ioreq_server(d, flags, s);
>>>> +            if ( rc == 0 )
>>>> +                gdprintk(XENLOG_DEBUG, "%u %s type HVMMEM_ioreq_server.\n",
>>>> +                         s->id, (flags != 0) ? "mapped to" : "unmapped from");
>>>
>>> Why gdprintk()? I don't think the current domain is of much
>>> interest here. What would be of interest is the subject domain.
>>>
>>
>> s->id is not the domain_id, but id of the ioreq server.
>
> That's understood. But gdprintk() itself logs the current domain,
> which isn't as useful as the subject one.
>

Oh, I see. So the correct routine here should be dprintk(), right?

>>>> --- a/xen/arch/x86/mm/p2m-ept.c
>>>> +++ b/xen/arch/x86/mm/p2m-ept.c
>>>> @@ -132,6 +132,19 @@ static void ept_p2m_type_to_flags(struct p2m_domain
>>>> *p2m, ept_entry_t *entry,
>>>>                entry->r = entry->w = entry->x = 1;
>>>>                entry->a = entry->d = !!cpu_has_vmx_ept_ad;
>>>>                break;
>>>> +        case p2m_ioreq_server:
>>>> +            entry->r = !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_READ_ACCESS);
>>>> +	    /*
>>>> +	     * write access right is disabled when entry->r is 0, but whether
>>>> +	     * write accesses are emulated by hypervisor or forwarded to an
>>>> +	     * ioreq server depends on the setting of p2m->ioreq.flags.
>>>> +	     */
>>>> +            entry->w = (entry->r &&
>>>> +                        !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_WRITE_ACCESS));
>>>> +            entry->x = entry->r;
>>>
>>> Why would we want to allow instruction execution from such pages?
>>> And with all three bits now possibly being clear, aren't we risking the
>>> entries to be mis-treated as not-present ones?
>>>
>>
>> Hah. You got me. Thanks! :)
>> Now I realized it would be difficult if we wanna to emulate the read
>> operations for HVM. According to Intel mannual, entry->r is to be
>> cleared, so should entry->w if we do not want ept misconfig. And
>> with both read and write permissions being forbidden, entry->x can be
>> set only on processors with EXECUTE_ONLY capability.
>> To avoid any entry to be mis-treated as not-present. We have several
>> solutions:
>> a> do not support the read emulation for now - we have no such usage
>> case;
>> b> add the check of p2m_t against p2m_ioreq_server in is_epte_present -
>> a bit weird to me.
>> Which one do you prefer? or any other suggestions?
>
> That question would also need to be asked to others who had
> suggested supporting both. I'd be fine with a, but I also don't view
> b as too awkward.
>

According to Intel mannual, an entry is regarded as not present, if
bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
will change its semantics, if this is acceptable(with no hurt to
hypervisor). I'd prefer option b>

Does anyone else have different suggestions other than b> ?


>>>> +    /*
>>>> +     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
>>>> +     * we mark the p2m table to be recalculated, so that gfns which were
>>>> +     * previously marked with p2m_ioreq_server can be resynced.
>>>> +     */
>>>> +    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
>>>
>>> What does "resynced" here mean? I.e. I can see why this is wanted
>>> when unmapping a server, but when mapping a server there shouldn't
>>> be any such pages in the first place.
>>>
>>
>> There shouldn't be. But if there is(misbehavior from the device model
>> side), it can be recalculated back to p2m_ram_rw(which is not quite
>> necessary as the unmapping case).
>
> DM misbehavior should not result in such a problem - the hypervisor
> should refuse any bad requests.
>
OK. I can add code to guarantee from hypervisor side no entries changed
to p2m_ioreq_server before the mapping happens.

B.R.
Yu
Jan Beulich April 12, 2016, 3:08 p.m. UTC | #15
>>> "Yu, Zhang" <yu.c.zhang@linux.intel.com> 04/12/16 11:47 AM >>>
>On 4/12/2016 12:31 AM, Jan Beulich wrote:
>>>>> On 11.04.16 at 13:14, <yu.c.zhang@linux.intel.com> wrote:
>>> On 4/9/2016 6:28 AM, Jan Beulich wrote:
>>>>>>> On 31.03.16 at 12:53, <yu.c.zhang@linux.intel.com> wrote:
>>>>> +        if ( s->id == id )
>>>>> +        {
>>>>> +            rc = p2m_set_ioreq_server(d, flags, s);
>>>>> +            if ( rc == 0 )
>>>>> +                gdprintk(XENLOG_DEBUG, "%u %s type HVMMEM_ioreq_server.\n",
>>>>> +                         s->id, (flags != 0) ? "mapped to" : "unmapped from");
>>>>
>>>> Why gdprintk()? I don't think the current domain is of much
>>>> interest here. What would be of interest is the subject domain.
>>>>
>>>
>>> s->id is not the domain_id, but id of the ioreq server.
>>
>> That's understood. But gdprintk() itself logs the current domain,
>> which isn't as useful as the subject one.
>
>Oh, I see. So the correct routine here should be dprintk(), right?

Yes.

>>>> And with all three bits now possibly being clear, aren't we risking the
>>>> entries to be mis-treated as not-present ones?
>>>
>>> Hah. You got me. Thanks! :)
>>> Now I realized it would be difficult if we wanna to emulate the read
>>> operations for HVM. According to Intel mannual, entry->r is to be
>>> cleared, so should entry->w if we do not want ept misconfig. And
>>> with both read and write permissions being forbidden, entry->x can be
>>> set only on processors with EXECUTE_ONLY capability.
>>> To avoid any entry to be mis-treated as not-present. We have several
>>> solutions:
>>> a> do not support the read emulation for now - we have no such usage
>>> case;
>>> b> add the check of p2m_t against p2m_ioreq_server in is_epte_present -
>>> a bit weird to me.
>>> Which one do you prefer? or any other suggestions?
>>
>> That question would also need to be asked to others who had
>> suggested supporting both. I'd be fine with a, but I also don't view
>> b as too awkward.
>
>According to Intel mannual, an entry is regarded as not present, if
>bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
>will change its semantics, if this is acceptable(with no hurt to
>hypervisor). I'd prefer option b>

Perhaps time for the VMX maintainers to chime in - such a change is acceptable
only if it doesn't result in changed hardware behavior. I can't think of any such off
the top of my head, but this really should be confirmed by the maintainers before
deciding to go such a route.

Jan
Yu Zhang April 14, 2016, 9:56 a.m. UTC | #16
On 4/12/2016 11:08 PM, Jan Beulich wrote:
>>>> "Yu, Zhang" <yu.c.zhang@linux.intel.com> 04/12/16 11:47 AM >>>
>> On 4/12/2016 12:31 AM, Jan Beulich wrote:
>>>>>> On 11.04.16 at 13:14, <yu.c.zhang@linux.intel.com> wrote:
>>>> On 4/9/2016 6:28 AM, Jan Beulich wrote:
>>>>>>>> On 31.03.16 at 12:53, <yu.c.zhang@linux.intel.com> wrote:
>>>>>> +        if ( s->id == id )
>>>>>> +        {
>>>>>> +            rc = p2m_set_ioreq_server(d, flags, s);
>>>>>> +            if ( rc == 0 )
>>>>>> +                gdprintk(XENLOG_DEBUG, "%u %s type HVMMEM_ioreq_server.\n",
>>>>>> +                         s->id, (flags != 0) ? "mapped to" : "unmapped from");
>>>>>
>>>>> Why gdprintk()? I don't think the current domain is of much
>>>>> interest here. What would be of interest is the subject domain.
>>>>>
>>>>
>>>> s->id is not the domain_id, but id of the ioreq server.
>>>
>>> That's understood. But gdprintk() itself logs the current domain,
>>> which isn't as useful as the subject one.
>>
>> Oh, I see. So the correct routine here should be dprintk(), right?
>
> Yes.
>
>>>>> And with all three bits now possibly being clear, aren't we risking the
>>>>> entries to be mis-treated as not-present ones?
>>>>
>>>> Hah. You got me. Thanks! :)
>>>> Now I realized it would be difficult if we wanna to emulate the read
>>>> operations for HVM. According to Intel mannual, entry->r is to be
>>>> cleared, so should entry->w if we do not want ept misconfig. And
>>>> with both read and write permissions being forbidden, entry->x can be
>>>> set only on processors with EXECUTE_ONLY capability.
>>>> To avoid any entry to be mis-treated as not-present. We have several
>>>> solutions:
>>>> a> do not support the read emulation for now - we have no such usage
>>>> case;
>>>> b> add the check of p2m_t against p2m_ioreq_server in is_epte_present -
>>>> a bit weird to me.
>>>> Which one do you prefer? or any other suggestions?
>>>
>>> That question would also need to be asked to others who had
>>> suggested supporting both. I'd be fine with a, but I also don't view
>>> b as too awkward.
>>
>> According to Intel mannual, an entry is regarded as not present, if
>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
>> will change its semantics, if this is acceptable(with no hurt to
>> hypervisor). I'd prefer option b>
>
> Perhaps time for the VMX maintainers to chime in - such a change is acceptable
> only if it doesn't result in changed hardware behavior. I can't think of any such off
> the top of my head, but this really should be confirmed by the maintainers before
> deciding to go such a route.
>

Thanks, Jan. :)
Jun & Kevin, any suggestions?

Yu
Yu Zhang April 14, 2016, 10:45 a.m. UTC | #17
On 4/11/2016 7:15 PM, Yu, Zhang wrote:
>
>
> On 4/8/2016 7:01 PM, George Dunlap wrote:
>> On 08/04/16 11:10, Yu, Zhang wrote:
>> [snip]
>>> BTW, I noticed your reply has not be CCed to mailing list, and I also
>>> wonder if we should raise this last question in community?
>>
>> Oops -- that was a mistake on my part.  :-)  I appreciate the
>> discretion; just so you know in the future, if I'm purposely changing
>> the CC list (removing xen-devel and/or adding extra people), I'll almost
>> always say so at the top of the mail.
>>
>>>> And then of course there's the p2m_ioreq_server -> p2m_ram_logdirty
>>>> transition -- I assume that live migration is incompatible with this
>>>> functionality?  Is there anything that prevents a live migration from
>>>> being started when there are outstanding p2m_ioreq_server entries?
>>>>
>>>
>>> Another good question, and the answer is unfortunately yes. :-)
>>>
>>> If live migration happens during the normal emulation process, entries
>>> marked with p2m_ioreq_server will be changed to p2m_log_dirty in
>>> resolve_misconfig(), and later write operations will change them to
>>> p2m_ram_rw, thereafter these pages can not be forwarded to device model.
>>>  From this point of view, this functionality is incompatible with live
>>> migration.
>>>
>>> But for XenGT, I think this is acceptable, because, if live migration
>>> is to be supported in the future, intervention from backend device
>>> model will be necessary. At that time, we can guarantee from the device
>>> model side that there's no outdated p2m_ioreq_server entries, hence no
>>> need to reset the p2m type back to p2m_ram_rw(and do not include
>>> p2m_ioreq_server in the P2M_CHANGEABLE_TYPES). By "outdated", I mean
>>> when an ioreq server is detached from p2m_ioreq_server, or before an
>>> ioreq server is attached to this type, entries marked with
>>> p2m_ioreq_server should be regarded as outdated.
>>>
>>> Is this acceptible to you? Any suggestions?
>>
>> So the question is, as of this series, what happens if someone tries to
>> initiate a live migration while there are outstanding p2m_ioreq_server
>> entries?
>>
>> If the answer is "the ioreq server suddenly loses all control of the
>> memory", that's something that needs to be changed.
>>
>
> Sorry, for this patch series, I'm afraid the above description is the
> answer.
>
> Besides, I find it's hard to change current code to both support the
> deferred resetting of p2m_ioreq_server and the live migration at the
> same time. One reason is that a page with p2m_ioreq_server behaves
> differently in different situations.
>
> My assumption of XenGT is that, for live migration to work, the device
> model should guarantee there's no outstanding p2m_ioreq_server pages
> in hypervisor(no need to use the deferred recalculation), and it is our
> device model who should be responsible for the copying of the write
> protected guest pages later.
>
> And another solution I can think of: when unmapping the ioreq server,
> we walk the p2m table and reset entries with p2m_ioreq_server back
> directly, instead of deferring the reset. And of course, this means
> performance impact. But since the mapping and unmapping of an ioreq
> server is not a frequent one, the performance penalty may be acceptable.
> How do you think about this approach?
>

George, sorry to bother you. Any comments on above option? :)

Another choice might be to let live migration fail if there's
outstanding p2m_ioreq_server entries. But I'm not quite inclined to do
so, because:
1> I'd still like to keep live migration feature for XenGT.
2> Not easy to know if there's outstanding p2m_ioreq_server entries. I
mean, since p2m type change is not only triggered by hypercall, to keep
a counter for remaining p2m_ioreq_server entries means a lot code
changes;

Besides, I wonder whether the requirement to reset the p2m_ioreq_server
is indispensable, could we let the device model side to be responsible
for this? The worst case I can imagine for device model failing to do
so is that operations of a gfn might be delivered to a wrong device
model. I'm not clear what kind of damage would this cause to the
hypervisor or other VM.

Does any other maintainers have any suggestions?
Thanks in advance! :)
>> If the answer is, "everything just works", that's perfect.
>>
>> If the answer is, "Before logdirty mode is set, the ioreq server has the
>> opportunity to detach, removing the p2m_ioreq_server entries, and
>> operating without that functionality", that's good too.
>>
>> If the answer is, "the live migration request fails and the guest
>> continues to run", that's also acceptable.  If you want this series to
>> be checked in today (the last day for 4.7), this is probably your best
>> bet.
>>
>>   -George
>>
>>
>>
>

Regards
Yu
Paul Durrant April 18, 2016, 3:57 p.m. UTC | #18
> -----Original Message-----

> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: 14 April 2016 11:45

> To: George Dunlap; Paul Durrant; xen-devel@lists.xen.org

> Cc: Jan Beulich; Kevin Tian; Andrew Cooper; Lv, Zhiyuan; Tim (Xen.org);

> jun.nakajima@intel.com

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> On 4/11/2016 7:15 PM, Yu, Zhang wrote:

> >

> >

> > On 4/8/2016 7:01 PM, George Dunlap wrote:

> >> On 08/04/16 11:10, Yu, Zhang wrote:

> >> [snip]

> >>> BTW, I noticed your reply has not be CCed to mailing list, and I also

> >>> wonder if we should raise this last question in community?

> >>

> >> Oops -- that was a mistake on my part.  :-)  I appreciate the

> >> discretion; just so you know in the future, if I'm purposely changing

> >> the CC list (removing xen-devel and/or adding extra people), I'll almost

> >> always say so at the top of the mail.

> >>

> >>>> And then of course there's the p2m_ioreq_server ->

> p2m_ram_logdirty

> >>>> transition -- I assume that live migration is incompatible with this

> >>>> functionality?  Is there anything that prevents a live migration from

> >>>> being started when there are outstanding p2m_ioreq_server entries?

> >>>>

> >>>

> >>> Another good question, and the answer is unfortunately yes. :-)

> >>>

> >>> If live migration happens during the normal emulation process, entries

> >>> marked with p2m_ioreq_server will be changed to p2m_log_dirty in

> >>> resolve_misconfig(), and later write operations will change them to

> >>> p2m_ram_rw, thereafter these pages can not be forwarded to device

> model.

> >>>  From this point of view, this functionality is incompatible with live

> >>> migration.

> >>>

> >>> But for XenGT, I think this is acceptable, because, if live migration

> >>> is to be supported in the future, intervention from backend device

> >>> model will be necessary. At that time, we can guarantee from the device

> >>> model side that there's no outdated p2m_ioreq_server entries, hence

> no

> >>> need to reset the p2m type back to p2m_ram_rw(and do not include

> >>> p2m_ioreq_server in the P2M_CHANGEABLE_TYPES). By "outdated", I

> mean

> >>> when an ioreq server is detached from p2m_ioreq_server, or before an

> >>> ioreq server is attached to this type, entries marked with

> >>> p2m_ioreq_server should be regarded as outdated.

> >>>

> >>> Is this acceptible to you? Any suggestions?

> >>

> >> So the question is, as of this series, what happens if someone tries to

> >> initiate a live migration while there are outstanding p2m_ioreq_server

> >> entries?

> >>

> >> If the answer is "the ioreq server suddenly loses all control of the

> >> memory", that's something that needs to be changed.

> >>

> >

> > Sorry, for this patch series, I'm afraid the above description is the

> > answer.

> >

> > Besides, I find it's hard to change current code to both support the

> > deferred resetting of p2m_ioreq_server and the live migration at the

> > same time. One reason is that a page with p2m_ioreq_server behaves

> > differently in different situations.

> >

> > My assumption of XenGT is that, for live migration to work, the device

> > model should guarantee there's no outstanding p2m_ioreq_server pages

> > in hypervisor(no need to use the deferred recalculation), and it is our

> > device model who should be responsible for the copying of the write

> > protected guest pages later.

> >

> > And another solution I can think of: when unmapping the ioreq server,

> > we walk the p2m table and reset entries with p2m_ioreq_server back

> > directly, instead of deferring the reset. And of course, this means

> > performance impact. But since the mapping and unmapping of an ioreq

> > server is not a frequent one, the performance penalty may be acceptable.

> > How do you think about this approach?

> >

> 

> George, sorry to bother you. Any comments on above option? :)

> 

> Another choice might be to let live migration fail if there's

> outstanding p2m_ioreq_server entries. But I'm not quite inclined to do

> so, because:

> 1> I'd still like to keep live migration feature for XenGT.

> 2> Not easy to know if there's outstanding p2m_ioreq_server entries. I

> mean, since p2m type change is not only triggered by hypercall, to keep

> a counter for remaining p2m_ioreq_server entries means a lot code

> changes;

> 

> Besides, I wonder whether the requirement to reset the p2m_ioreq_server

> is indispensable, could we let the device model side to be responsible

> for this? The worst case I can imagine for device model failing to do

> so is that operations of a gfn might be delivered to a wrong device

> model. I'm not clear what kind of damage would this cause to the

> hypervisor or other VM.

> 

> Does any other maintainers have any suggestions?


Note that it is a requirement that an ioreq server be disabled before VM suspend. That means ioreq server pages essentially have to go back to ram_rw semantics.

  Paul

> Thanks in advance! :)

> >> If the answer is, "everything just works", that's perfect.

> >>

> >> If the answer is, "Before logdirty mode is set, the ioreq server has the

> >> opportunity to detach, removing the p2m_ioreq_server entries, and

> >> operating without that functionality", that's good too.

> >>

> >> If the answer is, "the live migration request fails and the guest

> >> continues to run", that's also acceptable.  If you want this series to

> >> be checked in today (the last day for 4.7), this is probably your best

> >> bet.

> >>

> >>   -George

> >>

> >>

> >>

> >

> 

> Regards

> Yu
Tian, Kevin April 19, 2016, 4:37 a.m. UTC | #19
> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: Thursday, April 14, 2016 6:45 PM

> 

> On 4/11/2016 7:15 PM, Yu, Zhang wrote:

> >

> >

> > On 4/8/2016 7:01 PM, George Dunlap wrote:

> >> On 08/04/16 11:10, Yu, Zhang wrote:

> >> [snip]

> >>> BTW, I noticed your reply has not be CCed to mailing list, and I also

> >>> wonder if we should raise this last question in community?

> >>

> >> Oops -- that was a mistake on my part.  :-)  I appreciate the

> >> discretion; just so you know in the future, if I'm purposely changing

> >> the CC list (removing xen-devel and/or adding extra people), I'll almost

> >> always say so at the top of the mail.

> >>

> >>>> And then of course there's the p2m_ioreq_server -> p2m_ram_logdirty

> >>>> transition -- I assume that live migration is incompatible with this

> >>>> functionality?  Is there anything that prevents a live migration from

> >>>> being started when there are outstanding p2m_ioreq_server entries?

> >>>>

> >>>

> >>> Another good question, and the answer is unfortunately yes. :-)

> >>>

> >>> If live migration happens during the normal emulation process, entries

> >>> marked with p2m_ioreq_server will be changed to p2m_log_dirty in

> >>> resolve_misconfig(), and later write operations will change them to

> >>> p2m_ram_rw, thereafter these pages can not be forwarded to device model.

> >>>  From this point of view, this functionality is incompatible with live

> >>> migration.

> >>>

> >>> But for XenGT, I think this is acceptable, because, if live migration

> >>> is to be supported in the future, intervention from backend device

> >>> model will be necessary. At that time, we can guarantee from the device

> >>> model side that there's no outdated p2m_ioreq_server entries, hence no

> >>> need to reset the p2m type back to p2m_ram_rw(and do not include

> >>> p2m_ioreq_server in the P2M_CHANGEABLE_TYPES). By "outdated", I mean

> >>> when an ioreq server is detached from p2m_ioreq_server, or before an

> >>> ioreq server is attached to this type, entries marked with

> >>> p2m_ioreq_server should be regarded as outdated.

> >>>

> >>> Is this acceptible to you? Any suggestions?

> >>

> >> So the question is, as of this series, what happens if someone tries to

> >> initiate a live migration while there are outstanding p2m_ioreq_server

> >> entries?

> >>

> >> If the answer is "the ioreq server suddenly loses all control of the

> >> memory", that's something that needs to be changed.

> >>

> >

> > Sorry, for this patch series, I'm afraid the above description is the

> > answer.

> >

> > Besides, I find it's hard to change current code to both support the

> > deferred resetting of p2m_ioreq_server and the live migration at the

> > same time. One reason is that a page with p2m_ioreq_server behaves

> > differently in different situations.

> >

> > My assumption of XenGT is that, for live migration to work, the device

> > model should guarantee there's no outstanding p2m_ioreq_server pages

> > in hypervisor(no need to use the deferred recalculation), and it is our

> > device model who should be responsible for the copying of the write

> > protected guest pages later.

> >

> > And another solution I can think of: when unmapping the ioreq server,

> > we walk the p2m table and reset entries with p2m_ioreq_server back

> > directly, instead of deferring the reset. And of course, this means

> > performance impact. But since the mapping and unmapping of an ioreq

> > server is not a frequent one, the performance penalty may be acceptable.

> > How do you think about this approach?

> >

> 

> George, sorry to bother you. Any comments on above option? :)

> 

> Another choice might be to let live migration fail if there's

> outstanding p2m_ioreq_server entries. But I'm not quite inclined to do

> so, because:

> 1> I'd still like to keep live migration feature for XenGT.

> 2> Not easy to know if there's outstanding p2m_ioreq_server entries. I

> mean, since p2m type change is not only triggered by hypercall, to keep

> a counter for remaining p2m_ioreq_server entries means a lot code

> changes;

> 

> Besides, I wonder whether the requirement to reset the p2m_ioreq_server

> is indispensable, could we let the device model side to be responsible

> for this? The worst case I can imagine for device model failing to do

> so is that operations of a gfn might be delivered to a wrong device

> model. I'm not clear what kind of damage would this cause to the

> hypervisor or other VM.

> 

> Does any other maintainers have any suggestions?

> Thanks in advance! :)


I'm not sure how above is working. In pre-copy phase (where logdirty
is concerned), the device model is still actively serving requests from
guest, including initiating new write-protection requests. How can you
guarantee draining of outstanding p2m_ioreq_server entries w/o 
actually freezing device model (while freezing device model means guest 
driver might be blocked with random errors)?

Thanks
Kevin
Tian, Kevin April 19, 2016, 4:50 a.m. UTC | #20
> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
> Sent: Thursday, April 14, 2016 5:57 PM
> >>>>> And with all three bits now possibly being clear, aren't we risking the
> >>>>> entries to be mis-treated as not-present ones?
> >>>>
> >>>> Hah. You got me. Thanks! :)
> >>>> Now I realized it would be difficult if we wanna to emulate the read
> >>>> operations for HVM. According to Intel mannual, entry->r is to be
> >>>> cleared, so should entry->w if we do not want ept misconfig. And
> >>>> with both read and write permissions being forbidden, entry->x can be
> >>>> set only on processors with EXECUTE_ONLY capability.
> >>>> To avoid any entry to be mis-treated as not-present. We have several
> >>>> solutions:
> >>>> a> do not support the read emulation for now - we have no such usage
> >>>> case;
> >>>> b> add the check of p2m_t against p2m_ioreq_server in is_epte_present -
> >>>> a bit weird to me.
> >>>> Which one do you prefer? or any other suggestions?
> >>>
> >>> That question would also need to be asked to others who had
> >>> suggested supporting both. I'd be fine with a, but I also don't view
> >>> b as too awkward.
> >>
> >> According to Intel mannual, an entry is regarded as not present, if
> >> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
> >> will change its semantics, if this is acceptable(with no hurt to
> >> hypervisor). I'd prefer option b>
> >
> > Perhaps time for the VMX maintainers to chime in - such a change is acceptable
> > only if it doesn't result in changed hardware behavior. I can't think of any such off
> > the top of my head, but this really should be confirmed by the maintainers before
> > deciding to go such a route.
> >
> 
> Thanks, Jan. :)
> Jun & Kevin, any suggestions?
> 

I'm a bit worried about this change, since it's a fundamental EPT
interface. Can we avoid supporting read emulation now?

Thanks
Kevin
Paul Durrant April 19, 2016, 8:46 a.m. UTC | #21
> -----Original Message-----
> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> Sent: 19 April 2016 05:51
> To: Yu, Zhang; Jan Beulich; Paul Durrant; Nakajima, Jun
> Cc: Andrew Cooper; George Dunlap; Lv, Zhiyuan; xen-devel@lists.xen.org;
> Keir (Xen.org); Tim (Xen.org)
> Subject: RE: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
> map guest ram with p2m_ioreq_server to an ioreq server
> 
> > From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
> > Sent: Thursday, April 14, 2016 5:57 PM
> > >>>>> And with all three bits now possibly being clear, aren't we risking the
> > >>>>> entries to be mis-treated as not-present ones?
> > >>>>
> > >>>> Hah. You got me. Thanks! :)
> > >>>> Now I realized it would be difficult if we wanna to emulate the read
> > >>>> operations for HVM. According to Intel mannual, entry->r is to be
> > >>>> cleared, so should entry->w if we do not want ept misconfig. And
> > >>>> with both read and write permissions being forbidden, entry->x can
> be
> > >>>> set only on processors with EXECUTE_ONLY capability.
> > >>>> To avoid any entry to be mis-treated as not-present. We have several
> > >>>> solutions:
> > >>>> a> do not support the read emulation for now - we have no such
> usage
> > >>>> case;
> > >>>> b> add the check of p2m_t against p2m_ioreq_server in
> is_epte_present -
> > >>>> a bit weird to me.
> > >>>> Which one do you prefer? or any other suggestions?
> > >>>
> > >>> That question would also need to be asked to others who had
> > >>> suggested supporting both. I'd be fine with a, but I also don't view
> > >>> b as too awkward.
> > >>
> > >> According to Intel mannual, an entry is regarded as not present, if
> > >> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
> > >> will change its semantics, if this is acceptable(with no hurt to
> > >> hypervisor). I'd prefer option b>
> > >
> > > Perhaps time for the VMX maintainers to chime in - such a change is
> acceptable
> > > only if it doesn't result in changed hardware behavior. I can't think of any
> such off
> > > the top of my head, but this really should be confirmed by the
> maintainers before
> > > deciding to go such a route.
> > >
> >
> > Thanks, Jan. :)
> > Jun & Kevin, any suggestions?
> >
> 
> I'm a bit worried about this change, since it's a fundamental EPT
> interface. Can we avoid supporting read emulation now?
> 

I'm happy to drop the implementation of read emulation for the moment to keep things simple, as long as it is kept in the interface.

  Paul

> Thanks
> Kevin
Yu Zhang April 19, 2016, 9:11 a.m. UTC | #22
On 4/18/2016 11:57 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: 14 April 2016 11:45
>> To: George Dunlap; Paul Durrant; xen-devel@lists.xen.org
>> Cc: Jan Beulich; Kevin Tian; Andrew Cooper; Lv, Zhiyuan; Tim (Xen.org);
>> jun.nakajima@intel.com
>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>> On 4/11/2016 7:15 PM, Yu, Zhang wrote:
>>>
>>>
>>> On 4/8/2016 7:01 PM, George Dunlap wrote:
>>>> On 08/04/16 11:10, Yu, Zhang wrote:
>>>> [snip]
>>>>> BTW, I noticed your reply has not be CCed to mailing list, and I also
>>>>> wonder if we should raise this last question in community?
>>>>
>>>> Oops -- that was a mistake on my part.  :-)  I appreciate the
>>>> discretion; just so you know in the future, if I'm purposely changing
>>>> the CC list (removing xen-devel and/or adding extra people), I'll almost
>>>> always say so at the top of the mail.
>>>>
>>>>>> And then of course there's the p2m_ioreq_server ->
>> p2m_ram_logdirty
>>>>>> transition -- I assume that live migration is incompatible with this
>>>>>> functionality?  Is there anything that prevents a live migration from
>>>>>> being started when there are outstanding p2m_ioreq_server entries?
>>>>>>
>>>>>
>>>>> Another good question, and the answer is unfortunately yes. :-)
>>>>>
>>>>> If live migration happens during the normal emulation process, entries
>>>>> marked with p2m_ioreq_server will be changed to p2m_log_dirty in
>>>>> resolve_misconfig(), and later write operations will change them to
>>>>> p2m_ram_rw, thereafter these pages can not be forwarded to device
>> model.
>>>>>   From this point of view, this functionality is incompatible with live
>>>>> migration.
>>>>>
>>>>> But for XenGT, I think this is acceptable, because, if live migration
>>>>> is to be supported in the future, intervention from backend device
>>>>> model will be necessary. At that time, we can guarantee from the device
>>>>> model side that there's no outdated p2m_ioreq_server entries, hence
>> no
>>>>> need to reset the p2m type back to p2m_ram_rw(and do not include
>>>>> p2m_ioreq_server in the P2M_CHANGEABLE_TYPES). By "outdated", I
>> mean
>>>>> when an ioreq server is detached from p2m_ioreq_server, or before an
>>>>> ioreq server is attached to this type, entries marked with
>>>>> p2m_ioreq_server should be regarded as outdated.
>>>>>
>>>>> Is this acceptible to you? Any suggestions?
>>>>
>>>> So the question is, as of this series, what happens if someone tries to
>>>> initiate a live migration while there are outstanding p2m_ioreq_server
>>>> entries?
>>>>
>>>> If the answer is "the ioreq server suddenly loses all control of the
>>>> memory", that's something that needs to be changed.
>>>>
>>>
>>> Sorry, for this patch series, I'm afraid the above description is the
>>> answer.
>>>
>>> Besides, I find it's hard to change current code to both support the
>>> deferred resetting of p2m_ioreq_server and the live migration at the
>>> same time. One reason is that a page with p2m_ioreq_server behaves
>>> differently in different situations.
>>>
>>> My assumption of XenGT is that, for live migration to work, the device
>>> model should guarantee there's no outstanding p2m_ioreq_server pages
>>> in hypervisor(no need to use the deferred recalculation), and it is our
>>> device model who should be responsible for the copying of the write
>>> protected guest pages later.
>>>
>>> And another solution I can think of: when unmapping the ioreq server,
>>> we walk the p2m table and reset entries with p2m_ioreq_server back
>>> directly, instead of deferring the reset. And of course, this means
>>> performance impact. But since the mapping and unmapping of an ioreq
>>> server is not a frequent one, the performance penalty may be acceptable.
>>> How do you think about this approach?
>>>
>>
>> George, sorry to bother you. Any comments on above option? :)
>>
>> Another choice might be to let live migration fail if there's
>> outstanding p2m_ioreq_server entries. But I'm not quite inclined to do
>> so, because:
>> 1> I'd still like to keep live migration feature for XenGT.
>> 2> Not easy to know if there's outstanding p2m_ioreq_server entries. I
>> mean, since p2m type change is not only triggered by hypercall, to keep
>> a counter for remaining p2m_ioreq_server entries means a lot code
>> changes;
>>
>> Besides, I wonder whether the requirement to reset the p2m_ioreq_server
>> is indispensable, could we let the device model side to be responsible
>> for this? The worst case I can imagine for device model failing to do
>> so is that operations of a gfn might be delivered to a wrong device
>> model. I'm not clear what kind of damage would this cause to the
>> hypervisor or other VM.
>>
>> Does any other maintainers have any suggestions?
>
> Note that it is a requirement that an ioreq server be disabled before VM suspend. That means ioreq server pages essentially have to go back to ram_rw semantics.
>
>    Paul
>

OK. So it should be hypervisor's responsibility to do the resetting.
Now we probably have 2 choices:
1> we reset the p2m type synchronously when ioreq server unmapping
happens, instead of deferring to the misconfig handling part. This
means performance impact to traverse the p2m table.

Or
2> we just disallow live migration when p2m->ioreq.server is not NULL.
This is not quite accurate, because having p2m->ioreq.server mapped
to p2m_ioreq_server does not necessarily means there would be such
outstanding entries. To be more accurate, we can add some other rough
check, e.g. both check if p2m->ioreq.server against NULL and check if
the hvmop_set_mem_type has ever been triggered once for the
p2m_ioreq_server type.

Both choice seems suboptimal for me. And I wonder if we have any
better solutions?

Thanks
Yu

>> Thanks in advance! :)
>>>> If the answer is, "everything just works", that's perfect.
>>>>
>>>> If the answer is, "Before logdirty mode is set, the ioreq server has the
>>>> opportunity to detach, removing the p2m_ioreq_server entries, and
>>>> operating without that functionality", that's good too.
>>>>
>>>> If the answer is, "the live migration request fails and the guest
>>>> continues to run", that's also acceptable.  If you want this series to
>>>> be checked in today (the last day for 4.7), this is probably your best
>>>> bet.
>>>>
>>>>    -George
>>>>
>>>>
>>>>
>>>
>>
>> Regards
>> Yu
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
Yu Zhang April 19, 2016, 9:15 a.m. UTC | #23
On 4/19/2016 12:50 PM, Tian, Kevin wrote:
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: Thursday, April 14, 2016 5:57 PM
>>>>>>> And with all three bits now possibly being clear, aren't we risking the
>>>>>>> entries to be mis-treated as not-present ones?
>>>>>>
>>>>>> Hah. You got me. Thanks! :)
>>>>>> Now I realized it would be difficult if we wanna to emulate the read
>>>>>> operations for HVM. According to Intel mannual, entry->r is to be
>>>>>> cleared, so should entry->w if we do not want ept misconfig. And
>>>>>> with both read and write permissions being forbidden, entry->x can be
>>>>>> set only on processors with EXECUTE_ONLY capability.
>>>>>> To avoid any entry to be mis-treated as not-present. We have several
>>>>>> solutions:
>>>>>> a> do not support the read emulation for now - we have no such usage
>>>>>> case;
>>>>>> b> add the check of p2m_t against p2m_ioreq_server in is_epte_present -
>>>>>> a bit weird to me.
>>>>>> Which one do you prefer? or any other suggestions?
>>>>>
>>>>> That question would also need to be asked to others who had
>>>>> suggested supporting both. I'd be fine with a, but I also don't view
>>>>> b as too awkward.
>>>>
>>>> According to Intel mannual, an entry is regarded as not present, if
>>>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
>>>> will change its semantics, if this is acceptable(with no hurt to
>>>> hypervisor). I'd prefer option b>
>>>
>>> Perhaps time for the VMX maintainers to chime in - such a change is acceptable
>>> only if it doesn't result in changed hardware behavior. I can't think of any such off
>>> the top of my head, but this really should be confirmed by the maintainers before
>>> deciding to go such a route.
>>>
>>
>> Thanks, Jan. :)
>> Jun & Kevin, any suggestions?
>>
>
> I'm a bit worried about this change, since it's a fundamental EPT
> interface. Can we avoid supporting read emulation now?
>

Thanks for your reply, Kevin.
For XenGT, either waY would be fine. But I do not think changing
is_epte_present will introduce any bug.

Paul, do you have any preference?

Yu
Paul Durrant April 19, 2016, 9:21 a.m. UTC | #24
> -----Original Message-----

[snip]
> >> Does any other maintainers have any suggestions?

> >

> > Note that it is a requirement that an ioreq server be disabled before VM

> suspend. That means ioreq server pages essentially have to go back to

> ram_rw semantics.

> >

> >    Paul

> >

> 

> OK. So it should be hypervisor's responsibility to do the resetting.

> Now we probably have 2 choices:

> 1> we reset the p2m type synchronously when ioreq server unmapping

> happens, instead of deferring to the misconfig handling part. This

> means performance impact to traverse the p2m table.

> 


Do we need to reset at all. The p2m type does need to be transferred, it will just be unclaimed on the far end (i.e. the pages are treated as r/w ram) until the emulator starts up there. If that cannot be done without creating yet another p2m type to handle logdirty (which seems a suboptimal way of dealing with it) then I think migration needs to be disallowed on any domain than contains any ioreq_server type pages at this stage.

  Paul

> Or

> 2> we just disallow live migration when p2m->ioreq.server is not NULL.

> This is not quite accurate, because having p2m->ioreq.server mapped

> to p2m_ioreq_server does not necessarily means there would be such

> outstanding entries. To be more accurate, we can add some other rough

> check, e.g. both check if p2m->ioreq.server against NULL and check if

> the hvmop_set_mem_type has ever been triggered once for the

> p2m_ioreq_server type.

> 

> Both choice seems suboptimal for me. And I wonder if we have any

> better solutions?

> 

> Thanks

> Yu

> 

> >> Thanks in advance! :)

> >>>> If the answer is, "everything just works", that's perfect.

> >>>>

> >>>> If the answer is, "Before logdirty mode is set, the ioreq server has the

> >>>> opportunity to detach, removing the p2m_ioreq_server entries, and

> >>>> operating without that functionality", that's good too.

> >>>>

> >>>> If the answer is, "the live migration request fails and the guest

> >>>> continues to run", that's also acceptable.  If you want this series to

> >>>> be checked in today (the last day for 4.7), this is probably your best

> >>>> bet.

> >>>>

> >>>>    -George

> >>>>

> >>>>

> >>>>

> >>>

> >>

> >> Regards

> >> Yu

> > _______________________________________________

> > Xen-devel mailing list

> > Xen-devel@lists.xen.org

> > http://lists.xen.org/xen-devel

> >
Yu Zhang April 19, 2016, 9:21 a.m. UTC | #25
On 4/19/2016 12:37 PM, Tian, Kevin wrote:
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: Thursday, April 14, 2016 6:45 PM
>>
>> On 4/11/2016 7:15 PM, Yu, Zhang wrote:
>>>
>>>
>>> On 4/8/2016 7:01 PM, George Dunlap wrote:
>>>> On 08/04/16 11:10, Yu, Zhang wrote:
>>>> [snip]
>>>>> BTW, I noticed your reply has not be CCed to mailing list, and I also
>>>>> wonder if we should raise this last question in community?
>>>>
>>>> Oops -- that was a mistake on my part.  :-)  I appreciate the
>>>> discretion; just so you know in the future, if I'm purposely changing
>>>> the CC list (removing xen-devel and/or adding extra people), I'll almost
>>>> always say so at the top of the mail.
>>>>
>>>>>> And then of course there's the p2m_ioreq_server -> p2m_ram_logdirty
>>>>>> transition -- I assume that live migration is incompatible with this
>>>>>> functionality?  Is there anything that prevents a live migration from
>>>>>> being started when there are outstanding p2m_ioreq_server entries?
>>>>>>
>>>>>
>>>>> Another good question, and the answer is unfortunately yes. :-)
>>>>>
>>>>> If live migration happens during the normal emulation process, entries
>>>>> marked with p2m_ioreq_server will be changed to p2m_log_dirty in
>>>>> resolve_misconfig(), and later write operations will change them to
>>>>> p2m_ram_rw, thereafter these pages can not be forwarded to device model.
>>>>>   From this point of view, this functionality is incompatible with live
>>>>> migration.
>>>>>
>>>>> But for XenGT, I think this is acceptable, because, if live migration
>>>>> is to be supported in the future, intervention from backend device
>>>>> model will be necessary. At that time, we can guarantee from the device
>>>>> model side that there's no outdated p2m_ioreq_server entries, hence no
>>>>> need to reset the p2m type back to p2m_ram_rw(and do not include
>>>>> p2m_ioreq_server in the P2M_CHANGEABLE_TYPES). By "outdated", I mean
>>>>> when an ioreq server is detached from p2m_ioreq_server, or before an
>>>>> ioreq server is attached to this type, entries marked with
>>>>> p2m_ioreq_server should be regarded as outdated.
>>>>>
>>>>> Is this acceptible to you? Any suggestions?
>>>>
>>>> So the question is, as of this series, what happens if someone tries to
>>>> initiate a live migration while there are outstanding p2m_ioreq_server
>>>> entries?
>>>>
>>>> If the answer is "the ioreq server suddenly loses all control of the
>>>> memory", that's something that needs to be changed.
>>>>
>>>
>>> Sorry, for this patch series, I'm afraid the above description is the
>>> answer.
>>>
>>> Besides, I find it's hard to change current code to both support the
>>> deferred resetting of p2m_ioreq_server and the live migration at the
>>> same time. One reason is that a page with p2m_ioreq_server behaves
>>> differently in different situations.
>>>
>>> My assumption of XenGT is that, for live migration to work, the device
>>> model should guarantee there's no outstanding p2m_ioreq_server pages
>>> in hypervisor(no need to use the deferred recalculation), and it is our
>>> device model who should be responsible for the copying of the write
>>> protected guest pages later.
>>>
>>> And another solution I can think of: when unmapping the ioreq server,
>>> we walk the p2m table and reset entries with p2m_ioreq_server back
>>> directly, instead of deferring the reset. And of course, this means
>>> performance impact. But since the mapping and unmapping of an ioreq
>>> server is not a frequent one, the performance penalty may be acceptable.
>>> How do you think about this approach?
>>>
>>
>> George, sorry to bother you. Any comments on above option? :)
>>
>> Another choice might be to let live migration fail if there's
>> outstanding p2m_ioreq_server entries. But I'm not quite inclined to do
>> so, because:
>> 1> I'd still like to keep live migration feature for XenGT.
>> 2> Not easy to know if there's outstanding p2m_ioreq_server entries. I
>> mean, since p2m type change is not only triggered by hypercall, to keep
>> a counter for remaining p2m_ioreq_server entries means a lot code
>> changes;
>>
>> Besides, I wonder whether the requirement to reset the p2m_ioreq_server
>> is indispensable, could we let the device model side to be responsible
>> for this? The worst case I can imagine for device model failing to do
>> so is that operations of a gfn might be delivered to a wrong device
>> model. I'm not clear what kind of damage would this cause to the
>> hypervisor or other VM.
>>
>> Does any other maintainers have any suggestions?
>> Thanks in advance! :)
>
> I'm not sure how above is working. In pre-copy phase (where logdirty
> is concerned), the device model is still actively serving requests from
> guest, including initiating new write-protection requests. How can you
> guarantee draining of outstanding p2m_ioreq_server entries w/o
> actually freezing device model (while freezing device model means guest
> driver might be blocked with random errors)?
>

You are right, and I'm not suggesting to clear the p2m_ioreq_server
entries when live migration happens. My suggestion is that either we
guarantee there is no outstanding p2m_ioreq_server entries right after
the ioreq server is unbounded, or do not support live migration for
now.  :)

B.R.
Yu
Paul Durrant April 19, 2016, 9:23 a.m. UTC | #26
> -----Original Message-----

> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: 19 April 2016 10:15

> To: Kevin Tian; Jan Beulich; Paul Durrant; Nakajima, Jun

> Cc: Keir (Xen.org); George Dunlap; Andrew Cooper; Tim (Xen.org); xen-

> devel@lists.xen.org; Lv, Zhiyuan

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> 

> 

> On 4/19/2016 12:50 PM, Tian, Kevin wrote:

> >> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> >> Sent: Thursday, April 14, 2016 5:57 PM

> >>>>>>> And with all three bits now possibly being clear, aren't we risking

> the

> >>>>>>> entries to be mis-treated as not-present ones?

> >>>>>>

> >>>>>> Hah. You got me. Thanks! :)

> >>>>>> Now I realized it would be difficult if we wanna to emulate the read

> >>>>>> operations for HVM. According to Intel mannual, entry->r is to be

> >>>>>> cleared, so should entry->w if we do not want ept misconfig. And

> >>>>>> with both read and write permissions being forbidden, entry->x can

> be

> >>>>>> set only on processors with EXECUTE_ONLY capability.

> >>>>>> To avoid any entry to be mis-treated as not-present. We have

> several

> >>>>>> solutions:

> >>>>>> a> do not support the read emulation for now - we have no such

> usage

> >>>>>> case;

> >>>>>> b> add the check of p2m_t against p2m_ioreq_server in

> is_epte_present -

> >>>>>> a bit weird to me.

> >>>>>> Which one do you prefer? or any other suggestions?

> >>>>>

> >>>>> That question would also need to be asked to others who had

> >>>>> suggested supporting both. I'd be fine with a, but I also don't view

> >>>>> b as too awkward.

> >>>>

> >>>> According to Intel mannual, an entry is regarded as not present, if

> >>>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we

> >>>> will change its semantics, if this is acceptable(with no hurt to

> >>>> hypervisor). I'd prefer option b>

> >>>

> >>> Perhaps time for the VMX maintainers to chime in - such a change is

> acceptable

> >>> only if it doesn't result in changed hardware behavior. I can't think of any

> such off

> >>> the top of my head, but this really should be confirmed by the

> maintainers before

> >>> deciding to go such a route.

> >>>

> >>

> >> Thanks, Jan. :)

> >> Jun & Kevin, any suggestions?

> >>

> >

> > I'm a bit worried about this change, since it's a fundamental EPT

> > interface. Can we avoid supporting read emulation now?

> >

> 

> Thanks for your reply, Kevin.

> For XenGT, either waY would be fine. But I do not think changing

> is_epte_present will introduce any bug.

> 

> Paul, do you have any preference?

>


Unless the code is nailed down in the next couple of days then I don't think any implementation is going to make it into 4.7 so I'd prefer to go with the simplest option, and not supporting read emulation makes things a lot simpler.

  Paul
 
> Yu
Yu Zhang April 19, 2016, 9:27 a.m. UTC | #27
On 4/19/2016 4:46 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Tian, Kevin [mailto:kevin.tian@intel.com]
>> Sent: 19 April 2016 05:51
>> To: Yu, Zhang; Jan Beulich; Paul Durrant; Nakajima, Jun
>> Cc: Andrew Cooper; George Dunlap; Lv, Zhiyuan; xen-devel@lists.xen.org;
>> Keir (Xen.org); Tim (Xen.org)
>> Subject: RE: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>>> Sent: Thursday, April 14, 2016 5:57 PM
>>>>>>>> And with all three bits now possibly being clear, aren't we risking the
>>>>>>>> entries to be mis-treated as not-present ones?
>>>>>>>
>>>>>>> Hah. You got me. Thanks! :)
>>>>>>> Now I realized it would be difficult if we wanna to emulate the read
>>>>>>> operations for HVM. According to Intel mannual, entry->r is to be
>>>>>>> cleared, so should entry->w if we do not want ept misconfig. And
>>>>>>> with both read and write permissions being forbidden, entry->x can
>> be
>>>>>>> set only on processors with EXECUTE_ONLY capability.
>>>>>>> To avoid any entry to be mis-treated as not-present. We have several
>>>>>>> solutions:
>>>>>>> a> do not support the read emulation for now - we have no such
>> usage
>>>>>>> case;
>>>>>>> b> add the check of p2m_t against p2m_ioreq_server in
>> is_epte_present -
>>>>>>> a bit weird to me.
>>>>>>> Which one do you prefer? or any other suggestions?
>>>>>>
>>>>>> That question would also need to be asked to others who had
>>>>>> suggested supporting both. I'd be fine with a, but I also don't view
>>>>>> b as too awkward.
>>>>>
>>>>> According to Intel mannual, an entry is regarded as not present, if
>>>>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
>>>>> will change its semantics, if this is acceptable(with no hurt to
>>>>> hypervisor). I'd prefer option b>
>>>>
>>>> Perhaps time for the VMX maintainers to chime in - such a change is
>> acceptable
>>>> only if it doesn't result in changed hardware behavior. I can't think of any
>> such off
>>>> the top of my head, but this really should be confirmed by the
>> maintainers before
>>>> deciding to go such a route.
>>>>
>>>
>>> Thanks, Jan. :)
>>> Jun & Kevin, any suggestions?
>>>
>>
>> I'm a bit worried about this change, since it's a fundamental EPT
>> interface. Can we avoid supporting read emulation now?
>>
>
> I'm happy to drop the implementation of read emulation for the moment to keep things simple, as long as it is kept in the interface.
>

Thanks, Paul.
So what's the benefit to keep the read in the interface, if we can
not support its emulation?

It could be easier to change the is_epte_present definition than
to remove the read emulation code, but either way would not be
difficult.

Yu
Paul Durrant April 19, 2016, 9:40 a.m. UTC | #28
> -----Original Message-----

> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: 19 April 2016 10:27

> To: Paul Durrant; Kevin Tian; Jan Beulich; Nakajima, Jun

> Cc: Keir (Xen.org); Andrew Cooper; Tim (Xen.org); George Dunlap; xen-

> devel@lists.xen.org; Lv, Zhiyuan

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> 

> 

> On 4/19/2016 4:46 PM, Paul Durrant wrote:

> >> -----Original Message-----

> >> From: Tian, Kevin [mailto:kevin.tian@intel.com]

> >> Sent: 19 April 2016 05:51

> >> To: Yu, Zhang; Jan Beulich; Paul Durrant; Nakajima, Jun

> >> Cc: Andrew Cooper; George Dunlap; Lv, Zhiyuan; xen-

> devel@lists.xen.org;

> >> Keir (Xen.org); Tim (Xen.org)

> >> Subject: RE: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> >> map guest ram with p2m_ioreq_server to an ioreq server

> >>

> >>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> >>> Sent: Thursday, April 14, 2016 5:57 PM

> >>>>>>>> And with all three bits now possibly being clear, aren't we risking

> the

> >>>>>>>> entries to be mis-treated as not-present ones?

> >>>>>>>

> >>>>>>> Hah. You got me. Thanks! :)

> >>>>>>> Now I realized it would be difficult if we wanna to emulate the read

> >>>>>>> operations for HVM. According to Intel mannual, entry->r is to be

> >>>>>>> cleared, so should entry->w if we do not want ept misconfig. And

> >>>>>>> with both read and write permissions being forbidden, entry->x

> can

> >> be

> >>>>>>> set only on processors with EXECUTE_ONLY capability.

> >>>>>>> To avoid any entry to be mis-treated as not-present. We have

> several

> >>>>>>> solutions:

> >>>>>>> a> do not support the read emulation for now - we have no such

> >> usage

> >>>>>>> case;

> >>>>>>> b> add the check of p2m_t against p2m_ioreq_server in

> >> is_epte_present -

> >>>>>>> a bit weird to me.

> >>>>>>> Which one do you prefer? or any other suggestions?

> >>>>>>

> >>>>>> That question would also need to be asked to others who had

> >>>>>> suggested supporting both. I'd be fine with a, but I also don't view

> >>>>>> b as too awkward.

> >>>>>

> >>>>> According to Intel mannual, an entry is regarded as not present, if

> >>>>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we

> >>>>> will change its semantics, if this is acceptable(with no hurt to

> >>>>> hypervisor). I'd prefer option b>

> >>>>

> >>>> Perhaps time for the VMX maintainers to chime in - such a change is

> >> acceptable

> >>>> only if it doesn't result in changed hardware behavior. I can't think of

> any

> >> such off

> >>>> the top of my head, but this really should be confirmed by the

> >> maintainers before

> >>>> deciding to go such a route.

> >>>>

> >>>

> >>> Thanks, Jan. :)

> >>> Jun & Kevin, any suggestions?

> >>>

> >>

> >> I'm a bit worried about this change, since it's a fundamental EPT

> >> interface. Can we avoid supporting read emulation now?

> >>

> >

> > I'm happy to drop the implementation of read emulation for the moment

> to keep things simple, as long as it is kept in the interface.

> >

> 

> Thanks, Paul.

> So what's the benefit to keep the read in the interface, if we can

> not support its emulation?

>


Well, if it's in the interface then support can be added in future. If it's not in the interface then it needs to be added later and that makes things more awkward compatibility-wise.

  Paul
 
> It could be easier to change the is_epte_present definition than

> to remove the read emulation code, but either way would not be

> difficult.

> 

> Yu
Yu Zhang April 19, 2016, 9:44 a.m. UTC | #29
On 4/19/2016 5:21 PM, Paul Durrant wrote:
>> -----Original Message-----
> [snip]
>>>> Does any other maintainers have any suggestions?
>>>
>>> Note that it is a requirement that an ioreq server be disabled before VM
>> suspend. That means ioreq server pages essentially have to go back to
>> ram_rw semantics.
>>>
>>>     Paul
>>>
>>
>> OK. So it should be hypervisor's responsibility to do the resetting.
>> Now we probably have 2 choices:
>> 1> we reset the p2m type synchronously when ioreq server unmapping
>> happens, instead of deferring to the misconfig handling part. This
>> means performance impact to traverse the p2m table.
>>
>
> Do we need to reset at all. The p2m type does need to be transferred, it will just be unclaimed on the far end (i.e. the pages are treated as r/w ram) until the emulator starts up there. If that cannot be done without creating yet another p2m type to handle logdirty (which seems a suboptimal way of dealing with it) then I think migration needs to be disallowed on any domain than contains any ioreq_server type pages at this stage.
>
>    Paul
>

Yes. We need - either the device model or hypervisor should grantee
there's no p2m_ioreq_server pages left after an ioreq server is unmapped
from this type (which is write protected in such senario), otherwise
its emulation might be forwarded to other unexpected device models which
claims the p2m_ioreq_server later.

So I guess approach 2> is your suggestion now.

Besides, previously, Jan also questioned the necessity of resetting the
p2m type when an ioreq server is mapping to the p2m_ioreq_server. His
argument is that we should only allow such p2m transition after an
ioreq server has already mapped to this p2m_ioreq_server. I think his 
point sounds also reasonable.

Thanks
Yu

>> Or
>> 2> we just disallow live migration when p2m->ioreq.server is not NULL.
>> This is not quite accurate, because having p2m->ioreq.server mapped
>> to p2m_ioreq_server does not necessarily means there would be such
>> outstanding entries. To be more accurate, we can add some other rough
>> check, e.g. both check if p2m->ioreq.server against NULL and check if
>> the hvmop_set_mem_type has ever been triggered once for the
>> p2m_ioreq_server type.
>>
>> Both choice seems suboptimal for me. And I wonder if we have any
>> better solutions?
>>
>> Thanks
>> Yu
>>
>>>> Thanks in advance! :)
>>>>>> If the answer is, "everything just works", that's perfect.
>>>>>>
>>>>>> If the answer is, "Before logdirty mode is set, the ioreq server has the
>>>>>> opportunity to detach, removing the p2m_ioreq_server entries, and
>>>>>> operating without that functionality", that's good too.
>>>>>>
>>>>>> If the answer is, "the live migration request fails and the guest
>>>>>> continues to run", that's also acceptable.  If you want this series to
>>>>>> be checked in today (the last day for 4.7), this is probably your best
>>>>>> bet.
>>>>>>
>>>>>>     -George
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> Regards
>>>> Yu
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
>>>
Yu Zhang April 19, 2016, 9:49 a.m. UTC | #30
On 4/19/2016 5:40 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: 19 April 2016 10:27
>> To: Paul Durrant; Kevin Tian; Jan Beulich; Nakajima, Jun
>> Cc: Keir (Xen.org); Andrew Cooper; Tim (Xen.org); George Dunlap; xen-
>> devel@lists.xen.org; Lv, Zhiyuan
>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>>
>>
>> On 4/19/2016 4:46 PM, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Tian, Kevin [mailto:kevin.tian@intel.com]
>>>> Sent: 19 April 2016 05:51
>>>> To: Yu, Zhang; Jan Beulich; Paul Durrant; Nakajima, Jun
>>>> Cc: Andrew Cooper; George Dunlap; Lv, Zhiyuan; xen-
>> devel@lists.xen.org;
>>>> Keir (Xen.org); Tim (Xen.org)
>>>> Subject: RE: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>>>> map guest ram with p2m_ioreq_server to an ioreq server
>>>>
>>>>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>>>>> Sent: Thursday, April 14, 2016 5:57 PM
>>>>>>>>>> And with all three bits now possibly being clear, aren't we risking
>> the
>>>>>>>>>> entries to be mis-treated as not-present ones?
>>>>>>>>>
>>>>>>>>> Hah. You got me. Thanks! :)
>>>>>>>>> Now I realized it would be difficult if we wanna to emulate the read
>>>>>>>>> operations for HVM. According to Intel mannual, entry->r is to be
>>>>>>>>> cleared, so should entry->w if we do not want ept misconfig. And
>>>>>>>>> with both read and write permissions being forbidden, entry->x
>> can
>>>> be
>>>>>>>>> set only on processors with EXECUTE_ONLY capability.
>>>>>>>>> To avoid any entry to be mis-treated as not-present. We have
>> several
>>>>>>>>> solutions:
>>>>>>>>> a> do not support the read emulation for now - we have no such
>>>> usage
>>>>>>>>> case;
>>>>>>>>> b> add the check of p2m_t against p2m_ioreq_server in
>>>> is_epte_present -
>>>>>>>>> a bit weird to me.
>>>>>>>>> Which one do you prefer? or any other suggestions?
>>>>>>>>
>>>>>>>> That question would also need to be asked to others who had
>>>>>>>> suggested supporting both. I'd be fine with a, but I also don't view
>>>>>>>> b as too awkward.
>>>>>>>
>>>>>>> According to Intel mannual, an entry is regarded as not present, if
>>>>>>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means we
>>>>>>> will change its semantics, if this is acceptable(with no hurt to
>>>>>>> hypervisor). I'd prefer option b>
>>>>>>
>>>>>> Perhaps time for the VMX maintainers to chime in - such a change is
>>>> acceptable
>>>>>> only if it doesn't result in changed hardware behavior. I can't think of
>> any
>>>> such off
>>>>>> the top of my head, but this really should be confirmed by the
>>>> maintainers before
>>>>>> deciding to go such a route.
>>>>>>
>>>>>
>>>>> Thanks, Jan. :)
>>>>> Jun & Kevin, any suggestions?
>>>>>
>>>>
>>>> I'm a bit worried about this change, since it's a fundamental EPT
>>>> interface. Can we avoid supporting read emulation now?
>>>>
>>>
>>> I'm happy to drop the implementation of read emulation for the moment
>> to keep things simple, as long as it is kept in the interface.
>>>
>>
>> Thanks, Paul.
>> So what's the benefit to keep the read in the interface, if we can
>> not support its emulation?
>>
>
> Well, if it's in the interface then support can be added in future. If it's not in the interface then it needs to be added later and that makes things more awkward compatibility-wise.
>
>    Paul
>

But if we need to support read emulation in the future, we'll have to 
clear bit0:2 in ept pte, which is the same situation we are facing
now. So why do we not support it now?

Sorry for seeming so stubborn, I just want to have some more convincing
explanations. :)

Yu

>> It could be easier to change the is_epte_present definition than
>> to remove the read emulation code, but either way would not be
>> difficult.
>>
>> Yu
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
Yu Zhang April 19, 2016, 9:54 a.m. UTC | #31
On 4/19/2016 6:01 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: 19 April 2016 10:49
>> To: Paul Durrant; Kevin Tian; Jan Beulich; Nakajima, Jun
>> Cc: Keir (Xen.org); Andrew Cooper; Tim (Xen.org); George Dunlap; xen-
>> devel@lists.xen.org; Lv, Zhiyuan
>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>>
>>
>> On 4/19/2016 5:40 PM, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>>>> Sent: 19 April 2016 10:27
>>>> To: Paul Durrant; Kevin Tian; Jan Beulich; Nakajima, Jun
>>>> Cc: Keir (Xen.org); Andrew Cooper; Tim (Xen.org); George Dunlap; xen-
>>>> devel@lists.xen.org; Lv, Zhiyuan
>>>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>>>> map guest ram with p2m_ioreq_server to an ioreq server
>>>>
>>>>
>>>>
>>>> On 4/19/2016 4:46 PM, Paul Durrant wrote:
>>>>>> -----Original Message-----
>>>>>> From: Tian, Kevin [mailto:kevin.tian@intel.com]
>>>>>> Sent: 19 April 2016 05:51
>>>>>> To: Yu, Zhang; Jan Beulich; Paul Durrant; Nakajima, Jun
>>>>>> Cc: Andrew Cooper; George Dunlap; Lv, Zhiyuan; xen-
>>>> devel@lists.xen.org;
>>>>>> Keir (Xen.org); Tim (Xen.org)
>>>>>> Subject: RE: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP
>> to
>>>>>> map guest ram with p2m_ioreq_server to an ioreq server
>>>>>>
>>>>>>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>>>>>>> Sent: Thursday, April 14, 2016 5:57 PM
>>>>>>>>>>>> And with all three bits now possibly being clear, aren't we
>> risking
>>>> the
>>>>>>>>>>>> entries to be mis-treated as not-present ones?
>>>>>>>>>>>
>>>>>>>>>>> Hah. You got me. Thanks! :)
>>>>>>>>>>> Now I realized it would be difficult if we wanna to emulate the
>> read
>>>>>>>>>>> operations for HVM. According to Intel mannual, entry->r is to
>> be
>>>>>>>>>>> cleared, so should entry->w if we do not want ept misconfig.
>> And
>>>>>>>>>>> with both read and write permissions being forbidden, entry->x
>>>> can
>>>>>> be
>>>>>>>>>>> set only on processors with EXECUTE_ONLY capability.
>>>>>>>>>>> To avoid any entry to be mis-treated as not-present. We have
>>>> several
>>>>>>>>>>> solutions:
>>>>>>>>>>> a> do not support the read emulation for now - we have no such
>>>>>> usage
>>>>>>>>>>> case;
>>>>>>>>>>> b> add the check of p2m_t against p2m_ioreq_server in
>>>>>> is_epte_present -
>>>>>>>>>>> a bit weird to me.
>>>>>>>>>>> Which one do you prefer? or any other suggestions?
>>>>>>>>>>
>>>>>>>>>> That question would also need to be asked to others who had
>>>>>>>>>> suggested supporting both. I'd be fine with a, but I also don't view
>>>>>>>>>> b as too awkward.
>>>>>>>>>
>>>>>>>>> According to Intel mannual, an entry is regarded as not present, if
>>>>>>>>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means
>> we
>>>>>>>>> will change its semantics, if this is acceptable(with no hurt to
>>>>>>>>> hypervisor). I'd prefer option b>
>>>>>>>>
>>>>>>>> Perhaps time for the VMX maintainers to chime in - such a change is
>>>>>> acceptable
>>>>>>>> only if it doesn't result in changed hardware behavior. I can't think of
>>>> any
>>>>>> such off
>>>>>>>> the top of my head, but this really should be confirmed by the
>>>>>> maintainers before
>>>>>>>> deciding to go such a route.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks, Jan. :)
>>>>>>> Jun & Kevin, any suggestions?
>>>>>>>
>>>>>>
>>>>>> I'm a bit worried about this change, since it's a fundamental EPT
>>>>>> interface. Can we avoid supporting read emulation now?
>>>>>>
>>>>>
>>>>> I'm happy to drop the implementation of read emulation for the
>> moment
>>>> to keep things simple, as long as it is kept in the interface.
>>>>>
>>>>
>>>> Thanks, Paul.
>>>> So what's the benefit to keep the read in the interface, if we can
>>>> not support its emulation?
>>>>
>>>
>>> Well, if it's in the interface then support can be added in future. If it's not in
>> the interface then it needs to be added later and that makes things more
>> awkward compatibility-wise.
>>>
>>>     Paul
>>>
>>
>> But if we need to support read emulation in the future, we'll have to
>> clear bit0:2 in ept pte, which is the same situation we are facing
>> now. So why do we not support it now?
>>
>> Sorry for seeming so stubborn, I just want to have some more convincing
>> explanations. :)
>>
>
> That's an argument you need to have with Kevin. If he thinks the risk of making the change now is small then go ahead. All I'm saying is that 4.7 is about to go out of the door so whatever happens now needs to be as low risk and as minimally disruptive as possible.
>

Got it. Thank you, Paul. :)

Yu

>    Paul
>
>> Yu
>>
>>>> It could be easier to change the is_epte_present definition than
>>>> to remove the read emulation code, but either way would not be
>>>> difficult.
>>>>
>>>> Yu
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
>>>
Paul Durrant April 19, 2016, 10:01 a.m. UTC | #32
> -----Original Message-----

> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: 19 April 2016 10:49

> To: Paul Durrant; Kevin Tian; Jan Beulich; Nakajima, Jun

> Cc: Keir (Xen.org); Andrew Cooper; Tim (Xen.org); George Dunlap; xen-

> devel@lists.xen.org; Lv, Zhiyuan

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> 

> 

> On 4/19/2016 5:40 PM, Paul Durrant wrote:

> >> -----Original Message-----

> >> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> >> Sent: 19 April 2016 10:27

> >> To: Paul Durrant; Kevin Tian; Jan Beulich; Nakajima, Jun

> >> Cc: Keir (Xen.org); Andrew Cooper; Tim (Xen.org); George Dunlap; xen-

> >> devel@lists.xen.org; Lv, Zhiyuan

> >> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> >> map guest ram with p2m_ioreq_server to an ioreq server

> >>

> >>

> >>

> >> On 4/19/2016 4:46 PM, Paul Durrant wrote:

> >>>> -----Original Message-----

> >>>> From: Tian, Kevin [mailto:kevin.tian@intel.com]

> >>>> Sent: 19 April 2016 05:51

> >>>> To: Yu, Zhang; Jan Beulich; Paul Durrant; Nakajima, Jun

> >>>> Cc: Andrew Cooper; George Dunlap; Lv, Zhiyuan; xen-

> >> devel@lists.xen.org;

> >>>> Keir (Xen.org); Tim (Xen.org)

> >>>> Subject: RE: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP

> to

> >>>> map guest ram with p2m_ioreq_server to an ioreq server

> >>>>

> >>>>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> >>>>> Sent: Thursday, April 14, 2016 5:57 PM

> >>>>>>>>>> And with all three bits now possibly being clear, aren't we

> risking

> >> the

> >>>>>>>>>> entries to be mis-treated as not-present ones?

> >>>>>>>>>

> >>>>>>>>> Hah. You got me. Thanks! :)

> >>>>>>>>> Now I realized it would be difficult if we wanna to emulate the

> read

> >>>>>>>>> operations for HVM. According to Intel mannual, entry->r is to

> be

> >>>>>>>>> cleared, so should entry->w if we do not want ept misconfig.

> And

> >>>>>>>>> with both read and write permissions being forbidden, entry->x

> >> can

> >>>> be

> >>>>>>>>> set only on processors with EXECUTE_ONLY capability.

> >>>>>>>>> To avoid any entry to be mis-treated as not-present. We have

> >> several

> >>>>>>>>> solutions:

> >>>>>>>>> a> do not support the read emulation for now - we have no such

> >>>> usage

> >>>>>>>>> case;

> >>>>>>>>> b> add the check of p2m_t against p2m_ioreq_server in

> >>>> is_epte_present -

> >>>>>>>>> a bit weird to me.

> >>>>>>>>> Which one do you prefer? or any other suggestions?

> >>>>>>>>

> >>>>>>>> That question would also need to be asked to others who had

> >>>>>>>> suggested supporting both. I'd be fine with a, but I also don't view

> >>>>>>>> b as too awkward.

> >>>>>>>

> >>>>>>> According to Intel mannual, an entry is regarded as not present, if

> >>>>>>> bit0:2 is 0. So adding a p2m type check in is_epte_present() means

> we

> >>>>>>> will change its semantics, if this is acceptable(with no hurt to

> >>>>>>> hypervisor). I'd prefer option b>

> >>>>>>

> >>>>>> Perhaps time for the VMX maintainers to chime in - such a change is

> >>>> acceptable

> >>>>>> only if it doesn't result in changed hardware behavior. I can't think of

> >> any

> >>>> such off

> >>>>>> the top of my head, but this really should be confirmed by the

> >>>> maintainers before

> >>>>>> deciding to go such a route.

> >>>>>>

> >>>>>

> >>>>> Thanks, Jan. :)

> >>>>> Jun & Kevin, any suggestions?

> >>>>>

> >>>>

> >>>> I'm a bit worried about this change, since it's a fundamental EPT

> >>>> interface. Can we avoid supporting read emulation now?

> >>>>

> >>>

> >>> I'm happy to drop the implementation of read emulation for the

> moment

> >> to keep things simple, as long as it is kept in the interface.

> >>>

> >>

> >> Thanks, Paul.

> >> So what's the benefit to keep the read in the interface, if we can

> >> not support its emulation?

> >>

> >

> > Well, if it's in the interface then support can be added in future. If it's not in

> the interface then it needs to be added later and that makes things more

> awkward compatibility-wise.

> >

> >    Paul

> >

> 

> But if we need to support read emulation in the future, we'll have to

> clear bit0:2 in ept pte, which is the same situation we are facing

> now. So why do we not support it now?

> 

> Sorry for seeming so stubborn, I just want to have some more convincing

> explanations. :)

> 


That's an argument you need to have with Kevin. If he thinks the risk of making the change now is small then go ahead. All I'm saying is that 4.7 is about to go out of the door so whatever happens now needs to be as low risk and as minimally disruptive as possible.

  Paul

> Yu

> 

> >> It could be easier to change the is_epte_present definition than

> >> to remove the read emulation code, but either way would not be

> >> difficult.

> >>

> >> Yu

> > _______________________________________________

> > Xen-devel mailing list

> > Xen-devel@lists.xen.org

> > http://lists.xen.org/xen-devel

> >
Paul Durrant April 19, 2016, 10:05 a.m. UTC | #33
> -----Original Message-----

> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: 19 April 2016 10:44

> To: Paul Durrant; George Dunlap; xen-devel@lists.xen.org

> Cc: Kevin Tian; Jan Beulich; Andrew Cooper; Tim (Xen.org); Lv, Zhiyuan;

> jun.nakajima@intel.com

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> 

> 

> On 4/19/2016 5:21 PM, Paul Durrant wrote:

> >> -----Original Message-----

> > [snip]

> >>>> Does any other maintainers have any suggestions?

> >>>

> >>> Note that it is a requirement that an ioreq server be disabled before VM

> >> suspend. That means ioreq server pages essentially have to go back to

> >> ram_rw semantics.

> >>>

> >>>     Paul

> >>>

> >>

> >> OK. So it should be hypervisor's responsibility to do the resetting.

> >> Now we probably have 2 choices:

> >> 1> we reset the p2m type synchronously when ioreq server unmapping

> >> happens, instead of deferring to the misconfig handling part. This

> >> means performance impact to traverse the p2m table.

> >>

> >

> > Do we need to reset at all. The p2m type does need to be transferred, it

> will just be unclaimed on the far end (i.e. the pages are treated as r/w ram)

> until the emulator starts up there. If that cannot be done without creating

> yet another p2m type to handle logdirty (which seems a suboptimal way of

> dealing with it) then I think migration needs to be disallowed on any domain

> than contains any ioreq_server type pages at this stage.

> >

> >    Paul

> >

> 

> Yes. We need - either the device model or hypervisor should grantee

> there's no p2m_ioreq_server pages left after an ioreq server is unmapped

> from this type (which is write protected in such senario), otherwise

> its emulation might be forwarded to other unexpected device models which

> claims the p2m_ioreq_server later.


That should be for the device model to guarantee IMO. If the 'wrong' emulator claims the ioreq server type then I don't think that's Xen's problem.

> 

> So I guess approach 2> is your suggestion now.

> 

> Besides, previously, Jan also questioned the necessity of resetting the

> p2m type when an ioreq server is mapping to the p2m_ioreq_server. His

> argument is that we should only allow such p2m transition after an

> ioreq server has already mapped to this p2m_ioreq_server. I think his

> point sounds also reasonable.

> 


I was kind of hoping to avoid that ordering dependency but if it makes things simpler then so be it.

  Paul

> Thanks

> Yu

> 

> >> Or

> >> 2> we just disallow live migration when p2m->ioreq.server is not NULL.

> >> This is not quite accurate, because having p2m->ioreq.server mapped

> >> to p2m_ioreq_server does not necessarily means there would be such

> >> outstanding entries. To be more accurate, we can add some other rough

> >> check, e.g. both check if p2m->ioreq.server against NULL and check if

> >> the hvmop_set_mem_type has ever been triggered once for the

> >> p2m_ioreq_server type.

> >>

> >> Both choice seems suboptimal for me. And I wonder if we have any

> >> better solutions?

> >>

> >> Thanks

> >> Yu

> >>

> >>>> Thanks in advance! :)

> >>>>>> If the answer is, "everything just works", that's perfect.

> >>>>>>

> >>>>>> If the answer is, "Before logdirty mode is set, the ioreq server has

> the

> >>>>>> opportunity to detach, removing the p2m_ioreq_server entries, and

> >>>>>> operating without that functionality", that's good too.

> >>>>>>

> >>>>>> If the answer is, "the live migration request fails and the guest

> >>>>>> continues to run", that's also acceptable.  If you want this series to

> >>>>>> be checked in today (the last day for 4.7), this is probably your best

> >>>>>> bet.

> >>>>>>

> >>>>>>     -George

> >>>>>>

> >>>>>>

> >>>>>>

> >>>>>

> >>>>

> >>>> Regards

> >>>> Yu

> >>> _______________________________________________

> >>> Xen-devel mailing list

> >>> Xen-devel@lists.xen.org

> >>> http://lists.xen.org/xen-devel

> >>>
Yu Zhang April 19, 2016, 11:17 a.m. UTC | #34
On 4/19/2016 6:05 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: 19 April 2016 10:44
>> To: Paul Durrant; George Dunlap; xen-devel@lists.xen.org
>> Cc: Kevin Tian; Jan Beulich; Andrew Cooper; Tim (Xen.org); Lv, Zhiyuan;
>> jun.nakajima@intel.com
>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>>
>>
>> On 4/19/2016 5:21 PM, Paul Durrant wrote:
>>>> -----Original Message-----
>>> [snip]
>>>>>> Does any other maintainers have any suggestions?
>>>>>
>>>>> Note that it is a requirement that an ioreq server be disabled before VM
>>>> suspend. That means ioreq server pages essentially have to go back to
>>>> ram_rw semantics.
>>>>>
>>>>>      Paul
>>>>>
>>>>
>>>> OK. So it should be hypervisor's responsibility to do the resetting.
>>>> Now we probably have 2 choices:
>>>> 1> we reset the p2m type synchronously when ioreq server unmapping
>>>> happens, instead of deferring to the misconfig handling part. This
>>>> means performance impact to traverse the p2m table.
>>>>
>>>
>>> Do we need to reset at all. The p2m type does need to be transferred, it
>> will just be unclaimed on the far end (i.e. the pages are treated as r/w ram)
>> until the emulator starts up there. If that cannot be done without creating
>> yet another p2m type to handle logdirty (which seems a suboptimal way of
>> dealing with it) then I think migration needs to be disallowed on any domain
>> than contains any ioreq_server type pages at this stage.
>>>
>>>     Paul
>>>
>>
>> Yes. We need - either the device model or hypervisor should grantee
>> there's no p2m_ioreq_server pages left after an ioreq server is unmapped
>> from this type (which is write protected in such senario), otherwise
>> its emulation might be forwarded to other unexpected device models which
>> claims the p2m_ioreq_server later.
>
> That should be for the device model to guarantee IMO. If the 'wrong' emulator claims the ioreq server type then I don't think that's Xen's problem.
>

Thanks, Paul.

So what about the VM suspend case you mentioned above? Will that trigger
the unmapping of ioreq server? Could the device model also take the role
to change the p2m type back in such case?

It would be much simpler if hypervisor side does not need to provide
the p2m resetting logic, and we can support live migration at the same
time then. :)


B.R.
Yu

>>
>> So I guess approach 2> is your suggestion now.
>>
>> Besides, previously, Jan also questioned the necessity of resetting the
>> p2m type when an ioreq server is mapping to the p2m_ioreq_server. His
>> argument is that we should only allow such p2m transition after an
>> ioreq server has already mapped to this p2m_ioreq_server. I think his
>> point sounds also reasonable.
>>
>
> I was kind of hoping to avoid that ordering dependency but if it makes things simpler then so be it.
>
>    Paul
>
>> Thanks
>> Yu
>>
>>>> Or
>>>> 2> we just disallow live migration when p2m->ioreq.server is not NULL.
>>>> This is not quite accurate, because having p2m->ioreq.server mapped
>>>> to p2m_ioreq_server does not necessarily means there would be such
>>>> outstanding entries. To be more accurate, we can add some other rough
>>>> check, e.g. both check if p2m->ioreq.server against NULL and check if
>>>> the hvmop_set_mem_type has ever been triggered once for the
>>>> p2m_ioreq_server type.
>>>>
>>>> Both choice seems suboptimal for me. And I wonder if we have any
>>>> better solutions?
>>>>
>>>> Thanks
>>>> Yu
>>>>
>>>>>> Thanks in advance! :)
>>>>>>>> If the answer is, "everything just works", that's perfect.
>>>>>>>>
>>>>>>>> If the answer is, "Before logdirty mode is set, the ioreq server has
>> the
>>>>>>>> opportunity to detach, removing the p2m_ioreq_server entries, and
>>>>>>>> operating without that functionality", that's good too.
>>>>>>>>
>>>>>>>> If the answer is, "the live migration request fails and the guest
>>>>>>>> continues to run", that's also acceptable.  If you want this series to
>>>>>>>> be checked in today (the last day for 4.7), this is probably your best
>>>>>>>> bet.
>>>>>>>>
>>>>>>>>      -George
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Yu
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xen.org
>>>>> http://lists.xen.org/xen-devel
>>>>>
Paul Durrant April 19, 2016, 11:47 a.m. UTC | #35
> -----Original Message-----

> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> Sent: 19 April 2016 12:18

> To: Paul Durrant; George Dunlap; xen-devel@lists.xen.org

> Cc: Kevin Tian; Jan Beulich; Andrew Cooper; Tim (Xen.org); Lv, Zhiyuan;

> jun.nakajima@intel.com

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> 

> 

> On 4/19/2016 6:05 PM, Paul Durrant wrote:

> >> -----Original Message-----

> >> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]

> >> Sent: 19 April 2016 10:44

> >> To: Paul Durrant; George Dunlap; xen-devel@lists.xen.org

> >> Cc: Kevin Tian; Jan Beulich; Andrew Cooper; Tim (Xen.org); Lv, Zhiyuan;

> >> jun.nakajima@intel.com

> >> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> >> map guest ram with p2m_ioreq_server to an ioreq server

> >>

> >>

> >>

> >> On 4/19/2016 5:21 PM, Paul Durrant wrote:

> >>>> -----Original Message-----

> >>> [snip]

> >>>>>> Does any other maintainers have any suggestions?

> >>>>>

> >>>>> Note that it is a requirement that an ioreq server be disabled before

> VM

> >>>> suspend. That means ioreq server pages essentially have to go back to

> >>>> ram_rw semantics.

> >>>>>

> >>>>>      Paul

> >>>>>

> >>>>

> >>>> OK. So it should be hypervisor's responsibility to do the resetting.

> >>>> Now we probably have 2 choices:

> >>>> 1> we reset the p2m type synchronously when ioreq server unmapping

> >>>> happens, instead of deferring to the misconfig handling part. This

> >>>> means performance impact to traverse the p2m table.

> >>>>

> >>>

> >>> Do we need to reset at all. The p2m type does need to be transferred, it

> >> will just be unclaimed on the far end (i.e. the pages are treated as r/w

> ram)

> >> until the emulator starts up there. If that cannot be done without creating

> >> yet another p2m type to handle logdirty (which seems a suboptimal way

> of

> >> dealing with it) then I think migration needs to be disallowed on any

> domain

> >> than contains any ioreq_server type pages at this stage.

> >>>

> >>>     Paul

> >>>

> >>

> >> Yes. We need - either the device model or hypervisor should grantee

> >> there's no p2m_ioreq_server pages left after an ioreq server is

> unmapped

> >> from this type (which is write protected in such senario), otherwise

> >> its emulation might be forwarded to other unexpected device models

> which

> >> claims the p2m_ioreq_server later.

> >

> > That should be for the device model to guarantee IMO. If the 'wrong'

> emulator claims the ioreq server type then I don't think that's Xen's problem.

> >

> 

> Thanks, Paul.

> 

> So what about the VM suspend case you mentioned above? Will that trigger

> the unmapping of ioreq server? Could the device model also take the role

> to change the p2m type back in such case?


Yes. The device model has to be told by the toolstack that the VM is suspending, otherwise it can't disable the ioreq server which puts the shared ioreq pages back into the guest p2m. If that's not done then the pages will be leaked.

> 

> It would be much simpler if hypervisor side does not need to provide

> the p2m resetting logic, and we can support live migration at the same

> time then. :)

> 


That really should not be hypervisor's job.

  Paul

> 

> B.R.

> Yu

> 

> >>

> >> So I guess approach 2> is your suggestion now.

> >>

> >> Besides, previously, Jan also questioned the necessity of resetting the

> >> p2m type when an ioreq server is mapping to the p2m_ioreq_server. His

> >> argument is that we should only allow such p2m transition after an

> >> ioreq server has already mapped to this p2m_ioreq_server. I think his

> >> point sounds also reasonable.

> >>

> >

> > I was kind of hoping to avoid that ordering dependency but if it makes

> things simpler then so be it.

> >

> >    Paul

> >

> >> Thanks

> >> Yu

> >>

> >>>> Or

> >>>> 2> we just disallow live migration when p2m->ioreq.server is not NULL.

> >>>> This is not quite accurate, because having p2m->ioreq.server mapped

> >>>> to p2m_ioreq_server does not necessarily means there would be such

> >>>> outstanding entries. To be more accurate, we can add some other

> rough

> >>>> check, e.g. both check if p2m->ioreq.server against NULL and check if

> >>>> the hvmop_set_mem_type has ever been triggered once for the

> >>>> p2m_ioreq_server type.

> >>>>

> >>>> Both choice seems suboptimal for me. And I wonder if we have any

> >>>> better solutions?

> >>>>

> >>>> Thanks

> >>>> Yu

> >>>>

> >>>>>> Thanks in advance! :)

> >>>>>>>> If the answer is, "everything just works", that's perfect.

> >>>>>>>>

> >>>>>>>> If the answer is, "Before logdirty mode is set, the ioreq server has

> >> the

> >>>>>>>> opportunity to detach, removing the p2m_ioreq_server entries,

> and

> >>>>>>>> operating without that functionality", that's good too.

> >>>>>>>>

> >>>>>>>> If the answer is, "the live migration request fails and the guest

> >>>>>>>> continues to run", that's also acceptable.  If you want this series to

> >>>>>>>> be checked in today (the last day for 4.7), this is probably your

> best

> >>>>>>>> bet.

> >>>>>>>>

> >>>>>>>>      -George

> >>>>>>>>

> >>>>>>>>

> >>>>>>>>

> >>>>>>>

> >>>>>>

> >>>>>> Regards

> >>>>>> Yu

> >>>>> _______________________________________________

> >>>>> Xen-devel mailing list

> >>>>> Xen-devel@lists.xen.org

> >>>>> http://lists.xen.org/xen-devel

> >>>>>
Yu Zhang April 19, 2016, 11:59 a.m. UTC | #36
On 4/19/2016 7:47 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>> Sent: 19 April 2016 12:18
>> To: Paul Durrant; George Dunlap; xen-devel@lists.xen.org
>> Cc: Kevin Tian; Jan Beulich; Andrew Cooper; Tim (Xen.org); Lv, Zhiyuan;
>> jun.nakajima@intel.com
>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>>
>>
>> On 4/19/2016 6:05 PM, Paul Durrant wrote:
>>>> -----Original Message-----
>>>> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
>>>> Sent: 19 April 2016 10:44
>>>> To: Paul Durrant; George Dunlap; xen-devel@lists.xen.org
>>>> Cc: Kevin Tian; Jan Beulich; Andrew Cooper; Tim (Xen.org); Lv, Zhiyuan;
>>>> jun.nakajima@intel.com
>>>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>>>> map guest ram with p2m_ioreq_server to an ioreq server
>>>>
>>>>
>>>>
>>>> On 4/19/2016 5:21 PM, Paul Durrant wrote:
>>>>>> -----Original Message-----
>>>>> [snip]
>>>>>>>> Does any other maintainers have any suggestions?
>>>>>>>
>>>>>>> Note that it is a requirement that an ioreq server be disabled before
>> VM
>>>>>> suspend. That means ioreq server pages essentially have to go back to
>>>>>> ram_rw semantics.
>>>>>>>
>>>>>>>       Paul
>>>>>>>
>>>>>>
>>>>>> OK. So it should be hypervisor's responsibility to do the resetting.
>>>>>> Now we probably have 2 choices:
>>>>>> 1> we reset the p2m type synchronously when ioreq server unmapping
>>>>>> happens, instead of deferring to the misconfig handling part. This
>>>>>> means performance impact to traverse the p2m table.
>>>>>>
>>>>>
>>>>> Do we need to reset at all. The p2m type does need to be transferred, it
>>>> will just be unclaimed on the far end (i.e. the pages are treated as r/w
>> ram)
>>>> until the emulator starts up there. If that cannot be done without creating
>>>> yet another p2m type to handle logdirty (which seems a suboptimal way
>> of
>>>> dealing with it) then I think migration needs to be disallowed on any
>> domain
>>>> than contains any ioreq_server type pages at this stage.
>>>>>
>>>>>      Paul
>>>>>
>>>>
>>>> Yes. We need - either the device model or hypervisor should grantee
>>>> there's no p2m_ioreq_server pages left after an ioreq server is
>> unmapped
>>>> from this type (which is write protected in such senario), otherwise
>>>> its emulation might be forwarded to other unexpected device models
>> which
>>>> claims the p2m_ioreq_server later.
>>>
>>> That should be for the device model to guarantee IMO. If the 'wrong'
>> emulator claims the ioreq server type then I don't think that's Xen's problem.
>>>
>>
>> Thanks, Paul.
>>
>> So what about the VM suspend case you mentioned above? Will that trigger
>> the unmapping of ioreq server? Could the device model also take the role
>> to change the p2m type back in such case?
>
> Yes. The device model has to be told by the toolstack that the VM is suspending, otherwise it can't disable the ioreq server which puts the shared ioreq pages back into the guest p2m. If that's not done then the pages will be leaked.
>
>>
>> It would be much simpler if hypervisor side does not need to provide
>> the p2m resetting logic, and we can support live migration at the same
>> time then. :)
>>
>
> That really should not be hypervisor's job.
>
>    Paul
>

Oh. So let's just remove the p2m type recalculation code from this
patch, no need to call p2m_change_entry_type_global, and no need to
worry about the log dirty part.

George, do you think this acceptable?

BTW, if no need to call p2m_change_entry_type_global, which is not
used for shadow mode, we can keep this p2m type in shadow code, right?

Thanks
Yu

>>
>> B.R.
>> Yu
>>
>>>>
>>>> So I guess approach 2> is your suggestion now.
>>>>
>>>> Besides, previously, Jan also questioned the necessity of resetting the
>>>> p2m type when an ioreq server is mapping to the p2m_ioreq_server. His
>>>> argument is that we should only allow such p2m transition after an
>>>> ioreq server has already mapped to this p2m_ioreq_server. I think his
>>>> point sounds also reasonable.
>>>>
>>>
>>> I was kind of hoping to avoid that ordering dependency but if it makes
>> things simpler then so be it.
>>>
>>>     Paul
>>>
>>>> Thanks
>>>> Yu
>>>>
>>>>>> Or
>>>>>> 2> we just disallow live migration when p2m->ioreq.server is not NULL.
>>>>>> This is not quite accurate, because having p2m->ioreq.server mapped
>>>>>> to p2m_ioreq_server does not necessarily means there would be such
>>>>>> outstanding entries. To be more accurate, we can add some other
>> rough
>>>>>> check, e.g. both check if p2m->ioreq.server against NULL and check if
>>>>>> the hvmop_set_mem_type has ever been triggered once for the
>>>>>> p2m_ioreq_server type.
>>>>>>
>>>>>> Both choice seems suboptimal for me. And I wonder if we have any
>>>>>> better solutions?
>>>>>>
>>>>>> Thanks
>>>>>> Yu
>>>>>>
>>>>>>>> Thanks in advance! :)
>>>>>>>>>> If the answer is, "everything just works", that's perfect.
>>>>>>>>>>
>>>>>>>>>> If the answer is, "Before logdirty mode is set, the ioreq server has
>>>> the
>>>>>>>>>> opportunity to detach, removing the p2m_ioreq_server entries,
>> and
>>>>>>>>>> operating without that functionality", that's good too.
>>>>>>>>>>
>>>>>>>>>> If the answer is, "the live migration request fails and the guest
>>>>>>>>>> continues to run", that's also acceptable.  If you want this series to
>>>>>>>>>> be checked in today (the last day for 4.7), this is probably your
>> best
>>>>>>>>>> bet.
>>>>>>>>>>
>>>>>>>>>>       -George
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Yu
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xen.org
>>>>>>> http://lists.xen.org/xen-devel
>>>>>>>
George Dunlap April 20, 2016, 2:50 p.m. UTC | #37
On Tue, Apr 19, 2016 at 12:59 PM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
>>> So what about the VM suspend case you mentioned above? Will that trigger
>>> the unmapping of ioreq server? Could the device model also take the role
>>> to change the p2m type back in such case?
>>
>>
>> Yes. The device model has to be told by the toolstack that the VM is
>> suspending, otherwise it can't disable the ioreq server which puts the
>> shared ioreq pages back into the guest p2m. If that's not done then the
>> pages will be leaked.
>>
>>>
>>> It would be much simpler if hypervisor side does not need to provide
>>> the p2m resetting logic, and we can support live migration at the same
>>> time then. :)
>>>
>>
>> That really should not be hypervisor's job.
>>
>>    Paul
>>
>
> Oh. So let's just remove the p2m type recalculation code from this
> patch, no need to call p2m_change_entry_type_global, and no need to
> worry about the log dirty part.
>
> George, do you think this acceptable?

Sorry, just to make sure I understand your intentions:

1. The ioreq servers will be responsible for changing all the
p2m_ioreq_server types back to p2m_ram_rw before detaching
2. Xen will refuse to set logdirty mode if there are any attached ioreq servers

The one problem with #1 is what to do if the ioreq server is buggy /
killed, and forgets / is unable to clean up its ioreq_server entries?

There's a certain symmetry with the idea of having the ioreq server
responsible both for mapping and unmapping; and it would be nice to
have this stuff work for shadow mode as well.  But the automatic type
change seems like a more robust option.  (Still open to persuasion on
this.)

On another note: the hard freeze is long past the already-extended
deadline, so I was assuming this was 4.8 material at this point.

 -George
Paul Durrant April 20, 2016, 2:57 p.m. UTC | #38
> -----Original Message-----

> From: George Dunlap

> Sent: 20 April 2016 15:50

> To: Yu, Zhang

> Cc: Paul Durrant; xen-devel@lists.xen.org; Kevin Tian; Jan Beulich; Andrew

> Cooper; Tim (Xen.org); Lv, Zhiyuan; jun.nakajima@intel.com

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> On Tue, Apr 19, 2016 at 12:59 PM, Yu, Zhang <yu.c.zhang@linux.intel.com>

> wrote:

> >>> So what about the VM suspend case you mentioned above? Will that

> trigger

> >>> the unmapping of ioreq server? Could the device model also take the

> role

> >>> to change the p2m type back in such case?

> >>

> >>

> >> Yes. The device model has to be told by the toolstack that the VM is

> >> suspending, otherwise it can't disable the ioreq server which puts the

> >> shared ioreq pages back into the guest p2m. If that's not done then the

> >> pages will be leaked.

> >>

> >>>

> >>> It would be much simpler if hypervisor side does not need to provide

> >>> the p2m resetting logic, and we can support live migration at the same

> >>> time then. :)

> >>>

> >>

> >> That really should not be hypervisor's job.

> >>

> >>    Paul

> >>

> >

> > Oh. So let's just remove the p2m type recalculation code from this

> > patch, no need to call p2m_change_entry_type_global, and no need to

> > worry about the log dirty part.

> >

> > George, do you think this acceptable?

> 

> Sorry, just to make sure I understand your intentions:

> 

> 1. The ioreq servers will be responsible for changing all the

> p2m_ioreq_server types back to p2m_ram_rw before detaching

> 2. Xen will refuse to set logdirty mode if there are any attached ioreq servers

> 

> The one problem with #1 is what to do if the ioreq server is buggy /

> killed, and forgets / is unable to clean up its ioreq_server entries?

> 


If we ever want to meaningfully migrate a VM with XenGT enabled then I think it has to be possible to migrate the p2m_ioreq_server page types as-is. I'd expect a new instance of the xengt emulator to claim them on the far and any emulator state be transferred in a similar way to which qemu state is transferred such that vgpu emulation can be continued after the migration.
After all, it's not like we magically reset the type of mmio-dm pages on migration. It is just expected that a new instance of QEMU is spawned to take care of them on the far end when the VM is unpaused.

  Paul

> There's a certain symmetry with the idea of having the ioreq server

> responsible both for mapping and unmapping; and it would be nice to

> have this stuff work for shadow mode as well.  But the automatic type

> change seems like a more robust option.  (Still open to persuasion on

> this.)

> 

> On another note: the hard freeze is long past the already-extended

> deadline, so I was assuming this was 4.8 material at this point.

> 

>  -George
George Dunlap April 20, 2016, 3:37 p.m. UTC | #39
On Wed, Apr 20, 2016 at 3:57 PM, Paul Durrant <Paul.Durrant@citrix.com> wrote:
>> -----Original Message-----
>> From: George Dunlap
>> Sent: 20 April 2016 15:50
>> To: Yu, Zhang
>> Cc: Paul Durrant; xen-devel@lists.xen.org; Kevin Tian; Jan Beulich; Andrew
>> Cooper; Tim (Xen.org); Lv, Zhiyuan; jun.nakajima@intel.com
>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>> On Tue, Apr 19, 2016 at 12:59 PM, Yu, Zhang <yu.c.zhang@linux.intel.com>
>> wrote:
>> >>> So what about the VM suspend case you mentioned above? Will that
>> trigger
>> >>> the unmapping of ioreq server? Could the device model also take the
>> role
>> >>> to change the p2m type back in such case?
>> >>
>> >>
>> >> Yes. The device model has to be told by the toolstack that the VM is
>> >> suspending, otherwise it can't disable the ioreq server which puts the
>> >> shared ioreq pages back into the guest p2m. If that's not done then the
>> >> pages will be leaked.
>> >>
>> >>>
>> >>> It would be much simpler if hypervisor side does not need to provide
>> >>> the p2m resetting logic, and we can support live migration at the same
>> >>> time then. :)
>> >>>
>> >>
>> >> That really should not be hypervisor's job.
>> >>
>> >>    Paul
>> >>
>> >
>> > Oh. So let's just remove the p2m type recalculation code from this
>> > patch, no need to call p2m_change_entry_type_global, and no need to
>> > worry about the log dirty part.
>> >
>> > George, do you think this acceptable?
>>
>> Sorry, just to make sure I understand your intentions:
>>
>> 1. The ioreq servers will be responsible for changing all the
>> p2m_ioreq_server types back to p2m_ram_rw before detaching
>> 2. Xen will refuse to set logdirty mode if there are any attached ioreq servers
>>
>> The one problem with #1 is what to do if the ioreq server is buggy /
>> killed, and forgets / is unable to clean up its ioreq_server entries?
>>
>
> If we ever want to meaningfully migrate a VM with XenGT enabled then I think it has to be possible to migrate the p2m_ioreq_server page types as-is. I'd expect a new instance of the xengt emulator to claim them on the far and any emulator state be transferred in a similar way to which qemu state is transferred such that vgpu emulation can be continued after the migration.
> After all, it's not like we magically reset the type of mmio-dm pages on migration. It is just expected that a new instance of QEMU is spawned to take care of them on the far end when the VM is unpaused.

I don't see the implication you're trying to make regarding the
attach/detach interface.

In any case, is it really not the case that the devicemodels on the
far side have to re-register which IO ranges they're interested in?
Shouldn't the devicemodel on the far side re-register the pfns it
wants to watch as well?  That again seems like the most robust
solution, rather than assuming that the pfns you expect to have been
marked are already marked.  To do otherwise risks bugs where 1) pages
you think were being watched aren't 2) pages being watched that you
don't care about (and you don't think to un-watch them because you
didn't think you'd watched them in the first place).

On a slightly different topic: come to think of it, there's no need to
switch the p2m type for ioreq_server when we turn on logdirty mode; we
just have to call paging_mark_dirty() if it's a write, don't we?

 -George
Paul Durrant April 20, 2016, 4:30 p.m. UTC | #40
> -----Original Message-----

> From: George Dunlap

> Sent: 20 April 2016 16:38

> To: Paul Durrant

> Cc: Yu, Zhang; Kevin Tian; jun.nakajima@intel.com; Andrew Cooper; Tim

> (Xen.org); xen-devel@lists.xen.org; Lv, Zhiyuan; Jan Beulich

> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> map guest ram with p2m_ioreq_server to an ioreq server

> 

> On Wed, Apr 20, 2016 at 3:57 PM, Paul Durrant <Paul.Durrant@citrix.com>

> wrote:

> >> -----Original Message-----

> >> From: George Dunlap

> >> Sent: 20 April 2016 15:50

> >> To: Yu, Zhang

> >> Cc: Paul Durrant; xen-devel@lists.xen.org; Kevin Tian; Jan Beulich;

> Andrew

> >> Cooper; Tim (Xen.org); Lv, Zhiyuan; jun.nakajima@intel.com

> >> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to

> >> map guest ram with p2m_ioreq_server to an ioreq server

> >>

> >> On Tue, Apr 19, 2016 at 12:59 PM, Yu, Zhang

> <yu.c.zhang@linux.intel.com>

> >> wrote:

> >> >>> So what about the VM suspend case you mentioned above? Will that

> >> trigger

> >> >>> the unmapping of ioreq server? Could the device model also take the

> >> role

> >> >>> to change the p2m type back in such case?

> >> >>

> >> >>

> >> >> Yes. The device model has to be told by the toolstack that the VM is

> >> >> suspending, otherwise it can't disable the ioreq server which puts the

> >> >> shared ioreq pages back into the guest p2m. If that's not done then

> the

> >> >> pages will be leaked.

> >> >>

> >> >>>

> >> >>> It would be much simpler if hypervisor side does not need to provide

> >> >>> the p2m resetting logic, and we can support live migration at the

> same

> >> >>> time then. :)

> >> >>>

> >> >>

> >> >> That really should not be hypervisor's job.

> >> >>

> >> >>    Paul

> >> >>

> >> >

> >> > Oh. So let's just remove the p2m type recalculation code from this

> >> > patch, no need to call p2m_change_entry_type_global, and no need to

> >> > worry about the log dirty part.

> >> >

> >> > George, do you think this acceptable?

> >>

> >> Sorry, just to make sure I understand your intentions:

> >>

> >> 1. The ioreq servers will be responsible for changing all the

> >> p2m_ioreq_server types back to p2m_ram_rw before detaching

> >> 2. Xen will refuse to set logdirty mode if there are any attached ioreq

> servers

> >>

> >> The one problem with #1 is what to do if the ioreq server is buggy /

> >> killed, and forgets / is unable to clean up its ioreq_server entries?

> >>

> >

> > If we ever want to meaningfully migrate a VM with XenGT enabled then I

> think it has to be possible to migrate the p2m_ioreq_server page types as-is.

> I'd expect a new instance of the xengt emulator to claim them on the far and

> any emulator state be transferred in a similar way to which qemu state is

> transferred such that vgpu emulation can be continued after the migration.

> > After all, it's not like we magically reset the type of mmio-dm pages on

> migration. It is just expected that a new instance of QEMU is spawned to

> take care of them on the far end when the VM is unpaused.

> 

> I don't see the implication you're trying to make regarding the

> attach/detach interface.

> 

> In any case, is it really not the case that the devicemodels on the

> far side have to re-register which IO ranges they're interested in?

> Shouldn't the devicemodel on the far side re-register the pfns it

> wants to watch as well?  That again seems like the most robust

> solution, rather than assuming that the pfns you expect to have been

> marked are already marked.  To do otherwise risks bugs where 1) pages

> you think were being watched aren't 2) pages being watched that you

> don't care about (and you don't think to un-watch them because you

> didn't think you'd watched them in the first place).

> 


That sounds a little inefficient when there could be thousands of pages in play. If the p2m types were faithfully reproduced in the receiving domain then the new incarnation of the emulator just needs to claim the type rather than making thousands of set-mem-type hypercalls as well. I agree that the information would need to be transferred though, otherwise the new incarnation wouldn't be able reset the page types when it was done with them.

> On a slightly different topic: come to think of it, there's no need to

> switch the p2m type for ioreq_server when we turn on logdirty mode; we

> just have to call paging_mark_dirty() if it's a write, don't we?

> 


That sounds right to me.

  Paul

>  -George
George Dunlap April 20, 2016, 4:58 p.m. UTC | #41
On Wed, Apr 20, 2016 at 5:30 PM, Paul Durrant <Paul.Durrant@citrix.com> wrote:
>> In any case, is it really not the case that the devicemodels on the
>> far side have to re-register which IO ranges they're interested in?
>> Shouldn't the devicemodel on the far side re-register the pfns it
>> wants to watch as well?  That again seems like the most robust
>> solution, rather than assuming that the pfns you expect to have been
>> marked are already marked.  To do otherwise risks bugs where 1) pages
>> you think were being watched aren't 2) pages being watched that you
>> don't care about (and you don't think to un-watch them because you
>> didn't think you'd watched them in the first place).
>>
>
> That sounds a little inefficient when there could be thousands of pages in play. If the p2m types were faithfully reproduced in the receiving domain then the new incarnation of the emulator just needs to claim the type rather than making thousands of set-mem-type hypercalls as well. I agree that the information would need to be transferred though, otherwise the new incarnation wouldn't be able reset the page types when it was done with them.

The emulator could presumably go through and re-shadow the GPTs as
they were used, rather than all at once, right?

I still think that flushing all types on disconnect and having the
ioreq server on the far side re-protect things as it wants to is a
more robust solution.  But ultimately you're probably going to be the
one who's going to have to deal with any bugs related to this, so I'd
leave it between you and Yu Zhang.

 -George
Yu Zhang April 21, 2016, 1:21 p.m. UTC | #42
Thank you, George.

On 4/20/2016 10:50 PM, George Dunlap wrote:
> On Tue, Apr 19, 2016 at 12:59 PM, Yu, Zhang <yu.c.zhang@linux.intel.com> wrote:
>>>> So what about the VM suspend case you mentioned above? Will that trigger
>>>> the unmapping of ioreq server? Could the device model also take the role
>>>> to change the p2m type back in such case?
>>>
>>>
>>> Yes. The device model has to be told by the toolstack that the VM is
>>> suspending, otherwise it can't disable the ioreq server which puts the
>>> shared ioreq pages back into the guest p2m. If that's not done then the
>>> pages will be leaked.
>>>
>>>>
>>>> It would be much simpler if hypervisor side does not need to provide
>>>> the p2m resetting logic, and we can support live migration at the same
>>>> time then. :)
>>>>
>>>
>>> That really should not be hypervisor's job.
>>>
>>>    Paul
>>>
>>
>> Oh. So let's just remove the p2m type recalculation code from this
>> patch, no need to call p2m_change_entry_type_global, and no need to
>> worry about the log dirty part.
>>
>> George, do you think this acceptable?
>
> Sorry, just to make sure I understand your intentions:
>
> 1. The ioreq servers will be responsible for changing all the
> p2m_ioreq_server types back to p2m_ram_rw before detaching

In this version, this is done by calling p2m_change_entry_type_global()
*after* the ioreq server detaching.

But Paul & I now agreed not to reset p2m type in hypervisor, it should
be the device model's responsibility to do so. Even if buggy/malicious
device model fails to do so, the only affected one is the HVM being
serviced, neither hypervisor nor other VMs will be hurt.

> 2. Xen will refuse to set logdirty mode if there are any attached ioreq servers
>

In this version, I included p2m_ioreq_server to P2M_CHANGEABLE_TYPES,
so that we can use p2m_change_entry_global() for the auto resetting of
p2m_ioreq_server, yet later realized that if live migration happens
during a normal emulation process(before the detachment of the ioreq
server), entries with p2m_ioreq_server will be changed to p2m_log_dirty
(later to p2m_ram_rw in ept violation), therefore write operations will
not be able to forward to the device model.

I also realized it impossible to both keep the p2m auto resetting for
ioreq server detachment, and also support the live migration. So I once
suggested this option #2.

But it is not necessary now that we do not need the p2m resetting code.

> The one problem with #1 is what to do if the ioreq server is buggy /
> killed, and forgets / is unable to clean up its ioreq_server entries?
>
> There's a certain symmetry with the idea of having the ioreq server
> responsible both for mapping and unmapping; and it would be nice to
> have this stuff work for shadow mode as well.  But the automatic type

About shadow mode, it is not supported in this version, because routine
p2m_change_entry_global() is only supported on HAP mode. But I still
would like to keep shadow mode out in next version, because we have no
such usage model and if there's any bug it would be hard for us to find.

> change seems like a more robust option.  (Still open to persuasion on
> this.)
>
> On another note: the hard freeze is long past the already-extended
> deadline, so I was assuming this was 4.8 material at this point.
>

It is a pity. And I'm the one to be blamed for. Thank you very much
for all the advice and help. I'll continue to work on this feature
till no matter whichever version it is accepted to. :)

B.R.
Yu
Yu Zhang April 21, 2016, 1:28 p.m. UTC | #43
On 4/21/2016 12:30 AM, Paul Durrant wrote:
>> -----Original Message-----
>> From: George Dunlap
>> Sent: 20 April 2016 16:38
>> To: Paul Durrant
>> Cc: Yu, Zhang; Kevin Tian; jun.nakajima@intel.com; Andrew Cooper; Tim
>> (Xen.org); xen-devel@lists.xen.org; Lv, Zhiyuan; Jan Beulich
>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>> map guest ram with p2m_ioreq_server to an ioreq server
>>
>> On Wed, Apr 20, 2016 at 3:57 PM, Paul Durrant <Paul.Durrant@citrix.com>
>> wrote:
>>>> -----Original Message-----
>>>> From: George Dunlap
>>>> Sent: 20 April 2016 15:50
>>>> To: Yu, Zhang
>>>> Cc: Paul Durrant; xen-devel@lists.xen.org; Kevin Tian; Jan Beulich;
>> Andrew
>>>> Cooper; Tim (Xen.org); Lv, Zhiyuan; jun.nakajima@intel.com
>>>> Subject: Re: [Xen-devel] [PATCH v2 3/3] x86/ioreq server: Add HVMOP to
>>>> map guest ram with p2m_ioreq_server to an ioreq server
>>>>
>>>> On Tue, Apr 19, 2016 at 12:59 PM, Yu, Zhang
>> <yu.c.zhang@linux.intel.com>
>>>> wrote:
>>>>>>> So what about the VM suspend case you mentioned above? Will that
>>>> trigger
>>>>>>> the unmapping of ioreq server? Could the device model also take the
>>>> role
>>>>>>> to change the p2m type back in such case?
>>>>>>
>>>>>>
>>>>>> Yes. The device model has to be told by the toolstack that the VM is
>>>>>> suspending, otherwise it can't disable the ioreq server which puts the
>>>>>> shared ioreq pages back into the guest p2m. If that's not done then
>> the
>>>>>> pages will be leaked.
>>>>>>
>>>>>>>
>>>>>>> It would be much simpler if hypervisor side does not need to provide
>>>>>>> the p2m resetting logic, and we can support live migration at the
>> same
>>>>>>> time then. :)
>>>>>>>
>>>>>>
>>>>>> That really should not be hypervisor's job.
>>>>>>
>>>>>>    Paul
>>>>>>
>>>>>
>>>>> Oh. So let's just remove the p2m type recalculation code from this
>>>>> patch, no need to call p2m_change_entry_type_global, and no need to
>>>>> worry about the log dirty part.
>>>>>
>>>>> George, do you think this acceptable?
>>>>
>>>> Sorry, just to make sure I understand your intentions:
>>>>
>>>> 1. The ioreq servers will be responsible for changing all the
>>>> p2m_ioreq_server types back to p2m_ram_rw before detaching
>>>> 2. Xen will refuse to set logdirty mode if there are any attached ioreq
>> servers
>>>>
>>>> The one problem with #1 is what to do if the ioreq server is buggy /
>>>> killed, and forgets / is unable to clean up its ioreq_server entries?
>>>>
>>>
>>> If we ever want to meaningfully migrate a VM with XenGT enabled then I
>> think it has to be possible to migrate the p2m_ioreq_server page types as-is.
>> I'd expect a new instance of the xengt emulator to claim them on the far and
>> any emulator state be transferred in a similar way to which qemu state is
>> transferred such that vgpu emulation can be continued after the migration.
>>> After all, it's not like we magically reset the type of mmio-dm pages on
>> migration. It is just expected that a new instance of QEMU is spawned to
>> take care of them on the far end when the VM is unpaused.
>>
>> I don't see the implication you're trying to make regarding the
>> attach/detach interface.
>>
>> In any case, is it really not the case that the devicemodels on the
>> far side have to re-register which IO ranges they're interested in?
>> Shouldn't the devicemodel on the far side re-register the pfns it
>> wants to watch as well?  That again seems like the most robust
>> solution, rather than assuming that the pfns you expect to have been
>> marked are already marked.  To do otherwise risks bugs where 1) pages
>> you think were being watched aren't 2) pages being watched that you
>> don't care about (and you don't think to un-watch them because you
>> didn't think you'd watched them in the first place).
>>
>
> That sounds a little inefficient when there could be thousands of pages in play. If the p2m types were faithfully reproduced in the receiving domain then the new incarnation of the emulator just needs to claim the type rather than making thousands of set-mem-type hypercalls as well. I agree that the information would need to be transferred though, otherwise the new incarnation wouldn't be able reset the page types when it was done with them.
>

I agree. For XenGT to live migrate, device model may probably actively
take role like this.

>> On a slightly different topic: come to think of it, there's no need to
>> switch the p2m type for ioreq_server when we turn on logdirty mode; we
>> just have to call paging_mark_dirty() if it's a write, don't we?
>>

Yes. Unlike the log dirty pages, which will be changed back to
p2m_ram_rw after the first ept violation(to keep it simple, no
intel PML considered), pages with p2m_ioreq_server will continue
to be trapped. Since this patch series is only for the basic
functionality of XenGT, I suggest we do not include this until
we plan to support XenGT live migration in the future. And at
that time, some kind optimization may be necessary(i.e. how to
remove the redundant paging_mark_dirty for write protected pages).


>
> That sounds right to me.
>
>   Paul
>
>>  -George

B.R.
Yu
Wei Liu April 22, 2016, 11:27 a.m. UTC | #44
On Thu, Apr 21, 2016 at 09:21:17PM +0800, Yu, Zhang wrote:
[...]
> It is a pity. And I'm the one to be blamed for. Thank you very much

Please don't blame yourself. I don't think anyone should be blamed here.

I can only speak for myself, but I feel grateful when other people want
to collaborate with upstream projects.  We just need to realise
collaboration takes time and effort and things don't always come out as
expected.  I've been on both sides, both as contributor and maintainer
so I understand the frustration.

I want to publicly thank you (and all other engineers at Intel) for your
persistent effort for pushing things upstream and I believe things will
work out in the end.

Wei.
George Dunlap April 22, 2016, 11:30 a.m. UTC | #45
On 22/04/16 12:27, Wei Liu wrote:
> On Thu, Apr 21, 2016 at 09:21:17PM +0800, Yu, Zhang wrote:
> [...]
>> It is a pity. And I'm the one to be blamed for. Thank you very much
> 
> Please don't blame yourself. I don't think anyone should be blamed here.

There are lots of things that might have led to a different outcome:

1. Yu Zhang may have noticed the issues earlier
2. Paul or I may have noticed the issues earlier
3. One of the x86 maintainers might have given a review earlier and
caught the issues
4. Yu Zhang, Paul, or I may have pinged one of the x86 maintainers to
get the review earlier

We all tried, but we all could perhaps have done a bit better.  It is a
shame it missed it by so little, but that's part of the reason we're
shortening the release cycles. :-)

 -George
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index ddc8007..77a4793 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -94,11 +94,69 @@  static const struct hvm_io_handler null_handler = {
     .ops = &null_ops
 };
 
+static int mem_read(const struct hvm_io_handler *io_handler,
+                    uint64_t addr,
+                    uint32_t size,
+                    uint64_t *data)
+{
+    struct domain *currd = current->domain;
+    unsigned long gmfn = paddr_to_pfn(addr);
+    unsigned long offset = addr & ~PAGE_MASK;
+    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL, P2M_UNSHARE);
+    uint8_t *p;
+
+    if ( !page )
+        return X86EMUL_UNHANDLEABLE;
+
+    p = __map_domain_page(page);
+    p += offset;
+    memcpy(data, p, size);
+
+    unmap_domain_page(p);
+    put_page(page);
+
+    return X86EMUL_OKAY;
+}
+
+static int mem_write(const struct hvm_io_handler *handler,
+                     uint64_t addr,
+                     uint32_t size,
+                     uint64_t data)
+{
+    struct domain *currd = current->domain;
+    unsigned long gmfn = paddr_to_pfn(addr);
+    unsigned long offset = addr & ~PAGE_MASK;
+    struct page_info *page = get_page_from_gfn(currd, gmfn, NULL, P2M_UNSHARE);
+    uint8_t *p;
+
+    if ( !page )
+        return X86EMUL_UNHANDLEABLE;
+
+    p = __map_domain_page(page);
+    p += offset;
+    memcpy(p, &data, size);
+
+    unmap_domain_page(p);
+    put_page(page);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_io_ops mem_ops = {
+    .read = mem_read,
+    .write = mem_write
+};
+
+static const struct hvm_io_handler mem_handler = {
+    .ops = &mem_ops
+};
+
 static int hvmemul_do_io(
     bool_t is_mmio, paddr_t addr, unsigned long reps, unsigned int size,
     uint8_t dir, bool_t df, bool_t data_is_addr, uintptr_t data)
 {
     struct vcpu *curr = current;
+    struct domain *currd = curr->domain;
     struct hvm_vcpu_io *vio = &curr->arch.hvm_vcpu.hvm_io;
     ioreq_t p = {
         .type = is_mmio ? IOREQ_TYPE_COPY : IOREQ_TYPE_PIO,
@@ -140,7 +198,7 @@  static int hvmemul_do_io(
              (p.dir != dir) ||
              (p.df != df) ||
              (p.data_is_ptr != data_is_addr) )
-            domain_crash(curr->domain);
+            domain_crash(currd);
 
         if ( data_is_addr )
             return X86EMUL_UNHANDLEABLE;
@@ -168,13 +226,72 @@  static int hvmemul_do_io(
         break;
     case X86EMUL_UNHANDLEABLE:
     {
-        struct hvm_ioreq_server *s =
-            hvm_select_ioreq_server(curr->domain, &p);
+        struct hvm_ioreq_server *s;
+        p2m_type_t p2mt;
+
+        if ( is_mmio )
+        {
+            unsigned long gmfn = paddr_to_pfn(addr);
+
+            (void) get_gfn_query_unlocked(currd, gmfn, &p2mt);
+
+            switch ( p2mt )
+            {
+                case p2m_ioreq_server:
+                {
+                    unsigned long flags;
+
+                    p2m_get_ioreq_server(currd, &flags, &s);
+
+                    if ( !s )
+                        break;
+
+                    if ( (dir == IOREQ_READ &&
+                          !(flags & P2M_IOREQ_HANDLE_READ_ACCESS)) ||
+                         (dir == IOREQ_WRITE &&
+                          !(flags & P2M_IOREQ_HANDLE_WRITE_ACCESS)) )
+                        s = NULL;
+
+                    break;
+                }
+                case p2m_ram_rw:
+                    s = NULL;
+                    break;
+
+                default:
+                    s = hvm_select_ioreq_server(currd, &p);
+                    break;
+            }
+        }
+        else
+        {
+            p2mt = p2m_invalid;
+
+            s = hvm_select_ioreq_server(currd, &p);
+        }
 
         /* If there is no suitable backing DM, just ignore accesses */
         if ( !s )
         {
-            rc = hvm_process_io_intercept(&null_handler, &p);
+            switch ( p2mt )
+            {
+            case p2m_ioreq_server:
+            /*
+             * Race conditions may exist when access to a gfn with
+             * p2m_ioreq_server is intercepted by hypervisor, during
+             * which time p2m type of this gfn is recalculated back
+             * to p2m_ram_rw. mem_handler is used to handle this
+             * corner case.
+             */
+            case p2m_ram_rw:
+                rc = hvm_process_io_intercept(&mem_handler, &p);
+                break;
+
+            default:
+                rc = hvm_process_io_intercept(&null_handler, &p);
+                break;
+            }
+
             vio->io_req.state = STATE_IOREQ_NONE;
         }
         else
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index bec6a8a..ba1de00 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1252,6 +1252,8 @@  static int hvm_destroy_ioreq_server(struct domain *d, ioservid_t id)
 
         domain_pause(d);
 
+        p2m_destroy_ioreq_server(d, s);
+
         hvm_ioreq_server_disable(s, 0);
 
         list_del(&s->list_entry);
@@ -1411,6 +1413,47 @@  static int hvm_unmap_io_range_from_ioreq_server(struct domain *d, ioservid_t id,
     return rc;
 }
 
+static int hvm_map_mem_type_to_ioreq_server(struct domain *d,
+                                            ioservid_t id,
+                                            hvmmem_type_t type,
+                                            uint32_t flags)
+{
+    struct hvm_ioreq_server *s;
+    int rc;
+
+    /* For now, only HVMMEM_ioreq_server is supported */
+    if ( type != HVMMEM_ioreq_server )
+        return -EINVAL;
+
+    if ( flags & ~(HVMOP_IOREQ_MEM_ACCESS_READ |
+                   HVMOP_IOREQ_MEM_ACCESS_WRITE) )
+        return -EINVAL;
+
+    spin_lock(&d->arch.hvm_domain.ioreq_server.lock);
+
+    rc = -ENOENT;
+    list_for_each_entry ( s,
+                          &d->arch.hvm_domain.ioreq_server.list,
+                          list_entry )
+    {
+        if ( s == d->arch.hvm_domain.default_ioreq_server )
+            continue;
+
+        if ( s->id == id )
+        {
+            rc = p2m_set_ioreq_server(d, flags, s);
+            if ( rc == 0 )
+                gdprintk(XENLOG_DEBUG, "%u %s type HVMMEM_ioreq_server.\n",
+                         s->id, (flags != 0) ? "mapped to" : "unmapped from");
+
+            break;
+        }
+    }
+
+    spin_unlock(&d->arch.hvm_domain.ioreq_server.lock);
+    return rc;
+}
+
 static int hvm_set_ioreq_server_state(struct domain *d, ioservid_t id,
                                       bool_t enabled)
 {
@@ -3164,9 +3207,9 @@  int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
      * If this GFN is emulated MMIO or marked as read-only, pass the fault
      * to the mmio handler.
      */
-    if ( (p2mt == p2m_mmio_dm) || 
-         (npfec.write_access &&
-          (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) )
+    if ( (p2mt == p2m_mmio_dm) ||
+         (p2mt == p2m_ioreq_server) ||
+         (npfec.write_access && p2m_is_discard_write(p2mt)) )
     {
         __put_gfn(p2m, gfn);
         if ( ap2m_active )
@@ -5979,6 +6022,40 @@  static int hvmop_unmap_io_range_from_ioreq_server(
     return rc;
 }
 
+static int hvmop_map_mem_type_to_ioreq_server(
+    XEN_GUEST_HANDLE_PARAM(xen_hvm_map_mem_type_to_ioreq_server_t) uop)
+{
+    xen_hvm_map_mem_type_to_ioreq_server_t op;
+    struct domain *d;
+    int rc;
+
+    if ( copy_from_guest(&op, uop, 1) )
+        return -EFAULT;
+
+    rc = rcu_lock_remote_domain_by_id(op.domid, &d);
+    if ( rc != 0 )
+        return rc;
+
+    rc = -EINVAL;
+    if ( !is_hvm_domain(d) )
+        goto out;
+
+    /* For now, only support for HAP enabled hvm */
+    if ( !hap_enabled(d) )
+        goto out;
+
+    rc = xsm_hvm_ioreq_server(XSM_DM_PRIV, d,
+                              HVMOP_map_mem_type_to_ioreq_server);
+    if ( rc != 0 )
+        goto out;
+
+    rc = hvm_map_mem_type_to_ioreq_server(d, op.id, op.type, op.flags);
+
+ out:
+    rcu_unlock_domain(d);
+    return rc;
+}
+
 static int hvmop_set_ioreq_server_state(
     XEN_GUEST_HANDLE_PARAM(xen_hvm_set_ioreq_server_state_t) uop)
 {
@@ -6613,8 +6690,7 @@  static int hvmop_get_mem_type(
 static bool_t hvm_allow_p2m_type_change(p2m_type_t old, p2m_type_t new)
 {
     if ( p2m_is_ram(old) ||
-         (p2m_is_hole(old) && new == p2m_mmio_dm) ||
-         (old == p2m_ioreq_server && new == p2m_ram_rw) )
+         (p2m_is_hole(old) && new == p2m_mmio_dm) )
         return 1;
 
     return 0;
@@ -6648,6 +6724,10 @@  static int hvmop_set_mem_type(
     if ( !is_hvm_domain(d) )
         goto out;
 
+    /* For now, HVMMEM_ioreq_server is only supported for HAP enabled hvm. */
+    if ( a.hvmmem_type == HVMMEM_ioreq_server && !hap_enabled(d) )
+        goto out;
+
     rc = xsm_hvm_control(XSM_DM_PRIV, d, HVMOP_set_mem_type);
     if ( rc )
         goto out;
@@ -6748,6 +6828,11 @@  long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
             guest_handle_cast(arg, xen_hvm_io_range_t));
         break;
 
+    case HVMOP_map_mem_type_to_ioreq_server:
+        rc = hvmop_map_mem_type_to_ioreq_server(
+            guest_handle_cast(arg, xen_hvm_map_mem_type_to_ioreq_server_t));
+        break;
+
     case HVMOP_set_ioreq_server_state:
         rc = hvmop_set_ioreq_server_state(
             guest_handle_cast(arg, xen_hvm_set_ioreq_server_state_t));
diff --git a/xen/arch/x86/mm/hap/nested_hap.c b/xen/arch/x86/mm/hap/nested_hap.c
index 9cee5a0..bbb6d85 100644
--- a/xen/arch/x86/mm/hap/nested_hap.c
+++ b/xen/arch/x86/mm/hap/nested_hap.c
@@ -174,7 +174,7 @@  nestedhap_walk_L0_p2m(struct p2m_domain *p2m, paddr_t L1_gpa, paddr_t *L0_gpa,
     if ( *p2mt == p2m_mmio_direct )
         goto direct_mmio_out;
     rc = NESTEDHVM_PAGEFAULT_MMIO;
-    if ( *p2mt == p2m_mmio_dm )
+    if ( *p2mt == p2m_mmio_dm || *p2mt == p2m_ioreq_server )
         goto out;
 
     rc = NESTEDHVM_PAGEFAULT_L0_ERROR;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 380ec25..854e158 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -132,6 +132,19 @@  static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             entry->r = entry->w = entry->x = 1;
             entry->a = entry->d = !!cpu_has_vmx_ept_ad;
             break;
+        case p2m_ioreq_server:
+            entry->r = !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_READ_ACCESS);
+	    /*
+	     * write access right is disabled when entry->r is 0, but whether
+	     * write accesses are emulated by hypervisor or forwarded to an
+	     * ioreq server depends on the setting of p2m->ioreq.flags.
+	     */
+            entry->w = (entry->r &&
+                        !(p2m->ioreq.flags & P2M_IOREQ_HANDLE_WRITE_ACCESS));
+            entry->x = entry->r;
+            entry->a = !!cpu_has_vmx_ept_ad;
+            entry->d = entry->w && cpu_has_vmx_ept_ad;
+            break;
         case p2m_mmio_direct:
             entry->r = entry->x = 1;
             entry->w = !rangeset_contains_singleton(mmio_ro_ranges,
@@ -171,7 +184,6 @@  static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             entry->a = entry->d = !!cpu_has_vmx_ept_ad;
             break;
         case p2m_grant_map_ro:
-        case p2m_ioreq_server:
             entry->r = 1;
             entry->w = entry->x = 0;
             entry->a = !!cpu_has_vmx_ept_ad;
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index eabd2e3..7a0ddb8 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -72,8 +72,8 @@  static const unsigned long pgt[] = {
     PGT_l3_page_table
 };
 
-static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn,
-                                       unsigned int level)
+static unsigned long p2m_type_to_flags(struct p2m_domain *p2m, p2m_type_t t,
+                                       mfn_t mfn, unsigned int level)
 {
     unsigned long flags;
     /*
@@ -94,8 +94,18 @@  static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn,
     default:
         return flags | _PAGE_NX_BIT;
     case p2m_grant_map_ro:
-    case p2m_ioreq_server:
         return flags | P2M_BASE_FLAGS | _PAGE_NX_BIT;
+    case p2m_ioreq_server:
+    {
+        flags |= P2M_BASE_FLAGS | _PAGE_RW;
+
+        if ( p2m->ioreq.flags & P2M_IOREQ_HANDLE_READ_ACCESS )
+            return flags & ~(_PAGE_PRESENT | _PAGE_RW);
+        else if ( p2m->ioreq.flags & P2M_IOREQ_HANDLE_WRITE_ACCESS )
+            return flags & ~_PAGE_RW;
+        else
+            return flags;
+    }
     case p2m_ram_ro:
     case p2m_ram_logdirty:
     case p2m_ram_shared:
@@ -442,7 +452,8 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
             p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn | ~mask)
                               ? p2m_ram_logdirty : p2m_ram_rw;
             unsigned long mfn = l1e_get_pfn(e);
-            unsigned long flags = p2m_type_to_flags(p2mt, _mfn(mfn), level);
+            unsigned long flags = p2m_type_to_flags(p2m, p2mt,
+                                                    _mfn(mfn), level);
 
             if ( level )
             {
@@ -579,7 +590,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
         l3e_content = mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt)
             ? l3e_from_pfn(mfn_x(mfn),
-                           p2m_type_to_flags(p2mt, mfn, 2) | _PAGE_PSE)
+                           p2m_type_to_flags(p2m, p2mt, mfn, 2) | _PAGE_PSE)
             : l3e_empty();
         entry_content.l1 = l3e_content.l3;
 
@@ -615,7 +626,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
-                                             p2m_type_to_flags(p2mt, mfn, 0));
+                                         p2m_type_to_flags(p2m, p2mt, mfn, 0));
         else
             entry_content = l1e_empty();
 
@@ -651,7 +662,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             l2e_content = l2e_from_pfn(mfn_x(mfn),
-                                       p2m_type_to_flags(p2mt, mfn, 1) |
+                                       p2m_type_to_flags(p2m, p2mt, mfn, 1) |
                                        _PAGE_PSE);
         else
             l2e_content = l2e_empty();
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index b3fce1b..f7d2f60 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -83,6 +83,8 @@  static int p2m_initialise(struct domain *d, struct p2m_domain *p2m)
     else
         p2m_pt_init(p2m);
 
+    spin_lock_init(&p2m->ioreq.lock);
+
     return ret;
 }
 
@@ -289,6 +291,86 @@  void p2m_memory_type_changed(struct domain *d)
     }
 }
 
+int p2m_set_ioreq_server(struct domain *d,
+                         unsigned long flags,
+                         struct hvm_ioreq_server *s)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    int rc;
+
+    spin_lock(&p2m->ioreq.lock);
+
+    rc = -EBUSY;
+    if ( (flags != 0) && (p2m->ioreq.server != NULL) )
+        goto out;
+
+    rc = -EINVAL;
+    /* unmap ioreq server from p2m type by passing flags with 0 */
+    if ( (flags == 0) && (p2m->ioreq.server != s) )
+        goto out;
+
+    if ( flags == 0 )
+    {
+        p2m->ioreq.server = NULL;
+        p2m->ioreq.flags = 0;
+    }
+    else
+    {
+        p2m->ioreq.server = s;
+        p2m->ioreq.flags = flags;
+    }
+
+    /*
+     * Each time we map/unmap an ioreq server to/from p2m_ioreq_server,
+     * we mark the p2m table to be recalculated, so that gfns which were
+     * previously marked with p2m_ioreq_server can be resynced.
+     */
+    p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
+
+    rc = 0;
+
+out:
+    spin_unlock(&p2m->ioreq.lock);
+
+    return rc;
+}
+
+void p2m_get_ioreq_server(struct domain *d,
+                          unsigned long *flags,
+                          struct hvm_ioreq_server **s)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    spin_lock(&p2m->ioreq.lock);
+
+    *s = p2m->ioreq.server;
+    *flags = p2m->ioreq.flags;
+
+    spin_unlock(&p2m->ioreq.lock);
+}
+
+void p2m_destroy_ioreq_server(struct domain *d,
+                              struct hvm_ioreq_server *s)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    spin_lock(&p2m->ioreq.lock);
+
+    if ( p2m->ioreq.server == s )
+    {
+        p2m->ioreq.server = NULL;
+        p2m->ioreq.flags = 0;
+
+        /*
+         * Mark p2m table to be recalculated, so that gfns which were
+         * previously marked with p2m_ioreq_server can be resynced.
+         */
+        p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
+    }
+
+    spin_unlock(&p2m->ioreq.lock);
+}
+
 void p2m_enable_hardware_log_dirty(struct domain *d)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index c81302a..2e0d258 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3224,8 +3224,7 @@  static int sh_page_fault(struct vcpu *v,
     }
 
     /* Need to hand off device-model MMIO to the device model */
-    if ( p2mt == p2m_mmio_dm
-         || (p2mt == p2m_ioreq_server && ft == ft_demand_write) )
+    if ( p2mt == p2m_mmio_dm )
     {
         gpa = guest_walk_to_gpa(&gw);
         goto mmio;
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index ee2ea9c..8f925ac 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -89,7 +89,8 @@  typedef unsigned int p2m_query_t;
                        | p2m_to_mask(p2m_ram_paging_out)      \
                        | p2m_to_mask(p2m_ram_paged)           \
                        | p2m_to_mask(p2m_ram_paging_in)       \
-                       | p2m_to_mask(p2m_ram_shared))
+                       | p2m_to_mask(p2m_ram_shared)          \
+                       | p2m_to_mask(p2m_ioreq_server))
 
 /* Types that represent a physmap hole that is ok to replace with a shared
  * entry */
@@ -111,8 +112,7 @@  typedef unsigned int p2m_query_t;
 #define P2M_RO_TYPES (p2m_to_mask(p2m_ram_logdirty)     \
                       | p2m_to_mask(p2m_ram_ro)         \
                       | p2m_to_mask(p2m_grant_map_ro)   \
-                      | p2m_to_mask(p2m_ram_shared)     \
-                      | p2m_to_mask(p2m_ioreq_server))
+                      | p2m_to_mask(p2m_ram_shared))
 
 /* Write-discard types, which should discard the write operations */
 #define P2M_DISCARD_WRITE_TYPES (p2m_to_mask(p2m_ram_ro)     \
@@ -120,7 +120,8 @@  typedef unsigned int p2m_query_t;
 
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
-                              | p2m_to_mask(p2m_ram_logdirty) )
+                              | p2m_to_mask(p2m_ram_logdirty) \
+                              | p2m_to_mask(p2m_ioreq_server) )
 
 #define P2M_POD_TYPES (p2m_to_mask(p2m_populate_on_demand))
 
@@ -320,6 +321,27 @@  struct p2m_domain {
         struct ept_data ept;
         /* NPT-equivalent structure could be added here. */
     };
+
+    struct {
+        spinlock_t lock;
+        /*
+         * ioreq server who's responsible for the emulation of
+         * gfns with specific p2m type(for now, p2m_ioreq_server).
+         * Behaviors of gfns with p2m_ioreq_server set but no
+         * ioreq server mapped in advance should be the same as
+         * p2m_ram_rw.
+         */
+        struct hvm_ioreq_server *server;
+        /*
+         * flags specifies whether read, write or both operations
+         * are to be emulated by an ioreq server.
+         */
+        unsigned long flags;
+
+#define P2M_IOREQ_HANDLE_WRITE_ACCESS HVMOP_IOREQ_MEM_ACCESS_WRITE
+#define P2M_IOREQ_HANDLE_READ_ACCESS  HVMOP_IOREQ_MEM_ACCESS_READ
+
+    } ioreq;
 };
 
 /* get host p2m table */
@@ -821,6 +843,12 @@  static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
     return flags;
 }
 
+int p2m_set_ioreq_server(struct domain *d, unsigned long flags,
+                         struct hvm_ioreq_server *s);
+void p2m_get_ioreq_server(struct domain *d, unsigned long *flags,
+                         struct hvm_ioreq_server **s);
+void p2m_destroy_ioreq_server(struct domain *d, struct hvm_ioreq_server *s);
+
 #endif /* _XEN_P2M_H */
 
 /*
diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
index a1eae52..d46f186 100644
--- a/xen/include/public/hvm/hvm_op.h
+++ b/xen/include/public/hvm/hvm_op.h
@@ -489,6 +489,43 @@  struct xen_hvm_altp2m_op {
 typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
 
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+
+/*
+ * HVMOP_map_mem_type_to_ioreq_server : map or unmap the IOREQ Server <id>
+ *                                      to specific memroy type <type>
+ *                                      for specific accesses <flags>
+ *
+ * Note that if only write operations are to be forwarded to an ioreq server,
+ * read operations will be performed with no hypervisor intervention. But if
+ * flags indicates that read operations are to be forwarded to an ioreq server,
+ * write operations will inevitably be trapped into hypervisor, whether they
+ * are emulated by hypervisor or forwarded to ioreq server depends on the flags
+ * setting. This situation means significant performance impact.
+ */
+#define HVMOP_map_mem_type_to_ioreq_server 26
+struct xen_hvm_map_mem_type_to_ioreq_server {
+    domid_t domid;      /* IN - domain to be serviced */
+    ioservid_t id;      /* IN - ioreq server id */
+    hvmmem_type_t type; /* IN - memory type */
+    uint32_t flags;     /* IN - types of accesses to be forwarded to the
+                           ioreq server. flags with 0 means to unmap the
+                           ioreq server */
+#define _HVMOP_IOREQ_MEM_ACCESS_READ 0
+#define HVMOP_IOREQ_MEM_ACCESS_READ \
+    (1u << _HVMOP_IOREQ_MEM_ACCESS_READ)
+
+#define _HVMOP_IOREQ_MEM_ACCESS_WRITE 1
+#define HVMOP_IOREQ_MEM_ACCESS_WRITE \
+    (1u << _HVMOP_IOREQ_MEM_ACCESS_WRITE)
+};
+typedef struct xen_hvm_map_mem_type_to_ioreq_server
+    xen_hvm_map_mem_type_to_ioreq_server_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_map_mem_type_to_ioreq_server_t);
+
+#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
+
+
 #endif /* __XEN_PUBLIC_HVM_HVM_OP_H__ */
 
 /*