diff mbox

[v10,5/6] x86/ioreq server: Asynchronously reset outstanding p2m_ioreq_server entries.

Message ID 1491135867-623-6-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang April 2, 2017, 12:24 p.m. UTC
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.

New field entry_count is introduced in struct p2m_domain, to record
the number of p2m_ioreq_server p2m page table entries. One nature of
these entries is that they only point to 4K sized page frames, because
all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
p2m_change_type_one(). We do not need to worry about the counting for
2M/1G sized pages.

This patch disallows mapping of an ioreq server, when there's still
p2m_ioreq_server entry left, in case another mapping occurs right after
the current one being unmapped, releases its lock, with p2m table not
synced yet.

This patch also disallows live migration, when there's remaining
p2m_ioreq_server entry in p2m table. The core reason is our current
implementation of p2m_change_entry_type_global() lacks information
to resync p2m_ioreq_server entries correctly if global_logdirty is
on.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>

changes in v5: 
  - According to comments from Jan: use unsigned long for entry_count;
  - According to comments from Jan: refuse mapping requirement when
    there's p2m_ioreq_server remained in p2m table.
  - Added "Reviewed-by: Paul Durrant <paul.durrant@citrix.com>"

changes in v4: 
  - According to comments from Jan: use ASSERT() instead of 'if'
    condition in p2m_change_type_one().
  - According to comments from Jan: commit message changes to mention
    the p2m_ioreq_server are all based on 4K sized pages.

changes in v3: 
  - Move the synchronously resetting logic into patch 5.
  - According to comments from Jan: introduce p2m_check_changeable()
    to clarify the p2m type change code.
  - According to comments from George: use locks in the same order
    to avoid deadlock, call p2m_change_entry_type_global() after unmap
    of the ioreq server is finished.

changes in v2: 
  - Move the calculation of ioreq server page entry_cout into 
    p2m_change_type_one() so that we do not need a seperate lock.
    Note: entry_count is also calculated in resolve_misconfig()/
    do_recalc(), fortunately callers of both routines have p2m 
    lock protected already.
  - Simplify logic in hvmop_set_mem_type().
  - Introduce routine p2m_finish_type_change() to walk the p2m 
    table and do the p2m reset.
---
 xen/arch/x86/hvm/ioreq.c  |  8 ++++++++
 xen/arch/x86/mm/hap/hap.c |  9 +++++++++
 xen/arch/x86/mm/p2m-ept.c |  8 +++++++-
 xen/arch/x86/mm/p2m-pt.c  | 13 +++++++++++--
 xen/arch/x86/mm/p2m.c     | 29 +++++++++++++++++++++++++++++
 xen/include/asm-x86/p2m.h |  9 ++++++++-
 6 files changed, 72 insertions(+), 4 deletions(-)

Comments

Jan Beulich April 3, 2017, 2:36 p.m. UTC | #1
>>> On 02.04.17 at 14:24, <yu.c.zhang@linux.intel.com> wrote:
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -317,6 +317,15 @@ int p2m_set_ioreq_server(struct domain *d,
>          if ( p2m->ioreq.server != NULL )
>              goto out;
>  
> +        /*
> +         * It is possible that an ioreq server has just been unmapped,
> +         * released the spin lock, with some p2m_ioreq_server entries
> +         * in p2m table remained. We shall refuse another ioreq server
> +         * mapping request in such case.
> +         */
> +        if ( read_atomic(&p2m->ioreq.entry_count) )
> +            goto out;

So this produces the same -EINVAL as the earlier check in context
above. I think it would be nice if neither did - -EINUSE for the first
(which we don't have, so -EOPNOTSUPP would seem the second
bets option there) and -EBUSY for the second would seem more
appropriate. If you agree, respective adjustments could be done
while committing, if no other reason for a v11 arises.

Jan
Jan Beulich April 3, 2017, 2:38 p.m. UTC | #2
>>> On 03.04.17 at 16:36, <JBeulich@suse.com> wrote:
>>>> On 02.04.17 at 14:24, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -317,6 +317,15 @@ int p2m_set_ioreq_server(struct domain *d,
>>          if ( p2m->ioreq.server != NULL )
>>              goto out;
>>  
>> +        /*
>> +         * It is possible that an ioreq server has just been unmapped,
>> +         * released the spin lock, with some p2m_ioreq_server entries
>> +         * in p2m table remained. We shall refuse another ioreq server
>> +         * mapping request in such case.
>> +         */
>> +        if ( read_atomic(&p2m->ioreq.entry_count) )
>> +            goto out;
> 
> So this produces the same -EINVAL as the earlier check in context
> above. I think it would be nice if neither did - -EINUSE for the first
> (which we don't have, so -EOPNOTSUPP would seem the second
> bets option there) and -EBUSY for the second would seem more
> appropriate. If you agree, respective adjustments could be done
> while committing, if no other reason for a v11 arises.

Oh, and with that
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan
Yu Zhang April 5, 2017, 7:18 a.m. UTC | #3
On 4/3/2017 10:36 PM, Jan Beulich wrote:
> So this produces the same -EINVAL as the earlier check in context
> above. I think it would be nice if neither did - -EINUSE for the first
> (which we don't have, so -EOPNOTSUPP would seem the second
> bets option there) and -EBUSY for the second would seem more
> appropriate. If you agree, respective adjustments could be done
> while committing, if no other reason for a v11 arises.

Thanks Jan.
But my code shows both will return -EBUSY, instead of -EINVAL for the 
mapping requirement:
       /* Unmap ioreq server from p2m type by passing flags with 0. */
       if ( flags == 0 )
       {
/rc = -EINVAL;/
           if ( p2m->ioreq.server != s )
               goto out;

           p2m->ioreq.server = NULL;
           p2m->ioreq.flags = 0;
       }
       else
       {
/rc = -EBUSY;/
           if ( p2m->ioreq.server != NULL )
               goto out;

           /*
            * It is possible that an ioreq server has just been unmapped,
            * released the spin lock, with some p2m_ioreq_server entries
            * in p2m table remained. We shall refuse another ioreq server
            * mapping request in such case.
            */
           if ( read_atomic(&p2m->ioreq.entry_count) )
               goto out;

           p2m->ioreq.server = s;
           p2m->ioreq.flags = flags;
       }

Are you really wanna use -EOPNOTSUPP when p2m->ioreq.server is not NULL 
for the mapping code?

For the unmapping code, -EINVAL is used when the ioreq server to be 
unmapped is not the designated one.
But for this one, I am not sure which one is better: -EINVAL or -EBUSY?

Yu
Jan Beulich April 5, 2017, 8:11 a.m. UTC | #4
>>> On 05.04.17 at 09:18, <yu.c.zhang@linux.intel.com> wrote:
> On 4/3/2017 10:36 PM, Jan Beulich wrote:
>> So this produces the same -EINVAL as the earlier check in context
>> above. I think it would be nice if neither did - -EINUSE for the first
>> (which we don't have, so -EOPNOTSUPP would seem the second
>> bets option there) and -EBUSY for the second would seem more
>> appropriate. If you agree, respective adjustments could be done
>> while committing, if no other reason for a v11 arises.
> 
> Thanks Jan.

First of all - deleting irrelevant context from replies is always
appreciated, but I think you went too far here. It is no longer
possible to re-associate your comments with the change you've
made (not visible from the code fragment below).

> But my code shows both will return -EBUSY, instead of -EINVAL for the 
> mapping requirement:
>        /* Unmap ioreq server from p2m type by passing flags with 0. */
>        if ( flags == 0 )
>        {
> /rc = -EINVAL;/
>            if ( p2m->ioreq.server != s )
>                goto out;
> 
>            p2m->ioreq.server = NULL;
>            p2m->ioreq.flags = 0;
>        }
>        else
>        {
> /rc = -EBUSY;/
>            if ( p2m->ioreq.server != NULL )
>                goto out;
> 
>            /*
>             * It is possible that an ioreq server has just been unmapped,
>             * released the spin lock, with some p2m_ioreq_server entries
>             * in p2m table remained. We shall refuse another ioreq server
>             * mapping request in such case.
>             */
>            if ( read_atomic(&p2m->ioreq.entry_count) )
>                goto out;

Oh, I see: I've mistakenly assumed to addition was to the if()
branch. I clearly didn't look closely enough, I'm sorry.

>            p2m->ioreq.server = s;
>            p2m->ioreq.flags = flags;
>        }
> 
> Are you really wanna use -EOPNOTSUPP when p2m->ioreq.server is not NULL 
> for the mapping code?

As per above, I really did refer to the wrong piece of code, so all
is fine the way the code is.

Jan
George Dunlap April 5, 2017, 2:41 p.m. UTC | #5
On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> After an ioreq server has unmapped, the remaining p2m_ioreq_server
> entries need to be reset back to p2m_ram_rw. This patch does this
> asynchronously with the current p2m_change_entry_type_global()
> interface.
>
> New field entry_count is introduced in struct p2m_domain, to record
> the number of p2m_ioreq_server p2m page table entries. One nature of
> these entries is that they only point to 4K sized page frames, because
> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
> p2m_change_type_one(). We do not need to worry about the counting for
> 2M/1G sized pages.

Assuming that all p2m_ioreq_server entries are *created* by
p2m_change_type_one() may valid, but can you assume that they are only
ever *removed* by p2m_change_type_one() (or recalculation)?

What happens, for instance, if a guest balloons out one of the ram
pages?  I don't immediately see anything preventing a p2m_ioreq_server
page from being ballooned out, nor anything on the
decrease_reservation() path decreasing p2m->ioreq.entry_count.  Or did
I miss something?

Other than that, only one minor comment...

> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index a57b385..6ec950a 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -187,6 +187,15 @@ out:
>   */
>  static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
>  {
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    /*
> +     * Refuse to turn on global log-dirty mode if
> +     * there's outstanding p2m_ioreq_server pages.

Grammar nit: if *there are* outstanding pages.

 -George
Yu Zhang April 5, 2017, 4:22 p.m. UTC | #6
On 4/5/2017 10:41 PM, George Dunlap wrote:
> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>> entries need to be reset back to p2m_ram_rw. This patch does this
>> asynchronously with the current p2m_change_entry_type_global()
>> interface.
>>
>> New field entry_count is introduced in struct p2m_domain, to record
>> the number of p2m_ioreq_server p2m page table entries. One nature of
>> these entries is that they only point to 4K sized page frames, because
>> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
>> p2m_change_type_one(). We do not need to worry about the counting for
>> 2M/1G sized pages.
> Assuming that all p2m_ioreq_server entries are *created* by
> p2m_change_type_one() may valid, but can you assume that they are only
> ever *removed* by p2m_change_type_one() (or recalculation)?
>
> What happens, for instance, if a guest balloons out one of the ram
> pages?  I don't immediately see anything preventing a p2m_ioreq_server
> page from being ballooned out, nor anything on the
> decrease_reservation() path decreasing p2m->ioreq.entry_count.  Or did
> I miss something?
>
> Other than that, only one minor comment...

Thanks for your thorough consideration, George. But I do not think we need to worry about this:

If the emulation is in process, the balloon driver cannot get a p2m_ioreq_server page - because
it is already allocated.

And even when emulation is finished, the balloon driver successfully get this page, and triggers
decrease_reservation, the purpose is to remove the current mapping relation between the gfn
and mfn in p2m. So IIUC, p2m_remove_page() will be triggered if everything is goes fine, and then
p2m_set_entry(), which will trigger the recalc logic eventually, either in ept_set_entry() or
p2m_pt_set_entry(). Then the entry_count will be updated in the recalc logic.

>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
>> index a57b385..6ec950a 100644
>> --- a/xen/arch/x86/mm/hap/hap.c
>> +++ b/xen/arch/x86/mm/hap/hap.c
>> @@ -187,6 +187,15 @@ out:
>>    */
>>   static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
>>   {
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +    /*
>> +     * Refuse to turn on global log-dirty mode if
>> +     * there's outstanding p2m_ioreq_server pages.
> Grammar nit: if *there are* outstanding pages.

Oh, right. Thanks

B.R.
Yu
>   -George
>
Yu Zhang April 5, 2017, 4:32 p.m. UTC | #7
On 4/6/2017 12:35 AM, George Dunlap wrote:
> On 05/04/17 17:22, Yu Zhang wrote:
>>
>> On 4/5/2017 10:41 PM, George Dunlap wrote:
>>> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang <yu.c.zhang@linux.intel.com>
>>> wrote:
>>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>>> asynchronously with the current p2m_change_entry_type_global()
>>>> interface.
>>>>
>>>> New field entry_count is introduced in struct p2m_domain, to record
>>>> the number of p2m_ioreq_server p2m page table entries. One nature of
>>>> these entries is that they only point to 4K sized page frames, because
>>>> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
>>>> p2m_change_type_one(). We do not need to worry about the counting for
>>>> 2M/1G sized pages.
>>> Assuming that all p2m_ioreq_server entries are *created* by
>>> p2m_change_type_one() may valid, but can you assume that they are only
>>> ever *removed* by p2m_change_type_one() (or recalculation)?
>>>
>>> What happens, for instance, if a guest balloons out one of the ram
>>> pages?  I don't immediately see anything preventing a p2m_ioreq_server
>>> page from being ballooned out, nor anything on the
>>> decrease_reservation() path decreasing p2m->ioreq.entry_count.  Or did
>>> I miss something?
>>>
>>> Other than that, only one minor comment...
>> Thanks for your thorough consideration, George. But I do not think we
>> need to worry about this:
>>
>> If the emulation is in process, the balloon driver cannot get a
>> p2m_ioreq_server page - because
>> it is already allocated.
> In theory, yes, the guest *shouldn't* do this.  But what if the guest OS
> makes a mistake?  Or, what if the ioreq server makes a mistake and
> places a watch on a page that *isn't* allocated by the device driver, or
> forgets to change a page type back to ram when the device driver frees
> it back to the guest kernel?

Then the lazy p2m change code will be triggered, and this page is reset 
to p2m_ram_rw
before being set to p2m_invalid, just like the normal path. Will this be 
a problem?

> It's the hypervisor's job to do the right thing even if the guest and
> the device model don't.
>
>> And even when emulation is finished, the balloon driver successfully get
>> this page, and triggers
>> decrease_reservation, the purpose is to remove the current mapping
>> relation between the gfn
>> and mfn in p2m. So IIUC, p2m_remove_page() will be triggered if
>> everything is goes fine, and then
>> p2m_set_entry(), which will trigger the recalc logic eventually, either
>> in ept_set_entry() or
>> p2m_pt_set_entry(). Then the entry_count will be updated in the recalc
>> logic.
> Yes, once the lazy type change has been made, we can rely on the
> recalculation logic to make sure that the types are changed appropriately.

Yep. So my understanding is that as long as p2m_set_entry() is used to 
change the p2m,
we do not need to worry about this.

B.R.
Yu
>
>   -George
>
>
George Dunlap April 5, 2017, 4:35 p.m. UTC | #8
On 05/04/17 17:22, Yu Zhang wrote:
> 
> 
> On 4/5/2017 10:41 PM, George Dunlap wrote:
>> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang <yu.c.zhang@linux.intel.com>
>> wrote:
>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>> asynchronously with the current p2m_change_entry_type_global()
>>> interface.
>>>
>>> New field entry_count is introduced in struct p2m_domain, to record
>>> the number of p2m_ioreq_server p2m page table entries. One nature of
>>> these entries is that they only point to 4K sized page frames, because
>>> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
>>> p2m_change_type_one(). We do not need to worry about the counting for
>>> 2M/1G sized pages.
>> Assuming that all p2m_ioreq_server entries are *created* by
>> p2m_change_type_one() may valid, but can you assume that they are only
>> ever *removed* by p2m_change_type_one() (or recalculation)?
>>
>> What happens, for instance, if a guest balloons out one of the ram
>> pages?  I don't immediately see anything preventing a p2m_ioreq_server
>> page from being ballooned out, nor anything on the
>> decrease_reservation() path decreasing p2m->ioreq.entry_count.  Or did
>> I miss something?
>>
>> Other than that, only one minor comment...
> 
> Thanks for your thorough consideration, George. But I do not think we
> need to worry about this:
> 
> If the emulation is in process, the balloon driver cannot get a
> p2m_ioreq_server page - because
> it is already allocated.

In theory, yes, the guest *shouldn't* do this.  But what if the guest OS
makes a mistake?  Or, what if the ioreq server makes a mistake and
places a watch on a page that *isn't* allocated by the device driver, or
forgets to change a page type back to ram when the device driver frees
it back to the guest kernel?

It's the hypervisor's job to do the right thing even if the guest and
the device model don't.

> And even when emulation is finished, the balloon driver successfully get
> this page, and triggers
> decrease_reservation, the purpose is to remove the current mapping
> relation between the gfn
> and mfn in p2m. So IIUC, p2m_remove_page() will be triggered if
> everything is goes fine, and then
> p2m_set_entry(), which will trigger the recalc logic eventually, either
> in ept_set_entry() or
> p2m_pt_set_entry(). Then the entry_count will be updated in the recalc
> logic.

Yes, once the lazy type change has been made, we can rely on the
recalculation logic to make sure that the types are changed appropriately.

 -George
George Dunlap April 5, 2017, 5:01 p.m. UTC | #9
On 05/04/17 17:32, Yu Zhang wrote:
> 
> 
> On 4/6/2017 12:35 AM, George Dunlap wrote:
>> On 05/04/17 17:22, Yu Zhang wrote:
>>>
>>> On 4/5/2017 10:41 PM, George Dunlap wrote:
>>>> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang <yu.c.zhang@linux.intel.com>
>>>> wrote:
>>>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>>>> asynchronously with the current p2m_change_entry_type_global()
>>>>> interface.
>>>>>
>>>>> New field entry_count is introduced in struct p2m_domain, to record
>>>>> the number of p2m_ioreq_server p2m page table entries. One nature of
>>>>> these entries is that they only point to 4K sized page frames, because
>>>>> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
>>>>> p2m_change_type_one(). We do not need to worry about the counting for
>>>>> 2M/1G sized pages.
>>>> Assuming that all p2m_ioreq_server entries are *created* by
>>>> p2m_change_type_one() may valid, but can you assume that they are only
>>>> ever *removed* by p2m_change_type_one() (or recalculation)?
>>>>
>>>> What happens, for instance, if a guest balloons out one of the ram
>>>> pages?  I don't immediately see anything preventing a p2m_ioreq_server
>>>> page from being ballooned out, nor anything on the
>>>> decrease_reservation() path decreasing p2m->ioreq.entry_count.  Or did
>>>> I miss something?
>>>>
>>>> Other than that, only one minor comment...
>>> Thanks for your thorough consideration, George. But I do not think we
>>> need to worry about this:
>>>
>>> If the emulation is in process, the balloon driver cannot get a
>>> p2m_ioreq_server page - because
>>> it is already allocated.
>> In theory, yes, the guest *shouldn't* do this.  But what if the guest OS
>> makes a mistake?  Or, what if the ioreq server makes a mistake and
>> places a watch on a page that *isn't* allocated by the device driver, or
>> forgets to change a page type back to ram when the device driver frees
>> it back to the guest kernel?
> 
> Then the lazy p2m change code will be triggered, and this page is reset
> to p2m_ram_rw
> before being set to p2m_invalid, just like the normal path. Will this be
> a problem?

No, I'm talking about before the ioreq server detaches.

Scenario 1: Bug in driver
1. Guest driver allocates page A
2. dm marks A as p2m_ioreq_server
3. Guest driver accidentally frees A to the kernel
4. guest kernel balloons out page A; ioreq.entry_count is wrong

Scenario 2: Bug in the kernel
1. Guest driver allocates page A
2. dm marks A as p2m_ioreq_server
3. Guest kernel tries to balloon out page B, but makes a calculation
mistake and balloons out A instead; now ioreq.entry_count is wrong

Scenario 3: Off-by-one bug in devicemodel
1. Guest driver allocates pages A-D
2. dm makes a mistake and marks pages A-E as p2m_ioreq_server (one extra
page)
3. guest kernel balloons out page E; now ioreq.entry_count is wrong

Scenario 4: "Leak" in devicemodel
1. Guest driver allocates page A
2. dm marks A as p2m_ioreq_server
3. Guest driver is done with page A, but DM forgets to reset it to
p2m_ram_rw
4. Guest driver frees A to guest kernel
5. Guest kernel balloons out page A; now ioreq.entry_count is wrong

I could keep going on; there are *lots* of bugs in the driver, the
kernel, or the devicemodel which could cause pages marked
p2m_ioreq_server to end up being ballooned out; which under the current
code would make ioreq.entry_count wrong.

It's the hypervisor's job to do the right thing even when other
components have bugs in them.  This is why I initially suggested keeping
count in atomic_write_ept_entry() -- no matter how the entry is changed,
we always know exactly how many entries of type p2m_ioreq_server we have.

 -George
Yu Zhang April 5, 2017, 5:18 p.m. UTC | #10
On 4/6/2017 1:01 AM, George Dunlap wrote:
> On 05/04/17 17:32, Yu Zhang wrote:
>>
>> On 4/6/2017 12:35 AM, George Dunlap wrote:
>>> On 05/04/17 17:22, Yu Zhang wrote:
>>>> On 4/5/2017 10:41 PM, George Dunlap wrote:
>>>>> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang <yu.c.zhang@linux.intel.com>
>>>>> wrote:
>>>>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>>>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>>>>> asynchronously with the current p2m_change_entry_type_global()
>>>>>> interface.
>>>>>>
>>>>>> New field entry_count is introduced in struct p2m_domain, to record
>>>>>> the number of p2m_ioreq_server p2m page table entries. One nature of
>>>>>> these entries is that they only point to 4K sized page frames, because
>>>>>> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
>>>>>> p2m_change_type_one(). We do not need to worry about the counting for
>>>>>> 2M/1G sized pages.
>>>>> Assuming that all p2m_ioreq_server entries are *created* by
>>>>> p2m_change_type_one() may valid, but can you assume that they are only
>>>>> ever *removed* by p2m_change_type_one() (or recalculation)?
>>>>>
>>>>> What happens, for instance, if a guest balloons out one of the ram
>>>>> pages?  I don't immediately see anything preventing a p2m_ioreq_server
>>>>> page from being ballooned out, nor anything on the
>>>>> decrease_reservation() path decreasing p2m->ioreq.entry_count.  Or did
>>>>> I miss something?
>>>>>
>>>>> Other than that, only one minor comment...
>>>> Thanks for your thorough consideration, George. But I do not think we
>>>> need to worry about this:
>>>>
>>>> If the emulation is in process, the balloon driver cannot get a
>>>> p2m_ioreq_server page - because
>>>> it is already allocated.
>>> In theory, yes, the guest *shouldn't* do this.  But what if the guest OS
>>> makes a mistake?  Or, what if the ioreq server makes a mistake and
>>> places a watch on a page that *isn't* allocated by the device driver, or
>>> forgets to change a page type back to ram when the device driver frees
>>> it back to the guest kernel?
>> Then the lazy p2m change code will be triggered, and this page is reset
>> to p2m_ram_rw
>> before being set to p2m_invalid, just like the normal path. Will this be
>> a problem?
> No, I'm talking about before the ioreq server detaches.
Sorry, I do not get it. Take scenario 1 for example:
> Scenario 1: Bug in driver
> 1. Guest driver allocates page A
> 2. dm marks A as p2m_ioreq_server
Here in step 2. the ioreq.entry_count increases;
> 3. Guest driver accidentally frees A to the kernel
> 4. guest kernel balloons out page A; ioreq.entry_count is wrong

Here in step 4. the ioreq.entry_count decreases.
Isn't this what we are expecting?

Yu
>
> Scenario 2: Bug in the kernel
> 1. Guest driver allocates page A
> 2. dm marks A as p2m_ioreq_server
> 3. Guest kernel tries to balloon out page B, but makes a calculation
> mistake and balloons out A instead; now ioreq.entry_count is wrong
>
> Scenario 3: Off-by-one bug in devicemodel
> 1. Guest driver allocates pages A-D
> 2. dm makes a mistake and marks pages A-E as p2m_ioreq_server (one extra
> page)
> 3. guest kernel balloons out page E; now ioreq.entry_count is wrong
>
> Scenario 4: "Leak" in devicemodel
> 1. Guest driver allocates page A
> 2. dm marks A as p2m_ioreq_server
> 3. Guest driver is done with page A, but DM forgets to reset it to
> p2m_ram_rw
> 4. Guest driver frees A to guest kernel
> 5. Guest kernel balloons out page A; now ioreq.entry_count is wrong
>
> I could keep going on; there are *lots* of bugs in the driver, the
> kernel, or the devicemodel which could cause pages marked
> p2m_ioreq_server to end up being ballooned out; which under the current
> code would make ioreq.entry_count wrong.
>
> It's the hypervisor's job to do the right thing even when other
> components have bugs in them.  This is why I initially suggested keeping
> count in atomic_write_ept_entry() -- no matter how the entry is changed,
> we always know exactly how many entries of type p2m_ioreq_server we have.
>
>   -George
>
>
Yu Zhang April 5, 2017, 5:28 p.m. UTC | #11
On 4/6/2017 1:18 AM, Yu Zhang wrote:
>
>
> On 4/6/2017 1:01 AM, George Dunlap wrote:
>> On 05/04/17 17:32, Yu Zhang wrote:
>>>
>>> On 4/6/2017 12:35 AM, George Dunlap wrote:
>>>> On 05/04/17 17:22, Yu Zhang wrote:
>>>>> On 4/5/2017 10:41 PM, George Dunlap wrote:
>>>>>> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang 
>>>>>> <yu.c.zhang@linux.intel.com>
>>>>>> wrote:
>>>>>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>>>>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>>>>>> asynchronously with the current p2m_change_entry_type_global()
>>>>>>> interface.
>>>>>>>
>>>>>>> New field entry_count is introduced in struct p2m_domain, to record
>>>>>>> the number of p2m_ioreq_server p2m page table entries. One 
>>>>>>> nature of
>>>>>>> these entries is that they only point to 4K sized page frames, 
>>>>>>> because
>>>>>>> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
>>>>>>> p2m_change_type_one(). We do not need to worry about the 
>>>>>>> counting for
>>>>>>> 2M/1G sized pages.
>>>>>> Assuming that all p2m_ioreq_server entries are *created* by
>>>>>> p2m_change_type_one() may valid, but can you assume that they are 
>>>>>> only
>>>>>> ever *removed* by p2m_change_type_one() (or recalculation)?
>>>>>>
>>>>>> What happens, for instance, if a guest balloons out one of the ram
>>>>>> pages?  I don't immediately see anything preventing a 
>>>>>> p2m_ioreq_server
>>>>>> page from being ballooned out, nor anything on the
>>>>>> decrease_reservation() path decreasing p2m->ioreq.entry_count.  
>>>>>> Or did
>>>>>> I miss something?
>>>>>>
>>>>>> Other than that, only one minor comment...
>>>>> Thanks for your thorough consideration, George. But I do not think we
>>>>> need to worry about this:
>>>>>
>>>>> If the emulation is in process, the balloon driver cannot get a
>>>>> p2m_ioreq_server page - because
>>>>> it is already allocated.
>>>> In theory, yes, the guest *shouldn't* do this.  But what if the 
>>>> guest OS
>>>> makes a mistake?  Or, what if the ioreq server makes a mistake and
>>>> places a watch on a page that *isn't* allocated by the device 
>>>> driver, or
>>>> forgets to change a page type back to ram when the device driver frees
>>>> it back to the guest kernel?
>>> Then the lazy p2m change code will be triggered, and this page is reset
>>> to p2m_ram_rw
>>> before being set to p2m_invalid, just like the normal path. Will 
>>> this be
>>> a problem?
>> No, I'm talking about before the ioreq server detaches.
> Sorry, I do not get it. Take scenario 1 for example:
>> Scenario 1: Bug in driver
>> 1. Guest driver allocates page A
>> 2. dm marks A as p2m_ioreq_server
> Here in step 2. the ioreq.entry_count increases;
>> 3. Guest driver accidentally frees A to the kernel
>> 4. guest kernel balloons out page A; ioreq.entry_count is wrong
>
> Here in step 4. the ioreq.entry_count decreases.

Oh. I figured out. This entry is not invalidated yet if ioreq is not 
unmapped. Sorry.

> Isn't this what we are expecting?
>
> Yu
>>
>> Scenario 2: Bug in the kernel
>> 1. Guest driver allocates page A
>> 2. dm marks A as p2m_ioreq_server
>> 3. Guest kernel tries to balloon out page B, but makes a calculation
>> mistake and balloons out A instead; now ioreq.entry_count is wrong
>>
>> Scenario 3: Off-by-one bug in devicemodel
>> 1. Guest driver allocates pages A-D
>> 2. dm makes a mistake and marks pages A-E as p2m_ioreq_server (one extra
>> page)
>> 3. guest kernel balloons out page E; now ioreq.entry_count is wrong
>>
>> Scenario 4: "Leak" in devicemodel
>> 1. Guest driver allocates page A
>> 2. dm marks A as p2m_ioreq_server
>> 3. Guest driver is done with page A, but DM forgets to reset it to
>> p2m_ram_rw
>> 4. Guest driver frees A to guest kernel
>> 5. Guest kernel balloons out page A; now ioreq.entry_count is wrong
>>
>> I could keep going on; there are *lots* of bugs in the driver, the
>> kernel, or the devicemodel which could cause pages marked
>> p2m_ioreq_server to end up being ballooned out; which under the current
>> code would make ioreq.entry_count wrong.
>>
>> It's the hypervisor's job to do the right thing even when other
>> components have bugs in them.  This is why I initially suggested keeping
>> count in atomic_write_ept_entry() -- no matter how the entry is changed,
>> we always know exactly how many entries of type p2m_ioreq_server we 
>> have.
>>

Well, count in atomic_write_ept_entry() only works for ept. Besides, it 
requires
interface changes - need to pass the p2m.
Another thought is - now in XenGT, PoD is disabled to make sure gfn->mfn 
does
not change. So how about we disable ballooning if ioreq.entry_count is 
not 0?

Yu
>>   -George
>>
>>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
George Dunlap April 5, 2017, 5:29 p.m. UTC | #12
On 05/04/17 18:18, Yu Zhang wrote:
> 
> 
> On 4/6/2017 1:01 AM, George Dunlap wrote:
>> On 05/04/17 17:32, Yu Zhang wrote:
>>>
>>> On 4/6/2017 12:35 AM, George Dunlap wrote:
>>>> On 05/04/17 17:22, Yu Zhang wrote:
>>>>> On 4/5/2017 10:41 PM, George Dunlap wrote:
>>>>>> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang <yu.c.zhang@linux.intel.com>
>>>>>> wrote:
>>>>>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>>>>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>>>>>> asynchronously with the current p2m_change_entry_type_global()
>>>>>>> interface.
>>>>>>>
>>>>>>> New field entry_count is introduced in struct p2m_domain, to record
>>>>>>> the number of p2m_ioreq_server p2m page table entries. One nature of
>>>>>>> these entries is that they only point to 4K sized page frames,
>>>>>>> because
>>>>>>> all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
>>>>>>> p2m_change_type_one(). We do not need to worry about the counting
>>>>>>> for
>>>>>>> 2M/1G sized pages.
>>>>>> Assuming that all p2m_ioreq_server entries are *created* by
>>>>>> p2m_change_type_one() may valid, but can you assume that they are
>>>>>> only
>>>>>> ever *removed* by p2m_change_type_one() (or recalculation)?
>>>>>>
>>>>>> What happens, for instance, if a guest balloons out one of the ram
>>>>>> pages?  I don't immediately see anything preventing a
>>>>>> p2m_ioreq_server
>>>>>> page from being ballooned out, nor anything on the
>>>>>> decrease_reservation() path decreasing p2m->ioreq.entry_count.  Or
>>>>>> did
>>>>>> I miss something?
>>>>>>
>>>>>> Other than that, only one minor comment...
>>>>> Thanks for your thorough consideration, George. But I do not think we
>>>>> need to worry about this:
>>>>>
>>>>> If the emulation is in process, the balloon driver cannot get a
>>>>> p2m_ioreq_server page - because
>>>>> it is already allocated.
>>>> In theory, yes, the guest *shouldn't* do this.  But what if the
>>>> guest OS
>>>> makes a mistake?  Or, what if the ioreq server makes a mistake and
>>>> places a watch on a page that *isn't* allocated by the device
>>>> driver, or
>>>> forgets to change a page type back to ram when the device driver frees
>>>> it back to the guest kernel?
>>> Then the lazy p2m change code will be triggered, and this page is reset
>>> to p2m_ram_rw
>>> before being set to p2m_invalid, just like the normal path. Will this be
>>> a problem?
>> No, I'm talking about before the ioreq server detaches.
> Sorry, I do not get it. Take scenario 1 for example:
>> Scenario 1: Bug in driver
>> 1. Guest driver allocates page A
>> 2. dm marks A as p2m_ioreq_server
> Here in step 2. the ioreq.entry_count increases;
>> 3. Guest driver accidentally frees A to the kernel
>> 4. guest kernel balloons out page A; ioreq.entry_count is wrong
> 
> Here in step 4. the ioreq.entry_count decreases.
> Isn't this what we are expecting?

So step 4 happens via xen/common/memory.c:decrease_reservation().  Where
along that path will ioreq.entry_count be decreased for the page freed?

Maybe it is, but I didn't see it when I looked.  As far as I can tell,
after step 4, ioreq.entry_count would still be 1, but that the actual
number of p2m_ioreq_server entries would be 0.

 -George
Yu Zhang April 5, 2017, 6:02 p.m. UTC | #13
On 4/6/2017 1:28 AM, Yu Zhang wrote:
>
>
> On 4/6/2017 1:18 AM, Yu Zhang wrote:
>>
>>
>> On 4/6/2017 1:01 AM, George Dunlap wrote:
>>> On 05/04/17 17:32, Yu Zhang wrote:
>>>>
>>>> On 4/6/2017 12:35 AM, George Dunlap wrote:
>>>>> On 05/04/17 17:22, Yu Zhang wrote:
>>>>>> On 4/5/2017 10:41 PM, George Dunlap wrote:
>>>>>>> On Sun, Apr 2, 2017 at 1:24 PM, Yu Zhang 
>>>>>>> <yu.c.zhang@linux.intel.com>
>>>>>>> wrote:
>>>>>>>> After an ioreq server has unmapped, the remaining p2m_ioreq_server
>>>>>>>> entries need to be reset back to p2m_ram_rw. This patch does this
>>>>>>>> asynchronously with the current p2m_change_entry_type_global()
>>>>>>>> interface.
>>>>>>>>
>>>>>>>> New field entry_count is introduced in struct p2m_domain, to 
>>>>>>>> record
>>>>>>>> the number of p2m_ioreq_server p2m page table entries. One 
>>>>>>>> nature of
>>>>>>>> these entries is that they only point to 4K sized page frames, 
>>>>>>>> because
>>>>>>>> all p2m_ioreq_server entries are originated from p2m_ram_rw 
>>>>>>>> ones in
>>>>>>>> p2m_change_type_one(). We do not need to worry about the 
>>>>>>>> counting for
>>>>>>>> 2M/1G sized pages.
>>>>>>> Assuming that all p2m_ioreq_server entries are *created* by
>>>>>>> p2m_change_type_one() may valid, but can you assume that they 
>>>>>>> are only
>>>>>>> ever *removed* by p2m_change_type_one() (or recalculation)?
>>>>>>>
>>>>>>> What happens, for instance, if a guest balloons out one of the ram
>>>>>>> pages?  I don't immediately see anything preventing a 
>>>>>>> p2m_ioreq_server
>>>>>>> page from being ballooned out, nor anything on the
>>>>>>> decrease_reservation() path decreasing p2m->ioreq.entry_count.  
>>>>>>> Or did
>>>>>>> I miss something?
>>>>>>>
>>>>>>> Other than that, only one minor comment...
>>>>>> Thanks for your thorough consideration, George. But I do not 
>>>>>> think we
>>>>>> need to worry about this:
>>>>>>
>>>>>> If the emulation is in process, the balloon driver cannot get a
>>>>>> p2m_ioreq_server page - because
>>>>>> it is already allocated.
>>>>> In theory, yes, the guest *shouldn't* do this.  But what if the 
>>>>> guest OS
>>>>> makes a mistake?  Or, what if the ioreq server makes a mistake and
>>>>> places a watch on a page that *isn't* allocated by the device 
>>>>> driver, or
>>>>> forgets to change a page type back to ram when the device driver 
>>>>> frees
>>>>> it back to the guest kernel?
>>>> Then the lazy p2m change code will be triggered, and this page is 
>>>> reset
>>>> to p2m_ram_rw
>>>> before being set to p2m_invalid, just like the normal path. Will 
>>>> this be
>>>> a problem?
>>> No, I'm talking about before the ioreq server detaches.
>> Sorry, I do not get it. Take scenario 1 for example:
>>> Scenario 1: Bug in driver
>>> 1. Guest driver allocates page A
>>> 2. dm marks A as p2m_ioreq_server
>> Here in step 2. the ioreq.entry_count increases;
>>> 3. Guest driver accidentally frees A to the kernel
>>> 4. guest kernel balloons out page A; ioreq.entry_count is wrong
>>
>> Here in step 4. the ioreq.entry_count decreases.
>
> Oh. I figured out. This entry is not invalidated yet if ioreq is not 
> unmapped. Sorry.
>
>> Isn't this what we are expecting?
>>
>> Yu
>>>
>>> Scenario 2: Bug in the kernel
>>> 1. Guest driver allocates page A
>>> 2. dm marks A as p2m_ioreq_server
>>> 3. Guest kernel tries to balloon out page B, but makes a calculation
>>> mistake and balloons out A instead; now ioreq.entry_count is wrong
>>>
>>> Scenario 3: Off-by-one bug in devicemodel
>>> 1. Guest driver allocates pages A-D
>>> 2. dm makes a mistake and marks pages A-E as p2m_ioreq_server (one 
>>> extra
>>> page)
>>> 3. guest kernel balloons out page E; now ioreq.entry_count is wrong
>>>
>>> Scenario 4: "Leak" in devicemodel
>>> 1. Guest driver allocates page A
>>> 2. dm marks A as p2m_ioreq_server
>>> 3. Guest driver is done with page A, but DM forgets to reset it to
>>> p2m_ram_rw
>>> 4. Guest driver frees A to guest kernel
>>> 5. Guest kernel balloons out page A; now ioreq.entry_count is wrong
>>>
>>> I could keep going on; there are *lots* of bugs in the driver, the
>>> kernel, or the devicemodel which could cause pages marked
>>> p2m_ioreq_server to end up being ballooned out; which under the current
>>> code would make ioreq.entry_count wrong.
>>>
>>> It's the hypervisor's job to do the right thing even when other
>>> components have bugs in them.  This is why I initially suggested 
>>> keeping
>>> count in atomic_write_ept_entry() -- no matter how the entry is 
>>> changed,
>>> we always know exactly how many entries of type p2m_ioreq_server we 
>>> have.
>>>
>
> Well, count in atomic_write_ept_entry() only works for ept. Besides, 
> it requires
> interface changes - need to pass the p2m.
> Another thought is - now in XenGT, PoD is disabled to make sure 
> gfn->mfn does
> not change. So how about we disable ballooning if ioreq.entry_count is 
> not 0?

Or maybe just change the p2m_ioreq_server to p2m_ram_rw before it is set 
to p2m_invalid?
Like below code:

diff --git a/xen/common/memory.c b/xen/common/memory.c

index 7dbddda..40e5f63 100644

--- a/xen/common/memory.c

+++ b/xen/common/memory.c

@@ -288,6 +288,10 @@ int guest_remove_page(struct domain *d, unsigned 
long gmfn)

put_gfn(d, gmfn);

return 1;

}

+if ( unlikely(p2mt == p2m_ioreq_server) )

+p2m_change_type_one(d, gmfn,

+p2m_ioreq_server, p2m_ram_rw);

+

#else

mfn = gfn_to_mfn(d, _gfn(gmfn));

#endif


Yu

>
> Yu
>>>   -George
>>>
>>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> https://lists.xen.org/xen-devel
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Jan Beulich April 6, 2017, 7:43 a.m. UTC | #14
>>> On 05.04.17 at 19:28, <yu.c.zhang@linux.intel.com> wrote:
> Well, count in atomic_write_ept_entry() only works for ept. Besides, it 
> requires
> interface changes - need to pass the p2m.
> Another thought is - now in XenGT, PoD is disabled to make sure gfn->mfn 
> does
> not change. So how about we disable ballooning if ioreq.entry_count is 
> not 0?

You can't disable ballooning - the guest is free to make decisions
regarding what memory to return to the hypervisor all by itself.
Tool stack directions (via xenstore) are merely a hint for well
behaved guests.

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index 5bf3b6d..07a6c26 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -955,6 +955,14 @@  int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
 
     spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
 
+    if ( rc == 0 && flags == 0 )
+    {
+        struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+        if ( read_atomic(&p2m->ioreq.entry_count) )
+            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
+    }
+
     return rc;
 }
 
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index a57b385..6ec950a 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -187,6 +187,15 @@  out:
  */
 static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
 {
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    /*
+     * Refuse to turn on global log-dirty mode if
+     * there's outstanding p2m_ioreq_server pages.
+     */
+    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
+        return -EBUSY;
+
     /* turn on PG_log_dirty bit in paging mode */
     paging_lock(d);
     d->arch.paging.mode |= PG_log_dirty;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index cc1eb21..62e9ddf 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -544,6 +544,12 @@  static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
                     e.ipat = ipat;
                     if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
                     {
+                         if ( e.sa_p2mt == p2m_ioreq_server )
+                         {
+                             ASSERT(p2m->ioreq.entry_count > 0);
+                             p2m->ioreq.entry_count--;
+                         }
+
                          e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
                                      ? p2m_ram_logdirty : p2m_ram_rw;
                          ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, e.access);
@@ -965,7 +971,7 @@  static mfn_t ept_get_entry(struct p2m_domain *p2m,
     if ( is_epte_valid(ept_entry) )
     {
         if ( (recalc || ept_entry->recalc) &&
-             p2m_is_changeable(ept_entry->sa_p2mt) )
+             p2m_check_changeable(ept_entry->sa_p2mt) )
             *t = p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                       : p2m_ram_rw;
         else
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index c0055f3..0a70dc5 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -436,11 +436,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
          needs_recalc(l1, *pent) )
     {
         l1_pgentry_t e = *pent;
+        p2m_type_t p2mt_old;
 
         if ( !valid_recalc(l1, e) )
             P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
                       p2m->domain->domain_id, gfn, level);
-        if ( p2m_is_changeable(p2m_flags_to_type(l1e_get_flags(e))) )
+        p2mt_old = p2m_flags_to_type(l1e_get_flags(e));
+        if ( p2m_is_changeable(p2mt_old) )
         {
             unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
             p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn | ~mask)
@@ -460,6 +462,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
                      mfn &= ~((unsigned long)_PAGE_PSE_PAT >> PAGE_SHIFT);
                 flags |= _PAGE_PSE;
             }
+
+            if ( p2mt_old == p2m_ioreq_server )
+            {
+                ASSERT(p2m->ioreq.entry_count > 0);
+                p2m->ioreq.entry_count--;
+            }
+
             e = l1e_from_pfn(mfn, flags);
             p2m_add_iommu_flags(&e, level,
                                 (p2mt == p2m_ram_rw)
@@ -729,7 +738,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 static inline p2m_type_t recalc_type(bool_t recalc, p2m_type_t t,
                                      struct p2m_domain *p2m, unsigned long gfn)
 {
-    if ( !recalc || !p2m_is_changeable(t) )
+    if ( !recalc || !p2m_check_changeable(t) )
         return t;
     return p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                 : p2m_ram_rw;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index b84add0..7a1bdd8 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -317,6 +317,15 @@  int p2m_set_ioreq_server(struct domain *d,
         if ( p2m->ioreq.server != NULL )
             goto out;
 
+        /*
+         * It is possible that an ioreq server has just been unmapped,
+         * released the spin lock, with some p2m_ioreq_server entries
+         * in p2m table remained. We shall refuse another ioreq server
+         * mapping request in such case.
+         */
+        if ( read_atomic(&p2m->ioreq.entry_count) )
+            goto out;
+
         p2m->ioreq.server = s;
         p2m->ioreq.flags = flags;
     }
@@ -938,6 +947,26 @@  int p2m_change_type_one(struct domain *d, unsigned long gfn,
                          p2m->default_access)
          : -EBUSY;
 
+    if ( !rc )
+    {
+        switch ( nt )
+        {
+        case p2m_ram_rw:
+            if ( ot == p2m_ioreq_server )
+            {
+                ASSERT(p2m->ioreq.entry_count > 0);
+                p2m->ioreq.entry_count--;
+            }
+            break;
+        case p2m_ioreq_server:
+            ASSERT(ot == p2m_ram_rw);
+            p2m->ioreq.entry_count++;
+            break;
+        default:
+            break;
+        }
+    }
+
     gfn_unlock(p2m, gfn, 0);
 
     return rc;
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 4521620..e7e390d 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -120,7 +120,10 @@  typedef unsigned int p2m_query_t;
 
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
-                              | p2m_to_mask(p2m_ram_logdirty) )
+                              | p2m_to_mask(p2m_ram_logdirty) \
+                              | p2m_to_mask(p2m_ioreq_server) )
+
+#define P2M_IOREQ_TYPES (p2m_to_mask(p2m_ioreq_server))
 
 #define P2M_POD_TYPES (p2m_to_mask(p2m_populate_on_demand))
 
@@ -157,6 +160,7 @@  typedef unsigned int p2m_query_t;
 #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
 #define p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
 #define p2m_is_changeable(_t) (p2m_to_mask(_t) & P2M_CHANGEABLE_TYPES)
+#define p2m_is_ioreq(_t) (p2m_to_mask(_t) & P2M_IOREQ_TYPES)
 #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
 #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
 /* Grant types are *not* considered valid, because they can be
@@ -178,6 +182,8 @@  typedef unsigned int p2m_query_t;
 
 #define p2m_allows_invalid_mfn(t) (p2m_to_mask(t) & P2M_INVALID_MFN_TYPES)
 
+#define p2m_check_changeable(t) (p2m_is_changeable(t) && !p2m_is_ioreq(t))
+
 typedef enum {
     p2m_host,
     p2m_nested,
@@ -349,6 +355,7 @@  struct p2m_domain {
           * are to be emulated by an ioreq server.
           */
          unsigned int flags;
+         unsigned long entry_count;
      } ioreq;
 };