diff mbox

[v7,4/5] ix86/ioreq server: Asynchronously reset outstanding p2m_ioreq_server entries.

Message ID 1488987232-12349-5-git-send-email-yu.c.zhang@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu Zhang March 8, 2017, 3:33 p.m. UTC
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.

This patch also disallows live migration, when there's still any
outstanding p2m_ioreq_server entry left. The core reason is our
current implementation of p2m_change_entry_type_global() can not
tell the state of p2m_ioreq_server entries(can not decide if an
entry is to be emulated or to be resynced).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>

changes in v3: 
  - Move the synchronously resetting logic into patch 5.
  - According to comments from Jan: introduce p2m_check_changeable()
    to clarify the p2m type change code.
  - According to comments from George: use locks in the same order
    to avoid deadlock, call p2m_change_entry_type_global() after unmap
    of the ioreq server is finished.

changes in v2: 
  - Move the calculation of ioreq server page entry_cout into 
    p2m_change_type_one() so that we do not need a seperate lock.
    Note: entry_count is also calculated in resolve_misconfig()/
    do_recalc(), fortunately callers of both routines have p2m
    lock protected already.
  - Simplify logic in hvmop_set_mem_type().
  - Introduce routine p2m_finish_type_change() to walk the p2m
    table and do the p2m reset.
---
 xen/arch/x86/hvm/ioreq.c  |  8 ++++++++
 xen/arch/x86/mm/hap/hap.c |  9 +++++++++
 xen/arch/x86/mm/p2m-ept.c |  8 +++++++-
 xen/arch/x86/mm/p2m-pt.c  | 13 +++++++++++--
 xen/arch/x86/mm/p2m.c     | 20 ++++++++++++++++++++
 xen/include/asm-x86/p2m.h |  9 ++++++++-
 6 files changed, 63 insertions(+), 4 deletions(-)

Comments

Jan Beulich March 10, 2017, 4:03 p.m. UTC | #1
>>> On 08.03.17 at 16:33, <yu.c.zhang@linux.intel.com> wrote:
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -954,6 +954,26 @@ int p2m_change_type_one(struct domain *d, unsigned long gfn,
>                           p2m->default_access)
>           : -EBUSY;
>  
> +    if ( !rc )
> +    {
> +        switch ( nt )
> +        {
> +        case p2m_ram_rw:
> +            if ( ot == p2m_ioreq_server )
> +            {
> +                p2m->ioreq.entry_count--;
> +                ASSERT(p2m->ioreq.entry_count >= 0);
> +            }
> +            break;
> +        case p2m_ioreq_server:
> +            if ( ot == p2m_ram_rw )
> +                p2m->ioreq.entry_count++;

I don't think the conditional is needed here. If anything it should be
an ASSERT() imo, as other transitions to p2m_ioreq_server should
not be allowed.

But there's a wider understanding issue I'm having here: What is
an "entry" here? Commonly I would assume this to refer to an
individual (4k) page, but it looks like you really mean table entry,
i.e. possibly representing a 2M or 1G page.

Jan
Yu Zhang March 11, 2017, 8:42 a.m. UTC | #2
On 3/11/2017 12:03 AM, Jan Beulich wrote:
>>>> On 08.03.17 at 16:33, <yu.c.zhang@linux.intel.com> wrote:
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -954,6 +954,26 @@ int p2m_change_type_one(struct domain *d, unsigned long gfn,
>>                            p2m->default_access)
>>            : -EBUSY;
>>   
>> +    if ( !rc )
>> +    {
>> +        switch ( nt )
>> +        {
>> +        case p2m_ram_rw:
>> +            if ( ot == p2m_ioreq_server )
>> +            {
>> +                p2m->ioreq.entry_count--;
>> +                ASSERT(p2m->ioreq.entry_count >= 0);
>> +            }
>> +            break;
>> +        case p2m_ioreq_server:
>> +            if ( ot == p2m_ram_rw )
>> +                p2m->ioreq.entry_count++;
> I don't think the conditional is needed here. If anything it should be
> an ASSERT() imo, as other transitions to p2m_ioreq_server should
> not be allowed.

Thanks, Jan. I'll use ASSERT.

> But there's a wider understanding issue I'm having here: What is
> an "entry" here? Commonly I would assume this to refer to an
> individual (4k) page, but it looks like you really mean table entry,
> i.e. possibly representing a 2M or 1G page.

Well, it should be an entry pointing to a 4K page(only).
For p2m_ioreq_server, we shall not meet huge page. Because they are 
changed from p2m_ram_rw pages
in set_mem_type() -> p2m_change_type_one(), which calls p2m_set_entry() 
with PAGE_ORDER_4K specified.


Thanks
Yu
> Jan
>
>
Jan Beulich March 13, 2017, 11:24 a.m. UTC | #3
>>> On 11.03.17 at 09:42, <yu.c.zhang@linux.intel.com> wrote:
> On 3/11/2017 12:03 AM, Jan Beulich wrote:
>> But there's a wider understanding issue I'm having here: What is
>> an "entry" here? Commonly I would assume this to refer to an
>> individual (4k) page, but it looks like you really mean table entry,
>> i.e. possibly representing a 2M or 1G page.
> 
> Well, it should be an entry pointing to a 4K page(only).
> For p2m_ioreq_server, we shall not meet huge page. Because they are 
> changed from p2m_ram_rw pages
> in set_mem_type() -> p2m_change_type_one(), which calls p2m_set_entry() 
> with PAGE_ORDER_4K specified.

And recombination of large pages won't ever end up hitting these?

Jan
Yu Zhang March 14, 2017, 7:42 a.m. UTC | #4
On 3/13/2017 7:24 PM, Jan Beulich wrote:
>>>> On 11.03.17 at 09:42, <yu.c.zhang@linux.intel.com> wrote:
>> On 3/11/2017 12:03 AM, Jan Beulich wrote:
>>> But there's a wider understanding issue I'm having here: What is
>>> an "entry" here? Commonly I would assume this to refer to an
>>> individual (4k) page, but it looks like you really mean table entry,
>>> i.e. possibly representing a 2M or 1G page.
>> Well, it should be an entry pointing to a 4K page(only).
>> For p2m_ioreq_server, we shall not meet huge page. Because they are
>> changed from p2m_ram_rw pages
>> in set_mem_type() -> p2m_change_type_one(), which calls p2m_set_entry()
>> with PAGE_ORDER_4K specified.
> And recombination of large pages won't ever end up hitting these?

Well, by recombination I guess you refer to the POD pages? I do not 
think p2m_ioreq_server
pages will be combined now, which means we do not need to worry about 
recounting the
p2m_ioreq_server entries while a split happens.

And as to type change from p2m_ram_rw to p2m_ioreq_server, even if this 
is done on a large
page, p2m_change_type_one() will split the page and only mark one ept 
entry(which maps to
a 4K page) as p2m_ioreq_server(other 511 entries remains as p2m_ram_rw). 
So I still believe
counting p2m_ioreq_server entries here is correct.

Besides, if we look from XenGT requirement side, it is guest graphic 
page tables we are trying to
write-protect, which are 4K in size.

Thanks
Yu

> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Jan Beulich March 14, 2017, 10:49 a.m. UTC | #5
>>> On 14.03.17 at 08:42, <yu.c.zhang@linux.intel.com> wrote:
> On 3/13/2017 7:24 PM, Jan Beulich wrote:
>>>>> On 11.03.17 at 09:42, <yu.c.zhang@linux.intel.com> wrote:
>>> On 3/11/2017 12:03 AM, Jan Beulich wrote:
>>>> But there's a wider understanding issue I'm having here: What is
>>>> an "entry" here? Commonly I would assume this to refer to an
>>>> individual (4k) page, but it looks like you really mean table entry,
>>>> i.e. possibly representing a 2M or 1G page.
>>> Well, it should be an entry pointing to a 4K page(only).
>>> For p2m_ioreq_server, we shall not meet huge page. Because they are
>>> changed from p2m_ram_rw pages
>>> in set_mem_type() -> p2m_change_type_one(), which calls p2m_set_entry()
>>> with PAGE_ORDER_4K specified.
>> And recombination of large pages won't ever end up hitting these?
> 
> Well, by recombination I guess you refer to the POD pages? I do not 
> think p2m_ioreq_server
> pages will be combined now, which means we do not need to worry about 
> recounting the
> p2m_ioreq_server entries while a split happens.

No, I didn't think about PoD here. But indeed I was misremembering:
We don't currently make any attempt to transparently recreate large
pages.

Jan
Yu Zhang March 14, 2017, 12:18 p.m. UTC | #6
On 3/14/2017 6:49 PM, Jan Beulich wrote:
>>>> On 14.03.17 at 08:42, <yu.c.zhang@linux.intel.com> wrote:
>> On 3/13/2017 7:24 PM, Jan Beulich wrote:
>>>>>> On 11.03.17 at 09:42, <yu.c.zhang@linux.intel.com> wrote:
>>>> On 3/11/2017 12:03 AM, Jan Beulich wrote:
>>>>> But there's a wider understanding issue I'm having here: What is
>>>>> an "entry" here? Commonly I would assume this to refer to an
>>>>> individual (4k) page, but it looks like you really mean table entry,
>>>>> i.e. possibly representing a 2M or 1G page.
>>>> Well, it should be an entry pointing to a 4K page(only).
>>>> For p2m_ioreq_server, we shall not meet huge page. Because they are
>>>> changed from p2m_ram_rw pages
>>>> in set_mem_type() -> p2m_change_type_one(), which calls p2m_set_entry()
>>>> with PAGE_ORDER_4K specified.
>>> And recombination of large pages won't ever end up hitting these?
>> Well, by recombination I guess you refer to the POD pages? I do not
>> think p2m_ioreq_server
>> pages will be combined now, which means we do not need to worry about
>> recounting the
>> p2m_ioreq_server entries while a split happens.
> No, I didn't think about PoD here. But indeed I was misremembering:
> We don't currently make any attempt to transparently recreate large
> pages.

Thanks Jan. So do you now think the current counting logic for 
p2m_ioreq_server is correct? :-)

Yu
> Jan
>
>
Jan Beulich March 14, 2017, 1:11 p.m. UTC | #7
>>> On 14.03.17 at 13:18, <yu.c.zhang@linux.intel.com> wrote:

> 
> On 3/14/2017 6:49 PM, Jan Beulich wrote:
>>>>> On 14.03.17 at 08:42, <yu.c.zhang@linux.intel.com> wrote:
>>> On 3/13/2017 7:24 PM, Jan Beulich wrote:
>>>>>>> On 11.03.17 at 09:42, <yu.c.zhang@linux.intel.com> wrote:
>>>>> On 3/11/2017 12:03 AM, Jan Beulich wrote:
>>>>>> But there's a wider understanding issue I'm having here: What is
>>>>>> an "entry" here? Commonly I would assume this to refer to an
>>>>>> individual (4k) page, but it looks like you really mean table entry,
>>>>>> i.e. possibly representing a 2M or 1G page.
>>>>> Well, it should be an entry pointing to a 4K page(only).
>>>>> For p2m_ioreq_server, we shall not meet huge page. Because they are
>>>>> changed from p2m_ram_rw pages
>>>>> in set_mem_type() -> p2m_change_type_one(), which calls p2m_set_entry()
>>>>> with PAGE_ORDER_4K specified.
>>>> And recombination of large pages won't ever end up hitting these?
>>> Well, by recombination I guess you refer to the POD pages? I do not
>>> think p2m_ioreq_server
>>> pages will be combined now, which means we do not need to worry about
>>> recounting the
>>> p2m_ioreq_server entries while a split happens.
>> No, I didn't think about PoD here. But indeed I was misremembering:
>> We don't currently make any attempt to transparently recreate large
>> pages.
> 
> Thanks Jan. So do you now think the current counting logic for 
> p2m_ioreq_server is correct? :-)

Yes. But please make sure you mention the 4k page dependency
at least in the commit message.

Jan
Yu Zhang March 14, 2017, 1:29 p.m. UTC | #8
On 3/14/2017 9:11 PM, Jan Beulich wrote:
>>>> On 14.03.17 at 13:18, <yu.c.zhang@linux.intel.com> wrote:
>> On 3/14/2017 6:49 PM, Jan Beulich wrote:
>>>>>> On 14.03.17 at 08:42, <yu.c.zhang@linux.intel.com> wrote:
>>>> On 3/13/2017 7:24 PM, Jan Beulich wrote:
>>>>>>>> On 11.03.17 at 09:42, <yu.c.zhang@linux.intel.com> wrote:
>>>>>> On 3/11/2017 12:03 AM, Jan Beulich wrote:
>>>>>>> But there's a wider understanding issue I'm having here: What is
>>>>>>> an "entry" here? Commonly I would assume this to refer to an
>>>>>>> individual (4k) page, but it looks like you really mean table entry,
>>>>>>> i.e. possibly representing a 2M or 1G page.
>>>>>> Well, it should be an entry pointing to a 4K page(only).
>>>>>> For p2m_ioreq_server, we shall not meet huge page. Because they are
>>>>>> changed from p2m_ram_rw pages
>>>>>> in set_mem_type() -> p2m_change_type_one(), which calls p2m_set_entry()
>>>>>> with PAGE_ORDER_4K specified.
>>>>> And recombination of large pages won't ever end up hitting these?
>>>> Well, by recombination I guess you refer to the POD pages? I do not
>>>> think p2m_ioreq_server
>>>> pages will be combined now, which means we do not need to worry about
>>>> recounting the
>>>> p2m_ioreq_server entries while a split happens.
>>> No, I didn't think about PoD here. But indeed I was misremembering:
>>> We don't currently make any attempt to transparently recreate large
>>> pages.
>> Thanks Jan. So do you now think the current counting logic for
>> p2m_ioreq_server is correct? :-)
> Yes. But please make sure you mention the 4k page dependency
> at least in the commit message.

OK. Will do.

Thanks
Yu
> Jan
>
>
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index fcb9f38..c129eb4 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -949,6 +949,14 @@  int hvm_map_mem_type_to_ioreq_server(struct domain *d, ioservid_t id,
 
     spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
 
+    if ( rc == 0 && flags == 0 )
+    {
+        struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+        if ( read_atomic(&p2m->ioreq.entry_count) )
+            p2m_change_entry_type_global(d, p2m_ioreq_server, p2m_ram_rw);
+    }
+
     return rc;
 }
 
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index a57b385..f27a56f 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -187,6 +187,15 @@  out:
  */
 static int hap_enable_log_dirty(struct domain *d, bool_t log_global)
 {
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    /*
+    * Refuse to turn on global log-dirty mode if
+    * there's outstanding p2m_ioreq_server pages.
+    */
+    if ( log_global && read_atomic(&p2m->ioreq.entry_count) )
+        return -EBUSY;
+
     /* turn on PG_log_dirty bit in paging mode */
     paging_lock(d);
     d->arch.paging.mode |= PG_log_dirty;
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index cc1eb21..1df3d09 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -544,6 +544,12 @@  static int resolve_misconfig(struct p2m_domain *p2m, unsigned long gfn)
                     e.ipat = ipat;
                     if ( e.recalc && p2m_is_changeable(e.sa_p2mt) )
                     {
+                         if ( e.sa_p2mt == p2m_ioreq_server )
+                         {
+                             p2m->ioreq.entry_count--;
+                             ASSERT(p2m->ioreq.entry_count >= 0);
+                         }
+
                          e.sa_p2mt = p2m_is_logdirty_range(p2m, gfn + i, gfn + i)
                                      ? p2m_ram_logdirty : p2m_ram_rw;
                          ept_p2m_type_to_flags(p2m, &e, e.sa_p2mt, e.access);
@@ -965,7 +971,7 @@  static mfn_t ept_get_entry(struct p2m_domain *p2m,
     if ( is_epte_valid(ept_entry) )
     {
         if ( (recalc || ept_entry->recalc) &&
-             p2m_is_changeable(ept_entry->sa_p2mt) )
+             p2m_check_changeable(ept_entry->sa_p2mt) )
             *t = p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                       : p2m_ram_rw;
         else
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 97dc25d..d9a7b76 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -437,11 +437,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
          needs_recalc(l1, *pent) )
     {
         l1_pgentry_t e = *pent;
+        p2m_type_t p2mt_old;
 
         if ( !valid_recalc(l1, e) )
             P2M_DEBUG("bogus recalc leaf at d%d:%lx:%u\n",
                       p2m->domain->domain_id, gfn, level);
-        if ( p2m_is_changeable(p2m_flags_to_type(l1e_get_flags(e))) )
+        p2mt_old = p2m_flags_to_type(l1e_get_flags(e));
+        if ( p2m_is_changeable(p2mt_old) )
         {
             unsigned long mask = ~0UL << (level * PAGETABLE_ORDER);
             p2m_type_t p2mt = p2m_is_logdirty_range(p2m, gfn & mask, gfn | ~mask)
@@ -461,6 +463,13 @@  static int do_recalc(struct p2m_domain *p2m, unsigned long gfn)
                      mfn &= ~(_PAGE_PSE_PAT >> PAGE_SHIFT);
                 flags |= _PAGE_PSE;
             }
+
+            if ( p2mt_old == p2m_ioreq_server )
+            {
+                p2m->ioreq.entry_count--;
+                ASSERT(p2m->ioreq.entry_count >= 0);
+            }
+
             e = l1e_from_pfn(mfn, flags);
             p2m_add_iommu_flags(&e, level,
                                 (p2mt == p2m_ram_rw)
@@ -730,7 +739,7 @@  p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 static inline p2m_type_t recalc_type(bool_t recalc, p2m_type_t t,
                                      struct p2m_domain *p2m, unsigned long gfn)
 {
-    if ( !recalc || !p2m_is_changeable(t) )
+    if ( !recalc || !p2m_check_changeable(t) )
         return t;
     return p2m_is_logdirty_range(p2m, gfn, gfn) ? p2m_ram_logdirty
                                                 : p2m_ram_rw;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 0edfc61..94d7141 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -954,6 +954,26 @@  int p2m_change_type_one(struct domain *d, unsigned long gfn,
                          p2m->default_access)
          : -EBUSY;
 
+    if ( !rc )
+    {
+        switch ( nt )
+        {
+        case p2m_ram_rw:
+            if ( ot == p2m_ioreq_server )
+            {
+                p2m->ioreq.entry_count--;
+                ASSERT(p2m->ioreq.entry_count >= 0);
+            }
+            break;
+        case p2m_ioreq_server:
+            if ( ot == p2m_ram_rw )
+                p2m->ioreq.entry_count++;
+            break;
+        default:
+            break;
+        }
+    }
+
     gfn_unlock(p2m, gfn, 0);
 
     return rc;
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 3786680..395f125 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -120,7 +120,10 @@  typedef unsigned int p2m_query_t;
 
 /* Types that can be subject to bulk transitions. */
 #define P2M_CHANGEABLE_TYPES (p2m_to_mask(p2m_ram_rw) \
-                              | p2m_to_mask(p2m_ram_logdirty) )
+                              | p2m_to_mask(p2m_ram_logdirty) \
+                              | p2m_to_mask(p2m_ioreq_server) )
+
+#define P2M_IOREQ_TYPES (p2m_to_mask(p2m_ioreq_server))
 
 #define P2M_POD_TYPES (p2m_to_mask(p2m_populate_on_demand))
 
@@ -157,6 +160,7 @@  typedef unsigned int p2m_query_t;
 #define p2m_is_readonly(_t) (p2m_to_mask(_t) & P2M_RO_TYPES)
 #define p2m_is_discard_write(_t) (p2m_to_mask(_t) & P2M_DISCARD_WRITE_TYPES)
 #define p2m_is_changeable(_t) (p2m_to_mask(_t) & P2M_CHANGEABLE_TYPES)
+#define p2m_is_ioreq(_t) (p2m_to_mask(_t) & P2M_IOREQ_TYPES)
 #define p2m_is_pod(_t) (p2m_to_mask(_t) & P2M_POD_TYPES)
 #define p2m_is_grant(_t) (p2m_to_mask(_t) & P2M_GRANT_TYPES)
 /* Grant types are *not* considered valid, because they can be
@@ -178,6 +182,8 @@  typedef unsigned int p2m_query_t;
 
 #define p2m_allows_invalid_mfn(t) (p2m_to_mask(t) & P2M_INVALID_MFN_TYPES)
 
+#define p2m_check_changeable(t) (p2m_is_changeable(t) && !p2m_is_ioreq(t))
+
 typedef enum {
     p2m_host,
     p2m_nested,
@@ -349,6 +355,7 @@  struct p2m_domain {
           * are to be emulated by an ioreq server.
           */
          unsigned int flags;
+         long entry_count;
      } ioreq;
 };