diff mbox series

[2/3] introduce GFN notification for translated domains

Message ID 7045df66-009d-6c9f-8e8d-cfd058c29131@suse.com (mailing list archive)
State Superseded
Headers show
Series AMD/IOMMU: re-work mode updating | expand

Commit Message

Jan Beulich Nov. 6, 2019, 3:19 p.m. UTC
In order for individual IOMMU drivers (and from an abstract pov also
architectures) to be able to adjust their data structures ahead of time
when they might cover only a sub-range of all possible GFNs, introduce
a notification call used by various code paths potentially installing a
fresh mapping of a never used GFN (for a particular domain).

Note that in gnttab_transfer() the notification and lock re-acquire
handling is best effort only (the guest may not be able to make use of
the new page in case of failure, but that's in line with the lack of a
return value check of guest_physmap_add_page() itself).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

Comments

George Dunlap Nov. 7, 2019, 11:35 a.m. UTC | #1
On 11/6/19 3:19 PM, Jan Beulich wrote:
> In order for individual IOMMU drivers (and from an abstract pov also
> architectures) to be able to adjust their data structures ahead of time
> when they might cover only a sub-range of all possible GFNs, introduce
> a notification call used by various code paths potentially installing a
> fresh mapping of a never used GFN (for a particular domain).

So trying to reverse engineer what's going on here, you mean to say
something like this:

---
Individual IOMMU drivers contain adjuct data structures for gfn ranges
contained in the main p2m.  For efficiency, these adjuct data structures
often cover only a subset of the gfn range.  Installing a fresh mapping
of a never-used gfn may require these ranges to be expanded.  Doing this
when the p2m entry is first updated may be problematic because <reasons>.

To fix this, implement notify_gfn(), to be called when Xen first becomes
aware that a potentially new gfn may be about to be used.  This will
notify the IOMMU driver about the new gfn, allowing it to expand the
data structures.  It may return -ERESTART (?) for long-running
operations, in which case the operation should be restarted or a
different error if the expansion of the data structure is not possible.
 In the latter case, the entire operation should fail.
---

Is that about right?  Note I've had to make a lot of guesses here about
the functionality and intent.

> Note that in gnttab_transfer() the notification and lock re-acquire
> handling is best effort only (the guest may not be able to make use of
> the new page in case of failure, but that's in line with the lack of a
> return value check of guest_physmap_add_page() itself).

Is there a reason we can't just return an error to the caller?

 -George
Jan Beulich Nov. 7, 2019, 11:47 a.m. UTC | #2
On 07.11.2019 12:35, George Dunlap wrote:
> On 11/6/19 3:19 PM, Jan Beulich wrote:
>> In order for individual IOMMU drivers (and from an abstract pov also
>> architectures) to be able to adjust their data structures ahead of time
>> when they might cover only a sub-range of all possible GFNs, introduce
>> a notification call used by various code paths potentially installing a
>> fresh mapping of a never used GFN (for a particular domain).
> 
> So trying to reverse engineer what's going on here, you mean to say
> something like this:
> 
> ---
> Individual IOMMU drivers contain adjuct data structures for gfn ranges
> contained in the main p2m.  For efficiency, these adjuct data structures
> often cover only a subset of the gfn range.  Installing a fresh mapping
> of a never-used gfn may require these ranges to be expanded.  Doing this
> when the p2m entry is first updated may be problematic because <reasons>.
> 
> To fix this, implement notify_gfn(), to be called when Xen first becomes
> aware that a potentially new gfn may be about to be used.  This will
> notify the IOMMU driver about the new gfn, allowing it to expand the
> data structures.  It may return -ERESTART (?) for long-running
> operations, in which case the operation should be restarted or a
> different error if the expansion of the data structure is not possible.
>  In the latter case, the entire operation should fail.
> ---
> 
> Is that about right?

With the exception of the -ERESTART / long running operations aspect,
yes. Plus assuming you mean "adjunct" (not "adjuct", which my
dictionary doesn't know about).

>  Note I've had to make a lot of guesses here about
> the functionality and intent.

Well, even after seeing your longer description, I don't see what mine
doesn't say.

>> Note that in gnttab_transfer() the notification and lock re-acquire
>> handling is best effort only (the guest may not be able to make use of
>> the new page in case of failure, but that's in line with the lack of a
>> return value check of guest_physmap_add_page() itself).
> 
> Is there a reason we can't just return an error to the caller?

Rolling back what has been done by that point would seem rather
difficult, which I guess is the reason why the code was written the
way it is (prior to my change).

Jan
George Dunlap Nov. 7, 2019, 12:10 p.m. UTC | #3
On 11/7/19 11:47 AM, Jan Beulich wrote:
> On 07.11.2019 12:35, George Dunlap wrote:
>> On 11/6/19 3:19 PM, Jan Beulich wrote:
>>> In order for individual IOMMU drivers (and from an abstract pov also
>>> architectures) to be able to adjust their data structures ahead of time
>>> when they might cover only a sub-range of all possible GFNs, introduce
>>> a notification call used by various code paths potentially installing a
>>> fresh mapping of a never used GFN (for a particular domain).
>>
>> So trying to reverse engineer what's going on here, you mean to say
>> something like this:
>>
>> ---
>> Individual IOMMU drivers contain adjuct data structures for gfn ranges
>> contained in the main p2m.  For efficiency, these adjuct data structures
>> often cover only a subset of the gfn range.  Installing a fresh mapping
>> of a never-used gfn may require these ranges to be expanded.  Doing this
>> when the p2m entry is first updated may be problematic because <reasons>.
>>
>> To fix this, implement notify_gfn(), to be called when Xen first becomes
>> aware that a potentially new gfn may be about to be used.  This will
>> notify the IOMMU driver about the new gfn, allowing it to expand the
>> data structures.  It may return -ERESTART (?) for long-running
>> operations, in which case the operation should be restarted or a
>> different error if the expansion of the data structure is not possible.
>>  In the latter case, the entire operation should fail.
>> ---
>>
>> Is that about right?
> 
> With the exception of the -ERESTART / long running operations aspect,
> yes. Plus assuming you mean "adjunct" (not "adjuct", which my
> dictionary doesn't know about).
> 
>>  Note I've had to make a lot of guesses here about
>> the functionality and intent.
> 
> Well, even after seeing your longer description, I don't see what mine
> doesn't say

* "Ahead of time" -- ahead of what?

* Why do things need to be done ahead of time, rather than at the time
(for whatever it is)?  (I couldn't even really guess at this, which is
why I put "<reasons>".)

* To me "notify" doesn't in any way imply that the operation can fail.
Most modern notifications are FYI only, with no opportunity to prevent
the thing from happening.  (That's not to say that notify is an
inappropriate name -- just that by itself it doesn't imply the ability
to cancel, which seems like a major factor to understanding the intent
of the patch.)

>>> Note that in gnttab_transfer() the notification and lock re-acquire
>>> handling is best effort only (the guest may not be able to make use of
>>> the new page in case of failure, but that's in line with the lack of a
>>> return value check of guest_physmap_add_page() itself).
>>
>> Is there a reason we can't just return an error to the caller?
> 
> Rolling back what has been done by that point would seem rather
> difficult, which I guess is the reason why the code was written the
> way it is (prior to my change).

The phrasing made me think that you were changing it to be best-effort,
rather than following suit with existing functionality.

Maybe:

"Note that before this patch, in gnttab_transfer(), once <condition>
happens, further errors modifying the physmap are ignored (presumably
because it would be too complicated to try to roll back at that point).
 This patch follows suit by ignoring failed notify_gfn()s, simply
printing out a warning that the gfn may not be accessible due to the
failure."

 -George
Jan Beulich Nov. 7, 2019, 12:45 p.m. UTC | #4
On 07.11.2019 13:10, George Dunlap wrote:
> On 11/7/19 11:47 AM, Jan Beulich wrote:
>> On 07.11.2019 12:35, George Dunlap wrote:
>>> On 11/6/19 3:19 PM, Jan Beulich wrote:
>>>> In order for individual IOMMU drivers (and from an abstract pov also
>>>> architectures) to be able to adjust their data structures ahead of time
>>>> when they might cover only a sub-range of all possible GFNs, introduce
>>>> a notification call used by various code paths potentially installing a
>>>> fresh mapping of a never used GFN (for a particular domain).
>>>
>>> So trying to reverse engineer what's going on here, you mean to say
>>> something like this:
>>>
>>> ---
>>> Individual IOMMU drivers contain adjuct data structures for gfn ranges
>>> contained in the main p2m.  For efficiency, these adjuct data structures
>>> often cover only a subset of the gfn range.  Installing a fresh mapping
>>> of a never-used gfn may require these ranges to be expanded.  Doing this
>>> when the p2m entry is first updated may be problematic because <reasons>.
>>>
>>> To fix this, implement notify_gfn(), to be called when Xen first becomes
>>> aware that a potentially new gfn may be about to be used.  This will
>>> notify the IOMMU driver about the new gfn, allowing it to expand the
>>> data structures.  It may return -ERESTART (?) for long-running
>>> operations, in which case the operation should be restarted or a
>>> different error if the expansion of the data structure is not possible.
>>>  In the latter case, the entire operation should fail.
>>> ---
>>>
>>> Is that about right?
>>
>> With the exception of the -ERESTART / long running operations aspect,
>> yes. Plus assuming you mean "adjunct" (not "adjuct", which my
>> dictionary doesn't know about).
>>
>>>  Note I've had to make a lot of guesses here about
>>> the functionality and intent.
>>
>> Well, even after seeing your longer description, I don't see what mine
>> doesn't say
> 
> * "Ahead of time" -- ahead of what?

I replaced "time" by "actual mapping requests", realizing that I'm
implying too much here of what is the subject of the next patch.

> * Why do things need to be done ahead of time, rather than at the time
> (for whatever it is)?  (I couldn't even really guess at this, which is
> why I put "<reasons>".)

This "why" imo really is the subject of the next patch, and hence
gets explained there.

> * To me "notify" doesn't in any way imply that the operation can fail.
> Most modern notifications are FYI only, with no opportunity to prevent
> the thing from happening.  (That's not to say that notify is an
> inappropriate name -- just that by itself it doesn't imply the ability
> to cancel, which seems like a major factor to understanding the intent
> of the patch.)

I'm up for different names; "notify" is what I could think of. It
being able to fail is in line with our more abstract notifier
infrastructure (inherited from Linux) also allowing for NOTIFY_BAD.

>>>> Note that in gnttab_transfer() the notification and lock re-acquire
>>>> handling is best effort only (the guest may not be able to make use of
>>>> the new page in case of failure, but that's in line with the lack of a
>>>> return value check of guest_physmap_add_page() itself).
>>>
>>> Is there a reason we can't just return an error to the caller?
>>
>> Rolling back what has been done by that point would seem rather
>> difficult, which I guess is the reason why the code was written the
>> way it is (prior to my change).
> 
> The phrasing made me think that you were changing it to be best-effort,
> rather than following suit with existing functionality.
> 
> Maybe:
> 
> "Note that before this patch, in gnttab_transfer(), once <condition>
> happens, further errors modifying the physmap are ignored (presumably
> because it would be too complicated to try to roll back at that point).
>  This patch follows suit by ignoring failed notify_gfn()s, simply
> printing out a warning that the gfn may not be accessible due to the
> failure."

Thanks, I'll use this in a slightly extended form.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -173,7 +173,8 @@  static int __init pvh_populate_memory_ra
             continue;
         }
 
-        rc = guest_physmap_add_page(d, _gfn(start), page_to_mfn(page),
+        rc = notify_gfn(d, _gfn(start + (1UL << order) - 1)) ?:
+             guest_physmap_add_page(d, _gfn(start), page_to_mfn(page),
                                     order);
         if ( rc != 0 )
         {
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4286,9 +4286,17 @@  static int hvmop_set_param(
         if ( a.value > SHUTDOWN_MAX )
             rc = -EINVAL;
         break;
+
     case HVM_PARAM_IOREQ_SERVER_PFN:
-        d->arch.hvm.ioreq_gfn.base = a.value;
+        if ( d->arch.hvm.params[HVM_PARAM_NR_IOREQ_SERVER_PAGES] )
+            rc = notify_gfn(
+                     d,
+                     _gfn(a.value + d->arch.hvm.params
+                                    [HVM_PARAM_NR_IOREQ_SERVER_PAGES] - 1));
+        if ( !rc )
+             d->arch.hvm.ioreq_gfn.base = a.value;
         break;
+
     case HVM_PARAM_NR_IOREQ_SERVER_PAGES:
     {
         unsigned int i;
@@ -4299,6 +4307,9 @@  static int hvmop_set_param(
             rc = -EINVAL;
             break;
         }
+        rc = notify_gfn(d, _gfn(d->arch.hvm.ioreq_gfn.base + a.value - 1));
+        if ( rc )
+            break;
         for ( i = 0; i < a.value; i++ )
             set_bit(i, &d->arch.hvm.ioreq_gfn.mask);
 
@@ -4312,7 +4323,11 @@  static int hvmop_set_param(
         BUILD_BUG_ON(HVM_PARAM_BUFIOREQ_PFN >
                      sizeof(d->arch.hvm.ioreq_gfn.legacy_mask) * 8);
         if ( a.value )
-            set_bit(a.index, &d->arch.hvm.ioreq_gfn.legacy_mask);
+        {
+            rc = notify_gfn(d, _gfn(a.value));
+            if ( !rc )
+                set_bit(a.index, &d->arch.hvm.ioreq_gfn.legacy_mask);
+        }
         break;
 
     case HVM_PARAM_X87_FIP_WIDTH:
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -946,6 +946,16 @@  map_grant_ref(
         return;
     }
 
+    if ( paging_mode_translate(ld) /* && (op->flags & GNTMAP_host_map) */ &&
+         (rc = notify_gfn(ld, gaddr_to_gfn(op->host_addr))) )
+    {
+        gdprintk(XENLOG_INFO, "notify(%"PRI_gfn") -> %d\n",
+                 gfn_x(gaddr_to_gfn(op->host_addr)), rc);
+        op->status = GNTST_general_error;
+        return;
+        BUILD_BUG_ON(GNTST_okay);
+    }
+
     if ( unlikely((rd = rcu_lock_domain_by_id(op->dom)) == NULL) )
     {
         gdprintk(XENLOG_INFO, "Could not find domain %d\n", op->dom);
@@ -2123,6 +2133,7 @@  gnttab_transfer(
     {
         bool_t okay;
         int rc;
+        gfn_t gfn;
 
         if ( i && hypercall_preempt_check() )
             return i;
@@ -2300,21 +2311,52 @@  gnttab_transfer(
         act = active_entry_acquire(e->grant_table, gop.ref);
 
         if ( evaluate_nospec(e->grant_table->gt_version == 1) )
+            gfn = _gfn(shared_entry_v1(e->grant_table, gop.ref).frame);
+        else
+            gfn = _gfn(shared_entry_v2(e->grant_table, gop.ref).full_page.frame);
+
+        if ( paging_mode_translate(e) )
         {
-            grant_entry_v1_t *sha = &shared_entry_v1(e->grant_table, gop.ref);
+            gfn_t gfn2;
+
+            active_entry_release(act);
+            grant_read_unlock(e->grant_table);
+
+            rc = notify_gfn(e, gfn);
+            if ( rc )
+                printk(XENLOG_G_WARNING
+                       "%pd: gref %u: xfer GFN %"PRI_gfn" may be inaccessible (%d)\n",
+                       e, gop.ref, gfn_x(gfn), rc);
+
+            grant_read_lock(e->grant_table);
+            act = active_entry_acquire(e->grant_table, gop.ref);
 
-            guest_physmap_add_page(e, _gfn(sha->frame), mfn, 0);
-            if ( !paging_mode_translate(e) )
-                sha->frame = mfn_x(mfn);
+            if ( evaluate_nospec(e->grant_table->gt_version == 1) )
+                gfn2 = _gfn(shared_entry_v1(e->grant_table, gop.ref).frame);
+            else
+                gfn2 = _gfn(shared_entry_v2(e->grant_table, gop.ref).
+                    full_page.frame);
+
+            if ( !gfn_eq(gfn, gfn2) )
+            {
+                printk(XENLOG_G_WARNING
+                       "%pd: gref %u: xfer GFN went %"PRI_gfn" -> %"PRI_gfn"\n",
+                       e, gop.ref, gfn_x(gfn), gfn_x(gfn2));
+                gfn = gfn2;
+            }
         }
-        else
-        {
-            grant_entry_v2_t *sha = &shared_entry_v2(e->grant_table, gop.ref);
 
-            guest_physmap_add_page(e, _gfn(sha->full_page.frame), mfn, 0);
-            if ( !paging_mode_translate(e) )
-                sha->full_page.frame = mfn_x(mfn);
+        guest_physmap_add_page(e, gfn, mfn, 0);
+
+        if ( !paging_mode_translate(e) )
+        {
+            if ( evaluate_nospec(e->grant_table->gt_version == 1) )
+                shared_entry_v1(e->grant_table, gop.ref).frame = mfn_x(mfn);
+            else
+                shared_entry_v2(e->grant_table, gop.ref).full_page.frame =
+                    mfn_x(mfn);
         }
+
         smp_wmb();
         shared_entry_header(e->grant_table, gop.ref)->flags |=
             GTF_transfer_completed;
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -203,6 +203,10 @@  static void populate_physmap(struct memo
         if ( unlikely(__copy_from_guest_offset(&gpfn, a->extent_list, i, 1)) )
             goto out;
 
+        if ( paging_mode_translate(d) &&
+             notify_gfn(d, _gfn(gpfn + (1U << a->extent_order) - 1)) )
+            goto out;
+
         if ( a->memflags & MEMF_populate_on_demand )
         {
             /* Disallow populating PoD pages on oneself. */
@@ -745,6 +749,10 @@  static long memory_exchange(XEN_GUEST_HA
                 continue;
             }
 
+            if ( paging_mode_translate(d) )
+                rc = notify_gfn(d,
+                                _gfn(gpfn + (1U << exch.out.extent_order) - 1));
+
             mfn = page_to_mfn(page);
             guest_physmap_add_page(d, _gfn(gpfn), mfn,
                                    exch.out.extent_order);
@@ -813,12 +821,20 @@  int xenmem_add_to_physmap(struct domain
         extra.foreign_domid = DOMID_INVALID;
 
     if ( xatp->space != XENMAPSPACE_gmfn_range )
-        return xenmem_add_to_physmap_one(d, xatp->space, extra,
+        return notify_gfn(d, _gfn(xatp->gpfn)) ?:
+               xenmem_add_to_physmap_one(d, xatp->space, extra,
                                          xatp->idx, _gfn(xatp->gpfn));
 
     if ( xatp->size < start )
         return -EILSEQ;
 
+    if ( !start && xatp->size )
+    {
+        rc = notify_gfn(d, _gfn(xatp->gpfn + xatp->size - 1));
+        if ( rc )
+            return rc;
+    }
+
     xatp->idx += start;
     xatp->gpfn += start;
     xatp->size -= start;
@@ -891,7 +907,8 @@  static int xenmem_add_to_physmap_batch(s
                                                extent, 1)) )
             return -EFAULT;
 
-        rc = xenmem_add_to_physmap_one(d, xatpb->space,
+        rc = notify_gfn(d, _gfn(gpfn)) ?:
+             xenmem_add_to_physmap_one(d, xatpb->space,
                                        xatpb->u,
                                        idx, _gfn(gpfn));
 
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -530,6 +530,14 @@  void iommu_share_p2m_table(struct domain
         iommu_get_ops()->share_p2m(d);
 }
 
+int iommu_notify_gfn(struct domain *d, gfn_t gfn)
+{
+    const struct iommu_ops *ops = dom_iommu(d)->platform_ops;
+
+    return need_iommu_pt_sync(d) && ops->notify_dfn
+           ? iommu_call(ops, notify_dfn, d, _dfn(gfn_x(gfn))) : 0;
+}
+
 void iommu_crash_shutdown(void)
 {
     if ( !iommu_crash_disable )
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -237,6 +237,8 @@  struct iommu_ops {
     int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn,
                                     unsigned int *flags);
 
+    int __must_check (*notify_dfn)(struct domain *d, dfn_t dfn);
+
     void (*free_page_table)(struct page_info *);
 
 #ifdef CONFIG_X86
@@ -331,6 +333,7 @@  void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
+int __must_check iommu_notify_gfn(struct domain *d, gfn_t gfn);
 
 #ifdef CONFIG_HAS_PCI
 int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -1039,6 +1039,11 @@  static always_inline bool is_iommu_enabl
     return evaluate_nospec(d->options & XEN_DOMCTL_CDF_iommu);
 }
 
+static inline int __must_check notify_gfn(struct domain *d, gfn_t gfn)
+{
+    return /* arch_notify_gfn(d, gfn) ?: */ iommu_notify_gfn(d, gfn);
+}
+
 extern bool sched_smt_power_savings;
 extern bool sched_disable_smt_switching;