diff mbox series

[v4,18/21] VT-d: replace all-contiguous page tables by superpage mappings

Message ID 98553b89-6296-9e4c-4677-9201cd7cdeef@suse.com (mailing list archive)
State Superseded
Headers show
Series IOMMU: superpage support when not sharing pagetables | expand

Commit Message

Jan Beulich April 25, 2022, 8:43 a.m. UTC
When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.

The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
for whenever we (and obviously hardware) start supporting 512G mappings.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
Unlike the freeing of all-empty page tables, this causes quite a bit of
back and forth for PV domains, due to their mapping/unmapping of pages
when they get converted to/from being page tables. It may therefore be
worth considering to delay re-coalescing a little, to avoid doing so
when the superpage would otherwise get split again pretty soon. But I
think this would better be the subject of a separate change anyway.

Of course this could also be helped by more "aware" kernel side
behavior: They could avoid immediately mapping freed page tables
writable again, in anticipation of re-using that same page for another
page table elsewhere.
---
v4: Re-base over changes earlier in the series.
v3: New.

Comments

Roger Pau Monné May 11, 2022, 11:08 a.m. UTC | #1
On Mon, Apr 25, 2022 at 10:43:45AM +0200, Jan Beulich wrote:
> When a page table ends up with all contiguous entries (including all
> identical attributes), it can be replaced by a superpage entry at the
> next higher level. The page table itself can then be scheduled for
> freeing.
> 
> The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
> for whenever we (and obviously hardware) start supporting 512G mappings.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Like on the AMD side, I wonder whether you can get away with only
doing a cache flush for the last (highest level) PTE, as the lower
ones won't be reachable anyway, as the page-table is freed.

Then the flush could be done outside of the locked region.

The rest LGTM.

Thanks, Roger.
Jan Beulich May 18, 2022, 10:44 a.m. UTC | #2
On 11.05.2022 13:08, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:43:45AM +0200, Jan Beulich wrote:
>> When a page table ends up with all contiguous entries (including all
>> identical attributes), it can be replaced by a superpage entry at the
>> next higher level. The page table itself can then be scheduled for
>> freeing.
>>
>> The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
>> for whenever we (and obviously hardware) start supporting 512G mappings.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> Like on the AMD side, I wonder whether you can get away with only

FTAOD I take it you mean "like on the all-empty side", as on AMD we
don't need to do any cache flushing?

> doing a cache flush for the last (highest level) PTE, as the lower
> ones won't be reachable anyway, as the page-table is freed.

But that freeing will happen only later, with a TLB flush in between.
Until then we would better make sure the IOMMU sees what was written,
even if it reading a stale value _should_ be benign.

Jan

> Then the flush could be done outside of the locked region.
> 
> The rest LGTM.
> 
> Thanks, Roger.
>
Roger Pau Monné May 20, 2022, 10:38 a.m. UTC | #3
On Wed, May 18, 2022 at 12:44:06PM +0200, Jan Beulich wrote:
> On 11.05.2022 13:08, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:43:45AM +0200, Jan Beulich wrote:
> >> When a page table ends up with all contiguous entries (including all
> >> identical attributes), it can be replaced by a superpage entry at the
> >> next higher level. The page table itself can then be scheduled for
> >> freeing.
> >>
> >> The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
> >> for whenever we (and obviously hardware) start supporting 512G mappings.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > 
> > Like on the AMD side, I wonder whether you can get away with only
> 
> FTAOD I take it you mean "like on the all-empty side", as on AMD we
> don't need to do any cache flushing?

Heh, yes, sorry.

> > doing a cache flush for the last (highest level) PTE, as the lower
> > ones won't be reachable anyway, as the page-table is freed.
> 
> But that freeing will happen only later, with a TLB flush in between.
> Until then we would better make sure the IOMMU sees what was written,
> even if it reading a stale value _should_ be benign.

Hm, but when doing the TLB flush the paging structures will already be
fully updated, and the top level visible entry will have it's cache
flushed, so the lower ones would never be reachable AFAICT.

Thanks, Roger.
diff mbox series

Patch

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2216,14 +2216,35 @@  static int __must_check cf_check intel_i
      * While the (ab)use of PTE_kind_table here allows to save some work in
      * the function, the main motivation for it is that it avoids a so far
      * unexplained hang during boot (while preparing Dom0) on a Westmere
-     * based laptop.
+     * based laptop.  This also has the intended effect of terminating the
+     * loop when super pages aren't supported anymore at the next level.
      */
-    pt_update_contig_markers(&page->val,
-                             address_level_offset(dfn_to_daddr(dfn), level),
-                             level,
-                             (hd->platform_ops->page_sizes &
-                              (1UL << level_to_offset_bits(level + 1))
-                              ? PTE_kind_leaf : PTE_kind_table));
+    while ( pt_update_contig_markers(&page->val,
+                                     address_level_offset(dfn_to_daddr(dfn), level),
+                                     level,
+                                     (hd->platform_ops->page_sizes &
+                                      (1UL << level_to_offset_bits(level + 1))
+                                       ? PTE_kind_leaf : PTE_kind_table)) )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        new.val &= ~(LEVEL_MASK << level_to_offset_bits(level));
+        dma_set_pte_superpage(new);
+
+        pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), ++level,
+                                          flush_flags, false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
+        *pte = new;
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(hd, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
     unmap_vtd_domain_page(page);
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -232,7 +232,7 @@  struct context_entry {
 
 /* page table handling */
 #define LEVEL_STRIDE       (9)
-#define LEVEL_MASK         ((1 << LEVEL_STRIDE) - 1)
+#define LEVEL_MASK         (PTE_NUM - 1UL)
 #define PTE_NUM            (1 << LEVEL_STRIDE)
 #define level_to_agaw(val) ((val) - 2)
 #define agaw_to_level(val) ((val) + 2)