[v4,16/21] VT-d: free all-empty page tables

Message ID	b9a2be8d-3bb9-3718-6e3b-f07f6dcdde20@suse.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Message-ID: <b9a2be8d-3bb9-3718-6e3b-f07f6dcdde20@suse.com> Date: Mon, 25 Apr 2022 10:42:50 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: [PATCH v4 16/21] VT-d: free all-empty page tables Content-Language: en-US From: Jan Beulich <jbeulich@suse.com> To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Paul Durrant <paul@xen.org>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Kevin Tian <kevin.tian@intel.com> References: <b92e294e-7277-d977-bb96-7c28d60000c6@suse.com> In-Reply-To: <b92e294e-7277-d977-bb96-7c28d60000c6@suse.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit MIME-Version: 1.0
Series	IOMMU: superpage support when not sharing pagetables \| expand [v4,00/21] IOMMU: superpage support when not sharing pagetables [v4,01/21] AMD/IOMMU: correct potentially-UB shifts [v4,02/21] IOMMU: simplify unmap-on-error in iommu_map() [v4,03/21] IOMMU: add order parameter to ->{,un}map_page() hooks [v4,04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks [v4,05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 [v4,06/21] IOMMU/x86: perform PV Dom0 mappings in batches [v4,07/21] IOMMU/x86: support freeing of pagetables [v4,08/21] AMD/IOMMU: walk trees upon page fault [v4,09/21] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() [v4,10/21] AMD/IOMMU: allow use of superpage mappings [v4,11/21] VT-d: allow use of superpage mappings [v4,12/21] IOMMU: fold flush-all hook into "flush one" [v4,13/21] IOMMU/x86: prefill newly allocate page tables [v4,14/21] x86: introduce helper for recording degree of contiguity in page tables [v4,15/21] AMD/IOMMU: free all-empty page tables [v4,16/21] VT-d: free all-empty page tables [v4,17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings [v4,18/21] VT-d: replace all-contiguous page tables by superpage mappings [v4,19/21] IOMMU/x86: add perf counters for page table splitting / coalescing [v4,20/21] VT-d: fold iommu_flush_iotlb{,_pages}() [v4,21/21] VT-d: fold dma_pte_clear_one() into its only caller

Jan Beulich April 25, 2022, 8:42 a.m. UTC

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Note further that while pt_update_contig_markers() updates perhaps
several PTEs within the table, since these are changes to "avail" bits
only I do not think that cache flushing would be needed afterwards. Such
cache flushing (of entire pages, unless adding yet more logic to me more
selective) would be quite noticable performance-wise (very prominent
during Dom0 boot).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v4: Re-base over changes earlier in the series.
v3: Properly bound loop. Re-base over changes earlier in the series.
v2: New.
---
The hang during boot on my Latitude E6410 (see the respective code
comment) was pretty close after iommu_enable_translation(). No errors,
no watchdog would kick in, just sometimes the first few pixel lines of
the next log message's (XEN) prefix would have made it out to the screen
(and there's no serial there). It's been a lot of experimenting until I
figured the workaround (which I consider ugly, but halfway acceptable).
I've been trying hard to make sure the workaround wouldn't be masking a
real issue, yet I'm still wary of it possibly doing so ... My best guess
at this point is that on these old IOMMUs the ignored bits 52...61
aren't really ignored for present entries, but also aren't "reserved"
enough to trigger faults. This guess is from having tried to set other
bits in this range (unconditionally, and with the workaround here in
place), which yielded the same behavior.

Tian, Kevin April 27, 2022, 4:09 a.m. UTC | #1

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, April 25, 2022 4:43 PM
> 
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Note further that while pt_update_contig_markers() updates perhaps
> several PTEs within the table, since these are changes to "avail" bits
> only I do not think that cache flushing would be needed afterwards. Such
> cache flushing (of entire pages, unless adding yet more logic to me more
> selective) would be quite noticable performance-wise (very prominent
> during Dom0 boot).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> v4: Re-base over changes earlier in the series.
> v3: Properly bound loop. Re-base over changes earlier in the series.
> v2: New.
> ---
> The hang during boot on my Latitude E6410 (see the respective code
> comment) was pretty close after iommu_enable_translation(). No errors,
> no watchdog would kick in, just sometimes the first few pixel lines of
> the next log message's (XEN) prefix would have made it out to the screen
> (and there's no serial there). It's been a lot of experimenting until I
> figured the workaround (which I consider ugly, but halfway acceptable).
> I've been trying hard to make sure the workaround wouldn't be masking a
> real issue, yet I'm still wary of it possibly doing so ... My best guess
> at this point is that on these old IOMMUs the ignored bits 52...61
> aren't really ignored for present entries, but also aren't "reserved"
> enough to trigger faults. This guess is from having tried to set other
> bits in this range (unconditionally, and with the workaround here in
> place), which yielded the same behavior.
> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -43,6 +43,9 @@
>  #include "vtd.h"
>  #include "../ats.h"
> 
> +#define CONTIG_MASK DMA_PTE_CONTIG_MASK
> +#include <asm/pt-contig-markers.h>
> +
>  /* dom_io is used as a sentinel for quarantined devices */
>  #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
>  #define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
> 
>              write_atomic(&pte->val, new_pte.val);
>              iommu_sync_cache(pte, sizeof(struct dma_pte));
> +            pt_update_contig_markers(&parent->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_table);
>          }
> 
>          if ( --level == target )
> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
> 
>      old = *pte;
>      dma_clear_pte(*pte);
> +    iommu_sync_cache(pte, sizeof(*pte));
> +
> +    while ( pt_update_contig_markers(&page->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_null) &&
> +            ++level < min_pt_levels )
> +    {
> +        struct page_info *pg = maddr_to_page(pg_maddr);
> +
> +        unmap_vtd_domain_page(page);
> +
> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level,
> flush_flags,
> +                                          false);
> +        BUG_ON(pg_maddr < PAGE_SIZE);
> +
> +        page = map_vtd_domain_page(pg_maddr);
> +        pte = &page[address_level_offset(addr, level)];
> +        dma_clear_pte(*pte);
> +        iommu_sync_cache(pte, sizeof(*pte));
> +
> +        *flush_flags |= IOMMU_FLUSHF_all;
> +        iommu_queue_free_pgtable(hd, pg);
> +    }
> 
>      spin_unlock(&hd->arch.mapping_lock);
> -    iommu_sync_cache(pte, sizeof(struct dma_pte));
> 
>      unmap_vtd_domain_page(page);
> 
> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>      }
> 
>      *pte = new;
> -
>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> +
> +    /*
> +     * While the (ab)use of PTE_kind_table here allows to save some work in
> +     * the function, the main motivation for it is that it avoids a so far
> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
> +     * based laptop.
> +     */
> +    pt_update_contig_markers(&page->val,
> +                             address_level_offset(dfn_to_daddr(dfn), level),
> +                             level,
> +                             (hd->platform_ops->page_sizes &
> +                              (1UL << level_to_offset_bits(level + 1))
> +                              ? PTE_kind_leaf : PTE_kind_table));
> +
>      spin_unlock(&hd->arch.mapping_lock);
>      unmap_vtd_domain_page(page);
>

Roger Pau Monné May 10, 2022, 2:30 p.m. UTC | #2

On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Note further that while pt_update_contig_markers() updates perhaps
> several PTEs within the table, since these are changes to "avail" bits
> only I do not think that cache flushing would be needed afterwards. Such
> cache flushing (of entire pages, unless adding yet more logic to me more
> selective) would be quite noticable performance-wise (very prominent
> during Dom0 boot).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v4: Re-base over changes earlier in the series.
> v3: Properly bound loop. Re-base over changes earlier in the series.
> v2: New.
> ---
> The hang during boot on my Latitude E6410 (see the respective code
> comment) was pretty close after iommu_enable_translation(). No errors,
> no watchdog would kick in, just sometimes the first few pixel lines of
> the next log message's (XEN) prefix would have made it out to the screen
> (and there's no serial there). It's been a lot of experimenting until I
> figured the workaround (which I consider ugly, but halfway acceptable).
> I've been trying hard to make sure the workaround wouldn't be masking a
> real issue, yet I'm still wary of it possibly doing so ... My best guess
> at this point is that on these old IOMMUs the ignored bits 52...61
> aren't really ignored for present entries, but also aren't "reserved"
> enough to trigger faults. This guess is from having tried to set other
> bits in this range (unconditionally, and with the workaround here in
> place), which yielded the same behavior.

Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
some? IOMMUs are not usable?

Would be good if we could get a more formal response I think.

> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -43,6 +43,9 @@
>  #include "vtd.h"
>  #include "../ats.h"
>  
> +#define CONTIG_MASK DMA_PTE_CONTIG_MASK
> +#include <asm/pt-contig-markers.h>
> +
>  /* dom_io is used as a sentinel for quarantined devices */
>  #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
>  #define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
>  
>              write_atomic(&pte->val, new_pte.val);
>              iommu_sync_cache(pte, sizeof(struct dma_pte));
> +            pt_update_contig_markers(&parent->val,
> +                                     address_level_offset(addr, level),

I think (unless previous patches in the series have changed this)
there already is an 'offset' local variable that you could use.

> +                                     level, PTE_kind_table);
>          }
>  
>          if ( --level == target )
> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>  
>      old = *pte;
>      dma_clear_pte(*pte);
> +    iommu_sync_cache(pte, sizeof(*pte));
> +
> +    while ( pt_update_contig_markers(&page->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_null) &&
> +            ++level < min_pt_levels )
> +    {
> +        struct page_info *pg = maddr_to_page(pg_maddr);
> +
> +        unmap_vtd_domain_page(page);
> +
> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> +                                          false);
> +        BUG_ON(pg_maddr < PAGE_SIZE);
> +
> +        page = map_vtd_domain_page(pg_maddr);
> +        pte = &page[address_level_offset(addr, level)];
> +        dma_clear_pte(*pte);
> +        iommu_sync_cache(pte, sizeof(*pte));
> +
> +        *flush_flags |= IOMMU_FLUSHF_all;
> +        iommu_queue_free_pgtable(hd, pg);
> +    }

I think I'm setting myself for trouble, but do we need to sync cache
the lower lever entries if higher level ones are to be changed.

IOW, would it be fine to just flush the highest level modified PTE?
As the lower lever ones won't be reachable anyway.

>      spin_unlock(&hd->arch.mapping_lock);
> -    iommu_sync_cache(pte, sizeof(struct dma_pte));
>  
>      unmap_vtd_domain_page(page);
>  
> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>      }
>  
>      *pte = new;
> -
>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> +
> +    /*
> +     * While the (ab)use of PTE_kind_table here allows to save some work in
> +     * the function, the main motivation for it is that it avoids a so far
> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
> +     * based laptop.
> +     */
> +    pt_update_contig_markers(&page->val,
> +                             address_level_offset(dfn_to_daddr(dfn), level),
> +                             level,
> +                             (hd->platform_ops->page_sizes &
> +                              (1UL << level_to_offset_bits(level + 1))
> +                              ? PTE_kind_leaf : PTE_kind_table));

So this works because on what we believe to be affected models the
only supported page sizes are 4K?

Do we want to do the same with AMD if we don't allow 512G super pages?

Thanks, Roger.

Jan Beulich May 18, 2022, 10:26 a.m. UTC | #3

On 10.05.2022 16:30, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
>> When a page table ends up with no present entries left, it can be
>> replaced by a non-present entry at the next higher level. The page table
>> itself can then be scheduled for freeing.
>>
>> Note that while its output isn't used there yet,
>> pt_update_contig_markers() right away needs to be called in all places
>> where entries get updated, not just the one where entries get cleared.
>>
>> Note further that while pt_update_contig_markers() updates perhaps
>> several PTEs within the table, since these are changes to "avail" bits
>> only I do not think that cache flushing would be needed afterwards. Such
>> cache flushing (of entire pages, unless adding yet more logic to me more
>> selective) would be quite noticable performance-wise (very prominent
>> during Dom0 boot).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v4: Re-base over changes earlier in the series.
>> v3: Properly bound loop. Re-base over changes earlier in the series.
>> v2: New.
>> ---
>> The hang during boot on my Latitude E6410 (see the respective code
>> comment) was pretty close after iommu_enable_translation(). No errors,
>> no watchdog would kick in, just sometimes the first few pixel lines of
>> the next log message's (XEN) prefix would have made it out to the screen
>> (and there's no serial there). It's been a lot of experimenting until I
>> figured the workaround (which I consider ugly, but halfway acceptable).
>> I've been trying hard to make sure the workaround wouldn't be masking a
>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>> at this point is that on these old IOMMUs the ignored bits 52...61
>> aren't really ignored for present entries, but also aren't "reserved"
>> enough to trigger faults. This guess is from having tried to set other
>> bits in this range (unconditionally, and with the workaround here in
>> place), which yielded the same behavior.
> 
> Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
> some? IOMMUs are not usable?
> 
> Would be good if we could get a more formal response I think.

A more formal response would be nice, but given the age of the affected
hardware I don't expect anything more will be done there by Intel.

>> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
>>  
>>              write_atomic(&pte->val, new_pte.val);
>>              iommu_sync_cache(pte, sizeof(struct dma_pte));
>> +            pt_update_contig_markers(&parent->val,
>> +                                     address_level_offset(addr, level),
> 
> I think (unless previous patches in the series have changed this)
> there already is an 'offset' local variable that you could use.

The variable is clobbered by "IOMMU/x86: prefill newly allocate page
tables".

>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>>  
>>      old = *pte;
>>      dma_clear_pte(*pte);
>> +    iommu_sync_cache(pte, sizeof(*pte));
>> +
>> +    while ( pt_update_contig_markers(&page->val,
>> +                                     address_level_offset(addr, level),
>> +                                     level, PTE_kind_null) &&
>> +            ++level < min_pt_levels )
>> +    {
>> +        struct page_info *pg = maddr_to_page(pg_maddr);
>> +
>> +        unmap_vtd_domain_page(page);
>> +
>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
>> +                                          false);
>> +        BUG_ON(pg_maddr < PAGE_SIZE);
>> +
>> +        page = map_vtd_domain_page(pg_maddr);
>> +        pte = &page[address_level_offset(addr, level)];
>> +        dma_clear_pte(*pte);
>> +        iommu_sync_cache(pte, sizeof(*pte));
>> +
>> +        *flush_flags |= IOMMU_FLUSHF_all;
>> +        iommu_queue_free_pgtable(hd, pg);
>> +    }
> 
> I think I'm setting myself for trouble, but do we need to sync cache
> the lower lever entries if higher level ones are to be changed.
> 
> IOW, would it be fine to just flush the highest level modified PTE?
> As the lower lever ones won't be reachable anyway.

I definitely want to err on the safe side here. If later we can
prove that some cache flush is unneeded, I'd be happy to see it
go away.

>> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>>      }
>>  
>>      *pte = new;
>> -
>>      iommu_sync_cache(pte, sizeof(struct dma_pte));
>> +
>> +    /*
>> +     * While the (ab)use of PTE_kind_table here allows to save some work in
>> +     * the function, the main motivation for it is that it avoids a so far
>> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
>> +     * based laptop.
>> +     */
>> +    pt_update_contig_markers(&page->val,
>> +                             address_level_offset(dfn_to_daddr(dfn), level),
>> +                             level,
>> +                             (hd->platform_ops->page_sizes &
>> +                              (1UL << level_to_offset_bits(level + 1))
>> +                              ? PTE_kind_leaf : PTE_kind_table));
> 
> So this works because on what we believe to be affected models the
> only supported page sizes are 4K?

Yes.

> Do we want to do the same with AMD if we don't allow 512G super pages?

Why? They don't have a similar flaw.

Jan

Tian, Kevin May 20, 2022, 12:38 a.m. UTC | #4

> From: Jan Beulich
> Sent: Wednesday, May 18, 2022 6:26 PM
> 
> On 10.05.2022 16:30, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> >> When a page table ends up with no present entries left, it can be
> >> replaced by a non-present entry at the next higher level. The page table
> >> itself can then be scheduled for freeing.
> >>
> >> Note that while its output isn't used there yet,
> >> pt_update_contig_markers() right away needs to be called in all places
> >> where entries get updated, not just the one where entries get cleared.
> >>
> >> Note further that while pt_update_contig_markers() updates perhaps
> >> several PTEs within the table, since these are changes to "avail" bits
> >> only I do not think that cache flushing would be needed afterwards. Such
> >> cache flushing (of entire pages, unless adding yet more logic to me more
> >> selective) would be quite noticable performance-wise (very prominent
> >> during Dom0 boot).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> v4: Re-base over changes earlier in the series.
> >> v3: Properly bound loop. Re-base over changes earlier in the series.
> >> v2: New.
> >> ---
> >> The hang during boot on my Latitude E6410 (see the respective code
> >> comment) was pretty close after iommu_enable_translation(). No errors,
> >> no watchdog would kick in, just sometimes the first few pixel lines of
> >> the next log message's (XEN) prefix would have made it out to the screen
> >> (and there's no serial there). It's been a lot of experimenting until I
> >> figured the workaround (which I consider ugly, but halfway acceptable).
> >> I've been trying hard to make sure the workaround wouldn't be masking a
> >> real issue, yet I'm still wary of it possibly doing so ... My best guess
> >> at this point is that on these old IOMMUs the ignored bits 52...61
> >> aren't really ignored for present entries, but also aren't "reserved"
> >> enough to trigger faults. This guess is from having tried to set other
> >> bits in this range (unconditionally, and with the workaround here in
> >> place), which yielded the same behavior.
> >
> > Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
> > some? IOMMUs are not usable?
> >
> > Would be good if we could get a more formal response I think.
> 
> A more formal response would be nice, but given the age of the affected
> hardware I don't expect anything more will be done there by Intel.
> 

I didn't hear response on this open internally.

Roger Pau Monné May 20, 2022, 11:13 a.m. UTC | #5

On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
> On 10.05.2022 16:30, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> >> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
> >>  
> >>      old = *pte;
> >>      dma_clear_pte(*pte);
> >> +    iommu_sync_cache(pte, sizeof(*pte));
> >> +
> >> +    while ( pt_update_contig_markers(&page->val,
> >> +                                     address_level_offset(addr, level),
> >> +                                     level, PTE_kind_null) &&
> >> +            ++level < min_pt_levels )
> >> +    {
> >> +        struct page_info *pg = maddr_to_page(pg_maddr);
> >> +
> >> +        unmap_vtd_domain_page(page);
> >> +
> >> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> >> +                                          false);
> >> +        BUG_ON(pg_maddr < PAGE_SIZE);
> >> +
> >> +        page = map_vtd_domain_page(pg_maddr);
> >> +        pte = &page[address_level_offset(addr, level)];
> >> +        dma_clear_pte(*pte);
> >> +        iommu_sync_cache(pte, sizeof(*pte));
> >> +
> >> +        *flush_flags |= IOMMU_FLUSHF_all;
> >> +        iommu_queue_free_pgtable(hd, pg);
> >> +    }
> > 
> > I think I'm setting myself for trouble, but do we need to sync cache
> > the lower lever entries if higher level ones are to be changed.
> > 
> > IOW, would it be fine to just flush the highest level modified PTE?
> > As the lower lever ones won't be reachable anyway.
> 
> I definitely want to err on the safe side here. If later we can
> prove that some cache flush is unneeded, I'd be happy to see it
> go away.

Hm, so it's not only about adding more cache flushes, but moving them
inside of the locked region: previously the only cache flush was done
outside of the locked region.

I guess I can't convince myself why we would need to flush cache of
entries that are to be removed, and that also point to pages scheduled
to be freed.

> >> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
> >>      }
> >>  
> >>      *pte = new;
> >> -
> >>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> >> +
> >> +    /*
> >> +     * While the (ab)use of PTE_kind_table here allows to save some work in
> >> +     * the function, the main motivation for it is that it avoids a so far
> >> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
> >> +     * based laptop.
> >> +     */
> >> +    pt_update_contig_markers(&page->val,
> >> +                             address_level_offset(dfn_to_daddr(dfn), level),
> >> +                             level,
> >> +                             (hd->platform_ops->page_sizes &
> >> +                              (1UL << level_to_offset_bits(level + 1))
> >> +                              ? PTE_kind_leaf : PTE_kind_table));
> > 
> > So this works because on what we believe to be affected models the
> > only supported page sizes are 4K?
> 
> Yes.
> 
> > Do we want to do the same with AMD if we don't allow 512G super pages?
> 
> Why? They don't have a similar flaw.

So the question was mostly whether we should also avoid the
pt_update_contig_markers for 1G entries, because we won't coalesce
them into a 512G anyway.  IOW avoid the overhead of updating the
contig markers if we know that the resulting super-page is not
supported by ->page_sizes.

Thanks, Roger.

Jan Beulich May 27, 2022, 7:40 a.m. UTC | #6

On 20.05.2022 13:13, Roger Pau Monné wrote:
> On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
>> On 10.05.2022 16:30, Roger Pau Monné wrote:
>>> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
>>>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>>>>  
>>>>      old = *pte;
>>>>      dma_clear_pte(*pte);
>>>> +    iommu_sync_cache(pte, sizeof(*pte));
>>>> +
>>>> +    while ( pt_update_contig_markers(&page->val,
>>>> +                                     address_level_offset(addr, level),
>>>> +                                     level, PTE_kind_null) &&
>>>> +            ++level < min_pt_levels )
>>>> +    {
>>>> +        struct page_info *pg = maddr_to_page(pg_maddr);
>>>> +
>>>> +        unmap_vtd_domain_page(page);
>>>> +
>>>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
>>>> +                                          false);
>>>> +        BUG_ON(pg_maddr < PAGE_SIZE);
>>>> +
>>>> +        page = map_vtd_domain_page(pg_maddr);
>>>> +        pte = &page[address_level_offset(addr, level)];
>>>> +        dma_clear_pte(*pte);
>>>> +        iommu_sync_cache(pte, sizeof(*pte));
>>>> +
>>>> +        *flush_flags |= IOMMU_FLUSHF_all;
>>>> +        iommu_queue_free_pgtable(hd, pg);
>>>> +    }
>>>
>>> I think I'm setting myself for trouble, but do we need to sync cache
>>> the lower lever entries if higher level ones are to be changed.
>>>
>>> IOW, would it be fine to just flush the highest level modified PTE?
>>> As the lower lever ones won't be reachable anyway.
>>
>> I definitely want to err on the safe side here. If later we can
>> prove that some cache flush is unneeded, I'd be happy to see it
>> go away.
> 
> Hm, so it's not only about adding more cache flushes, but moving them
> inside of the locked region: previously the only cache flush was done
> outside of the locked region.
> 
> I guess I can't convince myself why we would need to flush cache of
> entries that are to be removed, and that also point to pages scheduled
> to be freed.

As previously said - with a series like this I wanted to strictly be
on the safe side, maintaining the pre-existing pattern of all
modifications of live tables being accompanied by a flush (if flushes
are needed in the first place, of course). As to moving flushes into
the locked region, I don't view this as a problem, seeing in
particular that elsewhere we already have flushes with the lock held
(at the very least the _full page_ one in addr_to_dma_page_maddr(),
but also e.g. in intel_iommu_map_page(), where it could be easily
moved past the unlock).

If you (continue to) think that breaking the present pattern isn't
going to misguide future changes, I can certainly drop these not
really necessary flushes. Otoh I was actually considering to,
subsequently, integrate the flushes into e.g. dma_clear_pte() to
make it virtually impossible to break that pattern. This would
imply that all page table related flushes would then occur with the
lock held.

(I won't separately reply to the similar topic on patch 18.)

>>>> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>>>>      }
>>>>  
>>>>      *pte = new;
>>>> -
>>>>      iommu_sync_cache(pte, sizeof(struct dma_pte));
>>>> +
>>>> +    /*
>>>> +     * While the (ab)use of PTE_kind_table here allows to save some work in
>>>> +     * the function, the main motivation for it is that it avoids a so far
>>>> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
>>>> +     * based laptop.
>>>> +     */
>>>> +    pt_update_contig_markers(&page->val,
>>>> +                             address_level_offset(dfn_to_daddr(dfn), level),
>>>> +                             level,
>>>> +                             (hd->platform_ops->page_sizes &
>>>> +                              (1UL << level_to_offset_bits(level + 1))
>>>> +                              ? PTE_kind_leaf : PTE_kind_table));
>>>
>>> So this works because on what we believe to be affected models the
>>> only supported page sizes are 4K?
>>
>> Yes.
>>
>>> Do we want to do the same with AMD if we don't allow 512G super pages?
>>
>> Why? They don't have a similar flaw.
> 
> So the question was mostly whether we should also avoid the
> pt_update_contig_markers for 1G entries, because we won't coalesce
> them into a 512G anyway.  IOW avoid the overhead of updating the
> contig markers if we know that the resulting super-page is not
> supported by ->page_sizes.

As the comment says, I consider this at least partly an abuse of
PTE_kind_table, so I'm wary of extending this to AMD. But if you
continue to think it's worth it, I could certainly do so there as
well.

Jan

Jan Beulich May 27, 2022, 7:53 a.m. UTC | #7

On 27.05.2022 09:40, Jan Beulich wrote:
> On 20.05.2022 13:13, Roger Pau Monné wrote:
>> On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
>>> On 10.05.2022 16:30, Roger Pau Monné wrote:
>>>> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
>>>>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>>>>>  
>>>>>      old = *pte;
>>>>>      dma_clear_pte(*pte);
>>>>> +    iommu_sync_cache(pte, sizeof(*pte));
>>>>> +
>>>>> +    while ( pt_update_contig_markers(&page->val,
>>>>> +                                     address_level_offset(addr, level),
>>>>> +                                     level, PTE_kind_null) &&
>>>>> +            ++level < min_pt_levels )
>>>>> +    {
>>>>> +        struct page_info *pg = maddr_to_page(pg_maddr);
>>>>> +
>>>>> +        unmap_vtd_domain_page(page);
>>>>> +
>>>>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
>>>>> +                                          false);
>>>>> +        BUG_ON(pg_maddr < PAGE_SIZE);
>>>>> +
>>>>> +        page = map_vtd_domain_page(pg_maddr);
>>>>> +        pte = &page[address_level_offset(addr, level)];
>>>>> +        dma_clear_pte(*pte);
>>>>> +        iommu_sync_cache(pte, sizeof(*pte));
>>>>> +
>>>>> +        *flush_flags |= IOMMU_FLUSHF_all;
>>>>> +        iommu_queue_free_pgtable(hd, pg);
>>>>> +    }
>>>>
>>>> I think I'm setting myself for trouble, but do we need to sync cache
>>>> the lower lever entries if higher level ones are to be changed.
>>>>
>>>> IOW, would it be fine to just flush the highest level modified PTE?
>>>> As the lower lever ones won't be reachable anyway.
>>>
>>> I definitely want to err on the safe side here. If later we can
>>> prove that some cache flush is unneeded, I'd be happy to see it
>>> go away.
>>
>> Hm, so it's not only about adding more cache flushes, but moving them
>> inside of the locked region: previously the only cache flush was done
>> outside of the locked region.
>>
>> I guess I can't convince myself why we would need to flush cache of
>> entries that are to be removed, and that also point to pages scheduled
>> to be freed.
> 
> As previously said - with a series like this I wanted to strictly be
> on the safe side, maintaining the pre-existing pattern of all
> modifications of live tables being accompanied by a flush (if flushes
> are needed in the first place, of course). As to moving flushes into
> the locked region, I don't view this as a problem, seeing in
> particular that elsewhere we already have flushes with the lock held
> (at the very least the _full page_ one in addr_to_dma_page_maddr(),
> but also e.g. in intel_iommu_map_page(), where it could be easily
> moved past the unlock).
> 
> If you (continue to) think that breaking the present pattern isn't
> going to misguide future changes, I can certainly drop these not
> really necessary flushes. Otoh I was actually considering to,
> subsequently, integrate the flushes into e.g. dma_clear_pte() to
> make it virtually impossible to break that pattern. This would
> imply that all page table related flushes would then occur with the
> lock held.
> 
> (I won't separately reply to the similar topic on patch 18.)

Oh, one more (formal / minor) aspect: Changing when to (not) flush
would also invalidate Kevin's R-b which I've got already for both
this and the later, similarly affected patch.

Jan

Roger Pau Monné May 27, 2022, 9:21 a.m. UTC | #8

On Fri, May 27, 2022 at 09:53:01AM +0200, Jan Beulich wrote:
> On 27.05.2022 09:40, Jan Beulich wrote:
> > On 20.05.2022 13:13, Roger Pau Monné wrote:
> >> On Wed, May 18, 2022 at 12:26:03PM +0200, Jan Beulich wrote:
> >>> On 10.05.2022 16:30, Roger Pau Monné wrote:
> >>>> On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> >>>>> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
> >>>>>  
> >>>>>      old = *pte;
> >>>>>      dma_clear_pte(*pte);
> >>>>> +    iommu_sync_cache(pte, sizeof(*pte));
> >>>>> +
> >>>>> +    while ( pt_update_contig_markers(&page->val,
> >>>>> +                                     address_level_offset(addr, level),
> >>>>> +                                     level, PTE_kind_null) &&
> >>>>> +            ++level < min_pt_levels )
> >>>>> +    {
> >>>>> +        struct page_info *pg = maddr_to_page(pg_maddr);
> >>>>> +
> >>>>> +        unmap_vtd_domain_page(page);
> >>>>> +
> >>>>> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> >>>>> +                                          false);
> >>>>> +        BUG_ON(pg_maddr < PAGE_SIZE);
> >>>>> +
> >>>>> +        page = map_vtd_domain_page(pg_maddr);
> >>>>> +        pte = &page[address_level_offset(addr, level)];
> >>>>> +        dma_clear_pte(*pte);
> >>>>> +        iommu_sync_cache(pte, sizeof(*pte));
> >>>>> +
> >>>>> +        *flush_flags |= IOMMU_FLUSHF_all;
> >>>>> +        iommu_queue_free_pgtable(hd, pg);
> >>>>> +    }
> >>>>
> >>>> I think I'm setting myself for trouble, but do we need to sync cache
> >>>> the lower lever entries if higher level ones are to be changed.
> >>>>
> >>>> IOW, would it be fine to just flush the highest level modified PTE?
> >>>> As the lower lever ones won't be reachable anyway.
> >>>
> >>> I definitely want to err on the safe side here. If later we can
> >>> prove that some cache flush is unneeded, I'd be happy to see it
> >>> go away.
> >>
> >> Hm, so it's not only about adding more cache flushes, but moving them
> >> inside of the locked region: previously the only cache flush was done
> >> outside of the locked region.
> >>
> >> I guess I can't convince myself why we would need to flush cache of
> >> entries that are to be removed, and that also point to pages scheduled
> >> to be freed.
> > 
> > As previously said - with a series like this I wanted to strictly be
> > on the safe side, maintaining the pre-existing pattern of all
> > modifications of live tables being accompanied by a flush (if flushes
> > are needed in the first place, of course). As to moving flushes into
> > the locked region, I don't view this as a problem, seeing in
> > particular that elsewhere we already have flushes with the lock held
> > (at the very least the _full page_ one in addr_to_dma_page_maddr(),
> > but also e.g. in intel_iommu_map_page(), where it could be easily
> > moved past the unlock).
> > 
> > If you (continue to) think that breaking the present pattern isn't
> > going to misguide future changes, I can certainly drop these not
> > really necessary flushes. Otoh I was actually considering to,
> > subsequently, integrate the flushes into e.g. dma_clear_pte() to
> > make it virtually impossible to break that pattern. This would
> > imply that all page table related flushes would then occur with the
> > lock held.

Hm, while I agree it's safer to do the flush in dma_clear_pte()
itself, I wonder how much of a performance impact does this have.  It
might be not relevant, in which case I would certainly be fine with
placing the flush in dma_clear_pte().

> > (I won't separately reply to the similar topic on patch 18.)
> 
> Oh, one more (formal / minor) aspect: Changing when to (not) flush
> would also invalidate Kevin's R-b which I've got already for both
> this and the later, similarly affected patch.

OK, so let's go with this for now.  I don't have further comments:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

[v4,16/21] VT-d: free all-empty page tables

Commit Message

Comments

Patch