diff mbox series

[2/4] VT-d / x86: re-arrange cache syncing

Message ID e0819175-83b4-9489-8e4b-7be4940f1d54@suse.com (mailing list archive)
State Superseded
Headers show
Series (mainly) IOMMU hook adjustments | expand

Commit Message

Jan Beulich Dec. 1, 2021, 9:40 a.m. UTC
The actual function should always have lived in core x86 code; move it
there, replacing get_cache_line_size() by readily available (except very
early during boot; see the code comment) data.

Drop the respective IOMMU hook, (re)introducing a respective boolean
instead. Replace a true and an almost open-coding instance of
iommu_sync_cache().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Placing the function next to flush_area_local() exposes a curious
asymmetry between the SFENCE placements: sync_cache() has it after the
flush, while flush_area_local() has it before it. I think the latter one
is misplaced.

Comments

Andrew Cooper Dec. 1, 2021, 2:39 p.m. UTC | #1
On 01/12/2021 09:40, Jan Beulich wrote:
> The actual function should always have lived in core x86 code; move it
> there, replacing get_cache_line_size() by readily available (except very
> early during boot; see the code comment) data.
>
> Drop the respective IOMMU hook, (re)introducing a respective boolean
> instead. Replace a true and an almost open-coding instance of
> iommu_sync_cache().

Coherency (or not) is a per-IOMMU property, not a global platform
property.  iGPU IOMMUs are very different to those the uncore, and
there's no reason to presume that IOMMUs in the PCH would have the same
properties as those in the uncore.

Given how expensive sync_cache() is, we cannot afford to be using it for
coherent IOMMUs in a mixed system.

This wants to be a boolean in arch_iommu.


> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Placing the function next to flush_area_local() exposes a curious
> asymmetry between the SFENCE placements: sync_cache() has it after the
> flush, while flush_area_local() has it before it. I think the latter one
> is misplaced.

Wow this is a mess.

First and foremost, AMD state that on pre-CLFLUSHOPT parts, CLFLUSH is
unordered with ~anything (including SFENCE), and need bounding with
MFENCE on both sides.  We definitely aren't doing this correctly right now.


AMD explicitly states that SFENCE drains the store and WC buffers (i.e.
make stuff instantaneously globally visible).  Intel does not, and
merely guarantees ordering.

A leading SFENCE would only make sense if there were WC concerns, but
both manuals say that the memory type doesn't matter, so I can't see a
justification for it.

What does matter, from the IOMMU's point of view, is that the line has
been written back (or evicted on pre-CLWB parts) before the IOTLB
invalidation occurs.  The invalidation will be a write to a different
address, which is why the trailing SFENCE is necessary, as CLFLUSHOPT
isn't ordered with respect to unaliasing writes.


On a more minor note, both Intel and AMD say that CLFLUSH* are permitted
to target an execute-only code segment, so we need a fix in
hvmemul_cache_op()'s use of hvmemul_virtual_to_linear(...,
hvm_access_read, ...) which will currently #GP in this case.

>
> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -11,6 +11,7 @@
>  #include <xen/sched.h>
>  #include <xen/smp.h>
>  #include <xen/softirq.h>
> +#include <asm/cache.h>
>  #include <asm/flushtlb.h>
>  #include <asm/invpcid.h>
>  #include <asm/nops.h>
> @@ -265,6 +266,57 @@ unsigned int flush_area_local(const void
>      return flags;
>  }
>  
> +void sync_cache(const void *addr, unsigned int size)

Can we name this cache_writeback()?  sync is very generic, and it really
doesn't want confusing cache_flush().

~Andrew
Jan Beulich Dec. 2, 2021, 9:19 a.m. UTC | #2
On 01.12.2021 15:39, Andrew Cooper wrote:
> On 01/12/2021 09:40, Jan Beulich wrote:
>> The actual function should always have lived in core x86 code; move it
>> there, replacing get_cache_line_size() by readily available (except very
>> early during boot; see the code comment) data.
>>
>> Drop the respective IOMMU hook, (re)introducing a respective boolean
>> instead. Replace a true and an almost open-coding instance of
>> iommu_sync_cache().
> 
> Coherency (or not) is a per-IOMMU property, not a global platform
> property.  iGPU IOMMUs are very different to those the uncore, and
> there's no reason to presume that IOMMUs in the PCH would have the same
> properties as those in the uncore.
> 
> Given how expensive sync_cache() is, we cannot afford to be using it for
> coherent IOMMUs in a mixed system.
> 
> This wants to be a boolean in arch_iommu.

That's a valid consideration, but may not be as easy as it may seem on
the surface. Certainly not something I could promise to find time for
soon. And definitely separate from the specific change here.

>> ---
>> Placing the function next to flush_area_local() exposes a curious
>> asymmetry between the SFENCE placements: sync_cache() has it after the
>> flush, while flush_area_local() has it before it. I think the latter one
>> is misplaced.
> 
> Wow this is a mess.
> 
> First and foremost, AMD state that on pre-CLFLUSHOPT parts, CLFLUSH is
> unordered with ~anything (including SFENCE), and need bounding with
> MFENCE on both sides.  We definitely aren't doing this correctly right now.
> 
> 
> AMD explicitly states that SFENCE drains the store and WC buffers (i.e.
> make stuff instantaneously globally visible).  Intel does not, and
> merely guarantees ordering.
> 
> A leading SFENCE would only make sense if there were WC concerns, but
> both manuals say that the memory type doesn't matter, so I can't see a
> justification for it.
> 
> What does matter, from the IOMMU's point of view, is that the line has
> been written back (or evicted on pre-CLWB parts) before the IOTLB
> invalidation occurs.  The invalidation will be a write to a different
> address, which is why the trailing SFENCE is necessary, as CLFLUSHOPT
> isn't ordered with respect to unaliasing writes.

IOW for the purposes of this change all is correct, and everything else
will require taking care of separately.

> On a more minor note, both Intel and AMD say that CLFLUSH* are permitted
> to target an execute-only code segment, so we need a fix in
> hvmemul_cache_op()'s use of hvmemul_virtual_to_linear(...,
> hvm_access_read, ...) which will currently #GP in this case.

Hmm, this specific case would seem to require to simply use hvm_access_none
(like hvmemul_tlb_op() already does), except for CLWB.

But then hvm_vcpu_virtual_to_linear() also doesn't look to handle
hvm_access_insn_fetch correctly for data segments. Changing of which would
in turn require to first audit all use sites, to make sure we don't break
anything.

I'll see about doing both.

>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -11,6 +11,7 @@
>>  #include <xen/sched.h>
>>  #include <xen/smp.h>
>>  #include <xen/softirq.h>
>> +#include <asm/cache.h>
>>  #include <asm/flushtlb.h>
>>  #include <asm/invpcid.h>
>>  #include <asm/nops.h>
>> @@ -265,6 +266,57 @@ unsigned int flush_area_local(const void
>>      return flags;
>>  }
>>  
>> +void sync_cache(const void *addr, unsigned int size)
> 
> Can we name this cache_writeback()?  sync is very generic, and it really
> doesn't want confusing cache_flush().

Oh, sure, can do. As long as you don't insist on also changing
iommu_sync_cache(): While purely mechanical, this would bloat the
patch by quite a bit.

Bottom line: This last item is the only change to the actual patch;
everything else will require further work yielding separate patches.

Jan
Tian, Kevin Dec. 24, 2021, 7:24 a.m. UTC | #3
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Thursday, December 2, 2021 5:19 PM
> 
> On 01.12.2021 15:39, Andrew Cooper wrote:
> > On 01/12/2021 09:40, Jan Beulich wrote:
> >> The actual function should always have lived in core x86 code; move it
> >> there, replacing get_cache_line_size() by readily available (except very
> >> early during boot; see the code comment) data.
> >>
> >> Drop the respective IOMMU hook, (re)introducing a respective boolean
> >> instead. Replace a true and an almost open-coding instance of
> >> iommu_sync_cache().
> >
> > Coherency (or not) is a per-IOMMU property, not a global platform
> > property.  iGPU IOMMUs are very different to those the uncore, and
> > there's no reason to presume that IOMMUs in the PCH would have the
> same
> > properties as those in the uncore.
> >
> > Given how expensive sync_cache() is, we cannot afford to be using it for
> > coherent IOMMUs in a mixed system.
> >
> > This wants to be a boolean in arch_iommu.
> 
> That's a valid consideration, but may not be as easy as it may seem on
> the surface. Certainly not something I could promise to find time for
> soon. And definitely separate from the specific change here.

I'm fine with this patch if you prefer to a staging approach to improve it.
By any means this patch doesn't make things worse.

> 
> >> ---
> >> Placing the function next to flush_area_local() exposes a curious
> >> asymmetry between the SFENCE placements: sync_cache() has it after the
> >> flush, while flush_area_local() has it before it. I think the latter one
> >> is misplaced.
> >
> > Wow this is a mess.
> >
> > First and foremost, AMD state that on pre-CLFLUSHOPT parts, CLFLUSH is
> > unordered with ~anything (including SFENCE), and need bounding with
> > MFENCE on both sides.  We definitely aren't doing this correctly right now.
> >
> >
> > AMD explicitly states that SFENCE drains the store and WC buffers (i.e.
> > make stuff instantaneously globally visible).  Intel does not, and
> > merely guarantees ordering.
> >
> > A leading SFENCE would only make sense if there were WC concerns, but
> > both manuals say that the memory type doesn't matter, so I can't see a
> > justification for it.
> >
> > What does matter, from the IOMMU's point of view, is that the line has
> > been written back (or evicted on pre-CLWB parts) before the IOTLB
> > invalidation occurs.  The invalidation will be a write to a different
> > address, which is why the trailing SFENCE is necessary, as CLFLUSHOPT
> > isn't ordered with respect to unaliasing writes.
> 
> IOW for the purposes of this change all is correct, and everything else
> will require taking care of separately.
> 

Same for this part. btw Linux does mfence both before and after clflush.

Thanks
Kevin
diff mbox series

Patch

--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -11,6 +11,7 @@ 
 #include <xen/sched.h>
 #include <xen/smp.h>
 #include <xen/softirq.h>
+#include <asm/cache.h>
 #include <asm/flushtlb.h>
 #include <asm/invpcid.h>
 #include <asm/nops.h>
@@ -265,6 +266,57 @@  unsigned int flush_area_local(const void
     return flags;
 }
 
+void sync_cache(const void *addr, unsigned int size)
+{
+    /*
+     * This function may be called before current_cpu_data is established.
+     * Hence a fallback is needed to prevent the loop below becoming infinite.
+     */
+    unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16;
+    const void *end = addr + size;
+
+    addr -= (unsigned long)addr & (clflush_size - 1);
+    for ( ; addr < end; addr += clflush_size )
+    {
+/*
+ * The arguments to a macro must not include preprocessor directives. Doing so
+ * results in undefined behavior, so we have to create some defines here in
+ * order to avoid it.
+ */
+#if defined(HAVE_AS_CLWB)
+# define CLWB_ENCODING "clwb %[p]"
+#elif defined(HAVE_AS_XSAVEOPT)
+# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
+#else
+# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
+#endif
+
+#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
+#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
+# define INPUT BASE_INPUT
+#else
+# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
+#endif
+        /*
+         * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush
+         * + prefix than a clflush + nop, and hence the prefix is added instead
+         * of letting the alternative framework fill the gap by appending nops.
+         */
+        alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]",
+                         "data16 clflush %[p]", /* clflushopt */
+                         X86_FEATURE_CLFLUSHOPT,
+                         CLWB_ENCODING,
+                         X86_FEATURE_CLWB, /* no outputs */,
+                         INPUT(addr));
+#undef INPUT
+#undef BASE_INPUT
+#undef CLWB_ENCODING
+    }
+
+    alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT,
+                      "sfence", X86_FEATURE_CLWB);
+}
+
 unsigned int guest_flush_tlb_flags(const struct domain *d)
 {
     bool shadow = paging_mode_shadow(d);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -202,54 +202,6 @@  static void check_cleanup_domid_map(stru
     }
 }
 
-static void sync_cache(const void *addr, unsigned int size)
-{
-    static unsigned long clflush_size = 0;
-    const void *end = addr + size;
-
-    if ( clflush_size == 0 )
-        clflush_size = get_cache_line_size();
-
-    addr -= (unsigned long)addr & (clflush_size - 1);
-    for ( ; addr < end; addr += clflush_size )
-/*
- * The arguments to a macro must not include preprocessor directives. Doing so
- * results in undefined behavior, so we have to create some defines here in
- * order to avoid it.
- */
-#if defined(HAVE_AS_CLWB)
-# define CLWB_ENCODING "clwb %[p]"
-#elif defined(HAVE_AS_XSAVEOPT)
-# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
-#else
-# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
-#endif
-
-#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
-#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
-# define INPUT BASE_INPUT
-#else
-# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
-#endif
-        /*
-         * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush
-         * + prefix than a clflush + nop, and hence the prefix is added instead
-         * of letting the alternative framework fill the gap by appending nops.
-         */
-        alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]",
-                         "data16 clflush %[p]", /* clflushopt */
-                         X86_FEATURE_CLFLUSHOPT,
-                         CLWB_ENCODING,
-                         X86_FEATURE_CLWB, /* no outputs */,
-                         INPUT(addr));
-#undef INPUT
-#undef BASE_INPUT
-#undef CLWB_ENCODING
-
-    alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT,
-                      "sfence", X86_FEATURE_CLWB);
-}
-
 /* Allocate page table, return its machine address */
 uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node)
 {
@@ -268,8 +220,7 @@  uint64_t alloc_pgtable_maddr(unsigned lo
 
         clear_page(vaddr);
 
-        if ( (iommu_ops.init ? &iommu_ops : &vtd_ops)->sync_cache )
-            sync_cache(vaddr, PAGE_SIZE);
+        iommu_sync_cache(vaddr, PAGE_SIZE);
         unmap_domain_page(vaddr);
         cur_pg++;
     }
@@ -1295,7 +1246,7 @@  int __init iommu_alloc(struct acpi_drhd_
     iommu->nr_pt_levels = agaw_to_level(agaw);
 
     if ( !ecap_coherent(iommu->ecap) )
-        vtd_ops.sync_cache = sync_cache;
+        iommu_non_coherent = true;
 
     /* allocate domain id bitmap */
     nr_dom = cap_ndoms(iommu->cap);
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -28,6 +28,7 @@ 
 
 const struct iommu_init_ops *__initdata iommu_init_ops;
 struct iommu_ops __read_mostly iommu_ops;
+bool __read_mostly iommu_non_coherent;
 
 enum iommu_intremap __read_mostly iommu_intremap = iommu_intremap_full;
 
@@ -435,8 +436,7 @@  struct page_info *iommu_alloc_pgtable(st
     p = __map_domain_page(pg);
     clear_page(p);
 
-    if ( hd->platform_ops->sync_cache )
-        iommu_vcall(hd->platform_ops, sync_cache, p, PAGE_SIZE);
+    iommu_sync_cache(p, PAGE_SIZE);
 
     unmap_domain_page(p);
 
--- a/xen/include/asm-x86/cache.h
+++ b/xen/include/asm-x86/cache.h
@@ -11,4 +11,10 @@ 
 
 #define __read_mostly __section(".data.read_mostly")
 
+#ifndef __ASSEMBLY__
+
+void sync_cache(const void *addr, unsigned int size);
+
+#endif
+
 #endif
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -19,6 +19,7 @@ 
 #include <xen/mem_access.h>
 #include <xen/spinlock.h>
 #include <asm/apicdef.h>
+#include <asm/cache.h>
 #include <asm/processor.h>
 #include <asm/hvm/vmx/vmcs.h>
 
@@ -134,12 +135,13 @@  extern bool untrusted_msi;
 int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq,
                    const uint8_t gvec);
 
-#define iommu_sync_cache(addr, size) ({                 \
-    const struct iommu_ops *ops = iommu_get_ops();      \
-                                                        \
-    if ( ops->sync_cache )                              \
-        iommu_vcall(ops, sync_cache, addr, size);       \
-})
+extern bool iommu_non_coherent;
+
+static inline void iommu_sync_cache(const void *addr, unsigned int size)
+{
+    if ( iommu_non_coherent )
+        sync_cache(addr, size);
+}
 
 int __must_check iommu_free_pgtables(struct domain *d);
 struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -268,7 +268,6 @@  struct iommu_ops {
     int (*setup_hpet_msi)(struct msi_desc *);
 
     int (*adjust_irq_affinities)(void);
-    void (*sync_cache)(const void *addr, unsigned int size);
     void (*clear_root_pgtable)(struct domain *d);
     int (*update_ire_from_msi)(struct msi_desc *msi_desc, struct msi_msg *msg);
 #endif /* CONFIG_X86 */
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -76,7 +76,6 @@  int __must_check qinval_device_iotlb_syn
                                           struct pci_dev *pdev,
                                           u16 did, u16 size, u64 addr);
 
-unsigned int get_cache_line_size(void);
 void flush_all_cache(void);
 
 uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node);
--- a/xen/drivers/passthrough/vtd/x86/vtd.c
+++ b/xen/drivers/passthrough/vtd/x86/vtd.c
@@ -47,11 +47,6 @@  void unmap_vtd_domain_page(const void *v
     unmap_domain_page(va);
 }
 
-unsigned int get_cache_line_size(void)
-{
-    return ((cpuid_ebx(1) >> 8) & 0xff) * 8;
-}
-
 void flush_all_cache()
 {
     wbinvd();