diff mbox series

[5/5] x86/HVM: limit cache writeback overhead

Message ID 18fcf499-a2ae-ab48-a66f-ca0499097e8a@suse.com (mailing list archive)
State Superseded
Headers show
Series x86: reduce cache flushing overhead | expand

Commit Message

Jan Beulich April 19, 2023, 10:46 a.m. UTC
There's no need to write back caches on all CPUs upon seeing a WBINVD
exit; ones that a vCPU hasn't run on since the last writeback (or since
it was started) can't hold data which may need writing back.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
With us not running AMD IOMMUs in non-coherent ways, I wonder whether
svm_wbinvd_intercept() really needs to do anything (or whether it
couldn't check iommu_snoop just like VMX does, knowing that as of
c609108b2190 ["x86/shadow: make iommu_snoop usage consistent with
HAP's"] that's always set; this would largely serve as grep fodder then,
to make sure this code is updated once / when we do away with this
global variable, and it would be the penultimate step to being able to
fold SVM's and VT-x'es functions).

Comments

Andrew Cooper April 19, 2023, 9:55 p.m. UTC | #1
On 19/04/2023 11:46 am, Jan Beulich wrote:
> There's no need to write back caches on all CPUs upon seeing a WBINVD
> exit; ones that a vCPU hasn't run on since the last writeback (or since
> it was started) can't hold data which may need writing back.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I find it unlikely that this is an improvement in any way at all.

You're adding a memory allocation, and making the fastpath slower, for
all HVM domains even the ~100% of them in practice which never get given
a device in the first place.

Just so you can skip the WBINVD side effect on the L1/L2 caches of the
CPUs this domain happens not to have run on since the last time they
flushed (which is already an under estimate).  Note how this does not
change the behaviour for the L3 caches, which form the overwhelming
majority of the WBINVD overhead in the first place.

So my response was going to be "definitely not without the per-domain
'reduced cacheability permitted' setting we've discussed".  And even
then, not without numbers suggesting it's a problem in the first place,
or at least a better explanation of why it might plausibly be an issue.


But, in writing this, I've realised a real bug.

Cache snoops can occur and pull lines sideways for microarchitectural
reasons.  And even if we want to hand-wave that away as being unlikely
(it is), you can't hand-wave away rogue speculation in the directmap.

By not issuing WBINVD on all cores, you've got a real chance of letting
some lines escape the attempt to evict them.

But it's worse than that - even IPIing all cores, there's a speculation
pattern which can cause some lines to survive.  Rare, sure, but not
impossible.

Right now, I'm not sure that WBINVD can even be used safely without the
extra careful use of CR0.{CD,NW}, which provides a workaround for
native, but nothing helpful for hypervisors...

~Andrew
Jan Beulich April 20, 2023, 9:42 a.m. UTC | #2
On 19.04.2023 23:55, Andrew Cooper wrote:
> On 19/04/2023 11:46 am, Jan Beulich wrote:
>> There's no need to write back caches on all CPUs upon seeing a WBINVD
>> exit; ones that a vCPU hasn't run on since the last writeback (or since
>> it was started) can't hold data which may need writing back.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> I find it unlikely that this is an improvement in any way at all.
> 
> You're adding a memory allocation, and making the fastpath slower, for
> all HVM domains even the ~100% of them in practice which never get given
> a device in the first place.
> 
> Just so you can skip the WBINVD side effect on the L1/L2 caches of the
> CPUs this domain happens not to have run on since the last time they
> flushed (which is already an under estimate).  Note how this does not
> change the behaviour for the L3 caches, which form the overwhelming
> majority of the WBINVD overhead in the first place.

You're thinking of only single-node systems here, it seems? I view this
change as particularly relevant for node-constrained domains on NUMA
systems.

As to "making the fastpath slower": That can only be the
__cpumask_set_cpu() added to hvm_do_resume(). What do you suggest? I
can certainly add a conditional there (and then I could also make the
memory allocation conditional), but I thought a LOCK-less RMW memory
operation might better be done in straight line code.

As an aside - after having thought of making this change, I did go and
check what KVM does. And (surprise or not) they do exactly this.

> So my response was going to be "definitely not without the per-domain
> 'reduced cacheability permitted' setting we've discussed".  And even
> then, not without numbers suggesting it's a problem in the first place,
> or at least a better explanation of why it might plausibly be an issue.
> 
> 
> But, in writing this, I've realised a real bug.
> 
> Cache snoops can occur and pull lines sideways for microarchitectural
> reasons.  And even if we want to hand-wave that away as being unlikely
> (it is), you can't hand-wave away rogue speculation in the directmap.
> 
> By not issuing WBINVD on all cores, you've got a real chance of letting
> some lines escape the attempt to evict them.
> 
> But it's worse than that - even IPIing all cores, there's a speculation
> pattern which can cause some lines to survive.  Rare, sure, but not
> impossible.
> 
> Right now, I'm not sure that WBINVD can even be used safely without the
> extra careful use of CR0.{CD,NW}, which provides a workaround for
> native, but nothing helpful for hypervisors...

Wait: See the title and the earlier patches. We're not talking of "evict"
here anymore, but of "write-back". The few cases of "evict" are left alone
by this change. If any of those are affected by what you say (and it looks
like some might be), then fixing that definitely is unrelated work. (You
may have meant that latter part of your reply like this, but you haven't
said so in any way I would recognize.)

Jan
diff mbox series

Patch

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -537,6 +537,8 @@  void hvm_do_resume(struct vcpu *v)
         v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
     }
 
+    __cpumask_set_cpu(v->processor, v->arch.hvm.cache_dirty_mask);
+
     if ( unlikely(v->arch.vm_event) && v->arch.monitor.next_interrupt_enabled )
     {
         struct x86_event info;
@@ -1592,6 +1594,10 @@  int hvm_vcpu_initialise(struct vcpu *v)
     if ( rc )
         goto fail6;
 
+    rc = -ENOMEM;
+    if ( !zalloc_cpumask_var(&v->arch.hvm.cache_dirty_mask) )
+        goto fail6;
+
     rc = ioreq_server_add_vcpu_all(d, v);
     if ( rc != 0 )
         goto fail6;
@@ -1621,6 +1627,7 @@  int hvm_vcpu_initialise(struct vcpu *v)
     hvm_vcpu_cacheattr_destroy(v);
  fail1:
     viridian_vcpu_deinit(v);
+    FREE_CPUMASK_VAR(v->arch.hvm.cache_dirty_mask);
     return rc;
 }
 
@@ -1628,6 +1635,8 @@  void hvm_vcpu_destroy(struct vcpu *v)
 {
     viridian_vcpu_deinit(v);
 
+    FREE_CPUMASK_VAR(v->arch.hvm.cache_dirty_mask);
+
     ioreq_server_remove_vcpu_all(v->domain, v);
 
     if ( hvm_altp2m_supported() )
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2363,8 +2363,14 @@  static void svm_vmexit_mce_intercept(
 
 static void cf_check svm_wbinvd_intercept(void)
 {
-    if ( cache_flush_permitted(current->domain) )
-        flush_all(FLUSH_WRITEBACK);
+    struct vcpu *curr = current;
+
+    if ( !cache_flush_permitted(curr->domain) )
+        return;
+
+    flush_mask(curr->arch.hvm.cache_dirty_mask, FLUSH_WRITEBACK);
+    cpumask_copy(curr->arch.hvm.cache_dirty_mask,
+                 cpumask_of(curr->processor));
 }
 
 static void svm_vmexit_do_invalidate_cache(struct cpu_user_regs *regs,
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3710,11 +3710,17 @@  static void vmx_do_extint(struct cpu_use
 
 static void cf_check vmx_wbinvd_intercept(void)
 {
-    if ( !cache_flush_permitted(current->domain) || iommu_snoop )
+    struct vcpu *curr = current;
+
+    if ( !cache_flush_permitted(curr->domain) || iommu_snoop )
         return;
 
     if ( cpu_has_wbinvd_exiting )
-        flush_all(FLUSH_WRITEBACK);
+    {
+        flush_mask(curr->arch.hvm.cache_dirty_mask, FLUSH_WRITEBACK);
+        cpumask_copy(curr->arch.hvm.cache_dirty_mask,
+                     cpumask_of(curr->processor));
+    }
     else
         wbnoinvd();
 }
--- a/xen/arch/x86/include/asm/hvm/vcpu.h
+++ b/xen/arch/x86/include/asm/hvm/vcpu.h
@@ -161,6 +161,8 @@  struct hvm_vcpu {
         struct svm_vcpu svm;
     };
 
+    cpumask_var_t       cache_dirty_mask;
+
     struct tasklet      assert_evtchn_irq_tasklet;
 
     struct nestedvcpu   nvcpu;