diff mbox series

[3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page

Message ID 20221119094659.11868-3-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series [1/4] MAINTAINERS: Add KVM x86/xen maintainer list | expand

Commit Message

David Woodhouse Nov. 19, 2022, 9:46 a.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

In the case where a GPC is refreshed to a different location within the
same page, we didn't bother to update it. Mostly we don't need to, but
since the ->khva field also includes the offset within the page, that
does have to be updated.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/pfncache.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Durrant, Paul Nov. 19, 2022, 11:40 a.m. UTC | #1
> -----Original Message-----
> From: David Woodhouse <dwmw2@infradead.org>
> Sent: 19 November 2022 09:47
> To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
> <seanjc@google.com>
> Cc: kvm@vger.kernel.org; mhal@rbox.co
> Subject: [EXTERNAL] [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it
> moves within the same page
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In the case where a GPC is refreshed to a different location within the
> same page, we didn't bother to update it. Mostly we don't need to, but
> since the ->khva field also includes the offset within the page, that does
> have to be updated.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Reviewed-by: <paul@xen.org>
Sean Christopherson Nov. 22, 2022, 4:23 p.m. UTC | #2
On Sat, Nov 19, 2022, Durrant, Paul wrote:
> > -----Original Message-----
> > From: David Woodhouse <dwmw2@infradead.org>
> > Sent: 19 November 2022 09:47
> > To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
> > <seanjc@google.com>
> > Cc: kvm@vger.kernel.org; mhal@rbox.co
> > Subject: [EXTERNAL] [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it
> > moves within the same page

Please use a mail client that doesn't include the header gunk in the reply.

> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and know
> > the content is safe.
> > 
> > 
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > In the case where a GPC is refreshed to a different location within the
> > same page, we didn't bother to update it. Mostly we don't need to, but
> > since the ->khva field also includes the offset within the page, that does
> > have to be updated.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> Reviewed-by: <paul@xen.org>

Tags need your real name, not just your email address, i.e. this should be:

  Reviewed-by: Paul Durrant <paul@xen.org>
Paul Durrant Nov. 22, 2022, 4:49 p.m. UTC | #3
On 22/11/2022 16:23, Sean Christopherson wrote:
> On Sat, Nov 19, 2022, Durrant, Paul wrote:
>>> -----Original Message-----
>>> From: David Woodhouse <dwmw2@infradead.org>
>>> Sent: 19 November 2022 09:47
>>> To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
>>> <seanjc@google.com>
>>> Cc: kvm@vger.kernel.org; mhal@rbox.co
>>> Subject: [EXTERNAL] [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it
>>> moves within the same page
> 
> Please use a mail client that doesn't include the header gunk in the reply.
> 

Sorry about that; it's not just a change of client unfortunately but I 
should be avoiding the problem now...

>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>
>> Reviewed-by: <paul@xen.org>
> 
> Tags need your real name, not just your email address, i.e. this should be:
> 
>    Reviewed-by: Paul Durrant <paul@xen.org>

Yes indeed it should. Don't know how I managed to screw that up... It's 
not like haven't type that properly hundreds of times on Xen patch reviews.

   Paul
David Woodhouse Nov. 22, 2022, 5:02 p.m. UTC | #4
On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > Tags need your real name, not just your email address, i.e. this should be:
> >     Reviewed-by: Paul Durrant <paul@xen.org>
> 
> Yes indeed it should. Don't know how I managed to screw that up... It's 
> not like haven't type that properly hundreds of times on Xen patch reviews.

All sorted in the tree I'm curating
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes

Of those, three are marked as Cc:stable and want to go into the 6.1 release:

      KVM: x86/xen: Validate port number in SCHEDOP_poll
      KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
      KVM: Update gfn_to_pfn_cache khva when it moves within the same page

The rest (including the runstate compatibility fixes) are less
critical.

Sean, given that this now includes your patch series which in turn you
took over from Michal, how would you prefer me to proceed?

David Woodhouse (7):
      KVM: x86/xen: Validate port number in SCHEDOP_poll
      KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
      KVM: x86/xen: Add CPL to Xen hypercall tracepoint
      MAINTAINERS: Add KVM x86/xen maintainer list
      KVM: x86/xen: Compatibility fixes for shared runstate area
      KVM: Update gfn_to_pfn_cache khva when it moves within the same page
      KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary

Michal Luczaj (6):
      KVM: Shorten gfn_to_pfn_cache function names
      KVM: x86: Remove unused argument in gpc_unmap_khva()
      KVM: Store immutable gfn_to_pfn_cache properties
      KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()
      KVM: Clean up hva_to_pfn_retry()
      KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()

Sean Christopherson (4):
      KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cache
      KVM: Do not partially reinitialize gfn=>pfn cache during activation
      KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers
      KVM: Skip unnecessary "unmap" if gpc is already valid during refresh

We can reinstate the 'immutable length' thing on top, if we pick one of
the discussed options for coping with the fact that for the runstate
area, it *isn't* immutable. I'm slightly leaning towards just setting
the length to '1' despite it being a lie.
Sean Christopherson Nov. 22, 2022, 7:09 p.m. UTC | #5
On Tue, Nov 22, 2022, David Woodhouse wrote:
> On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > > Tags need your real name, not just your email address, i.e. this should be:
> > >     Reviewed-by: Paul Durrant <paul@xen.org>
> > 
> > Yes indeed it should. Don't know how I managed to screw that up... It's 
> > not like haven't type that properly hundreds of times on Xen patch reviews.
> 
> All sorted in the tree I'm curating
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
> 
> Of those, three are marked as Cc:stable and want to go into the 6.1 release:
> 
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
> 
> The rest (including the runstate compatibility fixes) are less
> critical.
> 
> Sean, given that this now includes your patch series which in turn you
> took over from Michal, how would you prefer me to proceed?

If you're ok with a slight delay for fixing the runstate stuff, I think it makes
sense to push out everything else to fixes out to 6.3.  At the very least, I want
to explore using a gfn_to_hva_cache instead of two gfn_to_pfn_cache structures
for the runstate, which at this point would be a bit rushed for 6.2.

Pushing out to 6.3 will also give us time to do the more aggressive cleanup of
adding kvm_gpc_lock(), which I think will yield APIs that both of us are happy with,
i.e. the gpa+len of the cache can be immutable, but users can still validate their
individual usage.

> David Woodhouse (7):
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: x86/xen: Add CPL to Xen hypercall tracepoint
>       MAINTAINERS: Add KVM x86/xen maintainer list
>       KVM: x86/xen: Compatibility fixes for shared runstate area
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>       KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary
> 
> Michal Luczaj (6):
>       KVM: Shorten gfn_to_pfn_cache function names
>       KVM: x86: Remove unused argument in gpc_unmap_khva()
>       KVM: Store immutable gfn_to_pfn_cache properties
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()
>       KVM: Clean up hva_to_pfn_retry()
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()
> 
> Sean Christopherson (4):
>       KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cache
>       KVM: Do not partially reinitialize gfn=>pfn cache during activation
>       KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers
>       KVM: Skip unnecessary "unmap" if gpc is already valid during refresh
> 
> We can reinstate the 'immutable length' thing on top, if we pick one of
> the discussed options for coping with the fact that for the runstate
> area, it *isn't* immutable. I'm slightly leaning towards just setting
> the length to '1' despite it being a lie.
David Woodhouse Nov. 22, 2022, 11:13 p.m. UTC | #6
On Tue, 2022-11-22 at 19:09 +0000, Sean Christopherson wrote:
> On Tue, Nov 22, 2022, David Woodhouse wrote:
> > On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > > > Tags need your real name, not just your email address, i.e. this should be:
> > > >     Reviewed-by: Paul Durrant <paul@xen.org>
> > > 
> > > Yes indeed it should. Don't know how I managed to screw that up... It's 
> > > not like haven't type that properly hundreds of times on Xen patch reviews.
> > 
> > All sorted in the tree I'm curating
> > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
> > 
> > 
> > Of those, three are marked as Cc:stable and want to go into the 6.1 release:
> > 
> >       KVM: x86/xen: Validate port number in SCHEDOP_poll
> >       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
> >       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
> > 
> > The rest (including the runstate compatibility fixes) are less
> > critical.
> > 
> > Sean, given that this now includes your patch series which in turn you
> > took over from Michal, how would you prefer me to proceed?
> 
> If you're ok with a slight delay for fixing the runstate stuff, I think it makes
> sense to push out everything else to fixes out to 6.3. 

I can live with that. Will post those three fixes.

>  At the very least, I want to explore using a gfn_to_hva_cache
> instead of two gfn_to_pfn_cache structures for the runstate, which at
> this point would be a bit rushed for 6.2.

As noted, I think we'd need two gfn_to_hva_caches anyway. But happy to
explore options.

> Pushing out to 6.3 will also give us time to do the more aggressive cleanup of
> adding kvm_gpc_lock(), which I think will yield APIs that both of us are happy with,
> i.e. the gpa+len of the cache can be immutable, but users can still validate their
> individual usage.

Yeah. The kvm_gpc_lock() thing is another one I just couldn't see a
clean one-size-fits-all solution for with irqsave and not-irqsave
variants, and different conditions (and other locks to drop) between
the check() failing and the refresh() occurring... but if we can make a
helper which covers the *common* case without ending up being just an
obfuscation, that's a win.
Paolo Bonzini Nov. 24, 2022, 12:31 a.m. UTC | #7
On Tue, Nov 22, 2022 at 6:25 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > > Tags need your real name, not just your email address, i.e. this should be:
> > >     Reviewed-by: Paul Durrant <paul@xen.org>
> >
> > Yes indeed it should. Don't know how I managed to screw that up... It's
> > not like haven't type that properly hundreds of times on Xen patch reviews.
>
> All sorted in the tree I'm curating
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
>
> Of those, three are marked as Cc:stable and want to go into the 6.1 release:
>
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>
> The rest (including the runstate compatibility fixes) are less
> critical.

I have picked them into both kvm/master and kvm/queue.

The gpc series probably will be left for 6.3. I had already removed Sean's
bits for the gpc and will rebase on top of your runstate compatibility fixes,
which I'm cherry-picking into kvm/queue.

But wow, is that runstate compatibility patch ugly.  Is it really worth it
having the two separate update paths, one which is ugly because of
BUILD_BUG_ON assertions and one which is ugly because of the
two-page stuff?

Like this (sorry about any word-wrapping, I'll repost it properly
after testing):

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index b246decb53a9..873a0ded3822 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -205,7 +205,7 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
 #ifdef CONFIG_X86_64
         /*
          * Don't leak kernel memory through the padding in the 64-bit
-         * struct if we take the split-page path.
+         * struct.
          */
         memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
 #endif
@@ -251,83 +251,6 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         read_lock_irqsave(&gpc1->lock, flags);
     }

-    /*
-     * The common case is that it all fits on a page and we can
-     * just do it the simple way.
-     */
-    if (likely(!user_len2)) {
-        /*
-         * We use 'int *user_state' to point to the state field, and
-         * 'u64 *user_times' for runstate_entry_time. So the actual
-         * array of time[] in each state starts at user_times[1].
-         */
-        int *user_state = gpc1->khva;
-        u64 *user_times = gpc1->khva + times_ofs;
-
-        /*
-         * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
-         * field. We need to set it (and write-barrier) before writing to the
-         * the rest of the structure, and clear it last. Just as Xen does, we
-         * address the single *byte* in which it resides because it might be
-         * in a different cache line to the rest of the 64-bit word, due to
-         * the (lack of) alignment constraints.
-         */
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info,
state_entry_time) !=
-                 sizeof(uint64_t));
-        BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info,
state_entry_time) !=
-                 sizeof(uint64_t));
-        BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
-
-        update_bit = ((u8 *)(&user_times[1])) - 1;
-        *update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
-        smp_wmb();
-
-        /*
-         * Next, write the new runstate. This is in the *same* place
-         * for 32-bit and 64-bit guests, asserted here for paranoia.
-         */
-        BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
-                 offsetof(struct compat_vcpu_runstate_info, state));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
-                 sizeof(vx->current_runstate));
-        BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
-                 sizeof(vx->current_runstate));
-        *user_state = vx->current_runstate;
-
-        /*
-         * Then the actual runstate_entry_time (with the UPDATE bit
-         * still set).
-         */
-        *user_times = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-
-        /*
-         * Write the actual runstate times immediately after the
-         * runstate_entry_time.
-         */
-        BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-                 offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
-        BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info,
state_entry_time) !=
-                 offsetof(struct compat_vcpu_runstate_info, time) -
sizeof(u64));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-                 sizeof_field(struct compat_vcpu_runstate_info, time));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-                 sizeof(vx->runstate_times));
-        memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
-
-        smp_wmb();
-
-        /*
-         * Finally, clear the 'updating' bit. Don't use &= here because
-         * the compiler may not realise that update_bit and user_times
-         * point to the same place. That's a classic pointer-aliasing
-         * problem.
-         */
-        *update_bit = vx->runstate_entry_time >> 56;
-        smp_wmb();
-
-        goto done_1;
-    }
-
     /*
      * The painful code path. It's split across two pages and we need to
      * hold and validate both GPCs simultaneously. Thankfully we can get
@@ -336,7 +259,7 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
      */
     read_lock(&gpc2->lock);

-    if (!kvm_gpc_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
+    if (user_len2 && !kvm_gpc_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
         read_unlock(&gpc2->lock);
         read_unlock_irqrestore(&gpc1->lock, flags);

@@ -361,6 +284,20 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         goto retry;
     }

+    /*
+     * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
+     * field. We need to set it (and write-barrier) before writing to the
+     * the rest of the structure, and clear it last. Just as Xen does, we
+     * address the single *byte* in which it resides because it might be
+     * in a different cache line to the rest of the 64-bit word, due to
+     * the (lack of) alignment constraints.
+     */
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
+             sizeof(uint64_t));
+    BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info,
state_entry_time) !=
+             sizeof(uint64_t));
+    BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
+
     /*
      * Work out where the byte containing the XEN_RUNSTATE_UPDATE bit is.
      */
@@ -370,6 +307,17 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         update_bit = ((u8 *)gpc2->khva) + times_ofs + sizeof(u64) - 1 -
             user_len1;

+    /*
+     * Next, write the new runstate. This is in the *same* place
+     * for 32-bit and 64-bit guests, asserted here for paranoia.
+     */
+    BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
+             offsetof(struct compat_vcpu_runstate_info, state));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
+             sizeof(vx->current_runstate));
+    BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
+             sizeof(vx->current_runstate));
+
     /*
      * Create a structure on our stack with everything in the right place.
      * The rs_state pointer points to the start of it, which in the case
@@ -378,29 +326,44 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
      */
     *rs_state = vx->current_runstate;
     rs.state_entry_time = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-    memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
-
     *update_bit = rs.state_entry_time >> 56;
     smp_wmb();

+    /*
+     * Write the actual runstate times immediately after the
+     * runstate_entry_time.
+     */
+    BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+             offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
+    BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info,
state_entry_time) !=
+             offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+             sizeof_field(struct compat_vcpu_runstate_info, time));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+             sizeof(vx->runstate_times));
+    memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
+
     /*
      * Having constructed the structure starting at *rs_state, copy it
      * into the first and second pages as appropriate using user_len1
      * and user_len2.
      */
     memcpy(gpc1->khva, rs_state, user_len1);
-    memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+    if (user_len2)
+        memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
     smp_wmb();

     /*
-     * Finally, clear the XEN_RUNSTATE_UPDATE bit.
+     * Finally, clear the 'updating' bit. Don't use &= here because
+     * the compiler may not realise that update_bit and user_times
+     * point to the same place. That's a classic pointer-aliasing
+     * problem.
      */
     *update_bit = vx->runstate_entry_time >> 56;
     smp_wmb();

     if (user_len2)
         read_unlock(&gpc2->lock);
- done_1:
     read_unlock_irqrestore(&gpc1->lock, flags);

     mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);


? Yet another possibility is to introduce a

/* Copy from src to dest_ofs bytes into the combined area pointed to by
 * dest1 (up to dest1_len bytes) and dest2 (the rest). */
void split_memcpy(void *dest1, void *dest2, size_t dest_ofs, size_t
dest1_len, void *src, size_t src_len)

so that the on-stack struct is not needed at all. This makes it possible to
avoid the rs_state hack as well.

It's in kvm/queue only, so there's time to include a new version.

Paolo

> Sean, given that this now includes your patch series which in turn you
> took over from Michal, how would you prefer me to proceed?
>
> David Woodhouse (7):
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: x86/xen: Add CPL to Xen hypercall tracepoint
>       MAINTAINERS: Add KVM x86/xen maintainer list
>       KVM: x86/xen: Compatibility fixes for shared runstate area
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>       KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary
>
> Michal Luczaj (6):
>       KVM: Shorten gfn_to_pfn_cache function names
>       KVM: x86: Remove unused argument in gpc_unmap_khva()
>       KVM: Store immutable gfn_to_pfn_cache properties
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()
>       KVM: Clean up hva_to_pfn_retry()
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()
>
> Sean Christopherson (4):
>       KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cache
>       KVM: Do not partially reinitialize gfn=>pfn cache during activation
>       KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers
>       KVM: Skip unnecessary "unmap" if gpc is already valid during refresh
>
> We can reinstate the 'immutable length' thing on top, if we pick one of
> the discussed options for coping with the fact that for the runstate
> area, it *isn't* immutable. I'm slightly leaning towards just setting
> the length to '1' despite it being a lie.
David Woodhouse Nov. 24, 2022, 12:45 a.m. UTC | #8
On Thu, 2022-11-24 at 01:31 +0100, Paolo Bonzini wrote:
> I have picked them into both kvm/master and kvm/queue.

Thanks.

> The gpc series probably will be left for 6.3. I had already removed
> Sean's bits for the gpc and will rebase on top of your runstate
> compatibility fixes, which I'm cherry-picking into kvm/queue.
> 
> But wow, is that runstate compatibility patch ugly.  Is it really
> worth it having the two separate update paths, one which is ugly
> because of BUILD_BUG_ON assertions and one which is ugly because of
> the two-page stuff?

The BUILD_BUG_ON() assertions could move. I quite liked having them
there as documentation for compat handling, but I'm happy to move them.
There's plenty of *actual* comments on the compat handling too ;)

I do think it's worth the separate paths because the *common* case
doesn't have to bounce it through the kernel stack at all, and it's a
relatively fast path because it happens on each schedule in/out.

The fast path isn't the part I hate. Removing *that* doesn't really
make things much nicer.

If you want to move the BUILD_BUG_ONs into the slow path and then we
can concentrate our hate on that part and leave the fast path even
nicer, that's fine though ;)
David Woodhouse Nov. 24, 2022, 12:54 a.m. UTC | #9
On Thu, 2022-11-24 at 01:31 +0100, Paolo Bonzini wrote:
> Yet another possibility is to introduce a
> 
> /* Copy from src to dest_ofs bytes into the combined area pointed to by
>  * dest1 (up to dest1_len bytes) and dest2 (the rest). */
> void split_memcpy(void *dest1, void *dest2, size_t dest_ofs, size_t
> dest1_len, void *src, size_t src_len)
> 
> so that the on-stack struct is not needed at all. This makes it possible to
> avoid the rs_state hack as well.

FWIW I did one of those in 
https://lore.kernel.org/kvm/6acfdbd3e516ece5949e97c85e7db8dc97a3e6c6.camel@infradead.org/

Hated that too :)
Paolo Bonzini Nov. 24, 2022, 1:11 a.m. UTC | #10
On Thu, Nov 24, 2022 at 1:55 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Thu, 2022-11-24 at 01:31 +0100, Paolo Bonzini wrote:
> > Yet another possibility is to introduce a
> >
> > /* Copy from src to dest_ofs bytes into the combined area pointed to by
> >  * dest1 (up to dest1_len bytes) and dest2 (the rest). */
> > void split_memcpy(void *dest1, void *dest2, size_t dest_ofs, size_t
> > dest1_len, void *src, size_t src_len)
> >
> > so that the on-stack struct is not needed at all. This makes it possible to
> > avoid the rs_state hack as well.
>
> FWIW I did one of those in
> https://lore.kernel.org/kvm/6acfdbd3e516ece5949e97c85e7db8dc97a3e6c6.camel@infradead.org/
>
> Hated that too :)

Aha, that may be where I got the idea. Won't win a beauty contest indeed.

What if vx contained a struct vcpu_runstate_info instead of having
separate fields? That might combine the best of the on-stack version
and the best of the special-memcpy version...

Going to bed now anyway.

Paolo
David Woodhouse Nov. 24, 2022, 10:41 a.m. UTC | #11
On Thu, 2022-11-24 at 02:11 +0100, Paolo Bonzini wrote:
> 
> What if vx contained a struct vcpu_runstate_info instead of having
> separate fields? That might combine the best of the on-stack version
> and the best of the special-memcpy version...

Well... the state field is in a different place for compat vs. 64-bit
which makes that fun, and the state_entry_time field need to have the
XEN_RUNSTATE_UPDATE bit masked into it which means we don't really want
to just memcpy *that* part... and we already *are* just doing a memcpy
for the rest, which is the time[] array.

But we *can* recognise that the fast and slow paths aren't all that
different really, and combine them. How's this?

From bf9d04e1c59771c82d5694d22a86fb0508be1530 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Thu, 24 Nov 2022 09:49:00 +0000
Subject: [PATCH] KVM: x86/xen: Reconcile fast and slow paths for runstate
 update

Instead of having a completely separate fast path for the case where the
guest's runstate_info fits entirely in a page, recognise the similarity.

In both cases, the @update_bit pointer points to the byte containing the
XEN_RUNSTATE_UPDAT flag directly in the guest, via one of the GPCs.

In both cases, the actual guest structure (compat or not) is built up
from the fields in @vx, following a preset pointer. The only difference
is whether that pointer points to the kernel stack (in the split case)
or to guest memory directly via the GPC.

All that can be the same if we just set the pointers up accordingly for
each case. Then the only real difference is that dual memcpy, and the
whole of that original fast path can go away, disappearing into the slow
path without actually doing the slow part.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 249 +++++++++++++++++++++++----------------------
 1 file changed, 126 insertions(+), 123 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index d4077262c0a2..eaa7fddfbb8c 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -177,44 +177,78 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
 	size_t user_len, user_len1, user_len2;
 	struct vcpu_runstate_info rs;
-	int *rs_state = &rs.state;
 	unsigned long flags;
 	size_t times_ofs;
-	u8 *update_bit;
+	uint8_t *update_bit;
+	uint64_t *rs_times;
+	int *rs_state;
 
 	/*
 	 * The only difference between 32-bit and 64-bit versions of the
 	 * runstate struct is the alignment of uint64_t in 32-bit, which
 	 * means that the 64-bit version has an additional 4 bytes of
-	 * padding after the first field 'state'.
+	 * padding after the first field 'state'. Let's be really really
+	 * paranoid about that, and matching it with our internal data
+	 * structures that we memcpy into it...
 	 */
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
 	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
 	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
 #ifdef CONFIG_X86_64
+	/*
+	 * The 64-bit structure has 4 bytes of padding before 'state_entry_time'
+	 * so each subsequent field is shifted by 4, and it's 4 bytes longer.
+	 */
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
 		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
 		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
+	BUILD_BUG_ON(sizeof(struct vcpu_runstate_info) != 0x2c + 4);
 #endif
+	/*
+	 * The state field is in the same place at the start of both structs,
+	 * and is the same size (int) as vx->current_runstate.
+	 */
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
+		     offsetof(struct compat_vcpu_runstate_info, state));
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
+		     sizeof(vx->current_runstate));
+	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
+		     sizeof(vx->current_runstate));
+
+	/*
+	 * The state_entry_time field is 64 bits in both versions, and the
+	 * XEN_RUNSTATE_UPDATE flag is in the top bit, which given that x86
+	 * is little-endian means that it's in the last *byte* of the word.
+	 * That detail is important later.
+	 */
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
+		     sizeof(uint64_t));
+	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
+		     sizeof(uint64_t));
+	BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
+
+	/*
+	 * The time array is four 64-bit quantities in both versions, matching
+	 * the vx->runstate_times and immediately following state_entry_time.
+	 */
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+		     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
+	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
+		     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+		     sizeof_field(struct compat_vcpu_runstate_info, time));
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+		     sizeof(vx->runstate_times));
 
 	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
 		user_len = sizeof(struct vcpu_runstate_info);
 		times_ofs = offsetof(struct vcpu_runstate_info,
 				     state_entry_time);
-#ifdef CONFIG_X86_64
-		/*
-		 * Don't leak kernel memory through the padding in the 64-bit
-		 * struct if we take the split-page path.
-		 */
-		memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
-#endif
 	} else {
 		user_len = sizeof(struct compat_vcpu_runstate_info);
 		times_ofs = offsetof(struct compat_vcpu_runstate_info,
 				     state_entry_time);
-		/* Start of struct for compat mode (qv). */
-		rs_state++;
 	}
 
 	/*
@@ -251,145 +285,114 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		read_lock_irqsave(&gpc1->lock, flags);
 	}
 
-	/*
-	 * The common case is that it all fits on a page and we can
-	 * just do it the simple way.
-	 */
 	if (likely(!user_len2)) {
 		/*
-		 * We use 'int *user_state' to point to the state field, and
-		 * 'u64 *user_times' for runstate_entry_time. So the actual
-		 * array of time[] in each state starts at user_times[1].
+		 * Set up three pointers directly to the runstate_info
+		 * struct in the guest (via the GPC).
+		 *
+		 *  • @rs_state   → state field
+		 *  • @rs_times   → state_entry_time field.
+		 *  • @update_bit → last byte of state_entry_time, which
+		 *                  contains the XEN_RUNSTATE_UPDATE bit.
 		 */
-		int *user_state = gpc1->khva;
-		u64 *user_times = gpc1->khva + times_ofs;
-
+		rs_state = gpc1->khva;
+		rs_times = gpc1->khva + times_ofs;
+		update_bit = ((uint8_t *)(&rs_times[1])) - 1;
+	} else {
 		/*
-		 * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
-		 * field. We need to set it (and write-barrier) before writing to the
-		 * the rest of the structure, and clear it last. Just as Xen does, we
-		 * address the single *byte* in which it resides because it might be
-		 * in a different cache line to the rest of the 64-bit word, due to
-		 * the (lack of) alignment constraints.
+		 * The guest's runstate_info is split across two pages and we
+		 * need to hold and validate both GPCs simultaneously. We can
+		 * declare a lock ordering GPC1 > GPC2 because nothing else
+		 * takes them more than one at a time.
 		 */
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
-			     sizeof(uint64_t));
-		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
-			     sizeof(uint64_t));
-		BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
+		read_lock(&gpc2->lock);
 
-		update_bit = ((u8 *)(&user_times[1])) - 1;
-		*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
-		smp_wmb();
+		if (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
+			read_unlock(&gpc2->lock);
+			read_unlock_irqrestore(&gpc1->lock, flags);
 
-		/*
-		 * Next, write the new runstate. This is in the *same* place
-		 * for 32-bit and 64-bit guests, asserted here for paranoia.
-		 */
-		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
-			     offsetof(struct compat_vcpu_runstate_info, state));
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
-			     sizeof(vx->current_runstate));
-		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
-			     sizeof(vx->current_runstate));
-		*user_state = vx->current_runstate;
+			/* When invoked from kvm_sched_out() we cannot sleep */
+			if (atomic)
+				return;
 
-		/*
-		 * Then the actual runstate_entry_time (with the UPDATE bit
-		 * still set).
-		 */
-		*user_times = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+			/*
+			 * Use kvm_gpc_activate() here because if the runstate
+			 * area was configured in 32-bit mode and only extends
+			 * to the second page now because the guest changed to
+			 * 64-bit mode, the second GPC won't have been set up.
+			 */
+			if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
+					     gpc1->gpa + user_len1, user_len2))
+				return;
+
+			/*
+			 * We dropped the lock on GPC1 so we have to go all the
+			 * way back and revalidate that too.
+			 */
+			goto retry;
+		}
 
 		/*
-		 * Write the actual runstate times immediately after the
-		 * runstate_entry_time.
+		 * In this case, the runstate_info struct will be assembled on
+		 * the kernel stack (compat or not as appropriate) and will
+		 * be copied to GPC1/GPC2 with a dual memcpy. Set up the three
+		 * rs pointers accordingly.
 		 */
-		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-			     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
-		BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
-			     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-			     sizeof_field(struct compat_vcpu_runstate_info, time));
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-			     sizeof(vx->runstate_times));
-		memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
-
-		smp_wmb();
+		rs_times = &rs.state_entry_time;
 
 		/*
-		 * Finally, clear the 'updating' bit. Don't use &= here because
-		 * the compiler may not realise that update_bit and user_times
-		 * point to the same place. That's a classic pointer-aliasing
-		 * problem.
+		 * The rs_state pointer points to the start of what we'll
+		 * copy to the guest, which in the case of a compat guest
+		 * is the 32-bit field that the compiler thinks is padding.
 		 */
-		*update_bit = vx->runstate_entry_time >> 56;
-		smp_wmb();
-
-		goto done_1;
-	}
-
-	/*
-	 * The painful code path. It's split across two pages and we need to
-	 * hold and validate both GPCs simultaneously. Thankfully we can get
-	 * away with declaring a lock ordering GPC1 > GPC2 because nothing
-	 * else takes them more than one at a time.
-	 */
-	read_lock(&gpc2->lock);
-
-	if (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
-		read_unlock(&gpc2->lock);
-		read_unlock_irqrestore(&gpc1->lock, flags);
-
-		/* When invoked from kvm_sched_out() we cannot sleep */
-		if (atomic)
-			return;
+		rs_state = ((void *)rs_times) - times_ofs;
 
 		/*
-		 * Use kvm_gpc_activate() here because if the runstate
-		 * area was configured in 32-bit mode and only extends
-		 * to the second page now because the guest changed to
-		 * 64-bit mode, the second GPC won't have been set up.
+		 * The update_bit is still directly in the guest memory,
+		 * via one GPC or the other.
 		 */
-		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
-				     gpc1->gpa + user_len1, user_len2))
-			return;
+		if (user_len1 >= times_ofs + sizeof(uint64_t))
+			update_bit = ((u8 *)gpc1->khva) + times_ofs +
+				sizeof(uint64_t) - 1;
+		else
+			update_bit = ((u8 *)gpc2->khva) + times_ofs +
+				sizeof(uint64_t) - 1 - user_len1;
 
+#ifdef CONFIG_X86_64
 		/*
-		 * We dropped the lock on GPC1 so we have to go all the way
-		 * back and revalidate that too.
+		 * Don't leak kernel memory through the padding in the 64-bit
+		 * version of the struct.
 		 */
-		goto retry;
+		memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
+#endif
 	}
 
 	/*
-	 * Work out where the byte containing the XEN_RUNSTATE_UPDATE bit is.
+	 * First, set the XEN_RUNSTATE_UPDATE bit in the top bit of the
+	 * state_entry_time field, directly in the guest. We need to set
+	 * that (and write-barrier) before writing to the rest of the
+	 * structure, and clear it last. Just as Xen does, we address the
+	 * single *byte* in which it resides because it might be in a
+	 * different cache line to the rest of the 64-bit word, due to
+	 * the (lack of) alignment constraints.
 	 */
-	if (user_len1 >= times_ofs + sizeof(uint64_t))
-		update_bit = ((u8 *)gpc1->khva) + times_ofs + sizeof(u64) - 1;
-	else
-		update_bit = ((u8 *)gpc2->khva) + times_ofs + sizeof(u64) - 1 -
-			user_len1;
+	*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
+	smp_wmb();
 
 	/*
-	 * Create a structure on our stack with everything in the right place.
-	 * The rs_state pointer points to the start of it, which in the case
-	 * of a compat guest on a 64-bit host is the 32 bit field that the
-	 * compiler thinks is padding.
+	 * Now assemble the actual structure, either on our kernel stack
+	 * or directly in the guest according to how the rs_state and
+	 * rs_times pointers were set up above.
 	 */
 	*rs_state = vx->current_runstate;
-	rs.state_entry_time = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-	memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
-
-	*update_bit = rs.state_entry_time >> 56;
-	smp_wmb();
+	rs_times[0] = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+	memcpy(rs_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
 
-	/*
-	 * Having constructed the structure starting at *rs_state, copy it
-	 * into the first and second pages as appropriate using user_len1
-	 * and user_len2.
-	 */
-	memcpy(gpc1->khva, rs_state, user_len1);
-	memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+	/* For the split case, we have to then copy it to the guest. */
+	if (user_len2) {
+		memcpy(gpc1->khva, rs_state, user_len1);
+		memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+	}
 	smp_wmb();
 
 	/*
@@ -400,7 +403,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 
 	if (user_len2)
 		read_unlock(&gpc2->lock);
- done_1:
+
 	read_unlock_irqrestore(&gpc1->lock, flags);
 
 	mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);
David Woodhouse Nov. 24, 2022, 11:30 a.m. UTC | #12
On Thu, 2022-11-24 at 10:41 +0000, David Woodhouse wrote:
> On Thu, 2022-11-24 at 02:11 +0100, Paolo Bonzini wrote:
> > 
> > What if vx contained a struct vcpu_runstate_info instead of having
> > separate fields? That might combine the best of the on-stack version
> > and the best of the special-memcpy version...
> 
> Well... the state field is in a different place for compat vs. 64-bit
> which makes that fun, and the state_entry_time field need to have the
> XEN_RUNSTATE_UPDATE bit masked into it which means we don't really want
> to just memcpy *that* part... and we already *are* just doing a memcpy
> for the rest, which is the time[] array.
> 
> But we *can* recognise that the fast and slow paths aren't all that
> different really, and combine them. How's this?

That version applies immediately after the existing 'Compatibility
fixes for shared runstate area' commit, in case you wanted to squash
them.

Alternatively if you prefer it that way, I made a few trivial cosmetic
cleanups and rebased it on top of the existing kvm/queue. I also
rebased the remaining GPC cleanups on top of that, in my tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
diff mbox series

Patch

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index bd4a46aee384..5f83321bfd2a 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -297,7 +297,12 @@  int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 	if (!gpc->valid || old_uhva != gpc->uhva) {
 		ret = hva_to_pfn_retry(kvm, gpc);
 	} else {
-		/* If the HVA→PFN mapping was already valid, don't unmap it. */
+		/*
+		 * If the HVA→PFN mapping was already valid, don't unmap it.
+		 * But do update gpc->khva because the offset within the page
+		 * may have changed.
+		 */
+		gpc->khva = old_khva + page_offset;
 		old_pfn = KVM_PFN_ERR_FAULT;
 		old_khva = NULL;
 		ret = 0;