diff mbox series

[2/4] KVM: x86/xen: Compatibility fixes for shared runstate area

Message ID 20221119094659.11868-2-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series [1/4] MAINTAINERS: Add KVM x86/xen maintainer list | expand

Commit Message

David Woodhouse Nov. 19, 2022, 9:46 a.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The guest runstate area can be arbitrarily byte-aligned. In fact, even
when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
fields in the structure end up being unaligned due to the fact that the
32-bit ABI only aligns them to 32 bits.

So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
is buggy, because if it's unaligned then we can't update the whole field
atomically; the low bytes might be observable before the _UPDATE bit is.
Xen actually updates the *byte* containing that top bit, on its own. KVM
should do the same.

In addition, we cannot assume that the runstate area fits within a single
page. One option might be to make the gfn_to_pfn cache cope with regions
that cross a page — but getting a contiguous virtual kernel mapping of a
discontiguous set of IOMEM pages is a distinctly non-trivial exercise,
and it seems this is the *only* current use case for the GPC which would
benefit from it.

An earlier version of the runstate code did use a gfn_to_hva cache for
this purpose, but it still had the single-page restriction because it
used the uhva directly — because it needs to be able to do so atomically
when the vCPU is being scheduled out, so it used pagefault_disable()
around the accesses and didn't just use kvm_write_guest_cached() which
has a fallback path.

So... use a pair of GPCs for the first and potential second page covering
the runstate area. We can get away with locking both at once because
nothing else takes more than one GPC lock at a time so we can invent
a trivial ordering rule.

Keep the trivial fast path for the common case where it's all in the
same page, but fixed to use a byte access for the XEN_RUNSTATE_UPDATE
bit. And in the cross-page case, build the structure locally on the
stack and then copy it over with two memcpy() calls, again handling
the XEN_RUNSTATE_UPDATE bit through a single byte access.

Finally, Xen also does write the runstate area immediately when it's
configured. Flip the kvm_xen_update_runstate() and …_guest() functions
and call the latter directly when the runstate area is set. This means
that other ioctls which modify the runstate also write it immediately
to the guest when they do so, which is also intended.

Update the xen_shinfo_test to exercise the pathological case where the
XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is
actually in a different page to the rest of the 64-bit word.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h               |   1 +
 arch/x86/kvm/xen.c                            | 367 +++++++++++++-----
 arch/x86/kvm/xen.h                            |   6 +-
 .../selftests/kvm/x86_64/xen_shinfo_test.c    |  12 +-
 4 files changed, 272 insertions(+), 114 deletions(-)

Comments

Durrant, Paul Nov. 19, 2022, 11:38 a.m. UTC | #1
> -----Original Message-----
> From: David Woodhouse <dwmw2@infradead.org>
> Sent: 19 November 2022 09:47
> To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
> <seanjc@google.com>
> Cc: kvm@vger.kernel.org; mhal@rbox.co
> Subject: [EXTERNAL] [PATCH 2/4] KVM: x86/xen: Compatibility fixes for
> shared runstate area
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The guest runstate area can be arbitrarily byte-aligned. In fact, even
> when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> fields in the structure end up being unaligned due to the fact that the
> 32-bit ABI only aligns them to 32 bits.
> 
> So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> is buggy, because if it's unaligned then we can't update the whole field
> atomically; the low bytes might be observable before the _UPDATE bit is.
> Xen actually updates the *byte* containing that top bit, on its own. KVM
> should do the same.
> 
> In addition, we cannot assume that the runstate area fits within a single
> page. One option might be to make the gfn_to_pfn cache cope with regions
> that cross a page — but getting a contiguous virtual kernel mapping of a
> discontiguous set of IOMEM pages is a distinctly non-trivial exercise, and
> it seems this is the *only* current use case for the GPC which would
> benefit from it.
> 
> An earlier version of the runstate code did use a gfn_to_hva cache for
> this purpose, but it still had the single-page restriction because it used
> the uhva directly — because it needs to be able to do so atomically when
> the vCPU is being scheduled out, so it used pagefault_disable() around the
> accesses and didn't just use kvm_write_guest_cached() which has a fallback
> path.
> 
> So... use a pair of GPCs for the first and potential second page covering
> the runstate area. We can get away with locking both at once because
> nothing else takes more than one GPC lock at a time so we can invent a
> trivial ordering rule.
> 
> Keep the trivial fast path for the common case where it's all in the same
> page, but fixed to use a byte access for the XEN_RUNSTATE_UPDATE bit. And
> in the cross-page case, build the structure locally on the stack and then
> copy it over with two memcpy() calls, again handling the
> XEN_RUNSTATE_UPDATE bit through a single byte access.
> 
> Finally, Xen also does write the runstate area immediately when it's
> configured. Flip the kvm_xen_update_runstate() and …_guest() functions and
> call the latter directly when the runstate area is set. This means that
> other ioctls which modify the runstate also write it immediately to the
> guest when they do so, which is also intended.
> 
> Update the xen_shinfo_test to exercise the pathological case where the
> XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is
> actually in a different page to the rest of the 64-bit word.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/include/asm/kvm_host.h               |   1 +
>  arch/x86/kvm/xen.c                            | 367 +++++++++++++-----
>  arch/x86/kvm/xen.h                            |   6 +-
>  .../selftests/kvm/x86_64/xen_shinfo_test.c    |  12 +-
>  4 files changed, 272 insertions(+), 114 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index d1013c4f673c..70af7240a1d5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -686,6 +686,7 @@ struct kvm_vcpu_xen {
>         struct gfn_to_pfn_cache vcpu_info_cache;
>         struct gfn_to_pfn_cache vcpu_time_info_cache;
>         struct gfn_to_pfn_cache runstate_cache;
> +       struct gfn_to_pfn_cache runstate2_cache;
>         u64 last_steal;
>         u64 runstate_entry_time;
>         u64 runstate_times[4];
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index
> 4b8e9628fbf5..8aa953b1f0e0 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -170,148 +170,269 @@ static void kvm_xen_init_timer(struct kvm_vcpu
> *vcpu)
>         vcpu->arch.xen.timer.function = xen_timer_callback;  }
> 
> -static void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
> +static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool
> +atomic)
>  {
>         struct kvm_vcpu_xen *vx = &v->arch.xen;
> -       u64 now = get_kvmclock_ns(v->kvm);
> -       u64 delta_ns = now - vx->runstate_entry_time;
> -       u64 run_delay = current->sched_info.run_delay;
> +       struct gfn_to_pfn_cache *gpc1 = &vx->runstate_cache;
> +       struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
> +       size_t user_len, user_len1, user_len2;

I think it might make the code marginally neater to use two-element arrays... but only marginally.

> +       struct vcpu_runstate_info rs;
> +       int *rs_state = &rs.state;
> +       unsigned long flags;
> +       size_t times_ofs;
> +       u8 *update_bit;
> 
> -       if (unlikely(!vx->runstate_entry_time))
> -               vx->current_runstate = RUNSTATE_offline;
> +       /*
> +        * The only difference between 32-bit and 64-bit versions of the
> +        * runstate struct us the alignment of uint64_t in 32-bit, which
> +        * means that the 64-bit version has an additional 4 bytes of
> +        * padding after the first field 'state'.
> +        */
> +       BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
> +       BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) !=
> 0);
> +       BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
> +#ifdef CONFIG_X86_64
> +       BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time)
> !=
> +                    offsetof(struct compat_vcpu_runstate_info,
> state_entry_time) + 4);
> +       BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
> +                    offsetof(struct compat_vcpu_runstate_info, time) +
> +4); #endif
> +
> +       if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
> +               user_len = sizeof(struct vcpu_runstate_info);
> +               times_ofs = offsetof(struct vcpu_runstate_info,
> +                                    state_entry_time);
> +       } else {
> +               user_len = sizeof(struct compat_vcpu_runstate_info);
> +               times_ofs = offsetof(struct compat_vcpu_runstate_info,
> +                                    state_entry_time);
> +               rs_state++;
> +       }
> 
>         /*
> -        * Time waiting for the scheduler isn't "stolen" if the
> -        * vCPU wasn't running anyway.
> +        * There are basically no alignment constraints. The guest can set
> it
> +        * up so it crosses from one page to the next, and at arbitrary
> byte
> +        * alignment (and the 32-bit ABI doesn't align the 64-bit integers
> +        * anyway, even if the overall struct had been 64-bit aligned).
>          */
> -       if (vx->current_runstate == RUNSTATE_running) {
> -               u64 steal_ns = run_delay - vx->last_steal;
> +       if ((gpc1->gpa & ~PAGE_MASK) + user_len >= PAGE_SIZE) {
> +               user_len1 = PAGE_SIZE - (gpc1->gpa & ~PAGE_MASK);
> +               user_len2 = user_len - user_len1;
> +       } else {
> +               user_len1 = user_len;
> +               user_len2 = 0;
> +       }
> +       BUG_ON(user_len1 + user_len2 != user_len);
> 
> -               delta_ns -= steal_ns;
> + retry:
> +       /*
> +        * Attempt to obtain the GPC lock on *both* (if there are two)
> +        * gfn_to_pfn caches that cover the region.
> +        */
> +       read_lock_irqsave(&gpc1->lock, flags);
> +       while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc1, gpc1->gpa,
> user_len1)) {
> +               read_unlock_irqrestore(&gpc1->lock, flags);
> 
> -               vx->runstate_times[RUNSTATE_runnable] += steal_ns;
> +               /* When invoked from kvm_sched_out() we cannot sleep */
> +               if (atomic)
> +                       return;
> +
> +               if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc1, gpc1->gpa,
> user_len1))
> +                       return;
> +
> +               read_lock_irqsave(&gpc1->lock, flags);

This is a repeated pattern now, so how about a helper function?

  Paul
David Woodhouse Nov. 19, 2022, 11:46 a.m. UTC | #2
On Sat, 2022-11-19 at 11:38 +0000, Durrant, Paul wrote:
> > +       size_t user_len, user_len1, user_len2;
> 
> I think it might make the code marginally neater to use two-element arrays... but only marginally.

Not sure; I kind of prefer it like this. At least overall, this version
is somewhat nicer than some of the interim attempts that you saw... :)


> > +        * Attempt to obtain the GPC lock on *both* (if there are two)
> > +        * gfn_to_pfn caches that cover the region.
> > +        */
> > +       read_lock_irqsave(&gpc1->lock, flags);
> > +       while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc1, gpc1->gpa, user_len1)) {
> > +               read_unlock_irqrestore(&gpc1->lock, flags);
> > 
> > -               vx->runstate_times[RUNSTATE_runnable] += steal_ns;
> > +               /* When invoked from kvm_sched_out() we cannot sleep */
> > +               if (atomic)
> > +                       return;
> > +
> > +               if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc1, gpc1->gpa, user_len1))
> > +                       return;
> > +
> > +               read_lock_irqsave(&gpc1->lock, flags);
> 
> This is a repeated pattern now, so how about a helper function?

I've been tempted by that a few times but although it's a repeating
pattern, there are differences every time around the conditions for
bailing out early — the if (atomic) part above.
Sean Christopherson Nov. 22, 2022, 6:39 p.m. UTC | #3
On Sat, Nov 19, 2022, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The guest runstate area can be arbitrarily byte-aligned. In fact, even
> when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> fields in the structure end up being unaligned due to the fact that the
> 32-bit ABI only aligns them to 32 bits.
> 
> So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> is buggy, because if it's unaligned then we can't update the whole field
> atomically; the low bytes might be observable before the _UPDATE bit is.
> Xen actually updates the *byte* containing that top bit, on its own. KVM
> should do the same.

I think we're using the wrong APIs to update the runstate.  The VMCS/VMCB pages
_need_ the host pfn, i.e. need to use a gpc (eventually).  The Xen PV stuff on the
other hand most definitely doesn't need to know the pfn.

The event channel code would be difficult to convert due to lack of uaccess
primitives, but I don't see anything in the runstate code that prevents KVM from
using a gfn_to_hva_cache.  That will naturally handle page splits by sending them
down a slow path and would yield far simpler code.

If taking the slow path is an issue, then the guest really should be fixed to not
split pages.  And if that's not an acceptable answer, the gfn_to_hva_cache code
could be updated to use the fast path if the region is contiguous in the host
virtual address space.

> +static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
>  {
>  	struct kvm_vcpu_xen *vx = &v->arch.xen;
> -	u64 now = get_kvmclock_ns(v->kvm);
> -	u64 delta_ns = now - vx->runstate_entry_time;
> -	u64 run_delay = current->sched_info.run_delay;
> +	struct gfn_to_pfn_cache *gpc1 = &vx->runstate_cache;
> +	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
> +	size_t user_len, user_len1, user_len2;
> +	struct vcpu_runstate_info rs;
> +	int *rs_state = &rs.state;
> +	unsigned long flags;
> +	size_t times_ofs;
> +	u8 *update_bit;
>  
> -	if (unlikely(!vx->runstate_entry_time))
> -		vx->current_runstate = RUNSTATE_offline;
> +	/*
> +	 * The only difference between 32-bit and 64-bit versions of the
> +	 * runstate struct us the alignment of uint64_t in 32-bit, which

s/us/is

> +	 * means that the 64-bit version has an additional 4 bytes of
> +	 * padding after the first field 'state'.
> +	 */
> +	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
> +	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
> +	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
> +#ifdef CONFIG_X86_64
> +	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
> +		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
> +	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
> +		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
> +#endif
> +
> +	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
> +		user_len = sizeof(struct vcpu_runstate_info);
> +		times_ofs = offsetof(struct vcpu_runstate_info,
> +				     state_entry_time);
> +	} else {
> +		user_len = sizeof(struct compat_vcpu_runstate_info);
> +		times_ofs = offsetof(struct compat_vcpu_runstate_info,
> +				     state_entry_time);
> +		rs_state++;

...

> +	*rs_state = vx->current_runstate;
> +#ifdef CONFIG_X86_64
> +	/* Don't leak kernel memory through the padding in the 64-bit struct */
> +	if (rs_state == &rs.state)
> +		rs_state[1] = 0;

Oof, that's difficult to follow.  Rather than pointer magic, what about zeroing
the first word unconditionally?  Likely faster than a CMP+CMOV or whatever gets
generated.  Or just zero the entire struct.

	/* Blah blah blah */
	*(unsigned long *)&rs.state = 0;

> +#endif
David Woodhouse Nov. 22, 2022, 11:04 p.m. UTC | #4
On Tue, 2022-11-22 at 18:39 +0000, Sean Christopherson wrote:
> On Sat, Nov 19, 2022, David Woodhouse wrote:
> > From: David Woodhouse <
> > dwmw@amazon.co.uk
> > >
> > 
> > The guest runstate area can be arbitrarily byte-aligned. In fact, even
> > when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> > fields in the structure end up being unaligned due to the fact that the
> > 32-bit ABI only aligns them to 32 bits.
> > 
> > So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> > is buggy, because if it's unaligned then we can't update the whole field
> > atomically; the low bytes might be observable before the _UPDATE bit is.
> > Xen actually updates the *byte* containing that top bit, on its own. KVM
> > should do the same.
> 
> I think we're using the wrong APIs to update the runstate.  The VMCS/VMCB pages
> _need_ the host pfn, i.e. need to use a gpc (eventually).  The Xen PV stuff on the
> other hand most definitely doesn't need to know the pfn.
> 
> The event channel code would be difficult to convert due to lack of uaccess
> primitives, but I don't see anything in the runstate code that prevents KVM from
> using a gfn_to_hva_cache.  That will naturally handle page splits by sending them
> down a slow path and would yield far simpler code.
> 
> If taking the slow path is an issue, then the guest really should be fixed to not
> split pages.  And if that's not an acceptable answer, the gfn_to_hva_cache code
> could be updated to use the fast path if the region is contiguous in the host
> virtual address space.
> 

Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
thing I tried. There are a handful of complexifying factors, none of
which are insurmountable if we try hard enough.

 • Even if we fix the gfn_to_hva_cache to still use the fast path for
   more than the first page, it's still possible for the runstate to
   cross from one memslot to an adjacent one. We probably still need
   two of them, which is a large part of the ugliness of this patch.

 • Accessing it via the userspace HVA requires coping with the case
   where the vCPU is being scheduled out, and it can't sleep. The
   previous code before commit 34d41d19e dealt with that by using 
   pagefault_disable() and depending on the GHC fast path. But even
   so...

 • We also could end up having to touch page B, page A then page B
   again, potentially having to abort and leave the runstate with
   the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
   version which checks that both pages are present before it starts
   writing anything to the guest.

As I said, they're not insurmountable but that's why I ended up with
the patch you see, after first attempting to use a gfn_to_hva_cache
again. Happy to entertain suggestions.

I note we have a kvm_read_guest_atomic() and briefly pondered adding a
'write' version of same... but that doesn't seem to cope with crossing
memslots either; it also assumes its caller only uses it to access
within a single page.

> > +	*rs_state = vx->current_runstate;
> > +#ifdef CONFIG_X86_64
> > +	/* Don't leak kernel memory through the padding in the 64-bit struct */
> > +	if (rs_state == &rs.state)
> > +		rs_state[1] = 0;
> 
> Oof, that's difficult to follow.  Rather than pointer magic, what about zeroing
> the first word unconditionally?  Likely faster than a CMP+CMOV or whatever gets
> generated.  Or just zero the entire struct.
> 
> 	/* Blah blah blah */
> 	*(unsigned long *)&rs.state = 0;

That gives us pointer aliasing problems, which I hate even if we *do*
compile with -fno-strict-aliasing. I'll use a memset which ought to end
up being precisely the same in practice.

#ifdef CONFIG_X86_64
		/*
		 * Don't leak kernel memory through the padding in the 64-bit
		 * struct if we take the split-page path.
		 */
		memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
#endif
Michal Luczaj Nov. 22, 2022, 11:46 p.m. UTC | #5
On 11/19/22 10:46, David Woodhouse wrote:
> @@ -584,23 +705,57 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> ...
> +		if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode)
> +			sz = sizeof(struct vcpu_runstate_info);
> +		else
> +			sz = sizeof(struct compat_vcpu_runstate_info);
> +
> +		/* How much fits in the (first) page? */
> +		sz1 = PAGE_SIZE - (data->u.gpa & ~PAGE_MASK);
>  		r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate_cache,
> -				     NULL, KVM_HOST_USES_PFN, data->u.gpa,
> -				     sizeof(struct vcpu_runstate_info));
> -		break;
> +				     NULL, KVM_HOST_USES_PFN, data->u.gpa, sz1);
> +		if (r)
> +			goto deactivate_out;
>  
> +		/* Either map the second page, or deactivate the second GPC */
> +		if (sz1 > sz) {

Out of curiosity: is there a reason behind potentially
kvm_gpc_activate()ing a "len=0" cache? (sz1==sz leads to sz2=0)
I feel I may be missing something, but shouldn't the comparison be >=?

> +			kvm_gpc_deactivate(vcpu->kvm,
> +					   &vcpu->arch.xen.runstate2_cache);
> +		} else {
> +			sz2 = sz - sz1;
> +			BUG_ON((data->u.gpa + sz1) & ~PAGE_MASK);
> +			r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate2_cache,
> +					     NULL, KVM_HOST_USES_PFN,
> +					     data->u.gpa + sz1, sz2);
> +			if (r)
> +				goto deactivate_out;
> +		}
thanks,
Michal
David Woodhouse Nov. 23, 2022, 12:03 a.m. UTC | #6
On Wed, 2022-11-23 at 00:46 +0100, Michal Luczaj wrote:
> 
> > +             /* Either map the second page, or deactivate the second GPC */
> > +             if (sz1 > sz) {
> 
> Out of curiosity: is there a reason behind potentially
> kvm_gpc_activate()ing a "len=0" cache? (sz1==sz leads to sz2=0)
> I feel I may be missing something, but shouldn't the comparison be >=?

No good reason. Fixed in my tree; thanks.
Sean Christopherson Nov. 23, 2022, 5:17 p.m. UTC | #7
On Tue, Nov 22, 2022, David Woodhouse wrote:
> On Tue, 2022-11-22 at 18:39 +0000, Sean Christopherson wrote:
> > On Sat, Nov 19, 2022, David Woodhouse wrote:
> > > From: David Woodhouse <
> > > dwmw@amazon.co.uk
> > > >
> > > 
> > > The guest runstate area can be arbitrarily byte-aligned. In fact, even
> > > when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> > > fields in the structure end up being unaligned due to the fact that the
> > > 32-bit ABI only aligns them to 32 bits.
> > > 
> > > So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> > > is buggy, because if it's unaligned then we can't update the whole field
> > > atomically; the low bytes might be observable before the _UPDATE bit is.
> > > Xen actually updates the *byte* containing that top bit, on its own. KVM
> > > should do the same.
> > 
> > I think we're using the wrong APIs to update the runstate.  The VMCS/VMCB pages
> > _need_ the host pfn, i.e. need to use a gpc (eventually).  The Xen PV stuff on the
> > other hand most definitely doesn't need to know the pfn.
> > 
> > The event channel code would be difficult to convert due to lack of uaccess
> > primitives, but I don't see anything in the runstate code that prevents KVM from
> > using a gfn_to_hva_cache.  That will naturally handle page splits by sending them
> > down a slow path and would yield far simpler code.
> > 
> > If taking the slow path is an issue, then the guest really should be fixed to not
> > split pages.  And if that's not an acceptable answer, the gfn_to_hva_cache code
> > could be updated to use the fast path if the region is contiguous in the host
> > virtual address space.
> > 
> 
> Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
> thing I tried. There are a handful of complexifying factors, none of
> which are insurmountable if we try hard enough.
> 
>  • Even if we fix the gfn_to_hva_cache to still use the fast path for
>    more than the first page, it's still possible for the runstate to
>    cross from one memslot to an adjacent one. We probably still need
>    two of them, which is a large part of the ugliness of this patch.

Hrm.  What if KVM requires that the runstate be contiguous in host virtual address
space?  That wouldn't violate Xen's ABI since it's a KVM restriction, and it can't
break backwards compatibility since KVM doesn't even handle page splits, let alone
memslot splits.  AFAIK, most (all?) userspace VMMs make guest RAM virtually
contiguous anyways.

Probably a moot point since I don't see a way around the "page got swapped out"
issue.

>  • Accessing it via the userspace HVA requires coping with the case
>    where the vCPU is being scheduled out, and it can't sleep. The
>    previous code before commit 34d41d19e dealt with that by using 
>    pagefault_disable() and depending on the GHC fast path. But even
>    so...
> 
>  • We also could end up having to touch page B, page A then page B
>    again, potentially having to abort and leave the runstate with
>    the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
>    version which checks that both pages are present before it starts
>    writing anything to the guest.

Ugh, and because of the in-atomic case, there's nothing KVM can do to remedy page B
getting swapped out.  Actually, it doesn't even require a B=>A=>B pattern.  Even
without a page split, the backing page could get swapped out between setting and
clearing.

> As I said, they're not insurmountable but that's why I ended up with
> the patch you see, after first attempting to use a gfn_to_hva_cache
> again. Happy to entertain suggestions.

What if we use a gfn_to_pfn_cache for the page containing XEN_RUNSTATE_UPDATE,
and then use kvm_vcpu_write_guest_atomic() if there's a page split?  That would
avoid needing to acquire mutliple gpc locks, and the update could use a single
code flow by always constructing an on-stack representation.  E.g. very roughly:

	*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
	smp_wmb();

	if (khva)
		memcpy(khva, rs_state, user_len);
	else
		kvm_vcpu_write_guest_in_atomic(vcpu, gpa, rs_state, user_len);
	smp_wmb();

	*update_bit = vx->runstate_entry_time >> 56;
	smp_wmb();

where "khva" is NULL if there's a page split.

> I note we have a kvm_read_guest_atomic() and briefly pondered adding a
> 'write' version of same... but that doesn't seem to cope with crossing
> memslots either; it also assumes its caller only uses it to access
> within a single page.

We should fix the lack of page split handling.  There are no bugs because KVM
only uses the helper for cases where the accesses are naturally aligned, but that's
bound to bite someone eventually.  That would be an oppurtunity to dedup the code
that handles "segments" (ugh), e.g. instead of

	gfn_t gfn = gpa >> PAGE_SHIFT;
	int seg;
	int offset = offset_in_page(gpa);
	int ret;

	while ((seg = next_segment(len, offset)) != 0) {
		ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg);
		if (ret < 0)
			return ret;
		offset = 0;
		len -= seg;
		data += seg;
		++gfn;
	}
	return 0;

provide a macro that allows

	kvm_for_each_chunk(...)
		ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg);
		if (ret)
			return ret;
	}
	
I also think we should rename the "atomic" helpers to be "in_atomic" (or inatomic
if we want to follow the kernel's terrible nomenclature, which I always read as
"not atomic").  I always think read_guest_atomic() means an "atomic read".
David Woodhouse Nov. 23, 2022, 5:58 p.m. UTC | #8
On Wed, 2022-11-23 at 17:17 +0000, Sean Christopherson wrote:
> On Tue, Nov 22, 2022, David Woodhouse wrote:
> > Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
> > thing I tried. There are a handful of complexifying factors, none of
> > which are insurmountable if we try hard enough.
> > 
> >  • Even if we fix the gfn_to_hva_cache to still use the fast path for
> >    more than the first page, it's still possible for the runstate to
> >    cross from one memslot to an adjacent one. We probably still need
> >    two of them, which is a large part of the ugliness of this patch.
> 
> Hrm.  What if KVM requires that the runstate be contiguous in host virtual address
> space?  That wouldn't violate Xen's ABI since it's a KVM restriction, and it can't
> break backwards compatibility since KVM doesn't even handle page splits, let alone
> memslot splits.  AFAIK, most (all?) userspace VMMs make guest RAM virtually
> contiguous anyways.
> 
> Probably a moot point since I don't see a way around the "page got swapped out"
> issue.

Right.

> >  • Accessing it via the userspace HVA requires coping with the case
> >    where the vCPU is being scheduled out, and it can't sleep. The
> >    previous code before commit 34d41d19e dealt with that by using 
> >    pagefault_disable() and depending on the GHC fast path. But even
> >    so...
> > 
> >  • We also could end up having to touch page B, page A then page B
> >    again, potentially having to abort and leave the runstate with
> >    the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
> >    version which checks that both pages are present before it starts
> >    writing anything to the guest.
> 
> Ugh, and because of the in-atomic case, there's nothing KVM can do to remedy page B
> getting swapped out.  Actually, it doesn't even require a B=>A=>B pattern.  Even
> without a page split, the backing page could get swapped out between setting and
> clearing.

Indeed. Which is why I quite like the "lock the whole lot down and then
do the update" in my existing patch.

> > As I said, they're not insurmountable but that's why I ended up with
> > the patch you see, after first attempting to use a gfn_to_hva_cache
> > again. Happy to entertain suggestions.
> 
> What if we use a gfn_to_pfn_cache for the page containing XEN_RUNSTATE_UPDATE,
> and then use kvm_vcpu_write_guest_atomic() if there's a page split?  That would
> avoid needing to acquire mutliple gpc locks, and the update could use a single
> code flow by always constructing an on-stack representation.  E.g. very roughly:

Even in the 'split' case, I don't think we want to be doing the full
memslot lookup for kvm_vcpu_write_guest_inatomic() every time we
schedule the guest vCPU in and out. We do want the cache, and a
kvm_vcpu_write_guest_cached_inatomic() function.

So instead of using two GPCs in the split case, that would mean we end
up using a gfn_to_pfn_cache for one page and a gfn_to_hva_cache for the
other?

And with or without that cache, we can *still* end up doing a partial
update if the page goes away. The byte with the XEN_RUNSTATE_UPDATE bit
might still be accessible, but bets are off about what state the rest
of the structure is in — and those runtimes are supposed to add up, or
the guest is going to get unhappy.

I'm actually OK with locking two GPCs. It wasn't my first choice, but
it's reasonable enough IMO given that none of the alternatives jump out
as being particularly attractive either.

> > I note we have a kvm_read_guest_atomic() and briefly pondered adding a
> > 'write' version of same... but that doesn't seem to cope with crossing
> > memslots either; it also assumes its caller only uses it to access
> > within a single page.
> 
> We should fix the lack of page split handling.

Agreed.

Even if it starts with just KVM_BUG_ON(offset + len >= PAGE_SIZE) to
avoid surprises.

I'd *love* to make the GPC cope with page splits, and just give us a
virtually contiguous mapping of up to N pages for some reasonable value
of N. I just can't see how to do that in the IOMEM case.
Sean Christopherson Nov. 23, 2022, 6:16 p.m. UTC | #9
On Wed, Nov 23, 2022, David Woodhouse wrote:
> On Wed, 2022-11-23 at 17:17 +0000, Sean Christopherson wrote:
> And with or without that cache, we can *still* end up doing a partial
> update if the page goes away. The byte with the XEN_RUNSTATE_UPDATE bit
> might still be accessible, but bets are off about what state the rest
> of the structure is in — and those runtimes are supposed to add up, or
> the guest is going to get unhappy.

Ugh.  What a terrible ABI.  

> I'm actually OK with locking two GPCs. It wasn't my first choice, but
> it's reasonable enough IMO given that none of the alternatives jump out
> as being particularly attractive either.

I detest the two GPCs, but since KVM apparently needs to provide "all or nothing"
updates, I don't see a better option.
Sean Christopherson Nov. 23, 2022, 6:21 p.m. UTC | #10
On Sat, Nov 19, 2022, David Woodhouse wrote:
> +		/*
> +		 * Use kvm_gpc_activate() here because if the runstate
> +		 * area was configured in 32-bit mode and only extends
> +		 * to the second page now because the guest changed to
> +		 * 64-bit mode, the second GPC won't have been set up.
> +		 */
> +		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
> +				     gpc1->gpa + user_len1, user_len2))

I believe kvm_gpc_activate() needs to be converted from write_lock_irq() to
write_lock_irqsave() for this to be safe.

Side topic, why do all of these flows disable IRQs?
David Woodhouse Nov. 23, 2022, 6:48 p.m. UTC | #11
On Wed, 2022-11-23 at 18:16 +0000, Sean Christopherson wrote:
> On Wed, Nov 23, 2022, David Woodhouse wrote:
> > On Wed, 2022-11-23 at 17:17 +0000, Sean Christopherson wrote:
> > And with or without that cache, we can *still* end up doing a partial
> > update if the page goes away. The byte with the XEN_RUNSTATE_UPDATE bit
> > might still be accessible, but bets are off about what state the rest
> > of the structure is in — and those runtimes are supposed to add up, or
> > the guest is going to get unhappy.
> 
> Ugh.  What a terrible ABI.  
> 
> > I'm actually OK with locking two GPCs. It wasn't my first choice, but
> > it's reasonable enough IMO given that none of the alternatives jump out
> > as being particularly attractive either.
> 
> I detest the two GPCs, but since KVM apparently needs to provide "all or nothing"
> updates, I don't see a better option.

Actually I think it might be not be so awful to do a contiguous virtual
(kernel) mapping of multiple discontiguous IOMEM PFNs. A kind of
memremapv().

We can open-code something like ioremap_prot(), doing a single call to
get_vm_area_caller() for the whole virtual size, then individually
calling ioremap_page_range() for each page.

So we *could* fix the GPC to cope with ranges which cross more than a
single page. But if the runstate area is the only user for that, as
seems to be the case so far, then it might be a bit too much additional
complexity in an area which is fun enough already.

I'll propose that we go ahead with the two-GPCs model for now, and if
we ever need to do that *again*, then we look harder at making the GPC
support multiple pages?
David Woodhouse Nov. 23, 2022, 7:22 p.m. UTC | #12
On Wed, 2022-11-23 at 18:21 +0000, Sean Christopherson wrote:
> On Sat, Nov 19, 2022, David Woodhouse wrote:
...
		/* When invoked from kvm_sched_out() we cannot sleep */
		if (atomic)
			return;
...
> > +		/*
> > +		 * Use kvm_gpc_activate() here because if the runstate
> > +		 * area was configured in 32-bit mode and only extends
> > +		 * to the second page now because the guest changed to
> > +		 * 64-bit mode, the second GPC won't have been set up.
> > +		 */
> > +		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
> > +				     gpc1->gpa + user_len1, user_len2))
> 
> I believe kvm_gpc_activate() needs to be converted from write_lock_irq() to
> write_lock_irqsave() for this to be safe.

Hm, not sure I concur. You're only permitted to call kvm_gpc_activate()
in a context where you can sleep. Interrupts had better not be disabled
already, and it had better not need write_lock_irqsave().

In this particular context, we do drop all locks before calling
kvm_gpc_activate(), and we don't call it at all in the scheduling-out
case when we aren't permitted to sleep.

> Side topic, why do all of these flows disable IRQs?

The gpc->lock is permitted to be taken in IRQ context by the users of
the GPC. For example we do this for the shared_info and vcpu_info when
passing interrupts directly through to the guest via
kvm_arch_set_irq_inatomic() → kvm_xen_set_evtchn_fast().

The code path is fairly convoluted but I do believe we get there
directly from the VFIO IRQ handler, via the custom wakeup handler on
the irqfd's waitqueue (irqfd_wakeup).

Now, perhaps a GPC user which knows that *it* is not going to use its
GPC from IRQ context, could refrain from bothering to disable
interrupts when taking its own gpc->lock. And that's probably true for
the runstate area.

The generic GPC code still needs to disable interrupts when taking a
gpc->lock though, unless we want to add yet another flag to control
that behaviour.
Sean Christopherson Nov. 23, 2022, 7:32 p.m. UTC | #13
On Wed, Nov 23, 2022, David Woodhouse wrote:
> On Wed, 2022-11-23 at 18:21 +0000, Sean Christopherson wrote:
> > On Sat, Nov 19, 2022, David Woodhouse wrote:
> ...
> 		/* When invoked from kvm_sched_out() we cannot sleep */
> 		if (atomic)
> 			return;
> ...
> > > +		/*
> > > +		 * Use kvm_gpc_activate() here because if the runstate
> > > +		 * area was configured in 32-bit mode and only extends
> > > +		 * to the second page now because the guest changed to
> > > +		 * 64-bit mode, the second GPC won't have been set up.
> > > +		 */
> > > +		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
> > > +				     gpc1->gpa + user_len1, user_len2))
> > 
> > I believe kvm_gpc_activate() needs to be converted from write_lock_irq() to
> > write_lock_irqsave() for this to be safe.
> 
> Hm, not sure I concur. You're only permitted to call kvm_gpc_activate()
> in a context where you can sleep. Interrupts had better not be disabled
> already, and it had better not need write_lock_irqsave().
> 
> In this particular context, we do drop all locks before calling
> kvm_gpc_activate(), and we don't call it at all in the scheduling-out
> case when we aren't permitted to sleep.

Oh, duh, there's an "if (atomic)" check right above this.

> > Side topic, why do all of these flows disable IRQs?
> 
> The gpc->lock is permitted to be taken in IRQ context by the users of
> the GPC. For example we do this for the shared_info and vcpu_info when
> passing interrupts directly through to the guest via
> kvm_arch_set_irq_inatomic() → kvm_xen_set_evtchn_fast().
> 
> The code path is fairly convoluted but I do believe we get there
> directly from the VFIO IRQ handler, via the custom wakeup handler on
> the irqfd's waitqueue (irqfd_wakeup).

Yeah, I remember finding that flow.

> Now, perhaps a GPC user which knows that *it* is not going to use its
> GPC from IRQ context, could refrain from bothering to disable
> interrupts when taking its own gpc->lock. And that's probably true for
> the runstate area.

This is effectively what I was asking about.

> The generic GPC code still needs to disable interrupts when taking a
> gpc->lock though, unless we want to add yet another flag to control
> that behaviour.

Right.  Might be worth adding a comment at some point to call out that disabling
IRQs may not be strictly required for all users, but it's done for simplicity.
Ah, if/when we add kvm_gpc_lock(), that would be the perfect place to document
the behavior.

Thanks!
David Woodhouse Nov. 23, 2022, 8:07 p.m. UTC | #14
On Wed, 2022-11-23 at 19:32 +0000, Sean Christopherson wrote:
> On Wed, Nov 23, 2022, David Woodhouse wrote:
> > Now, perhaps a GPC user which knows that *it* is not going to use its
> > GPC from IRQ context, could refrain from bothering to disable
> > interrupts when taking its own gpc->lock. And that's probably true for
> > the runstate area.
> 
> This is effectively what I was asking about.

So yes, I suppose we could do that. Incrementally...

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index af5ac495feec..5b93c3ed137c 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -177,7 +177,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	size_t user_len, user_len1, user_len2;
 	struct vcpu_runstate_info rs;
 	int *rs_state = &rs.state;
-	unsigned long flags;
 	size_t times_ofs;
 	u8 *update_bit;
 
@@ -236,9 +235,9 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * Attempt to obtain the GPC lock on *both* (if there are two)
 	 * gfn_to_pfn caches that cover the region.
 	 */
-	read_lock_irqsave(&gpc1->lock, flags);
+	read_lock(&gpc1->lock);
 	while (!kvm_gpc_check(gpc1, user_len1)) {
-		read_unlock_irqrestore(&gpc1->lock, flags);
+		read_unlock(&gpc1->lock);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
 		if (atomic)
@@ -247,7 +246,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		if (kvm_gpc_refresh(gpc1, user_len1))
 			return;
 
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 
 	/*
@@ -337,7 +336,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 
 	if (!kvm_gpc_check(gpc2, user_len2)) {
 		read_unlock(&gpc2->lock);
-		read_unlock_irqrestore(&gpc1->lock, flags);
+		read_unlock(&gpc1->lock);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
 		if (atomic)
@@ -399,7 +398,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	if (user_len2)
 		read_unlock(&gpc2->lock);
  done_1:
-	read_unlock_irqrestore(&gpc1->lock, flags);
+	read_unlock(&gpc1->lock);
 
 	mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);
 	if (user_len2)

I don't know how I feel about that. I resonate with what you say below
about "not strictly required for all users, but it's done for
simplicity". This code doesn't have enough simplicity. Handling the
runstate GPC as a special case just because we can, doesn't necessarily
fill me with joy for the amount of optimisation that it brings.

> > The generic GPC code still needs to disable interrupts when taking a
> > gpc->lock though, unless we want to add yet another flag to control
> > that behaviour.
> 
> Right.  Might be worth adding a comment at some point to call out that disabling
> IRQs may not be strictly required for all users, but it's done for simplicity.
> Ah, if/when we add kvm_gpc_lock(), that would be the perfect place to document
> the behavior.

Yeah. Or perhaps the kvm_gpc_lock() should go with that 'not required
for all users, but done for simplicity' angle too, and always disable
IRQs?
Sean Christopherson Nov. 23, 2022, 8:26 p.m. UTC | #15
On Wed, Nov 23, 2022, David Woodhouse wrote:
> On Wed, 2022-11-23 at 19:32 +0000, Sean Christopherson wrote:
> > Right.  Might be worth adding a comment at some point to call out that disabling
> > IRQs may not be strictly required for all users, but it's done for simplicity.
> > Ah, if/when we add kvm_gpc_lock(), that would be the perfect place to document
> > the behavior.
> 
> Yeah. Or perhaps the kvm_gpc_lock() should go with that 'not required
> for all users, but done for simplicity' angle too, and always disable
> IRQs?

I was thinking the latter (always disable IRQs in kvm_gpc_lock()).  Sorry I didn't
make that clear.  I completely agree that fewer conditionals in this code is better,
I was mostly trying to figure out if there is some edge case I was missing.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d1013c4f673c..70af7240a1d5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -686,6 +686,7 @@  struct kvm_vcpu_xen {
 	struct gfn_to_pfn_cache vcpu_info_cache;
 	struct gfn_to_pfn_cache vcpu_time_info_cache;
 	struct gfn_to_pfn_cache runstate_cache;
+	struct gfn_to_pfn_cache runstate2_cache;
 	u64 last_steal;
 	u64 runstate_entry_time;
 	u64 runstate_times[4];
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4b8e9628fbf5..8aa953b1f0e0 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -170,148 +170,269 @@  static void kvm_xen_init_timer(struct kvm_vcpu *vcpu)
 	vcpu->arch.xen.timer.function = xen_timer_callback;
 }
 
-static void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
+static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 {
 	struct kvm_vcpu_xen *vx = &v->arch.xen;
-	u64 now = get_kvmclock_ns(v->kvm);
-	u64 delta_ns = now - vx->runstate_entry_time;
-	u64 run_delay = current->sched_info.run_delay;
+	struct gfn_to_pfn_cache *gpc1 = &vx->runstate_cache;
+	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
+	size_t user_len, user_len1, user_len2;
+	struct vcpu_runstate_info rs;
+	int *rs_state = &rs.state;
+	unsigned long flags;
+	size_t times_ofs;
+	u8 *update_bit;
 
-	if (unlikely(!vx->runstate_entry_time))
-		vx->current_runstate = RUNSTATE_offline;
+	/*
+	 * The only difference between 32-bit and 64-bit versions of the
+	 * runstate struct us the alignment of uint64_t in 32-bit, which
+	 * means that the 64-bit version has an additional 4 bytes of
+	 * padding after the first field 'state'.
+	 */
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
+	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
+	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
+#ifdef CONFIG_X86_64
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
+		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
+#endif
+
+	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
+		user_len = sizeof(struct vcpu_runstate_info);
+		times_ofs = offsetof(struct vcpu_runstate_info,
+				     state_entry_time);
+	} else {
+		user_len = sizeof(struct compat_vcpu_runstate_info);
+		times_ofs = offsetof(struct compat_vcpu_runstate_info,
+				     state_entry_time);
+		rs_state++;
+	}
 
 	/*
-	 * Time waiting for the scheduler isn't "stolen" if the
-	 * vCPU wasn't running anyway.
+	 * There are basically no alignment constraints. The guest can set it
+	 * up so it crosses from one page to the next, and at arbitrary byte
+	 * alignment (and the 32-bit ABI doesn't align the 64-bit integers
+	 * anyway, even if the overall struct had been 64-bit aligned).
 	 */
-	if (vx->current_runstate == RUNSTATE_running) {
-		u64 steal_ns = run_delay - vx->last_steal;
+	if ((gpc1->gpa & ~PAGE_MASK) + user_len >= PAGE_SIZE) {
+		user_len1 = PAGE_SIZE - (gpc1->gpa & ~PAGE_MASK);
+		user_len2 = user_len - user_len1;
+	} else {
+		user_len1 = user_len;
+		user_len2 = 0;
+	}
+	BUG_ON(user_len1 + user_len2 != user_len);
 
-		delta_ns -= steal_ns;
+ retry:
+	/*
+	 * Attempt to obtain the GPC lock on *both* (if there are two)
+	 * gfn_to_pfn caches that cover the region.
+	 */
+	read_lock_irqsave(&gpc1->lock, flags);
+	while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc1, gpc1->gpa, user_len1)) {
+		read_unlock_irqrestore(&gpc1->lock, flags);
 
-		vx->runstate_times[RUNSTATE_runnable] += steal_ns;
+		/* When invoked from kvm_sched_out() we cannot sleep */
+		if (atomic)
+			return;
+
+		if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc1, gpc1->gpa, user_len1))
+			return;
+
+		read_lock_irqsave(&gpc1->lock, flags);
 	}
-	vx->last_steal = run_delay;
 
-	vx->runstate_times[vx->current_runstate] += delta_ns;
-	vx->current_runstate = state;
-	vx->runstate_entry_time = now;
-}
+	/*
+	 * The common case is that it all fits on a page and we can
+	 * just do it the simple way.
+	 */
+	if (likely(!user_len2)) {
+		/*
+		 * We use 'int *user_state' to point to the state field, and
+		 * 'u64 *user_times' for runstate_entry_time. So the actual
+		 * array of time[] in each state starts at user_times[1].
+		 */
+		int *user_state = gpc1->khva;
+		u64 *user_times = gpc1->khva + times_ofs;
 
-void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, int state)
-{
-	struct kvm_vcpu_xen *vx = &v->arch.xen;
-	struct gfn_to_pfn_cache *gpc = &vx->runstate_cache;
-	uint64_t *user_times;
-	unsigned long flags;
-	size_t user_len;
-	int *user_state;
+		/*
+		 * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
+		 * field. We need to set it (and write-barrier) before writing to the
+		 * the rest of the structure, and clear it last. Just as Xen does, we
+		 * address the single *byte* in which it resides because it might be
+		 * in a different cache line to the rest of the 64-bit word, due to
+		 * the (lack of) alignment constraints.
+		 */
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
+			     sizeof(uint64_t));
+		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
+			     sizeof(uint64_t));
+		BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
 
-	kvm_xen_update_runstate(v, state);
+		update_bit = ((u8 *)(&user_times[1])) - 1;
+		*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
+		smp_wmb();
 
-	if (!vx->runstate_cache.active)
-		return;
+		/*
+		 * Next, write the new runstate. This is in the *same* place
+		 * for 32-bit and 64-bit guests, asserted here for paranoia.
+		 */
+		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
+			     offsetof(struct compat_vcpu_runstate_info, state));
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
+			     sizeof(vx->current_runstate));
+		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
+			     sizeof(vx->current_runstate));
+		*user_state = vx->current_runstate;
 
-	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode)
-		user_len = sizeof(struct vcpu_runstate_info);
-	else
-		user_len = sizeof(struct compat_vcpu_runstate_info);
+		/*
+		 * Then the actual runstate_entry_time (with the UPDATE bit
+		 * still set).
+		 */
+		*user_times = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
 
-	read_lock_irqsave(&gpc->lock, flags);
-	while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc, gpc->gpa,
-					   user_len)) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		/*
+		 * Write the actual runstate times immediately after the
+		 * runstate_entry_time.
+		 */
+		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+			     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
+		BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
+			     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+			     sizeof_field(struct compat_vcpu_runstate_info, time));
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+			     sizeof(vx->runstate_times));
+		memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
+
+		smp_wmb();
+
+		/*
+		 * Finally, clear the 'updating' bit. Don't use &= here because
+		 * the compiler may not realise that update_bit and user_times
+		 * point to the same place. That's a classic pointer-aliasing
+		 * problem.
+		 */
+		*update_bit = vx->runstate_entry_time >> 56;
+		smp_wmb();
+
+		goto done_1;
+	}
+
+	/*
+	 * The painful code path. It's split across two pages and we need to
+	 * hold and validate both GPCs simultaneously. Thankfully we can get
+	 * away with declaring a lock ordering GPC1 > GPC2 because nothing
+	 * else takes them more than one at a time.
+	 */
+	read_lock(&gpc2->lock);
+
+	if (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
+		read_unlock(&gpc2->lock);
+		read_unlock_irqrestore(&gpc1->lock, flags);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
-		if (state == RUNSTATE_runnable)
+		if (atomic)
 			return;
 
-		if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc, gpc->gpa, user_len))
+		/*
+		 * Use kvm_gpc_activate() here because if the runstate
+		 * area was configured in 32-bit mode and only extends
+		 * to the second page now because the guest changed to
+		 * 64-bit mode, the second GPC won't have been set up.
+		 */
+		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
+				     gpc1->gpa + user_len1, user_len2))
 			return;
 
-		read_lock_irqsave(&gpc->lock, flags);
+		/*
+		 * We dropped the lock on GPC1 so we have to go all the way
+		 * back and revalidate that too.
+		 */
+		goto retry;
 	}
 
 	/*
-	 * The only difference between 32-bit and 64-bit versions of the
-	 * runstate struct us the alignment of uint64_t in 32-bit, which
-	 * means that the 64-bit version has an additional 4 bytes of
-	 * padding after the first field 'state'.
-	 *
-	 * So we use 'int __user *user_state' to point to the state field,
-	 * and 'uint64_t __user *user_times' for runstate_entry_time. So
-	 * the actual array of time[] in each state starts at user_times[1].
+	 * Work out where the byte containing the XEN_RUNSTATE_UPDATE bit is.
 	 */
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
-	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
-	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
-#ifdef CONFIG_X86_64
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
-		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
-#endif
-
-	user_state = gpc->khva;
-
-	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode)
-		user_times = gpc->khva + offsetof(struct vcpu_runstate_info,
-						  state_entry_time);
+	if (user_len1 >= times_ofs + sizeof(uint64_t))
+		update_bit = ((u8 *)gpc1->khva) + times_ofs + sizeof(u64) - 1;
 	else
-		user_times = gpc->khva + offsetof(struct compat_vcpu_runstate_info,
-						  state_entry_time);
+		update_bit = ((u8 *)gpc2->khva) + times_ofs + sizeof(u64) - 1 -
+			user_len1;
 
-	/*
-	 * First write the updated state_entry_time at the appropriate
-	 * location determined by 'offset'.
+
+	/* Create a structure on our stack with everything in the right place.
+	 * The rs_state pointer points to the start of it, which in the case
+	 * of a compat guest on a 64-bit host is the 32 bit field that the
+	 * compiler thinks is padding.
 	 */
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
-		     sizeof(user_times[0]));
-	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
-		     sizeof(user_times[0]));
+	*rs_state = vx->current_runstate;
+#ifdef CONFIG_X86_64
+	/* Don't leak kernel memory through the padding in the 64-bit struct */
+	if (rs_state == &rs.state)
+		rs_state[1] = 0;
+#endif
+	rs.state_entry_time = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+	memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
 
-	user_times[0] = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+	*update_bit = rs.state_entry_time >> 56;
 	smp_wmb();
 
 	/*
-	 * Next, write the new runstate. This is in the *same* place
-	 * for 32-bit and 64-bit guests, asserted here for paranoia.
+	 * Having constructed the structure, copy it into the first and
+	 * second pages as appropriate using user_len1 and user_len2.
 	 */
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
-		     offsetof(struct compat_vcpu_runstate_info, state));
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
-		     sizeof(vx->current_runstate));
-	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
-		     sizeof(vx->current_runstate));
-
-	*user_state = vx->current_runstate;
+	memcpy(gpc1->khva, rs_state, user_len1);
+	memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+	smp_wmb();
 
 	/*
-	 * Write the actual runstate times immediately after the
-	 * runstate_entry_time.
+	 * Finally, clear the XEN_RUNSTATE_UPDATE bit.
 	 */
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-		     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
-	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
-		     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-		     sizeof_field(struct compat_vcpu_runstate_info, time));
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-		     sizeof(vx->runstate_times));
-
-	memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
+	*update_bit = vx->runstate_entry_time >> 56;
 	smp_wmb();
 
+	if (user_len2)
+		read_unlock(&gpc2->lock);
+ done_1:
+	read_unlock_irqrestore(&gpc1->lock, flags);
+
+	mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);
+	if (user_len2)
+		mark_page_dirty_in_slot(v->kvm, gpc2->memslot, gpc2->gpa >> PAGE_SHIFT);
+}
+
+void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
+{
+	struct kvm_vcpu_xen *vx = &v->arch.xen;
+	u64 now = get_kvmclock_ns(v->kvm);
+	u64 delta_ns = now - vx->runstate_entry_time;
+	u64 run_delay = current->sched_info.run_delay;
+
+	if (unlikely(!vx->runstate_entry_time))
+		vx->current_runstate = RUNSTATE_offline;
+
 	/*
-	 * Finally, clear the XEN_RUNSTATE_UPDATE bit in the guest's
-	 * runstate_entry_time field.
+	 * Time waiting for the scheduler isn't "stolen" if the
+	 * vCPU wasn't running anyway.
 	 */
-	user_times[0] &= ~XEN_RUNSTATE_UPDATE;
-	smp_wmb();
+	if (vx->current_runstate == RUNSTATE_running) {
+		u64 steal_ns = run_delay - vx->last_steal;
 
-	read_unlock_irqrestore(&gpc->lock, flags);
+		delta_ns -= steal_ns;
 
-	mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT);
+		vx->runstate_times[RUNSTATE_runnable] += steal_ns;
+	}
+	vx->last_steal = run_delay;
+
+	vx->runstate_times[vx->current_runstate] += delta_ns;
+	vx->current_runstate = state;
+	vx->runstate_entry_time = now;
+
+	if (vx->runstate_cache.active)
+		kvm_xen_update_runstate_guest(v, state == RUNSTATE_runnable);
 }
 
 static void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v)
@@ -584,23 +705,57 @@  int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		break;
 
-	case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR:
+	case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR: {
+		size_t sz, sz1, sz2;
+
 		if (!sched_info_on()) {
 			r = -EOPNOTSUPP;
 			break;
 		}
 		if (data->u.gpa == GPA_INVALID) {
+			r = 0;
+		deactivate_out:
 			kvm_gpc_deactivate(vcpu->kvm,
 					   &vcpu->arch.xen.runstate_cache);
-			r = 0;
+			kvm_gpc_deactivate(vcpu->kvm,
+					   &vcpu->arch.xen.runstate2_cache);
 			break;
 		}
 
+		/*
+		 * If the guest switches to 64-bit mode after setting the runstate
+		 * address, that's actually OK. kvm_xen_update_runstate_guest()
+		 * will cope.
+		 */
+		if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode)
+			sz = sizeof(struct vcpu_runstate_info);
+		else
+			sz = sizeof(struct compat_vcpu_runstate_info);
+
+		/* How much fits in the (first) page? */
+		sz1 = PAGE_SIZE - (data->u.gpa & ~PAGE_MASK);
 		r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate_cache,
-				     NULL, KVM_HOST_USES_PFN, data->u.gpa,
-				     sizeof(struct vcpu_runstate_info));
-		break;
+				     NULL, KVM_HOST_USES_PFN, data->u.gpa, sz1);
+		if (r)
+			goto deactivate_out;
 
+		/* Either map the second page, or deactivate the second GPC */
+		if (sz1 > sz) {
+			kvm_gpc_deactivate(vcpu->kvm,
+					   &vcpu->arch.xen.runstate2_cache);
+		} else {
+			sz2 = sz - sz1;
+			BUG_ON((data->u.gpa + sz1) & ~PAGE_MASK);
+			r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate2_cache,
+					     NULL, KVM_HOST_USES_PFN,
+					     data->u.gpa + sz1, sz2);
+			if (r)
+				goto deactivate_out;
+		}
+
+		kvm_xen_update_runstate_guest(vcpu, false);
+		break;
+	}
 	case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT:
 		if (!sched_info_on()) {
 			r = -EOPNOTSUPP;
@@ -1834,6 +1989,7 @@  void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu)
 	timer_setup(&vcpu->arch.xen.poll_timer, cancel_evtchn_poll, 0);
 
 	kvm_gpc_init(&vcpu->arch.xen.runstate_cache);
+	kvm_gpc_init(&vcpu->arch.xen.runstate2_cache);
 	kvm_gpc_init(&vcpu->arch.xen.vcpu_info_cache);
 	kvm_gpc_init(&vcpu->arch.xen.vcpu_time_info_cache);
 }
@@ -1844,6 +2000,7 @@  void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu)
 		kvm_xen_stop_timer(vcpu);
 
 	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.runstate_cache);
+	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.runstate2_cache);
 	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.vcpu_info_cache);
 	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.vcpu_time_info_cache);
 
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 532a535a9e99..8503d2c6891e 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -143,11 +143,11 @@  int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
 #include <asm/xen/interface.h>
 #include <xen/interface/vcpu.h>
 
-void kvm_xen_update_runstate_guest(struct kvm_vcpu *vcpu, int state);
+void kvm_xen_update_runstate(struct kvm_vcpu *vcpu, int state);
 
 static inline void kvm_xen_runstate_set_running(struct kvm_vcpu *vcpu)
 {
-	kvm_xen_update_runstate_guest(vcpu, RUNSTATE_running);
+	kvm_xen_update_runstate(vcpu, RUNSTATE_running);
 }
 
 static inline void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu)
@@ -162,7 +162,7 @@  static inline void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu)
 	if (WARN_ON_ONCE(!vcpu->preempted))
 		return;
 
-	kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable);
+	kvm_xen_update_runstate(vcpu, RUNSTATE_runnable);
 }
 
 /* 32-bit compatibility definitions, also used natively in 32-bit build */
diff --git a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
index 2a5727188c8d..7f39815f1772 100644
--- a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
+++ b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
@@ -26,17 +26,17 @@ 
 #define SHINFO_REGION_GPA	0xc0000000ULL
 #define SHINFO_REGION_SLOT	10
 
-#define DUMMY_REGION_GPA	(SHINFO_REGION_GPA + (2 * PAGE_SIZE))
+#define DUMMY_REGION_GPA	(SHINFO_REGION_GPA + (3 * PAGE_SIZE))
 #define DUMMY_REGION_SLOT	11
 
 #define SHINFO_ADDR	(SHINFO_REGION_GPA)
-#define PVTIME_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE)
-#define RUNSTATE_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE + 0x20)
 #define VCPU_INFO_ADDR	(SHINFO_REGION_GPA + 0x40)
+#define PVTIME_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE)
+#define RUNSTATE_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE + PAGE_SIZE - 15)
 
 #define SHINFO_VADDR	(SHINFO_REGION_GVA)
-#define RUNSTATE_VADDR	(SHINFO_REGION_GVA + PAGE_SIZE + 0x20)
 #define VCPU_INFO_VADDR	(SHINFO_REGION_GVA + 0x40)
+#define RUNSTATE_VADDR	(SHINFO_REGION_GVA + PAGE_SIZE + PAGE_SIZE - 15)
 
 #define EVTCHN_VECTOR	0x10
 
@@ -449,8 +449,8 @@  int main(int argc, char *argv[])
 
 	/* Map a region for the shared_info page */
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
-				    SHINFO_REGION_GPA, SHINFO_REGION_SLOT, 2, 0);
-	virt_map(vm, SHINFO_REGION_GVA, SHINFO_REGION_GPA, 2);
+				    SHINFO_REGION_GPA, SHINFO_REGION_SLOT, 3, 0);
+	virt_map(vm, SHINFO_REGION_GVA, SHINFO_REGION_GPA, 3);
 
 	struct shared_info *shinfo = addr_gpa2hva(vm, SHINFO_VADDR);