diff mbox series

[RFC] KVM: x86: Allow userspace exit on HLT and MWAIT, else yield on MWAIT

Message ID 1b52b557beb6606007f7ec5672eab0adf1606a34.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series [RFC] KVM: x86: Allow userspace exit on HLT and MWAIT, else yield on MWAIT | expand

Commit Message

David Woodhouse Sept. 18, 2023, 9:06 a.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

The VMM may have work to do on behalf of the guest, and it's often
desirable to use the cycles when the vCPUS are idle.

When the vCPU uses HLT this works out OK because the VMM can run its
tasks in a separate thread which gets scheduled when the in-kernel
emulation of HLT schedules away. It isn't perfect, because it doesn't
easily allow for handling both low-priority maintenance tasks when the
VMM wants to wait until the vCPU is idle, and also for higher priority
tasks where the VMM does want to preempt the vCPU. It can also lead to
noisy neighbour effects, when a host has isn't necessarily sized to
expect any given VMM to suddenly be contending for many *more* pCPUs
than it has vCPUs. 

In addition, there are times when we need to expose MWAIT to a guest
for compatibility with a previous environment. And MWAIT is much harder
because it's very hard to emulate properly.

There were attempts at doing so based on marking the target page read-
only in MONITOR and triggering the wake when it takes a minor fault,
but so far they haven't led to a working solution:
https://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/mwait.html

So when a guest executes MWAIT, either we've disabled exit-on-mwait and
the guest actually sits in non-root mode hogging the pCPU, or if we do
enable exit-on-mwait the kernel just treats it as a NOP and bounces
right back into the guest to busy-wait round its idle loop.

For a start, we can stick a yield() into that busy-loop. The yield()
has fairly poorly defined semantics, but it's better than *nothing* and
does allow a VMM's thread-based I/O and maintenance tasks to run a
*little* better.

Better still, we can bounce all the way out to *userspace* on an MWAIT
exit, and let the VMM perform some of its pending work right there and
then in the vCPU thread before re-entering the vCPU. That's much nicer
than yield(). The vCPU is still runnable, since we still don't have a
*real* emulation of MWAIT, so the vCPU thread can do a *little* bit of
work and then go back into the vCPU for another turn around the loop.

And if we're going to do that kind of task processing for MWAIT-idle
guests directly from the vCPU thread, it's neater to do it for HLT-idle
guests that way too.

For HLT, the vCPU *isn't* runnable; it'll be in KVM_MP_STATE_HALTED.
The VMM can poll the mp_state and know when the vCPU should be run
again. But not poll(), although we might want to hook up something like
that (or just a signal or eventfd) for other reasons for VSM anyway.
The VMM can also just do some work and then re-enter the vCPU without
the corresponding bit set in the kvm_run struct.

So, er, what does this patch do? Add a capability, define two bits for
exiting to userspace on HLT or MWAIT — in the kvm_run struct rather
than needing a separate ioctl to turn them on or off, so that the VMM
can make the decision each time it enters the vCPU. Hook it up to
(ab?)use the existing KVM_EXIT_HLT which was previously only used when
the local APIC was emulated in userspace, and add a new KVM_EXIT_MWAIT.

Fairly much untested.

If this approach seems reasonable, of course I'll add test cases and
proper documentation before posting it for real. This is the proof of
concept before we even put it through testing to see what performance
we get out of it especially for those obnoxious MWAIT-enabled guests.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Comments

Alexander Graf Sept. 18, 2023, 9:41 a.m. UTC | #1
On 18.09.23 11:06, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The VMM may have work to do on behalf of the guest, and it's often
> desirable to use the cycles when the vCPUS are idle.
>
> When the vCPU uses HLT this works out OK because the VMM can run its
> tasks in a separate thread which gets scheduled when the in-kernel
> emulation of HLT schedules away. It isn't perfect, because it doesn't
> easily allow for handling both low-priority maintenance tasks when the
> VMM wants to wait until the vCPU is idle, and also for higher priority
> tasks where the VMM does want to preempt the vCPU. It can also lead to
> noisy neighbour effects, when a host has isn't necessarily sized to
> expect any given VMM to suddenly be contending for many *more* pCPUs
> than it has vCPUs.
>
> In addition, there are times when we need to expose MWAIT to a guest
> for compatibility with a previous environment. And MWAIT is much harder
> because it's very hard to emulate properly.
>
> There were attempts at doing so based on marking the target page read-
> only in MONITOR and triggering the wake when it takes a minor fault,
> but so far they haven't led to a working solution:
> https://www.contrib.andrew.cmu.edu/~somlo/OSXKVM/mwait.html
>
> So when a guest executes MWAIT, either we've disabled exit-on-mwait and
> the guest actually sits in non-root mode hogging the pCPU, or if we do
> enable exit-on-mwait the kernel just treats it as a NOP and bounces
> right back into the guest to busy-wait round its idle loop.
>
> For a start, we can stick a yield() into that busy-loop. The yield()
> has fairly poorly defined semantics, but it's better than *nothing* and
> does allow a VMM's thread-based I/O and maintenance tasks to run a
> *little* better.
>
> Better still, we can bounce all the way out to *userspace* on an MWAIT
> exit, and let the VMM perform some of its pending work right there and
> then in the vCPU thread before re-entering the vCPU. That's much nicer
> than yield(). The vCPU is still runnable, since we still don't have a
> *real* emulation of MWAIT, so the vCPU thread can do a *little* bit of
> work and then go back into the vCPU for another turn around the loop.
>
> And if we're going to do that kind of task processing for MWAIT-idle
> guests directly from the vCPU thread, it's neater to do it for HLT-idle
> guests that way too.
>
> For HLT, the vCPU *isn't* runnable; it'll be in KVM_MP_STATE_HALTED.
> The VMM can poll the mp_state and know when the vCPU should be run
> again. But not poll(), although we might want to hook up something like
> that (or just a signal or eventfd) for other reasons for VSM anyway.
> The VMM can also just do some work and then re-enter the vCPU without
> the corresponding bit set in the kvm_run struct.
>
> So, er, what does this patch do? Add a capability, define two bits for
> exiting to userspace on HLT or MWAIT — in the kvm_run struct rather
> than needing a separate ioctl to turn them on or off, so that the VMM
> can make the decision each time it enters the vCPU. Hook it up to
> (ab?)use the existing KVM_EXIT_HLT which was previously only used when
> the local APIC was emulated in userspace, and add a new KVM_EXIT_MWAIT.
>
> Fairly much untested.
>
> If this approach seems reasonable, of course I'll add test cases and
> proper documentation before posting it for real. This is the proof of
> concept before we even put it through testing to see what performance
> we get out of it especially for those obnoxious MWAIT-enabled guests.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


IIUC you want to do work in a user space vCPU thread when the guest vCPU 
is idle. As you pointed out above, KVM can not actually do much about 
MWAIT: It basically busy loops and hogs the CPU.

The typical flow I would expect for "work in a vCPU thread" is:

0) vCPU runs. HLT/MWAIT is directly exposed to guest.
1) vCPU exits. Creates deferred work. Enables HLT/MWAIT trapping.
2) vCPU runs again
3) vCPU calls HLT/MWAIT. We exit to user space to finish work from 1
4) vCPU runs again without HLT/MWAIT trapping

That means on top (or instead?) of the bits you have below that indicate 
"Should I exit to user space?", what you really need are bits that do 
what enable_cap(KVM_CAP_X86_DISABLE_EXITS) does in light-weight: Disable 
HLT/MWAIT trapping temporarily.

Also, please keep in mind that you still would need a fallback mechanism 
to run your "deferred work" even when the guest does not call HLT/MWAIT, 
like a regular timer in your main thread.

On top of all this, I'm not sure it's more efficient to do the trap to 
the vCPU thread compared to just creating a separate real thread. Your 
main problem is the emulatability of MWAIT because that leaves "no time" 
to do deferred work. But then again, if your deferred work is so complex 
that it needs more than a few ms (which you can always steal from the 
vCPU thread, especiall with yield()), you'll need to start implementing 
time slicing of that work in user space next - and basically rebuild 
your own scheduler there. Ugh.

IMHO the real core value of this idea would be in a vcpu_run bit that on 
VCPU_RUN can toggle between HLT/MWAIT intercept on and off. The actual 
trap to user space, you're most likely better off with a separate thread.


Alex


>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a6582c1fd8b9..8f931539114a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2128,9 +2128,23 @@ static int kvm_emulate_monitor_mwait(struct kvm_vcpu *vcpu, const char *insn)
>   	pr_warn_once("%s instruction emulated as NOP!\n", insn);
>   	return kvm_emulate_as_nop(vcpu);
>   }
> +
>   int kvm_emulate_mwait(struct kvm_vcpu *vcpu)
>   {
> -	return kvm_emulate_monitor_mwait(vcpu, "MWAIT");
> +	int ret = kvm_emulate_monitor_mwait(vcpu, "MWAIT");
> +
> +	if (ret && kvm_userspace_exit(vcpu, KVM_EXIT_MWAIT)) {
> +		vcpu->run->exit_reason = KVM_EXIT_MWAIT;
> +		ret = 0;
> +	} else {
> +		/*
> +		 * Calling yield() has poorly defined semantics, but the
> +		 * guest is in a busy loop and it's the best we can do
> +		 * without a full emulation of MONITOR/MWAIT.
> +		 */
> +		yield();
> +	}
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(kvm_emulate_mwait);
>   
> @@ -4554,6 +4568,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   				r |= KVM_X86_DISABLE_EXITS_MWAIT;
>   		}
>   		break;
> +	case KVM_CAP_X86_USERSPACE_EXITS:
> +		r = KVM_X86_USERSPACE_VALID_EXITS;
> +		break;
>   	case KVM_CAP_X86_SMM:
>   		if (!IS_ENABLED(CONFIG_KVM_SMM))
>   			break;
> @@ -9643,11 +9660,11 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
>   	++vcpu->stat.halt_exits;
>   	if (lapic_in_kernel(vcpu)) {
>   		vcpu->arch.mp_state = state;
> -		return 1;
> -	} else {
> -		vcpu->run->exit_reason = reason;
> -		return 0;
> +		if (!kvm_userspace_exit(vcpu, reason))
> +			return 1;
>   	}
> +	vcpu->run->exit_reason = reason;
> +	return 0;
>   }
>   
>   int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..ce10a809151c 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -430,6 +430,19 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
>   	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
>   }
>   
> +static inline bool kvm_userspace_exit(struct kvm_vcpu *vcpu, int reason)
> +{
> +	if (reason == KVM_EXIT_HLT &&
> +	    (vcpu->run->userspace_exits & KVM_X86_USERSPACE_EXIT_HLT))
> +		return true;
> +
> +	if (reason == KVM_EXIT_MWAIT &&
> +	    (vcpu->run->userspace_exits & KVM_X86_USERSPACE_EXIT_MWAIT))
> +		return true;
> +
> +	return false;
> +}
> +
>   enum kvm_intr_type {
>   	/* Values are arbitrary, but must be non-zero. */
>   	KVM_HANDLING_IRQ = 1,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13065dd96132..43d94d49fc24 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MWAIT            38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -283,7 +284,8 @@ struct kvm_run {
>   	/* in */
>   	__u8 request_interrupt_window;
>   	__u8 immediate_exit;
> -	__u8 padding1[6];
> +	__u8 userspace_exits;
> +	__u8 padding1[5];
>   
>   	/* out */
>   	__u32 exit_reason;
> @@ -841,6 +843,11 @@ struct kvm_ioeventfd {
>                                                 KVM_X86_DISABLE_EXITS_PAUSE | \
>                                                 KVM_X86_DISABLE_EXITS_CSTATE)
>   
> +#define KVM_X86_USERSPACE_EXIT_MWAIT	     (1 << 0)
> +#define KVM_X86_USERSPACE_EXIT_HLT	     (1 << 1)
> +#define KVM_X86_USERSPACE_VALID_EXITS        (KVM_X86_USERSPACE_EXIT_MWAIT | \
> +                                              KVM_X86_USERSPACE_EXIT_HLT)
> +
>   /* for KVM_ENABLE_CAP */
>   struct kvm_enable_cap {
>   	/* in */
> @@ -1192,6 +1199,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_COUNTER_OFFSET 227
>   #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
>   #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
> +#define KVM_CAP_X86_USERSPACE_EXITS 230
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
>



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
David Woodhouse Sept. 18, 2023, 11:10 a.m. UTC | #2
On Mon, 2023-09-18 at 11:41 +0200, Alexander Graf wrote:
> 
> IIUC you want to do work in a user space vCPU thread when the guest vCPU 
> is idle. As you pointed out above, KVM can not actually do much about
> MWAIT: It basically busy loops and hogs the CPU.

Well.. I suspect what I *really* want is a decent way to emulate MWAIT
properly and let it actually sleep. Or failing that, to declare that we
can actually change the guest-visible experience when those guests are
migrated to KVM, and take away MWAIT completely.

> The typical flow I would expect for "work in a vCPU thread" is:
> 
> 0) vCPU runs. HLT/MWAIT is directly exposed to guest.
> 1) vCPU exits. Creates deferred work. Enables HLT/MWAIT trapping.

That can happen, but it may also be a separate I/O thread which
receives an eventfd notification and finds that there is now work to be
done. If that work can be fairly much instantaneous, it can be done
immediately. Else it gets deferred to what we Linux hackers might think
of as a workqueue.

If all the vCPUs are in HLT when the work queue becomes non-empty, we'd
need to prod them *all* to change their exit-on-{HLT,MWAIT} status when
work becomes available, just in case one of them becomes idle and can
process the work "for free" using idle cycles.

> 2) vCPU runs again
> 3) vCPU calls HLT/MWAIT. We exit to user space to finish work from 1
> 4) vCPU runs again without HLT/MWAIT trapping
>
> That means on top (or instead?) of the bits you have below that indicate 
> "Should I exit to user space?", what you really need are bits that do
> what enable_cap(KVM_CAP_X86_DISABLE_EXITS) does in light-weight: Disable 
> HLT/MWAIT trapping temporarily.

If I do it that way, yes. A lightweight way to enable/disable the exits
even to kernel would be a nice to have. But it's a trade-off. For HLT
you'd get lower latency re-entering the vCPU at a cost of much higher
latency processing work if the vCPU was *already* in HLT.

We probably would want to stop burning power in the MWAIT loop though,
and let the pCPU sit in the guest in MWAIT if there really is nothing
else to do.

We're experimenting with various permutations.

> Also, please keep in mind that you still would need a fallback mechanism 
> to run your "deferred work" even when the guest does not call HLT/MWAIT, 
> like a regular timer in your main thread.

Yeah. In that case I think the ideal answer is that we let the kernel
scheduler sort it out. I was thinking of a model where we have I/O (or
workqueue) threads in *addition* to the userspace exits on idle. The
separate threads own the work (and a number of them are woken according
to the queue depth), and idle vCPUs *opportunistically* process work
items on top of that.

That approach alone would work fine with the existing HLT scheduling;
it's just MWAIT which is a pain because yield() doesn't really do much
(but as noted, it's better than *nothing*).

> On top of all this, I'm not sure it's more efficient to do the trap to 
> the vCPU thread compared to just creating a separate real thread. Your 
> main problem is the emulatability of MWAIT because that leaves "no time" 
> to do deferred work. But then again, if your deferred work is so complex 
> that it needs more than a few ms (which you can always steal from the
> vCPU thread, especiall with yield()), you'll need to start implementing 
> time slicing of that work in user space next - and basically rebuild 
> your own scheduler there. Ugh.
>
> IMHO the real core value of this idea would be in a vcpu_run bit that on 
> VCPU_RUN can toggle between HLT/MWAIT intercept on and off. The actual 
> trap to user space, you're most likely better off with a separate thread.

No, that's very much not the point. The problem is that yield() doesn't
work well enough — and isn't designed or guaranteed to do anything in
particular for most cases. It's better than *nothing* but we want the
opportunity to do the actual work right there in the *loop* of the
guest bouncing through MWAIT.
Alexander Graf Sept. 18, 2023, 11:59 a.m. UTC | #3
On 18.09.23 13:10, David Woodhouse wrote:
> On Mon, 2023-09-18 at 11:41 +0200, Alexander Graf wrote:
>> IIUC you want to do work in a user space vCPU thread when the guest vCPU
>> is idle. As you pointed out above, KVM can not actually do much about
>> MWAIT: It basically busy loops and hogs the CPU.
> Well.. I suspect what I *really* want is a decent way to emulate MWAIT
> properly and let it actually sleep. Or failing that, to declare that we
> can actually change the guest-visible experience when those guests are
> migrated to KVM, and take away MWAIT completely.
>
>> The typical flow I would expect for "work in a vCPU thread" is:
>>
>> 0) vCPU runs. HLT/MWAIT is directly exposed to guest.
>> 1) vCPU exits. Creates deferred work. Enables HLT/MWAIT trapping.
> That can happen, but it may also be a separate I/O thread which
> receives an eventfd notification and finds that there is now work to be
> done. If that work can be fairly much instantaneous, it can be done
> immediately. Else it gets deferred to what we Linux hackers might think
> of as a workqueue.
>
> If all the vCPUs are in HLT when the work queue becomes non-empty, we'd
> need to prod them *all* to change their exit-on-{HLT,MWAIT} status when
> work becomes available, just in case one of them becomes idle and can
> process the work "for free" using idle cycles.
>
>> 2) vCPU runs again
>> 3) vCPU calls HLT/MWAIT. We exit to user space to finish work from 1
>> 4) vCPU runs again without HLT/MWAIT trapping
>>
>> That means on top (or instead?) of the bits you have below that indicate
>> "Should I exit to user space?", what you really need are bits that do
>> what enable_cap(KVM_CAP_X86_DISABLE_EXITS) does in light-weight: Disable
>> HLT/MWAIT trapping temporarily.
> If I do it that way, yes. A lightweight way to enable/disable the exits
> even to kernel would be a nice to have. But it's a trade-off. For HLT
> you'd get lower latency re-entering the vCPU at a cost of much higher
> latency processing work if the vCPU was *already* in HLT.
>
> We probably would want to stop burning power in the MWAIT loop though,
> and let the pCPU sit in the guest in MWAIT if there really is nothing
> else to do.
>
> We're experimenting with various permutations.
>
>> Also, please keep in mind that you still would need a fallback mechanism
>> to run your "deferred work" even when the guest does not call HLT/MWAIT,
>> like a regular timer in your main thread.
> Yeah. In that case I think the ideal answer is that we let the kernel
> scheduler sort it out. I was thinking of a model where we have I/O (or
> workqueue) threads in *addition* to the userspace exits on idle. The
> separate threads own the work (and a number of them are woken according
> to the queue depth), and idle vCPUs *opportunistically* process work
> items on top of that.
>
> That approach alone would work fine with the existing HLT scheduling;
> it's just MWAIT which is a pain because yield() doesn't really do much
> (but as noted, it's better than *nothing*).
>
>> On top of all this, I'm not sure it's more efficient to do the trap to
>> the vCPU thread compared to just creating a separate real thread. Your
>> main problem is the emulatability of MWAIT because that leaves "no time"
>> to do deferred work. But then again, if your deferred work is so complex
>> that it needs more than a few ms (which you can always steal from the
>> vCPU thread, especiall with yield()), you'll need to start implementing
>> time slicing of that work in user space next - and basically rebuild
>> your own scheduler there. Ugh.
>>
>> IMHO the real core value of this idea would be in a vcpu_run bit that on
>> VCPU_RUN can toggle between HLT/MWAIT intercept on and off. The actual
>> trap to user space, you're most likely better off with a separate thread.
> No, that's very much not the point. The problem is that yield() doesn't
> work well enough — and isn't designed or guaranteed to do anything in
> particular for most cases. It's better than *nothing* but we want the
> opportunity to do the actual work right there in the *loop* of the
> guest bouncing through MWAIT.


The problem with MWAIT is that you don't really know when it's done.

You could find out by making MONITOR'ed pages(!) read-only so you can 
wake up any target vCPU that's in MWAIT, but that's considerably 
expensive if you want to do it well.

You could also burn one VM/system wide CPU that does nothing but waits 
for changes in any MONITOR'ed cache line. Doable with less power 
consumption if you use TSX I guess. But probably not what you want either.

Another alternative would be to make guests PV aware, so they understand 
you don't actually do MWAIT and give you a hypercall every time they 
modify whatever anyone would want to monitor (such as 
thread_info->flags). But that requires new guest kernels. I don't think 
you want to wait for that :).

So in a nutshell, emulating MWAIT properly is just super difficult. If 
you have even the remotest chance to get away with doing HLT instead, 
I'd take that. In that model, an I/O thread that schedules over idle 
threads becomes natural.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Peter Zijlstra Sept. 19, 2023, 9:04 a.m. UTC | #4
On Mon, Sep 18, 2023 at 01:59:50PM +0200, Alexander Graf wrote:
> The problem with MWAIT is that you don't really know when it's done.

This isn't really a problem. MWAIT is allowed (expected even) to return
early.

REP;NOP is a valid implementation of MWAIT.

MWAIT must not delay waking (much) after either:

 - write to monitored address
 - interrupt pending

But it doesn't say anything about not waking up sooner.

Now, obviously on real hardware you prefer if MWAIT were to also do the
whole C-state thing and safe your some actual power, but this is virt,
real hardware is not a concern and wakeup-timeliness also not much.


IIRC the ARM64 WFE thing has a 10khz timer or something it wakes from if
nothing else. So I suppose what I'm saying is that: nanosleep(100000)
might be a suitable MWAIT implementation.

It's virt, it sucks anyway :-)
Paolo Bonzini Sept. 22, 2023, noon UTC | #5
On Mon, Sep 18, 2023 at 11:30 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The VMM may have work to do on behalf of the guest, and it's often
> desirable to use the cycles when the vCPUS are idle.
>
> When the vCPU uses HLT this works out OK because the VMM can run its
> tasks in a separate thread which gets scheduled when the in-kernel
> emulation of HLT schedules away. It isn't perfect, because it doesn't
> easily allow for handling both low-priority maintenance tasks when the
> VMM wants to wait until the vCPU is idle, and also for higher priority
> tasks where the VMM does want to preempt the vCPU. It can also lead to
> noisy neighbour effects, when a host has isn't necessarily sized to
> expect any given VMM to suddenly be contending for many *more* pCPUs
> than it has vCPUs.
>
> In addition, there are times when we need to expose MWAIT to a guest
> for compatibility with a previous environment. And MWAIT is much harder
> because it's very hard to emulate properly.

I don't dislike giving userspace more flexibility in deciding when to
exit on HLT and MWAIT (or even PAUSE), and kvm_run is a good place to
do this. It's an extension of request_interrupt_window and
immediate_exit. I'm not sure how it would interact with
KVM_CAP_X86_DISABLE_EXITS.  Perhaps
KVM_ENABLE_CAP(KVM_X86_DISABLE_EXITS) could be changed to do nothing
except writing to a new kvm_run field? All the kvm->arch.*_in_guest
field would change into a kvm->arch.saved_request_userspace_exit, and
every vmentry would do something like

  if (kvm->arch.saved_request_userspace_exit !=
kvm_run->request_userspace_exit) {
     /* tweak intercepts */
  }

To avoid races you need two flags though; there needs to be also a
kernel->userspace communication of whether the vCPU is currently in
HLT or MWAIT, using the "flags" field for example. If it was HLT only,
moving the mp_state in kvm_run would seem like a good idea; but not if
MWAIT or PAUSE are also included.

To set a kvm_run flag during MWAIT, you could reenter MWAIT with the
MWAIT-exiting bit cleared and the monitor trap flag bit (or just
EFLAGS.TF) set. On the subsequent singlestep exit, clear the flag in
kvm_run and set again the MWAIT-exiting bit. The MWAIT handler would
also check kvm_run->request_userspace_exit before reentering.

Paolo
David Woodhouse Sept. 23, 2023, 7:22 a.m. UTC | #6
On Fri, 2023-09-22 at 14:00 +0200, Paolo Bonzini wrote:
> On Mon, Sep 18, 2023 at 11:30 AM David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The VMM may have work to do on behalf of the guest, and it's often
> > desirable to use the cycles when the vCPUS are idle.
> > 
> > When the vCPU uses HLT this works out OK because the VMM can run its
> > tasks in a separate thread which gets scheduled when the in-kernel
> > emulation of HLT schedules away. It isn't perfect, because it doesn't
> > easily allow for handling both low-priority maintenance tasks when the
> > VMM wants to wait until the vCPU is idle, and also for higher priority
> > tasks where the VMM does want to preempt the vCPU. It can also lead to
> > noisy neighbour effects, when a host has isn't necessarily sized to
> > expect any given VMM to suddenly be contending for many *more* pCPUs
> > than it has vCPUs.
> > 
> > In addition, there are times when we need to expose MWAIT to a guest
> > for compatibility with a previous environment. And MWAIT is much harder
> > because it's very hard to emulate properly.
> 
> I don't dislike giving userspace more flexibility in deciding when to
> exit on HLT and MWAIT (or even PAUSE), and kvm_run is a good place to
> do this. It's an extension of request_interrupt_window and
> immediate_exit. I'm not sure how it would interact with
> KVM_CAP_X86_DISABLE_EXITS.

Yeah, right now it doesn't interact at all. The use case is that you
*always* allow vmexits to KVM for the offending instructions, and then
it's just a question of what KVM does when that happens.

> Perhaps KVM_ENABLE_CAP(KVM_X86_DISABLE_EXITS) could be changed to do
> nothing except writing to a new kvm_run field? All the kvm-
> >arch.*_in_guest field would change into a kvm-
> >arch.saved_request_userspace_exit, and every vmentry would do
> something like
> 
>   if (kvm->arch.saved_request_userspace_exit != kvm_run->request_userspace_exit) {
>      /* tweak intercepts */
>   }
> 
> To avoid races you need two flags though; there needs to be also a
> kernel->userspace communication of whether the vCPU is currently in
> HLT or MWAIT, using the "flags" field for example. If it was HLT only,
> moving the mp_state in kvm_run would seem like a good idea; but not if
> MWAIT or PAUSE are also included.

Right. When work is added to an empty workqueue, the VMM will want to
hunt for a vCPU which is currently idle and then signal it to exit.

As you say, for HLT it's simple enough to look at the mp_state, and we
can move that into kvm_run so it doesn't need an ioctl... although it
would also be nice to get an *event* on an eventfd when the vCPU
becomes runnable (as noted, we want that for VSM anyway). Or perhaps
even to be able to poll() on the vCPU fd.

But MWAIT (as currently not-really-emulated) and PAUSE are both just
transient states with nothing you can really *wait* for, which is why
they're such fun to deal with.

> To set a kvm_run flag during MWAIT, you could reenter MWAIT with the
> MWAIT-exiting bit cleared and the monitor trap flag bit (or just
> EFLAGS.TF) set. On the subsequent singlestep exit, clear the flag in
> kvm_run and set again the MWAIT-exiting bit. The MWAIT handler would
> also check kvm_run->request_userspace_exit before reentering.

Yeah, we've pondered that one. Perhaps coupled with setting the scheduling
priority as low as possible while it's actually on the MWAIT, and
putting it back again afterwards. Something along the lines of 'do not
schedule me unless you literally have *nothing* else to do on this
pCPU, for the next N µs'.

Not pretty, but *nothing* you do with MWAIT is going to be pretty.
Unless we can tolerate 4KiB granularity and actually get the read-only
and minor fault trick working.

Anyway, I knocked this up just for Fred to play with and see what
actually performs reasonably and what doesn't, because I never want to
post even random proof-of-concept kernel patches in private. So we'll
play with it and see what we get out of it.
Paolo Bonzini Sept. 23, 2023, 9:24 a.m. UTC | #7
On 9/23/23 09:22, David Woodhouse wrote:
> On Fri, 2023-09-22 at 14:00 +0200, Paolo Bonzini wrote:
>> To avoid races you need two flags though; there needs to be also a
>> kernel->userspace communication of whether the vCPU is currently in
>> HLT or MWAIT, using the "flags" field for example. If it was HLT only,
>> moving the mp_state in kvm_run would seem like a good idea; but not if
>> MWAIT or PAUSE are also included.
> 
> Right. When work is added to an empty workqueue, the VMM will want to
> hunt for a vCPU which is currently idle and then signal it to exit.
> 
> As you say, for HLT it's simple enough to look at the mp_state, and we
> can move that into kvm_run so it doesn't need an ioctl...

Looking at it again: not so easy because the mpstate is changed in the 
vCPU thread by vcpu_block() itself.

> although it
> would also be nice to get an *event* on an eventfd when the vCPU
> becomes runnable (as noted, we want that for VSM anyway). Or perhaps
> even to be able to poll() on the vCPU fd.

Why do you need it?  You can just use KVM_RUN to go to sleep, and if you 
get another job you kick out the vCPU with pthread_kill.  (I also didn't 
get the VSM reference).

An interesting quirk is that kvm_run->immediate_exit is processed before 
kvm_vcpu_block(), but TIF_SIGPENDING is processed afterwards.  This 
means that you can force an mpstate update with pthread_kill + KVM_RUN. 
It's not going to be a speed demon, but it's worth writing a selftest 
for it.

> But MWAIT (as currently not-really-emulated) and PAUSE are both just
> transient states with nothing you can really *wait* for, which is why
> they're such fun to deal with.

PAUSE is easier because it is just momentary and you stick it inside 
what's already a busy wait.  MWAIT is less fun because you don't really 
want to busy wait.

Paolo
Alexander Graf Sept. 23, 2023, 4:43 p.m. UTC | #8
On 23.09.23 11:24, Paolo Bonzini wrote:
>
> On 9/23/23 09:22, David Woodhouse wrote:
>> On Fri, 2023-09-22 at 14:00 +0200, Paolo Bonzini wrote:
>>> To avoid races you need two flags though; there needs to be also a
>>> kernel->userspace communication of whether the vCPU is currently in
>>> HLT or MWAIT, using the "flags" field for example. If it was HLT only,
>>> moving the mp_state in kvm_run would seem like a good idea; but not if
>>> MWAIT or PAUSE are also included.
>>
>> Right. When work is added to an empty workqueue, the VMM will want to
>> hunt for a vCPU which is currently idle and then signal it to exit.
>>
>> As you say, for HLT it's simple enough to look at the mp_state, and we
>> can move that into kvm_run so it doesn't need an ioctl...
>
> Looking at it again: not so easy because the mpstate is changed in the
> vCPU thread by vcpu_block() itself.
>
>> although it
>> would also be nice to get an *event* on an eventfd when the vCPU
>> becomes runnable (as noted, we want that for VSM anyway). Or perhaps
>> even to be able to poll() on the vCPU fd.
>
> Why do you need it?  You can just use KVM_RUN to go to sleep, and if you
> get another job you kick out the vCPU with pthread_kill.  (I also didn't
> get the VSM reference).


With the original VSM patches, we used to make a vCPU aware of the fact 
that it can morph into one of many VTLs. That approach turned out to be 
insanely intrusive and fragile and so we're currently reimplementing 
everything as VTLs as vCPUs. That allows us to move the majority of VSM 
functionality to user space. Everything we've seen so far looks as if 
there is no real performance loss with that approach.

One small problem with that is that now user space is responsible for 
switching between VTLs: It determines which VTL is currently running and 
leaves all others (read: all other vCPUs) as stopped. That means if you 
are running happily in KVM_RUN in VTL0 and VTL1 gets an interrupt, user 
space needs to stop VTL0 and unpause VTL1 until it triggers VTL_RETURN 
at which point VTL1 stops execution and VTL0 runs again.

Nicolas built a patch that exposes "interrupt on vCPU is pending" as an 
ioeventfd user space can request. That way, user space can know whenever 
a currently paused vCPU has a pending interrupt and can act accordingly. 
You could use the same mechanism if you wanted to implement HLT in user 
space, but still use an in-kernel LAPIC.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Paolo Bonzini Sept. 26, 2023, 5:20 p.m. UTC | #9
On Sat, Sep 23, 2023 at 6:44 PM Alexander Graf <graf@amazon.de> wrote:
> On 23.09.23 11:24, Paolo Bonzini wrote:
> > Why do you need it?  You can just use KVM_RUN to go to sleep, and if you
> > get another job you kick out the vCPU with pthread_kill.  (I also didn't
> > get the VSM reference).
>
> With the original VSM patches, we used to make a vCPU aware of the fact
> that it can morph into one of many VTLs. That approach turned out to be
> insanely intrusive and fragile and so we're currently reimplementing
> everything as VTLs as vCPUs. That allows us to move the majority of VSM
> functionality to user space. Everything we've seen so far looks as if
> there is no real performance loss with that approach.

Yes, that was also what I remember, sharing the FPU somehow while
having separate vCPU file descriptors.

> One small problem with that is that now user space is responsible for
> switching between VTLs: It determines which VTL is currently running and
> leaves all others (read: all other vCPUs) as stopped. That means if you
> are running happily in KVM_RUN in VTL0 and VTL1 gets an interrupt, user
> space needs to stop VTL0 and unpause VTL1 until it triggers VTL_RETURN
> at which point VTL1 stops execution and VTL0 runs again.

That's with IPIs in VTL1, right? I understand now. My idea was, since
we need a link from VTL1 to VTL0 for the FPU, to use the same link to
trigger a vmexit to userspace if source VTL > destination VTL. I am
not sure how you would handle the case where the destination vCPU is
not running; probably by detecting the IPI when VTL0 restarts on the
destination vCPU?

In any case, making vCPUs poll()-able is sensible.

Paolo

> Nicolas built a patch that exposes "interrupt on vCPU is pending" as an
> ioeventfd user space can request. That way, user space can know whenever
> a currently paused vCPU has a pending interrupt and can act accordingly.
> You could use the same mechanism if you wanted to implement HLT in user
> space, but still use an in-kernel LAPIC.
David Woodhouse Sept. 26, 2023, 5:28 p.m. UTC | #10
On 26 September 2023 19:20:24 CEST, Paolo Bonzini <pbonzini@redhat.com> wrote:
>On Sat, Sep 23, 2023 at 6:44 PM Alexander Graf <graf@amazon.de> wrote:
>> On 23.09.23 11:24, Paolo Bonzini wrote:
>> > Why do you need it?  You can just use KVM_RUN to go to sleep, and if you
>> > get another job you kick out the vCPU with pthread_kill.  (I also didn't
>> > get the VSM reference).
>>
>> With the original VSM patches, we used to make a vCPU aware of the fact
>> that it can morph into one of many VTLs. That approach turned out to be
>> insanely intrusive and fragile and so we're currently reimplementing
>> everything as VTLs as vCPUs. That allows us to move the majority of VSM
>> functionality to user space. Everything we've seen so far looks as if
>> there is no real performance loss with that approach.
>
>Yes, that was also what I remember, sharing the FPU somehow while
>having separate vCPU file descriptors.
>
>> One small problem with that is that now user space is responsible for
>> switching between VTLs: It determines which VTL is currently running and
>> leaves all others (read: all other vCPUs) as stopped. That means if you
>> are running happily in KVM_RUN in VTL0 and VTL1 gets an interrupt, user
>> space needs to stop VTL0 and unpause VTL1 until it triggers VTL_RETURN
>> at which point VTL1 stops execution and VTL0 runs again.
>
>That's with IPIs in VTL1, right? I understand now. My idea was, since
>we need a link from VTL1 to VTL0 for the FPU, to use the same link to
>trigger a vmexit to userspace if source VTL > destination VTL. I am
>not sure how you would handle the case where the destination vCPU is
>not running; probably by detecting the IPI when VTL0 restarts on the
>destination vCPU?
>
>In any case, making vCPUs poll()-able is sensible.

Thinking about this a bit more, even for HLT it probably isn't just as simple as checking for mp_state changes. If there's a REQ_EVENT outstanding for something like a timer delivery, that won't get handled and the IRQ actually delivered to the local APIC until the vCPU is actually *run*, will it?
Sean Christopherson Sept. 26, 2023, 8:29 p.m. UTC | #11
On Tue, Sep 26, 2023, David Woodhouse wrote:
> 
> 
> On 26 September 2023 19:20:24 CEST, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >On Sat, Sep 23, 2023 at 6:44 PM Alexander Graf <graf@amazon.de> wrote:
> >> On 23.09.23 11:24, Paolo Bonzini wrote:
> >> > Why do you need it?  You can just use KVM_RUN to go to sleep, and if you
> >> > get another job you kick out the vCPU with pthread_kill.  (I also didn't
> >> > get the VSM reference).
> >>
> >> With the original VSM patches, we used to make a vCPU aware of the fact
> >> that it can morph into one of many VTLs. That approach turned out to be
> >> insanely intrusive and fragile and so we're currently reimplementing
> >> everything as VTLs as vCPUs. That allows us to move the majority of VSM
> >> functionality to user space. Everything we've seen so far looks as if
> >> there is no real performance loss with that approach.
> >
> >Yes, that was also what I remember, sharing the FPU somehow while
> >having separate vCPU file descriptors.
> >
> >> One small problem with that is that now user space is responsible for
> >> switching between VTLs: It determines which VTL is currently running and
> >> leaves all others (read: all other vCPUs) as stopped. That means if you
> >> are running happily in KVM_RUN in VTL0 and VTL1 gets an interrupt, user
> >> space needs to stop VTL0 and unpause VTL1 until it triggers VTL_RETURN
> >> at which point VTL1 stops execution and VTL0 runs again.
> >
> >That's with IPIs in VTL1, right? I understand now. My idea was, since
> >we need a link from VTL1 to VTL0 for the FPU, to use the same link to
> >trigger a vmexit to userspace if source VTL > destination VTL. I am
> >not sure how you would handle the case where the destination vCPU is
> >not running; probably by detecting the IPI when VTL0 restarts on the
> >destination vCPU?
> >
> >In any case, making vCPUs poll()-able is sensible.
> 
> Thinking about this a bit more, even for HLT it probably isn't just as simple
> as checking for mp_state changes. If there's a REQ_EVENT outstanding for
> something like a timer delivery, that won't get handled and the IRQ actually
> delivered to the local APIC until the vCPU is actually *run*, will it?

I haven't been following this conversation, just reacting to seeing "HLT" and
"mp_state", which is a bit of a mess:

https://lore.kernel.org/all/ZMgIQ5m1jMSAogT4@google.com
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a6582c1fd8b9..8f931539114a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2128,9 +2128,23 @@  static int kvm_emulate_monitor_mwait(struct kvm_vcpu *vcpu, const char *insn)
 	pr_warn_once("%s instruction emulated as NOP!\n", insn);
 	return kvm_emulate_as_nop(vcpu);
 }
+
 int kvm_emulate_mwait(struct kvm_vcpu *vcpu)
 {
-	return kvm_emulate_monitor_mwait(vcpu, "MWAIT");
+	int ret = kvm_emulate_monitor_mwait(vcpu, "MWAIT");
+
+	if (ret && kvm_userspace_exit(vcpu, KVM_EXIT_MWAIT)) {
+		vcpu->run->exit_reason = KVM_EXIT_MWAIT;
+		ret = 0;
+	} else {
+		/*
+		 * Calling yield() has poorly defined semantics, but the
+		 * guest is in a busy loop and it's the best we can do
+		 * without a full emulation of MONITOR/MWAIT.
+		 */
+		yield();
+	}
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_mwait);
 
@@ -4554,6 +4568,9 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 				r |= KVM_X86_DISABLE_EXITS_MWAIT;
 		}
 		break;
+	case KVM_CAP_X86_USERSPACE_EXITS:
+		r = KVM_X86_USERSPACE_VALID_EXITS;
+		break;
 	case KVM_CAP_X86_SMM:
 		if (!IS_ENABLED(CONFIG_KVM_SMM))
 			break;
@@ -9643,11 +9660,11 @@  static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
 	++vcpu->stat.halt_exits;
 	if (lapic_in_kernel(vcpu)) {
 		vcpu->arch.mp_state = state;
-		return 1;
-	} else {
-		vcpu->run->exit_reason = reason;
-		return 0;
+		if (!kvm_userspace_exit(vcpu, reason))
+			return 1;
 	}
+	vcpu->run->exit_reason = reason;
+	return 0;
 }
 
 int kvm_emulate_halt_noskip(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1e7be1f6ab29..ce10a809151c 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -430,6 +430,19 @@  static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
 	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
 }
 
+static inline bool kvm_userspace_exit(struct kvm_vcpu *vcpu, int reason)
+{
+	if (reason == KVM_EXIT_HLT &&
+	    (vcpu->run->userspace_exits & KVM_X86_USERSPACE_EXIT_HLT))
+		return true;
+
+	if (reason == KVM_EXIT_MWAIT &&
+	    (vcpu->run->userspace_exits & KVM_X86_USERSPACE_EXIT_MWAIT))
+		return true;
+
+	return false;
+}
+
 enum kvm_intr_type {
 	/* Values are arbitrary, but must be non-zero. */
 	KVM_HANDLING_IRQ = 1,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13065dd96132..43d94d49fc24 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MWAIT            38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -283,7 +284,8 @@  struct kvm_run {
 	/* in */
 	__u8 request_interrupt_window;
 	__u8 immediate_exit;
-	__u8 padding1[6];
+	__u8 userspace_exits;
+	__u8 padding1[5];
 
 	/* out */
 	__u32 exit_reason;
@@ -841,6 +843,11 @@  struct kvm_ioeventfd {
                                               KVM_X86_DISABLE_EXITS_PAUSE | \
                                               KVM_X86_DISABLE_EXITS_CSTATE)
 
+#define KVM_X86_USERSPACE_EXIT_MWAIT	     (1 << 0)
+#define KVM_X86_USERSPACE_EXIT_HLT	     (1 << 1)
+#define KVM_X86_USERSPACE_VALID_EXITS        (KVM_X86_USERSPACE_EXIT_MWAIT | \
+                                              KVM_X86_USERSPACE_EXIT_HLT)
+
 /* for KVM_ENABLE_CAP */
 struct kvm_enable_cap {
 	/* in */
@@ -1192,6 +1199,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_X86_USERSPACE_EXITS 230
 
 #ifdef KVM_CAP_IRQ_ROUTING