mbox series

[v2,0/2] kvm: x86: Convey the exit reason to user-space on emulation failure

Message ID 20210706101207.2993686-1-david.edmondson@oracle.com (mailing list archive)
Headers show
Series kvm: x86: Convey the exit reason to user-space on emulation failure | expand

Message

David Edmondson July 6, 2021, 10:12 a.m. UTC
To help when debugging failures in the field, if instruction emulation
fails, report the VM exit reason to userspace in order that it can be
recorded.

I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
to use the emulation_failure part of the exit union in struct kvm_run
- advice welcomed.

v2:
- Improve patch comments (dmatlack)
- Intel should provide the full exit reason (dmatlack)
- Pass a boolean rather than flags (dmatlack)
- Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
  (dmatlack)
- Describe the exit_reason field of the emulation_failure structure
  (dmatlack)

David Edmondson (2):
  KVM: x86: Add kvm_x86_ops.get_exit_reason
  KVM: x86: On emulation failure, convey the exit reason to userspace

 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 +++
 arch/x86/kvm/svm/svm.c             |  6 ++++++
 arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
 arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
 include/uapi/linux/kvm.h           |  7 +++++++
 6 files changed, 37 insertions(+), 13 deletions(-)

Comments

David Matlack July 7, 2021, 11:20 p.m. UTC | #1
On Tue, Jul 06, 2021 at 11:12:05AM +0100, David Edmondson wrote:
> To help when debugging failures in the field, if instruction emulation
> fails, report the VM exit reason to userspace in order that it can be
> recorded.

What is the benefit of seeing the VM-exit reason that led to an
emulation failure?

> 
> I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
> to use the emulation_failure part of the exit union in struct kvm_run
> - advice welcomed.
> 
> v2:
> - Improve patch comments (dmatlack)
> - Intel should provide the full exit reason (dmatlack)
> - Pass a boolean rather than flags (dmatlack)
> - Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
>   (dmatlack)
> - Describe the exit_reason field of the emulation_failure structure
>   (dmatlack)
> 
> David Edmondson (2):
>   KVM: x86: Add kvm_x86_ops.get_exit_reason
>   KVM: x86: On emulation failure, convey the exit reason to userspace
> 
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  3 +++
>  arch/x86/kvm/svm/svm.c             |  6 ++++++
>  arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
>  arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
>  include/uapi/linux/kvm.h           |  7 +++++++
>  6 files changed, 37 insertions(+), 13 deletions(-)
> 
> -- 
> 2.30.2
>
David Matlack July 7, 2021, 11:22 p.m. UTC | #2
On Tue, Jul 06, 2021 at 11:12:05AM +0100, David Edmondson wrote:
> To help when debugging failures in the field, if instruction emulation
> fails, report the VM exit reason to userspace in order that it can be
> recorded.
> 
> I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
> to use the emulation_failure part of the exit union in struct kvm_run
> - advice welcomed.
> 
> v2:
> - Improve patch comments (dmatlack)
> - Intel should provide the full exit reason (dmatlack)

I just asked if Intel should provide the full exit reason, I do not have
an opinion either way. It really comes down to your usecase for wanting
the exit reason. Would the full exit reason be useful or do you just
need the basic exit number?

> - Pass a boolean rather than flags (dmatlack)
> - Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
>   (dmatlack)
> - Describe the exit_reason field of the emulation_failure structure
>   (dmatlack)
> 
> David Edmondson (2):
>   KVM: x86: Add kvm_x86_ops.get_exit_reason
>   KVM: x86: On emulation failure, convey the exit reason to userspace
> 
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  3 +++
>  arch/x86/kvm/svm/svm.c             |  6 ++++++
>  arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
>  arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
>  include/uapi/linux/kvm.h           |  7 +++++++
>  6 files changed, 37 insertions(+), 13 deletions(-)
> 
> -- 
> 2.30.2
>
David Edmondson July 8, 2021, 2:17 p.m. UTC | #3
Apologies if you see two of these - I had some email problems earlier.

On Wednesday, 2021-07-07 at 23:20:04 UTC, David Matlack wrote:

> On Tue, Jul 06, 2021 at 11:12:05AM +0100, David Edmondson wrote:
>> To help when debugging failures in the field, if instruction emulation
>> fails, report the VM exit reason to userspace in order that it can be
>> recorded.
>
> What is the benefit of seeing the VM-exit reason that led to an
> emulation failure?

I can't cite an example of where this has definitively led in a
direction that helped solve a problem, but we do sometimes see emulation
failures reported in situations where we are not able to reproduce the
failures on demand and the existing information provided at the time of
failure is either insufficient or suspect.

Given that, I'm left casting about for data that can be made available
to assist in postmortem analysis of the failures.

>> I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
>> to use the emulation_failure part of the exit union in struct kvm_run
>> - advice welcomed.
>> 
>> v2:
>> - Improve patch comments (dmatlack)
>> - Intel should provide the full exit reason (dmatlack)
>> - Pass a boolean rather than flags (dmatlack)
>> - Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
>>   (dmatlack)
>> - Describe the exit_reason field of the emulation_failure structure
>>   (dmatlack)
>> 
>> David Edmondson (2):
>>   KVM: x86: Add kvm_x86_ops.get_exit_reason
>>   KVM: x86: On emulation failure, convey the exit reason to userspace
>> 
>>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>>  arch/x86/include/asm/kvm_host.h    |  3 +++
>>  arch/x86/kvm/svm/svm.c             |  6 ++++++
>>  arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
>>  arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
>>  include/uapi/linux/kvm.h           |  7 +++++++
>>  6 files changed, 37 insertions(+), 13 deletions(-)
>> 
>> -- 
>> 2.30.2
>> 

dme.
David Edmondson July 8, 2021, 2:20 p.m. UTC | #4
On Wednesday, 2021-07-07 at 23:22:44 UTC, David Matlack wrote:

> On Tue, Jul 06, 2021 at 11:12:05AM +0100, David Edmondson wrote:
>> To help when debugging failures in the field, if instruction emulation
>> fails, report the VM exit reason to userspace in order that it can be
>> recorded.
>> 
>> I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
>> to use the emulation_failure part of the exit union in struct kvm_run
>> - advice welcomed.
>> 
>> v2:
>> - Improve patch comments (dmatlack)
>> - Intel should provide the full exit reason (dmatlack)
>
> I just asked if Intel should provide the full exit reason, I do not have
> an opinion either way. It really comes down to your usecase for wanting
> the exit reason. Would the full exit reason be useful or do you just
> need the basic exit number?

Given that this is intended as a debug aid, having the full exit reason
makes sense.

>> - Pass a boolean rather than flags (dmatlack)
>> - Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
>>   (dmatlack)
>> - Describe the exit_reason field of the emulation_failure structure
>>   (dmatlack)
>> 
>> David Edmondson (2):
>>   KVM: x86: Add kvm_x86_ops.get_exit_reason
>>   KVM: x86: On emulation failure, convey the exit reason to userspace
>> 
>>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>>  arch/x86/include/asm/kvm_host.h    |  3 +++
>>  arch/x86/kvm/svm/svm.c             |  6 ++++++
>>  arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
>>  arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
>>  include/uapi/linux/kvm.h           |  7 +++++++
>>  6 files changed, 37 insertions(+), 13 deletions(-)
>> 
>> -- 
>> 2.30.2
>> 

dme.
David Matlack July 8, 2021, 6:38 p.m. UTC | #5
On Thu, Jul 08, 2021 at 03:17:40PM +0100, David Edmondson wrote:
> Apologies if you see two of these - I had some email problems earlier.

I only got one! :)

> 
> On Wednesday, 2021-07-07 at 23:20:04 UTC, David Matlack wrote:
> 
> > On Tue, Jul 06, 2021 at 11:12:05AM +0100, David Edmondson wrote:
> >> To help when debugging failures in the field, if instruction emulation
> >> fails, report the VM exit reason to userspace in order that it can be
> >> recorded.
> >
> > What is the benefit of seeing the VM-exit reason that led to an
> > emulation failure?
> 
> I can't cite an example of where this has definitively led in a
> direction that helped solve a problem, but we do sometimes see emulation
> failures reported in situations where we are not able to reproduce the
> failures on demand and the existing information provided at the time of
> failure is either insufficient or suspect.
> 
> Given that, I'm left casting about for data that can be made available
> to assist in postmortem analysis of the failures.

Understood, thanks for the context. My only concern would be that
userspace APIs are difficult to change once they exist. If it turns
out knowing the exit reason does not help with debugging emulation
failures we'd still be stuck with exporting it on every emulation
failure.

My intuition is that the instruction bytes (which are now available with
Aaron's patch) and the guest register state (which is queryable through
other ioctls) should be sufficient to set up a reproduction of the
emulation failure in a kvm-unit-test and the exit reason should not
really matter. I'm curious if that's not the case?

I'm really not opposed to exporting the exit reason if it is useful, I'm
just not sure it will help.

> 
> >> I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
> >> to use the emulation_failure part of the exit union in struct kvm_run
> >> - advice welcomed.
> >> 
> >> v2:
> >> - Improve patch comments (dmatlack)
> >> - Intel should provide the full exit reason (dmatlack)
> >> - Pass a boolean rather than flags (dmatlack)
> >> - Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
> >>   (dmatlack)
> >> - Describe the exit_reason field of the emulation_failure structure
> >>   (dmatlack)
> >> 
> >> David Edmondson (2):
> >>   KVM: x86: Add kvm_x86_ops.get_exit_reason
> >>   KVM: x86: On emulation failure, convey the exit reason to userspace
> >> 
> >>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
> >>  arch/x86/include/asm/kvm_host.h    |  3 +++
> >>  arch/x86/kvm/svm/svm.c             |  6 ++++++
> >>  arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
> >>  arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
> >>  include/uapi/linux/kvm.h           |  7 +++++++
> >>  6 files changed, 37 insertions(+), 13 deletions(-)
> >> 
> >> -- 
> >> 2.30.2
> >> 
> 
> dme.
> -- 
> It's gettin', it's gettin', it's gettin' kinda hectic.
David Edmondson July 8, 2021, 8:13 p.m. UTC | #6
On Thursday, 2021-07-08 at 18:38:18 UTC, David Matlack wrote:

> On Thu, Jul 08, 2021 at 03:17:40PM +0100, David Edmondson wrote:
>> Apologies if you see two of these - I had some email problems earlier.
>
> I only got one! :)

Phew!

>> On Wednesday, 2021-07-07 at 23:20:04 UTC, David Matlack wrote:
>> 
>> > On Tue, Jul 06, 2021 at 11:12:05AM +0100, David Edmondson wrote:
>> >> To help when debugging failures in the field, if instruction emulation
>> >> fails, report the VM exit reason to userspace in order that it can be
>> >> recorded.
>> >
>> > What is the benefit of seeing the VM-exit reason that led to an
>> > emulation failure?
>> 
>> I can't cite an example of where this has definitively led in a
>> direction that helped solve a problem, but we do sometimes see emulation
>> failures reported in situations where we are not able to reproduce the
>> failures on demand and the existing information provided at the time of
>> failure is either insufficient or suspect.
>> 
>> Given that, I'm left casting about for data that can be made available
>> to assist in postmortem analysis of the failures.
>
> Understood, thanks for the context. My only concern would be that
> userspace APIs are difficult to change once they exist.

Agreed.

> If it turns out knowing the exit reason does not help with debugging
> emulation failures we'd still be stuck with exporting it on every
> emulation failure.

We could stop setting the flag and never export it, but this would waste
space in the structure and be odd, without doubt.

> My intuition is that the instruction bytes (which are now available with
> Aaron's patch) and the guest register state (which is queryable through
> other ioctls) should be sufficient to set up a reproduction of the
> emulation failure in a kvm-unit-test and the exit reason should not
> really matter. I'm curious if that's not the case?

The instruction bytes around the reported EIP are all zeroes - the
register dump looks suspect, and doesn't correspond with the reported
behaviour of the VM at the time of the failure.

It's possible that Aaron's changes will help, indeed, given that they
report state from within the instruction emulator itself. So far I don't
have a sufficiently reproducible case to be able to see if that is the
case.

> I'm really not opposed to exporting the exit reason if it is useful, I'm
> just not sure it will help.

In the emulation failure case we are not in something I would consider a
fast path, and the overhead of acquiring and reporting the exit reason
is low.

Do you anticipate a case where it would be inappropriate or expensive to
report the reason?

>> 
>> >> I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
>> >> to use the emulation_failure part of the exit union in struct kvm_run
>> >> - advice welcomed.
>> >> 
>> >> v2:
>> >> - Improve patch comments (dmatlack)
>> >> - Intel should provide the full exit reason (dmatlack)
>> >> - Pass a boolean rather than flags (dmatlack)
>> >> - Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
>> >>   (dmatlack)
>> >> - Describe the exit_reason field of the emulation_failure structure
>> >>   (dmatlack)
>> >> 
>> >> David Edmondson (2):
>> >>   KVM: x86: Add kvm_x86_ops.get_exit_reason
>> >>   KVM: x86: On emulation failure, convey the exit reason to userspace
>> >> 
>> >>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>> >>  arch/x86/include/asm/kvm_host.h    |  3 +++
>> >>  arch/x86/kvm/svm/svm.c             |  6 ++++++
>> >>  arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
>> >>  arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
>> >>  include/uapi/linux/kvm.h           |  7 +++++++
>> >>  6 files changed, 37 insertions(+), 13 deletions(-)
>> >> 
>> >> -- 
>> >> 2.30.2
>> >> 
>> 
>> dme.
>> -- 
>> It's gettin', it's gettin', it's gettin' kinda hectic.

dme.
David Matlack July 8, 2021, 8:35 p.m. UTC | #7
On Thu, Jul 08, 2021 at 09:13:38PM +0100, David Edmondson wrote:
> On Thursday, 2021-07-08 at 18:38:18 UTC, David Matlack wrote:
> 
> > On Thu, Jul 08, 2021 at 03:17:40PM +0100, David Edmondson wrote:
> >> Apologies if you see two of these - I had some email problems earlier.
> >
> > I only got one! :)
> 
> Phew!
> 
> >> On Wednesday, 2021-07-07 at 23:20:04 UTC, David Matlack wrote:
> >> 
> >> > On Tue, Jul 06, 2021 at 11:12:05AM +0100, David Edmondson wrote:
> >> >> To help when debugging failures in the field, if instruction emulation
> >> >> fails, report the VM exit reason to userspace in order that it can be
> >> >> recorded.
> >> >
> >> > What is the benefit of seeing the VM-exit reason that led to an
> >> > emulation failure?
> >> 
> >> I can't cite an example of where this has definitively led in a
> >> direction that helped solve a problem, but we do sometimes see emulation
> >> failures reported in situations where we are not able to reproduce the
> >> failures on demand and the existing information provided at the time of
> >> failure is either insufficient or suspect.
> >> 
> >> Given that, I'm left casting about for data that can be made available
> >> to assist in postmortem analysis of the failures.
> >
> > Understood, thanks for the context. My only concern would be that
> > userspace APIs are difficult to change once they exist.
> 
> Agreed.
> 
> > If it turns out knowing the exit reason does not help with debugging
> > emulation failures we'd still be stuck with exporting it on every
> > emulation failure.
> 
> We could stop setting the flag and never export it, but this would waste
> space in the structure and be odd, without doubt.
> 
> > My intuition is that the instruction bytes (which are now available with
> > Aaron's patch) and the guest register state (which is queryable through
> > other ioctls) should be sufficient to set up a reproduction of the
> > emulation failure in a kvm-unit-test and the exit reason should not
> > really matter. I'm curious if that's not the case?
> 
> The instruction bytes around the reported EIP are all zeroes - the
> register dump looks suspect, and doesn't correspond with the reported
> behaviour of the VM at the time of the failure.

Interesting... Nothing comes to mind but others on this list might have
a suggestion of where to look next.

> 
> It's possible that Aaron's changes will help, indeed, given that they
> report state from within the instruction emulator itself. So far I don't
> have a sufficiently reproducible case to be able to see if that is the
> case.
> 
> > I'm really not opposed to exporting the exit reason if it is useful, I'm
> > just not sure it will help.
> 
> In the emulation failure case we are not in something I would consider a
> fast path, and the overhead of acquiring and reporting the exit reason
> is low.

Agreed. I'm not worried about performance, only code complexity and
bloat in the userspace API. But as you suggested above we could always
stop setting the flag and remove the code that populates the exit reason
if it turns out to not be useful. The field in kvm_run is the only thing
that could be hard to remove in the future.

> 
> Do you anticipate a case where it would be inappropriate or expensive to
> report the reason?
> 
> >> 
> >> >> I'm unsure whether sgx_handle_emulation_failure() needs to be adapted
> >> >> to use the emulation_failure part of the exit union in struct kvm_run
> >> >> - advice welcomed.
> >> >> 
> >> >> v2:
> >> >> - Improve patch comments (dmatlack)
> >> >> - Intel should provide the full exit reason (dmatlack)
> >> >> - Pass a boolean rather than flags (dmatlack)
> >> >> - Use the helper in kvm_task_switch() and kvm_handle_memory_failure()
> >> >>   (dmatlack)
> >> >> - Describe the exit_reason field of the emulation_failure structure
> >> >>   (dmatlack)
> >> >> 
> >> >> David Edmondson (2):
> >> >>   KVM: x86: Add kvm_x86_ops.get_exit_reason
> >> >>   KVM: x86: On emulation failure, convey the exit reason to userspace
> >> >> 
> >> >>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
> >> >>  arch/x86/include/asm/kvm_host.h    |  3 +++
> >> >>  arch/x86/kvm/svm/svm.c             |  6 ++++++
> >> >>  arch/x86/kvm/vmx/vmx.c             | 11 +++++++----
> >> >>  arch/x86/kvm/x86.c                 | 22 +++++++++++++---------
> >> >>  include/uapi/linux/kvm.h           |  7 +++++++
> >> >>  6 files changed, 37 insertions(+), 13 deletions(-)
> >> >> 
> >> >> -- 
> >> >> 2.30.2
> >> >> 
> >> 
> >> dme.
> >> -- 
> >> It's gettin', it's gettin', it's gettin' kinda hectic.
> 
> dme.
> -- 
> Please forgive me if I act a little strange, for I know not what I do.
Sean Christopherson July 9, 2021, 4:08 p.m. UTC | #8
On Thu, Jul 08, 2021, David Matlack wrote:
> On Thu, Jul 08, 2021 at 09:13:38PM +0100, David Edmondson wrote:
> > On Thursday, 2021-07-08 at 18:38:18 UTC, David Matlack wrote:
> > > On Thu, Jul 08, 2021 at 03:17:40PM +0100, David Edmondson wrote:
> > >> I can't cite an example of where this has definitively led in a
> > >> direction that helped solve a problem, but we do sometimes see emulation
> > >> failures reported in situations where we are not able to reproduce the
> > >> failures on demand and the existing information provided at the time of
> > >> failure is either insufficient or suspect.
> > >> 
> > >> Given that, I'm left casting about for data that can be made available
> > >> to assist in postmortem analysis of the failures.
> > >
> > > Understood, thanks for the context. My only concern would be that
> > > userspace APIs are difficult to change once they exist.
> > 
> > Agreed.
> > 
> > > If it turns out knowing the exit reason does not help with debugging
> > > emulation failures we'd still be stuck with exporting it on every
> > > emulation failure.

I can think of multiple cases where knowing why KVM emulated in the first place
would be helpful, e.g. a failure on EPT misconfig (MMIO) exit could be a simple
"drat, KVM doesn't handle SSE instructions", whereas a failure on a descriptor
table exit (for UMIP emulation) would be a completely different mess.