[v2,0/4] KVM: x86: tracepoint updates

Message ID	20230924124410.897646-1-mlevitsk@redhat.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Maxim Levitsky <mlevitsk@redhat.com> To: kvm@vger.kernel.org Cc: Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Ingo Molnar <mingo@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Borislav Petkov <bp@alien8.de>, Maxim Levitsky <mlevitsk@redhat.com> Subject: [PATCH v2 0/4] KVM: x86: tracepoint updates Date: Sun, 24 Sep 2023 15:44:06 +0300 Message-Id: <20230924124410.897646-1-mlevitsk@redhat.com> Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	KVM: x86: tracepoint updates \| expand [v2,0/4] KVM: x86: tracepoint updates [v2,1/4] KVM: x86: refactor req_immediate_exit logic [v2,2/4] KVM: x86: add more information to the kvm_entry tracepoint [v2,3/4] KVM: x86: add more information to kvm_exit tracepoint [v2,4/4] KVM: x86: add new nested vmexit tracepoints

Message ID

20230924124410.897646-1-mlevitsk@redhat.com (mailing list archive)

Headers

From: Maxim Levitsky <mlevitsk@redhat.com>
To: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org,
        linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
        Sean Christopherson <seanjc@google.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Ingo Molnar <mingo@redhat.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Borislav Petkov <bp@alien8.de>,
        Maxim Levitsky <mlevitsk@redhat.com>
Subject: [PATCH v2 0/4] KVM: x86: tracepoint updates
Date: Sun, 24 Sep 2023 15:44:06 +0300
Message-Id: <20230924124410.897646-1-mlevitsk@redhat.com>
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Series

KVM: x86: tracepoint updates | expand

Message

Maxim Levitsky Sept. 24, 2023, 12:44 p.m. UTC

This patch series is intended to add some selected information
to the kvm tracepoints to make it easier to gather insights about
running nested guests.

This patch series was developed together with a new x86 performance analysis tool
that I developed recently (https://gitlab.com/maximlevitsky/kvmon)
which aims to be a better kvm_stat, and allows you at glance
to see what is happening in a VM, including nesting.

Best regards,
	Maxim Levitsky

Maxim Levitsky (4):
  KVM: x86: refactor req_immediate_exit logic
  KVM: x86: add more information to the kvm_entry tracepoint
  KVM: x86: add more information to kvm_exit tracepoint
  KVM: x86: add new nested vmexit tracepoints

 arch/x86/include/asm/kvm-x86-ops.h |   2 +-
 arch/x86/include/asm/kvm_host.h    |  11 ++-
 arch/x86/kvm/svm/nested.c          |  22 ++++++
 arch/x86/kvm/svm/svm.c             |  22 +++++-
 arch/x86/kvm/trace.h               | 115 +++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/nested.c          |  21 ++++++
 arch/x86/kvm/vmx/vmx.c             |  31 +++++---
 arch/x86/kvm/vmx/vmx.h             |   2 -
 arch/x86/kvm/x86.c                 |  34 ++++-----
 9 files changed, 214 insertions(+), 46 deletions(-)

Comments

Sean Christopherson Sept. 26, 2023, 12:03 a.m. UTC | #1

On Sun, Sep 24, 2023, Maxim Levitsky wrote:
> This patch series is intended to add some selected information
> to the kvm tracepoints to make it easier to gather insights about
> running nested guests.
> 
> This patch series was developed together with a new x86 performance analysis tool
> that I developed recently (https://gitlab.com/maximlevitsky/kvmon)
> which aims to be a better kvm_stat, and allows you at glance
> to see what is happening in a VM, including nesting.

Rather than add more and more tracepoints, I think we should be more thoughtful
about (a) where we place KVM's tracepoints and (b) giving userspace the necessary
hooks to write BPF programs to extract whatever data is needed at any give time.

There's simply no way we can iterate fast enough in KVM tracepoints to adapt to
userspace's debug/monitoring needs.  E.g. if it turns out someone wants detailed
info on hypercalls that use memory or registers beyond ABCD, the new tracepoints
won't help them.

If all KVM tracepoints grab "struct kvm_vcpu" and force VMCS "registers" to be
cached (or decached depending on one's viewpoint), then I think that'll serve 99%
of usecases.  E.g. the vCPU gives a BPF program kvm_vcpu, vcpu_{vmx,svm}, kvm, etc.

trace_kvm_exit is good example, where despite all of the information that is captured
by KVM, it's borderline worthless for CPUID and MSR exits because their interesting
information is held in registers and not captured in the VMCS or VMCB.

There are some on BTF type info issues that I've encountered, but I suspect that's
as much a PEBKAC problem as anything.

Maxim Levitsky Sept. 26, 2023, 8:28 a.m. UTC | #2

У пн, 2023-09-25 у 17:03 -0700, Sean Christopherson пише:
> On Sun, Sep 24, 2023, Maxim Levitsky wrote:
> > This patch series is intended to add some selected information
> > to the kvm tracepoints to make it easier to gather insights about
> > running nested guests.
> > 
> > This patch series was developed together with a new x86 performance analysis tool
> > that I developed recently (https://gitlab.com/maximlevitsky/kvmon)
> > which aims to be a better kvm_stat, and allows you at glance
> > to see what is happening in a VM, including nesting.
> 
> Rather than add more and more tracepoints, I think we should be more thoughtful
> about (a) where we place KVM's tracepoints and (b) giving userspace the necessary
> hooks to write BPF programs to extract whatever data is needed at any give time.
> 
> There's simply no way we can iterate fast enough in KVM tracepoints to adapt to
> userspace's debug/monitoring needs.  E.g. if it turns out someone wants detailed
> info on hypercalls that use memory or registers beyond ABCD, the new tracepoints
> won't help them.
> 
> If all KVM tracepoints grab "struct kvm_vcpu" and force VMCS "registers" to be
> cached (or decached depending on one's viewpoint), then I think that'll serve 99%
> of usecases.  E.g. the vCPU gives a BPF program kvm_vcpu, vcpu_{vmx,svm}, kvm, etc.
> 
> trace_kvm_exit is good example, where despite all of the information that is captured
> by KVM, it's borderline worthless for CPUID and MSR exits because their interesting
> information is held in registers and not captured in the VMCS or VMCB.
> 
> There are some on BTF type info issues that I've encountered, but I suspect that's
> as much a PEBKAC problem as anything.
> 

While eBPF has its use cases, none of the extra tracepoints were added solely because of
the monitoring tool and I do understand that tracepoints are a limited resource.

Each added tracepoint/info was added only when it was also found to be useful for regular
kvm tracing.

Best regards,
	Maxim Levitsky

Paolo Bonzini Sept. 26, 2023, 4:36 p.m. UTC | #3

On 9/26/23 10:28, Maxim Levitsky wrote:
>> trace_kvm_exit is good example, where despite all of the information that is captured
>> by KVM, it's borderline worthless for CPUID and MSR exits because their interesting
>> information is held in registers and not captured in the VMCS or VMCB.
>>
>> There are some on BTF type info issues that I've encountered, but I suspect that's
>> as much a PEBKAC problem as anything.
>>
> While eBPF has its use cases, none of the extra tracepoints were added solely because of
> the monitoring tool and I do understand that tracepoints are a limited resource.
> 
> Each added tracepoint/info was added only when it was also found to be useful for regular
> kvm tracing.

I am not sure about _all_ of them, but I agree with both of you.

On one hand, it would be pretty cool to have eBPF access to registers. 
On the other hand, the specific info you're adding is generic and I 
think there are only a couple exceptions where I am not sure it belongs 
in the generic KVM tracepoints.

Paolo

Sean Christopherson Sept. 26, 2023, 8:45 p.m. UTC | #4

On Tue, Sep 26, 2023, Paolo Bonzini wrote:
> On 9/26/23 10:28, Maxim Levitsky wrote:
> > > trace_kvm_exit is good example, where despite all of the information that is captured
> > > by KVM, it's borderline worthless for CPUID and MSR exits because their interesting
> > > information is held in registers and not captured in the VMCS or VMCB.
> > > 
> > > There are some on BTF type info issues that I've encountered, but I suspect that's
> > > as much a PEBKAC problem as anything.
> > > 
> > While eBPF has its use cases, none of the extra tracepoints were added solely because of
> > the monitoring tool and I do understand that tracepoints are a limited resource.
> > 
> > Each added tracepoint/info was added only when it was also found to be useful for regular
> > kvm tracing.
> 
> I am not sure about _all_ of them, but I agree with both of you.
> 
> On one hand, it would be pretty cool to have eBPF access to registers. On
> the other hand, the specific info you're adding is generic and I think there
> are only a couple exceptions where I am not sure it belongs in the generic
> KVM tracepoints.

I'm not saying this information isn't useful, *sometimes*.  What I'm saying is
that I don't think it's sustainable/maintainble to keep expanding KVM's tracepoints.
E.g. why trace req_immediate_exit and not mmu_invalidate_seq[*]?

It's practically impossible to predict exactly what information will be useful in
the field.  And for something like kvmon, IMO the onus is on userspace to do the
heavy lifting.

Rather than hardcode mounds of information in KVM's tracepoints, I think we should
refactor KVM to make it as easy as possible to use BPF to piggyback tracepoints
and get at data that *might* be interesting, and then add a variety of BPF programs
(e.g. using bpftrace for simplicity) somewhere in tools/ to provide equivalent
functionality to select existing tracepoints.

E.g. if we rework kvm_vcpu to place "struct kvm_vcpu_arch arch" at offset '0',
then we get at all the GPRs and pseudo-registers by hardcoding offsets into the
struct, i.e. without needing BTF type info.  More elaborate trace programs would
likely need BTF, or maybe some clever shenanigans, but it seems very doable.

[*] https://lore.kernel.org/all/ZOaUdP46f8XjQvio@google.com