[4/4] kvm: x86: export TSC offset to user-space

Message ID	1472663145-1835-5-git-send-email-lcapitulino@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Luiz Capitulino <lcapitulino@redhat.com> To: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org, pbonzini@redhat.com, rkrcmar@redhat.com, rostedt@goodmis.org, mhiramat@kernel.org, mtosatti@redhat.com Subject: [PATCH 4/4] kvm: x86: export TSC offset to user-space Date: Wed, 31 Aug 2016 13:05:45 -0400 Message-Id: <1472663145-1835-5-git-send-email-lcapitulino@redhat.com> In-Reply-To: <1472663145-1835-1-git-send-email-lcapitulino@redhat.com> References: <1472663145-1835-1-git-send-email-lcapitulino@redhat.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Luiz Capitulino Aug. 31, 2016, 5:05 p.m. UTC

We need to retrieve a VM's TSC offset in order to use
the host's TSC to merge host and guest traces. This is
explained in detail in this thread:

  [Qemu-devel] [RFC] host and guest kernel trace merging
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html

Today, the only way to retrieve a VM's TSC offset is
by using the kvm_write_tsc_offset tracepoint. This has
a few problems. First, the tracepoint is only emitted
when the VM boots, which requires a reboot to get it if
the VM is already running. Second, tracepoints are not
supposed to be ABIs in case they need to be consumed by
user-space tools.

This commit exports a VM's TSC offset to user-space via
debugfs. A new file called "tsc-offset" is created in
the VM's debugfs directory. For example:

  /sys/kernel/debug/kvm/51696-10/tsc-offset

This file contains one TSC offset per line, for each
vCPU. For example:

  vcpu0: 18446742405270834952
  vcpu1: 18446742405270834952
  vcpu2: 18446742405270834952
  vcpu3: 18446742405270834952

There are some important observations about this
solution:

 - While all vCPUs TSC offsets should be equal for the
   cases we care about (ie. stable TSC and no write to
   the TSC MSR), I chose to follow the spec and export
   each vCPU's TSC offset (might also be helpful for
   debugging)

 - The TSC offset is only useful after the VM has booted

 - We'll probably need to export the TSC multiplier too.
   However, I've been using only the TSC offset for now.
   So, let's get this merged first and do the TSC multiplier
   as a second step

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/svm.c              |  1 +
 arch/x86/kvm/vmx.c              |  8 ++++++++
 arch/x86/kvm/x86.c              | 30 ++++++++++++++++++++++++++++++
 4 files changed, 40 insertions(+)

Stefan Hajnoczi Sept. 2, 2016, 1:43 p.m. UTC | #1

On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> We need to retrieve a VM's TSC offset in order to use
> the host's TSC to merge host and guest traces. This is
> explained in detail in this thread:
> 
>   [Qemu-devel] [RFC] host and guest kernel trace merging
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> 
> Today, the only way to retrieve a VM's TSC offset is
> by using the kvm_write_tsc_offset tracepoint. This has
> a few problems. First, the tracepoint is only emitted
> when the VM boots, which requires a reboot to get it if
> the VM is already running. Second, tracepoints are not
> supposed to be ABIs in case they need to be consumed by
> user-space tools.
> 
> This commit exports a VM's TSC offset to user-space via
> debugfs. A new file called "tsc-offset" is created in
> the VM's debugfs directory. For example:
> 
>   /sys/kernel/debug/kvm/51696-10/tsc-offset
> 
> This file contains one TSC offset per line, for each
> vCPU. For example:
> 
>   vcpu0: 18446742405270834952
>   vcpu1: 18446742405270834952
>   vcpu2: 18446742405270834952
>   vcpu3: 18446742405270834952
> 
> There are some important observations about this
> solution:
> 
>  - While all vCPUs TSC offsets should be equal for the
>    cases we care about (ie. stable TSC and no write to
>    the TSC MSR), I chose to follow the spec and export
>    each vCPU's TSC offset (might also be helpful for
>    debugging)
> 
>  - The TSC offset is only useful after the VM has booted
> 
>  - We'll probably need to export the TSC multiplier too.
>    However, I've been using only the TSC offset for now.
>    So, let's get this merged first and do the TSC multiplier
>    as a second step

Can TSC offset changes occur at runtime?

One example is vcpu hotplug where the tracing tool would need to fetch
the new vcpu's TSC offset after tracing has already started.

Another example is if QEMU or the guest change the TSC offset while
running.  If the tracing tool doesn't notice this then trace events will have
incorrect timestamps.

Stefan

Steven Rostedt Sept. 2, 2016, 2:15 p.m. UTC | #2

On Fri, 2 Sep 2016 09:43:01 -0400
Stefan Hajnoczi <stefanha@gmail.com> wrote:

> Can TSC offset changes occur at runtime?
> 
> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.

I believe there are tracepoints for these events. They would obviously
need to be enabled for the tracer to catch them.

I would also recommend that they go into their own instance to make
sure other events do not overwrite them.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luiz Capitulino Sept. 2, 2016, 4:26 p.m. UTC | #3

On Fri, 2 Sep 2016 09:43:01 -0400
Stefan Hajnoczi <stefanha@gmail.com> wrote:

> On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> > We need to retrieve a VM's TSC offset in order to use
> > the host's TSC to merge host and guest traces. This is
> > explained in detail in this thread:
> > 
> >   [Qemu-devel] [RFC] host and guest kernel trace merging
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > 
> > Today, the only way to retrieve a VM's TSC offset is
> > by using the kvm_write_tsc_offset tracepoint. This has
> > a few problems. First, the tracepoint is only emitted
> > when the VM boots, which requires a reboot to get it if
> > the VM is already running. Second, tracepoints are not
> > supposed to be ABIs in case they need to be consumed by
> > user-space tools.
> > 
> > This commit exports a VM's TSC offset to user-space via
> > debugfs. A new file called "tsc-offset" is created in
> > the VM's debugfs directory. For example:
> > 
> >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > 
> > This file contains one TSC offset per line, for each
> > vCPU. For example:
> > 
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> > There are some important observations about this
> > solution:
> > 
> >  - While all vCPUs TSC offsets should be equal for the
> >    cases we care about (ie. stable TSC and no write to
> >    the TSC MSR), I chose to follow the spec and export
> >    each vCPU's TSC offset (might also be helpful for
> >    debugging)
> > 
> >  - The TSC offset is only useful after the VM has booted
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >    However, I've been using only the TSC offset for now.
> >    So, let's get this merged first and do the TSC multiplier
> >    as a second step  
> 
> Can TSC offset changes occur at runtime?

Yes. IIRC, if the system has an unstable TSC, KVM will adjust the
TSC offset when migrating the vCPU to other cores (although
tracing with unstable TSC is not supported). Also, the guest can
write to the TSC MSR at any time (although I don't know if Linux
ever does this).

> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.

I guess that what tools like trace-cmd want to do in those cases
is to warn the user and discard the trace. A simple way of doing
this would be to re-check that the TSC offset are the same after
tracing is done. It could also use inotify, in case it works
for debugfs (never tried it myself).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luiz Capitulino Sept. 2, 2016, 4:29 p.m. UTC | #4

On Fri, 2 Sep 2016 12:26:55 -0400
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> I guess that what tools like trace-cmd want to do in those cases
> is to warn the user and discard the trace. A simple way of doing
> this would be to re-check that the TSC offset are the same after
> tracing is done. It could also use inotify, in case it works
> for debugfs (never tried it myself).

The second idea was probably stupid, never mind me :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Sept. 2, 2016, 5 p.m. UTC | #5

On 31/08/2016 19:05, Luiz Capitulino wrote:
>   vcpu0: 18446742405270834952
>   vcpu1: 18446742405270834952
>   vcpu2: 18446742405270834952
>   vcpu3: 18446742405270834952
> 
>  - We'll probably need to export the TSC multiplier too.
>    However, I've been using only the TSC offset for now.
>    So, let's get this merged first and do the TSC multiplier
>    as a second step

You'll need to export the number of fractional bits in the multiplier,
too.  It's going to be a very simple patch, so please do everything now.

arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.

> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c              |  1 +
>  arch/x86/kvm/vmx.c              |  8 ++++++++
>  arch/x86/kvm/x86.c              | 30 ++++++++++++++++++++++++++++++
>  4 files changed, 40 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 33ae3a4..5714bbd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -952,6 +952,7 @@ struct kvm_x86_ops {
>  	bool (*has_wbinvd_exit)(void);
>  
>  	u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
> +	u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
>  	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>  
>  	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index af523d8..c851477 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>  	.has_wbinvd_exit = svm_has_wbinvd_exit,
>  
>  	.read_tsc_offset = svm_read_tsc_offset,
> +	.read_cached_tsc_offset = svm_read_tsc_offset,
>  	.write_tsc_offset = svm_write_tsc_offset,
>  	.adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
>  	.read_l1_tsc = svm_read_l1_tsc,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..82dfe42 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -616,6 +616,7 @@ struct vcpu_vmx {
>  	u64 hv_deadline_tsc;
>  
>  	u64 current_tsc_ratio;
> +	u64 cached_tsc_offset;
>  
>  	bool guest_pkru_valid;
>  	u32 guest_pkru;
> @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
>  	return vmcs_read64(TSC_OFFSET);
>  }
>  
> +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> +	return to_vmx(vcpu)->cached_tsc_offset;
> +}
> +
>  /*
>   * writes 'offset' into guest's timestamp counter offset register
>   */
> @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  					   vmcs_read64(TSC_OFFSET), offset);
>  		vmcs_write64(TSC_OFFSET, offset);
>  	}
> +	to_vmx(vcpu)->cached_tsc_offset = offset;
>  }
>  
>  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment)
> @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
>  
>  	.read_tsc_offset = vmx_read_tsc_offset,
> +	.read_cached_tsc_offset = vmx_read_cached_tsc_offset,
>  	.write_tsc_offset = vmx_write_tsc_offset,
>  	.adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
>  	.read_l1_tsc = vmx_read_l1_tsc,

You need to handle SVM as well.  So you might as well simplify the code:

- add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset

- add a tsc_offset field in struct kvm_vcpu_arch

- replace kvm_x86_ops->read_tsc_offset with accesses to the new field

Then in a fifth patch export the TSC offset (and multiplier ;)) to
userspace.

I'm not very happy about having a single file for all TSC offsets.
Creating subdirectories under the PID-FD per-VM directory would be nicer
in the long run.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luiz Capitulino Sept. 2, 2016, 5:31 p.m. UTC | #6

On Fri, 2 Sep 2016 19:00:41 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 31/08/2016 19:05, Luiz Capitulino wrote:
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >    However, I've been using only the TSC offset for now.
> >    So, let's get this merged first and do the TSC multiplier
> >    as a second step  
> 
> You'll need to export the number of fractional bits in the multiplier,
> too.  It's going to be a very simple patch, so please do everything now.

I didn't want to expose the multiplier before testing our tracing
procedure with it. So far we've been only using the TSC offset (and
it works great). I don't even know if I have a machine around to
test it, so it could take a bit.

> arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.

Will do.

> > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/svm.c              |  1 +
> >  arch/x86/kvm/vmx.c              |  8 ++++++++
> >  arch/x86/kvm/x86.c              | 30 ++++++++++++++++++++++++++++++
> >  4 files changed, 40 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 33ae3a4..5714bbd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -952,6 +952,7 @@ struct kvm_x86_ops {
> >  	bool (*has_wbinvd_exit)(void);
> >  
> >  	u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
> > +	u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
> >  	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
> >  
> >  	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index af523d8..c851477 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
> >  	.has_wbinvd_exit = svm_has_wbinvd_exit,
> >  
> >  	.read_tsc_offset = svm_read_tsc_offset,
> > +	.read_cached_tsc_offset = svm_read_tsc_offset,
> >  	.write_tsc_offset = svm_write_tsc_offset,
> >  	.adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
> >  	.read_l1_tsc = svm_read_l1_tsc,
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 5cede40..82dfe42 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -616,6 +616,7 @@ struct vcpu_vmx {
> >  	u64 hv_deadline_tsc;
> >  
> >  	u64 current_tsc_ratio;
> > +	u64 cached_tsc_offset;
> >  
> >  	bool guest_pkru_valid;
> >  	u32 guest_pkru;
> > @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
> >  	return vmcs_read64(TSC_OFFSET);
> >  }
> >  
> > +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
> > +{
> > +	return to_vmx(vcpu)->cached_tsc_offset;
> > +}
> > +
> >  /*
> >   * writes 'offset' into guest's timestamp counter offset register
> >   */
> > @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  					   vmcs_read64(TSC_OFFSET), offset);
> >  		vmcs_write64(TSC_OFFSET, offset);
> >  	}
> > +	to_vmx(vcpu)->cached_tsc_offset = offset;
> >  }
> >  
> >  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment)
> > @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
> >  	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> >  
> >  	.read_tsc_offset = vmx_read_tsc_offset,
> > +	.read_cached_tsc_offset = vmx_read_cached_tsc_offset,
> >  	.write_tsc_offset = vmx_write_tsc_offset,
> >  	.adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
> >  	.read_l1_tsc = vmx_read_l1_tsc,  
> 
> You need to handle SVM as well.  So you might as well simplify the code:

SVM is handled:

	+	.read_cached_tsc_offset = svm_read_tsc_offset,

> - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset
> 
> - add a tsc_offset field in struct kvm_vcpu_arch
> 
> - replace kvm_x86_ops->read_tsc_offset with accesses to the new field

Given that SVM is handled, you still want me to do this?

> Then in a fifth patch export the TSC offset (and multiplier ;)) to
> userspace.
> 
> I'm not very happy about having a single file for all TSC offsets.
> Creating subdirectories under the PID-FD per-VM directory would be nicer
> in the long run.

I think Steven would also prefer that, but some people raised the
concern at KVM Forum that creating per vcpu dirs in debugfs may
consume considerable memory for a system running several dozen
if not hundreds of VMs. This concern seems valid to me, but I
can do either way.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Sept. 2, 2016, 11:49 p.m. UTC | #7

On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote:
> On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> > We need to retrieve a VM's TSC offset in order to use
> > the host's TSC to merge host and guest traces. This is
> > explained in detail in this thread:
> > 
> >   [Qemu-devel] [RFC] host and guest kernel trace merging
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > 
> > Today, the only way to retrieve a VM's TSC offset is
> > by using the kvm_write_tsc_offset tracepoint. This has
> > a few problems. First, the tracepoint is only emitted
> > when the VM boots, which requires a reboot to get it if
> > the VM is already running. Second, tracepoints are not
> > supposed to be ABIs in case they need to be consumed by
> > user-space tools.
> > 
> > This commit exports a VM's TSC offset to user-space via
> > debugfs. A new file called "tsc-offset" is created in
> > the VM's debugfs directory. For example:
> > 
> >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > 
> > This file contains one TSC offset per line, for each
> > vCPU. For example:
> > 
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> > There are some important observations about this
> > solution:
> > 
> >  - While all vCPUs TSC offsets should be equal for the
> >    cases we care about (ie. stable TSC and no write to
> >    the TSC MSR), I chose to follow the spec and export
> >    each vCPU's TSC offset (might also be helpful for
> >    debugging)
> > 
> >  - The TSC offset is only useful after the VM has booted
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >    However, I've been using only the TSC offset for now.
> >    So, let's get this merged first and do the TSC multiplier
> >    as a second step
> 
> Can TSC offset changes occur at runtime?
> 
> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.
> 
> Stefan

Yes they can, and the interface should mention that "the user is
responsible for handling races of execution" (IMO).

So the workflow is:

1) User boots VM and knows the state of the VM.
2) User runs trace-cmd on the host.

Is there a need to automate gathering of traces? (that is to know the
state of reboots and so forth). I don't see one. In that case, the above
workflow is functional.

Can you add such comments to the interface Luiz (that the value
read is potentially stale).





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Sept. 3, 2016, 12:23 a.m. UTC | #8

On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote:
> On Fri, 2 Sep 2016 09:43:01 -0400
> Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> > Can TSC offset changes occur at runtime?

Yes, but Linux guests don't write to the TSC offset
after booting and unless user does manual TSC writes.

> > One example is vcpu hotplug where the tracing tool would need to fetch
> > the new vcpu's TSC offset after tracing has already started.
> > 
> > Another example is if QEMU or the guest change the TSC offset while
> > running.  If the tracing tool doesn't notice this then trace events will have
> > incorrect timestamps.

So what happens is this:

HostTSC (a variable). 
GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable)

Then the algorithm processes the trace as follows:
line = for each line(guest_trace)
    line = line - GuestTSCOffset (only the timestamp of course)

So from the moment the guest writes a new TSC offset, the host 
should use the new TSC offset to subtract from the trace entries.
The trace entries are in fact:

HostTSC + GuestTSCOffset

So the guest trace should contain entries for "USE NEW TSC OFFSET,
VALUE: xxx", which can be done (hum not sure if guest entries
or host entries).

However, correct me if i am wrong, the usecase seems to be:

1) Boot guest.
2) run trace-cmd
3) run workload
4) read traces on host.

Another option is to have notifications as follows: record on a buffer 
the following:

    [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
    [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]

Then when merging the trace entries, you do:

line = for each line(host trace)
    write_to_merged_trace(line)
    if (contains_tsc_offset_event(line)) {
        GuestTSCOffset = line.GuestTSCOffset
        if (!guest_tsc_offset_initialized) {
            process_all_guest_lines(
            line = line - GuestTSCOffset (only the timestamp of course)
        }
    }

Aha, fail: the traces on the host are not sufficient to know when 
to use which offset to subtract on the guest trace.

So the only possibility is to have the guest inform the occurrence
of the events: however the guest does not have access to the TSC offset.

So the host needs to inform the new tsc offset value and the guest needs
to inform when the event occurred on its side. So the algorithm can use
information on both traces to know which value to subtract on the
algorithm above.

Is this necessary? Or people do:
1) Boot guest.
2) run trace-cmd
3) run workload
4) read traces on host.

> I believe there are tracepoints for these events. They would obviously
> need to be enabled for the tracer to catch them.
> 
> I would also recommend that they go into their own instance to make
> sure other events do not overwrite them.
> 
> -- Steve
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luiz Capitulino Sept. 3, 2016, 1:29 a.m. UTC | #9

On Fri, 2 Sep 2016 20:49:37 -0300
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote:
> > On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:  
> > > We need to retrieve a VM's TSC offset in order to use
> > > the host's TSC to merge host and guest traces. This is
> > > explained in detail in this thread:
> > > 
> > >   [Qemu-devel] [RFC] host and guest kernel trace merging
> > >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > > 
> > > Today, the only way to retrieve a VM's TSC offset is
> > > by using the kvm_write_tsc_offset tracepoint. This has
> > > a few problems. First, the tracepoint is only emitted
> > > when the VM boots, which requires a reboot to get it if
> > > the VM is already running. Second, tracepoints are not
> > > supposed to be ABIs in case they need to be consumed by
> > > user-space tools.
> > > 
> > > This commit exports a VM's TSC offset to user-space via
> > > debugfs. A new file called "tsc-offset" is created in
> > > the VM's debugfs directory. For example:
> > > 
> > >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > > 
> > > This file contains one TSC offset per line, for each
> > > vCPU. For example:
> > > 
> > >   vcpu0: 18446742405270834952
> > >   vcpu1: 18446742405270834952
> > >   vcpu2: 18446742405270834952
> > >   vcpu3: 18446742405270834952
> > > 
> > > There are some important observations about this
> > > solution:
> > > 
> > >  - While all vCPUs TSC offsets should be equal for the
> > >    cases we care about (ie. stable TSC and no write to
> > >    the TSC MSR), I chose to follow the spec and export
> > >    each vCPU's TSC offset (might also be helpful for
> > >    debugging)
> > > 
> > >  - The TSC offset is only useful after the VM has booted
> > > 
> > >  - We'll probably need to export the TSC multiplier too.
> > >    However, I've been using only the TSC offset for now.
> > >    So, let's get this merged first and do the TSC multiplier
> > >    as a second step  
> > 
> > Can TSC offset changes occur at runtime?
> > 
> > One example is vcpu hotplug where the tracing tool would need to fetch
> > the new vcpu's TSC offset after tracing has already started.
> > 
> > Another example is if QEMU or the guest change the TSC offset while
> > running.  If the tracing tool doesn't notice this then trace events will have
> > incorrect timestamps.
> > 
> > Stefan  
> 
> Yes they can, and the interface should mention that "the user is
> responsible for handling races of execution" (IMO).
> 
> So the workflow is:
> 
> 1) User boots VM and knows the state of the VM.
> 2) User runs trace-cmd on the host.
> 
> Is there a need to automate gathering of traces? (that is to know the
> state of reboots and so forth). I don't see one. In that case, the above
> workflow is functional.
> 
> Can you add such comments to the interface Luiz (that the value
> read is potentially stale).

Sure, no problem.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Masami Hiramatsu (Google) Sept. 3, 2016, 4:04 a.m. UTC | #10

On Fri, 2 Sep 2016 21:23:11 -0300
Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote:
> > On Fri, 2 Sep 2016 09:43:01 -0400
> > Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > 
> > > Can TSC offset changes occur at runtime?
> 
> Yes, but Linux guests don't write to the TSC offset
> after booting and unless user does manual TSC writes.
> 
> > > One example is vcpu hotplug where the tracing tool would need to fetch
> > > the new vcpu's TSC offset after tracing has already started.
> > > 
> > > Another example is if QEMU or the guest change the TSC offset while
> > > running.  If the tracing tool doesn't notice this then trace events will have
> > > incorrect timestamps.
> 
> So what happens is this:
> 
> HostTSC (a variable). 
> GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable)

The same idea has been done by Yoshihiro
http://events.linuxfoundation.org/sites/events/files/cojp13_yunomae.pdf

 
> Then the algorithm processes the trace as follows:
> line = for each line(guest_trace)
>     line = line - GuestTSCOffset (only the timestamp of course)
> 
> So from the moment the guest writes a new TSC offset, the host 
> should use the new TSC offset to subtract from the trace entries.
> The trace entries are in fact:
> 
> HostTSC + GuestTSCOffset
> 
> So the guest trace should contain entries for "USE NEW TSC OFFSET,
> VALUE: xxx", which can be done (hum not sure if guest entries
> or host entries).
> 
> However, correct me if i am wrong, the usecase seems to be:
> 
> 1) Boot guest.
> 2) run trace-cmd
> 3) run workload
> 4) read traces on host.

IIRC, previous (current?) method is to run trace-cmd at first (before
boot the guest) so that it can get tsc-offset event and
can wait on a special unix domain socket.

For above usecase, we have to have an interface to get the current
tsc offset like Luis suggested.

> 
> Another option is to have notifications as follows: record on a buffer 
> the following:
> 
>     [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
>     [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
> 
> Then when merging the trace entries, you do:
> 
> line = for each line(host trace)
>     write_to_merged_trace(line)
>     if (contains_tsc_offset_event(line)) {
>         GuestTSCOffset = line.GuestTSCOffset
>         if (!guest_tsc_offset_initialized) {
>             process_all_guest_lines(
>             line = line - GuestTSCOffset (only the timestamp of course)
>         }
>     }
> 
> Aha, fail: the traces on the host are not sufficient to know when 
> to use which offset to subtract on the guest trace.
> 
> So the only possibility is to have the guest inform the occurrence
> of the events: however the guest does not have access to the TSC offset.
> 
> So the host needs to inform the new tsc offset value and the guest needs
> to inform when the event occurred on its side. So the algorithm can use
> information on both traces to know which value to subtract on the
> algorithm above.
> 
> Is this necessary? Or people do:
> 1) Boot guest.
> 2) run trace-cmd
> 3) run workload
> 4) read traces on host.
> 
> > I believe there are tracepoints for these events. They would obviously
> > need to be enabled for the tracer to catch them.

Yes, Yoshihiro introduced a tracepoint for that.
http://lkml.iu.edu/hypermail/linux/kernel/1306.1/01741.html

So, we have to trace this event in host side.

> > 
> > I would also recommend that they go into their own instance to make
> > sure other events do not overwrite them.
> > 
> > -- Steve

Thanks,

Paolo Bonzini Sept. 5, 2016, 8:10 a.m. UTC | #11

On 02/09/2016 19:31, Luiz Capitulino wrote:
> On Fri, 2 Sep 2016 19:00:41 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> On 31/08/2016 19:05, Luiz Capitulino wrote:
>>>   vcpu0: 18446742405270834952
>>>   vcpu1: 18446742405270834952
>>>   vcpu2: 18446742405270834952
>>>   vcpu3: 18446742405270834952
>>>
>>>  - We'll probably need to export the TSC multiplier too.
>>>    However, I've been using only the TSC offset for now.
>>>    So, let's get this merged first and do the TSC multiplier
>>>    as a second step  
>>
>> You'll need to export the number of fractional bits in the multiplier,
>> too.  It's going to be a very simple patch, so please do everything now.
> 
> I didn't want to expose the multiplier before testing our tracing
> procedure with it.

Exposing the multiplier should be independent of tracing.  I can test it
for you.

Paolo

> So far we've been only using the TSC offset (and
> it works great). I don't even know if I have a machine around to
> test it, so it could take a bit.
> 
>> arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.
> 
> Will do.
> 
>>> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |  1 +
>>>  arch/x86/kvm/svm.c              |  1 +
>>>  arch/x86/kvm/vmx.c              |  8 ++++++++
>>>  arch/x86/kvm/x86.c              | 30 ++++++++++++++++++++++++++++++
>>>  4 files changed, 40 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 33ae3a4..5714bbd 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -952,6 +952,7 @@ struct kvm_x86_ops {
>>>  	bool (*has_wbinvd_exit)(void);
>>>  
>>>  	u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
>>> +	u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
>>>  	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>>>  
>>>  	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index af523d8..c851477 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>>>  	.has_wbinvd_exit = svm_has_wbinvd_exit,
>>>  
>>>  	.read_tsc_offset = svm_read_tsc_offset,
>>> +	.read_cached_tsc_offset = svm_read_tsc_offset,
>>>  	.write_tsc_offset = svm_write_tsc_offset,
>>>  	.adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
>>>  	.read_l1_tsc = svm_read_l1_tsc,
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 5cede40..82dfe42 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -616,6 +616,7 @@ struct vcpu_vmx {
>>>  	u64 hv_deadline_tsc;
>>>  
>>>  	u64 current_tsc_ratio;
>>> +	u64 cached_tsc_offset;
>>>  
>>>  	bool guest_pkru_valid;
>>>  	u32 guest_pkru;
>>> @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
>>>  	return vmcs_read64(TSC_OFFSET);
>>>  }
>>>  
>>> +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
>>> +{
>>> +	return to_vmx(vcpu)->cached_tsc_offset;
>>> +}
>>> +
>>>  /*
>>>   * writes 'offset' into guest's timestamp counter offset register
>>>   */
>>> @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>>>  					   vmcs_read64(TSC_OFFSET), offset);
>>>  		vmcs_write64(TSC_OFFSET, offset);
>>>  	}
>>> +	to_vmx(vcpu)->cached_tsc_offset = offset;
>>>  }
>>>  
>>>  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment)
>>> @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>>>  	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
>>>  
>>>  	.read_tsc_offset = vmx_read_tsc_offset,
>>> +	.read_cached_tsc_offset = vmx_read_cached_tsc_offset,
>>>  	.write_tsc_offset = vmx_write_tsc_offset,
>>>  	.adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
>>>  	.read_l1_tsc = vmx_read_l1_tsc,  
>>
>> You need to handle SVM as well.  So you might as well simplify the code:
> 
> SVM is handled:
> 
> 	+	.read_cached_tsc_offset = svm_read_tsc_offset,
> 
>> - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset
>>
>> - add a tsc_offset field in struct kvm_vcpu_arch
>>
>> - replace kvm_x86_ops->read_tsc_offset with accesses to the new field
> 
> Given that SVM is handled, you still want me to do this?
> 
>> Then in a fifth patch export the TSC offset (and multiplier ;)) to
>> userspace.
>>
>> I'm not very happy about having a single file for all TSC offsets.
>> Creating subdirectories under the PID-FD per-VM directory would be nicer
>> in the long run.
> 
> I think Steven would also prefer that, but some people raised the
> concern at KVM Forum that creating per vcpu dirs in debugfs may
> consume considerable memory for a system running several dozen
> if not hundreds of VMs. This concern seems valid to me, but I
> can do either way.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[4/4] kvm: x86: export TSC offset to user-space

Commit Message

Comments

Patch