Message ID | 1472663145-1835-5-git-send-email-lcapitulino@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote: > We need to retrieve a VM's TSC offset in order to use > the host's TSC to merge host and guest traces. This is > explained in detail in this thread: > > [Qemu-devel] [RFC] host and guest kernel trace merging > https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html > > Today, the only way to retrieve a VM's TSC offset is > by using the kvm_write_tsc_offset tracepoint. This has > a few problems. First, the tracepoint is only emitted > when the VM boots, which requires a reboot to get it if > the VM is already running. Second, tracepoints are not > supposed to be ABIs in case they need to be consumed by > user-space tools. > > This commit exports a VM's TSC offset to user-space via > debugfs. A new file called "tsc-offset" is created in > the VM's debugfs directory. For example: > > /sys/kernel/debug/kvm/51696-10/tsc-offset > > This file contains one TSC offset per line, for each > vCPU. For example: > > vcpu0: 18446742405270834952 > vcpu1: 18446742405270834952 > vcpu2: 18446742405270834952 > vcpu3: 18446742405270834952 > > There are some important observations about this > solution: > > - While all vCPUs TSC offsets should be equal for the > cases we care about (ie. stable TSC and no write to > the TSC MSR), I chose to follow the spec and export > each vCPU's TSC offset (might also be helpful for > debugging) > > - The TSC offset is only useful after the VM has booted > > - We'll probably need to export the TSC multiplier too. > However, I've been using only the TSC offset for now. > So, let's get this merged first and do the TSC multiplier > as a second step Can TSC offset changes occur at runtime? One example is vcpu hotplug where the tracing tool would need to fetch the new vcpu's TSC offset after tracing has already started. Another example is if QEMU or the guest change the TSC offset while running. If the tracing tool doesn't notice this then trace events will have incorrect timestamps. Stefan
On Fri, 2 Sep 2016 09:43:01 -0400 Stefan Hajnoczi <stefanha@gmail.com> wrote: > Can TSC offset changes occur at runtime? > > One example is vcpu hotplug where the tracing tool would need to fetch > the new vcpu's TSC offset after tracing has already started. > > Another example is if QEMU or the guest change the TSC offset while > running. If the tracing tool doesn't notice this then trace events will have > incorrect timestamps. I believe there are tracepoints for these events. They would obviously need to be enabled for the tracer to catch them. I would also recommend that they go into their own instance to make sure other events do not overwrite them. -- Steve -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Sep 2016 09:43:01 -0400 Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote: > > We need to retrieve a VM's TSC offset in order to use > > the host's TSC to merge host and guest traces. This is > > explained in detail in this thread: > > > > [Qemu-devel] [RFC] host and guest kernel trace merging > > https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html > > > > Today, the only way to retrieve a VM's TSC offset is > > by using the kvm_write_tsc_offset tracepoint. This has > > a few problems. First, the tracepoint is only emitted > > when the VM boots, which requires a reboot to get it if > > the VM is already running. Second, tracepoints are not > > supposed to be ABIs in case they need to be consumed by > > user-space tools. > > > > This commit exports a VM's TSC offset to user-space via > > debugfs. A new file called "tsc-offset" is created in > > the VM's debugfs directory. For example: > > > > /sys/kernel/debug/kvm/51696-10/tsc-offset > > > > This file contains one TSC offset per line, for each > > vCPU. For example: > > > > vcpu0: 18446742405270834952 > > vcpu1: 18446742405270834952 > > vcpu2: 18446742405270834952 > > vcpu3: 18446742405270834952 > > > > There are some important observations about this > > solution: > > > > - While all vCPUs TSC offsets should be equal for the > > cases we care about (ie. stable TSC and no write to > > the TSC MSR), I chose to follow the spec and export > > each vCPU's TSC offset (might also be helpful for > > debugging) > > > > - The TSC offset is only useful after the VM has booted > > > > - We'll probably need to export the TSC multiplier too. > > However, I've been using only the TSC offset for now. > > So, let's get this merged first and do the TSC multiplier > > as a second step > > Can TSC offset changes occur at runtime? Yes. IIRC, if the system has an unstable TSC, KVM will adjust the TSC offset when migrating the vCPU to other cores (although tracing with unstable TSC is not supported). Also, the guest can write to the TSC MSR at any time (although I don't know if Linux ever does this). > One example is vcpu hotplug where the tracing tool would need to fetch > the new vcpu's TSC offset after tracing has already started. > > Another example is if QEMU or the guest change the TSC offset while > running. If the tracing tool doesn't notice this then trace events will have > incorrect timestamps. I guess that what tools like trace-cmd want to do in those cases is to warn the user and discard the trace. A simple way of doing this would be to re-check that the TSC offset are the same after tracing is done. It could also use inotify, in case it works for debugfs (never tried it myself). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Sep 2016 12:26:55 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote: > I guess that what tools like trace-cmd want to do in those cases > is to warn the user and discard the trace. A simple way of doing > this would be to re-check that the TSC offset are the same after > tracing is done. It could also use inotify, in case it works > for debugfs (never tried it myself). The second idea was probably stupid, never mind me :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 31/08/2016 19:05, Luiz Capitulino wrote: > vcpu0: 18446742405270834952 > vcpu1: 18446742405270834952 > vcpu2: 18446742405270834952 > vcpu3: 18446742405270834952 > > - We'll probably need to export the TSC multiplier too. > However, I've been using only the TSC offset for now. > So, let's get this merged first and do the TSC multiplier > as a second step You'll need to export the number of fractional bits in the multiplier, too. It's going to be a very simple patch, so please do everything now. arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c. > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/svm.c | 1 + > arch/x86/kvm/vmx.c | 8 ++++++++ > arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++ > 4 files changed, 40 insertions(+) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 33ae3a4..5714bbd 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -952,6 +952,7 @@ struct kvm_x86_ops { > bool (*has_wbinvd_exit)(void); > > u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); > + u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu); > void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); > > u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc); > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index af523d8..c851477 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = { > .has_wbinvd_exit = svm_has_wbinvd_exit, > > .read_tsc_offset = svm_read_tsc_offset, > + .read_cached_tsc_offset = svm_read_tsc_offset, > .write_tsc_offset = svm_write_tsc_offset, > .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest, > .read_l1_tsc = svm_read_l1_tsc, > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 5cede40..82dfe42 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -616,6 +616,7 @@ struct vcpu_vmx { > u64 hv_deadline_tsc; > > u64 current_tsc_ratio; > + u64 cached_tsc_offset; > > bool guest_pkru_valid; > u32 guest_pkru; > @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu) > return vmcs_read64(TSC_OFFSET); > } > > +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu) > +{ > + return to_vmx(vcpu)->cached_tsc_offset; > +} > + > /* > * writes 'offset' into guest's timestamp counter offset register > */ > @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > vmcs_read64(TSC_OFFSET), offset); > vmcs_write64(TSC_OFFSET, offset); > } > + to_vmx(vcpu)->cached_tsc_offset = offset; > } > > static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment) > @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = { > .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, > > .read_tsc_offset = vmx_read_tsc_offset, > + .read_cached_tsc_offset = vmx_read_cached_tsc_offset, > .write_tsc_offset = vmx_write_tsc_offset, > .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest, > .read_l1_tsc = vmx_read_l1_tsc, You need to handle SVM as well. So you might as well simplify the code: - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset - add a tsc_offset field in struct kvm_vcpu_arch - replace kvm_x86_ops->read_tsc_offset with accesses to the new field Then in a fifth patch export the TSC offset (and multiplier ;)) to userspace. I'm not very happy about having a single file for all TSC offsets. Creating subdirectories under the PID-FD per-VM directory would be nicer in the long run. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Sep 2016 19:00:41 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 31/08/2016 19:05, Luiz Capitulino wrote: > > vcpu0: 18446742405270834952 > > vcpu1: 18446742405270834952 > > vcpu2: 18446742405270834952 > > vcpu3: 18446742405270834952 > > > > - We'll probably need to export the TSC multiplier too. > > However, I've been using only the TSC offset for now. > > So, let's get this merged first and do the TSC multiplier > > as a second step > > You'll need to export the number of fractional bits in the multiplier, > too. It's going to be a very simple patch, so please do everything now. I didn't want to expose the multiplier before testing our tracing procedure with it. So far we've been only using the TSC offset (and it works great). I don't even know if I have a machine around to test it, so it could take a bit. > arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c. Will do. > > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> > > --- > > arch/x86/include/asm/kvm_host.h | 1 + > > arch/x86/kvm/svm.c | 1 + > > arch/x86/kvm/vmx.c | 8 ++++++++ > > arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++ > > 4 files changed, 40 insertions(+) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index 33ae3a4..5714bbd 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -952,6 +952,7 @@ struct kvm_x86_ops { > > bool (*has_wbinvd_exit)(void); > > > > u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); > > + u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu); > > void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); > > > > u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc); > > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > > index af523d8..c851477 100644 > > --- a/arch/x86/kvm/svm.c > > +++ b/arch/x86/kvm/svm.c > > @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = { > > .has_wbinvd_exit = svm_has_wbinvd_exit, > > > > .read_tsc_offset = svm_read_tsc_offset, > > + .read_cached_tsc_offset = svm_read_tsc_offset, > > .write_tsc_offset = svm_write_tsc_offset, > > .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest, > > .read_l1_tsc = svm_read_l1_tsc, > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 5cede40..82dfe42 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -616,6 +616,7 @@ struct vcpu_vmx { > > u64 hv_deadline_tsc; > > > > u64 current_tsc_ratio; > > + u64 cached_tsc_offset; > > > > bool guest_pkru_valid; > > u32 guest_pkru; > > @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu) > > return vmcs_read64(TSC_OFFSET); > > } > > > > +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu) > > +{ > > + return to_vmx(vcpu)->cached_tsc_offset; > > +} > > + > > /* > > * writes 'offset' into guest's timestamp counter offset register > > */ > > @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > > vmcs_read64(TSC_OFFSET), offset); > > vmcs_write64(TSC_OFFSET, offset); > > } > > + to_vmx(vcpu)->cached_tsc_offset = offset; > > } > > > > static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment) > > @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = { > > .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, > > > > .read_tsc_offset = vmx_read_tsc_offset, > > + .read_cached_tsc_offset = vmx_read_cached_tsc_offset, > > .write_tsc_offset = vmx_write_tsc_offset, > > .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest, > > .read_l1_tsc = vmx_read_l1_tsc, > > You need to handle SVM as well. So you might as well simplify the code: SVM is handled: + .read_cached_tsc_offset = svm_read_tsc_offset, > - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset > > - add a tsc_offset field in struct kvm_vcpu_arch > > - replace kvm_x86_ops->read_tsc_offset with accesses to the new field Given that SVM is handled, you still want me to do this? > Then in a fifth patch export the TSC offset (and multiplier ;)) to > userspace. > > I'm not very happy about having a single file for all TSC offsets. > Creating subdirectories under the PID-FD per-VM directory would be nicer > in the long run. I think Steven would also prefer that, but some people raised the concern at KVM Forum that creating per vcpu dirs in debugfs may consume considerable memory for a system running several dozen if not hundreds of VMs. This concern seems valid to me, but I can do either way. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote: > On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote: > > We need to retrieve a VM's TSC offset in order to use > > the host's TSC to merge host and guest traces. This is > > explained in detail in this thread: > > > > [Qemu-devel] [RFC] host and guest kernel trace merging > > https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html > > > > Today, the only way to retrieve a VM's TSC offset is > > by using the kvm_write_tsc_offset tracepoint. This has > > a few problems. First, the tracepoint is only emitted > > when the VM boots, which requires a reboot to get it if > > the VM is already running. Second, tracepoints are not > > supposed to be ABIs in case they need to be consumed by > > user-space tools. > > > > This commit exports a VM's TSC offset to user-space via > > debugfs. A new file called "tsc-offset" is created in > > the VM's debugfs directory. For example: > > > > /sys/kernel/debug/kvm/51696-10/tsc-offset > > > > This file contains one TSC offset per line, for each > > vCPU. For example: > > > > vcpu0: 18446742405270834952 > > vcpu1: 18446742405270834952 > > vcpu2: 18446742405270834952 > > vcpu3: 18446742405270834952 > > > > There are some important observations about this > > solution: > > > > - While all vCPUs TSC offsets should be equal for the > > cases we care about (ie. stable TSC and no write to > > the TSC MSR), I chose to follow the spec and export > > each vCPU's TSC offset (might also be helpful for > > debugging) > > > > - The TSC offset is only useful after the VM has booted > > > > - We'll probably need to export the TSC multiplier too. > > However, I've been using only the TSC offset for now. > > So, let's get this merged first and do the TSC multiplier > > as a second step > > Can TSC offset changes occur at runtime? > > One example is vcpu hotplug where the tracing tool would need to fetch > the new vcpu's TSC offset after tracing has already started. > > Another example is if QEMU or the guest change the TSC offset while > running. If the tracing tool doesn't notice this then trace events will have > incorrect timestamps. > > Stefan Yes they can, and the interface should mention that "the user is responsible for handling races of execution" (IMO). So the workflow is: 1) User boots VM and knows the state of the VM. 2) User runs trace-cmd on the host. Is there a need to automate gathering of traces? (that is to know the state of reboots and so forth). I don't see one. In that case, the above workflow is functional. Can you add such comments to the interface Luiz (that the value read is potentially stale). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote: > On Fri, 2 Sep 2016 09:43:01 -0400 > Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > Can TSC offset changes occur at runtime? Yes, but Linux guests don't write to the TSC offset after booting and unless user does manual TSC writes. > > One example is vcpu hotplug where the tracing tool would need to fetch > > the new vcpu's TSC offset after tracing has already started. > > > > Another example is if QEMU or the guest change the TSC offset while > > running. If the tracing tool doesn't notice this then trace events will have > > incorrect timestamps. So what happens is this: HostTSC (a variable). GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable) Then the algorithm processes the trace as follows: line = for each line(guest_trace) line = line - GuestTSCOffset (only the timestamp of course) So from the moment the guest writes a new TSC offset, the host should use the new TSC offset to subtract from the trace entries. The trace entries are in fact: HostTSC + GuestTSCOffset So the guest trace should contain entries for "USE NEW TSC OFFSET, VALUE: xxx", which can be done (hum not sure if guest entries or host entries). However, correct me if i am wrong, the usecase seems to be: 1) Boot guest. 2) run trace-cmd 3) run workload 4) read traces on host. Another option is to have notifications as follows: record on a buffer the following: [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ] [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ] Then when merging the trace entries, you do: line = for each line(host trace) write_to_merged_trace(line) if (contains_tsc_offset_event(line)) { GuestTSCOffset = line.GuestTSCOffset if (!guest_tsc_offset_initialized) { process_all_guest_lines( line = line - GuestTSCOffset (only the timestamp of course) } } Aha, fail: the traces on the host are not sufficient to know when to use which offset to subtract on the guest trace. So the only possibility is to have the guest inform the occurrence of the events: however the guest does not have access to the TSC offset. So the host needs to inform the new tsc offset value and the guest needs to inform when the event occurred on its side. So the algorithm can use information on both traces to know which value to subtract on the algorithm above. Is this necessary? Or people do: 1) Boot guest. 2) run trace-cmd 3) run workload 4) read traces on host. > I believe there are tracepoints for these events. They would obviously > need to be enabled for the tracer to catch them. > > I would also recommend that they go into their own instance to make > sure other events do not overwrite them. > > -- Steve -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Sep 2016 20:49:37 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote: > > On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote: > > > We need to retrieve a VM's TSC offset in order to use > > > the host's TSC to merge host and guest traces. This is > > > explained in detail in this thread: > > > > > > [Qemu-devel] [RFC] host and guest kernel trace merging > > > https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html > > > > > > Today, the only way to retrieve a VM's TSC offset is > > > by using the kvm_write_tsc_offset tracepoint. This has > > > a few problems. First, the tracepoint is only emitted > > > when the VM boots, which requires a reboot to get it if > > > the VM is already running. Second, tracepoints are not > > > supposed to be ABIs in case they need to be consumed by > > > user-space tools. > > > > > > This commit exports a VM's TSC offset to user-space via > > > debugfs. A new file called "tsc-offset" is created in > > > the VM's debugfs directory. For example: > > > > > > /sys/kernel/debug/kvm/51696-10/tsc-offset > > > > > > This file contains one TSC offset per line, for each > > > vCPU. For example: > > > > > > vcpu0: 18446742405270834952 > > > vcpu1: 18446742405270834952 > > > vcpu2: 18446742405270834952 > > > vcpu3: 18446742405270834952 > > > > > > There are some important observations about this > > > solution: > > > > > > - While all vCPUs TSC offsets should be equal for the > > > cases we care about (ie. stable TSC and no write to > > > the TSC MSR), I chose to follow the spec and export > > > each vCPU's TSC offset (might also be helpful for > > > debugging) > > > > > > - The TSC offset is only useful after the VM has booted > > > > > > - We'll probably need to export the TSC multiplier too. > > > However, I've been using only the TSC offset for now. > > > So, let's get this merged first and do the TSC multiplier > > > as a second step > > > > Can TSC offset changes occur at runtime? > > > > One example is vcpu hotplug where the tracing tool would need to fetch > > the new vcpu's TSC offset after tracing has already started. > > > > Another example is if QEMU or the guest change the TSC offset while > > running. If the tracing tool doesn't notice this then trace events will have > > incorrect timestamps. > > > > Stefan > > Yes they can, and the interface should mention that "the user is > responsible for handling races of execution" (IMO). > > So the workflow is: > > 1) User boots VM and knows the state of the VM. > 2) User runs trace-cmd on the host. > > Is there a need to automate gathering of traces? (that is to know the > state of reboots and so forth). I don't see one. In that case, the above > workflow is functional. > > Can you add such comments to the interface Luiz (that the value > read is potentially stale). Sure, no problem. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Sep 2016 21:23:11 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote: > > On Fri, 2 Sep 2016 09:43:01 -0400 > > Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > Can TSC offset changes occur at runtime? > > Yes, but Linux guests don't write to the TSC offset > after booting and unless user does manual TSC writes. > > > > One example is vcpu hotplug where the tracing tool would need to fetch > > > the new vcpu's TSC offset after tracing has already started. > > > > > > Another example is if QEMU or the guest change the TSC offset while > > > running. If the tracing tool doesn't notice this then trace events will have > > > incorrect timestamps. > > So what happens is this: > > HostTSC (a variable). > GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable) The same idea has been done by Yoshihiro http://events.linuxfoundation.org/sites/events/files/cojp13_yunomae.pdf > Then the algorithm processes the trace as follows: > line = for each line(guest_trace) > line = line - GuestTSCOffset (only the timestamp of course) > > So from the moment the guest writes a new TSC offset, the host > should use the new TSC offset to subtract from the trace entries. > The trace entries are in fact: > > HostTSC + GuestTSCOffset > > So the guest trace should contain entries for "USE NEW TSC OFFSET, > VALUE: xxx", which can be done (hum not sure if guest entries > or host entries). > > However, correct me if i am wrong, the usecase seems to be: > > 1) Boot guest. > 2) run trace-cmd > 3) run workload > 4) read traces on host. IIRC, previous (current?) method is to run trace-cmd at first (before boot the guest) so that it can get tsc-offset event and can wait on a special unix domain socket. For above usecase, we have to have an interface to get the current tsc offset like Luis suggested. > > Another option is to have notifications as follows: record on a buffer > the following: > > [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ] > [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ] > > Then when merging the trace entries, you do: > > line = for each line(host trace) > write_to_merged_trace(line) > if (contains_tsc_offset_event(line)) { > GuestTSCOffset = line.GuestTSCOffset > if (!guest_tsc_offset_initialized) { > process_all_guest_lines( > line = line - GuestTSCOffset (only the timestamp of course) > } > } > > Aha, fail: the traces on the host are not sufficient to know when > to use which offset to subtract on the guest trace. > > So the only possibility is to have the guest inform the occurrence > of the events: however the guest does not have access to the TSC offset. > > So the host needs to inform the new tsc offset value and the guest needs > to inform when the event occurred on its side. So the algorithm can use > information on both traces to know which value to subtract on the > algorithm above. > > Is this necessary? Or people do: > 1) Boot guest. > 2) run trace-cmd > 3) run workload > 4) read traces on host. > > > I believe there are tracepoints for these events. They would obviously > > need to be enabled for the tracer to catch them. Yes, Yoshihiro introduced a tracepoint for that. http://lkml.iu.edu/hypermail/linux/kernel/1306.1/01741.html So, we have to trace this event in host side. > > > > I would also recommend that they go into their own instance to make > > sure other events do not overwrite them. > > > > -- Steve Thanks,
On 02/09/2016 19:31, Luiz Capitulino wrote: > On Fri, 2 Sep 2016 19:00:41 +0200 > Paolo Bonzini <pbonzini@redhat.com> wrote: > >> On 31/08/2016 19:05, Luiz Capitulino wrote: >>> vcpu0: 18446742405270834952 >>> vcpu1: 18446742405270834952 >>> vcpu2: 18446742405270834952 >>> vcpu3: 18446742405270834952 >>> >>> - We'll probably need to export the TSC multiplier too. >>> However, I've been using only the TSC offset for now. >>> So, let's get this merged first and do the TSC multiplier >>> as a second step >> >> You'll need to export the number of fractional bits in the multiplier, >> too. It's going to be a very simple patch, so please do everything now. > > I didn't want to expose the multiplier before testing our tracing > procedure with it. Exposing the multiplier should be independent of tracing. I can test it for you. Paolo > So far we've been only using the TSC offset (and > it works great). I don't even know if I have a machine around to > test it, so it could take a bit. > >> arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c. > > Will do. > >>> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> >>> --- >>> arch/x86/include/asm/kvm_host.h | 1 + >>> arch/x86/kvm/svm.c | 1 + >>> arch/x86/kvm/vmx.c | 8 ++++++++ >>> arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++ >>> 4 files changed, 40 insertions(+) >>> >>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >>> index 33ae3a4..5714bbd 100644 >>> --- a/arch/x86/include/asm/kvm_host.h >>> +++ b/arch/x86/include/asm/kvm_host.h >>> @@ -952,6 +952,7 @@ struct kvm_x86_ops { >>> bool (*has_wbinvd_exit)(void); >>> >>> u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); >>> + u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu); >>> void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); >>> >>> u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc); >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >>> index af523d8..c851477 100644 >>> --- a/arch/x86/kvm/svm.c >>> +++ b/arch/x86/kvm/svm.c >>> @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = { >>> .has_wbinvd_exit = svm_has_wbinvd_exit, >>> >>> .read_tsc_offset = svm_read_tsc_offset, >>> + .read_cached_tsc_offset = svm_read_tsc_offset, >>> .write_tsc_offset = svm_write_tsc_offset, >>> .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest, >>> .read_l1_tsc = svm_read_l1_tsc, >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >>> index 5cede40..82dfe42 100644 >>> --- a/arch/x86/kvm/vmx.c >>> +++ b/arch/x86/kvm/vmx.c >>> @@ -616,6 +616,7 @@ struct vcpu_vmx { >>> u64 hv_deadline_tsc; >>> >>> u64 current_tsc_ratio; >>> + u64 cached_tsc_offset; >>> >>> bool guest_pkru_valid; >>> u32 guest_pkru; >>> @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu) >>> return vmcs_read64(TSC_OFFSET); >>> } >>> >>> +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu) >>> +{ >>> + return to_vmx(vcpu)->cached_tsc_offset; >>> +} >>> + >>> /* >>> * writes 'offset' into guest's timestamp counter offset register >>> */ >>> @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) >>> vmcs_read64(TSC_OFFSET), offset); >>> vmcs_write64(TSC_OFFSET, offset); >>> } >>> + to_vmx(vcpu)->cached_tsc_offset = offset; >>> } >>> >>> static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment) >>> @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = { >>> .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, >>> >>> .read_tsc_offset = vmx_read_tsc_offset, >>> + .read_cached_tsc_offset = vmx_read_cached_tsc_offset, >>> .write_tsc_offset = vmx_write_tsc_offset, >>> .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest, >>> .read_l1_tsc = vmx_read_l1_tsc, >> >> You need to handle SVM as well. So you might as well simplify the code: > > SVM is handled: > > + .read_cached_tsc_offset = svm_read_tsc_offset, > >> - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset >> >> - add a tsc_offset field in struct kvm_vcpu_arch >> >> - replace kvm_x86_ops->read_tsc_offset with accesses to the new field > > Given that SVM is handled, you still want me to do this? > >> Then in a fifth patch export the TSC offset (and multiplier ;)) to >> userspace. >> >> I'm not very happy about having a single file for all TSC offsets. >> Creating subdirectories under the PID-FD per-VM directory would be nicer >> in the long run. > > I think Steven would also prefer that, but some people raised the > concern at KVM Forum that creating per vcpu dirs in debugfs may > consume considerable memory for a system running several dozen > if not hundreds of VMs. This concern seems valid to me, but I > can do either way. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 33ae3a4..5714bbd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -952,6 +952,7 @@ struct kvm_x86_ops { bool (*has_wbinvd_exit)(void); u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu); + u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu); void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index af523d8..c851477 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = { .has_wbinvd_exit = svm_has_wbinvd_exit, .read_tsc_offset = svm_read_tsc_offset, + .read_cached_tsc_offset = svm_read_tsc_offset, .write_tsc_offset = svm_write_tsc_offset, .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest, .read_l1_tsc = svm_read_l1_tsc, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5cede40..82dfe42 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -616,6 +616,7 @@ struct vcpu_vmx { u64 hv_deadline_tsc; u64 current_tsc_ratio; + u64 cached_tsc_offset; bool guest_pkru_valid; u32 guest_pkru; @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu) return vmcs_read64(TSC_OFFSET); } +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu) +{ + return to_vmx(vcpu)->cached_tsc_offset; +} + /* * writes 'offset' into guest's timestamp counter offset register */ @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) vmcs_read64(TSC_OFFSET), offset); vmcs_write64(TSC_OFFSET, offset); } + to_vmx(vcpu)->cached_tsc_offset = offset; } static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 adjustment) @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit, .read_tsc_offset = vmx_read_tsc_offset, + .read_cached_tsc_offset = vmx_read_cached_tsc_offset, .write_tsc_offset = vmx_write_tsc_offset, .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest, .read_l1_tsc = vmx_read_l1_tsc, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 18dfbac..75a8e23 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -54,6 +54,7 @@ #include <linux/pvclock_gtod.h> #include <linux/kvm_irqfd.h> #include <linux/irqbypass.h> +#include <linux/debugfs.h> #include <trace/events/kvm.h> #include <asm/debugreg.h> @@ -7779,8 +7780,37 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) return 0; } +static int tsc_offset_show(struct seq_file *m, void *data) +{ + struct kvm *kvm = m->private; + struct kvm_vcpu *vcpu; + int i; + + kvm_for_each_vcpu(i, vcpu, kvm) + seq_printf(m, "vcpu%d: %llu\n", + vcpu->vcpu_id, kvm_x86_ops->read_cached_tsc_offset(vcpu)); + + return 0; +} + +static int tsc_offset_open(struct inode *inode, struct file *file) +{ + return single_open(file, tsc_offset_show, inode->i_private); +} + +static const struct file_operations tsc_offset_fops = { + .owner = THIS_MODULE, + .open = tsc_offset_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + int kvm_arch_create_vm_debugfs(struct kvm *kvm) { + if (!debugfs_create_file("tsc-offset", 0444, + kvm->debugfs_dentry, kvm, &tsc_offset_fops)) + return -ENOMEM; return 0; }
We need to retrieve a VM's TSC offset in order to use the host's TSC to merge host and guest traces. This is explained in detail in this thread: [Qemu-devel] [RFC] host and guest kernel trace merging https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html Today, the only way to retrieve a VM's TSC offset is by using the kvm_write_tsc_offset tracepoint. This has a few problems. First, the tracepoint is only emitted when the VM boots, which requires a reboot to get it if the VM is already running. Second, tracepoints are not supposed to be ABIs in case they need to be consumed by user-space tools. This commit exports a VM's TSC offset to user-space via debugfs. A new file called "tsc-offset" is created in the VM's debugfs directory. For example: /sys/kernel/debug/kvm/51696-10/tsc-offset This file contains one TSC offset per line, for each vCPU. For example: vcpu0: 18446742405270834952 vcpu1: 18446742405270834952 vcpu2: 18446742405270834952 vcpu3: 18446742405270834952 There are some important observations about this solution: - While all vCPUs TSC offsets should be equal for the cases we care about (ie. stable TSC and no write to the TSC MSR), I chose to follow the spec and export each vCPU's TSC offset (might also be helpful for debugging) - The TSC offset is only useful after the VM has booted - We'll probably need to export the TSC multiplier too. However, I've been using only the TSC offset for now. So, let's get this merged first and do the TSC multiplier as a second step Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 1 + arch/x86/kvm/vmx.c | 8 ++++++++ arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++ 4 files changed, 40 insertions(+)