diff mbox series

[V2,02/11] perf/x86: Add support for TSC as a perf event clock

Message ID 20220214110914.268126-3-adrian.hunter@intel.com (mailing list archive)
State New, archived
Headers show
Series perf intel-pt: Add perf event clocks to better support VM tracing | expand

Commit Message

Adrian Hunter Feb. 14, 2022, 11:09 a.m. UTC
Currently, using Intel PT to trace a VM guest is limited to kernel space
because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
While these events can be collected for the host, there is not a way to do
that yet for a guest. One approach, would be to collect them inside the
guest, but that would require being able to synchronize with host
timestamps.

The motivation for this patch is to provide a clock that can be used within
a VM guest, and that correlates to a VM host clock. In the case of TSC, if
the hypervisor leaves rdtsc alone, the TSC value will be subject only to
the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
to inject events from a guest perf.data file, into a host perf.data file.

Thus making possible the collection of VM guest side band for Intel PT
decoding.

There are other potential benefits of TSC as a perf event clock:
	- ability to work directly with TSC
	- ability to inject non-Intel-PT-related events from a guest

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/events/core.c            | 16 +++++++++
 arch/x86/include/asm/perf_event.h |  3 ++
 include/uapi/linux/perf_event.h   | 12 ++++++-
 kernel/events/core.c              | 57 +++++++++++++++++++------------
 4 files changed, 65 insertions(+), 23 deletions(-)

Comments

Peter Zijlstra March 4, 2022, 12:30 p.m. UTC | #1
On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> Currently, using Intel PT to trace a VM guest is limited to kernel space
> because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
> While these events can be collected for the host, there is not a way to do
> that yet for a guest. One approach, would be to collect them inside the
> guest, but that would require being able to synchronize with host
> timestamps.
> 
> The motivation for this patch is to provide a clock that can be used within
> a VM guest, and that correlates to a VM host clock. In the case of TSC, if
> the hypervisor leaves rdtsc alone, the TSC value will be subject only to
> the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
> to inject events from a guest perf.data file, into a host perf.data file.
> 
> Thus making possible the collection of VM guest side band for Intel PT
> decoding.
> 
> There are other potential benefits of TSC as a perf event clock:
> 	- ability to work directly with TSC
> 	- ability to inject non-Intel-PT-related events from a guest
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/events/core.c            | 16 +++++++++
>  arch/x86/include/asm/perf_event.h |  3 ++
>  include/uapi/linux/perf_event.h   | 12 ++++++-
>  kernel/events/core.c              | 57 +++++++++++++++++++------------
>  4 files changed, 65 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index e686c5e0537b..51d5345de30a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
>  		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>  	userpg->pmc_width = x86_pmu.cntval_bits;
>  
> +	if (event->attr.use_clockid &&
> +	    event->attr.ns_clockid &&
> +	    event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
> +		userpg->cap_user_time_zero = 1;
> +		userpg->time_mult = 1;
> +		userpg->time_shift = 0;
> +		userpg->time_offset = 0;
> +		userpg->time_zero = 0;
> +		return;
> +	}
> +
>  	if (!using_native_sched_clock() || !sched_clock_stable())
>  		return;

This looks the wrong way around. If TSC is found unstable, we should
never expose it.

And I'm not at all sure about the whole virt thing. Last time I looked
at pvclock it made no sense at all.
Peter Zijlstra March 4, 2022, 12:32 p.m. UTC | #2
On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..e8617efd552b 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -290,6 +290,15 @@ enum {
>  	PERF_TXN_ABORT_SHIFT = 32,
>  };
>  
> +/*
> + * If supported, clockid value to select an architecture dependent hardware
> + * clock. Note this means the unit of time is ticks not nanoseconds.
> + * Requires ns_clockid to be set in addition to use_clockid.
> + * On x86, this clock is provided by the rdtsc instruction, and is not
> + * paravirtualized.
> + */
> +#define CLOCK_PERF_HW_CLOCK		0x10000000
> +
>  /*
>   * The format of the data returned by read() on a perf event fd,
>   * as specified by attr.read_format:
> @@ -409,7 +418,8 @@ struct perf_event_attr {
>  				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
>  				remove_on_exec :  1, /* event is removed from task on exec */
>  				sigtrap        :  1, /* send synchronous SIGTRAP on event */
> -				__reserved_1   : 26;
> +				ns_clockid     :  1, /* non-standard clockid */
> +				__reserved_1   : 25;
>  
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n events */

Thomas, do we want to gate this behind this magic flag, or can that
CLOCKID be granted unconditionally?
Peter Zijlstra March 4, 2022, 12:33 p.m. UTC | #3
On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> +u64 perf_hw_clock(void)
> +{
> +	return rdtsc_ordered();
> +}

Why the _ordered ?
Adrian Hunter March 4, 2022, 12:41 p.m. UTC | #4
On 04/03/2022 14:33, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> +u64 perf_hw_clock(void)
>> +{
>> +	return rdtsc_ordered();
>> +}
> 
> Why the _ordered ?

To be on the safe-side - in case it matters.  trace_clock_x86_tsc() also uses the ordered variant.
Adrian Hunter March 4, 2022, 1:03 p.m. UTC | #5
On 04/03/2022 14:30, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> Currently, using Intel PT to trace a VM guest is limited to kernel space
>> because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
>> While these events can be collected for the host, there is not a way to do
>> that yet for a guest. One approach, would be to collect them inside the
>> guest, but that would require being able to synchronize with host
>> timestamps.
>>
>> The motivation for this patch is to provide a clock that can be used within
>> a VM guest, and that correlates to a VM host clock. In the case of TSC, if
>> the hypervisor leaves rdtsc alone, the TSC value will be subject only to
>> the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
>> to inject events from a guest perf.data file, into a host perf.data file.
>>
>> Thus making possible the collection of VM guest side band for Intel PT
>> decoding.
>>
>> There are other potential benefits of TSC as a perf event clock:
>> 	- ability to work directly with TSC
>> 	- ability to inject non-Intel-PT-related events from a guest
>>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>>  arch/x86/events/core.c            | 16 +++++++++
>>  arch/x86/include/asm/perf_event.h |  3 ++
>>  include/uapi/linux/perf_event.h   | 12 ++++++-
>>  kernel/events/core.c              | 57 +++++++++++++++++++------------
>>  4 files changed, 65 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index e686c5e0537b..51d5345de30a 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
>>  		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>>  	userpg->pmc_width = x86_pmu.cntval_bits;
>>  
>> +	if (event->attr.use_clockid &&
>> +	    event->attr.ns_clockid &&
>> +	    event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
>> +		userpg->cap_user_time_zero = 1;
>> +		userpg->time_mult = 1;
>> +		userpg->time_shift = 0;
>> +		userpg->time_offset = 0;
>> +		userpg->time_zero = 0;
>> +		return;
>> +	}
>> +
>>  	if (!using_native_sched_clock() || !sched_clock_stable())
>>  		return;
> 
> This looks the wrong way around. If TSC is found unstable, we should
> never expose it.

Intel PT traces contain TSC whether or not it is stable, and it could
still be usable in some cases e.g. short traces on a single CPU.

Ftrace seems to offer x86-tsc unconditionally as a clock.

We could add warnings to comments and documentation about its potential
pitfalls.

> 
> And I'm not at all sure about the whole virt thing. Last time I looked
> at pvclock it made no sense at all.

It is certainly not useful for synchronizing events against TSC.
Thomas Gleixner March 4, 2022, 5:51 p.m. UTC | #6
On Fri, Mar 04 2022 at 13:32, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 82858b697c05..e8617efd552b 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -290,6 +290,15 @@ enum {
>>  	PERF_TXN_ABORT_SHIFT = 32,
>>  };
>>  
>> +/*
>> + * If supported, clockid value to select an architecture dependent hardware
>> + * clock. Note this means the unit of time is ticks not nanoseconds.
>> + * Requires ns_clockid to be set in addition to use_clockid.
>> + * On x86, this clock is provided by the rdtsc instruction, and is not
>> + * paravirtualized.
>> + */
>> +#define CLOCK_PERF_HW_CLOCK		0x10000000
>> +
>>  /*
>>   * The format of the data returned by read() on a perf event fd,
>>   * as specified by attr.read_format:
>> @@ -409,7 +418,8 @@ struct perf_event_attr {
>>  				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
>>  				remove_on_exec :  1, /* event is removed from task on exec */
>>  				sigtrap        :  1, /* send synchronous SIGTRAP on event */
>> -				__reserved_1   : 26;
>> +				ns_clockid     :  1, /* non-standard clockid */
>> +				__reserved_1   : 25;
>>  
>>  	union {
>>  		__u32		wakeup_events;	  /* wakeup every n events */
>
> Thomas, do we want to gate this behind this magic flag, or can that
> CLOCKID be granted unconditionally?

I'm not seeing a point in that flag and please define the clock id where
the other clockids are defined. We want a proper ID range for such
magically defined clocks.

We use INT_MIN < id < 16 today. I have plans to expand the ID space past
16, so using something like the above is fine.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e686c5e0537b..51d5345de30a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2728,6 +2728,17 @@  void arch_perf_update_userpage(struct perf_event *event,
 		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
+	if (event->attr.use_clockid &&
+	    event->attr.ns_clockid &&
+	    event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
+		userpg->cap_user_time_zero = 1;
+		userpg->time_mult = 1;
+		userpg->time_shift = 0;
+		userpg->time_offset = 0;
+		userpg->time_zero = 0;
+		return;
+	}
+
 	if (!using_native_sched_clock() || !sched_clock_stable())
 		return;
 
@@ -2980,6 +2991,11 @@  unsigned long perf_misc_flags(struct pt_regs *regs)
 	return misc;
 }
 
+u64 perf_hw_clock(void)
+{
+	return rdtsc_ordered();
+}
+
 void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
 	cap->version		= x86_pmu.version;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 58d9e4b1fa0a..5288ea1ae2ba 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -451,6 +451,9 @@  extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
 extern unsigned long perf_misc_flags(struct pt_regs *regs);
 #define perf_misc_flags(regs)	perf_misc_flags(regs)
 
+extern u64 perf_hw_clock(void);
+#define perf_hw_clock		perf_hw_clock
+
 #include <asm/stacktrace.h>
 
 /*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 82858b697c05..e8617efd552b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -290,6 +290,15 @@  enum {
 	PERF_TXN_ABORT_SHIFT = 32,
 };
 
+/*
+ * If supported, clockid value to select an architecture dependent hardware
+ * clock. Note this means the unit of time is ticks not nanoseconds.
+ * Requires ns_clockid to be set in addition to use_clockid.
+ * On x86, this clock is provided by the rdtsc instruction, and is not
+ * paravirtualized.
+ */
+#define CLOCK_PERF_HW_CLOCK		0x10000000
+
 /*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
@@ -409,7 +418,8 @@  struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				ns_clockid     :  1, /* non-standard clockid */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 57249f37c37d..15dee265a5b9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12008,35 +12008,48 @@  static void mutex_lock_double(struct mutex *a, struct mutex *b)
 	mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
 }
 
-static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
+static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id, bool ns_clockid)
 {
 	bool nmi_safe = false;
 
-	switch (clk_id) {
-	case CLOCK_MONOTONIC:
-		event->clock = &ktime_get_mono_fast_ns;
-		nmi_safe = true;
-		break;
+	if (ns_clockid) {
+		switch (clk_id) {
+#ifdef perf_hw_clock
+		case CLOCK_PERF_HW_CLOCK:
+			event->clock = &perf_hw_clock;
+			nmi_safe = true;
+			break;
+#endif
+		default:
+			return -EINVAL;
+		}
+	} else {
+		switch (clk_id) {
+		case CLOCK_MONOTONIC:
+			event->clock = &ktime_get_mono_fast_ns;
+			nmi_safe = true;
+			break;
 
-	case CLOCK_MONOTONIC_RAW:
-		event->clock = &ktime_get_raw_fast_ns;
-		nmi_safe = true;
-		break;
+		case CLOCK_MONOTONIC_RAW:
+			event->clock = &ktime_get_raw_fast_ns;
+			nmi_safe = true;
+			break;
 
-	case CLOCK_REALTIME:
-		event->clock = &ktime_get_real_ns;
-		break;
+		case CLOCK_REALTIME:
+			event->clock = &ktime_get_real_ns;
+			break;
 
-	case CLOCK_BOOTTIME:
-		event->clock = &ktime_get_boottime_ns;
-		break;
+		case CLOCK_BOOTTIME:
+			event->clock = &ktime_get_boottime_ns;
+			break;
 
-	case CLOCK_TAI:
-		event->clock = &ktime_get_clocktai_ns;
-		break;
+		case CLOCK_TAI:
+			event->clock = &ktime_get_clocktai_ns;
+			break;
 
-	default:
-		return -EINVAL;
+		default:
+			return -EINVAL;
+		}
 	}
 
 	if (!nmi_safe && !(event->pmu->capabilities & PERF_PMU_CAP_NO_NMI))
@@ -12245,7 +12258,7 @@  SYSCALL_DEFINE5(perf_event_open,
 	pmu = event->pmu;
 
 	if (attr.use_clockid) {
-		err = perf_event_set_clock(event, attr.clockid);
+		err = perf_event_set_clock(event, attr.clockid, attr.ns_clockid);
 		if (err)
 			goto err_alloc;
 	}