diff mbox series

VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX

Message ID Zn9X0yFxZi_Mrlnt@google.com (mailing list archive)
State New, archived
Headers show
Series VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX | expand

Commit Message

Sean Christopherson June 29, 2024, 12:39 a.m. UTC
Forking this off to try and avoid confusion...

On Wed, Jun 12, 2024, Reinette Chatre wrote:
> +/*
> + * Possible TDCR values with matching divide count. Used to modify APIC
> + * timer frequency.
> + */
> +static struct {
> +	uint32_t tdcr;
> +	uint32_t divide_count;
> +} tdcrs[] = {
> +	{0x0, 2},
> +	{0x1, 4},
> +	{0x2, 8},
> +	{0x3, 16},
> +	{0x8, 32},
> +	{0x9, 64},
> +	{0xa, 128},
> +	{0xb, 1},
> +};

...

> +static void apic_guest_code(void)
> +{
> +	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
> +	const uint32_t tmict = ~0u;
> +	uint64_t tsc0, tsc1, freq;
> +	uint32_t tmcct;
> +	int i;
> +
> +	apic_enable();
> +
> +	/*
> +	 * Setup one-shot timer.  The vector does not matter because the
> +	 * interrupt should not fire.
> +	 */
> +	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
> +
> +	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
> +
> +		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
> +		apic_write_reg(APIC_TMICT, tmict);
> +
> +		tsc0 = rdtsc();
> +		udelay(delay_ms * 1000);
> +		tmcct = apic_read_reg(APIC_TMCCT);
> +		tsc1 = rdtsc();
> +
> +		/*
> +		 * Stop the timer _after_ reading the current, final count, as
> +		 * writing the initial counter also modifies the current count.
> +		 */
> +		apic_write_reg(APIC_TMICT, 0);
> +
> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> +		/* Check if measured frequency is within 1% of configured frequency. */
> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> +	}

This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
bug.  It looks like something APICv related is clobbering internal VMX timer state?
Or maybe there's a tearing or truncation issue?

As mentioned ad nauseum at this point, I'm offline all of next week, so hopefully
there's enough info here to get a root cause...


A spurious VM-Exit will occur after programming a vmcs.PREEMPTION_TIMER_VALUE that
shouldn't exit.  Every observed failure occurs when bits 27:16 are zero, with a
small value in bits 15:0, e.g. VM-Enter with a timer value of 0xe0003bf7 or
0xa0006db6 will cause a near-immediate VM-Exit.

Weirdly, it doesn't always happen, e.g. I've observed rollover from 0xa000xxxx
to 0x9fffxxxx without failure.  However, the *test* failure is 100% reproducible,
i.e. it's not _that_ rare of a failure.  So maybe there's state build-up required?
E.g. in the failing cases, there are 10s of entries with a slightly larger timer
value, with no preemption timer exit (the host's tick IRQ interrupts the guest,
and then KVM reprograms the VMX timer).

Even more sketchy, the failure only occurs if APICv is enabled.  Turning off APICv
makes the problem go away (I initially suspected KVM was somehow botching the TMCCT
emulation).  I am 99.9% certain there is no KVM APICv bug that is causing problems.
Our IVB servers don't fail (even with APICv enabled), nor does my Raptor Lake client
box (with APICv enabled).  And I confirmed that the VMX timer is still getting
programmed with the same sequence that fails when APICv is enabled.

(Before I realized the pattern of values), I sanity checked the VMCS field just
before VM-Enter, and again after VM-Exit (KVM runs without the CPU save
vmcs.PREEMPTION_TIMER_VALUE on exit).

I also verified the CPU thinks the timer has expired by enabling "save on exit"
and verifying vmcs.PREEMPTION_TIMER_VALUE is indeed '0', and that KVM really did
get a PREEMPTION_TIMER exit.

Attached is the debug patch I used to get the below data (sort of; one of the
post-exit prints is without saving vmcs.PREEMPTION_TIMER_VALUE on exit).

In kvm_hypercall (ignore the name, I piggybacked a tracepoint because trying to
log to dmesg was too slow, guest literally couldn't make forward progress), a1 is
the VMX timer value programmed by KVM (0xe0003bf7).
 
 apic_bus_clock_-11419   [056] d..2.   146.771179: kvm_hypercall: nr 0x2c8a9e9cc6703c a0 0x2c8b0e9ce46c37 a1 0xe0003bf7 a2 0xe0003bf7 a3 0x7  
 apic_bus_clock_-11419   [056] d..2.   146.771242: kvm_exit: vcpu 0 reason PREEMPTION_TIMER rip 0x402065 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 VMX timer exit, VMCS = e0003bf7, delta = e0003690

and the post-exit print with the attached patch:

  kvm_intel: VMX timer exit, EXIT_REASON = 34, VMCS = 0, delta = e00037e9

And I've been fiddling with the below hack to coerce KVM into programming VMX
timer values.  Had I more time, I would have booted kernels with the ability to
properly fuzz the timer values.

Note, with TMICT=-1, only divide_count=1 will fail, because larger divide counts
result in a timeout that doesn't fit into the 32-bit VMX timer field (don't ask
me how long it took me to realize the divide count affects the time frequency,
not the actual count, *sigh*).

Comments

Reinette Chatre July 3, 2024, 8:14 p.m. UTC | #1
Hi Sean,

On 6/28/24 5:39 PM, Sean Christopherson wrote:
> Forking this off to try and avoid confusion...
> 
> On Wed, Jun 12, 2024, Reinette Chatre wrote:

...

>> +
>> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
>> +		/* Check if measured frequency is within 1% of configured frequency. */
>> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
>> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
>> +	}
> 
> This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> bug.  It looks like something APICv related is clobbering internal VMX timer state?
> Or maybe there's a tearing or truncation issue?

It has been a few days. Just a note to let you know that we are investigating this.
On my side I have not yet been able to reproduce this issue. I tested
kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
apic_bus_clock_test and they all passed. Since I have lack of experience here there are
some Intel virtualization experts helping out with this investigation and I hope that
they will be some insights from the analysis and testing that you already provided.

Reinette
Reinette Chatre July 3, 2024, 9:37 p.m. UTC | #2
(a short update ...)

On 7/3/24 1:14 PM, Reinette Chatre wrote:
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
>> Forking this off to try and avoid confusion...
>>
>> On Wed, Jun 12, 2024, Reinette Chatre wrote:
> 
> ...
> 
>>> +
>>> +        freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
>>> +        /* Check if measured frequency is within 1% of configured frequency. */
>>> +        GUEST_ASSERT(freq < apic_hz * 101 / 100);
>>> +        GUEST_ASSERT(freq > apic_hz * 99 / 100);
>>> +    }
>>
>> This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
>> bug.  It looks like something APICv related is clobbering internal VMX timer state?
>> Or maybe there's a tearing or truncation issue?
> 
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.

I have now been able to test on SKX also and I am not yet able to reproduce. For
reference, the systems I tested on are:
SKX: https://ark.intel.com/content/www/us/en/ark/products/120507/intel-xeon-platinum-8170m-processor-35-75m-cache-2-10-ghz.html
ICX: https://ark.intel.com/content/www/us/en/ark/products/212459/intel-xeon-platinum-8360y-processor-54m-cache-2-40-ghz.html
CLX: https://ark.intel.com/content/www/us/en/ark/products/192476/intel-xeon-platinum-8260l-processor-35-75m-cache-2-40-ghz.html

Reinette
Yuan Yao July 8, 2024, 5:55 a.m. UTC | #3
On Wed, Jul 03, 2024 at 01:14:09PM -0700, Reinette Chatre wrote:
> Hi Sean,
>
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
> > Forking this off to try and avoid confusion...
> >
> > On Wed, Jun 12, 2024, Reinette Chatre wrote:
>
> ...
>
> > > +
> > > +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> > > +		/* Check if measured frequency is within 1% of configured frequency. */
> > > +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> > > +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> > > +	}
> >
> > This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> > bug.  It looks like something APICv related is clobbering internal VMX timer state?
> > Or maybe there's a tearing or truncation issue?
>
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.
>
> Reinette
>

I can reproduce this on my side, even with apicv disabled by
'insmod $kernel_path/arch/x86/kvm/kvm-intel.ko enable_apicv=N'.
@Sean I think we're observing same issue, please see the details
below:

apic_bus_clock_test can't reproduce this may because the
preemption timer value calculation relies on apic bus
frequency/TMICT/Divide count/host TSC frequency and ratio
between preemption timer and host TSC frequency, too many
factors to generate the 'magic' value there. So I changed
KVM and added a small KVM kselftest tool to set the
preemption timer value directly from guest, this makes the
reproducing easily. The changes are attached at end of this
comment.

The trace I captured below came form host with 1.7GHz TSC,
the VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is enabled to get the
cpu saved vmcs.VMX_PREEMPTION_TIMER_VALUE after VMEXIT. I
set the vmcs.VMX_PREEMPTION_TIMER_VALUE to 0x880042ad which
is the 'magic' number on this 1.7Ghz TSC machine:

preempt_test  20677.199521: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5554 a1:0x880042ad a2:0x880042ad a3:0x20462e98d9b9
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The value read back from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, for double confirmation.
  a3: The host tsc at the time point, debug only.

preempt_test  20677.199579:      kvm:kvm_exit: reason PREEMPTION_TIMER rip 0x40274d info 0 0 intr 0

preempt_test  20677.199579: kvm:kvm_vmx_debug: kvm_vmx_debug: 2, a0:0x34 a1:0x0 a2:0x87fea9b0 a3:0x20462e9a5749
  a0: The VMEXIT reason, 0x34 is preemption timer VMEXIT.
  a1: The read back vmcs.VMX_PREEMPTION_TIMER_VALUE value, here 0.
  a2: The next preemption timer value should be written to
      vmcs, calculates from the (target tsc - current tsc) >>
      7. Now the preemption timer vmexit happend only after
      ~58 microseconds elapsed, it should happen after
      ~171.79 seconds but not such soon, the issue is
      reproduced.

Another more easy way to observe this symptom w/o care the
'magic' preemption timer vlaue is use the maximum preemption
timer value 0xffffffff, below log w/ 0xffffffff is captured
from same machine:

preempt_test 20530.456589: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5551 a1:0xffffffff a2:0xffffffff a3:0x200c1971ca5d

  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The read back value from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, double confirmation.
  a3: the host tsc at the time point, debug only.

preempt_test 20530.456690:      kvm:kvm_exit: reason VMCALL rip 0x4131a0 info 0 0 intr 0
  The preempt_test checks preemption timer state every
  100us, this VMEXIT is expected behavior.

preempt_test 20530.456691:     kvm:kvm_entry: vcpu 0, rip 0x4131a3

preempt_test 20530.456691: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77ff82cc a1:0xfffe900b a2:0xfffe900b a3:0x200c19746d45
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT. The difference value
      shouldn't be such huge number 0xffffffff - 0x77ff82cc
      = 0x88007D33 when just ~100us elapsed from previous
      VMENTRY, the issue is reproduced.

Use 0x88000000 as preemption timer value to verify this
preempt_test tool, the preemption timer VMEXIT happend after
~171.45 seconds which is expected behavior on the host
1.7Ghz TSC system:

The preemption timer VMEXIT should happen after:
2281701376 × 128 / 1700000000 = 171.79 seconds.

Attached my changes in KVM and kselftest tool for
reproducing here, based on:
https://github.com/kvm-x86/linux.git
tag:kvm-x86-next-2024.06.28

Patch 01:

From a977bf12a8cd1bbe401e68d3702c0b5aa3bf66e4 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@intel.com>
Date: Thu, 4 Jul 2024 09:59:55 +0800
Subject: [PATCH 1/2] KVM: x86: Introudce trace_kvm_vmx_debug()

debug only common trace.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
---
 arch/x86/kvm/trace.h | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c   |  1 +
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index d3aeffd6ae75..7b9eb23d71d3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -34,6 +34,34 @@ TRACE_EVENT(kvm_entry,
 		  __entry->immediate_exit ? "[immediate exit]" : "")
 );

+TRACE_EVENT(kvm_vmx_debug,
+	    TP_PROTO(unsigned long n, unsigned long a0,
+		     unsigned long a1,
+		     unsigned long a2,
+		     unsigned long a3),
+	    TP_ARGS(n, a0, a1, a2, a3),
+
+	TP_STRUCT__entry(
+		 __field(	unsigned long,	n		)
+		__field(	unsigned long,	a0		)
+		__field(	unsigned long,	a1		)
+		__field(	unsigned long,	a2		)
+		__field(	unsigned long,	a3		)
+	),
+
+	TP_fast_assign(
+		__entry->n      = n;
+		__entry->a0		= a0;
+		__entry->a1		= a1;
+		__entry->a2		= a2;
+		__entry->a3		= a3;
+	),
+
+	TP_printk("kvm_vmx_debug: %ld, a0:0x%lx a1:0x%lx a2:0x%lx a3:0x%lx",
+		  __entry->n, __entry->a0, __entry->a1, __entry->a2, __entry->a3)
+);
+
+
 /*
  * Tracepoint for hypercall.
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..6d1972d6c988 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14036,6 +14036,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmx_debug);

 static int __init kvm_x86_init(void)
 {
--
2.27.0

Patch 02:

From a3bdce4f2f93810372f6068396776ac702946d16 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@intel.com>
Date: Wed, 3 Jul 2024 14:08:02 +0800
Subject: [PATCH 2/2] [DEBUG] preempt timer debug test

A specific kselftesting based program to allow set the VMX
preempt timer value from VM directly.

Introduce 2 hypercall 0x56780001/2, 01 to set the preempt
timer value, 02 to wait for the preemption time expired.

Usage:
Reload kvm applied this change, then:
$KRNEL_SRC_ROOT/tools/testing/selftests/kvm/x86_64/preempt_test -p 'preempt_timer_vale'

'preempt_timer_vale' is the preempt timer value in DEC format, HEX is not supported.

For example:

perf record -e "kvm:*" tools/testing/selftests/kvm/x86_64/preempt_test -p 2281718445

Above set the preempt value to 2281718445(0x880042AD) and
capture the trace, then check the kvm_vmx_debug in the trace
to know the preempt timer behavior.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 arch/x86/kvm/vmx/vmx.h                        |   5 +
 arch/x86/kvm/vmx/vmx.c                        | 113 +++++++++++++++++-
 .../selftests/kvm/x86_64/preempt_test.c       |  82 +++++++++++++
 4 files changed, 198 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/preempt_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ad8b5d15f2bd..957509957f80 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -129,6 +129,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/amx_test
 TEST_GEN_PROGS_x86_64 += x86_64/max_vcpuid_cap_test
 TEST_GEN_PROGS_x86_64 += x86_64/triple_fault_event_test
 TEST_GEN_PROGS_x86_64 += x86_64/recalc_apic_map_test
+TEST_GEN_PROGS_x86_64 += x86_64/preempt_test
 TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..82ea0ccc7a63 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -368,6 +368,11 @@ struct vcpu_vmx {

 	/* ve_info must be page aligned. */
 	struct vmx_ve_information *ve_info;
+
+	volatile bool debug_timer;
+	bool debug_timer_set_to_hardware;
+	u32 debug_timer_val;
+	u64 debug_timer_deadline_tsc;
 };

 struct kvm_vmx {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..73f084c29f9a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4431,8 +4431,9 @@ static u32 vmx_vmexit_ctrl(void)
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
-	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
-			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER);
+	pr_info("Set VM_EXIT_SAVE_VMX_PREEMPTION_TIMER forcedly for preempt timer debug\n");
+

 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
@@ -5993,11 +5994,41 @@ static int handle_pml_full(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static fastpath_t handle_fastpath_debug_timer(struct kvm_vcpu *vcpu,
+					      bool force_immediate_exit)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u64 tscl;
+	u32 delta;
+
+	tscl = rdtsc();
+
+	if (vmx->debug_timer_deadline_tsc > tscl)
+		delta = (u32)((vmx->debug_timer_deadline_tsc - tscl) >>
+			      cpu_preemption_timer_multi);
+	else
+		delta = 0;
+
+	trace_kvm_vmx_debug(2UL,
+			    (unsigned long)vmcs_read32(VM_EXIT_REASON),
+			    (unsigned long)vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+			    (unsigned long)delta, tscl);
+
+	vmx->debug_timer = false;
+
+	return EXIT_FASTPATH_REENTER_GUEST;
+}
+
 static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 						   bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);

+	WARN_ON(vmx->debug_timer && force_immediate_exit);
+	if (vmx->debug_timer)
+		return handle_fastpath_debug_timer(vcpu,
+						   force_immediate_exit);
+
 	/*
 	 * In the *extremely* unlikely scenario that this is a spurious VM-Exit
 	 * due to the timer expiring while it was "soft" disabled, just eat the
@@ -6096,6 +6127,60 @@ static int handle_notify(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static unsigned long vmx_debug_set_preempt_timer(struct kvm_vcpu *vcpu,
+						 unsigned long a0,
+						 unsigned long a1,
+						 unsigned long a2,
+						 unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->debug_timer = true;
+	vmx->debug_timer_set_to_hardware = false;
+	vmx->debug_timer_val = a0;
+	vmx->debug_timer_deadline_tsc = rdtsc() + (a0 << cpu_preemption_timer_multi);
+	pr_info("debug_timer = %u\n", (u32)a0);
+
+	return 0;
+}
+
+
+static unsigned long vmx_debug_get_preempt_timer_result(struct kvm_vcpu *vcpu,
+							unsigned long a0,
+							unsigned long a1,
+							unsigned long a2,
+							unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (vmx->debug_timer)
+		return 1;
+	return 0;
+}
+
+static int vmx_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3;
+	unsigned long ret;
+
+	nr = kvm_rax_read(vcpu);
+	if (nr != 0x87650001 && nr != 0x87650002)
+		return kvm_emulate_hypercall(vcpu);
+
+	a0 = kvm_rbx_read(vcpu);
+	a1 = kvm_rcx_read(vcpu);
+	a2 = kvm_rdx_read(vcpu);
+	a3 = kvm_rsi_read(vcpu);
+
+	if (nr == 0x87650001)
+		ret = vmx_debug_set_preempt_timer(vcpu, a0, a1, a2, a3);
+	else
+		ret = vmx_debug_get_preempt_timer_result(vcpu, a0, a1, a2, a3);
+
+	kvm_rax_write(vcpu, ret);
+	return kvm_skip_emulated_instruction(vcpu);
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -6117,7 +6202,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_INVD]		      = kvm_emulate_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_RDPMC]                   = kvm_emulate_rdpmc,
-	[EXIT_REASON_VMCALL]                  = kvm_emulate_hypercall,
+	[EXIT_REASON_VMCALL]                  = vmx_emulate_hypercall,
 	[EXIT_REASON_VMCLEAR]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMLAUNCH]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMPTRLD]		      = handle_vmx_instruction,
@@ -7199,6 +7284,28 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 	if (force_immediate_exit) {
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
+	} else if (vmx->debug_timer) {
+		u32 old;
+
+		tscl = rdtsc();
+
+		if (!vmx->debug_timer_set_to_hardware) {
+			delta_tsc = vmx->debug_timer_val;
+			vmx->debug_timer_set_to_hardware = true;
+		} else {
+			if (vmx->debug_timer_deadline_tsc > tscl)
+				delta_tsc = (u32)((vmx->debug_timer_deadline_tsc - tscl)
+						  >> cpu_preemption_timer_multi);
+			else
+				delta_tsc = 0;
+		}
+
+		old = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
+		trace_kvm_vmx_debug(3UL, old,
+				    vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+				    delta_tsc, tscl);
+		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (vmx->hv_deadline_tsc != -1) {
 		tscl = rdtsc();
 		if (vmx->hv_deadline_tsc > tscl)
diff --git a/tools/testing/selftests/kvm/x86_64/preempt_test.c b/tools/testing/selftests/kvm/x86_64/preempt_test.c
new file mode 100644
index 000000000000..2e58cfee61d0
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/preempt_test.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024 Intel Corporation
+ *
+ * Debug the preemption timer behavior
+ */
+
+#include "test_util.h"
+#include "processor.h"
+#include "ucall_common.h"
+
+uint32_t preempt_timer_val = 0x1000000;
+static void guest_code(uint64_t apic_hz, uint64_t delay_ms)
+{
+	volatile unsigned long r;
+
+	kvm_hypercall(0x87650001, preempt_timer_val, 0, 0, 0);
+	do {
+		udelay(100);
+		r = kvm_hypercall(0x87650002, 0, 0, 0, 0);
+	} while(r != 0);
+
+	GUEST_DONE();
+}
+
+static void do_test(struct kvm_vcpu *vcpu)
+{
+	bool done = false;
+	struct ucall uc;
+
+	while (!done) {
+		vcpu_run(vcpu);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_DONE:
+			done = true;
+			break;
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		default:
+			break;
+		}
+	}
+}
+
+static void run_test(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	vm = vm_create(1);
+
+	sync_global_to_guest(vm, preempt_timer_val);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code);
+
+	do_test(vcpu);
+
+	kvm_vm_free(vm);
+}
+
+
+int main(int argc, char *argv[])
+{
+	int opt;
+
+	while ((opt = getopt(argc, argv, "p:h")) != -1) {
+		switch (opt) {
+		case 'p':
+			preempt_timer_val = atoi(optarg);
+			break;
+		default:
+			exit(KSFT_SKIP);
+		}
+	}
+
+	printf("preempt timer value:%u(0x%x)\n",
+	       preempt_timer_val, preempt_timer_val);
+
+	run_test();
+}
--
2.27.0
diff mbox series

Patch

From 42da584d5e4e2dfaf4296f2ea666f3a04c82052f Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 28 Jun 2024 16:33:32 -0700
Subject: [PATCH] debug

---
 arch/x86/kvm/vmx/vmx.c | 33 +++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h |  1 +
 arch/x86/kvm/x86.c     |  1 +
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..e83351c690d9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4431,8 +4431,7 @@  static u32 vmx_vmexit_ctrl(void)
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
-	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
-			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER);
 
 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
@@ -5997,6 +5996,8 @@  static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 						   bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 delta_tsc;
+	u64 tscl;
 
 	/*
 	 * In the *extremely* unlikely scenario that this is a spurious VM-Exit
@@ -6020,6 +6021,16 @@  static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 	if (is_guest_mode(vcpu))
 		return EXIT_FASTPATH_NONE;
 
+	tscl = rdtsc();
+	if (vmx->hv_deadline_tsc > tscl)
+		delta_tsc = (u32)((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi);
+	else
+		delta_tsc = 0;
+
+	pr_warn("VMX timer exit, EXIT_REASON = %x, VMCS = %x, delta = %x\n",
+		vmcs_read32(VM_EXIT_REASON),
+		vmcs_read32(VMX_PREEMPTION_TIMER_VALUE), delta_tsc);
+
 	kvm_lapic_expired_hv_timer(vcpu);
 	return EXIT_FASTPATH_REENTER_GUEST;
 }
@@ -7197,6 +7208,8 @@  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 	u32 delta_tsc;
 
 	if (force_immediate_exit) {
+		trace_kvm_hypercall(0, 0, 0, 0, 0);
+		vmx->preemption_timer = 0;
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (vmx->hv_deadline_tsc != -1) {
@@ -7208,9 +7221,14 @@  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 		else
 			delta_tsc = 0;
 
+		trace_kvm_hypercall(tscl, vmx->hv_deadline_tsc, delta_tsc,
+				    ((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi),
+				    cpu_preemption_timer_multi);
+		vmx->preemption_timer = delta_tsc;
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (!vmx->loaded_vmcs->hv_timer_soft_disabled) {
+		trace_kvm_hypercall(-1, -1, -1, -1, -1);
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, -1);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = true;
 	}
@@ -7218,6 +7236,8 @@  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 
 void noinstr vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
 {
+	WARN_ON(!vmx->loaded_vmcs->hv_timer_soft_disabled &&
+		vmcs_read32(VMX_PREEMPTION_TIMER_VALUE) != vmx->preemption_timer);
 	if (unlikely(host_rsp != vmx->loaded_vmcs->host_state.rsp)) {
 		vmx->loaded_vmcs->host_state.rsp = host_rsp;
 		vmcs_writel(HOST_RSP, host_rsp);
@@ -8128,7 +8148,7 @@  int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	    delta_tsc && u64_shl_div_u64(delta_tsc,
 				kvm_caps.tsc_scaling_ratio_frac_bits,
 				vcpu->arch.l1_tsc_scaling_ratio, &delta_tsc))
-		return -ERANGE;
+		goto out_of_range;
 
 	/*
 	 * If the delta tsc can't fit in the 32 bit after the multi shift,
@@ -8137,11 +8157,16 @@  int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	 * on every vmentry is costly so we just use an hrtimer.
 	 */
 	if (delta_tsc >> (cpu_preemption_timer_multi + 32))
-		return -ERANGE;
+		goto out_of_range;
 
+	trace_kvm_hypercall(tscl, vmx->hv_deadline_tsc, delta_tsc, -1, -1);
 	vmx->hv_deadline_tsc = tscl + delta_tsc;
 	*expired = !delta_tsc;
 	return 0;
+
+out_of_range:
+	vmx->hv_deadline_tsc = -1;
+	return -ERANGE;
 }
 
 void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..ecafbb11519d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -341,6 +341,7 @@  struct vcpu_vmx {
 
 	/* apic deadline value in host tsc */
 	u64 hv_deadline_tsc;
+	u32 preemption_timer;
 
 	unsigned long host_debugctlmsr;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..00847259bcc4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14024,6 +14024,7 @@  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_hypercall);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);

base-commit: 128c71b7f489d8115d29a487367c648f8acc8374
-- 
2.45.2.803.g4e1b14247a-goog