diff mbox series

KVM: LAPIC: Fix an inversion error when a negative value assigned to lapic_timer.timer_advance_ns

Message ID 20240520115334.852510-1-zhoushuling@huawei.com (mailing list archive)
State New, archived
Headers show
Series KVM: LAPIC: Fix an inversion error when a negative value assigned to lapic_timer.timer_advance_ns | expand

Commit Message

zhoushuling May 20, 2024, 11:53 a.m. UTC
From: Shuling Zhou <zhoushuling@huawei.com>

After 'commit 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns
parameter overflow")',a negative value can be assigned to
lapic_timer_advance_ns, when it is '-1', the kvm_create_lapic()
will judge it and turns on adaptive tuning of timer advancement.
However, when lapic_timer_advance_ns=-2, it will be assigned to
an uint variable apic->lapic_timer.timer_advance_ns, the
apic->lapic_timer.timer_advance_ns of each vCPU will become a huge
value. When a VM is started, the VM is stuck in the
"
[    2.669717] ACPI: Core revision 20130517
[    2.672378] ACPI: All ACPI Tables successfully acquired
[    2.673309] ftrace: allocating 29651 entries in 116 pages
[    2.698797] Enabling x2apic
[    2.699431] Enabled x2apic
[    2.700160] Switched APIC routing to physical x2apic.
[    2.701644] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    2.702575] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz (fam: 06, model: 6a, stepping: 06)
..........
"

'Fixes: 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns
parameter overflow")'

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Shuling Zhou<zhoushuling@huawei.com>
---
 arch/x86/kvm/lapic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Sean Christopherson May 20, 2024, 2:28 p.m. UTC | #1
On Mon, May 20, 2024, zhoushuling@huawei.com wrote:
> From: Shuling Zhou <zhoushuling@huawei.com>
> 
> After 'commit 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns
> parameter overflow")',a negative value can be assigned to
> lapic_timer_advance_ns, when it is '-1', the kvm_create_lapic()
> will judge it and turns on adaptive tuning of timer advancement.
> However, when lapic_timer_advance_ns=-2, it will be assigned to
> an uint variable apic->lapic_timer.timer_advance_ns, the
> apic->lapic_timer.timer_advance_ns of each vCPU will become a huge
> value. When a VM is started, the VM is stuck in the
> "
> [    2.669717] ACPI: Core revision 20130517
> [    2.672378] ACPI: All ACPI Tables successfully acquired
> [    2.673309] ftrace: allocating 29651 entries in 116 pages
> [    2.698797] Enabling x2apic
> [    2.699431] Enabled x2apic
> [    2.700160] Switched APIC routing to physical x2apic.
> [    2.701644] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    2.702575] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz (fam: 06, model: 6a, stepping: 06)
> ..........
> "
> 
> 'Fixes: 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns
> parameter overflow")'

 Fixes: 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow")

> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Shuling Zhou<zhoushuling@huawei.com>

There should be whitespace between your name and email.

> ---
>  arch/x86/kvm/lapic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index ebf41023be38..5feeb889ddb6 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2848,7 +2848,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
>  	if (timer_advance_ns == -1) {
>  		apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_NS_INIT;
>  		lapic_timer_advance_dynamic = true;
> -	} else {
> +	} else if (timer_advance_ns >= 0) {
>  		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
>  		lapic_timer_advance_dynamic = false;
>  	}

Wouldn't it be more appropriate to treat any negative value as "dynamic"?  The
comment above the module param also needs to be updated.

Oof, and lapic_timer_advance_dynamic is a global, which yields behavior that is
nearly impossible to document.  So I think we want this, over two patches?

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ebf41023be38..3a1bcfbe3e93 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -59,7 +59,6 @@
 #define MAX_APIC_VECTOR                        256
 #define APIC_VECTORS_PER_REG           32
 
-static bool lapic_timer_advance_dynamic __read_mostly;
 #define LAPIC_TIMER_ADVANCE_ADJUST_MIN 100     /* clock cycles */
 #define LAPIC_TIMER_ADVANCE_ADJUST_MAX 10000   /* clock cycles */
 #define LAPIC_TIMER_ADVANCE_NS_INIT    1000
@@ -1854,7 +1853,7 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
        guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
        trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
 
-       if (lapic_timer_advance_dynamic) {
+       if (apic->lapic_timer.timer_advance_dynamic) {
                adjust_lapic_timer_advance(vcpu, guest_tsc - tsc_deadline);
                /*
                 * If the timer fired early, reread the TSC to account for the
@@ -2845,12 +2844,12 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
        hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
                     HRTIMER_MODE_ABS_HARD);
        apic->lapic_timer.timer.function = apic_timer_fn;
-       if (timer_advance_ns == -1) {
+       if (timer_advance_ns < 0) {
                apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_NS_INIT;
-               lapic_timer_advance_dynamic = true;
+               apic->lapic_timer.timer_advance_dynamic = true;
        } else {
                apic->lapic_timer.timer_advance_ns = timer_advance_ns;
-               lapic_timer_advance_dynamic = false;
+               apic->lapic_timer.timer_advance_dynamic = false;
        }
 
        /*
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 0a0ea4b5dd8c..6fb3b16a2754 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -54,6 +54,7 @@ struct kvm_timer {
        u32 timer_advance_ns;
        atomic_t pending;                       /* accumulated triggered timers */
        bool hv_timer_in_use;
+       bool timer_advance_dynamic;
 };
 
 struct kvm_lapic {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a2b62169e09a..60e86607056e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -165,10 +165,11 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, 0644);
 
 /*
- * lapic timer advance (tscdeadline mode only) in nanoseconds.  '-1' enables
- * adaptive tuning starting from default advancement of 1000ns.  '0' disables
- * advancement entirely.  Any other value is used as-is and disables adaptive
- * tuning, i.e. allows privileged userspace to set an exact advancement time.
+ * lapic timer advance (tscdeadline mode only) in nanoseconds.  Any negative
+ * value enable adaptive tuning starting from default advancement of 1000ns.
+ * '0' disables advancement entirely.  Any postive value is used as-is and
+ * disables adaptive tuning, i.e. allows privileged userspace to set an exact
+ * advancement time.
  */
 static int __read_mostly lapic_timer_advance_ns = -1;
 module_param(lapic_timer_advance_ns, int, 0644);
zhoushuling May 21, 2024, 12:33 p.m. UTC | #2
> On Mon, May 20, 2024, zhoushuling@huawei.com wrote:
>> From: Shuling Zhou <zhoushuling@huawei.com>
>>
>> After 'commit 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns
>> parameter overflow")',a negative value can be assigned to
>> lapic_timer_advance_ns, when it is '-1', the kvm_create_lapic()
>> will judge it and turns on adaptive tuning of timer advancement.
>> However, when lapic_timer_advance_ns=-2, it will be assigned to
>> an uint variable apic->lapic_timer.timer_advance_ns, the
>> apic->lapic_timer.timer_advance_ns of each vCPU will become a huge
>> value. When a VM is started, the VM is stuck in the
>> "
>> [    2.669717] ACPI: Core revision 20130517
>> [    2.672378] ACPI: All ACPI Tables successfully acquired
>> [    2.673309] ftrace: allocating 29651 entries in 116 pages
>> [    2.698797] Enabling x2apic
>> [    2.699431] Enabled x2apic
>> [    2.700160] Switched APIC routing to physical x2apic.
>> [    2.701644] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
>> [    2.702575] smpboot: CPU0: Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz (fam: 06, model: 6a, stepping: 06)
>> ..........
>> "
>>
>> 'Fixes: 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns
>> parameter overflow")'
>   Fixes: 0e6edceb8f18 ("KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow")
>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
>> Cc: Wanpeng Li <wanpengli@tencent.com>
>> Signed-off-by: Shuling Zhou<zhoushuling@huawei.com>
> There should be whitespace between your name and email.
>
>> ---
>>   arch/x86/kvm/lapic.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index ebf41023be38..5feeb889ddb6 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2848,7 +2848,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
>>   	if (timer_advance_ns == -1) {
>>   		apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_NS_INIT;
>>   		lapic_timer_advance_dynamic = true;
>> -	} else {
>> +	} else if (timer_advance_ns >= 0) {
>>   		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
>>   		lapic_timer_advance_dynamic = false;
>>   	}
> Wouldn't it be more appropriate to treat any negative value as "dynamic"?  The
> comment above the module param also needs to be updated.


Thank for your suggestions.

I agree with "treat any negative value as dynamic".


>
> Oof, and lapic_timer_advance_dynamic is a global, which yields behavior that is
> nearly impossible to document.  So I think we want this, over two patches?
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index ebf41023be38..3a1bcfbe3e93 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -59,7 +59,6 @@
>   #define MAX_APIC_VECTOR                        256
>   #define APIC_VECTORS_PER_REG           32
>   
> -static bool lapic_timer_advance_dynamic __read_mostly;
>   #define LAPIC_TIMER_ADVANCE_ADJUST_MIN 100     /* clock cycles */
>   #define LAPIC_TIMER_ADVANCE_ADJUST_MAX 10000   /* clock cycles */
>   #define LAPIC_TIMER_ADVANCE_NS_INIT    1000
> @@ -1854,7 +1853,7 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
>          guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>          trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>   
> -       if (lapic_timer_advance_dynamic) {
> +       if (apic->lapic_timer.timer_advance_dynamic) {
>                  adjust_lapic_timer_advance(vcpu, guest_tsc - tsc_deadline);
>                  /*
>                   * If the timer fired early, reread the TSC to account for the
> @@ -2845,12 +2844,12 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
>          hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
>                       HRTIMER_MODE_ABS_HARD);
>          apic->lapic_timer.timer.function = apic_timer_fn;
> -       if (timer_advance_ns == -1) {
> +       if (timer_advance_ns < 0) {
>                  apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_NS_INIT;
> -               lapic_timer_advance_dynamic = true;
> +               apic->lapic_timer.timer_advance_dynamic = true;
>          } else {
>                  apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> -               lapic_timer_advance_dynamic = false;
> +               apic->lapic_timer.timer_advance_dynamic = false;
>          }
>   
>          /*
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 0a0ea4b5dd8c..6fb3b16a2754 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -54,6 +54,7 @@ struct kvm_timer {
>          u32 timer_advance_ns;
>          atomic_t pending;                       /* accumulated triggered timers */
>          bool hv_timer_in_use;
> +       bool timer_advance_dynamic;
>   };


However,I do not understand why the global function switch 
'lapic_timer_advance_dynamic'

is changed to a local variable in the 'struct kvm_timer'.On a host, the 
adaptive tuning

of timer advancement is global function, and 
each vcpu->apic->lapic_timer.timer_advance_dynamic

of each VM is the same, different VMs cannot be configured with 
different switches.

Is it better to keep the current implementation of the global variable 
'lapic_timer_advance_dynamic'?

and just modify like this:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ebf41023be38..75469e329b23 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2845,7 +2845,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int 
timer_advance_ns)
     hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC,
              HRTIMER_MODE_ABS_HARD);
     apic->lapic_timer.timer.function = apic_timer_fn;
-   if (timer_advance_ns == -1) {
+   if (timer_advance_ns < 0) {
         apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_NS_INIT;
         lapic_timer_advance_dynamic = true;
     } else {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 082ac6d95a3a..071342d56ba8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -165,10 +165,11 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
  module_param(tsc_tolerance_ppm, uint, 0644);

  /*
- * lapic timer advance (tscdeadline mode only) in nanoseconds. '-1' enables
- * adaptive tuning starting from default advancement of 1000ns.  '0' 
disables
- * advancement entirely.  Any other value is used as-is and disables 
adaptive
- * tuning, i.e. allows privileged userspace to set an exact advancement 
time.
+ * lapic timer advance (tscdeadline mode only) in nanoseconds. Any negative
+ * value enable adaptive tuning starting from default advancement of 
1000ns.
+ * '0' disables advancement entirely.  Any postive value is used as-is and
+ * disables adaptive tuning, i.e. allows privileged userspace to set an 
exact
+ * advancement time.
   */
  static int __read_mostly lapic_timer_advance_ns = -1;
  module_param(lapic_timer_advance_ns, int, 0644);

Thanks.


>   
>   struct kvm_lapic {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a2b62169e09a..60e86607056e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -165,10 +165,11 @@ static u32 __read_mostly tsc_tolerance_ppm = 250;
>   module_param(tsc_tolerance_ppm, uint, 0644);
>   
>   /*
> - * lapic timer advance (tscdeadline mode only) in nanoseconds.  '-1' enables
> - * adaptive tuning starting from default advancement of 1000ns.  '0' disables
> - * advancement entirely.  Any other value is used as-is and disables adaptive
> - * tuning, i.e. allows privileged userspace to set an exact advancement time.
> + * lapic timer advance (tscdeadline mode only) in nanoseconds.  Any negative
> + * value enable adaptive tuning starting from default advancement of 1000ns.
> + * '0' disables advancement entirely.  Any postive value is used as-is and
> + * disables adaptive tuning, i.e. allows privileged userspace to set an exact
> + * advancement time.
>    */
>   static int __read_mostly lapic_timer_advance_ns = -1;
>   module_param(lapic_timer_advance_ns, int, 0644);
>
Sean Christopherson May 21, 2024, 3:02 p.m. UTC | #3
On Tue, May 21, 2024, zhoushuling wrote:
> > On Mon, May 20, 2024, zhoushuling@huawei.com wrote:
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 0a0ea4b5dd8c..6fb3b16a2754 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -54,6 +54,7 @@ struct kvm_timer {
> >          u32 timer_advance_ns;
> >          atomic_t pending;                       /* accumulated triggered timers */
> >          bool hv_timer_in_use;
> > +       bool timer_advance_dynamic;
> >   };
> 
> 
> However,I do not understand why the global function switch
> 'lapic_timer_advance_dynamic' > is changed to a local variable in the 'struct
> kvm_timer'.  On a host, the adaptive tuning of timer advancement is global
> function, and each vcpu->apic->lapic_timer.timer_advance_dynamic of each VM
> is the same, different VMs cannot be configured with different switches.

...

>  static int __read_mostly lapic_timer_advance_ns = -1;
>  module_param(lapic_timer_advance_ns, int, 0644);

The module param is writable, i.e. can be modified while KVM is running.  E.g. if
the admin changes lapic_timer_advance_ns from a negative to a postive value, then
vCPUs that were created while lapic_timer_advance_ns<0 will have a timer_advance_ns
that was dynamically calculated, but is now static.  I doubt there's a use case
that actually does anything like that, and in practice it probably doesn't cause
real problems, but it makes for bizarre and unpredictable behavior.

Hmm, alternativately, we could make lapic_timer_advance_ns a read-only boolean.
The param is wrtiable primarily because dynamic/adaptive tuning was added much
later, i.e. getting a usable value required modifying the advancement time while
VMs were running.  But I would be very surprised if there are use cases that still
*need* to hand tune the advancement, as it's practically impossible for userspace
to do better than KVM.

The only argument I can think of for taking a raw value from userspace is if there
is an absurd delay that exceeds KVM's max advancement of 5us.  But I'm not sure
KVM should even support such values.

Let me post a patch to convert lapic_timer_advance_ns to a read-only bool.  If
there is pushback on that idea, then we can circle back to this patch, but I'm
hoping we can simplify all of this instead of hardening KVM against edge cases
that no one likely cares about.

Side topic, if we keep the module param as-is, it really should be wrapped with
READ_ONCE().
diff mbox series

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ebf41023be38..5feeb889ddb6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2848,7 +2848,7 @@  int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
 	if (timer_advance_ns == -1) {
 		apic->lapic_timer.timer_advance_ns = LAPIC_TIMER_ADVANCE_NS_INIT;
 		lapic_timer_advance_dynamic = true;
-	} else {
+	} else if (timer_advance_ns >= 0) {
 		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 		lapic_timer_advance_dynamic = false;
 	}