diff mbox series

[v11,19/20] x86/kvmclock: Skip kvmclock when Secure TSC is available

Message ID 20240731150811.156771-20-nikunj@amd.com (mailing list archive)
State New, archived
Headers show
Series Add Secure TSC support for SNP guests | expand

Commit Message

Nikunj A. Dadhania July 31, 2024, 3:08 p.m. UTC
For AMD SNP guests with SecureTSC enabled, kvm-clock is being picked up
momentarily instead of selecting more stable TSC clocksource.

[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000001] kvm-clock: using sched offset of 1799357702246960 cycles
[    0.001493] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.006289] tsc: Detected 1996.249 MHz processor
[    0.305123] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
[    1.045759] clocksource: Switched to clocksource kvm-clock
[    1.141326] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
[    1.144634] clocksource: Switched to clocksource tsc

When Secure TSC is enabled, skip using the kvmclock. The guest kernel will
fallback and use Secure TSC based clocksource.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Tested-by: Peter Gonda <pgonda@google.com>
---
 arch/x86/kernel/kvmclock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tom Lendacky Sept. 13, 2024, 5:19 p.m. UTC | #1
On 7/31/24 10:08, Nikunj A Dadhania wrote:
> For AMD SNP guests with SecureTSC enabled, kvm-clock is being picked up
> momentarily instead of selecting more stable TSC clocksource.
> 
> [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [    0.000001] kvm-clock: using sched offset of 1799357702246960 cycles
> [    0.001493] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    0.006289] tsc: Detected 1996.249 MHz processor
> [    0.305123] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
> [    1.045759] clocksource: Switched to clocksource kvm-clock
> [    1.141326] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
> [    1.144634] clocksource: Switched to clocksource tsc
> 
> When Secure TSC is enabled, skip using the kvmclock. The guest kernel will
> fallback and use Secure TSC based clocksource.
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> Tested-by: Peter Gonda <pgonda@google.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
>  arch/x86/kernel/kvmclock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 5b2c15214a6b..3d03b4c937b9 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
>  {
>  	u8 flags;
>  
> -	if (!kvm_para_available() || !kvmclock)
> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
>  		return;
>  
>  	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
Sean Christopherson Sept. 13, 2024, 5:30 p.m. UTC | #2
On Wed, Jul 31, 2024, Nikunj A Dadhania wrote:
> For AMD SNP guests with SecureTSC enabled, kvm-clock is being picked up
> momentarily instead of selecting more stable TSC clocksource.
> 
> [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [    0.000001] kvm-clock: using sched offset of 1799357702246960 cycles
> [    0.001493] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    0.006289] tsc: Detected 1996.249 MHz processor
> [    0.305123] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
> [    1.045759] clocksource: Switched to clocksource kvm-clock
> [    1.141326] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
> [    1.144634] clocksource: Switched to clocksource tsc
> 
> When Secure TSC is enabled, skip using the kvmclock. The guest kernel will
> fallback and use Secure TSC based clocksource.
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> Tested-by: Peter Gonda <pgonda@google.com>
> ---
>  arch/x86/kernel/kvmclock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 5b2c15214a6b..3d03b4c937b9 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
>  {
>  	u8 flags;
>  
> -	if (!kvm_para_available() || !kvmclock)
> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))

I would much prefer we solve the kvmclock vs. TSC fight in a generic way.  Unless
I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
is simply what's forcing the issue, but it's not actually the reason why Linux
should prefer the TSC over kvmclock.  The underlying reason is that platforms that
support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
TSC is a superior timesource purely from a functionality perspective.  That it's
more secure is icing on the cake.

>  		return;
>  
>  	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
> -- 
> 2.34.1
>
Nikunj A. Dadhania Sept. 16, 2024, 3:20 p.m. UTC | #3
On 9/13/2024 11:00 PM, Sean Christopherson wrote:
> On Wed, Jul 31, 2024, Nikunj A Dadhania wrote:
>> For AMD SNP guests with SecureTSC enabled, kvm-clock is being picked up
>> momentarily instead of selecting more stable TSC clocksource.
>>
>> [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
>> [    0.000001] kvm-clock: using sched offset of 1799357702246960 cycles
>> [    0.001493] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
>> [    0.006289] tsc: Detected 1996.249 MHz processor
>> [    0.305123] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
>> [    1.045759] clocksource: Switched to clocksource kvm-clock
>> [    1.141326] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
>> [    1.144634] clocksource: Switched to clocksource tsc
>>
>> When Secure TSC is enabled, skip using the kvmclock. The guest kernel will
>> fallback and use Secure TSC based clocksource.
>>
>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>> Tested-by: Peter Gonda <pgonda@google.com>
>> ---
>>  arch/x86/kernel/kvmclock.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>> index 5b2c15214a6b..3d03b4c937b9 100644
>> --- a/arch/x86/kernel/kvmclock.c
>> +++ b/arch/x86/kernel/kvmclock.c
>> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
>>  {
>>  	u8 flags;
>>  
>> -	if (!kvm_para_available() || !kvmclock)
>> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
> 
> I would much prefer we solve the kvmclock vs. TSC fight in a generic way.  Unless
> I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
> is simply what's forcing the issue, but it's not actually the reason why Linux
> should prefer the TSC over kvmclock.  The underlying reason is that platforms that
> support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
> TSC is a superior timesource purely from a functionality perspective.  That it's
> more secure is icing on the cake.

Are you suggesting that whenever the guest is either SNP or TDX, kvmclock should be
disabled assuming that timesource is stable and always running?

Regards
Nikunj
Sean Christopherson Sept. 18, 2024, 12:07 p.m. UTC | #4
On Mon, Sep 16, 2024, Nikunj A. Dadhania wrote:
> On 9/13/2024 11:00 PM, Sean Christopherson wrote:
> >> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> >> Tested-by: Peter Gonda <pgonda@google.com>
> >> ---
> >>  arch/x86/kernel/kvmclock.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> >> index 5b2c15214a6b..3d03b4c937b9 100644
> >> --- a/arch/x86/kernel/kvmclock.c
> >> +++ b/arch/x86/kernel/kvmclock.c
> >> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
> >>  {
> >>  	u8 flags;
> >>  
> >> -	if (!kvm_para_available() || !kvmclock)
> >> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
> > 
> > I would much prefer we solve the kvmclock vs. TSC fight in a generic way.  Unless
> > I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
> > is simply what's forcing the issue, but it's not actually the reason why Linux
> > should prefer the TSC over kvmclock.  The underlying reason is that platforms that
> > support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
> > TSC is a superior timesource purely from a functionality perspective.  That it's
> > more secure is icing on the cake.
> 
> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
> should be disabled assuming that timesource is stable and always running?

No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
is stable, irrespective of SNP or TDX.  This is effectively already done for the
timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
kvm_sched_clock_init() code.

The other aspect of this to consider is wallclock.  If I'm reading the code
correctly, _completely_ disabling kvmclock will case the kernel to keep using the
RTC for wallclock.  Using the RTC is an amusingly bad decision for SNP and TDX
(and regular VMs), as the RTC is a slooow emulation path and it's still very much
controlled by the untrusted host.

Unless you have a better idea for what to do with wallclock, I think the right
approach is to come up a cleaner way to prefer TSC over kvmclock for timekeeping
and the scheduler, but leave wallclock as-is.  And then for SNP and TDX, "assert"
that the TSC is being used instead of kvmclock.  Presumably, all SNP and TDX
hosts provide a stable TSC, so there's probably no reason for the guest to even
care if the TSC is "secure".

Note, past me missed the wallclock side of things[*], so I don't think hiding
kvmclock entirely is the best solution.

[*] https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@google.com
Nikunj A. Dadhania Sept. 20, 2024, 5:15 a.m. UTC | #5
On 9/18/2024 5:37 PM, Sean Christopherson wrote:
> On Mon, Sep 16, 2024, Nikunj A. Dadhania wrote:
>> On 9/13/2024 11:00 PM, Sean Christopherson wrote:
>>>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>>>> Tested-by: Peter Gonda <pgonda@google.com>
>>>> ---
>>>>  arch/x86/kernel/kvmclock.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>>>> index 5b2c15214a6b..3d03b4c937b9 100644
>>>> --- a/arch/x86/kernel/kvmclock.c
>>>> +++ b/arch/x86/kernel/kvmclock.c
>>>> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
>>>>  {
>>>>  	u8 flags;
>>>>  
>>>> -	if (!kvm_para_available() || !kvmclock)
>>>> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
>>>
>>> I would much prefer we solve the kvmclock vs. TSC fight in a generic way.  Unless
>>> I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
>>> is simply what's forcing the issue, but it's not actually the reason why Linux
>>> should prefer the TSC over kvmclock.  The underlying reason is that platforms that
>>> support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
>>> TSC is a superior timesource purely from a functionality perspective.  That it's
>>> more secure is icing on the cake.
>>
>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
>> should be disabled assuming that timesource is stable and always running?
> 
> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
> is stable, irrespective of SNP or TDX.  This is effectively already done for the
> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
> kvm_sched_clock_init() code.

The kvm-clock and tsc-early both are having the rating of 299. As they are of
same rating, kvm-clock is being picked up first.

Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will
be picked up instead.

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fafdbf813ae3..1982cee74354 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -289,7 +289,7 @@ void __init kvmclock_init(void)
 {
 	u8 flags;
 
-	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+	if (!kvm_para_available() || !kvmclock)
 		return;
 
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
@@ -342,7 +342,7 @@ void __init kvmclock_init(void)
 	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
 	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
 	    !check_tsc_unstable())
-		kvm_clock.rating = 299;
+		kvm_clock.rating = 298;
 
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
 	pv_info.name = "KVM";



[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000001] kvm-clock: using sched offset of 6630881179920185 cycles
[    0.001266] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.005347] tsc: Detected 1996.247 MHz processor
[    0.263100] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398caa9ddcb, max_idle_ns: 881590739785 ns
[    0.980456] clocksource: Switched to clocksource tsc-early
[    1.059332] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398caa9ddcb, max_idle_ns: 881590739785 ns
[    1.062094] clocksource: Switched to clocksource tsc

> The other aspect of this to consider is wallclock.  If I'm reading the code
> correctly, _completely_ disabling kvmclock will case the kernel to keep using the
> RTC for wallclock.  Using the RTC is an amusingly bad decision for SNP and TDX
> (and regular VMs), as the RTC is a slooow emulation path and it's still very much
> controlled by the untrusted host.

Right, this is not expected.
 
> Unless you have a better idea for what to do with wallclock, I think the right
> approach is to come up a cleaner way to prefer TSC over kvmclock for timekeeping
> and the scheduler, but leave wallclock as-is.  And then for SNP and TDX, "assert"
> that the TSC is being used instead of kvmclock.  Presumably, all SNP and TDX
> hosts provide a stable TSC, so there's probably no reason for the guest to even
> care if the TSC is "secure".
> 
> Note, past me missed the wallclock side of things[*], so I don't think hiding
> kvmclock entirely is the best solution.
> 
> [*] https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@google.com

Regards
Nikunj
Sean Christopherson Sept. 20, 2024, 7:21 a.m. UTC | #6
On Fri, Sep 20, 2024, Nikunj A. Dadhania wrote:
> On 9/18/2024 5:37 PM, Sean Christopherson wrote:
> > On Mon, Sep 16, 2024, Nikunj A. Dadhania wrote:
> >> On 9/13/2024 11:00 PM, Sean Christopherson wrote:
> >>>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> >>>> Tested-by: Peter Gonda <pgonda@google.com>
> >>>> ---
> >>>>  arch/x86/kernel/kvmclock.c | 2 +-
> >>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> >>>> index 5b2c15214a6b..3d03b4c937b9 100644
> >>>> --- a/arch/x86/kernel/kvmclock.c
> >>>> +++ b/arch/x86/kernel/kvmclock.c
> >>>> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
> >>>>  {
> >>>>  	u8 flags;
> >>>>  
> >>>> -	if (!kvm_para_available() || !kvmclock)
> >>>> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
> >>>
> >>> I would much prefer we solve the kvmclock vs. TSC fight in a generic way.  Unless
> >>> I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
> >>> is simply what's forcing the issue, but it's not actually the reason why Linux
> >>> should prefer the TSC over kvmclock.  The underlying reason is that platforms that
> >>> support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
> >>> TSC is a superior timesource purely from a functionality perspective.  That it's
> >>> more secure is icing on the cake.
> >>
> >> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
> >> should be disabled assuming that timesource is stable and always running?
> > 
> > No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
> > is stable, irrespective of SNP or TDX.  This is effectively already done for the
> > timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
> > invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
> > kvm_sched_clock_init() code.
> 
> The kvm-clock and tsc-early both are having the rating of 299. As they are of
> same rating, kvm-clock is being picked up first.
> 
> Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will
> be picked up instead.

IMO, it's ugly, but that's a problem with the rating system inasmuch as anything.

But the kernel will still be using kvmclock for the scheduler clock, which is
undesirable.
Nikunj A. Dadhania Sept. 20, 2024, 8:54 a.m. UTC | #7
On 9/20/2024 12:51 PM, Sean Christopherson wrote:
> On Fri, Sep 20, 2024, Nikunj A. Dadhania wrote:
>> On 9/18/2024 5:37 PM, Sean Christopherson wrote:
>>> On Mon, Sep 16, 2024, Nikunj A. Dadhania wrote:
>>>> On 9/13/2024 11:00 PM, Sean Christopherson wrote:
>>>>>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>>>>>> Tested-by: Peter Gonda <pgonda@google.com>
>>>>>> ---
>>>>>>  arch/x86/kernel/kvmclock.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>>>>>> index 5b2c15214a6b..3d03b4c937b9 100644
>>>>>> --- a/arch/x86/kernel/kvmclock.c
>>>>>> +++ b/arch/x86/kernel/kvmclock.c
>>>>>> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
>>>>>>  {
>>>>>>  	u8 flags;
>>>>>>  
>>>>>> -	if (!kvm_para_available() || !kvmclock)
>>>>>> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
>>>>>
>>>>> I would much prefer we solve the kvmclock vs. TSC fight in a generic way.  Unless
>>>>> I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
>>>>> is simply what's forcing the issue, but it's not actually the reason why Linux
>>>>> should prefer the TSC over kvmclock.  The underlying reason is that platforms that
>>>>> support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
>>>>> TSC is a superior timesource purely from a functionality perspective.  That it's
>>>>> more secure is icing on the cake.
>>>>
>>>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
>>>> should be disabled assuming that timesource is stable and always running?
>>>
>>> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
>>> is stable, irrespective of SNP or TDX.  This is effectively already done for the
>>> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
>>> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
>>> kvm_sched_clock_init() code.
>>
>> The kvm-clock and tsc-early both are having the rating of 299. As they are of
>> same rating, kvm-clock is being picked up first.
>>
>> Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will
>> be picked up instead.
> 
> IMO, it's ugly, but that's a problem with the rating system inasmuch as anything.
>
> But the kernel will still be using kvmclock for the scheduler clock, which is
> undesirable.

Agree, kvm_sched_clock_init() is still being called. The above hunk was to use
tsc-early/tsc as the clocksource and not kvm-clock.

Regards,
Nikunj
Nikunj A. Dadhania Sept. 25, 2024, 8:53 a.m. UTC | #8
On 9/20/2024 2:24 PM, Nikunj A. Dadhania wrote:
> On 9/20/2024 12:51 PM, Sean Christopherson wrote:
>> On Fri, Sep 20, 2024, Nikunj A. Dadhania wrote:
>>> On 9/18/2024 5:37 PM, Sean Christopherson wrote:
>>>> On Mon, Sep 16, 2024, Nikunj A. Dadhania wrote:
>>>>> On 9/13/2024 11:00 PM, Sean Christopherson wrote:
>>>>>>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>>>>>>> Tested-by: Peter Gonda <pgonda@google.com>
>>>>>>> ---
>>>>>>>  arch/x86/kernel/kvmclock.c | 2 +-
>>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>>>>>>> index 5b2c15214a6b..3d03b4c937b9 100644
>>>>>>> --- a/arch/x86/kernel/kvmclock.c
>>>>>>> +++ b/arch/x86/kernel/kvmclock.c
>>>>>>> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
>>>>>>>  {
>>>>>>>  	u8 flags;
>>>>>>>  
>>>>>>> -	if (!kvm_para_available() || !kvmclock)
>>>>>>> +	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
>>>>>>
>>>>>> I would much prefer we solve the kvmclock vs. TSC fight in a generic way.  Unless
>>>>>> I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
>>>>>> is simply what's forcing the issue, but it's not actually the reason why Linux
>>>>>> should prefer the TSC over kvmclock.  The underlying reason is that platforms that
>>>>>> support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
>>>>>> TSC is a superior timesource purely from a functionality perspective.  That it's
>>>>>> more secure is icing on the cake.
>>>>>
>>>>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
>>>>> should be disabled assuming that timesource is stable and always running?
>>>>
>>>> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
>>>> is stable, irrespective of SNP or TDX.  This is effectively already done for the
>>>> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
>>>> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
>>>> kvm_sched_clock_init() code.
>>>
>>> The kvm-clock and tsc-early both are having the rating of 299. As they are of
>>> same rating, kvm-clock is being picked up first.
>>>
>>> Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will
>>> be picked up instead.
>>
>> IMO, it's ugly, but that's a problem with the rating system inasmuch as anything.
>>
>> But the kernel will still be using kvmclock for the scheduler clock, which is
>> undesirable.
> 
> Agree, kvm_sched_clock_init() is still being called. The above hunk was to use
> tsc-early/tsc as the clocksource and not kvm-clock.

How about the below patch:

From: Nikunj A Dadhania <nikunj@amd.com>
Date: Tue, 28 Nov 2023 18:29:56 +0530
Subject: [RFC PATCH] x86/kvmclock: Prefer invariant TSC as the clocksource and
 scheduler clock

For platforms that support stable and always running TSC, although the
kvm-clock rating is dropped to 299 to prefer TSC, the guest scheduler clock
still keeps on using the kvm-clock which is undesirable. Moreover, as the
kvm-clock and early-tsc clocksource are both registered with 299 rating,
kvm-clock is being picked up momentarily instead of selecting more stable
tsc-early clocksource.

  kvm-clock: Using msrs 4b564d01 and 4b564d00
  kvm-clock: using sched offset of 1799357702246960 cycles
  clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
  tsc: Detected 1996.249 MHz processor
  clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
  clocksource: Switched to clocksource kvm-clock
  clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
  clocksource: Switched to clocksource tsc

Drop the kvm-clock rating to 298, so that tsc-early is picked up before
kvm-clock and use TSC for scheduler clock as well when the TSC is invariant
and stable.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>

---

The issue we see here is that on bare-metal if the TSC is marked unstable,
then the sched-clock will fall back to jiffies. In the virtualization case,
do we want to fall back to kvm-clock when TSC is marked unstable?

---
 arch/x86/kernel/kvmclock.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..c997b2628c4b 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -317,9 +317,6 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
-
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
@@ -341,8 +338,12 @@ void __init kvmclock_init(void)
 	 */
 	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
 	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
-	    !check_tsc_unstable())
-		kvm_clock.rating = 299;
+	    !check_tsc_unstable()) {
+		kvm_clock.rating = 298;
+	} else {
+		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+	}
 
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
 	pv_info.name = "KVM";
Sean Christopherson Sept. 25, 2024, 12:55 p.m. UTC | #9
On Wed, Sep 25, 2024, Nikunj A. Dadhania wrote:
> >>>>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
> >>>>> should be disabled assuming that timesource is stable and always running?
> >>>>
> >>>> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
> >>>> is stable, irrespective of SNP or TDX.  This is effectively already done for the
> >>>> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
> >>>> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
> >>>> kvm_sched_clock_init() code.
> >>>
> >>> The kvm-clock and tsc-early both are having the rating of 299. As they are of
> >>> same rating, kvm-clock is being picked up first.
> >>>
> >>> Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will
> >>> be picked up instead.
> >>
> >> IMO, it's ugly, but that's a problem with the rating system inasmuch as anything.
> >>
> >> But the kernel will still be using kvmclock for the scheduler clock, which is
> >> undesirable.
> > 
> > Agree, kvm_sched_clock_init() is still being called. The above hunk was to use
> > tsc-early/tsc as the clocksource and not kvm-clock.
> 
> How about the below patch:
> 
> From: Nikunj A Dadhania <nikunj@amd.com>
> Date: Tue, 28 Nov 2023 18:29:56 +0530
> Subject: [RFC PATCH] x86/kvmclock: Prefer invariant TSC as the clocksource and
>  scheduler clock
> 
> For platforms that support stable and always running TSC, although the
> kvm-clock rating is dropped to 299 to prefer TSC, the guest scheduler clock
> still keeps on using the kvm-clock which is undesirable. Moreover, as the
> kvm-clock and early-tsc clocksource are both registered with 299 rating,
> kvm-clock is being picked up momentarily instead of selecting more stable
> tsc-early clocksource.
> 
>   kvm-clock: Using msrs 4b564d01 and 4b564d00
>   kvm-clock: using sched offset of 1799357702246960 cycles
>   clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
>   tsc: Detected 1996.249 MHz processor
>   clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
>   clocksource: Switched to clocksource kvm-clock
>   clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
>   clocksource: Switched to clocksource tsc
> 
> Drop the kvm-clock rating to 298, so that tsc-early is picked up before
> kvm-clock and use TSC for scheduler clock as well when the TSC is invariant
> and stable.
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
> 
> ---
> 
> The issue we see here is that on bare-metal if the TSC is marked unstable,
> then the sched-clock will fall back to jiffies. In the virtualization case,
> do we want to fall back to kvm-clock when TSC is marked unstable?

In the general case, yes.  Though that might be a WARN-able offense if the TSC
is allegedly constant+nonstop.  And for SNP and TDX, it might be a "panic and do
not boot" offense, since using kvmclock undermines the security of the guest.

> ---
>  arch/x86/kernel/kvmclock.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 5b2c15214a6b..c997b2628c4b 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -317,9 +317,6 @@ void __init kvmclock_init(void)
>  	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
>  		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>  
> -	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> -	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> -
>  	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
>  	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
>  	x86_platform.get_wallclock = kvm_get_wallclock;
> @@ -341,8 +338,12 @@ void __init kvmclock_init(void)
>  	 */
>  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> -	    !check_tsc_unstable())
> -		kvm_clock.rating = 299;
> +	    !check_tsc_unstable()) {
> +		kvm_clock.rating = 298;
> +	} else {
> +		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> +		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> +	}

I would really, really like to fix this in a centralized location, not by having
each PV clocksource muck with their clock's rating.  I'm not even sure the existing
code is entirely correct, as kvmclock_init() runs _before_ tsc_early_init().  Which
is desirable in the legacy case, as it allows calibrating the TSC using kvmclock,

  	x86_platform.calibrate_tsc = kvm_get_tsc_khz;

but on modern setups that's definitely undesirable, as it means the kernel won't
use CPUID.0x15, which every explicitly tells software the frequency of the TSC.

And I don't think we want to simply point at native_calibrate_tsc(), because that
thing is not at all correct for a VM, where checking x86_vendor and x86_vfm is at
best sketchy.  E.g. I would think it's in AMD's interest for Secure TSC to define
the TSC frequency using CPUID.0x15, even if AMD CPUs don't (yet) natively support
CPUID.0x15.

In other words, I think we need to overhaul the PV clock vs. TSC logic so that it
makes sense for modern CPUs+VMs, not just keep hacking away at kvmclock.  I don't
expect the code would be all that complex in the end, the hardest part is likely
just figuring out (and agreeing on) what exactly the kernel should be doing.

>  	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
>  	pv_info.name = "KVM";
> -- 
> 2.34.1
>
Nikunj A. Dadhania Sept. 30, 2024, 6:27 a.m. UTC | #10
On 9/25/2024 6:25 PM, Sean Christopherson wrote:
> On Wed, Sep 25, 2024, Nikunj A. Dadhania wrote:
>>>>>>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
>>>>>>> should be disabled assuming that timesource is stable and always running?
>>>>>>
>>>>>> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
>>>>>> is stable, irrespective of SNP or TDX.  This is effectively already done for the
>>>>>> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
>>>>>> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
>>>>>> kvm_sched_clock_init() code.
>>>>>
>>>>> The kvm-clock and tsc-early both are having the rating of 299. As they are of
>>>>> same rating, kvm-clock is being picked up first.
>>>>>
>>>>> Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will
>>>>> be picked up instead.
>>>>
>>>> IMO, it's ugly, but that's a problem with the rating system inasmuch as anything.
>>>>
>>>> But the kernel will still be using kvmclock for the scheduler clock, which is
>>>> undesirable.
>>>
>>> Agree, kvm_sched_clock_init() is still being called. The above hunk was to use
>>> tsc-early/tsc as the clocksource and not kvm-clock.
>>
>> How about the below patch:
>>
>> From: Nikunj A Dadhania <nikunj@amd.com>
>> Date: Tue, 28 Nov 2023 18:29:56 +0530
>> Subject: [RFC PATCH] x86/kvmclock: Prefer invariant TSC as the clocksource and
>>  scheduler clock
>>
>> For platforms that support stable and always running TSC, although the
>> kvm-clock rating is dropped to 299 to prefer TSC, the guest scheduler clock
>> still keeps on using the kvm-clock which is undesirable. Moreover, as the
>> kvm-clock and early-tsc clocksource are both registered with 299 rating,
>> kvm-clock is being picked up momentarily instead of selecting more stable
>> tsc-early clocksource.
>>
>>   kvm-clock: Using msrs 4b564d01 and 4b564d00
>>   kvm-clock: using sched offset of 1799357702246960 cycles
>>   clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
>>   tsc: Detected 1996.249 MHz processor
>>   clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
>>   clocksource: Switched to clocksource kvm-clock
>>   clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398cadd9d93, max_idle_ns: 881590552906 ns
>>   clocksource: Switched to clocksource tsc
>>
>> Drop the kvm-clock rating to 298, so that tsc-early is picked up before
>> kvm-clock and use TSC for scheduler clock as well when the TSC is invariant
>> and stable.
>>
>> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
>>
>> ---
>>
>> The issue we see here is that on bare-metal if the TSC is marked unstable,
>> then the sched-clock will fall back to jiffies. In the virtualization case,
>> do we want to fall back to kvm-clock when TSC is marked unstable?
> 
> In the general case, yes.  Though that might be a WARN-able offense if the TSC
> is allegedly constant+nonstop.  And for SNP and TDX, it might be a "panic and do
> not boot" offense, since using kvmclock undermines the security of the guest.
> 
>> ---
>>  arch/x86/kernel/kvmclock.c | 11 ++++++-----
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>> index 5b2c15214a6b..c997b2628c4b 100644
>> --- a/arch/x86/kernel/kvmclock.c
>> +++ b/arch/x86/kernel/kvmclock.c
>> @@ -317,9 +317,6 @@ void __init kvmclock_init(void)
>>  	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
>>  		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>>  
>> -	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
>> -	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
>> -
>>  	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
>>  	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
>>  	x86_platform.get_wallclock = kvm_get_wallclock;
>> @@ -341,8 +338,12 @@ void __init kvmclock_init(void)
>>  	 */
>>  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>>  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
>> -	    !check_tsc_unstable())
>> -		kvm_clock.rating = 299;
>> +	    !check_tsc_unstable()) {
>> +		kvm_clock.rating = 298;
>> +	} else {
>> +		flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
>> +		kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
>> +	}
> 
> I would really, really like to fix this in a centralized location, not by having
> each PV clocksource muck with their clock's rating.  

TSC Clock Rating Adjustment:
* During TSC initialization, downgrade the TSC clock rating to 200 if TSC is not
  constant/reliable, placing it below HPET.
* Ensure the kvm-clock rating is set to 299 by default in the 
  struct clocksource kvm_clock.
* Avoid changing the kvm clock rating based on the availability of reliable
  clock sources. Let the TSC clock source determine and downgrade itself.

The above will make sure that the PV clocksource rating remain
unaffected.

Clock soure selection order when the ratings match:
* Currently, clocks are registered and enqueued based on their rating.
* When clock ratings are tied, use the advertised clock frequency(freq_khz) as a
  secondary key to favor clocks with better frequency.

This approach improves the selection process by considering both rating and
frequency. Something like below:

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index d0538a75f4c6..591451ccc0fa 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -1098,6 +1098,9 @@ static void clocksource_enqueue(struct clocksource *cs)
 		/* Keep track of the place, where to insert */
 		if (tmp->rating < cs->rating)
 			break;
+		if (tmp->rating == cs->rating && tmp->freq_khz < cs->freq_khz)
+			break;
+
 		entry = &tmp->list;
 	}
 	list_add(&cs->list, entry);
@@ -1133,6 +1136,9 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
 		 * clocksource with mask >= 40-bit and f >= 4GHz. That maps to
 		 * ~ 0.06ppm granularity for NTP.
 		 */
+
+		cs->freq_khz = freq * scale / 1000;
+
 		sec = cs->mask;
 		do_div(sec, freq);
 		do_div(sec, scale);


> I'm not even sure the existing
> code is entirely correct, as kvmclock_init() runs _before_ tsc_early_init().  Which
> is desirable in the legacy case, as it allows calibrating the TSC using kvmclock,
> 
>   	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> 
> but on modern setups that's definitely undesirable, as it means the kernel won't
> use CPUID.0x15, which every explicitly tells software the frequency of the TSC.
>
> And I don't think we want to simply point at native_calibrate_tsc(), because that
> thing is not at all correct for a VM, where checking x86_vendor and x86_vfm is at
> best sketchy.  
>
> E.g. I would think it's in AMD's interest for Secure TSC to define
> the TSC frequency using CPUID.0x15, even if AMD CPUs don't (yet) natively support
> CPUID.0x15.

For SecureTSC: GUEST_TSC_FREQ MSR (C001_0134h) provides the TSC frequency.

> In other words, I think we need to overhaul the PV clock vs. TSC logic so that it
> makes sense for modern CPUs+VMs, not just keep hacking away at kvmclock.  I don't
> expect the code would be all that complex in the end, the hardest part is likely
> just figuring out (and agreeing on) what exactly the kernel should be doing.

To summarise this thread with respect to TSC vs KVM clock, there three key questions:

1) When should kvmclock init be done?
2) How should the TSC frequency be discovered?
3) What should be the sched clock source and how should it be selected in a generic way?

○ Legacy CPU/VMs: VMs running on platforms without non-stop/constant TSC 
  + kvm-clock should be registered before tsc-early/tsc
  + Need to calibrate TSC frequency
  + Use kvmclock wallclock
  + Use kvmclock for sched clock selected dynamicaly
    (using clocksource enable()/disable() callback)

○ Modern CPU/VMs: VMs running on platforms supporting constant, non-stop and reliable TSC
  + kvm-clock should be registered before tsc-early/tsc
  + TSC Frequency:
      Intel: TSC frequency using CPUID 0x15H/ 0x16H ?

      For SecureTSC: GUEST_TSC_FREQ MSR (C001_0134h) provides the TSC frequency, other 
      AMD guests need to calibrate the TSC frequency.
  + Use kvmclock wallclock
  + Use TSC for sched clock

After reviewing the code, the current init sequence looks correct for both legacy
and modern VMs/CPUs. Let kvmclock go ahead and register itself as a clocksource, although 
registration of the sched clock can be deferred until the kvm-clock is picked up as the clock 
source. And restore the old sched clock when kvmclock is disabled. Something like the below
patch, lightly tested:

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..7167caa3348d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -21,6 +21,7 @@
 #include <asm/hypervisor.h>
 #include <asm/x86_init.h>
 #include <asm/kvmclock.h>
+#include <asm/timer.h>
 
 static int kvmclock __initdata = 1;
 static int kvmclock_vsyscall __initdata = 1;
@@ -148,12 +149,41 @@ bool kvm_check_and_clear_guest_paused(void)
 	return ret;
 }
 
+static u64 (*old_pv_sched_clock)(void);
+
+static void enable_kvm_schedclock_work(struct work_struct *work)
+{
+	u8 flags;
+
+	old_pv_sched_clock = static_call_query(pv_sched_clock);
+	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+}
+
+static DECLARE_DELAYED_WORK(enable_kvm_sc, enable_kvm_schedclock_work);
+
+static void disable_kvm_schedclock_work(struct work_struct *work)
+{
+	if (old_pv_sched_clock)
+		paravirt_set_sched_clock(old_pv_sched_clock);
+}
+static DECLARE_DELAYED_WORK(disable_kvm_sc, disable_kvm_schedclock_work);
+
 static int kvm_cs_enable(struct clocksource *cs)
 {
+	u8 flags;
+
 	vclocks_set_used(VDSO_CLOCKMODE_PVCLOCK);
+	schedule_delayed_work(&enable_kvm_sc, 0);
+
 	return 0;
 }
 
+static void kvm_cs_disable(struct clocksource *cs)
+{
+	schedule_delayed_work(&disable_kvm_sc, 0);
+}
+
 static struct clocksource kvm_clock = {
 	.name	= "kvm-clock",
 	.read	= kvm_clock_get_cycles,
@@ -162,6 +192,7 @@ static struct clocksource kvm_clock = {
 	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
 	.id     = CSID_X86_KVM_CLK,
 	.enable	= kvm_cs_enable,
+	.disable = kvm_cs_disable,
 };
 
 static void kvm_register_clock(char *txt)
@@ -287,8 +318,6 @@ static int kvmclock_setup_percpu(unsigned int cpu)
 
 void __init kvmclock_init(void)
 {
-	u8 flags;
-
 	if (!kvm_para_available() || !kvmclock)
 		return;
 
@@ -317,9 +346,6 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
-	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
-
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;

Regards
Nikunj
Thomas Gleixner Sept. 30, 2024, 9:20 p.m. UTC | #11
On Mon, Sep 30 2024 at 11:57, Nikunj A. Dadhania wrote:
> TSC Clock Rating Adjustment:
> * During TSC initialization, downgrade the TSC clock rating to 200 if TSC is not
>   constant/reliable, placing it below HPET.

Downgrading a constant TSC is a bad idea. Reliable just means that it
does not need a watchdog clocksource. If it's non-constant it's
downgraded anyway.

> * Ensure the kvm-clock rating is set to 299 by default in the 
>   struct clocksource kvm_clock.
> * Avoid changing the kvm clock rating based on the availability of reliable
>   clock sources. Let the TSC clock source determine and downgrade itself.

Why downgrade? If it's the best one you want to upgrade it so it's
preferred over the others.

> The above will make sure that the PV clocksource rating remain
> unaffected.
>
> Clock soure selection order when the ratings match:
> * Currently, clocks are registered and enqueued based on their rating.
> * When clock ratings are tied, use the advertised clock frequency(freq_khz) as a
>   secondary key to favor clocks with better frequency.
>
> This approach improves the selection process by considering both rating and
> frequency. Something like below:

What does the frequency tell us? Not really anything. It's not
necessarily the better clocksource.

Higher frequency gives you a slightly better resolution, but as all of
this is usually sub-nanosecond resolution already that's not making a
difference in practice.

So if you know you want TSC to be selected, then upgrade the rating of
both the early and the regular TSC clocksource and be done with it.

Thanks,

        tglx
Nikunj A. Dadhania Oct. 1, 2024, 4:26 a.m. UTC | #12
On 10/1/2024 2:50 AM, Thomas Gleixner wrote:
> On Mon, Sep 30 2024 at 11:57, Nikunj A. Dadhania wrote:
>> TSC Clock Rating Adjustment:
>> * During TSC initialization, downgrade the TSC clock rating to 200 if TSC is not
>>   constant/reliable, placing it below HPET.
> 
> Downgrading a constant TSC is a bad idea. Reliable just means that it
> does not need a watchdog clocksource. If it's non-constant it's
> downgraded anyway.
> 
>> * Ensure the kvm-clock rating is set to 299 by default in the 
>>   struct clocksource kvm_clock.
>> * Avoid changing the kvm clock rating based on the availability of reliable
>>   clock sources. Let the TSC clock source determine and downgrade itself.
> 
> Why downgrade? If it's the best one you want to upgrade it so it's
> preferred over the others.

Thanks for confirming that upgrading the TSC rating is fine.

> The above will make sure that the PV clocksource rating remain
>> unaffected.
>>
>> Clock soure selection order when the ratings match:
>> * Currently, clocks are registered and enqueued based on their rating.
>> * When clock ratings are tied, use the advertised clock frequency(freq_khz) as a
>>   secondary key to favor clocks with better frequency.
>>
>> This approach improves the selection process by considering both rating and
>> frequency. Something like below:
> 
> What does the frequency tell us? Not really anything. It's not
> necessarily the better clocksource.
> 
> Higher frequency gives you a slightly better resolution, but as all of
> this is usually sub-nanosecond resolution already that's not making a
> difference in practice.
> 
> So if you know you want TSC to be selected, then upgrade the rating of
> both the early and the regular TSC clocksource and be done with it.

Sure Thomas, I will modify the patch accordingly and send an RFC.

Also I realized that, for the guests, instead of rdtsc(), we should be 
calling rdtsc_ordered() to make sure that time moves forward even when
vCPUs are migrated.

Thanks,
Nikunj
Nikunj A. Dadhania Oct. 1, 2024, 2:36 p.m. UTC | #13
On 10/1/2024 9:56 AM, Nikunj A. Dadhania wrote:
 
 
> Also I realized that, for the guests, instead of rdtsc(), we should be 
> calling rdtsc_ordered() to make sure that time moves forward even when
> vCPUs are migrated.

The above is with reference to native_sched_clock() being used for VM
instead of kvm_sched_clock_read() when using TSC.

Regards
Nikunj
diff mbox series

Patch

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..3d03b4c937b9 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -289,7 +289,7 @@  void __init kvmclock_init(void)
 {
 	u8 flags;
 
-	if (!kvm_para_available() || !kvmclock)
+	if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
 		return;
 
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {