diff mbox

[1/5] x86/kvm: On KVM re-enable (e.g. after suspend), update clocks

Message ID 861716d768a1da6d1fd257b7972f8df13baf7f85.1449702533.git.luto@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Andy Lutomirski Dec. 9, 2015, 11:12 p.m. UTC
This gets rid of the "did TSC go backwards" logic and just updates
all clocks.  It should work better (no more disabling of fast
timing) and more reliably (all of the clocks are actually updated).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kvm/x86.c | 75 +++---------------------------------------------------
 1 file changed, 3 insertions(+), 72 deletions(-)

Comments

Radim Krčmář March 16, 2016, 10:06 p.m. UTC | #1
2015-12-09 15:12-0800, Andy Lutomirski:
> This gets rid of the "did TSC go backwards" logic and just updates
> all clocks.  It should work better (no more disabling of fast
> timing) and more reliably (all of the clocks are actually updated).
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
>  		kvm_for_each_vcpu(i, vcpu, kvm) {
> +			if (vcpu->cpu == smp_processor_id()) {

(vmm_exclusive sets vcpu->cpu to -1, so KVM_REQ_MASTERCLOCK_UPDATE might
 not run, but vmm_exclusive probably doesn't work anyway.)

>  				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
> +						vcpu);
>  			}

(Requesting KVM_REQ_MASTERCLOCK_UPDATE once per VM is enough.)

> -	if (backwards_tsc) {
> -		u64 delta_cyc = max_tsc - local_tsc;
> -		backwards_tsc_observed = true;
> -		list_for_each_entry(kvm, &vm_list, vm_list) {
> -			kvm_for_each_vcpu(i, vcpu, kvm) {
> -				vcpu->arch.tsc_offset_adjustment += delta_cyc;
> -				vcpu->arch.last_host_tsc = local_tsc;

tsc_offset_adjustment was set for

  	/* Apply any externally detected TSC adjustments (due to suspend) */
  	if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
  		adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
  		vcpu->arch.tsc_offset_adjustment = 0;
  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
  	}

Guest TSC is going to jump backward with this patch, which would make
the guest think that a lot of cycles passed.  This has no bearing on
guest timekeeping, because the guest shouldn't be using raw TSC.
If we wanted to do something though, there are at least two options:
1) Fake that TSC continued at roughly its specified rate:  compute how
   many cycles could have elapsed while the CPU was suspended (using
   host time before/after suspend and guest TSC frequency) and adjust
   guest TSC.
2) Resume guest TSC at its last cycle before suspend.
   (Roughly what KVM does now.)

What are your opinions on TSC faking?

Thanks.


---
Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
the masterclock+suspend issue, if you don't mind ... So far, I don't
even see a reason to update kvmclock on kvm_arch_hardware_enable().
Suspend is a condition that we want to handle, so kvm_resume would be a
better place, but we handle suspend only because TSC and timekeeping has
changed, so I think that the right place is in their event notifiers.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski March 16, 2016, 10:15 p.m. UTC | #2
On Wed, Mar 16, 2016 at 3:06 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
> 2015-12-09 15:12-0800, Andy Lutomirski:
>> This gets rid of the "did TSC go backwards" logic and just updates
>> all clocks.  It should work better (no more disabling of fast
>> timing) and more reliably (all of the clocks are actually updated).
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> @@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void)
>>       list_for_each_entry(kvm, &vm_list, vm_list) {
>>               kvm_for_each_vcpu(i, vcpu, kvm) {
>> +                     if (vcpu->cpu == smp_processor_id()) {
>
> (vmm_exclusive sets vcpu->cpu to -1, so KVM_REQ_MASTERCLOCK_UPDATE might
>  not run, but vmm_exclusive probably doesn't work anyway.)
>
>>                               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> +                             kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
>> +                                             vcpu);
>>                       }
>
> (Requesting KVM_REQ_MASTERCLOCK_UPDATE once per VM is enough.)
>
>> -     if (backwards_tsc) {
>> -             u64 delta_cyc = max_tsc - local_tsc;
>> -             backwards_tsc_observed = true;
>> -             list_for_each_entry(kvm, &vm_list, vm_list) {
>> -                     kvm_for_each_vcpu(i, vcpu, kvm) {
>> -                             vcpu->arch.tsc_offset_adjustment += delta_cyc;
>> -                             vcpu->arch.last_host_tsc = local_tsc;
>
> tsc_offset_adjustment was set for
>
>         /* Apply any externally detected TSC adjustments (due to suspend) */
>         if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
>                 adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
>                 vcpu->arch.tsc_offset_adjustment = 0;
>                 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>         }
>
> Guest TSC is going to jump backward with this patch, which would make
> the guest think that a lot of cycles passed.  This has no bearing on
> guest timekeeping, because the guest shouldn't be using raw TSC.
> If we wanted to do something though, there are at least two options:
> 1) Fake that TSC continued at roughly its specified rate:  compute how
>    many cycles could have elapsed while the CPU was suspended (using
>    host time before/after suspend and guest TSC frequency) and adjust
>    guest TSC.
> 2) Resume guest TSC at its last cycle before suspend.
>    (Roughly what KVM does now.)
>
> What are your opinions on TSC faking?

I'd suggest restarting it wherever it left off, because it's simpler.
If there was a CLOCK_BOOT_RAW, you could try to track it, but I'm not
sure that such a thing exists.

FWIW, if you ever intend to support ART ("always running timer")
passthrough, this is going to be a giant clusterfsck.  Good luck.  I
haven't gotten a straight answer as to what hardware actually supports
that thing, so even testing isn't no easy.

>
> Thanks.
>
>
> ---
> Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
> the masterclock+suspend issue, if you don't mind ... So far, I don't
> even see a reason to update kvmclock on kvm_arch_hardware_enable().
> Suspend is a condition that we want to handle, so kvm_resume would be a
> better place, but we handle suspend only because TSC and timekeeping has
> changed, so I think that the right place is in their event notifiers.

I'd be glad to try to review things.  Please cc me.

One of the Xen people pointed me at the MS Viridian spec for handling
TSC rate changes on migration to or from hosts that don't support TSC
scaling.  I wonder if KVM could use the same technique or even the
same API.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář March 16, 2016, 10:59 p.m. UTC | #3
2016-03-16 15:15-0700, Andy Lutomirski:
> On Wed, Mar 16, 2016 at 3:06 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
>> Guest TSC is going to jump backward with this patch, which would make
>> the guest think that a lot of cycles passed.  This has no bearing on
>> guest timekeeping, because the guest shouldn't be using raw TSC.
>> If we wanted to do something though, there are at least two options:
>> 1) Fake that TSC continued at roughly its specified rate:  compute how
>>    many cycles could have elapsed while the CPU was suspended (using
>>    host time before/after suspend and guest TSC frequency) and adjust
>>    guest TSC.
>> 2) Resume guest TSC at its last cycle before suspend.
>>    (Roughly what KVM does now.)
>>
>> What are your opinions on TSC faking?
> 
> I'd suggest restarting it wherever it left off, because it's simpler.
> If there was a CLOCK_BOOT_RAW, you could try to track it, but I'm not
> sure that such a thing exists.

CLOCK_MONOTONIC_RAW can count in suspend, so CLOCK_BOOT_RAW would be a
conditional alias and it probably doesn't exist because of that.

> FWIW, if you ever intend to support ART ("always running timer")
> passthrough, this is going to be a giant clusterfsck.  Good luck.  I
> haven't gotten a straight answer as to what hardware actually supports
> that thing, so even testing isn't no easy.

Hm, AR TSC would be best handled by doing nothing ... dropping the
faking logic just became tempting.

>> ---
>> Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
>> the masterclock+suspend issue, if you don't mind ... So far, I don't
>> even see a reason to update kvmclock on kvm_arch_hardware_enable().
>> Suspend is a condition that we want to handle, so kvm_resume would be a
>> better place, but we handle suspend only because TSC and timekeeping has
>> changed, so I think that the right place is in their event notifiers.
> 
> I'd be glad to try to review things.  Please cc me.

Ok.

> One of the Xen people pointed me at the MS Viridian spec for handling
> TSC rate changes on migration to or from hosts that don't support TSC
> scaling.  I wonder if KVM could use the same technique or even the
> same API.

The TSC frequency MSR is read-only in Xen, so I guess it's equivalent to
pvclock.  I'll take a deeper look, thanks for pointers.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski March 16, 2016, 11:07 p.m. UTC | #4
On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
> 2016-03-16 15:15-0700, Andy Lutomirski:
>> On Wed, Mar 16, 2016 at 3:06 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
>>> Guest TSC is going to jump backward with this patch, which would make
>>> the guest think that a lot of cycles passed.  This has no bearing on
>>> guest timekeeping, because the guest shouldn't be using raw TSC.
>>> If we wanted to do something though, there are at least two options:
>>> 1) Fake that TSC continued at roughly its specified rate:  compute how
>>>    many cycles could have elapsed while the CPU was suspended (using
>>>    host time before/after suspend and guest TSC frequency) and adjust
>>>    guest TSC.
>>> 2) Resume guest TSC at its last cycle before suspend.
>>>    (Roughly what KVM does now.)
>>>
>>> What are your opinions on TSC faking?
>>
>> I'd suggest restarting it wherever it left off, because it's simpler.
>> If there was a CLOCK_BOOT_RAW, you could try to track it, but I'm not
>> sure that such a thing exists.
>
> CLOCK_MONOTONIC_RAW can count in suspend, so CLOCK_BOOT_RAW would be a
> conditional alias and it probably doesn't exist because of that.
>
>> FWIW, if you ever intend to support ART ("always running timer")
>> passthrough, this is going to be a giant clusterfsck.  Good luck.  I
>> haven't gotten a straight answer as to what hardware actually supports
>> that thing, so even testing isn't no easy.
>
> Hm, AR TSC would be best handled by doing nothing ... dropping the
> faking logic just became tempting.

As it stands, ART is screwed if you adjust the VMCS's tsc offset.  But
I think it's also screwed if you migrate to a machine with a different
ratio of guest TSC ticks to host ART ticks or a different offset,
because the host isn't going to do the rdmsr every time it tries to
access the ART, so passing it through might require a paravirt
mechanism no matter what.

ISTM that, if KVM tries to keep the guest TSC monotonic across
migration, it should probably also keep it monotonic across host
suspend/resume.  After all, host suspend/resume is kind of like
migrating from the pre-suspend host to the post-resume host.  Maybe it
could even share code.

>
>>> ---
>>> Btw. I'll be spending some days to decipher kvmclock, so I'd also fix
>>> the masterclock+suspend issue, if you don't mind ... So far, I don't
>>> even see a reason to update kvmclock on kvm_arch_hardware_enable().
>>> Suspend is a condition that we want to handle, so kvm_resume would be a
>>> better place, but we handle suspend only because TSC and timekeeping has
>>> changed, so I think that the right place is in their event notifiers.
>>
>> I'd be glad to try to review things.  Please cc me.
>
> Ok.
>
>> One of the Xen people pointed me at the MS Viridian spec for handling
>> TSC rate changes on migration to or from hosts that don't support TSC
>> scaling.  I wonder if KVM could use the same technique or even the
>> same API.
>
> The TSC frequency MSR is read-only in Xen, so I guess it's equivalent to
> pvclock.  I'll take a deeper look, thanks for pointers.
Radim Krčmář March 17, 2016, 3:10 p.m. UTC | #5
2016-03-16 16:07-0700, Andy Lutomirski:
> On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
>> 2016-03-16 15:15-0700, Andy Lutomirski:
>>> FWIW, if you ever intend to support ART ("always running timer")
>>> passthrough, this is going to be a giant clusterfsck.  Good luck.  I
>>> haven't gotten a straight answer as to what hardware actually supports
>>> that thing, so even testing isn't no easy.
>>
>> Hm, AR TSC would be best handled by doing nothing ... dropping the
>> faking logic just became tempting.

ART is different from what I initially thought, it's the underlying
mechanism for invariant TSC and nothing more ...  we already forbid
migrations when the guest knows about invariant TSC, so we could do the
same and let ART be virtualized.  (Suspend has to be forbidden too.)

> As it stands, ART is screwed if you adjust the VMCS's tsc offset.  But

Luckily, assigning real hardware can prevent migration or suspend, so we
won't need to adjust the offset during runtime.  TSC is a generally
unmigratable device that just happens to live on the CPU.

(It would have been better to hide TSC capability from the guest and only
 use rdtsc for kvmclock if the guest wanted fancy features.)

> I think it's also screwed if you migrate to a machine with a different
> ratio of guest TSC ticks to host ART ticks or a different offset,
> because the host isn't going to do the rdmsr every time it tries to
> access the ART, so passing it through might require a paravirt
> mechanism no matter what.

It's almost certain that the other host will have a different offset,
which makes TSC unmigratable in software without even considering ART
or frequencies.  Well, KVM already emulates different TSC frequency, so
we could emulate ART without sinking much lower. :)

> ISTM that, if KVM tries to keep the guest TSC monotonic across
> migration, it should probably also keep it monotonic across host
> suspend/resume.

Yes, "Pausing" TSC during suspend or migration is one way of improving
the TSC estimate.  If we want to emulate ART, then the estimate is
noticeably lacking, because TSC and ART are defined by a simple
equation (SDM 2015-12, 17.14.4 Invariant Time-Keeping):
 TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K

where the guest thinks that CPUID and K are constant (between events
that the guest knows of), so we should give the best estimate of how
many TSC cycles have passed.  (The best estimate is still lacking.)

>                  After all, host suspend/resume is kind of like
> migrating from the pre-suspend host to the post-resume host.  Maybe it
> could even share code.

Hopefully ... host suspend/resume is driven by kernel and migration is
driven by userspace, which might complicate sharing.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski March 17, 2016, 6:22 p.m. UTC | #6
On Mar 17, 2016 8:10 AM, "Radim Krcmar" <rkrcmar@redhat.com> wrote:
>
> 2016-03-16 16:07-0700, Andy Lutomirski:
> > On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
> >> 2016-03-16 15:15-0700, Andy Lutomirski:
> >>> FWIW, if you ever intend to support ART ("always running timer")
> >>> passthrough, this is going to be a giant clusterfsck.  Good luck.  I
> >>> haven't gotten a straight answer as to what hardware actually supports
> >>> that thing, so even testing isn't no easy.
> >>
> >> Hm, AR TSC would be best handled by doing nothing ... dropping the
> >> faking logic just became tempting.
>
> ART is different from what I initially thought, it's the underlying
> mechanism for invariant TSC and nothing more ...  we already forbid
> migrations when the guest knows about invariant TSC, so we could do the
> same and let ART be virtualized.  (Suspend has to be forbidden too.)

It's more than that -- it's a TSC-like clock that can be read by PCIe devices.

>
> > As it stands, ART is screwed if you adjust the VMCS's tsc offset.  But
>
> Luckily, assigning real hardware can prevent migration or suspend, so we
> won't need to adjust the offset during runtime.  TSC is a generally
> unmigratable device that just happens to live on the CPU.
>
> (It would have been better to hide TSC capability from the guest and only
>  use rdtsc for kvmclock if the guest wanted fancy features.)
>

I think that, if KVM passes through an ART-supporting NIC, it might be
rather messy to try to avoid passing through TSC as well.  But maybe a
pvclock-like structure could expose the ART-kvmclock offset and scale.

> > I think it's also screwed if you migrate to a machine with a different
> > ratio of guest TSC ticks to host ART ticks or a different offset,
> > because the host isn't going to do the rdmsr every time it tries to
> > access the ART, so passing it through might require a paravirt
> > mechanism no matter what.
>
> It's almost certain that the other host will have a different offset,
> which makes TSC unmigratable in software without even considering ART
> or frequencies.  Well, KVM already emulates different TSC frequency, so
> we could emulate ART without sinking much lower. :)
>
> > ISTM that, if KVM tries to keep the guest TSC monotonic across
> > migration, it should probably also keep it monotonic across host
> > suspend/resume.
>
> Yes, "Pausing" TSC during suspend or migration is one way of improving
> the TSC estimate.  If we want to emulate ART, then the estimate is
> noticeably lacking, because TSC and ART are defined by a simple
> equation (SDM 2015-12, 17.14.4 Invariant Time-Keeping):
>  TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K
>
> where the guest thinks that CPUID and K are constant (between events
> that the guest knows of), so we should give the best estimate of how
> many TSC cycles have passed.  (The best estimate is still lacking.)
>
> >                  After all, host suspend/resume is kind of like
> > migrating from the pre-suspend host to the post-resume host.  Maybe it
> > could even share code.
>
> Hopefully ... host suspend/resume is driven by kernel and migration is
> driven by userspace, which might complicate sharing.

Good point.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář March 17, 2016, 7:58 p.m. UTC | #7
2016-03-17 11:22-0700, Andy Lutomirski:
> On Mar 17, 2016 8:10 AM, "Radim Krcmar" <rkrcmar@redhat.com> wrote:
>> 2016-03-16 16:07-0700, Andy Lutomirski:
>>> On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <rkrcmar@redhat.com> wrote:
>>>> 2016-03-16 15:15-0700, Andy Lutomirski:
>>>>> FWIW, if you ever intend to support ART ("always running timer")
>>>>> passthrough, this is going to be a giant clusterfsck.  Good luck.  I
>>>>> haven't gotten a straight answer as to what hardware actually supports
>>>>> that thing, so even testing isn't no easy.
>>>>
>>>> Hm, AR TSC would be best handled by doing nothing ... dropping the
>>>> faking logic just became tempting.
>>
>> ART is different from what I initially thought, it's the underlying
>> mechanism for invariant TSC and nothing more ...  we already forbid
>> migrations when the guest knows about invariant TSC, so we could do the
>> same and let ART be virtualized.  (Suspend has to be forbidden too.)
> 
> It's more than that -- it's a TSC-like clock that can be read by PCIe devices.

So ART is for time synchronization within the machine.  Makes sense now.

>>> As it stands, ART is screwed if you adjust the VMCS's tsc offset.  But
>>
>> Luckily, assigning real hardware can prevent migration or suspend, so we
>> won't need to adjust the offset during runtime.  TSC is a generally
>> unmigratable device that just happens to live on the CPU.
>>
>> (It would have been better to hide TSC capability from the guest and only
>>  use rdtsc for kvmclock if the guest wanted fancy features.)
>>
> 
> I think that, if KVM passes through an ART-supporting NIC, it might be
> rather messy to try to avoid passing through TSC as well.

I agree.  Migrating a guest with ART-supporting NIC is going to be hard
or impossible, so there is no big drawback in exposing TSC.

If KVM adds host TSC_ADJUST and VMCS TSC-offset to guest TSC_ADJUST,
then ART-supporting NIC should use timestamps compatible with VCPUs.

>                                                            But maybe a
> pvclock-like structure could expose the ART-kvmclock offset and scale.

I think that getting ART from kvmclock would turn out to be horrible.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eed32283d22c..c88f91f4b1a3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -123,8 +123,6 @@  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
 unsigned int __read_mostly lapic_timer_advance_ns = 0;
 module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
 
-static bool __read_mostly backwards_tsc_observed = false;
-
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {
@@ -1671,7 +1669,6 @@  static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_cycle_now);
 
 	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
-				&& !backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
 
 	if (ka->use_master_clock)
@@ -7369,88 +7366,22 @@  int kvm_arch_hardware_enable(void)
 	struct kvm_vcpu *vcpu;
 	int i;
 	int ret;
-	u64 local_tsc;
-	u64 max_tsc = 0;
-	bool stable, backwards_tsc = false;
 
 	kvm_shared_msr_cpu_online();
 	ret = kvm_x86_ops->hardware_enable();
 	if (ret != 0)
 		return ret;
 
-	local_tsc = rdtsc();
-	stable = !check_tsc_unstable();
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (!stable && vcpu->cpu == smp_processor_id())
+			if (vcpu->cpu == smp_processor_id()) {
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-			if (stable && vcpu->arch.last_host_tsc > local_tsc) {
-				backwards_tsc = true;
-				if (vcpu->arch.last_host_tsc > max_tsc)
-					max_tsc = vcpu->arch.last_host_tsc;
+				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE,
+						 vcpu);
 			}
 		}
 	}
 
-	/*
-	 * Sometimes, even reliable TSCs go backwards.  This happens on
-	 * platforms that reset TSC during suspend or hibernate actions, but
-	 * maintain synchronization.  We must compensate.  Fortunately, we can
-	 * detect that condition here, which happens early in CPU bringup,
-	 * before any KVM threads can be running.  Unfortunately, we can't
-	 * bring the TSCs fully up to date with real time, as we aren't yet far
-	 * enough into CPU bringup that we know how much real time has actually
-	 * elapsed; our helper function, get_kernel_ns() will be using boot
-	 * variables that haven't been updated yet.
-	 *
-	 * So we simply find the maximum observed TSC above, then record the
-	 * adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
-	 * the adjustment will be applied.  Note that we accumulate
-	 * adjustments, in case multiple suspend cycles happen before some VCPU
-	 * gets a chance to run again.  In the event that no KVM threads get a
-	 * chance to run, we will miss the entire elapsed period, as we'll have
-	 * reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
-	 * loose cycle time.  This isn't too big a deal, since the loss will be
-	 * uniform across all VCPUs (not to mention the scenario is extremely
-	 * unlikely). It is possible that a second hibernate recovery happens
-	 * much faster than a first, causing the observed TSC here to be
-	 * smaller; this would require additional padding adjustment, which is
-	 * why we set last_host_tsc to the local tsc observed here.
-	 *
-	 * N.B. - this code below runs only on platforms with reliable TSC,
-	 * as that is the only way backwards_tsc is set above.  Also note
-	 * that this runs for ALL vcpus, which is not a bug; all VCPUs should
-	 * have the same delta_cyc adjustment applied if backwards_tsc
-	 * is detected.  Note further, this adjustment is only done once,
-	 * as we reset last_host_tsc on all VCPUs to stop this from being
-	 * called multiple times (one for each physical CPU bringup).
-	 *
-	 * Platforms with unreliable TSCs don't have to deal with this, they
-	 * will be compensated by the logic in vcpu_load, which sets the TSC to
-	 * catchup mode.  This will catchup all VCPUs to real time, but cannot
-	 * guarantee that they stay in perfect synchronization.
-	 */
-	if (backwards_tsc) {
-		u64 delta_cyc = max_tsc - local_tsc;
-		backwards_tsc_observed = true;
-		list_for_each_entry(kvm, &vm_list, vm_list) {
-			kvm_for_each_vcpu(i, vcpu, kvm) {
-				vcpu->arch.tsc_offset_adjustment += delta_cyc;
-				vcpu->arch.last_host_tsc = local_tsc;
-				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
-			}
-
-			/*
-			 * We have to disable TSC offset matching.. if you were
-			 * booting a VM while issuing an S4 host suspend....
-			 * you may have some problem.  Solving this issue is
-			 * left as an exercise to the reader.
-			 */
-			kvm->arch.last_tsc_nsec = 0;
-			kvm->arch.last_tsc_write = 0;
-		}
-
-	}
 	return 0;
 }