diff mbox

[v2,00/23] MIPS: KVM: Fixes and guest timer rewrite

Message ID 53874719.5070604@imgtec.com (mailing list archive)
State New, archived
Headers show

Commit Message

James Hogan May 29, 2014, 2:41 p.m. UTC
On 29/05/14 11:36, Paolo Bonzini wrote:
> Il 29/05/2014 11:16, James Hogan ha scritto:
>> Here are a range of MIPS KVM T&E fixes, preferably for v3.16 but I know
>> it's probably a bit late now. Changes are pretty minimal though since
>> v1 so please consider. They can also be found on my kvm_mips_queue
>> branch (and the kvm_mips_timer_v2 tag) here:
>> git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/kvm-mips.git
> 
> That's okay for me, but I'd like to get a look at the QEMU parts.

I'm still in the process of cleaning up my qemu patchset, but the
attached 2 patches hopefully shows the gist of it this particular change.

Thanks
James

> 
> I would also like an Acked-by for patches 2 and 14, and I have a
> question about patch 11.
> 
> Paolo

Comments

Paolo Bonzini May 29, 2014, 3:23 p.m. UTC | #1
Il 29/05/2014 16:41, James Hogan ha scritto:
> +
> +    /* If VM clock stopped then state was already saved when it was stopped */
> +    if (runstate_is_running()) {
> +        ret = kvm_mips_save_count(cs);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +

You're expecting that calls to kvm_mips_get_cp0_registers and 
kvm_mips_put_cp0_registers are balanced and not nested.  Perhaps you 
should add an assert about it.

> +    if (!(count_ctl & KVM_REG_MIPS_COUNT_CTL_DC)) {
> +        count_ctl |= KVM_REG_MIPS_COUNT_CTL_DC;
> +        ret = kvm_mips_put_one_reg64(cs, KVM_REG_MIPS_COUNT_CTL, &count_ctl);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }

Would it make sense to return directly if the master disable bit is set? 
  The rest of this function is idempotent.

Also, perhaps this bit in kvm_mips_restore_count is unnecessary, and so 
is env->count_save_time in general:

> +        /* find time to resume the saved timer at */
> +        now = get_clock();
> +        count_resume = now - (cpu_get_clock_at(now) - env->count_save_time);

Is the COUNT_RESUME write necessary if the VM is running?  Does the 
master disable bit just latch the values, or does it really stop the 
timer?  (My reading of the code is the former, since writing 
COUNT_RESUME only modifies the bias: no write => no bias change => timer 
runs).

And if the VM is not running, you have timer_state.cpu_ticks_enabled == 
false, so cpu_get_clock_at() always returns timers_state.cpu_clock_offset.

So, if the COUNT_RESUME write is not necessary for a running VM, you can 
then just write get_clock() to COUNT_RESUME, which seems to make sense 
to me.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Hogan May 29, 2014, 4:27 p.m. UTC | #2
Hi Paolo,

On 29/05/14 16:23, Paolo Bonzini wrote:
> Il 29/05/2014 16:41, James Hogan ha scritto:
>> +
>> +    /* If VM clock stopped then state was already saved when it was
>> stopped */
>> +    if (runstate_is_running()) {
>> +        ret = kvm_mips_save_count(cs);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    }
>> +
> 
> You're expecting that calls to kvm_mips_get_cp0_registers and
> kvm_mips_put_cp0_registers are balanced and not nested.  Perhaps you
> should add an assert about it.

Yes and no. If you loadvm or do an incoming migration you get an extra
put without a prior get, so it has to handle that case (ensuring that
the timer is frozen in kvm_mips_restore_count). I don't think it should
ever do two kvm_mips_get_cp0_registers calls though, but it should still
handle it okay since the timer will already be frozen so it'd just read
the same values out.

(although as it stands CP0_Count never represents the offset from the VM
clock for KVM like it does with a running Count with TCG, so the vmstate
is technically incompatible between TCG/KVM).

>> +    if (!(count_ctl & KVM_REG_MIPS_COUNT_CTL_DC)) {
>> +        count_ctl |= KVM_REG_MIPS_COUNT_CTL_DC;
>> +        ret = kvm_mips_put_one_reg64(cs, KVM_REG_MIPS_COUNT_CTL,
>> &count_ctl);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    }
> 
> Would it make sense to return directly if the master disable bit is set?

It would probably indicate that kvm_mips_get_cp0_registers was called
twice, so yes it could do that, although it would probably be
unexpected/wrong if it did happen (maybe qemu modified the values while
the state was dirty).

>  The rest of this function is idempotent.
> 
> Also, perhaps this bit in kvm_mips_restore_count is unnecessary, and so
> is env->count_save_time in general:
> 
>> +        /* find time to resume the saved timer at */
>> +        now = get_clock();
>> +        count_resume = now - (cpu_get_clock_at(now) -
>> env->count_save_time);
> 
> Is the COUNT_RESUME write necessary if the VM is running?

Running at that instant or running continuously since the save?

At this instant the VM is always running. Either it's just been started
and other state isn't dirty, or the registers have been put while the VM
is running.

If the VM wasn't stopped since the context was saved, you're right that
it could skip modifying COUNT_RESUME, it won't have changed. It's there
for if the VM was stopped, e.g.:
stop vm - save
get regs
start vm
put regs - restore (e.g. before run vcpu)

in which case COUNT_RESUME must be altered for Count to keep a similar
offset against the vm clock (like hw/mips/cputimer.c ensures while count
is running - even when vm stopped).

> Does the
> master disable bit just latch the values, or does it really stop the
> timer?  (My reading of the code is the former, since writing
> COUNT_RESUME only modifies the bias: no write => no bias change => timer
> runs).

It appears latched in the sense that starting it again will jump Count
forward to the time it would have been had it not been disabled (with no
loss of Compare interrupt in that time).

However it does stop the timer and interrupts, so if you change Count it
will be as if you changed it exactly at the moment it was latched. If
you change COUNT_RESUME, Count will not change immediately, but the
monotonic time at which the (unchanged) Count value will appear to
resume counting will be changed. So in that sense it acts as a bias.
Increment it by 10ns and the Count will be 10ns behind when you resume
(as if it had been stopped for 10ns, so with no missed interrupts).

> And if the VM is not running, you have timer_state.cpu_ticks_enabled ==
> false, so cpu_get_clock_at() always returns timers_state.cpu_clock_offset.
> 
> So, if the COUNT_RESUME write is not necessary for a running VM, you can
> then just write get_clock() to COUNT_RESUME, which seems to make sense
> to me.

If the VM is stopped you probably wouldn't want to resume the timer yet
at all.

See above though, the VM is always running at this point
(restore_count), even if it has only just started (or may have been
stopped and started since the save).

Does that make sense? It's surprising hard to explain how it works
clearly without resorting to equations :)

Thanks
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 29, 2014, 5:03 p.m. UTC | #3
Il 29/05/2014 18:27, James Hogan ha scritto:
> (although as it stands CP0_Count never represents the offset from the VM
> clock for KVM like it does with a running Count with TCG, so the vmstate
> is technically incompatible between TCG/KVM).

That can be fixed in cpu_save/cpu_load hooks, like

     if (kvm_enabled()) {
         uint32_t TCGlike_CP0_Count = ...
         qemu_put_sbe32s(f, &TCGlike_CP0_Count);
     } else {
         qemu_put_sbe32s(f, &env->CP0_Count);
     }

...

     if (kvm_enabled()) {
         uint32_t TCGlike_CP0_Count;
         qemu_get_sbe32s(f, &TCGlike_CP0_Count);
         env->CP0_Count = ...
     } else {
         qemu_get_sbe32s(f, &env->CP0_Count);
     }

>> Also, perhaps this bit in kvm_mips_restore_count is unnecessary, and so
>> is env->count_save_time in general:
>>
>>> +        /* find time to resume the saved timer at */
>>> +        now = get_clock();
>>> +        count_resume = now - (cpu_get_clock_at(now) -
>>> env->count_save_time);
>>
>> Is the COUNT_RESUME write necessary if the VM is running?
>
> Running at that instant or running continuously since the save?
>
> At this instant the VM is always running. Either it's just been started
> and other state isn't dirty, or the registers have been put while the VM
> is running.

The possible transitions are:

running, not dirty -> stopped
	need to freeze and load the registers

stopped -> running, not dirty
	will reload the registers, need to modify COUNT_RESUME

running, dirty -> stopped
	no need to do anything

stopped -> running, dirty
	will not reload the registers until put, will need to modify
	COUNT_RESUME on the next transition to "running, not dirty"

running, not dirty -> running, dirty
	need to freeze and load the registers

running, dirty -> running, not dirty
	need to modify COUNT_RESUME if the machine had been stopped
	in the meanwhile

The questions then is, can we skip tracking count_save_time and 
modifying COUNT_RESUME in kvm_mips_restore_count?  Then you can just 
write get_clock() to COUNT_RESUME in kvm_mips_update_state, like this:

     if (!running) {
         if (!cs->kvm_vcpu_dirty) {
             save;
         }
     else {
         write get_clock() to COUNT_RESUME;
         if (!cs->kvm_vcpu_dirty) {
             restore;
         }
     }

and even drop patch 1.  COUNT_RESUME is not even ever read by QEMU nor 
stored in CPUState, so.

The difference is that the guest "loses" the time between the "running, 
not dirty -> running, dirty" and "running, dirty -> stopped" 
transitions, while "gaining" the time between "stopped -> running, 
dirty" and "running, dirty -> running, not dirty".  If this is right, I 
think the difference does not matter in practice and the new/simpler 
code even explains the definition of COUNT_RESUME better in my eyes.

>> Does the
>> master disable bit just latch the values, or does it really stop the
>> timer?  (My reading of the code is the former, since writing
>> COUNT_RESUME only modifies the bias: no write => no bias change => timer
>> runs).
>
> It appears latched in the sense that starting it again will jump Count
> forward to the time it would have been had it not been disabled (with no
> loss of Compare interrupt in that time).

Yes, this is the important part because it means that the guest clock 
does not get progressively more skewed.  It also means that it is right 
to never write COUNT_RESUME except if you go through stop/continue.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Hogan May 29, 2014, 8:44 p.m. UTC | #4
Hi Paolo,

On 29/05/14 18:03, Paolo Bonzini wrote:
>>> Also, perhaps this bit in kvm_mips_restore_count is unnecessary, and so
>>> is env->count_save_time in general:
>>>
>>>> +        /* find time to resume the saved timer at */
>>>> +        now = get_clock();
>>>> +        count_resume = now - (cpu_get_clock_at(now) -
>>>> env->count_save_time);
>>>
>>> Is the COUNT_RESUME write necessary if the VM is running?
>>
>> Running at that instant or running continuously since the save?
>>
>> At this instant the VM is always running. Either it's just been started
>> and other state isn't dirty, or the registers have been put while the VM
>> is running.
> 
> The possible transitions are:
> 
> running, not dirty -> stopped
>     need to freeze and load the registers
> 
> stopped -> running, not dirty
>     will reload the registers, need to modify COUNT_RESUME
> 
> running, dirty -> stopped
>     no need to do anything
> 
> stopped -> running, dirty
>     will not reload the registers until put, will need to modify
>     COUNT_RESUME on the next transition to "running, not dirty"
> 
> running, not dirty -> running, dirty
>     need to freeze and load the registers
> 
> running, dirty -> running, not dirty
>     need to modify COUNT_RESUME if the machine had been stopped
>     in the meanwhile
> 
> The questions then is, can we skip tracking count_save_time and
> modifying COUNT_RESUME in kvm_mips_restore_count?  Then you can just
> write get_clock() to COUNT_RESUME in kvm_mips_update_state, like this:
> 
>     if (!running) {
>         if (!cs->kvm_vcpu_dirty) {
>             save;
>         }
>     else {
>         write get_clock() to COUNT_RESUME;
>         if (!cs->kvm_vcpu_dirty) {
>             restore;
>         }
>     }
> 
> and even drop patch 1.  COUNT_RESUME is not even ever read by QEMU nor
> stored in CPUState, so.
> 
> The difference is that the guest "loses" the time between the "running,
> not dirty -> running, dirty" and "running, dirty -> stopped"
> transitions, while "gaining" the time between "stopped -> running,
> dirty" and "running, dirty -> running, not dirty".  If this is right, I
> think the difference does not matter in practice and the new/simpler
> code even explains the definition of COUNT_RESUME better in my eyes.

Yes, I agree with your analysis and had considered something like this,
although it doesn't particularly appeal to my sense of perfectionism :).
It would be race free though, and if you're stopping the VM at all you
expect to lose some time anyway.

> 
>>> Does the
>>> master disable bit just latch the values, or does it really stop the
>>> timer?  (My reading of the code is the former, since writing
>>> COUNT_RESUME only modifies the bias: no write => no bias change => timer
>>> runs).
>>
>> It appears latched in the sense that starting it again will jump Count
>> forward to the time it would have been had it not been disabled (with no
>> loss of Compare interrupt in that time).
> 
> Yes, this is the important part because it means that the guest clock
> does not get progressively more skewed.  It also means that it is right
> to never write COUNT_RESUME except if you go through stop/continue.

Yes, if the VM hasn't been stopped the value written is unchanged from
that originally read (see below), so it could skip it in that case.

save:
  count_save_time = cpu_clock_offset + COUNT_RESUME
restore:
  COUNT_RESUME = get_clock() - (cpu_clock_offset + get_clock() -
count_save_time)
		= count_save_time - cpu_clock_offset
		= COUNT_RESUME

Cheers
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 30, 2014, 7:57 a.m. UTC | #5
Il 29/05/2014 22:44, James Hogan ha scritto:
> Yes, I agree with your analysis and had considered something like this,
> although it doesn't particularly appeal to my sense of perfectionism :).

I can see that.  But I think the simplification of the code is worth it.

It is hard to explain why the invalid times-goes-backwards case can 
happen if env->count_save_time is overwritten with data from another 
machine.  I think the explanation is that (due to 
timers_state.cpu_ticks_enabled) the value of "cpu_get_clock_at(now) - 
env->count_save_time" does not depend on get_clock(), but the code 
doesn't have any comment for that.

Rather than adding comments, we might as well force it to be always zero 
and just write get_clock() to COUNT_RESUME.

Finally, having to serialize env->count_save_time makes harder to 
support migration from TCG to KVM and back.

> It would be race free though, and if you're stopping the VM at all you
> expect to lose some time anyway.

Since you mentioned perfectionism, :) your code also loses some time; 
COUNT_RESUME is written based on when the CPU state becomes clean, not 
on when the CPU was restarted.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Hogan June 16, 2014, 4:29 p.m. UTC | #6
On 30/05/14 08:57, Paolo Bonzini wrote:
> Il 29/05/2014 22:44, James Hogan ha scritto:
>> Yes, I agree with your analysis and had considered something like this,
>> although it doesn't particularly appeal to my sense of perfectionism :).
> 
> I can see that.  But I think the simplification of the code is worth it.
> 
> It is hard to explain why the invalid times-goes-backwards case can
> happen if env->count_save_time is overwritten with data from another
> machine.  I think the explanation is that (due to
> timers_state.cpu_ticks_enabled) the value of "cpu_get_clock_at(now) -
> env->count_save_time" does not depend on get_clock(), but the code
> doesn't have any comment for that.

Exactly. I think of it in terms of count_save_time being in the time
frame of the vm clock, which is also migrated.

In fact since the VM clock is stopped during migration, the
saved/restored count_save_time will likely be the same as the saved vm
clock, so the delta calculated above at restore time is the time since
the vm clock was resumed.

> Rather than adding comments, we might as well force it to be always zero
> and just write get_clock() to COUNT_RESUME.
> 
> Finally, having to serialize env->count_save_time makes harder to
> support migration from TCG to KVM and back.

Yes, I'm not keen on that bit of code.

> 
>> It would be race free though, and if you're stopping the VM at all you
>> expect to lose some time anyway.
> 
> Since you mentioned perfectionism, :) your code also loses some time;
> COUNT_RESUME is written based on when the CPU state becomes clean, not
> on when the CPU was restarted.

The offset you suggest removing is what ensures time isn't lost there.
The VM time when the cpu is restarted == the VM time when it is stopped,
so by saving vm time to count_save_time at vm stop, the timer can be
restarted at exactly the right interval into the past
(COUNT_RESUME=now-interval) so that no time is lost.

Now there is one case I believe where time will be gained which is hard
to avoid without getting a notification before VM stop rather than
after. When the VM is stopped and qemu's state is clean, the state is
read very soon after (not immediately), so the saved state will be at
slightly after the recorded vm time. I.e. the timer was allowed to
continue slightly past when the vm clock was actually stopped.

Cheers
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini June 16, 2014, 4:33 p.m. UTC | #7
Il 16/06/2014 18:29, James Hogan ha scritto:
>> Rather than adding comments, we might as well force it to be always zero
>> and just write get_clock() to COUNT_RESUME.
>>
>> Finally, having to serialize env->count_save_time makes harder to
>> support migration from TCG to KVM and back.
>
> Yes, I'm not keen on that bit of code.

Are you going to submit it for 2.1?  There are still a few days for 
review and inclusion.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 22c61769ec31cb222c5291c88bd6a658e23862f5 Mon Sep 17 00:00:00 2001
From: James Hogan <james.hogan@imgtec.com>
Date: Thu, 29 May 2014 14:36:20 +0100
Subject: [PATCH 2/2] target-mips: kvm: Save/restore KVM timer state

Save and restore KVM timer state properly. When it's saved (or VM clock
stopped) the VM clock is also recorded. When it's restored (or VM clock
restarted) it is resumed with the stored count at the saved VM clock
translated into monotonic time, i.e. current time - elapsed VM
nanoseconds. This therefore behaves correctly after the VM clock has
been stopped.

E.g.
sync state to QEMU
    takes snapshot of Count, Cause and VM time(now)
sync state back to KVM
    restores Count, Cause at translated VM time (unchanged)
    time continues from the same Count without losing time or interrupts

Or:
stop vm clock
    takes snapshot of Count, Cause and VM time

restart vm clock
    restores Count, Cause at now - (vm time(now) - saved vm time)
    time continues from the same Count but at a later monotonic time
    depending on how long the vm clock was stopped

The VM time at which the timer was stopped (count_save_time) is
saved/loaded by cpu_save and cpu_load so that it can be restarted
without losing time relative to the VM clock.

Signed-off-by: James Hogan <james.hogan@imgtec.com>
---
 target-mips/cpu.h     |   3 +-
 target-mips/kvm.c     | 196 +++++++++++++++++++++++++++++++++++++++++++++++---
 target-mips/machine.c |   6 ++
 3 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/target-mips/cpu.h b/target-mips/cpu.h
index 67bf441..02a8da1 100644
--- a/target-mips/cpu.h
+++ b/target-mips/cpu.h
@@ -495,6 +495,7 @@  struct CPUMIPSState {
     const mips_def_t *cpu_model;
     void *irq[8];
     QEMUTimer *timer; /* Internal timer */
+    int64_t count_save_time;
 };
 
 #include "cpu-qom.h"
@@ -526,7 +527,7 @@  void mips_cpu_list (FILE *f, fprintf_function cpu_fprintf);
 extern void cpu_wrdsp(uint32_t rs, uint32_t mask_num, CPUMIPSState *env);
 extern uint32_t cpu_rddsp(uint32_t mask_num, CPUMIPSState *env);
 
-#define CPU_SAVE_VERSION 4
+#define CPU_SAVE_VERSION 5
 
 /* MMU modes definitions. We carefully match the indices with our
    hflags layout. */
diff --git a/target-mips/kvm.c b/target-mips/kvm.c
index 8765205..ee2dd1c 100644
--- a/target-mips/kvm.c
+++ b/target-mips/kvm.c
@@ -33,6 +33,8 @@  const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
     KVM_CAP_LAST_INFO
 };
 
+static void kvm_mips_update_state(void *opaque, int running, RunState state);
+
 unsigned long kvm_arch_vcpu_id(CPUState *cs)
 {
     return cs->cpu_index;
@@ -50,6 +52,9 @@  int kvm_arch_init(KVMState *s)
 int kvm_arch_init_vcpu(CPUState *cs)
 {
     int ret = 0;
+
+    qemu_add_vm_change_state_handler(kvm_mips_update_state, cs);
+
     DPRINTF("%s\n", __func__);
     return ret;
 }
@@ -224,6 +229,17 @@  int kvm_mips_set_ipi_interrupt(MIPSCPU *cpu, int irq, int level)
 #define KVM_REG_MIPS_CP0_XCONTEXT	MIPS_CP0_64(20, 0)
 #define KVM_REG_MIPS_CP0_ERROREPC	MIPS_CP0_64(30, 0)
 
+/* CP0_Count control */
+#define KVM_REG_MIPS_COUNT_CTL          (KVM_REG_MIPS | KVM_REG_SIZE_U64 | \
+                                         0x20000 | 0)
+#define KVM_REG_MIPS_COUNT_CTL_DC       0x00000001      /* master disable */
+/* CP0_Count resume monotonic nanoseconds */
+#define KVM_REG_MIPS_COUNT_RESUME       (KVM_REG_MIPS | KVM_REG_SIZE_U64 | \
+                                         0x20000 | 1)
+/* CP0_Count rate in Hz */
+#define KVM_REG_MIPS_COUNT_HZ           (KVM_REG_MIPS | KVM_REG_SIZE_U64 | \
+                                         0x20000 | 2)
+
 static inline int kvm_mips_put_one_reg(CPUState *cs, uint64_t reg_id,
                                        int32_t *addr)
 {
@@ -248,6 +264,17 @@  static inline int kvm_mips_put_one_ulreg(CPUState *cs, uint64_t reg_id,
     return kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &cp0reg);
 }
 
+static inline int kvm_mips_put_one_reg64(CPUState *cs, uint64_t reg_id,
+                                         uint64_t *addr)
+{
+    struct kvm_one_reg cp0reg = {
+        .id = reg_id,
+        .addr = (uintptr_t)addr
+    };
+
+    return kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &cp0reg);
+}
+
 static inline int kvm_mips_get_one_reg(CPUState *cs, uint64_t reg_id,
                                        int32_t *addr)
 {
@@ -286,6 +313,151 @@  static inline int kvm_mips_get_one_ulreg(CPUState *cs, uint64 reg_id,
     return ret;
 }
 
+static inline int kvm_mips_get_one_reg64(CPUState *cs, uint64 reg_id,
+                                         uint64_t *addr)
+{
+    int ret;
+    struct kvm_one_reg cp0reg = {
+        .id = reg_id,
+        .addr = (uintptr_t)addr
+    };
+
+    ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &cp0reg);
+    return ret;
+}
+
+/*
+ * We freeze the KVM timer when either the VM clock is stopped or the state is
+ * saved (the state is dirty).
+ */
+
+/*
+ * Save the state of the KVM timer when VM clock is stopped or state is synced
+ * to QEMU.
+ */
+static int kvm_mips_save_count(CPUState *cs)
+{
+    MIPSCPU *cpu = MIPS_CPU(cs);
+    CPUMIPSState *env = &cpu->env;
+    uint64_t count_ctl, count_resume;
+    int ret;
+
+    /* freeze KVM timer */
+    ret = kvm_mips_get_one_reg64(cs, KVM_REG_MIPS_COUNT_CTL, &count_ctl);
+    if (ret < 0) {
+        return ret;
+    }
+    if (!(count_ctl & KVM_REG_MIPS_COUNT_CTL_DC)) {
+        count_ctl |= KVM_REG_MIPS_COUNT_CTL_DC;
+        ret = kvm_mips_put_one_reg64(cs, KVM_REG_MIPS_COUNT_CTL, &count_ctl);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    /* read CP0_Cause */
+    ret = kvm_mips_get_one_reg(cs, KVM_REG_MIPS_CP0_CAUSE, &env->CP0_Cause);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* read CP0_Count */
+    ret = kvm_mips_get_one_reg(cs, KVM_REG_MIPS_CP0_COUNT, &env->CP0_Count);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* read COUNT_RESUME */
+    ret = kvm_mips_get_one_reg64(cs, KVM_REG_MIPS_COUNT_RESUME,
+                                 &count_resume);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* translate monotonic COUNT_RESUME to VM time */
+    env->count_save_time = cpu_get_clock_at(count_resume);
+
+    return ret;
+}
+
+/*
+ * Restore the state of the KVM timer when VM clock is restarted or state is
+ * synced to KVM.
+ */
+static int kvm_mips_restore_count(CPUState *cs)
+{
+    MIPSCPU *cpu = MIPS_CPU(cs);
+    CPUMIPSState *env = &cpu->env;
+    uint64_t count_ctl, count_resume;
+    int64_t now = get_clock();
+    int ret = 0;
+
+    /* check the timer is frozen */
+    ret = kvm_mips_get_one_reg64(cs, KVM_REG_MIPS_COUNT_CTL, &count_ctl);
+    if (ret < 0) {
+        return ret;
+    }
+    if (!(count_ctl & KVM_REG_MIPS_COUNT_CTL_DC)) {
+        /* freeze timer (sets COUNT_RESUME for us) */
+        count_ctl |= KVM_REG_MIPS_COUNT_CTL_DC;
+        ret = kvm_mips_put_one_reg64(cs, KVM_REG_MIPS_COUNT_CTL, &count_ctl);
+        if (ret < 0) {
+            return ret;
+        }
+    } else {
+        /* find time to resume the saved timer at */
+        now = get_clock();
+        count_resume = now - (cpu_get_clock_at(now) - env->count_save_time);
+        ret = kvm_mips_put_one_reg64(cs, KVM_REG_MIPS_COUNT_RESUME,
+                                     &count_resume);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    /* load CP0_Cause */
+    ret = kvm_mips_put_one_reg(cs, KVM_REG_MIPS_CP0_CAUSE, &env->CP0_Cause);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* load CP0_Count */
+    ret = kvm_mips_put_one_reg(cs, KVM_REG_MIPS_CP0_COUNT, &env->CP0_Count);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* resume KVM timer */
+    count_ctl &= ~KVM_REG_MIPS_COUNT_CTL_DC;
+    ret = kvm_mips_put_one_reg64(cs, KVM_REG_MIPS_COUNT_CTL, &count_ctl);
+    return ret;
+}
+
+/*
+ * Handle the VM clock being started or stopped
+ */
+static void kvm_mips_update_state(void *opaque, int running, RunState state)
+{
+    CPUState *cs = opaque;
+    int ret;
+
+    /*
+     * If state is already dirty (synced to QEMU) then the KVM timer state is
+     * already saved and can be restored when it is synced back to KVM.
+     */
+    if (!cs->kvm_vcpu_dirty) {
+        if (running) {
+            ret = kvm_mips_restore_count(cs);
+        } else {
+            ret = kvm_mips_save_count(cs);
+        }
+        if (ret < 0) {
+            fprintf(stderr, "Failed update count (running=%d)\n",
+                    running);
+        }
+    }
+}
+
 static int kvm_mips_put_cp0_registers(CPUState *cs, int level)
 {
     MIPSCPU *cpu = MIPS_CPU(cs);
@@ -322,15 +494,16 @@  static int kvm_mips_put_cp0_registers(CPUState *cs, int level)
         fprintf(stderr, "Failed to put BADVADDR\n");
         return ret;
     }
-    if (level != KVM_PUT_RUNTIME_STATE) {
-        /* don't write guest count during runtime or the clock might drift */
-        ret |= kvm_mips_put_one_reg(cs, KVM_REG_MIPS_CP0_COUNT,
-                                    &env->CP0_Count);
+
+    /* If VM clock stopped then state will be restored when it is restarted */
+    if (runstate_is_running()) {
+        ret = kvm_mips_restore_count(cs);
         if (ret < 0) {
-            fprintf(stderr, "Failed to put COMPARE\n");
+            fprintf(stderr, "Failed to put COUNT\n");
             return ret;
         }
     }
+
     ret |= kvm_mips_put_one_ulreg(cs, KVM_REG_MIPS_CP0_ENTRYHI,
                                   &env->CP0_EntryHi);
     if (ret < 0) {
@@ -397,10 +570,6 @@  static int kvm_mips_get_cp0_registers(CPUState *cs)
     if (ret < 0) {
         return ret;
     }
-    ret |= kvm_mips_get_one_reg(cs, KVM_REG_MIPS_CP0_COUNT, &env->CP0_Count);
-    if (ret < 0) {
-        return ret;
-    }
     ret |= kvm_mips_get_one_ulreg(cs, KVM_REG_MIPS_CP0_ENTRYHI,
                                   &env->CP0_EntryHi);
     if (ret < 0) {
@@ -419,6 +588,15 @@  static int kvm_mips_get_cp0_registers(CPUState *cs)
     if (ret < 0) {
         return ret;
     }
+
+    /* If VM clock stopped then state was already saved when it was stopped */
+    if (runstate_is_running()) {
+        ret = kvm_mips_save_count(cs);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
     ret |= kvm_mips_get_one_ulreg(cs, KVM_REG_MIPS_CP0_EPC, &env->CP0_EPC);
     if (ret < 0) {
         return ret;
diff --git a/target-mips/machine.c b/target-mips/machine.c
index 0f36c9e..4916b13 100644
--- a/target-mips/machine.c
+++ b/target-mips/machine.c
@@ -150,6 +150,8 @@  void cpu_save(QEMUFile *f, void *opaque)
         save_tc(f, &env->tcs[i]);
     for (i = 0; i < MIPS_FPU_MAX; i++)
         save_fpu(f, &env->fpus[i]);
+
+    qemu_put_sbe64s(f, &env->count_save_time);
 }
 
 static void load_tc(QEMUFile *f, TCState *tc, int version_id)
@@ -310,5 +312,9 @@  int cpu_load(QEMUFile *f, void *opaque, int version_id)
 
     /* XXX: ensure compatibility for halted bit ? */
     tlb_flush(CPU(cpu), 1);
+
+    if (version_id >= 5) {
+        qemu_get_sbe64s(f, &env->count_save_time);
+    }
     return 0;
 }
-- 
1.9.3