diff mbox

[QEMU] kvmclock: advance clock by time window between vm_stop and pre_save

Message ID 20161104094322.GA16930@amt.cnet (mailing list archive)
State New, archived
Headers show

Commit Message

Marcelo Tosatti Nov. 4, 2016, 9:43 a.m. UTC
This patch, relative to pre-copy migration codepath,
measures the time between vm_stop() and pre_save(), 
which includes copying the remaining RAM to destination,
and advances the clock by that amount.

In a VM with 5 seconds downtime, this reduces the guest 
clock difference on destination from 5s to 0.2s.

Please do not apply this yet as some codepaths still need
checking, submitting early for comments.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Comments

Juan Quintela Nov. 4, 2016, 12:28 p.m. UTC | #1
Marcelo Tosatti <mtosatti@redhat.com> wrote:
> This patch, relative to pre-copy migration codepath,
> measures the time between vm_stop() and pre_save(), 
> which includes copying the remaining RAM to destination,
> and advances the clock by that amount.
>
> In a VM with 5 seconds downtime, this reduces the guest 
> clock difference on destination from 5s to 0.2s.
>
> Please do not apply this yet as some codepaths still need
> checking, submitting early for comments.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

You can use an optional section, and then you don't need to increase the
version number.

I believe you that the clock manipulation is right, only talking about
the migration bits.

> +static uint64_t clock_delta(struct timespec *before, struct timespec *after)
> +{
> +    if (before->tv_sec > after->tv_sec ||
> +        (before->tv_sec == after->tv_sec &&
> +         before->tv_nsec > after->tv_nsec)) {
> +        fprintf(stderr, "clock_delta failed: before=(%ld sec, %ld nsec),"
> +                        "after=(%ld sec, %ld nsec)\n", before->tv_sec,
> +                        before->tv_nsec, after->tv_sec, after->tv_nsec);
> +        abort();
> +    }
> +
> +    return (after->tv_sec - before->tv_sec) * 1000000000ULL +
> +            after->tv_nsec - before->tv_nsec;
> +}

I can't believe that we don't have a helper function already to
calculate this....


> +
> +static void kvmclock_pre_save(void *opaque)
> +{
> +    KVMClockState *s = opaque;
> +    struct timespec now;
> +    uint64_t ns;
> +
> +    if (s->t_aftervmstop.tv_sec == 0) {
> +        return;
> +    }

You have your test here.

> +
> +    clock_gettime(CLOCK_MONOTONIC, &now);
> +
> +    ns = clock_delta(&s->t_aftervmstop, &now);
> +
> +    /*
> +     * Linux guests can overflow if time jumps
> +     * forward in large increments.
> +     * Cap maximum adjustment to 10 minutes.
> +     */
> +    ns = MIN(ns, 600000000000ULL);
> +
> +    if (s->clock + ns > s->clock) {
> +        s->ns = ns;

Would it be a good idea to print an error message here?  If it has been more
than 10mins since we did the vmstop, something got wrong here.


> +    }
> +}
> +
> +static int kvmclock_post_load(void *opaque, int version_id)
> +{
> +    KVMClockState *s = opaque;
> +
> +    /* save the value from incoming migration */
> +    s->advance_clock = s->ns;
> +
> +    return 0;
> +}
> +
>  static const VMStateDescription kvmclock_vmsd = {
>      .name = "kvmclock",
> -    .version_id = 1,
> +    .version_id = 2,
>      .minimum_version_id = 1,
> +    .pre_save = kvmclock_pre_save,
> +    .post_load = kvmclock_post_load,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64(clock, KVMClockState),
> +        VMSTATE_UINT64_V(ns, KVMClockState, 2),
>          VMSTATE_END_OF_LIST()
>      }
>  };


If you need help with the subsection stuff, just ask.

Later, Juan.
Marcelo Tosatti Nov. 4, 2016, 12:35 p.m. UTC | #2
On Fri, Nov 04, 2016 at 01:28:48PM +0100, Juan Quintela wrote:
> Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > This patch, relative to pre-copy migration codepath,
> > measures the time between vm_stop() and pre_save(), 
> > which includes copying the remaining RAM to destination,
> > and advances the clock by that amount.
> >
> > In a VM with 5 seconds downtime, this reduces the guest 
> > clock difference on destination from 5s to 0.2s.
> >
> > Please do not apply this yet as some codepaths still need
> > checking, submitting early for comments.
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> You can use an optional section, and then you don't need to increase the
> version number.

Optional section is more appropriate, thanks.

> I believe you that the clock manipulation is right, only talking about
> the migration bits.
> 
> > +static uint64_t clock_delta(struct timespec *before, struct timespec *after)
> > +{
> > +    if (before->tv_sec > after->tv_sec ||
> > +        (before->tv_sec == after->tv_sec &&
> > +         before->tv_nsec > after->tv_nsec)) {
> > +        fprintf(stderr, "clock_delta failed: before=(%ld sec, %ld nsec),"
> > +                        "after=(%ld sec, %ld nsec)\n", before->tv_sec,
> > +                        before->tv_nsec, after->tv_sec, after->tv_nsec);
> > +        abort();
> > +    }
> > +
> > +    return (after->tv_sec - before->tv_sec) * 1000000000ULL +
> > +            after->tv_nsec - before->tv_nsec;
> > +}
> 
> I can't believe that we don't have a helper function already to
> calculate this....

Couldnt find any...

> > +
> > +static void kvmclock_pre_save(void *opaque)
> > +{
> > +    KVMClockState *s = opaque;
> > +    struct timespec now;
> > +    uint64_t ns;
> > +
> > +    if (s->t_aftervmstop.tv_sec == 0) {
> > +        return;
> > +    }
> 
> You have your test here.
> 
> > +
> > +    clock_gettime(CLOCK_MONOTONIC, &now);
> > +
> > +    ns = clock_delta(&s->t_aftervmstop, &now);
> > +
> > +    /*
> > +     * Linux guests can overflow if time jumps
> > +     * forward in large increments.
> > +     * Cap maximum adjustment to 10 minutes.
> > +     */
> > +    ns = MIN(ns, 600000000000ULL);
> > +
> > +    if (s->clock + ns > s->clock) {
> > +        s->ns = ns;
> 
> Would it be a good idea to print an error message here?  If it has been more
> than 10mins since we did the vmstop, something got wrong here.

Not sure... is it not possible for the user to stop migration in some 
way? 

What if network is very slow and maxdowntime very high?

> > +    }
> > +}
> > +
> > +static int kvmclock_post_load(void *opaque, int version_id)
> > +{
> > +    KVMClockState *s = opaque;
> > +
> > +    /* save the value from incoming migration */
> > +    s->advance_clock = s->ns;
> > +
> > +    return 0;
> > +}
> > +
> >  static const VMStateDescription kvmclock_vmsd = {
> >      .name = "kvmclock",
> > -    .version_id = 1,
> > +    .version_id = 2,
> >      .minimum_version_id = 1,
> > +    .pre_save = kvmclock_pre_save,
> > +    .post_load = kvmclock_post_load,
> >      .fields = (VMStateField[]) {
> >          VMSTATE_UINT64(clock, KVMClockState),
> > +        VMSTATE_UINT64_V(ns, KVMClockState, 2),
> >          VMSTATE_END_OF_LIST()
> >      }
> >  };
> 
> 
> If you need help with the subsection stuff, just ask.
> 
> Later, Juan.

Ok, i'll try to cook up an optional section and lets see what happens.

Thanks Juan.
Marcelo Tosatti Nov. 4, 2016, 2 p.m. UTC | #3
On Fri, Nov 04, 2016 at 10:35:39AM -0200, Marcelo Tosatti wrote:
> On Fri, Nov 04, 2016 at 01:28:48PM +0100, Juan Quintela wrote:
> > Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > This patch, relative to pre-copy migration codepath,
> > > measures the time between vm_stop() and pre_save(), 
> > > which includes copying the remaining RAM to destination,
> > > and advances the clock by that amount.
> > >
> > > In a VM with 5 seconds downtime, this reduces the guest 
> > > clock difference on destination from 5s to 0.2s.
> > >
> > > Please do not apply this yet as some codepaths still need
> > > checking, submitting early for comments.
> > >
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > You can use an optional section, and then you don't need to increase the
> > version number.
> 
> Optional section is more appropriate, thanks.
> 
> > I believe you that the clock manipulation is right, only talking about
> > the migration bits.
> > 
> > > +static uint64_t clock_delta(struct timespec *before, struct timespec *after)
> > > +{
> > > +    if (before->tv_sec > after->tv_sec ||
> > > +        (before->tv_sec == after->tv_sec &&
> > > +         before->tv_nsec > after->tv_nsec)) {
> > > +        fprintf(stderr, "clock_delta failed: before=(%ld sec, %ld nsec),"
> > > +                        "after=(%ld sec, %ld nsec)\n", before->tv_sec,
> > > +                        before->tv_nsec, after->tv_sec, after->tv_nsec);
> > > +        abort();
> > > +    }
> > > +
> > > +    return (after->tv_sec - before->tv_sec) * 1000000000ULL +
> > > +            after->tv_nsec - before->tv_nsec;
> > > +}
> > 
> > I can't believe that we don't have a helper function already to
> > calculate this....
> 
> Couldnt find any...
> 
> > > +
> > > +static void kvmclock_pre_save(void *opaque)
> > > +{
> > > +    KVMClockState *s = opaque;
> > > +    struct timespec now;
> > > +    uint64_t ns;
> > > +
> > > +    if (s->t_aftervmstop.tv_sec == 0) {
> > > +        return;
> > > +    }
> > 
> > You have your test here.
> > 
> > > +
> > > +    clock_gettime(CLOCK_MONOTONIC, &now);
> > > +
> > > +    ns = clock_delta(&s->t_aftervmstop, &now);
> > > +
> > > +    /*
> > > +     * Linux guests can overflow if time jumps
> > > +     * forward in large increments.
> > > +     * Cap maximum adjustment to 10 minutes.
> > > +     */
> > > +    ns = MIN(ns, 600000000000ULL);
> > > +
> > > +    if (s->clock + ns > s->clock) {
> > > +        s->ns = ns;
> > 
> > Would it be a good idea to print an error message here?  If it has been more
> > than 10mins since we did the vmstop, something got wrong here.
> 
> Not sure... is it not possible for the user to stop migration in some 
> way? 
> 
> What if network is very slow and maxdowntime very high?
> 
> > > +    }
> > > +}
> > > +
> > > +static int kvmclock_post_load(void *opaque, int version_id)
> > > +{
> > > +    KVMClockState *s = opaque;
> > > +
> > > +    /* save the value from incoming migration */
> > > +    s->advance_clock = s->ns;
> > > +
> > > +    return 0;
> > > +}
> > > +
> > >  static const VMStateDescription kvmclock_vmsd = {
> > >      .name = "kvmclock",
> > > -    .version_id = 1,
> > > +    .version_id = 2,
> > >      .minimum_version_id = 1,
> > > +    .pre_save = kvmclock_pre_save,
> > > +    .post_load = kvmclock_post_load,
> > >      .fields = (VMStateField[]) {
> > >          VMSTATE_UINT64(clock, KVMClockState),
> > > +        VMSTATE_UINT64_V(ns, KVMClockState, 2),
> > >          VMSTATE_END_OF_LIST()
> > >      }
> > >  };
> > 
> > 
> > If you need help with the subsection stuff, just ask.
> > 
> > Later, Juan.
> 
> Ok, i'll try to cook up an optional section and lets see what happens.
> 
> Thanks Juan.

Ok so by "optional section" i meant a section that when sent 
to destination, could be ignored and migration would succeed. 

The alternative (what this patch has now) is to increase migration
version so that:

    1. older machine types remain compatible. 
    2. newer machine types fail to migrate.

Because the data being sent, ns, is not really optional: if kvmclock or
hyper-v time is enabled (which should be 100% of the cases) we always
want to send that data.

That is, there is no difference between:

* Writing a subsection with needed=1 always (except when 
using an older machine types).
* Using old/new machine types with particular versions.

I think i missed the patch to switch current machine
types to kvmclock v1, BTW.
Radim Krčmář Nov. 4, 2016, 3:25 p.m. UTC | #4
2016-11-04 07:43-0200, Marcelo Tosatti:
> This patch, relative to pre-copy migration codepath,
> measures the time between vm_stop() and pre_save(), 
> which includes copying the remaining RAM to destination,
> and advances the clock by that amount.
> 
> In a VM with 5 seconds downtime, this reduces the guest 
> clock difference on destination from 5s to 0.2s.
> 
> Please do not apply this yet as some codepaths still need
> checking, submitting early for comments.

The time computation looks ok.

We could make it slightly more precise by returning the CLOCK_MONOTONIC
at which KVM_GET_CLOCK is read with the IOCTL, but we don't account the
migration time anyway, so that precision would be wasted.

> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
> @@ -100,6 +106,11 @@ static void kvmclock_vm_state_change(void *opaque, int running,
>              s->clock = time_at_migration;
>          }
>  
> +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
> +            s->clock += s->advance_clock;
> +            s->advance_clock = 0;
> +        }

Can't the advance_clock added to the migrated KVMClockState instead of
passing it as another parameter?

(It is sad that we can't just query KVMClockState in kvmclock_pre_save
 because of the Linux bug.)

> @@ -135,6 +146,18 @@ static void kvmclock_vm_state_change(void *opaque, int running,
>              abort();
>          }
>          s->clock = data.clock;
> +        /*
> +         * Transition from VM-running to VM-stopped via migration?
> +         * Record when the VM was stopped.
> +         */
> +
> +        if (state == RUN_STATE_FINISH_MIGRATE &&
> +            !migration_in_postcopy(migrate_get_current())) {
> +            clock_gettime(CLOCK_MONOTONIC, &s->t_aftervmstop);

How big (more like small) was the clock delta between here and
kvmclock_pre_save with postcopy?

Thanks.
Paolo Bonzini Nov. 4, 2016, 3:33 p.m. UTC | #5
On 04/11/2016 16:25, Radim Krčmář wrote:
>> >  
>> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
>> > +            s->clock += s->advance_clock;
>> > +            s->advance_clock = 0;
>> > +        }
> Can't the advance_clock added to the migrated KVMClockState instead of
> passing it as another parameter?
> 
> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>  because of the Linux bug.)

What Linux bug?  The one that makes us use kvmclock_current_nsec?

It should work with 4.9-rc (well, once Linus applies my pull request).
4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
the raw value from the kernel timekeeper.

I'm thinking that we should add a KVM capability for this, and skip
kvmclock_current_nsec if the capability is present.  The first part is
trivial, so we can do it even during Linux rc period.

Paolo
Radim Krčmář Nov. 4, 2016, 3:48 p.m. UTC | #6
2016-11-04 16:33+0100, Paolo Bonzini:
> On 04/11/2016 16:25, Radim Krčmář wrote:
>>> >  
>>> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
>>> > +            s->clock += s->advance_clock;
>>> > +            s->advance_clock = 0;
>>> > +        }
>> Can't the advance_clock added to the migrated KVMClockState instead of
>> passing it as another parameter?
>> 
>> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>>  because of the Linux bug.)
> 
> What Linux bug?  The one that makes us use kvmclock_current_nsec?

No, the one that forced Marcelo to add the 10 minute limit to the
advance_clock.  We wouldn't need this advance_clock hack if we could
just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
clock should count only if vm is running").

> It should work with 4.9-rc (well, once Linus applies my pull request).
> 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
> the raw value from the kernel timekeeper.
> 
> I'm thinking that we should add a KVM capability for this, and skip
> kvmclock_current_nsec if the capability is present.  The first part is
> trivial, so we can do it even during Linux rc period.

I agree, that sounds like a nice improvement.
Paolo Bonzini Nov. 4, 2016, 3:57 p.m. UTC | #7
On 04/11/2016 16:48, Radim Krčmář wrote:
> 2016-11-04 16:33+0100, Paolo Bonzini:
>> On 04/11/2016 16:25, Radim Krčmář wrote:
>>>>>  
>>>>> +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
>>>>> +            s->clock += s->advance_clock;
>>>>> +            s->advance_clock = 0;
>>>>> +        }
>>> Can't the advance_clock added to the migrated KVMClockState instead of
>>> passing it as another parameter?
>>>
>>> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>>>  because of the Linux bug.)
>>
>> What Linux bug?  The one that makes us use kvmclock_current_nsec?
> 
> No, the one that forced Marcelo to add the 10 minute limit to the
> advance_clock.  We wouldn't need this advance_clock hack if we could
> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> clock should count only if vm is running").

There are two cases:

- migrating a paused guest

- pausing at the end of migration

In the first case, kvmclock_vm_state_change's !running branch will see
state == RUN_STATE_FINISH_MIGRATE && s->clock_valid.  In the second
case, it will see state == RUN_STATE_FINISH_MIGRATE && !s->clock_valid.

In the second case it should do nothing.  Then kvmclock_pre_save will
see !s->clock_valid and call KVM_GET_CLOCK.  A bonus is that, if
migration fails and migration_thread() calls vm_start(), then the guest
will not see the clock drift backwards.

>> It should work with 4.9-rc (well, once Linus applies my pull request).
>> 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
>> the raw value from the kernel timekeeper.
>>
>> I'm thinking that we should add a KVM capability for this, and skip
>> kvmclock_current_nsec if the capability is present.  The first part is
>> trivial, so we can do it even during Linux rc period.
> 
> I agree, that sounds like a nice improvement.

Ok, I'll send the KVM patch then.  It can even be the value *returned*
for KVM_CAP_ADJUST_CLOCK---right now it returns 1, we can return 3 (bit
0 for KVM_CAP_ADJUST_CLOCK_RESENT and bit 1 for returning the raw
value).  Any suggestion for the name?

Paolo
Marcelo Tosatti Nov. 4, 2016, 4:04 p.m. UTC | #8
On Fri, Nov 04, 2016 at 04:25:23PM +0100, Radim Krčmář wrote:
> 2016-11-04 07:43-0200, Marcelo Tosatti:
> > This patch, relative to pre-copy migration codepath,
> > measures the time between vm_stop() and pre_save(), 
> > which includes copying the remaining RAM to destination,
> > and advances the clock by that amount.
> > 
> > In a VM with 5 seconds downtime, this reduces the guest 
> > clock difference on destination from 5s to 0.2s.
> > 
> > Please do not apply this yet as some codepaths still need
> > checking, submitting early for comments.
> 
> The time computation looks ok.
> 
> We could make it slightly more precise by returning the CLOCK_MONOTONIC
> at which KVM_GET_CLOCK is read with the IOCTL, but we don't account the
> migration time anyway, so that precision would be wasted.

Yes, can be enhanced later... That should be about 1us? (return to
userspace).

> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
> > @@ -100,6 +106,11 @@ static void kvmclock_vm_state_change(void *opaque, int running,
> >              s->clock = time_at_migration;
> >          }
> >  
> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
> > +            s->clock += s->advance_clock;
> > +            s->advance_clock = 0;
> > +        }
> 
> Can't the advance_clock added to the migrated KVMClockState instead of
> passing it as another parameter?

Don't want to be fragile to ->ns being overwritten by pre_save executing
on the destination and overwriting the value passed in from
migration... 

> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>  because of the Linux bug.)
> 
> > @@ -135,6 +146,18 @@ static void kvmclock_vm_state_change(void *opaque, int running,
> >              abort();
> >          }
> >          s->clock = data.clock;
> > +        /*
> > +         * Transition from VM-running to VM-stopped via migration?
> > +         * Record when the VM was stopped.
> > +         */
> > +
> > +        if (state == RUN_STATE_FINISH_MIGRATE &&
> > +            !migration_in_postcopy(migrate_get_current())) {
> > +            clock_gettime(CLOCK_MONOTONIC, &s->t_aftervmstop);
> 
> How big (more like small) was the clock delta between here and
> kvmclock_pre_save with postcopy?

Haven't checked, but the clock delta is now 0.2s. 
So i assume postcopy (stopping the VM, saving the device states)
should be about 0.2s.

I'll run and let you know.
Marcelo Tosatti Nov. 4, 2016, 4:24 p.m. UTC | #9
On Fri, Nov 04, 2016 at 04:48:28PM +0100, Radim Krčmář wrote:
> 2016-11-04 16:33+0100, Paolo Bonzini:
> > On 04/11/2016 16:25, Radim Krčmář wrote:
> >>> >  
> >>> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
> >>> > +            s->clock += s->advance_clock;
> >>> > +            s->advance_clock = 0;
> >>> > +        }
> >> Can't the advance_clock added to the migrated KVMClockState instead of
> >> passing it as another parameter?
> >> 
> >> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
> >>  because of the Linux bug.)
> > 
> > What Linux bug?  The one that makes us use kvmclock_current_nsec?
> 
> No, the one that forced Marcelo to add the 10 minute limit to the
> advance_clock. 

Overflow when reading clocksource, in the timer interrupt.

>  We wouldn't need this advance_clock hack if we could
> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> clock should count only if vm is running").

People have requested that CLOCK_MONOTONIC stops when 
vm suspends.

> > It should work with 4.9-rc (well, once Linus applies my pull request).
> > 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
> > the raw value from the kernel timekeeper.
> > 
> > I'm thinking that we should add a KVM capability for this, and skip
> > kvmclock_current_nsec if the capability is present.  The first part is
> > trivial, so we can do it even during Linux rc period.
> 
> I agree, that sounds like a nice improvement.

I fail to see how this (KVM_GET_CLOCK using raw value (vs NTP corrected)
relates to clocksource overflow and CLOCK_MONOTONIC stop requests.

-----

Todo list (collective???):

1) Implement suggestion to Roman: 
"Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master
clocks" to handle the time backwards when updating masterclock 
from kernel clock.

2) Review TSC scaling support (make sure scaled TSC 
is propagated properly everywhere).

3) Enable migration with invariant TSC.

4) Fullvirt fix for local APIC deadline timer bug
(BTW Paolo Windows is also vulnerable right).
Marcelo Tosatti Nov. 4, 2016, 5:07 p.m. UTC | #10
On Fri, Nov 04, 2016 at 04:25:23PM +0100, Radim Krčmář wrote:
> > +        /*
> > +         * Transition from VM-running to VM-stopped via migration?
> > +         * Record when the VM was stopped.
> > +         */
> > +
> > +        if (state == RUN_STATE_FINISH_MIGRATE &&
> > +            !migration_in_postcopy(migrate_get_current())) {
> > +            clock_gettime(CLOCK_MONOTONIC, &s->t_aftervmstop);
> 
> How big (more like small) was the clock delta between here and
> kvmclock_pre_save with postcopy?
> 
> Thanks.

qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not
available: Function not implemented

But should be about the same as precopy+this patch (guess as i don't 
know the postcopy path).
Radim Krčmář Nov. 4, 2016, 5:16 p.m. UTC | #11
2016-11-04 16:57+0100, Paolo Bonzini:
> On 04/11/2016 16:48, Radim Krčmář wrote:
>> 2016-11-04 16:33+0100, Paolo Bonzini:
>>> On 04/11/2016 16:25, Radim Krčmář wrote:
>>>>>>  
>>>>>> +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
>>>>>> +            s->clock += s->advance_clock;
>>>>>> +            s->advance_clock = 0;
>>>>>> +        }
>>>> Can't the advance_clock added to the migrated KVMClockState instead of
>>>> passing it as another parameter?
>>>>
>>>> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>>>>  because of the Linux bug.)
>>>
>>> What Linux bug?  The one that makes us use kvmclock_current_nsec?
>> 
>> No, the one that forced Marcelo to add the 10 minute limit to the
>> advance_clock.  We wouldn't need this advance_clock hack if we could
>> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
>> clock should count only if vm is running").
> 
> There are two cases:
> 
> - migrating a paused guest
> 
> - pausing at the end of migration
> 
> In the first case, kvmclock_vm_state_change's !running branch will see
> state == RUN_STATE_FINISH_MIGRATE && s->clock_valid.  In the second
> case, it will see state == RUN_STATE_FINISH_MIGRATE && !s->clock_valid.

I lift my case, marcelo's said that stopping the time is a feature ...
(*kittens die*)

> In the second case it should do nothing.  Then kvmclock_pre_save will
> see !s->clock_valid and call KVM_GET_CLOCK.  A bonus is that, if
> migration fails and migration_thread() calls vm_start(), then the guest
> will not see the clock drift backwards.

Yes, moving KVM_GET_CLOCK to kvmclock_pre_save() and doing nothing in
kvmclock_vm_state_change() to avoid the drift on failed migration would
be nicer.  We'd then also get rid of the ns migration parameter.

>>> It should work with 4.9-rc (well, once Linus applies my pull request).
>>> 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
>>> the raw value from the kernel timekeeper.

Oh, and this does introduce a minor bug to this patch -- the time
counted by KVM_GET_CLOCK is has different frequency CLOCK_MONOTONIC.
Not accounting for that is bearable.

>>> I'm thinking that we should add a KVM capability for this, and skip
>>> kvmclock_current_nsec if the capability is present.  The first part is
>>> trivial, so we can do it even during Linux rc period.
>> 
>> I agree, that sounds like a nice improvement.
> 
> Ok, I'll send the KVM patch then.  It can even be the value *returned*
> for KVM_CAP_ADJUST_CLOCK---right now it returns 1, we can return 3 (bit
> 0 for KVM_CAP_ADJUST_CLOCK_RESENT and bit 1 for returning the raw
> value).  Any suggestion for the name?

KVM_CAP_GET_CLOCK_MASTERCLOCK?

to show that the time is counted differently when using master clock.
Another idea would be KVM_CAP_GET_CLOCK_ACTUAL, because we should now
read what the guest is seeing.
Radim Krčmář Nov. 4, 2016, 5:34 p.m. UTC | #12
2016-11-04 14:24-0200, Marcelo Tosatti:
> On Fri, Nov 04, 2016 at 04:48:28PM +0100, Radim Krčmář wrote:
>> 2016-11-04 16:33+0100, Paolo Bonzini:
>> > On 04/11/2016 16:25, Radim Krčmář wrote:
>> >>> >  
>> >>> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
>> >>> > +            s->clock += s->advance_clock;
>> >>> > +            s->advance_clock = 0;
>> >>> > +        }
>> >> Can't the advance_clock added to the migrated KVMClockState instead of
>> >> passing it as another parameter?
>> >> 
>> >> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>> >>  because of the Linux bug.)
>> > 
>> > What Linux bug?  The one that makes us use kvmclock_current_nsec?
>> 
>> No, the one that forced Marcelo to add the 10 minute limit to the
>> advance_clock. 
> 
> Overflow when reading clocksource, in the timer interrupt.

Is it still there?  I wonder why 10 minutes is the limit. :)

>>  We wouldn't need this advance_clock hack if we could
>> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
>> clock should count only if vm is running").
> 
> People have requested that CLOCK_MONOTONIC stops when 
> vm suspends.

Ugh, we could have done that on the OS level, but stopping kvmclock
means that we sabotaged CLOCK_BOOTTIME as it doesn't return
CLOCK_MONOTONIC + time spent in suspend anymore.

And kvmclock is defined to count nanosecond (slightly different in
practice) since some point in time, so pausing it is not really valid.

>> > It should work with 4.9-rc (well, once Linus applies my pull request).
>> > 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
>> > the raw value from the kernel timekeeper.
>> > 
>> > I'm thinking that we should add a KVM capability for this, and skip
>> > kvmclock_current_nsec if the capability is present.  The first part is
>> > trivial, so we can do it even during Linux rc period.
>> 
>> I agree, that sounds like a nice improvement.
> 
> I fail to see how this (KVM_GET_CLOCK using raw value (vs NTP corrected)
> relates to clocksource overflow and CLOCK_MONOTONIC stop requests.

If the guest uses master clock, then kvmclock drifts away from BOOTTIME,
so returning the value of kvmclock using kernel_ns() would introduce a
drift when setting the clock -- we must read the same way that the guest
does.

> -----
> 
> Todo list (collective???):
> 
> 1) Implement suggestion to Roman: 
> "Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master
> clocks" to handle the time backwards when updating masterclock 
> from kernel clock.
> 
> 2) Review TSC scaling support (make sure scaled TSC 
> is propagated properly everywhere).
> 
> 3) Enable migration with invariant TSC.
> 
> 4) Fullvirt fix for local APIC deadline timer bug

The list looks good.

I'll resume work on (4) now that rework of APIC timers is merged.
It's going to be ugly on the guest side, because linux could be using it
even when not using kvmclock for sched clock ...

> (BTW Paolo Windows is also vulnerable right).

Ugh, we can't do much more than disabling TSC deadline there until QEMU
can migrate it.
Radim Krčmář Nov. 4, 2016, 5:39 p.m. UTC | #13
2016-11-04 15:07-0200, Marcelo Tosatti:
> On Fri, Nov 04, 2016 at 04:25:23PM +0100, Radim Krčmář wrote:
>> > +        /*
>> > +         * Transition from VM-running to VM-stopped via migration?
>> > +         * Record when the VM was stopped.
>> > +         */
>> > +
>> > +        if (state == RUN_STATE_FINISH_MIGRATE &&
>> > +            !migration_in_postcopy(migrate_get_current())) {
>> > +            clock_gettime(CLOCK_MONOTONIC, &s->t_aftervmstop);
>> 
>> How big (more like small) was the clock delta between here and
>> kvmclock_pre_save with postcopy?
>> 
>> Thanks.
> 
> qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not
> available: Function not implemented
> 
> But should be about the same as precopy+this patch (guess as i don't 
> know the postcopy path).

I was wondering about the improvement we could achieve by not excluding
postcopy from the time fixup.  (i.e. how much time elapses between
pausing and migrating the vm in postcopy)

I would also guess that it is not significant.
Marcelo Tosatti Nov. 4, 2016, 6:29 p.m. UTC | #14
On Fri, Nov 04, 2016 at 06:34:20PM +0100, Radim Krčmář wrote:
> 2016-11-04 14:24-0200, Marcelo Tosatti:
> > On Fri, Nov 04, 2016 at 04:48:28PM +0100, Radim Krčmář wrote:
> >> 2016-11-04 16:33+0100, Paolo Bonzini:
> >> > On 04/11/2016 16:25, Radim Krčmář wrote:
> >> >>> >  
> >> >>> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
> >> >>> > +            s->clock += s->advance_clock;
> >> >>> > +            s->advance_clock = 0;
> >> >>> > +        }
> >> >> Can't the advance_clock added to the migrated KVMClockState instead of
> >> >> passing it as another parameter?
> >> >> 
> >> >> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
> >> >>  because of the Linux bug.)
> >> > 
> >> > What Linux bug?  The one that makes us use kvmclock_current_nsec?
> >> 
> >> No, the one that forced Marcelo to add the 10 minute limit to the
> >> advance_clock. 
> > 
> > Overflow when reading clocksource, in the timer interrupt.
> 
> Is it still there?  I wonder why 10 minutes is the limit. :)

Second question: I don't know, i guess it started as a 
"lets clamp this to values that make sense to catch
any potential offenders" but now i think it makes 
sense to say:

"because it does not make sense to search for the smaller
overflow of Linux guests and use that".

So 10minutes works for me using the sentence:
"migration should not take longer than 10 minutes".

You prefer to drop that 10 minutes check?

First question:

I suppose so: disable CLOCK_MONOTONIC counting on pause, 
pause your VM for two days, and resume.

timekeeping_resume

        cycle_now = tk->tkr_mono.read(clock);
        if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
                cycle_now > tk->tkr_mono.cycle_last) {
                u64 num, max = ULLONG_MAX;
                u32 mult = clock->mult;
                u32 shift = clock->shift;
                s64 nsec = 0;

                cycle_delta = clocksource_delta(cycle_now,
tk->tkr_mono.cycle_last,
                                                tk->tkr_mono.mask);

                /*
                 * "cycle_delta * mutl" may cause 64 bits overflow, if
                 * the
                 * suspended time is too long. In that case we need do
                 * the
                 * 64 bits math carefully
                 */
                do_div(max, mult);
                if (cycle_delta > max) {
                        num = div64_u64(cycle_delta, max);
                        nsec = (((u64) max * mult) >> shift) * num;
                        cycle_delta -= num * max;
                }
                nsec += ((u64) cycle_delta * mult) >> shift;

timekeeping_forward_now
/**
 * timekeeping_forward_now - update clock to the current time
 *
 * Forward the current clock to update its state since the last call to
 * update_wall_time(). This is useful before significant clock changes,
 * as it avoids having to deal with this time offset explicitly.
 */
static void timekeeping_forward_now(struct timekeeper *tk)
{
        struct clocksource *clock = tk->tkr_mono.clock;
        cycle_t cycle_now, delta;
        s64 nsec;

        cycle_now = tk->tkr_mono.read(clock);
        delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last,
tk->tkr_mono.mask);
        tk->tkr_mono.cycle_last = cycle_now;
        tk->tkr_raw.cycle_last  = cycle_now;

        tk->tkr_mono.xtime_nsec += delta * tk->tkr_mono.mult;

For example.

> >>  We wouldn't need this advance_clock hack if we could
> >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> >> clock should count only if vm is running").
> > 
> > People have requested that CLOCK_MONOTONIC stops when 
> > vm suspends.
> 
> Ugh, we could have done that on the OS level, 

Done it how at the OS level ?

> but stopping kvmclock
> means that we sabotaged CLOCK_BOOTTIME as it doesn't return
> CLOCK_MONOTONIC + time spent in suspend anymore.

It does:
So this patchset introduces CLOCK_BOOTTIME, which is identical
to CLOCK_MONOTONIC, but includes any time spent in suspend.

"suspend" refers to suspend-to-RAM, not "virtual machine stopped".

CLOCK_BOOTTIME fails to work when you extend "suspend" 
to "suspend AND virtual machine pause or restoration".

(if you want that, can be fixed easily i supposed, but 
nobody ever asked for it).

> And kvmclock is defined to count nanosecond (slightly different in
> practice) since some point in time, so pausing it is not really valid.

Sorry, where is that defined? Nobody makes that assumption
(that kvmclock should be correlated to some physical time
and that it can't stop counting).

> >> > It should work with 4.9-rc (well, once Linus applies my pull request).
> >> > 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
> >> > the raw value from the kernel timekeeper.
> >> > 
> >> > I'm thinking that we should add a KVM capability for this, and skip
> >> > kvmclock_current_nsec if the capability is present.  The first part is
> >> > trivial, so we can do it even during Linux rc period.
> >> 
> >> I agree, that sounds like a nice improvement.
> > 
> > I fail to see how this (KVM_GET_CLOCK using raw value (vs NTP corrected)
> > relates to clocksource overflow and CLOCK_MONOTONIC stop requests.
> 
> If the guest uses master clock, then kvmclock drifts away from BOOTTIME,
> so returning the value of kvmclock using kernel_ns() would introduce a
> drift when setting the clock -- we must read the same way that the guest
> does.

Does not work:

    (periodic incremental reads from TSC, using the offset from
     previous TSC read, storing to scaling to a memory location)
    and reads interpolating with TSC (what kernel does)

        != (different values than)

    (TSC scaled to nanoseconds)

Its not obvious to me this should be the case, perhaps if done 
carefully can work.

(See the thread with Roman, this was determined empirically).

> > -----
> > 
> > Todo list (collective???):
> > 
> > 1) Implement suggestion to Roman: 
> > "Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master
> > clocks" to handle the time backwards when updating masterclock 
> > from kernel clock.
> > 
> > 2) Review TSC scaling support (make sure scaled TSC 
> > is propagated properly everywhere).
> > 
> > 3) Enable migration with invariant TSC.
> > 
> > 4) Fullvirt fix for local APIC deadline timer bug
> 
> The list looks good.
> 
> I'll resume work on (4) now that rework of APIC timers is merged.
> It's going to be ugly on the guest side, because linux could be using it
> even when not using kvmclock for sched clock ...

From previous discussion on the thread Eduardo started:

>> > Actually, a nicer fix would be to check the different
>> > frequencies and scale the deadline relative to the difference.
>>
>> You cannot know what exactly the guest was thinking when it set the
>> TSC
>> deadline.  Perhaps it wanted an interrupt when the TSC was exactly
>> 9876543210.

Yep, the spec just says that the timer fires when TSC >= deadline.

> You can't expect the correlation between TSC value and timer interrupt
> execution to be precise, because of the delay between HW timer
> expiration and interrupt execution.

Yes, this is valid.

> So you have to live with the fact that the TSC deadline timer can be
> late (which is the same thing as with your paravirt solution, in case
> of migration to host with faster TSC freq) (which to me renders the
> argument above invalid).
>
> Solution for old guests and new guests:
> Just save how far ahead in the future the TSC deadline timer is
> supposed
> to expire on the source (in ns), in the destination save all registers
> (but disable TSC deadline timer injection), arm a timer in QEMU
> for expiration time, reenable TSC deadline timer reinjection.

The interrupt can also fire early after migrating to a TSC with lower
frequency, which violates the definition of TSC deadline timer when an
OS could even RDTSC a value lower than the deadline in the interrupt
handler.  (An OS doesn't need to expect that.)

We already have vcpu->arch.virtual_tsc_khz that ought to remain constant
for a lifetime of the guest and KVM uses it to convert TSC delta into
correct nanoseconds.

The main problem is that QEMU changes virtual_tsc_khz when migrating
without hardware scaling, so KVM is forced to get nanoseconds wrong ...

If QEMU doesn't want to keep the TSC frequency constant, then it would
be better if it didn't expose TSC in CPUID -- guest would just use
kvmclock without being tempted by direct TSC accesses.


> > (BTW Paolo Windows is also vulnerable right).
> 
> Ugh, we can't do much more than disabling TSC deadline there until QEMU
> can migrate it.

So the idea was to convert to nanoseconds, count a timer
in nanoseconds, and when it expires inject immediatelly.
(Yes its ugly).

(btw i suppose you can talk to destination via the command interface 
on the postcopy patches now), say to get the destination tsc frequency.
Marcelo Tosatti Nov. 4, 2016, 6:31 p.m. UTC | #15
On Fri, Nov 04, 2016 at 06:39:18PM +0100, Radim Krčmář wrote:
> 2016-11-04 15:07-0200, Marcelo Tosatti:
> > On Fri, Nov 04, 2016 at 04:25:23PM +0100, Radim Krčmář wrote:
> >> > +        /*
> >> > +         * Transition from VM-running to VM-stopped via migration?
> >> > +         * Record when the VM was stopped.
> >> > +         */
> >> > +
> >> > +        if (state == RUN_STATE_FINISH_MIGRATE &&
> >> > +            !migration_in_postcopy(migrate_get_current())) {
> >> > +            clock_gettime(CLOCK_MONOTONIC, &s->t_aftervmstop);
> >> 
> >> How big (more like small) was the clock delta between here and
> >> kvmclock_pre_save with postcopy?
> >> 
> >> Thanks.
> > 
> > qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not
> > available: Function not implemented
> > 
> > But should be about the same as precopy+this patch (guess as i don't 
> > know the postcopy path).
> 
> I was wondering about the improvement we could achieve by not excluding
> postcopy from the time fixup.  (i.e. how much time elapses between
> pausing and migrating the vm in postcopy)
> 
> I would also guess that it is not significant.

Ideally the total should be zero because for certain workloads
any change is problematic...
Radim Krčmář Nov. 4, 2016, 8:07 p.m. UTC | #16
2016-11-04 16:29-0200, Marcelo Tosatti:
> On Fri, Nov 04, 2016 at 06:34:20PM +0100, Radim Krčmář wrote:
>> 2016-11-04 14:24-0200, Marcelo Tosatti:
>> > On Fri, Nov 04, 2016 at 04:48:28PM +0100, Radim Krčmář wrote:
>> >> 2016-11-04 16:33+0100, Paolo Bonzini:
>> >> > On 04/11/2016 16:25, Radim Krčmář wrote:
>> >> >>> >  
>> >> >>> > +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
>> >> >>> > +            s->clock += s->advance_clock;
>> >> >>> > +            s->advance_clock = 0;
>> >> >>> > +        }
>> >> >> Can't the advance_clock added to the migrated KVMClockState instead of
>> >> >> passing it as another parameter?
>> >> >> 
>> >> >> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>> >> >>  because of the Linux bug.)
>> >> > 
>> >> > What Linux bug?  The one that makes us use kvmclock_current_nsec?
>> >> 
>> >> No, the one that forced Marcelo to add the 10 minute limit to the
>> >> advance_clock. 
>> > 
>> > Overflow when reading clocksource, in the timer interrupt.
>> 
>> Is it still there?  I wonder why 10 minutes is the limit. :)
> 
> Second question: I don't know, i guess it started as a 
> "lets clamp this to values that make sense to catch
> any potential offenders" but now i think it makes 
> sense to say:
> 
> "because it does not make sense to search for the smaller
> overflow of Linux guests and use that".
> 
> So 10minutes works for me using the sentence:
> "migration should not take longer than 10 minutes".
> 
> You prefer to drop that 10 minutes check?

I would.  The overflow should mean that the value will be off, but it
would already be, so being a little more wrong doesn't seem worth the
extra code.
It is minor and v2 has r-b, so let's just forget about it. :)

> First question:
> 
> I suppose so: disable CLOCK_MONOTONIC counting on pause, 
> pause your VM for two days, and resume.
> 
> timekeeping_resume
> 
>         cycle_now = tk->tkr_mono.read(clock);
>         if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
>                 cycle_now > tk->tkr_mono.cycle_last) {
>                 u64 num, max = ULLONG_MAX;
>                 u32 mult = clock->mult;
>                 u32 shift = clock->shift;
>                 s64 nsec = 0;
> 
>                 cycle_delta = clocksource_delta(cycle_now,
> tk->tkr_mono.cycle_last,
>                                                 tk->tkr_mono.mask);
> 
>                 /*
>                  * "cycle_delta * mutl" may cause 64 bits overflow, if
>                  * the
>                  * suspended time is too long. In that case we need do
>                  * the
>                  * 64 bits math carefully
>                  */
>                 do_div(max, mult);
>                 if (cycle_delta > max) {
>                         num = div64_u64(cycle_delta, max);
>                         nsec = (((u64) max * mult) >> shift) * num;
>                         cycle_delta -= num * max;
>                 }
>                 nsec += ((u64) cycle_delta * mult) >> shift;

This seems like it handles the overflow.  And does it tragically
ineffectively, so I can understand that they didn't want to do that in
interrupt context.  Using 128 bit multiplication (on x86_64) and shift
would get it done in a fraction of the time.

[...]
> For example.

I found the 10 minues in __clocksource_update_freq_scale():

	if (freq) {
		/*
		 * Calc the maximum number of seconds which we can run before
		 * wrapping around. For clocksources which have a mask > 32-bit
		 * we need to limit the max sleep time to have a good
		 * conversion precision. 10 minutes is still a reasonable
		 * amount. That results in a shift value of 24 for a
		 * clocksource with mask >= 40-bit and f >= 4GHz. That maps to
		 * ~ 0.06ppm granularity for NTP.
		 */
		sec = cs->mask;
		do_div(sec, freq);
		do_div(sec, scale);
		if (!sec)
			sec = 1;
		else if (sec > 600 && cs->mask > UINT_MAX)
			sec = 600;

		clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
				       NSEC_PER_SEC / scale, sec * scale);
	}

>> >>  We wouldn't need this advance_clock hack if we could
>> >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
>> >> clock should count only if vm is running").
>> > 
>> > People have requested that CLOCK_MONOTONIC stops when 
>> > vm suspends.
>> 
>> Ugh, we could have done that on the OS level, 
> 
> Done it how at the OS level ?

With the use of another interface steal-time like interface.

>> but stopping kvmclock
>> means that we sabotaged CLOCK_BOOTTIME as it doesn't return
>> CLOCK_MONOTONIC + time spent in suspend anymore.
> 
> It does:
> So this patchset introduces CLOCK_BOOTTIME, which is identical
> to CLOCK_MONOTONIC, but includes any time spent in suspend.
> 
> "suspend" refers to suspend-to-RAM, not "virtual machine stopped".
> 
> CLOCK_BOOTTIME fails to work when you extend "suspend" 
> to "suspend AND virtual machine pause or restoration".
> 
> (if you want that, can be fixed easily i supposed, but 
> nobody ever asked for it).

Yeah, it's just weird, because the reason why people wanted
CLOCK_MONOTONIC was quite stupid (to see how long it was unpaused ...)
while having a useable timesource seems quite important.

>> And kvmclock is defined to count nanosecond (slightly different in
>> practice) since some point in time, so pausing it is not really valid.
> 
> Sorry, where is that defined? Nobody makes that assumption
> (that kvmclock should be correlated to some physical time
> and that it can't stop counting).

Documentation/virtual/kvm/msr.txt

  system_time: a host notion of monotonic time, including sleep
  time at the time this structure was last updated. Unit is
  nanoseconds.

and arch/x86/include/asm/pvclock-abi.h:

  * pvclock_vcpu_time_info holds the system time and the tsc timestamp
  * of the last update. So the guest can use the tsc delta to get a
  * more precise system time.  There is one per virtual cpu.
  *
  * pvclock_wall_clock references the point in time when the system
  * time was zero (usually boot time), thus the guest calculates the
  * current wall clock by adding the system time.

>> >> > It should work with 4.9-rc (well, once Linus applies my pull request).
>> >> > 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
>> >> > the raw value from the kernel timekeeper.
>> >> > 
>> >> > I'm thinking that we should add a KVM capability for this, and skip
>> >> > kvmclock_current_nsec if the capability is present.  The first part is
>> >> > trivial, so we can do it even during Linux rc period.
>> >> 
>> >> I agree, that sounds like a nice improvement.
>> > 
>> > I fail to see how this (KVM_GET_CLOCK using raw value (vs NTP corrected)
>> > relates to clocksource overflow and CLOCK_MONOTONIC stop requests.
>> 
>> If the guest uses master clock, then kvmclock drifts away from BOOTTIME,
>> so returning the value of kvmclock using kernel_ns() would introduce a
>> drift when setting the clock -- we must read the same way that the guest
>> does.
> 
> Does not work:
> 
>     (periodic incremental reads from TSC, using the offset from
>      previous TSC read, storing to scaling to a memory location)
>     and reads interpolating with TSC (what kernel does)
> 
>         != (different values than)
> 
>     (TSC scaled to nanoseconds)
> 
> Its not obvious to me this should be the case, perhaps if done 
> carefully can work.

This can't be equal if the TSC frequency is not constant, and it is not.
KVM master clock is using different TSC frequency than the kernel.
The current state is weird, because if we ever recompute the master
clock, we use kvm_get_time_and_clockread() for the new base, which
shifts the time. :/

We could match CLOCK_BOOTTIME with kvmclock if we wanted ...  It would
be much saner, but we'd recomputing the master clock every time the host
time gets updated.

> (See the thread with Roman, this was determined empirically).
> 
>> > -----
>> > 
>> > Todo list (collective???):
>> > 
>> > 1) Implement suggestion to Roman: 
>> > "Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master
>> > clocks" to handle the time backwards when updating masterclock 
>> > from kernel clock.
>> > 
>> > 2) Review TSC scaling support (make sure scaled TSC 
>> > is propagated properly everywhere).
>> > 
>> > 3) Enable migration with invariant TSC.
>> > 
>> > 4) Fullvirt fix for local APIC deadline timer bug
>> 
>> The list looks good.
>> 
>> I'll resume work on (4) now that rework of APIC timers is merged.
>> It's going to be ugly on the guest side, because linux could be using it
>> even when not using kvmclock for sched clock ...
> 
> From previous discussion on the thread Eduardo started:
|> 
|>>> > Actually, a nicer fix would be to check the different
|>>> > frequencies and scale the deadline relative to the difference.
|>>>
|>>> You cannot know what exactly the guest was thinking when it set the
|>>> TSC
|>>> deadline.  Perhaps it wanted an interrupt when the TSC was exactly
|>>> 9876543210.
|> 
|> Yep, the spec just says that the timer fires when TSC >= deadline.
|> 
|>> You can't expect the correlation between TSC value and timer interrupt
|>> execution to be precise, because of the delay between HW timer
|>> expiration and interrupt execution.
|> 
|> Yes, this is valid.
|> 
|>> So you have to live with the fact that the TSC deadline timer can be
|>> late (which is the same thing as with your paravirt solution, in case
|>> of migration to host with faster TSC freq) (which to me renders the
|>> argument above invalid).
|>>
|>> Solution for old guests and new guests:
|>> Just save how far ahead in the future the TSC deadline timer is
|>> supposed
|>> to expire on the source (in ns), in the destination save all registers
|>> (but disable TSC deadline timer injection), arm a timer in QEMU
|>> for expiration time, reenable TSC deadline timer reinjection.
|> 
|> The interrupt can also fire early after migrating to a TSC with lower
|> frequency, which violates the definition of TSC deadline timer when an
|> OS could even RDTSC a value lower than the deadline in the interrupt
|> handler.  (An OS doesn't need to expect that.)
|> 
|> We already have vcpu->arch.virtual_tsc_khz that ought to remain constant
|> for a lifetime of the guest and KVM uses it to convert TSC delta into
|> correct nanoseconds.
|> 
|> The main problem is that QEMU changes virtual_tsc_khz when migrating
|> without hardware scaling, so KVM is forced to get nanoseconds wrong ...
|> 
|> If QEMU doesn't want to keep the TSC frequency constant, then it would
|> be better if it didn't expose TSC in CPUID -- guest would just use
|> kvmclock without being tempted by direct TSC accesses.
|> 
|> 
>> > (BTW Paolo Windows is also vulnerable right).
>> 
>> Ugh, we can't do much more than disabling TSC deadline there until QEMU
>> can migrate it.
> 
> So the idea was to convert to nanoseconds, count a timer
> in nanoseconds, and when it expires inject immediatelly.
> (Yes its ugly).

Problematic, because we can't know that the timer was used to count real
time: maybe the guest really wanted to count TSC cycles.  And it is just
wrong if we would fire the timer while the guest could rdtsc() a value
before the deadline in its handler, which might happen if we migrated to
a TSC with a lower frequency.

> (btw i suppose you can talk to destination via the command interface 
> on the postcopy patches now), say to get the destination tsc frequency.

The problem is that changing TSC frequency on the host affects all
following interrupts, because KVM is not informed of the new frequency.
The one timer armed during migration is relatively insignificant, so
Paolo's patches even ignored it.
Paolo Bonzini Nov. 4, 2016, 9:29 p.m. UTC | #17
> >> No, the one that forced Marcelo to add the 10 minute limit to the
> >> advance_clock.  We wouldn't need this advance_clock hack if we could
> >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> >> clock should count only if vm is running").
> > 
> > There are two cases:
> > 
> > - migrating a paused guest
> > 
> > - pausing at the end of migration
> > 
> > In the first case, kvmclock_vm_state_change's !running branch will see
> > state == RUN_STATE_FINISH_MIGRATE && s->clock_valid.  In the second
> > case, it will see state == RUN_STATE_FINISH_MIGRATE && !s->clock_valid.
> 
> I lift my case, marcelo's said that stopping the time is a feature ...
> (*kittens die*)

But that's why separating the two cases brings us the best of both worlds.
If migrating a paused guest, there's no need for any adjustment, so no
advance_clock hack.  If pausing at the end of migration, there's no need
to pause kvmclock (this patch is effectively working around 00f4d64ee76e)
and if we don't do that we can just call KVM_GET_CLOCK at pre_save time.

> Oh, and this does introduce a minor bug to this patch -- the time
> counted by KVM_GET_CLOCK is has different frequency CLOCK_MONOTONIC.
> Not accounting for that is bearable.

Not really, I was going to point that out when I got to replying with
a review. :)

Paolo
Marcelo Tosatti Nov. 4, 2016, 9:47 p.m. UTC | #18
On Fri, Nov 04, 2016 at 05:29:36PM -0400, Paolo Bonzini wrote:
> 
> > >> No, the one that forced Marcelo to add the 10 minute limit to the
> > >> advance_clock.  We wouldn't need this advance_clock hack if we could
> > >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> > >> clock should count only if vm is running").
> > > 
> > > There are two cases:
> > > 
> > > - migrating a paused guest
> > > 
> > > - pausing at the end of migration
> > > 
> > > In the first case, kvmclock_vm_state_change's !running branch will see
> > > state == RUN_STATE_FINISH_MIGRATE && s->clock_valid.  In the second
> > > case, it will see state == RUN_STATE_FINISH_MIGRATE && !s->clock_valid.
> > 
> > I lift my case, marcelo's said that stopping the time is a feature ...
> > (*kittens die*)
> 
> But that's why separating the two cases brings us the best of both worlds.
> If migrating a paused guest, there's no need for any adjustment, so no
> advance_clock hack.  If pausing at the end of migration, there's no need
> to pause kvmclock (this patch is effectively working around 00f4d64ee76e)
> and if we don't do that we can just call KVM_GET_CLOCK at pre_save time.

That was my internal v1. But then, the destination ignores s->clock
as follows:

    if (running) {
        struct kvm_clock_data data = {};
        uint64_t time_at_migration = kvmclock_current_nsec(s);

        s->clock_valid = false;

        /* We can't rely on the migrated clock value, just discard it */
        if (time_at_migration) {
            s->clock = time_at_migration;
        }

        data.clock = s->clock;
        ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);

So you need to send that "ns" value (difference of two clock reads)
separately.

> 
> > Oh, and this does introduce a minor bug to this patch -- the time
> > counted by KVM_GET_CLOCK is has different frequency CLOCK_MONOTONIC.
> > Not accounting for that is bearable.
> 
> Not really, I was going to point that out when I got to replying with
> a review. :)
> 
> Paolo
Paolo Bonzini Nov. 4, 2016, 10:35 p.m. UTC | #19
> > But that's why separating the two cases brings us the best of both worlds.
> > If migrating a paused guest, there's no need for any adjustment, so no
> > advance_clock hack.  If pausing at the end of migration, there's no need
> > to pause kvmclock (this patch is effectively working around 00f4d64ee76e)
> > and if we don't do that we can just call KVM_GET_CLOCK at pre_save time.
> 
> That was my internal v1. But then, the destination ignores s->clock
> as follows:
> 
>     if (running) {
>         struct kvm_clock_data data = {};
>         uint64_t time_at_migration = kvmclock_current_nsec(s);

Right, but this is unnecessary in 4.9-rc, where KVM_GET_CLOCK *already*
returns the same as QEMU's kvmclock_current_nsec.  So this is the other
half of my suggestion, where we need to check the behavior of KVM_GET_CLOCK
via KVM_CHECK_EXTENSION.

I think it's okay to only fix the bug with master clock enabled and
for new KVM with sensible KVM_GET_CLOCK semantics.

Paolo

>         s->clock_valid = false;
> 
>         /* We can't rely on the migrated clock value, just discard it */
>         if (time_at_migration) {
>             s->clock = time_at_migration;
>         }
> 
>         data.clock = s->clock;
>         ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> 
> So you need to send that "ns" value (difference of two clock reads)
> separately.
> 
> > 
> > > Oh, and this does introduce a minor bug to this patch -- the time
> > > counted by KVM_GET_CLOCK is has different frequency CLOCK_MONOTONIC.
> > > Not accounting for that is bearable.
> > 
> > Not really, I was going to point that out when I got to replying with
> > a review. :)
> > 
> > Paolo
> 
>
Dr. David Alan Gilbert Nov. 7, 2016, 1:08 p.m. UTC | #20
* Radim Krčmář (rkrcmar@redhat.com) wrote:
> 2016-11-04 15:07-0200, Marcelo Tosatti:
> > On Fri, Nov 04, 2016 at 04:25:23PM +0100, Radim Krčmář wrote:
> >> > +        /*
> >> > +         * Transition from VM-running to VM-stopped via migration?
> >> > +         * Record when the VM was stopped.
> >> > +         */
> >> > +
> >> > +        if (state == RUN_STATE_FINISH_MIGRATE &&
> >> > +            !migration_in_postcopy(migrate_get_current())) {
> >> > +            clock_gettime(CLOCK_MONOTONIC, &s->t_aftervmstop);
> >> 
> >> How big (more like small) was the clock delta between here and
> >> kvmclock_pre_save with postcopy?
> >> 
> >> Thanks.
> > 
> > qemu-system-x86_64: postcopy_ram_supported_by_host: userfaultfd not
> > available: Function not implemented
> > 
> > But should be about the same as precopy+this patch (guess as i don't 
> > know the postcopy path).
> 
> I was wondering about the improvement we could achieve by not excluding
> postcopy from the time fixup.  (i.e. how much time elapses between
> pausing and migrating the vm in postcopy)
> 
> I would also guess that it is not significant.

It can add up, so it would be useful to be able to do this trick.
We have to calculate and send some 'discard maps' that can take
some time, especially on larger VMs, and we also have to serialise
the state of all non-RAM devices, so if you have a lot of other
emulated devices it can take a bit more time.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Roman Kagan Nov. 7, 2016, 2:31 p.m. UTC | #21
On Fri, Nov 04, 2016 at 06:16:06PM +0100, Radim Krčmář wrote:
> 2016-11-04 16:57+0100, Paolo Bonzini:
> > On 04/11/2016 16:48, Radim Krčmář wrote:
> >> 2016-11-04 16:33+0100, Paolo Bonzini:
> >>> On 04/11/2016 16:25, Radim Krčmář wrote:
> >>>>>>  
> >>>>>> +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
> >>>>>> +            s->clock += s->advance_clock;
> >>>>>> +            s->advance_clock = 0;
> >>>>>> +        }
> >>>> Can't the advance_clock added to the migrated KVMClockState instead of
> >>>> passing it as another parameter?
> >>>>
> >>>> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
> >>>>  because of the Linux bug.)
> >>>
> >>> What Linux bug?  The one that makes us use kvmclock_current_nsec?
> >> 
> >> No, the one that forced Marcelo to add the 10 minute limit to the
> >> advance_clock.  We wouldn't need this advance_clock hack if we could
> >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> >> clock should count only if vm is running").
> > 
> > There are two cases:
> > 
> > - migrating a paused guest
> > 
> > - pausing at the end of migration
> > 
> > In the first case, kvmclock_vm_state_change's !running branch will see
> > state == RUN_STATE_FINISH_MIGRATE && s->clock_valid.  In the second
> > case, it will see state == RUN_STATE_FINISH_MIGRATE && !s->clock_valid.
> 
> I lift my case, marcelo's said that stopping the time is a feature ...
> (*kittens die*)

Sorry to chime in in the middle of the thread, but I wonder how happy
the guests are with this behavior.  Intuitively pausing or snapshotting
feels like closing the lid of a laptop, so every time I see the guest
waking up in the past after a pause I get confused.  It may also be
unexpected by Windows guests who never had this overflow problem but
now, being tied up with kvmclock, have to stop the time while in pause,
too.

Roman.
Marcelo Tosatti Nov. 7, 2016, 7:31 p.m. UTC | #22
On Mon, Nov 07, 2016 at 05:31:49PM +0300, Roman Kagan wrote:
> On Fri, Nov 04, 2016 at 06:16:06PM +0100, Radim Krčmář wrote:
> > 2016-11-04 16:57+0100, Paolo Bonzini:
> > > On 04/11/2016 16:48, Radim Krčmář wrote:
> > >> 2016-11-04 16:33+0100, Paolo Bonzini:
> > >>> On 04/11/2016 16:25, Radim Krčmář wrote:
> > >>>>>>  
> > >>>>>> +        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
> > >>>>>> +            s->clock += s->advance_clock;
> > >>>>>> +            s->advance_clock = 0;
> > >>>>>> +        }
> > >>>> Can't the advance_clock added to the migrated KVMClockState instead of
> > >>>> passing it as another parameter?
> > >>>>
> > >>>> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
> > >>>>  because of the Linux bug.)
> > >>>
> > >>> What Linux bug?  The one that makes us use kvmclock_current_nsec?
> > >> 
> > >> No, the one that forced Marcelo to add the 10 minute limit to the
> > >> advance_clock.  We wouldn't need this advance_clock hack if we could
> > >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
> > >> clock should count only if vm is running").
> > > 
> > > There are two cases:
> > > 
> > > - migrating a paused guest
> > > 
> > > - pausing at the end of migration
> > > 
> > > In the first case, kvmclock_vm_state_change's !running branch will see
> > > state == RUN_STATE_FINISH_MIGRATE && s->clock_valid.  In the second
> > > case, it will see state == RUN_STATE_FINISH_MIGRATE && !s->clock_valid.
> > 
> > I lift my case, marcelo's said that stopping the time is a feature ...
> > (*kittens die*)
> 
> Sorry to chime in in the middle of the thread, but I wonder how happy
> the guests are with this behavior.  Intuitively pausing or snapshotting
> feels like closing the lid of a laptop, so every time I see the guest
> waking up in the past after a pause I get confused.  It may also be
> unexpected by Windows guests who never had this overflow problem but
> now, being tied up with kvmclock, have to stop the time while in pause,
> too.
> 
> Roman.

Waking up should be using guest-set-time QGA API: 

http://bugzilla.redhat.com/show_bug.cgi?id=1102411

Check "virsh domtime".
diff mbox

Patch

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index 0f75dd3..1bd8fd6 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -22,9 +22,11 @@ 
 #include "kvm_i386.h"
 #include "hw/sysbus.h"
 #include "hw/kvm/clock.h"
+#include "migration/migration.h"
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
+#include <time.h>
 
 #define TYPE_KVM_CLOCK "kvmclock"
 #define KVM_CLOCK(obj) OBJECT_CHECK(KVMClockState, (obj), TYPE_KVM_CLOCK)
@@ -35,7 +37,11 @@  typedef struct KVMClockState {
     /*< public >*/
 
     uint64_t clock;
+    uint64_t ns;
     bool clock_valid;
+
+    uint64_t advance_clock;
+    struct timespec t_aftervmstop;
 } KVMClockState;
 
 struct pvclock_vcpu_time_info {
@@ -100,6 +106,11 @@  static void kvmclock_vm_state_change(void *opaque, int running,
             s->clock = time_at_migration;
         }
 
+        if (s->advance_clock && s->clock + s->advance_clock > s->clock) {
+            s->clock += s->advance_clock;
+            s->advance_clock = 0;
+        }
+
         data.clock = s->clock;
         ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
         if (ret < 0) {
@@ -135,6 +146,18 @@  static void kvmclock_vm_state_change(void *opaque, int running,
             abort();
         }
         s->clock = data.clock;
+        /*
+         * Transition from VM-running to VM-stopped via migration?
+         * Record when the VM was stopped.
+         */
+
+        if (state == RUN_STATE_FINISH_MIGRATE &&
+            !migration_in_postcopy(migrate_get_current())) {
+            clock_gettime(CLOCK_MONOTONIC, &s->t_aftervmstop);
+        } else {
+            s->t_aftervmstop.tv_sec = 0;
+            s->t_aftervmstop.tv_nsec = 0;
+        }
 
         /*
          * If the VM is stopped, declare the clock state valid to
@@ -152,12 +175,66 @@  static void kvmclock_realize(DeviceState *dev, Error **errp)
     qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
 }
 
+static uint64_t clock_delta(struct timespec *before, struct timespec *after)
+{
+    if (before->tv_sec > after->tv_sec ||
+        (before->tv_sec == after->tv_sec &&
+         before->tv_nsec > after->tv_nsec)) {
+        fprintf(stderr, "clock_delta failed: before=(%ld sec, %ld nsec),"
+                        "after=(%ld sec, %ld nsec)\n", before->tv_sec,
+                        before->tv_nsec, after->tv_sec, after->tv_nsec);
+        abort();
+    }
+
+    return (after->tv_sec - before->tv_sec) * 1000000000ULL +
+            after->tv_nsec - before->tv_nsec;
+}
+
+static void kvmclock_pre_save(void *opaque)
+{
+    KVMClockState *s = opaque;
+    struct timespec now;
+    uint64_t ns;
+
+    if (s->t_aftervmstop.tv_sec == 0) {
+        return;
+    }
+
+    clock_gettime(CLOCK_MONOTONIC, &now);
+
+    ns = clock_delta(&s->t_aftervmstop, &now);
+
+    /*
+     * Linux guests can overflow if time jumps
+     * forward in large increments.
+     * Cap maximum adjustment to 10 minutes.
+     */
+    ns = MIN(ns, 600000000000ULL);
+
+    if (s->clock + ns > s->clock) {
+        s->ns = ns;
+    }
+}
+
+static int kvmclock_post_load(void *opaque, int version_id)
+{
+    KVMClockState *s = opaque;
+
+    /* save the value from incoming migration */
+    s->advance_clock = s->ns;
+
+    return 0;
+}
+
 static const VMStateDescription kvmclock_vmsd = {
     .name = "kvmclock",
-    .version_id = 1,
+    .version_id = 2,
     .minimum_version_id = 1,
+    .pre_save = kvmclock_pre_save,
+    .post_load = kvmclock_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64(clock, KVMClockState),
+        VMSTATE_UINT64_V(ns, KVMClockState, 2),
         VMSTATE_END_OF_LIST()
     }
 };