Message ID | c25d440c-0a2f-4fea-bb6e-0e0b945d9cf0@default (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 10/20/2017 01:37 AM, Dongli Zhang wrote: > Hi Boris, > > ----- boris.ostrovsky@oracle.com wrote: > >> On 10/19/2017 04:02 AM, Dongli Zhang wrote: >>> After guest live migration on xen, steal time in /proc/stat >>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by >>> xen_steal_lock() might be less than this_rq()->prev_steal_time which >> is >>> derived from previous return value of xen_steal_clock(). >>> >>> For instance, steal time of each vcpu is 335 before live migration. >>> >>> cpu 198 0 368 200064 1962 0 0 1340 0 0 >>> cpu0 38 0 81 50063 492 0 0 335 0 0 >>> cpu1 65 0 97 49763 634 0 0 335 0 0 >>> cpu2 38 0 81 50098 462 0 0 335 0 0 >>> cpu3 56 0 107 50138 374 0 0 335 0 0 >>> >>> After live migration, steal time is reduced to 312. >>> >>> cpu 200 0 370 200330 1971 0 0 1248 0 0 >>> cpu0 38 0 82 50123 500 0 0 312 0 0 >>> cpu1 65 0 97 49832 634 0 0 312 0 0 >>> cpu2 39 0 82 50167 462 0 0 312 0 0 >>> cpu3 56 0 107 50207 374 0 0 312 0 0 >>> >>> The code in this patch is borrowed from do_stolen_accounting() which >> has >>> already been removed from linux source code since commit >> ecb23dc6f2ef >>> ("xen: add steal_clock support on x86"). The core idea of both >>> do_stolen_accounting() and this patch is to avoid accounting new >> steal >>> clock if it is smaller than previous old steal clock. >>> >>> Similar and more severe issue would impact prior linux 4.8-4.10 as >>> discussed by Michael Las at >>> >> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest, >>> which would overflow steal time and lead to 100% st usage in top >> command >>> for linux 4.8-4.10. A backport of this patch would fix that issue. >>> >>> References: >> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest >>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> >>> --- >>> drivers/xen/time.c | 15 ++++++++++++++- >>> 1 file changed, 14 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c >>> index ac5f23f..2b3a996 100644 >>> --- a/drivers/xen/time.c >>> +++ b/drivers/xen/time.c >>> @@ -19,6 +19,8 @@ >>> /* runstate info updated by Xen */ >>> static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate); >>> >>> +static DEFINE_PER_CPU(u64, xen_old_steal); >>> + >>> /* return an consistent snapshot of 64-bit time/counter value */ >>> static u64 get64(const u64 *p) >>> { >>> @@ -83,9 +85,20 @@ bool xen_vcpu_stolen(int vcpu) >>> u64 xen_steal_clock(int cpu) >>> { >>> struct vcpu_runstate_info state; >>> + u64 xen_new_steal; >>> + s64 steal_delta; >>> >>> xen_get_runstate_snapshot_cpu(&state, cpu); >>> - return state.time[RUNSTATE_runnable] + >> state.time[RUNSTATE_offline]; >>> + xen_new_steal = state.time[RUNSTATE_runnable] >>> + + state.time[RUNSTATE_offline]; >>> + steal_delta = xen_new_steal - per_cpu(xen_old_steal, cpu); >>> + >>> + if (steal_delta < 0) >>> + xen_new_steal = per_cpu(xen_old_steal, cpu); >>> + else >>> + per_cpu(xen_old_steal, cpu) = xen_new_steal; >>> + >>> + return xen_new_steal; >>> } >>> >>> void xen_setup_runstate_info(int cpu) >> Can we stash state.time[] during suspend and then add stashed values >> inside xen_get_runstate_snapshot_cpu()? > > Would you like to stash state.time[] during do_suspend() (or xen_suspend()) or > code below is expected: > > ------------------------------------------------- > > --- a/drivers/xen/time.c > +++ b/drivers/xen/time.c > @@ -19,6 +19,8 @@ > /* runstate info updated by Xen */ > static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate); > > +static DEFINE_PER_CPU(u64[4], old_runstate_time); > + > /* return an consistent snapshot of 64-bit time/counter value */ > static u64 get64(const u64 *p) > { > @@ -52,6 +54,8 @@ static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res, > { > u64 state_time; > struct vcpu_runstate_info *state; > + int i; > + s64 time_delta; > > BUG_ON(preemptible()); > > @@ -64,6 +68,17 @@ static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res, > rmb(); /* Hypervisor might update data. */ > } while (get64(&state->state_entry_time) != state_time || > (state_time & XEN_RUNSTATE_UPDATE)); > + > + for (i = 0; i < 4; i++) { > + if (i == RUNSTATE_runnable || i == RUNSTATE_offline) { > + time_delta = res->time[i] - per_cpu(old_runstate_time, cpu)[i]; > + > + if (unlikely(time_delta < 0)) > + res->time[i] = per_cpu(old_runstate_time, cpu)[i]; > + else > + per_cpu(old_runstate_time, cpu)[i] = res->time[i]; > + } > + } > } I was thinking more along the lines of for (i=0; i<4; i++) res->time[i] += per_cpu(old_runstate_time, cpu)[i]; time[] array is cumulative and AFAIUI is cleared during resume by the hypervisor so we are effectively restoring the values here. No? -boris
--- a/drivers/xen/time.c +++ b/drivers/xen/time.c @@ -19,6 +19,8 @@ /* runstate info updated by Xen */ static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate); +static DEFINE_PER_CPU(u64[4], old_runstate_time); + /* return an consistent snapshot of 64-bit time/counter value */ static u64 get64(const u64 *p) { @@ -52,6 +54,8 @@ static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res, { u64 state_time; struct vcpu_runstate_info *state; + int i; + s64 time_delta; BUG_ON(preemptible()); @@ -64,6 +68,17 @@ static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res, rmb(); /* Hypervisor might update data. */ } while (get64(&state->state_entry_time) != state_time || (state_time & XEN_RUNSTATE_UPDATE)); + + for (i = 0; i < 4; i++) { + if (i == RUNSTATE_runnable || i == RUNSTATE_offline) { + time_delta = res->time[i] - per_cpu(old_runstate_time, cpu)[i]; + + if (unlikely(time_delta < 0)) + res->time[i] = per_cpu(old_runstate_time, cpu)[i]; + else + per_cpu(old_runstate_time, cpu)[i] = res->time[i]; + } + } } -------------------------------------------------