diff mbox

[1/1] sched/cputime: do not decrease steal time after live migration on xen

Message ID 1507626848-24148-1-git-send-email-dongli.zhang@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dongli Zhang Oct. 10, 2017, 9:14 a.m. UTC
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
paravirt_steal_clock() might be less than this_rq()->prev_steal_time.

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

The code in this patch is borrowed from do_stolen_accounting() which has
already been removed from linux source code since commit ecb23dc6 ("xen:
add steal_clock support on x86").

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
Unlike the issue discussed by Michael Las which would overflow steal time
and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
linux 4.11+ would only decrease but not overflow steal time after live
migration.

References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 kernel/sched/cputime.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Comments

Ingo Molnar Oct. 10, 2017, 10:59 a.m. UTC | #1
(Cc:-ed more gents involved in kernel/sched/cputime.c work. Full patch quoted 
below.)

* Dongli Zhang <dongli.zhang@oracle.com> wrote:

> After guest live migration on xen, steal time in /proc/stat
> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> paravirt_steal_clock() might be less than this_rq()->prev_steal_time.
> 
> For instance, steal time of each vcpu is 335 before live migration.
> 
> cpu  198 0 368 200064 1962 0 0 1340 0 0
> cpu0 38 0 81 50063 492 0 0 335 0 0
> cpu1 65 0 97 49763 634 0 0 335 0 0
> cpu2 38 0 81 50098 462 0 0 335 0 0
> cpu3 56 0 107 50138 374 0 0 335 0 0
> 
> After live migration, steal time is reduced to 312.
> 
> cpu  200 0 370 200330 1971 0 0 1248 0 0
> cpu0 38 0 82 50123 500 0 0 312 0 0
> cpu1 65 0 97 49832 634 0 0 312 0 0
> cpu2 39 0 82 50167 462 0 0 312 0 0
> cpu3 56 0 107 50207 374 0 0 312 0 0
> 
> The code in this patch is borrowed from do_stolen_accounting() which has
> already been removed from linux source code since commit ecb23dc6 ("xen:
> add steal_clock support on x86").
> 
> Similar and more severe issue would impact prior linux 4.8-4.10 as
> discussed by Michael Las at
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
> Unlike the issue discussed by Michael Las which would overflow steal time
> and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
> linux 4.11+ would only decrease but not overflow steal time after live
> migration.
> 
> References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  kernel/sched/cputime.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 14d2dbf..57d09cab 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -238,10 +238,17 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	if (static_key_false(&paravirt_steal_enabled)) {
> -		u64 steal;
> +		u64 steal, steal_time;
> +		s64 steal_delta;
> +
> +		steal_time = paravirt_steal_clock(smp_processor_id());
> +		steal = steal_delta = steal_time - this_rq()->prev_steal_time;
> +
> +		if (unlikely(steal_delta < 0)) {
> +			this_rq()->prev_steal_time = steal_time;
> +			return 0;
> +		}
>  
> -		steal = paravirt_steal_clock(smp_processor_id());
> -		steal -= this_rq()->prev_steal_time;
>  		steal = min(steal, maxtime);
>  		account_steal_time(steal);
>  		this_rq()->prev_steal_time += steal;
> -- 
> 2.7.4
>
Peter Zijlstra Oct. 10, 2017, 11:58 a.m. UTC | #2
On Tue, Oct 10, 2017 at 05:14:08PM +0800, Dongli Zhang wrote:
> After guest live migration on xen, steal time in /proc/stat
> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> paravirt_steal_clock() might be less than this_rq()->prev_steal_time.

So why not fix paravirt_steal_clock() to not be broken?
Stanislaw Gruszka Oct. 10, 2017, 12:42 p.m. UTC | #3
On Tue, Oct 10, 2017 at 12:59:26PM +0200, Ingo Molnar wrote:
> 
> (Cc:-ed more gents involved in kernel/sched/cputime.c work. Full patch quoted 
> below.)
> 
> * Dongli Zhang <dongli.zhang@oracle.com> wrote:
> 
> > After guest live migration on xen, steal time in /proc/stat
> > (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> > paravirt_steal_clock() might be less than this_rq()->prev_steal_time.
> > 
> > For instance, steal time of each vcpu is 335 before live migration.
> > 
> > cpu  198 0 368 200064 1962 0 0 1340 0 0
> > cpu0 38 0 81 50063 492 0 0 335 0 0
> > cpu1 65 0 97 49763 634 0 0 335 0 0
> > cpu2 38 0 81 50098 462 0 0 335 0 0
> > cpu3 56 0 107 50138 374 0 0 335 0 0
> > 
> > After live migration, steal time is reduced to 312.
> > 
> > cpu  200 0 370 200330 1971 0 0 1248 0 0
> > cpu0 38 0 82 50123 500 0 0 312 0 0
> > cpu1 65 0 97 49832 634 0 0 312 0 0
> > cpu2 39 0 82 50167 462 0 0 312 0 0
> > cpu3 56 0 107 50207 374 0 0 312 0 0
> > 
> > The code in this patch is borrowed from do_stolen_accounting() which has
> > already been removed from linux source code since commit ecb23dc6 ("xen:
> > add steal_clock support on x86").
> > 
> > Similar and more severe issue would impact prior linux 4.8-4.10 as
> > discussed by Michael Las at
> > https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
> > Unlike the issue discussed by Michael Las which would overflow steal time
> > and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
> > linux 4.11+ would only decrease but not overflow steal time after live
> > migration.
> > 
> > References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
> > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> > ---
> >  kernel/sched/cputime.c | 13 ++++++++++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 14d2dbf..57d09cab 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -238,10 +238,17 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
> >  {
> >  #ifdef CONFIG_PARAVIRT
> >  	if (static_key_false(&paravirt_steal_enabled)) {
> > -		u64 steal;
> > +		u64 steal, steal_time;
> > +		s64 steal_delta;
> > +
> > +		steal_time = paravirt_steal_clock(smp_processor_id());
> > +		steal = steal_delta = steal_time - this_rq()->prev_steal_time;
> > +
> > +		if (unlikely(steal_delta < 0)) {
> > +			this_rq()->prev_steal_time = steal_time;

I don't think setting prev_steal_time to smaller value is right
thing to do. 

Beside, I don't think we need to check for overflow condition for
cputime variables (it will happen after 279 years :-). So instead
of introducing signed steal_delta variable I would just add
below check, which should be sufficient to fix the problem:

	if (unlikely(steal <= this_rq()->prev_steal_time))
		return 0;

Thanks
Stanislaw
Peter Zijlstra Oct. 10, 2017, 12:48 p.m. UTC | #4
On Tue, Oct 10, 2017 at 02:42:01PM +0200, Stanislaw Gruszka wrote:
> > > +		u64 steal, steal_time;
> > > +		s64 steal_delta;
> > > +
> > > +		steal_time = paravirt_steal_clock(smp_processor_id());
> > > +		steal = steal_delta = steal_time - this_rq()->prev_steal_time;
> > > +
> > > +		if (unlikely(steal_delta < 0)) {
> > > +			this_rq()->prev_steal_time = steal_time;
> 
> I don't think setting prev_steal_time to smaller value is right
> thing to do. 
> 
> Beside, I don't think we need to check for overflow condition for
> cputime variables (it will happen after 279 years :-). So instead
> of introducing signed steal_delta variable I would just add
> below check, which should be sufficient to fix the problem:
> 
> 	if (unlikely(steal <= this_rq()->prev_steal_time))
> 		return 0;

How about you just fix up paravirt_steal_time() on migration and not
muck with the users ?
Rik van Riel Oct. 10, 2017, 2:01 p.m. UTC | #5
On Tue, 2017-10-10 at 14:48 +0200, Peter Zijlstra wrote:
> On Tue, Oct 10, 2017 at 02:42:01PM +0200, Stanislaw Gruszka wrote:
> > > > +		u64 steal, steal_time;
> > > > +		s64 steal_delta;
> > > > +
> > > > +		steal_time =
> > > > paravirt_steal_clock(smp_processor_id());
> > > > +		steal = steal_delta = steal_time - this_rq()-
> > > > >prev_steal_time;
> > > > +
> > > > +		if (unlikely(steal_delta < 0)) {
> > > > +			this_rq()->prev_steal_time =
> > > > steal_time;
> > 
> > I don't think setting prev_steal_time to smaller value is right
> > thing to do. 
> > 
> > Beside, I don't think we need to check for overflow condition for
> > cputime variables (it will happen after 279 years :-). So instead
> > of introducing signed steal_delta variable I would just add
> > below check, which should be sufficient to fix the problem:
> > 
> > 	if (unlikely(steal <= this_rq()->prev_steal_time))
> > 		return 0;
> 
> How about you just fix up paravirt_steal_time() on migration and not
> muck with the users ?

Not just migration, either. CPU hotplug is another time to fix up
the steal time.
Dongli Zhang Oct. 11, 2017, 7:29 a.m. UTC | #6
Hi Stanislaw and Peter,

On 10/10/2017 08:42 PM, Stanislaw Gruszka wrote:
> On Tue, Oct 10, 2017 at 12:59:26PM +0200, Ingo Molnar wrote:
>>
>> (Cc:-ed more gents involved in kernel/sched/cputime.c work. Full patch quoted 
>> below.)
>>
>> * Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>
>>> After guest live migration on xen, steal time in /proc/stat
>>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>>> paravirt_steal_clock() might be less than this_rq()->prev_steal_time.
>>>
>>> For instance, steal time of each vcpu is 335 before live migration.
>>>
>>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>>> cpu0 38 0 81 50063 492 0 0 335 0 0
>>> cpu1 65 0 97 49763 634 0 0 335 0 0
>>> cpu2 38 0 81 50098 462 0 0 335 0 0
>>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>>
>>> After live migration, steal time is reduced to 312.
>>>
>>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>>> cpu0 38 0 82 50123 500 0 0 312 0 0
>>> cpu1 65 0 97 49832 634 0 0 312 0 0
>>> cpu2 39 0 82 50167 462 0 0 312 0 0
>>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>>
>>> The code in this patch is borrowed from do_stolen_accounting() which has
>>> already been removed from linux source code since commit ecb23dc6 ("xen:
>>> add steal_clock support on x86").
>>>
>>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>>> discussed by Michael Las at
>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
>>> Unlike the issue discussed by Michael Las which would overflow steal time
>>> and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
>>> linux 4.11+ would only decrease but not overflow steal time after live
>>> migration.
>>>
>>> References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>> ---
>>>  kernel/sched/cputime.c | 13 ++++++++++---
>>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>>> index 14d2dbf..57d09cab 100644
>>> --- a/kernel/sched/cputime.c
>>> +++ b/kernel/sched/cputime.c
>>> @@ -238,10 +238,17 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
>>>  {
>>>  #ifdef CONFIG_PARAVIRT
>>>  	if (static_key_false(&paravirt_steal_enabled)) {
>>> -		u64 steal;
>>> +		u64 steal, steal_time;
>>> +		s64 steal_delta;
>>> +
>>> +		steal_time = paravirt_steal_clock(smp_processor_id());
>>> +		steal = steal_delta = steal_time - this_rq()->prev_steal_time;
>>> +
>>> +		if (unlikely(steal_delta < 0)) {
>>> +			this_rq()->prev_steal_time = steal_time;
> 
> I don't think setting prev_steal_time to smaller value is right
> thing to do.

If we do not set prev_steal_time to smaller steal (obtained from
paravirt_steal_clock()), it will take a while for kernel to wait for new steal
to catch up with this_rq()->prev_steal_time, and cpustat[CPUTIME_STEAL] will
stay unchanged until steal is more than this_rq()->prev_steal_time again. Do you
think it is fine?

If it is fine, I will try to limit the fix to xen specific code in
driver/xen/time.c so that we would not taint kernel/sched/cputime.c, as Peter
has asked why not just fix up paravirt_steal_time() on migration.

Thank you very much!

Dongli Zhang

> 
> Beside, I don't think we need to check for overflow condition for
> cputime variables (it will happen after 279 years :-). So instead
> of introducing signed steal_delta variable I would just add
> below check, which should be sufficient to fix the problem:
> 
> 	if (unlikely(steal <= this_rq()->prev_steal_time))
> 		return 0;
> 
> Thanks
> Stanislaw
>
Dongli Zhang Oct. 11, 2017, 7:47 a.m. UTC | #7
Hi Rik,

On 10/10/2017 10:01 PM, Rik van Riel wrote:
> On Tue, 2017-10-10 at 14:48 +0200, Peter Zijlstra wrote:
>> On Tue, Oct 10, 2017 at 02:42:01PM +0200, Stanislaw Gruszka wrote:
>>>>> +		u64 steal, steal_time;
>>>>> +		s64 steal_delta;
>>>>> +
>>>>> +		steal_time =
>>>>> paravirt_steal_clock(smp_processor_id());
>>>>> +		steal = steal_delta = steal_time - this_rq()-
>>>>>> prev_steal_time;
>>>>> +
>>>>> +		if (unlikely(steal_delta < 0)) {
>>>>> +			this_rq()->prev_steal_time =
>>>>> steal_time;
>>>
>>> I don't think setting prev_steal_time to smaller value is right
>>> thing to do. 
>>>
>>> Beside, I don't think we need to check for overflow condition for
>>> cputime variables (it will happen after 279 years :-). So instead
>>> of introducing signed steal_delta variable I would just add
>>> below check, which should be sufficient to fix the problem:
>>>
>>> 	if (unlikely(steal <= this_rq()->prev_steal_time))
>>> 		return 0;
>>
>> How about you just fix up paravirt_steal_time() on migration and not
>> muck with the users ?
> 
> Not just migration, either. CPU hotplug is another time to fix up
> the steal time.

I think this issue might be hit when we add and online vcpu after a very very
long time since boot (or the last time vcpu is offline). Please correct me if I
am wrong.

Thank you very much!

Dongli Zhang

>
diff mbox

Patch

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 14d2dbf..57d09cab 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -238,10 +238,17 @@  static __always_inline u64 steal_account_process_time(u64 maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
-		u64 steal;
+		u64 steal, steal_time;
+		s64 steal_delta;
+
+		steal_time = paravirt_steal_clock(smp_processor_id());
+		steal = steal_delta = steal_time - this_rq()->prev_steal_time;
+
+		if (unlikely(steal_delta < 0)) {
+			this_rq()->prev_steal_time = steal_time;
+			return 0;
+		}
 
-		steal = paravirt_steal_clock(smp_processor_id());
-		steal -= this_rq()->prev_steal_time;
 		steal = min(steal, maxtime);
 		account_steal_time(steal);
 		this_rq()->prev_steal_time += steal;