cpus: reset throttle_thread_scheduled after sleep

Message ID	1495229390-18909-1-git-send-email-felipe@nutanix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Felipe Franciosi <felipe@nutanix.com> To: Paolo Bonzini <pbonzini@redhat.com>, "Jason J. Herne" <jjherne@linux.vnet.ibm.com>, Malcolm Crossley <malcolm@nutanix.com> Date: Fri, 19 May 2017 22:29:50 +0100 Message-Id: <1495229390-18909-1-git-send-email-felipe@nutanix.com> Subject: [Qemu-devel] [PATCH] cpus: reset throttle_thread_scheduled after sleep Precedence: list Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Felipe Franciosi <felipe@nutanix.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Felipe Franciosi May 19, 2017, 9:29 p.m. UTC

Currently, the throttle_thread_scheduled flag is reset back to 0 before
sleeping (as part of the throttling logic). Given that throttle_timer
(well, any timer) may tick with a slight delay, it so happens that under
heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
schedule a further cpu_throttle_thread() work item after the flag reset,
but before the previous sleep completed. This results on the vCPU thread
sleeping continuously for potentially several seconds in a row.

The chances of that happening can be drastically minimised by resetting
the flag after the sleep.

Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>
---
 cpus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Jason J. Herne May 22, 2017, 1:01 p.m. UTC | #1

On 05/19/2017 05:29 PM, Felipe Franciosi wrote:
> Currently, the throttle_thread_scheduled flag is reset back to 0 before
> sleeping (as part of the throttling logic). Given that throttle_timer
> (well, any timer) may tick with a slight delay, it so happens that under
> heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
> schedule a further cpu_throttle_thread() work item after the flag reset,
> but before the previous sleep completed. This results on the vCPU thread
> sleeping continuously for potentially several seconds in a row.
>
> The chances of that happening can be drastically minimised by resetting
> the flag after the sleep.
>
> Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
> Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>
> ---
>  cpus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/cpus.c b/cpus.c
> index 516e5cb..f42eebd 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -677,9 +677,9 @@ static void cpu_throttle_thread(CPUState *cpu, run_on_cpu_data opaque)
>      sleeptime_ns = (long)(throttle_ratio * CPU_THROTTLE_TIMESLICE_NS);
>
>      qemu_mutex_unlock_iothread();
> -    atomic_set(&cpu->throttle_thread_scheduled, 0);
>      g_usleep(sleeptime_ns / 1000); /* Convert ns to us for usleep call */
>      qemu_mutex_lock_iothread();
> +    atomic_set(&cpu->throttle_thread_scheduled, 0);
>  }
>
>  static void cpu_throttle_timer_tick(void *opaque)
>

This seems to make sense to me.

Acked-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>

I'm CC'ing Juan, Amit and David as they are all active in the migration 
area and may have
opinions on this. Juan and David were also reviewers for the original 
series.

Paolo Bonzini May 25, 2017, 3:52 p.m. UTC | #2

On 19/05/2017 23:29, Felipe Franciosi wrote:
> Currently, the throttle_thread_scheduled flag is reset back to 0 before
> sleeping (as part of the throttling logic). Given that throttle_timer
> (well, any timer) may tick with a slight delay, it so happens that under
> heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
> schedule a further cpu_throttle_thread() work item after the flag reset,
> but before the previous sleep completed. This results on the vCPU thread
> sleeping continuously for potentially several seconds in a row.
> 
> The chances of that happening can be drastically minimised by resetting
> the flag after the sleep.

True, on the other hand this may also increase the chance of not
sleeping at all.

How overcommitted was the host system?

Paolo

> Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
> Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>

Felipe Franciosi May 25, 2017, 4:25 p.m. UTC | #3

> On 25 May 2017, at 16:52, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> 
> 
> On 19/05/2017 23:29, Felipe Franciosi wrote:
>> Currently, the throttle_thread_scheduled flag is reset back to 0 before
>> sleeping (as part of the throttling logic). Given that throttle_timer
>> (well, any timer) may tick with a slight delay, it so happens that under
>> heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
>> schedule a further cpu_throttle_thread() work item after the flag reset,
>> but before the previous sleep completed. This results on the vCPU thread
>> sleeping continuously for potentially several seconds in a row.
>> 
>> The chances of that happening can be drastically minimised by resetting
>> the flag after the sleep.
> 
> True, on the other hand this may also increase the chance of not
> sleeping at all.

The perfect solution (for this throttling strategy) would probably be a per-cpu timer. In the meantime, I think avoiding massive sleeps is a win. We observed stalls in excess of 70 secs at 99% throttle.

> How overcommitted was the host system?

Not overcommitted at all. And it's quite easy to reproduce. All you need is a workload heavy enough to prevent the migration from converging (or a slow network which you can emulate with a qdisc).

With a Linux guest, you should quickly see soft lockups being reported. With Windows, probably BSODs.

Thanks,
Felipe

> 
> Paolo
> 
>> Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
>> Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>

Paolo Bonzini May 25, 2017, 4:34 p.m. UTC | #4

On 25/05/2017 18:25, Felipe Franciosi wrote:
> The perfect solution (for this throttling strategy) would probably be
> a per-cpu timer. In the meantime, I think avoiding massive sleeps is
> a win. We observed stalls in excess of 70 secs at 99% throttle.

Ah, so the issue is not overcommit, it's too high throttling.  Then it
makes sense.

Thanks,

Paolo

>> How overcommitted was the host system?
> Not overcommitted at all. And it's quite easy to reproduce. All you
> need is a workload heavy enough to prevent the migration from
> converging (or a slow network which you can emulate with a qdisc).
> 
> With a Linux guest, you should quickly see soft lockups being
> reported. With Windows, probably BSODs.

Dr. David Alan Gilbert June 1, 2017, 2:36 p.m. UTC | #5

* Jason J. Herne (jjherne@linux.vnet.ibm.com) wrote:
> On 05/19/2017 05:29 PM, Felipe Franciosi wrote:
> > Currently, the throttle_thread_scheduled flag is reset back to 0 before
> > sleeping (as part of the throttling logic). Given that throttle_timer
> > (well, any timer) may tick with a slight delay, it so happens that under
> > heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
> > schedule a further cpu_throttle_thread() work item after the flag reset,
> > but before the previous sleep completed. This results on the vCPU thread
> > sleeping continuously for potentially several seconds in a row.
> > 
> > The chances of that happening can be drastically minimised by resetting
> > the flag after the sleep.
> > 
> > Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
> > Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>
> > ---
> >  cpus.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/cpus.c b/cpus.c
> > index 516e5cb..f42eebd 100644
> > --- a/cpus.c
> > +++ b/cpus.c
> > @@ -677,9 +677,9 @@ static void cpu_throttle_thread(CPUState *cpu, run_on_cpu_data opaque)
> >      sleeptime_ns = (long)(throttle_ratio * CPU_THROTTLE_TIMESLICE_NS);
> > 
> >      qemu_mutex_unlock_iothread();
> > -    atomic_set(&cpu->throttle_thread_scheduled, 0);
> >      g_usleep(sleeptime_ns / 1000); /* Convert ns to us for usleep call */
> >      qemu_mutex_lock_iothread();
> > +    atomic_set(&cpu->throttle_thread_scheduled, 0);
> >  }
> > 
> >  static void cpu_throttle_timer_tick(void *opaque)
> > 
> 
> This seems to make sense to me.
> 
> Acked-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> 
> I'm CC'ing Juan, Amit and David as they are all active in the migration area
> and may have
> opinions on this. Juan and David were also reviewers for the original
> series.

The description is interesting and sounds reasonable; it'll be
interesting to see what difference it makes to the autoconverge
behaviour for those workloads that need this level of throttle.

Dave

> -- 
> -- Jason J. Herne (jjherne@linux.vnet.ibm.com)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Felipe Franciosi June 1, 2017, 3:02 p.m. UTC | #6

> On 1 Jun 2017, at 15:36, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Jason J. Herne (jjherne@linux.vnet.ibm.com) wrote:
>> On 05/19/2017 05:29 PM, Felipe Franciosi wrote:
>>> Currently, the throttle_thread_scheduled flag is reset back to 0 before
>>> sleeping (as part of the throttling logic). Given that throttle_timer
>>> (well, any timer) may tick with a slight delay, it so happens that under
>>> heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
>>> schedule a further cpu_throttle_thread() work item after the flag reset,
>>> but before the previous sleep completed. This results on the vCPU thread
>>> sleeping continuously for potentially several seconds in a row.
>>> 
>>> The chances of that happening can be drastically minimised by resetting
>>> the flag after the sleep.
>>> 
>>> Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
>>> Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>
>>> ---
>>> cpus.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/cpus.c b/cpus.c
>>> index 516e5cb..f42eebd 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -677,9 +677,9 @@ static void cpu_throttle_thread(CPUState *cpu, run_on_cpu_data opaque)
>>>     sleeptime_ns = (long)(throttle_ratio * CPU_THROTTLE_TIMESLICE_NS);
>>> 
>>>     qemu_mutex_unlock_iothread();
>>> -    atomic_set(&cpu->throttle_thread_scheduled, 0);
>>>     g_usleep(sleeptime_ns / 1000); /* Convert ns to us for usleep call */
>>>     qemu_mutex_lock_iothread();
>>> +    atomic_set(&cpu->throttle_thread_scheduled, 0);
>>> }
>>> 
>>> static void cpu_throttle_timer_tick(void *opaque)
>>> 
>> 
>> This seems to make sense to me.
>> 
>> Acked-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
>> 
>> I'm CC'ing Juan, Amit and David as they are all active in the migration area
>> and may have
>> opinions on this. Juan and David were also reviewers for the original
>> series.
> 
> The description is interesting and sounds reasonable; it'll be
> interesting to see what difference it makes to the autoconverge
> behaviour for those workloads that need this level of throttle.

To get some hard data, we wrote a little application that:
1) spawns multiple threads (one per vCPU)
2) each thread mmap()s+mlock()s a certain workset (eg. 30GB/#threads for a 32GB VM)
3) each thread writes a word to the beginning of every page in a tight loop
4) the parent thread periodically reports the number of dirtied pages

Even on a dedicated 10G link, that is pretty much guaranteed to require 99% throttle to converge.

Before the patch, Qemu migrates the VM (depicted above) fairly quickly (~40s) after reaching 99% throttle. The application reported a few seconds at a time with lockups which we initially thought was just that thread not running between Qemu-induced vCPU sleeps (and later attributed it to the reported bug).

Then we used a 1G link. This time, the migration had to run for a lot longer even at 99%. That made the bug more likely to happen and we observed soft lockups (reported by the guest's kernel on the console) of 70+ seconds.

Using the patch, and back on a 10G link, the migration completes after a few more iterations than before (took just under 2mins after reaching 99%). If you want further validation of the bug, instrumenting cpus-common.c:process_queued_cpu_work() could be done to show that cpu_throttle_thread() is running back-to-back under these cases.

In summary we believe this patch is immediately required to prevent the lockups. A more elaborate throttling solution should be considered as future work. Perhaps a per-vCPU timer which throttles more precisely or a new convergence design altogether.

Thanks,
Felipe

> 
> Dave
> 
>> -- 
>> -- Jason J. Herne (jjherne@linux.vnet.ibm.com)
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Dr. David Alan Gilbert June 1, 2017, 3:08 p.m. UTC | #7

* Felipe Franciosi (felipe@nutanix.com) wrote:
> 
> > On 1 Jun 2017, at 15:36, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Jason J. Herne (jjherne@linux.vnet.ibm.com) wrote:
> >> On 05/19/2017 05:29 PM, Felipe Franciosi wrote:
> >>> Currently, the throttle_thread_scheduled flag is reset back to 0 before
> >>> sleeping (as part of the throttling logic). Given that throttle_timer
> >>> (well, any timer) may tick with a slight delay, it so happens that under
> >>> heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
> >>> schedule a further cpu_throttle_thread() work item after the flag reset,
> >>> but before the previous sleep completed. This results on the vCPU thread
> >>> sleeping continuously for potentially several seconds in a row.
> >>> 
> >>> The chances of that happening can be drastically minimised by resetting
> >>> the flag after the sleep.
> >>> 
> >>> Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
> >>> Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>
> >>> ---
> >>> cpus.c | 2 +-
> >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>> 
> >>> diff --git a/cpus.c b/cpus.c
> >>> index 516e5cb..f42eebd 100644
> >>> --- a/cpus.c
> >>> +++ b/cpus.c
> >>> @@ -677,9 +677,9 @@ static void cpu_throttle_thread(CPUState *cpu, run_on_cpu_data opaque)
> >>>     sleeptime_ns = (long)(throttle_ratio * CPU_THROTTLE_TIMESLICE_NS);
> >>> 
> >>>     qemu_mutex_unlock_iothread();
> >>> -    atomic_set(&cpu->throttle_thread_scheduled, 0);
> >>>     g_usleep(sleeptime_ns / 1000); /* Convert ns to us for usleep call */
> >>>     qemu_mutex_lock_iothread();
> >>> +    atomic_set(&cpu->throttle_thread_scheduled, 0);
> >>> }
> >>> 
> >>> static void cpu_throttle_timer_tick(void *opaque)
> >>> 
> >> 
> >> This seems to make sense to me.
> >> 
> >> Acked-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> >> 
> >> I'm CC'ing Juan, Amit and David as they are all active in the migration area
> >> and may have
> >> opinions on this. Juan and David were also reviewers for the original
> >> series.
> > 
> > The description is interesting and sounds reasonable; it'll be
> > interesting to see what difference it makes to the autoconverge
> > behaviour for those workloads that need this level of throttle.
> 
> To get some hard data, we wrote a little application that:
> 1) spawns multiple threads (one per vCPU)
> 2) each thread mmap()s+mlock()s a certain workset (eg. 30GB/#threads for a 32GB VM)
> 3) each thread writes a word to the beginning of every page in a tight loop
> 4) the parent thread periodically reports the number of dirtied pages
> 
> Even on a dedicated 10G link, that is pretty much guaranteed to require 99% throttle to converge.
> 
> Before the patch, Qemu migrates the VM (depicted above) fairly quickly (~40s) after reaching 99% throttle. The application reported a few seconds at a time with lockups which we initially thought was just that thread not running between Qemu-induced vCPU sleeps (and later attributed it to the reported bug).
> 
> Then we used a 1G link. This time, the migration had to run for a lot longer even at 99%. That made the bug more likely to happen and we observed soft lockups (reported by the guest's kernel on the console) of 70+ seconds.
> 
> Using the patch, and back on a 10G link, the migration completes after a few more iterations than before (took just under 2mins after reaching 99%). If you want further validation of the bug, instrumenting cpus-common.c:process_queued_cpu_work() could be done to show that cpu_throttle_thread() is running back-to-back under these cases.

OK, that's reasonable.

> In summary we believe this patch is immediately required to prevent the lockups.

Yes, agreed.

> A more elaborate throttling solution should be considered as future work. Perhaps a per-vCPU timer which throttles more precisely or a new convergence design altogether.

Dave

> 
> Thanks,
> Felipe
> 
> > 
> > Dave
> > 
> >> -- 
> >> -- Jason J. Herne (jjherne@linux.vnet.ibm.com)
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Juan Quintela June 7, 2017, 4:26 p.m. UTC | #8

Felipe Franciosi <felipe@nutanix.com> wrote:
> Currently, the throttle_thread_scheduled flag is reset back to 0 before
> sleeping (as part of the throttling logic). Given that throttle_timer
> (well, any timer) may tick with a slight delay, it so happens that under
> heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
> schedule a further cpu_throttle_thread() work item after the flag reset,
> but before the previous sleep completed. This results on the vCPU thread
> sleeping continuously for potentially several seconds in a row.
>
> The chances of that happening can be drastically minimised by resetting
> the flag after the sleep.
>
> Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
> Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

Paolo, I think that the analisys makes sense.

Should you pull this patch, or do you want me to pull it?

Thanks, Juan.

Paolo Bonzini June 7, 2017, 4:58 p.m. UTC | #9

On 07/06/2017 18:26, Juan Quintela wrote:
> Felipe Franciosi <felipe@nutanix.com> wrote:
>> Currently, the throttle_thread_scheduled flag is reset back to 0 before
>> sleeping (as part of the throttling logic). Given that throttle_timer
>> (well, any timer) may tick with a slight delay, it so happens that under
>> heavy throttling (ie. close or on CPU_THROTTLE_PCT_MAX) the tick may
>> schedule a further cpu_throttle_thread() work item after the flag reset,
>> but before the previous sleep completed. This results on the vCPU thread
>> sleeping continuously for potentially several seconds in a row.
>>
>> The chances of that happening can be drastically minimised by resetting
>> the flag after the sleep.
>>
>> Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
>> Signed-off-by: Malcolm Crossley <malcolm@nutanix.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> Paolo, I think that the analisys makes sense.
> 
> Should you pull this patch, or do you want me to pull it?

I've already included in my jinxed (now at v6) pull request.

Paolo

cpus: reset throttle_thread_scheduled after sleep

Commit Message

Comments

Patch