diff mbox series

[v3,34/47] xen/sched: add fall back to idle vcpu when scheduling unit

Message ID 20190914085251.18816-35-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen: add core scheduling support | expand

Commit Message

Jürgen Groß Sept. 14, 2019, 8:52 a.m. UTC
When scheduling an unit with multiple vcpus there is no guarantee all
vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
idle vcpu of the current cpu in that case. This requires to store the
correct schedule_unit pointer in the idle vcpu as long as it used as
fallback vcpu.

In order to modify the runstates of the correct vcpus when switching
schedule units merge sched_unit_runstate_change() into
sched_switch_units() and loop over the affected physical cpus instead
of the unit's vcpus. This in turn requires an access function to the
current variable of other cpus.

Today context_saved() is called in case previous and next vcpus differ
when doing a context switch. With an idle vcpu being capable to be a
substitute for an offline vcpu this is problematic when switching to
an idle scheduling unit. An idle previous vcpu leaves us in doubt which
schedule unit was active previously, so save the previous unit pointer
in the per-schedule resource area. If it is NULL the unit has not
changed and we don't have to set the previous unit to be not running.

When running an idle vcpu in a non-idle scheduling unit use a specific
guest idle loop not performing any tasklets and livepatching in order
to avoid populating the cpu caches with memory used by other domains
(as far as possible). Softirqs are considered to be save.

In order to avoid livepatching when going to guest idle another
variant of reset_stack_and_jump() not calling check_for_livepatch_work
is needed.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
---
RFC V2:
- new patch (Andrew Cooper)

V1:
- use urgent_count to select correct idle routine (Jan Beulich)

V2:
- set vcpu->is_running in context_saved()
- introduce reset_stack_and_jump_nolp() (Jan Beulich)
- readd scrubbing (Jan Beulich, Andrew Cooper)
- get_cpu_current() _NOT_ moved to include/asm-x86/current.h as the
  needed reference of stack_base[] results in a #include hell

V3:
- split context_saved() into unit_context_saved() and vcpu_context_saved()
---
 xen/arch/x86/domain.c         |  23 ++++++
 xen/common/schedule.c         | 180 +++++++++++++++++++++++++++++-------------
 xen/include/asm-arm/current.h |   1 +
 xen/include/asm-x86/current.h |  19 ++++-
 xen/include/asm-x86/smp.h     |   3 +
 xen/include/xen/sched-if.h    |   4 +-
 xen/include/xen/sched.h       |   1 +
 7 files changed, 170 insertions(+), 61 deletions(-)

Comments

Jan Beulich Sept. 24, 2019, 10:53 a.m. UTC | #1
On 14.09.2019 10:52, Juergen Gross wrote:
> When scheduling an unit with multiple vcpus there is no guarantee all
> vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
> idle vcpu of the current cpu in that case. This requires to store the
> correct schedule_unit pointer in the idle vcpu as long as it used as
> fallback vcpu.
> 
> In order to modify the runstates of the correct vcpus when switching
> schedule units merge sched_unit_runstate_change() into
> sched_switch_units() and loop over the affected physical cpus instead
> of the unit's vcpus. This in turn requires an access function to the
> current variable of other cpus.
> 
> Today context_saved() is called in case previous and next vcpus differ
> when doing a context switch. With an idle vcpu being capable to be a
> substitute for an offline vcpu this is problematic when switching to
> an idle scheduling unit. An idle previous vcpu leaves us in doubt which
> schedule unit was active previously, so save the previous unit pointer
> in the per-schedule resource area. If it is NULL the unit has not
> changed and we don't have to set the previous unit to be not running.
> 
> When running an idle vcpu in a non-idle scheduling unit use a specific
> guest idle loop not performing any tasklets and livepatching in order
> to avoid populating the cpu caches with memory used by other domains
> (as far as possible). Softirqs are considered to be save.

Aiui "tasklets" here means only ones run in (idle) vCPU context, whereas
"softirqs" includes tasklets run in softirq context. I think it would
help if the description made this distinction. With this I then wonder
whether the cache related argumentation above still holds: VT-d's IOMMU
fault handling tasklet runs in softirq context, for example, and
hvm_assert_evtchn_irq(), being handed a struct vcpu *, does too. Of
course this could be considered covered by "(as far as possible)" ...

> @@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void)
>  
>  static void noreturn continue_idle_domain(struct vcpu *v)
>  {
> +    /* Idle vcpus might be attached to non-idle units! */
> +    if ( !is_idle_domain(v->sched_unit->domain) )
> +        reset_stack_and_jump_nolp(guest_idle_loop);

The construct and comment make me wonder - did you audit all uses
of is_idle_{domain,vcpu}() for being in line with this new usage
mode?

> @@ -1767,33 +1774,66 @@ static void sched_switch_units(struct sched_resource *sd,
>                                 struct sched_unit *next, struct sched_unit *prev,
>                                 s_time_t now)
>  {
> -    sd->curr = next;
> -
> -    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id,
> -             now - prev->state_entry_time);
> -    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id,
> -             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
> -             (now - next->state_entry_time) : 0, prev->next_time);
> +    int cpu;

unsigned int?

>      ASSERT(unit_running(prev));
>  
> -    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
> -             next->domain->domain_id, next->unit_id);
> +    if ( prev != next )
> +    {
> +        sd->curr = next;
> +        sd->prev = prev;
>  
> -    sched_unit_runstate_change(prev, false, now);
> +        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
> +                 prev->unit_id, now - prev->state_entry_time);
> +        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
> +                 next->unit_id,
> +                 (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
> +                 (now - next->state_entry_time) : 0, prev->next_time);
> +        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
> +                 next->domain->domain_id, next->unit_id);
>  
> -    ASSERT(!unit_running(next));
> -    sched_unit_runstate_change(next, true, now);
> +        ASSERT(!unit_running(next));
>  
> -    /*
> -     * NB. Don't add any trace records from here until the actual context
> -     * switch, else lost_records resume will not work properly.
> -     */
> +        /*
> +         * NB. Don't add any trace records from here until the actual context
> +         * switch, else lost_records resume will not work properly.
> +         */
> +
> +        ASSERT(!next->is_running);
> +        next->is_running = 1;
> +        next->state_entry_time = now;
> +
> +        if ( is_idle_unit(prev) )
> +        {
> +            prev->runstate_cnt[RUNSTATE_running] = 0;
> +            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
> +        }
> +        if ( is_idle_unit(next) )
> +        {
> +            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
> +            next->runstate_cnt[RUNSTATE_runnable] = 0;
> +        }

Is this correct when some of the vCPU-s a substitute idle ones?

> @@ -1849,29 +1889,39 @@ static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now,
>      if ( prev->next_time >= 0 ) /* -ve means no limit */
>          set_timer(&sd->s_timer, now + prev->next_time);
>  
> -    if ( likely(prev != next) )
> -        sched_switch_units(sd, next, prev, now);
> +    sched_switch_units(sd, next, prev, now);
>  
>      return next;
>  }
>  
> -static void context_saved(struct vcpu *prev)
> +static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext)
>  {
> -    struct sched_unit *unit = prev->sched_unit;
> -
>      /* Clear running flag /after/ writing context to memory. */
>      smp_wmb();
>  
> -    prev->is_running = 0;
> +    if ( vprev != vnext )
> +        vprev->is_running = 0;
> +}

With this "vnext" could be const qualified as it seems. And "false"
instead of "0" perhaps, as you touch this anyway?

> @@ -2015,7 +2079,8 @@ static void sched_slave(void)
>  
>      pcpu_schedule_unlock_irq(lock, cpu);
>  
> -    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
> +    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
> +                         is_idle_unit(next) && !is_idle_unit(prev), now);
>  }
>  
>  /*
> @@ -2075,7 +2140,8 @@ static void schedule(void)
>      pcpu_schedule_unlock_irq(lock, cpu);
>  
>      vnext = sched_unit2vcpu_cpu(next, cpu);
> -    sched_context_switch(vprev, vnext, now);
> +    sched_context_switch(vprev, vnext,
> +                         !is_idle_unit(prev) && is_idle_unit(next), now);
>  }

As a minor remark, I think between such constructs it would be good
if there was no difference, unless there's a reason to have one. Yet
if there was a reason, it surely would want to be spelled out.

> @@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  # define CHECK_FOR_LIVEPATCH_WORK ""
>  #endif
>  
> -#define reset_stack_and_jump(__fn)                                      \
> +#define switch_stack_and_jump(fn, instr)                                \

Is there any dependency on "instr" to just be a single instruction?
I'm inclined to ask for it to be named "extra" or some such.

> --- a/xen/include/asm-x86/smp.h
> +++ b/xen/include/asm-x86/smp.h
> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>  /* Representing HT and core siblings in each socket. */
>  extern cpumask_t **socket_cpumask;
>  
> +#define get_cpu_current(cpu) \
> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)

I don't think this can go without a comment clarifying under what
(pretty narrow I think) conditions this is legitimate to use.

Jan
Jürgen Groß Sept. 25, 2019, 12:58 p.m. UTC | #2
On 24.09.19 12:53, Jan Beulich wrote:
> On 14.09.2019 10:52, Juergen Gross wrote:
>> When scheduling an unit with multiple vcpus there is no guarantee all
>> vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
>> idle vcpu of the current cpu in that case. This requires to store the
>> correct schedule_unit pointer in the idle vcpu as long as it used as
>> fallback vcpu.
>>
>> In order to modify the runstates of the correct vcpus when switching
>> schedule units merge sched_unit_runstate_change() into
>> sched_switch_units() and loop over the affected physical cpus instead
>> of the unit's vcpus. This in turn requires an access function to the
>> current variable of other cpus.
>>
>> Today context_saved() is called in case previous and next vcpus differ
>> when doing a context switch. With an idle vcpu being capable to be a
>> substitute for an offline vcpu this is problematic when switching to
>> an idle scheduling unit. An idle previous vcpu leaves us in doubt which
>> schedule unit was active previously, so save the previous unit pointer
>> in the per-schedule resource area. If it is NULL the unit has not
>> changed and we don't have to set the previous unit to be not running.
>>
>> When running an idle vcpu in a non-idle scheduling unit use a specific
>> guest idle loop not performing any tasklets and livepatching in order
>> to avoid populating the cpu caches with memory used by other domains
>> (as far as possible). Softirqs are considered to be save.
> 
> Aiui "tasklets" here means only ones run in (idle) vCPU context, whereas
> "softirqs" includes tasklets run in softirq context. I think it would
> help if the description made this distinction. With this I then wonder
> whether the cache related argumentation above still holds: VT-d's IOMMU
> fault handling tasklet runs in softirq context, for example, and
> hvm_assert_evtchn_irq(), being handed a struct vcpu *, does too. Of
> course this could be considered covered by "(as far as possible)" ...

I'll write "... not performing any non-softirq tasklets ...".

> 
>> @@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void)
>>   
>>   static void noreturn continue_idle_domain(struct vcpu *v)
>>   {
>> +    /* Idle vcpus might be attached to non-idle units! */
>> +    if ( !is_idle_domain(v->sched_unit->domain) )
>> +        reset_stack_and_jump_nolp(guest_idle_loop);
> 
> The construct and comment make me wonder - did you audit all uses
> of is_idle_{domain,vcpu}() for being in line with this new usage
> mode?

Yes.

> 
>> @@ -1767,33 +1774,66 @@ static void sched_switch_units(struct sched_resource *sd,
>>                                  struct sched_unit *next, struct sched_unit *prev,
>>                                  s_time_t now)
>>   {
>> -    sd->curr = next;
>> -
>> -    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id,
>> -             now - prev->state_entry_time);
>> -    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id,
>> -             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
>> -             (now - next->state_entry_time) : 0, prev->next_time);
>> +    int cpu;
> 
> unsigned int?

Okay.

> 
>>       ASSERT(unit_running(prev));
>>   
>> -    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
>> -             next->domain->domain_id, next->unit_id);
>> +    if ( prev != next )
>> +    {
>> +        sd->curr = next;
>> +        sd->prev = prev;
>>   
>> -    sched_unit_runstate_change(prev, false, now);
>> +        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
>> +                 prev->unit_id, now - prev->state_entry_time);
>> +        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
>> +                 next->unit_id,
>> +                 (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
>> +                 (now - next->state_entry_time) : 0, prev->next_time);
>> +        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
>> +                 next->domain->domain_id, next->unit_id);
>>   
>> -    ASSERT(!unit_running(next));
>> -    sched_unit_runstate_change(next, true, now);
>> +        ASSERT(!unit_running(next));
>>   
>> -    /*
>> -     * NB. Don't add any trace records from here until the actual context
>> -     * switch, else lost_records resume will not work properly.
>> -     */
>> +        /*
>> +         * NB. Don't add any trace records from here until the actual context
>> +         * switch, else lost_records resume will not work properly.
>> +         */
>> +
>> +        ASSERT(!next->is_running);
>> +        next->is_running = 1;
>> +        next->state_entry_time = now;
>> +
>> +        if ( is_idle_unit(prev) )
>> +        {
>> +            prev->runstate_cnt[RUNSTATE_running] = 0;
>> +            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
>> +        }
>> +        if ( is_idle_unit(next) )
>> +        {
>> +            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
>> +            next->runstate_cnt[RUNSTATE_runnable] = 0;
>> +        }
> 
> Is this correct when some of the vCPU-s a substitute idle ones?

Yes, as this affects idle units only. An idle vcpu running as a
substitute will _not_ result in the related idle unit runstate_cnt
to be updated.

> 
>> @@ -1849,29 +1889,39 @@ static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now,
>>       if ( prev->next_time >= 0 ) /* -ve means no limit */
>>           set_timer(&sd->s_timer, now + prev->next_time);
>>   
>> -    if ( likely(prev != next) )
>> -        sched_switch_units(sd, next, prev, now);
>> +    sched_switch_units(sd, next, prev, now);
>>   
>>       return next;
>>   }
>>   
>> -static void context_saved(struct vcpu *prev)
>> +static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext)
>>   {
>> -    struct sched_unit *unit = prev->sched_unit;
>> -
>>       /* Clear running flag /after/ writing context to memory. */
>>       smp_wmb();
>>   
>> -    prev->is_running = 0;
>> +    if ( vprev != vnext )
>> +        vprev->is_running = 0;
>> +}
> 
> With this "vnext" could be const qualified as it seems. And "false"
> instead of "0" perhaps, as you touch this anyway?

Okay.

> 
>> @@ -2015,7 +2079,8 @@ static void sched_slave(void)
>>   
>>       pcpu_schedule_unlock_irq(lock, cpu);
>>   
>> -    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
>> +    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
>> +                         is_idle_unit(next) && !is_idle_unit(prev), now);
>>   }
>>   
>>   /*
>> @@ -2075,7 +2140,8 @@ static void schedule(void)
>>       pcpu_schedule_unlock_irq(lock, cpu);
>>   
>>       vnext = sched_unit2vcpu_cpu(next, cpu);
>> -    sched_context_switch(vprev, vnext, now);
>> +    sched_context_switch(vprev, vnext,
>> +                         !is_idle_unit(prev) && is_idle_unit(next), now);
>>   }
> 
> As a minor remark, I think between such constructs it would be good
> if there was no difference, unless there's a reason to have one. Yet
> if there was a reason, it surely would want to be spelled out.

I guess you mean changing the parameters of sched_context_switch()? I
can do that.

> 
>> @@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>>   # define CHECK_FOR_LIVEPATCH_WORK ""
>>   #endif
>>   
>> -#define reset_stack_and_jump(__fn)                                      \
>> +#define switch_stack_and_jump(fn, instr)                                \
> 
> Is there any dependency on "instr" to just be a single instruction?
> I'm inclined to ask for it to be named "extra" or some such.

Fine with me.

> 
>> --- a/xen/include/asm-x86/smp.h
>> +++ b/xen/include/asm-x86/smp.h
>> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>>   /* Representing HT and core siblings in each socket. */
>>   extern cpumask_t **socket_cpumask;
>>   
>> +#define get_cpu_current(cpu) \
>> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
> 
> I don't think this can go without a comment clarifying under what
> (pretty narrow I think) conditions this is legitimate to use.

Okay. I'll add a comment like: "to be used only while no context switch
can occur on the cpu".


Juergen
Jan Beulich Sept. 25, 2019, 1:11 p.m. UTC | #3
On 25.09.2019 14:58, Jürgen Groß wrote:
> On 24.09.19 12:53, Jan Beulich wrote:
>> On 14.09.2019 10:52, Juergen Gross wrote:
>>> @@ -2015,7 +2079,8 @@ static void sched_slave(void)
>>>   
>>>       pcpu_schedule_unlock_irq(lock, cpu);
>>>   
>>> -    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
>>> +    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
>>> +                         is_idle_unit(next) && !is_idle_unit(prev), now);
>>>   }
>>>   
>>>   /*
>>> @@ -2075,7 +2140,8 @@ static void schedule(void)
>>>       pcpu_schedule_unlock_irq(lock, cpu);
>>>   
>>>       vnext = sched_unit2vcpu_cpu(next, cpu);
>>> -    sched_context_switch(vprev, vnext, now);
>>> +    sched_context_switch(vprev, vnext,
>>> +                         !is_idle_unit(prev) && is_idle_unit(next), now);
>>>   }
>>
>> As a minor remark, I think between such constructs it would be good
>> if there was no difference, unless there's a reason to have one. Yet
>> if there was a reason, it surely would want to be spelled out.
> 
> I guess you mean changing the parameters of sched_context_switch()? I
> can do that.

Well, yes, the two sides of the && in one of them.

>>> --- a/xen/include/asm-x86/smp.h
>>> +++ b/xen/include/asm-x86/smp.h
>>> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>>>   /* Representing HT and core siblings in each socket. */
>>>   extern cpumask_t **socket_cpumask;
>>>   
>>> +#define get_cpu_current(cpu) \
>>> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
>>
>> I don't think this can go without a comment clarifying under what
>> (pretty narrow I think) conditions this is legitimate to use.
> 
> Okay. I'll add a comment like: "to be used only while no context switch
> can occur on the cpu".

To be crystal clear, I'd append ", i.e. by certain scheduling code only".

Jan
Jürgen Groß Sept. 25, 2019, 1:15 p.m. UTC | #4
On 25.09.19 15:11, Jan Beulich wrote:
> On 25.09.2019 14:58, Jürgen Groß wrote:
>> On 24.09.19 12:53, Jan Beulich wrote:
>>> On 14.09.2019 10:52, Juergen Gross wrote:
>>>> @@ -2015,7 +2079,8 @@ static void sched_slave(void)
>>>>    
>>>>        pcpu_schedule_unlock_irq(lock, cpu);
>>>>    
>>>> -    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
>>>> +    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
>>>> +                         is_idle_unit(next) && !is_idle_unit(prev), now);
>>>>    }
>>>>    
>>>>    /*
>>>> @@ -2075,7 +2140,8 @@ static void schedule(void)
>>>>        pcpu_schedule_unlock_irq(lock, cpu);
>>>>    
>>>>        vnext = sched_unit2vcpu_cpu(next, cpu);
>>>> -    sched_context_switch(vprev, vnext, now);
>>>> +    sched_context_switch(vprev, vnext,
>>>> +                         !is_idle_unit(prev) && is_idle_unit(next), now);
>>>>    }
>>>
>>> As a minor remark, I think between such constructs it would be good
>>> if there was no difference, unless there's a reason to have one. Yet
>>> if there was a reason, it surely would want to be spelled out.
>>
>> I guess you mean changing the parameters of sched_context_switch()? I
>> can do that.
> 
> Well, yes, the two sides of the && in one of them.

Ah, okay.

> 
>>>> --- a/xen/include/asm-x86/smp.h
>>>> +++ b/xen/include/asm-x86/smp.h
>>>> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>>>>    /* Representing HT and core siblings in each socket. */
>>>>    extern cpumask_t **socket_cpumask;
>>>>    
>>>> +#define get_cpu_current(cpu) \
>>>> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
>>>
>>> I don't think this can go without a comment clarifying under what
>>> (pretty narrow I think) conditions this is legitimate to use.
>>
>> Okay. I'll add a comment like: "to be used only while no context switch
>> can occur on the cpu".
> 
> To be crystal clear, I'd append ", i.e. by certain scheduling code only".

Fine with me.


Juergen
diff mbox series

Patch

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 6f3132682d..8d430d38c2 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -159,6 +159,25 @@  static void idle_loop(void)
     }
 }
 
+/*
+ * Idle loop for siblings in active schedule units.
+ * We don't do any standard idle work like tasklets or livepatching.
+ */
+static void guest_idle_loop(void)
+{
+    unsigned int cpu = smp_processor_id();
+
+    for ( ; ; )
+    {
+        ASSERT(!cpu_is_offline(cpu));
+
+        if ( !softirq_pending(cpu) && !scrub_free_pages() &&
+             !softirq_pending(cpu))
+            sched_guest_idle(pm_idle, cpu);
+        do_softirq();
+    }
+}
+
 void startup_cpu_idle_loop(void)
 {
     struct vcpu *v = current;
@@ -172,6 +191,10 @@  void startup_cpu_idle_loop(void)
 
 static void noreturn continue_idle_domain(struct vcpu *v)
 {
+    /* Idle vcpus might be attached to non-idle units! */
+    if ( !is_idle_domain(v->sched_unit->domain) )
+        reset_stack_and_jump_nolp(guest_idle_loop);
+
     reset_stack_and_jump(idle_loop);
 }
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 246ad38c7d..d53c60b966 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -136,10 +136,21 @@  static struct scheduler sched_idle_ops = {
     .switch_sched   = sched_idle_switch_sched,
 };
 
+static inline struct vcpu *unit2vcpu_cpu(struct sched_unit *unit,
+                                         unsigned int cpu)
+{
+    unsigned int idx = unit->unit_id + per_cpu(sched_res_idx, cpu);
+    const struct domain *d = unit->domain;
+
+    return (idx < d->max_vcpus) ? d->vcpu[idx] : NULL;
+}
+
 static inline struct vcpu *sched_unit2vcpu_cpu(struct sched_unit *unit,
                                                unsigned int cpu)
 {
-    return unit->domain->vcpu[unit->unit_id + per_cpu(sched_res_idx, cpu)];
+    struct vcpu *v = unit2vcpu_cpu(unit, cpu);
+
+    return (v && v->new_state == RUNSTATE_running) ? v : idle_vcpu[cpu];
 }
 
 static inline struct scheduler *dom_scheduler(const struct domain *d)
@@ -259,8 +270,11 @@  static inline void vcpu_runstate_change(
 
     trace_runstate_change(v, new_state);
 
-    unit->runstate_cnt[v->runstate.state]--;
-    unit->runstate_cnt[new_state]++;
+    if ( !is_idle_vcpu(v) )
+    {
+        unit->runstate_cnt[v->runstate.state]--;
+        unit->runstate_cnt[new_state]++;
+    }
 
     delta = new_entry_time - v->runstate.state_entry_time;
     if ( delta > 0 )
@@ -272,19 +286,11 @@  static inline void vcpu_runstate_change(
     v->runstate.state = new_state;
 }
 
-static inline void sched_unit_runstate_change(struct sched_unit *unit,
-    bool running, s_time_t new_entry_time)
+void sched_guest_idle(void (*idle) (void), unsigned int cpu)
 {
-    struct vcpu *v;
-
-    for_each_sched_unit_vcpu ( unit, v )
-        if ( running )
-            vcpu_runstate_change(v, v->new_state, new_entry_time);
-        else
-            vcpu_runstate_change(v,
-                ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
-                 (vcpu_runnable(v) ? RUNSTATE_runnable : RUNSTATE_offline)),
-                new_entry_time);
+    atomic_inc(&per_cpu(sched_urgent_count, cpu));
+    idle();
+    atomic_dec(&per_cpu(sched_urgent_count, cpu));
 }
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
@@ -523,6 +529,7 @@  int sched_init_vcpu(struct vcpu *v)
     if ( is_idle_domain(d) )
     {
         get_sched_res(v->processor)->curr = unit;
+        get_sched_res(v->processor)->sched_unit_idle = unit;
         v->is_running = 1;
         unit->is_running = 1;
         unit->state_entry_time = NOW();
@@ -855,7 +862,7 @@  static void sched_unit_move_locked(struct sched_unit *unit,
  *
  * sched_unit_migrate_finish() will do the work now if it can, or simply
  * return if it can't (because unit is still running); in that case
- * sched_unit_migrate_finish() will be called by context_saved().
+ * sched_unit_migrate_finish() will be called by unit_context_saved().
  */
 static void sched_unit_migrate_start(struct sched_unit *unit)
 {
@@ -878,7 +885,7 @@  static void sched_unit_migrate_finish(struct sched_unit *unit)
 
     /*
      * If the unit is currently running, this will be handled by
-     * context_saved(); and in any case, if the bit is cleared, then
+     * unit_context_saved(); and in any case, if the bit is cleared, then
      * someone else has already done the work so we don't need to.
      */
     if ( unit->is_running )
@@ -1767,33 +1774,66 @@  static void sched_switch_units(struct sched_resource *sd,
                                struct sched_unit *next, struct sched_unit *prev,
                                s_time_t now)
 {
-    sd->curr = next;
-
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id,
-             (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0, prev->next_time);
+    int cpu;
 
     ASSERT(unit_running(prev));
 
-    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
-             next->domain->domain_id, next->unit_id);
+    if ( prev != next )
+    {
+        sd->curr = next;
+        sd->prev = prev;
 
-    sched_unit_runstate_change(prev, false, now);
+        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
+                 prev->unit_id, now - prev->state_entry_time);
+        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
+                 next->unit_id,
+                 (next->vcpu_list->runstate.state == RUNSTATE_runnable) ?
+                 (now - next->state_entry_time) : 0, prev->next_time);
+        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id,
+                 next->domain->domain_id, next->unit_id);
 
-    ASSERT(!unit_running(next));
-    sched_unit_runstate_change(next, true, now);
+        ASSERT(!unit_running(next));
 
-    /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+        /*
+         * NB. Don't add any trace records from here until the actual context
+         * switch, else lost_records resume will not work properly.
+         */
+
+        ASSERT(!next->is_running);
+        next->is_running = 1;
+        next->state_entry_time = now;
+
+        if ( is_idle_unit(prev) )
+        {
+            prev->runstate_cnt[RUNSTATE_running] = 0;
+            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
+        }
+        if ( is_idle_unit(next) )
+        {
+            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
+            next->runstate_cnt[RUNSTATE_runnable] = 0;
+        }
+    }
+
+    for_each_cpu ( cpu, sd->cpus )
+    {
+        struct vcpu *vprev = get_cpu_current(cpu);
+        struct vcpu *vnext = sched_unit2vcpu_cpu(next, cpu);
+
+        if ( vprev != vnext || vprev->runstate.state != vnext->new_state )
+        {
+            vcpu_runstate_change(vprev,
+                ((vprev->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
+                 (vcpu_runnable(vprev) ? RUNSTATE_runnable : RUNSTATE_offline)),
+                now);
+            vcpu_runstate_change(vnext, vnext->new_state, now);
+        }
 
-    ASSERT(!next->is_running);
-    next->vcpu_list->is_running = 1;
-    next->is_running = 1;
-    next->state_entry_time = now;
+        vnext->is_running = 1;
+
+        if ( is_idle_vcpu(vnext) )
+            vnext->sched_unit = next;
+    }
 }
 
 static bool sched_tasklet_check_cpu(unsigned int cpu)
@@ -1849,29 +1889,39 @@  static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now,
     if ( prev->next_time >= 0 ) /* -ve means no limit */
         set_timer(&sd->s_timer, now + prev->next_time);
 
-    if ( likely(prev != next) )
-        sched_switch_units(sd, next, prev, now);
+    sched_switch_units(sd, next, prev, now);
 
     return next;
 }
 
-static void context_saved(struct vcpu *prev)
+static void vcpu_context_saved(struct vcpu *vprev, struct vcpu *vnext)
 {
-    struct sched_unit *unit = prev->sched_unit;
-
     /* Clear running flag /after/ writing context to memory. */
     smp_wmb();
 
-    prev->is_running = 0;
+    if ( vprev != vnext )
+        vprev->is_running = 0;
+}
+
+static void unit_context_saved(struct sched_resource *sd)
+{
+    struct sched_unit *unit = sd->prev;
+
+    if ( !unit )
+        return;
+
     unit->is_running = 0;
     unit->state_entry_time = NOW();
+    sd->prev = NULL;
 
     /* Check for migration request /after/ clearing running flag. */
     smp_mb();
 
-    sched_context_saved(vcpu_scheduler(prev), unit);
+    sched_context_saved(unit_scheduler(unit), unit);
 
-    sched_unit_migrate_finish(unit);
+    /* Idle never migrates and idle vcpus might belong to other units. */
+    if ( !is_idle_unit(unit) )
+        sched_unit_migrate_finish(unit);
 }
 
 /*
@@ -1881,35 +1931,44 @@  static void context_saved(struct vcpu *prev)
  * The counter will be 0 in case no rendezvous is needed. For the rendezvous
  * case it is initialised to the number of cpus to rendezvous plus 1. Each
  * member entering decrements the counter. The last one will decrement it to
- * 1 and perform the final needed action in that case (call of context_saved()
- * if vcpu was switched), and then set the counter to zero. The other members
+ * 1 and perform the final needed action in that case (call of
+ * unit_context_saved()), and then set the counter to zero. The other members
  * will wait until the counter becomes zero until they proceed.
  */
 void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
 {
     struct sched_unit *next = vnext->sched_unit;
+    struct sched_resource *sd = get_sched_res(smp_processor_id());
 
     if ( atomic_read(&next->rendezvous_out_cnt) )
     {
         int cnt = atomic_dec_return(&next->rendezvous_out_cnt);
 
-        /* Call context_saved() before releasing other waiters. */
+        vcpu_context_saved(vprev, vnext);
+
+        /* Call unit_context_saved() before releasing other waiters. */
         if ( cnt == 1 )
         {
-            if ( vprev != vnext )
-                context_saved(vprev);
+            unit_context_saved(sd);
             atomic_set(&next->rendezvous_out_cnt, 0);
         }
         else
             while ( atomic_read(&next->rendezvous_out_cnt) )
                 cpu_relax();
     }
-    else if ( vprev != vnext && sched_granularity == 1 )
-        context_saved(vprev);
+    else
+    {
+        vcpu_context_saved(vprev, vnext);
+        if ( sched_granularity == 1 )
+            unit_context_saved(sd);
+    }
+
+    if ( is_idle_vcpu(vprev) && vprev != vnext )
+        vprev->sched_unit = sd->sched_unit_idle;
 }
 
 static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
-                                 s_time_t now)
+                                 bool reset_idle_unit, s_time_t now)
 {
     if ( unlikely(vprev == vnext) )
     {
@@ -1918,6 +1977,11 @@  static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
                  now - vprev->runstate.state_entry_time,
                  vprev->sched_unit->next_time);
         sched_context_switched(vprev, vnext);
+
+        if ( reset_idle_unit )
+            vnext->sched_unit =
+                get_sched_res(smp_processor_id())->sched_unit_idle;
+
         trace_continue_running(vnext);
         return continue_running(vprev);
     }
@@ -1976,7 +2040,7 @@  static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev,
             pcpu_schedule_unlock_irq(*lock, cpu);
 
             raise_softirq(SCHED_SLAVE_SOFTIRQ);
-            sched_context_switch(vprev, vprev, now);
+            sched_context_switch(vprev, vprev, false, now);
         }
 
         pcpu_schedule_unlock_irq(*lock, cpu);
@@ -2015,7 +2079,8 @@  static void sched_slave(void)
 
     pcpu_schedule_unlock_irq(lock, cpu);
 
-    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now);
+    sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu),
+                         is_idle_unit(next) && !is_idle_unit(prev), now);
 }
 
 /*
@@ -2075,7 +2140,8 @@  static void schedule(void)
     pcpu_schedule_unlock_irq(lock, cpu);
 
     vnext = sched_unit2vcpu_cpu(next, cpu);
-    sched_context_switch(vprev, vnext, now);
+    sched_context_switch(vprev, vnext,
+                         !is_idle_unit(prev) && is_idle_unit(next), now);
 }
 
 /* The scheduler timer: force a run through the scheduler */
@@ -2146,6 +2212,7 @@  static int cpu_schedule_up(unsigned int cpu)
      */
 
     sd->curr = idle_vcpu[cpu]->sched_unit;
+    sd->sched_unit_idle = idle_vcpu[cpu]->sched_unit;
 
     sd->sched_priv = NULL;
 
@@ -2315,6 +2382,7 @@  void __init scheduler_init(void)
     if ( vcpu_create(idle_domain, 0) == NULL )
         BUG();
     get_sched_res(0)->curr = idle_vcpu[0]->sched_unit;
+    get_sched_res(0)->sched_unit_idle = idle_vcpu[0]->sched_unit;
 }
 
 /*
diff --git a/xen/include/asm-arm/current.h b/xen/include/asm-arm/current.h
index 1653e89d30..88beb4645a 100644
--- a/xen/include/asm-arm/current.h
+++ b/xen/include/asm-arm/current.h
@@ -18,6 +18,7 @@  DECLARE_PER_CPU(struct vcpu *, curr_vcpu);
 
 #define current            (this_cpu(curr_vcpu))
 #define set_current(vcpu)  do { current = (vcpu); } while (0)
+#define get_cpu_current(cpu)  (per_cpu(curr_vcpu, cpu))
 
 /* Per-VCPU state that lives at the top of the stack */
 struct cpu_info {
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index f3508c3c08..0b47485337 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -77,6 +77,11 @@  struct cpu_info {
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
+static inline struct cpu_info *get_cpu_info_from_stack(unsigned long sp)
+{
+    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+}
+
 static inline struct cpu_info *get_cpu_info(void)
 {
 #ifdef __clang__
@@ -87,7 +92,7 @@  static inline struct cpu_info *get_cpu_info(void)
     register unsigned long sp asm("rsp");
 #endif
 
-    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+    return get_cpu_info_from_stack(sp);
 }
 
 #define get_current()         (get_cpu_info()->current_vcpu)
@@ -124,16 +129,22 @@  unsigned long get_stack_dump_bottom (unsigned long sp);
 # define CHECK_FOR_LIVEPATCH_WORK ""
 #endif
 
-#define reset_stack_and_jump(__fn)                                      \
+#define switch_stack_and_jump(fn, instr)                                \
     ({                                                                  \
         __asm__ __volatile__ (                                          \
             "mov %0,%%"__OP"sp;"                                        \
-            CHECK_FOR_LIVEPATCH_WORK                                      \
+            instr                                                       \
              "jmp %c1"                                                  \
-            : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" );   \
+            : : "r" (guest_cpu_user_regs()), "i" (fn) : "memory" );     \
         unreachable();                                                  \
     })
 
+#define reset_stack_and_jump(fn)                                        \
+    switch_stack_and_jump(fn, CHECK_FOR_LIVEPATCH_WORK)
+
+#define reset_stack_and_jump_nolp(fn)                                   \
+    switch_stack_and_jump(fn, "")
+
 /*
  * Which VCPU's state is currently running on each CPU?
  * This is not necesasrily the same as 'current' as a CPU may be
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 9f533f9072..51a31ab00a 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -76,6 +76,9 @@  void set_nr_sockets(void);
 /* Representing HT and core siblings in each socket. */
 extern cpumask_t **socket_cpumask;
 
+#define get_cpu_current(cpu) \
+    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
+
 #endif /* !__ASSEMBLY__ */
 
 #endif
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 84d0658578..2929154d35 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -39,6 +39,8 @@  struct sched_resource {
     spinlock_t         *schedule_lock,
                        _lock;
     struct sched_unit  *curr;
+    struct sched_unit  *sched_unit_idle;
+    struct sched_unit  *prev;
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
 
@@ -180,7 +182,7 @@  static inline void sched_clear_pause_flags_atomic(struct sched_unit *unit,
 
 static inline struct sched_unit *sched_idle_unit(unsigned int cpu)
 {
-    return idle_vcpu[cpu]->sched_unit;
+    return get_sched_res(cpu)->sched_unit_idle;
 }
 
 static inline unsigned int sched_get_resource_cpu(unsigned int cpu)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 5b805eac58..144d353447 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -929,6 +929,7 @@  void restore_vcpu_affinity(struct domain *d);
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
 uint64_t get_cpu_idle_time(unsigned int cpu);
+void sched_guest_idle(void (*idle) (void), unsigned int cpu);
 
 /*
  * Used by idle loop to decide whether there is work to do: