diff mbox

[2/5] xen: credit2: never consider CPUs outside of our cpupool.

Message ID 148467400670.27920.10444838852821010432.stgit@Solace.fritz.box (mailing list archive)
State New, archived
Headers show

Commit Message

Dario Faggioli Jan. 17, 2017, 5:26 p.m. UTC
In fact, relying on the mask of what pCPUs belong to
which Credit2 runqueue is not enough. If we only do that,
when Credit2 is the boot scheduler, we may ASSERT() or
panic when moving a pCPU from Pool-0 to another cpupool.

This is because pCPUs outside of any pool are considered
part of cpupool0. This puts us at risk of crash when those
same pCPUs are added to another pool and something
different than the idle domain is found to be running
on them.

Note that, even if we prevent the above to happen (which
is the purpose of this patch), this is still pretty bad,
in fact, when we remove a pCPU from Pool-0:
- in Credit1, as we do *not* update prv->ncpus and
  prv->credit, which means we're considering the wrong
  total credits when doing accounting;
- in Credit2, the pCPU remains part of one runqueue,
  and is hence at least considered during load balancing,
  even if no vCPU should really run there.

In Credit1, this "only" causes skewed accounting and
no crashes because there is a lot of `cpumask_and`ing
going on with the cpumask of the domains' cpupool
(which, BTW, comes at a price).

A quick and not to involved (and easily backportable)
solution for Credit2, is to do exactly the same.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Jan Beulich <jbeulich@suse.com>
---
This is a bugfix, and should be backported to 4.8.
---
The proper solution would mean calling deinit_pdata() when removing a pCPU from
cpupool0, as well as a bit more of code reshuffling.

And, although worth doing, it certainly will take more work, more time, and
will probably be hard (well, surely harder than this) to backport.

Therefore, I'd argue in favor of both taking and backporting this change, which
at least enables using Credit2 as default scheduler without risking a crash
when creating a second cpupool.

Afterwards, a proper solution would be proposed for Xen 4.9.

Finally, given the wide number of issues similar to this that I've found and
fixed in the last release cycle, I think it would be good to take a stab at
whether the interface between cpupools and the schedulers could not be
simplified. :-O

Regards,
Dario
---
 xen/common/sched_credit2.c |   59 ++++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 21 deletions(-)

Comments

Jürgen Groß Jan. 19, 2017, 8:08 a.m. UTC | #1
On 17/01/17 18:26, Dario Faggioli wrote:
> In fact, relying on the mask of what pCPUs belong to
> which Credit2 runqueue is not enough. If we only do that,
> when Credit2 is the boot scheduler, we may ASSERT() or
> panic when moving a pCPU from Pool-0 to another cpupool.
> 
> This is because pCPUs outside of any pool are considered
> part of cpupool0. This puts us at risk of crash when those
> same pCPUs are added to another pool and something
> different than the idle domain is found to be running
> on them.
> 
> Note that, even if we prevent the above to happen (which
> is the purpose of this patch), this is still pretty bad,
> in fact, when we remove a pCPU from Pool-0:
> - in Credit1, as we do *not* update prv->ncpus and
>   prv->credit, which means we're considering the wrong
>   total credits when doing accounting;
> - in Credit2, the pCPU remains part of one runqueue,
>   and is hence at least considered during load balancing,
>   even if no vCPU should really run there.
> 
> In Credit1, this "only" causes skewed accounting and
> no crashes because there is a lot of `cpumask_and`ing
> going on with the cpumask of the domains' cpupool
> (which, BTW, comes at a price).
> 
> A quick and not to involved (and easily backportable)
> solution for Credit2, is to do exactly the same.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> ---
> This is a bugfix, and should be backported to 4.8.
> ---
> The proper solution would mean calling deinit_pdata() when removing a pCPU from
> cpupool0, as well as a bit more of code reshuffling.
> 
> And, although worth doing, it certainly will take more work, more time, and
> will probably be hard (well, surely harder than this) to backport.
> 
> Therefore, I'd argue in favor of both taking and backporting this change, which
> at least enables using Credit2 as default scheduler without risking a crash
> when creating a second cpupool.
> 
> Afterwards, a proper solution would be proposed for Xen 4.9.
> 
> Finally, given the wide number of issues similar to this that I've found and
> fixed in the last release cycle, I think it would be good to take a stab at
> whether the interface between cpupools and the schedulers could not be
> simplified. :-O

The first patch version for support of cpupools I sent used an own
scheduler instance for the free cpus. Keir merged this instance with
the one for Pool-0.

I think it might be a good idea to check whether some of the problems
are going away with the free cpu scheduler idea again. Maybe it would
even be a good idea to add a "null-scheduler" for this purpose
supporting only one vcpu per pcpu. This scheduler could be used for
high performance domains, too.


Juergen
Dario Faggioli Jan. 19, 2017, 8:22 a.m. UTC | #2
[Adding Anshul which also had to deal with something similar to this
 recently]

On Thu, 2017-01-19 at 09:08 +0100, Juergen Gross wrote:
> On 17/01/17 18:26, Dario Faggioli wrote:
> > Finally, given the wide number of issues similar to this that I've
> > found and
> > fixed in the last release cycle, I think it would be good to take a
> > stab at
> > whether the interface between cpupools and the schedulers could not
> > be
> > simplified. :-O
> 
> The first patch version for support of cpupools I sent used an own
> scheduler instance for the free cpus. Keir merged this instance with
> the one for Pool-0.
> 
Really? Because I more than once thought that something like that would
be a good idea. I even started to implement it once... I then dropped
it, because I was able to deal with that problem in another way, and
had more pressing priorities, but I always liked the concept.

> I think it might be a good idea to check whether some of the problems
> are going away with the free cpu scheduler idea again. 
>
I think some of them will.

AFAICT, out of the top of my head, that alone won't avoid having to
modify when and how a couple of scheduling hooks are called (the pCPU
data allocation and initialization ones), but most likely will simplify
making that change.

> Maybe it would
> even be a good idea to add a "null-scheduler" for this purpose
> supporting only one vcpu per pcpu. This scheduler could be used for
> high performance domains, too.
> 
Oh, I guess it can. At first, I was actually planning for such
scheduler to always and only select the appropriate idle vCPU... But it
can be extended to do something different, if wanted.

Let's see if others want to chime in. If not, I'll dust off the half-
backed patches I have for this and see if I can make them work.

Dario
George Dunlap Jan. 23, 2017, 2:40 p.m. UTC | #3
On Thu, Jan 19, 2017 at 8:08 AM, Juergen Gross <jgross@suse.com> wrote:
> On 17/01/17 18:26, Dario Faggioli wrote:
>> In fact, relying on the mask of what pCPUs belong to
>> which Credit2 runqueue is not enough. If we only do that,
>> when Credit2 is the boot scheduler, we may ASSERT() or
>> panic when moving a pCPU from Pool-0 to another cpupool.
>>
>> This is because pCPUs outside of any pool are considered
>> part of cpupool0. This puts us at risk of crash when those
>> same pCPUs are added to another pool and something
>> different than the idle domain is found to be running
>> on them.
>>
>> Note that, even if we prevent the above to happen (which
>> is the purpose of this patch), this is still pretty bad,
>> in fact, when we remove a pCPU from Pool-0:
>> - in Credit1, as we do *not* update prv->ncpus and
>>   prv->credit, which means we're considering the wrong
>>   total credits when doing accounting;
>> - in Credit2, the pCPU remains part of one runqueue,
>>   and is hence at least considered during load balancing,
>>   even if no vCPU should really run there.
>>
>> In Credit1, this "only" causes skewed accounting and
>> no crashes because there is a lot of `cpumask_and`ing
>> going on with the cpumask of the domains' cpupool
>> (which, BTW, comes at a price).
>>
>> A quick and not to involved (and easily backportable)
>> solution for Credit2, is to do exactly the same.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com
>> ---
>> Cc: George Dunlap <george.dunlap@eu.citrix.com>
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> ---
>> This is a bugfix, and should be backported to 4.8.
>> ---
>> The proper solution would mean calling deinit_pdata() when removing a pCPU from
>> cpupool0, as well as a bit more of code reshuffling.
>>
>> And, although worth doing, it certainly will take more work, more time, and
>> will probably be hard (well, surely harder than this) to backport.
>>
>> Therefore, I'd argue in favor of both taking and backporting this change, which
>> at least enables using Credit2 as default scheduler without risking a crash
>> when creating a second cpupool.
>>
>> Afterwards, a proper solution would be proposed for Xen 4.9.
>>
>> Finally, given the wide number of issues similar to this that I've found and
>> fixed in the last release cycle, I think it would be good to take a stab at
>> whether the interface between cpupools and the schedulers could not be
>> simplified. :-O
>
> The first patch version for support of cpupools I sent used an own
> scheduler instance for the free cpus. Keir merged this instance with
> the one for Pool-0.

You mean he just made the change unilaterally without posting it to
the list?  I just went back and skimmed through the old cpupools
threads and didn't see this discussed.

Having a "cpupool-remove" operation that doesn't actually remove the
cpu from the pool is a bit mad...

 -George
George Dunlap Jan. 23, 2017, 3:20 p.m. UTC | #4
On Tue, Jan 17, 2017 at 5:26 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> In fact, relying on the mask of what pCPUs belong to
> which Credit2 runqueue is not enough. If we only do that,
> when Credit2 is the boot scheduler, we may ASSERT() or
> panic when moving a pCPU from Pool-0 to another cpupool.
>
> This is because pCPUs outside of any pool are considered
> part of cpupool0. This puts us at risk of crash when those
> same pCPUs are added to another pool and something
> different than the idle domain is found to be running
> on them.
>
> Note that, even if we prevent the above to happen (which
> is the purpose of this patch), this is still pretty bad,
> in fact, when we remove a pCPU from Pool-0:
> - in Credit1, as we do *not* update prv->ncpus and
>   prv->credit, which means we're considering the wrong
>   total credits when doing accounting;
> - in Credit2, the pCPU remains part of one runqueue,
>   and is hence at least considered during load balancing,
>   even if no vCPU should really run there.
>
> In Credit1, this "only" causes skewed accounting and
> no crashes because there is a lot of `cpumask_and`ing
> going on with the cpumask of the domains' cpupool
> (which, BTW, comes at a price).
>
> A quick and not to involved (and easily backportable)
> solution for Credit2, is to do exactly the same.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com

Blech.  But I agree we need a fix we can backport:

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> ---
> This is a bugfix, and should be backported to 4.8.
> ---
> The proper solution would mean calling deinit_pdata() when removing a pCPU from
> cpupool0, as well as a bit more of code reshuffling.
>
> And, although worth doing, it certainly will take more work, more time, and
> will probably be hard (well, surely harder than this) to backport.
>
> Therefore, I'd argue in favor of both taking and backporting this change, which
> at least enables using Credit2 as default scheduler without risking a crash
> when creating a second cpupool.
>
> Afterwards, a proper solution would be proposed for Xen 4.9.
>
> Finally, given the wide number of issues similar to this that I've found and
> fixed in the last release cycle, I think it would be good to take a stab at
> whether the interface between cpupools and the schedulers could not be
> simplified. :-O
>
> Regards,
> Dario
> ---
>  xen/common/sched_credit2.c |   59 ++++++++++++++++++++++++++++----------------
>  1 file changed, 38 insertions(+), 21 deletions(-)
>
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 523922e..ce0e146 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -510,19 +510,22 @@ void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
>   */
>  static int get_fallback_cpu(struct csched2_vcpu *svc)
>  {
> -    int fallback_cpu, cpu = svc->vcpu->processor;
> +    struct vcpu *v = svc->vcpu;
> +    int cpu = v->processor;
>
> -    if ( likely(cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity)) )
> -        return cpu;
> +    cpumask_and(cpumask_scratch_cpu(cpu), v->cpu_hard_affinity,
> +                cpupool_domain_cpumask(v->domain));
>
> -    cpumask_and(cpumask_scratch_cpu(cpu), svc->vcpu->cpu_hard_affinity,
> -                &svc->rqd->active);
> -    fallback_cpu = cpumask_first(cpumask_scratch_cpu(cpu));
> -    if ( likely(fallback_cpu < nr_cpu_ids) )
> -        return fallback_cpu;
> +    if ( likely(cpumask_test_cpu(cpu, cpumask_scratch_cpu(cpu))) )
> +        return cpu;
>
> -    cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
> -                cpupool_domain_cpumask(svc->vcpu->domain));
> +    if ( likely(cpumask_intersects(cpumask_scratch_cpu(cpu),
> +                                   &svc->rqd->active)) )
> +    {
> +        cpumask_and(cpumask_scratch_cpu(cpu), &svc->rqd->active,
> +                    cpumask_scratch_cpu(cpu));
> +        return cpumask_first(cpumask_scratch_cpu(cpu));
> +    }
>
>      ASSERT(!cpumask_empty(cpumask_scratch_cpu(cpu)));
>
> @@ -940,6 +943,9 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>                      (unsigned char *)&d);
>      }
>
> +    cpumask_and(cpumask_scratch_cpu(cpu), new->vcpu->cpu_hard_affinity,
> +                cpupool_domain_cpumask(new->vcpu->domain));
> +
>      /*
>       * First of all, consider idle cpus, checking if we can just
>       * re-use the pcpu where we were running before.
> @@ -952,7 +958,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>          cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
>      else
>          cpumask_copy(&mask, &rqd->smt_idle);
> -    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
> +    cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
>      i = cpumask_test_or_cycle(cpu, &mask);
>      if ( i < nr_cpu_ids )
>      {
> @@ -967,7 +973,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>       * gone through the scheduler yet.
>       */
>      cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
> -    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
> +    cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
>      i = cpumask_test_or_cycle(cpu, &mask);
>      if ( i < nr_cpu_ids )
>      {
> @@ -983,7 +989,7 @@ runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
>       */
>      cpumask_andnot(&mask, &rqd->active, &rqd->idle);
>      cpumask_andnot(&mask, &mask, &rqd->tickled);
> -    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
> +    cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
>      if ( cpumask_test_cpu(cpu, &mask) )
>      {
>          cur = CSCHED2_VCPU(curr_on_cpu(cpu));
> @@ -1525,6 +1531,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>          goto out;
>      }
>
> +    cpumask_and(cpumask_scratch_cpu(cpu), vc->cpu_hard_affinity,
> +                cpupool_domain_cpumask(vc->domain));
> +
>      /*
>       * First check to see if we're here because someone else suggested a place
>       * for us to move.
> @@ -1536,13 +1545,13 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>              printk(XENLOG_WARNING "%s: target runqueue disappeared!\n",
>                     __func__);
>          }
> -        else
> +        else if ( cpumask_intersects(cpumask_scratch_cpu(cpu),
> +                                     &svc->migrate_rqd->active) )
>          {
> -            cpumask_and(cpumask_scratch_cpu(cpu), vc->cpu_hard_affinity,
> +            cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
>                          &svc->migrate_rqd->active);
>              new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
> -            if ( new_cpu < nr_cpu_ids )
> -                goto out_up;
> +            goto out_up;
>          }
>          /* Fall-through to normal cpu pick */
>      }
> @@ -1570,12 +1579,12 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>           */
>          if ( rqd == svc->rqd )
>          {
> -            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +            if ( cpumask_intersects(cpumask_scratch_cpu(cpu), &rqd->active) )
>                  rqd_avgload = max_t(s_time_t, rqd->b_avgload - svc->avgload, 0);
>          }
>          else if ( spin_trylock(&rqd->lock) )
>          {
> -            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +            if ( cpumask_intersects(cpumask_scratch_cpu(cpu), &rqd->active) )
>                  rqd_avgload = rqd->b_avgload;
>
>              spin_unlock(&rqd->lock);
> @@ -1597,7 +1606,7 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
>          goto out_up;
>      }
>
> -    cpumask_and(cpumask_scratch_cpu(cpu), vc->cpu_hard_affinity,
> +    cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
>                  &prv->rqd[min_rqi].active);
>      new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
>      BUG_ON(new_cpu >= nr_cpu_ids);
> @@ -1713,6 +1722,8 @@ static void migrate(const struct scheduler *ops,
>          __runq_deassign(svc);
>
>          cpumask_and(cpumask_scratch_cpu(cpu), svc->vcpu->cpu_hard_affinity,
> +                    cpupool_domain_cpumask(svc->vcpu->domain));
> +        cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
>                      &trqd->active);
>          svc->vcpu->processor = cpumask_any(cpumask_scratch_cpu(cpu));
>          ASSERT(svc->vcpu->processor < nr_cpu_ids);
> @@ -1738,8 +1749,14 @@ static void migrate(const struct scheduler *ops,
>  static bool_t vcpu_is_migrateable(struct csched2_vcpu *svc,
>                                    struct csched2_runqueue_data *rqd)
>  {
> +    struct vcpu *v = svc->vcpu;
> +    int cpu = svc->vcpu->processor;
> +
> +    cpumask_and(cpumask_scratch_cpu(cpu), v->cpu_hard_affinity,
> +                cpupool_domain_cpumask(v->domain));
> +
>      return !(svc->flags & CSFLAG_runq_migrate_request) &&
> -           cpumask_intersects(svc->vcpu->cpu_hard_affinity, &rqd->active);
> +           cpumask_intersects(cpumask_scratch_cpu(cpu), &rqd->active);
>  }
>
>  static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Jürgen Groß Jan. 24, 2017, 12:35 p.m. UTC | #5
On 23/01/17 15:40, George Dunlap wrote:
> On Thu, Jan 19, 2017 at 8:08 AM, Juergen Gross <jgross@suse.com> wrote:
>> On 17/01/17 18:26, Dario Faggioli wrote:
>>> In fact, relying on the mask of what pCPUs belong to
>>> which Credit2 runqueue is not enough. If we only do that,
>>> when Credit2 is the boot scheduler, we may ASSERT() or
>>> panic when moving a pCPU from Pool-0 to another cpupool.
>>>
>>> This is because pCPUs outside of any pool are considered
>>> part of cpupool0. This puts us at risk of crash when those
>>> same pCPUs are added to another pool and something
>>> different than the idle domain is found to be running
>>> on them.
>>>
>>> Note that, even if we prevent the above to happen (which
>>> is the purpose of this patch), this is still pretty bad,
>>> in fact, when we remove a pCPU from Pool-0:
>>> - in Credit1, as we do *not* update prv->ncpus and
>>>   prv->credit, which means we're considering the wrong
>>>   total credits when doing accounting;
>>> - in Credit2, the pCPU remains part of one runqueue,
>>>   and is hence at least considered during load balancing,
>>>   even if no vCPU should really run there.
>>>
>>> In Credit1, this "only" causes skewed accounting and
>>> no crashes because there is a lot of `cpumask_and`ing
>>> going on with the cpumask of the domains' cpupool
>>> (which, BTW, comes at a price).
>>>
>>> A quick and not to involved (and easily backportable)
>>> solution for Credit2, is to do exactly the same.
>>>
>>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com
>>> ---
>>> Cc: George Dunlap <george.dunlap@eu.citrix.com>
>>> Cc: Juergen Gross <jgross@suse.com>
>>> Cc: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> This is a bugfix, and should be backported to 4.8.
>>> ---
>>> The proper solution would mean calling deinit_pdata() when removing a pCPU from
>>> cpupool0, as well as a bit more of code reshuffling.
>>>
>>> And, although worth doing, it certainly will take more work, more time, and
>>> will probably be hard (well, surely harder than this) to backport.
>>>
>>> Therefore, I'd argue in favor of both taking and backporting this change, which
>>> at least enables using Credit2 as default scheduler without risking a crash
>>> when creating a second cpupool.
>>>
>>> Afterwards, a proper solution would be proposed for Xen 4.9.
>>>
>>> Finally, given the wide number of issues similar to this that I've found and
>>> fixed in the last release cycle, I think it would be good to take a stab at
>>> whether the interface between cpupools and the schedulers could not be
>>> simplified. :-O
>>
>> The first patch version for support of cpupools I sent used an own
>> scheduler instance for the free cpus. Keir merged this instance with
>> the one for Pool-0.
> 
> You mean he just made the change unilaterally without posting it to
> the list?  I just went back and skimmed through the old cpupools
> threads and didn't see this discussed.

I'm not sure I remember everything correctly. Unfortunately I don't
have a copy of all emails back from then as this work was done for
another employer.

Looking into the archives it seems the switch was done already in the
final version I sent to the list. I believe Keir did some testing based
on my first series and suggested some modifications which I happily
accepted.

> Having a "cpupool-remove" operation that doesn't actually remove the
> cpu from the pool is a bit mad...

Logically it does remove the cpu from Pool-0. It is just that there is
no other scheduler entity involved for doing so.


Juergen
Dario Faggioli Jan. 24, 2017, 12:49 p.m. UTC | #6
On Tue, 2017-01-24 at 13:35 +0100, Juergen Gross wrote:
> On 23/01/17 15:40, George Dunlap wrote:
> > 
> > Having a "cpupool-remove" operation that doesn't actually remove
> > the
> > cpu from the pool is a bit mad...
> 
> Logically it does remove the cpu from Pool-0. It is just that there
> is
> no other scheduler entity involved for doing so.
> 
Yes, it's removed from the pool, *but* it basically remains part of the
pool's scheduler, that's the issue.

As you say, there's no another scheduler to which we can attach it...
So, as of now, I see two options:
1) create such (dummy) scheduler;
2) go all the way down toward deallocating the scheduler related data 
   of the cpu (and reallocate them back when re-added).

BTW, Anshul also hit this problem while also doing some work on
Credit2, and told me he'd be giving some thinking to it, and try to
figure some ideas out (as soon as he'd be free of a couple of other
burdens :-D).

Let's see what he'll come up with. :-)

Regards,
Dario
George Dunlap Jan. 24, 2017, 4:37 p.m. UTC | #7
> On Jan 24, 2017, at 12:49 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote:
> 
> On Tue, 2017-01-24 at 13:35 +0100, Juergen Gross wrote:
>> On 23/01/17 15:40, George Dunlap wrote:
>>>  
>>> Having a "cpupool-remove" operation that doesn't actually remove
>>> the
>>> cpu from the pool is a bit mad...
>> 
>> Logically it does remove the cpu from Pool-0. It is just that there
>> is
>> no other scheduler entity involved for doing so.
>> 
> Yes, it's removed from the pool, *but* it basically remains part of the
> pool's scheduler, that's the issue.

And in particular for credit2, remains in the runqueue cpu mask (which is what this patch is mostly about).

 -George
Jan Beulich Feb. 3, 2017, 8:41 a.m. UTC | #8
>>> On 17.01.17 at 18:26, <dario.faggioli@citrix.com> wrote:
> In fact, relying on the mask of what pCPUs belong to
> which Credit2 runqueue is not enough. If we only do that,
> when Credit2 is the boot scheduler, we may ASSERT() or
> panic when moving a pCPU from Pool-0 to another cpupool.
> 
> This is because pCPUs outside of any pool are considered
> part of cpupool0. This puts us at risk of crash when those
> same pCPUs are added to another pool and something
> different than the idle domain is found to be running
> on them.
> 
> Note that, even if we prevent the above to happen (which
> is the purpose of this patch), this is still pretty bad,
> in fact, when we remove a pCPU from Pool-0:
> - in Credit1, as we do *not* update prv->ncpus and
>   prv->credit, which means we're considering the wrong
>   total credits when doing accounting;
> - in Credit2, the pCPU remains part of one runqueue,
>   and is hence at least considered during load balancing,
>   even if no vCPU should really run there.
> 
> In Credit1, this "only" causes skewed accounting and
> no crashes because there is a lot of `cpumask_and`ing
> going on with the cpumask of the domains' cpupool
> (which, BTW, comes at a price).
> 
> A quick and not to involved (and easily backportable)
> solution for Credit2, is to do exactly the same.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com 
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> ---
> This is a bugfix, and should be backported to 4.8.

While I did manage to backport "xen: credit2: use the correct scratch
cpumask" and "xen: credit2: fix shutdown/suspend when playing with
cpupools" also to 4.7, this one is even worse for me to deal with
(purely by its subject I'd assume it's applicable): Its first hunk applies
fine, but then everything breaks.

Jan
Dario Faggioli Feb. 3, 2017, 3:27 p.m. UTC | #9
On Fri, 2017-02-03 at 01:41 -0700, Jan Beulich wrote:
> > > > On 17.01.17 at 18:26, <dario.faggioli@citrix.com> wrote:
> > ---
> > This is a bugfix, and should be backported to 4.8.
> 
> While I did manage to backport "xen: credit2: use the correct scratch
> cpumask" and "xen: credit2: fix shutdown/suspend when playing with
> cpupools" also to 4.7, this one is even worse for me to deal with
> (purely by its subject I'd assume it's applicable): Its first hunk
> applies
> fine, but then everything breaks.
> 
I see. Sorry for that.

Is it ok if I give it a try myself and, if I manage, send you the
backported patch?

Regards,
Dario
Jan Beulich Feb. 3, 2017, 3:40 p.m. UTC | #10
>>> On 03.02.17 at 16:27, <dario.faggioli@citrix.com> wrote:
> On Fri, 2017-02-03 at 01:41 -0700, Jan Beulich wrote:
>> > > > On 17.01.17 at 18:26, <dario.faggioli@citrix.com> wrote:
>> > ---
>> > This is a bugfix, and should be backported to 4.8.
>> 
>> While I did manage to backport "xen: credit2: use the correct scratch
>> cpumask" and "xen: credit2: fix shutdown/suspend when playing with
>> cpupools" also to 4.7, this one is even worse for me to deal with
>> (purely by its subject I'd assume it's applicable): Its first hunk
>> applies
>> fine, but then everything breaks.
>> 
> I see. Sorry for that.
> 
> Is it ok if I give it a try myself and, if I manage, send you the
> backported patch?

Yes, I should probably have phrased my earlier reply that way.

Jan
Dario Faggioli Feb. 8, 2017, 4:48 p.m. UTC | #11
On Fri, 2017-02-03 at 08:40 -0700, Jan Beulich wrote:
> > > > > > On 17.01.17 at 18:26, <dario.faggioli@citrix.com> wrote:
> > > While I did manage to backport "xen: credit2: use the correct
> > > scratch
> > > cpumask" and "xen: credit2: fix shutdown/suspend when playing
> > > with
> > > cpupools" also to 4.7, this one is even worse for me to deal with
> > > (purely by its subject I'd assume it's applicable): Its first
> > > hunk
> > > applies
> > > fine, but then everything breaks.
> > > 
> > I see. Sorry for that.
> > 
> > Is it ok if I give it a try myself and, if I manage, send you the
> > backported patch?
> 
> Yes, I should probably have phrased my earlier reply that way.
> 
Well, I only wanted to double check whether you were meaning "please
provide backport" or "let's give up". :-)

But now that I'm looking into this, I need to double check something
again. So, as far as I've understood, you want the following 3 patches
to be backported to both 4.8 and 4.7:

 xen: credit2: use the correct scratch cpumask
 xen: credit2: fix shutdown/suspend when playing with cpupools
 xen: credit2: never consider CPUs outside of our cpupool

And you're saying you already have done some of them, but I don't seem
to be able to find _any_ of them in any of the following branches:

  staging-4.8 http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/staging-4.8
  stable-4.8 http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.8
  staging-4.7 http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/staging-4.7
  stable-4.7 http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.7

Am I looking in the wrong places / misunderstood anything?

Thanks and Regards,
Dario
Jan Beulich Feb. 8, 2017, 5:02 p.m. UTC | #12
>>> On 08.02.17 at 17:48, <dario.faggioli@citrix.com> wrote:
> On Fri, 2017-02-03 at 08:40 -0700, Jan Beulich wrote:
>> > > > > > On 17.01.17 at 18:26, <dario.faggioli@citrix.com> wrote:
>> > > While I did manage to backport "xen: credit2: use the correct
>> > > scratch
>> > > cpumask" and "xen: credit2: fix shutdown/suspend when playing
>> > > with
>> > > cpupools" also to 4.7, this one is even worse for me to deal with
>> > > (purely by its subject I'd assume it's applicable): Its first
>> > > hunk
>> > > applies
>> > > fine, but then everything breaks.
>> > > 
>> > I see. Sorry for that.
>> > 
>> > Is it ok if I give it a try myself and, if I manage, send you the
>> > backported patch?
>> 
>> Yes, I should probably have phrased my earlier reply that way.
>> 
> Well, I only wanted to double check whether you were meaning "please
> provide backport" or "let's give up". :-)
> 
> But now that I'm looking into this, I need to double check something
> again. So, as far as I've understood, you want the following 3 patches
> to be backported to both 4.8 and 4.7:
> 
>  xen: credit2: use the correct scratch cpumask
>  xen: credit2: fix shutdown/suspend when playing with cpupools
>  xen: credit2: never consider CPUs outside of our cpupool
> 
> And you're saying you already have done some of them, but I don't seem
> to be able to find _any_ of them in any of the following branches:
> 
>   staging-4.8 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/staging-4.8 
>   stable-4.8 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.8 
>   staging-4.7 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/staging-4.7 
>   stable-4.7 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.7 
> 
> Am I looking in the wrong places / misunderstood anything?

Well, me having done the backports didn't imply me committing them
right away; I wanted them to first undergo some testing. I hope to
get to that tomorrow. If you want to wait for what I have to appear
there, that's of course fine.

Jan
Dario Faggioli Feb. 8, 2017, 6:55 p.m. UTC | #13
On Wed, 2017-02-08 at 10:02 -0700, Jan Beulich wrote:
> > > > On 08.02.17 at 17:48, <dario.faggioli@citrix.com> wrote:
> > Am I looking in the wrong places / misunderstood anything?
> 
> Well, me having done the backports didn't imply me committing them
> right away; I wanted them to first undergo some testing. 
>
Ok, right, I see it now. Thanks for clarifying (again).

> I hope to
> get to that tomorrow. If you want to wait for what I have to appear
> there, that's of course fine.
> 
So, patches applies cleanly to 4.8, as you've probably seen already.

As per 4.7, I've prepared a branch for you:

 git://xenbits.xen.org/people/dariof/xen.git for-jan/staging-4.7/credit2-cpupool-fixes
 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/for-jan/staging-4.7/credit2-cpupool-fixes
 https://travis-ci.org/fdario/xen/builds/199709757

I've only quickly tested it so far, and I have to leave now. I'll play
with it a bit more tomorrow morning and let you know how it goes.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Jan Beulich Feb. 9, 2017, 9:17 a.m. UTC | #14
>>> On 08.02.17 at 19:55, <dario.faggioli@citrix.com> wrote:
> As per 4.7, I've prepared a branch for you:

Thanks.

>  git://xenbits.xen.org/people/dariof/xen.git 
> for-jan/staging-4.7/credit2-cpupool-fixes
>  
> http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/head 
> s/for-jan/staging-4.7/credit2-cpupool-fixes
>  https://travis-ci.org/fdario/xen/builds/199709757 
> 
> I've only quickly tested it so far, and I have to leave now. I'll play
> with it a bit more tomorrow morning and let you know how it goes.

Well, looking at the topmost commit, you've clearly missed the
follow-up fix (ad5808d905), which I did fold into that backport
right away. I'm going to commit what I have later today, and
I'll pull in the one extra backport next time round.

Jan
Dario Faggioli Feb. 9, 2017, 9:25 a.m. UTC | #15
On Thu, 2017-02-09 at 02:17 -0700, Jan Beulich wrote:
> > > > On 08.02.17 at 19:55, <dario.faggioli@citrix.com> wrote:
> > I've only quickly tested it so far, and I have to leave now. I'll
> > play
> > with it a bit more tomorrow morning and let you know how it goes.
> 
> Well, looking at the topmost commit, you've clearly missed the
> follow-up fix (ad5808d905), which I did fold into that backport
> right away. 
>
Yes, in fact, I've just added it now (will re-push to the branch after
testing).

> I'm going to commit what I have later today, and
> I'll pull in the one extra backport next time round.
> 
Ok, sure.

Dario
diff mbox

Patch

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 523922e..ce0e146 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -510,19 +510,22 @@  void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
  */
 static int get_fallback_cpu(struct csched2_vcpu *svc)
 {
-    int fallback_cpu, cpu = svc->vcpu->processor;
+    struct vcpu *v = svc->vcpu;
+    int cpu = v->processor;
 
-    if ( likely(cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity)) )
-        return cpu;
+    cpumask_and(cpumask_scratch_cpu(cpu), v->cpu_hard_affinity,
+                cpupool_domain_cpumask(v->domain));
 
-    cpumask_and(cpumask_scratch_cpu(cpu), svc->vcpu->cpu_hard_affinity,
-                &svc->rqd->active);
-    fallback_cpu = cpumask_first(cpumask_scratch_cpu(cpu));
-    if ( likely(fallback_cpu < nr_cpu_ids) )
-        return fallback_cpu;
+    if ( likely(cpumask_test_cpu(cpu, cpumask_scratch_cpu(cpu))) )
+        return cpu;
 
-    cpumask_and(cpumask_scratch, svc->vcpu->cpu_hard_affinity,
-                cpupool_domain_cpumask(svc->vcpu->domain));
+    if ( likely(cpumask_intersects(cpumask_scratch_cpu(cpu),
+                                   &svc->rqd->active)) )
+    {
+        cpumask_and(cpumask_scratch_cpu(cpu), &svc->rqd->active,
+                    cpumask_scratch_cpu(cpu));
+        return cpumask_first(cpumask_scratch_cpu(cpu));
+    }
 
     ASSERT(!cpumask_empty(cpumask_scratch_cpu(cpu)));
 
@@ -940,6 +943,9 @@  runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
                     (unsigned char *)&d);
     }
 
+    cpumask_and(cpumask_scratch_cpu(cpu), new->vcpu->cpu_hard_affinity,
+                cpupool_domain_cpumask(new->vcpu->domain));
+
     /*
      * First of all, consider idle cpus, checking if we can just
      * re-use the pcpu where we were running before.
@@ -952,7 +958,7 @@  runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
         cpumask_andnot(&mask, &rqd->idle, &rqd->smt_idle);
     else
         cpumask_copy(&mask, &rqd->smt_idle);
-    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
     i = cpumask_test_or_cycle(cpu, &mask);
     if ( i < nr_cpu_ids )
     {
@@ -967,7 +973,7 @@  runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
      * gone through the scheduler yet.
      */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
-    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
     i = cpumask_test_or_cycle(cpu, &mask);
     if ( i < nr_cpu_ids )
     {
@@ -983,7 +989,7 @@  runq_tickle(const struct scheduler *ops, struct csched2_vcpu *new, s_time_t now)
      */
     cpumask_andnot(&mask, &rqd->active, &rqd->idle);
     cpumask_andnot(&mask, &mask, &rqd->tickled);
-    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
+    cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu));
     if ( cpumask_test_cpu(cpu, &mask) )
     {
         cur = CSCHED2_VCPU(curr_on_cpu(cpu));
@@ -1525,6 +1531,9 @@  csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         goto out;
     }
 
+    cpumask_and(cpumask_scratch_cpu(cpu), vc->cpu_hard_affinity,
+                cpupool_domain_cpumask(vc->domain));
+
     /*
      * First check to see if we're here because someone else suggested a place
      * for us to move.
@@ -1536,13 +1545,13 @@  csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
             printk(XENLOG_WARNING "%s: target runqueue disappeared!\n",
                    __func__);
         }
-        else
+        else if ( cpumask_intersects(cpumask_scratch_cpu(cpu),
+                                     &svc->migrate_rqd->active) )
         {
-            cpumask_and(cpumask_scratch_cpu(cpu), vc->cpu_hard_affinity,
+            cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                         &svc->migrate_rqd->active);
             new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
-            if ( new_cpu < nr_cpu_ids )
-                goto out_up;
+            goto out_up;
         }
         /* Fall-through to normal cpu pick */
     }
@@ -1570,12 +1579,12 @@  csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
          */
         if ( rqd == svc->rqd )
         {
-            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+            if ( cpumask_intersects(cpumask_scratch_cpu(cpu), &rqd->active) )
                 rqd_avgload = max_t(s_time_t, rqd->b_avgload - svc->avgload, 0);
         }
         else if ( spin_trylock(&rqd->lock) )
         {
-            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+            if ( cpumask_intersects(cpumask_scratch_cpu(cpu), &rqd->active) )
                 rqd_avgload = rqd->b_avgload;
 
             spin_unlock(&rqd->lock);
@@ -1597,7 +1606,7 @@  csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         goto out_up;
     }
 
-    cpumask_and(cpumask_scratch_cpu(cpu), vc->cpu_hard_affinity,
+    cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                 &prv->rqd[min_rqi].active);
     new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
     BUG_ON(new_cpu >= nr_cpu_ids);
@@ -1713,6 +1722,8 @@  static void migrate(const struct scheduler *ops,
         __runq_deassign(svc);
 
         cpumask_and(cpumask_scratch_cpu(cpu), svc->vcpu->cpu_hard_affinity,
+                    cpupool_domain_cpumask(svc->vcpu->domain));
+        cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                     &trqd->active);
         svc->vcpu->processor = cpumask_any(cpumask_scratch_cpu(cpu));
         ASSERT(svc->vcpu->processor < nr_cpu_ids);
@@ -1738,8 +1749,14 @@  static void migrate(const struct scheduler *ops,
 static bool_t vcpu_is_migrateable(struct csched2_vcpu *svc,
                                   struct csched2_runqueue_data *rqd)
 {
+    struct vcpu *v = svc->vcpu;
+    int cpu = svc->vcpu->processor;
+
+    cpumask_and(cpumask_scratch_cpu(cpu), v->cpu_hard_affinity,
+                cpupool_domain_cpumask(v->domain));
+
     return !(svc->flags & CSFLAG_runq_migrate_request) &&
-           cpumask_intersects(svc->vcpu->cpu_hard_affinity, &rqd->active);
+           cpumask_intersects(cpumask_scratch_cpu(cpu), &rqd->active);
 }
 
 static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)