diff mbox series

[2/2] xen: credit2: limit the max number of CPUs in a runqueue

Message ID 158818179558.24327.11334680191217289878.stgit@Palanthas (mailing list archive)
State Superseded
Headers show
Series xen: credit2: limit the number of CPUs per runqueue | expand

Commit Message

Dario Faggioli April 29, 2020, 5:36 p.m. UTC
In Credit2 CPUs (can) share runqueues, depending on the topology. For
instance, with per-socket runqueues (the default) all the CPUs that are
part of the same socket share a runqueue.

On platform with a huge number of CPUs per socket, that could be a
problem. An example is AMD EPYC2 servers, where we can have up to 128
CPUs in a socket.

It is of course possible to define other, still topology-based, runqueue
arrangements (e.g., per-LLC, per-DIE, etc). But that may still result in
runqueues with too many CPUs on other/future platforms.

Therefore, let's set a limit to the max number of CPUs that can share a
Credit2 runqueue. The actual value is configurable (at boot time), the
default being 16. If, for instance,  there are more than 16 CPUs in a
socket, they'll be split among two (or more) runqueues.

Note: with core scheduling enabled, this parameter sets the max number
of *scheduling resources* that can share a runqueue. Therefore, with
granularity set to core (and assumint 2 threads per core), we will have
at most 16 cores per runqueue, which corresponds to 32 threads. But that
is fine, considering how core scheduling works.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Juergen Gross <jgross@suse.com>
---
 xen/common/sched/cpupool.c |    2 -
 xen/common/sched/credit2.c |  104 ++++++++++++++++++++++++++++++++++++++++++--
 xen/common/sched/private.h |    2 +
 3 files changed, 103 insertions(+), 5 deletions(-)

Comments

Jan Beulich April 30, 2020, 6:45 a.m. UTC | #1
On 29.04.2020 19:36, Dario Faggioli wrote:
> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
>             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu));
>  }
>  
> +/* Additional checks, to avoid separating siblings in different runqueues. */
> +static bool
> +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
> +{
> +    unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
> +    unsigned int rcpu, nr_smts = 0;
> +
> +    /*
> +     * If we put the CPU in this runqueue, we must be sure that there will
> +     * be enough room for accepting its hyperthread sibling(s) here as well.
> +     */
> +    cpumask_clear(cpumask_scratch_cpu(cpu));
> +    for_each_cpu ( rcpu, &rqd->active )
> +    {
> +        ASSERT(rcpu != cpu);
> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
> +        {
> +            /*
> +             * For each CPU already in the runqueue, account for it and for
> +             * its sibling(s), independently from whether such sibling(s) are
> +             * in the runqueue already or not.
> +             *
> +             * Of course, if there are sibling CPUs in the runqueue already,
> +             * only count them once.
> +             */
> +            cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
> +                       per_cpu(cpu_sibling_mask, rcpu));
> +            nr_smts += nr_sibl;

This being common code, is it appropriate to assume all CPUs having
the same number of siblings? Even beyond that, iirc the sibling mask
represents the online or parked siblings, but not offline ones. For
the purpose here, don't you rather care about the full set?

What about HT vs AMD Fam15's CUs? Do you want both to be treated
the same here?

Also could you outline the intentions with this logic in the
description, to be able to match the goal with what gets done?

> +        }
> +    }
> +    /*
> +     * We know that neither the CPU, nor any of its sibling are here,
> +     * or we wouldn't even have entered the function.
> +     */
> +    ASSERT(!cpumask_intersects(cpumask_scratch_cpu(cpu),
> +                               per_cpu(cpu_sibling_mask, cpu)));
> +
> +    /* Try adding CPU and its sibling(s) to the count and check... */
> +    nr_smts += nr_sibl;
> +
> +    if ( nr_smts <= opt_max_cpus_runqueue )
> +        return true;
> +
> +    return false;

Fold these into

    return nr_smts <= opt_max_cpus_runqueue;

?

> @@ -873,11 +930,44 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>          if ( !rqi_unused && rqd->id > rqi )
>              rqi_unused = true;
>  
> -        if ( cpu_runqueue_match(rqd, cpu) )
> +        /*
> +         * Check whether the CPU should (according to the topology) and also
> +         * can (if we there aren't too many already) go in this runqueue.

Nit: Stray "we"?

> +         */
> +        if ( rqd->refcnt < opt_max_cpus_runqueue &&
> +             cpu_runqueue_match(rqd, cpu) )
>          {
> -            rqd_valid = true;
> -            break;
> +            cpumask_t *siblings = per_cpu(cpu_sibling_mask, cpu);
> +
> +            dprintk(XENLOG_DEBUG, "CPU %d matches runq %d, cpus={%*pbl} (max %d)\n",
> +                    cpu, rqd->id, CPUMASK_PR(&rqd->active),
> +                    opt_max_cpus_runqueue);
> +
> +            /*
> +             * If we're using core (or socket!) scheduling, or we don't have
> +             * hyperthreading, no need to do any further checking.
> +             *
> +             * If no (to both), but our sibling is already in this runqueue,
> +             * then it's also ok for the CPU to stay in this runqueue..

Nit: Stray full 2nd stop?

> +             * Otherwise, do some more checks, to better account for SMT.
> +             */
> +            if ( opt_sched_granularity != SCHED_GRAN_cpu ||
> +                 cpumask_weight(siblings) <= 1 ||
> +                 cpumask_intersects(&rqd->active, siblings) )
> +            {
> +                dprintk(XENLOG_DEBUG, "runq %d selected\n", rqd->id);
> +                rqd_valid = rqd;
> +                break;
> +            }
> +            else if ( cpu_runqueue_smt_match(rqd, cpu) )
> +            {
> +                dprintk(XENLOG_DEBUG, "considering runq %d...\n", rqd->id);
> +                rqd_valid = rqd;
> +            }
>          }
> +	else

Hard tab slipped in.

> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>          rqd->pick_bias = cpu;
>          rqd->id = rqi;
>      }
> +    else
> +        rqd = rqd_valid;
> +
> +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to runqueue %d with {%*pbl}\n",
> +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd->id,
> +           CPUMASK_PR(&rqd->active));

Iirc there's one per-CPU printk() already. On large systems this isn't
very nice, so I'd like to ask that their total number at least not get
further grown. Ideally there would be a less verbose summary after all
CPUs have been brought up at boot, with per-CPU info be logged only
during CPU hot online.

Jan
Jürgen Groß April 30, 2020, 7:35 a.m. UTC | #2
On 29.04.20 19:36, Dario Faggioli wrote:
> In Credit2 CPUs (can) share runqueues, depending on the topology. For
> instance, with per-socket runqueues (the default) all the CPUs that are
> part of the same socket share a runqueue.
> 
> On platform with a huge number of CPUs per socket, that could be a
> problem. An example is AMD EPYC2 servers, where we can have up to 128
> CPUs in a socket.
> 
> It is of course possible to define other, still topology-based, runqueue
> arrangements (e.g., per-LLC, per-DIE, etc). But that may still result in
> runqueues with too many CPUs on other/future platforms.
> 
> Therefore, let's set a limit to the max number of CPUs that can share a
> Credit2 runqueue. The actual value is configurable (at boot time), the
> default being 16. If, for instance,  there are more than 16 CPUs in a
> socket, they'll be split among two (or more) runqueues.

Did you think about balancing the runqueues regarding the number of
cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in each
runqueue? I know this will need more logic as cpus are added one by one,
but the result would be much better IMO.

> 
> Note: with core scheduling enabled, this parameter sets the max number
> of *scheduling resources* that can share a runqueue. Therefore, with
> granularity set to core (and assumint 2 threads per core), we will have
> at most 16 cores per runqueue, which corresponds to 32 threads. But that
> is fine, considering how core scheduling works.
> 
> Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Juergen Gross <jgross@suse.com>
> ---
>   xen/common/sched/cpupool.c |    2 -
>   xen/common/sched/credit2.c |  104 ++++++++++++++++++++++++++++++++++++++++++--
>   xen/common/sched/private.h |    2 +
>   3 files changed, 103 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
> index d40345b585..0227457285 100644
> --- a/xen/common/sched/cpupool.c
> +++ b/xen/common/sched/cpupool.c
> @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus;
>   
>   static DEFINE_SPINLOCK(cpupool_lock);
>   
> -static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
> +enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;

Please don't use the global option value, but the per-cpupool one.

>   static unsigned int __read_mostly sched_granularity = 1;
>   
>   #ifdef CONFIG_HAS_SCHED_GRANULARITY
> diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
> index 697c9f917d..abe4d048c8 100644
> --- a/xen/common/sched/credit2.c
> +++ b/xen/common/sched/credit2.c
> @@ -471,6 +471,16 @@ static int __init parse_credit2_runqueue(const char *s)
>   }
>   custom_param("credit2_runqueue", parse_credit2_runqueue);
>   
> +/*
> + * How many CPUs will be put, at most, in the same runqueue.
> + * Runqueues are still arranged according to the host topology (and
> + * according to the value of the 'credit2_runqueue' parameter). But
> + * we also have a cap to the number of CPUs that share runqueues.
> + * As soon as we reach the limit, a new runqueue will be created.
> + */
> +static unsigned int __read_mostly opt_max_cpus_runqueue = 16;
> +integer_param("sched_credit2_max_cpus_runqueue", opt_max_cpus_runqueue);
> +
>   /*
>    * Per-runqueue data
>    */
> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
>              (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu));
>   }
>   
> +/* Additional checks, to avoid separating siblings in different runqueues. */
> +static bool
> +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
> +{
> +    unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu));

Shouldn't you mask away siblings not in the cpupool?

> +    unsigned int rcpu, nr_smts = 0;
> +
> +    /*
> +     * If we put the CPU in this runqueue, we must be sure that there will
> +     * be enough room for accepting its hyperthread sibling(s) here as well.
> +     */
> +    cpumask_clear(cpumask_scratch_cpu(cpu));
> +    for_each_cpu ( rcpu, &rqd->active )
> +    {
> +        ASSERT(rcpu != cpu);
> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
> +        {
> +            /*
> +             * For each CPU already in the runqueue, account for it and for
> +             * its sibling(s), independently from whether such sibling(s) are
> +             * in the runqueue already or not.
> +             *
> +             * Of course, if there are sibling CPUs in the runqueue already,
> +             * only count them once.
> +             */
> +            cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
> +                       per_cpu(cpu_sibling_mask, rcpu));

Again, local cpupool only!


Juergen
Dario Faggioli April 30, 2020, 12:28 p.m. UTC | #3
On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote:
> On 29.04.20 19:36, Dario Faggioli wrote:
> > 
> > Therefore, let's set a limit to the max number of CPUs that can
> > share a
> > Credit2 runqueue. The actual value is configurable (at boot time),
> > the
> > default being 16. If, for instance,  there are more than 16 CPUs in
> > a
> > socket, they'll be split among two (or more) runqueues.
> 
> Did you think about balancing the runqueues regarding the number of
> cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in
> each
> runqueue? I know this will need more logic as cpus are added one by
> one,
> but the result would be much better IMO.
>
I know. Point is, CPUs not only are added one by one, but they can,
once the system is running, be offlined/onlined or moved among
cpupools.

Therefore, if we have 20 CPUs, even if we put 10 in each runqueue at
boot, if the admin removes 4 CPUs that happened to be all in the same
runqueue, we end up in an unbalanced (6 vs 10) situation again. So we'd
indeed need full runqueue online rebalancing logic, which will probably
end up being quite complex and I'm not sure it's worth it.

That being said, I can try to make things a bit more fair, when CPUs
come up and are added to the pool. Something around the line of adding
them to the runqueue with the least number of CPUs in it (among the
suitable ones, of course).

With that, when the user removes 4 CPUs, we will have the 6 vs 10
situation. But we would make sure that, when she adds them back, we
will go back to 10 vs. 10, instead than, say, 6 vs 14 or something like
that.

Was something like this that you had in mind? And in any case, what do
you think about it?

> > --- a/xen/common/sched/cpupool.c
> > +++ b/xen/common/sched/cpupool.c
> > @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus;
> >   
> >   static DEFINE_SPINLOCK(cpupool_lock);
> >   
> > -static enum sched_gran __read_mostly opt_sched_granularity =
> > SCHED_GRAN_cpu;
> > +enum sched_gran __read_mostly opt_sched_granularity =
> > SCHED_GRAN_cpu;
> 
> Please don't use the global option value, but the per-cpupool one.
> 
Yep, you're right. Will do.

> > +/* Additional checks, to avoid separating siblings in different
> > runqueues. */
> > +static bool
> > +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd,
> > unsigned int cpu)
> > +{
> > +    unsigned int nr_sibl =
> > cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
> 
> Shouldn't you mask away siblings not in the cpupool?
> 
So, point here is: if I have Pool-0 and Pool-1, each with 2 runqueues
and CPU 0 is in Pool-1, when I add CPU 1 --which is CPU 0's sibling--
to Pool-0, I always want to make sure that there is room for both CPUs
0 and 1 in the runqueue of Pool-0 where I'm putting it (CPU 0). Even if
CPU 1 is currently in another pool.

This way if, in future, CPU 1 is removed from Pool-1 and added to
Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If I
don't consider CPUs which currently are in another pool, we risk that
when/if they're added to this very pool, they'll end up in a different
runqueue.

And we don't want that.

Makes sense?

Thanks and Regards
Jürgen Groß April 30, 2020, 12:52 p.m. UTC | #4
On 30.04.20 14:28, Dario Faggioli wrote:
> On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote:
>> On 29.04.20 19:36, Dario Faggioli wrote:
>>>
>>> Therefore, let's set a limit to the max number of CPUs that can
>>> share a
>>> Credit2 runqueue. The actual value is configurable (at boot time),
>>> the
>>> default being 16. If, for instance,  there are more than 16 CPUs in
>>> a
>>> socket, they'll be split among two (or more) runqueues.
>>
>> Did you think about balancing the runqueues regarding the number of
>> cpus? E.g. in case of max being 16 and having 20 cpus to put 10 in
>> each
>> runqueue? I know this will need more logic as cpus are added one by
>> one,
>> but the result would be much better IMO.
>>
> I know. Point is, CPUs not only are added one by one, but they can,
> once the system is running, be offlined/onlined or moved among
> cpupools.
> 
> Therefore, if we have 20 CPUs, even if we put 10 in each runqueue at
> boot, if the admin removes 4 CPUs that happened to be all in the same
> runqueue, we end up in an unbalanced (6 vs 10) situation again. So we'd
> indeed need full runqueue online rebalancing logic, which will probably
> end up being quite complex and I'm not sure it's worth it.
> 
> That being said, I can try to make things a bit more fair, when CPUs
> come up and are added to the pool. Something around the line of adding
> them to the runqueue with the least number of CPUs in it (among the
> suitable ones, of course).
> 
> With that, when the user removes 4 CPUs, we will have the 6 vs 10
> situation. But we would make sure that, when she adds them back, we
> will go back to 10 vs. 10, instead than, say, 6 vs 14 or something like
> that.
> 
> Was something like this that you had in mind? And in any case, what do
> you think about it?

Yes, this would be better already.

> 
>>> --- a/xen/common/sched/cpupool.c
>>> +++ b/xen/common/sched/cpupool.c
>>> @@ -37,7 +37,7 @@ static cpumask_t cpupool_locked_cpus;
>>>    
>>>    static DEFINE_SPINLOCK(cpupool_lock);
>>>    
>>> -static enum sched_gran __read_mostly opt_sched_granularity =
>>> SCHED_GRAN_cpu;
>>> +enum sched_gran __read_mostly opt_sched_granularity =
>>> SCHED_GRAN_cpu;
>>
>> Please don't use the global option value, but the per-cpupool one.
>>
> Yep, you're right. Will do.
> 
>>> +/* Additional checks, to avoid separating siblings in different
>>> runqueues. */
>>> +static bool
>>> +cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd,
>>> unsigned int cpu)
>>> +{
>>> +    unsigned int nr_sibl =
>>> cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
>>
>> Shouldn't you mask away siblings not in the cpupool?
>>
> So, point here is: if I have Pool-0 and Pool-1, each with 2 runqueues
> and CPU 0 is in Pool-1, when I add CPU 1 --which is CPU 0's sibling--
> to Pool-0, I always want to make sure that there is room for both CPUs
> 0 and 1 in the runqueue of Pool-0 where I'm putting it (CPU 0). Even if
> CPU 1 is currently in another pool.
> 
> This way if, in future, CPU 1 is removed from Pool-1 and added to
> Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If I
> don't consider CPUs which currently are in another pool, we risk that
> when/if they're added to this very pool, they'll end up in a different
> runqueue.
> 
> And we don't want that.
> 
> Makes sense?

Yes.

You should add a comment in this regard.

And you should either reject the case of less cpus per queue than
siblings per core, or you should handle this situation. Otherwise you
won't ever find a suitable run-queue. :-)


Juergen
Dario Faggioli April 30, 2020, 2:01 p.m. UTC | #5
On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote:
> On 30.04.20 14:28, Dario Faggioli wrote:
> > On Thu, 2020-04-30 at 09:35 +0200, Jürgen Groß wrote:
> > > 
> > With that, when the user removes 4 CPUs, we will have the 6 vs 10
> > situation. But we would make sure that, when she adds them back, we
> > will go back to 10 vs. 10, instead than, say, 6 vs 14 or something
> > like
> > that.
> > 
> > Was something like this that you had in mind? And in any case, what
> > do
> > you think about it?
> 
> Yes, this would be better already.
> 
Right, I'll give a try at this, and let's see how it ends up looking
like.

> > This way if, in future, CPU 1 is removed from Pool-1 and added to
> > Pool-0, I am sure it can go in the same runqueue where CPU 0 is. If
> > I
> > don't consider CPUs which currently are in another pool, we risk
> > that
> > when/if they're added to this very pool, they'll end up in a
> > different
> > runqueue.
> > 
> > And we don't want that.
> > 
> Yes.
> 
Cool. :-)

> You should add a comment in this regard.
> 
Sure.

> And you should either reject the case of less cpus per queue than
> siblings per core, or you should handle this situation. Otherwise you
> won't ever find a suitable run-queue. :-)
> 
Right, and in fact I had a check for that, rejecting smaller
max-cpus-per-runqueue than siblings, where we validate also the other
scheduling parameters. But indeed it's not there now, so I guess I
removed it by mistake before sending.

And now that I double check, I'm also missing documenting the new
parameters.

Thanks and Regards
Dario Faggioli May 26, 2020, 9:18 p.m. UTC | #6
On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote:
> On 30.04.20 14:28, Dario Faggioli wrote:
> > That being said, I can try to make things a bit more fair, when
> > CPUs
> > come up and are added to the pool. Something around the line of
> > adding
> > them to the runqueue with the least number of CPUs in it (among the
> > suitable ones, of course).
> > 
> > With that, when the user removes 4 CPUs, we will have the 6 vs 10
> > situation. But we would make sure that, when she adds them back, we
> > will go back to 10 vs. 10, instead than, say, 6 vs 14 or something
> > like
> > that.
> > 
> > Was something like this that you had in mind? And in any case, what
> > do
> > you think about it?
> 
> Yes, this would be better already.
> 
So, a couple of thoughts. Doing something like what I tried to describe
above is not too bad, and I have it pretty much ready.

With that, on an Intel system with 96 CPUs on two sockets, and
max_cpus_per_runqueue set to 16, I got, after boot, instead of just 2
giant runqueues with 48 CPUs in each:

 - 96 CPUs online, split in 6 runqueues (3 for each socket) with 16 
   runqueues in each of them

I can also "tweak" it in such a way that, if one for instance boots
with "smt=no", we get to a point where we have:

 - 48 CPUs online, split in 6 runqueues, with 8 CPUs in each

Now, I think this is good, and actually better than the current
situation where on such a system, we only have two very big runqueues
(and let me repeat that introducing a per-LLC runqueue arrangement, on
which I'm also working, won't help in this case, as NUMA node == LLC).

So, problem is that if one starts to fiddle with cpupools and cpu on
and offlining, things can get pretty unbalanced. E.g., I can end up
with 2 runqueues on a socket, one with 16 CPUs and the other with just
a couple of them.

Now, this is all possible as of now (I mean, without this patch)
already, although at a different level. In fact, I can very well remove
all-but-1 CPUs of node 1 from Pool-0, and end up again with a runqueue
with a lot of CPUs and another with just one.

It looks like we need a way to rebalance the runqueues, which should be
doable... But despite having spent a couple of days trying to come up
with something decent, that I could include in v2 of this series, I
couldn't get it to work sensibly.

So, this looks to me like an improvement, that would need being
improved further, but I'm not sure we have the time for it right now
(Cc-ing Paul). Should we still go for what's ready? I think yes, but
I'd be interested in opinions...

Also, if anyone has any clever ideas on how to implement a mechanism
that rebalance the CPUs within the runqueue, I'm all ears and am up for
trying implementing. :-)

Regards
Dario Faggioli May 26, 2020, 10 p.m. UTC | #7
Hey,

thanks for the review, and sorry for replying late... I was busy with
something and then was trying to implement a better balancing logic, as
discussed with Juergen, but with only partial success...

On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote:
> On 29.04.2020 19:36, Dario Faggioli wrote:
> > @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct 
> > [...]
> > +        ASSERT(rcpu != cpu);
> > +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
> > +        {
> > +            /*
> > +             * For each CPU already in the runqueue, account for
> > it and for
> > +             * its sibling(s), independently from whether such
> > sibling(s) are
> > +             * in the runqueue already or not.
> > +             *
> > +             * Of course, if there are sibling CPUs in the
> > runqueue already,
> > +             * only count them once.
> > +             */
> > +            cpumask_or(cpumask_scratch_cpu(cpu),
> > cpumask_scratch_cpu(cpu),
> > +                       per_cpu(cpu_sibling_mask, rcpu));
> > +            nr_smts += nr_sibl;
> 
> This being common code, is it appropriate to assume all CPUs having
> the same number of siblings? 
>
You mention common code because you are thinking of differences between
x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips
that have SMT, or that we may want to identify and treat like if it was
SMT, we currently have no support for that, so I don't think it is a
problem.

On x86, I'm not sure I am aware of cases where the number of threads is
different among cores or sockets... are there any?

Besides, we have some SMT specific code around (especially in
scheduling) already.

> Even beyond that, iirc the sibling mask
> represents the online or parked siblings, but not offline ones. For
> the purpose here, don't you rather care about the full set?
> 
This is actually a good point. I indeed care about the number of
siblings a thread has, in general, not only about the ones that are
currently online.

In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped
in an helper that just returns 1 for ARM. What do you think, is this
better?

> What about HT vs AMD Fam15's CUs? Do you want both to be treated
> the same here?
> 
Are you referring to the cores that, AFAIUI, share the L1i cache? If
yes, I thought about it, and ended up _not_ dealing with them here, but
I'm still a bit unsure.

Cache oriented runqueue organization will be the subject of another
patch series, and that's why I kept them out. However, that's a rather
special case with a lot in common to SMT... Just in case, is there a
way to identify them easily, like with a mask or something, in the code
already?

> Also could you outline the intentions with this logic in the
> description, to be able to match the goal with what gets done?
> 
Sure, I will try state it more clearly.

> > @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private
> > *prv, unsigned int cpu)
> >          rqd->pick_bias = cpu;
> >          rqd->id = rqi;
> >      }
> > +    else
> > +        rqd = rqd_valid;
> > +
> > +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to
> > runqueue %d with {%*pbl}\n",
> > +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd-
> > >id,
> > +           CPUMASK_PR(&rqd->active));
> 
> Iirc there's one per-CPU printk() already. On large systems this
> isn't
> very nice, so I'd like to ask that their total number at least not
> get
> further grown. Ideally there would be a less verbose summary after
> all
> CPUs have been brought up at boot, with per-CPU info be logged only
> during CPU hot online.
> 
Understood. Problem is that, here in the scheduling code, I don't see
an easy way to tell when we have finished bringing up CPUs... And it's
probably not worth looking too hard (even less adding logic) only for
the sake of printing this message.

So I think I will demote this printk as a XENLOG_DEBUG one (and will
also get rid of others that were already DEBUG, but not super useful,
after some more thinking).

The idea is that, after all, exactly on which runqueue a CPU ends up is
not an information that should matter much from an user/administrator.
For now, it will be possible to know where they ended up via debug
keys. And even if we feel like making it more visible, that is better
achieved via some toolstack command that queries and prints the
configuration of the scheduler, rather than a line like this in the
boot log.

Regards
Jürgen Groß May 27, 2020, 4:26 a.m. UTC | #8
On 27.05.20 00:00, Dario Faggioli wrote:
> Hey,
> 
> thanks for the review, and sorry for replying late... I was busy with
> something and then was trying to implement a better balancing logic, as
> discussed with Juergen, but with only partial success...
> 
> On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote:
>> On 29.04.2020 19:36, Dario Faggioli wrote:
>>> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct
>>> [...]
>>> +        ASSERT(rcpu != cpu);
>>> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
>>> +        {
>>> +            /*
>>> +             * For each CPU already in the runqueue, account for
>>> it and for
>>> +             * its sibling(s), independently from whether such
>>> sibling(s) are
>>> +             * in the runqueue already or not.
>>> +             *
>>> +             * Of course, if there are sibling CPUs in the
>>> runqueue already,
>>> +             * only count them once.
>>> +             */
>>> +            cpumask_or(cpumask_scratch_cpu(cpu),
>>> cpumask_scratch_cpu(cpu),
>>> +                       per_cpu(cpu_sibling_mask, rcpu));
>>> +            nr_smts += nr_sibl;
>>
>> This being common code, is it appropriate to assume all CPUs having
>> the same number of siblings?
>>
> You mention common code because you are thinking of differences between
> x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips
> that have SMT, or that we may want to identify and treat like if it was
> SMT, we currently have no support for that, so I don't think it is a
> problem.
> 
> On x86, I'm not sure I am aware of cases where the number of threads is
> different among cores or sockets... are there any?
> 
> Besides, we have some SMT specific code around (especially in
> scheduling) already.
> 
>> Even beyond that, iirc the sibling mask
>> represents the online or parked siblings, but not offline ones. For
>> the purpose here, don't you rather care about the full set?
>>
> This is actually a good point. I indeed care about the number of
> siblings a thread has, in general, not only about the ones that are
> currently online.
> 
> In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped
> in an helper that just returns 1 for ARM. What do you think, is this
> better?
> 
>> What about HT vs AMD Fam15's CUs? Do you want both to be treated
>> the same here?
>>
> Are you referring to the cores that, AFAIUI, share the L1i cache? If
> yes, I thought about it, and ended up _not_ dealing with them here, but
> I'm still a bit unsure.
> 
> Cache oriented runqueue organization will be the subject of another
> patch series, and that's why I kept them out. However, that's a rather
> special case with a lot in common to SMT... Just in case, is there a
> way to identify them easily, like with a mask or something, in the code
> already?
> 
>> Also could you outline the intentions with this logic in the
>> description, to be able to match the goal with what gets done?
>>
> Sure, I will try state it more clearly.
> 
>>> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private
>>> *prv, unsigned int cpu)
>>>           rqd->pick_bias = cpu;
>>>           rqd->id = rqi;
>>>       }
>>> +    else
>>> +        rqd = rqd_valid;
>>> +
>>> +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to
>>> runqueue %d with {%*pbl}\n",
>>> +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd-
>>>> id,
>>> +           CPUMASK_PR(&rqd->active));
>>
>> Iirc there's one per-CPU printk() already. On large systems this
>> isn't
>> very nice, so I'd like to ask that their total number at least not
>> get
>> further grown. Ideally there would be a less verbose summary after
>> all
>> CPUs have been brought up at boot, with per-CPU info be logged only
>> during CPU hot online.
>>
> Understood. Problem is that, here in the scheduling code, I don't see
> an easy way to tell when we have finished bringing up CPUs... And it's
> probably not worth looking too hard (even less adding logic) only for
> the sake of printing this message.

cpupool_init() is the perfect place for that.


Juergen
Jan Beulich May 27, 2020, 6:17 a.m. UTC | #9
On 27.05.2020 00:00, Dario Faggioli wrote:
> On Thu, 2020-04-30 at 08:45 +0200, Jan Beulich wrote:
>> On 29.04.2020 19:36, Dario Faggioli wrote:
>>> @@ -852,14 +862,61 @@ cpu_runqueue_match(const struct 
>>> [...]
>>> +        ASSERT(rcpu != cpu);
>>> +        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
>>> +        {
>>> +            /*
>>> +             * For each CPU already in the runqueue, account for
>>> it and for
>>> +             * its sibling(s), independently from whether such
>>> sibling(s) are
>>> +             * in the runqueue already or not.
>>> +             *
>>> +             * Of course, if there are sibling CPUs in the
>>> runqueue already,
>>> +             * only count them once.
>>> +             */
>>> +            cpumask_or(cpumask_scratch_cpu(cpu),
>>> cpumask_scratch_cpu(cpu),
>>> +                       per_cpu(cpu_sibling_mask, rcpu));
>>> +            nr_smts += nr_sibl;
>>
>> This being common code, is it appropriate to assume all CPUs having
>> the same number of siblings? 
>>
> You mention common code because you are thinking of differences between
> x86 and ARM? In ARM --althought there might be (I'm not sure)-- chips
> that have SMT, or that we may want to identify and treat like if it was
> SMT, we currently have no support for that, so I don't think it is a
> problem.
> 
> On x86, I'm not sure I am aware of cases where the number of threads is
> different among cores or sockets... are there any?

I'm not aware of any either, but in common code it should also
matter what might be, not just what there is. Unless you wrap
things in e.g. CONFIG_X86.

>> Even beyond that, iirc the sibling mask
>> represents the online or parked siblings, but not offline ones. For
>> the purpose here, don't you rather care about the full set?
>>
> This is actually a good point. I indeed care about the number of
> siblings a thread has, in general, not only about the ones that are
> currently online.
> 
> In v2, I'll be using boot_cpu_data.x86_num_siblings, of course wrapped
> in an helper that just returns 1 for ARM. What do you think, is this
> better?

As per above, cpu_data[rcpu] then please.

>> What about HT vs AMD Fam15's CUs? Do you want both to be treated
>> the same here?
>>
> Are you referring to the cores that, AFAIUI, share the L1i cache? If
> yes, I thought about it, and ended up _not_ dealing with them here, but
> I'm still a bit unsure.
> 
> Cache oriented runqueue organization will be the subject of another
> patch series, and that's why I kept them out. However, that's a rather
> special case with a lot in common to SMT...

I didn't think of cache sharing in particular, but about the
concept of compute units vs hyperthreads in general.

> Just in case, is there a
> way to identify them easily, like with a mask or something, in the code
> already?

cpu_sibling_mask still gets used for both, so there's no mask
to use. As per set_cpu_sibling_map() you can look at
cpu_data[].compute_unit_id to tell, but that's of course x86-
specific (as is the entire compute unit concept).

>>> @@ -900,6 +990,12 @@ cpu_add_to_runqueue(struct csched2_private
>>> *prv, unsigned int cpu)
>>>          rqd->pick_bias = cpu;
>>>          rqd->id = rqi;
>>>      }
>>> +    else
>>> +        rqd = rqd_valid;
>>> +
>>> +    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to
>>> runqueue %d with {%*pbl}\n",
>>> +           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd-
>>>> id,
>>> +           CPUMASK_PR(&rqd->active));
>>
>> Iirc there's one per-CPU printk() already. On large systems this
>> isn't
>> very nice, so I'd like to ask that their total number at least not
>> get
>> further grown. Ideally there would be a less verbose summary after
>> all
>> CPUs have been brought up at boot, with per-CPU info be logged only
>> during CPU hot online.
>>
> Understood. Problem is that, here in the scheduling code, I don't see
> an easy way to tell when we have finished bringing up CPUs... And it's
> probably not worth looking too hard (even less adding logic) only for
> the sake of printing this message.
> 
> So I think I will demote this printk as a XENLOG_DEBUG one (and will
> also get rid of others that were already DEBUG, but not super useful,
> after some more thinking).

Having seen Jürgen's reply as well as what you further wrote below,
I'd still like to point out that even XENLOG_DEBUG isn't quiet
enough: There may be some value to such a debugging message to you,
but it may be mainly spam to e.g. me, who I still have a need to
run with loglvl=all in the common case. Let's not forget, the
context in which the underlying topic came up in was pretty-many-
core AMD CPUs.

We had this issue elsewhere as well, and as CPU counts grew we
started hiding such messages behind a command line option, besides
a log level qualification. Similarly we hide e.g. details of the
IOMMU arrangements behind a command line option.

> The idea is that, after all, exactly on which runqueue a CPU ends up is
> not an information that should matter much from an user/administrator.
> For now, it will be possible to know where they ended up via debug
> keys. And even if we feel like making it more visible, that is better
> achieved via some toolstack command that queries and prints the
> configuration of the scheduler, rather than a line like this in the
> boot log.

Good.

Jan
Jan Beulich May 27, 2020, 6:22 a.m. UTC | #10
On 26.05.2020 23:18, Dario Faggioli wrote:
> On Thu, 2020-04-30 at 14:52 +0200, Jürgen Groß wrote:
>> On 30.04.20 14:28, Dario Faggioli wrote:
>>> That being said, I can try to make things a bit more fair, when
>>> CPUs
>>> come up and are added to the pool. Something around the line of
>>> adding
>>> them to the runqueue with the least number of CPUs in it (among the
>>> suitable ones, of course).
>>>
>>> With that, when the user removes 4 CPUs, we will have the 6 vs 10
>>> situation. But we would make sure that, when she adds them back, we
>>> will go back to 10 vs. 10, instead than, say, 6 vs 14 or something
>>> like
>>> that.
>>>
>>> Was something like this that you had in mind? And in any case, what
>>> do
>>> you think about it?
>>
>> Yes, this would be better already.
>>
> So, a couple of thoughts. Doing something like what I tried to describe
> above is not too bad, and I have it pretty much ready.
> 
> With that, on an Intel system with 96 CPUs on two sockets, and
> max_cpus_per_runqueue set to 16, I got, after boot, instead of just 2
> giant runqueues with 48 CPUs in each:
> 
>  - 96 CPUs online, split in 6 runqueues (3 for each socket) with 16 
>    runqueues in each of them
> 
> I can also "tweak" it in such a way that, if one for instance boots
> with "smt=no", we get to a point where we have:
> 
>  - 48 CPUs online, split in 6 runqueues, with 8 CPUs in each
> 
> Now, I think this is good, and actually better than the current
> situation where on such a system, we only have two very big runqueues
> (and let me repeat that introducing a per-LLC runqueue arrangement, on
> which I'm also working, won't help in this case, as NUMA node == LLC).
> 
> So, problem is that if one starts to fiddle with cpupools and cpu on
> and offlining, things can get pretty unbalanced. E.g., I can end up
> with 2 runqueues on a socket, one with 16 CPUs and the other with just
> a couple of them.
> 
> Now, this is all possible as of now (I mean, without this patch)
> already, although at a different level. In fact, I can very well remove
> all-but-1 CPUs of node 1 from Pool-0, and end up again with a runqueue
> with a lot of CPUs and another with just one.
> 
> It looks like we need a way to rebalance the runqueues, which should be
> doable... But despite having spent a couple of days trying to come up
> with something decent, that I could include in v2 of this series, I
> couldn't get it to work sensibly.

CPU on-/offlining may not need considering here at all imo. I think it
would be quite reasonable as a first step to have a static model where
from system topology (and perhaps a command line option) one can
predict which runqueue a particular CPU will end up in, no matter when
it gets brought online.

Jan
Dario Faggioli May 28, 2020, 9:32 a.m. UTC | #11
On Wed, 2020-05-27 at 06:26 +0200, Jürgen Groß wrote:
> On 27.05.20 00:00, Dario Faggioli wrote:
> > 
> > Understood. Problem is that, here in the scheduling code, I don't
> > see
> > an easy way to tell when we have finished bringing up CPUs... And
> > it's
> > probably not worth looking too hard (even less adding logic) only
> > for
> > the sake of printing this message.
> 
> cpupool_init() is the perfect place for that.
> 
Yes, at least for boot time, it is indeed, so I'll go for it.

OTOH, when for instance one creates a new cpupool (or adding a bunch of
CPUs to one, with `xl cpupool-add Pool-X 7,3-14-22,18`), CPUs are added
one by one, and we don't really know in Xen which one would be the last
one, and print a summary.

But yeah, I'll go for cpupool_init() and get rid of the rest, for now.

Thanks and Regards
Dario Faggioli May 28, 2020, 9:36 a.m. UTC | #12
On Wed, 2020-05-27 at 08:22 +0200, Jan Beulich wrote:
> On 26.05.2020 23:18, Dario Faggioli wrote:
> > 
> > It looks like we need a way to rebalance the runqueues, which
> > should be
> > doable... But despite having spent a couple of days trying to come
> > up
> > with something decent, that I could include in v2 of this series, I
> > couldn't get it to work sensibly.
> 
> CPU on-/offlining may not need considering here at all imo. I think
> it
> would be quite reasonable as a first step to have a static model
> where
> from system topology (and perhaps a command line option) one can
> predict which runqueue a particular CPU will end up in, no matter
> when
> it gets brought online.
> 
Right.

IAC, just FYI, after talking to Juergen --who suggested a nice solution
to overcome the problem where I was stuck-- I have now a patch that
successfully implement dynamic online rebalancing of runqueues.

I'm polishing up the comments and changelog and will send it ASAP, as
I'd really like for this series to go in... :-)

Thanks and Regards
Dario Faggioli May 28, 2020, 2:55 p.m. UTC | #13
On Wed, 2020-05-27 at 08:17 +0200, Jan Beulich wrote:
> On 27.05.2020 00:00, Dario Faggioli wrote:
> > 
> > Cache oriented runqueue organization will be the subject of another
> > patch series, and that's why I kept them out. However, that's a
> > rather
> > special case with a lot in common to SMT...
> 
> I didn't think of cache sharing in particular, but about the
> concept of compute units vs hyperthreads in general.
> 
Ok.

> > Just in case, is there a
> > way to identify them easily, like with a mask or something, in the
> > code
> > already?
> 
> cpu_sibling_mask still gets used for both, so there's no mask
> to use. As per set_cpu_sibling_map() you can look at
> cpu_data[].compute_unit_id to tell, but that's of course x86-
> specific (as is the entire compute unit concept).
> 
Right. And thanks for the pointers.

But then, what I am understanding by having a look there is that I
indeed can use (again, appropriately wrapped) x86_num_siblings, for
telling, in this function, whether a CPU has any, and if yes how many,
HT (Intel) or CU (AMD) siblings in total, although some of them may
currently be offline.

Which means I will be treating HTs and CUs the same which, thinking
more about it (and thinking actually to CUs, rather than to any cache
sharing relationship), does make sense for this feature.

Does this make sense, or am I missing or misinterpreting anything?

> > So I think I will demote this printk as a XENLOG_DEBUG one (and
> > will
> > also get rid of others that were already DEBUG, but not super
> > useful,
> > after some more thinking).
> 
> Having seen Jürgen's reply as well as what you further wrote below,
> I'd still like to point out that even XENLOG_DEBUG isn't quiet
> enough: There may be some value to such a debugging message to you,
> but it may be mainly spam to e.g. me, who I still have a need to
> run with loglvl=all in the common case. Let's not forget, the
> context in which the underlying topic came up in was pretty-many-
> core AMD CPUs.
> 
Good point indeed about DEBUG potentially being an issue as well. So
yes, as announced in my reply to Juergen, I was going with the recap in
cpupool_init().

However, that looks like it requires a new hook in the scheduler's
interface, as the information is scheduler specific but at the same
time I don't think we want to have the exact same information from
either dump_settings() or dump_cpu_state(), which is all we have... :-/

I'll see about it.

Thanks and Regards
Jan Beulich May 29, 2020, 9:58 a.m. UTC | #14
On 28.05.2020 16:55, Dario Faggioli wrote:
> On Wed, 2020-05-27 at 08:17 +0200, Jan Beulich wrote:
>> On 27.05.2020 00:00, Dario Faggioli wrote:
>>> Just in case, is there a
>>> way to identify them easily, like with a mask or something, in the
>>> code
>>> already?
>>
>> cpu_sibling_mask still gets used for both, so there's no mask
>> to use. As per set_cpu_sibling_map() you can look at
>> cpu_data[].compute_unit_id to tell, but that's of course x86-
>> specific (as is the entire compute unit concept).
>>
> Right. And thanks for the pointers.
> 
> But then, what I am understanding by having a look there is that I
> indeed can use (again, appropriately wrapped) x86_num_siblings, for
> telling, in this function, whether a CPU has any, and if yes how many,
> HT (Intel) or CU (AMD) siblings in total, although some of them may
> currently be offline.
> 
> Which means I will be treating HTs and CUs the same which, thinking
> more about it (and thinking actually to CUs, rather than to any cache
> sharing relationship), does make sense for this feature.
> 
> Does this make sense, or am I missing or misinterpreting anything?

Well, it effectively answers the question I had raised: "What about HT
vs AMD Fam15's CUs? Do you want both to be treated the same here?"

Jan
Dario Faggioli May 29, 2020, 10:19 a.m. UTC | #15
On Fri, 2020-05-29 at 11:58 +0200, Jan Beulich wrote:
> On 28.05.2020 16:55, Dario Faggioli wrote:
> > 
> > Which means I will be treating HTs and CUs the same which, thinking
> > more about it (and thinking actually to CUs, rather than to any
> > cache
> > sharing relationship), does make sense for this feature.
> > 
> > Does this make sense, or am I missing or misinterpreting anything?
> 
> Well, it effectively answers the question I had raised: "What about
> HT
> vs AMD Fam15's CUs? Do you want both to be treated the same here?"
> 
Exactly! :-)

And, in fact, the answer is "Yes, I want them being treated the same,
and the patches are actually doing ".

Sorry for not getting it right away, and just replying like that in the
first place.

At least, this led to better (more generic) comments and variable names
in v2 of this patch.

Regards
diff mbox series

Patch

diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
index d40345b585..0227457285 100644
--- a/xen/common/sched/cpupool.c
+++ b/xen/common/sched/cpupool.c
@@ -37,7 +37,7 @@  static cpumask_t cpupool_locked_cpus;
 
 static DEFINE_SPINLOCK(cpupool_lock);
 
-static enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
+enum sched_gran __read_mostly opt_sched_granularity = SCHED_GRAN_cpu;
 static unsigned int __read_mostly sched_granularity = 1;
 
 #ifdef CONFIG_HAS_SCHED_GRANULARITY
diff --git a/xen/common/sched/credit2.c b/xen/common/sched/credit2.c
index 697c9f917d..abe4d048c8 100644
--- a/xen/common/sched/credit2.c
+++ b/xen/common/sched/credit2.c
@@ -471,6 +471,16 @@  static int __init parse_credit2_runqueue(const char *s)
 }
 custom_param("credit2_runqueue", parse_credit2_runqueue);
 
+/*
+ * How many CPUs will be put, at most, in the same runqueue.
+ * Runqueues are still arranged according to the host topology (and
+ * according to the value of the 'credit2_runqueue' parameter). But
+ * we also have a cap to the number of CPUs that share runqueues.
+ * As soon as we reach the limit, a new runqueue will be created.
+ */
+static unsigned int __read_mostly opt_max_cpus_runqueue = 16;
+integer_param("sched_credit2_max_cpus_runqueue", opt_max_cpus_runqueue);
+
 /*
  * Per-runqueue data
  */
@@ -852,14 +862,61 @@  cpu_runqueue_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
            (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu));
 }
 
+/* Additional checks, to avoid separating siblings in different runqueues. */
+static bool
+cpu_runqueue_smt_match(const struct csched2_runqueue_data *rqd, unsigned int cpu)
+{
+    unsigned int nr_sibl = cpumask_weight(per_cpu(cpu_sibling_mask, cpu));
+    unsigned int rcpu, nr_smts = 0;
+
+    /*
+     * If we put the CPU in this runqueue, we must be sure that there will
+     * be enough room for accepting its hyperthread sibling(s) here as well.
+     */
+    cpumask_clear(cpumask_scratch_cpu(cpu));
+    for_each_cpu ( rcpu, &rqd->active )
+    {
+        ASSERT(rcpu != cpu);
+        if ( !cpumask_test_cpu(rcpu, cpumask_scratch_cpu(cpu)) )
+        {
+            /*
+             * For each CPU already in the runqueue, account for it and for
+             * its sibling(s), independently from whether such sibling(s) are
+             * in the runqueue already or not.
+             *
+             * Of course, if there are sibling CPUs in the runqueue already,
+             * only count them once.
+             */
+            cpumask_or(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
+                       per_cpu(cpu_sibling_mask, rcpu));
+            nr_smts += nr_sibl;
+        }
+    }
+    /*
+     * We know that neither the CPU, nor any of its sibling are here,
+     * or we wouldn't even have entered the function.
+     */
+    ASSERT(!cpumask_intersects(cpumask_scratch_cpu(cpu),
+                               per_cpu(cpu_sibling_mask, cpu)));
+
+    /* Try adding CPU and its sibling(s) to the count and check... */
+    nr_smts += nr_sibl;
+
+    if ( nr_smts <= opt_max_cpus_runqueue )
+        return true;
+
+    return false;
+}
+
 static struct csched2_runqueue_data *
 cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
 {
     struct csched2_runqueue_data *rqd, *rqd_new;
+    struct csched2_runqueue_data *rqd_valid = NULL;
     struct list_head *rqd_ins;
     unsigned long flags;
     int rqi = 0;
-    bool rqi_unused = false, rqd_valid = false;
+    bool rqi_unused = false;
 
     /* Prealloc in case we need it - not allowed with interrupts off. */
     rqd_new = xzalloc(struct csched2_runqueue_data);
@@ -873,11 +930,44 @@  cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         if ( !rqi_unused && rqd->id > rqi )
             rqi_unused = true;
 
-        if ( cpu_runqueue_match(rqd, cpu) )
+        /*
+         * Check whether the CPU should (according to the topology) and also
+         * can (if we there aren't too many already) go in this runqueue.
+         */
+        if ( rqd->refcnt < opt_max_cpus_runqueue &&
+             cpu_runqueue_match(rqd, cpu) )
         {
-            rqd_valid = true;
-            break;
+            cpumask_t *siblings = per_cpu(cpu_sibling_mask, cpu);
+
+            dprintk(XENLOG_DEBUG, "CPU %d matches runq %d, cpus={%*pbl} (max %d)\n",
+                    cpu, rqd->id, CPUMASK_PR(&rqd->active),
+                    opt_max_cpus_runqueue);
+
+            /*
+             * If we're using core (or socket!) scheduling, or we don't have
+             * hyperthreading, no need to do any further checking.
+             *
+             * If no (to both), but our sibling is already in this runqueue,
+             * then it's also ok for the CPU to stay in this runqueue..
+             *
+             * Otherwise, do some more checks, to better account for SMT.
+             */
+            if ( opt_sched_granularity != SCHED_GRAN_cpu ||
+                 cpumask_weight(siblings) <= 1 ||
+                 cpumask_intersects(&rqd->active, siblings) )
+            {
+                dprintk(XENLOG_DEBUG, "runq %d selected\n", rqd->id);
+                rqd_valid = rqd;
+                break;
+            }
+            else if ( cpu_runqueue_smt_match(rqd, cpu) )
+            {
+                dprintk(XENLOG_DEBUG, "considering runq %d...\n", rqd->id);
+                rqd_valid = rqd;
+            }
         }
+	else
+            dprintk(XENLOG_DEBUG, "ignoring runq %d\n", rqd->id);
 
         if ( !rqi_unused )
         {
@@ -900,6 +990,12 @@  cpu_add_to_runqueue(struct csched2_private *prv, unsigned int cpu)
         rqd->pick_bias = cpu;
         rqd->id = rqi;
     }
+    else
+        rqd = rqd_valid;
+
+    printk(XENLOG_INFO "CPU %d (sibling={%*pbl}) will go to runqueue %d with {%*pbl}\n",
+           cpu, CPUMASK_PR(per_cpu(cpu_sibling_mask, cpu)), rqd->id,
+           CPUMASK_PR(&rqd->active));
 
     rqd->refcnt++;
 
diff --git a/xen/common/sched/private.h b/xen/common/sched/private.h
index 367811a12f..e964e3f407 100644
--- a/xen/common/sched/private.h
+++ b/xen/common/sched/private.h
@@ -30,6 +30,8 @@  enum sched_gran {
     SCHED_GRAN_socket
 };
 
+extern enum sched_gran opt_sched_granularity;
+
 /*
  * In order to allow a scheduler to remap the lock->cpu mapping,
  * we have a per-cpu pointer, along with a pre-allocated set of