diff mbox series

xen/sched: fix credit2 smt idle handling

Message ID 20190322090431.28112-1-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen/sched: fix credit2 smt idle handling | expand

Commit Message

Jürgen Groß March 22, 2019, 9:04 a.m. UTC
Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to
identify idle cores where vcpus can be moved to. A core is thought to
be idle when all siblings are known to have the idle vcpu running on
them.

Unfortunately the information of a vcpu running on a cpu is per
runqueue. So in case not all siblings are in the same runqueue a core
will never be regarded to be idle, as the sibling not in the runqueue
is never known to run the idle vcpu.

This problem can be solved by and-ing the core's sibling cpumask with
the runqueue's active mask before doing the idle test.

In order for not having to allocate another cpumask the interfaces of
smt_idle_mask_set() and smt_idle_mask_clear() are modified to not take
a mask as input, but the runqueue data pointer, as those functions are
always called with the same masks as parameters.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/common/sched_credit2.c | 35 ++++++++++++++++-------------------
 1 file changed, 16 insertions(+), 19 deletions(-)

Comments

Dario Faggioli March 23, 2019, 2:32 a.m. UTC | #1
On Fri, 2019-03-22 at 10:04 +0100, Juergen Gross wrote:
> Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to
> identify idle cores where vcpus can be moved to. A core is thought to
> be idle when all siblings are known to have the idle vcpu running on
> them.
> 
> Unfortunately the information of a vcpu running on a cpu is per
> runqueue. So in case not all siblings are in the same runqueue a core
> will never be regarded to be idle, as the sibling not in the runqueue
> is never known to run the idle vcpu.
> 
Good catch.

I apparently forgot to take care of this, when introduced the
possibility of having per single CPU runqueue (which, in an SMT enabled
system, would mean per-thread runqs).

> This problem can be solved by and-ing the core's sibling cpumask with
> the runqueue's active mask before doing the idle test.
> 
Right. There's one thing, though. Using one runq per CPU, in this
scheduler, is a really poor choice, and I basically would recommend it
only for testing or debugging (and this should probably be highlighted
a lot better in the docs).

Therefore, I'm a bit reluctant at adding another cpumask bitwise
operation, in hot paths, just for taking care of it.

Note that this also applies to cpupools, i.e., I also consider a very
poor choice putting two sibling hyperthreads in different pools. If you
recall, I even sent a patch to forbid doing that (which is still
blocked on a series of yours for passing parameters from the tools to
the hypervisor).

The only case I care, is a CPU being off-lined.

So, one thing that we could do is to put credit2_runqueue=cpu inside
such #ifdef-s too (and I can prepare a patch to that effect myself, if
you want).

To properly deal with offline CPUs, I think we can change the logic a
little, i.e., we initialize the smt_idle mask to all-1 (all CPUs idle),
and we also make sure that we set the CPU bit (instead of learing it)
in smt_idle, when we remove the CPU from the scheduler.

Does doing these things (negatively) affect the other patch series
you're working on?

> In order for not having to allocate another cpumask the interfaces of
> smt_idle_mask_set() and smt_idle_mask_clear() are modified to not
> take
> a mask as input, but the runqueue data pointer, as those functions
> are
> always called with the same masks as parameters.
> 
The interface change, I'm happy with it.

Oh, and the other patches you sent, as you know, I've been travelling
this week, so I couldn't look at them.

I will let you have my comment ASAP next week.

Regards,
Dario
Dario Faggioli March 23, 2019, 12:34 p.m. UTC | #2
On Sat, 2019-03-23 at 03:32 +0100, Dario Faggioli wrote:
> On Fri, 2019-03-22 at 10:04 +0100, Juergen Gross wrote:
> > This problem can be solved by and-ing the core's sibling cpumask
> > with
> > the runqueue's active mask before doing the idle test.
> > 
> 
> Therefore, I'm a bit reluctant at adding another cpumask bitwise
> operation, in hot paths, just for taking care of it.
> 
> Note that this also applies to cpupools, i.e., I also consider a very
> poor choice putting two sibling hyperthreads in different pools. If
> you
> recall, I even sent a patch to forbid doing that (which is still
> blocked on a series of yours for passing parameters from the tools to
> the hypervisor).
> 
> The only case I care, is a CPU being off-lined.
> 
> So, one thing that we could do is to put credit2_runqueue=cpu inside
> such #ifdef-s too (and I can prepare a patch to that effect myself,
> if
> you want).
> 
> To properly deal with offline CPUs, I think we can change the logic a
> little, i.e., we initialize the smt_idle mask to all-1 (all CPUs
> idle),
> and we also make sure that we set the CPU bit (instead of learing it)
> in smt_idle, when we remove the CPU from the scheduler.
> 
Which, thinking more about this, should also serve as a solution for
the issue addressed by this patch.

We need to check whether all the other places where smt_idle is used
would be ok with that (I think they are, as we pretty much always end
up &&-ing it with online or active anyway), or if they need some
reshuffling (and if yes, how that looks like).

I can check this on Monday.

Regards,
Dario
Jürgen Groß March 23, 2019, 1:22 p.m. UTC | #3
On 23/03/2019 03:32, Dario Faggioli wrote:
> On Fri, 2019-03-22 at 10:04 +0100, Juergen Gross wrote:
>> Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to
>> identify idle cores where vcpus can be moved to. A core is thought to
>> be idle when all siblings are known to have the idle vcpu running on
>> them.
>>
>> Unfortunately the information of a vcpu running on a cpu is per
>> runqueue. So in case not all siblings are in the same runqueue a core
>> will never be regarded to be idle, as the sibling not in the runqueue
>> is never known to run the idle vcpu.
>>
> Good catch.
> 
> I apparently forgot to take care of this, when introduced the
> possibility of having per single CPU runqueue (which, in an SMT enabled
> system, would mean per-thread runqs).
> 
>> This problem can be solved by and-ing the core's sibling cpumask with
>> the runqueue's active mask before doing the idle test.
>>
> Right. There's one thing, though. Using one runq per CPU, in this
> scheduler, is a really poor choice, and I basically would recommend it
> only for testing or debugging (and this should probably be highlighted
> a lot better in the docs).
> 
> Therefore, I'm a bit reluctant at adding another cpumask bitwise
> operation, in hot paths, just for taking care of it.
> 
> Note that this also applies to cpupools, i.e., I also consider a very
> poor choice putting two sibling hyperthreads in different pools. If you
> recall, I even sent a patch to forbid doing that (which is still
> blocked on a series of yours for passing parameters from the tools to
> the hypervisor).
> 
> The only case I care, is a CPU being off-lined.

In my core scheduling solution we only ever have one active sibling per
core.

> So, one thing that we could do is to put credit2_runqueue=cpu inside
> such #ifdef-s too (and I can prepare a patch to that effect myself, if
> you want).
> 
> To properly deal with offline CPUs, I think we can change the logic a
> little, i.e., we initialize the smt_idle mask to all-1 (all CPUs idle),
> and we also make sure that we set the CPU bit (instead of learing it)
> in smt_idle, when we remove the CPU from the scheduler.

How does that help?

Only if all siblings are marked as idle in rqd->idle we set any bits
in rqd->smt_idle (all siblings).

Or did you mean rqd->idle instead?

This might be problematic in case of runqueue per cpu, though.

Another idea: we could introduce a credit2 pcpu data cpumask pointer
for replacement of the cpu_sibling_mask. For runqueue per cpu it would
pount to cpumask_of(cpu), for the "normal case" it would point to the
correct cpu_sibling_mask, and for special cases we could allocate a
mask if needed.


Juergen
Dario Faggioli March 25, 2019, 10:53 a.m. UTC | #4
On Sat, 2019-03-23 at 14:22 +0100, Juergen Gross wrote:
> On 23/03/2019 03:32, Dario Faggioli wrote:
> > To properly deal with offline CPUs, I think we can change the logic
> > a
> > little, i.e., we initialize the smt_idle mask to all-1 (all CPUs
> > idle),
> > and we also make sure that we set the CPU bit (instead of learing
> > it)
> > in smt_idle, when we remove the CPU from the scheduler.
> 
> How does that help?
> 
> Only if all siblings are marked as idle in rqd->idle we set any bits
> in rqd->smt_idle (all siblings).
> 
> Or did you mean rqd->idle instead?
> 
Yep, indeed I was thinking to rqd->idle, sorry for the confusion. :-)

> This might be problematic in case of runqueue per cpu, though.
> 
Mmm... So, my point here is, when a CPU does not belong to a scheduler
(and, in case of Credit2, even when it does not belong to a runqueue)
we "consider" it as being either idle or busy, depending on whether we
set or clear the relevant bit.

But truth is, we don't have a way to know if it is really idle or not,
so we really shouldn't rely on such info.

Therefore, we should:
- make sure that we actually don't, or it's a bug (which is the point
of this patch ;-P)
- decide whether to set or clear the bit basing on what's more
convenient (e.g., because it saves some cpumask operation).

Anyway...

> Another idea: we could introduce a credit2 pcpu data cpumask pointer
> for replacement of the cpu_sibling_mask. For runqueue per cpu it
> would
> pount to cpumask_of(cpu), for the "normal case" it would point to the
> correct cpu_sibling_mask, and for special cases we could allocate a
> mask if needed.
> 
... I think I am fine with this idea.

Dario
Jürgen Groß March 25, 2019, 11:42 a.m. UTC | #5
On 25/03/2019 11:53, Dario Faggioli wrote:
> On Sat, 2019-03-23 at 14:22 +0100, Juergen Gross wrote:
>> On 23/03/2019 03:32, Dario Faggioli wrote:
>>> To properly deal with offline CPUs, I think we can change the logic
>>> a
>>> little, i.e., we initialize the smt_idle mask to all-1 (all CPUs
>>> idle),
>>> and we also make sure that we set the CPU bit (instead of learing
>>> it)
>>> in smt_idle, when we remove the CPU from the scheduler.
>>
>> How does that help?
>>
>> Only if all siblings are marked as idle in rqd->idle we set any bits
>> in rqd->smt_idle (all siblings).
>>
>> Or did you mean rqd->idle instead?
>>
> Yep, indeed I was thinking to rqd->idle, sorry for the confusion. :-)
> 
>> This might be problematic in case of runqueue per cpu, though.
>>
> Mmm... So, my point here is, when a CPU does not belong to a scheduler
> (and, in case of Credit2, even when it does not belong to a runqueue)
> we "consider" it as being either idle or busy, depending on whether we
> set or clear the relevant bit.
> 
> But truth is, we don't have a way to know if it is really idle or not,
> so we really shouldn't rely on such info.
> 
> Therefore, we should:
> - make sure that we actually don't, or it's a bug (which is the point
> of this patch ;-P)
> - decide whether to set or clear the bit basing on what's more
> convenient (e.g., because it saves some cpumask operation).
> 
> Anyway...
> 
>> Another idea: we could introduce a credit2 pcpu data cpumask pointer
>> for replacement of the cpu_sibling_mask. For runqueue per cpu it
>> would
>> pount to cpumask_of(cpu), for the "normal case" it would point to the
>> correct cpu_sibling_mask, and for special cases we could allocate a
>> mask if needed.
>>
> ... I think I am fine with this idea.

Preparing V2 of the patch...


Juergen
diff mbox series

Patch

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 543dc3664d..ab50e7ad23 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -638,7 +638,8 @@  static inline bool has_cap(const struct csched2_vcpu *svc)
 
 /*
  * If all the siblings of cpu (including cpu itself) are both idle and
- * untickled, set all their bits in mask.
+ * untickled, set all their bits in mask. Note that only siblings handled
+ * by the rqd can be taken into account.
  *
  * NB that rqd->smt_idle is different than rqd->idle.  rqd->idle
  * records pcpus that at are merely idle (i.e., at the moment do not
@@ -653,25 +654,23 @@  static inline bool has_cap(const struct csched2_vcpu *svc)
  * changes.
  */
 static inline
-void smt_idle_mask_set(unsigned int cpu, const cpumask_t *idlers,
-                       cpumask_t *mask)
+void smt_idle_mask_set(unsigned int cpu, struct csched2_runqueue_data *rqd)
 {
-    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
-
-    if ( cpumask_subset(cpu_siblings, idlers) )
-        cpumask_or(mask, mask, cpu_siblings);
+    cpumask_and(cpumask_scratch, per_cpu(cpu_sibling_mask, cpu), &rqd->active);
+    if ( cpumask_subset(cpumask_scratch, &rqd->idle) &&
+         !cpumask_intersects(cpumask_scratch, &rqd->tickled) )
+        cpumask_or(&rqd->smt_idle, &rqd->smt_idle, cpumask_scratch);
 }
 
 /*
  * Clear the bits of all the siblings of cpu from mask (if necessary).
  */
 static inline
-void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask)
+void smt_idle_mask_clear(unsigned int cpu, struct csched2_runqueue_data *rqd)
 {
-    const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu);
-
-    if ( cpumask_subset(cpu_siblings, mask) )
-        cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu));
+    cpumask_and(cpumask_scratch, per_cpu(cpu_sibling_mask, cpu), &rqd->active);
+    if ( cpumask_subset(cpumask_scratch, &rqd->smt_idle) )
+        cpumask_andnot(&rqd->smt_idle, &rqd->smt_idle, cpumask_scratch);
 }
 
 /*
@@ -1323,7 +1322,7 @@  static inline void
 tickle_cpu(unsigned int cpu, struct csched2_runqueue_data *rqd)
 {
     __cpumask_set_cpu(cpu, &rqd->tickled);
-    smt_idle_mask_clear(cpu, &rqd->smt_idle);
+    smt_idle_mask_clear(cpu, rqd);
     cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
 }
 
@@ -3468,8 +3467,7 @@  csched2_schedule(
     if ( tickled )
     {
         __cpumask_clear_cpu(cpu, &rqd->tickled);
-        cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
-        smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+        smt_idle_mask_set(cpu, rqd);
     }
 
     if ( unlikely(tb_init_done) )
@@ -3553,7 +3551,7 @@  csched2_schedule(
         if ( cpumask_test_cpu(cpu, &rqd->idle) )
         {
             __cpumask_clear_cpu(cpu, &rqd->idle);
-            smt_idle_mask_clear(cpu, &rqd->smt_idle);
+            smt_idle_mask_clear(cpu, rqd);
         }
 
         /*
@@ -3599,14 +3597,13 @@  csched2_schedule(
             if ( cpumask_test_cpu(cpu, &rqd->idle) )
             {
                 __cpumask_clear_cpu(cpu, &rqd->idle);
-                smt_idle_mask_clear(cpu, &rqd->smt_idle);
+                smt_idle_mask_clear(cpu, rqd);
             }
         }
         else if ( !cpumask_test_cpu(cpu, &rqd->idle) )
         {
             __cpumask_set_cpu(cpu, &rqd->idle);
-            cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled);
-            smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle);
+            smt_idle_mask_set(cpu, rqd);
         }
         /* Make sure avgload gets updated periodically even
          * if there's no activity */