Message ID | 20190322090431.28112-1-jgross@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen/sched: fix credit2 smt idle handling | expand |
On Fri, 2019-03-22 at 10:04 +0100, Juergen Gross wrote: > Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to > identify idle cores where vcpus can be moved to. A core is thought to > be idle when all siblings are known to have the idle vcpu running on > them. > > Unfortunately the information of a vcpu running on a cpu is per > runqueue. So in case not all siblings are in the same runqueue a core > will never be regarded to be idle, as the sibling not in the runqueue > is never known to run the idle vcpu. > Good catch. I apparently forgot to take care of this, when introduced the possibility of having per single CPU runqueue (which, in an SMT enabled system, would mean per-thread runqs). > This problem can be solved by and-ing the core's sibling cpumask with > the runqueue's active mask before doing the idle test. > Right. There's one thing, though. Using one runq per CPU, in this scheduler, is a really poor choice, and I basically would recommend it only for testing or debugging (and this should probably be highlighted a lot better in the docs). Therefore, I'm a bit reluctant at adding another cpumask bitwise operation, in hot paths, just for taking care of it. Note that this also applies to cpupools, i.e., I also consider a very poor choice putting two sibling hyperthreads in different pools. If you recall, I even sent a patch to forbid doing that (which is still blocked on a series of yours for passing parameters from the tools to the hypervisor). The only case I care, is a CPU being off-lined. So, one thing that we could do is to put credit2_runqueue=cpu inside such #ifdef-s too (and I can prepare a patch to that effect myself, if you want). To properly deal with offline CPUs, I think we can change the logic a little, i.e., we initialize the smt_idle mask to all-1 (all CPUs idle), and we also make sure that we set the CPU bit (instead of learing it) in smt_idle, when we remove the CPU from the scheduler. Does doing these things (negatively) affect the other patch series you're working on? > In order for not having to allocate another cpumask the interfaces of > smt_idle_mask_set() and smt_idle_mask_clear() are modified to not > take > a mask as input, but the runqueue data pointer, as those functions > are > always called with the same masks as parameters. > The interface change, I'm happy with it. Oh, and the other patches you sent, as you know, I've been travelling this week, so I couldn't look at them. I will let you have my comment ASAP next week. Regards, Dario
On Sat, 2019-03-23 at 03:32 +0100, Dario Faggioli wrote: > On Fri, 2019-03-22 at 10:04 +0100, Juergen Gross wrote: > > This problem can be solved by and-ing the core's sibling cpumask > > with > > the runqueue's active mask before doing the idle test. > > > > Therefore, I'm a bit reluctant at adding another cpumask bitwise > operation, in hot paths, just for taking care of it. > > Note that this also applies to cpupools, i.e., I also consider a very > poor choice putting two sibling hyperthreads in different pools. If > you > recall, I even sent a patch to forbid doing that (which is still > blocked on a series of yours for passing parameters from the tools to > the hypervisor). > > The only case I care, is a CPU being off-lined. > > So, one thing that we could do is to put credit2_runqueue=cpu inside > such #ifdef-s too (and I can prepare a patch to that effect myself, > if > you want). > > To properly deal with offline CPUs, I think we can change the logic a > little, i.e., we initialize the smt_idle mask to all-1 (all CPUs > idle), > and we also make sure that we set the CPU bit (instead of learing it) > in smt_idle, when we remove the CPU from the scheduler. > Which, thinking more about this, should also serve as a solution for the issue addressed by this patch. We need to check whether all the other places where smt_idle is used would be ok with that (I think they are, as we pretty much always end up &&-ing it with online or active anyway), or if they need some reshuffling (and if yes, how that looks like). I can check this on Monday. Regards, Dario
On 23/03/2019 03:32, Dario Faggioli wrote: > On Fri, 2019-03-22 at 10:04 +0100, Juergen Gross wrote: >> Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to >> identify idle cores where vcpus can be moved to. A core is thought to >> be idle when all siblings are known to have the idle vcpu running on >> them. >> >> Unfortunately the information of a vcpu running on a cpu is per >> runqueue. So in case not all siblings are in the same runqueue a core >> will never be regarded to be idle, as the sibling not in the runqueue >> is never known to run the idle vcpu. >> > Good catch. > > I apparently forgot to take care of this, when introduced the > possibility of having per single CPU runqueue (which, in an SMT enabled > system, would mean per-thread runqs). > >> This problem can be solved by and-ing the core's sibling cpumask with >> the runqueue's active mask before doing the idle test. >> > Right. There's one thing, though. Using one runq per CPU, in this > scheduler, is a really poor choice, and I basically would recommend it > only for testing or debugging (and this should probably be highlighted > a lot better in the docs). > > Therefore, I'm a bit reluctant at adding another cpumask bitwise > operation, in hot paths, just for taking care of it. > > Note that this also applies to cpupools, i.e., I also consider a very > poor choice putting two sibling hyperthreads in different pools. If you > recall, I even sent a patch to forbid doing that (which is still > blocked on a series of yours for passing parameters from the tools to > the hypervisor). > > The only case I care, is a CPU being off-lined. In my core scheduling solution we only ever have one active sibling per core. > So, one thing that we could do is to put credit2_runqueue=cpu inside > such #ifdef-s too (and I can prepare a patch to that effect myself, if > you want). > > To properly deal with offline CPUs, I think we can change the logic a > little, i.e., we initialize the smt_idle mask to all-1 (all CPUs idle), > and we also make sure that we set the CPU bit (instead of learing it) > in smt_idle, when we remove the CPU from the scheduler. How does that help? Only if all siblings are marked as idle in rqd->idle we set any bits in rqd->smt_idle (all siblings). Or did you mean rqd->idle instead? This might be problematic in case of runqueue per cpu, though. Another idea: we could introduce a credit2 pcpu data cpumask pointer for replacement of the cpu_sibling_mask. For runqueue per cpu it would pount to cpumask_of(cpu), for the "normal case" it would point to the correct cpu_sibling_mask, and for special cases we could allocate a mask if needed. Juergen
On Sat, 2019-03-23 at 14:22 +0100, Juergen Gross wrote: > On 23/03/2019 03:32, Dario Faggioli wrote: > > To properly deal with offline CPUs, I think we can change the logic > > a > > little, i.e., we initialize the smt_idle mask to all-1 (all CPUs > > idle), > > and we also make sure that we set the CPU bit (instead of learing > > it) > > in smt_idle, when we remove the CPU from the scheduler. > > How does that help? > > Only if all siblings are marked as idle in rqd->idle we set any bits > in rqd->smt_idle (all siblings). > > Or did you mean rqd->idle instead? > Yep, indeed I was thinking to rqd->idle, sorry for the confusion. :-) > This might be problematic in case of runqueue per cpu, though. > Mmm... So, my point here is, when a CPU does not belong to a scheduler (and, in case of Credit2, even when it does not belong to a runqueue) we "consider" it as being either idle or busy, depending on whether we set or clear the relevant bit. But truth is, we don't have a way to know if it is really idle or not, so we really shouldn't rely on such info. Therefore, we should: - make sure that we actually don't, or it's a bug (which is the point of this patch ;-P) - decide whether to set or clear the bit basing on what's more convenient (e.g., because it saves some cpumask operation). Anyway... > Another idea: we could introduce a credit2 pcpu data cpumask pointer > for replacement of the cpu_sibling_mask. For runqueue per cpu it > would > pount to cpumask_of(cpu), for the "normal case" it would point to the > correct cpu_sibling_mask, and for special cases we could allocate a > mask if needed. > ... I think I am fine with this idea. Dario
On 25/03/2019 11:53, Dario Faggioli wrote: > On Sat, 2019-03-23 at 14:22 +0100, Juergen Gross wrote: >> On 23/03/2019 03:32, Dario Faggioli wrote: >>> To properly deal with offline CPUs, I think we can change the logic >>> a >>> little, i.e., we initialize the smt_idle mask to all-1 (all CPUs >>> idle), >>> and we also make sure that we set the CPU bit (instead of learing >>> it) >>> in smt_idle, when we remove the CPU from the scheduler. >> >> How does that help? >> >> Only if all siblings are marked as idle in rqd->idle we set any bits >> in rqd->smt_idle (all siblings). >> >> Or did you mean rqd->idle instead? >> > Yep, indeed I was thinking to rqd->idle, sorry for the confusion. :-) > >> This might be problematic in case of runqueue per cpu, though. >> > Mmm... So, my point here is, when a CPU does not belong to a scheduler > (and, in case of Credit2, even when it does not belong to a runqueue) > we "consider" it as being either idle or busy, depending on whether we > set or clear the relevant bit. > > But truth is, we don't have a way to know if it is really idle or not, > so we really shouldn't rely on such info. > > Therefore, we should: > - make sure that we actually don't, or it's a bug (which is the point > of this patch ;-P) > - decide whether to set or clear the bit basing on what's more > convenient (e.g., because it saves some cpumask operation). > > Anyway... > >> Another idea: we could introduce a credit2 pcpu data cpumask pointer >> for replacement of the cpu_sibling_mask. For runqueue per cpu it >> would >> pount to cpumask_of(cpu), for the "normal case" it would point to the >> correct cpu_sibling_mask, and for special cases we could allocate a >> mask if needed. >> > ... I think I am fine with this idea. Preparing V2 of the patch... Juergen
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 543dc3664d..ab50e7ad23 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -638,7 +638,8 @@ static inline bool has_cap(const struct csched2_vcpu *svc) /* * If all the siblings of cpu (including cpu itself) are both idle and - * untickled, set all their bits in mask. + * untickled, set all their bits in mask. Note that only siblings handled + * by the rqd can be taken into account. * * NB that rqd->smt_idle is different than rqd->idle. rqd->idle * records pcpus that at are merely idle (i.e., at the moment do not @@ -653,25 +654,23 @@ static inline bool has_cap(const struct csched2_vcpu *svc) * changes. */ static inline -void smt_idle_mask_set(unsigned int cpu, const cpumask_t *idlers, - cpumask_t *mask) +void smt_idle_mask_set(unsigned int cpu, struct csched2_runqueue_data *rqd) { - const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu); - - if ( cpumask_subset(cpu_siblings, idlers) ) - cpumask_or(mask, mask, cpu_siblings); + cpumask_and(cpumask_scratch, per_cpu(cpu_sibling_mask, cpu), &rqd->active); + if ( cpumask_subset(cpumask_scratch, &rqd->idle) && + !cpumask_intersects(cpumask_scratch, &rqd->tickled) ) + cpumask_or(&rqd->smt_idle, &rqd->smt_idle, cpumask_scratch); } /* * Clear the bits of all the siblings of cpu from mask (if necessary). */ static inline -void smt_idle_mask_clear(unsigned int cpu, cpumask_t *mask) +void smt_idle_mask_clear(unsigned int cpu, struct csched2_runqueue_data *rqd) { - const cpumask_t *cpu_siblings = per_cpu(cpu_sibling_mask, cpu); - - if ( cpumask_subset(cpu_siblings, mask) ) - cpumask_andnot(mask, mask, per_cpu(cpu_sibling_mask, cpu)); + cpumask_and(cpumask_scratch, per_cpu(cpu_sibling_mask, cpu), &rqd->active); + if ( cpumask_subset(cpumask_scratch, &rqd->smt_idle) ) + cpumask_andnot(&rqd->smt_idle, &rqd->smt_idle, cpumask_scratch); } /* @@ -1323,7 +1322,7 @@ static inline void tickle_cpu(unsigned int cpu, struct csched2_runqueue_data *rqd) { __cpumask_set_cpu(cpu, &rqd->tickled); - smt_idle_mask_clear(cpu, &rqd->smt_idle); + smt_idle_mask_clear(cpu, rqd); cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ); } @@ -3468,8 +3467,7 @@ csched2_schedule( if ( tickled ) { __cpumask_clear_cpu(cpu, &rqd->tickled); - cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled); - smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle); + smt_idle_mask_set(cpu, rqd); } if ( unlikely(tb_init_done) ) @@ -3553,7 +3551,7 @@ csched2_schedule( if ( cpumask_test_cpu(cpu, &rqd->idle) ) { __cpumask_clear_cpu(cpu, &rqd->idle); - smt_idle_mask_clear(cpu, &rqd->smt_idle); + smt_idle_mask_clear(cpu, rqd); } /* @@ -3599,14 +3597,13 @@ csched2_schedule( if ( cpumask_test_cpu(cpu, &rqd->idle) ) { __cpumask_clear_cpu(cpu, &rqd->idle); - smt_idle_mask_clear(cpu, &rqd->smt_idle); + smt_idle_mask_clear(cpu, rqd); } } else if ( !cpumask_test_cpu(cpu, &rqd->idle) ) { __cpumask_set_cpu(cpu, &rqd->idle); - cpumask_andnot(cpumask_scratch, &rqd->idle, &rqd->tickled); - smt_idle_mask_set(cpu, cpumask_scratch, &rqd->smt_idle); + smt_idle_mask_set(cpu, rqd); } /* Make sure avgload gets updated periodically even * if there's no activity */
Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to identify idle cores where vcpus can be moved to. A core is thought to be idle when all siblings are known to have the idle vcpu running on them. Unfortunately the information of a vcpu running on a cpu is per runqueue. So in case not all siblings are in the same runqueue a core will never be regarded to be idle, as the sibling not in the runqueue is never known to run the idle vcpu. This problem can be solved by and-ing the core's sibling cpumask with the runqueue's active mask before doing the idle test. In order for not having to allocate another cpumask the interfaces of smt_idle_mask_set() and smt_idle_mask_clear() are modified to not take a mask as input, but the runqueue data pointer, as those functions are always called with the same masks as parameters. Signed-off-by: Juergen Gross <jgross@suse.com> --- xen/common/sched_credit2.c | 35 ++++++++++++++++------------------- 1 file changed, 16 insertions(+), 19 deletions(-)