[0/2] xen: credit2: limit the number of CPUs per runqueue

Message ID	158818022727.24327.14309662489731832234.stgit@Palanthas (mailing list archive)
Headers	show Return-Path: <SRS0=DK6E=6N=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1277520757 Subject: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue From: Dario Faggioli <dfaggioli@suse.com> To: xen-devel@lists.xenproject.org Date: Wed, 29 Apr 2020 19:36:23 +0200 Message-ID: <158818022727.24327.14309662489731832234.stgit@Palanthas> User-Agent: StGit/0.21 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: list Cc: Juergen Gross <jgross@suse.com>, Andrew Cooper <andrew.cooper3@citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jan Beulich <jbeulich@suse.com> Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	xen: credit2: limit the number of CPUs per runqueue \| expand [0/2] xen: credit2: limit the number of CPUs per runqueue [1/2] xen: credit2: factor cpu to runqueue matching in a function [2/2] xen: credit2: limit the max number of CPUs in a runqueue

Message ID

158818022727.24327.14309662489731832234.stgit@Palanthas (mailing list archive)

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1277520757
Subject: [PATCH 0/2] xen: credit2: limit the number of CPUs per runqueue
From: Dario Faggioli <dfaggioli@suse.com>
To: xen-devel@lists.xenproject.org
Date: Wed, 29 Apr 2020 19:36:23 +0200
Message-ID: <158818022727.24327.14309662489731832234.stgit@Palanthas>
User-Agent: StGit/0.21
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Precedence: list
Cc: Juergen Gross <jgross@suse.com>,
 Andrew Cooper <andrew.cooper3@citrix.com>,
 George Dunlap <george.dunlap@citrix.com>, Jan Beulich <jbeulich@suse.com>
Errors-To: xen-devel-bounces@lists.xenproject.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>

Series

xen: credit2: limit the number of CPUs per runqueue | expand

Message

Dario Faggioli April 29, 2020, 5:36 p.m. UTC

In Credit2, the CPUs are assigned to runqueues according to the host
topology. For instance, if we want per-socket runqueues (which is the
default), the CPUs that are in the same socket will end up in the same
runqueue.

This is generally good for scalability, at least until the number of
CPUs that end up in the same runqueue is not too high. In fact, all this
CPUs will compete for the same spinlock, for making scheduling decisions
and manipulating the scheduler data structures. Therefore, if they are
too many, that can become a bottleneck.

This has not been an issue so far, but architectures with 128 CPUs per
socket are now available, and it is certainly unideal to have so many
CPUs in the same runqueue, competing for the same locks, etc.

Let's therefore set a cap to the total number of CPUs that can share a
runqueue. The value is set to 16, by default, but can be changed with
a boot command line parameter.

Note that this is orthogonal and independent from the activity of
introducing runqueue arrangement mechanisms and chriteria. In fact,
we very well may want to implement the capability of having, say, per-LLC
runqueues, or per-die (where that is defined) runqueues. In fact, this
would work well on current system, but nothing guarantees that, in
future, there won't be an architecture with hundreds of CPUs per DIE or
LLC... And we'll be back to where we are now.

Therefore, even if we are actually planning to add some new runqueue
organization criteria, fixing a limit for the number of CPUs that
will ever share a runqueue, whatever the underlying topology is, is
still useful.

Note also that, if the host has hyperthreading (and we are not using
core-scheduling), additional care is taken to avoid splitting CPUs that
are hyperthread siblings among different runqueues.

---
Dario Faggioli (2):
      xen: credit2: factor cpu to runqueue matching in a function
      xen: credit2: limit the max number of CPUs in a runqueue

 xen/common/sched/cpupool.c |    2 -
 xen/common/sched/credit2.c |  130 +++++++++++++++++++++++++++++++++++++++-----
 xen/common/sched/private.h |    2 +
 3 files changed, 119 insertions(+), 15 deletions(-)

--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Comments

Dario Faggioli May 29, 2020, 11:46 a.m. UTC | #1

So,

I felt like providing some additional thoughts about this series, from
a release point of view (adding Paul).

Timing is *beyond tight* so if this series, entirely or partly, has any
chance to go in, it would be through some form of exception, which of
course comes with some risks, etc.

I did work hard to submit the full series, because I wanted people to
be able to see the complete solution. However, I think the series
itself can be logically split in two parts.

Basically, if we just consider patches 1 and 4 we will end up, right
after boot, with a system that has smaller runqueues. They will most
likely be balanced in terms of how many CPUs each one has, so a good
setup. This will likely (actual differences seems to depend *quite a
bit* on the actual workload) be an improvement for very large systems.

This is a path will get a decent amount of testing in OSSTests, from
now until the day of the release, I think, because booting with the
default CPU configuration and setup is what most (all?) OSSTests' jobs
do.

If the user starts to create pools, we can get to a point where the
different runqueues are unbalanced, i.e., each one has a different
number of CPUs in them, wrt the others. This, however:
* can happen already, as of today, without this patches. Whether these
  patches may make things "better" or "worse", from this point of view,
  it's impossible to tell, because it actually depends on what CPUs 
  the user moves among pools or put offline, etc.
* this means that the scheduler has to deal with unbalanced runqueues 
  anyway, and if it doesn't, it's a bug and, again, it seems to me 
  that these patches don't make things particularly better or worse.

So, again, for patches 1 and 4 alone, it looks to me that we'd get
improvements, at least in some cases, the codepath will get some
testing and they do not introduce additional or different issues than
what we have already right now.

They also are at their second iteration, as the original patch series
was comprised of exactly those two patches.

So, I think it would be interesting if these two patches would be given
a chance, even just of getting some reviews... And I would be fine
going through the formal process necessary for making that to happen
myself.

Then, there's the rest, the runqueue rebalancing thing. As said above,
I personally would love if we'd release with it, but I see one rather
big issue. In fact, such mechanism is triggered and stressed is
stressed by the dynamic creation and manipulation of cpupools (and CPU
on/off-lining), and we don't have an OSSTests test for that. Therefore,
we are not in the best position for catching issues it may have
introduced.

I can commit to do some testing myself, but it's not the same thing has
having them in our CI, I know that very well. So, I'd be interested in
hearing what others think about these other patches as well, and I am
happy to do my best to make sure that they are working fine, if we
decide to try to include them too, but I do see this as much more of a
risk myself.

So, any thoughts? :-)

Thanks and Regards

Dario Faggioli May 29, 2020, 3:06 p.m. UTC | #2

On Fri, 2020-05-29 at 13:46 +0200, Dario Faggioli wrote:
> Basically, if we just consider patches 1 and 4 we will end up, right
> after boot, with a system that has smaller runqueues.
>
Actually, to be fully precise, given how I reorganized the series, it's
not patches 1 and 4, it's patches 1, 3 and 4.

Hopefully, that is not a big deal, but it patch 3 is really a problem,
I can re-arrange patch 4 for working without it.

Apart from this, and for adding more information, on a system with 96
CPUs in 2 sockets, this is how the runqueues looks like (with these
patches:

(XEN) Online Cpus: 0-95
(XEN) Cpupool 0:
(XEN) Cpus: 0-95
(XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
(XEN) Scheduler: SMP Credit Scheduler rev2 (credit2)
(XEN) Active queues: 6
(XEN)   default-weight     = 256
(XEN) Runqueue 0:
(XEN)   ncpus              = 16
(XEN)   cpus               = 0-15
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 0
(XEN)   instload           = 0
(XEN)   aveload            = 1223 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,0000ffff
(XEN) Runqueue 1:
(XEN)   ncpus              = 16
(XEN)   cpus               = 16-31
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 16
(XEN)   instload           = 0
(XEN)   aveload            = 3324 (~1%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,00000000,ffff0000
(XEN) Runqueue 2:
(XEN)   ncpus              = 16
(XEN)   cpus               = 32-47
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 32
(XEN)   instload           = 1
(XEN)   aveload            = 8996 (~3%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,0000feff,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,0000fcff,00000000
(XEN) Runqueue 3:
(XEN)   ncpus              = 16
(XEN)   cpus               = 48-63
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 48
(XEN)   instload           = 0
(XEN)   aveload            = 2424 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,00000000,ffff0000,00000000
(XEN) Runqueue 4:
(XEN)   ncpus              = 16
(XEN)   cpus               = 64-79
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 66
(XEN)   instload           = 0
(XEN)   aveload            = 1070 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,0000ffff,00000000,00000000
(XEN) Runqueue 5:
(XEN)   ncpus              = 16
(XEN)   cpus               = 80-95
(XEN)   max_weight         = 256
(XEN)   pick_bias          = 82
(XEN)   instload           = 0
(XEN)   aveload            = 425 (~0%)
(XEN)   idlers: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000
(XEN)   tickled: 00000000,00000000,00000000,00000000,00000000,00000000,00000000
(XEN)   fully idle cores: 00000000,00000000,00000000,00000000,ffff0000,00000000,00000000

Without the patches, there would be just 2 of them (on with CPUs 0-47
and another with CPUs 48-95).

On a system with "just" 16 CPUs, in 2 sockets, they look like this:

(XEN) Online Cpus: 0-15
(XEN) Cpupool 0:
(XEN) Cpus: 0-15
(XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
(XEN) Scheduler: SMP Credit Scheduler rev2 (credit2)
(XEN) Active queues: 2
(XEN) 	default-weight     = 256
(XEN) Runqueue 0:
(XEN) 	ncpus              = 8
(XEN) 	cpus               = 0-7
(XEN) 	max_weight         = 256
(XEN) 	pick_bias          = 0
(XEN) 	instload           = 0
(XEN) 	aveload            = 7077 (~2%)
(XEN) 	idlers: 00000000,000000ff
(XEN) 	tickled: 00000000,00000000
(XEN) 	fully idle cores: 00000000,000000ff
(XEN) Runqueue 1:
(XEN) 	ncpus              = 8
(XEN) 	cpus               = 8-15
(XEN) 	max_weight         = 256
(XEN) 	pick_bias          = 8
(XEN) 	instload           = 1
(XEN) 	aveload            = 11848 (~4%)
(XEN) 	idlers: 00000000,0000fe00
(XEN) 	tickled: 00000000,00000000
(XEN) 	fully idle cores: 00000000,0000fc00

There are still 2, because there are 2 sockets, and we still honor the
topology (and 8 CPUs in a runqueue is fine, because is lower than 16).

I'll share the same output on a 256 CPU system, as soon as I finish
installing it.

Regards

George Dunlap May 29, 2020, 4:15 p.m. UTC | #3

> On May 29, 2020, at 12:46 PM, Dario Faggioli <dfaggioli@suse.com> wrote:
> 
> So,
> 
> I felt like providing some additional thoughts about this series, from
> a release point of view (adding Paul).
> 
> Timing is *beyond tight* so if this series, entirely or partly, has any
> chance to go in, it would be through some form of exception, which of
> course comes with some risks, etc.
> 
> I did work hard to submit the full series, because I wanted people to
> be able to see the complete solution. However, I think the series
> itself can be logically split in two parts.
> 
> Basically, if we just consider patches 1 and 4 we will end up, right
> after boot, with a system that has smaller runqueues. They will most
> likely be balanced in terms of how many CPUs each one has, so a good
> setup. This will likely (actual differences seems to depend *quite a
> bit* on the actual workload) be an improvement for very large systems.

Fundamentally, I feel like the reason we have the feature freeze is exactly to have to avoid questions like this.  Something very much like patch 4 was posted before the last posting date; patches 1-4 received R-b’s before the feature freeze.  I think they should probably go in.

The rebalancing patches I’m inclined to say should wait until they’ve had a bit more time to be thought about.

 -George