mbox series

[v3,00/47] xen: add core scheduling support

Message ID 20190914085251.18816-1-jgross@suse.com (mailing list archive)
Headers show
Series xen: add core scheduling support | expand

Message

Jürgen Groß Sept. 14, 2019, 8:52 a.m. UTC
Add support for core- and socket-scheduling in the Xen hypervisor.

Via boot parameter sched-gran=core (or sched-gran=socket)
it is possible to change the scheduling granularity from cpu (the
default) to either whole cores or even sockets.

All logical cpus (threads) of the core or socket are always scheduled
together. This means that on a core always vcpus of the same domain
will be active, and those vcpus will always be scheduled at the same
time.

This is achieved by switching the scheduler to no longer see vcpus as
the primary object to schedule, but "schedule units". Each schedule
unit consists of as many vcpus as each core has threads on the current
system. The vcpu->unit relation is fixed.

I have done some very basic performance testing: on a 4 cpu system
(2 cores with 2 threads each) I did a "make -j 4" for building the Xen
hypervisor. With This test has been run on dom0, once with no other
guest active and once with another guest with 4 vcpus running the same
test. The results are (always elapsed time, system time, user time):

sched-gran=cpu,    no other guest: 116.10 177.65 207.84
sched-gran=core,   no other guest: 114.04 175.47 207.45
sched-gran=cpu,    other guest:    202.30 334.21 384.63
sched-gran=core,   other guest:    207.24 293.04 371.37

The performance tests have been performed with credit2, the other
schedulers are tested only briefly to be able to create a domain in a
cpupool.

Cpupools have been moderately tested (cpu add/remove, create, destroy,
move domain).

Cpu on-/offlining has been moderately tested, too.

The series is based on the series:
"xen/sched: use new idle scheduler for free cpus"
which has been split off from V1 and on the patch:
"xen/sched: rework and rename vcpu_force_reschedule()"
which has been split off from V2.

The complete patch series is available under:

  git://github.com/jgross1/xen/ sched-v3

Changes in V3:
- comments addressed
- former patch 26 carved out and sent separately
- some minor bugs fixed

Changes in V2:
- comments addressed
- some patches merged into one
- idle scheduler related patches split off to own series
- some patches are already applied
- some bugs fixed (e.g. crashes when powering off)

Changes in V1:
- cpupools are working now
- cpu on-/offlining working now
- all schedulers working now
- renamed "items" to "units"
- introduction of "idle scheduler"
- several new patches (see individual patches, mostly splits of
  former patches or cpupool and cpu on-/offlining support)
- all review comments addressed
- some minor changes (see individual patches)

Changes in RFC V2:
- ARM is building now
- HVM domains are working now
- idling will always be done with idle_vcpu active
- other small changes see individual patches

Juergen Gross (47):
  xen/sched: use new sched_unit instead of vcpu in scheduler interfaces
  xen/sched: move per-vcpu scheduler private data pointer to sched_unit
  xen/sched: build a linked list of struct sched_unit
  xen/sched: introduce struct sched_resource
  xen/sched: let pick_cpu return a scheduler resource
  xen/sched: switch schedule_data.curr to point at sched_unit
  xen/sched: move per cpu scheduler private data into struct
    sched_resource
  xen/sched: switch vcpu_schedule_lock to unit_schedule_lock
  xen/sched: move some per-vcpu items to struct sched_unit
  xen/sched: add scheduler helpers hiding vcpu
  xen/sched: rename scheduler related perf counters
  xen/sched: switch struct task_slice from vcpu to sched_unit
  xen/sched: add is_running indicator to struct sched_unit
  xen/sched: make null scheduler vcpu agnostic.
  xen/sched: make rt scheduler vcpu agnostic.
  xen/sched: make credit scheduler vcpu agnostic.
  xen/sched: make credit2 scheduler vcpu agnostic.
  xen/sched: make arinc653 scheduler vcpu agnostic.
  xen: add sched_unit_pause_nosync() and sched_unit_unpause()
  xen: let vcpu_create() select processor
  xen/sched: use sched_resource cpu instead smp_processor_id in
    schedulers
  xen/sched: switch schedule() from vcpus to sched_units
  xen/sched: switch sched_move_irqs() to take sched_unit as parameter
  xen: switch from for_each_vcpu() to for_each_sched_unit()
  xen/sched: add runstate counters to struct sched_unit
  xen/sched: Change vcpu_migrate_*() to operate on schedule unit
  xen/sched: move struct task_slice into struct sched_unit
  xen/sched: add code to sync scheduling of all vcpus of a sched unit
  xen/sched: introduce unit_runnable_state()
  xen/sched: add support for multiple vcpus per sched unit where missing
  xen/sched: modify cpupool_domain_cpumask() to be an unit mask
  xen/sched: support allocating multiple vcpus into one sched unit
  xen/sched: add a percpu resource index
  xen/sched: add fall back to idle vcpu when scheduling unit
  xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware
  xen/sched: carve out freeing sched_unit memory into dedicated function
  xen/sched: move per-cpu variable scheduler to struct sched_resource
  xen/sched: move per-cpu variable cpupool to struct sched_resource
  xen/sched: reject switching smt on/off with core scheduling active
  xen/sched: prepare per-cpupool scheduling granularity
  xen/sched: split schedule_cpu_switch()
  xen/sched: protect scheduling resource via rcu
  xen/sched: support multiple cpus per scheduling resource
  xen/sched: support differing granularity in schedule_cpu_[add/rm]()
  xen/sched: support core scheduling for moving cpus to/from cpupools
  xen/sched: disable scheduling when entering ACPI deep sleep states
  xen/sched: add scheduling granularity enum

 xen/arch/arm/domain.c            |    2 +-
 xen/arch/arm/domain_build.c      |   13 +-
 xen/arch/x86/Kconfig             |    1 +
 xen/arch/x86/acpi/power.c        |    4 +
 xen/arch/x86/dom0_build.c        |   10 +-
 xen/arch/x86/domain.c            |   26 +-
 xen/arch/x86/hvm/dom0_build.c    |    9 +-
 xen/arch/x86/pv/dom0_build.c     |   10 +-
 xen/arch/x86/sysctl.c            |    5 +
 xen/common/Kconfig               |    3 +
 xen/common/cpupool.c             |  155 +++-
 xen/common/domain.c              |   29 +-
 xen/common/domctl.c              |   23 +-
 xen/common/keyhandler.c          |   58 +-
 xen/common/sched_arinc653.c      |  257 +++---
 xen/common/sched_credit.c        |  762 ++++++++--------
 xen/common/sched_credit2.c       | 1121 ++++++++++++------------
 xen/common/sched_null.c          |  472 +++++-----
 xen/common/sched_rt.c            |  545 ++++++------
 xen/common/schedule.c            | 1787 +++++++++++++++++++++++++++++++-------
 xen/common/softirq.c             |    6 +-
 xen/include/asm-arm/current.h    |    1 +
 xen/include/asm-x86/cpuidle.h    |   11 -
 xen/include/asm-x86/current.h    |   19 +-
 xen/include/asm-x86/dom0_build.h |    3 +-
 xen/include/asm-x86/smp.h        |    3 +
 xen/include/xen/domain.h         |    3 +-
 xen/include/xen/perfc_defn.h     |   32 +-
 xen/include/xen/sched-if.h       |  343 ++++++--
 xen/include/xen/sched.h          |   99 ++-
 xen/include/xen/softirq.h        |    1 +
 31 files changed, 3649 insertions(+), 2164 deletions(-)

Comments

Jan Beulich Sept. 20, 2019, 4:14 p.m. UTC | #1
On 14.09.2019 10:52, Juergen Gross wrote:
> This is achieved by switching the scheduler to no longer see vcpus as
> the primary object to schedule, but "schedule units". Each schedule
> unit consists of as many vcpus as each core has threads on the current
> system. The vcpu->unit relation is fixed.

There's another aspect here that, while perhaps obvious, I didn't
realize so far: Iirc right now schedulers try to place vCPU-s on
different cores, as long as there aren't more runnable vCPU-s than
there are cores. This is to improve overall throughput, since
vCPU-s on sibling hyperthreads would compete for execution
resources. With a fixed relation this is going to be impossible.
Otoh I can of course see how, once we have proper virtual
topology, this allows better scheduling decisions inside the
guest, in particular if - under the right circumstances - it is
actually wanted to run two entities on sibling threads.

Jan
Dario Faggioli Sept. 24, 2019, 10:36 a.m. UTC | #2
On Fri, 2019-09-20 at 18:14 +0200, Jan Beulich wrote:
> On 14.09.2019 10:52, Juergen Gross wrote:
> > This is achieved by switching the scheduler to no longer see vcpus
> > as
> > the primary object to schedule, but "schedule units". Each schedule
> > unit consists of as many vcpus as each core has threads on the
> > current
> > system. The vcpu->unit relation is fixed.
> 
> There's another aspect here that, while perhaps obvious, I didn't
> realize so far: Iirc right now schedulers try to place vCPU-s on
> different cores, as long as there aren't more runnable vCPU-s than
> there are cores. 
>
Indeed they do.

> This is to improve overall throughput, since
> vCPU-s on sibling hyperthreads would compete for execution
> resources. With a fixed relation this is going to be impossible.
>
It is. And that is the reason why my benchmarks show rather bad
performance for a 4 vCUPUs VMs on a 8 CPUs (4 cores, with
hyperthreading) host. In fact, as Juergen showed during his Xen Summit
talk, in such a case core-scheduling achieves much worse performance
than "regular" cpu-scheduling, both when hyperthreading is enabled and
disabled.

It's an intrinsic characteristic of this solution that we have decided
to go for (i.e., introducing the 'virtual core' and 'scheduling
resource' concepts, and act almost entirely at the schedule.c level).

> Otoh I can of course see how, once we have proper virtual
> topology, this allows better scheduling decisions inside the
> guest, in particular if - under the right circumstances - it is
> actually wanted to run two entities on sibling threads.
> 
Yes, this is indeed one aspect. There is also the fact that, currently,
as soon as you have 1 more vCPU than there are cores, e.g. coming from
another VM, the guest that had each of its vCPUs running on one core,
experiences a slowdown. While, with core-scheduling enabled from the
beginning, performance stays consistent.

In any case, this all happens with core-scheduling actually enabled.
With these patches applied, but cpu-scheduling selected at boot, fully
idle cores are still preferred, and the vCPUs will still be spread
among them (as soon as there's any available).

Regards
Sergey Dyasli Sept. 24, 2019, 11:15 a.m. UTC | #3
Hi Juergen,

After an extensive testing of your jgross1/sched-v3 branch in XenRT,
I'm happy to say that we've found no functional regressions so far
when running in the default (thread/cpu) mode.

Hopefully this gives some level of confidence to this series and the
plan about including it into 4.13 [1]

[1] RFC: Criteria for checking in core scheduling series
    https://lore.kernel.org/xen-devel/97e1bfe4-3383-ff1d-bf61-48b8aa63bb2c@citrix.com/

Thanks,
Sergey
Jürgen Groß Sept. 24, 2019, 11:17 a.m. UTC | #4
On 24.09.19 13:15, Sergey Dyasli wrote:
> Hi Juergen,
> 
> After an extensive testing of your jgross1/sched-v3 branch in XenRT,
> I'm happy to say that we've found no functional regressions so far
> when running in the default (thread/cpu) mode.
> 
> Hopefully this gives some level of confidence to this series and the
> plan about including it into 4.13 [1]
> 
> [1] RFC: Criteria for checking in core scheduling series
>      https://lore.kernel.org/xen-devel/97e1bfe4-3383-ff1d-bf61-48b8aa63bb2c@citrix.com/

Thank you very much for your efforts and the confirmation!


Juergen
Dario Faggioli Sept. 24, 2019, 5:29 p.m. UTC | #5
On Tue, 2019-09-24 at 12:15 +0100, Sergey Dyasli wrote:
> Hi Juergen,
> 
> After an extensive testing of your jgross1/sched-v3 branch in XenRT,
> I'm happy to say that we've found no functional regressions so far
> when running in the default (thread/cpu) mode.
> 
> Hopefully this gives some level of confidence to this series and the
> plan about including it into 4.13 [1]
> 
Thanks a lot for doing this, and for letting us know.

Can I ask whether the tests were done using Credit2 (i.e., upstream
default) or Credit1, as scheduler?

Thanks again and Regards
Igor Druzhinin Sept. 24, 2019, 5:42 p.m. UTC | #6
On 24/09/2019 18:29, Dario Faggioli wrote:
> On Tue, 2019-09-24 at 12:15 +0100, Sergey Dyasli wrote:
>> Hi Juergen,
>>
>> After an extensive testing of your jgross1/sched-v3 branch in XenRT,
>> I'm happy to say that we've found no functional regressions so far
>> when running in the default (thread/cpu) mode.
>>
>> Hopefully this gives some level of confidence to this series and the
>> plan about including it into 4.13 [1]
>>
> Thanks a lot for doing this, and for letting us know.

Additionally, we've got performance test results today and they showed
no noticeable regressions in thread mode against 4.13 without
core-scheduling patches.

> Can I ask whether the tests were done using Credit2 (i.e., upstream
> default) or Credit1, as scheduler?
>
That was Credit1 that we use in the product.

Igor