Message ID | 20190329150934.17694-1-jgross@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | xen: add core scheduling support | expand |
On 29/03/2019 16:08, Juergen Gross wrote: > This series is very RFC!!!! > > Add support for core- and socket-scheduling in the Xen hypervisor. > > Via boot parameter sched_granularity=core (or sched_granularity=socket) > it is possible to change the scheduling granularity from thread (the > default) to either whole cores or even sockets. > > All logical cpus (threads) of the core or socket are always scheduled > together. This means that on a core always vcpus of the same domain > will be active, and those vcpus will always be scheduled at the same > time. > > This is achieved by switching the scheduler to no longer see vcpus as > the primary object to schedule, but "schedule items". Each schedule > item consists of as many vcpus as each core has threads on the current > system. The vcpu->item relation is fixed. > > I have done some very basic performance testing: on a 4 cpu system > (2 cores with 2 threads each) I did a "make -j 4" for building the Xen > hypervisor. With This test has been run on dom0, once with no other > guest active and once with another guest with 4 vcpus running the same > test. The results are (always elapsed time, system time, user time): > > sched_granularity=thread, no other guest: 116.10 177.65 207.84 > sched_granularity=core, no other guest: 114.04 175.47 207.45 > sched_granularity=thread, other guest: 202.30 334.21 384.63 > sched_granularity=core, other guest: 207.24 293.04 371.37 > > All tests have been performed with credit2, the other schedulers are > untested up to now. > > Cpupools are not yet working, as moving cpus between cpupools needs > more work. > > HVM domains do not work yet, there is a doublefault in Xen at the > end of Seabios. I'm currently investigating this issue. > > This is x86-only for the moment. ARM doesn't even build with the > series applied. For full ARM support I might need some help with the > ARM specific context switch handling. > > The first 7 patches have been sent to xen-devel already, I'm just > adding them here for convenience as they are prerequisites. > > I'm especially looking for feedback regarding the overall idea and > design. I have put the patches in a repository: github.com/jgross1/xen.git sched-rfc Juergen
>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote: > Via boot parameter sched_granularity=core (or sched_granularity=socket) > it is possible to change the scheduling granularity from thread (the > default) to either whole cores or even sockets. > > All logical cpus (threads) of the core or socket are always scheduled > together. This means that on a core always vcpus of the same domain > will be active, and those vcpus will always be scheduled at the same > time. > > This is achieved by switching the scheduler to no longer see vcpus as > the primary object to schedule, but "schedule items". Each schedule > item consists of as many vcpus as each core has threads on the current > system. The vcpu->item relation is fixed. Hmm, I find this surprising: A typical guest would have more vCPU-s than there are threads per core. So if two of them want to run, but each is associated with a different core, you'd need two cores instead of one to actually fulfill the request? I could see this necessarily being the case if you arranged vCPU-s into virtual threads, cores, sockets, and nodes, but at least from the patch titles it doesn't look as if you did in this series. Are there other reasons to make this a fixed relationship? As a minor cosmetic request visible from this cover letter right away: Could the command line option please become "sched-granularity=" or even "sched-gran="? Jan
On 29/03/2019 16:39, Jan Beulich wrote: >>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote: >> Via boot parameter sched_granularity=core (or sched_granularity=socket) >> it is possible to change the scheduling granularity from thread (the >> default) to either whole cores or even sockets. >> >> All logical cpus (threads) of the core or socket are always scheduled >> together. This means that on a core always vcpus of the same domain >> will be active, and those vcpus will always be scheduled at the same >> time. >> >> This is achieved by switching the scheduler to no longer see vcpus as >> the primary object to schedule, but "schedule items". Each schedule >> item consists of as many vcpus as each core has threads on the current >> system. The vcpu->item relation is fixed. > > Hmm, I find this surprising: A typical guest would have more vCPU-s > than there are threads per core. So if two of them want to run, but > each is associated with a different core, you'd need two cores instead > of one to actually fulfill the request? I could see this necessarily being Correct. > the case if you arranged vCPU-s into virtual threads, cores, sockets, > and nodes, but at least from the patch titles it doesn't look as if you > did in this series. Are there other reasons to make this a fixed > relationship? In fact I'm doing it, but only implicitly and without adapting the cpuid related information. The idea is to pass the topology information at least below the scheduling granularity to the guest later. Not having the fixed relationship would result in something like the co-scheduling series Dario already sent, which would need more than mechanical changes in each scheduler. > As a minor cosmetic request visible from this cover letter right away: > Could the command line option please become "sched-granularity=" > or even "sched-gran="? Of course! Juergen
On Fri, 2019-03-29 at 16:46 +0100, Juergen Gross wrote: > On 29/03/2019 16:39, Jan Beulich wrote: > > > > > On 29.03.19 at 16:08, <jgross@suse.com> wrote: > > > This is achieved by switching the scheduler to no longer see > > > vcpus as > > > the primary object to schedule, but "schedule items". Each > > > schedule > > > item consists of as many vcpus as each core has threads on the > > > current > > > system. The vcpu->item relation is fixed. > > > > the case if you arranged vCPU-s into virtual threads, cores, > > sockets, > > and nodes, but at least from the patch titles it doesn't look as if > > you > > did in this series. Are there other reasons to make this a fixed > > relationship? > > In fact I'm doing it, but only implicitly and without adapting the > cpuid related information. The idea is to pass the topology > information > at least below the scheduling granularity to the guest later. > > Not having the fixed relationship would result in something like the > co-scheduling series Dario already sent, which would need more than > mechanical changes in each scheduler. > Yep. So, just for the records, those series are, this one for Credit1: https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html And this one for Credit2: https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01113.html Both are RFC, but the Credit2 one was much, much better (more complete, more tested, more stable, achieving better fairness, etc). In these series, the "relationship" being discussed here is not fixed. Not right now, at least, but it can become so (I didn't do it as we currently lack the info for doing that properly). It is/was, IMO, a good thing that everything work both with or without topology enlightenment (even for one we'll have it, in case one, for whatever reason, doesn't care). As said by Juergen, the two approaches (and hence the structure of the series) are quite different. This series is more generic, acts on the common scheduler code and logic. It's quite intrusive, as we can see :-D, but enables the feature for all the schedulers all at once (well, they all need changes, but mostly mechanical). My series, OTOH, act on each scheduler specifically (and in fact there is one for Credit and one for Credit2, and there would need to be one for RTDS, if wanted, etc). They're much more self contained, but less generic; and the changes necessary within each scheduler are specific to the scheduler itself, and non-mechanical. Regards, Dario
On 29/03/2019 17:56, Dario Faggioli wrote: > On Fri, 2019-03-29 at 16:46 +0100, Juergen Gross wrote: >> On 29/03/2019 16:39, Jan Beulich wrote: >>>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote: >>>> This is achieved by switching the scheduler to no longer see >>>> vcpus as >>>> the primary object to schedule, but "schedule items". Each >>>> schedule >>>> item consists of as many vcpus as each core has threads on the >>>> current >>>> system. The vcpu->item relation is fixed. >>> >>> the case if you arranged vCPU-s into virtual threads, cores, >>> sockets, >>> and nodes, but at least from the patch titles it doesn't look as if >>> you >>> did in this series. Are there other reasons to make this a fixed >>> relationship? >> >> In fact I'm doing it, but only implicitly and without adapting the >> cpuid related information. The idea is to pass the topology >> information >> at least below the scheduling granularity to the guest later. >> >> Not having the fixed relationship would result in something like the >> co-scheduling series Dario already sent, which would need more than >> mechanical changes in each scheduler. >> > Yep. So, just for the records, those series are, this one for Credit1: > https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html > > And this one for Credit2: > https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01113.html > > Both are RFC, but the Credit2 one was much, much better (more complete, > more tested, more stable, achieving better fairness, etc). > > In these series, the "relationship" being discussed here is not fixed. > Not right now, at least, but it can become so (I didn't do it as we > currently lack the info for doing that properly). > > It is/was, IMO, a good thing that everything work both with or without > topology enlightenment (even for one we'll have it, in case one, for > whatever reason, doesn't care). > > As said by Juergen, the two approaches (and hence the structure of the > series) are quite different. This series is more generic, acts on the > common scheduler code and logic. It's quite intrusive, as we can see > :-D, but enables the feature for all the schedulers all at once (well, > they all need changes, but mostly mechanical). > > My series, OTOH, act on each scheduler specifically (and in fact there > is one for Credit and one for Credit2, and there would need to be one > for RTDS, if wanted, etc). They're much more self contained, but less > generic; and the changes necessary within each scheduler are specific > to the scheduler itself, and non-mechanical. Another line of thought: in case we want core scheduling for security reasons (to ensure always vcpus of the same guest are sharing a core) the same might apply to the guest itself: it might want to ensure only threads of the same process are sharing a core. This would be quite easy with my series, but impossible for Dario's solution without the fixed relationship between guest siblings. Juergen
On Fri, 2019-03-29 at 18:00 +0100, Juergen Gross wrote: > On 29/03/2019 17:56, Dario Faggioli wrote: > > As said by Juergen, the two approaches (and hence the structure of > > the > > series) are quite different. This series is more generic, acts on > > the > > common scheduler code and logic. It's quite intrusive, as we can > > see > > :-D, but enables the feature for all the schedulers all at once > > (well, > > they all need changes, but mostly mechanical). > > > > My series, OTOH, act on each scheduler specifically (and in fact > > there > > is one for Credit and one for Credit2, and there would need to be > > one > > for RTDS, if wanted, etc). They're much more self contained, but > > less > > generic; and the changes necessary within each scheduler are > > specific > > to the scheduler itself, and non-mechanical. > > Another line of thought: in case we want core scheduling for security > reasons (to ensure always vcpus of the same guest are sharing a core) > the same might apply to the guest itself: it might want to ensure > only threads of the same process are sharing a core. > Sure, as soon as we'll manage to "passthrough" to it the necessary topology information. > This would be > quite easy with my series, but impossible for Dario's solution > without > the fixed relationship between guest siblings. > Well, not "impossible". :-) As said above, that's not there, but it can be added/implemented. Anyway... Lemme go back looking at the patches, and preparing for running benchmarks. :-D :-D Dario
Out of curiosity, has there been any research done on whether or not it makes more sense to just disable CPU threading with respect to overall performance? In some of the testing that we did with OpenXT, we noticed in some of our tests a performance increase when hyperthreading was disabled. I would be curious what other research has been done in this regard. Either way, if threading is enabled, grouping up threads makes a lot of sense WRT some of the recent security issues that have come up with Intel CPUs. On Fri, Mar 29, 2019 at 11:03 AM Juergen Gross <jgross@suse.com> wrote: > > On 29/03/2019 17:56, Dario Faggioli wrote: > > On Fri, 2019-03-29 at 16:46 +0100, Juergen Gross wrote: > >> On 29/03/2019 16:39, Jan Beulich wrote: > >>>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote: > >>>> This is achieved by switching the scheduler to no longer see > >>>> vcpus as > >>>> the primary object to schedule, but "schedule items". Each > >>>> schedule > >>>> item consists of as many vcpus as each core has threads on the > >>>> current > >>>> system. The vcpu->item relation is fixed. > >>> > >>> the case if you arranged vCPU-s into virtual threads, cores, > >>> sockets, > >>> and nodes, but at least from the patch titles it doesn't look as if > >>> you > >>> did in this series. Are there other reasons to make this a fixed > >>> relationship? > >> > >> In fact I'm doing it, but only implicitly and without adapting the > >> cpuid related information. The idea is to pass the topology > >> information > >> at least below the scheduling granularity to the guest later. > >> > >> Not having the fixed relationship would result in something like the > >> co-scheduling series Dario already sent, which would need more than > >> mechanical changes in each scheduler. > >> > > Yep. So, just for the records, those series are, this one for Credit1: > > https://lists.xenproject.org/archives/html/xen-devel/2018-08/msg02164.html > > > > And this one for Credit2: > > https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg01113.html > > > > Both are RFC, but the Credit2 one was much, much better (more complete, > > more tested, more stable, achieving better fairness, etc). > > > > In these series, the "relationship" being discussed here is not fixed. > > Not right now, at least, but it can become so (I didn't do it as we > > currently lack the info for doing that properly). > > > > It is/was, IMO, a good thing that everything work both with or without > > topology enlightenment (even for one we'll have it, in case one, for > > whatever reason, doesn't care). > > > > As said by Juergen, the two approaches (and hence the structure of the > > series) are quite different. This series is more generic, acts on the > > common scheduler code and logic. It's quite intrusive, as we can see > > :-D, but enables the feature for all the schedulers all at once (well, > > they all need changes, but mostly mechanical). > > > > My series, OTOH, act on each scheduler specifically (and in fact there > > is one for Credit and one for Credit2, and there would need to be one > > for RTDS, if wanted, etc). They're much more self contained, but less > > generic; and the changes necessary within each scheduler are specific > > to the scheduler itself, and non-mechanical. > > Another line of thought: in case we want core scheduling for security > reasons (to ensure always vcpus of the same guest are sharing a core) > the same might apply to the guest itself: it might want to ensure > only threads of the same process are sharing a core. This would be > quite easy with my series, but impossible for Dario's solution without > the fixed relationship between guest siblings. > > > Juergen > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xenproject.org > https://lists.xenproject.org/mailman/listinfo/xen-devel
On 29/03/2019 17:39, Rian Quinn wrote: > Out of curiosity, has there been any research done on whether or not > it makes more sense to just disable CPU threading with respect to > overall performance? In some of the testing that we did with OpenXT, > we noticed in some of our tests a performance increase when > hyperthreading was disabled. I would be curious what other research > has been done in this regard. > > Either way, if threading is enabled, grouping up threads makes a lot > of sense WRT some of the recent security issues that have come up with > Intel CPUs. There has been plenty of academic research done, and there are real usecases where disabling HT improves performance. However, there are plenty when it doesn't. During L1TF testing, XenServer measured one typical usecase (agregate small packet IO throughput, which is representative of a load of webserver VMs) which took a 60% perf hit. 10% of this was the raw L1D_FLUSH hit, while 50% of it was actually due to the increased IO latency of halving the number of vcpus which could be run concurrently. As for core aware scheduling, even if nothing else, grouping things up will get you better cache sharing from the VM's point of view. As you can probably tell, the answer is far too workload dependent to come up with a general rule, but at least having the options available will let people experiment. ~Andrew
Even if I've only skimmed through it... cool series! :-D On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote: > > I have done some very basic performance testing: on a 4 cpu system > (2 cores with 2 threads each) I did a "make -j 4" for building the > Xen > hypervisor. With This test has been run on dom0, once with no other > guest active and once with another guest with 4 vcpus running the > same > test. The results are (always elapsed time, system time, user time): > > sched_granularity=thread, no other guest: 116.10 177.65 207.84 > sched_granularity=core, no other guest: 114.04 175.47 207.45 > sched_granularity=thread, other guest: 202.30 334.21 384.63 > sched_granularity=core, other guest: 207.24 293.04 371.37 > So, just to be sure I'm reading this properly, "sched_granularity=thread" means no co-scheduling of any sort is in effect, right? Basically the patch series is applied, but "not used", correct? If yes, these are interesting, and promising, numbers. :-) > All tests have been performed with credit2, the other schedulers are > untested up to now. > Just as an heads up for people (as Juergen knows this already :-D), I'm planning to run some performance evaluation of this patches. I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an 16 CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on which I should be able to get some bench suite running relatively easy and (hopefully) quick. I'm planning to evaluate: - vanilla (i.e., without this series), SMT enabled in BIOS - vanilla (i.e., without this series), SMT disabled in BIOS - patched (i.e., with this series), granularity=thread - patched (i.e., with this series), granularity=core I'll do start with no overcommitment, and then move to 2x overcommitment (as you did above). And I'll also be focusing on Credit2 only. Everyone else who also want to do some stress and performance testing and share the results, that's very much appreciated. :-) Regards, Dario
Makes sense. The reason I ask is we currently have to disable HT due to L1TF until a scheduler change is made to address the issue and the #1 question everyone asks is what will that do to performance so any info on that topic and how a patch like this will address the L1TF issue is most helpful. On Fri, Mar 29, 2019 at 11:49 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 29/03/2019 17:39, Rian Quinn wrote: > > Out of curiosity, has there been any research done on whether or not > > it makes more sense to just disable CPU threading with respect to > > overall performance? In some of the testing that we did with OpenXT, > > we noticed in some of our tests a performance increase when > > hyperthreading was disabled. I would be curious what other research > > has been done in this regard. > > > > Either way, if threading is enabled, grouping up threads makes a lot > > of sense WRT some of the recent security issues that have come up with > > Intel CPUs. > > There has been plenty of academic research done, and there are real > usecases where disabling HT improves performance. > > However, there are plenty when it doesn't. During L1TF testing, > XenServer measured one typical usecase (agregate small packet IO > throughput, which is representative of a load of webserver VMs) which > took a 60% perf hit. > > 10% of this was the raw L1D_FLUSH hit, while 50% of it was actually due > to the increased IO latency of halving the number of vcpus which could > be run concurrently. > > As for core aware scheduling, even if nothing else, grouping things up > will get you better cache sharing from the VM's point of view. > > As you can probably tell, the answer is far too workload dependent to > come up with a general rule, but at least having the options available > will let people experiment. > > ~Andrew
On 29/03/2019 19:16, Dario Faggioli wrote: > Even if I've only skimmed through it... cool series! :-D > > On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote: >> >> I have done some very basic performance testing: on a 4 cpu system >> (2 cores with 2 threads each) I did a "make -j 4" for building the >> Xen >> hypervisor. With This test has been run on dom0, once with no other >> guest active and once with another guest with 4 vcpus running the >> same >> test. The results are (always elapsed time, system time, user time): >> >> sched_granularity=thread, no other guest: 116.10 177.65 207.84 >> sched_granularity=core, no other guest: 114.04 175.47 207.45 >> sched_granularity=thread, other guest: 202.30 334.21 384.63 >> sched_granularity=core, other guest: 207.24 293.04 371.37 >> > So, just to be sure I'm reading this properly, > "sched_granularity=thread" means no co-scheduling of any sort is in > effect, right? Basically the patch series is applied, but "not used", > correct? Yes. > If yes, these are interesting, and promising, numbers. :-) > >> All tests have been performed with credit2, the other schedulers are >> untested up to now. >> > Just as an heads up for people (as Juergen knows this already :-D), I'm > planning to run some performance evaluation of this patches. > > I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an 16 > CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on > which I should be able to get some bench suite running relatively easy > and (hopefully) quick. > > I'm planning to evaluate: > - vanilla (i.e., without this series), SMT enabled in BIOS > - vanilla (i.e., without this series), SMT disabled in BIOS > - patched (i.e., with this series), granularity=thread > - patched (i.e., with this series), granularity=core > > I'll do start with no overcommitment, and then move to 2x > overcommitment (as you did above). Thanks, I appreciate that! Juergen
>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote: > Via boot parameter sched_granularity=core (or sched_granularity=socket) > it is possible to change the scheduling granularity from thread (the > default) to either whole cores or even sockets. One further general question came to mind: How about also having "sched-granularity=thread" (or "...=none") to retain current behavior, at least to have an easy way to compare effects if wanted? But perhaps also to allow to deal with potential resources wasting configurations like having mostly VMs with e.g. an odd number of vCPU-s. The other question of course is whether the terms thread, core, and socket are generic enough to be used in architecture independent code. Even on x86 it already leaves out / unclear where / how e.g. AMD's compute units would be classified. I don't have any good suggestion for abstraction, so possibly the terms used may want to become arch-specific. Jan
On 01/04/2019 08:41, Jan Beulich wrote: >>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote: >> Via boot parameter sched_granularity=core (or sched_granularity=socket) >> it is possible to change the scheduling granularity from thread (the >> default) to either whole cores or even sockets. > > One further general question came to mind: How about also having > "sched-granularity=thread" (or "...=none") to retain current > behavior, at least to have an easy way to compare effects if > wanted? But perhaps also to allow to deal with potential resources > wasting configurations like having mostly VMs with e.g. an odd > number of vCPU-s. Fine with me. > The other question of course is whether the terms thread, core, > and socket are generic enough to be used in architecture > independent code. Even on x86 it already leaves out / unclear > where / how e.g. AMD's compute units would be classified. I > don't have any good suggestion for abstraction, so possibly > the terms used may want to become arch-specific. I followed the already known terms from the credit2_runqueue parameter. I think they should match. Which would call for "sched-granularity=cpu" instead of "thread". Juergen
On Mon, 2019-04-01 at 08:49 +0200, Juergen Gross wrote: > On 01/04/2019 08:41, Jan Beulich wrote: > > One further general question came to mind: How about also having > > "sched-granularity=thread" (or "...=none") to retain current > > behavior, at least to have an easy way to compare effects if > > wanted? But perhaps also to allow to deal with potential resources > > wasting configurations like having mostly VMs with e.g. an odd > > number of vCPU-s. > > Fine with me. > Mmm... I'm still in the process of looking at the patches, so there might be something I'm missing, but, from the descriptions and from talking to you (Juergen), I was assuming that to be the case already... isn't it so? > > The other question of course is whether the terms thread, core, > > and socket are generic enough to be used in architecture > > independent code. Even on x86 it already leaves out / unclear > > where / how e.g. AMD's compute units would be classified. I > > don't have any good suggestion for abstraction, so possibly > > the terms used may want to become arch-specific. > > I followed the already known terms from the credit2_runqueue > parameter. I think they should match. Which would call for > "sched-granularity=cpu" instead of "thread". > Yep, I'd go for cpu. Both for, as you said, consistency and also because I can envision "granularity=thread" being mistaken/interpreted as a form of "thread aware co-scheduling" (i.e., what "granularity=core" actually does! :-O) Regards, Dario
>>> On 01.04.19 at 08:49, <jgross@suse.com> wrote: > On 01/04/2019 08:41, Jan Beulich wrote: >>>>> On 29.03.19 at 16:08, <jgross@suse.com> wrote: >>> Via boot parameter sched_granularity=core (or sched_granularity=socket) >>> it is possible to change the scheduling granularity from thread (the >>> default) to either whole cores or even sockets. >> >> One further general question came to mind: How about also having >> "sched-granularity=thread" (or "...=none") to retain current >> behavior, at least to have an easy way to compare effects if >> wanted? But perhaps also to allow to deal with potential resources >> wasting configurations like having mostly VMs with e.g. an odd >> number of vCPU-s. > > Fine with me. > >> The other question of course is whether the terms thread, core, >> and socket are generic enough to be used in architecture >> independent code. Even on x86 it already leaves out / unclear >> where / how e.g. AMD's compute units would be classified. I >> don't have any good suggestion for abstraction, so possibly >> the terms used may want to become arch-specific. > > I followed the already known terms from the credit2_runqueue > parameter. I think they should match. Which would call for > "sched-granularity=cpu" instead of "thread". "cpu" is fine of course. I wonder though whether the other two were a good choice for "credit2_runqueue". Stefano, Julien - is this terminology at least half way suitable for Arm? Jan
On 01/04/2019 09:10, Dario Faggioli wrote: > On Mon, 2019-04-01 at 08:49 +0200, Juergen Gross wrote: >> On 01/04/2019 08:41, Jan Beulich wrote: >>> One further general question came to mind: How about also having >>> "sched-granularity=thread" (or "...=none") to retain current >>> behavior, at least to have an easy way to compare effects if >>> wanted? But perhaps also to allow to deal with potential resources >>> wasting configurations like having mostly VMs with e.g. an odd >>> number of vCPU-s. >> >> Fine with me. >> > Mmm... I'm still in the process of looking at the patches, so there > might be something I'm missing, but, from the descriptions and from > talking to you (Juergen), I was assuming that to be the case already... > isn't it so? Yes, it is. I understood Jan to ask for a special parameter value for that. Juergen
On Fri, 2019-03-29 at 19:16 +0100, Dario Faggioli wrote: > On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote: > > I have done some very basic performance testing: on a 4 cpu system > > (2 cores with 2 threads each) I did a "make -j 4" for building the > > Xen > > hypervisor. With This test has been run on dom0, once with no other > > guest active and once with another guest with 4 vcpus running the > > same > > test. > Just as an heads up for people (as Juergen knows this already :-D), > I'm > planning to run some performance evaluation of this patches. > > I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an > 16 > CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on > which I should be able to get some bench suite running relatively > easy > and (hopefully) quick. > > I'm planning to evaluate: > - vanilla (i.e., without this series), SMT enabled in BIOS > - vanilla (i.e., without this series), SMT disabled in BIOS > - patched (i.e., with this series), granularity=thread > - patched (i.e., with this series), granularity=core > > I'll do start with no overcommitment, and then move to 2x > overcommitment (as you did above). > I've got the first set of results. It's fewer than I wanted/expected to have at this point in time, but still... Also, it's Phoronix again. I don't especially love it, but I'm still working on convincing our own internal automated benchmarking tool (which I like a lot more :-) ) to be a good friend of Xen. :-P It's a not too big set of tests, done in the following conditions: - hardware: Intel Xeon E5620; 2 NUMA nodes, 4 cores and 2 threads each - slow disk (old rotational HDD) - benchmarks run in dom0 - CPU, memory and some disk IO benchmarks - all Spec&Melt mitigations disabled both at Xen and dom0 kernel level - cpufreq governor = performance, max_cstate = C1 - *non* debug hypervisor In just one sentence, what I'd say is "So far so god" :-D https://openbenchmarking.org/result/1904105-SP-1904100DA38 1) 'Xen dom0, SMT On, vanilla' is staging *without* this series even applied 2) 'Xen dom0, SMT on, patched, sched_granularity=thread' is with this series applied, but scheduler behavior as right now 3) 'Xen dom0, SMT on, patched, sched_granularity=core' is with this series applied, and core-scheduling enabled 4) 'Xen dom0, SMT Off, vanilla' is staging *without* this series applied, and SMT turned off in BIOS (i.e., we only have 8 CPUs) So, comparing 1 and 4, we see, for each specific benchmark, what is the cost of disabling SMT (or vice-versa, the gain of using SMT). Comparing 1 and 2, we see the overhead introduced by this series, when it is not used to achieve core-scheduling. Compating 1 and 3, we see the differences with what we have right now, and what we'll have with core-scheduling enabled, as it is implemented in this series. Some of the things we can see from the results: - disabling SMT (i.e., 1 vs 4) is not always bad, but it is bad overall, i.e., if you look at how many tests are better and at how many are slower, with SMT off (and also, by how much). Of course, this can be considered true for these specific benchmarks, on this specific hardware and with this configuration - the overhead introduced by this series is, overall, pretty small, apart from not more than a couple of exceptions (e.g., Stream Triad or zstd compression). OTOH, there seem to be cases where this series improves performance (e.g., Stress-NG Socket Activity) - the performance we achieve with core-scheduling are more than acceptable - between core-scheduling and disabling SMT, core-scheduling wins and I wouldn't even call it a match :-P Of course, other thoughts, comments, alternative analysis are welcome. As said above, this is less that what I wanted to have, and in fact I'm running more stuff. I have a much more comprehensive set of benchmarks running in these days. It being "much more comprehensive", however, also means it takes more time. I have a newer and faster (both CPU and disk) machine, but I need to re-purpose it for benchmarking purposes. At least now that the old Xeon NUMA box is done with this first round, I can use it for: - running the tests inside a "regular" PV domain - running the tests inside more than one PV domain, i.e. with some degree of overcommitment I'll push out results as soon as I have them. Regards
On 11/04/2019 02:34, Dario Faggioli wrote: > On Fri, 2019-03-29 at 19:16 +0100, Dario Faggioli wrote: >> On Fri, 2019-03-29 at 16:08 +0100, Juergen Gross wrote: >>> I have done some very basic performance testing: on a 4 cpu system >>> (2 cores with 2 threads each) I did a "make -j 4" for building the >>> Xen >>> hypervisor. With This test has been run on dom0, once with no other >>> guest active and once with another guest with 4 vcpus running the >>> same >>> test. >> Just as an heads up for people (as Juergen knows this already :-D), >> I'm >> planning to run some performance evaluation of this patches. >> >> I've got an 8 CPUs system (4 cores, 2 threads each, no-NUMA) and an >> 16 >> CPUs system (2 sockets/NUMA nodes, 4 cores each, 2 threads each) on >> which I should be able to get some bench suite running relatively >> easy >> and (hopefully) quick. >> >> I'm planning to evaluate: >> - vanilla (i.e., without this series), SMT enabled in BIOS >> - vanilla (i.e., without this series), SMT disabled in BIOS >> - patched (i.e., with this series), granularity=thread >> - patched (i.e., with this series), granularity=core >> >> I'll do start with no overcommitment, and then move to 2x >> overcommitment (as you did above). >> > I've got the first set of results. It's fewer than I wanted/expected to > have at this point in time, but still... > > Also, it's Phoronix again. I don't especially love it, but I'm still > working on convincing our own internal automated benchmarking tool > (which I like a lot more :-) ) to be a good friend of Xen. :-P I think the Phoronix tests as such are not that bad, its the way they are used by Phoronix which is completely idiotic. > It's a not too big set of tests, done in the following conditions: > - hardware: Intel Xeon E5620; 2 NUMA nodes, 4 cores and 2 threads each > - slow disk (old rotational HDD) > - benchmarks run in dom0 > - CPU, memory and some disk IO benchmarks > - all Spec&Melt mitigations disabled both at Xen and dom0 kernel level > - cpufreq governor = performance, max_cstate = C1 > - *non* debug hypervisor > > In just one sentence, what I'd say is "So far so god" :-D > > https://openbenchmarking.org/result/1904105-SP-1904100DA38 Thanks for doing that! Juergen
On Thu, 2019-04-11 at 09:16 +0200, Juergen Gross wrote: > On 11/04/2019 02:34, Dario Faggioli wrote: > > Also, it's Phoronix again. I don't especially love it, but I'm > > still > > working on convincing our own internal automated benchmarking tool > > (which I like a lot more :-) ) to be a good friend of Xen. :-P > > I think the Phoronix tests as such are not that bad, its the way they > are used by Phoronix which is completely idiotic. > Sure, that is the main problem. About the suite itself, the fact that it is kind of a black box, can be a very good thing, but also a not so good one. Opaqueness is, AFAIUI, among its design goals, so I can't possibly complain about that. And in fact, that is what makes it so easy and quick to play with it. :-) If you want to tweak the configuration of a benchmark, or change how they're run, beside the config options that are pre-defined for each benchmark, (e.g., do stuff like adding `numactl blabla` "in front" of some), that is a lot less obvious or easy. And yes, this is somewhat the case for most, if not all, the benchmarking suite, but I find Phoronix makes this _particularly_ tricky. Anyway... :-D :-D Regards