mbox series

[RFC,0/3] Series short description

Message ID 154219299016.19470.9372139354280787961.stgit@wayrath (mailing list archive)
Headers show
Series Series short description | expand

Message

Dario Faggioli Nov. 14, 2018, 10:56 a.m. UTC
The following series implements...

---

Dario Faggioli (3):
      i386: add properties for customizing L2 and L3 caches size
      i386: custom cache size in CPUID2 and CPUID4 descriptors
      i386: custom cache size in AMD's CPUID descriptors too


 0 files changed

--
Signature

Comments

Dario Faggioli Nov. 14, 2018, 11:08 a.m. UTC | #1
Wow... Mmm, not sure what went wrong... Anyway, this is the cover
letter I thought I had sent. Sorry :-/
--
Hello everyone,

This is Dario, from SUSE, and this is the first time I touch QEMU. :-D

So, basically, while playing with an AMD EPYC box, we came across a weird
performance regression between host and guest. It was happening with the
STREAM benchmark, and we tracked it down to non-temporal stores _not_ being
used, inside the guest.

More specifically, this was because the glibc version we were dealing with had
heuristics for deciding whether or not to use NT instructions. Basically, it
was checking is how big the L2 and L3 caches are, as compared to how many
threads are actually sharing such caches.

Currently, as far as cache layout and size are concerned, we only have the
following options:
- no L3 cache,
- emulated L3 cache, which means the default cache layout for the chosen CPU
  is used,
- host L3 cache info, which means the cache layout of the host is used.

Now, in our case, 'host-cache-info' made sense, because we were pinning vcpus
as well as doing other optimizations. However, as the VM had _less_ vcpus than
the host had pcpus, the result of the heuristics was to avoid non-temporal
stores, causing the unexpectedly high drop in performance. And, as you can
imagine, we could not fix things by using 'l3-cache=on' either.

This made us think this could be a general problem, and not only an issue for
our benchmarks, and here it comes this series. :-)

Basically, while we can, of course, control the number of vcpus a guest has
already --as well as how they are arranged within the guest topology-- we can't
control how big are the caches the guest sees. And this is what this series
tries to implement: giving the user the ability to tweak the actual size of the
L2 and L3 caches, to deal with all those cases when the guest OS or userspace
do check that, and behave differently depending on what they see.

Yes, this is not at all that common, but happens, and hece the feature can
be considered useful, IMO. And yes, it is definitely something meant for those
cases where one is carefully tuning and highly optimizing, with things like
vcpu pinning, etc.

I've tested with many CPU models, and the cahce info from inside the guest
looks consistent. I haven't re-run the benchmarks that triggered all this work,
as I don't have the proper hardware handy right now, but I'm planning to
(although, as said, this looks like a general problem to me).

I've got libvirt patches for exposing these new properties in the works, but
of course they only make sense if/when this series is accepted.

As I said, it's my first submission, and it's RFC because there are a couple
of things that I'm not sure I got right (details in the single patches).

Any comment or advice more than welcome. :-)

Thanks and Regards,
Dario
---
Dario Faggioli (3):
      i386: add properties for customizing L2 and L3 cache sizes
      i386: custom cache size in CPUID2 and CPUID4 descriptors
      i386: custom cache size in AMD's CPUID descriptors too

 include/hw/i386/pc.h |    8 ++++++++
 target/i386/cpu.c    |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 target/i386/cpu.h    |    3 +++
 3 files changed, 61 insertions(+)
Daniel P. Berrangé Nov. 14, 2018, 11:29 a.m. UTC | #2
On Wed, Nov 14, 2018 at 12:08:42PM +0100, Dario Faggioli wrote:
> Wow... Mmm, not sure what went wrong... Anyway, this is the cover
> letter I thought I had sent. Sorry :-/

No problem !

If you have not come across it before, "git-publish" is a great addon
tool for git to make sending patch series more pain-free

   https://github.com/stefanha/git-publish

QEMU git provides a config file for it, so you can just run "git-publish"
with no args and it will send a series containing all patches on your
currently checked out branch, automatically reading the MAINTAINERS file
to figure out who should be CC'd on the series.

The only global setup is for your $HOME/.gitconfig to contain a setting
for "smtpserver" under the "[sendemail]" group for outbound SMTP relay.

> --
> Hello everyone,
> 
> This is Dario, from SUSE, and this is the first time I touch QEMU. :-D

Welcome & thanks for your first patch(es) to QEMU.

I'll let others comment on the actual code you've sent...

> So, basically, while playing with an AMD EPYC box, we came across a weird
> performance regression between host and guest. It was happening with the
> STREAM benchmark, and we tracked it down to non-temporal stores _not_ being
> used, inside the guest.
> 
> More specifically, this was because the glibc version we were dealing with had
> heuristics for deciding whether or not to use NT instructions. Basically, it
> was checking is how big the L2 and L3 caches are, as compared to how many
> threads are actually sharing such caches.
> 
> Currently, as far as cache layout and size are concerned, we only have the
> following options:
> - no L3 cache,
> - emulated L3 cache, which means the default cache layout for the chosen CPU
>   is used,
> - host L3 cache info, which means the cache layout of the host is used.
> 
> Now, in our case, 'host-cache-info' made sense, because we were pinning vcpus
> as well as doing other optimizations. However, as the VM had _less_ vcpus than
> the host had pcpus, the result of the heuristics was to avoid non-temporal
> stores, causing the unexpectedly high drop in performance. And, as you can
> imagine, we could not fix things by using 'l3-cache=on' either.
> 
> This made us think this could be a general problem, and not only an issue for
> our benchmarks, and here it comes this series. :-)
> 
> Basically, while we can, of course, control the number of vcpus a guest has
> already --as well as how they are arranged within the guest topology-- we can't
> control how big are the caches the guest sees. And this is what this series
> tries to implement: giving the user the ability to tweak the actual size of the
> L2 and L3 caches, to deal with all those cases when the guest OS or userspace
> do check that, and behave differently depending on what they see.
> 
> Yes, this is not at all that common, but happens, and hece the feature can
> be considered useful, IMO. And yes, it is definitely something meant for those
> cases where one is carefully tuning and highly optimizing, with things like
> vcpu pinning, etc.
> 
> I've tested with many CPU models, and the cahce info from inside the guest
> looks consistent. I haven't re-run the benchmarks that triggered all this work,
> as I don't have the proper hardware handy right now, but I'm planning to
> (although, as said, this looks like a general problem to me).
> 
> I've got libvirt patches for exposing these new properties in the works, but
> of course they only make sense if/when this series is accepted.
> 
> As I said, it's my first submission, and it's RFC because there are a couple
> of things that I'm not sure I got right (details in the single patches).
> 
> Any comment or advice more than welcome. :-)
> 
> Thanks and Regards,
> Dario
> ---
> Dario Faggioli (3):
>       i386: add properties for customizing L2 and L3 cache sizes
>       i386: custom cache size in CPUID2 and CPUID4 descriptors
>       i386: custom cache size in AMD's CPUID descriptors too
> 
>  include/hw/i386/pc.h |    8 ++++++++
>  target/i386/cpu.c    |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  target/i386/cpu.h    |    3 +++
>  3 files changed, 61 insertions(+)
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Software Engineer @ SUSE https://www.suse.com/



Regards,
Daniel
Dario Faggioli Nov. 14, 2018, 3:24 p.m. UTC | #3
On Wed, 2018-11-14 at 11:29 +0000, Daniel P. Berrangé wrote:
> On Wed, Nov 14, 2018 at 12:08:42PM +0100, Dario Faggioli wrote:
> > Wow... Mmm, not sure what went wrong... Anyway, this is the cover
> > letter I thought I had sent. Sorry :-/
> 
> No problem !
> 
Hello,

> If you have not come across it before, "git-publish" is a great addon
> tool for git to make sending patch series more pain-free
> 
>    https://github.com/stefanha/git-publish
> 
> [...]
> 
Yes, I've heard of it. I'm already planning to check if it works well
with stgit, which I also use (and wish to continue to :-) ).

If it does, I'll definitely start using it.

> > --
> > Hello everyone,
> > 
> > This is Dario, from SUSE, and this is the first time I touch QEMU.
> > :-D
> 
> Welcome & thanks for your first patch(es) to QEMU.
> 
Thanks for the warm welcome. :-D

Regards,
Dario
no-reply@patchew.org Nov. 14, 2018, 6:47 p.m. UTC | #4
Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 154219299016.19470.9372139354280787961.stgit@wayrath
Type: series
Subject: [Qemu-devel] [RFC PATCH 0/3] Series short description

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
28cb7a9 i386: custom cache size in AMD's CPUID descriptors too
1708b37 i386: custom cache size in CPUID2 and CPUID4 descriptors
ed1b0e0 i386: add properties for customizing L2 and L3 caches size

=== OUTPUT BEGIN ===
Checking PATCH 1/3: i386: add properties for customizing L2 and L3 caches size...
Checking PATCH 2/3: i386: custom cache size in CPUID2 and CPUID4 descriptors...
ERROR: braces {} are necessary for all arms of this statement
#38: FILE: target/i386/cpu.c:440:
+        if (sets == 0)
[...]

total: 1 errors, 0 warnings, 56 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 3/3: i386: custom cache size in AMD's CPUID descriptors too...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com