mbox series

[RFC,0/7] hostmem: NUMA-aware memory preallocation using ThreadContext

Message ID 20220721120732.118133-1-david@redhat.com (mailing list archive)
Headers show
Series hostmem: NUMA-aware memory preallocation using ThreadContext | expand

Message

David Hildenbrand July 21, 2022, 12:07 p.m. UTC
This is a follow-up on "util: NUMA aware memory preallocation" [1] by
Michal.

Setting the CPU affinity of threads from inside QEMU usually isn't
easily possible, because we don't want QEMU -- once started and running
guest code -- to be able to mess up the system. QEMU disallows relevant
syscalls using seccomp, such that any such invocation will fail.

Especially for memory preallocation in memory backends, the CPU affinity
can significantly increase guest startup time, for example, when running
large VMs backed by huge/gigantic pages, because of NUMA effects. For
NUMA-aware preallocation, we have to set the CPU affinity, however:

(1) Once preallocation threads are created during preallocation, management
    tools cannot intercept anymore to change the affinity. These threads
    are created automatically on demand.
(2) QEMU cannot easily set the CPU affinity itself.
(3) The CPU affinity derived from the NUMA bindings of the memory backend
    might not necessarily be exactly the CPUs we actually want to use
    (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).

There is an easy "workaround". If we have a thread with the right CPU
affinity, we can simply create new threads on demand via that prepared
context. So, all we have to do is setup and create such a context ahead
of time, to then configure preallocation to create new threads via that
environment.

So, let's introduce a user-creatable "thread-context" object that
essentially consists of a context thread used to create new threads.
QEMU can either try setting the CPU affinity itself ("cpu-affinity",
"node-affinity" property), or upper layers can extract the thread id
("thread-id" property) to configure it externally.

Make memory-backends consume a thread-context object
(via the "prealloc-context" property) and use it when preallocating to
create new threads with the desired CPU affinity. Further, to make it
easier to use, allow creation of "thread-context" objects, including
setting the CPU affinity directly from QEMU, *before* enabling the
sandbox option.


Quick test on a system with 2 NUMA nodes:

Without CPU affinity:
    time qemu-system-x86_64 \
        -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
        -nographic -monitor stdio

    real    0m5.383s
    real    0m3.499s
    real    0m5.129s
    real    0m4.232s
    real    0m5.220s
    real    0m4.288s
    real    0m3.582s
    real    0m4.305s
    real    0m5.421s
    real    0m4.502s

    -> It heavily depends on the scheduler CPU selection

With CPU affinity:
    time qemu-system-x86_64 \
        -object thread-context,id=tc1,node-affinity=0 \
        -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
        -sandbox enable=on,resourcecontrol=deny \
        -nographic -monitor stdio

    real    0m1.959s
    real    0m1.942s
    real    0m1.943s
    real    0m1.941s
    real    0m1.948s
    real    0m1.964s
    real    0m1.949s
    real    0m1.948s
    real    0m1.941s
    real    0m1.937s

On reasonably large VMs, the speedup can be quite significant.

While this concept is currently only used for short-lived preallocation
threads, nothing major speaks against reusing the concept for other
threads that are harder to identify/configure -- except that
we need additional (idle) context threads that are otherwise left unused.

[1] https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com

Cc: Michal Privoznik <mprivozn@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Eduardo Habkost <eduardo@habkost.net>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Stefan Weil <sw@weilnetz.de>

David Hildenbrand (7):
  util: Cleanup and rename os_mem_prealloc()
  util: Introduce qemu_thread_set_affinity() and
    qemu_thread_get_affinity()
  util: Introduce ThreadContext user-creatable object
  util: Add write-only "node-affinity" property for ThreadContext
  util: Make qemu_prealloc_mem() optionally consume a ThreadContext
  hostmem: Allow for specifying a ThreadContext for preallocation
  vl: Allow ThreadContext objects to be created before the sandbox
    option

 backends/hostmem.c            |  13 +-
 hw/virtio/virtio-mem.c        |   2 +-
 include/qemu/osdep.h          |  19 +-
 include/qemu/thread-context.h |  58 ++++++
 include/qemu/thread.h         |   4 +
 include/sysemu/hostmem.h      |   2 +
 meson.build                   |  16 ++
 qapi/qom.json                 |  25 +++
 softmmu/cpus.c                |   2 +-
 softmmu/vl.c                  |  30 ++-
 util/meson.build              |   1 +
 util/oslib-posix.c            |  39 ++--
 util/oslib-win32.c            |   8 +-
 util/qemu-thread-posix.c      |  70 +++++++
 util/qemu-thread-win32.c      |  12 ++
 util/thread-context.c         | 363 ++++++++++++++++++++++++++++++++++
 16 files changed, 637 insertions(+), 27 deletions(-)
 create mode 100644 include/qemu/thread-context.h
 create mode 100644 util/thread-context.c

Comments

Michal Privoznik July 25, 2022, 1:59 p.m. UTC | #1
On 7/21/22 14:07, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.

I've skimmed through patches and haven't spotted anything obviously
wrong. I'll test these more once I write libvirt support for them (which
I plan to do soon).

Michal
Michal Privoznik Aug. 5, 2022, 11:01 a.m. UTC | #2
On 7/21/22 14:07, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.
> 
> Setting the CPU affinity of threads from inside QEMU usually isn't
> easily possible, because we don't want QEMU -- once started and running
> guest code -- to be able to mess up the system. QEMU disallows relevant
> syscalls using seccomp, such that any such invocation will fail.
> 
> Especially for memory preallocation in memory backends, the CPU affinity
> can significantly increase guest startup time, for example, when running
> large VMs backed by huge/gigantic pages, because of NUMA effects. For
> NUMA-aware preallocation, we have to set the CPU affinity, however:
> 
> (1) Once preallocation threads are created during preallocation, management
>     tools cannot intercept anymore to change the affinity. These threads
>     are created automatically on demand.
> (2) QEMU cannot easily set the CPU affinity itself.
> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>     might not necessarily be exactly the CPUs we actually want to use
>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> 
> There is an easy "workaround". If we have a thread with the right CPU
> affinity, we can simply create new threads on demand via that prepared
> context. So, all we have to do is setup and create such a context ahead
> of time, to then configure preallocation to create new threads via that
> environment.
> 
> So, let's introduce a user-creatable "thread-context" object that
> essentially consists of a context thread used to create new threads.
> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> "node-affinity" property), or upper layers can extract the thread id
> ("thread-id" property) to configure it externally.
> 
> Make memory-backends consume a thread-context object
> (via the "prealloc-context" property) and use it when preallocating to
> create new threads with the desired CPU affinity. Further, to make it
> easier to use, allow creation of "thread-context" objects, including
> setting the CPU affinity directly from QEMU, *before* enabling the
> sandbox option.
> 
> 
> Quick test on a system with 2 NUMA nodes:
> 
> Without CPU affinity:
>     time qemu-system-x86_64 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>         -nographic -monitor stdio
> 
>     real    0m5.383s
>     real    0m3.499s
>     real    0m5.129s
>     real    0m4.232s
>     real    0m5.220s
>     real    0m4.288s
>     real    0m3.582s
>     real    0m4.305s
>     real    0m5.421s
>     real    0m4.502s
> 
>     -> It heavily depends on the scheduler CPU selection
> 
> With CPU affinity:
>     time qemu-system-x86_64 \
>         -object thread-context,id=tc1,node-affinity=0 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>         -sandbox enable=on,resourcecontrol=deny \
>         -nographic -monitor stdio
> 
>     real    0m1.959s
>     real    0m1.942s
>     real    0m1.943s
>     real    0m1.941s
>     real    0m1.948s
>     real    0m1.964s
>     real    0m1.949s
>     real    0m1.948s
>     real    0m1.941s
>     real    0m1.937s
> 
> On reasonably large VMs, the speedup can be quite significant.
> 

I've timed 'virsh start' with a guest that has 47GB worth of 1GB
hugepages and seen the startup time halved basically (from 10.5s to
5.6s). The host has 4 NUMA nodes and I'm pinning the guest onto two nodes.

I've written libvirt counterpart (which I'll post as soon as these are
merged). The way it works is the whenever .prealloc-threads= is to be
used AND qemu is capable of thread-context the thread-context object is
generated before every memory-backend-*, like this:

-object
'{"qom-type":"thread-context","id":"tc-ram-node0","node-affinity":[2]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":21474836480,"host-nodes":[2],"policy":"bind","prealloc-context":"tc-ram-node0"}'
\
-numa node,nodeid=0,cpus=0,cpus=2,memdev=ram-node0 \
-object
'{"qom-type":"thread-context","id":"tc-ram-node1","node-affinity":[3]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node1","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":28991029248,"host-nodes":[3],"policy":"bind","prealloc-context":"tc-ram-node1"}'
\


Now, it's not visible in this snippet, but my code does not reuse
thread-context objects. So if there's another memfd, it'll get its own TC:

-object
'{"qom-type":"thread-context","id":"tc-memdimm0","node-affinity":[1]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"memdimm0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":1073741824,"host-nodes":[1],"policy":"bind","prealloc-context":"tc-memdimm0"}'
\

The reason is that logic generating memory-backends is very complex and
separating out parts of it so that thread-context objects can be
generated first and reused by those backends would inevitably lead to
regression. I guess my question is, whether it's a problem that libvirt
would leave one additional thread, sleeping in a semaphore, for each
memory-backend (iff prealloc-threads are used).

Although, if I read the code correctly, thread-context object can be
specified AFTER memory backends, because they are parsed and created
before backends anyway. Well, something to think over the weekend.


> While this concept is currently only used for short-lived preallocation
> threads, nothing major speaks against reusing the concept for other
> threads that are harder to identify/configure -- except that
> we need additional (idle) context threads that are otherwise left unused.
> 
> [1] https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com
> 
> Cc: Michal Privoznik <mprivozn@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Daniel P. Berrangé" <berrange@redhat.com>
> Cc: Eduardo Habkost <eduardo@habkost.net>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Stefan Weil <sw@weilnetz.de>
> 
> David Hildenbrand (7):
>   util: Cleanup and rename os_mem_prealloc()
>   util: Introduce qemu_thread_set_affinity() and
>     qemu_thread_get_affinity()
>   util: Introduce ThreadContext user-creatable object
>   util: Add write-only "node-affinity" property for ThreadContext
>   util: Make qemu_prealloc_mem() optionally consume a ThreadContext
>   hostmem: Allow for specifying a ThreadContext for preallocation
>   vl: Allow ThreadContext objects to be created before the sandbox
>     option
> 
>  backends/hostmem.c            |  13 +-
>  hw/virtio/virtio-mem.c        |   2 +-
>  include/qemu/osdep.h          |  19 +-
>  include/qemu/thread-context.h |  58 ++++++
>  include/qemu/thread.h         |   4 +
>  include/sysemu/hostmem.h      |   2 +
>  meson.build                   |  16 ++
>  qapi/qom.json                 |  25 +++
>  softmmu/cpus.c                |   2 +-
>  softmmu/vl.c                  |  30 ++-
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |  39 ++--
>  util/oslib-win32.c            |   8 +-
>  util/qemu-thread-posix.c      |  70 +++++++
>  util/qemu-thread-win32.c      |  12 ++
>  util/thread-context.c         | 363 ++++++++++++++++++++++++++++++++++
>  16 files changed, 637 insertions(+), 27 deletions(-)
>  create mode 100644 include/qemu/thread-context.h
>  create mode 100644 util/thread-context.c
> 

Reviewed-by: Michal Privoznik <mprivozn@redhat.com>

Michal
David Hildenbrand Aug. 5, 2022, 3:47 p.m. UTC | #3
> 
> I've timed 'virsh start' with a guest that has 47GB worth of 1GB
> hugepages and seen the startup time halved basically (from 10.5s to
> 5.6s). The host has 4 NUMA nodes and I'm pinning the guest onto two nodes.
> 
> I've written libvirt counterpart (which I'll post as soon as these are
> merged). The way it works is the whenever .prealloc-threads= is to be
> used AND qemu is capable of thread-context the thread-context object is
> generated before every memory-backend-*, like this:

Once interesting corner case might be with CPU-less NUMA nodes. Setting
the node-affinity would fail because there are no CPUs. Libvirt could
figure that out by testing if the selected node(s) have CPUs.

> 
> -object
> '{"qom-type":"thread-context","id":"tc-ram-node0","node-affinity":[2]}' \
> -object
> '{"qom-type":"memory-backend-memfd","id":"ram-node0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":21474836480,"host-nodes":[2],"policy":"bind","prealloc-context":"tc-ram-node0"}'
> \
> -numa node,nodeid=0,cpus=0,cpus=2,memdev=ram-node0 \
> -object
> '{"qom-type":"thread-context","id":"tc-ram-node1","node-affinity":[3]}' \
> -object
> '{"qom-type":"memory-backend-memfd","id":"ram-node1","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":28991029248,"host-nodes":[3],"policy":"bind","prealloc-context":"tc-ram-node1"}'
> \
> 
> 
> Now, it's not visible in this snippet, but my code does not reuse
> thread-context objects. So if there's another memfd, it'll get its own TC:
> 
> -object
> '{"qom-type":"thread-context","id":"tc-memdimm0","node-affinity":[1]}' \
> -object
> '{"qom-type":"memory-backend-memfd","id":"memdimm0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":1073741824,"host-nodes":[1],"policy":"bind","prealloc-context":"tc-memdimm0"}'
> \
> 
> The reason is that logic generating memory-backends is very complex and
> separating out parts of it so that thread-context objects can be
> generated first and reused by those backends would inevitably lead to

Sounds like something we can work on later.

> regression. I guess my question is, whether it's a problem that libvirt
> would leave one additional thread, sleeping in a semaphore, for each
> memory-backend (iff prealloc-threads are used).

I guess in most setups we just don't care. Of course, with 256 DIMMs or
endless number of nodes, we *might* care.


One optimization for some ordinary setups (not caring about NUMA-aware
preallocation during DIMM hotplug) would be to assign some dummy thread
context once prealloc finished (e.g., once QEMU initialized after
prealloc) and delete the original thread context along with the thread.

> 
> Although, if I read the code correctly, thread-context object can be
> specified AFTER memory backends, because they are parsed and created
> before backends anyway. Well, something to think over the weekend.

Yes, the command line order does not matter.

[...]

> 
> Reviewed-by: Michal Privoznik <mprivozn@redhat.com>

Thanks!
Joao Martins Aug. 9, 2022, 10:56 a.m. UTC | #4
On 7/21/22 13:07, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.
> 
> Setting the CPU affinity of threads from inside QEMU usually isn't
> easily possible, because we don't want QEMU -- once started and running
> guest code -- to be able to mess up the system. QEMU disallows relevant
> syscalls using seccomp, such that any such invocation will fail.
> 
> Especially for memory preallocation in memory backends, the CPU affinity
> can significantly increase guest startup time, for example, when running
> large VMs backed by huge/gigantic pages, because of NUMA effects. For
> NUMA-aware preallocation, we have to set the CPU affinity, however:
> 
> (1) Once preallocation threads are created during preallocation, management
>     tools cannot intercept anymore to change the affinity. These threads
>     are created automatically on demand.
> (2) QEMU cannot easily set the CPU affinity itself.
> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>     might not necessarily be exactly the CPUs we actually want to use
>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> 
> There is an easy "workaround". If we have a thread with the right CPU
> affinity, we can simply create new threads on demand via that prepared
> context. So, all we have to do is setup and create such a context ahead
> of time, to then configure preallocation to create new threads via that
> environment.
> 
> So, let's introduce a user-creatable "thread-context" object that
> essentially consists of a context thread used to create new threads.
> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> "node-affinity" property), or upper layers can extract the thread id
> ("thread-id" property) to configure it externally.
> 
> Make memory-backends consume a thread-context object
> (via the "prealloc-context" property) and use it when preallocating to
> create new threads with the desired CPU affinity. Further, to make it
> easier to use, allow creation of "thread-context" objects, including
> setting the CPU affinity directly from QEMU, *before* enabling the
> sandbox option.
> 
> 
> Quick test on a system with 2 NUMA nodes:
> 
> Without CPU affinity:
>     time qemu-system-x86_64 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>         -nographic -monitor stdio
> 
>     real    0m5.383s
>     real    0m3.499s
>     real    0m5.129s
>     real    0m4.232s
>     real    0m5.220s
>     real    0m4.288s
>     real    0m3.582s
>     real    0m4.305s
>     real    0m5.421s
>     real    0m4.502s
> 
>     -> It heavily depends on the scheduler CPU selection
> 
> With CPU affinity:
>     time qemu-system-x86_64 \
>         -object thread-context,id=tc1,node-affinity=0 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>         -sandbox enable=on,resourcecontrol=deny \
>         -nographic -monitor stdio
> 
>     real    0m1.959s
>     real    0m1.942s
>     real    0m1.943s
>     real    0m1.941s
>     real    0m1.948s
>     real    0m1.964s
>     real    0m1.949s
>     real    0m1.948s
>     real    0m1.941s
>     real    0m1.937s
> 
> On reasonably large VMs, the speedup can be quite significant.
> 
Really awesome work!

I am not sure I picked up this well while reading the series, but it seems to me that
prealloc is still serialized on per memory-backend when solely configured by command-line
right?

Meaning when we start prealloc we wait until the memory-backend thread-context action is
completed (per-memory-backend) even if other to-be-configured memory-backends will use a
thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
"need" to wait until the former prealloc finishes?

Unless as you alluded in one of the last patches: we can pass these thread-contexts with
prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
clients set prealloc=on, and thus prealloc would happen concurrently per node?

We were thinking to extend it to leverage per socket bandwidth essentially to parallel
this even further (we saw improvements with something like that but haven't tried this
series yet). Likely this is already possible with your work and I didn't pick up on it,
hence just making sure this is the case :)
David Hildenbrand Aug. 9, 2022, 6:06 p.m. UTC | #5
On 09.08.22 12:56, Joao Martins wrote:
> On 7/21/22 13:07, David Hildenbrand wrote:
>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>> Michal.
>>
>> Setting the CPU affinity of threads from inside QEMU usually isn't
>> easily possible, because we don't want QEMU -- once started and running
>> guest code -- to be able to mess up the system. QEMU disallows relevant
>> syscalls using seccomp, such that any such invocation will fail.
>>
>> Especially for memory preallocation in memory backends, the CPU affinity
>> can significantly increase guest startup time, for example, when running
>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>
>> (1) Once preallocation threads are created during preallocation, management
>>     tools cannot intercept anymore to change the affinity. These threads
>>     are created automatically on demand.
>> (2) QEMU cannot easily set the CPU affinity itself.
>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>     might not necessarily be exactly the CPUs we actually want to use
>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>
>> There is an easy "workaround". If we have a thread with the right CPU
>> affinity, we can simply create new threads on demand via that prepared
>> context. So, all we have to do is setup and create such a context ahead
>> of time, to then configure preallocation to create new threads via that
>> environment.
>>
>> So, let's introduce a user-creatable "thread-context" object that
>> essentially consists of a context thread used to create new threads.
>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>> "node-affinity" property), or upper layers can extract the thread id
>> ("thread-id" property) to configure it externally.
>>
>> Make memory-backends consume a thread-context object
>> (via the "prealloc-context" property) and use it when preallocating to
>> create new threads with the desired CPU affinity. Further, to make it
>> easier to use, allow creation of "thread-context" objects, including
>> setting the CPU affinity directly from QEMU, *before* enabling the
>> sandbox option.
>>
>>
>> Quick test on a system with 2 NUMA nodes:
>>
>> Without CPU affinity:
>>     time qemu-system-x86_64 \
>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>>         -nographic -monitor stdio
>>
>>     real    0m5.383s
>>     real    0m3.499s
>>     real    0m5.129s
>>     real    0m4.232s
>>     real    0m5.220s
>>     real    0m4.288s
>>     real    0m3.582s
>>     real    0m4.305s
>>     real    0m5.421s
>>     real    0m4.502s
>>
>>     -> It heavily depends on the scheduler CPU selection
>>
>> With CPU affinity:
>>     time qemu-system-x86_64 \
>>         -object thread-context,id=tc1,node-affinity=0 \
>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>>         -sandbox enable=on,resourcecontrol=deny \
>>         -nographic -monitor stdio
>>
>>     real    0m1.959s
>>     real    0m1.942s
>>     real    0m1.943s
>>     real    0m1.941s
>>     real    0m1.948s
>>     real    0m1.964s
>>     real    0m1.949s
>>     real    0m1.948s
>>     real    0m1.941s
>>     real    0m1.937s
>>
>> On reasonably large VMs, the speedup can be quite significant.
>>
> Really awesome work!

Thanks!

> 
> I am not sure I picked up this well while reading the series, but it seems to me that
> prealloc is still serialized on per memory-backend when solely configured by command-line
> right?

I think it's serialized in any case, even when preallocation is
triggered manually using prealloc=on. I might be wrong, but any kind of
object creation or property changes should be serialized by the BQL.

In theory, we can "easily" preallocate in our helper --
qemu_prealloc_mem() -- concurrently when we don't have to bother about
handling SIGBUS -- that is, when the kernel supports
MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
serialize in there as well.

> 
> Meaning when we start prealloc we wait until the memory-backend thread-context action is
> completed (per-memory-backend) even if other to-be-configured memory-backends will use a
> thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
> "need" to wait until the former prealloc finishes?

Yes. This series only takes care of NUMA-aware preallocation, but
doesn't preallocate multiple memory backends in parallel.

In theory, it would be quite easy to preallocate concurrently: simply
create the memory backend objects passed on the QEMU cmdline
concurrently from multiple threads.

In practice, we have to be careful I think with the BQL. But it doesn't
sound horribly complicated to achieve that. We can perform all
synchronized under the BQL and only trigger actual expensive
preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
released BQL.

> 
> Unless as you alluded in one of the last patches: we can pass these thread-contexts with
> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
> clients set prealloc=on, and thus prealloc would happen concurrently per node?

I think we will serialize in any case when modifying properties. Can you
give it a shot and see if it would work as of now? I doubt it, but I
might be wrong.

> 
> We were thinking to extend it to leverage per socket bandwidth essentially to parallel
> this even further (we saw improvements with something like that but haven't tried this
> series yet). Likely this is already possible with your work and I didn't pick up on it,
> hence just making sure this is the case :)

With this series, you can essentially tell QEMU which physical CPUs to
use for preallocating a given memory backend. But memory backends are
not created+preallocated concurrently yet.
Michal Privoznik Aug. 10, 2022, 6:55 a.m. UTC | #6
On 8/9/22 20:06, David Hildenbrand wrote:
> On 09.08.22 12:56, Joao Martins wrote:
>> On 7/21/22 13:07, David Hildenbrand wrote:
>>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>>> Michal.
>>>
>>> Setting the CPU affinity of threads from inside QEMU usually isn't
>>> easily possible, because we don't want QEMU -- once started and running
>>> guest code -- to be able to mess up the system. QEMU disallows relevant
>>> syscalls using seccomp, such that any such invocation will fail.
>>>
>>> Especially for memory preallocation in memory backends, the CPU affinity
>>> can significantly increase guest startup time, for example, when running
>>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>>
>>> (1) Once preallocation threads are created during preallocation, management
>>>     tools cannot intercept anymore to change the affinity. These threads
>>>     are created automatically on demand.
>>> (2) QEMU cannot easily set the CPU affinity itself.
>>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>>     might not necessarily be exactly the CPUs we actually want to use
>>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>>
>>> There is an easy "workaround". If we have a thread with the right CPU
>>> affinity, we can simply create new threads on demand via that prepared
>>> context. So, all we have to do is setup and create such a context ahead
>>> of time, to then configure preallocation to create new threads via that
>>> environment.
>>>
>>> So, let's introduce a user-creatable "thread-context" object that
>>> essentially consists of a context thread used to create new threads.
>>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>>> "node-affinity" property), or upper layers can extract the thread id
>>> ("thread-id" property) to configure it externally.
>>>
>>> Make memory-backends consume a thread-context object
>>> (via the "prealloc-context" property) and use it when preallocating to
>>> create new threads with the desired CPU affinity. Further, to make it
>>> easier to use, allow creation of "thread-context" objects, including
>>> setting the CPU affinity directly from QEMU, *before* enabling the
>>> sandbox option.
>>>
>>>
>>> Quick test on a system with 2 NUMA nodes:
>>>
>>> Without CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m5.383s
>>>     real    0m3.499s
>>>     real    0m5.129s
>>>     real    0m4.232s
>>>     real    0m5.220s
>>>     real    0m4.288s
>>>     real    0m3.582s
>>>     real    0m4.305s
>>>     real    0m5.421s
>>>     real    0m4.502s
>>>
>>>     -> It heavily depends on the scheduler CPU selection
>>>
>>> With CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object thread-context,id=tc1,node-affinity=0 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>>>         -sandbox enable=on,resourcecontrol=deny \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m1.959s
>>>     real    0m1.942s
>>>     real    0m1.943s
>>>     real    0m1.941s
>>>     real    0m1.948s
>>>     real    0m1.964s
>>>     real    0m1.949s
>>>     real    0m1.948s
>>>     real    0m1.941s
>>>     real    0m1.937s
>>>
>>> On reasonably large VMs, the speedup can be quite significant.
>>>
>> Really awesome work!
> 
> Thanks!
> 
>>
>> I am not sure I picked up this well while reading the series, but it seems to me that
>> prealloc is still serialized on per memory-backend when solely configured by command-line
>> right?
> 
> I think it's serialized in any case, even when preallocation is
> triggered manually using prealloc=on. I might be wrong, but any kind of
> object creation or property changes should be serialized by the BQL.
> 
> In theory, we can "easily" preallocate in our helper --
> qemu_prealloc_mem() -- concurrently when we don't have to bother about
> handling SIGBUS -- that is, when the kernel supports
> MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
> serialize in there as well.
> 
>>
>> Meaning when we start prealloc we wait until the memory-backend thread-context action is
>> completed (per-memory-backend) even if other to-be-configured memory-backends will use a
>> thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
>> "need" to wait until the former prealloc finishes?
> 
> Yes. This series only takes care of NUMA-aware preallocation, but
> doesn't preallocate multiple memory backends in parallel.
> 
> In theory, it would be quite easy to preallocate concurrently: simply
> create the memory backend objects passed on the QEMU cmdline
> concurrently from multiple threads.
> 
> In practice, we have to be careful I think with the BQL. But it doesn't
> sound horribly complicated to achieve that. We can perform all
> synchronized under the BQL and only trigger actual expensive
> preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
> released BQL.
> 
>>
>> Unless as you alluded in one of the last patches: we can pass these thread-contexts with
>> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
>> clients set prealloc=on, and thus prealloc would happen concurrently per node?
> 
> I think we will serialize in any case when modifying properties. Can you
> give it a shot and see if it would work as of now? I doubt it, but I
> might be wrong.

Disclaimer, I don't know QEMU internals that much so I might be wrong,
but even if libvirt went with -preconfig, wouldn't the monitor be stuck
for every 'set prealloc=on' call? I mean, in the end, the same set of
functions is called (e.g. touch_all_pages()) as if the configuration was
provided via cmd line. So I don't see why there should be any difference
between cmd line and -preconfig.



<offtopic>
In near future, as the number of cmd line arguments that libvirt
generates grows, libvirt might need to switch to -preconfig. Or, if it
needs to query some values first and generate configuration based on
that. But for now, there are no plans.
</offtopic>

Michal
Joao Martins Aug. 11, 2022, 10:50 a.m. UTC | #7
On 8/9/22 19:06, David Hildenbrand wrote:
> On 09.08.22 12:56, Joao Martins wrote:
>> On 7/21/22 13:07, David Hildenbrand wrote:
>>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>>> Michal.
>>>
>>> Setting the CPU affinity of threads from inside QEMU usually isn't
>>> easily possible, because we don't want QEMU -- once started and running
>>> guest code -- to be able to mess up the system. QEMU disallows relevant
>>> syscalls using seccomp, such that any such invocation will fail.
>>>
>>> Especially for memory preallocation in memory backends, the CPU affinity
>>> can significantly increase guest startup time, for example, when running
>>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>>
>>> (1) Once preallocation threads are created during preallocation, management
>>>     tools cannot intercept anymore to change the affinity. These threads
>>>     are created automatically on demand.
>>> (2) QEMU cannot easily set the CPU affinity itself.
>>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>>     might not necessarily be exactly the CPUs we actually want to use
>>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>>
>>> There is an easy "workaround". If we have a thread with the right CPU
>>> affinity, we can simply create new threads on demand via that prepared
>>> context. So, all we have to do is setup and create such a context ahead
>>> of time, to then configure preallocation to create new threads via that
>>> environment.
>>>
>>> So, let's introduce a user-creatable "thread-context" object that
>>> essentially consists of a context thread used to create new threads.
>>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>>> "node-affinity" property), or upper layers can extract the thread id
>>> ("thread-id" property) to configure it externally.
>>>
>>> Make memory-backends consume a thread-context object
>>> (via the "prealloc-context" property) and use it when preallocating to
>>> create new threads with the desired CPU affinity. Further, to make it
>>> easier to use, allow creation of "thread-context" objects, including
>>> setting the CPU affinity directly from QEMU, *before* enabling the
>>> sandbox option.
>>>
>>>
>>> Quick test on a system with 2 NUMA nodes:
>>>
>>> Without CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m5.383s
>>>     real    0m3.499s
>>>     real    0m5.129s
>>>     real    0m4.232s
>>>     real    0m5.220s
>>>     real    0m4.288s
>>>     real    0m3.582s
>>>     real    0m4.305s
>>>     real    0m5.421s
>>>     real    0m4.502s
>>>
>>>     -> It heavily depends on the scheduler CPU selection
>>>
>>> With CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object thread-context,id=tc1,node-affinity=0 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>>>         -sandbox enable=on,resourcecontrol=deny \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m1.959s
>>>     real    0m1.942s
>>>     real    0m1.943s
>>>     real    0m1.941s
>>>     real    0m1.948s
>>>     real    0m1.964s
>>>     real    0m1.949s
>>>     real    0m1.948s
>>>     real    0m1.941s
>>>     real    0m1.937s
>>>
>>> On reasonably large VMs, the speedup can be quite significant.
>>>
>> Really awesome work!
> 
> Thanks!
> 
>>
>> I am not sure I picked up this well while reading the series, but it seems to me that
>> prealloc is still serialized on per memory-backend when solely configured by command-line
>> right?
> 
> I think it's serialized in any case, even when preallocation is
> triggered manually using prealloc=on. I might be wrong, but any kind of
> object creation or property changes should be serialized by the BQL.
> 
> In theory, we can "easily" preallocate in our helper --
> qemu_prealloc_mem() -- concurrently when we don't have to bother about
> handling SIGBUS -- that is, when the kernel supports
> MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
> serialize in there as well.
> 
/me nods matches my understanding

>>
>> Meaning when we start prealloc we wait until the memory-backend thread-context action is
>> completed (per-memory-backend) even if other to-be-configured memory-backends will use a
>> thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
>> "need" to wait until the former prealloc finishes?
> 
> Yes. This series only takes care of NUMA-aware preallocation, but
> doesn't preallocate multiple memory backends in parallel.
> 
> In theory, it would be quite easy to preallocate concurrently: simply
> create the memory backend objects passed on the QEMU cmdline
> concurrently from multiple threads.
> 
Right

> In practice, we have to be careful I think with the BQL. But it doesn't
> sound horribly complicated to achieve that. We can perform all
> synchronized under the BQL and only trigger actual expensive
> preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
> released BQL.
> 
Right.

The small bit to take care (AFAIU from the code) is to defer waiting for all the memset
threads to finish. The problem in command line at least is that you attempt at memsetting,
but then wait for all the threads to finish. And because the context
passed to memset is allocated over the stack we must wait as we would lose that
state. So it's mainly moving the tracking to be global and defer the time
that we wait to join all threads. With MADV_POPULATE_WRITE we know we are OK but I
wonder if sigbus could be made to work too like only registering only once, and the
sigbus handler would look for the thread based on the address range it is handling,
having the just-MCEd address. And we only unregister the sigbus handler also once after
all prealloc threads are finished.

Via QMP, I am not sure BQL is the only "problem", there might be some monitor lock there
too or some sort of request handling serialization that only one thread processes QMP
requests and dispatches them. Simply releasing BQL prior to prealloc doesn't do much,
but though it may help doing other work while that is happening.

>>
>> Unless as you alluded in one of the last patches: we can pass these thread-contexts with
>> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
>> clients set prealloc=on, and thus prealloc would happen concurrently per node?
> 
> I think we will serialize in any case when modifying properties. Can you
> give it a shot and see if it would work as of now? I doubt it, but I
> might be wrong.
> 

Over a quick experiment with two monitors each
attempting at prealloc each node in parallel, well it takes the same 7secs (on a small
2-node 128G test) regardless. Your expectation looks indeed correct.

>>
>> We were thinking to extend it to leverage per socket bandwidth essentially to parallel
>> this even further (we saw improvements with something like that but haven't tried this
>> series yet). Likely this is already possible with your work and I didn't pick up on it,
>> hence just making sure this is the case :)
> 
> With this series, you can essentially tell QEMU which physical CPUs to
> use for preallocating a given memory backend. But memory backends are
> not created+preallocated concurrently yet.
> 
Yeap, thanks for the context/info.
Michal Privoznik Sept. 21, 2022, 2:44 p.m. UTC | #8
On 7/21/22 14:07, David Hildenbrand wrote:
>

Ping? Is there any plan how to move forward? I have libvirt patches
ready to consume this and I'd like to prune my old local branches :-)

Michal
David Hildenbrand Sept. 21, 2022, 2:54 p.m. UTC | #9
On 21.09.22 16:44, Michal Prívozník wrote:
> On 7/21/22 14:07, David Hildenbrand wrote:
>>
> 
> Ping? Is there any plan how to move forward? I have libvirt patches
> ready to consume this and I'd like to prune my old local branches :-)

Heh, I was thinking about this series just today. I was distracted with 
all other kind of stuff.

I'll move forward with this series later this week/early next week.

Thanks!
Michal Privoznik Sept. 22, 2022, 6:45 a.m. UTC | #10
On 9/21/22 16:54, David Hildenbrand wrote:
> On 21.09.22 16:44, Michal Prívozník wrote:
>> On 7/21/22 14:07, David Hildenbrand wrote:
>>>
>>
>> Ping? Is there any plan how to move forward? I have libvirt patches
>> ready to consume this and I'd like to prune my old local branches :-)
> 
> Heh, I was thinking about this series just today. I was distracted with
> all other kind of stuff.
> 
> I'll move forward with this series later this week/early next week.

No rush, it's only that I don't want this to fall into void. Let me know
if I can help somehow. Meanwhile, here's my aforementioned branch:

https://gitlab.com/MichalPrivoznik/libvirt/-/tree/qemu_thread_context

I've made it so that ThreadContext is generated whenever
.prealloc-threads AND .host-nodes are used (i.e. no XML visible config
knob). And I'm generating ThreadContext objects for each memory backend
separately even though they can be reused, but IMO that's optimization
that can be done later.

Michal