diff mbox series

util: NUMA aware memory preallocation

Message ID ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com (mailing list archive)
State New, archived
Headers show
Series util: NUMA aware memory preallocation | expand

Commit Message

Michal Privoznik May 10, 2022, 6:55 a.m. UTC
When allocating large amounts of memory the task is offloaded
onto threads. These threads then use various techniques to
allocate the memory fully (madvise(), writing into the memory).
However, these threads are free to run on any CPU, which becomes
problematic on NUMA machines because it may happen that a thread
is running on a distant node.

Ideally, this is something that a management application would
resolve, but we are not anywhere close to that, Firstly, memory
allocation happens before monitor socket is even available. But
okay, that's what -preconfig is for. But then the problem is that
'object-add' would not return until all memory is preallocated.

Long story short, management application has no way of learning
TIDs of allocator threads so it can't make them run NUMA aware.

But what we can do is to propagate the 'host-nodes' attribute of
MemoryBackend object down to where preallocation threads are
created and set their affinity according to the attribute.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 backends/hostmem.c     |  6 ++--
 hw/virtio/virtio-mem.c |  2 +-
 include/qemu/osdep.h   |  2 ++
 util/meson.build       |  2 +-
 util/oslib-posix.c     | 74 ++++++++++++++++++++++++++++++++++++++++--
 util/oslib-win32.c     |  2 ++
 6 files changed, 82 insertions(+), 6 deletions(-)

Comments

Daniel P. Berrangé May 10, 2022, 9:12 a.m. UTC | #1
On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> When allocating large amounts of memory the task is offloaded
> onto threads. These threads then use various techniques to
> allocate the memory fully (madvise(), writing into the memory).
> However, these threads are free to run on any CPU, which becomes
> problematic on NUMA machines because it may happen that a thread
> is running on a distant node.
> 
> Ideally, this is something that a management application would
> resolve, but we are not anywhere close to that, Firstly, memory
> allocation happens before monitor socket is even available. But
> okay, that's what -preconfig is for. But then the problem is that
> 'object-add' would not return until all memory is preallocated.
> 
> Long story short, management application has no way of learning
> TIDs of allocator threads so it can't make them run NUMA aware.

So I'm wondering what the impact of this problem is for various
scenarios.

The default config for a KVM guest with libvirt is no CPU pinning
at all. The kernel auto-places CPUs and decides on where RAM is to
be allocated. So in this case, whether or not libvirt can talk to
QMP in time to query threads is largely irrelevant, as we don't
want todo placement in any case.

In theory the kernel should allocate RAM on the node local to
where the process is currently executing. So as long as the
guest RAM fits in available free RAM on the local node, RAM
should be allocated from the node that matches the CPU running
the QEMU main thread.

The challenge is if we spawn N more threads to do pre-alloc,
these can be spread onto other nodes. I wonder if the kernel
huas any preference for keeping threads within a process on
the same NUMA node ?

Overall, if libvirt is not applying pinning to the QEMU guest,
then we're 100% reliant on the kernel todo something sensible,
both for normal QEMU execution and for prealloc. Since we're
not doing placement of QEMU RAM or CPUs, the logic in this
patch won't do anything either.


If the guest has more RAM than can fit on the local NUMA node,
then we're doomed no matter what, even ignoring prealloc, there
will be cross-node traffic. This scenario requires the admin to
setup proper CPU /memory pinning for QEMU in libvirt.

If libvirt is doing CPU pinning (as instructed by the mgmt app
above us), then when we first start QEMU, the process thread
leader will get given affinity by libvirt prior to exec. This
affinity will be the union of affinity for all CPUs that will
be later configured.

The typical case for CPU pinning, is that everything fits in
one NUMA node, and so in this case, we don't need todo anything
more. The prealloc threads will already be constrained to the
right place by the affinity of the QEMU thread leader, so the
logic in this patch will run, but it won't do anything that
was not already done.

So we're left with the hardest case, where the guest is explicitly
spread across multiple NUMA nodes. In this case the thread leader
affinity will span many NUMA nodes, and so the prealloc threads
will freely be placed across any CPU that is in the union of CPUs
the guest is placed on. Just as with thue non-pinned case, the
prealloc will be at the mercy of the kernel making sensible
placement decisions.

The very last cases is the only one where this patch can potentially
be beneficial. The problem is that because libvirt is in charge of
enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
anything with CPU affinity. So AFAICT, this patch should result in
an error from sched_setaffinity when run under libvirt.

> But what we can do is to propagate the 'host-nodes' attribute of
> MemoryBackend object down to where preallocation threads are
> created and set their affinity according to the attribute.
> 
> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>  backends/hostmem.c     |  6 ++--
>  hw/virtio/virtio-mem.c |  2 +-
>  include/qemu/osdep.h   |  2 ++
>  util/meson.build       |  2 +-
>  util/oslib-posix.c     | 74 ++++++++++++++++++++++++++++++++++++++++--
>  util/oslib-win32.c     |  2 ++
>  6 files changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index a7bae3d713..7373472c7e 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
>          uint64_t sz = memory_region_size(&backend->mr);
>  
> -        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err);
> +        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads,
> +                        backend->host_nodes, MAX_NODES, &local_err);
>          if (local_err) {
>              error_propagate(errp, local_err);
>              return;
> @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>           */
>          if (backend->prealloc) {
>              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
> -                            backend->prealloc_threads, &local_err);
> +                            backend->prealloc_threads, backend->host_nodes,
> +                            MAX_NODES, &local_err);
>              if (local_err) {
>                  goto out;
>              }
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 5aca408726..48b104cdf6 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
>              int fd = memory_region_get_fd(&vmem->memdev->mr);
>              Error *local_err = NULL;
>  
> -            os_mem_prealloc(fd, area, size, 1, &local_err);
> +            os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err);
>              if (local_err) {
>                  static bool warned;
>  
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 1c1e7eca98..474cbf3b86 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type);
>  void qemu_set_tty_echo(int fd, bool echo);
>  
>  void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> +                     const unsigned long *host_nodes,
> +                     unsigned long max_node,
>                       Error **errp);
>  
>  /**
> diff --git a/util/meson.build b/util/meson.build
> index 8f16018cd4..393ff74570 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -15,7 +15,7 @@ freebsd_dep = []
>  if targetos == 'freebsd'
>    freebsd_dep = util
>  endif
> -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep])
> +util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa])
>  util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c'))
>  util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c'))
>  util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c'))
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 477990f39b..1572b9b178 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -73,6 +73,10 @@
>  #include "qemu/error-report.h"
>  #endif
>  
> +#ifdef CONFIG_NUMA
> +#include <numa.h>
> +#endif
> +
>  #define MAX_MEM_PREALLOC_THREAD_COUNT 16
>  
>  struct MemsetThread;
> @@ -82,6 +86,9 @@ typedef struct MemsetContext {
>      bool any_thread_failed;
>      struct MemsetThread *threads;
>      int num_threads;
> +#ifdef CONFIG_NUMA
> +    struct bitmask *nodemask;
> +#endif
>  } MemsetContext;
>  
>  struct MemsetThread {
> @@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg)
>      }
>      qemu_mutex_unlock(&page_mutex);
>  
> +#ifdef CONFIG_NUMA
> +    if (memset_args->context->nodemask) {
> +        numa_run_on_node_mask(memset_args->context->nodemask);
> +    }
> +#endif
> +
>      /* unblock SIGBUS */
>      sigemptyset(&set);
>      sigaddset(&set, SIGBUS);
> @@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void *arg)
>      }
>      qemu_mutex_unlock(&page_mutex);
>  
> +#ifdef CONFIG_NUMA
> +    if (memset_args->context->nodemask) {
> +        numa_run_on_node_mask(memset_args->context->nodemask);
> +    }
> +#endif

Ok, so this is where affinity is set, and I believe this ought to be
failing when run under libvirt with an EPERM from sched_setaffinity.
This code is ignoring errors so I guess this won't be noticed unless
libnuma is printing an error message ?

> +
>      if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) {
>          ret = -errno;
>      }
> @@ -489,7 +508,9 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
>  }
>  
>  static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> -                           int smp_cpus, bool use_madv_populate_write)
> +                           int smp_cpus, const unsigned long *host_nodes,
> +                           unsigned long max_node,
> +                           bool use_madv_populate_write)
>  {
>      static gsize initialized = 0;
>      MemsetContext context = {
> @@ -499,6 +520,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>      void *(*touch_fn)(void *);
>      int ret = 0, i = 0;
>      char *addr = area;
> +    unsigned long value = max_node;
>  
>      if (g_once_init_enter(&initialized)) {
>          qemu_mutex_init(&page_mutex);
> @@ -520,6 +542,48 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>          touch_fn = do_touch_pages;
>      }
>  
> +    if (host_nodes) {
> +        value = find_first_bit(host_nodes, max_node);
> +    }
> +    if (value != max_node) {
> +#ifdef CONFIG_NUMA
> +        struct bitmask *cpus = numa_allocate_cpumask();
> +        g_autofree unsigned long *zerocpumask;
> +        size_t zerocpumasklen;
> +        g_autofree unsigned long *zeronodemask;
> +        size_t zeronodemasklen;
> +
> +        context.nodemask = numa_bitmask_alloc(max_node);
> +
> +        zerocpumasklen = cpus->size / sizeof(unsigned long);
> +        zerocpumask = g_new0(unsigned long, zerocpumasklen);
> +
> +        for (; value != max_node;
> +             value = find_next_bit(host_nodes, max_node, value + 1)) {
> +            if (numa_node_to_cpus(value, cpus) ||
> +                memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0)
> +                continue;
> +
> +            /* If given NUMA node has CPUs run threads on them. */
> +            numa_bitmask_setbit(context.nodemask, value);
> +        }
> +
> +        numa_bitmask_free(cpus);
> +
> +        zeronodemasklen = max_node / sizeof(unsigned long);
> +        zeronodemask = g_new0(unsigned long, zeronodemasklen);
> +
> +        if (memcmp(context.nodemask->maskp,
> +                   zeronodemask, zeronodemasklen) == 0) {
> +            /* If no NUMA has a CPU available, then don't pin threads. */
> +            g_clear_pointer(&context.nodemask, numa_bitmask_free);
> +        }
> +#else
> +        errno = -EINVAL;
> +        return -1;
> +#endif
> +    }
> +
>      context.threads = g_new0(MemsetThread, context.num_threads);
>      numpages_per_thread = numpages / context.num_threads;
>      leftover = numpages % context.num_threads;
> @@ -554,6 +618,10 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>      if (!use_madv_populate_write) {
>          sigbus_memset_context = NULL;
>      }
> +
> +#ifdef CONFIG_NUMA
> +    g_clear_pointer(&context.nodemask, numa_bitmask_free);
> +#endif
>      g_free(context.threads);
>  
>      return ret;
> @@ -566,6 +634,8 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
>  }
>  
>  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> +                     const unsigned long *host_nodes,
> +                     unsigned long max_node,
>                       Error **errp)
>  {
>      static gsize initialized;
> @@ -608,7 +678,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
>  
>      /* touch pages simultaneously */
>      ret = touch_all_pages(area, hpagesize, numpages, smp_cpus,
> -                          use_madv_populate_write);
> +                          host_nodes, max_node, use_madv_populate_write);
>      if (ret) {
>          error_setg_errno(errp, -ret,
>                           "os_mem_prealloc: preallocating memory failed");
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index dafef4f157..6efd912355 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -314,6 +314,8 @@ int getpagesize(void)
>  }
>  
>  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> +                     const unsigned long *host_nodes,
> +                     unsigned long max_node,
>                       Error **errp)
>  {
>      int i;
> -- 
> 2.35.1
> 
> 

With regards,
Daniel
Dr. David Alan Gilbert May 10, 2022, 10:27 a.m. UTC | #2
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> > When allocating large amounts of memory the task is offloaded
> > onto threads. These threads then use various techniques to
> > allocate the memory fully (madvise(), writing into the memory).
> > However, these threads are free to run on any CPU, which becomes
> > problematic on NUMA machines because it may happen that a thread
> > is running on a distant node.
> > 
> > Ideally, this is something that a management application would
> > resolve, but we are not anywhere close to that, Firstly, memory
> > allocation happens before monitor socket is even available. But
> > okay, that's what -preconfig is for. But then the problem is that
> > 'object-add' would not return until all memory is preallocated.
> > 
> > Long story short, management application has no way of learning
> > TIDs of allocator threads so it can't make them run NUMA aware.
> 
> So I'm wondering what the impact of this problem is for various
> scenarios.
> 
> The default config for a KVM guest with libvirt is no CPU pinning
> at all. The kernel auto-places CPUs and decides on where RAM is to
> be allocated. So in this case, whether or not libvirt can talk to
> QMP in time to query threads is largely irrelevant, as we don't
> want todo placement in any case.
> 
> In theory the kernel should allocate RAM on the node local to
> where the process is currently executing. So as long as the
> guest RAM fits in available free RAM on the local node, RAM
> should be allocated from the node that matches the CPU running
> the QEMU main thread.
> 
> The challenge is if we spawn N more threads to do pre-alloc,
> these can be spread onto other nodes. I wonder if the kernel
> huas any preference for keeping threads within a process on
> the same NUMA node ?
> 
> Overall, if libvirt is not applying pinning to the QEMU guest,
> then we're 100% reliant on the kernel todo something sensible,
> both for normal QEMU execution and for prealloc. Since we're
> not doing placement of QEMU RAM or CPUs, the logic in this
> patch won't do anything either.
> 
> 
> If the guest has more RAM than can fit on the local NUMA node,
> then we're doomed no matter what, even ignoring prealloc, there
> will be cross-node traffic. This scenario requires the admin to
> setup proper CPU /memory pinning for QEMU in libvirt.
> 
> If libvirt is doing CPU pinning (as instructed by the mgmt app
> above us), then when we first start QEMU, the process thread
> leader will get given affinity by libvirt prior to exec. This
> affinity will be the union of affinity for all CPUs that will
> be later configured.
> 
> The typical case for CPU pinning, is that everything fits in
> one NUMA node, and so in this case, we don't need todo anything
> more. The prealloc threads will already be constrained to the
> right place by the affinity of the QEMU thread leader, so the
> logic in this patch will run, but it won't do anything that
> was not already done.
> 
> So we're left with the hardest case, where the guest is explicitly
> spread across multiple NUMA nodes. In this case the thread leader
> affinity will span many NUMA nodes, and so the prealloc threads
> will freely be placed across any CPU that is in the union of CPUs
> the guest is placed on. Just as with thue non-pinned case, the
> prealloc will be at the mercy of the kernel making sensible
> placement decisions.
> 
> The very last cases is the only one where this patch can potentially
> be beneficial. The problem is that because libvirt is in charge of
> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
> anything with CPU affinity. So AFAICT, this patch should result in
> an error from sched_setaffinity when run under libvirt.

This problem crops up in a few places though; potentially the same trick
could be done with multifd threads in migration, if we can make some
multifd threads handle some specific NUMA nodes rather than shotgunning
them across the threads.
[Bonus: Make multifd route different NUMA nodes down NICs that are
node local]

Dave

> > But what we can do is to propagate the 'host-nodes' attribute of
> > MemoryBackend object down to where preallocation threads are
> > created and set their affinity according to the attribute.
> > 
> > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
> > Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> > ---
> >  backends/hostmem.c     |  6 ++--
> >  hw/virtio/virtio-mem.c |  2 +-
> >  include/qemu/osdep.h   |  2 ++
> >  util/meson.build       |  2 +-
> >  util/oslib-posix.c     | 74 ++++++++++++++++++++++++++++++++++++++++--
> >  util/oslib-win32.c     |  2 ++
> >  6 files changed, 82 insertions(+), 6 deletions(-)
> > 
> > diff --git a/backends/hostmem.c b/backends/hostmem.c
> > index a7bae3d713..7373472c7e 100644
> > --- a/backends/hostmem.c
> > +++ b/backends/hostmem.c
> > @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
> >          void *ptr = memory_region_get_ram_ptr(&backend->mr);
> >          uint64_t sz = memory_region_size(&backend->mr);
> >  
> > -        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err);
> > +        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads,
> > +                        backend->host_nodes, MAX_NODES, &local_err);
> >          if (local_err) {
> >              error_propagate(errp, local_err);
> >              return;
> > @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
> >           */
> >          if (backend->prealloc) {
> >              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
> > -                            backend->prealloc_threads, &local_err);
> > +                            backend->prealloc_threads, backend->host_nodes,
> > +                            MAX_NODES, &local_err);
> >              if (local_err) {
> >                  goto out;
> >              }
> > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> > index 5aca408726..48b104cdf6 100644
> > --- a/hw/virtio/virtio-mem.c
> > +++ b/hw/virtio/virtio-mem.c
> > @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
> >              int fd = memory_region_get_fd(&vmem->memdev->mr);
> >              Error *local_err = NULL;
> >  
> > -            os_mem_prealloc(fd, area, size, 1, &local_err);
> > +            os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err);
> >              if (local_err) {
> >                  static bool warned;
> >  
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index 1c1e7eca98..474cbf3b86 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type);
> >  void qemu_set_tty_echo(int fd, bool echo);
> >  
> >  void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> > +                     const unsigned long *host_nodes,
> > +                     unsigned long max_node,
> >                       Error **errp);
> >  
> >  /**
> > diff --git a/util/meson.build b/util/meson.build
> > index 8f16018cd4..393ff74570 100644
> > --- a/util/meson.build
> > +++ b/util/meson.build
> > @@ -15,7 +15,7 @@ freebsd_dep = []
> >  if targetos == 'freebsd'
> >    freebsd_dep = util
> >  endif
> > -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep])
> > +util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa])
> >  util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c'))
> >  util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c'))
> >  util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c'))
> > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > index 477990f39b..1572b9b178 100644
> > --- a/util/oslib-posix.c
> > +++ b/util/oslib-posix.c
> > @@ -73,6 +73,10 @@
> >  #include "qemu/error-report.h"
> >  #endif
> >  
> > +#ifdef CONFIG_NUMA
> > +#include <numa.h>
> > +#endif
> > +
> >  #define MAX_MEM_PREALLOC_THREAD_COUNT 16
> >  
> >  struct MemsetThread;
> > @@ -82,6 +86,9 @@ typedef struct MemsetContext {
> >      bool any_thread_failed;
> >      struct MemsetThread *threads;
> >      int num_threads;
> > +#ifdef CONFIG_NUMA
> > +    struct bitmask *nodemask;
> > +#endif
> >  } MemsetContext;
> >  
> >  struct MemsetThread {
> > @@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg)
> >      }
> >      qemu_mutex_unlock(&page_mutex);
> >  
> > +#ifdef CONFIG_NUMA
> > +    if (memset_args->context->nodemask) {
> > +        numa_run_on_node_mask(memset_args->context->nodemask);
> > +    }
> > +#endif
> > +
> >      /* unblock SIGBUS */
> >      sigemptyset(&set);
> >      sigaddset(&set, SIGBUS);
> > @@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void *arg)
> >      }
> >      qemu_mutex_unlock(&page_mutex);
> >  
> > +#ifdef CONFIG_NUMA
> > +    if (memset_args->context->nodemask) {
> > +        numa_run_on_node_mask(memset_args->context->nodemask);
> > +    }
> > +#endif
> 
> Ok, so this is where affinity is set, and I believe this ought to be
> failing when run under libvirt with an EPERM from sched_setaffinity.
> This code is ignoring errors so I guess this won't be noticed unless
> libnuma is printing an error message ?
> 
> > +
> >      if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) {
> >          ret = -errno;
> >      }
> > @@ -489,7 +508,9 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
> >  }
> >  
> >  static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> > -                           int smp_cpus, bool use_madv_populate_write)
> > +                           int smp_cpus, const unsigned long *host_nodes,
> > +                           unsigned long max_node,
> > +                           bool use_madv_populate_write)
> >  {
> >      static gsize initialized = 0;
> >      MemsetContext context = {
> > @@ -499,6 +520,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> >      void *(*touch_fn)(void *);
> >      int ret = 0, i = 0;
> >      char *addr = area;
> > +    unsigned long value = max_node;
> >  
> >      if (g_once_init_enter(&initialized)) {
> >          qemu_mutex_init(&page_mutex);
> > @@ -520,6 +542,48 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> >          touch_fn = do_touch_pages;
> >      }
> >  
> > +    if (host_nodes) {
> > +        value = find_first_bit(host_nodes, max_node);
> > +    }
> > +    if (value != max_node) {
> > +#ifdef CONFIG_NUMA
> > +        struct bitmask *cpus = numa_allocate_cpumask();
> > +        g_autofree unsigned long *zerocpumask;
> > +        size_t zerocpumasklen;
> > +        g_autofree unsigned long *zeronodemask;
> > +        size_t zeronodemasklen;
> > +
> > +        context.nodemask = numa_bitmask_alloc(max_node);
> > +
> > +        zerocpumasklen = cpus->size / sizeof(unsigned long);
> > +        zerocpumask = g_new0(unsigned long, zerocpumasklen);
> > +
> > +        for (; value != max_node;
> > +             value = find_next_bit(host_nodes, max_node, value + 1)) {
> > +            if (numa_node_to_cpus(value, cpus) ||
> > +                memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0)
> > +                continue;
> > +
> > +            /* If given NUMA node has CPUs run threads on them. */
> > +            numa_bitmask_setbit(context.nodemask, value);
> > +        }
> > +
> > +        numa_bitmask_free(cpus);
> > +
> > +        zeronodemasklen = max_node / sizeof(unsigned long);
> > +        zeronodemask = g_new0(unsigned long, zeronodemasklen);
> > +
> > +        if (memcmp(context.nodemask->maskp,
> > +                   zeronodemask, zeronodemasklen) == 0) {
> > +            /* If no NUMA has a CPU available, then don't pin threads. */
> > +            g_clear_pointer(&context.nodemask, numa_bitmask_free);
> > +        }
> > +#else
> > +        errno = -EINVAL;
> > +        return -1;
> > +#endif
> > +    }
> > +
> >      context.threads = g_new0(MemsetThread, context.num_threads);
> >      numpages_per_thread = numpages / context.num_threads;
> >      leftover = numpages % context.num_threads;
> > @@ -554,6 +618,10 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> >      if (!use_madv_populate_write) {
> >          sigbus_memset_context = NULL;
> >      }
> > +
> > +#ifdef CONFIG_NUMA
> > +    g_clear_pointer(&context.nodemask, numa_bitmask_free);
> > +#endif
> >      g_free(context.threads);
> >  
> >      return ret;
> > @@ -566,6 +634,8 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
> >  }
> >  
> >  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> > +                     const unsigned long *host_nodes,
> > +                     unsigned long max_node,
> >                       Error **errp)
> >  {
> >      static gsize initialized;
> > @@ -608,7 +678,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> >  
> >      /* touch pages simultaneously */
> >      ret = touch_all_pages(area, hpagesize, numpages, smp_cpus,
> > -                          use_madv_populate_write);
> > +                          host_nodes, max_node, use_madv_populate_write);
> >      if (ret) {
> >          error_setg_errno(errp, -ret,
> >                           "os_mem_prealloc: preallocating memory failed");
> > diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> > index dafef4f157..6efd912355 100644
> > --- a/util/oslib-win32.c
> > +++ b/util/oslib-win32.c
> > @@ -314,6 +314,8 @@ int getpagesize(void)
> >  }
> >  
> >  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> > +                     const unsigned long *host_nodes,
> > +                     unsigned long max_node,
> >                       Error **errp)
> >  {
> >      int i;
> > -- 
> > 2.35.1
> > 
> > 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
>
Dr. David Alan Gilbert May 11, 2022, 8:34 a.m. UTC | #3
* Michal Privoznik (mprivozn@redhat.com) wrote:
> When allocating large amounts of memory the task is offloaded
> onto threads. These threads then use various techniques to
> allocate the memory fully (madvise(), writing into the memory).
> However, these threads are free to run on any CPU, which becomes
> problematic on NUMA machines because it may happen that a thread
> is running on a distant node.
> 
> Ideally, this is something that a management application would
> resolve, but we are not anywhere close to that, Firstly, memory
> allocation happens before monitor socket is even available. But
> okay, that's what -preconfig is for. But then the problem is that
> 'object-add' would not return until all memory is preallocated.
> 
> Long story short, management application has no way of learning
> TIDs of allocator threads so it can't make them run NUMA aware.
> 
> But what we can do is to propagate the 'host-nodes' attribute of
> MemoryBackend object down to where preallocation threads are
> created and set their affinity according to the attribute.

Joe (cc'd) sent me some numbers for this which emphasise how useful it
is:
 | On systems with 4 physical numa nodes and 2-6 Tb of memory, this numa-aware
 |preallocation provided about a 25% speedup in touching the pages.
 |The speedup gets larger as the numa node count and memory sizes grow.
....
 | In a simple parallel 1Gb page-zeroing test on a very large system (32-numa
 | nodes and 47Tb of memory), the numa-aware preallocation was 2.3X faster
 | than letting the threads float wherever.
 | We're working with someone whose large guest normally takes 4.5 hours to
 | boot.  With Michal P's initial patch to parallelize the preallocation, that
 | time dropped to about 1 hour.  Including this numa-aware preallocation
 | would reduce the guest boot time to less than 1/2 hour.

so chopping *half an hour* off the startup time seems a worthy
optimisation (even if most of us aren't fortunate enough to have 47T of
ram).

Dave

> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>  backends/hostmem.c     |  6 ++--
>  hw/virtio/virtio-mem.c |  2 +-
>  include/qemu/osdep.h   |  2 ++
>  util/meson.build       |  2 +-
>  util/oslib-posix.c     | 74 ++++++++++++++++++++++++++++++++++++++++--
>  util/oslib-win32.c     |  2 ++
>  6 files changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index a7bae3d713..7373472c7e 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
>          void *ptr = memory_region_get_ram_ptr(&backend->mr);
>          uint64_t sz = memory_region_size(&backend->mr);
>  
> -        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err);
> +        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads,
> +                        backend->host_nodes, MAX_NODES, &local_err);
>          if (local_err) {
>              error_propagate(errp, local_err);
>              return;
> @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>           */
>          if (backend->prealloc) {
>              os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
> -                            backend->prealloc_threads, &local_err);
> +                            backend->prealloc_threads, backend->host_nodes,
> +                            MAX_NODES, &local_err);
>              if (local_err) {
>                  goto out;
>              }
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 5aca408726..48b104cdf6 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
>              int fd = memory_region_get_fd(&vmem->memdev->mr);
>              Error *local_err = NULL;
>  
> -            os_mem_prealloc(fd, area, size, 1, &local_err);
> +            os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err);
>              if (local_err) {
>                  static bool warned;
>  
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 1c1e7eca98..474cbf3b86 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type);
>  void qemu_set_tty_echo(int fd, bool echo);
>  
>  void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
> +                     const unsigned long *host_nodes,
> +                     unsigned long max_node,
>                       Error **errp);
>  
>  /**
> diff --git a/util/meson.build b/util/meson.build
> index 8f16018cd4..393ff74570 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -15,7 +15,7 @@ freebsd_dep = []
>  if targetos == 'freebsd'
>    freebsd_dep = util
>  endif
> -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep])
> +util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa])
>  util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c'))
>  util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c'))
>  util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c'))
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 477990f39b..1572b9b178 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -73,6 +73,10 @@
>  #include "qemu/error-report.h"
>  #endif
>  
> +#ifdef CONFIG_NUMA
> +#include <numa.h>
> +#endif
> +
>  #define MAX_MEM_PREALLOC_THREAD_COUNT 16
>  
>  struct MemsetThread;
> @@ -82,6 +86,9 @@ typedef struct MemsetContext {
>      bool any_thread_failed;
>      struct MemsetThread *threads;
>      int num_threads;
> +#ifdef CONFIG_NUMA
> +    struct bitmask *nodemask;
> +#endif
>  } MemsetContext;
>  
>  struct MemsetThread {
> @@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg)
>      }
>      qemu_mutex_unlock(&page_mutex);
>  
> +#ifdef CONFIG_NUMA
> +    if (memset_args->context->nodemask) {
> +        numa_run_on_node_mask(memset_args->context->nodemask);
> +    }
> +#endif
> +
>      /* unblock SIGBUS */
>      sigemptyset(&set);
>      sigaddset(&set, SIGBUS);
> @@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void *arg)
>      }
>      qemu_mutex_unlock(&page_mutex);
>  
> +#ifdef CONFIG_NUMA
> +    if (memset_args->context->nodemask) {
> +        numa_run_on_node_mask(memset_args->context->nodemask);
> +    }
> +#endif
> +
>      if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) {
>          ret = -errno;
>      }
> @@ -489,7 +508,9 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
>  }
>  
>  static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> -                           int smp_cpus, bool use_madv_populate_write)
> +                           int smp_cpus, const unsigned long *host_nodes,
> +                           unsigned long max_node,
> +                           bool use_madv_populate_write)
>  {
>      static gsize initialized = 0;
>      MemsetContext context = {
> @@ -499,6 +520,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>      void *(*touch_fn)(void *);
>      int ret = 0, i = 0;
>      char *addr = area;
> +    unsigned long value = max_node;
>  
>      if (g_once_init_enter(&initialized)) {
>          qemu_mutex_init(&page_mutex);
> @@ -520,6 +542,48 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>          touch_fn = do_touch_pages;
>      }
>  
> +    if (host_nodes) {
> +        value = find_first_bit(host_nodes, max_node);
> +    }
> +    if (value != max_node) {
> +#ifdef CONFIG_NUMA
> +        struct bitmask *cpus = numa_allocate_cpumask();
> +        g_autofree unsigned long *zerocpumask;
> +        size_t zerocpumasklen;
> +        g_autofree unsigned long *zeronodemask;
> +        size_t zeronodemasklen;
> +
> +        context.nodemask = numa_bitmask_alloc(max_node);
> +
> +        zerocpumasklen = cpus->size / sizeof(unsigned long);
> +        zerocpumask = g_new0(unsigned long, zerocpumasklen);
> +
> +        for (; value != max_node;
> +             value = find_next_bit(host_nodes, max_node, value + 1)) {
> +            if (numa_node_to_cpus(value, cpus) ||
> +                memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0)
> +                continue;
> +
> +            /* If given NUMA node has CPUs run threads on them. */
> +            numa_bitmask_setbit(context.nodemask, value);
> +        }
> +
> +        numa_bitmask_free(cpus);
> +
> +        zeronodemasklen = max_node / sizeof(unsigned long);
> +        zeronodemask = g_new0(unsigned long, zeronodemasklen);
> +
> +        if (memcmp(context.nodemask->maskp,
> +                   zeronodemask, zeronodemasklen) == 0) {
> +            /* If no NUMA has a CPU available, then don't pin threads. */
> +            g_clear_pointer(&context.nodemask, numa_bitmask_free);
> +        }
> +#else
> +        errno = -EINVAL;
> +        return -1;
> +#endif
> +    }
> +
>      context.threads = g_new0(MemsetThread, context.num_threads);
>      numpages_per_thread = numpages / context.num_threads;
>      leftover = numpages % context.num_threads;
> @@ -554,6 +618,10 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>      if (!use_madv_populate_write) {
>          sigbus_memset_context = NULL;
>      }
> +
> +#ifdef CONFIG_NUMA
> +    g_clear_pointer(&context.nodemask, numa_bitmask_free);
> +#endif
>      g_free(context.threads);
>  
>      return ret;
> @@ -566,6 +634,8 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
>  }
>  
>  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> +                     const unsigned long *host_nodes,
> +                     unsigned long max_node,
>                       Error **errp)
>  {
>      static gsize initialized;
> @@ -608,7 +678,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
>  
>      /* touch pages simultaneously */
>      ret = touch_all_pages(area, hpagesize, numpages, smp_cpus,
> -                          use_madv_populate_write);
> +                          host_nodes, max_node, use_madv_populate_write);
>      if (ret) {
>          error_setg_errno(errp, -ret,
>                           "os_mem_prealloc: preallocating memory failed");
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index dafef4f157..6efd912355 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -314,6 +314,8 @@ int getpagesize(void)
>  }
>  
>  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
> +                     const unsigned long *host_nodes,
> +                     unsigned long max_node,
>                       Error **errp)
>  {
>      int i;
> -- 
> 2.35.1
> 
>
Daniel P. Berrangé May 11, 2022, 9:19 a.m. UTC | #4
On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> When allocating large amounts of memory the task is offloaded
> onto threads. These threads then use various techniques to
> allocate the memory fully (madvise(), writing into the memory).
> However, these threads are free to run on any CPU, which becomes
> problematic on NUMA machines because it may happen that a thread
> is running on a distant node.
> 
> Ideally, this is something that a management application would
> resolve, but we are not anywhere close to that, Firstly, memory
> allocation happens before monitor socket is even available. But
> okay, that's what -preconfig is for. But then the problem is that
> 'object-add' would not return until all memory is preallocated.

Is the delay to 'object-add' actually a problem ?

Currently we're cold plugging the memory backends, so prealloc
happens before QMP is available. So we have a delay immediately
at startup. Switching to -preconfig plus 'object-add'  would
not be making the delay worse, merely moving it ever so slightly
later.

With the POV of an application using libvirt, this is the same.
virDomainCreate takes 1 hour, regardless of whether the 1 hour
allocatinon delay is before QMP or in -preconfig object-add
execution.

> Long story short, management application has no way of learning
> TIDs of allocator threads so it can't make them run NUMA aware.

This feels like the key issue. The preallocation threads are
invisible to libvirt, regardless of whether we're doing coldplug
or hotplug of memory-backends. Indeed the threads are invisible
to all of QEMU, except the memory backend code.

Conceptually we need 1 or more explicit worker threads, that we
can assign CPU affinity to, and then QEMU can place jobs on them.
I/O threads serve this role, but limited to blockdev work. We
need a generalization of I/O threads, for arbitrary jobs that
QEMU might want to farm out to specific numa nodes.

In a guest spanning multiple host NUMA nodes, libvirt would
have to configure 1 or more worker threads for QEMU, learn
their TIDs,then add the memory backends in -preconfig, which
would farm our preallocation to the worker threads, with
job placement matching the worker's affinity.


With regards,
Daniel
Daniel P. Berrangé May 11, 2022, 9:20 a.m. UTC | #5
On Wed, May 11, 2022 at 09:34:07AM +0100, Dr. David Alan Gilbert wrote:
> * Michal Privoznik (mprivozn@redhat.com) wrote:
> > When allocating large amounts of memory the task is offloaded
> > onto threads. These threads then use various techniques to
> > allocate the memory fully (madvise(), writing into the memory).
> > However, these threads are free to run on any CPU, which becomes
> > problematic on NUMA machines because it may happen that a thread
> > is running on a distant node.
> > 
> > Ideally, this is something that a management application would
> > resolve, but we are not anywhere close to that, Firstly, memory
> > allocation happens before monitor socket is even available. But
> > okay, that's what -preconfig is for. But then the problem is that
> > 'object-add' would not return until all memory is preallocated.
> > 
> > Long story short, management application has no way of learning
> > TIDs of allocator threads so it can't make them run NUMA aware.
> > 
> > But what we can do is to propagate the 'host-nodes' attribute of
> > MemoryBackend object down to where preallocation threads are
> > created and set their affinity according to the attribute.
> 
> Joe (cc'd) sent me some numbers for this which emphasise how useful it
> is:
>  | On systems with 4 physical numa nodes and 2-6 Tb of memory, this numa-aware
>  |preallocation provided about a 25% speedup in touching the pages.
>  |The speedup gets larger as the numa node count and memory sizes grow.
> ....
>  | In a simple parallel 1Gb page-zeroing test on a very large system (32-numa
>  | nodes and 47Tb of memory), the numa-aware preallocation was 2.3X faster
>  | than letting the threads float wherever.
>  | We're working with someone whose large guest normally takes 4.5 hours to
>  | boot.  With Michal P's initial patch to parallelize the preallocation, that
>  | time dropped to about 1 hour.  Including this numa-aware preallocation
>  | would reduce the guest boot time to less than 1/2 hour.
> 
> so chopping *half an hour* off the startup time seems a worthy
> optimisation (even if most of us aren't fortunate enough to have 47T of
> ram).

I presume this test was done with bare QEMU though, not libvirt managed
QEMU, as IIUC, the latter would not be able to set its affinity and so
never see this benefit.


With regards,
Daniel
David Hildenbrand May 11, 2022, 9:31 a.m. UTC | #6
>> Long story short, management application has no way of learning
>> TIDs of allocator threads so it can't make them run NUMA aware.
> 
> This feels like the key issue. The preallocation threads are
> invisible to libvirt, regardless of whether we're doing coldplug
> or hotplug of memory-backends. Indeed the threads are invisible
> to all of QEMU, except the memory backend code.
> 
> Conceptually we need 1 or more explicit worker threads, that we
> can assign CPU affinity to, and then QEMU can place jobs on them.
> I/O threads serve this role, but limited to blockdev work. We
> need a generalization of I/O threads, for arbitrary jobs that
> QEMU might want to farm out to specific numa nodes.

At least the "-object iothread" thingy can already be used for actions
outside of blockdev. virtio-balloon uses one for free page hinting.
Daniel P. Berrangé May 11, 2022, 9:34 a.m. UTC | #7
On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
> >> Long story short, management application has no way of learning
> >> TIDs of allocator threads so it can't make them run NUMA aware.
> > 
> > This feels like the key issue. The preallocation threads are
> > invisible to libvirt, regardless of whether we're doing coldplug
> > or hotplug of memory-backends. Indeed the threads are invisible
> > to all of QEMU, except the memory backend code.
> > 
> > Conceptually we need 1 or more explicit worker threads, that we
> > can assign CPU affinity to, and then QEMU can place jobs on them.
> > I/O threads serve this role, but limited to blockdev work. We
> > need a generalization of I/O threads, for arbitrary jobs that
> > QEMU might want to farm out to specific numa nodes.
> 
> At least the "-object iothread" thingy can already be used for actions
> outside of blockdev. virtio-balloon uses one for free page hinting.

Ah that's good to know, so my idea probably isn't so much work as
I thought it might be.


With regards,
Daniel
David Hildenbrand May 11, 2022, 10:03 a.m. UTC | #8
On 11.05.22 11:34, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
>>>> Long story short, management application has no way of learning
>>>> TIDs of allocator threads so it can't make them run NUMA aware.
>>>
>>> This feels like the key issue. The preallocation threads are
>>> invisible to libvirt, regardless of whether we're doing coldplug
>>> or hotplug of memory-backends. Indeed the threads are invisible
>>> to all of QEMU, except the memory backend code.
>>>
>>> Conceptually we need 1 or more explicit worker threads, that we
>>> can assign CPU affinity to, and then QEMU can place jobs on them.
>>> I/O threads serve this role, but limited to blockdev work. We
>>> need a generalization of I/O threads, for arbitrary jobs that
>>> QEMU might want to farm out to specific numa nodes.
>>
>> At least the "-object iothread" thingy can already be used for actions
>> outside of blockdev. virtio-balloon uses one for free page hinting.
> 
> Ah that's good to know, so my idea probably isn't so much work as
> I thought it might be.

I guess we'd have to create a bunch of iothreads on the command line and
then feed them as an array to the memory backend we want to create. We
could then forward the threads to a new variant of os_mem_prealloc().

We could

a) Allocate new iothreads for each memory backend we create. Hm, that
might be suboptimal, we could end up with many iothreads.

b) Reuse iothreads and have separate sets of iothreads per NUMA node.
Assign them to a node once.

c) Reuse iothreads and reassign them to NUMA nodes on demand.

However, I'm not sure what the semantics are when having multiple
backends referencing the iothreads ...


Regarding c) and alternative  I raised is that we could simply
preallocate the threads in QEMU internally on machine creation, and let
libvirt reassign them to the fitting NUMA node whenever
coldplugging/hotplugging a new memory backend.
Daniel P. Berrangé May 11, 2022, 10:10 a.m. UTC | #9
On Wed, May 11, 2022 at 12:03:24PM +0200, David Hildenbrand wrote:
> On 11.05.22 11:34, Daniel P. Berrangé wrote:
> > On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
> >>>> Long story short, management application has no way of learning
> >>>> TIDs of allocator threads so it can't make them run NUMA aware.
> >>>
> >>> This feels like the key issue. The preallocation threads are
> >>> invisible to libvirt, regardless of whether we're doing coldplug
> >>> or hotplug of memory-backends. Indeed the threads are invisible
> >>> to all of QEMU, except the memory backend code.
> >>>
> >>> Conceptually we need 1 or more explicit worker threads, that we
> >>> can assign CPU affinity to, and then QEMU can place jobs on them.
> >>> I/O threads serve this role, but limited to blockdev work. We
> >>> need a generalization of I/O threads, for arbitrary jobs that
> >>> QEMU might want to farm out to specific numa nodes.
> >>
> >> At least the "-object iothread" thingy can already be used for actions
> >> outside of blockdev. virtio-balloon uses one for free page hinting.
> > 
> > Ah that's good to know, so my idea probably isn't so much work as
> > I thought it might be.
> 
> I guess we'd have to create a bunch of iothreads on the command line and
> then feed them as an array to the memory backend we want to create. We
> could then forward the threads to a new variant of os_mem_prealloc().
> 
> We could
> 
> a) Allocate new iothreads for each memory backend we create. Hm, that
> might be suboptimal, we could end up with many iothreads.
> 
> b) Reuse iothreads and have separate sets of iothreads per NUMA node.
> Assign them to a node once.
> 
> c) Reuse iothreads and reassign them to NUMA nodes on demand.

If all we needs is NUMA affinity, not CPU affinity, then it would
be sufficient to create 1 I/O thread per host NUMA node that the
VM needs to use. The job running in the I/O can spawn further
threads and inherit the NUMA affinity.  This might be more clever
than it is needed though.

I expect creating/deleting I/O threads is cheap in comparison to
the work done for preallocation. If libvirt is using -preconfig
and object-add to create the memory backend, then we could have
option of creating the I/O threads dynamically in -preconfig mode,
create the memory backend, and then delete the I/O threads again.

> However, I'm not sure what the semantics are when having multiple
> backends referencing the iothreads ...

Yep, we don't especially need an "ownership" relationship for what
we want todo with preallocatino, specially because it is a one
off point-in-time usage, not continuous usage as with block devices

With regards,
Daniel
Paolo Bonzini May 11, 2022, 11:07 a.m. UTC | #10
On 5/11/22 12:10, Daniel P. Berrangé wrote:
> If all we needs is NUMA affinity, not CPU affinity, then it would
> be sufficient to create 1 I/O thread per host NUMA node that the
> VM needs to use. The job running in the I/O can spawn further
> threads and inherit the NUMA affinity.  This might be more clever
> than it is needed though.
> 
> I expect creating/deleting I/O threads is cheap in comparison to
> the work done for preallocation. If libvirt is using -preconfig
> and object-add to create the memory backend, then we could have
> option of creating the I/O threads dynamically in -preconfig mode,
> create the memory backend, and then delete the I/O threads again.

I think this is very overengineered.  Michal's patch is doing the 
obvious thing and if it doesn't work that's because Libvirt is trying to 
micromanage QEMU.

As mentioned on IRC, if the reason is to prevent moving around threads 
in realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the 
kernel level.

Paolo
Michal Privoznik May 11, 2022, 1:16 p.m. UTC | #11
On 5/10/22 11:12, Daniel P. Berrangé wrote:
> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
>> When allocating large amounts of memory the task is offloaded
>> onto threads. These threads then use various techniques to
>> allocate the memory fully (madvise(), writing into the memory).
>> However, these threads are free to run on any CPU, which becomes
>> problematic on NUMA machines because it may happen that a thread
>> is running on a distant node.
>>
>> Ideally, this is something that a management application would
>> resolve, but we are not anywhere close to that, Firstly, memory
>> allocation happens before monitor socket is even available. But
>> okay, that's what -preconfig is for. But then the problem is that
>> 'object-add' would not return until all memory is preallocated.
>>
>> Long story short, management application has no way of learning
>> TIDs of allocator threads so it can't make them run NUMA aware.
> 
> So I'm wondering what the impact of this problem is for various
> scenarios.

The scenario which I tested this with was no <emulatorpin/> but using
'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
those which are unfamiliar with libvirt, this is about placing the main
qemu TID (with the main eventloop) into a CGroup that restricts on what
CPUs it can run.

> 
> The default config for a KVM guest with libvirt is no CPU pinning
> at all. The kernel auto-places CPUs and decides on where RAM is to
> be allocated. So in this case, whether or not libvirt can talk to
> QMP in time to query threads is largely irrelevant, as we don't
> want todo placement in any case.
> 
> In theory the kernel should allocate RAM on the node local to
> where the process is currently executing. So as long as the
> guest RAM fits in available free RAM on the local node, RAM
> should be allocated from the node that matches the CPU running
> the QEMU main thread.
> 
> The challenge is if we spawn N more threads to do pre-alloc,
> these can be spread onto other nodes. I wonder if the kernel
> huas any preference for keeping threads within a process on
> the same NUMA node ?

That's not exactly what I saw. I would have thought too that initially
the prealloc thread could be spawned just anywhere but after few
iterations the scheduler realized what NUMA node the thread is close to
and automatically schedule it to run there. Well, it didn't happen.

> 
> Overall, if libvirt is not applying pinning to the QEMU guest,
> then we're 100% reliant on the kernel todo something sensible,
> both for normal QEMU execution and for prealloc. Since we're
> not doing placement of QEMU RAM or CPUs, the logic in this
> patch won't do anything either.
> 
> 
> If the guest has more RAM than can fit on the local NUMA node,
> then we're doomed no matter what, even ignoring prealloc, there
> will be cross-node traffic. This scenario requires the admin to
> setup proper CPU /memory pinning for QEMU in libvirt.
> 
> If libvirt is doing CPU pinning (as instructed by the mgmt app
> above us), then when we first start QEMU, the process thread
> leader will get given affinity by libvirt prior to exec. This
> affinity will be the union of affinity for all CPUs that will
> be later configured.
> 
> The typical case for CPU pinning, is that everything fits in
> one NUMA node, and so in this case, we don't need todo anything
> more. The prealloc threads will already be constrained to the
> right place by the affinity of the QEMU thread leader, so the
> logic in this patch will run, but it won't do anything that
> was not already done.
> 
> So we're left with the hardest case, where the guest is explicitly
> spread across multiple NUMA nodes. In this case the thread leader
> affinity will span many NUMA nodes, and so the prealloc threads
> will freely be placed across any CPU that is in the union of CPUs
> the guest is placed on. Just as with thue non-pinned case, the
> prealloc will be at the mercy of the kernel making sensible
> placement decisions.

Indeed, but it's at least somewhat restricted. NB, in real scenario
users will map guest NUMA nodes onto host ones with 1:1 relationship.
And each guest NUMA node will have its own memdev=, i.e. its own set of
threads, so in the end, prealloc threads won't jump between host NUMA
nodes but stay local to the node they are allocating memory on.

> 
> The very last cases is the only one where this patch can potentially
> be beneficial. The problem is that because libvirt is in charge of
> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
> anything with CPU affinity. So AFAICT, this patch should result in
> an error from sched_setaffinity when run under libvirt.

Yes, I had to disable capability dropping in qemu.conf.

After all, I think maybe the right place to fix this is kernel? I mean,
why don't prealloc threads converge to the nodes they are working with?

Michal
David Hildenbrand May 11, 2022, 2:50 p.m. UTC | #12
>>
>> The very last cases is the only one where this patch can potentially
>> be beneficial. The problem is that because libvirt is in charge of
>> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
>> anything with CPU affinity. So AFAICT, this patch should result in
>> an error from sched_setaffinity when run under libvirt.
> 
> Yes, I had to disable capability dropping in qemu.conf.
> 
> After all, I think maybe the right place to fix this is kernel? I mean,
> why don't prealloc threads converge to the nodes they are working with?

Good question, I think the whole machinery doesn't support hugetlb,
including recording remote faults and migrating the thread to a
different node.
Daniel P. Berrangé May 11, 2022, 3:08 p.m. UTC | #13
On Wed, May 11, 2022 at 03:16:55PM +0200, Michal Prívozník wrote:
> On 5/10/22 11:12, Daniel P. Berrangé wrote:
> > On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
> >> When allocating large amounts of memory the task is offloaded
> >> onto threads. These threads then use various techniques to
> >> allocate the memory fully (madvise(), writing into the memory).
> >> However, these threads are free to run on any CPU, which becomes
> >> problematic on NUMA machines because it may happen that a thread
> >> is running on a distant node.
> >>
> >> Ideally, this is something that a management application would
> >> resolve, but we are not anywhere close to that, Firstly, memory
> >> allocation happens before monitor socket is even available. But
> >> okay, that's what -preconfig is for. But then the problem is that
> >> 'object-add' would not return until all memory is preallocated.
> >>
> >> Long story short, management application has no way of learning
> >> TIDs of allocator threads so it can't make them run NUMA aware.
> > 
> > So I'm wondering what the impact of this problem is for various
> > scenarios.
> 
> The scenario which I tested this with was no <emulatorpin/> but using
> 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
> those which are unfamiliar with libvirt, this is about placing the main
> qemu TID (with the main eventloop) into a CGroup that restricts on what
> CPUs it can run.
> 
> > 
> > The default config for a KVM guest with libvirt is no CPU pinning
> > at all. The kernel auto-places CPUs and decides on where RAM is to
> > be allocated. So in this case, whether or not libvirt can talk to
> > QMP in time to query threads is largely irrelevant, as we don't
> > want todo placement in any case.
> > 
> > In theory the kernel should allocate RAM on the node local to
> > where the process is currently executing. So as long as the
> > guest RAM fits in available free RAM on the local node, RAM
> > should be allocated from the node that matches the CPU running
> > the QEMU main thread.
> > 
> > The challenge is if we spawn N more threads to do pre-alloc,
> > these can be spread onto other nodes. I wonder if the kernel
> > huas any preference for keeping threads within a process on
> > the same NUMA node ?
> 
> That's not exactly what I saw. I would have thought too that initially
> the prealloc thread could be spawned just anywhere but after few
> iterations the scheduler realized what NUMA node the thread is close to
> and automatically schedule it to run there. Well, it didn't happen.

Thinking about it, this does make sense to some extent. When a
thread is first spawned, how can the kernel know what region of
memory it is about to start touching ? So at the very least the
kernel schedular can get it wrong initially. It would need something
to watch memory acces patterns to determine whether the initial
decision was right or wrong, and fine tune it later.

Seems like the kernel typically tries todo the opposite to what
we thought, and instead of moving CPUs, has ways to move the
memory instead.

https://www.kernel.org/doc/html/latest/vm/page_migration.html

> > Overall, if libvirt is not applying pinning to the QEMU guest,
> > then we're 100% reliant on the kernel todo something sensible,
> > both for normal QEMU execution and for prealloc. Since we're
> > not doing placement of QEMU RAM or CPUs, the logic in this
> > patch won't do anything either.
> > 
> > 
> > If the guest has more RAM than can fit on the local NUMA node,
> > then we're doomed no matter what, even ignoring prealloc, there
> > will be cross-node traffic. This scenario requires the admin to
> > setup proper CPU /memory pinning for QEMU in libvirt.
> > 
> > If libvirt is doing CPU pinning (as instructed by the mgmt app
> > above us), then when we first start QEMU, the process thread
> > leader will get given affinity by libvirt prior to exec. This
> > affinity will be the union of affinity for all CPUs that will
> > be later configured.
> > 
> > The typical case for CPU pinning, is that everything fits in
> > one NUMA node, and so in this case, we don't need todo anything
> > more. The prealloc threads will already be constrained to the
> > right place by the affinity of the QEMU thread leader, so the
> > logic in this patch will run, but it won't do anything that
> > was not already done.
> > 
> > So we're left with the hardest case, where the guest is explicitly
> > spread across multiple NUMA nodes. In this case the thread leader
> > affinity will span many NUMA nodes, and so the prealloc threads
> > will freely be placed across any CPU that is in the union of CPUs
> > the guest is placed on. Just as with thue non-pinned case, the
> > prealloc will be at the mercy of the kernel making sensible
> > placement decisions.
> 
> Indeed, but it's at least somewhat restricted. NB, in real scenario
> users will map guest NUMA nodes onto host ones with 1:1 relationship.
> And each guest NUMA node will have its own memdev=, i.e. its own set of
> threads, so in the end, prealloc threads won't jump between host NUMA
> nodes but stay local to the node they are allocating memory on.

Thinking about this from a completely different QEMU angle.

Right now the preallocation happens when we create the memory
device, and takes place in threads spawned from the main
QEMU thread.

We are doing memory placement in order that specific blocks of
virtual RAM are co-located with specific virtual CPUs. IOW we
know we already have some threads that will match locality of
the RAM we have.

We are doing memory pre-allocation to give predictable
performance once the VM starts, and to have a guarantee
that the memory is actually available  for use.

We don't actually need the memory pre-allocation to take
place so early in object creation, as we have it right now.
It just needs to be done before VM creation is done and
vCPUs start guest code.


From the POV of controlling QEMU VM resource usage, we don't
really want memory pre-allocation to consume more host CPUs
than we've assigned to the VM for its vCPUs.

So what if instead of creating throwaway threads for memory
allocation early, we ran the preallocation in the vCPU
threads before they start executing guest code ? This is
still early enough to achieve our goals for preallocation.

These vCPU threads already have the right affinity setup. This
ensures that the CPU burn for preallocation doesn't exceed
what we've allowed for guest CPU usage in general, so resource
limits will be naturally enforced.

Is this kind of approach possible ?

With regards,
Daniel
David Hildenbrand May 11, 2022, 4:41 p.m. UTC | #14
On 11.05.22 17:08, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 03:16:55PM +0200, Michal Prívozník wrote:
>> On 5/10/22 11:12, Daniel P. Berrangé wrote:
>>> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
>>>> When allocating large amounts of memory the task is offloaded
>>>> onto threads. These threads then use various techniques to
>>>> allocate the memory fully (madvise(), writing into the memory).
>>>> However, these threads are free to run on any CPU, which becomes
>>>> problematic on NUMA machines because it may happen that a thread
>>>> is running on a distant node.
>>>>
>>>> Ideally, this is something that a management application would
>>>> resolve, but we are not anywhere close to that, Firstly, memory
>>>> allocation happens before monitor socket is even available. But
>>>> okay, that's what -preconfig is for. But then the problem is that
>>>> 'object-add' would not return until all memory is preallocated.
>>>>
>>>> Long story short, management application has no way of learning
>>>> TIDs of allocator threads so it can't make them run NUMA aware.
>>>
>>> So I'm wondering what the impact of this problem is for various
>>> scenarios.
>>
>> The scenario which I tested this with was no <emulatorpin/> but using
>> 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
>> those which are unfamiliar with libvirt, this is about placing the main
>> qemu TID (with the main eventloop) into a CGroup that restricts on what
>> CPUs it can run.
>>
>>>
>>> The default config for a KVM guest with libvirt is no CPU pinning
>>> at all. The kernel auto-places CPUs and decides on where RAM is to
>>> be allocated. So in this case, whether or not libvirt can talk to
>>> QMP in time to query threads is largely irrelevant, as we don't
>>> want todo placement in any case.
>>>
>>> In theory the kernel should allocate RAM on the node local to
>>> where the process is currently executing. So as long as the
>>> guest RAM fits in available free RAM on the local node, RAM
>>> should be allocated from the node that matches the CPU running
>>> the QEMU main thread.
>>>
>>> The challenge is if we spawn N more threads to do pre-alloc,
>>> these can be spread onto other nodes. I wonder if the kernel
>>> huas any preference for keeping threads within a process on
>>> the same NUMA node ?
>>
>> That's not exactly what I saw. I would have thought too that initially
>> the prealloc thread could be spawned just anywhere but after few
>> iterations the scheduler realized what NUMA node the thread is close to
>> and automatically schedule it to run there. Well, it didn't happen.
> 
> Thinking about it, this does make sense to some extent. When a
> thread is first spawned, how can the kernel know what region of
> memory it is about to start touching ? So at the very least the
> kernel schedular can get it wrong initially. It would need something
> to watch memory acces patterns to determine whether the initial
> decision was right or wrong, and fine tune it later.
> 
> Seems like the kernel typically tries todo the opposite to what
> we thought, and instead of moving CPUs, has ways to move the
> memory instead.
> 
> https://www.kernel.org/doc/html/latest/vm/page_migration.html
> 
>>> Overall, if libvirt is not applying pinning to the QEMU guest,
>>> then we're 100% reliant on the kernel todo something sensible,
>>> both for normal QEMU execution and for prealloc. Since we're
>>> not doing placement of QEMU RAM or CPUs, the logic in this
>>> patch won't do anything either.
>>>
>>>
>>> If the guest has more RAM than can fit on the local NUMA node,
>>> then we're doomed no matter what, even ignoring prealloc, there
>>> will be cross-node traffic. This scenario requires the admin to
>>> setup proper CPU /memory pinning for QEMU in libvirt.
>>>
>>> If libvirt is doing CPU pinning (as instructed by the mgmt app
>>> above us), then when we first start QEMU, the process thread
>>> leader will get given affinity by libvirt prior to exec. This
>>> affinity will be the union of affinity for all CPUs that will
>>> be later configured.
>>>
>>> The typical case for CPU pinning, is that everything fits in
>>> one NUMA node, and so in this case, we don't need todo anything
>>> more. The prealloc threads will already be constrained to the
>>> right place by the affinity of the QEMU thread leader, so the
>>> logic in this patch will run, but it won't do anything that
>>> was not already done.
>>>
>>> So we're left with the hardest case, where the guest is explicitly
>>> spread across multiple NUMA nodes. In this case the thread leader
>>> affinity will span many NUMA nodes, and so the prealloc threads
>>> will freely be placed across any CPU that is in the union of CPUs
>>> the guest is placed on. Just as with thue non-pinned case, the
>>> prealloc will be at the mercy of the kernel making sensible
>>> placement decisions.
>>
>> Indeed, but it's at least somewhat restricted. NB, in real scenario
>> users will map guest NUMA nodes onto host ones with 1:1 relationship.
>> And each guest NUMA node will have its own memdev=, i.e. its own set of
>> threads, so in the end, prealloc threads won't jump between host NUMA
>> nodes but stay local to the node they are allocating memory on.
> 
> Thinking about this from a completely different QEMU angle.
> 
> Right now the preallocation happens when we create the memory
> device, and takes place in threads spawned from the main
> QEMU thread.
> 
> We are doing memory placement in order that specific blocks of
> virtual RAM are co-located with specific virtual CPUs. IOW we
> know we already have some threads that will match locality of
> the RAM we have.
> 
> We are doing memory pre-allocation to give predictable
> performance once the VM starts, and to have a guarantee
> that the memory is actually available  for use.
> 
> We don't actually need the memory pre-allocation to take
> place so early in object creation, as we have it right now.
> It just needs to be done before VM creation is done and
> vCPUs start guest code.
> 
> 
> From the POV of controlling QEMU VM resource usage, we don't
> really want memory pre-allocation to consume more host CPUs
> than we've assigned to the VM for its vCPUs.
> 
> So what if instead of creating throwaway threads for memory
> allocation early, we ran the preallocation in the vCPU
> threads before they start executing guest code ? This is
> still early enough to achieve our goals for preallocation.
> 
> These vCPU threads already have the right affinity setup. This
> ensures that the CPU burn for preallocation doesn't exceed
> what we've allowed for guest CPU usage in general, so resource
> limits will be naturally enforced.
> 
> Is this kind of approach possible ?

That sounds quite hackish. IMHO, we should make sure that any approach
we introduce is able to cope with both, coldplugged and hotplugged
memory backends.

I agree with Paolos comment that libvirt is trying to micromanage QEMU
here, which is the root of the issue IMHO.

For applicable VMs (!realtime?), Libvirt could simply allow QEMU to set
the affinity in case there cannot really be harm done. Not sure what
other interfaces the kernel could provide to allow Libvirt to restrict
the affinity to only some subset of nodes/cpus, so QEMU's harm could be
limited to that subset.
Daniel P. Berrangé May 11, 2022, 4:54 p.m. UTC | #15
On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote:
> On 5/11/22 12:10, Daniel P. Berrangé wrote:
> > If all we needs is NUMA affinity, not CPU affinity, then it would
> > be sufficient to create 1 I/O thread per host NUMA node that the
> > VM needs to use. The job running in the I/O can spawn further
> > threads and inherit the NUMA affinity.  This might be more clever
> > than it is needed though.
> > 
> > I expect creating/deleting I/O threads is cheap in comparison to
> > the work done for preallocation. If libvirt is using -preconfig
> > and object-add to create the memory backend, then we could have
> > option of creating the I/O threads dynamically in -preconfig mode,
> > create the memory backend, and then delete the I/O threads again.
> 
> I think this is very overengineered.  Michal's patch is doing the obvious
> thing and if it doesn't work that's because Libvirt is trying to micromanage
> QEMU.

Calling it micromanaging is putting a very negative connotation on
this. What we're trying todo is enforce a host resource policy for
QEMU, in a way that a compromised QEMU can't escape, which is a
valuable protection. 

> As mentioned on IRC, if the reason is to prevent moving around threads in
> realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel
> level.

We use cgroups where it is available to us, but we don't always have
the freedom that we'd like. Sometimes the deployment scenario of
libvirt means that we're stuck with whatever cgroup libvirtd is
launched in and sched_setaffinity is our only way to confine QEMU,
so wanting to prevent use of sched_setaffinity is reasonable IMHO.

With regards,
Daniel
Paolo Bonzini May 12, 2022, 7:41 a.m. UTC | #16
On 5/11/22 18:54, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote:
>> On 5/11/22 12:10, Daniel P. Berrangé wrote:
>>> I expect creating/deleting I/O threads is cheap in comparison to
>>> the work done for preallocation. If libvirt is using -preconfig
>>> and object-add to create the memory backend, then we could have
>>> option of creating the I/O threads dynamically in -preconfig mode,
>>> create the memory backend, and then delete the I/O threads again.
>>
>> I think this is very overengineered.  Michal's patch is doing the obvious
>> thing and if it doesn't work that's because Libvirt is trying to micromanage
>> QEMU.
> 
> Calling it micromanaging is putting a very negative connotation on
> this. What we're trying todo is enforce a host resource policy for
> QEMU, in a way that a compromised QEMU can't escape, which is a
> valuable protection.

I'm sorry if that was a bit exaggerated, but the negative connotation 
was intentional.

>> As mentioned on IRC, if the reason is to prevent moving around threads in
>> realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel
>> level.
> 
> We use cgroups where it is available to us, but we don't always have
> the freedom that we'd like.

I understand.  I'm thinking of a new flag to sched_setscheduler that 
fixes the CPU affinity and policy of the thread and prevents changing it 
in case QEMU is compromised later.  The seccomp/SELinux sandboxes can 
prevent setting the SCHED_FIFO class without this flag.

In addition, my hunch is that this works only because the RT setup of 
QEMU is not safe against priority inversion.  IIRC the iothreads are set 
with a non-realtime priority, but actually they should have a _higher_ 
priority than the CPU threads, and the thread pool I/O bound workers 
should have an even higher priority; otherwise you have a priority 
inversion situation where an interrupt is pending that would wake up the 
CPU, but the iothreads cannot process it because they have a lower 
priority than the CPU.

So the iothread and the associated util/thread-pool.c thread pool are 
the wrong tools to solve Michal's issue; they are not meant for 
background CPU-bound work, even though they _might_ work due to their 
incorrect RT setup.

Paolo
Daniel P. Berrangé May 12, 2022, 8:15 a.m. UTC | #17
On Thu, May 12, 2022 at 09:41:29AM +0200, Paolo Bonzini wrote:
> On 5/11/22 18:54, Daniel P. Berrangé wrote:
> > On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote:
> > > On 5/11/22 12:10, Daniel P. Berrangé wrote:
> > > > I expect creating/deleting I/O threads is cheap in comparison to
> > > > the work done for preallocation. If libvirt is using -preconfig
> > > > and object-add to create the memory backend, then we could have
> > > > option of creating the I/O threads dynamically in -preconfig mode,
> > > > create the memory backend, and then delete the I/O threads again.
> > > 
> > > I think this is very overengineered.  Michal's patch is doing the obvious
> > > thing and if it doesn't work that's because Libvirt is trying to micromanage
> > > QEMU.
> > 
> > Calling it micromanaging is putting a very negative connotation on
> > this. What we're trying todo is enforce a host resource policy for
> > QEMU, in a way that a compromised QEMU can't escape, which is a
> > valuable protection.
> 
> I'm sorry if that was a bit exaggerated, but the negative connotation was
> intentional.
> 
> > > As mentioned on IRC, if the reason is to prevent moving around threads in
> > > realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel
> > > level.
> > 
> > We use cgroups where it is available to us, but we don't always have
> > the freedom that we'd like.
> 
> I understand.  I'm thinking of a new flag to sched_setscheduler that fixes
> the CPU affinity and policy of the thread and prevents changing it in case
> QEMU is compromised later.  The seccomp/SELinux sandboxes can prevent
> setting the SCHED_FIFO class without this flag.
> 
> In addition, my hunch is that this works only because the RT setup of QEMU
> is not safe against priority inversion.  IIRC the iothreads are set with a
> non-realtime priority, but actually they should have a _higher_ priority
> than the CPU threads, and the thread pool I/O bound workers should have an
> even higher priority; otherwise you have a priority inversion situation
> where an interrupt is pending that would wake up the CPU, but the iothreads
> cannot process it because they have a lower priority than the CPU.

At least for RHEL deployments of KVM-RT, IIC the expectation is that
the VCPUs with RT priority never do I/O, and that there is at least 1
additional non-RT vCPU from which the OS performs I/O. IOW, the RT
VCPU works in a completely self contained manner with no interaction
to any other QEMU threads. If that's not the case, then you would
have to make sure those other threads have priority / schedular
adjustments to avoid priority inversion

With regards,
Daniel
David Hildenbrand June 8, 2022, 10:34 a.m. UTC | #18
On 11.05.22 11:34, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote:
>>>> Long story short, management application has no way of learning
>>>> TIDs of allocator threads so it can't make them run NUMA aware.
>>>
>>> This feels like the key issue. The preallocation threads are
>>> invisible to libvirt, regardless of whether we're doing coldplug
>>> or hotplug of memory-backends. Indeed the threads are invisible
>>> to all of QEMU, except the memory backend code.
>>>
>>> Conceptually we need 1 or more explicit worker threads, that we
>>> can assign CPU affinity to, and then QEMU can place jobs on them.
>>> I/O threads serve this role, but limited to blockdev work. We
>>> need a generalization of I/O threads, for arbitrary jobs that
>>> QEMU might want to farm out to specific numa nodes.
>>
>> At least the "-object iothread" thingy can already be used for actions
>> outside of blockdev. virtio-balloon uses one for free page hinting.
> 
> Ah that's good to know, so my idea probably isn't so much work as
> I thought it might be.

Looking into the details, iothreads are the wrong tool to use here.

I can imagine use cases where you'd want to perform preallcoation
* Only one some specific CPUs of a NUMA node (especially, not ones where 
  we pinned VCPUs)
* On CPUs that are on a different NUMA node then the actual backend 
  memory ... just thinking about all of the CPU-less nodes for PMEM and 
  CXL that we're starting to see.


So ideally, we'd let the user configure either
* Set of physical CPUs to use (low hanging fruit)
* Set of NUMA nodes to use (harder)
and allow libvirt to easily configure it similarly by pinning threads.

As CPU affinity is inherited when creating a new thread, so here is one
IMHO reasonable simple thing to get the job done and allow for such
flexibility.


Introduce a "thread-context" user-creatable object that is used to
configure CPU affinity.

Internally, thread-context creates exactly one thread called "TC $id".
That thread serves no purpose besides spawning new threads that inherit the
affinity.

Internal users (like preallocation, but we could reuse the same concept for other
non-io threads, such as userfaultfd, and some other potential non-io thread
users) simply use that thread context to spawn new threads.


Spawned threads get called "TC $id/$threadname", whereby $threadname is the
ordinary name supplied by the internal user. This could be used to identify
long-running threads if needed in the future.

It's worth nothing that this is not a thread pool.


a) Ordinary cmdline usage:

-object thread-context,id="tc1",cpus=0-9,cpus=12

QEMU tries setting the CPU affinity and fails if that's impossible.

-object memory-backend-file,...,prealloc=on,prealloc-threads=16,prealloc-thread-context=tc1

When a context is set, preallcoation code will use the thread-context to spawn threads.


b) Libvirt, ..., usage:

-object thread-context,id="tc1"

Then, libvirt identifies and sets the affinity for the "TC tc1" thread.

-object memory-backend-file,...,prealloc=on,prealloc-threads=16,prealloc-thread-context=tc1



thread-context can be reused for successive preallcoation etc, obviously.
diff mbox series

Patch

diff --git a/backends/hostmem.c b/backends/hostmem.c
index a7bae3d713..7373472c7e 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -232,7 +232,8 @@  static void host_memory_backend_set_prealloc(Object *obj, bool value,
         void *ptr = memory_region_get_ram_ptr(&backend->mr);
         uint64_t sz = memory_region_size(&backend->mr);
 
-        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err);
+        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads,
+                        backend->host_nodes, MAX_NODES, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -394,7 +395,8 @@  host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          */
         if (backend->prealloc) {
             os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
-                            backend->prealloc_threads, &local_err);
+                            backend->prealloc_threads, backend->host_nodes,
+                            MAX_NODES, &local_err);
             if (local_err) {
                 goto out;
             }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 5aca408726..48b104cdf6 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -467,7 +467,7 @@  static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
             int fd = memory_region_get_fd(&vmem->memdev->mr);
             Error *local_err = NULL;
 
-            os_mem_prealloc(fd, area, size, 1, &local_err);
+            os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err);
             if (local_err) {
                 static bool warned;
 
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 1c1e7eca98..474cbf3b86 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -577,6 +577,8 @@  unsigned long qemu_getauxval(unsigned long type);
 void qemu_set_tty_echo(int fd, bool echo);
 
 void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
+                     const unsigned long *host_nodes,
+                     unsigned long max_node,
                      Error **errp);
 
 /**
diff --git a/util/meson.build b/util/meson.build
index 8f16018cd4..393ff74570 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -15,7 +15,7 @@  freebsd_dep = []
 if targetos == 'freebsd'
   freebsd_dep = util
 endif
-util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep])
+util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa])
 util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c'))
 util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c'))
 util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c'))
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 477990f39b..1572b9b178 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -73,6 +73,10 @@ 
 #include "qemu/error-report.h"
 #endif
 
+#ifdef CONFIG_NUMA
+#include <numa.h>
+#endif
+
 #define MAX_MEM_PREALLOC_THREAD_COUNT 16
 
 struct MemsetThread;
@@ -82,6 +86,9 @@  typedef struct MemsetContext {
     bool any_thread_failed;
     struct MemsetThread *threads;
     int num_threads;
+#ifdef CONFIG_NUMA
+    struct bitmask *nodemask;
+#endif
 } MemsetContext;
 
 struct MemsetThread {
@@ -420,6 +427,12 @@  static void *do_touch_pages(void *arg)
     }
     qemu_mutex_unlock(&page_mutex);
 
+#ifdef CONFIG_NUMA
+    if (memset_args->context->nodemask) {
+        numa_run_on_node_mask(memset_args->context->nodemask);
+    }
+#endif
+
     /* unblock SIGBUS */
     sigemptyset(&set);
     sigaddset(&set, SIGBUS);
@@ -463,6 +476,12 @@  static void *do_madv_populate_write_pages(void *arg)
     }
     qemu_mutex_unlock(&page_mutex);
 
+#ifdef CONFIG_NUMA
+    if (memset_args->context->nodemask) {
+        numa_run_on_node_mask(memset_args->context->nodemask);
+    }
+#endif
+
     if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) {
         ret = -errno;
     }
@@ -489,7 +508,9 @@  static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
 }
 
 static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-                           int smp_cpus, bool use_madv_populate_write)
+                           int smp_cpus, const unsigned long *host_nodes,
+                           unsigned long max_node,
+                           bool use_madv_populate_write)
 {
     static gsize initialized = 0;
     MemsetContext context = {
@@ -499,6 +520,7 @@  static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     void *(*touch_fn)(void *);
     int ret = 0, i = 0;
     char *addr = area;
+    unsigned long value = max_node;
 
     if (g_once_init_enter(&initialized)) {
         qemu_mutex_init(&page_mutex);
@@ -520,6 +542,48 @@  static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         touch_fn = do_touch_pages;
     }
 
+    if (host_nodes) {
+        value = find_first_bit(host_nodes, max_node);
+    }
+    if (value != max_node) {
+#ifdef CONFIG_NUMA
+        struct bitmask *cpus = numa_allocate_cpumask();
+        g_autofree unsigned long *zerocpumask;
+        size_t zerocpumasklen;
+        g_autofree unsigned long *zeronodemask;
+        size_t zeronodemasklen;
+
+        context.nodemask = numa_bitmask_alloc(max_node);
+
+        zerocpumasklen = cpus->size / sizeof(unsigned long);
+        zerocpumask = g_new0(unsigned long, zerocpumasklen);
+
+        for (; value != max_node;
+             value = find_next_bit(host_nodes, max_node, value + 1)) {
+            if (numa_node_to_cpus(value, cpus) ||
+                memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0)
+                continue;
+
+            /* If given NUMA node has CPUs run threads on them. */
+            numa_bitmask_setbit(context.nodemask, value);
+        }
+
+        numa_bitmask_free(cpus);
+
+        zeronodemasklen = max_node / sizeof(unsigned long);
+        zeronodemask = g_new0(unsigned long, zeronodemasklen);
+
+        if (memcmp(context.nodemask->maskp,
+                   zeronodemask, zeronodemasklen) == 0) {
+            /* If no NUMA has a CPU available, then don't pin threads. */
+            g_clear_pointer(&context.nodemask, numa_bitmask_free);
+        }
+#else
+        errno = -EINVAL;
+        return -1;
+#endif
+    }
+
     context.threads = g_new0(MemsetThread, context.num_threads);
     numpages_per_thread = numpages / context.num_threads;
     leftover = numpages % context.num_threads;
@@ -554,6 +618,10 @@  static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     if (!use_madv_populate_write) {
         sigbus_memset_context = NULL;
     }
+
+#ifdef CONFIG_NUMA
+    g_clear_pointer(&context.nodemask, numa_bitmask_free);
+#endif
     g_free(context.threads);
 
     return ret;
@@ -566,6 +634,8 @@  static bool madv_populate_write_possible(char *area, size_t pagesize)
 }
 
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
+                     const unsigned long *host_nodes,
+                     unsigned long max_node,
                      Error **errp)
 {
     static gsize initialized;
@@ -608,7 +678,7 @@  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
 
     /* touch pages simultaneously */
     ret = touch_all_pages(area, hpagesize, numpages, smp_cpus,
-                          use_madv_populate_write);
+                          host_nodes, max_node, use_madv_populate_write);
     if (ret) {
         error_setg_errno(errp, -ret,
                          "os_mem_prealloc: preallocating memory failed");
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index dafef4f157..6efd912355 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -314,6 +314,8 @@  int getpagesize(void)
 }
 
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
+                     const unsigned long *host_nodes,
+                     unsigned long max_node,
                      Error **errp)
 {
     int i;