Message ID | ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | util: NUMA aware memory preallocation | expand |
On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote: > When allocating large amounts of memory the task is offloaded > onto threads. These threads then use various techniques to > allocate the memory fully (madvise(), writing into the memory). > However, these threads are free to run on any CPU, which becomes > problematic on NUMA machines because it may happen that a thread > is running on a distant node. > > Ideally, this is something that a management application would > resolve, but we are not anywhere close to that, Firstly, memory > allocation happens before monitor socket is even available. But > okay, that's what -preconfig is for. But then the problem is that > 'object-add' would not return until all memory is preallocated. > > Long story short, management application has no way of learning > TIDs of allocator threads so it can't make them run NUMA aware. So I'm wondering what the impact of this problem is for various scenarios. The default config for a KVM guest with libvirt is no CPU pinning at all. The kernel auto-places CPUs and decides on where RAM is to be allocated. So in this case, whether or not libvirt can talk to QMP in time to query threads is largely irrelevant, as we don't want todo placement in any case. In theory the kernel should allocate RAM on the node local to where the process is currently executing. So as long as the guest RAM fits in available free RAM on the local node, RAM should be allocated from the node that matches the CPU running the QEMU main thread. The challenge is if we spawn N more threads to do pre-alloc, these can be spread onto other nodes. I wonder if the kernel huas any preference for keeping threads within a process on the same NUMA node ? Overall, if libvirt is not applying pinning to the QEMU guest, then we're 100% reliant on the kernel todo something sensible, both for normal QEMU execution and for prealloc. Since we're not doing placement of QEMU RAM or CPUs, the logic in this patch won't do anything either. If the guest has more RAM than can fit on the local NUMA node, then we're doomed no matter what, even ignoring prealloc, there will be cross-node traffic. This scenario requires the admin to setup proper CPU /memory pinning for QEMU in libvirt. If libvirt is doing CPU pinning (as instructed by the mgmt app above us), then when we first start QEMU, the process thread leader will get given affinity by libvirt prior to exec. This affinity will be the union of affinity for all CPUs that will be later configured. The typical case for CPU pinning, is that everything fits in one NUMA node, and so in this case, we don't need todo anything more. The prealloc threads will already be constrained to the right place by the affinity of the QEMU thread leader, so the logic in this patch will run, but it won't do anything that was not already done. So we're left with the hardest case, where the guest is explicitly spread across multiple NUMA nodes. In this case the thread leader affinity will span many NUMA nodes, and so the prealloc threads will freely be placed across any CPU that is in the union of CPUs the guest is placed on. Just as with thue non-pinned case, the prealloc will be at the mercy of the kernel making sensible placement decisions. The very last cases is the only one where this patch can potentially be beneficial. The problem is that because libvirt is in charge of enforcing CPU affinity, IIRC, we explicitly block QEMU from doing anything with CPU affinity. So AFAICT, this patch should result in an error from sched_setaffinity when run under libvirt. > But what we can do is to propagate the 'host-nodes' attribute of > MemoryBackend object down to where preallocation threads are > created and set their affinity according to the attribute. > > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000 > Signed-off-by: Michal Privoznik <mprivozn@redhat.com> > --- > backends/hostmem.c | 6 ++-- > hw/virtio/virtio-mem.c | 2 +- > include/qemu/osdep.h | 2 ++ > util/meson.build | 2 +- > util/oslib-posix.c | 74 ++++++++++++++++++++++++++++++++++++++++-- > util/oslib-win32.c | 2 ++ > 6 files changed, 82 insertions(+), 6 deletions(-) > > diff --git a/backends/hostmem.c b/backends/hostmem.c > index a7bae3d713..7373472c7e 100644 > --- a/backends/hostmem.c > +++ b/backends/hostmem.c > @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, > void *ptr = memory_region_get_ram_ptr(&backend->mr); > uint64_t sz = memory_region_size(&backend->mr); > > - os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err); > + os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, > + backend->host_nodes, MAX_NODES, &local_err); > if (local_err) { > error_propagate(errp, local_err); > return; > @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) > */ > if (backend->prealloc) { > os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz, > - backend->prealloc_threads, &local_err); > + backend->prealloc_threads, backend->host_nodes, > + MAX_NODES, &local_err); > if (local_err) { > goto out; > } > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c > index 5aca408726..48b104cdf6 100644 > --- a/hw/virtio/virtio-mem.c > +++ b/hw/virtio/virtio-mem.c > @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, > int fd = memory_region_get_fd(&vmem->memdev->mr); > Error *local_err = NULL; > > - os_mem_prealloc(fd, area, size, 1, &local_err); > + os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err); > if (local_err) { > static bool warned; > > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h > index 1c1e7eca98..474cbf3b86 100644 > --- a/include/qemu/osdep.h > +++ b/include/qemu/osdep.h > @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type); > void qemu_set_tty_echo(int fd, bool echo); > > void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus, > + const unsigned long *host_nodes, > + unsigned long max_node, > Error **errp); > > /** > diff --git a/util/meson.build b/util/meson.build > index 8f16018cd4..393ff74570 100644 > --- a/util/meson.build > +++ b/util/meson.build > @@ -15,7 +15,7 @@ freebsd_dep = [] > if targetos == 'freebsd' > freebsd_dep = util > endif > -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep]) > +util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa]) > util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c')) > util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c')) > util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c')) > diff --git a/util/oslib-posix.c b/util/oslib-posix.c > index 477990f39b..1572b9b178 100644 > --- a/util/oslib-posix.c > +++ b/util/oslib-posix.c > @@ -73,6 +73,10 @@ > #include "qemu/error-report.h" > #endif > > +#ifdef CONFIG_NUMA > +#include <numa.h> > +#endif > + > #define MAX_MEM_PREALLOC_THREAD_COUNT 16 > > struct MemsetThread; > @@ -82,6 +86,9 @@ typedef struct MemsetContext { > bool any_thread_failed; > struct MemsetThread *threads; > int num_threads; > +#ifdef CONFIG_NUMA > + struct bitmask *nodemask; > +#endif > } MemsetContext; > > struct MemsetThread { > @@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg) > } > qemu_mutex_unlock(&page_mutex); > > +#ifdef CONFIG_NUMA > + if (memset_args->context->nodemask) { > + numa_run_on_node_mask(memset_args->context->nodemask); > + } > +#endif > + > /* unblock SIGBUS */ > sigemptyset(&set); > sigaddset(&set, SIGBUS); > @@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void *arg) > } > qemu_mutex_unlock(&page_mutex); > > +#ifdef CONFIG_NUMA > + if (memset_args->context->nodemask) { > + numa_run_on_node_mask(memset_args->context->nodemask); > + } > +#endif Ok, so this is where affinity is set, and I believe this ought to be failing when run under libvirt with an EPERM from sched_setaffinity. This code is ignoring errors so I guess this won't be noticed unless libnuma is printing an error message ? > + > if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) { > ret = -errno; > } > @@ -489,7 +508,9 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages, > } > > static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > - int smp_cpus, bool use_madv_populate_write) > + int smp_cpus, const unsigned long *host_nodes, > + unsigned long max_node, > + bool use_madv_populate_write) > { > static gsize initialized = 0; > MemsetContext context = { > @@ -499,6 +520,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > void *(*touch_fn)(void *); > int ret = 0, i = 0; > char *addr = area; > + unsigned long value = max_node; > > if (g_once_init_enter(&initialized)) { > qemu_mutex_init(&page_mutex); > @@ -520,6 +542,48 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > touch_fn = do_touch_pages; > } > > + if (host_nodes) { > + value = find_first_bit(host_nodes, max_node); > + } > + if (value != max_node) { > +#ifdef CONFIG_NUMA > + struct bitmask *cpus = numa_allocate_cpumask(); > + g_autofree unsigned long *zerocpumask; > + size_t zerocpumasklen; > + g_autofree unsigned long *zeronodemask; > + size_t zeronodemasklen; > + > + context.nodemask = numa_bitmask_alloc(max_node); > + > + zerocpumasklen = cpus->size / sizeof(unsigned long); > + zerocpumask = g_new0(unsigned long, zerocpumasklen); > + > + for (; value != max_node; > + value = find_next_bit(host_nodes, max_node, value + 1)) { > + if (numa_node_to_cpus(value, cpus) || > + memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0) > + continue; > + > + /* If given NUMA node has CPUs run threads on them. */ > + numa_bitmask_setbit(context.nodemask, value); > + } > + > + numa_bitmask_free(cpus); > + > + zeronodemasklen = max_node / sizeof(unsigned long); > + zeronodemask = g_new0(unsigned long, zeronodemasklen); > + > + if (memcmp(context.nodemask->maskp, > + zeronodemask, zeronodemasklen) == 0) { > + /* If no NUMA has a CPU available, then don't pin threads. */ > + g_clear_pointer(&context.nodemask, numa_bitmask_free); > + } > +#else > + errno = -EINVAL; > + return -1; > +#endif > + } > + > context.threads = g_new0(MemsetThread, context.num_threads); > numpages_per_thread = numpages / context.num_threads; > leftover = numpages % context.num_threads; > @@ -554,6 +618,10 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > if (!use_madv_populate_write) { > sigbus_memset_context = NULL; > } > + > +#ifdef CONFIG_NUMA > + g_clear_pointer(&context.nodemask, numa_bitmask_free); > +#endif > g_free(context.threads); > > return ret; > @@ -566,6 +634,8 @@ static bool madv_populate_write_possible(char *area, size_t pagesize) > } > > void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > + const unsigned long *host_nodes, > + unsigned long max_node, > Error **errp) > { > static gsize initialized; > @@ -608,7 +678,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > > /* touch pages simultaneously */ > ret = touch_all_pages(area, hpagesize, numpages, smp_cpus, > - use_madv_populate_write); > + host_nodes, max_node, use_madv_populate_write); > if (ret) { > error_setg_errno(errp, -ret, > "os_mem_prealloc: preallocating memory failed"); > diff --git a/util/oslib-win32.c b/util/oslib-win32.c > index dafef4f157..6efd912355 100644 > --- a/util/oslib-win32.c > +++ b/util/oslib-win32.c > @@ -314,6 +314,8 @@ int getpagesize(void) > } > > void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > + const unsigned long *host_nodes, > + unsigned long max_node, > Error **errp) > { > int i; > -- > 2.35.1 > > With regards, Daniel
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote: > > When allocating large amounts of memory the task is offloaded > > onto threads. These threads then use various techniques to > > allocate the memory fully (madvise(), writing into the memory). > > However, these threads are free to run on any CPU, which becomes > > problematic on NUMA machines because it may happen that a thread > > is running on a distant node. > > > > Ideally, this is something that a management application would > > resolve, but we are not anywhere close to that, Firstly, memory > > allocation happens before monitor socket is even available. But > > okay, that's what -preconfig is for. But then the problem is that > > 'object-add' would not return until all memory is preallocated. > > > > Long story short, management application has no way of learning > > TIDs of allocator threads so it can't make them run NUMA aware. > > So I'm wondering what the impact of this problem is for various > scenarios. > > The default config for a KVM guest with libvirt is no CPU pinning > at all. The kernel auto-places CPUs and decides on where RAM is to > be allocated. So in this case, whether or not libvirt can talk to > QMP in time to query threads is largely irrelevant, as we don't > want todo placement in any case. > > In theory the kernel should allocate RAM on the node local to > where the process is currently executing. So as long as the > guest RAM fits in available free RAM on the local node, RAM > should be allocated from the node that matches the CPU running > the QEMU main thread. > > The challenge is if we spawn N more threads to do pre-alloc, > these can be spread onto other nodes. I wonder if the kernel > huas any preference for keeping threads within a process on > the same NUMA node ? > > Overall, if libvirt is not applying pinning to the QEMU guest, > then we're 100% reliant on the kernel todo something sensible, > both for normal QEMU execution and for prealloc. Since we're > not doing placement of QEMU RAM or CPUs, the logic in this > patch won't do anything either. > > > If the guest has more RAM than can fit on the local NUMA node, > then we're doomed no matter what, even ignoring prealloc, there > will be cross-node traffic. This scenario requires the admin to > setup proper CPU /memory pinning for QEMU in libvirt. > > If libvirt is doing CPU pinning (as instructed by the mgmt app > above us), then when we first start QEMU, the process thread > leader will get given affinity by libvirt prior to exec. This > affinity will be the union of affinity for all CPUs that will > be later configured. > > The typical case for CPU pinning, is that everything fits in > one NUMA node, and so in this case, we don't need todo anything > more. The prealloc threads will already be constrained to the > right place by the affinity of the QEMU thread leader, so the > logic in this patch will run, but it won't do anything that > was not already done. > > So we're left with the hardest case, where the guest is explicitly > spread across multiple NUMA nodes. In this case the thread leader > affinity will span many NUMA nodes, and so the prealloc threads > will freely be placed across any CPU that is in the union of CPUs > the guest is placed on. Just as with thue non-pinned case, the > prealloc will be at the mercy of the kernel making sensible > placement decisions. > > The very last cases is the only one where this patch can potentially > be beneficial. The problem is that because libvirt is in charge of > enforcing CPU affinity, IIRC, we explicitly block QEMU from doing > anything with CPU affinity. So AFAICT, this patch should result in > an error from sched_setaffinity when run under libvirt. This problem crops up in a few places though; potentially the same trick could be done with multifd threads in migration, if we can make some multifd threads handle some specific NUMA nodes rather than shotgunning them across the threads. [Bonus: Make multifd route different NUMA nodes down NICs that are node local] Dave > > But what we can do is to propagate the 'host-nodes' attribute of > > MemoryBackend object down to where preallocation threads are > > created and set their affinity according to the attribute. > > > > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000 > > Signed-off-by: Michal Privoznik <mprivozn@redhat.com> > > --- > > backends/hostmem.c | 6 ++-- > > hw/virtio/virtio-mem.c | 2 +- > > include/qemu/osdep.h | 2 ++ > > util/meson.build | 2 +- > > util/oslib-posix.c | 74 ++++++++++++++++++++++++++++++++++++++++-- > > util/oslib-win32.c | 2 ++ > > 6 files changed, 82 insertions(+), 6 deletions(-) > > > > diff --git a/backends/hostmem.c b/backends/hostmem.c > > index a7bae3d713..7373472c7e 100644 > > --- a/backends/hostmem.c > > +++ b/backends/hostmem.c > > @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, > > void *ptr = memory_region_get_ram_ptr(&backend->mr); > > uint64_t sz = memory_region_size(&backend->mr); > > > > - os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err); > > + os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, > > + backend->host_nodes, MAX_NODES, &local_err); > > if (local_err) { > > error_propagate(errp, local_err); > > return; > > @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) > > */ > > if (backend->prealloc) { > > os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz, > > - backend->prealloc_threads, &local_err); > > + backend->prealloc_threads, backend->host_nodes, > > + MAX_NODES, &local_err); > > if (local_err) { > > goto out; > > } > > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c > > index 5aca408726..48b104cdf6 100644 > > --- a/hw/virtio/virtio-mem.c > > +++ b/hw/virtio/virtio-mem.c > > @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, > > int fd = memory_region_get_fd(&vmem->memdev->mr); > > Error *local_err = NULL; > > > > - os_mem_prealloc(fd, area, size, 1, &local_err); > > + os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err); > > if (local_err) { > > static bool warned; > > > > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h > > index 1c1e7eca98..474cbf3b86 100644 > > --- a/include/qemu/osdep.h > > +++ b/include/qemu/osdep.h > > @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type); > > void qemu_set_tty_echo(int fd, bool echo); > > > > void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus, > > + const unsigned long *host_nodes, > > + unsigned long max_node, > > Error **errp); > > > > /** > > diff --git a/util/meson.build b/util/meson.build > > index 8f16018cd4..393ff74570 100644 > > --- a/util/meson.build > > +++ b/util/meson.build > > @@ -15,7 +15,7 @@ freebsd_dep = [] > > if targetos == 'freebsd' > > freebsd_dep = util > > endif > > -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep]) > > +util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa]) > > util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c')) > > util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c')) > > util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c')) > > diff --git a/util/oslib-posix.c b/util/oslib-posix.c > > index 477990f39b..1572b9b178 100644 > > --- a/util/oslib-posix.c > > +++ b/util/oslib-posix.c > > @@ -73,6 +73,10 @@ > > #include "qemu/error-report.h" > > #endif > > > > +#ifdef CONFIG_NUMA > > +#include <numa.h> > > +#endif > > + > > #define MAX_MEM_PREALLOC_THREAD_COUNT 16 > > > > struct MemsetThread; > > @@ -82,6 +86,9 @@ typedef struct MemsetContext { > > bool any_thread_failed; > > struct MemsetThread *threads; > > int num_threads; > > +#ifdef CONFIG_NUMA > > + struct bitmask *nodemask; > > +#endif > > } MemsetContext; > > > > struct MemsetThread { > > @@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg) > > } > > qemu_mutex_unlock(&page_mutex); > > > > +#ifdef CONFIG_NUMA > > + if (memset_args->context->nodemask) { > > + numa_run_on_node_mask(memset_args->context->nodemask); > > + } > > +#endif > > + > > /* unblock SIGBUS */ > > sigemptyset(&set); > > sigaddset(&set, SIGBUS); > > @@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void *arg) > > } > > qemu_mutex_unlock(&page_mutex); > > > > +#ifdef CONFIG_NUMA > > + if (memset_args->context->nodemask) { > > + numa_run_on_node_mask(memset_args->context->nodemask); > > + } > > +#endif > > Ok, so this is where affinity is set, and I believe this ought to be > failing when run under libvirt with an EPERM from sched_setaffinity. > This code is ignoring errors so I guess this won't be noticed unless > libnuma is printing an error message ? > > > + > > if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) { > > ret = -errno; > > } > > @@ -489,7 +508,9 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages, > > } > > > > static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > > - int smp_cpus, bool use_madv_populate_write) > > + int smp_cpus, const unsigned long *host_nodes, > > + unsigned long max_node, > > + bool use_madv_populate_write) > > { > > static gsize initialized = 0; > > MemsetContext context = { > > @@ -499,6 +520,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > > void *(*touch_fn)(void *); > > int ret = 0, i = 0; > > char *addr = area; > > + unsigned long value = max_node; > > > > if (g_once_init_enter(&initialized)) { > > qemu_mutex_init(&page_mutex); > > @@ -520,6 +542,48 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > > touch_fn = do_touch_pages; > > } > > > > + if (host_nodes) { > > + value = find_first_bit(host_nodes, max_node); > > + } > > + if (value != max_node) { > > +#ifdef CONFIG_NUMA > > + struct bitmask *cpus = numa_allocate_cpumask(); > > + g_autofree unsigned long *zerocpumask; > > + size_t zerocpumasklen; > > + g_autofree unsigned long *zeronodemask; > > + size_t zeronodemasklen; > > + > > + context.nodemask = numa_bitmask_alloc(max_node); > > + > > + zerocpumasklen = cpus->size / sizeof(unsigned long); > > + zerocpumask = g_new0(unsigned long, zerocpumasklen); > > + > > + for (; value != max_node; > > + value = find_next_bit(host_nodes, max_node, value + 1)) { > > + if (numa_node_to_cpus(value, cpus) || > > + memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0) > > + continue; > > + > > + /* If given NUMA node has CPUs run threads on them. */ > > + numa_bitmask_setbit(context.nodemask, value); > > + } > > + > > + numa_bitmask_free(cpus); > > + > > + zeronodemasklen = max_node / sizeof(unsigned long); > > + zeronodemask = g_new0(unsigned long, zeronodemasklen); > > + > > + if (memcmp(context.nodemask->maskp, > > + zeronodemask, zeronodemasklen) == 0) { > > + /* If no NUMA has a CPU available, then don't pin threads. */ > > + g_clear_pointer(&context.nodemask, numa_bitmask_free); > > + } > > +#else > > + errno = -EINVAL; > > + return -1; > > +#endif > > + } > > + > > context.threads = g_new0(MemsetThread, context.num_threads); > > numpages_per_thread = numpages / context.num_threads; > > leftover = numpages % context.num_threads; > > @@ -554,6 +618,10 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > > if (!use_madv_populate_write) { > > sigbus_memset_context = NULL; > > } > > + > > +#ifdef CONFIG_NUMA > > + g_clear_pointer(&context.nodemask, numa_bitmask_free); > > +#endif > > g_free(context.threads); > > > > return ret; > > @@ -566,6 +634,8 @@ static bool madv_populate_write_possible(char *area, size_t pagesize) > > } > > > > void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > > + const unsigned long *host_nodes, > > + unsigned long max_node, > > Error **errp) > > { > > static gsize initialized; > > @@ -608,7 +678,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > > > > /* touch pages simultaneously */ > > ret = touch_all_pages(area, hpagesize, numpages, smp_cpus, > > - use_madv_populate_write); > > + host_nodes, max_node, use_madv_populate_write); > > if (ret) { > > error_setg_errno(errp, -ret, > > "os_mem_prealloc: preallocating memory failed"); > > diff --git a/util/oslib-win32.c b/util/oslib-win32.c > > index dafef4f157..6efd912355 100644 > > --- a/util/oslib-win32.c > > +++ b/util/oslib-win32.c > > @@ -314,6 +314,8 @@ int getpagesize(void) > > } > > > > void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > > + const unsigned long *host_nodes, > > + unsigned long max_node, > > Error **errp) > > { > > int i; > > -- > > 2.35.1 > > > > > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > >
* Michal Privoznik (mprivozn@redhat.com) wrote: > When allocating large amounts of memory the task is offloaded > onto threads. These threads then use various techniques to > allocate the memory fully (madvise(), writing into the memory). > However, these threads are free to run on any CPU, which becomes > problematic on NUMA machines because it may happen that a thread > is running on a distant node. > > Ideally, this is something that a management application would > resolve, but we are not anywhere close to that, Firstly, memory > allocation happens before monitor socket is even available. But > okay, that's what -preconfig is for. But then the problem is that > 'object-add' would not return until all memory is preallocated. > > Long story short, management application has no way of learning > TIDs of allocator threads so it can't make them run NUMA aware. > > But what we can do is to propagate the 'host-nodes' attribute of > MemoryBackend object down to where preallocation threads are > created and set their affinity according to the attribute. Joe (cc'd) sent me some numbers for this which emphasise how useful it is: | On systems with 4 physical numa nodes and 2-6 Tb of memory, this numa-aware |preallocation provided about a 25% speedup in touching the pages. |The speedup gets larger as the numa node count and memory sizes grow. .... | In a simple parallel 1Gb page-zeroing test on a very large system (32-numa | nodes and 47Tb of memory), the numa-aware preallocation was 2.3X faster | than letting the threads float wherever. | We're working with someone whose large guest normally takes 4.5 hours to | boot. With Michal P's initial patch to parallelize the preallocation, that | time dropped to about 1 hour. Including this numa-aware preallocation | would reduce the guest boot time to less than 1/2 hour. so chopping *half an hour* off the startup time seems a worthy optimisation (even if most of us aren't fortunate enough to have 47T of ram). Dave > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000 > Signed-off-by: Michal Privoznik <mprivozn@redhat.com> > --- > backends/hostmem.c | 6 ++-- > hw/virtio/virtio-mem.c | 2 +- > include/qemu/osdep.h | 2 ++ > util/meson.build | 2 +- > util/oslib-posix.c | 74 ++++++++++++++++++++++++++++++++++++++++-- > util/oslib-win32.c | 2 ++ > 6 files changed, 82 insertions(+), 6 deletions(-) > > diff --git a/backends/hostmem.c b/backends/hostmem.c > index a7bae3d713..7373472c7e 100644 > --- a/backends/hostmem.c > +++ b/backends/hostmem.c > @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, > void *ptr = memory_region_get_ram_ptr(&backend->mr); > uint64_t sz = memory_region_size(&backend->mr); > > - os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err); > + os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, > + backend->host_nodes, MAX_NODES, &local_err); > if (local_err) { > error_propagate(errp, local_err); > return; > @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) > */ > if (backend->prealloc) { > os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz, > - backend->prealloc_threads, &local_err); > + backend->prealloc_threads, backend->host_nodes, > + MAX_NODES, &local_err); > if (local_err) { > goto out; > } > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c > index 5aca408726..48b104cdf6 100644 > --- a/hw/virtio/virtio-mem.c > +++ b/hw/virtio/virtio-mem.c > @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, > int fd = memory_region_get_fd(&vmem->memdev->mr); > Error *local_err = NULL; > > - os_mem_prealloc(fd, area, size, 1, &local_err); > + os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err); > if (local_err) { > static bool warned; > > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h > index 1c1e7eca98..474cbf3b86 100644 > --- a/include/qemu/osdep.h > +++ b/include/qemu/osdep.h > @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type); > void qemu_set_tty_echo(int fd, bool echo); > > void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus, > + const unsigned long *host_nodes, > + unsigned long max_node, > Error **errp); > > /** > diff --git a/util/meson.build b/util/meson.build > index 8f16018cd4..393ff74570 100644 > --- a/util/meson.build > +++ b/util/meson.build > @@ -15,7 +15,7 @@ freebsd_dep = [] > if targetos == 'freebsd' > freebsd_dep = util > endif > -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep]) > +util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa]) > util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c')) > util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c')) > util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c')) > diff --git a/util/oslib-posix.c b/util/oslib-posix.c > index 477990f39b..1572b9b178 100644 > --- a/util/oslib-posix.c > +++ b/util/oslib-posix.c > @@ -73,6 +73,10 @@ > #include "qemu/error-report.h" > #endif > > +#ifdef CONFIG_NUMA > +#include <numa.h> > +#endif > + > #define MAX_MEM_PREALLOC_THREAD_COUNT 16 > > struct MemsetThread; > @@ -82,6 +86,9 @@ typedef struct MemsetContext { > bool any_thread_failed; > struct MemsetThread *threads; > int num_threads; > +#ifdef CONFIG_NUMA > + struct bitmask *nodemask; > +#endif > } MemsetContext; > > struct MemsetThread { > @@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg) > } > qemu_mutex_unlock(&page_mutex); > > +#ifdef CONFIG_NUMA > + if (memset_args->context->nodemask) { > + numa_run_on_node_mask(memset_args->context->nodemask); > + } > +#endif > + > /* unblock SIGBUS */ > sigemptyset(&set); > sigaddset(&set, SIGBUS); > @@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void *arg) > } > qemu_mutex_unlock(&page_mutex); > > +#ifdef CONFIG_NUMA > + if (memset_args->context->nodemask) { > + numa_run_on_node_mask(memset_args->context->nodemask); > + } > +#endif > + > if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) { > ret = -errno; > } > @@ -489,7 +508,9 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages, > } > > static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > - int smp_cpus, bool use_madv_populate_write) > + int smp_cpus, const unsigned long *host_nodes, > + unsigned long max_node, > + bool use_madv_populate_write) > { > static gsize initialized = 0; > MemsetContext context = { > @@ -499,6 +520,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > void *(*touch_fn)(void *); > int ret = 0, i = 0; > char *addr = area; > + unsigned long value = max_node; > > if (g_once_init_enter(&initialized)) { > qemu_mutex_init(&page_mutex); > @@ -520,6 +542,48 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > touch_fn = do_touch_pages; > } > > + if (host_nodes) { > + value = find_first_bit(host_nodes, max_node); > + } > + if (value != max_node) { > +#ifdef CONFIG_NUMA > + struct bitmask *cpus = numa_allocate_cpumask(); > + g_autofree unsigned long *zerocpumask; > + size_t zerocpumasklen; > + g_autofree unsigned long *zeronodemask; > + size_t zeronodemasklen; > + > + context.nodemask = numa_bitmask_alloc(max_node); > + > + zerocpumasklen = cpus->size / sizeof(unsigned long); > + zerocpumask = g_new0(unsigned long, zerocpumasklen); > + > + for (; value != max_node; > + value = find_next_bit(host_nodes, max_node, value + 1)) { > + if (numa_node_to_cpus(value, cpus) || > + memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0) > + continue; > + > + /* If given NUMA node has CPUs run threads on them. */ > + numa_bitmask_setbit(context.nodemask, value); > + } > + > + numa_bitmask_free(cpus); > + > + zeronodemasklen = max_node / sizeof(unsigned long); > + zeronodemask = g_new0(unsigned long, zeronodemasklen); > + > + if (memcmp(context.nodemask->maskp, > + zeronodemask, zeronodemasklen) == 0) { > + /* If no NUMA has a CPU available, then don't pin threads. */ > + g_clear_pointer(&context.nodemask, numa_bitmask_free); > + } > +#else > + errno = -EINVAL; > + return -1; > +#endif > + } > + > context.threads = g_new0(MemsetThread, context.num_threads); > numpages_per_thread = numpages / context.num_threads; > leftover = numpages % context.num_threads; > @@ -554,6 +618,10 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, > if (!use_madv_populate_write) { > sigbus_memset_context = NULL; > } > + > +#ifdef CONFIG_NUMA > + g_clear_pointer(&context.nodemask, numa_bitmask_free); > +#endif > g_free(context.threads); > > return ret; > @@ -566,6 +634,8 @@ static bool madv_populate_write_possible(char *area, size_t pagesize) > } > > void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > + const unsigned long *host_nodes, > + unsigned long max_node, > Error **errp) > { > static gsize initialized; > @@ -608,7 +678,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > > /* touch pages simultaneously */ > ret = touch_all_pages(area, hpagesize, numpages, smp_cpus, > - use_madv_populate_write); > + host_nodes, max_node, use_madv_populate_write); > if (ret) { > error_setg_errno(errp, -ret, > "os_mem_prealloc: preallocating memory failed"); > diff --git a/util/oslib-win32.c b/util/oslib-win32.c > index dafef4f157..6efd912355 100644 > --- a/util/oslib-win32.c > +++ b/util/oslib-win32.c > @@ -314,6 +314,8 @@ int getpagesize(void) > } > > void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, > + const unsigned long *host_nodes, > + unsigned long max_node, > Error **errp) > { > int i; > -- > 2.35.1 > >
On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote: > When allocating large amounts of memory the task is offloaded > onto threads. These threads then use various techniques to > allocate the memory fully (madvise(), writing into the memory). > However, these threads are free to run on any CPU, which becomes > problematic on NUMA machines because it may happen that a thread > is running on a distant node. > > Ideally, this is something that a management application would > resolve, but we are not anywhere close to that, Firstly, memory > allocation happens before monitor socket is even available. But > okay, that's what -preconfig is for. But then the problem is that > 'object-add' would not return until all memory is preallocated. Is the delay to 'object-add' actually a problem ? Currently we're cold plugging the memory backends, so prealloc happens before QMP is available. So we have a delay immediately at startup. Switching to -preconfig plus 'object-add' would not be making the delay worse, merely moving it ever so slightly later. With the POV of an application using libvirt, this is the same. virDomainCreate takes 1 hour, regardless of whether the 1 hour allocatinon delay is before QMP or in -preconfig object-add execution. > Long story short, management application has no way of learning > TIDs of allocator threads so it can't make them run NUMA aware. This feels like the key issue. The preallocation threads are invisible to libvirt, regardless of whether we're doing coldplug or hotplug of memory-backends. Indeed the threads are invisible to all of QEMU, except the memory backend code. Conceptually we need 1 or more explicit worker threads, that we can assign CPU affinity to, and then QEMU can place jobs on them. I/O threads serve this role, but limited to blockdev work. We need a generalization of I/O threads, for arbitrary jobs that QEMU might want to farm out to specific numa nodes. In a guest spanning multiple host NUMA nodes, libvirt would have to configure 1 or more worker threads for QEMU, learn their TIDs,then add the memory backends in -preconfig, which would farm our preallocation to the worker threads, with job placement matching the worker's affinity. With regards, Daniel
On Wed, May 11, 2022 at 09:34:07AM +0100, Dr. David Alan Gilbert wrote: > * Michal Privoznik (mprivozn@redhat.com) wrote: > > When allocating large amounts of memory the task is offloaded > > onto threads. These threads then use various techniques to > > allocate the memory fully (madvise(), writing into the memory). > > However, these threads are free to run on any CPU, which becomes > > problematic on NUMA machines because it may happen that a thread > > is running on a distant node. > > > > Ideally, this is something that a management application would > > resolve, but we are not anywhere close to that, Firstly, memory > > allocation happens before monitor socket is even available. But > > okay, that's what -preconfig is for. But then the problem is that > > 'object-add' would not return until all memory is preallocated. > > > > Long story short, management application has no way of learning > > TIDs of allocator threads so it can't make them run NUMA aware. > > > > But what we can do is to propagate the 'host-nodes' attribute of > > MemoryBackend object down to where preallocation threads are > > created and set their affinity according to the attribute. > > Joe (cc'd) sent me some numbers for this which emphasise how useful it > is: > | On systems with 4 physical numa nodes and 2-6 Tb of memory, this numa-aware > |preallocation provided about a 25% speedup in touching the pages. > |The speedup gets larger as the numa node count and memory sizes grow. > .... > | In a simple parallel 1Gb page-zeroing test on a very large system (32-numa > | nodes and 47Tb of memory), the numa-aware preallocation was 2.3X faster > | than letting the threads float wherever. > | We're working with someone whose large guest normally takes 4.5 hours to > | boot. With Michal P's initial patch to parallelize the preallocation, that > | time dropped to about 1 hour. Including this numa-aware preallocation > | would reduce the guest boot time to less than 1/2 hour. > > so chopping *half an hour* off the startup time seems a worthy > optimisation (even if most of us aren't fortunate enough to have 47T of > ram). I presume this test was done with bare QEMU though, not libvirt managed QEMU, as IIUC, the latter would not be able to set its affinity and so never see this benefit. With regards, Daniel
>> Long story short, management application has no way of learning >> TIDs of allocator threads so it can't make them run NUMA aware. > > This feels like the key issue. The preallocation threads are > invisible to libvirt, regardless of whether we're doing coldplug > or hotplug of memory-backends. Indeed the threads are invisible > to all of QEMU, except the memory backend code. > > Conceptually we need 1 or more explicit worker threads, that we > can assign CPU affinity to, and then QEMU can place jobs on them. > I/O threads serve this role, but limited to blockdev work. We > need a generalization of I/O threads, for arbitrary jobs that > QEMU might want to farm out to specific numa nodes. At least the "-object iothread" thingy can already be used for actions outside of blockdev. virtio-balloon uses one for free page hinting.
On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote: > >> Long story short, management application has no way of learning > >> TIDs of allocator threads so it can't make them run NUMA aware. > > > > This feels like the key issue. The preallocation threads are > > invisible to libvirt, regardless of whether we're doing coldplug > > or hotplug of memory-backends. Indeed the threads are invisible > > to all of QEMU, except the memory backend code. > > > > Conceptually we need 1 or more explicit worker threads, that we > > can assign CPU affinity to, and then QEMU can place jobs on them. > > I/O threads serve this role, but limited to blockdev work. We > > need a generalization of I/O threads, for arbitrary jobs that > > QEMU might want to farm out to specific numa nodes. > > At least the "-object iothread" thingy can already be used for actions > outside of blockdev. virtio-balloon uses one for free page hinting. Ah that's good to know, so my idea probably isn't so much work as I thought it might be. With regards, Daniel
On 11.05.22 11:34, Daniel P. Berrangé wrote: > On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote: >>>> Long story short, management application has no way of learning >>>> TIDs of allocator threads so it can't make them run NUMA aware. >>> >>> This feels like the key issue. The preallocation threads are >>> invisible to libvirt, regardless of whether we're doing coldplug >>> or hotplug of memory-backends. Indeed the threads are invisible >>> to all of QEMU, except the memory backend code. >>> >>> Conceptually we need 1 or more explicit worker threads, that we >>> can assign CPU affinity to, and then QEMU can place jobs on them. >>> I/O threads serve this role, but limited to blockdev work. We >>> need a generalization of I/O threads, for arbitrary jobs that >>> QEMU might want to farm out to specific numa nodes. >> >> At least the "-object iothread" thingy can already be used for actions >> outside of blockdev. virtio-balloon uses one for free page hinting. > > Ah that's good to know, so my idea probably isn't so much work as > I thought it might be. I guess we'd have to create a bunch of iothreads on the command line and then feed them as an array to the memory backend we want to create. We could then forward the threads to a new variant of os_mem_prealloc(). We could a) Allocate new iothreads for each memory backend we create. Hm, that might be suboptimal, we could end up with many iothreads. b) Reuse iothreads and have separate sets of iothreads per NUMA node. Assign them to a node once. c) Reuse iothreads and reassign them to NUMA nodes on demand. However, I'm not sure what the semantics are when having multiple backends referencing the iothreads ... Regarding c) and alternative I raised is that we could simply preallocate the threads in QEMU internally on machine creation, and let libvirt reassign them to the fitting NUMA node whenever coldplugging/hotplugging a new memory backend.
On Wed, May 11, 2022 at 12:03:24PM +0200, David Hildenbrand wrote: > On 11.05.22 11:34, Daniel P. Berrangé wrote: > > On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote: > >>>> Long story short, management application has no way of learning > >>>> TIDs of allocator threads so it can't make them run NUMA aware. > >>> > >>> This feels like the key issue. The preallocation threads are > >>> invisible to libvirt, regardless of whether we're doing coldplug > >>> or hotplug of memory-backends. Indeed the threads are invisible > >>> to all of QEMU, except the memory backend code. > >>> > >>> Conceptually we need 1 or more explicit worker threads, that we > >>> can assign CPU affinity to, and then QEMU can place jobs on them. > >>> I/O threads serve this role, but limited to blockdev work. We > >>> need a generalization of I/O threads, for arbitrary jobs that > >>> QEMU might want to farm out to specific numa nodes. > >> > >> At least the "-object iothread" thingy can already be used for actions > >> outside of blockdev. virtio-balloon uses one for free page hinting. > > > > Ah that's good to know, so my idea probably isn't so much work as > > I thought it might be. > > I guess we'd have to create a bunch of iothreads on the command line and > then feed them as an array to the memory backend we want to create. We > could then forward the threads to a new variant of os_mem_prealloc(). > > We could > > a) Allocate new iothreads for each memory backend we create. Hm, that > might be suboptimal, we could end up with many iothreads. > > b) Reuse iothreads and have separate sets of iothreads per NUMA node. > Assign them to a node once. > > c) Reuse iothreads and reassign them to NUMA nodes on demand. If all we needs is NUMA affinity, not CPU affinity, then it would be sufficient to create 1 I/O thread per host NUMA node that the VM needs to use. The job running in the I/O can spawn further threads and inherit the NUMA affinity. This might be more clever than it is needed though. I expect creating/deleting I/O threads is cheap in comparison to the work done for preallocation. If libvirt is using -preconfig and object-add to create the memory backend, then we could have option of creating the I/O threads dynamically in -preconfig mode, create the memory backend, and then delete the I/O threads again. > However, I'm not sure what the semantics are when having multiple > backends referencing the iothreads ... Yep, we don't especially need an "ownership" relationship for what we want todo with preallocatino, specially because it is a one off point-in-time usage, not continuous usage as with block devices With regards, Daniel
On 5/11/22 12:10, Daniel P. Berrangé wrote: > If all we needs is NUMA affinity, not CPU affinity, then it would > be sufficient to create 1 I/O thread per host NUMA node that the > VM needs to use. The job running in the I/O can spawn further > threads and inherit the NUMA affinity. This might be more clever > than it is needed though. > > I expect creating/deleting I/O threads is cheap in comparison to > the work done for preallocation. If libvirt is using -preconfig > and object-add to create the memory backend, then we could have > option of creating the I/O threads dynamically in -preconfig mode, > create the memory backend, and then delete the I/O threads again. I think this is very overengineered. Michal's patch is doing the obvious thing and if it doesn't work that's because Libvirt is trying to micromanage QEMU. As mentioned on IRC, if the reason is to prevent moving around threads in realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel level. Paolo
On 5/10/22 11:12, Daniel P. Berrangé wrote: > On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote: >> When allocating large amounts of memory the task is offloaded >> onto threads. These threads then use various techniques to >> allocate the memory fully (madvise(), writing into the memory). >> However, these threads are free to run on any CPU, which becomes >> problematic on NUMA machines because it may happen that a thread >> is running on a distant node. >> >> Ideally, this is something that a management application would >> resolve, but we are not anywhere close to that, Firstly, memory >> allocation happens before monitor socket is even available. But >> okay, that's what -preconfig is for. But then the problem is that >> 'object-add' would not return until all memory is preallocated. >> >> Long story short, management application has no way of learning >> TIDs of allocator threads so it can't make them run NUMA aware. > > So I'm wondering what the impact of this problem is for various > scenarios. The scenario which I tested this with was no <emulatorpin/> but using 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For those which are unfamiliar with libvirt, this is about placing the main qemu TID (with the main eventloop) into a CGroup that restricts on what CPUs it can run. > > The default config for a KVM guest with libvirt is no CPU pinning > at all. The kernel auto-places CPUs and decides on where RAM is to > be allocated. So in this case, whether or not libvirt can talk to > QMP in time to query threads is largely irrelevant, as we don't > want todo placement in any case. > > In theory the kernel should allocate RAM on the node local to > where the process is currently executing. So as long as the > guest RAM fits in available free RAM on the local node, RAM > should be allocated from the node that matches the CPU running > the QEMU main thread. > > The challenge is if we spawn N more threads to do pre-alloc, > these can be spread onto other nodes. I wonder if the kernel > huas any preference for keeping threads within a process on > the same NUMA node ? That's not exactly what I saw. I would have thought too that initially the prealloc thread could be spawned just anywhere but after few iterations the scheduler realized what NUMA node the thread is close to and automatically schedule it to run there. Well, it didn't happen. > > Overall, if libvirt is not applying pinning to the QEMU guest, > then we're 100% reliant on the kernel todo something sensible, > both for normal QEMU execution and for prealloc. Since we're > not doing placement of QEMU RAM or CPUs, the logic in this > patch won't do anything either. > > > If the guest has more RAM than can fit on the local NUMA node, > then we're doomed no matter what, even ignoring prealloc, there > will be cross-node traffic. This scenario requires the admin to > setup proper CPU /memory pinning for QEMU in libvirt. > > If libvirt is doing CPU pinning (as instructed by the mgmt app > above us), then when we first start QEMU, the process thread > leader will get given affinity by libvirt prior to exec. This > affinity will be the union of affinity for all CPUs that will > be later configured. > > The typical case for CPU pinning, is that everything fits in > one NUMA node, and so in this case, we don't need todo anything > more. The prealloc threads will already be constrained to the > right place by the affinity of the QEMU thread leader, so the > logic in this patch will run, but it won't do anything that > was not already done. > > So we're left with the hardest case, where the guest is explicitly > spread across multiple NUMA nodes. In this case the thread leader > affinity will span many NUMA nodes, and so the prealloc threads > will freely be placed across any CPU that is in the union of CPUs > the guest is placed on. Just as with thue non-pinned case, the > prealloc will be at the mercy of the kernel making sensible > placement decisions. Indeed, but it's at least somewhat restricted. NB, in real scenario users will map guest NUMA nodes onto host ones with 1:1 relationship. And each guest NUMA node will have its own memdev=, i.e. its own set of threads, so in the end, prealloc threads won't jump between host NUMA nodes but stay local to the node they are allocating memory on. > > The very last cases is the only one where this patch can potentially > be beneficial. The problem is that because libvirt is in charge of > enforcing CPU affinity, IIRC, we explicitly block QEMU from doing > anything with CPU affinity. So AFAICT, this patch should result in > an error from sched_setaffinity when run under libvirt. Yes, I had to disable capability dropping in qemu.conf. After all, I think maybe the right place to fix this is kernel? I mean, why don't prealloc threads converge to the nodes they are working with? Michal
>> >> The very last cases is the only one where this patch can potentially >> be beneficial. The problem is that because libvirt is in charge of >> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing >> anything with CPU affinity. So AFAICT, this patch should result in >> an error from sched_setaffinity when run under libvirt. > > Yes, I had to disable capability dropping in qemu.conf. > > After all, I think maybe the right place to fix this is kernel? I mean, > why don't prealloc threads converge to the nodes they are working with? Good question, I think the whole machinery doesn't support hugetlb, including recording remote faults and migrating the thread to a different node.
On Wed, May 11, 2022 at 03:16:55PM +0200, Michal Prívozník wrote: > On 5/10/22 11:12, Daniel P. Berrangé wrote: > > On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote: > >> When allocating large amounts of memory the task is offloaded > >> onto threads. These threads then use various techniques to > >> allocate the memory fully (madvise(), writing into the memory). > >> However, these threads are free to run on any CPU, which becomes > >> problematic on NUMA machines because it may happen that a thread > >> is running on a distant node. > >> > >> Ideally, this is something that a management application would > >> resolve, but we are not anywhere close to that, Firstly, memory > >> allocation happens before monitor socket is even available. But > >> okay, that's what -preconfig is for. But then the problem is that > >> 'object-add' would not return until all memory is preallocated. > >> > >> Long story short, management application has no way of learning > >> TIDs of allocator threads so it can't make them run NUMA aware. > > > > So I'm wondering what the impact of this problem is for various > > scenarios. > > The scenario which I tested this with was no <emulatorpin/> but using > 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For > those which are unfamiliar with libvirt, this is about placing the main > qemu TID (with the main eventloop) into a CGroup that restricts on what > CPUs it can run. > > > > > The default config for a KVM guest with libvirt is no CPU pinning > > at all. The kernel auto-places CPUs and decides on where RAM is to > > be allocated. So in this case, whether or not libvirt can talk to > > QMP in time to query threads is largely irrelevant, as we don't > > want todo placement in any case. > > > > In theory the kernel should allocate RAM on the node local to > > where the process is currently executing. So as long as the > > guest RAM fits in available free RAM on the local node, RAM > > should be allocated from the node that matches the CPU running > > the QEMU main thread. > > > > The challenge is if we spawn N more threads to do pre-alloc, > > these can be spread onto other nodes. I wonder if the kernel > > huas any preference for keeping threads within a process on > > the same NUMA node ? > > That's not exactly what I saw. I would have thought too that initially > the prealloc thread could be spawned just anywhere but after few > iterations the scheduler realized what NUMA node the thread is close to > and automatically schedule it to run there. Well, it didn't happen. Thinking about it, this does make sense to some extent. When a thread is first spawned, how can the kernel know what region of memory it is about to start touching ? So at the very least the kernel schedular can get it wrong initially. It would need something to watch memory acces patterns to determine whether the initial decision was right or wrong, and fine tune it later. Seems like the kernel typically tries todo the opposite to what we thought, and instead of moving CPUs, has ways to move the memory instead. https://www.kernel.org/doc/html/latest/vm/page_migration.html > > Overall, if libvirt is not applying pinning to the QEMU guest, > > then we're 100% reliant on the kernel todo something sensible, > > both for normal QEMU execution and for prealloc. Since we're > > not doing placement of QEMU RAM or CPUs, the logic in this > > patch won't do anything either. > > > > > > If the guest has more RAM than can fit on the local NUMA node, > > then we're doomed no matter what, even ignoring prealloc, there > > will be cross-node traffic. This scenario requires the admin to > > setup proper CPU /memory pinning for QEMU in libvirt. > > > > If libvirt is doing CPU pinning (as instructed by the mgmt app > > above us), then when we first start QEMU, the process thread > > leader will get given affinity by libvirt prior to exec. This > > affinity will be the union of affinity for all CPUs that will > > be later configured. > > > > The typical case for CPU pinning, is that everything fits in > > one NUMA node, and so in this case, we don't need todo anything > > more. The prealloc threads will already be constrained to the > > right place by the affinity of the QEMU thread leader, so the > > logic in this patch will run, but it won't do anything that > > was not already done. > > > > So we're left with the hardest case, where the guest is explicitly > > spread across multiple NUMA nodes. In this case the thread leader > > affinity will span many NUMA nodes, and so the prealloc threads > > will freely be placed across any CPU that is in the union of CPUs > > the guest is placed on. Just as with thue non-pinned case, the > > prealloc will be at the mercy of the kernel making sensible > > placement decisions. > > Indeed, but it's at least somewhat restricted. NB, in real scenario > users will map guest NUMA nodes onto host ones with 1:1 relationship. > And each guest NUMA node will have its own memdev=, i.e. its own set of > threads, so in the end, prealloc threads won't jump between host NUMA > nodes but stay local to the node they are allocating memory on. Thinking about this from a completely different QEMU angle. Right now the preallocation happens when we create the memory device, and takes place in threads spawned from the main QEMU thread. We are doing memory placement in order that specific blocks of virtual RAM are co-located with specific virtual CPUs. IOW we know we already have some threads that will match locality of the RAM we have. We are doing memory pre-allocation to give predictable performance once the VM starts, and to have a guarantee that the memory is actually available for use. We don't actually need the memory pre-allocation to take place so early in object creation, as we have it right now. It just needs to be done before VM creation is done and vCPUs start guest code. From the POV of controlling QEMU VM resource usage, we don't really want memory pre-allocation to consume more host CPUs than we've assigned to the VM for its vCPUs. So what if instead of creating throwaway threads for memory allocation early, we ran the preallocation in the vCPU threads before they start executing guest code ? This is still early enough to achieve our goals for preallocation. These vCPU threads already have the right affinity setup. This ensures that the CPU burn for preallocation doesn't exceed what we've allowed for guest CPU usage in general, so resource limits will be naturally enforced. Is this kind of approach possible ? With regards, Daniel
On 11.05.22 17:08, Daniel P. Berrangé wrote: > On Wed, May 11, 2022 at 03:16:55PM +0200, Michal Prívozník wrote: >> On 5/10/22 11:12, Daniel P. Berrangé wrote: >>> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote: >>>> When allocating large amounts of memory the task is offloaded >>>> onto threads. These threads then use various techniques to >>>> allocate the memory fully (madvise(), writing into the memory). >>>> However, these threads are free to run on any CPU, which becomes >>>> problematic on NUMA machines because it may happen that a thread >>>> is running on a distant node. >>>> >>>> Ideally, this is something that a management application would >>>> resolve, but we are not anywhere close to that, Firstly, memory >>>> allocation happens before monitor socket is even available. But >>>> okay, that's what -preconfig is for. But then the problem is that >>>> 'object-add' would not return until all memory is preallocated. >>>> >>>> Long story short, management application has no way of learning >>>> TIDs of allocator threads so it can't make them run NUMA aware. >>> >>> So I'm wondering what the impact of this problem is for various >>> scenarios. >> >> The scenario which I tested this with was no <emulatorpin/> but using >> 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For >> those which are unfamiliar with libvirt, this is about placing the main >> qemu TID (with the main eventloop) into a CGroup that restricts on what >> CPUs it can run. >> >>> >>> The default config for a KVM guest with libvirt is no CPU pinning >>> at all. The kernel auto-places CPUs and decides on where RAM is to >>> be allocated. So in this case, whether or not libvirt can talk to >>> QMP in time to query threads is largely irrelevant, as we don't >>> want todo placement in any case. >>> >>> In theory the kernel should allocate RAM on the node local to >>> where the process is currently executing. So as long as the >>> guest RAM fits in available free RAM on the local node, RAM >>> should be allocated from the node that matches the CPU running >>> the QEMU main thread. >>> >>> The challenge is if we spawn N more threads to do pre-alloc, >>> these can be spread onto other nodes. I wonder if the kernel >>> huas any preference for keeping threads within a process on >>> the same NUMA node ? >> >> That's not exactly what I saw. I would have thought too that initially >> the prealloc thread could be spawned just anywhere but after few >> iterations the scheduler realized what NUMA node the thread is close to >> and automatically schedule it to run there. Well, it didn't happen. > > Thinking about it, this does make sense to some extent. When a > thread is first spawned, how can the kernel know what region of > memory it is about to start touching ? So at the very least the > kernel schedular can get it wrong initially. It would need something > to watch memory acces patterns to determine whether the initial > decision was right or wrong, and fine tune it later. > > Seems like the kernel typically tries todo the opposite to what > we thought, and instead of moving CPUs, has ways to move the > memory instead. > > https://www.kernel.org/doc/html/latest/vm/page_migration.html > >>> Overall, if libvirt is not applying pinning to the QEMU guest, >>> then we're 100% reliant on the kernel todo something sensible, >>> both for normal QEMU execution and for prealloc. Since we're >>> not doing placement of QEMU RAM or CPUs, the logic in this >>> patch won't do anything either. >>> >>> >>> If the guest has more RAM than can fit on the local NUMA node, >>> then we're doomed no matter what, even ignoring prealloc, there >>> will be cross-node traffic. This scenario requires the admin to >>> setup proper CPU /memory pinning for QEMU in libvirt. >>> >>> If libvirt is doing CPU pinning (as instructed by the mgmt app >>> above us), then when we first start QEMU, the process thread >>> leader will get given affinity by libvirt prior to exec. This >>> affinity will be the union of affinity for all CPUs that will >>> be later configured. >>> >>> The typical case for CPU pinning, is that everything fits in >>> one NUMA node, and so in this case, we don't need todo anything >>> more. The prealloc threads will already be constrained to the >>> right place by the affinity of the QEMU thread leader, so the >>> logic in this patch will run, but it won't do anything that >>> was not already done. >>> >>> So we're left with the hardest case, where the guest is explicitly >>> spread across multiple NUMA nodes. In this case the thread leader >>> affinity will span many NUMA nodes, and so the prealloc threads >>> will freely be placed across any CPU that is in the union of CPUs >>> the guest is placed on. Just as with thue non-pinned case, the >>> prealloc will be at the mercy of the kernel making sensible >>> placement decisions. >> >> Indeed, but it's at least somewhat restricted. NB, in real scenario >> users will map guest NUMA nodes onto host ones with 1:1 relationship. >> And each guest NUMA node will have its own memdev=, i.e. its own set of >> threads, so in the end, prealloc threads won't jump between host NUMA >> nodes but stay local to the node they are allocating memory on. > > Thinking about this from a completely different QEMU angle. > > Right now the preallocation happens when we create the memory > device, and takes place in threads spawned from the main > QEMU thread. > > We are doing memory placement in order that specific blocks of > virtual RAM are co-located with specific virtual CPUs. IOW we > know we already have some threads that will match locality of > the RAM we have. > > We are doing memory pre-allocation to give predictable > performance once the VM starts, and to have a guarantee > that the memory is actually available for use. > > We don't actually need the memory pre-allocation to take > place so early in object creation, as we have it right now. > It just needs to be done before VM creation is done and > vCPUs start guest code. > > > From the POV of controlling QEMU VM resource usage, we don't > really want memory pre-allocation to consume more host CPUs > than we've assigned to the VM for its vCPUs. > > So what if instead of creating throwaway threads for memory > allocation early, we ran the preallocation in the vCPU > threads before they start executing guest code ? This is > still early enough to achieve our goals for preallocation. > > These vCPU threads already have the right affinity setup. This > ensures that the CPU burn for preallocation doesn't exceed > what we've allowed for guest CPU usage in general, so resource > limits will be naturally enforced. > > Is this kind of approach possible ? That sounds quite hackish. IMHO, we should make sure that any approach we introduce is able to cope with both, coldplugged and hotplugged memory backends. I agree with Paolos comment that libvirt is trying to micromanage QEMU here, which is the root of the issue IMHO. For applicable VMs (!realtime?), Libvirt could simply allow QEMU to set the affinity in case there cannot really be harm done. Not sure what other interfaces the kernel could provide to allow Libvirt to restrict the affinity to only some subset of nodes/cpus, so QEMU's harm could be limited to that subset.
On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote: > On 5/11/22 12:10, Daniel P. Berrangé wrote: > > If all we needs is NUMA affinity, not CPU affinity, then it would > > be sufficient to create 1 I/O thread per host NUMA node that the > > VM needs to use. The job running in the I/O can spawn further > > threads and inherit the NUMA affinity. This might be more clever > > than it is needed though. > > > > I expect creating/deleting I/O threads is cheap in comparison to > > the work done for preallocation. If libvirt is using -preconfig > > and object-add to create the memory backend, then we could have > > option of creating the I/O threads dynamically in -preconfig mode, > > create the memory backend, and then delete the I/O threads again. > > I think this is very overengineered. Michal's patch is doing the obvious > thing and if it doesn't work that's because Libvirt is trying to micromanage > QEMU. Calling it micromanaging is putting a very negative connotation on this. What we're trying todo is enforce a host resource policy for QEMU, in a way that a compromised QEMU can't escape, which is a valuable protection. > As mentioned on IRC, if the reason is to prevent moving around threads in > realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel > level. We use cgroups where it is available to us, but we don't always have the freedom that we'd like. Sometimes the deployment scenario of libvirt means that we're stuck with whatever cgroup libvirtd is launched in and sched_setaffinity is our only way to confine QEMU, so wanting to prevent use of sched_setaffinity is reasonable IMHO. With regards, Daniel
On 5/11/22 18:54, Daniel P. Berrangé wrote: > On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote: >> On 5/11/22 12:10, Daniel P. Berrangé wrote: >>> I expect creating/deleting I/O threads is cheap in comparison to >>> the work done for preallocation. If libvirt is using -preconfig >>> and object-add to create the memory backend, then we could have >>> option of creating the I/O threads dynamically in -preconfig mode, >>> create the memory backend, and then delete the I/O threads again. >> >> I think this is very overengineered. Michal's patch is doing the obvious >> thing and if it doesn't work that's because Libvirt is trying to micromanage >> QEMU. > > Calling it micromanaging is putting a very negative connotation on > this. What we're trying todo is enforce a host resource policy for > QEMU, in a way that a compromised QEMU can't escape, which is a > valuable protection. I'm sorry if that was a bit exaggerated, but the negative connotation was intentional. >> As mentioned on IRC, if the reason is to prevent moving around threads in >> realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel >> level. > > We use cgroups where it is available to us, but we don't always have > the freedom that we'd like. I understand. I'm thinking of a new flag to sched_setscheduler that fixes the CPU affinity and policy of the thread and prevents changing it in case QEMU is compromised later. The seccomp/SELinux sandboxes can prevent setting the SCHED_FIFO class without this flag. In addition, my hunch is that this works only because the RT setup of QEMU is not safe against priority inversion. IIRC the iothreads are set with a non-realtime priority, but actually they should have a _higher_ priority than the CPU threads, and the thread pool I/O bound workers should have an even higher priority; otherwise you have a priority inversion situation where an interrupt is pending that would wake up the CPU, but the iothreads cannot process it because they have a lower priority than the CPU. So the iothread and the associated util/thread-pool.c thread pool are the wrong tools to solve Michal's issue; they are not meant for background CPU-bound work, even though they _might_ work due to their incorrect RT setup. Paolo
On Thu, May 12, 2022 at 09:41:29AM +0200, Paolo Bonzini wrote: > On 5/11/22 18:54, Daniel P. Berrangé wrote: > > On Wed, May 11, 2022 at 01:07:47PM +0200, Paolo Bonzini wrote: > > > On 5/11/22 12:10, Daniel P. Berrangé wrote: > > > > I expect creating/deleting I/O threads is cheap in comparison to > > > > the work done for preallocation. If libvirt is using -preconfig > > > > and object-add to create the memory backend, then we could have > > > > option of creating the I/O threads dynamically in -preconfig mode, > > > > create the memory backend, and then delete the I/O threads again. > > > > > > I think this is very overengineered. Michal's patch is doing the obvious > > > thing and if it doesn't work that's because Libvirt is trying to micromanage > > > QEMU. > > > > Calling it micromanaging is putting a very negative connotation on > > this. What we're trying todo is enforce a host resource policy for > > QEMU, in a way that a compromised QEMU can't escape, which is a > > valuable protection. > > I'm sorry if that was a bit exaggerated, but the negative connotation was > intentional. > > > > As mentioned on IRC, if the reason is to prevent moving around threads in > > > realtime (SCHED_FIFO, SCHED_RR) classes, that should be fixed at the kernel > > > level. > > > > We use cgroups where it is available to us, but we don't always have > > the freedom that we'd like. > > I understand. I'm thinking of a new flag to sched_setscheduler that fixes > the CPU affinity and policy of the thread and prevents changing it in case > QEMU is compromised later. The seccomp/SELinux sandboxes can prevent > setting the SCHED_FIFO class without this flag. > > In addition, my hunch is that this works only because the RT setup of QEMU > is not safe against priority inversion. IIRC the iothreads are set with a > non-realtime priority, but actually they should have a _higher_ priority > than the CPU threads, and the thread pool I/O bound workers should have an > even higher priority; otherwise you have a priority inversion situation > where an interrupt is pending that would wake up the CPU, but the iothreads > cannot process it because they have a lower priority than the CPU. At least for RHEL deployments of KVM-RT, IIC the expectation is that the VCPUs with RT priority never do I/O, and that there is at least 1 additional non-RT vCPU from which the OS performs I/O. IOW, the RT VCPU works in a completely self contained manner with no interaction to any other QEMU threads. If that's not the case, then you would have to make sure those other threads have priority / schedular adjustments to avoid priority inversion With regards, Daniel
On 11.05.22 11:34, Daniel P. Berrangé wrote: > On Wed, May 11, 2022 at 11:31:23AM +0200, David Hildenbrand wrote: >>>> Long story short, management application has no way of learning >>>> TIDs of allocator threads so it can't make them run NUMA aware. >>> >>> This feels like the key issue. The preallocation threads are >>> invisible to libvirt, regardless of whether we're doing coldplug >>> or hotplug of memory-backends. Indeed the threads are invisible >>> to all of QEMU, except the memory backend code. >>> >>> Conceptually we need 1 or more explicit worker threads, that we >>> can assign CPU affinity to, and then QEMU can place jobs on them. >>> I/O threads serve this role, but limited to blockdev work. We >>> need a generalization of I/O threads, for arbitrary jobs that >>> QEMU might want to farm out to specific numa nodes. >> >> At least the "-object iothread" thingy can already be used for actions >> outside of blockdev. virtio-balloon uses one for free page hinting. > > Ah that's good to know, so my idea probably isn't so much work as > I thought it might be. Looking into the details, iothreads are the wrong tool to use here. I can imagine use cases where you'd want to perform preallcoation * Only one some specific CPUs of a NUMA node (especially, not ones where we pinned VCPUs) * On CPUs that are on a different NUMA node then the actual backend memory ... just thinking about all of the CPU-less nodes for PMEM and CXL that we're starting to see. So ideally, we'd let the user configure either * Set of physical CPUs to use (low hanging fruit) * Set of NUMA nodes to use (harder) and allow libvirt to easily configure it similarly by pinning threads. As CPU affinity is inherited when creating a new thread, so here is one IMHO reasonable simple thing to get the job done and allow for such flexibility. Introduce a "thread-context" user-creatable object that is used to configure CPU affinity. Internally, thread-context creates exactly one thread called "TC $id". That thread serves no purpose besides spawning new threads that inherit the affinity. Internal users (like preallocation, but we could reuse the same concept for other non-io threads, such as userfaultfd, and some other potential non-io thread users) simply use that thread context to spawn new threads. Spawned threads get called "TC $id/$threadname", whereby $threadname is the ordinary name supplied by the internal user. This could be used to identify long-running threads if needed in the future. It's worth nothing that this is not a thread pool. a) Ordinary cmdline usage: -object thread-context,id="tc1",cpus=0-9,cpus=12 QEMU tries setting the CPU affinity and fails if that's impossible. -object memory-backend-file,...,prealloc=on,prealloc-threads=16,prealloc-thread-context=tc1 When a context is set, preallcoation code will use the thread-context to spawn threads. b) Libvirt, ..., usage: -object thread-context,id="tc1" Then, libvirt identifies and sets the affinity for the "TC tc1" thread. -object memory-backend-file,...,prealloc=on,prealloc-threads=16,prealloc-thread-context=tc1 thread-context can be reused for successive preallcoation etc, obviously.
diff --git a/backends/hostmem.c b/backends/hostmem.c index a7bae3d713..7373472c7e 100644 --- a/backends/hostmem.c +++ b/backends/hostmem.c @@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, void *ptr = memory_region_get_ram_ptr(&backend->mr); uint64_t sz = memory_region_size(&backend->mr); - os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err); + os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, + backend->host_nodes, MAX_NODES, &local_err); if (local_err) { error_propagate(errp, local_err); return; @@ -394,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) */ if (backend->prealloc) { os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz, - backend->prealloc_threads, &local_err); + backend->prealloc_threads, backend->host_nodes, + MAX_NODES, &local_err); if (local_err) { goto out; } diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index 5aca408726..48b104cdf6 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, int fd = memory_region_get_fd(&vmem->memdev->mr); Error *local_err = NULL; - os_mem_prealloc(fd, area, size, 1, &local_err); + os_mem_prealloc(fd, area, size, 1, NULL, 0, &local_err); if (local_err) { static bool warned; diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index 1c1e7eca98..474cbf3b86 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -577,6 +577,8 @@ unsigned long qemu_getauxval(unsigned long type); void qemu_set_tty_echo(int fd, bool echo); void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus, + const unsigned long *host_nodes, + unsigned long max_node, Error **errp); /** diff --git a/util/meson.build b/util/meson.build index 8f16018cd4..393ff74570 100644 --- a/util/meson.build +++ b/util/meson.build @@ -15,7 +15,7 @@ freebsd_dep = [] if targetos == 'freebsd' freebsd_dep = util endif -util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep]) +util_ss.add(when: 'CONFIG_POSIX', if_true: [files('oslib-posix.c'), freebsd_dep, numa]) util_ss.add(when: 'CONFIG_POSIX', if_true: files('qemu-thread-posix.c')) util_ss.add(when: 'CONFIG_POSIX', if_true: files('memfd.c')) util_ss.add(when: 'CONFIG_WIN32', if_true: files('aio-win32.c')) diff --git a/util/oslib-posix.c b/util/oslib-posix.c index 477990f39b..1572b9b178 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -73,6 +73,10 @@ #include "qemu/error-report.h" #endif +#ifdef CONFIG_NUMA +#include <numa.h> +#endif + #define MAX_MEM_PREALLOC_THREAD_COUNT 16 struct MemsetThread; @@ -82,6 +86,9 @@ typedef struct MemsetContext { bool any_thread_failed; struct MemsetThread *threads; int num_threads; +#ifdef CONFIG_NUMA + struct bitmask *nodemask; +#endif } MemsetContext; struct MemsetThread { @@ -420,6 +427,12 @@ static void *do_touch_pages(void *arg) } qemu_mutex_unlock(&page_mutex); +#ifdef CONFIG_NUMA + if (memset_args->context->nodemask) { + numa_run_on_node_mask(memset_args->context->nodemask); + } +#endif + /* unblock SIGBUS */ sigemptyset(&set); sigaddset(&set, SIGBUS); @@ -463,6 +476,12 @@ static void *do_madv_populate_write_pages(void *arg) } qemu_mutex_unlock(&page_mutex); +#ifdef CONFIG_NUMA + if (memset_args->context->nodemask) { + numa_run_on_node_mask(memset_args->context->nodemask); + } +#endif + if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) { ret = -errno; } @@ -489,7 +508,9 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages, } static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, - int smp_cpus, bool use_madv_populate_write) + int smp_cpus, const unsigned long *host_nodes, + unsigned long max_node, + bool use_madv_populate_write) { static gsize initialized = 0; MemsetContext context = { @@ -499,6 +520,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, void *(*touch_fn)(void *); int ret = 0, i = 0; char *addr = area; + unsigned long value = max_node; if (g_once_init_enter(&initialized)) { qemu_mutex_init(&page_mutex); @@ -520,6 +542,48 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, touch_fn = do_touch_pages; } + if (host_nodes) { + value = find_first_bit(host_nodes, max_node); + } + if (value != max_node) { +#ifdef CONFIG_NUMA + struct bitmask *cpus = numa_allocate_cpumask(); + g_autofree unsigned long *zerocpumask; + size_t zerocpumasklen; + g_autofree unsigned long *zeronodemask; + size_t zeronodemasklen; + + context.nodemask = numa_bitmask_alloc(max_node); + + zerocpumasklen = cpus->size / sizeof(unsigned long); + zerocpumask = g_new0(unsigned long, zerocpumasklen); + + for (; value != max_node; + value = find_next_bit(host_nodes, max_node, value + 1)) { + if (numa_node_to_cpus(value, cpus) || + memcmp(cpus->maskp, zerocpumask, zerocpumasklen) == 0) + continue; + + /* If given NUMA node has CPUs run threads on them. */ + numa_bitmask_setbit(context.nodemask, value); + } + + numa_bitmask_free(cpus); + + zeronodemasklen = max_node / sizeof(unsigned long); + zeronodemask = g_new0(unsigned long, zeronodemasklen); + + if (memcmp(context.nodemask->maskp, + zeronodemask, zeronodemasklen) == 0) { + /* If no NUMA has a CPU available, then don't pin threads. */ + g_clear_pointer(&context.nodemask, numa_bitmask_free); + } +#else + errno = -EINVAL; + return -1; +#endif + } + context.threads = g_new0(MemsetThread, context.num_threads); numpages_per_thread = numpages / context.num_threads; leftover = numpages % context.num_threads; @@ -554,6 +618,10 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, if (!use_madv_populate_write) { sigbus_memset_context = NULL; } + +#ifdef CONFIG_NUMA + g_clear_pointer(&context.nodemask, numa_bitmask_free); +#endif g_free(context.threads); return ret; @@ -566,6 +634,8 @@ static bool madv_populate_write_possible(char *area, size_t pagesize) } void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, + const unsigned long *host_nodes, + unsigned long max_node, Error **errp) { static gsize initialized; @@ -608,7 +678,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, /* touch pages simultaneously */ ret = touch_all_pages(area, hpagesize, numpages, smp_cpus, - use_madv_populate_write); + host_nodes, max_node, use_madv_populate_write); if (ret) { error_setg_errno(errp, -ret, "os_mem_prealloc: preallocating memory failed"); diff --git a/util/oslib-win32.c b/util/oslib-win32.c index dafef4f157..6efd912355 100644 --- a/util/oslib-win32.c +++ b/util/oslib-win32.c @@ -314,6 +314,8 @@ int getpagesize(void) } void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, + const unsigned long *host_nodes, + unsigned long max_node, Error **errp) { int i;
When allocating large amounts of memory the task is offloaded onto threads. These threads then use various techniques to allocate the memory fully (madvise(), writing into the memory). However, these threads are free to run on any CPU, which becomes problematic on NUMA machines because it may happen that a thread is running on a distant node. Ideally, this is something that a management application would resolve, but we are not anywhere close to that, Firstly, memory allocation happens before monitor socket is even available. But okay, that's what -preconfig is for. But then the problem is that 'object-add' would not return until all memory is preallocated. Long story short, management application has no way of learning TIDs of allocator threads so it can't make them run NUMA aware. But what we can do is to propagate the 'host-nodes' attribute of MemoryBackend object down to where preallocation threads are created and set their affinity according to the attribute. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074000 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- backends/hostmem.c | 6 ++-- hw/virtio/virtio-mem.c | 2 +- include/qemu/osdep.h | 2 ++ util/meson.build | 2 +- util/oslib-posix.c | 74 ++++++++++++++++++++++++++++++++++++++++-- util/oslib-win32.c | 2 ++ 6 files changed, 82 insertions(+), 6 deletions(-)