diff mbox series

hostmem: Honour multiple preferred nodes if possible

Message ID ba02465fc48807eddea9ad646fca7cc92f929ae7.1670603308.git.mprivozn@redhat.com (mailing list archive)
State New, archived
Headers show
Series hostmem: Honour multiple preferred nodes if possible | expand

Commit Message

Michal Privoznik Dec. 9, 2022, 4:29 p.m. UTC
If a memory-backend is configured with mode
HOST_MEM_POLICY_PREFERRED then
host_memory_backend_memory_complete() calls mbind() as:

  mbind(..., MPOL_PREFERRED, nodemask, ...);

Here, 'nodemask' is a bitmap of host NUMA nodes and corresponds
to the .host-nodes attribute. Therefore, there can be multiple
nodes specified. However, the documentation to MPOL_PREFERRED
says:

  MPOL_PREFERRED
    This mode sets the preferred node for allocation. ...
    If nodemask specifies more than one node ID, the first node
    in the mask will be selected as the preferred node.

Therefore, only the first node is honoured and the rest is
silently ignored. Well, with recent changes to the kernel and
numactl we can do better.

Firstly, new mode - MPOL_PREFERRED_MANY - was introduced to
kernel (v5.15-rc1~107^2~21) which now accepts multiple NUMA
nodes.

Then, numa_has_preferred_many() API was introduced to numactl
(v2.0.15~26) allowing applications to query kernel support.

Wiring this all together, we can pass MPOL_PREFERRED_MANY to the
mbind() call instead and stop ignoring multiple nodes, silently.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 backends/hostmem.c | 28 ++++++++++++++++++++++++++++
 meson.build        |  5 +++++
 2 files changed, 33 insertions(+)

Comments

David Hildenbrand Dec. 14, 2022, 10:54 a.m. UTC | #1
On 09.12.22 17:29, Michal Privoznik wrote:
> If a memory-backend is configured with mode
> HOST_MEM_POLICY_PREFERRED then
> host_memory_backend_memory_complete() calls mbind() as:
> 
>    mbind(..., MPOL_PREFERRED, nodemask, ...);
> 
> Here, 'nodemask' is a bitmap of host NUMA nodes and corresponds
> to the .host-nodes attribute. Therefore, there can be multiple
> nodes specified. However, the documentation to MPOL_PREFERRED
> says:
> 
>    MPOL_PREFERRED
>      This mode sets the preferred node for allocation. ...
>      If nodemask specifies more than one node ID, the first node
>      in the mask will be selected as the preferred node.
> 
> Therefore, only the first node is honoured and the rest is

s/honoured/honored/

> silently ignored. Well, with recent changes to the kernel and
> numactl we can do better.

Yeah, I think this "silent" ignoring was part of the design for both, 
the kernel feature and the QEMU feature. Yes, we can do better now.

> 
> Firstly, new mode - MPOL_PREFERRED_MANY - was introduced to
> kernel (v5.15-rc1~107^2~21) which now accepts multiple NUMA
> nodes.

Maybe give the kernel commit instead


"The Linux kernel added in v5.15 via commit cfcaa66f8032 ("") support 
for MPOL_PREFERRED_MANY, which accepts multiple preferred NUMA nodes 
instead.

> 
> Then, numa_has_preferred_many() API was introduced to numactl
> (v2.0.15~26) allowing applications to query kernel support.
> 
> Wiring this all together, we can pass MPOL_PREFERRED_MANY to the
> mbind() call instead and stop ignoring multiple nodes, silently.
> 
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>   backends/hostmem.c | 28 ++++++++++++++++++++++++++++
>   meson.build        |  5 +++++
>   2 files changed, 33 insertions(+)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c
> index 8640294c10..e0d6cb6c8a 100644
> --- a/backends/hostmem.c
> +++ b/backends/hostmem.c
> @@ -23,10 +23,22 @@
>   
>   #ifdef CONFIG_NUMA
>   #include <numaif.h>
> +#include <numa.h>
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_DEFAULT != MPOL_DEFAULT);
> +/*
> + * HOST_MEM_POLICY_PREFERRED may some time also by MPOL_PREFERRED_MANY, see
> + * below.

I failed to parse that sentence. :)

"
HOST_MEM_POLICY_PREFERRED may either transalte to MPOL_PREFERRED or 
MPOL_PREFERRED_MANY, see comments further below.
"

?

> + */
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_PREFERRED != MPOL_PREFERRED);
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
>   QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
> +
> +/*
> + * -1 for uninitialized,
> + *  0 for MPOL_PREFERRED_MANY unsupported,
> + *  1 for supported.
> + */
> +static int has_preferred_many = -1;

maybe "has_mpol_preferred_many" or "supports_mpol_preferred_many" instead.

... but why do we have to cache that value at all? ...

>   #endif
>   
>   char *
> @@ -346,6 +358,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>            * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
>            * this doesn't catch hugepage case. */
>           unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
> +        int mode = backend->policy;
>   
>           /* check for invalid host-nodes and policies and give more verbose
>            * error messages than mbind(). */
> @@ -369,6 +382,21 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>                  BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
>           assert(maxnode <= MAX_NODES);
>   
> +#ifdef HAVE_NUMA_SET_PREFERRED_MANY
> +        if (has_preferred_many < 0) {
> +            /* Check, whether kernel supports MPOL_PREFERRED_MANY. */
> +            has_preferred_many = numa_has_preferred_many() > 0 ? 1 : 0;
> +        }
> +
> +        if (mode == MPOL_PREFERRED && has_preferred_many > 0) {
> +            /*
> +             * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
> +             * silently picks the first node.
> +             */
> +            mode = MPOL_PREFERRED_MANY;
> +        }
> +#endif

... maybe simply not cache the value?


#ifdef HAVE_NUMA_SET_PREFERRED_MANY
	if (mode == MPOL_PREFERRED && numa_has_preferred_many() > 0) {
		/* ... */
		mode = MPOL_PREFERRED_MANY;
	}
#endif

> +
>           if (maxnode &&
>               mbind(ptr, sz, backend->policy, backend->host_nodes, maxnode + 1,
>                     flags)) {
> diff --git a/meson.build b/meson.build
> index 5c6b5a1c75..ebbff7a8ea 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1858,6 +1858,11 @@ config_host_data.set('CONFIG_LINUX_AIO', libaio.found())
>   config_host_data.set('CONFIG_LINUX_IO_URING', linux_io_uring.found())
>   config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
>   config_host_data.set('CONFIG_NUMA', numa.found())
> +if numa.found()
> +  config_host_data.set('HAVE_NUMA_SET_PREFERRED_MANY',
> +                       cc.has_function('numa_set_preferred_many',
> +                                       dependencies: numa))

You're using numa_has_preferred_many(), so better check for that and use 
HAVE_NUMA_HAS_PREFERRED_MANY?

Thanks!
diff mbox series

Patch

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 8640294c10..e0d6cb6c8a 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -23,10 +23,22 @@ 
 
 #ifdef CONFIG_NUMA
 #include <numaif.h>
+#include <numa.h>
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_DEFAULT != MPOL_DEFAULT);
+/*
+ * HOST_MEM_POLICY_PREFERRED may some time also by MPOL_PREFERRED_MANY, see
+ * below.
+ */
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_PREFERRED != MPOL_PREFERRED);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_BIND != MPOL_BIND);
 QEMU_BUILD_BUG_ON(HOST_MEM_POLICY_INTERLEAVE != MPOL_INTERLEAVE);
+
+/*
+ * -1 for uninitialized,
+ *  0 for MPOL_PREFERRED_MANY unsupported,
+ *  1 for supported.
+ */
+static int has_preferred_many = -1;
 #endif
 
 char *
@@ -346,6 +358,7 @@  host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
          * this doesn't catch hugepage case. */
         unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
+        int mode = backend->policy;
 
         /* check for invalid host-nodes and policies and give more verbose
          * error messages than mbind(). */
@@ -369,6 +382,21 @@  host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
                BITS_TO_LONGS(MAX_NODES + 1) * sizeof(unsigned long));
         assert(maxnode <= MAX_NODES);
 
+#ifdef HAVE_NUMA_SET_PREFERRED_MANY
+        if (has_preferred_many < 0) {
+            /* Check, whether kernel supports MPOL_PREFERRED_MANY. */
+            has_preferred_many = numa_has_preferred_many() > 0 ? 1 : 0;
+        }
+
+        if (mode == MPOL_PREFERRED && has_preferred_many > 0) {
+            /*
+             * Replace with MPOL_PREFERRED_MANY otherwise the mbind() below
+             * silently picks the first node.
+             */
+            mode = MPOL_PREFERRED_MANY;
+        }
+#endif
+
         if (maxnode &&
             mbind(ptr, sz, backend->policy, backend->host_nodes, maxnode + 1,
                   flags)) {
diff --git a/meson.build b/meson.build
index 5c6b5a1c75..ebbff7a8ea 100644
--- a/meson.build
+++ b/meson.build
@@ -1858,6 +1858,11 @@  config_host_data.set('CONFIG_LINUX_AIO', libaio.found())
 config_host_data.set('CONFIG_LINUX_IO_URING', linux_io_uring.found())
 config_host_data.set('CONFIG_LIBPMEM', libpmem.found())
 config_host_data.set('CONFIG_NUMA', numa.found())
+if numa.found()
+  config_host_data.set('HAVE_NUMA_SET_PREFERRED_MANY',
+                       cc.has_function('numa_set_preferred_many',
+                                       dependencies: numa))
+endif
 config_host_data.set('CONFIG_OPENGL', opengl.found())
 config_host_data.set('CONFIG_PROFILER', get_option('profiler'))
 config_host_data.set('CONFIG_RBD', rbd.found())