diff mbox series

[V3,01/16] machine: anon-alloc option

Message ID 1730468875-249970-2-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New
Headers show
Series Live update: cpr-transfer | expand

Commit Message

Steven Sistare Nov. 1, 2024, 1:47 p.m. UTC
Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This option applies to
memory allocated as a side effect of creating various devices. It does
not apply to memory-backend-objects, whether explicitly specified on
the command line, or implicitly created by the -m command line option.

The memfd option is intended to support new migration modes, in which the
memory region can be transferred in place to a new QEMU process, by sending
the memfd file descriptor to the process.  Memory contents are preserved,
and if the mode also transfers device descriptors, then pages that are
locked in memory for DMA remain locked.  This behavior is a pre-requisite
for supporting vfio, vdpa, and iommufd devices with the new modes.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 19 +++++++++++++++++++
 include/hw/boards.h |  1 +
 qapi/machine.json   | 14 ++++++++++++++
 qemu-options.hx     | 11 +++++++++++
 system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
 system/trace-events |  3 +++
 6 files changed, 83 insertions(+)

Comments

Peter Xu Nov. 1, 2024, 2:06 p.m. UTC | #1
On Fri, Nov 01, 2024 at 06:47:40AM -0700, Steve Sistare wrote:
> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> +
> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> +                                        TYPE_MEMORY_BACKEND)) {

Steve,

I think I'll postpone a few days on reading the whole series for other
things.. as I think this will miss 9.2 anyway (but we can see whether we
can still get it in early 10.0).

Said that, I want to mention this early that there was concern raised on
using block->mr->parent_obj.parent here to detect mem backends, which can
be error prone.  Please have a look at the discussion in v2 for that.

https://lore.kernel.org/qemu-devel/Zv7C7MeVP2X8bEJU@x1n/
David Hildenbrand Nov. 4, 2024, 10:39 a.m. UTC | #2
On 01.11.24 14:47, Steve Sistare wrote:
> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> on the value of the anon-alloc machine property.  This option applies to
> memory allocated as a side effect of creating various devices. It does
> not apply to memory-backend-objects, whether explicitly specified on
> the command line, or implicitly created by the -m command line option.
> 
> The memfd option is intended to support new migration modes, in which the
> memory region can be transferred in place to a new QEMU process, by sending
> the memfd file descriptor to the process.  Memory contents are preserved,
> and if the mode also transfers device descriptors, then pages that are
> locked in memory for DMA remain locked.  This behavior is a pre-requisite
> for supporting vfio, vdpa, and iommufd devices with the new modes.

A more portable, non-Linux specific variant of this will be using shm,
similar to backends/hostmem-shm.c.

Likely we should be using that instead of memfd, or try hiding the
details. See below.

[...]

> @@ -69,6 +70,8 @@
>   
>   #include "qemu/pmem.h"
>   
> +#include "qapi/qapi-types-migration.h"
> +#include "migration/options.h"
>   #include "migration/vmstate.h"
>   
>   #include "qemu/range.h"
> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                   qemu_mutex_unlock_ramlist();
>                   return;
>               }
> +
> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> +                                        TYPE_MEMORY_BACKEND)) {

This looks a bit and hackish, and I don't think ram_block_add() is the right
place where this should be. It should likely happen in the caller.

We already do have two ways of allocating "shared anonymous memory":

(1) memory-backend-ram,share=on
(2) memory-backend-shm

(2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].

[there is also Linux specific memfd, which gives us more flexibility with
hugetlb etc, but for the purpose here shm should likely be sufficient?]

So why not make (1) behave like (2) and move that handling into
qemu_ram_alloc_internal(), from where we can easily enable it using a
new RMA_SHARED flag? So as a first step, something like:

 From 4b7b760c6e54cf05addca6728edc19adbec1588a Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Mon, 4 Nov 2024 11:29:22 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  backends/hostmem-shm.c | 56 ++++----------------------------
  system/physmem.c       | 73 ++++++++++++++++++++++++++++++++++++++++--
  2 files changed, 76 insertions(+), 53 deletions(-)

diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
index 374edc3db8..0f33b35e9c 100644
--- a/backends/hostmem-shm.c
+++ b/backends/hostmem-shm.c
@@ -25,11 +25,8 @@ struct HostMemoryBackendShm {
  static bool
  shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
  {
-    g_autoptr(GString) shm_name = g_string_new(NULL);
-    g_autofree char *backend_name = NULL;
+    g_autofree char *name = NULL;
      uint32_t ram_flags;
-    int fd, oflag;
-    mode_t mode;
  
      if (!backend->size) {
          error_setg(errp, "can't create shm backend with size 0");
@@ -41,54 +38,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
          return false;
      }
  
-    /*
-     * Let's use `mode = 0` because we don't want other processes to open our
-     * memory unless we share the file descriptor with them.
-     */
-    mode = 0;
-    oflag = O_RDWR | O_CREAT | O_EXCL;
-    backend_name = host_memory_backend_get_name(backend);
-
-    /*
-     * Some operating systems allow creating anonymous POSIX shared memory
-     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
-     * defined by POSIX, so let's create a unique name.
-     *
-     * From Linux's shm_open(3) man-page:
-     *   For  portable  use,  a shared  memory  object should be identified
-     *   by a name of the form /somename;"
-     */
-    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
-                    backend_name);
-
-    fd = shm_open(shm_name->str, oflag, mode);
-    if (fd < 0) {
-        error_setg_errno(errp, errno,
-                         "failed to create POSIX shared memory");
-        return false;
-    }
-
-    /*
-     * We have the file descriptor, so we no longer need to expose the
-     * POSIX shared memory object. However it will remain allocated as long as
-     * there are file descriptors pointing to it.
-     */
-    shm_unlink(shm_name->str);
-
-    if (ftruncate(fd, backend->size) == -1) {
-        error_setg_errno(errp, errno,
-                         "failed to resize POSIX shared memory to %" PRIu64,
-                         backend->size);
-        close(fd);
-        return false;
-    }
-
+    /* Let's do the same as memory-backend-ram,share=on would do. */
+    name = host_memory_backend_get_name(backend);
      ram_flags = RAM_SHARED;
      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
-
-    return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
-                                              backend_name, backend->size,
-                                              ram_flags, fd, 0, errp);
+    return memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend),
+                                                  name, backend->size,
+                                                  ram_flags, errp);
  }
  
  static void
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..4d331b3828 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2057,6 +2057,59 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
  }
  #endif
  
+static int qemu_shm_alloc(size_t size, Error **errp)
+{
+    g_autoptr(GString) shm_name = g_string_new(NULL);
+    int fd, oflag, cur_sequence;
+    static int sequence;
+    mode_t mode;
+
+    cur_sequence = qatomic_fetch_inc(&sequence);
+
+    /*
+     * Let's use `mode = 0` because we don't want other processes to open our
+     * memory unless we share the file descriptor with them.
+     */
+    mode = 0;
+    oflag = O_RDWR | O_CREAT | O_EXCL;
+
+    /*
+     * Some operating systems allow creating anonymous POSIX shared memory
+     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+     * defined by POSIX, so let's create a unique name.
+     *
+     * From Linux's shm_open(3) man-page:
+     *   For  portable  use,  a shared  memory  object should be identified
+     *   by a name of the form /somename;"
+     */
+    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(),
+                    cur_sequence);
+
+    fd = shm_open(shm_name->str, oflag, mode);
+    if (fd < 0) {
+        error_setg_errno(errp, errno,
+                         "failed to create POSIX shared memory");
+        return false;
+    }
+
+    /*
+     * We have the file descriptor, so we no longer need to expose the
+     * POSIX shared memory object. However it will remain allocated as long as
+     * there are file descriptors pointing to it.
+     */
+    shm_unlink(shm_name->str);
+
+    if (ftruncate(fd, size) == -1) {
+        error_setg_errno(errp, errno,
+                         "failed to resize POSIX shared memory to %" PRIu64,
+                         size);
+        close(fd);
+        return false;
+    }
+
+    return fd;
+}
+
  static
  RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                    void (*resized)(const char*,
@@ -2084,12 +2137,26 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
      new_block->used_length = size;
      new_block->max_length = max_size;
      assert(max_size >= size);
-    new_block->fd = -1;
+
      new_block->guest_memfd = -1;
      new_block->page_size = qemu_real_host_page_size();
-    new_block->host = host;
      new_block->flags = ram_flags;
-    ram_block_add(new_block, &local_err);
+        new_block->host = host;
+
+    if ((ram_flags & RAM_PREALLOC) || !(ram_flags & RAM_SHARED)) {
+        new_block->fd = -1;
+    } else {
+        /*
+         * We want anonymous shared memory, similar to MAP_SHARED|MAP_ANON; but
+         * some users want the fd. So let's allocate shm explicitly, which will
+         * give us the fd.
+         */
+        assert(!host);
+        new_block->fd = qemu_shm_alloc(new_block->max_length, &local_err);
+    }
+    if (!local_err) {
+        ram_block_add(new_block, &local_err);
+    }
      if (local_err) {
          g_free(new_block);
          error_propagate(errp, local_err);
David Hildenbrand Nov. 4, 2024, 10:45 a.m. UTC | #3
On 04.11.24 11:39, David Hildenbrand wrote:
> On 01.11.24 14:47, Steve Sistare wrote:
>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>> on the value of the anon-alloc machine property.  This option applies to
>> memory allocated as a side effect of creating various devices. It does
>> not apply to memory-backend-objects, whether explicitly specified on
>> the command line, or implicitly created by the -m command line option.
>>
>> The memfd option is intended to support new migration modes, in which the
>> memory region can be transferred in place to a new QEMU process, by sending
>> the memfd file descriptor to the process.  Memory contents are preserved,
>> and if the mode also transfers device descriptors, then pages that are
>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>> for supporting vfio, vdpa, and iommufd devices with the new modes.
> 
> A more portable, non-Linux specific variant of this will be using shm,
> similar to backends/hostmem-shm.c.
> 
> Likely we should be using that instead of memfd, or try hiding the
> details. See below.
> 
> [...]
> 
>> @@ -69,6 +70,8 @@
>>    
>>    #include "qemu/pmem.h"
>>    
>> +#include "qapi/qapi-types-migration.h"
>> +#include "migration/options.h"
>>    #include "migration/vmstate.h"
>>    
>>    #include "qemu/range.h"
>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>                    qemu_mutex_unlock_ramlist();
>>                    return;
>>                }
>> +
>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>> +                                        TYPE_MEMORY_BACKEND)) {
> 
> This looks a bit and hackish, and I don't think ram_block_add() is the right
> place where this should be. It should likely happen in the caller.
> 
> We already do have two ways of allocating "shared anonymous memory":
> 
> (1) memory-backend-ram,share=on
> (2) memory-backend-shm
> 
> (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
> uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].
> 
> [there is also Linux specific memfd, which gives us more flexibility with
> hugetlb etc, but for the purpose here shm should likely be sufficient?]
> 
> So why not make (1) behave like (2) and move that handling into
> qemu_ram_alloc_internal(), from where we can easily enable it using a
> new RMA_SHARED flag? So as a first step, something like:
> 
>   From 4b7b760c6e54cf05addca6728edc19adbec1588a Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Mon, 4 Nov 2024 11:29:22 +0100
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>    backends/hostmem-shm.c | 56 ++++----------------------------
>    system/physmem.c       | 73 ++++++++++++++++++++++++++++++++++++++++--
>    2 files changed, 76 insertions(+), 53 deletions(-)
> 
> diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
> index 374edc3db8..0f33b35e9c 100644
> --- a/backends/hostmem-shm.c
> +++ b/backends/hostmem-shm.c
> @@ -25,11 +25,8 @@ struct HostMemoryBackendShm {
>    static bool
>    shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>    {
> -    g_autoptr(GString) shm_name = g_string_new(NULL);
> -    g_autofree char *backend_name = NULL;
> +    g_autofree char *name = NULL;
>        uint32_t ram_flags;
> -    int fd, oflag;
> -    mode_t mode;
>    
>        if (!backend->size) {
>            error_setg(errp, "can't create shm backend with size 0");
> @@ -41,54 +38,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>            return false;
>        }
>    
> -    /*
> -     * Let's use `mode = 0` because we don't want other processes to open our
> -     * memory unless we share the file descriptor with them.
> -     */
> -    mode = 0;
> -    oflag = O_RDWR | O_CREAT | O_EXCL;
> -    backend_name = host_memory_backend_get_name(backend);
> -
> -    /*
> -     * Some operating systems allow creating anonymous POSIX shared memory
> -     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
> -     * defined by POSIX, so let's create a unique name.
> -     *
> -     * From Linux's shm_open(3) man-page:
> -     *   For  portable  use,  a shared  memory  object should be identified
> -     *   by a name of the form /somename;"
> -     */
> -    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
> -                    backend_name);
> -
> -    fd = shm_open(shm_name->str, oflag, mode);
> -    if (fd < 0) {
> -        error_setg_errno(errp, errno,
> -                         "failed to create POSIX shared memory");
> -        return false;
> -    }
> -
> -    /*
> -     * We have the file descriptor, so we no longer need to expose the
> -     * POSIX shared memory object. However it will remain allocated as long as
> -     * there are file descriptors pointing to it.
> -     */
> -    shm_unlink(shm_name->str);
> -
> -    if (ftruncate(fd, backend->size) == -1) {
> -        error_setg_errno(errp, errno,
> -                         "failed to resize POSIX shared memory to %" PRIu64,
> -                         backend->size);
> -        close(fd);
> -        return false;
> -    }
> -
> +    /* Let's do the same as memory-backend-ram,share=on would do. */
> +    name = host_memory_backend_get_name(backend);
>        ram_flags = RAM_SHARED;
>        ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> -
> -    return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
> -                                              backend_name, backend->size,
> -                                              ram_flags, fd, 0, errp);
> +    return memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend),
> +                                                  name, backend->size,
> +                                                  ram_flags, errp);
>    }
>    
>    static void
> diff --git a/system/physmem.c b/system/physmem.c
> index dc1db3a384..4d331b3828 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2057,6 +2057,59 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>    }
>    #endif
>    
> +static int qemu_shm_alloc(size_t size, Error **errp)
> +{
> +    g_autoptr(GString) shm_name = g_string_new(NULL);
> +    int fd, oflag, cur_sequence;
> +    static int sequence;
> +    mode_t mode;
> +
> +    cur_sequence = qatomic_fetch_inc(&sequence);
> +
> +    /*
> +     * Let's use `mode = 0` because we don't want other processes to open our
> +     * memory unless we share the file descriptor with them.
> +     */
> +    mode = 0;
> +    oflag = O_RDWR | O_CREAT | O_EXCL;
> +
> +    /*
> +     * Some operating systems allow creating anonymous POSIX shared memory
> +     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
> +     * defined by POSIX, so let's create a unique name.
> +     *
> +     * From Linux's shm_open(3) man-page:
> +     *   For  portable  use,  a shared  memory  object should be identified
> +     *   by a name of the form /somename;"
> +     */
> +    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(),
> +                    cur_sequence);
> +
> +    fd = shm_open(shm_name->str, oflag, mode);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno,
> +                         "failed to create POSIX shared memory");
> +        return false;
> +    }
> +
> +    /*
> +     * We have the file descriptor, so we no longer need to expose the
> +     * POSIX shared memory object. However it will remain allocated as long as
> +     * there are file descriptors pointing to it.
> +     */
> +    shm_unlink(shm_name->str);
> +
> +    if (ftruncate(fd, size) == -1) {
> +        error_setg_errno(errp, errno,
> +                         "failed to resize POSIX shared memory to %" PRIu64,
> +                         size);
> +        close(fd);
> +        return false;
> +    }
> +
> +    return fd;
> +}
> +
>    static
>    RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>                                      void (*resized)(const char*,
> @@ -2084,12 +2137,26 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>        new_block->used_length = size;
>        new_block->max_length = max_size;
>        assert(max_size >= size);
> -    new_block->fd = -1;
> +
>        new_block->guest_memfd = -1;
>        new_block->page_size = qemu_real_host_page_size();
> -    new_block->host = host;
>        new_block->flags = ram_flags;
> -    ram_block_add(new_block, &local_err);
> +        new_block->host = host;
> +
> +    if ((ram_flags & RAM_PREALLOC) || !(ram_flags & RAM_SHARED)) {
> +        new_block->fd = -1;
> +    } else {
> +        /*
> +         * We want anonymous shared memory, similar to MAP_SHARED|MAP_ANON; but
> +         * some users want the fd. So let's allocate shm explicitly, which will
> +         * give us the fd.
> +         */
> +        assert(!host);
> +        new_block->fd = qemu_shm_alloc(new_block->max_length, &local_err);

Note: completely untested.

Likely a file_ram_alloc() call is missing here.
Steven Sistare Nov. 4, 2024, 5:38 p.m. UTC | #4
On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> On 01.11.24 14:47, Steve Sistare wrote:
>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>> on the value of the anon-alloc machine property.  This option applies to
>> memory allocated as a side effect of creating various devices. It does
>> not apply to memory-backend-objects, whether explicitly specified on
>> the command line, or implicitly created by the -m command line option.
>>
>> The memfd option is intended to support new migration modes, in which the
>> memory region can be transferred in place to a new QEMU process, by sending
>> the memfd file descriptor to the process.  Memory contents are preserved,
>> and if the mode also transfers device descriptors, then pages that are
>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>> for supporting vfio, vdpa, and iommufd devices with the new modes.
> 
> A more portable, non-Linux specific variant of this will be using shm,
> similar to backends/hostmem-shm.c.
> 
> Likely we should be using that instead of memfd, or try hiding the
> details. See below.

For this series I would prefer to use memfd and hide the details.  It's a
concise (and well tested) solution albeit linux only.  The code you supply
for posix shm would be a good follow on patch to support other unices.

We could drop
   -machine anon-alloc=mmap|memfd
and define
   -machine anon-shared

as you suggest at the end.

> [...]
> 
>> @@ -69,6 +70,8 @@
>>   #include "qemu/pmem.h"
>> +#include "qapi/qapi-types-migration.h"
>> +#include "migration/options.h"
>>   #include "migration/vmstate.h"
>>   #include "qemu/range.h"
>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>                   qemu_mutex_unlock_ramlist();
>>                   return;
>>               }
>> +
>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>> +                                        TYPE_MEMORY_BACKEND)) {
> 
> This looks a bit and hackish, 

OK. I can revert parts of the previous version which passed in RAM_SHARED from
various call sites to request anonymous shared memory:
   https://lore.kernel.org/qemu-devel/1714406135-451286-18-git-send-email-steven.sistare@oracle.com
See the various sites that do
     uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
Does that look OK to you?

> and I don't think ram_block_add() is the right
> place where this should be. It should likely happen in the caller.

I agree, but I received no feedback when I proposed to refactor allocation
vs ram_block_add, so I dropped them to simplify the live update review.
These refactor but do not change functionality.  Are you OK with something
like this?  Is this overkill?

https://lore.kernel.org/qemu-devel/1714406135-451286-1-git-send-email-steven.sistare@oracle.com/
   physmem: ram_block_create
   physmem: hoist guest_memfd creation
   physmem: hoist host memory allocation

> We already do have two ways of allocating "shared anonymous memory":
> 
> (1) memory-backend-ram,share=on
> (2) memory-backend-shm
> 
> (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
> uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].
> 
> [there is also Linux specific memfd, which gives us more flexibility with
> hugetlb etc, but for the purpose here shm should likely be sufficient?]
> 
> So why not make (1) behave like (2) and move that handling into
> qemu_ram_alloc_internal(), from where we can easily enable it using a
> new RMA_SHARED flag? So as a first step, something like:

I prefer that, and an earlier version did so, but only if anon-alloc==memfd.

To be clear, do you propose that memory-backend-ram,shared=on unconditionally
mmap fd-based shared memory, independently of the setting of anon-alloc?
And drop the MAP_ANON|MAP_SHARED possibility?

Or, do you propose that for memory-backend-ram,shared=on:
   if anon-shared
     mmap fd
   else
      MAP_ANON|MAP_SHARED

The former is simpler from a user documentation point of view, but either
works for me.  I could stop listing memory-backend-ram  as an exception in
the docs, which currently state:
   #     Memory-backend objects must have the share=on attribute, but
   #     memory-backend-epc and memory-backend-ram are not supported.

[...]
>
> Then, you only need a machine option to say "anon-shared", to make all
> anonymous memory sharable between processes. All it would do is setting
> the RAM_SHARED flag in qemu_ram_alloc_internal() when reasonable
> (!(ram_flags & RAM_PREALLOC)).
> 
> To handle "memory-backend-ram,share=off", can we find a way to bail out if
> memory-backend-ram,share=off was used while the machine option "anon-shared"
> would be active? 

In later patches I install migration blockers for various conditions, including
when a ram block does not support CPR.

> Or just document that the "anon-shared" will win?

IMO a blocker is sufficient.

I think you are also suggesting that an unadorned "memory-backend-ram"
specification (with implicit shared=off), plus anon-shared, should cause
shared anon to be allocated:
   "you only need a machine option to say "anon-shared", to make all anonymous
    memory sharable"

I did that previously, and Peter objected, saying the explicit anon-shared
should not override the implicit shared=off.

But perhaps I misinterpret someone.

- Steve

> Alternatives might be a RAM_PFORCE_PRIVATE flag, set by the memory backend.
> 
> 
> With above change, we could drop the "bool share" flag from,
> qemu_anon_ram_alloc(), as it would be unused.
> 
> [1] https://patchwork.kernel.org/project/qemu-devel/patch/20180201205511.19198-2-marcel@redhat.com/
David Hildenbrand Nov. 4, 2024, 7:51 p.m. UTC | #5
On 04.11.24 18:38, Steven Sistare wrote:
> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>> On 01.11.24 14:47, Steve Sistare wrote:
>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>> on the value of the anon-alloc machine property.  This option applies to
>>> memory allocated as a side effect of creating various devices. It does
>>> not apply to memory-backend-objects, whether explicitly specified on
>>> the command line, or implicitly created by the -m command line option.
>>>
>>> The memfd option is intended to support new migration modes, in which the
>>> memory region can be transferred in place to a new QEMU process, by sending
>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>> and if the mode also transfers device descriptors, then pages that are
>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>
>> A more portable, non-Linux specific variant of this will be using shm,
>> similar to backends/hostmem-shm.c.
>>
>> Likely we should be using that instead of memfd, or try hiding the
>> details. See below.
> 
> For this series I would prefer to use memfd and hide the details.  It's a
> concise (and well tested) solution albeit linux only.  The code you supply
> for posix shm would be a good follow on patch to support other unices.

Unless there is reason to use memfd we should start with the more 
generic POSIX variant that is available even on systems without memfd. 
Factoring stuff out as I drafted does look quite compelling.

I can help with the rework, and send it out separately, so you can focus 
on the "machine toggle" as part of this series.

Of course, if we find out we need the memfd internally instead under 
Linux for whatever reason later, we can use that instead.

But IIUC, the main selling point for memfd are additional features 
(hugetlb, memory sealing) that you aren't even using.

> 
> We could drop
>     -machine anon-alloc=mmap|memfd

Right, the memfd here might be an unnecessary detail. Especially, 
because all things here are mmap'ed ... so I don't quite like this 
interface :)


> and define
>     -machine anon-shared
> 
> as you suggest at the end.

Likely we should remove the "anon" part from the interface as well ... 
hmm ...

We want to instruct QEMU: "all guest RAM that is not explicitly 
specified should be sharable with another process".

"internal-ram-share=true"

"default-ram-share=true"

Maybe we can come up with something even better. But getting rid of the 
"anon" would be great. I think I prefer the latter (below).

> 
>> [...]
>>
>>> @@ -69,6 +70,8 @@
>>>    #include "qemu/pmem.h"
>>> +#include "qapi/qapi-types-migration.h"
>>> +#include "migration/options.h"
>>>    #include "migration/vmstate.h"
>>>    #include "qemu/range.h"
>>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>                    qemu_mutex_unlock_ramlist();
>>>                    return;
>>>                }
>>> +
>>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>>> +                                        TYPE_MEMORY_BACKEND)) {
>>
>> This looks a bit and hackish,
> 
> OK. I can revert parts of the previous version which passed in RAM_SHARED from
> various call sites to request anonymous shared memory:
>     https://lore.kernel.org/qemu-devel/1714406135-451286-18-git-send-email-steven.sistare@oracle.com
> See the various sites that do
>       uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> Does that look OK to you?

That's one option, or we just handle it in qemu_ram_alloc_internal() as 
I drafted below.

Or we simply have another interface to allocate this "default RAM that 
does not come from a memory backend and is subject to the global 
toggle", and hide that detail (conditionally setting RAM_SHARED) in there.

> 
>> and I don't think ram_block_add() is the right
>> place where this should be. It should likely happen in the caller.
> 
> I agree, but I received no feedback when I proposed to refactor allocation
> vs ram_block_add, so I dropped them to simplify the live update review.
> These refactor but do not change functionality.  Are you OK with something
> like this?  Is this overkill?
> 

Probably overkill :)

> https://lore.kernel.org/qemu-devel/1714406135-451286-1-git-send-email-steven.sistare@oracle.com/
>     physmem: ram_block_create
>     physmem: hoist guest_memfd creation
>     physmem: hoist host memory allocation
> 
>> We already do have two ways of allocating "shared anonymous memory":
>>
>> (1) memory-backend-ram,share=on
>> (2) memory-backend-shm
>>
>> (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
>> uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].
>>
>> [there is also Linux specific memfd, which gives us more flexibility with
>> hugetlb etc, but for the purpose here shm should likely be sufficient?]
>>
>> So why not make (1) behave like (2) and move that handling into
>> qemu_ram_alloc_internal(), from where we can easily enable it using a
>> new RMA_SHARED flag? So as a first step, something like:
> 
> I prefer that, and an earlier version did so, but only if anon-alloc==memfd.
> 
> To be clear, do you propose that memory-backend-ram,shared=on unconditionally
> mmap fd-based shared memory, independently of the setting of anon-alloc?
> And drop the MAP_ANON|MAP_SHARED possibility?

Yes, as done in my draft patch. MAP_ANON|MAP_SHARED was primarily a hack 
to make this RDMA thingy fly that could not deal with anonymous memory, 
and we didn't have

memory-backend-ram,share=on was added via 
06329ccecfa022494fdba288b3ab5bcb8dff4159 before
memory-backend-memfd was added via dbb9e0f40d7d561dcffcf7e41ac9f6a5ec90e5b5

Both ended up in the same QEMU release.

So likely memory-backend-ram,share=on could just have used 
memory-backend-memfd if it would have been available earlier, at least 
on Linux ...


But, it looks like the use case for memory-backend-ram,share=on does no 
longer even exist, because

commit 1dfd42c4264bbf47415a9e73f0d6b4e6a7cd7393
Author: Philippe Mathieu-Daudé <philmd@linaro.org>
Date:   Thu Mar 28 12:53:00 2024 +0100

     hw/rdma: Remove deprecated pvrdma device and rdmacm-mux helper

Removed that mremap() from the code base.

So we can change how memory-backend-ram,share=on is implemented 
internally, as long as it keeps on working in a similar way.

If "memory-backend-ram,share=on" will be the same as specifying 
"default-ram-share=on", that would actually be quite nice ... no need to 
bring in memfds at all as long as we only want some memory with an fd to 
share with other processes.

> 
> Or, do you propose that for memory-backend-ram,shared=on:
>     if anon-shared
>       mmap fd
>     else
>        MAP_ANON|MAP_SHARED


My suggestion would be to unconditionally use shm (which is available 
even on kernels without memfd support; if required use memfd first and 
fallback to shmem) as in the patch I drafted.

> 
> The former is simpler from a user documentation point of view, but either
> works for me.  I could stop listing memory-backend-ram  as an exception in
> the docs, which currently state:
>     #     Memory-backend objects must have the share=on attribute, but
>     #     memory-backend-epc and memory-backend-ram are not supported.

Likely that was never updated to document the memory-backend-ram use case.

> 
> [...]
>>
>> Then, you only need a machine option to say "anon-shared", to make all
>> anonymous memory sharable between processes. All it would do is setting
>> the RAM_SHARED flag in qemu_ram_alloc_internal() when reasonable
>> (!(ram_flags & RAM_PREALLOC)).
>>
>> To handle "memory-backend-ram,share=off", can we find a way to bail out if
>> memory-backend-ram,share=off was used while the machine option "anon-shared"
>> would be active?
> 
> In later patches I install migration blockers for various conditions, including
> when a ram block does not support CPR.

Good!

> 
>> Or just document that the "anon-shared" will win?
> 
> IMO a blocker is sufficient.
> 
> I think you are also suggesting that an unadorned "memory-backend-ram"
> specification (with implicit shared=off), plus anon-shared, should cause
> shared anon to be allocated:
>     "you only need a machine option to say "anon-shared", to make all anonymous
>      memory sharable"
> 
> I did that previously, and Peter objected, saying the explicit anon-shared
> should not override the implicit shared=off.

Yes, it's better if we can detect that somehow. There should be easy 
ways to make that work, so I wouldn't worry about that.
Peter Xu Nov. 4, 2024, 8:14 p.m. UTC | #6
On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote:
> > I did that previously, and Peter objected, saying the explicit anon-shared
> > should not override the implicit shared=off.
> 
> Yes, it's better if we can detect that somehow. There should be easy ways to
> make that work, so I wouldn't worry about that.

I still think whenever the caller is capable of passing RAM_SHARED flag
into ram_block_add(), we should always respect what's passed in from the
caller, no matter it's a shared / private request.

A major issue with that idea is when !RAM_SHARED, we don't easily know
whether it's because the caller explicitly chose share=off, or if it's
simply the type of ramblock that we don't care (e.g. ROMs).

So besides what I used to suggest on monitoring the four call sites that
can involve those, another simpler (and now I see it even cleaner..) way
could be that we explicitly introduce RAM_PRIVATE.  It means whenever we
have things like below in the callers:

    int ram_flags = shared ? RAM_SHARED : 0;

We start to switch to:

    int ram_flags = shared ? RAM_SHARED : RAM_PRIVATE;

Then in ram_block_add():

    if (!(ram_flags & (RAM_SHARED | RAM_PRIVATE))) {
        // these are the target ramblocks for cpr's whatever new machine
        // flag..
    }
David Hildenbrand Nov. 4, 2024, 8:15 p.m. UTC | #7
On 04.11.24 20:51, David Hildenbrand wrote:
> On 04.11.24 18:38, Steven Sistare wrote:
>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>> on the value of the anon-alloc machine property.  This option applies to
>>>> memory allocated as a side effect of creating various devices. It does
>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>> the command line, or implicitly created by the -m command line option.
>>>>
>>>> The memfd option is intended to support new migration modes, in which the
>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>> and if the mode also transfers device descriptors, then pages that are
>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>
>>> A more portable, non-Linux specific variant of this will be using shm,
>>> similar to backends/hostmem-shm.c.
>>>
>>> Likely we should be using that instead of memfd, or try hiding the
>>> details. See below.
>>
>> For this series I would prefer to use memfd and hide the details.  It's a
>> concise (and well tested) solution albeit linux only.  The code you supply
>> for posix shm would be a good follow on patch to support other unices.
> 
> Unless there is reason to use memfd we should start with the more
> generic POSIX variant that is available even on systems without memfd.
> Factoring stuff out as I drafted does look quite compelling.
> 
> I can help with the rework, and send it out separately, so you can focus
> on the "machine toggle" as part of this series.
> 
> Of course, if we find out we need the memfd internally instead under
> Linux for whatever reason later, we can use that instead.
> 
> But IIUC, the main selling point for memfd are additional features
> (hugetlb, memory sealing) that you aren't even using.

FWIW, I'm looking into some details, and one difference is that 
shmem_open() under Linux (glibc) seems to go to /dev/shmem and 
memfd/SYSV go to the internal tmpfs mount. There is not a big 
difference, but there can be some difference (e.g., sizing of the 
/dev/shm mount).

Regarding memory-backend-ram,share=on, I assume we can use memfd if 
available, but then fallback to shm_open().

I'm hoping we can find a way where it just all is rather intuitive, like

"default-ram-share=on": behave for internal RAM just like 
"memory-backend-ram,share=on"

"memory-backend-ram,share=on": use whatever mechanism we have to give us 
"anonymous" memory that can be shared using an fd with another process.


Thoughts?
David Hildenbrand Nov. 4, 2024, 8:17 p.m. UTC | #8
On 04.11.24 21:14, Peter Xu wrote:
> On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote:
>>> I did that previously, and Peter objected, saying the explicit anon-shared
>>> should not override the implicit shared=off.
>>
>> Yes, it's better if we can detect that somehow. There should be easy ways to
>> make that work, so I wouldn't worry about that.
> 
> I still think whenever the caller is capable of passing RAM_SHARED flag
> into ram_block_add(), we should always respect what's passed in from the
> caller, no matter it's a shared / private request.
> 
> A major issue with that idea is when !RAM_SHARED, we don't easily know
> whether it's because the caller explicitly chose share=off, or if it's
> simply the type of ramblock that we don't care (e.g. ROMs).

Agreed. But note that I think ram_block_add() is one level to deep to 
handle that.

> 
> So besides what I used to suggest on monitoring the four call sites that
> can involve those, another simpler (and now I see it even cleaner..) way
> could be that we explicitly introduce RAM_PRIVATE.  It means whenever we
> have things like below in the callers:

Yeah, I called it RAM_FORCE_PRIVATE, but it's the same idea. And simply 
calling it RAM_PRIVATE makes sense.
Peter Xu Nov. 4, 2024, 8:41 p.m. UTC | #9
On Mon, Nov 04, 2024 at 09:17:30PM +0100, David Hildenbrand wrote:
> On 04.11.24 21:14, Peter Xu wrote:
> > On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote:
> > > > I did that previously, and Peter objected, saying the explicit anon-shared
> > > > should not override the implicit shared=off.
> > > 
> > > Yes, it's better if we can detect that somehow. There should be easy ways to
> > > make that work, so I wouldn't worry about that.
> > 
> > I still think whenever the caller is capable of passing RAM_SHARED flag
> > into ram_block_add(), we should always respect what's passed in from the
> > caller, no matter it's a shared / private request.
> > 
> > A major issue with that idea is when !RAM_SHARED, we don't easily know
> > whether it's because the caller explicitly chose share=off, or if it's
> > simply the type of ramblock that we don't care (e.g. ROMs).
> 
> Agreed. But note that I think ram_block_add() is one level to deep to handle
> that.

True.. qemu_ram_alloc_internal() is probably the best place to do the trick.

Thanks,
Steven Sistare Nov. 4, 2024, 8:56 p.m. UTC | #10
On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> On 04.11.24 20:51, David Hildenbrand wrote:
>> On 04.11.24 18:38, Steven Sistare wrote:
>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>> memory allocated as a side effect of creating various devices. It does
>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>> the command line, or implicitly created by the -m command line option.
>>>>>
>>>>> The memfd option is intended to support new migration modes, in which the
>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>
>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>> similar to backends/hostmem-shm.c.
>>>>
>>>> Likely we should be using that instead of memfd, or try hiding the
>>>> details. See below.
>>>
>>> For this series I would prefer to use memfd and hide the details.  It's a
>>> concise (and well tested) solution albeit linux only.  The code you supply
>>> for posix shm would be a good follow on patch to support other unices.
>>
>> Unless there is reason to use memfd we should start with the more
>> generic POSIX variant that is available even on systems without memfd.
>> Factoring stuff out as I drafted does look quite compelling.
>>
>> I can help with the rework, and send it out separately, so you can focus
>> on the "machine toggle" as part of this series.
>>
>> Of course, if we find out we need the memfd internally instead under
>> Linux for whatever reason later, we can use that instead.
>>
>> But IIUC, the main selling point for memfd are additional features
>> (hugetlb, memory sealing) that you aren't even using.
> 
> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).

Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
To do so using shm_open requires configuration on the mount.  One step harder to use.

This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
if memory-backend-ram has hogged all the memory.

> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().

Yes, and if that is a good idea, then the same should be done for internal RAM
-- memfd if available and fallback to shm_open.

> I'm hoping we can find a way where it just all is rather intuitive, like
> 
> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> 
> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> 
> Thoughts?

Agreed, though I thought I had already landed at the intuitive specification in my patch.
The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
of options and words to describe them.

- Steve
David Hildenbrand Nov. 4, 2024, 9:36 p.m. UTC | #11
On 04.11.24 21:56, Steven Sistare wrote:
> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>> On 04.11.24 20:51, David Hildenbrand wrote:
>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>
>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>
>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>> similar to backends/hostmem-shm.c.
>>>>>
>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>> details. See below.
>>>>
>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>> for posix shm would be a good follow on patch to support other unices.
>>>
>>> Unless there is reason to use memfd we should start with the more
>>> generic POSIX variant that is available even on systems without memfd.
>>> Factoring stuff out as I drafted does look quite compelling.
>>>
>>> I can help with the rework, and send it out separately, so you can focus
>>> on the "machine toggle" as part of this series.
>>>
>>> Of course, if we find out we need the memfd internally instead under
>>> Linux for whatever reason later, we can use that instead.
>>>
>>> But IIUC, the main selling point for memfd are additional features
>>> (hugetlb, memory sealing) that you aren't even using.
>>
>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> 
> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> To do so using shm_open requires configuration on the mount.  One step harder to use.

Yes.

> 
> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> if memory-backend-ram has hogged all the memory.
> 
>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> 
> Yes, and if that is a good idea, then the same should be done for internal RAM
> -- memfd if available and fallback to shm_open.

Yes.

> 
>> I'm hoping we can find a way where it just all is rather intuitive, like
>>
>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>
>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>
>> Thoughts?
> 
> Agreed, though I thought I had already landed at the intuitive specification in my patch.
> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> of options and words to describe them.

Well, yes, and making it all a bit more consistent and the "machine 
option" behave just like "memory-backend-ram,share=on".
Steven Sistare Nov. 6, 2024, 8:12 p.m. UTC | #12
On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> On 04.11.24 21:56, Steven Sistare wrote:
>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>
>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>
>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>> similar to backends/hostmem-shm.c.
>>>>>>
>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>> details. See below.
>>>>>
>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>
>>>> Unless there is reason to use memfd we should start with the more
>>>> generic POSIX variant that is available even on systems without memfd.
>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>
>>>> I can help with the rework, and send it out separately, so you can focus
>>>> on the "machine toggle" as part of this series.
>>>>
>>>> Of course, if we find out we need the memfd internally instead under
>>>> Linux for whatever reason later, we can use that instead.
>>>>
>>>> But IIUC, the main selling point for memfd are additional features
>>>> (hugetlb, memory sealing) that you aren't even using.
>>>
>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>
>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>> To do so using shm_open requires configuration on the mount.  One step harder to use.
> 
> Yes.
> 
>>
>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>> if memory-backend-ram has hogged all the memory.
>>
>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>
>> Yes, and if that is a good idea, then the same should be done for internal RAM
>> -- memfd if available and fallback to shm_open.
> 
> Yes.
> 
>>
>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>
>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>
>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>
>>> Thoughts?
>>
>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>> of options and words to describe them.
> 
> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".

Hi David and Peter,

I have implemented and tested the following, for both qemu_memfd_create
and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
for simplicity.

Any comments before I submit a complete patch?

----
qemu-options.hx:
     ``aux-ram-share=on|off``
         Allocate auxiliary guest RAM as an anonymous file that is
         shareable with an external process.  This option applies to
         memory allocated as a side effect of creating various devices.
         It does not apply to memory-backend-objects, whether explicitly
         specified on the command line, or implicitly created by the -m
         command line option.

         Some migration modes require aux-ram-share=on.

qapi/migration.json:
     @cpr-transfer:
          ...
          Memory-backend objects must have the share=on attribute, but
          memory-backend-epc is not supported.  The VM must be started
          with the '-machine aux-ram-share=on' option.

Define RAM_PRIVATE

Define qemu_shm_alloc(), from David's tmp patch

ram_backend_memory_alloc()
     ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
     memory_region_init_ram_flags_nomigrate(ram_flags)

qemu_ram_alloc_internal()
     ...
     if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
         new_block->flags |= RAM_SHARED;

     if (!host && (new_block->flags & RAM_SHARED)) {
         qemu_ram_alloc_shared(new_block);
     } else
         new_block->fd = -1;
         new_block->host = host;
     }
     ram_block_add(new_block);

qemu_ram_alloc_shared()
     if qemu_memfd_check()
         new_block->fd = qemu_memfd_create()
     else
         new_block->fd = qemu_shm_alloc()
     new_block->host = file_ram_alloc(new_block->fd)
----

- Steve
Peter Xu Nov. 6, 2024, 8:41 p.m. UTC | #13
On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
> 
> 
> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > On 04.11.24 21:56, Steven Sistare wrote:
> > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > 
> > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > 
> > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > 
> > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > details. See below.
> > > > > > 
> > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > 
> > > > > Unless there is reason to use memfd we should start with the more
> > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > 
> > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > on the "machine toggle" as part of this series.
> > > > > 
> > > > > Of course, if we find out we need the memfd internally instead under
> > > > > Linux for whatever reason later, we can use that instead.
> > > > > 
> > > > > But IIUC, the main selling point for memfd are additional features
> > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > 
> > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > 
> > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > 
> > Yes.
> > 
> > > 
> > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > if memory-backend-ram has hogged all the memory.
> > > 
> > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > 
> > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > -- memfd if available and fallback to shm_open.
> > 
> > Yes.
> > 
> > > 
> > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > 
> > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > 
> > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > 
> > > > Thoughts?
> > > 
> > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > of options and words to describe them.
> > 
> > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> 
> Hi David and Peter,
> 
> I have implemented and tested the following, for both qemu_memfd_create
> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> for simplicity.

I'm ok with either shm or memfd, as this feature only applies to Linux
anyway.  I'll leave that part to you and David to decide.

> 
> Any comments before I submit a complete patch?
> 
> ----
> qemu-options.hx:
>     ``aux-ram-share=on|off``
>         Allocate auxiliary guest RAM as an anonymous file that is
>         shareable with an external process.  This option applies to
>         memory allocated as a side effect of creating various devices.
>         It does not apply to memory-backend-objects, whether explicitly
>         specified on the command line, or implicitly created by the -m
>         command line option.
> 
>         Some migration modes require aux-ram-share=on.
> 
> qapi/migration.json:
>     @cpr-transfer:
>          ...
>          Memory-backend objects must have the share=on attribute, but
>          memory-backend-epc is not supported.  The VM must be started
>          with the '-machine aux-ram-share=on' option.
> 
> Define RAM_PRIVATE
> 
> Define qemu_shm_alloc(), from David's tmp patch
> 
> ram_backend_memory_alloc()
>     ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>     memory_region_init_ram_flags_nomigrate(ram_flags)

Looks all good until here.

> 
> qemu_ram_alloc_internal()
>     ...
>     if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)

Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
that's equal to RAM_PREALLOC.  Meanwhile I slightly prefer we don't touch
anything if SHARED|PRIVATE is set.  All combined, it could be:

    if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
        // ramblock to be allocated, with no share/private request, aka,
        // aux memory chunk...
    }

>         new_block->flags |= RAM_SHARED;
> 
>     if (!host && (new_block->flags & RAM_SHARED)) {
>         qemu_ram_alloc_shared(new_block);

I'm not sure whether this needs its own helper.  Should we fallback to
ram_block_add() below, just like a RAM_SHARED?

IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
always cache the fd (even if we don't do that before)?

>     } else
>         new_block->fd = -1;
>         new_block->host = host;
>     }
>     ram_block_add(new_block);
> 
> qemu_ram_alloc_shared()
>     if qemu_memfd_check()
>         new_block->fd = qemu_memfd_create()
>     else
>         new_block->fd = qemu_shm_alloc()
>     new_block->host = file_ram_alloc(new_block->fd)
> ----
> 
> - Steve
>
Steven Sistare Nov. 6, 2024, 8:59 p.m. UTC | #14
On 11/6/2024 3:41 PM, Peter Xu wrote:
> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>
>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>
>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>
>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>> details. See below.
>>>>>>>
>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>
>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>
>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>> on the "machine toggle" as part of this series.
>>>>>>
>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>
>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>
>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>
>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>
>>> Yes.
>>>
>>>>
>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>> if memory-backend-ram has hogged all the memory.
>>>>
>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>
>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>> -- memfd if available and fallback to shm_open.
>>>
>>> Yes.
>>>
>>>>
>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>
>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>
>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>
>>>>> Thoughts?
>>>>
>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>> of options and words to describe them.
>>>
>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>
>> Hi David and Peter,
>>
>> I have implemented and tested the following, for both qemu_memfd_create
>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>> for simplicity.
> 
> I'm ok with either shm or memfd, as this feature only applies to Linux
> anyway.  I'll leave that part to you and David to decide.
> 
>>
>> Any comments before I submit a complete patch?
>>
>> ----
>> qemu-options.hx:
>>      ``aux-ram-share=on|off``
>>          Allocate auxiliary guest RAM as an anonymous file that is
>>          shareable with an external process.  This option applies to
>>          memory allocated as a side effect of creating various devices.
>>          It does not apply to memory-backend-objects, whether explicitly
>>          specified on the command line, or implicitly created by the -m
>>          command line option.
>>
>>          Some migration modes require aux-ram-share=on.
>>
>> qapi/migration.json:
>>      @cpr-transfer:
>>           ...
>>           Memory-backend objects must have the share=on attribute, but
>>           memory-backend-epc is not supported.  The VM must be started
>>           with the '-machine aux-ram-share=on' option.
>>
>> Define RAM_PRIVATE
>>
>> Define qemu_shm_alloc(), from David's tmp patch
>>
>> ram_backend_memory_alloc()
>>      ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>      memory_region_init_ram_flags_nomigrate(ram_flags)
> 
> Looks all good until here.
> 
>>
>> qemu_ram_alloc_internal()
>>      ...
>>      if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> 
> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
> that's equal to RAM_PREALLOC.  

IMO testing host is clearer and more future proof, regardless of how flags
are currently used.  If the caller passes host, then we should not allocate
memory here, full stop.

> Meanwhile I slightly prefer we don't touch
> anything if SHARED|PRIVATE is set.  

OK, if SHARED is already set I will not set it again.

> All combined, it could be:
> 
>      if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
>          // ramblock to be allocated, with no share/private request, aka,
>          // aux memory chunk...
>      }
> 
>>          new_block->flags |= RAM_SHARED;
>>
>>      if (!host && (new_block->flags & RAM_SHARED)) {
>>          qemu_ram_alloc_shared(new_block);
> 
> I'm not sure whether this needs its own helper.  

Reserve judgement until you see the full patch.  The helper is a
non-trivial subroutine and IMO it improves readability.  Also the
cpr find/save hooks are confined to the subroutine.

> Should we fallback to
> ram_block_add() below, just like a RAM_SHARED?

I thought we all discussed and agreed that the allocation should be performed
above ram_block_add.  David's suggested patch does it here also.

- Steve

> IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
> always cache the fd (even if we don't do that before)?
> 
>>      } else
>>          new_block->fd = -1;
>>          new_block->host = host;
>>      }
>>      ram_block_add(new_block);
>>
>> qemu_ram_alloc_shared()
>>      if qemu_memfd_check()
>>          new_block->fd = qemu_memfd_create()
>>      else
>>          new_block->fd = qemu_shm_alloc()
>>      new_block->host = file_ram_alloc(new_block->fd)
>> ----
>>
>> - Steve
>>
>
Peter Xu Nov. 6, 2024, 9:21 p.m. UTC | #15
On Wed, Nov 06, 2024 at 03:59:23PM -0500, Steven Sistare wrote:
> On 11/6/2024 3:41 PM, Peter Xu wrote:
> > On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
> > > On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > > > On 04.11.24 21:56, Steven Sistare wrote:
> > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > > > 
> > > > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > > > 
> > > > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > > > 
> > > > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > > > details. See below.
> > > > > > > > 
> > > > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > > > 
> > > > > > > Unless there is reason to use memfd we should start with the more
> > > > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > > > 
> > > > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > > > on the "machine toggle" as part of this series.
> > > > > > > 
> > > > > > > Of course, if we find out we need the memfd internally instead under
> > > > > > > Linux for whatever reason later, we can use that instead.
> > > > > > > 
> > > > > > > But IIUC, the main selling point for memfd are additional features
> > > > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > > > 
> > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > > > 
> > > > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > > > 
> > > > Yes.
> > > > 
> > > > > 
> > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > > > if memory-backend-ram has hogged all the memory.
> > > > > 
> > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > > > 
> > > > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > > > -- memfd if available and fallback to shm_open.
> > > > 
> > > > Yes.
> > > > 
> > > > > 
> > > > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > > > 
> > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > > > 
> > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > > > of options and words to describe them.
> > > > 
> > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> > > 
> > > Hi David and Peter,
> > > 
> > > I have implemented and tested the following, for both qemu_memfd_create
> > > and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> > > for simplicity.
> > 
> > I'm ok with either shm or memfd, as this feature only applies to Linux
> > anyway.  I'll leave that part to you and David to decide.
> > 
> > > 
> > > Any comments before I submit a complete patch?
> > > 
> > > ----
> > > qemu-options.hx:
> > >      ``aux-ram-share=on|off``
> > >          Allocate auxiliary guest RAM as an anonymous file that is
> > >          shareable with an external process.  This option applies to
> > >          memory allocated as a side effect of creating various devices.
> > >          It does not apply to memory-backend-objects, whether explicitly
> > >          specified on the command line, or implicitly created by the -m
> > >          command line option.
> > > 
> > >          Some migration modes require aux-ram-share=on.
> > > 
> > > qapi/migration.json:
> > >      @cpr-transfer:
> > >           ...
> > >           Memory-backend objects must have the share=on attribute, but
> > >           memory-backend-epc is not supported.  The VM must be started
> > >           with the '-machine aux-ram-share=on' option.
> > > 
> > > Define RAM_PRIVATE
> > > 
> > > Define qemu_shm_alloc(), from David's tmp patch
> > > 
> > > ram_backend_memory_alloc()
> > >      ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
> > >      memory_region_init_ram_flags_nomigrate(ram_flags)
> > 
> > Looks all good until here.
> > 
> > > 
> > > qemu_ram_alloc_internal()
> > >      ...
> > >      if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> > 
> > Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
> > that's equal to RAM_PREALLOC.
> 
> IMO testing host is clearer and more future proof, regardless of how flags
> are currently used.  If the caller passes host, then we should not allocate
> memory here, full stop.
> 
> > Meanwhile I slightly prefer we don't touch
> > anything if SHARED|PRIVATE is set.
> 
> OK, if SHARED is already set I will not set it again.
> 
> > All combined, it could be:
> > 
> >      if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
> >          // ramblock to be allocated, with no share/private request, aka,
> >          // aux memory chunk...
> >      }
> > 
> > >          new_block->flags |= RAM_SHARED;
> > > 
> > >      if (!host && (new_block->flags & RAM_SHARED)) {
> > >          qemu_ram_alloc_shared(new_block);
> > 
> > I'm not sure whether this needs its own helper.
> 
> Reserve judgement until you see the full patch.  The helper is a
> non-trivial subroutine and IMO it improves readability.  Also the
> cpr find/save hooks are confined to the subroutine.

I thought we can use the same code path to process "aux ramblock" and all
kinds of other RAM_SHARED ramblocks.  IIUC cpr find/save should apply there
too, but maybe I missed something.

> 
> > Should we fallback to
> > ram_block_add() below, just like a RAM_SHARED?
> 
> I thought we all discussed and agreed that the allocation should be performed
> above ram_block_add.  David's suggested patch does it here also.

I was not closely followed all the discussions happened.. so I could have
missed something indeed.

One thing I want to double check is cpr will still make things like below
work, right?

  -object memory-backend-ram,share=on [1]

IIUC with the old code this won't create fd, so to make cpr work (and also
what I was trying to say in the previous email..) is we could silently
start to create memfds for these, which means we need to first teach
qemu_anon_ram_alloc() on creating memfd for RAM_SHARED and cache these fds
(which should hopefully keep the same behavior as before).

Then for aux ramblocks like ROMs, as long as it sets RAM_SHARED properly in
qemu_ram_alloc_internal() (but only when aux-share-mem=on, for sure..),
then the rest code path in ram_block_add() should be the same as when user
specified share=on in [1].

Anyway, if both of you agreed on it, I am happy to wait and read the whole
patch.

Side note: I'll still use a few days for other things, but I'll get back to
read this whole series before next week.. btw, this series does not depend
on precreate phase now, am I right?

> 
> - Steve
> 
> > IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
> > always cache the fd (even if we don't do that before)?
> > 
> > >      } else
> > >          new_block->fd = -1;
> > >          new_block->host = host;
> > >      }
> > >      ram_block_add(new_block);
> > > 
> > > qemu_ram_alloc_shared()
> > >      if qemu_memfd_check()
> > >          new_block->fd = qemu_memfd_create()
> > >      else
> > >          new_block->fd = qemu_shm_alloc()
> > >      new_block->host = file_ram_alloc(new_block->fd)
> > > ----
> > > 
> > > - Steve
> > > 
> > 
>
David Hildenbrand Nov. 7, 2024, 1:05 p.m. UTC | #16
On 06.11.24 21:59, Steven Sistare wrote:
> On 11/6/2024 3:41 PM, Peter Xu wrote:
>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>
>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>
>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>
>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>> details. See below.
>>>>>>>>
>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>
>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>
>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>
>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>
>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>
>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>
>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>> if memory-backend-ram has hogged all the memory.
>>>>>
>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>
>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>> -- memfd if available and fallback to shm_open.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>
>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>
>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>> of options and words to describe them.
>>>>
>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>
>>> Hi David and Peter,
>>>
>>> I have implemented and tested the following, for both qemu_memfd_create
>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>> for simplicity.
>>
>> I'm ok with either shm or memfd, as this feature only applies to Linux
>> anyway.  I'll leave that part to you and David to decide.
>>
>>>
>>> Any comments before I submit a complete patch?
>>>
>>> ----
>>> qemu-options.hx:
>>>       ``aux-ram-share=on|off``
>>>           Allocate auxiliary guest RAM as an anonymous file that is
>>>           shareable with an external process.  This option applies to
>>>           memory allocated as a side effect of creating various devices.
>>>           It does not apply to memory-backend-objects, whether explicitly
>>>           specified on the command line, or implicitly created by the -m
>>>           command line option.
>>>
>>>           Some migration modes require aux-ram-share=on.
>>>
>>> qapi/migration.json:
>>>       @cpr-transfer:
>>>            ...
>>>            Memory-backend objects must have the share=on attribute, but
>>>            memory-backend-epc is not supported.  The VM must be started
>>>            with the '-machine aux-ram-share=on' option.
>>>
>>> Define RAM_PRIVATE
>>>
>>> Define qemu_shm_alloc(), from David's tmp patch
>>>
>>> ram_backend_memory_alloc()
>>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>
>> Looks all good until here.
>>
>>>
>>> qemu_ram_alloc_internal()
>>>       ...
>>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>
>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>> that's equal to RAM_PREALLOC.
> 
> IMO testing host is clearer and more future proof, regardless of how flags
> are currently used.  If the caller passes host, then we should not allocate
> memory here, full stop.
> 
>> Meanwhile I slightly prefer we don't touch
>> anything if SHARED|PRIVATE is set.
> 
> OK, if SHARED is already set I will not set it again.

We only have to make sure that stuff like qemu_ram_is_shared() will 
continue working as expected.

What I think we should do:

We should probably assert that nobody passes in SHARED|PRIVATE. And we 
can use PRIVATE only as a parameter to the function, but never actually 
set it on the ramblock.

If someone passes in PRIVATE, we don't include it in block->flags. 
(RMA_SHARED remains cleared)

If someone passes in SHARED, we do set it in block->flags.
If someone passes PRIVATE|SHARED, we assert.

If someone passes in nothing: we set block->flags to SHARED with 
aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)


If that's also what you had in mind, great.
David Hildenbrand Nov. 7, 2024, 1:23 p.m. UTC | #17
On 06.11.24 21:12, Steven Sistare wrote:
> 
> 
> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>> On 04.11.24 21:56, Steven Sistare wrote:
>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>
>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>
>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>
>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>> details. See below.
>>>>>>
>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>
>>>>> Unless there is reason to use memfd we should start with the more
>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>
>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>> on the "machine toggle" as part of this series.
>>>>>
>>>>> Of course, if we find out we need the memfd internally instead under
>>>>> Linux for whatever reason later, we can use that instead.
>>>>>
>>>>> But IIUC, the main selling point for memfd are additional features
>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>
>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>
>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>
>> Yes.
>>
>>>
>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>> if memory-backend-ram has hogged all the memory.
>>>
>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>
>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>> -- memfd if available and fallback to shm_open.
>>
>> Yes.
>>
>>>
>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>
>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>
>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>
>>>> Thoughts?
>>>
>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>> of options and words to describe them.
>>
>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> 
> Hi David and Peter,
> 
> I have implemented and tested the following, for both qemu_memfd_create
> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> for simplicity.
> 
> Any comments before I submit a complete patch?
> 
> ----
> qemu-options.hx:
>       ``aux-ram-share=on|off``
>           Allocate auxiliary guest RAM as an anonymous file that is
>           shareable with an external process.  This option applies to
>           memory allocated as a side effect of creating various devices.
>           It does not apply to memory-backend-objects, whether explicitly
>           specified on the command line, or implicitly created by the -m
>           command line option.
> 
>           Some migration modes require aux-ram-share=on.
> 
> qapi/migration.json:
>       @cpr-transfer:
>            ...
>            Memory-backend objects must have the share=on attribute, but
>            memory-backend-epc is not supported.  The VM must be started
>            with the '-machine aux-ram-share=on' option.
> 
> Define RAM_PRIVATE
> 
> Define qemu_shm_alloc(), from David's tmp patch
> 
> ram_backend_memory_alloc()
>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>       memory_region_init_ram_flags_nomigrate(ram_flags)
> 
> qemu_ram_alloc_internal()
>       ...
>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>           new_block->flags |= RAM_SHARED;
> 
>       if (!host && (new_block->flags & RAM_SHARED)) {
>           qemu_ram_alloc_shared(new_block);
>       } else
>           new_block->fd = -1;
>           new_block->host = host;
>       }
>       ram_block_add(new_block);
> 
> qemu_ram_alloc_shared()
>       if qemu_memfd_check()
>           new_block->fd = qemu_memfd_create()
>       else
>           new_block->fd = qemu_shm_alloc()

Yes, that way "memory-backend-ram,share=on" will just mean "give me the 
best shared memory for RAM to be shared with other processes, I don't 
care about the details", and it will work on Linux kernels even before 
we had memfds.

memory-backend-ram should be available on all architectures, and under 
Windows. qemu_anon_ram_alloc() under Linux just does nothing special, 
not even bail out.

MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I 
can share only with subprocesses", but then, *there are not subprocesses 
for QEMU*. I recall there was a trick to obtain the fd under Linux for 
these regions using /proc/self/fd/, but it's very Linux specific ...

So nobody would *actually* use that shared memory and it was only a hack 
for RDMA. Now we can do better.


We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if 
no shared memory can be created (unavailable), like we do on Windows.

So maybe something like

qemu_ram_alloc_shared()
	fd = -1;

	if (qemu_memfd_avilable()) {
		fd = qemu_memfd_create();
		if (fd < 0)
			... error
	} else if (qemu_shm_available())
		fd = qemu_shm_alloc();
		if (fd < 0)
			... error
	} else {
		/*
		 * Old behavior: try fd-less shared memory. We might
		 * just end up with non-shared memory on Windows, but
		 * nobody can make sure of this shared memory either way
		 * ... should we just use non-shared memory? Or should
		 * we simply bail out? But then, if there is no shared
		 * memory nobody could possible use it.
		 */
		qemu_anon_ram_alloc(share=true)
	}
Steven Sistare Nov. 7, 2024, 2:03 p.m. UTC | #18
On 11/6/2024 4:21 PM, Peter Xu wrote:
> On Wed, Nov 06, 2024 at 03:59:23PM -0500, Steven Sistare wrote:
>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>
>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>
>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>
>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>> details. See below.
>>>>>>>>>
>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>
>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>
>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>
>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>
>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>
>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>
>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>
>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>
>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>> -- memfd if available and fallback to shm_open.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>
>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>
>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>> of options and words to describe them.
>>>>>
>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>
>>>> Hi David and Peter,
>>>>
>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>> for simplicity.
>>>
>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>> anyway.  I'll leave that part to you and David to decide.
>>>
>>>>
>>>> Any comments before I submit a complete patch?
>>>>
>>>> ----
>>>> qemu-options.hx:
>>>>       ``aux-ram-share=on|off``
>>>>           Allocate auxiliary guest RAM as an anonymous file that is
>>>>           shareable with an external process.  This option applies to
>>>>           memory allocated as a side effect of creating various devices.
>>>>           It does not apply to memory-backend-objects, whether explicitly
>>>>           specified on the command line, or implicitly created by the -m
>>>>           command line option.
>>>>
>>>>           Some migration modes require aux-ram-share=on.
>>>>
>>>> qapi/migration.json:
>>>>       @cpr-transfer:
>>>>            ...
>>>>            Memory-backend objects must have the share=on attribute, but
>>>>            memory-backend-epc is not supported.  The VM must be started
>>>>            with the '-machine aux-ram-share=on' option.
>>>>
>>>> Define RAM_PRIVATE
>>>>
>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>
>>>> ram_backend_memory_alloc()
>>>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>>
>>> Looks all good until here.
>>>
>>>>
>>>> qemu_ram_alloc_internal()
>>>>       ...
>>>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>
>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>> that's equal to RAM_PREALLOC.
>>
>> IMO testing host is clearer and more future proof, regardless of how flags
>> are currently used.  If the caller passes host, then we should not allocate
>> memory here, full stop.
>>
>>> Meanwhile I slightly prefer we don't touch
>>> anything if SHARED|PRIVATE is set.
>>
>> OK, if SHARED is already set I will not set it again.
>>
>>> All combined, it could be:
>>>
>>>       if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
>>>           // ramblock to be allocated, with no share/private request, aka,
>>>           // aux memory chunk...
>>>       }
>>>
>>>>           new_block->flags |= RAM_SHARED;
>>>>
>>>>       if (!host && (new_block->flags & RAM_SHARED)) {
>>>>           qemu_ram_alloc_shared(new_block);
>>>
>>> I'm not sure whether this needs its own helper.
>>
>> Reserve judgement until you see the full patch.  The helper is a
>> non-trivial subroutine and IMO it improves readability.  Also the
>> cpr find/save hooks are confined to the subroutine.
> 
> I thought we can use the same code path to process "aux ramblock" and all
> kinds of other RAM_SHARED ramblocks.  IIUC cpr find/save should apply there
> too, but maybe I missed something.

Yes.  qemu_ram_alloc_shared() allocates and handles CPR for aux ramblock and
memory-backend-ram,share=on.

The key change that David suggested, that allows this unification, is to
push the fd creation down from ram_backend_memory_alloc to qemu_ram_alloc_shared.

>>> Should we fallback to
>>> ram_block_add() below, just like a RAM_SHARED?
>>
>> I thought we all discussed and agreed that the allocation should be performed
>> above ram_block_add.  David's suggested patch does it here also.
> 
> I was not closely followed all the discussions happened.. so I could have
> missed something indeed.
> 
> One thing I want to double check is cpr will still make things like below
> work, right?
> 
>    -object memory-backend-ram,share=on [1]

Yes, this new patch makes that work for CPR.
V3 did not support CPR for memory-backend-ram.

> IIUC with the old code this won't create fd, so to make cpr work (and also
> what I was trying to say in the previous email..) is we could silently
> start to create memfds for these, which means we need to first teach
> qemu_anon_ram_alloc() on creating memfd for RAM_SHARED and cache these fds
> (which should hopefully keep the same behavior as before).

Now the fd is created and cached in qemu_ram_alloc_internal -> qemu_ram_alloc_shared.

> Then for aux ramblocks like ROMs, as long as it sets RAM_SHARED properly in
> qemu_ram_alloc_internal() (but only when aux-share-mem=on, for sure..),
> then the rest code path in ram_block_add() should be the same as when user
> specified share=on in [1].
> 
> Anyway, if both of you agreed on it, I am happy to wait and read the whole
> patch.
> 
> Side note: I'll still use a few days for other things, but I'll get back to
> read this whole series before next week.. btw, this series does not depend
> on precreate phase now, am I right?

Correct, this series does not depend on precreate.

- Steve

>>> IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
>>> always cache the fd (even if we don't do that before)?
>>>
>>>>       } else
>>>>           new_block->fd = -1;
>>>>           new_block->host = host;
>>>>       }
>>>>       ram_block_add(new_block);
>>>>
>>>> qemu_ram_alloc_shared()
>>>>       if qemu_memfd_check()
>>>>           new_block->fd = qemu_memfd_create()
>>>>       else
>>>>           new_block->fd = qemu_shm_alloc()
>>>>       new_block->host = file_ram_alloc(new_block->fd)
>>>> ----
>>>>
>>>> - Steve
>>>>
>>>
>>
>
Steven Sistare Nov. 7, 2024, 2:04 p.m. UTC | #19
On 11/7/2024 8:05 AM, David Hildenbrand wrote:
> On 06.11.24 21:59, Steven Sistare wrote:
>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>
>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>
>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>
>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>> details. See below.
>>>>>>>>>
>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>
>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>
>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>
>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>
>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>
>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>
>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>
>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>
>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>> -- memfd if available and fallback to shm_open.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>
>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>
>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>> of options and words to describe them.
>>>>>
>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>
>>>> Hi David and Peter,
>>>>
>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>> for simplicity.
>>>
>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>> anyway.  I'll leave that part to you and David to decide.
>>>
>>>>
>>>> Any comments before I submit a complete patch?
>>>>
>>>> ----
>>>> qemu-options.hx:
>>>>       ``aux-ram-share=on|off``
>>>>           Allocate auxiliary guest RAM as an anonymous file that is
>>>>           shareable with an external process.  This option applies to
>>>>           memory allocated as a side effect of creating various devices.
>>>>           It does not apply to memory-backend-objects, whether explicitly
>>>>           specified on the command line, or implicitly created by the -m
>>>>           command line option.
>>>>
>>>>           Some migration modes require aux-ram-share=on.
>>>>
>>>> qapi/migration.json:
>>>>       @cpr-transfer:
>>>>            ...
>>>>            Memory-backend objects must have the share=on attribute, but
>>>>            memory-backend-epc is not supported.  The VM must be started
>>>>            with the '-machine aux-ram-share=on' option.
>>>>
>>>> Define RAM_PRIVATE
>>>>
>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>
>>>> ram_backend_memory_alloc()
>>>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>>
>>> Looks all good until here.
>>>
>>>>
>>>> qemu_ram_alloc_internal()
>>>>       ...
>>>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>
>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>> that's equal to RAM_PREALLOC.
>>
>> IMO testing host is clearer and more future proof, regardless of how flags
>> are currently used.  If the caller passes host, then we should not allocate
>> memory here, full stop.
>>
>>> Meanwhile I slightly prefer we don't touch
>>> anything if SHARED|PRIVATE is set.
>>
>> OK, if SHARED is already set I will not set it again.
> 
> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
> 
> What I think we should do:
> 
> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
> 
> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
> 
> If someone passes in SHARED, we do set it in block->flags.
> If someone passes PRIVATE|SHARED, we assert.
> 
> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
> 
> If that's also what you had in mind, great.

Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
I will undo the latter.

Do you plan to submit the part of your "tmp" patch that refactors
shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
I could include it in my series, with your Signed-off-by.

Do you have any comments on my proposed name aux-ram-share, or my proposed text
for qemu-options.hx and migration.json?  Speaking now would prevent more version
churn later.

- Steve
Steven Sistare Nov. 7, 2024, 4:02 p.m. UTC | #20
On 11/7/2024 8:23 AM, David Hildenbrand wrote:
> On 06.11.24 21:12, Steven Sistare wrote:
>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>
>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>
>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>
>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>> details. See below.
>>>>>>>
>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>
>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>
>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>> on the "machine toggle" as part of this series.
>>>>>>
>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>
>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>
>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>
>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>
>>> Yes.
>>>
>>>>
>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>> if memory-backend-ram has hogged all the memory.
>>>>
>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>
>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>> -- memfd if available and fallback to shm_open.
>>>
>>> Yes.
>>>
>>>>
>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>
>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>
>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>
>>>>> Thoughts?
>>>>
>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>> of options and words to describe them.
>>>
>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>
>> Hi David and Peter,
>>
>> I have implemented and tested the following, for both qemu_memfd_create
>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>> for simplicity.
>>
>> Any comments before I submit a complete patch?
>>
>> ----
>> qemu-options.hx:
>>       ``aux-ram-share=on|off``
>>           Allocate auxiliary guest RAM as an anonymous file that is
>>           shareable with an external process.  This option applies to
>>           memory allocated as a side effect of creating various devices.
>>           It does not apply to memory-backend-objects, whether explicitly
>>           specified on the command line, or implicitly created by the -m
>>           command line option.
>>
>>           Some migration modes require aux-ram-share=on.
>>
>> qapi/migration.json:
>>       @cpr-transfer:
>>            ...
>>            Memory-backend objects must have the share=on attribute, but
>>            memory-backend-epc is not supported.  The VM must be started
>>            with the '-machine aux-ram-share=on' option.
>>
>> Define RAM_PRIVATE
>>
>> Define qemu_shm_alloc(), from David's tmp patch
>>
>> ram_backend_memory_alloc()
>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>
>> qemu_ram_alloc_internal()
>>       ...
>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>           new_block->flags |= RAM_SHARED;
>>
>>       if (!host && (new_block->flags & RAM_SHARED)) {
>>           qemu_ram_alloc_shared(new_block);
>>       } else
>>           new_block->fd = -1;
>>           new_block->host = host;
>>       }
>>       ram_block_add(new_block);
>>
>> qemu_ram_alloc_shared()
>>       if qemu_memfd_check()
>>           new_block->fd = qemu_memfd_create()
>>       else
>>           new_block->fd = qemu_shm_alloc()
> 
> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
> 
> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
> 
> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
> 
> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
> 
> 
> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
> 
> So maybe something like
> 
> qemu_ram_alloc_shared()
>      fd = -1;
> 
>      if (qemu_memfd_avilable()) {
>          fd = qemu_memfd_create();
>          if (fd < 0)
>              ... error
>      } else if (qemu_shm_available())
>          fd = qemu_shm_alloc();
>          if (fd < 0)
>              ... error
>      } else {
>          /*
>           * Old behavior: try fd-less shared memory. We might
>           * just end up with non-shared memory on Windows, but
>           * nobody can make sure of this shared memory either way
>           * ... should we just use non-shared memory? Or should
>           * we simply bail out? But then, if there is no shared
>           * memory nobody could possible use it.
>           */
>          qemu_anon_ram_alloc(share=true)
>      }

Good catch.  We need that fallback for backwards compatibility.  Even with
no use case for memory-backend-ram,share=on since the demise of rdma, users
may specify it on windows, for no particular reason, but it works, and should
continue to work after this series.  CPR would be blocked.

More generally for backwards compatibility for share=on for no particular reason,
should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
options and more than half of ram is requested, it will fail, whereas current qemu
succeeds using MAP_SHARED|MAP_ANON.

- Steve
David Hildenbrand Nov. 7, 2024, 4:19 p.m. UTC | #21
On 07.11.24 15:04, Steven Sistare wrote:
> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
>> On 06.11.24 21:59, Steven Sistare wrote:
>>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>
>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>
>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>
>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>> details. See below.
>>>>>>>>>>
>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>
>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>
>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>
>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>
>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>
>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>
>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>
>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>
>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>
>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>
>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>> of options and words to describe them.
>>>>>>
>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>
>>>>> Hi David and Peter,
>>>>>
>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>> for simplicity.
>>>>
>>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>>> anyway.  I'll leave that part to you and David to decide.
>>>>
>>>>>
>>>>> Any comments before I submit a complete patch?
>>>>>
>>>>> ----
>>>>> qemu-options.hx:
>>>>>        ``aux-ram-share=on|off``
>>>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>>>            shareable with an external process.  This option applies to
>>>>>            memory allocated as a side effect of creating various devices.
>>>>>            It does not apply to memory-backend-objects, whether explicitly
>>>>>            specified on the command line, or implicitly created by the -m
>>>>>            command line option.
>>>>>
>>>>>            Some migration modes require aux-ram-share=on.
>>>>>
>>>>> qapi/migration.json:
>>>>>        @cpr-transfer:
>>>>>             ...
>>>>>             Memory-backend objects must have the share=on attribute, but
>>>>>             memory-backend-epc is not supported.  The VM must be started
>>>>>             with the '-machine aux-ram-share=on' option.
>>>>>
>>>>> Define RAM_PRIVATE
>>>>>
>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>
>>>>> ram_backend_memory_alloc()
>>>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>
>>>> Looks all good until here.
>>>>
>>>>>
>>>>> qemu_ram_alloc_internal()
>>>>>        ...
>>>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>
>>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>>> that's equal to RAM_PREALLOC.
>>>
>>> IMO testing host is clearer and more future proof, regardless of how flags
>>> are currently used.  If the caller passes host, then we should not allocate
>>> memory here, full stop.
>>>
>>>> Meanwhile I slightly prefer we don't touch
>>>> anything if SHARED|PRIVATE is set.
>>>
>>> OK, if SHARED is already set I will not set it again.
>>
>> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
>>
>> What I think we should do:
>>
>> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
>>
>> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
>>
>> If someone passes in SHARED, we do set it in block->flags.
>> If someone passes PRIVATE|SHARED, we assert.
>>
>> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
>>
>> If that's also what you had in mind, great.
> 
> Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
> I will undo the latter.
> 
> Do you plan to submit the part of your "tmp" patch that refactors
> shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
> I could include it in my series, with your Signed-off-by.

My patch went a bit too far I think. And would not work on win32 :)

We should probably start with this:


 From 124920aeda2756faa104bfa6e934c7c20b1fbbe9 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Mon, 4 Nov 2024 11:29:22 +0100
Subject: [PATCH] backends/hostmem-shm: factor out allocation of "anonymous
  shared memory with an fd"

Let's factor it out so we can reuse it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  backends/hostmem-shm.c | 45 ++++-------------------------------
  include/qemu/osdep.h   |  1 +
  system/physmem.c       |  2 +-
  util/oslib-posix.c     | 53 ++++++++++++++++++++++++++++++++++++++++++
  util/oslib-win32.c     |  6 +++++
  5 files changed, 65 insertions(+), 42 deletions(-)

diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
index 374edc3db8..837b9f1dd4 100644
--- a/backends/hostmem-shm.c
+++ b/backends/hostmem-shm.c
@@ -25,11 +25,9 @@ struct HostMemoryBackendShm {
  static bool
  shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
  {
-    g_autoptr(GString) shm_name = g_string_new(NULL);
      g_autofree char *backend_name = NULL;
      uint32_t ram_flags;
-    int fd, oflag;
-    mode_t mode;
+    int fd;
  
      if (!backend->size) {
          error_setg(errp, "can't create shm backend with size 0");
@@ -41,48 +39,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
          return false;
      }
  
-    /*
-     * Let's use `mode = 0` because we don't want other processes to open our
-     * memory unless we share the file descriptor with them.
-     */
-    mode = 0;
-    oflag = O_RDWR | O_CREAT | O_EXCL;
-    backend_name = host_memory_backend_get_name(backend);
-
-    /*
-     * Some operating systems allow creating anonymous POSIX shared memory
-     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
-     * defined by POSIX, so let's create a unique name.
-     *
-     * From Linux's shm_open(3) man-page:
-     *   For  portable  use,  a shared  memory  object should be identified
-     *   by a name of the form /somename;"
-     */
-    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
-                    backend_name);
-
-    fd = shm_open(shm_name->str, oflag, mode);
+    fd = qemu_shm_alloc(backend->size, errp);
      if (fd < 0) {
-        error_setg_errno(errp, errno,
-                         "failed to create POSIX shared memory");
-        return false;
-    }
-
-    /*
-     * We have the file descriptor, so we no longer need to expose the
-     * POSIX shared memory object. However it will remain allocated as long as
-     * there are file descriptors pointing to it.
-     */
-    shm_unlink(shm_name->str);
-
-    if (ftruncate(fd, backend->size) == -1) {
-        error_setg_errno(errp, errno,
-                         "failed to resize POSIX shared memory to %" PRIu64,
-                         backend->size);
-        close(fd);
          return false;
      }
  
+    /* Let's do the same as memory-backend-ram,share=on would do. */
+    backend_name = host_memory_backend_get_name(backend);
      ram_flags = RAM_SHARED;
      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
  
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index fe7c3c5f67..4a24f11174 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -505,6 +505,7 @@ int qemu_daemon(int nochdir, int noclose);
  void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
                            bool noreserve);
  void qemu_anon_ram_free(void *ptr, size_t size);
+int qemu_shm_alloc(size_t size, Error **errp);
  
  #ifdef _WIN32
  #define HAVE_CHARDEV_SERIAL 1
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..1b477fec44 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2089,7 +2089,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
      new_block->page_size = qemu_real_host_page_size();
      new_block->host = host;
      new_block->flags = ram_flags;
-    ram_block_add(new_block, &local_err);
+
      if (local_err) {
          g_free(new_block);
          error_propagate(errp, local_err);
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 11b35e48fb..bc5c28b162 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -931,3 +931,56 @@ void qemu_close_all_open_fd(const int *skip, unsigned int nskip)
          qemu_close_all_open_fd_fallback(skip, nskip, open_max);
      }
  }
+
+int qemu_shm_alloc(size_t size, Error **errp)
+{
+    g_autoptr(GString) shm_name = g_string_new(NULL);
+    int fd, oflag, cur_sequence;
+    static int sequence;
+    mode_t mode;
+
+    cur_sequence = qatomic_fetch_inc(&sequence);
+
+    /*
+     * Let's use `mode = 0` because we don't want other processes to open our
+     * memory unless we share the file descriptor with them.
+     */
+    mode = 0;
+    oflag = O_RDWR | O_CREAT | O_EXCL;
+
+    /*
+     * Some operating systems allow creating anonymous POSIX shared memory
+     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+     * defined by POSIX, so let's create a unique name.
+     *
+     * From Linux's shm_open(3) man-page:
+     *   For  portable  use,  a shared  memory  object should be identified
+     *   by a name of the form /somename;"
+     */
+    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(),
+                    cur_sequence);
+
+    fd = shm_open(shm_name->str, oflag, mode);
+    if (fd < 0) {
+        error_setg_errno(errp, errno,
+                         "failed to create POSIX shared memory");
+        return -1;
+    }
+
+    /*
+     * We have the file descriptor, so we no longer need to expose the
+     * POSIX shared memory object. However it will remain allocated as long as
+     * there are file descriptors pointing to it.
+     */
+    shm_unlink(shm_name->str);
+
+    if (ftruncate(fd, size) == -1) {
+        error_setg_errno(errp, errno,
+                         "failed to resize POSIX shared memory to %" PRIu64,
+                         size);
+        close(fd);
+        return -1;
+    }
+
+    return fd;
+}
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index b623830d62..f79a190b78 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -877,3 +877,9 @@ void qemu_win32_map_free(void *ptr, HANDLE h, Error **errp)
      }
      CloseHandle(h);
  }
+
+int qemu_shm_alloc(size_t size, Error **errp)
+{
+    error_setg("Shared memory is not supported.");
+    return -1;
+}
David Hildenbrand Nov. 7, 2024, 4:26 p.m. UTC | #22
On 07.11.24 17:02, Steven Sistare wrote:
> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>> On 06.11.24 21:12, Steven Sistare wrote:
>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>
>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>
>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>
>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>> details. See below.
>>>>>>>>
>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>
>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>
>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>
>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>
>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>
>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>
>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>> if memory-backend-ram has hogged all the memory.
>>>>>
>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>
>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>> -- memfd if available and fallback to shm_open.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>
>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>
>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>> of options and words to describe them.
>>>>
>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>
>>> Hi David and Peter,
>>>
>>> I have implemented and tested the following, for both qemu_memfd_create
>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>> for simplicity.
>>>
>>> Any comments before I submit a complete patch?
>>>
>>> ----
>>> qemu-options.hx:
>>>        ``aux-ram-share=on|off``
>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>            shareable with an external process.  This option applies to
>>>            memory allocated as a side effect of creating various devices.
>>>            It does not apply to memory-backend-objects, whether explicitly
>>>            specified on the command line, or implicitly created by the -m
>>>            command line option.
>>>
>>>            Some migration modes require aux-ram-share=on.
>>>
>>> qapi/migration.json:
>>>        @cpr-transfer:
>>>             ...
>>>             Memory-backend objects must have the share=on attribute, but
>>>             memory-backend-epc is not supported.  The VM must be started
>>>             with the '-machine aux-ram-share=on' option.
>>>
>>> Define RAM_PRIVATE
>>>
>>> Define qemu_shm_alloc(), from David's tmp patch
>>>
>>> ram_backend_memory_alloc()
>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>
>>> qemu_ram_alloc_internal()
>>>        ...
>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>            new_block->flags |= RAM_SHARED;
>>>
>>>        if (!host && (new_block->flags & RAM_SHARED)) {
>>>            qemu_ram_alloc_shared(new_block);
>>>        } else
>>>            new_block->fd = -1;
>>>            new_block->host = host;
>>>        }
>>>        ram_block_add(new_block);
>>>
>>> qemu_ram_alloc_shared()
>>>        if qemu_memfd_check()
>>>            new_block->fd = qemu_memfd_create()
>>>        else
>>>            new_block->fd = qemu_shm_alloc()
>>
>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>
>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>
>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>
>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>
>>
>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>
>> So maybe something like
>>
>> qemu_ram_alloc_shared()
>>       fd = -1;
>>
>>       if (qemu_memfd_avilable()) {
>>           fd = qemu_memfd_create();
>>           if (fd < 0)
>>               ... error
>>       } else if (qemu_shm_available())
>>           fd = qemu_shm_alloc();
>>           if (fd < 0)
>>               ... error
>>       } else {
>>           /*
>>            * Old behavior: try fd-less shared memory. We might
>>            * just end up with non-shared memory on Windows, but
>>            * nobody can make sure of this shared memory either way
>>            * ... should we just use non-shared memory? Or should
>>            * we simply bail out? But then, if there is no shared
>>            * memory nobody could possible use it.
>>            */
>>           qemu_anon_ram_alloc(share=true)
>>       }
> 
> Good catch.  We need that fallback for backwards compatibility.  Even with
> no use case for memory-backend-ram,share=on since the demise of rdma, users
> may specify it on windows, for no particular reason, but it works, and should
> continue to work after this series.  CPR would be blocked.

Yes, we should keep Windows working in the weird way it is working right 
now.

 > > More generally for backwards compatibility for share=on for no 
particular reason,
> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
> options and more than half of ram is requested, it will fail, whereas current qemu
> succeeds using MAP_SHARED|MAP_ANON.

Only on Linux without memfd, of course. Maybe we should just warn when 
qemu_shm_alloc() fails (and comment that we continue for compat reasons 
only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We 
could implement a fallback to shmget() but ... let's not go down that path.

But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if 
memfd is available and that allocating the memfd failed. Failing to 
allocate a memfd might highlight a bigger problem.
Peter Xu Nov. 7, 2024, 4:32 p.m. UTC | #23
On Thu, Nov 07, 2024 at 09:04:02AM -0500, Steven Sistare wrote:
> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
> > On 06.11.24 21:59, Steven Sistare wrote:
> > > On 11/6/2024 3:41 PM, Peter Xu wrote:
> > > > On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
> > > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > > > > > On 04.11.24 21:56, Steven Sistare wrote:
> > > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > > > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > > > > > 
> > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > > > > > 
> > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > > > > > 
> > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > > > > > details. See below.
> > > > > > > > > > 
> > > > > > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > > > > > 
> > > > > > > > > Unless there is reason to use memfd we should start with the more
> > > > > > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > > > > > 
> > > > > > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > > > > > on the "machine toggle" as part of this series.
> > > > > > > > > 
> > > > > > > > > Of course, if we find out we need the memfd internally instead under
> > > > > > > > > Linux for whatever reason later, we can use that instead.
> > > > > > > > > 
> > > > > > > > > But IIUC, the main selling point for memfd are additional features
> > > > > > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > > > > > 
> > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > > > > > 
> > > > > > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > > > > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > > 
> > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > > > > > if memory-backend-ram has hogged all the memory.
> > > > > > > 
> > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > > > > > 
> > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > > > > > -- memfd if available and fallback to shm_open.
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > > 
> > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > > > > > 
> > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > > > > > 
> > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > > > > > 
> > > > > > > > Thoughts?
> > > > > > > 
> > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > > > > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > > > > > of options and words to describe them.
> > > > > > 
> > > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> > > > > 
> > > > > Hi David and Peter,
> > > > > 
> > > > > I have implemented and tested the following, for both qemu_memfd_create
> > > > > and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> > > > > for simplicity.
> > > > 
> > > > I'm ok with either shm or memfd, as this feature only applies to Linux
> > > > anyway.  I'll leave that part to you and David to decide.
> > > > 
> > > > > 
> > > > > Any comments before I submit a complete patch?
> > > > > 
> > > > > ----
> > > > > qemu-options.hx:
> > > > >       ``aux-ram-share=on|off``
> > > > >           Allocate auxiliary guest RAM as an anonymous file that is
> > > > >           shareable with an external process.  This option applies to
> > > > >           memory allocated as a side effect of creating various devices.
> > > > >           It does not apply to memory-backend-objects, whether explicitly
> > > > >           specified on the command line, or implicitly created by the -m
> > > > >           command line option.
> > > > > 
> > > > >           Some migration modes require aux-ram-share=on.
> > > > > 
> > > > > qapi/migration.json:
> > > > >       @cpr-transfer:
> > > > >            ...
> > > > >            Memory-backend objects must have the share=on attribute, but
> > > > >            memory-backend-epc is not supported.  The VM must be started
> > > > >            with the '-machine aux-ram-share=on' option.
> > > > > 
> > > > > Define RAM_PRIVATE
> > > > > 
> > > > > Define qemu_shm_alloc(), from David's tmp patch
> > > > > 
> > > > > ram_backend_memory_alloc()
> > > > >       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
> > > > >       memory_region_init_ram_flags_nomigrate(ram_flags)
> > > > 
> > > > Looks all good until here.
> > > > 
> > > > > 
> > > > > qemu_ram_alloc_internal()
> > > > >       ...
> > > > >       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> > > > 
> > > > Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
> > > > that's equal to RAM_PREALLOC.
> > > 
> > > IMO testing host is clearer and more future proof, regardless of how flags
> > > are currently used.  If the caller passes host, then we should not allocate
> > > memory here, full stop.
> > > 
> > > > Meanwhile I slightly prefer we don't touch
> > > > anything if SHARED|PRIVATE is set.
> > > 
> > > OK, if SHARED is already set I will not set it again.
> > 
> > We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
> > 
> > What I think we should do:
> > 
> > We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
> > 
> > If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
> > 
> > If someone passes in SHARED, we do set it in block->flags.
> > If someone passes PRIVATE|SHARED, we assert.
> > 
> > If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
> > 
> > If that's also what you had in mind, great.
> 
> Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
> I will undo the latter.

David: why do we need to drop PRIVATE in ramblock flags?  I thought it was
pretty harmless.  I suppose things like qemu_ram_is_shared() will even keep
working as before?

It looks ok to remove it too, but it adds logics that doesn't seem
necessary to me, so just to double check if I missed something..

> 
> Do you plan to submit the part of your "tmp" patch that refactors
> shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
> I could include it in my series, with your Signed-off-by.
> 
> Do you have any comments on my proposed name aux-ram-share, or my proposed text
> for qemu-options.hx and migration.json?  Speaking now would prevent more version
> churn later.
> 
> - Steve
> 
>
David Hildenbrand Nov. 7, 2024, 4:38 p.m. UTC | #24
On 07.11.24 17:32, Peter Xu wrote:
> On Thu, Nov 07, 2024 at 09:04:02AM -0500, Steven Sistare wrote:
>> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
>>> On 06.11.24 21:59, Steven Sistare wrote:
>>>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>
>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>
>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>> details. See below.
>>>>>>>>>>>
>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>
>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>
>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>
>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>
>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>
>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>
>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>
>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>
>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>
>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>
>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>> of options and words to describe them.
>>>>>>>
>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>
>>>>>> Hi David and Peter,
>>>>>>
>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>> for simplicity.
>>>>>
>>>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>>>> anyway.  I'll leave that part to you and David to decide.
>>>>>
>>>>>>
>>>>>> Any comments before I submit a complete patch?
>>>>>>
>>>>>> ----
>>>>>> qemu-options.hx:
>>>>>>        ``aux-ram-share=on|off``
>>>>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>            shareable with an external process.  This option applies to
>>>>>>            memory allocated as a side effect of creating various devices.
>>>>>>            It does not apply to memory-backend-objects, whether explicitly
>>>>>>            specified on the command line, or implicitly created by the -m
>>>>>>            command line option.
>>>>>>
>>>>>>            Some migration modes require aux-ram-share=on.
>>>>>>
>>>>>> qapi/migration.json:
>>>>>>        @cpr-transfer:
>>>>>>             ...
>>>>>>             Memory-backend objects must have the share=on attribute, but
>>>>>>             memory-backend-epc is not supported.  The VM must be started
>>>>>>             with the '-machine aux-ram-share=on' option.
>>>>>>
>>>>>> Define RAM_PRIVATE
>>>>>>
>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>
>>>>>> ram_backend_memory_alloc()
>>>>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>
>>>>> Looks all good until here.
>>>>>
>>>>>>
>>>>>> qemu_ram_alloc_internal()
>>>>>>        ...
>>>>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>
>>>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>>>> that's equal to RAM_PREALLOC.
>>>>
>>>> IMO testing host is clearer and more future proof, regardless of how flags
>>>> are currently used.  If the caller passes host, then we should not allocate
>>>> memory here, full stop.
>>>>
>>>>> Meanwhile I slightly prefer we don't touch
>>>>> anything if SHARED|PRIVATE is set.
>>>>
>>>> OK, if SHARED is already set I will not set it again.
>>>
>>> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
>>>
>>> What I think we should do:
>>>
>>> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
>>>
>>> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
>>>
>>> If someone passes in SHARED, we do set it in block->flags.
>>> If someone passes PRIVATE|SHARED, we assert.
>>>
>>> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
>>>
>>> If that's also what you had in mind, great.
>>
>> Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
>> I will undo the latter.
> 
> David: why do we need to drop PRIVATE in ramblock flags?  I thought it was
> pretty harmless.  I suppose things like qemu_ram_is_shared() will even keep
> working as before?
> 
> It looks ok to remove it too, but it adds logics that doesn't seem
> necessary to me, so just to double check if I missed something..

A finished ramblock is only boolean "shared" vs. "not shared/private". A 
single flag (RAM_SHARED) can express that clearly.

Consequently there is less to get wrong when using RAM_PRIVATE only as a 
flag to the creation function (and documenting that!).

To make RAM_PRIVATE consistent we might have to tweak all other RAMBlock 
creation functions to set RAM_PRIVATE in the !RAM_SHARED case, and I 
don't think that is wroth the trouble.
Steven Sistare Nov. 7, 2024, 4:40 p.m. UTC | #25
On 11/7/2024 11:26 AM, David Hildenbrand wrote:
> On 07.11.24 17:02, Steven Sistare wrote:
>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>
>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>
>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>
>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>> details. See below.
>>>>>>>>>
>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>
>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>
>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>
>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>
>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>
>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>
>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>
>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>
>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>> -- memfd if available and fallback to shm_open.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>
>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>
>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>> of options and words to describe them.
>>>>>
>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>
>>>> Hi David and Peter,
>>>>
>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>> for simplicity.
>>>>
>>>> Any comments before I submit a complete patch?
>>>>
>>>> ----
>>>> qemu-options.hx:
>>>>        ``aux-ram-share=on|off``
>>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>>            shareable with an external process.  This option applies to
>>>>            memory allocated as a side effect of creating various devices.
>>>>            It does not apply to memory-backend-objects, whether explicitly
>>>>            specified on the command line, or implicitly created by the -m
>>>>            command line option.
>>>>
>>>>            Some migration modes require aux-ram-share=on.
>>>>
>>>> qapi/migration.json:
>>>>        @cpr-transfer:
>>>>             ...
>>>>             Memory-backend objects must have the share=on attribute, but
>>>>             memory-backend-epc is not supported.  The VM must be started
>>>>             with the '-machine aux-ram-share=on' option.
>>>>
>>>> Define RAM_PRIVATE
>>>>
>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>
>>>> ram_backend_memory_alloc()
>>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>
>>>> qemu_ram_alloc_internal()
>>>>        ...
>>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>            new_block->flags |= RAM_SHARED;
>>>>
>>>>        if (!host && (new_block->flags & RAM_SHARED)) {
>>>>            qemu_ram_alloc_shared(new_block);
>>>>        } else
>>>>            new_block->fd = -1;
>>>>            new_block->host = host;
>>>>        }
>>>>        ram_block_add(new_block);
>>>>
>>>> qemu_ram_alloc_shared()
>>>>        if qemu_memfd_check()
>>>>            new_block->fd = qemu_memfd_create()
>>>>        else
>>>>            new_block->fd = qemu_shm_alloc()
>>>
>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>
>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>
>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>
>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>
>>>
>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>
>>> So maybe something like
>>>
>>> qemu_ram_alloc_shared()
>>>       fd = -1;
>>>
>>>       if (qemu_memfd_avilable()) {
>>>           fd = qemu_memfd_create();
>>>           if (fd < 0)
>>>               ... error
>>>       } else if (qemu_shm_available())
>>>           fd = qemu_shm_alloc();
>>>           if (fd < 0)
>>>               ... error
>>>       } else {
>>>           /*
>>>            * Old behavior: try fd-less shared memory. We might
>>>            * just end up with non-shared memory on Windows, but
>>>            * nobody can make sure of this shared memory either way
>>>            * ... should we just use non-shared memory? Or should
>>>            * we simply bail out? But then, if there is no shared
>>>            * memory nobody could possible use it.
>>>            */
>>>           qemu_anon_ram_alloc(share=true)
>>>       }
>>
>> Good catch.  We need that fallback for backwards compatibility.  Even with
>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>> may specify it on windows, for no particular reason, but it works, and should
>> continue to work after this series.  CPR would be blocked.
> 
> Yes, we should keep Windows working in the weird way it is working right now.
> 
>  > > More generally for backwards compatibility for share=on for no particular reason,
>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>> options and more than half of ram is requested, it will fail, whereas current qemu
>> succeeds using MAP_SHARED|MAP_ANON.
> 
> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
> 
> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.

Agreed on all.

One more opinion from you please, if you will.

RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
set in
   ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal

None of the other backends reach qemu_ram_alloc_internal.

To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
everywhere MAP_SHARED may be set, eg:
     file_backend_memory_alloc()
           ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;

Or omit RAM_PRIVATE in those cases where it will not be checked, to minimize
exposure of this new flag?

- Steve
Peter Xu Nov. 7, 2024, 5:48 p.m. UTC | #26
On Thu, Nov 07, 2024 at 05:38:26PM +0100, David Hildenbrand wrote:
> > David: why do we need to drop PRIVATE in ramblock flags?  I thought it was
> > pretty harmless.  I suppose things like qemu_ram_is_shared() will even keep
> > working as before?
> > 
> > It looks ok to remove it too, but it adds logics that doesn't seem
> > necessary to me, so just to double check if I missed something..
> 
> A finished ramblock is only boolean "shared" vs. "not shared/private". A
> single flag (RAM_SHARED) can express that clearly.
> 
> Consequently there is less to get wrong when using RAM_PRIVATE only as a
> flag to the creation function (and documenting that!).
> 
> To make RAM_PRIVATE consistent we might have to tweak all other RAMBlock
> creation functions to set RAM_PRIVATE in the !RAM_SHARED case, and I don't
> think that is wroth the trouble.

Yeah, I actually prefer PRIVATE to be applied everywhere, and assert that
either SHARED|PRIVATE be set in all ramblocks.

But no strong opinions, if both of you like it only applied optionally, I
already lost the vote.  But yes, please in that case extremely carefully
document PRIVATE as it can be very tricky now.

PS: when I think about how to document that, I really, really hoped we
simply apply it to all..

Thanks,
Steven Sistare Nov. 7, 2024, 6:13 p.m. UTC | #27
On 11/7/2024 11:19 AM, David Hildenbrand wrote:
> On 07.11.24 15:04, Steven Sistare wrote:
>> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
[...]
>> Do you plan to submit the part of your "tmp" patch that refactors
>> shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
>> I could include it in my series, with your Signed-off-by.
> 
> My patch went a bit too far I think. And would not work on win32 :)
> 
> We should probably start with this:
> 
> From 124920aeda2756faa104bfa6e934c7c20b1fbbe9 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Mon, 4 Nov 2024 11:29:22 +0100
> Subject: [PATCH] backends/hostmem-shm: factor out allocation of "anonymous
>  shared memory with an fd"
> 
> Let's factor it out so we can reuse it.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  backends/hostmem-shm.c | 45 ++++-------------------------------
>  include/qemu/osdep.h   |  1 +
>  system/physmem.c       |  2 +-
>  util/oslib-posix.c     | 53 ++++++++++++++++++++++++++++++++++++++++++
>  util/oslib-win32.c     |  6 +++++
>  5 files changed, 65 insertions(+), 42 deletions(-) 
[...]

Thanks, I see what you fixed.

FYI I deleted this remnant from "tmp":

 > @@ -2089,7 +2089,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 >       new_block->page_size = qemu_real_host_page_size();
 >       new_block->host = host;
 >       new_block->flags = ram_flags;
 > -    ram_block_add(new_block, &local_err);
 > +

and I added this so all programs that link with libqemuutil also link with
rt, which defines shm_open:

diff --git a/meson.build b/meson.build
index e324c49..aa54535 100644
--- a/meson.build
+++ b/meson.build
@@ -3604,9 +3604,13 @@ libqemuutil = static_library('qemuutil',
                               build_by_default: false,
                               sources: util_ss.sources() + stub_ss.sources() + genh,
                               dependencies: [util_ss.dependencies(), libm, threads, glib, socket, malloc])
+qemuutil_deps = [event_loop_base]
+if host_os != 'windows'
+  qemuutil_deps += [rt]
+endif
  qemuutil = declare_dependency(link_with: libqemuutil,
                                sources: genh + version_res,
-                              dependencies: [event_loop_base])
+                              dependencies: qemuutil_deps)

- Steve
David Hildenbrand Nov. 8, 2024, 11:31 a.m. UTC | #28
On 07.11.24 17:40, Steven Sistare wrote:
> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>> On 07.11.24 17:02, Steven Sistare wrote:
>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>
>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>
>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>
>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>> details. See below.
>>>>>>>>>>
>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>
>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>
>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>
>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>
>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>
>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>
>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>
>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>
>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>
>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>
>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>> of options and words to describe them.
>>>>>>
>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>
>>>>> Hi David and Peter,
>>>>>
>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>> for simplicity.
>>>>>
>>>>> Any comments before I submit a complete patch?
>>>>>
>>>>> ----
>>>>> qemu-options.hx:
>>>>>         ``aux-ram-share=on|off``
>>>>>             Allocate auxiliary guest RAM as an anonymous file that is
>>>>>             shareable with an external process.  This option applies to
>>>>>             memory allocated as a side effect of creating various devices.
>>>>>             It does not apply to memory-backend-objects, whether explicitly
>>>>>             specified on the command line, or implicitly created by the -m
>>>>>             command line option.
>>>>>
>>>>>             Some migration modes require aux-ram-share=on.
>>>>>
>>>>> qapi/migration.json:
>>>>>         @cpr-transfer:
>>>>>              ...
>>>>>              Memory-backend objects must have the share=on attribute, but
>>>>>              memory-backend-epc is not supported.  The VM must be started
>>>>>              with the '-machine aux-ram-share=on' option.
>>>>>
>>>>> Define RAM_PRIVATE
>>>>>
>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>
>>>>> ram_backend_memory_alloc()
>>>>>         ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>         memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>
>>>>> qemu_ram_alloc_internal()
>>>>>         ...
>>>>>         if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>             new_block->flags |= RAM_SHARED;
>>>>>
>>>>>         if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>             qemu_ram_alloc_shared(new_block);
>>>>>         } else
>>>>>             new_block->fd = -1;
>>>>>             new_block->host = host;
>>>>>         }
>>>>>         ram_block_add(new_block);
>>>>>
>>>>> qemu_ram_alloc_shared()
>>>>>         if qemu_memfd_check()
>>>>>             new_block->fd = qemu_memfd_create()
>>>>>         else
>>>>>             new_block->fd = qemu_shm_alloc()
>>>>
>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>
>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>
>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>
>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>
>>>>
>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>
>>>> So maybe something like
>>>>
>>>> qemu_ram_alloc_shared()
>>>>        fd = -1;
>>>>
>>>>        if (qemu_memfd_avilable()) {
>>>>            fd = qemu_memfd_create();
>>>>            if (fd < 0)
>>>>                ... error
>>>>        } else if (qemu_shm_available())
>>>>            fd = qemu_shm_alloc();
>>>>            if (fd < 0)
>>>>                ... error
>>>>        } else {
>>>>            /*
>>>>             * Old behavior: try fd-less shared memory. We might
>>>>             * just end up with non-shared memory on Windows, but
>>>>             * nobody can make sure of this shared memory either way
>>>>             * ... should we just use non-shared memory? Or should
>>>>             * we simply bail out? But then, if there is no shared
>>>>             * memory nobody could possible use it.
>>>>             */
>>>>            qemu_anon_ram_alloc(share=true)
>>>>        }
>>>
>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>> may specify it on windows, for no particular reason, but it works, and should
>>> continue to work after this series.  CPR would be blocked.
>>
>> Yes, we should keep Windows working in the weird way it is working right now.
>>
>>   > > More generally for backwards compatibility for share=on for no particular reason,
>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>> succeeds using MAP_SHARED|MAP_ANON.
>>
>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>
>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
> 
> Agreed on all.
> 
> One more opinion from you please, if you will.
> 
> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
> set in
>     ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
> 
> None of the other backends reach qemu_ram_alloc_internal.
> 
> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
> everywhere MAP_SHARED may be set, eg:

Hm, I think then we should set RAM_PRIVATE really everywhere where we'd 
want it and relied on !RAM_SHARED doing the right thing.

Alternatively, we make our life easier and do something like

/*
  * This flag is only used while creating+allocating RAM, and
  * prevents RAM_SHARED getting set for anonymous RAM automatically in
  * some configurations.
  *
  * By default, not setting RAM_SHARED on anonymous RAM implies
  * "private anonymous RAM"; however, in some configuration we want to
  * have most of this RAM automatically be "sharable anonymous RAM",
  * except for some cases that really want "private anonymous RAM".
  *
  * This anonymous RAM *must* be private. This flag only applies to
  * "anonymous" RAM, not fd/file-backed/preallocated one.
  */
RAM_FORCE_ANON_PRIVATE	(1 << 13)


BUT maybe an even better alternative now that we have the 
"aux-ram-share" parameter, could we use

/*
  * Auxiliary RAM that was created automatically internally, instead of
  * explicitly like using memory-backend-ram or some other device on the
  * QEMU cmdline.
  */
RAM_AUX	(1 << 13)


So it will be quite clear that "aux-ram-share" only applies to RAM_AUX 
RAMBlocks.

That actually looks quite compelling to me :)
Peter Xu Nov. 8, 2024, 1:43 p.m. UTC | #29
On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote:
> On 07.11.24 17:40, Steven Sistare wrote:
> > On 11/7/2024 11:26 AM, David Hildenbrand wrote:
> > > On 07.11.24 17:02, Steven Sistare wrote:
> > > > On 11/7/2024 8:23 AM, David Hildenbrand wrote:
> > > > > On 06.11.24 21:12, Steven Sistare wrote:
> > > > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > > > > > > On 04.11.24 21:56, Steven Sistare wrote:
> > > > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > > > > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > > > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > > > > > > 
> > > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > > > > > > 
> > > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > > > > > > details. See below.
> > > > > > > > > > > 
> > > > > > > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > > > > > > 
> > > > > > > > > > Unless there is reason to use memfd we should start with the more
> > > > > > > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > > > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > > > > > > 
> > > > > > > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > > > > > > on the "machine toggle" as part of this series.
> > > > > > > > > > 
> > > > > > > > > > Of course, if we find out we need the memfd internally instead under
> > > > > > > > > > Linux for whatever reason later, we can use that instead.
> > > > > > > > > > 
> > > > > > > > > > But IIUC, the main selling point for memfd are additional features
> > > > > > > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > > > > > > 
> > > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > > > > > > 
> > > > > > > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > > > > > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > 
> > > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > > > > > > if memory-backend-ram has hogged all the memory.
> > > > > > > > 
> > > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > > > > > > 
> > > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > > > > > > -- memfd if available and fallback to shm_open.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > > > > > > 
> > > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > > > > > > 
> > > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > > > > > > 
> > > > > > > > > Thoughts?
> > > > > > > > 
> > > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > > > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > > > > > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > > > > > > of options and words to describe them.
> > > > > > > 
> > > > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> > > > > > 
> > > > > > Hi David and Peter,
> > > > > > 
> > > > > > I have implemented and tested the following, for both qemu_memfd_create
> > > > > > and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> > > > > > for simplicity.
> > > > > > 
> > > > > > Any comments before I submit a complete patch?
> > > > > > 
> > > > > > ----
> > > > > > qemu-options.hx:
> > > > > >         ``aux-ram-share=on|off``
> > > > > >             Allocate auxiliary guest RAM as an anonymous file that is
> > > > > >             shareable with an external process.  This option applies to
> > > > > >             memory allocated as a side effect of creating various devices.
> > > > > >             It does not apply to memory-backend-objects, whether explicitly
> > > > > >             specified on the command line, or implicitly created by the -m
> > > > > >             command line option.
> > > > > > 
> > > > > >             Some migration modes require aux-ram-share=on.
> > > > > > 
> > > > > > qapi/migration.json:
> > > > > >         @cpr-transfer:
> > > > > >              ...
> > > > > >              Memory-backend objects must have the share=on attribute, but
> > > > > >              memory-backend-epc is not supported.  The VM must be started
> > > > > >              with the '-machine aux-ram-share=on' option.
> > > > > > 
> > > > > > Define RAM_PRIVATE
> > > > > > 
> > > > > > Define qemu_shm_alloc(), from David's tmp patch
> > > > > > 
> > > > > > ram_backend_memory_alloc()
> > > > > >         ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
> > > > > >         memory_region_init_ram_flags_nomigrate(ram_flags)
> > > > > > 
> > > > > > qemu_ram_alloc_internal()
> > > > > >         ...
> > > > > >         if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> > > > > >             new_block->flags |= RAM_SHARED;
> > > > > > 
> > > > > >         if (!host && (new_block->flags & RAM_SHARED)) {
> > > > > >             qemu_ram_alloc_shared(new_block);
> > > > > >         } else
> > > > > >             new_block->fd = -1;
> > > > > >             new_block->host = host;
> > > > > >         }
> > > > > >         ram_block_add(new_block);
> > > > > > 
> > > > > > qemu_ram_alloc_shared()
> > > > > >         if qemu_memfd_check()
> > > > > >             new_block->fd = qemu_memfd_create()
> > > > > >         else
> > > > > >             new_block->fd = qemu_shm_alloc()
> > > > > 
> > > > > Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
> > > > > 
> > > > > memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
> > > > > 
> > > > > MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
> > > > > 
> > > > > So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
> > > > > 
> > > > > 
> > > > > We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
> > > > > 
> > > > > So maybe something like
> > > > > 
> > > > > qemu_ram_alloc_shared()
> > > > >        fd = -1;
> > > > > 
> > > > >        if (qemu_memfd_avilable()) {
> > > > >            fd = qemu_memfd_create();
> > > > >            if (fd < 0)
> > > > >                ... error
> > > > >        } else if (qemu_shm_available())
> > > > >            fd = qemu_shm_alloc();
> > > > >            if (fd < 0)
> > > > >                ... error
> > > > >        } else {
> > > > >            /*
> > > > >             * Old behavior: try fd-less shared memory. We might
> > > > >             * just end up with non-shared memory on Windows, but
> > > > >             * nobody can make sure of this shared memory either way
> > > > >             * ... should we just use non-shared memory? Or should
> > > > >             * we simply bail out? But then, if there is no shared
> > > > >             * memory nobody could possible use it.
> > > > >             */
> > > > >            qemu_anon_ram_alloc(share=true)
> > > > >        }
> > > > 
> > > > Good catch.  We need that fallback for backwards compatibility.  Even with
> > > > no use case for memory-backend-ram,share=on since the demise of rdma, users
> > > > may specify it on windows, for no particular reason, but it works, and should
> > > > continue to work after this series.  CPR would be blocked.
> > > 
> > > Yes, we should keep Windows working in the weird way it is working right now.
> > > 
> > >   > > More generally for backwards compatibility for share=on for no particular reason,
> > > > should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
> > > > options and more than half of ram is requested, it will fail, whereas current qemu
> > > > succeeds using MAP_SHARED|MAP_ANON.
> > > 
> > > Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
> > > 
> > > But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
> > 
> > Agreed on all.
> > 
> > One more opinion from you please, if you will.
> > 
> > RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
> > set in
> >     ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
> > 
> > None of the other backends reach qemu_ram_alloc_internal.
> > 
> > To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
> > everywhere MAP_SHARED may be set, eg:
> 
> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want
> it and relied on !RAM_SHARED doing the right thing.
> 
> Alternatively, we make our life easier and do something like
> 
> /*
>  * This flag is only used while creating+allocating RAM, and
>  * prevents RAM_SHARED getting set for anonymous RAM automatically in
>  * some configurations.
>  *
>  * By default, not setting RAM_SHARED on anonymous RAM implies
>  * "private anonymous RAM"; however, in some configuration we want to
>  * have most of this RAM automatically be "sharable anonymous RAM",
>  * except for some cases that really want "private anonymous RAM".
>  *
>  * This anonymous RAM *must* be private. This flag only applies to
>  * "anonymous" RAM, not fd/file-backed/preallocated one.
>  */
> RAM_FORCE_ANON_PRIVATE	(1 << 13)
> 
> 
> BUT maybe an even better alternative now that we have the "aux-ram-share"
> parameter, could we use
> 
> /*
>  * Auxiliary RAM that was created automatically internally, instead of
>  * explicitly like using memory-backend-ram or some other device on the
>  * QEMU cmdline.
>  */
> RAM_AUX	(1 << 13)
> 
> 
> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX
> RAMBlocks.
> 
> That actually looks quite compelling to me :)

Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
place?

IMHO RAM_AUX is too hard for any new callers to know how to set.  It's much
easier when we already have SHARED, adding PRIVATE could be mostly natural,
then we can already avoid AUX due to checking !SHARED & !PRIVATE.

Basically, SHARED|PRIVATE then must come from an user request (QMP or
cmdline), otherwise the caller should always set none of them, implying
aux.

It still looks the best to me.

Thanks,
Steven Sistare Nov. 8, 2024, 1:56 p.m. UTC | #30
On 11/8/2024 6:31 AM, David Hildenbrand wrote:
> On 07.11.24 17:40, Steven Sistare wrote:
>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>
>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>
>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>> details. See below.
>>>>>>>>>>>
>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>
>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>
>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>
>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>
>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>
>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>
>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>
>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>
>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>
>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>
>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>> of options and words to describe them.
>>>>>>>
>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>
>>>>>> Hi David and Peter,
>>>>>>
>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>> for simplicity.
>>>>>>
>>>>>> Any comments before I submit a complete patch?
>>>>>>
>>>>>> ----
>>>>>> qemu-options.hx:
>>>>>>         ``aux-ram-share=on|off``
>>>>>>             Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>             shareable with an external process.  This option applies to
>>>>>>             memory allocated as a side effect of creating various devices.
>>>>>>             It does not apply to memory-backend-objects, whether explicitly
>>>>>>             specified on the command line, or implicitly created by the -m
>>>>>>             command line option.
>>>>>>
>>>>>>             Some migration modes require aux-ram-share=on.
>>>>>>
>>>>>> qapi/migration.json:
>>>>>>         @cpr-transfer:
>>>>>>              ...
>>>>>>              Memory-backend objects must have the share=on attribute, but
>>>>>>              memory-backend-epc is not supported.  The VM must be started
>>>>>>              with the '-machine aux-ram-share=on' option.
>>>>>>
>>>>>> Define RAM_PRIVATE
>>>>>>
>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>
>>>>>> ram_backend_memory_alloc()
>>>>>>         ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>         memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>
>>>>>> qemu_ram_alloc_internal()
>>>>>>         ...
>>>>>>         if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>             new_block->flags |= RAM_SHARED;
>>>>>>
>>>>>>         if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>             qemu_ram_alloc_shared(new_block);
>>>>>>         } else
>>>>>>             new_block->fd = -1;
>>>>>>             new_block->host = host;
>>>>>>         }
>>>>>>         ram_block_add(new_block);
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         if qemu_memfd_check()
>>>>>>             new_block->fd = qemu_memfd_create()
>>>>>>         else
>>>>>>             new_block->fd = qemu_shm_alloc()
>>>>>
>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>
>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>
>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>
>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>
>>>>>
>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>
>>>>> So maybe something like
>>>>>
>>>>> qemu_ram_alloc_shared()
>>>>>        fd = -1;
>>>>>
>>>>>        if (qemu_memfd_avilable()) {
>>>>>            fd = qemu_memfd_create();
>>>>>            if (fd < 0)
>>>>>                ... error
>>>>>        } else if (qemu_shm_available())
>>>>>            fd = qemu_shm_alloc();
>>>>>            if (fd < 0)
>>>>>                ... error
>>>>>        } else {
>>>>>            /*
>>>>>             * Old behavior: try fd-less shared memory. We might
>>>>>             * just end up with non-shared memory on Windows, but
>>>>>             * nobody can make sure of this shared memory either way
>>>>>             * ... should we just use non-shared memory? Or should
>>>>>             * we simply bail out? But then, if there is no shared
>>>>>             * memory nobody could possible use it.
>>>>>             */
>>>>>            qemu_anon_ram_alloc(share=true)
>>>>>        }
>>>>
>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>> may specify it on windows, for no particular reason, but it works, and should
>>>> continue to work after this series.  CPR would be blocked.
>>>
>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>
>>>   > > More generally for backwards compatibility for share=on for no particular reason,
>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>
>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>
>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>
>> Agreed on all.
>>
>> One more opinion from you please, if you will.
>>
>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>> set in
>>     ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>
>> None of the other backends reach qemu_ram_alloc_internal.
>>
>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>> everywhere MAP_SHARED may be set, eg:
> 
> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
> 
> Alternatively, we make our life easier and do something like
> 
> /*
>   * This flag is only used while creating+allocating RAM, and
>   * prevents RAM_SHARED getting set for anonymous RAM automatically in
>   * some configurations.
>   *
>   * By default, not setting RAM_SHARED on anonymous RAM implies
>   * "private anonymous RAM"; however, in some configuration we want to
>   * have most of this RAM automatically be "sharable anonymous RAM",
>   * except for some cases that really want "private anonymous RAM".
>   *
>   * This anonymous RAM *must* be private. This flag only applies to
>   * "anonymous" RAM, not fd/file-backed/preallocated one.
>   */
> RAM_FORCE_ANON_PRIVATE    (1 << 13)
> 
> 
> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
> 
> /*
>   * Auxiliary RAM that was created automatically internally, instead of
>   * explicitly like using memory-backend-ram or some other device on the
>   * QEMU cmdline.
>   */
> RAM_AUX    (1 << 13)
> 
> 
> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
> 
> That actually looks quite compelling to me :)

Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
   qemu_ram_alloc_resizeable
   memory_region_init_ram_nomigrate
   memory_region_init_rom_nomigrate
   memory_region_init_rom_device_nomigrate

and test it with aux_ram_share in qemu_ram_alloc_internal.
   if RAM_AUX && aux_ram_share
     flags |= RAM_SHARED

However, we could just set RAM_SHARED at those same call sites:
   flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
which is what I did in
   [PATCH V2 01/11] machine: alloc-anon option
and test RAM_SHARED in qemu_ram_alloc_internal.
No need for RAM_PRIVATE.

RAM_AUX is nice because it declares intent more specifically.

Your preference?

- Steve
Steven Sistare Nov. 8, 2024, 2:14 p.m. UTC | #31
On 11/8/2024 8:43 AM, Peter Xu wrote:
> On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote:
>> On 07.11.24 17:40, Steven Sistare wrote:
>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>
>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>
>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>
>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>
>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>
>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>
>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>
>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>
>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>
>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>
>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>
>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>> of options and words to describe them.
>>>>>>>>
>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>
>>>>>>> Hi David and Peter,
>>>>>>>
>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>> for simplicity.
>>>>>>>
>>>>>>> Any comments before I submit a complete patch?
>>>>>>>
>>>>>>> ----
>>>>>>> qemu-options.hx:
>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>              command line option.
>>>>>>>
>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>
>>>>>>> qapi/migration.json:
>>>>>>>          @cpr-transfer:
>>>>>>>               ...
>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>
>>>>>>> Define RAM_PRIVATE
>>>>>>>
>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>
>>>>>>> ram_backend_memory_alloc()
>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>
>>>>>>> qemu_ram_alloc_internal()
>>>>>>>          ...
>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>
>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>          } else
>>>>>>>              new_block->fd = -1;
>>>>>>>              new_block->host = host;
>>>>>>>          }
>>>>>>>          ram_block_add(new_block);
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>          if qemu_memfd_check()
>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>          else
>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>
>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>
>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>
>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>
>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>
>>>>>>
>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>
>>>>>> So maybe something like
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         fd = -1;
>>>>>>
>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>             fd = qemu_memfd_create();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else if (qemu_shm_available())
>>>>>>             fd = qemu_shm_alloc();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else {
>>>>>>             /*
>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>              * memory nobody could possible use it.
>>>>>>              */
>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>         }
>>>>>
>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>> continue to work after this series.  CPR would be blocked.
>>>>
>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>
>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>
>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>
>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>
>>> Agreed on all.
>>>
>>> One more opinion from you please, if you will.
>>>
>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>> set in
>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>
>>> None of the other backends reach qemu_ram_alloc_internal.
>>>
>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>> everywhere MAP_SHARED may be set, eg:
>>
>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want
>> it and relied on !RAM_SHARED doing the right thing.
>>
>> Alternatively, we make our life easier and do something like
>>
>> /*
>>   * This flag is only used while creating+allocating RAM, and
>>   * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>   * some configurations.
>>   *
>>   * By default, not setting RAM_SHARED on anonymous RAM implies
>>   * "private anonymous RAM"; however, in some configuration we want to
>>   * have most of this RAM automatically be "sharable anonymous RAM",
>>   * except for some cases that really want "private anonymous RAM".
>>   *
>>   * This anonymous RAM *must* be private. This flag only applies to
>>   * "anonymous" RAM, not fd/file-backed/preallocated one.
>>   */
>> RAM_FORCE_ANON_PRIVATE	(1 << 13)
>>
>>
>> BUT maybe an even better alternative now that we have the "aux-ram-share"
>> parameter, could we use
>>
>> /*
>>   * Auxiliary RAM that was created automatically internally, instead of
>>   * explicitly like using memory-backend-ram or some other device on the
>>   * QEMU cmdline.
>>   */
>> RAM_AUX	(1 << 13)
>>
>>
>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX
>> RAMBlocks.
>>
>> That actually looks quite compelling to me :)
> 
> Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> place?
> 
> IMHO RAM_AUX is too hard for any new callers to know how to set.  It's much
> easier when we already have SHARED, adding PRIVATE could be mostly natural,
> then we can already avoid AUX due to checking !SHARED & !PRIVATE.
> 
> Basically, SHARED|PRIVATE then must come from an user request (QMP or
> cmdline), otherwise the caller should always set none of them, implying
> aux.
> 
> It still looks the best to me.

Our emails crossed. We could set PRIVATE|SHARED all over the place.  Nothing
wrong with that solution. I have no preference, other than finishing.

- Steve
David Hildenbrand Nov. 8, 2024, 2:18 p.m. UTC | #32
On 08.11.24 14:43, Peter Xu wrote:
> On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote:
>> On 07.11.24 17:40, Steven Sistare wrote:
>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>
>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>
>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>
>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>
>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>
>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>
>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>
>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>
>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>
>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>
>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>
>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>> of options and words to describe them.
>>>>>>>>
>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>
>>>>>>> Hi David and Peter,
>>>>>>>
>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>> for simplicity.
>>>>>>>
>>>>>>> Any comments before I submit a complete patch?
>>>>>>>
>>>>>>> ----
>>>>>>> qemu-options.hx:
>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>              command line option.
>>>>>>>
>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>
>>>>>>> qapi/migration.json:
>>>>>>>          @cpr-transfer:
>>>>>>>               ...
>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>
>>>>>>> Define RAM_PRIVATE
>>>>>>>
>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>
>>>>>>> ram_backend_memory_alloc()
>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>
>>>>>>> qemu_ram_alloc_internal()
>>>>>>>          ...
>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>
>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>          } else
>>>>>>>              new_block->fd = -1;
>>>>>>>              new_block->host = host;
>>>>>>>          }
>>>>>>>          ram_block_add(new_block);
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>          if qemu_memfd_check()
>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>          else
>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>
>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>
>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>
>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>
>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>
>>>>>>
>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>
>>>>>> So maybe something like
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         fd = -1;
>>>>>>
>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>             fd = qemu_memfd_create();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else if (qemu_shm_available())
>>>>>>             fd = qemu_shm_alloc();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else {
>>>>>>             /*
>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>              * memory nobody could possible use it.
>>>>>>              */
>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>         }
>>>>>
>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>> continue to work after this series.  CPR would be blocked.
>>>>
>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>
>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>
>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>
>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>
>>> Agreed on all.
>>>
>>> One more opinion from you please, if you will.
>>>
>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>> set in
>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>
>>> None of the other backends reach qemu_ram_alloc_internal.
>>>
>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>> everywhere MAP_SHARED may be set, eg:
>>
>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want
>> it and relied on !RAM_SHARED doing the right thing.
>>
>> Alternatively, we make our life easier and do something like
>>
>> /*
>>   * This flag is only used while creating+allocating RAM, and
>>   * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>   * some configurations.
>>   *
>>   * By default, not setting RAM_SHARED on anonymous RAM implies
>>   * "private anonymous RAM"; however, in some configuration we want to
>>   * have most of this RAM automatically be "sharable anonymous RAM",
>>   * except for some cases that really want "private anonymous RAM".
>>   *
>>   * This anonymous RAM *must* be private. This flag only applies to
>>   * "anonymous" RAM, not fd/file-backed/preallocated one.
>>   */
>> RAM_FORCE_ANON_PRIVATE	(1 << 13)
>>
>>
>> BUT maybe an even better alternative now that we have the "aux-ram-share"
>> parameter, could we use
>>
>> /*
>>   * Auxiliary RAM that was created automatically internally, instead of
>>   * explicitly like using memory-backend-ram or some other device on the
>>   * QEMU cmdline.
>>   */
>> RAM_AUX	(1 << 13)
>>
>>
>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX
>> RAMBlocks.
>>
>> That actually looks quite compelling to me :)
> 
> Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> place?
 > > IMHO RAM_AUX is too hard for any new callers to know how to set. 
It's much
> easier when we already have SHARED, adding PRIVATE could be mostly natural,
> then we can already avoid AUX due to checking !SHARED & !PRIVATE.

How is it clearer if you have to know whether you have to set 
RAM_PRIVATE or not for some RAM? Because you *wouldn't* set it "all over 
the place".

No strong opinion, but RAM_AUX aligns much better with what we actually 
want to achieve: making aux RAM shared. Which implies, detecting aux RAM ...
David Hildenbrand Nov. 8, 2024, 2:20 p.m. UTC | #33
On 08.11.24 14:56, Steven Sistare wrote:
> On 11/8/2024 6:31 AM, David Hildenbrand wrote:
>> On 07.11.24 17:40, Steven Sistare wrote:
>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>
>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>
>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>
>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>
>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>
>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>
>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>
>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>
>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>
>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>
>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>
>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>> of options and words to describe them.
>>>>>>>>
>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>
>>>>>>> Hi David and Peter,
>>>>>>>
>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>> for simplicity.
>>>>>>>
>>>>>>> Any comments before I submit a complete patch?
>>>>>>>
>>>>>>> ----
>>>>>>> qemu-options.hx:
>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>              command line option.
>>>>>>>
>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>
>>>>>>> qapi/migration.json:
>>>>>>>          @cpr-transfer:
>>>>>>>               ...
>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>
>>>>>>> Define RAM_PRIVATE
>>>>>>>
>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>
>>>>>>> ram_backend_memory_alloc()
>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>
>>>>>>> qemu_ram_alloc_internal()
>>>>>>>          ...
>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>
>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>          } else
>>>>>>>              new_block->fd = -1;
>>>>>>>              new_block->host = host;
>>>>>>>          }
>>>>>>>          ram_block_add(new_block);
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>          if qemu_memfd_check()
>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>          else
>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>
>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>
>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>
>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>
>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>
>>>>>>
>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>
>>>>>> So maybe something like
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         fd = -1;
>>>>>>
>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>             fd = qemu_memfd_create();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else if (qemu_shm_available())
>>>>>>             fd = qemu_shm_alloc();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else {
>>>>>>             /*
>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>              * memory nobody could possible use it.
>>>>>>              */
>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>         }
>>>>>
>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>> continue to work after this series.  CPR would be blocked.
>>>>
>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>
>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>
>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>
>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>
>>> Agreed on all.
>>>
>>> One more opinion from you please, if you will.
>>>
>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>> set in
>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>
>>> None of the other backends reach qemu_ram_alloc_internal.
>>>
>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>> everywhere MAP_SHARED may be set, eg:
>>
>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
>>
>> Alternatively, we make our life easier and do something like
>>
>> /*
>>    * This flag is only used while creating+allocating RAM, and
>>    * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>    * some configurations.
>>    *
>>    * By default, not setting RAM_SHARED on anonymous RAM implies
>>    * "private anonymous RAM"; however, in some configuration we want to
>>    * have most of this RAM automatically be "sharable anonymous RAM",
>>    * except for some cases that really want "private anonymous RAM".
>>    *
>>    * This anonymous RAM *must* be private. This flag only applies to
>>    * "anonymous" RAM, not fd/file-backed/preallocated one.
>>    */
>> RAM_FORCE_ANON_PRIVATE    (1 << 13)
>>
>>
>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
>>
>> /*
>>    * Auxiliary RAM that was created automatically internally, instead of
>>    * explicitly like using memory-backend-ram or some other device on the
>>    * QEMU cmdline.
>>    */
>> RAM_AUX    (1 << 13)
>>
>>
>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
>>
>> That actually looks quite compelling to me :)
> 
> Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
>     qemu_ram_alloc_resizeable
>     memory_region_init_ram_nomigrate
>     memory_region_init_rom_nomigrate
>     memory_region_init_rom_device_nomigrate
> 
> and test it with aux_ram_share in qemu_ram_alloc_internal.
>     if RAM_AUX && aux_ram_share
>       flags |= RAM_SHARED
> 
> However, we could just set RAM_SHARED at those same call sites:
>     flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
> which is what I did in
>     [PATCH V2 01/11] machine: alloc-anon option
> and test RAM_SHARED in qemu_ram_alloc_internal.
> No need for RAM_PRIVATE.
> 
> RAM_AUX is nice because it declares intent more specifically.
> 
> Your preference?

My preference is either using RAM_AUX to flag AUX RAM, or the inverse, 
RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely 
ivshmem.c

Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to 
decide ;)
Peter Xu Nov. 8, 2024, 2:32 p.m. UTC | #34
On Fri, Nov 08, 2024 at 09:14:02AM -0500, Steven Sistare wrote:
> > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> > place?
> > 
> > IMHO RAM_AUX is too hard for any new callers to know how to set.  It's much
> > easier when we already have SHARED, adding PRIVATE could be mostly natural,
> > then we can already avoid AUX due to checking !SHARED & !PRIVATE.
> > 
> > Basically, SHARED|PRIVATE then must come from an user request (QMP or
> > cmdline), otherwise the caller should always set none of them, implying
> > aux.
> > 
> > It still looks the best to me.
> 
> Our emails crossed. We could set PRIVATE|SHARED all over the place.  Nothing
> wrong with that solution. I have no preference, other than finishing.

The current AUX is exactly what I was picturing when I was replying v2, so
the four paths you listed here:

https://lore.kernel.org/qemu-devel/44b15731-0ee8-4e24-b4f5-0614bca594cb@oracle.com/

        Agreed, RAM_AUX is a clear solution.  I would set it in these
        functions:

                qemu_ram_alloc_resizeable
                memory_region_init_ram_nomigrate
                memory_region_init_rom_nomigrate
                memory_region_init_rom_device_nomigrate


That is what I listed previously:

https://lore.kernel.org/qemu-devel/Zv7C7MeVP2X8bEJU@x1n/

        I think that means below paths [1-4] are only relevant:

        qemu_ram_alloc
                memory_region_init_rom_device_nomigrate [1]
                memory_region_init_ram_flags_nomigrate
                        memory_region_init_ram_nomigrate    [2]
                        memory_region_init_rom_nomigrate    [3]
        qemu_ram_alloc_resizeable                   [4]

Except that if we don't want to risk using VM_SHARED unconditionally on
linux for aux, then we need to have a new flag, aka RAM_AUX or similar.

So I believe the list is at least correct.

I changed my mind because I noticed it will be non-trivial to know whether
one should set RAM_AUX when a new user needs to create some new ramblocks.

In that case, we need to define RAM_AUX properly.  One sane way to define
it is: "if the user didn't specify share or private property, please use
AUX", but then it'll overlap with RAM_SHARED / RAM_PRIVATE already, hence
redundant.

Yes, let's wait and see David's comment.

Thanks,
Steven Sistare Nov. 8, 2024, 2:37 p.m. UTC | #35
On 11/8/2024 9:20 AM, David Hildenbrand wrote:
> On 08.11.24 14:56, Steven Sistare wrote:
>> On 11/8/2024 6:31 AM, David Hildenbrand wrote:
>>> On 07.11.24 17:40, Steven Sistare wrote:
>>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>>
>>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>>
>>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>>
>>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>>
>>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>>
>>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>>
>>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>>
>>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>>
>>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>>
>>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>>
>>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>>
>>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>>> of options and words to describe them.
>>>>>>>>>
>>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>>
>>>>>>>> Hi David and Peter,
>>>>>>>>
>>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>>> for simplicity.
>>>>>>>>
>>>>>>>> Any comments before I submit a complete patch?
>>>>>>>>
>>>>>>>> ----
>>>>>>>> qemu-options.hx:
>>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>>              command line option.
>>>>>>>>
>>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>>
>>>>>>>> qapi/migration.json:
>>>>>>>>          @cpr-transfer:
>>>>>>>>               ...
>>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>>
>>>>>>>> Define RAM_PRIVATE
>>>>>>>>
>>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>>
>>>>>>>> ram_backend_memory_alloc()
>>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>>
>>>>>>>> qemu_ram_alloc_internal()
>>>>>>>>          ...
>>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>>
>>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>>          } else
>>>>>>>>              new_block->fd = -1;
>>>>>>>>              new_block->host = host;
>>>>>>>>          }
>>>>>>>>          ram_block_add(new_block);
>>>>>>>>
>>>>>>>> qemu_ram_alloc_shared()
>>>>>>>>          if qemu_memfd_check()
>>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>>          else
>>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>>
>>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>>
>>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>>
>>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>>
>>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>>
>>>>>>>
>>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>>
>>>>>>> So maybe something like
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>         fd = -1;
>>>>>>>
>>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>>             fd = qemu_memfd_create();
>>>>>>>             if (fd < 0)
>>>>>>>                 ... error
>>>>>>>         } else if (qemu_shm_available())
>>>>>>>             fd = qemu_shm_alloc();
>>>>>>>             if (fd < 0)
>>>>>>>                 ... error
>>>>>>>         } else {
>>>>>>>             /*
>>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>>              * memory nobody could possible use it.
>>>>>>>              */
>>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>>         }
>>>>>>
>>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>>> continue to work after this series.  CPR would be blocked.
>>>>>
>>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>>
>>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>>
>>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>>
>>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>>
>>>> Agreed on all.
>>>>
>>>> One more opinion from you please, if you will.
>>>>
>>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>>> set in
>>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>>
>>>> None of the other backends reach qemu_ram_alloc_internal.
>>>>
>>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>>> everywhere MAP_SHARED may be set, eg:
>>>
>>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
>>>
>>> Alternatively, we make our life easier and do something like
>>>
>>> /*
>>>    * This flag is only used while creating+allocating RAM, and
>>>    * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>>    * some configurations.
>>>    *
>>>    * By default, not setting RAM_SHARED on anonymous RAM implies
>>>    * "private anonymous RAM"; however, in some configuration we want to
>>>    * have most of this RAM automatically be "sharable anonymous RAM",
>>>    * except for some cases that really want "private anonymous RAM".
>>>    *
>>>    * This anonymous RAM *must* be private. This flag only applies to
>>>    * "anonymous" RAM, not fd/file-backed/preallocated one.
>>>    */
>>> RAM_FORCE_ANON_PRIVATE    (1 << 13)
>>>
>>>
>>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
>>>
>>> /*
>>>    * Auxiliary RAM that was created automatically internally, instead of
>>>    * explicitly like using memory-backend-ram or some other device on the
>>>    * QEMU cmdline.
>>>    */
>>> RAM_AUX    (1 << 13)
>>>
>>>
>>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
>>>
>>> That actually looks quite compelling to me :)
>>
>> Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
>>     qemu_ram_alloc_resizeable
>>     memory_region_init_ram_nomigrate
>>     memory_region_init_rom_nomigrate
>>     memory_region_init_rom_device_nomigrate
>>
>> and test it with aux_ram_share in qemu_ram_alloc_internal.
>>     if RAM_AUX && aux_ram_share
>>       flags |= RAM_SHARED
>>
>> However, we could just set RAM_SHARED at those same call sites:
>>     flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
>> which is what I did in
>>     [PATCH V2 01/11] machine: alloc-anon option
>> and test RAM_SHARED in qemu_ram_alloc_internal.
>> No need for RAM_PRIVATE.
>>
>> RAM_AUX is nice because it declares intent more specifically.
>>
>> Your preference?
> 
> My preference is either using RAM_AUX to flag AUX RAM, or the inverse, RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely ivshmem.c
> 
> Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to decide ;)

I like the inverse flag RAM_NON_AUX, better name TBD.
The call sites are well defined.
That is what my V3 hack was testing (modulo ivshmem).
    object_dynamic_cast(new_block->mr->parent_obj.parent, TYPE_MEMORY_BACKEND

- Steve
David Hildenbrand Nov. 8, 2024, 2:54 p.m. UTC | #36
On 08.11.24 15:37, Steven Sistare wrote:
> On 11/8/2024 9:20 AM, David Hildenbrand wrote:
>> On 08.11.24 14:56, Steven Sistare wrote:
>>> On 11/8/2024 6:31 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:40, Steven Sistare wrote:
>>>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>>>
>>>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>>>
>>>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>>>
>>>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>>>
>>>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>>>
>>>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>>>
>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>
>>>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>>>> of options and words to describe them.
>>>>>>>>>>
>>>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>>>
>>>>>>>>> Hi David and Peter,
>>>>>>>>>
>>>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>>>> for simplicity.
>>>>>>>>>
>>>>>>>>> Any comments before I submit a complete patch?
>>>>>>>>>
>>>>>>>>> ----
>>>>>>>>> qemu-options.hx:
>>>>>>>>>           ``aux-ram-share=on|off``
>>>>>>>>>               Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>>>               shareable with an external process.  This option applies to
>>>>>>>>>               memory allocated as a side effect of creating various devices.
>>>>>>>>>               It does not apply to memory-backend-objects, whether explicitly
>>>>>>>>>               specified on the command line, or implicitly created by the -m
>>>>>>>>>               command line option.
>>>>>>>>>
>>>>>>>>>               Some migration modes require aux-ram-share=on.
>>>>>>>>>
>>>>>>>>> qapi/migration.json:
>>>>>>>>>           @cpr-transfer:
>>>>>>>>>                ...
>>>>>>>>>                Memory-backend objects must have the share=on attribute, but
>>>>>>>>>                memory-backend-epc is not supported.  The VM must be started
>>>>>>>>>                with the '-machine aux-ram-share=on' option.
>>>>>>>>>
>>>>>>>>> Define RAM_PRIVATE
>>>>>>>>>
>>>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>>>
>>>>>>>>> ram_backend_memory_alloc()
>>>>>>>>>           ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>>>           memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>>>
>>>>>>>>> qemu_ram_alloc_internal()
>>>>>>>>>           ...
>>>>>>>>>           if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>>>               new_block->flags |= RAM_SHARED;
>>>>>>>>>
>>>>>>>>>           if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>>>               qemu_ram_alloc_shared(new_block);
>>>>>>>>>           } else
>>>>>>>>>               new_block->fd = -1;
>>>>>>>>>               new_block->host = host;
>>>>>>>>>           }
>>>>>>>>>           ram_block_add(new_block);
>>>>>>>>>
>>>>>>>>> qemu_ram_alloc_shared()
>>>>>>>>>           if qemu_memfd_check()
>>>>>>>>>               new_block->fd = qemu_memfd_create()
>>>>>>>>>           else
>>>>>>>>>               new_block->fd = qemu_shm_alloc()
>>>>>>>>
>>>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>>>
>>>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>>>
>>>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>>>
>>>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>>>
>>>>>>>>
>>>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>>>
>>>>>>>> So maybe something like
>>>>>>>>
>>>>>>>> qemu_ram_alloc_shared()
>>>>>>>>          fd = -1;
>>>>>>>>
>>>>>>>>          if (qemu_memfd_avilable()) {
>>>>>>>>              fd = qemu_memfd_create();
>>>>>>>>              if (fd < 0)
>>>>>>>>                  ... error
>>>>>>>>          } else if (qemu_shm_available())
>>>>>>>>              fd = qemu_shm_alloc();
>>>>>>>>              if (fd < 0)
>>>>>>>>                  ... error
>>>>>>>>          } else {
>>>>>>>>              /*
>>>>>>>>               * Old behavior: try fd-less shared memory. We might
>>>>>>>>               * just end up with non-shared memory on Windows, but
>>>>>>>>               * nobody can make sure of this shared memory either way
>>>>>>>>               * ... should we just use non-shared memory? Or should
>>>>>>>>               * we simply bail out? But then, if there is no shared
>>>>>>>>               * memory nobody could possible use it.
>>>>>>>>               */
>>>>>>>>              qemu_anon_ram_alloc(share=true)
>>>>>>>>          }
>>>>>>>
>>>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>>>> continue to work after this series.  CPR would be blocked.
>>>>>>
>>>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>>>
>>>>>>     > > More generally for backwards compatibility for share=on for no particular reason,
>>>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>>>
>>>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>>>
>>>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>>>
>>>>> Agreed on all.
>>>>>
>>>>> One more opinion from you please, if you will.
>>>>>
>>>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>>>> set in
>>>>>       ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>>>
>>>>> None of the other backends reach qemu_ram_alloc_internal.
>>>>>
>>>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>>>> everywhere MAP_SHARED may be set, eg:
>>>>
>>>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
>>>>
>>>> Alternatively, we make our life easier and do something like
>>>>
>>>> /*
>>>>     * This flag is only used while creating+allocating RAM, and
>>>>     * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>>>     * some configurations.
>>>>     *
>>>>     * By default, not setting RAM_SHARED on anonymous RAM implies
>>>>     * "private anonymous RAM"; however, in some configuration we want to
>>>>     * have most of this RAM automatically be "sharable anonymous RAM",
>>>>     * except for some cases that really want "private anonymous RAM".
>>>>     *
>>>>     * This anonymous RAM *must* be private. This flag only applies to
>>>>     * "anonymous" RAM, not fd/file-backed/preallocated one.
>>>>     */
>>>> RAM_FORCE_ANON_PRIVATE    (1 << 13)
>>>>
>>>>
>>>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
>>>>
>>>> /*
>>>>     * Auxiliary RAM that was created automatically internally, instead of
>>>>     * explicitly like using memory-backend-ram or some other device on the
>>>>     * QEMU cmdline.
>>>>     */
>>>> RAM_AUX    (1 << 13)
>>>>
>>>>
>>>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
>>>>
>>>> That actually looks quite compelling to me :)
>>>
>>> Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
>>>      qemu_ram_alloc_resizeable
>>>      memory_region_init_ram_nomigrate
>>>      memory_region_init_rom_nomigrate
>>>      memory_region_init_rom_device_nomigrate
>>>
>>> and test it with aux_ram_share in qemu_ram_alloc_internal.
>>>      if RAM_AUX && aux_ram_share
>>>        flags |= RAM_SHARED
>>>
>>> However, we could just set RAM_SHARED at those same call sites:
>>>      flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
>>> which is what I did in
>>>      [PATCH V2 01/11] machine: alloc-anon option
>>> and test RAM_SHARED in qemu_ram_alloc_internal.
>>> No need for RAM_PRIVATE.
>>>
>>> RAM_AUX is nice because it declares intent more specifically.
>>>
>>> Your preference?
>>
>> My preference is either using RAM_AUX to flag AUX RAM, or the inverse, RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely ivshmem.c
>>
>> Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to decide ;)
> 
> I like the inverse flag RAM_NON_AUX, better name TBD.
> The call sites are well defined.
> That is what my V3 hack was testing (modulo ivshmem).
>      object_dynamic_cast(new_block->mr->parent_obj.parent, TYPE_MEMORY_BACKEND

Likely AUX is everything that is "neither explicitly specified by the user nor
very special RAM"

So I think hw/misc/ivshmem.c would also not count as "aux", and similarly
hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on).

memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly
special: we cannot possibly turn them SHARED. But that's also what your code
already handled.

So maybe, really everything is AUX ram, except
* Using memory_region_init_ram_from_fd()/
   memory_region_init_ram_from_file() users.
* Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr
* Created via memory backends


Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why
it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that
code was introduced.

CCing Thomas.

commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
Author: Thomas Huth <huth@tuxfamily.org>
Date:   Sat Jun 30 08:45:25 2018 +0200

     m68k: Add NeXTcube machine
     
     It is still quite incomplete (no SCSI, no floppy emulation, no network,
     etc.), but the firmware already shows up the debug monitor prompt in the
     framebuffer display, so at least the very basics are already working.
     
     This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
     
      https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
     
     and altered quite a bit to fit the latest interface and coding conventions
     of the current QEMU.
Peter Xu Nov. 8, 2024, 3:01 p.m. UTC | #37
On Fri, Nov 08, 2024 at 03:18:07PM +0100, David Hildenbrand wrote:
> > 
> > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> > place?
> > > IMHO RAM_AUX is too hard for any new callers to know how to set. It's
> much
> > easier when we already have SHARED, adding PRIVATE could be mostly natural,
> > then we can already avoid AUX due to checking !SHARED & !PRIVATE.
> 
> How is it clearer if you have to know whether you have to set RAM_PRIVATE or
> not for some RAM? Because you *wouldn't* set it "all over the place".

I think I answered that in previous reply, but exactly after the line where
it got cut-off.. :)

https://lore.kernel.org/qemu-devel/Zy4VkScMEpYayGtM@x1n/

        Basically, SHARED|PRIVATE then must come from an user request (QMP
        or cmdline), otherwise the caller should always set none of them,
        implying aux.

But I confess that's not accurate.. some caller is based on type of objects
etc. to decide mem must be SHARED.  A better version could be:

        RAM_SHARED|RAM_PRIVATE describes the share-able attribute of the
        RAM block.

        It can never be set together.  When one is set, the memory
        attribute must suffice the request.  When none is set, QEMU will
        decide how to request the memory.

        The flag should only be set if the caller has explicit requirement
        on such memory property.  For example, it can come from either a
        request from user (share=on/off), or the memory must be shared /
        private due to its own attribute (shm objects, like ivshmem, shm
        backend, or remotely shared mem, etc.).

        Otherwise, callers should leave both flags unset.

Maybe we should document this directly into the flag definitions, so what
flags to set would be clearer than before, and just to say it's not the
caller's own willingness to set SHARED | PRIVATE randomly (so as to make
cpr available as much as possible).

Thanks,
Peter Xu Nov. 8, 2024, 3:07 p.m. UTC | #38
On Fri, Nov 08, 2024 at 03:54:13PM +0100, David Hildenbrand wrote:
> Likely AUX is everything that is "neither explicitly specified by the user nor
> very special RAM"
> 
> So I think hw/misc/ivshmem.c would also not count as "aux", and similarly
> hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on).
> 
> memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly
> special: we cannot possibly turn them SHARED. But that's also what your code
> already handled.
> 
> So maybe, really everything is AUX ram, except
> * Using memory_region_init_ram_from_fd()/
>   memory_region_init_ram_from_file() users.
> * Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr
> * Created via memory backends
> 
> 
> Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why
> it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that
> code was introduced.
> 
> CCing Thomas.
> 
> commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
> Author: Thomas Huth <huth@tuxfamily.org>
> Date:   Sat Jun 30 08:45:25 2018 +0200
> 
>     m68k: Add NeXTcube machine
>     It is still quite incomplete (no SCSI, no floppy emulation, no network,
>     etc.), but the firmware already shows up the debug monitor prompt in the
>     framebuffer display, so at least the very basics are already working.
>     This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
>      https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
>     and altered quite a bit to fit the latest interface and coding conventions
>     of the current QEMU.

This might also imply that our current RAM_SHARED is already not crystal
clear on when to use, not to mention RAM_AUX if to be introduced..  Please
see my other email, trying to define RAM_SHARED properly.

IIUC after we can properly define RAM_SHARED, then we don't need AUX, and
everything will be crystal clear.

Thanks,
David Hildenbrand Nov. 8, 2024, 3:09 p.m. UTC | #39
On 08.11.24 16:07, Peter Xu wrote:
> On Fri, Nov 08, 2024 at 03:54:13PM +0100, David Hildenbrand wrote:
>> Likely AUX is everything that is "neither explicitly specified by the user nor
>> very special RAM"
>>
>> So I think hw/misc/ivshmem.c would also not count as "aux", and similarly
>> hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on).
>>
>> memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly
>> special: we cannot possibly turn them SHARED. But that's also what your code
>> already handled.
>>
>> So maybe, really everything is AUX ram, except
>> * Using memory_region_init_ram_from_fd()/
>>    memory_region_init_ram_from_file() users.
>> * Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr
>> * Created via memory backends
>>
>>
>> Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why
>> it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that
>> code was introduced.
>>
>> CCing Thomas.
>>
>> commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
>> Author: Thomas Huth <huth@tuxfamily.org>
>> Date:   Sat Jun 30 08:45:25 2018 +0200
>>
>>      m68k: Add NeXTcube machine
>>      It is still quite incomplete (no SCSI, no floppy emulation, no network,
>>      etc.), but the firmware already shows up the debug monitor prompt in the
>>      framebuffer display, so at least the very basics are already working.
>>      This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
>>       https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
>>      and altered quite a bit to fit the latest interface and coding conventions
>>      of the current QEMU.
> 
> This might also imply that our current RAM_SHARED is already not crystal
> clear on when to use, not to mention RAM_AUX if to be introduced..

Likely not. When the code was introduced we used magic boolean 
parameters and likely "true" was set by accident.

There are not that many RAM_SHARED users at all ...

Anyhow ....

  Please
> see my other email, trying to define RAM_SHARED properly.
> 
> IIUC after we can properly define RAM_SHARED, then we don't need AUX, and
> everything will be crystal clear.

I think I still prefer RAM_NO_AUX, but I'll leave it to you and Steven 
to figure out, it's been way to many emails at this point :)
David Hildenbrand Nov. 8, 2024, 3:15 p.m. UTC | #40
> CCing Thomas.
> 
> commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
> Author: Thomas Huth <huth@tuxfamily.org>
> Date:   Sat Jun 30 08:45:25 2018 +0200
> 
>       m68k: Add NeXTcube machine
>       
>       It is still quite incomplete (no SCSI, no floppy emulation, no network,
>       etc.), but the firmware already shows up the debug monitor prompt in the
>       framebuffer display, so at least the very basics are already working.
>       
>       This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
>       
>        https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
>       
>       and altered quite a bit to fit the latest interface and coding conventions
>       of the current QEMU.

Staring at that link, the code was

     /* MMIO */
     cpu_register_physical_memory((uint32_t)0x2000000,0xD0000,
         cpu_register_io_memory(mmio_read, mmio_write, (void *)env,DEVICE_NATIVE_ENDIAN));
     
     /* BMAP */ //acts as a catch-all for now
     cpu_register_physical_memory((uint32_t)0x2100000,0x3A7FF,
         cpu_register_io_memory(scr_read, scr_write, (void *)env,DEVICE_NATIVE_ENDIAN));

Which we converted to

     /* MMIO */
     memory_region_init_io(mmiomem, NULL, &mmio_ops, machine, "next.mmio",
                           0xD0000);
     memory_region_add_subregion(sysmem, 0x02000000, mmiomem);

     /* BMAP memory */
     memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
                                             true, &error_fatal);
     memory_region_add_subregion(sysmem, 0x020c0000, bmapm1);


So likely the "true" was added by mistake.
diff mbox series

Patch

diff --git a/hw/core/machine.c b/hw/core/machine.c
index adaba17..a89a32b 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -460,6 +460,20 @@  static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static int machine_get_anon_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->anon_alloc;
+}
+
+static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->anon_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -1078,6 +1092,11 @@  static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
+                                   &AnonAllocOption_lookup,
+                                   machine_get_anon_alloc,
+                                   machine_set_anon_alloc);
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 5966069..5a87647 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -393,6 +393,7 @@  struct MachineState {
     bool enable_graphics;
     ConfidentialGuestSupport *cgs;
     HostMemoryBackend *memdev;
+    AnonAllocOption anon_alloc;
     /*
      * convenience alias to ram_memdev_id backend memory region
      * or to numa container memory region
diff --git a/qapi/machine.json b/qapi/machine.json
index 3cc055b..f634c40 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1898,3 +1898,17 @@ 
 { 'command': 'x-query-interrupt-controllers',
   'returns': 'HumanReadableText',
   'features': [ 'unstable' ]}
+
+##
+# @AnonAllocOption:
+#
+# An enumeration of the options for allocating anonymous guest memory.
+#
+# @mmap: allocate using mmap MAP_ANON
+#
+# @memfd: allocate using memfd_create
+#
+# Since: 9.2
+##
+{ 'enum': 'AnonAllocOption',
+  'data': [ 'mmap', 'memfd' ] }
diff --git a/qemu-options.hx b/qemu-options.hx
index dacc979..fdd6bf2 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -38,6 +38,7 @@  DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                nvdimm=on|off controls NVDIMM support (default=off)\n"
     "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
     "                hmat=on|off controls ACPI HMAT support (default=off)\n"
+    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
     "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
     "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
     QEMU_ARCH_ALL)
@@ -101,6 +102,16 @@  SRST
         Enables or disables ACPI Heterogeneous Memory Attribute Table
         (HMAT) support. The default is off.
 
+    ``anon-alloc=mmap|memfd``
+        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
+        or memfd_create.  This option applies to memory allocated as a
+        side effect of creating various devices. It does not apply to
+        memory-backend-objects, whether explicitly specified on the
+        command line, or implicitly created by the -m command line
+        option.
+
+        Some migration modes require anon-alloc=memfd.
+
     ``memory-backend='id'``
         An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
         Allows to use a memory backend as main RAM.
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a..174f7e0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -47,6 +47,7 @@ 
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -69,6 +70,8 @@ 
 
 #include "qemu/pmem.h"
 
+#include "qapi/qapi-types-migration.h"
+#include "migration/options.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1849,6 +1852,35 @@  static void ram_block_add(RAMBlock *new_block, Error **errp)
                 qemu_mutex_unlock_ramlist();
                 return;
             }
+
+        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
+                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
+                                        TYPE_MEMORY_BACKEND)) {
+            size_t max_length = new_block->max_length;
+            MemoryRegion *mr = new_block->mr;
+            const char *name = memory_region_name(mr);
+
+            new_block->mr->align = QEMU_VMALLOC_ALIGN;
+            new_block->flags |= RAM_SHARED;
+
+            if (new_block->fd == -1) {
+                new_block->fd = qemu_memfd_create(name, max_length + mr->align,
+                                                  0, 0, 0, errp);
+            }
+
+            if (new_block->fd >= 0) {
+                int mfd = new_block->fd;
+                qemu_set_cloexec(mfd);
+                new_block->host = file_ram_alloc(new_block, max_length, mfd,
+                                                 false, 0, errp);
+            }
+            if (!new_block->host) {
+                qemu_mutex_unlock_ramlist();
+                return;
+            }
+            memory_try_enable_merging(new_block->host, new_block->max_length);
+            free_on_error = true;
+
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
@@ -1932,6 +1964,9 @@  static void ram_block_add(RAMBlock *new_block, Error **errp)
         ram_block_notify_add(new_block->host, new_block->used_length,
                              new_block->max_length);
     }
+    trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
+                        new_block->fd, new_block->used_length,
+                        new_block->max_length);
     return;
 
 out_free:
diff --git a/system/trace-events b/system/trace-events
index 074d001..267daca 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -47,3 +47,6 @@  dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
 
 # cpu-throttle.c
 cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
+
+#physmem.c
+ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %zu, maxlen %zu"