diff mbox series

[V4,02/19] physmem: fd-based shared memory

Message ID 1733145611-62315-3-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New
Headers show
Series Live update: cpr-transfer | expand

Commit Message

Steven Sistare Dec. 2, 2024, 1:19 p.m. UTC
Create MAP_SHARED RAMBlocks by mmap'ing a file descriptor rather than using
MAP_ANON, so the memory can be accessed in another process by passing and
mmap'ing the fd.  This will allow CPR to support memory-backend-ram and
memory-backend-shm objects, provided the user creates them with share=on.

Use memfd_create if available because it has no constraints.  If not, use
POSIX shm_open.  However, this may fail if the shm mount size is too small,
even if the system has free memory, so for backwards compatibility fall
back to qemu_anon_ram_alloc/MAP_ANON on shm_open failure.

For backwards compatibility on Windows, always use MAP_ANON.  share=on has
no purpose there, but the syntax is accepted, and must continue to work.

Exclude Xen.  Xen ignores RAM_SHARED and does its own allocation.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 system/physmem.c    | 85 +++++++++++++++++++++++++++++++++++++++++++++++++----
 system/trace-events |  1 +
 2 files changed, 81 insertions(+), 5 deletions(-)

Comments

Peter Xu Dec. 9, 2024, 7:42 p.m. UTC | #1
On Mon, Dec 02, 2024 at 05:19:54AM -0800, Steve Sistare wrote:
> @@ -2089,13 +2154,23 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>      new_block->page_size = qemu_real_host_page_size();
>      new_block->host = host;
>      new_block->flags = ram_flags;
> +
> +    if (!host && !xen_enabled()) {

Adding one more xen check is unnecessary.  This patch needed it could mean
that the patch can be refactored.. because we have xen checks in both
ram_block_add() and also in the fd allocation path.

At the meantime, see:

qemu_ram_alloc_from_fd():
    if (kvm_enabled() && !kvm_has_sync_mmu()) {
        error_setg(errp,
                   "host lacks kvm mmu notifiers, -mem-path unsupported");
        return NULL;
    }

I don't think any decent kernel could hit this, but that could be another
sign that this patch duplicated some file allocations.

> +        if ((new_block->flags & RAM_SHARED) &&
> +            !qemu_ram_alloc_shared(new_block, &local_err)) {
> +            goto err;
> +        }
> +    }
> +
>      ram_block_add(new_block, &local_err);
> -    if (local_err) {
> -        g_free(new_block);
> -        error_propagate(errp, local_err);
> -        return NULL;
> +    if (!local_err) {
> +        return new_block;
>      }
> -    return new_block;
> +
> +err:
> +    g_free(new_block);
> +    error_propagate(errp, local_err);
> +    return NULL;
>  }

IIUC we only need to conditionally convert an anon-allocation into an
fd-allocation, and then we don't need to mostly duplicate
qemu_ram_alloc_from_fd(), instead we reuse it.

I do have a few other comments elsewhere, but when I was trying to comment.
E.g., we either shouldn't need to bother caching qemu_memfd_check()
results, or do it in qemu_memfd_check() directly.. and some more.

Then I think it's easier I provide a patch, and also show that it can be
also smaller changes to do the same thing, with everything fixed up
(e.g. addressing above mmu notifier missing issue).  What do you think as
below?

===8<===
From a90119131a972b0b4f15770fe0b431770456e447 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Mon, 9 Dec 2024 13:38:06 -0500
Subject: [PATCH] physmem: Try to always allocate anon and shared memory with
 fd

qemu_ram_alloc_internal() is the memory API QEMU uses to allocate anonymous
memory.  It allows RAM_SHARED too on top of anonymous.

It might be always beneficial to allocate memory with fd attached whenever
possible because fd is normally more flexible comparing to the virtual
mapping alone.  For example, CPR can use it to pass over fds between
processes to share memory, especially useful when the memory can be pinned.

Since there's no harm when it's possible, do it unconditionally for all
such anonymous & shared memory allocations where the memory is to be
allocated.  Provide fallbacks when it can fail, e.g., when none of the
memory attached fd is available.

Two extra ERRP_GUARD()s are needed in the used functions, as we will not
care about error even if it happened, so it's easier to allow passing NULL
into them.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 system/physmem.c   | 38 ++++++++++++++++++++++++++++++++++++++
 util/memfd.c       |  2 ++
 util/oslib-posix.c |  2 ++
 3 files changed, 42 insertions(+)

diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..4e795aefa0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -47,6 +47,7 @@
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -2057,6 +2058,24 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
 }
 #endif
 
+/*
+ * Try to allocate a zero-sized anonymous fd for shared memory allocations.
+ * Returns >=0 if succeeded, <0 otherwise.
+ *
+ * Prioritize memfd, as it doesn't have the same /dev/shm size limitation
+ * v.s. POSIX shm_open().
+ */
+static int qemu_ram_alloc_anonymous_fd(void)
+{
+    if (qemu_memfd_check(0)) {
+        return qemu_memfd_create("anon-memfd", 0, 0, 0, 0, NULL);
+    } else if (qemu_shm_available()) {
+        return qemu_shm_alloc(0, NULL);
+    } else {
+        return -1;
+    }
+}
+
 static
 RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
@@ -2073,6 +2092,25 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                           RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0);
     assert(!host ^ (ram_flags & RAM_PREALLOC));
 
+    /*
+     * Try to use fd-based allocation for anonymous and shared memory,
+     * because fd is normally more flexible (e.g. on memory sharing between
+     * processes).  We can still fallback to old ways if it fails.
+     */
+    if (!host && (ram_flags & RAM_SHARED)) {
+        int fd = qemu_ram_alloc_anonymous_fd();
+
+        if (fd >= 0) {
+            new_block = qemu_ram_alloc_from_fd(size, mr, ram_flags,
+                                               fd, 0, errp);
+            if (new_block) {
+                return new_block;
+            }
+            close(fd);
+        }
+        /* Either fd or ramblock allocation failed, fallback */
+    }
+
     align = qemu_real_host_page_size();
     align = MAX(align, TARGET_PAGE_SIZE);
     size = ROUND_UP(size, align);
diff --git a/util/memfd.c b/util/memfd.c
index 8a2e906962..0dc15b2f44 100644
--- a/util/memfd.c
+++ b/util/memfd.c
@@ -52,6 +52,8 @@ int qemu_memfd_create(const char *name, size_t size, bool hugetlb,
 {
     int htsize = hugetlbsize ? ctz64(hugetlbsize) : 0;
 
+    ERRP_GUARD();
+
     if (htsize && 1ULL << htsize != hugetlbsize) {
         error_setg(errp, "Hugepage size must be a power of 2");
         return -1;
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f8c3724e68..6ca3e994fc 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -944,6 +944,8 @@ int qemu_shm_alloc(size_t size, Error **errp)
     static int sequence;
     mode_t mode;
 
+    ERRP_GUARD();
+
     cur_sequence = qatomic_fetch_inc(&sequence);
 
     /*
diff mbox series

Patch

diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a..b0c4b22 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -47,6 +47,7 @@ 
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -2057,6 +2058,70 @@  RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
 }
 #endif
 
+static bool qemu_memfd_available(void)
+{
+    static int has_memfd = -1;
+
+    if (has_memfd < 0) {
+        has_memfd = qemu_memfd_check(0);
+    }
+    return has_memfd;
+}
+
+/*
+ * We want anonymous shared memory, similar to MAP_SHARED|MAP_ANON, but
+ * some users want the fd.  Allocate shm explicitly to get an fd.
+ */
+static bool qemu_ram_alloc_shared(RAMBlock *new_block, Error **errp)
+{
+    size_t max_length = new_block->max_length;
+    MemoryRegion *mr = new_block->mr;
+    const char *name = memory_region_name(mr);
+    int fd;
+
+    if (qemu_memfd_available()) {
+        fd = qemu_memfd_create(name, max_length + mr->align, 0, 0, 0, errp);
+        if (fd < 0) {
+            return false;
+        }
+    } else if (!qemu_shm_available()) {
+        /*
+         * Backwards compatibility for Windows.  The user may specify a
+         * memory backend with shared=on, and Windows ignores shared.
+         * Fall back to qemu_anon_ram_alloc.
+         */
+        return true;
+    } else {
+        Error *local_err = NULL;
+
+        fd = qemu_shm_alloc(max_length, &local_err);
+        if (fd < 0) {
+            /*
+             * Backwards compatibility in case the shm mount size is too small.
+             * Previous QEMU versions called qemu_anon_ram_alloc for anonymous
+             * shared memory, which could succeed.
+             */
+            error_prepend(&local_err,
+                          "Retrying using MAP_ANON|MAP_SHARED because: ");
+            warn_report_err(local_err);
+            return true;
+        }
+    }
+
+    new_block->mr->align = QEMU_VMALLOC_ALIGN;
+    new_block->host = file_ram_alloc(new_block, max_length, fd, false, 0, errp);
+
+    if (new_block->host) {
+        qemu_set_cloexec(fd);
+        new_block->fd = fd;
+        trace_qemu_ram_alloc_shared(name, max_length, fd, new_block->host);
+        return true;
+    }
+
+    close(fd);
+    return false;
+}
+
 static
 RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                   void (*resized)(const char*,
@@ -2089,13 +2154,23 @@  RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     new_block->page_size = qemu_real_host_page_size();
     new_block->host = host;
     new_block->flags = ram_flags;
+
+    if (!host && !xen_enabled()) {
+        if ((new_block->flags & RAM_SHARED) &&
+            !qemu_ram_alloc_shared(new_block, &local_err)) {
+            goto err;
+        }
+    }
+
     ram_block_add(new_block, &local_err);
-    if (local_err) {
-        g_free(new_block);
-        error_propagate(errp, local_err);
-        return NULL;
+    if (!local_err) {
+        return new_block;
     }
-    return new_block;
+
+err:
+    g_free(new_block);
+    error_propagate(errp, local_err);
+    return NULL;
 }
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
diff --git a/system/trace-events b/system/trace-events
index 5bbc3fb..831a60c 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -33,6 +33,7 @@  address_space_map(void *as, uint64_t addr, uint64_t len, bool is_write, uint32_t
 find_ram_offset(uint64_t size, uint64_t offset) "size: 0x%" PRIx64 " @ 0x%" PRIx64
 find_ram_offset_loop(uint64_t size, uint64_t candidate, uint64_t offset, uint64_t next, uint64_t mingap) "trying size: 0x%" PRIx64 " @ 0x%" PRIx64 ", offset: 0x%" PRIx64" next: 0x%" PRIx64 " mingap: 0x%" PRIx64
 ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_madvise, bool need_fallocate, int ret) "%s@%p + 0x%zx: madvise: %d fallocate: %d ret: %d"
+qemu_ram_alloc_shared(const char *name, size_t max_length, int fd, void *host) "%s size %zu fd %d host %p"
 
 # cpus.c
 vm_stop_flush_all(int ret) "ret %d"