Message ID | 1727725244-105198-2-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Live update: cpr-transfer | expand |
On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > on the value of the anon-alloc machine property. This option applies to > memory allocated as a side effect of creating various devices. It does > not apply to memory-backend-objects, whether explicitly specified on > the command line, or implicitly created by the -m command line option. > > The memfd option is intended to support new migration modes, in which the > memory region can be transferred in place to a new QEMU process, by sending > the memfd file descriptor to the process. Memory contents are preserved, > and if the mode also transfers device descriptors, then pages that are > locked in memory for DMA remain locked. This behavior is a pre-requisite > for supporting vfio, vdpa, and iommufd devices with the new modes. > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> [Igor seems missing in the loop; added] > --- > hw/core/machine.c | 19 +++++++++++++++++++ > include/hw/boards.h | 1 + > qapi/machine.json | 14 ++++++++++++++ > qemu-options.hx | 11 +++++++++++ > system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ > system/trace-events | 3 +++ > 6 files changed, 83 insertions(+) > > diff --git a/hw/core/machine.c b/hw/core/machine.c > index adaba17..a89a32b 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) > ms->mem_merge = value; > } > > +static int machine_get_anon_alloc(Object *obj, Error **errp) > +{ > + MachineState *ms = MACHINE(obj); > + > + return ms->anon_alloc; > +} > + > +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) > +{ > + MachineState *ms = MACHINE(obj); > + > + ms->anon_alloc = value; > +} > + > static bool machine_get_usb(Object *obj, Error **errp) > { > MachineState *ms = MACHINE(obj); > @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) > object_class_property_set_description(oc, "mem-merge", > "Enable/disable memory merge support"); > > + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", > + &AnonAllocOption_lookup, > + machine_get_anon_alloc, > + machine_set_anon_alloc); > + > object_class_property_add_bool(oc, "usb", > machine_get_usb, machine_set_usb); > object_class_property_set_description(oc, "usb", > diff --git a/include/hw/boards.h b/include/hw/boards.h > index 5966069..5a87647 100644 > --- a/include/hw/boards.h > +++ b/include/hw/boards.h > @@ -393,6 +393,7 @@ struct MachineState { > bool enable_graphics; > ConfidentialGuestSupport *cgs; > HostMemoryBackend *memdev; > + AnonAllocOption anon_alloc; > /* > * convenience alias to ram_memdev_id backend memory region > * or to numa container memory region > diff --git a/qapi/machine.json b/qapi/machine.json > index a6b8795..d4a63f5 100644 > --- a/qapi/machine.json > +++ b/qapi/machine.json > @@ -1898,3 +1898,17 @@ > { 'command': 'x-query-interrupt-controllers', > 'returns': 'HumanReadableText', > 'features': [ 'unstable' ]} > + > +## > +# @AnonAllocOption: > +# > +# An enumeration of the options for allocating anonymous guest memory. > +# > +# @mmap: allocate using mmap MAP_ANON > +# > +# @memfd: allocate using memfd_create > +# > +# Since: 9.2 > +## > +{ 'enum': 'AnonAllocOption', > + 'data': [ 'mmap', 'memfd' ] } > diff --git a/qemu-options.hx b/qemu-options.hx > index d94e2cb..90ab943 100644 > --- a/qemu-options.hx > +++ b/qemu-options.hx > @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ > " nvdimm=on|off controls NVDIMM support (default=off)\n" > " memory-encryption=@var{} memory encryption object to use (default=none)\n" > " hmat=on|off controls ACPI HMAT support (default=off)\n" > + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" > " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" > " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", > QEMU_ARCH_ALL) > @@ -101,6 +102,16 @@ SRST > Enables or disables ACPI Heterogeneous Memory Attribute Table > (HMAT) support. The default is off. > > + ``anon-alloc=mmap|memfd`` > + Allocate anonymous guest RAM using mmap MAP_ANON (the default) > + or memfd_create. This option applies to memory allocated as a > + side effect of creating various devices. It does not apply to > + memory-backend-objects, whether explicitly specified on the > + command line, or implicitly created by the -m command line > + option. > + > + Some migration modes require anon-alloc=memfd. > + > ``memory-backend='id'`` > An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. > Allows to use a memory backend as main RAM. > diff --git a/system/physmem.c b/system/physmem.c > index dc1db3a..174f7e0 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -47,6 +47,7 @@ > #include "qemu/qemu-print.h" > #include "qemu/log.h" > #include "qemu/memalign.h" > +#include "qemu/memfd.h" > #include "exec/memory.h" > #include "exec/ioport.h" > #include "sysemu/dma.h" > @@ -69,6 +70,8 @@ > > #include "qemu/pmem.h" > > +#include "qapi/qapi-types-migration.h" > +#include "migration/options.h" > #include "migration/vmstate.h" > > #include "qemu/range.h" > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) > qemu_mutex_unlock_ramlist(); > return; > } > + > + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && > + !object_dynamic_cast(new_block->mr->parent_obj.parent, > + TYPE_MEMORY_BACKEND)) { This is pretty fragile.. if someone adds yet another layer on top of memory backend objects, the ownership links can change and this might silently run into something else even without any warning.. I wished we dig into what is missing, but maybe that's too trivial. If not, we still need to make this as solid. Perhaps that can be a ram flag and let relevant callers pass in that flag explicitly. I think RAM_SHARED can actually be that flag already - I mean, in all paths that we may create anon mem (but not memory-backend-* objects), is it always safe we always switch to RAM_SHARED from anon? And that should also make sure that _if_ that path can already have an optional RAM_SHARED flag passed in, it means we did something wrong, because such change should not violate with whatever the user can specify share=on/off. It means nothing user specified would be violated. I think that means below paths [1-4] are only relevant: qemu_ram_alloc memory_region_init_rom_device_nomigrate [1] memory_region_init_ram_flags_nomigrate memory_region_init_ram_nomigrate [2] memory_region_init_rom_nomigrate [3] qemu_ram_alloc_resizeable [4] So I wonder whether we can have a patch simply switching [1-4] to constantly use VM_SHARED; I assume they're all corner case allocations: they never include major guest memory, but things like vram, etc. I feel like that's fine, I think it should even work with migration where an old QEMU with all such memory chunks being anon, be migrated to a new QEMU where such memory chunks being all memfd. Fundamentally it should work as qemu migration relies on host pointer not anything else IIRC. Worth double check, some migration test could also be useful if something obvious I overlook. Nothing yet I spot will go wrong. Then, maybe.. we don't need any new machine type property? > + size_t max_length = new_block->max_length; > + MemoryRegion *mr = new_block->mr; > + const char *name = memory_region_name(mr); > + > + new_block->mr->align = QEMU_VMALLOC_ALIGN; > + new_block->flags |= RAM_SHARED; > + > + if (new_block->fd == -1) { > + new_block->fd = qemu_memfd_create(name, max_length + mr->align, > + 0, 0, 0, errp); > + } > + > + if (new_block->fd >= 0) { > + int mfd = new_block->fd; > + qemu_set_cloexec(mfd); > + new_block->host = file_ram_alloc(new_block, max_length, mfd, > + false, 0, errp); > + } > + if (!new_block->host) { > + qemu_mutex_unlock_ramlist(); > + return; > + } > + memory_try_enable_merging(new_block->host, new_block->max_length); IIUC this can be dropped. It's destined to be SHARED here, so KSM won't work. But if you agree with VM_SHARED idea I mentioned above, this chunk of code is not needed as a whole, instead it'll be two separate patches instead: - Make above paths [1-4] constantly use VM_SHARED - Change qemu_anon_ram_alloc() path so that it'll cache the fd if VM_SHARED > + free_on_error = true; > + > } else { > new_block->host = qemu_anon_ram_alloc(new_block->max_length, > &new_block->mr->align, > @@ -1932,6 +1964,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) > ram_block_notify_add(new_block->host, new_block->used_length, > new_block->max_length); > } > + trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags, > + new_block->fd, new_block->used_length, > + new_block->max_length); > return; > > out_free: > diff --git a/system/trace-events b/system/trace-events > index 074d001..4669411 100644 > --- a/system/trace-events > +++ b/system/trace-events > @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P > > # cpu-throttle.c > cpu_throttle_set(int new_throttle_pct) "set guest CPU throttled by %d%%" > + > +#physmem.c > +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" > -- > 1.8.3.1 >
On 03.10.24 18:14, Peter Xu wrote: > On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: >> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >> on the value of the anon-alloc machine property. This option applies to >> memory allocated as a side effect of creating various devices. It does >> not apply to memory-backend-objects, whether explicitly specified on >> the command line, or implicitly created by the -m command line option. >> >> The memfd option is intended to support new migration modes, in which the >> memory region can be transferred in place to a new QEMU process, by sending >> the memfd file descriptor to the process. Memory contents are preserved, >> and if the mode also transfers device descriptors, then pages that are >> locked in memory for DMA remain locked. This behavior is a pre-requisite >> for supporting vfio, vdpa, and iommufd devices with the new modes. >> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > > [Igor seems missing in the loop; added] > >> --- >> hw/core/machine.c | 19 +++++++++++++++++++ >> include/hw/boards.h | 1 + >> qapi/machine.json | 14 ++++++++++++++ >> qemu-options.hx | 11 +++++++++++ >> system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ >> system/trace-events | 3 +++ >> 6 files changed, 83 insertions(+) >> >> diff --git a/hw/core/machine.c b/hw/core/machine.c >> index adaba17..a89a32b 100644 >> --- a/hw/core/machine.c >> +++ b/hw/core/machine.c >> @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) >> ms->mem_merge = value; >> } >> >> +static int machine_get_anon_alloc(Object *obj, Error **errp) >> +{ >> + MachineState *ms = MACHINE(obj); >> + >> + return ms->anon_alloc; >> +} >> + >> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) >> +{ >> + MachineState *ms = MACHINE(obj); >> + >> + ms->anon_alloc = value; >> +} >> + >> static bool machine_get_usb(Object *obj, Error **errp) >> { >> MachineState *ms = MACHINE(obj); >> @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) >> object_class_property_set_description(oc, "mem-merge", >> "Enable/disable memory merge support"); >> >> + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", >> + &AnonAllocOption_lookup, >> + machine_get_anon_alloc, >> + machine_set_anon_alloc); >> + >> object_class_property_add_bool(oc, "usb", >> machine_get_usb, machine_set_usb); >> object_class_property_set_description(oc, "usb", >> diff --git a/include/hw/boards.h b/include/hw/boards.h >> index 5966069..5a87647 100644 >> --- a/include/hw/boards.h >> +++ b/include/hw/boards.h >> @@ -393,6 +393,7 @@ struct MachineState { >> bool enable_graphics; >> ConfidentialGuestSupport *cgs; >> HostMemoryBackend *memdev; >> + AnonAllocOption anon_alloc; >> /* >> * convenience alias to ram_memdev_id backend memory region >> * or to numa container memory region >> diff --git a/qapi/machine.json b/qapi/machine.json >> index a6b8795..d4a63f5 100644 >> --- a/qapi/machine.json >> +++ b/qapi/machine.json >> @@ -1898,3 +1898,17 @@ >> { 'command': 'x-query-interrupt-controllers', >> 'returns': 'HumanReadableText', >> 'features': [ 'unstable' ]} >> + >> +## >> +# @AnonAllocOption: >> +# >> +# An enumeration of the options for allocating anonymous guest memory. >> +# >> +# @mmap: allocate using mmap MAP_ANON >> +# >> +# @memfd: allocate using memfd_create >> +# >> +# Since: 9.2 >> +## >> +{ 'enum': 'AnonAllocOption', >> + 'data': [ 'mmap', 'memfd' ] } >> diff --git a/qemu-options.hx b/qemu-options.hx >> index d94e2cb..90ab943 100644 >> --- a/qemu-options.hx >> +++ b/qemu-options.hx >> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ >> " nvdimm=on|off controls NVDIMM support (default=off)\n" >> " memory-encryption=@var{} memory encryption object to use (default=none)\n" >> " hmat=on|off controls ACPI HMAT support (default=off)\n" >> + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" >> " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" >> " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", >> QEMU_ARCH_ALL) >> @@ -101,6 +102,16 @@ SRST >> Enables or disables ACPI Heterogeneous Memory Attribute Table >> (HMAT) support. The default is off. >> >> + ``anon-alloc=mmap|memfd`` >> + Allocate anonymous guest RAM using mmap MAP_ANON (the default) >> + or memfd_create. This option applies to memory allocated as a >> + side effect of creating various devices. It does not apply to >> + memory-backend-objects, whether explicitly specified on the >> + command line, or implicitly created by the -m command line >> + option. >> + >> + Some migration modes require anon-alloc=memfd. >> + >> ``memory-backend='id'`` >> An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. >> Allows to use a memory backend as main RAM. >> diff --git a/system/physmem.c b/system/physmem.c >> index dc1db3a..174f7e0 100644 >> --- a/system/physmem.c >> +++ b/system/physmem.c >> @@ -47,6 +47,7 @@ >> #include "qemu/qemu-print.h" >> #include "qemu/log.h" >> #include "qemu/memalign.h" >> +#include "qemu/memfd.h" >> #include "exec/memory.h" >> #include "exec/ioport.h" >> #include "sysemu/dma.h" >> @@ -69,6 +70,8 @@ >> >> #include "qemu/pmem.h" >> >> +#include "qapi/qapi-types-migration.h" >> +#include "migration/options.h" >> #include "migration/vmstate.h" >> >> #include "qemu/range.h" >> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) >> qemu_mutex_unlock_ramlist(); >> return; >> } >> + >> + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && >> + !object_dynamic_cast(new_block->mr->parent_obj.parent, >> + TYPE_MEMORY_BACKEND)) { > > This is pretty fragile.. if someone adds yet another layer on top of memory > backend objects, the ownership links can change and this might silently run > into something else even without any warning.. > > I wished we dig into what is missing, but maybe that's too trivial. If > not, we still need to make this as solid. Perhaps that can be a ram flag > and let relevant callers pass in that flag explicitly. How would they decide whether or not we want to set the flag in the current configuration? > > I think RAM_SHARED can actually be that flag already - I mean, in all paths > that we may create anon mem (but not memory-backend-* objects), is it > always safe we always switch to RAM_SHARED from anon? Do you mean only setting the flag (-> anonymous shmem) or switching also to memfd, which is a bigger change?
On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote: > On 03.10.24 18:14, Peter Xu wrote: > > On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > > > on the value of the anon-alloc machine property. This option applies to > > > memory allocated as a side effect of creating various devices. It does > > > not apply to memory-backend-objects, whether explicitly specified on > > > the command line, or implicitly created by the -m command line option. > > > > > > The memfd option is intended to support new migration modes, in which the > > > memory region can be transferred in place to a new QEMU process, by sending > > > the memfd file descriptor to the process. Memory contents are preserved, > > > and if the mode also transfers device descriptors, then pages that are > > > locked in memory for DMA remain locked. This behavior is a pre-requisite > > > for supporting vfio, vdpa, and iommufd devices with the new modes. > > > > > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > > > > [Igor seems missing in the loop; added] > > > > > --- > > > hw/core/machine.c | 19 +++++++++++++++++++ > > > include/hw/boards.h | 1 + > > > qapi/machine.json | 14 ++++++++++++++ > > > qemu-options.hx | 11 +++++++++++ > > > system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ > > > system/trace-events | 3 +++ > > > 6 files changed, 83 insertions(+) > > > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > > index adaba17..a89a32b 100644 > > > --- a/hw/core/machine.c > > > +++ b/hw/core/machine.c > > > @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) > > > ms->mem_merge = value; > > > } > > > +static int machine_get_anon_alloc(Object *obj, Error **errp) > > > +{ > > > + MachineState *ms = MACHINE(obj); > > > + > > > + return ms->anon_alloc; > > > +} > > > + > > > +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) > > > +{ > > > + MachineState *ms = MACHINE(obj); > > > + > > > + ms->anon_alloc = value; > > > +} > > > + > > > static bool machine_get_usb(Object *obj, Error **errp) > > > { > > > MachineState *ms = MACHINE(obj); > > > @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) > > > object_class_property_set_description(oc, "mem-merge", > > > "Enable/disable memory merge support"); > > > + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", > > > + &AnonAllocOption_lookup, > > > + machine_get_anon_alloc, > > > + machine_set_anon_alloc); > > > + > > > object_class_property_add_bool(oc, "usb", > > > machine_get_usb, machine_set_usb); > > > object_class_property_set_description(oc, "usb", > > > diff --git a/include/hw/boards.h b/include/hw/boards.h > > > index 5966069..5a87647 100644 > > > --- a/include/hw/boards.h > > > +++ b/include/hw/boards.h > > > @@ -393,6 +393,7 @@ struct MachineState { > > > bool enable_graphics; > > > ConfidentialGuestSupport *cgs; > > > HostMemoryBackend *memdev; > > > + AnonAllocOption anon_alloc; > > > /* > > > * convenience alias to ram_memdev_id backend memory region > > > * or to numa container memory region > > > diff --git a/qapi/machine.json b/qapi/machine.json > > > index a6b8795..d4a63f5 100644 > > > --- a/qapi/machine.json > > > +++ b/qapi/machine.json > > > @@ -1898,3 +1898,17 @@ > > > { 'command': 'x-query-interrupt-controllers', > > > 'returns': 'HumanReadableText', > > > 'features': [ 'unstable' ]} > > > + > > > +## > > > +# @AnonAllocOption: > > > +# > > > +# An enumeration of the options for allocating anonymous guest memory. > > > +# > > > +# @mmap: allocate using mmap MAP_ANON > > > +# > > > +# @memfd: allocate using memfd_create > > > +# > > > +# Since: 9.2 > > > +## > > > +{ 'enum': 'AnonAllocOption', > > > + 'data': [ 'mmap', 'memfd' ] } > > > diff --git a/qemu-options.hx b/qemu-options.hx > > > index d94e2cb..90ab943 100644 > > > --- a/qemu-options.hx > > > +++ b/qemu-options.hx > > > @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ > > > " nvdimm=on|off controls NVDIMM support (default=off)\n" > > > " memory-encryption=@var{} memory encryption object to use (default=none)\n" > > > " hmat=on|off controls ACPI HMAT support (default=off)\n" > > > + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" > > > " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" > > > " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", > > > QEMU_ARCH_ALL) > > > @@ -101,6 +102,16 @@ SRST > > > Enables or disables ACPI Heterogeneous Memory Attribute Table > > > (HMAT) support. The default is off. > > > + ``anon-alloc=mmap|memfd`` > > > + Allocate anonymous guest RAM using mmap MAP_ANON (the default) > > > + or memfd_create. This option applies to memory allocated as a > > > + side effect of creating various devices. It does not apply to > > > + memory-backend-objects, whether explicitly specified on the > > > + command line, or implicitly created by the -m command line > > > + option. > > > + > > > + Some migration modes require anon-alloc=memfd. > > > + > > > ``memory-backend='id'`` > > > An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. > > > Allows to use a memory backend as main RAM. > > > diff --git a/system/physmem.c b/system/physmem.c > > > index dc1db3a..174f7e0 100644 > > > --- a/system/physmem.c > > > +++ b/system/physmem.c > > > @@ -47,6 +47,7 @@ > > > #include "qemu/qemu-print.h" > > > #include "qemu/log.h" > > > #include "qemu/memalign.h" > > > +#include "qemu/memfd.h" > > > #include "exec/memory.h" > > > #include "exec/ioport.h" > > > #include "sysemu/dma.h" > > > @@ -69,6 +70,8 @@ > > > #include "qemu/pmem.h" > > > +#include "qapi/qapi-types-migration.h" > > > +#include "migration/options.h" > > > #include "migration/vmstate.h" > > > #include "qemu/range.h" > > > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) > > > qemu_mutex_unlock_ramlist(); > > > return; > > > } > > > + > > > + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && > > > + !object_dynamic_cast(new_block->mr->parent_obj.parent, > > > + TYPE_MEMORY_BACKEND)) { > > > > This is pretty fragile.. if someone adds yet another layer on top of memory > > backend objects, the ownership links can change and this might silently run > > into something else even without any warning.. > > > > I wished we dig into what is missing, but maybe that's too trivial. If > > not, we still need to make this as solid. Perhaps that can be a ram flag > > and let relevant callers pass in that flag explicitly. > > How would they decide whether or not we want to set the flag in the current > configuration? It was in the previous email where it got cut.. I listed four paths that may need change. > > > > > I think RAM_SHARED can actually be that flag already - I mean, in all paths > > that we may create anon mem (but not memory-backend-* objects), is it > > always safe we always switch to RAM_SHARED from anon? > > Do you mean only setting the flag (-> anonymous shmem) or switching also to > memfd, which is a bigger change? Switching to memfd. I thought anon shmem (mmap(MAP_SHARED)) is mostly the same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no? In this case we need the fds so we need to do the latter to cache the fds. Thanks,
On 04.10.24 14:33, Peter Xu wrote: > On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote: >> On 03.10.24 18:14, Peter Xu wrote: >>> On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: >>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>> on the value of the anon-alloc machine property. This option applies to >>>> memory allocated as a side effect of creating various devices. It does >>>> not apply to memory-backend-objects, whether explicitly specified on >>>> the command line, or implicitly created by the -m command line option. >>>> >>>> The memfd option is intended to support new migration modes, in which the >>>> memory region can be transferred in place to a new QEMU process, by sending >>>> the memfd file descriptor to the process. Memory contents are preserved, >>>> and if the mode also transfers device descriptors, then pages that are >>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >>> >>> [Igor seems missing in the loop; added] >>> >>>> --- >>>> hw/core/machine.c | 19 +++++++++++++++++++ >>>> include/hw/boards.h | 1 + >>>> qapi/machine.json | 14 ++++++++++++++ >>>> qemu-options.hx | 11 +++++++++++ >>>> system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ >>>> system/trace-events | 3 +++ >>>> 6 files changed, 83 insertions(+) >>>> >>>> diff --git a/hw/core/machine.c b/hw/core/machine.c >>>> index adaba17..a89a32b 100644 >>>> --- a/hw/core/machine.c >>>> +++ b/hw/core/machine.c >>>> @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) >>>> ms->mem_merge = value; >>>> } >>>> +static int machine_get_anon_alloc(Object *obj, Error **errp) >>>> +{ >>>> + MachineState *ms = MACHINE(obj); >>>> + >>>> + return ms->anon_alloc; >>>> +} >>>> + >>>> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) >>>> +{ >>>> + MachineState *ms = MACHINE(obj); >>>> + >>>> + ms->anon_alloc = value; >>>> +} >>>> + >>>> static bool machine_get_usb(Object *obj, Error **errp) >>>> { >>>> MachineState *ms = MACHINE(obj); >>>> @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) >>>> object_class_property_set_description(oc, "mem-merge", >>>> "Enable/disable memory merge support"); >>>> + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", >>>> + &AnonAllocOption_lookup, >>>> + machine_get_anon_alloc, >>>> + machine_set_anon_alloc); >>>> + >>>> object_class_property_add_bool(oc, "usb", >>>> machine_get_usb, machine_set_usb); >>>> object_class_property_set_description(oc, "usb", >>>> diff --git a/include/hw/boards.h b/include/hw/boards.h >>>> index 5966069..5a87647 100644 >>>> --- a/include/hw/boards.h >>>> +++ b/include/hw/boards.h >>>> @@ -393,6 +393,7 @@ struct MachineState { >>>> bool enable_graphics; >>>> ConfidentialGuestSupport *cgs; >>>> HostMemoryBackend *memdev; >>>> + AnonAllocOption anon_alloc; >>>> /* >>>> * convenience alias to ram_memdev_id backend memory region >>>> * or to numa container memory region >>>> diff --git a/qapi/machine.json b/qapi/machine.json >>>> index a6b8795..d4a63f5 100644 >>>> --- a/qapi/machine.json >>>> +++ b/qapi/machine.json >>>> @@ -1898,3 +1898,17 @@ >>>> { 'command': 'x-query-interrupt-controllers', >>>> 'returns': 'HumanReadableText', >>>> 'features': [ 'unstable' ]} >>>> + >>>> +## >>>> +# @AnonAllocOption: >>>> +# >>>> +# An enumeration of the options for allocating anonymous guest memory. >>>> +# >>>> +# @mmap: allocate using mmap MAP_ANON >>>> +# >>>> +# @memfd: allocate using memfd_create >>>> +# >>>> +# Since: 9.2 >>>> +## >>>> +{ 'enum': 'AnonAllocOption', >>>> + 'data': [ 'mmap', 'memfd' ] } >>>> diff --git a/qemu-options.hx b/qemu-options.hx >>>> index d94e2cb..90ab943 100644 >>>> --- a/qemu-options.hx >>>> +++ b/qemu-options.hx >>>> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ >>>> " nvdimm=on|off controls NVDIMM support (default=off)\n" >>>> " memory-encryption=@var{} memory encryption object to use (default=none)\n" >>>> " hmat=on|off controls ACPI HMAT support (default=off)\n" >>>> + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" >>>> " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" >>>> " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", >>>> QEMU_ARCH_ALL) >>>> @@ -101,6 +102,16 @@ SRST >>>> Enables or disables ACPI Heterogeneous Memory Attribute Table >>>> (HMAT) support. The default is off. >>>> + ``anon-alloc=mmap|memfd`` >>>> + Allocate anonymous guest RAM using mmap MAP_ANON (the default) >>>> + or memfd_create. This option applies to memory allocated as a >>>> + side effect of creating various devices. It does not apply to >>>> + memory-backend-objects, whether explicitly specified on the >>>> + command line, or implicitly created by the -m command line >>>> + option. >>>> + >>>> + Some migration modes require anon-alloc=memfd. >>>> + >>>> ``memory-backend='id'`` >>>> An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. >>>> Allows to use a memory backend as main RAM. >>>> diff --git a/system/physmem.c b/system/physmem.c >>>> index dc1db3a..174f7e0 100644 >>>> --- a/system/physmem.c >>>> +++ b/system/physmem.c >>>> @@ -47,6 +47,7 @@ >>>> #include "qemu/qemu-print.h" >>>> #include "qemu/log.h" >>>> #include "qemu/memalign.h" >>>> +#include "qemu/memfd.h" >>>> #include "exec/memory.h" >>>> #include "exec/ioport.h" >>>> #include "sysemu/dma.h" >>>> @@ -69,6 +70,8 @@ >>>> #include "qemu/pmem.h" >>>> +#include "qapi/qapi-types-migration.h" >>>> +#include "migration/options.h" >>>> #include "migration/vmstate.h" >>>> #include "qemu/range.h" >>>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) >>>> qemu_mutex_unlock_ramlist(); >>>> return; >>>> } >>>> + >>>> + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && >>>> + !object_dynamic_cast(new_block->mr->parent_obj.parent, >>>> + TYPE_MEMORY_BACKEND)) { >>> >>> This is pretty fragile.. if someone adds yet another layer on top of memory >>> backend objects, the ownership links can change and this might silently run >>> into something else even without any warning.. >>> >>> I wished we dig into what is missing, but maybe that's too trivial. If >>> not, we still need to make this as solid. Perhaps that can be a ram flag >>> and let relevant callers pass in that flag explicitly. >> >> How would they decide whether or not we want to set the flag in the current >> configuration? > > It was in the previous email where it got cut.. I listed four paths that > may need change. That's not my question. Who would decide whether we want to set MAP_SHARED in these callers or not? If you have "unconditionally" in mind, I think it's a bad idea. If there is some other toggle to perform that setting conditionally, why not. > >> >>> >>> I think RAM_SHARED can actually be that flag already - I mean, in all paths >>> that we may create anon mem (but not memory-backend-* objects), is it >>> always safe we always switch to RAM_SHARED from anon? >> >> Do you mean only setting the flag (-> anonymous shmem) or switching also to >> memfd, which is a bigger change? > > Switching to memfd. I thought anon shmem (mmap(MAP_SHARED)) is mostly the > same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no? Memfd is Linux specific, keep that in mind. Apart from that there shouldn't be much difference between anon shmem and memfd (there are memory commit differences, though). Of course, there is a difference between anon memory and shmem, for example regarding what viritofsd faced (e.g., KSM) recently.
On Fri, Oct 04, 2024 at 02:54:38PM +0200, David Hildenbrand wrote: > On 04.10.24 14:33, Peter Xu wrote: > > On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote: > > > On 03.10.24 18:14, Peter Xu wrote: > > > > On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > > > > > on the value of the anon-alloc machine property. This option applies to > > > > > memory allocated as a side effect of creating various devices. It does > > > > > not apply to memory-backend-objects, whether explicitly specified on > > > > > the command line, or implicitly created by the -m command line option. > > > > > > > > > > The memfd option is intended to support new migration modes, in which the > > > > > memory region can be transferred in place to a new QEMU process, by sending > > > > > the memfd file descriptor to the process. Memory contents are preserved, > > > > > and if the mode also transfers device descriptors, then pages that are > > > > > locked in memory for DMA remain locked. This behavior is a pre-requisite > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes. > > > > > > > > > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > > > > > > > > [Igor seems missing in the loop; added] > > > > > > > > > --- > > > > > hw/core/machine.c | 19 +++++++++++++++++++ > > > > > include/hw/boards.h | 1 + > > > > > qapi/machine.json | 14 ++++++++++++++ > > > > > qemu-options.hx | 11 +++++++++++ > > > > > system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ > > > > > system/trace-events | 3 +++ > > > > > 6 files changed, 83 insertions(+) > > > > > > > > > > diff --git a/hw/core/machine.c b/hw/core/machine.c > > > > > index adaba17..a89a32b 100644 > > > > > --- a/hw/core/machine.c > > > > > +++ b/hw/core/machine.c > > > > > @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) > > > > > ms->mem_merge = value; > > > > > } > > > > > +static int machine_get_anon_alloc(Object *obj, Error **errp) > > > > > +{ > > > > > + MachineState *ms = MACHINE(obj); > > > > > + > > > > > + return ms->anon_alloc; > > > > > +} > > > > > + > > > > > +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) > > > > > +{ > > > > > + MachineState *ms = MACHINE(obj); > > > > > + > > > > > + ms->anon_alloc = value; > > > > > +} > > > > > + > > > > > static bool machine_get_usb(Object *obj, Error **errp) > > > > > { > > > > > MachineState *ms = MACHINE(obj); > > > > > @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) > > > > > object_class_property_set_description(oc, "mem-merge", > > > > > "Enable/disable memory merge support"); > > > > > + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", > > > > > + &AnonAllocOption_lookup, > > > > > + machine_get_anon_alloc, > > > > > + machine_set_anon_alloc); > > > > > + > > > > > object_class_property_add_bool(oc, "usb", > > > > > machine_get_usb, machine_set_usb); > > > > > object_class_property_set_description(oc, "usb", > > > > > diff --git a/include/hw/boards.h b/include/hw/boards.h > > > > > index 5966069..5a87647 100644 > > > > > --- a/include/hw/boards.h > > > > > +++ b/include/hw/boards.h > > > > > @@ -393,6 +393,7 @@ struct MachineState { > > > > > bool enable_graphics; > > > > > ConfidentialGuestSupport *cgs; > > > > > HostMemoryBackend *memdev; > > > > > + AnonAllocOption anon_alloc; > > > > > /* > > > > > * convenience alias to ram_memdev_id backend memory region > > > > > * or to numa container memory region > > > > > diff --git a/qapi/machine.json b/qapi/machine.json > > > > > index a6b8795..d4a63f5 100644 > > > > > --- a/qapi/machine.json > > > > > +++ b/qapi/machine.json > > > > > @@ -1898,3 +1898,17 @@ > > > > > { 'command': 'x-query-interrupt-controllers', > > > > > 'returns': 'HumanReadableText', > > > > > 'features': [ 'unstable' ]} > > > > > + > > > > > +## > > > > > +# @AnonAllocOption: > > > > > +# > > > > > +# An enumeration of the options for allocating anonymous guest memory. > > > > > +# > > > > > +# @mmap: allocate using mmap MAP_ANON > > > > > +# > > > > > +# @memfd: allocate using memfd_create > > > > > +# > > > > > +# Since: 9.2 > > > > > +## > > > > > +{ 'enum': 'AnonAllocOption', > > > > > + 'data': [ 'mmap', 'memfd' ] } > > > > > diff --git a/qemu-options.hx b/qemu-options.hx > > > > > index d94e2cb..90ab943 100644 > > > > > --- a/qemu-options.hx > > > > > +++ b/qemu-options.hx > > > > > @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ > > > > > " nvdimm=on|off controls NVDIMM support (default=off)\n" > > > > > " memory-encryption=@var{} memory encryption object to use (default=none)\n" > > > > > " hmat=on|off controls ACPI HMAT support (default=off)\n" > > > > > + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" > > > > > " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" > > > > > " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", > > > > > QEMU_ARCH_ALL) > > > > > @@ -101,6 +102,16 @@ SRST > > > > > Enables or disables ACPI Heterogeneous Memory Attribute Table > > > > > (HMAT) support. The default is off. > > > > > + ``anon-alloc=mmap|memfd`` > > > > > + Allocate anonymous guest RAM using mmap MAP_ANON (the default) > > > > > + or memfd_create. This option applies to memory allocated as a > > > > > + side effect of creating various devices. It does not apply to > > > > > + memory-backend-objects, whether explicitly specified on the > > > > > + command line, or implicitly created by the -m command line > > > > > + option. > > > > > + > > > > > + Some migration modes require anon-alloc=memfd. > > > > > + > > > > > ``memory-backend='id'`` > > > > > An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. > > > > > Allows to use a memory backend as main RAM. > > > > > diff --git a/system/physmem.c b/system/physmem.c > > > > > index dc1db3a..174f7e0 100644 > > > > > --- a/system/physmem.c > > > > > +++ b/system/physmem.c > > > > > @@ -47,6 +47,7 @@ > > > > > #include "qemu/qemu-print.h" > > > > > #include "qemu/log.h" > > > > > #include "qemu/memalign.h" > > > > > +#include "qemu/memfd.h" > > > > > #include "exec/memory.h" > > > > > #include "exec/ioport.h" > > > > > #include "sysemu/dma.h" > > > > > @@ -69,6 +70,8 @@ > > > > > #include "qemu/pmem.h" > > > > > +#include "qapi/qapi-types-migration.h" > > > > > +#include "migration/options.h" > > > > > #include "migration/vmstate.h" > > > > > #include "qemu/range.h" > > > > > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) > > > > > qemu_mutex_unlock_ramlist(); > > > > > return; > > > > > } > > > > > + > > > > > + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && > > > > > + !object_dynamic_cast(new_block->mr->parent_obj.parent, > > > > > + TYPE_MEMORY_BACKEND)) { > > > > > > > > This is pretty fragile.. if someone adds yet another layer on top of memory > > > > backend objects, the ownership links can change and this might silently run > > > > into something else even without any warning.. > > > > > > > > I wished we dig into what is missing, but maybe that's too trivial. If > > > > not, we still need to make this as solid. Perhaps that can be a ram flag > > > > and let relevant callers pass in that flag explicitly. > > > > > > How would they decide whether or not we want to set the flag in the current > > > configuration? > > > > It was in the previous email where it got cut.. I listed four paths that > > may need change. > > That's not my question. Who would decide whether we want to set MAP_SHARED > in these callers or not? > > If you have "unconditionally" in mind, I think it's a bad idea. If there is > some other toggle to perform that setting conditionally, why not. Yes I thought it could be unconditionally. We can discuss downside below, I think we can still use a new flag otherwise, but the idea would be the same, where I want the flag to be explicit in the callers not implicitly with the object type check, which I think can be hackish. > > > > > > > > > > > > > > I think RAM_SHARED can actually be that flag already - I mean, in all paths > > > > that we may create anon mem (but not memory-backend-* objects), is it > > > > always safe we always switch to RAM_SHARED from anon? > > > > > > Do you mean only setting the flag (-> anonymous shmem) or switching also to > > > memfd, which is a bigger change? > > > > Switching to memfd. I thought anon shmem (mmap(MAP_SHARED)) is mostly the > > same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no? > > Memfd is Linux specific, keep that in mind. Apart from that there shouldn't > be much difference between anon shmem and memfd (there are memory commit > differences, though). Could you elaborate the memory commit difference and what does that imply to QEMU's usage? > > Of course, there is a difference between anon memory and shmem, for example > regarding what viritofsd faced (e.g., KSM) recently. The four paths shouldn't be KSM target, AFAICT. None of them are major part of guest RAM, but only very small chunks like vram or roms. Thanks,
On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: > diff --git a/system/trace-events b/system/trace-events > index 074d001..4669411 100644 > --- a/system/trace-events > +++ b/system/trace-events > @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P > > # cpu-throttle.c > cpu_throttle_set(int new_throttle_pct) "set guest CPU throttled by %d%%" > + > +#physmem.c > +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" This breaks 32bit build: ../system/trace-events: In function ‘_nocheck__trace_ram_block_add’: ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 8 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ...... ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 9 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ...... ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" | ^~~~~~~~~~~~~~~~ ~~~~~~~~~~~ | | | size_t {aka unsigned int} ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" | ^~~~~~~~~~~~~~~~ ~~~~~~~~~~ | | | size_t {aka unsigned int} Probably need to switch to %zu for size_t's.
On 04.10.24 15:24, Peter Xu wrote: > On Fri, Oct 04, 2024 at 02:54:38PM +0200, David Hildenbrand wrote: >> On 04.10.24 14:33, Peter Xu wrote: >>> On Fri, Oct 04, 2024 at 12:14:35PM +0200, David Hildenbrand wrote: >>>> On 03.10.24 18:14, Peter Xu wrote: >>>>> On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: >>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>> memory allocated as a side effect of creating various devices. It does >>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>> the command line, or implicitly created by the -m command line option. >>>>>> >>>>>> The memfd option is intended to support new migration modes, in which the >>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>> >>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >>>>> >>>>> [Igor seems missing in the loop; added] >>>>> >>>>>> --- >>>>>> hw/core/machine.c | 19 +++++++++++++++++++ >>>>>> include/hw/boards.h | 1 + >>>>>> qapi/machine.json | 14 ++++++++++++++ >>>>>> qemu-options.hx | 11 +++++++++++ >>>>>> system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ >>>>>> system/trace-events | 3 +++ >>>>>> 6 files changed, 83 insertions(+) >>>>>> >>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c >>>>>> index adaba17..a89a32b 100644 >>>>>> --- a/hw/core/machine.c >>>>>> +++ b/hw/core/machine.c >>>>>> @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) >>>>>> ms->mem_merge = value; >>>>>> } >>>>>> +static int machine_get_anon_alloc(Object *obj, Error **errp) >>>>>> +{ >>>>>> + MachineState *ms = MACHINE(obj); >>>>>> + >>>>>> + return ms->anon_alloc; >>>>>> +} >>>>>> + >>>>>> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) >>>>>> +{ >>>>>> + MachineState *ms = MACHINE(obj); >>>>>> + >>>>>> + ms->anon_alloc = value; >>>>>> +} >>>>>> + >>>>>> static bool machine_get_usb(Object *obj, Error **errp) >>>>>> { >>>>>> MachineState *ms = MACHINE(obj); >>>>>> @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) >>>>>> object_class_property_set_description(oc, "mem-merge", >>>>>> "Enable/disable memory merge support"); >>>>>> + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", >>>>>> + &AnonAllocOption_lookup, >>>>>> + machine_get_anon_alloc, >>>>>> + machine_set_anon_alloc); >>>>>> + >>>>>> object_class_property_add_bool(oc, "usb", >>>>>> machine_get_usb, machine_set_usb); >>>>>> object_class_property_set_description(oc, "usb", >>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h >>>>>> index 5966069..5a87647 100644 >>>>>> --- a/include/hw/boards.h >>>>>> +++ b/include/hw/boards.h >>>>>> @@ -393,6 +393,7 @@ struct MachineState { >>>>>> bool enable_graphics; >>>>>> ConfidentialGuestSupport *cgs; >>>>>> HostMemoryBackend *memdev; >>>>>> + AnonAllocOption anon_alloc; >>>>>> /* >>>>>> * convenience alias to ram_memdev_id backend memory region >>>>>> * or to numa container memory region >>>>>> diff --git a/qapi/machine.json b/qapi/machine.json >>>>>> index a6b8795..d4a63f5 100644 >>>>>> --- a/qapi/machine.json >>>>>> +++ b/qapi/machine.json >>>>>> @@ -1898,3 +1898,17 @@ >>>>>> { 'command': 'x-query-interrupt-controllers', >>>>>> 'returns': 'HumanReadableText', >>>>>> 'features': [ 'unstable' ]} >>>>>> + >>>>>> +## >>>>>> +# @AnonAllocOption: >>>>>> +# >>>>>> +# An enumeration of the options for allocating anonymous guest memory. >>>>>> +# >>>>>> +# @mmap: allocate using mmap MAP_ANON >>>>>> +# >>>>>> +# @memfd: allocate using memfd_create >>>>>> +# >>>>>> +# Since: 9.2 >>>>>> +## >>>>>> +{ 'enum': 'AnonAllocOption', >>>>>> + 'data': [ 'mmap', 'memfd' ] } >>>>>> diff --git a/qemu-options.hx b/qemu-options.hx >>>>>> index d94e2cb..90ab943 100644 >>>>>> --- a/qemu-options.hx >>>>>> +++ b/qemu-options.hx >>>>>> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ >>>>>> " nvdimm=on|off controls NVDIMM support (default=off)\n" >>>>>> " memory-encryption=@var{} memory encryption object to use (default=none)\n" >>>>>> " hmat=on|off controls ACPI HMAT support (default=off)\n" >>>>>> + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" >>>>>> " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" >>>>>> " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", >>>>>> QEMU_ARCH_ALL) >>>>>> @@ -101,6 +102,16 @@ SRST >>>>>> Enables or disables ACPI Heterogeneous Memory Attribute Table >>>>>> (HMAT) support. The default is off. >>>>>> + ``anon-alloc=mmap|memfd`` >>>>>> + Allocate anonymous guest RAM using mmap MAP_ANON (the default) >>>>>> + or memfd_create. This option applies to memory allocated as a >>>>>> + side effect of creating various devices. It does not apply to >>>>>> + memory-backend-objects, whether explicitly specified on the >>>>>> + command line, or implicitly created by the -m command line >>>>>> + option. >>>>>> + >>>>>> + Some migration modes require anon-alloc=memfd. >>>>>> + >>>>>> ``memory-backend='id'`` >>>>>> An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. >>>>>> Allows to use a memory backend as main RAM. >>>>>> diff --git a/system/physmem.c b/system/physmem.c >>>>>> index dc1db3a..174f7e0 100644 >>>>>> --- a/system/physmem.c >>>>>> +++ b/system/physmem.c >>>>>> @@ -47,6 +47,7 @@ >>>>>> #include "qemu/qemu-print.h" >>>>>> #include "qemu/log.h" >>>>>> #include "qemu/memalign.h" >>>>>> +#include "qemu/memfd.h" >>>>>> #include "exec/memory.h" >>>>>> #include "exec/ioport.h" >>>>>> #include "sysemu/dma.h" >>>>>> @@ -69,6 +70,8 @@ >>>>>> #include "qemu/pmem.h" >>>>>> +#include "qapi/qapi-types-migration.h" >>>>>> +#include "migration/options.h" >>>>>> #include "migration/vmstate.h" >>>>>> #include "qemu/range.h" >>>>>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) >>>>>> qemu_mutex_unlock_ramlist(); >>>>>> return; >>>>>> } >>>>>> + >>>>>> + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && >>>>>> + !object_dynamic_cast(new_block->mr->parent_obj.parent, >>>>>> + TYPE_MEMORY_BACKEND)) { >>>>> >>>>> This is pretty fragile.. if someone adds yet another layer on top of memory >>>>> backend objects, the ownership links can change and this might silently run >>>>> into something else even without any warning.. >>>>> >>>>> I wished we dig into what is missing, but maybe that's too trivial. If >>>>> not, we still need to make this as solid. Perhaps that can be a ram flag >>>>> and let relevant callers pass in that flag explicitly. >>>> >>>> How would they decide whether or not we want to set the flag in the current >>>> configuration? >>> >>> It was in the previous email where it got cut.. I listed four paths that >>> may need change. >> >> That's not my question. Who would decide whether we want to set MAP_SHARED >> in these callers or not? >> >> If you have "unconditionally" in mind, I think it's a bad idea. If there is >> some other toggle to perform that setting conditionally, why not. > > Yes I thought it could be unconditionally. We can discuss downside below, > I think we can still use a new flag otherwise, but the idea would be the > same, where I want the flag to be explicit in the callers not implicitly > with the object type check, which I think can be hackish. I agree that the caller should specify it. But I don't think using shared memory where shared memory is not warranted is a reasonable approach. I'm quite surprise you're considering such changes with unclear impacts on other OSes besides Linux (Freedbsd? Windows that doeasn';t even support shared memory?) just to make one corner-case QEMU use case happy. But I'm sure there are valid reasons why you had that idea, so I'm happy to learn why using shared memory unconditionally here is better than providing a clean alternative path with the feature enabled and memfd actually being supported on the setup (e.g., newer Linux kernel). > >> >>> >>>> >>>>> >>>>> I think RAM_SHARED can actually be that flag already - I mean, in all paths >>>>> that we may create anon mem (but not memory-backend-* objects), is it >>>>> always safe we always switch to RAM_SHARED from anon? >>>> >>>> Do you mean only setting the flag (-> anonymous shmem) or switching also to >>>> memfd, which is a bigger change? >>> >>> Switching to memfd. I thought anon shmem (mmap(MAP_SHARED)) is mostly the >>> same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no? >> >> Memfd is Linux specific, keep that in mind. Apart from that there shouldn't >> be much difference between anon shmem and memfd (there are memory commit >> differences, though). > > Could you elaborate the memory commit difference and what does that imply > to QEMU's usage? Note how memfd code passed VM_NORESERVE to shmem_file_setup() and shmem_zero_setup() effectively doesn't (unless MAP_NORESERVE was specified IIRC). Not sure if the change makes a big impact in QEMU's usage, it's just one of these differences between memfd and shared anonymous memory. (responding to your "mostly the same"). > >> >> Of course, there is a difference between anon memory and shmem, for example >> regarding what viritofsd faced (e.g., KSM) recently. > > The four paths shouldn't be KSM target, AFAICT. Do you have a good overview of what is deduplicated in practice and why these don't apply? For example, I thought these functions are also used for hosting the BIOS, and that might just be deduplciated between VMs? Anyhow, there are obviously other differences with shmem vs. anonymous (THP handling, page fault performance, userfaultfd compatibility on older kernels) at least on Linux, but I have absolutely no clue how that would differ on other host OSes. None of them are major This is probably going to result in a bigger discussion, for which I don't have any time. So my opinion on it is above. Anyhow, this sounds like one of the suggestions I wouldn't suggest Steve to actually implement.
On Mon, Oct 07, 2024 at 06:23:26PM +0200, David Hildenbrand wrote: > > Yes I thought it could be unconditionally. We can discuss downside below, > > I think we can still use a new flag otherwise, but the idea would be the > > same, where I want the flag to be explicit in the callers not implicitly > > with the object type check, which I think can be hackish. > > I agree that the caller should specify it. > > But I don't think using shared memory where shared memory is not warranted > is a reasonable approach. > > I'm quite surprise you're considering such changes with unclear impacts on > other OSes besides Linux (Freedbsd? Windows that doeasn';t even support > shared memory?) just to make one corner-case QEMU use case happy. > > But I'm sure there are valid reasons why you had that idea, so I'm happy to > learn why using shared memory unconditionally here is better than providing > a clean alternative path with the feature enabled and memfd actually being > supported on the setup (e.g., newer Linux kernel). I was thinking whether cpr-transfer can be enabled by default, so whenever people want to use it, no code reset needed. It's also easier for Libvirt to not need to add yet another machine flags if possible. Currently this parameter is the only one left that needs to be manually enabled on src. It means if we can get rid of it then any QEMU based VM on Linux can do a cpr-transfer any time as long as QEMU supports it. Without it, this new parameter will need to be manually enabled otherwise another system code reboot / live migration needs to happen first without CPR, just to enable this flag. But yeah I don't think I think it all through, so I left my pure question. I think it looks still like an option, the other option if we still want to enable it by default is, keep the option, then only enable it on new machines that is based on Linux. OS dependency is definitely an issue. AFAICT CPR is only available for Linux anyway, but I'm happy to be corrected.. IOW, those chunk of new code (if only unconditionally done..) would need proper #ifdef, so that non-Linux OSes work like before. > > > > > > > > > > > > > > > > > > > > > > > > > > > I think RAM_SHARED can actually be that flag already - I mean, in all paths > > > > > > that we may create anon mem (but not memory-backend-* objects), is it > > > > > > always safe we always switch to RAM_SHARED from anon? > > > > > > > > > > Do you mean only setting the flag (-> anonymous shmem) or switching also to > > > > > memfd, which is a bigger change? > > > > > > > > Switching to memfd. I thought anon shmem (mmap(MAP_SHARED)) is mostly the > > > > same internally, if we create memfd then mmap(MAP_SHARED) on top of it, no? > > > > > > Memfd is Linux specific, keep that in mind. Apart from that there shouldn't > > > be much difference between anon shmem and memfd (there are memory commit > > > differences, though). > > > > Could you elaborate the memory commit difference and what does that imply > > to QEMU's usage? > > Note how memfd code passed VM_NORESERVE to shmem_file_setup() and > shmem_zero_setup() effectively doesn't (unless MAP_NORESERVE was specified > IIRC). > > Not sure if the change makes a big impact in QEMU's usage, it's just one of > these differences between memfd and shared anonymous memory. (responding to > your "mostly the same"). So yeah, I hoped the memory commit won't be a problem, because I think they should be corner case MRs, and should be small. Vram can take up to 16MB, that's the max I'm aware of, but indeed I don't know all the use cases to be sure. I think it means some tens of MBs can be accounted later during fault rather than upfront to fail QEMU from boot. Ideally mgmt apps should leave enough space for these ones, but if we worry on that we can stick with the current option (but create a new flag besides RAM_SHARED). > > > > > > > > > Of course, there is a difference between anon memory and shmem, for example > > > regarding what viritofsd faced (e.g., KSM) recently. > > > > The four paths shouldn't be KSM target, AFAICT. > > Do you have a good overview of what is deduplicated in practice and why > these don't apply? For example, I thought these functions are also used for > hosting the BIOS, and that might just be deduplciated between VMs? I was thinking KSM was for merging OS/App pages (rather than BIOSs, which are normally very, very small)? Though I could be wrong. > > Anyhow, there are obviously other differences with shmem vs. anonymous (THP > handling, page fault performance, userfaultfd compatibility on older > kernels) at least on Linux, but I have absolutely no clue how that would > differ on other host OSes. > > None of them are major > > This is probably going to result in a bigger discussion, for which I don't > have any time. So my opinion on it is above. > > Anyhow, this sounds like one of the suggestions I wouldn't suggest Steve to > actually implement. We don't need to make it a bigger discussion. If there's concern with it, we can stick with a new flag. The next question is whether if with a new flag we should enable it by default sometimes (e.g. on new machine types on Linux). But when with a new option, that can be discussed later. Thanks,
On 10/7/2024 11:36 AM, Peter Xu wrote: > On Mon, Sep 30, 2024 at 12:40:32PM -0700, Steve Sistare wrote: >> diff --git a/system/trace-events b/system/trace-events >> index 074d001..4669411 100644 >> --- a/system/trace-events >> +++ b/system/trace-events >> @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P >> >> # cpu-throttle.c >> cpu_throttle_set(int new_throttle_pct) "set guest CPU throttled by %d%%" >> + >> +#physmem.c >> +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" > > This breaks 32bit build: > > ../system/trace-events: In function ‘_nocheck__trace_ram_block_add’: > ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 8 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] > 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ...... > ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 9 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] > 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ...... > ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] > 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" > | ^~~~~~~~~~~~~~~~ ~~~~~~~~~~~ > | | > | size_t {aka unsigned int} > ../system/trace-events:52:22: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=] > 52 | ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu" > | ^~~~~~~~~~~~~~~~ ~~~~~~~~~~ > | | > | size_t {aka unsigned int} > > Probably need to switch to %zu for size_t's. Sorry for not building 31-bit! And thanks for the tip about %zu, that's new to me - steve
diff --git a/hw/core/machine.c b/hw/core/machine.c index adaba17..a89a32b 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) ms->mem_merge = value; } +static int machine_get_anon_alloc(Object *obj, Error **errp) +{ + MachineState *ms = MACHINE(obj); + + return ms->anon_alloc; +} + +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) +{ + MachineState *ms = MACHINE(obj); + + ms->anon_alloc = value; +} + static bool machine_get_usb(Object *obj, Error **errp) { MachineState *ms = MACHINE(obj); @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) object_class_property_set_description(oc, "mem-merge", "Enable/disable memory merge support"); + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", + &AnonAllocOption_lookup, + machine_get_anon_alloc, + machine_set_anon_alloc); + object_class_property_add_bool(oc, "usb", machine_get_usb, machine_set_usb); object_class_property_set_description(oc, "usb", diff --git a/include/hw/boards.h b/include/hw/boards.h index 5966069..5a87647 100644 --- a/include/hw/boards.h +++ b/include/hw/boards.h @@ -393,6 +393,7 @@ struct MachineState { bool enable_graphics; ConfidentialGuestSupport *cgs; HostMemoryBackend *memdev; + AnonAllocOption anon_alloc; /* * convenience alias to ram_memdev_id backend memory region * or to numa container memory region diff --git a/qapi/machine.json b/qapi/machine.json index a6b8795..d4a63f5 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -1898,3 +1898,17 @@ { 'command': 'x-query-interrupt-controllers', 'returns': 'HumanReadableText', 'features': [ 'unstable' ]} + +## +# @AnonAllocOption: +# +# An enumeration of the options for allocating anonymous guest memory. +# +# @mmap: allocate using mmap MAP_ANON +# +# @memfd: allocate using memfd_create +# +# Since: 9.2 +## +{ 'enum': 'AnonAllocOption', + 'data': [ 'mmap', 'memfd' ] } diff --git a/qemu-options.hx b/qemu-options.hx index d94e2cb..90ab943 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ " nvdimm=on|off controls NVDIMM support (default=off)\n" " memory-encryption=@var{} memory encryption object to use (default=none)\n" " hmat=on|off controls ACPI HMAT support (default=off)\n" + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", QEMU_ARCH_ALL) @@ -101,6 +102,16 @@ SRST Enables or disables ACPI Heterogeneous Memory Attribute Table (HMAT) support. The default is off. + ``anon-alloc=mmap|memfd`` + Allocate anonymous guest RAM using mmap MAP_ANON (the default) + or memfd_create. This option applies to memory allocated as a + side effect of creating various devices. It does not apply to + memory-backend-objects, whether explicitly specified on the + command line, or implicitly created by the -m command line + option. + + Some migration modes require anon-alloc=memfd. + ``memory-backend='id'`` An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. Allows to use a memory backend as main RAM. diff --git a/system/physmem.c b/system/physmem.c index dc1db3a..174f7e0 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -47,6 +47,7 @@ #include "qemu/qemu-print.h" #include "qemu/log.h" #include "qemu/memalign.h" +#include "qemu/memfd.h" #include "exec/memory.h" #include "exec/ioport.h" #include "sysemu/dma.h" @@ -69,6 +70,8 @@ #include "qemu/pmem.h" +#include "qapi/qapi-types-migration.h" +#include "migration/options.h" #include "migration/vmstate.h" #include "qemu/range.h" @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) qemu_mutex_unlock_ramlist(); return; } + + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && + !object_dynamic_cast(new_block->mr->parent_obj.parent, + TYPE_MEMORY_BACKEND)) { + size_t max_length = new_block->max_length; + MemoryRegion *mr = new_block->mr; + const char *name = memory_region_name(mr); + + new_block->mr->align = QEMU_VMALLOC_ALIGN; + new_block->flags |= RAM_SHARED; + + if (new_block->fd == -1) { + new_block->fd = qemu_memfd_create(name, max_length + mr->align, + 0, 0, 0, errp); + } + + if (new_block->fd >= 0) { + int mfd = new_block->fd; + qemu_set_cloexec(mfd); + new_block->host = file_ram_alloc(new_block, max_length, mfd, + false, 0, errp); + } + if (!new_block->host) { + qemu_mutex_unlock_ramlist(); + return; + } + memory_try_enable_merging(new_block->host, new_block->max_length); + free_on_error = true; + } else { new_block->host = qemu_anon_ram_alloc(new_block->max_length, &new_block->mr->align, @@ -1932,6 +1964,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) ram_block_notify_add(new_block->host, new_block->used_length, new_block->max_length); } + trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags, + new_block->fd, new_block->used_length, + new_block->max_length); return; out_free: diff --git a/system/trace-events b/system/trace-events index 074d001..4669411 100644 --- a/system/trace-events +++ b/system/trace-events @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P # cpu-throttle.c cpu_throttle_set(int new_throttle_pct) "set guest CPU throttled by %d%%" + +#physmem.c +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
Allocate anonymous memory using mmap MAP_ANON or memfd_create depending on the value of the anon-alloc machine property. This option applies to memory allocated as a side effect of creating various devices. It does not apply to memory-backend-objects, whether explicitly specified on the command line, or implicitly created by the -m command line option. The memfd option is intended to support new migration modes, in which the memory region can be transferred in place to a new QEMU process, by sending the memfd file descriptor to the process. Memory contents are preserved, and if the mode also transfers device descriptors, then pages that are locked in memory for DMA remain locked. This behavior is a pre-requisite for supporting vfio, vdpa, and iommufd devices with the new modes. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> --- hw/core/machine.c | 19 +++++++++++++++++++ include/hw/boards.h | 1 + qapi/machine.json | 14 ++++++++++++++ qemu-options.hx | 11 +++++++++++ system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ system/trace-events | 3 +++ 6 files changed, 83 insertions(+)