Message ID | 20211013103330.26869-1-david@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | virtio-mem: Expose device memory via separate memslots | expand |
* David Hildenbrand (david@redhat.com) wrote: > Based-on: 20211011175346.15499-1-david@redhat.com > > A virtio-mem device is represented by a single large RAM memory region > backed by a single large mmap. > > Right now, we map that complete memory region into guest physical addres > space, resulting in a very large memory mapping, KVM memory slot, ... > although only a small amount of memory might actually be exposed to the VM. > > For example, when starting a VM with a 1 TiB virtio-mem device that only > exposes little device memory (e.g., 1 GiB) towards the VM initialliy, > in order to hotplug more memory later, we waste a lot of memory on metadata > for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some > optimizations in KVM are being worked on to reduce this metadata overhead > on x86-64 in some cases, it remains a problem with nested VMs and there are > other reasons why we would want to reduce the total memory slot to a > reasonable minimum. > > We want to: > a) Reduce the metadata overhead, including bitmap sizes inside KVM but also > inside QEMU KVM code where possible. > b) Not always expose all device-memory to the VM, to reduce the attack > surface of malicious VMs without using userfaultfd. > > So instead, expose the RAM memory region not by a single large mapping > (consuming one memslot) but instead by multiple mappings, each consuming > one memslot. To do that, we divide the RAM memory region via aliases into > separate parts and only map the aliases into a device container we actually > need. We have to make sure that QEMU won't silently merge the memory > sections corresponding to the aliases (and thereby also memslots), > otherwise we lose atomic updates with KVM and vhost-user, which we deeply > care about when adding/removing memory. Further, to get memslot accounting > right, such merging is better avoided. > > Within the memslots, virtio-mem can (un)plug memory in smaller granularity > dynamically. So memslots are a pure optimization to tackle a) and b) above. > > Memslots are right now mapped once they fall into the usable device region > (which grows/shrinks on demand right now either when requesting to > hotplug more memory or during/after reboots). In the future, with > VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even > more dynamically when (un)plugging device blocks. > > > Adding a 500GiB virtio-mem device and not hotplugging any memory results in: > 0000000140000000-000001047fffffff (prio 0, i/o): device-memory > 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots > > Requesting the VM to consume 2 GiB results in (note: the usable region size > is bigger than 2 GiB, so 3 * 1 GiB memslots are required): > 0000000140000000-000001047fffffff (prio 0, i/o): device-memory > 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots > 0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff > 0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff > 00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff I've got a vague memory that there were some devices that didn't like doing split IO across a memory region (or something) - some virtio devices? Do you know if that's still true and if that causes a problem? Dave > Requesting the VM to consume 20 GiB results in: > 0000000140000000-000001047fffffff (prio 0, i/o): device-memory > 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots > 0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff > 0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff > 00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff > 0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff > 0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff > 0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff > 00000002c0000000-00000002ffffffff (prio 0, ram): alias virtio-mem-memslot-6 @mem0 0000000180000000-00000001bfffffff > 0000000300000000-000000033fffffff (prio 0, ram): alias virtio-mem-memslot-7 @mem0 00000001c0000000-00000001ffffffff > 0000000340000000-000000037fffffff (prio 0, ram): alias virtio-mem-memslot-8 @mem0 0000000200000000-000000023fffffff > 0000000380000000-00000003bfffffff (prio 0, ram): alias virtio-mem-memslot-9 @mem0 0000000240000000-000000027fffffff > 00000003c0000000-00000003ffffffff (prio 0, ram): alias virtio-mem-memslot-10 @mem0 0000000280000000-00000002bfffffff > 0000000400000000-000000043fffffff (prio 0, ram): alias virtio-mem-memslot-11 @mem0 00000002c0000000-00000002ffffffff > 0000000440000000-000000047fffffff (prio 0, ram): alias virtio-mem-memslot-12 @mem0 0000000300000000-000000033fffffff > 0000000480000000-00000004bfffffff (prio 0, ram): alias virtio-mem-memslot-13 @mem0 0000000340000000-000000037fffffff > 00000004c0000000-00000004ffffffff (prio 0, ram): alias virtio-mem-memslot-14 @mem0 0000000380000000-00000003bfffffff > 0000000500000000-000000053fffffff (prio 0, ram): alias virtio-mem-memslot-15 @mem0 00000003c0000000-00000003ffffffff > 0000000540000000-000000057fffffff (prio 0, ram): alias virtio-mem-memslot-16 @mem0 0000000400000000-000000043fffffff > 0000000580000000-00000005bfffffff (prio 0, ram): alias virtio-mem-memslot-17 @mem0 0000000440000000-000000047fffffff > 00000005c0000000-00000005ffffffff (prio 0, ram): alias virtio-mem-memslot-18 @mem0 0000000480000000-00000004bfffffff > 0000000600000000-000000063fffffff (prio 0, ram): alias virtio-mem-memslot-19 @mem0 00000004c0000000-00000004ffffffff > 0000000640000000-000000067fffffff (prio 0, ram): alias virtio-mem-memslot-20 @mem0 0000000500000000-000000053fffffff > > Requesting the VM to consume 5 GiB and rebooting (note: usable region size > will change during reboots) results in: > 0000000140000000-000001047fffffff (prio 0, i/o): device-memory > 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots > 0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff > 0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff > 00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff > 0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff > 0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff > 0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff > > > In addition to other factors, we limit the number of memslots to 1024 per > devices and the size of one memslot to at least 1 GiB. So only a 1 TiB > virtio-mem device could consume 1024 memslots in the "worst" case. To > calculate a memslot limit for a device, we use a heuristic based on all > available memslots for memory devices and the percentage of > "device size":"total memory device area size". Users can further limit > the maximum number of memslots that will be used by a device by setting > the "max-memslots" property. It's expected to be set to "0" (auto) in most > setups. > > In recent setups (e.g., KVM with ~32k memslots, vhost-user with ~4k > memslots after this series), we'll get the biggest benefit. In special > setups (e.g., older KVM, vhost kernel with 64 memslots), we'll get some > benefit -- the individual memslots will be bigger. > > Future work: > - vhost-user and libvhost-user optimizations for handling more memslots > more efficiently. > > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Eduardo Habkost <ehabkost@redhat.com> > Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Igor Mammedov <imammedo@redhat.com> > Cc: Ani Sinha <ani@anisinha.ca> > Cc: Peter Xu <peterx@redhat.com> > Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> > Cc: Stefan Hajnoczi <stefanha@redhat.com> > Cc: Richard Henderson <richard.henderson@linaro.org> > Cc: Philippe Mathieu-Daudé <f4bug@amsat.org> > Cc: kvm@vger.kernel.org > > David Hildenbrand (15): > memory: Drop mapping check from > memory_region_get_ram_discard_manager() > kvm: Return number of free memslots > vhost: Return number of free memslots > memory: Allow for marking memory region aliases unmergeable > vhost: Don't merge unmergeable memory sections > memory-device: Move memory_device_check_addable() directly into > memory_device_pre_plug() > memory-device: Generalize memory_device_used_region_size() > memory-device: Support memory devices that consume a variable number > of memslots > vhost: Respect reserved memslots for memory devices when realizing a > vhost device > virtio-mem: Set the RamDiscardManager for the RAM memory region > earlier > virtio-mem: Fix typo in virito_mem_intersect_memory_section() function > name > virtio-mem: Expose device memory via separate memslots > vhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 496 with > CONFIG_VIRTIO_MEM > libvhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 4096 > virtio-mem: Set "max-memslots" to 0 (auto) for the 6.2 machine > > accel/kvm/kvm-all.c | 24 ++- > accel/stubs/kvm-stub.c | 4 +- > hw/core/machine.c | 1 + > hw/mem/memory-device.c | 167 +++++++++++++++--- > hw/virtio/vhost-stub.c | 2 +- > hw/virtio/vhost-user.c | 7 +- > hw/virtio/vhost.c | 17 +- > hw/virtio/virtio-mem-pci.c | 22 +++ > hw/virtio/virtio-mem.c | 202 ++++++++++++++++++++-- > include/exec/memory.h | 23 +++ > include/hw/mem/memory-device.h | 32 ++++ > include/hw/virtio/vhost.h | 2 +- > include/hw/virtio/virtio-mem.h | 29 +++- > include/sysemu/kvm.h | 2 +- > softmmu/memory.c | 35 +++- > stubs/qmp_memory_device.c | 5 + > subprojects/libvhost-user/libvhost-user.h | 7 +- > 17 files changed, 499 insertions(+), 82 deletions(-) > > -- > 2.31.1 >
On 13.10.21 21:03, Dr. David Alan Gilbert wrote: > * David Hildenbrand (david@redhat.com) wrote: >> Based-on: 20211011175346.15499-1-david@redhat.com >> >> A virtio-mem device is represented by a single large RAM memory region >> backed by a single large mmap. >> >> Right now, we map that complete memory region into guest physical addres >> space, resulting in a very large memory mapping, KVM memory slot, ... >> although only a small amount of memory might actually be exposed to the VM. >> >> For example, when starting a VM with a 1 TiB virtio-mem device that only >> exposes little device memory (e.g., 1 GiB) towards the VM initialliy, >> in order to hotplug more memory later, we waste a lot of memory on metadata >> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some >> optimizations in KVM are being worked on to reduce this metadata overhead >> on x86-64 in some cases, it remains a problem with nested VMs and there are >> other reasons why we would want to reduce the total memory slot to a >> reasonable minimum. >> >> We want to: >> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also >> inside QEMU KVM code where possible. >> b) Not always expose all device-memory to the VM, to reduce the attack >> surface of malicious VMs without using userfaultfd. >> >> So instead, expose the RAM memory region not by a single large mapping >> (consuming one memslot) but instead by multiple mappings, each consuming >> one memslot. To do that, we divide the RAM memory region via aliases into >> separate parts and only map the aliases into a device container we actually >> need. We have to make sure that QEMU won't silently merge the memory >> sections corresponding to the aliases (and thereby also memslots), >> otherwise we lose atomic updates with KVM and vhost-user, which we deeply >> care about when adding/removing memory. Further, to get memslot accounting >> right, such merging is better avoided. >> >> Within the memslots, virtio-mem can (un)plug memory in smaller granularity >> dynamically. So memslots are a pure optimization to tackle a) and b) above. >> >> Memslots are right now mapped once they fall into the usable device region >> (which grows/shrinks on demand right now either when requesting to >> hotplug more memory or during/after reboots). In the future, with >> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even >> more dynamically when (un)plugging device blocks. >> >> >> Adding a 500GiB virtio-mem device and not hotplugging any memory results in: >> 0000000140000000-000001047fffffff (prio 0, i/o): device-memory >> 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots >> >> Requesting the VM to consume 2 GiB results in (note: the usable region size >> is bigger than 2 GiB, so 3 * 1 GiB memslots are required): >> 0000000140000000-000001047fffffff (prio 0, i/o): device-memory >> 0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots >> 0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff >> 0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff >> 00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff > > I've got a vague memory that there were some devices that didn't like > doing split IO across a memory region (or something) - some virtio > devices? Do you know if that's still true and if that causes a problem? Interesting point! I am not aware of any such issues, and I'd be surprised if we'd still have such buggy devices, because the layout virtio-mem now creates is just very similar to the layout we'll automatically create with ordinary DIMMs. If we hotplug DIMMs they will end up consecutive in guest physical address space, however, having separate memory regions and requiring separate memory slots. So, very similar to a virtio-mem device now. Maybe the catch is that it's hard to cross memory regions that are e.g., >- 128 MiB aligned, because ordinary allocations (e.g., via the buddy in Linux which supports <= 4 MiB pages) in won't cross these blocks.