Message ID | 20210607195430.48228-1-david@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/memory_hotplug: "auto-movable" online policy and memory groups | expand |
On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote: > Hi, > > this series aims at improving in-kernel auto-online support. It tackles the > fundamental problems that: Hi David, the idea sounds good to me, and I like that this series takes away part of the responsability from the user to know where the memory should go. I think the kernel is a much better fit for that as it has all the required information to balance things. I also glanced over the series and besides some things here and there the whole approach looks sane. I plan to have a look into it in a few days, just have some high level questions for the time being: > 1) We can create zone imbalances when onlining all memory blindly to > ZONE_MOVABLE, in the worst case crashing the system. We have to know > upfront how much memory we are going to hotplug such that we can > safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE > via "online_movable". This is far from practical and only applicable in > limited setups -- like inside VMs under the RHV/oVirt hypervisor which > will never hotplug more than 3 times the boot memory (and the > limitation is only in place due to the Linux limitation). Could you give more insight about the problems created by zone imbalances (e.g: a lot of movable memory and little kernel memory). > 2) We see more setups that implement dynamic VM resizing, hot(un)plugging > memory to resize VM memory. In these setups, we might hotplug a lot of > memory, but it might happen in various small steps in both directions > (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the > primary driver of this upstream right now, performing such dynamic > resizing NUMA-aware via multiple virtio-mem devices. > > Onlining all hotplugged memory to ZONE_NORMAL means we basically have > no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can > easily run into zone imbalances when growing a VM. We want a mixture, > and we want as much memory as reasonable/configured in ZONE_MOVABLE. > > 3) Memory devices consist of 1..X memory block devices, however, the > kernel doesn't really track the relationship. Consequently, also user > space has no idea. We want to make per-device decisions. As one > example, for memory hotunplug it doesn't make sense to use a mixture of > zones within a single DIMM: we want all MOVABLE if possible, otherwise > all !MOVABLE, because any !MOVABLE part will easily block the DIMM from > getting hotunplugged. As another example, virtio-mem operates on > individual units that span 1..X memory blocks. Similar to a DIMM, we > want a unit to either be all MOVABLE or !MOVABLE. Further, we want > as much memory of a virtio-mem device to be MOVABLE as possible. So, a virtio-mem unit could be seen as DIMM right? > 4) We want memory onlining to be done right from the kernel while adding > memory; for example, this is reqired for fast memory hotplug for > drivers that add individual memory blocks, like virito-mem. We want a > way to configure a policy in the kernel and avoid implementing advanced > policies in user space. "we want memory onlining to be done right from the kernel while adding memory" is not that always the case when a driver adds memory? User has no interaction with that right? > The auto-onlining support we have in the kernel is not sufficient. All we > have is a) online everything movable (online_movable) b) online everything > !movable (online_kernel) c) keep zones contiguous (online). This series > allows configuring c) to mean instead "online movable if possible according > to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new > onlining policy. > > This series does 3 things: > > 1) Introduces the "auto-movable" online policy that initially operates on > individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio > to make a decision whether a memory block will be onlined to > ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL > memory does not allow for more MOVABLE memory (details in the > patches). CMA memory is treated like MOVABLE memory. How a user would know which ratio is sane? Could we add some info in the Docu part that kinda sets some "basic" rules? > 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory > groups and uses group information to make decisions in the > "auto-movable" online policy accross memory blocks of a single memory > device (modeled as memory group). So, the distinction being that a DIMM cannot grow larger but we can add more memory to a virtio-mem unit? I feel I am missing some insight here. > 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by > allowing ZONE_NORMAL memory within a dynamic memory group to allow for > more ZONE_MOVABLE memory within the same memory group. The target use > case is dynamic VM resizing using virtio-mem. Sorry, I got lost in this one. Care to explain a bit more? > The target usage will be: > > 1) Linux boots with "mhp_default_online_type=offline" > > 2) User space (e.g., systemd unit) configures memory onlining (according > to a config file and system properties), for example: > * Setting memory_hotplug.online_policy=auto-movable > * Setting memory_hotplug.auto_movable_ratio=301 > * Setting memory_hotplug.auto_movable_numa_aware=true I think we would need to document those in order to let the user know what it is best for them. e.g: when do we want to enable auto_movable_numa_aware etc. > For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of > 301% results in the following layout: > Memory block 1-15: DMA32 (early) > Memory block 32-47: Normal (early) > Memory block 48-79: Movable (DIMM 0) > Memory block 80-111: Movable (DIMM 1) > Memory block 112-143: Movable (DIMM 2) > Memory block 144-275: Normal (DIMM 3) > Memory block 176-207: Normal (DIMM 4) > ... all Normal > (-> hotplugged Normal memory does not allow for more Movable memory) Uhm, I am sorry for being dense here: On x86_64, 4GB = 32 sections (of 128MB each). Why the memblock span from #1 to #47? > For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM > will result in the following layout: > Memory block 1-15: DMA32 (early) > Memory block 32-47: Normal (early) > Memory block 48-143: Movable (virtio-mem, first 12 GiB) > Memory block 144: Normal (virtio-mem, next 128 MiB) > Memory block 145-147: Movable (virtio-mem, next 384 MiB) > Memory block 148: Normal (virtio-mem, next 128 MiB) > Memory block 149-151: Movable (virtio-mem, next 384 MiB) > ... Normal/Movable mixture as above > (-> hotplugged Normal memory allows for more Movable memory within > the same device) > > Which gives us maximum flexibility when dynamically growing/shrinking a > VM in smaller steps. When shrinking, virtio-mem will prioritize unplug of > MOVABLE memory with [1] sent last week, such that we won't accidentially > trigger zone imbalances in more complicated setups that involve multiple > virtio-mem devices.
On 08.06.21 11:42, Oscar Salvador wrote: > On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote: >> Hi, >> >> this series aims at improving in-kernel auto-online support. It tackles the >> fundamental problems that: > > Hi David, > > the idea sounds good to me, and I like that this series takes away part of the > responsability from the user to know where the memory should go. > I think the kernel is a much better fit for that as it has all the required > information to balance things. > > I also glanced over the series and besides some things here and there the > whole approach looks sane. > I plan to have a look into it in a few days, just have some high level questions > for the time being: Hi Oscar, > >> 1) We can create zone imbalances when onlining all memory blindly to >> ZONE_MOVABLE, in the worst case crashing the system. We have to know >> upfront how much memory we are going to hotplug such that we can >> safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE >> via "online_movable". This is far from practical and only applicable in >> limited setups -- like inside VMs under the RHV/oVirt hypervisor which >> will never hotplug more than 3 times the boot memory (and the >> limitation is only in place due to the Linux limitation). > > Could you give more insight about the problems created by zone imbalances (e.g: > a lot of movable memory and little kernel memory). I just updated memory-hotplug.rst exactly for that purpose :) https://lkml.kernel.org/r/20210525102604.8770-1-david@redhat.com There, also safe zone ratios and "usually well known values" are given. I can link it in the next cover letter. > >> 2) We see more setups that implement dynamic VM resizing, hot(un)plugging >> memory to resize VM memory. In these setups, we might hotplug a lot of >> memory, but it might happen in various small steps in both directions >> (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the >> primary driver of this upstream right now, performing such dynamic >> resizing NUMA-aware via multiple virtio-mem devices. >> >> Onlining all hotplugged memory to ZONE_NORMAL means we basically have >> no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can >> easily run into zone imbalances when growing a VM. We want a mixture, >> and we want as much memory as reasonable/configured in ZONE_MOVABLE. >> >> 3) Memory devices consist of 1..X memory block devices, however, the >> kernel doesn't really track the relationship. Consequently, also user >> space has no idea. We want to make per-device decisions. As one >> example, for memory hotunplug it doesn't make sense to use a mixture of >> zones within a single DIMM: we want all MOVABLE if possible, otherwise >> all !MOVABLE, because any !MOVABLE part will easily block the DIMM from >> getting hotunplugged. As another example, virtio-mem operates on >> individual units that span 1..X memory blocks. Similar to a DIMM, we >> want a unit to either be all MOVABLE or !MOVABLE. Further, we want >> as much memory of a virtio-mem device to be MOVABLE as possible. > > So, a virtio-mem unit could be seen as DIMM right? It's a bit more complicated. Each individual unit (e.g., a 128 MiB memory block) is the smallest granularity we can add/remove of that device. So such a unit is somewhat like a DIMM. However, all "units" of the device can interact -- it's a single memory device. > >> 4) We want memory onlining to be done right from the kernel while adding >> memory; for example, this is reqired for fast memory hotplug for >> drivers that add individual memory blocks, like virito-mem. We want a >> way to configure a policy in the kernel and avoid implementing advanced >> policies in user space. > > "we want memory onlining to be done right from the kernel while adding memory" > > is not that always the case when a driver adds memory? User has no interaction > with that right? Well, with auto-onlining in the kernel disabled, user space has to do the onlining -- for example via udev rules right now in major distributions. But there are also users that always want to online manually in user space to select a zone. Most prominently standby memory on s390x, but also in some cases dax/kmem memory. But these two are really corner cases. In general, we want hotplugged memory to be onlined immediately. > >> The auto-onlining support we have in the kernel is not sufficient. All we >> have is a) online everything movable (online_movable) b) online everything >> !movable (online_kernel) c) keep zones contiguous (online). This series >> allows configuring c) to mean instead "online movable if possible according >> to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new >> onlining policy. >> >> This series does 3 things: >> >> 1) Introduces the "auto-movable" online policy that initially operates on >> individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio >> to make a decision whether a memory block will be onlined to >> ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL >> memory does not allow for more MOVABLE memory (details in the >> patches). CMA memory is treated like MOVABLE memory. > > How a user would know which ratio is sane? Could we add some info in the > Docu part that kinda sets some "basic" rules? Again, currently resides in the memory-hotplug.rst overhaul. > >> 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory >> groups and uses group information to make decisions in the >> "auto-movable" online policy accross memory blocks of a single memory >> device (modeled as memory group). > > So, the distinction being that a DIMM cannot grow larger but we can add more > memory to a virtio-mem unit? I feel I am missing some insight here. Right, the relevant patch contains more info. You either plug or unplug a DIMM (or a NUMA node which spans multiple DIMMS) -- both are ACPI memory devices that span multiple physical regions. You cannot unplug parts of a DIMM or grow it. "static" as also expressed by ACPI code ("adds" and "removes" all memory device memory in one go). virtio-mem behaves differently, as it's a single physical memory region in which we dynamically add or remove memory. The granularity in which we add/remove memory from Linux is a "unit". In the simplest case, it's just a single memory block (e.g., 128 MiB). So it's a memory device that can grow/shrink in the given unit -- "dynamic". > >> 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by >> allowing ZONE_NORMAL memory within a dynamic memory group to allow for >> more ZONE_MOVABLE memory within the same memory group. The target use >> case is dynamic VM resizing using virtio-mem. > > Sorry, I got lost in this one. Care to explain a bit more? The virtio-mem example below should make this a bit more clearer (in addition to the relevant patch), especially in contrast to static memory devices like DIMMs. Key is that a single virtio-mem device is a "dynamic memory group" in which memory can get added/removed dynamically in a given unit granularity. And we want to special case that type of device to have as much memory of a virtio-mem device being MOVABLE as possible (and configured). > >> The target usage will be: >> >> 1) Linux boots with "mhp_default_online_type=offline" >> >> 2) User space (e.g., systemd unit) configures memory onlining (according >> to a config file and system properties), for example: >> * Setting memory_hotplug.online_policy=auto-movable >> * Setting memory_hotplug.auto_movable_ratio=301 >> * Setting memory_hotplug.auto_movable_numa_aware=true > > I think we would need to document those in order to let the user know what > it is best for them. e.g: when do we want to enable auto_movable_numa_aware etc. Yes, as mentioned below, an memory-hotplug.rst update will follow once the overhaul is done. The respective patch contains more information. > >> For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of >> 301% results in the following layout: >> Memory block 1-15: DMA32 (early) >> Memory block 32-47: Normal (early) >> Memory block 48-79: Movable (DIMM 0) >> Memory block 80-111: Movable (DIMM 1) >> Memory block 112-143: Movable (DIMM 2) >> Memory block 144-275: Normal (DIMM 3) >> Memory block 176-207: Normal (DIMM 4) >> ... all Normal >> (-> hotplugged Normal memory does not allow for more Movable memory) > > Uhm, I am sorry for being dense here: > > On x86_64, 4GB = 32 sections (of 128MB each). Why the memblock span from #1 to #47? Sorry, it's actually "Memory block 0-15", which gives us 0-15 and 32-47 == 32 memory blocks corresponding to boot memory. Note that the absent memory blocks 16-31 should correspond to the PCI hole. Thanks Oscar!