mbox series

[RFC,0/4] mm: place pages to the freelist tail when onling and undoing isolation

Message ID 20200916183411.64756-1-david@redhat.com (mailing list archive)
Headers show
Series mm: place pages to the freelist tail when onling and undoing isolation | expand

Message

David Hildenbrand Sept. 16, 2020, 6:34 p.m. UTC
When adding separate memory blocks via add_memory*() and onlining them
immediately, the metadata (especially the memmap) of the next block will be
placed onto one of the just added+onlined block. This creates a chain
of unmovable allocations: If the last memory block cannot get
offlined+removed() so will all dependant ones. We directly have unmovable
allocations all over the place.

This can be observed quite easily using virtio-mem, however, it can also
be observed when using DIMMs. The freshly onlined pages will usually be
placed to the head of the freelists, meaning they will be allocated next,
turning the just-added memory usually immediately un-removable. The
fresh pages are cold, prefering to allocate others (that might be hot)
also feels to be the natural thing to do.

It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
adding separate, successive memory blocks, each memory block will have
unmovable allocations on them - for example gigantic pages will fail to
allocate.

While the ZONE_NORMAL doesn't provide any guarantees that memory can get
offlined+removed again (any kind of fragmentation with unmovable
allocations is possible), there are many scenarios (hotplugging a lot of
memory, running workload, hotunplug some memory/as much as possible) where
we can offline+remove quite a lot with this patchset.

a) To visualize the problem, a very simple example:

Start a VM with 4GB and 8GB of virtio-mem memory:

	[root@localhost ~]# lsmem
	RANGE                                 SIZE  STATE REMOVABLE  BLOCK
	0x0000000000000000-0x00000000bfffffff   3G online       yes   0-23
	0x0000000100000000-0x000000033fffffff   9G online       yes 32-103
	
	Memory block size:       128M
	Total online memory:      12G
	Total offline memory:      0B

Then try to unplug as much as possible using virtio-mem. Observe which
memory blocks are still around. Without this patch set:

	[root@localhost ~]# lsmem
	RANGE                                  SIZE  STATE REMOVABLE   BLOCK
	0x0000000000000000-0x00000000bfffffff    3G online       yes    0-23
	0x0000000100000000-0x000000013fffffff    1G online       yes   32-39
	0x0000000148000000-0x000000014fffffff  128M online       yes      41
	0x0000000158000000-0x000000015fffffff  128M online       yes      43
	0x0000000168000000-0x000000016fffffff  128M online       yes      45
	0x0000000178000000-0x000000017fffffff  128M online       yes      47
	0x0000000188000000-0x0000000197ffffff  256M online       yes   49-50
	0x00000001a0000000-0x00000001a7ffffff  128M online       yes      52
	0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
	0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
	0x00000001d0000000-0x00000001d7ffffff  128M online       yes      58
	0x00000001e0000000-0x00000001e7ffffff  128M online       yes      60
	0x00000001f0000000-0x00000001f7ffffff  128M online       yes      62
	0x0000000200000000-0x0000000207ffffff  128M online       yes      64
	0x0000000210000000-0x0000000217ffffff  128M online       yes      66
	0x0000000220000000-0x0000000227ffffff  128M online       yes      68
	0x0000000230000000-0x0000000237ffffff  128M online       yes      70
	0x0000000240000000-0x0000000247ffffff  128M online       yes      72
	0x0000000250000000-0x0000000257ffffff  128M online       yes      74
	0x0000000260000000-0x0000000267ffffff  128M online       yes      76
	0x0000000270000000-0x0000000277ffffff  128M online       yes      78
	0x0000000280000000-0x0000000287ffffff  128M online       yes      80
	0x0000000290000000-0x0000000297ffffff  128M online       yes      82
	0x00000002a0000000-0x00000002a7ffffff  128M online       yes      84
	0x00000002b0000000-0x00000002b7ffffff  128M online       yes      86
	0x00000002c0000000-0x00000002c7ffffff  128M online       yes      88
	0x00000002d0000000-0x00000002d7ffffff  128M online       yes      90
	0x00000002e0000000-0x00000002e7ffffff  128M online       yes      92
	0x00000002f0000000-0x00000002f7ffffff  128M online       yes      94
	0x0000000300000000-0x0000000307ffffff  128M online       yes      96
	0x0000000310000000-0x0000000317ffffff  128M online       yes      98
	0x0000000320000000-0x0000000327ffffff  128M online       yes     100
	0x0000000330000000-0x000000033fffffff  256M online       yes 102-103
	
	Memory block size:       128M
	Total online memory:     8.1G
	Total offline memory:      0B

With this patch set:

	[root@localhost ~]# lsmem
	RANGE                                 SIZE  STATE REMOVABLE BLOCK
	0x0000000000000000-0x00000000bfffffff   3G online       yes  0-23
	0x0000000100000000-0x000000013fffffff   1G online       yes 32-39
	
	Memory block size:       128M
	Total online memory:       4G
	Total offline memory:      0B

All memory can get unplugged, all memory block can get removed. Of course,
no workload ran and the system was basically idle, but it highlights the
issue - the fairly deterministic chain of unmovable allocations. When a
huge page for the 2MB memmap is needed, a just-onlined 4MB page will
be split. The remaining 2MB page will be used for the memmap of the next
memory block. So one memory block will hold the memmap of the two following
memory blocks. Finally the pages of the last-onlined memory block will get
used for the next bigger allocations - if any allocation is unmovable,
all dependent memory blocks cannot get unplugged and removed until that
allocation is gone.

Note that with bigger memory blocks (e.g., 256MB), *all* memory
blocks are dependent and none can get unplugged again!

b) Experiment with memory intensive workload

I performed an experiment with an older version of this patch set
(before we used undo_isolate_page_range() in online_pages():
Hotplug 56GB to a VM with an initial 4GB, onlining all memory to
ZONE_NORMAL right from the kernel when adding it. I then run various
memory intensive workloads that consume most system memory for a total of
45 minutes. Once finished, I try to unplug as much memory as possible.

With this change, I am able to remove via virtio-mem (adding individual
128MB memory blocks) 413 out of 448 added memory blocks. Via individual
(256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any numbers
without this patchset, but looking at the above example, it's at most half
of the 448 memory blocks for virtio-mem, and most probably none for DIMMs).

Again, there are workloads that might behave very differently due to the
nature of ZONE_NORMAL.

c) Future work:
- I'll be looking into avoiding reporting freshly onlined pages via the
  free page reporting framework. They are unbacked in the hypervisor, so
  reporting them isn't necessary (and might actually be bad for performance
  in some future use cases in the hypervisor).
- I'll be looking into being able to tell the OS that some pages are fresh
  (e.g., via alloc_contig_range() in virito-mem, freeing balloon inflated
  memory in a ballooning driver), such that we will skip reporting them
  via free page reporting (marking them reported), and placing them to the
  tail of the freelist.
- virtio-mem will soon also support ZONE_MOVABLE, however, especially
  when hotplugging a lot of memory (as in the experiment), a considerable
  amount of memory will have to remain in ZONE_NORMAL - so this change
  is relevant in any case.

I'm sending this as RFC as it also in its current form for simplicity
affects not only memory onlining but also
- Other users of undo_isolate_page_range(): Pages are always placed to the
  tail.
-- When memory offlining fails
-- When memory isolation fails after having isolated some pageblocks
-- When alloc_contig_range() either succeeds or fails
- Other users of __putback_isolated_page(): Pages are always placed to the
  tail.
-- Free page reporting
- Other users of __free_pages_core()
-- AFAIKs, any memory that is getting exposed to the buddy during boot.
   IIUC we will now usually allocate memory from lower addresses within
   a zone first (especially during boot).
- Other users of generic_online_page()
-- Hyper-V balloon

Let's see if there are concerns for these users with this approach.

David Hildenbrand (4):
  mm/page_alloc: convert "report" flag of __free_one_page() to a proper
    flag
  mm/page_alloc: place pages to tail in __putback_isolated_page()
  mm/page_alloc: always move pages to the tail of the freelist in
    unset_migratetype_isolate()
  mm/page_alloc: place pages to tail in __free_pages_core()

 include/linux/page-isolation.h |   2 +
 mm/page_alloc.c                | 102 +++++++++++++++++++++++++--------
 mm/page_isolation.c            |   8 ++-
 3 files changed, 86 insertions(+), 26 deletions(-)

Comments

Oscar Salvador Sept. 16, 2020, 6:50 p.m. UTC | #1
On 2020-09-16 20:34, David Hildenbrand wrote:
> When adding separate memory blocks via add_memory*() and onlining them
> immediately, the metadata (especially the memmap) of the next block 
> will be
> placed onto one of the just added+onlined block. This creates a chain
> of unmovable allocations: If the last memory block cannot get
> offlined+removed() so will all dependant ones. We directly have 
> unmovable
> allocations all over the place.
> 
> This can be observed quite easily using virtio-mem, however, it can 
> also
> be observed when using DIMMs. The freshly onlined pages will usually be
> placed to the head of the freelists, meaning they will be allocated 
> next,
> turning the just-added memory usually immediately un-removable. The
> fresh pages are cold, prefering to allocate others (that might be hot)
> also feels to be the natural thing to do.
> 
> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: 
> when
> adding separate, successive memory blocks, each memory block will have
> unmovable allocations on them - for example gigantic pages will fail to
> allocate.
> 
> While the ZONE_NORMAL doesn't provide any guarantees that memory can 
> get
> offlined+removed again (any kind of fragmentation with unmovable
> allocations is possible), there are many scenarios (hotplugging a lot 
> of
> memory, running workload, hotunplug some memory/as much as possible) 
> where
> we can offline+remove quite a lot with this patchset.

Hi David,

I did not read through the patchset yet, so sorry if the question is 
nonsense, but is this not trying to fix the same issue the vmemmap 
patches did? [1]

I was about to give it a new respin now that thw hwpoison stuff has been 
settled.

[1] https://patchwork.kernel.org/cover/11059175/
>
David Hildenbrand Sept. 16, 2020, 7:31 p.m. UTC | #2
> Am 16.09.2020 um 20:50 schrieb osalvador@suse.de:
> 
> On 2020-09-16 20:34, David Hildenbrand wrote:
>> When adding separate memory blocks via add_memory*() and onlining them
>> immediately, the metadata (especially the memmap) of the next block will be
>> placed onto one of the just added+onlined block. This creates a chain
>> of unmovable allocations: If the last memory block cannot get
>> offlined+removed() so will all dependant ones. We directly have unmovable
>> allocations all over the place.
>> This can be observed quite easily using virtio-mem, however, it can also
>> be observed when using DIMMs. The freshly onlined pages will usually be
>> placed to the head of the freelists, meaning they will be allocated next,
>> turning the just-added memory usually immediately un-removable. The
>> fresh pages are cold, prefering to allocate others (that might be hot)
>> also feels to be the natural thing to do.
>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>> adding separate, successive memory blocks, each memory block will have
>> unmovable allocations on them - for example gigantic pages will fail to
>> allocate.
>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>> offlined+removed again (any kind of fragmentation with unmovable
>> allocations is possible), there are many scenarios (hotplugging a lot of
>> memory, running workload, hotunplug some memory/as much as possible) where
>> we can offline+remove quite a lot with this patchset.
> 
> Hi David,
> 

Hi Oscar.

> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]

Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.

With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.

Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.

Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.

So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.

Thanks!

David

> 
> I was about to give it a new respin now that thw hwpoison stuff has been settled.
> 
> [1] https://patchwork.kernel.org/cover/11059175/
>
Wei Yang Sept. 18, 2020, 2:30 a.m. UTC | #3
On Wed, Sep 16, 2020 at 09:31:21PM +0200, David Hildenbrand wrote:
>
>
>> Am 16.09.2020 um 20:50 schrieb osalvador@suse.de:
>> 
>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>> When adding separate memory blocks via add_memory*() and onlining them
>>> immediately, the metadata (especially the memmap) of the next block will be
>>> placed onto one of the just added+onlined block. This creates a chain
>>> of unmovable allocations: If the last memory block cannot get
>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>> allocations all over the place.
>>> This can be observed quite easily using virtio-mem, however, it can also
>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>> placed to the head of the freelists, meaning they will be allocated next,
>>> turning the just-added memory usually immediately un-removable. The
>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>> also feels to be the natural thing to do.
>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>> adding separate, successive memory blocks, each memory block will have
>>> unmovable allocations on them - for example gigantic pages will fail to
>>> allocate.
>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>> offlined+removed again (any kind of fragmentation with unmovable
>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>> memory, running workload, hotunplug some memory/as much as possible) where
>>> we can offline+remove quite a lot with this patchset.
>> 
>> Hi David,
>> 
>
>Hi Oscar.
>
>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>
>Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>
>With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>
>Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>
>Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>
>So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.

While everything changes with shuffle.

>
>Thanks!
>
>David
>
>> 
>> I was about to give it a new respin now that thw hwpoison stuff has been settled.
>> 
>> [1] https://patchwork.kernel.org/cover/11059175/
>>
David Hildenbrand Sept. 18, 2020, 7:32 a.m. UTC | #4
On 18.09.20 04:30, Wei Yang wrote:
> On Wed, Sep 16, 2020 at 09:31:21PM +0200, David Hildenbrand wrote:
>>
>>
>>> Am 16.09.2020 um 20:50 schrieb osalvador@suse.de:
>>>
>>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>> placed onto one of the just added+onlined block. This creates a chain
>>>> of unmovable allocations: If the last memory block cannot get
>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>> allocations all over the place.
>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>> turning the just-added memory usually immediately un-removable. The
>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>> also feels to be the natural thing to do.
>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>> adding separate, successive memory blocks, each memory block will have
>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>> allocate.
>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>> we can offline+remove quite a lot with this patchset.
>>>
>>> Hi David,
>>>
>>
>> Hi Oscar.
>>
>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>>
>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>>
>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>>
>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>>
>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>>
>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
> 
> While everything changes with shuffle.
> 

Right. Shuffling would naturally try to break the dependencies.

Shuffling is quite rare though, it has to be enabled explicitly on the
cmdline and might not be of too much help in virtualized environments.
Vlastimil Babka Sept. 23, 2020, 2:31 p.m. UTC | #5
On 9/16/20 9:31 PM, David Hildenbrand wrote:
> 
> 
>> Am 16.09.2020 um 20:50 schrieb osalvador@suse.de:
>> 
>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>> When adding separate memory blocks via add_memory*() and onlining them
>>> immediately, the metadata (especially the memmap) of the next block will be
>>> placed onto one of the just added+onlined block. This creates a chain
>>> of unmovable allocations: If the last memory block cannot get
>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>> allocations all over the place.
>>> This can be observed quite easily using virtio-mem, however, it can also
>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>> placed to the head of the freelists, meaning they will be allocated next,
>>> turning the just-added memory usually immediately un-removable. The
>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>> also feels to be the natural thing to do.
>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>> adding separate, successive memory blocks, each memory block will have
>>> unmovable allocations on them - for example gigantic pages will fail to
>>> allocate.
>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>> offlined+removed again (any kind of fragmentation with unmovable
>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>> memory, running workload, hotunplug some memory/as much as possible) where
>>> we can offline+remove quite a lot with this patchset.
>> 
>> Hi David,
>> 
> 
> Hi Oscar.
> 
>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
> 
> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
> 
> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
> 
> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
> 
> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
> 
> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.

I see the point, but I don't think the head/tail mechanism is great for this. It
might sort of work, but with other interfering activity there are no guarantees
and it relies on a subtle implementation detail. There are better mechanisms
possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
existing memory before we allocate those long-term management structures. Or
onlining a bunch of blocks as zone_movable first and only later convert to
zone_normal in a controlled way when existing normal zone becomes depeted?

I guess it's an issue that the e.g. 128M block onlines are so disconnected from
each other it's hard to employ a strategy that works best for e.g. a whole bunch
of GB onlined at once. But I noticed some effort towards new API, so maybe that
will be solved there too?

> Thanks!
> 
> David
> 
>> 
>> I was about to give it a new respin now that thw hwpoison stuff has been settled.
>> 
>> [1] https://patchwork.kernel.org/cover/11059175/
>> 
>
David Hildenbrand Sept. 23, 2020, 3:26 p.m. UTC | #6
On 23.09.20 16:31, Vlastimil Babka wrote:
> On 9/16/20 9:31 PM, David Hildenbrand wrote:
>>
>>
>>> Am 16.09.2020 um 20:50 schrieb osalvador@suse.de:
>>>
>>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>> placed onto one of the just added+onlined block. This creates a chain
>>>> of unmovable allocations: If the last memory block cannot get
>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>> allocations all over the place.
>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>> turning the just-added memory usually immediately un-removable. The
>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>> also feels to be the natural thing to do.
>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>> adding separate, successive memory blocks, each memory block will have
>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>> allocate.
>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>> we can offline+remove quite a lot with this patchset.
>>>
>>> Hi David,
>>>
>>
>> Hi Oscar.
>>
>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>>
>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>>
>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>>
>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>>
>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>>
>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
> 

Hi Vlastimil,

> I see the point, but I don't think the head/tail mechanism is great for this. It
> might sort of work, but with other interfering activity there are no guarantees
> and it relies on a subtle implementation detail. There are better mechanisms

For the specified use case of adding+onlining a whole bunch of memory
this works just fine. We don't care too much about "other interfering
activity" as you mention here, or about guarantees - this is a pure
optimization that seems to work just fine in practice.

I'm not sure about the "subtle implementation detail" - buddy merging,
and head/tail of buddy lists are a basic concept of our page allocator.
If that would ever change, the optimization here would be lost and we
would have to think of something else. Nothing would actually break -
and it's all kept directly in page_alloc.c

I'd like to stress that what I propose here is both simple and powerful.

> possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
> existing memory before we allocate those long-term management structures. Or
> onlining a bunch of blocks as zone_movable first and only later convert to
> zone_normal in a controlled way when existing normal zone becomes depeted?

I see the following (more or less complicated) alternatives

1) Having a larger MIGRATE_UNMOVABLE area

a) Sizing it is difficult. I mean you would have to plan ahead for all
memory you might eventually hotplug later - and that could even be
impossible if you hotplug quite a lot of memory to a smaller machine.
(I've seen people in the vm/container world trying to hotplug 128GB
DIMMs to 2GB VMs ... and failing for obvious reasons)
b) not really desired. You usually want to have most memory movable, not
the opposite (just because you might hotplug memory in small chunks later).

2) smarter onlining

I have prototype patches for better auto-onlining (which I'll share at
some point), where I balance between ZONE_NORMAL and ZONE_MOVABLE in a
defined ratio. Assuming something very simple, adding separate memory
blocks and onlining them based on the current zone ratio (assuming a 1:4
normal:movable target ratio) would (without some other policies I have
in place) result in something like this for hotplugged memory (via
virtio-mem):

[N][M][M][M][M][N][M][M][M][M][N][M][M][M][M]...

(note: layout is suboptimal, just a simple example)

But even here, all [N] memory blocks would immediately be use for
allocations for the memmap of successive blocks. It doesn't solve the
dependency issues.

Now assume we would want to group [N] in a way to allow for gigantic
pages, like

[N][N][N][N][N][N][N][N][M][M][M][M] ....

we would, once again, never be able to allocate a gigantic page because
all [N] would contain a memmap.

3) conversion from MOVABLE -> NORMAL

While a conversion from MOVABLE to NORMAL would be interesting to see,
it's going to be a challenging task to actually implement (people expect
that page_zone() remains stable). Without any hacks, we'd have to

1. offline the selected (MOVABLE) memory block/chunk
2. online the selected memory block/chunk to the NORMAL zone

This is not something we can do out of random context (for example, we
need both, the device hotplug lock and the memory hotplug lock, as we
might race with user space) - so there might still be a chance of
corner-case OOMs.

(I assume there could also be quite a negative performance impact when
always relying on the conversion, and not properly planning ahead as in 2.)

> 
> I guess it's an issue that the e.g. 128M block onlines are so disconnected from
> each other it's hard to employ a strategy that works best for e.g. a whole bunch
> of GB onlined at once. But I noticed some effort towards new API, so maybe that
> will be solved there too?

While new interfaces might make it easier to identify boundaries of
separate DIMMs (e.g., to online a single DIMM either movable or
unmovable - which can partially be done right now when going via memory
resource boundaries), it doesn't help for the use case of adding
separate memory blocks.

So while having an automatic conversion from MOVABLE -> NORMAL would be
interesting, I doubt we'll see it in the foreseeable future. Are there
any similarly simple alternatives to optimize this?

Thanks!
Wei Yang Sept. 24, 2020, 1:57 a.m. UTC | #7
On Wed, Sep 23, 2020 at 04:31:25PM +0200, Vlastimil Babka wrote:
>On 9/16/20 9:31 PM, David Hildenbrand wrote:
>> 
>> 
>>> Am 16.09.2020 um 20:50 schrieb osalvador@suse.de:
>>> 
>>> On 2020-09-16 20:34, David Hildenbrand wrote:
>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>> placed onto one of the just added+onlined block. This creates a chain
>>>> of unmovable allocations: If the last memory block cannot get
>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>> allocations all over the place.
>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>> turning the just-added memory usually immediately un-removable. The
>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>> also feels to be the natural thing to do.
>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>> adding separate, successive memory blocks, each memory block will have
>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>> allocate.
>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>> we can offline+remove quite a lot with this patchset.
>>> 
>>> Hi David,
>>> 
>> 
>> Hi Oscar.
>> 
>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>> 
>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it‘s not completely ideal, especially for single memory blocks.
>> 
>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>> 
>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>> 
>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>> 
>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
>
>I see the point, but I don't think the head/tail mechanism is great for this. It
>might sort of work, but with other interfering activity there are no guarantees
>and it relies on a subtle implementation detail. There are better mechanisms
>possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
>existing memory before we allocate those long-term management structures. Or
>onlining a bunch of blocks as zone_movable first and only later convert to
>zone_normal in a controlled way when existing normal zone becomes depeted?
>

To be honest, David's approach is easy to understand for me.

And I don't see some negative effect.

>I guess it's an issue that the e.g. 128M block onlines are so disconnected from
>each other it's hard to employ a strategy that works best for e.g. a whole bunch
>of GB onlined at once. But I noticed some effort towards new API, so maybe that
>will be solved there too?
>
>> Thanks!
>> 
>> David
>> 
>>> 
>>> I was about to give it a new respin now that thw hwpoison stuff has been settled.
>>> 
>>> [1] https://patchwork.kernel.org/cover/11059175/
>>> 
>>
Mel Gorman Sept. 24, 2020, 9:40 a.m. UTC | #8
On Wed, Sep 23, 2020 at 05:26:06PM +0200, David Hildenbrand wrote:
> >>> ???On 2020-09-16 20:34, David Hildenbrand wrote:
> >>>> When adding separate memory blocks via add_memory*() and onlining them
> >>>> immediately, the metadata (especially the memmap) of the next block will be
> >>>> placed onto one of the just added+onlined block. This creates a chain
> >>>> of unmovable allocations: If the last memory block cannot get
> >>>> offlined+removed() so will all dependant ones. We directly have unmovable
> >>>> allocations all over the place.
> >>>> This can be observed quite easily using virtio-mem, however, it can also
> >>>> be observed when using DIMMs. The freshly onlined pages will usually be
> >>>> placed to the head of the freelists, meaning they will be allocated next,
> >>>> turning the just-added memory usually immediately un-removable. The
> >>>> fresh pages are cold, prefering to allocate others (that might be hot)
> >>>> also feels to be the natural thing to do.
> >>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
> >>>> adding separate, successive memory blocks, each memory block will have
> >>>> unmovable allocations on them - for example gigantic pages will fail to
> >>>> allocate.
> >>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
> >>>> offlined+removed again (any kind of fragmentation with unmovable
> >>>> allocations is possible), there are many scenarios (hotplugging a lot of
> >>>> memory, running workload, hotunplug some memory/as much as possible) where
> >>>> we can offline+remove quite a lot with this patchset.
> >>>
> >>> Hi David,
> >>>
> >>
> >> Hi Oscar.
> >>
> >>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
> >>
> >> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it???s not completely ideal, especially for single memory blocks.
> >>
> >> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
> >>
> >> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
> >>
> >> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
> >>
> >> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
> > 
> 
> Hi Vlastimil,
> 
> > I see the point, but I don't think the head/tail mechanism is great for this. It
> > might sort of work, but with other interfering activity there are no guarantees
> > and it relies on a subtle implementation detail. There are better mechanisms
> 
> For the specified use case of adding+onlining a whole bunch of memory
> this works just fine. We don't care too much about "other interfering
> activity" as you mention here, or about guarantees - this is a pure
> optimization that seems to work just fine in practice.
> 
> I'm not sure about the "subtle implementation detail" - buddy merging,
> and head/tail of buddy lists are a basic concept of our page allocator.
> If that would ever change, the optimization here would be lost and we
> would have to think of something else. Nothing would actually break -
> and it's all kept directly in page_alloc.c
> 

It's somewhat subtle because it's relying heavily on the exact ordering
of how pages are pulled from the free lists at the moment. Lets say for
example that someone was brave enough to tackle the problem of the giant
zone lock and split the zone into allocation arenas (like what glibc does
to split the lock). Depending on the exact ordering of how pages are
added and removed from the list would break your approach. I'm wary of
anything that relies on the ordering of freelists for correctness becauuse
it limits the ability to fix the zone lock (which has been overdue for
fixing for years now and getting worse as node sizes increase).

To be robust, you'd need to do something like early memory bring-up whereby
pages are directly allocated from one part of the DIMM (presumably the
start) and use that for the metadata -- potentially all the metadata that
would be necessary to plug/unplug the entire DIMM. This would effectively
be unmovable but if you want to guarantee that all the memory except the
metadata can be unplugged, you do not have much alteratives. Playing games
with the ordering of the freelists will simply end up as "sometimes works,
sometimes does not". 

In terms of forcing ranges to be UNMOVABLE or MOVABLE (either via zones
or by implementing "sticky" pageblocks which hits complex reclaim-related
problems), you start running into problems similar to lowmem starvation
where a page cache allocation fails because unmovable metadata cannot
be allocated.

I suggest you keep it simple -- statically allocate the potential
metadata needed in the future even though it limits the maximum amount
of memory that can be unplugged. The alternative is unpredictable
plug/unplug success rates.
David Hildenbrand Sept. 24, 2020, 9:54 a.m. UTC | #9
On 24.09.20 11:40, Mel Gorman wrote:
> On Wed, Sep 23, 2020 at 05:26:06PM +0200, David Hildenbrand wrote:
>>>>> ???On 2020-09-16 20:34, David Hildenbrand wrote:
>>>>>> When adding separate memory blocks via add_memory*() and onlining them
>>>>>> immediately, the metadata (especially the memmap) of the next block will be
>>>>>> placed onto one of the just added+onlined block. This creates a chain
>>>>>> of unmovable allocations: If the last memory block cannot get
>>>>>> offlined+removed() so will all dependant ones. We directly have unmovable
>>>>>> allocations all over the place.
>>>>>> This can be observed quite easily using virtio-mem, however, it can also
>>>>>> be observed when using DIMMs. The freshly onlined pages will usually be
>>>>>> placed to the head of the freelists, meaning they will be allocated next,
>>>>>> turning the just-added memory usually immediately un-removable. The
>>>>>> fresh pages are cold, prefering to allocate others (that might be hot)
>>>>>> also feels to be the natural thing to do.
>>>>>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
>>>>>> adding separate, successive memory blocks, each memory block will have
>>>>>> unmovable allocations on them - for example gigantic pages will fail to
>>>>>> allocate.
>>>>>> While the ZONE_NORMAL doesn't provide any guarantees that memory can get
>>>>>> offlined+removed again (any kind of fragmentation with unmovable
>>>>>> allocations is possible), there are many scenarios (hotplugging a lot of
>>>>>> memory, running workload, hotunplug some memory/as much as possible) where
>>>>>> we can offline+remove quite a lot with this patchset.
>>>>>
>>>>> Hi David,
>>>>>
>>>>
>>>> Hi Oscar.
>>>>
>>>>> I did not read through the patchset yet, so sorry if the question is nonsense, but is this not trying to fix the same issue the vmemmap patches did? [1]
>>>>
>>>> Not nonesense at all. It only helps to some degree, though. It solves the dependencies due to the memmap. However, it???s not completely ideal, especially for single memory blocks.
>>>>
>>>> With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlpar) you still have unmovable (vmemmap chunks) all over the physical address space. Consider the gigantic page example after hotplug. You directly fragmented all hotplugged memory.
>>>>
>>>> Of course, there might be (less extreme) dependencies due page tables for the identity mapping, extended struct pages and similar.
>>>>
>>>> Having that said, there are other benefits when preferring other memory over just hotplugged memory. Think about adding+onlining memory during boot (dimms under QEMU, virtio-mem), once the system is up you will have most (all) of that memory completely untouched.
>>>>
>>>> So while vmemmap on hotplugged memory would tackle some part of the issue, there are cases where this approach is better, and there are even benefits when combining both.
>>>
>>
>> Hi Vlastimil,
>>
>>> I see the point, but I don't think the head/tail mechanism is great for this. It
>>> might sort of work, but with other interfering activity there are no guarantees
>>> and it relies on a subtle implementation detail. There are better mechanisms
>>
>> For the specified use case of adding+onlining a whole bunch of memory
>> this works just fine. We don't care too much about "other interfering
>> activity" as you mention here, or about guarantees - this is a pure
>> optimization that seems to work just fine in practice.
>>
>> I'm not sure about the "subtle implementation detail" - buddy merging,
>> and head/tail of buddy lists are a basic concept of our page allocator.
>> If that would ever change, the optimization here would be lost and we
>> would have to think of something else. Nothing would actually break -
>> and it's all kept directly in page_alloc.c
>>

Hi Mel,

thanks for your reply.

> 
> It's somewhat subtle because it's relying heavily on the exact ordering
> of how pages are pulled from the free lists at the moment. Lets say for
> example that someone was brave enough to tackle the problem of the giant
> zone lock and split the zone into allocation arenas (like what glibc does
> to split the lock). Depending on the exact ordering of how pages are
> added and removed from the list would break your approach. I'm wary of

First of all, it would not break it (as I already said). The
optimization would be lost. Totally acceptable.

However, I assume we would apply the same technique (optimized buddy
merging - placing to head/tail, page shuffling) on these allocation
arenas. So the optimization would still mostly apply, just in different
granularity - which would be fine.

> anything that relies on the ordering of freelists for correctness becauuse
> it limits the ability to fix the zone lock (which has been overdue for
> fixing for years now and getting worse as node sizes increase).

"for correctness" - no, this is an optimization. As I said, there are no
guarantees. Please keep that in mind.

(also, page shuffling relies on the ordering of freelists right now ...
for correctness)

> 
> To be robust, you'd need to do something like early memory bring-up whereby
> pages are directly allocated from one part of the DIMM (presumably the
> start) and use that for the metadata -- potentially all the metadata that
> would be necessary to plug/unplug the entire DIMM. This would effectively
> be unmovable but if you want to guarantee that all the memory except the
> metadata can be unplugged, you do not have much alteratives. Playing games
> with the ordering of the freelists will simply end up as "sometimes works,
> sometimes does not". 

As answered to Oscar already, while something like that might be
feasible for DIMMs in the future (and there are still quite some issues
to be sorted out), it isn't always desired adding separate (small -
e.g., 128MB) memory blocks.  You - again- have unmovable allocations all
over the place that won't allow you to allocate any gigantic page.

> 
> In terms of forcing ranges to be UNMOVABLE or MOVABLE (either via zones
> or by implementing "sticky" pageblocks which hits complex reclaim-related
> problems), you start running into problems similar to lowmem starvation
> where a page cache allocation fails because unmovable metadata cannot
> be allocated.

Exactly.

> 
> I suggest you keep it simple -- statically allocate the potential
> metadata needed in the future even though it limits the maximum amount
> of memory that can be unplugged. The alternative is unpredictable
> plug/unplug success rates.
> 

I'm sorry I can't follow. How is this "simple"?  Or even "simpler" than
what I suggest?

And as I said, it doesn't always work. Assume I hotplug 128GB to a 2GB
machine via virtio-mem (which works just fine, as we add+online memory
in small chunks compared to a single, huge DIMM), I would have to
pre-allocate 2GB just for the memmap - which obviously doesn't work.

Again, I'd like to stress that this is a pure optimization that I am
proposing - nothing would "break" when ripping it out again, except that
we lose the optimizations I mentioned.
Vlastimil Babka Sept. 24, 2020, 1:59 p.m. UTC | #10
On 9/23/20 5:26 PM, David Hildenbrand wrote:
> On 23.09.20 16:31, Vlastimil Babka wrote:
>> On 9/16/20 9:31 PM, David Hildenbrand wrote:
>> 
> 
> Hi Vlastimil,
> 
>> I see the point, but I don't think the head/tail mechanism is great for this. It
>> might sort of work, but with other interfering activity there are no guarantees
>> and it relies on a subtle implementation detail. There are better mechanisms
> 
> For the specified use case of adding+onlining a whole bunch of memory
> this works just fine. We don't care too much about "other interfering
> activity" as you mention here, or about guarantees - this is a pure
> optimization that seems to work just fine in practice.
> 
> I'm not sure about the "subtle implementation detail" - buddy merging,
> and head/tail of buddy lists are a basic concept of our page allocator.

Mel already explained that, so I won't repeat.

> If that would ever change, the optimization here would be lost and we
> would have to think of something else. Nothing would actually break -
> and it's all kept directly in page_alloc.c

Sure, but then it can become a pointless code churn.

> I'd like to stress that what I propose here is both simple and powerful.
> 
>> possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
>> existing memory before we allocate those long-term management structures. Or
>> onlining a bunch of blocks as zone_movable first and only later convert to
>> zone_normal in a controlled way when existing normal zone becomes depeted?
> 
> I see the following (more or less complicated) alternatives
> 
> 1) Having a larger MIGRATE_UNMOVABLE area
> 
> a) Sizing it is difficult. I mean you would have to plan ahead for all
> memory you might eventually hotplug later - and that could even be

Yeah, hence my worry about existing interfaces that work on 128MB blocks
individually without a larger strategy.

> impossible if you hotplug quite a lot of memory to a smaller machine.
> (I've seen people in the vm/container world trying to hotplug 128GB
> DIMMs to 2GB VMs ... and failing for obvious reasons)

Some planning should still be possible to maximize the contiguous area without
unmovable allocations.

> b) not really desired. You usually want to have most memory movable, not
> the opposite (just because you might hotplug memory in small chunks later).
> 
> 2) smarter onlining
> 
> I have prototype patches for better auto-onlining (which I'll share at
> some point), where I balance between ZONE_NORMAL and ZONE_MOVABLE in a
> defined ratio. Assuming something very simple, adding separate memory
> blocks and onlining them based on the current zone ratio (assuming a 1:4
> normal:movable target ratio) would (without some other policies I have
> in place) result in something like this for hotplugged memory (via
> virtio-mem):
> 
> [N][M][M][M][M][N][M][M][M][M][N][M][M][M][M]...
> 
> (note: layout is suboptimal, just a simple example)
> 
> But even here, all [N] memory blocks would immediately be use for
> allocations for the memmap of successive blocks. It doesn't solve the
> dependency issues.
> 
> Now assume we would want to group [N] in a way to allow for gigantic
> pages, like
> 
> [N][N][N][N][N][N][N][N][M][M][M][M] ....
> 
> we would, once again, never be able to allocate a gigantic page because
> all [N] would contain a memmap.

The second approach should work, if you know how much you are going to online,
and plan the size the N group accordingly, and if the onlined amount is several
gigabytes, then only the first one (or first X) will be unusable for a gigantic
page, but the rest would be? Can't get much better than that.

> 3) conversion from MOVABLE -> NORMAL
> 
> While a conversion from MOVABLE to NORMAL would be interesting to see,
> it's going to be a challenging task to actually implement (people expect
> that page_zone() remains stable). Without any hacks, we'd have to
> 
> 1. offline the selected (MOVABLE) memory block/chunk
> 2. online the selected memory block/chunk to the NORMAL zone
> 
> This is not something we can do out of random context (for example, we
> need both, the device hotplug lock and the memory hotplug lock, as we
> might race with user space) - so there might still be a chance of
> corner-case OOMs.

Right, it's trickier than I thought.

> (I assume there could also be quite a negative performance impact when
> always relying on the conversion, and not properly planning ahead as in 2.)
> 
>> 
>> I guess it's an issue that the e.g. 128M block onlines are so disconnected from
>> each other it's hard to employ a strategy that works best for e.g. a whole bunch
>> of GB onlined at once. But I noticed some effort towards new API, so maybe that
>> will be solved there too?
> 
> While new interfaces might make it easier to identify boundaries of
> separate DIMMs (e.g., to online a single DIMM either movable or
> unmovable - which can partially be done right now when going via memory
> resource boundaries), it doesn't help for the use case of adding
> separate memory blocks.
> 
> So while having an automatic conversion from MOVABLE -> NORMAL would be
> interesting, I doubt we'll see it in the foreseeable future. Are there
> any similarly simple alternatives to optimize this?

I've reviewed the series and I won't block it - yes it's an optimistic approach
that can break and leave us with code churn. But at least it's not that much
code and the extra test in  __free_one_page() shouldn't make this hotpath too
worse. But I still hope we can achieve a more robust solution one day.

> Thanks!
>
David Hildenbrand Sept. 24, 2020, 2:29 p.m. UTC | #11
>> If that would ever change, the optimization here would be lost and we
>> would have to think of something else. Nothing would actually break -
>> and it's all kept directly in page_alloc.c
> 
> Sure, but then it can become a pointless code churn.

Indeed, and if there are valid concerns that this will happen in the
near future (e.g., < 1 year), I agree that we should look into
alternatives right from the start. Otherwise it's good enough until some
of the other things I mentioned below become real (which could also take
a while ...).

> 
>> I'd like to stress that what I propose here is both simple and powerful.
>>
>>> possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in the
>>> existing memory before we allocate those long-term management structures. Or
>>> onlining a bunch of blocks as zone_movable first and only later convert to
>>> zone_normal in a controlled way when existing normal zone becomes depeted?
>>
>> I see the following (more or less complicated) alternatives
>>
>> 1) Having a larger MIGRATE_UNMOVABLE area
>>
>> a) Sizing it is difficult. I mean you would have to plan ahead for all
>> memory you might eventually hotplug later - and that could even be
> 
> Yeah, hence my worry about existing interfaces that work on 128MB blocks
> individually without a larger strategy.

Yes, in the works :)

> 
>> impossible if you hotplug quite a lot of memory to a smaller machine.
>> (I've seen people in the vm/container world trying to hotplug 128GB
>> DIMMs to 2GB VMs ... and failing for obvious reasons)
> 
> Some planning should still be possible to maximize the contiguous area without
> unmovable allocations.

Indeed, optimizing that is very high on my list of things to look into ...

>>
>> we would, once again, never be able to allocate a gigantic page because
>> all [N] would contain a memmap.
> 
> The second approach should work, if you know how much you are going to online,
> and plan the size the N group accordingly, and if the onlined amount is several
> gigabytes, then only the first one (or first X) will be unusable for a gigantic
> page, but the rest would be? Can't get much better than that.

Indeed, it's the optimal case (assuming one can come up with a safe zone
balance - which is usually possible, but unfortunately, there are
exceptions one at least has to identify).

[...]

> 
> I've reviewed the series and I won't block it - yes it's an optimistic approach
> that can break and leave us with code churn. But at least it's not that much

Thanks.

I'll try to document somewhere that the behavior of FOP_TO_TAIL is a
pure optimization and might change in the future - along with the case
it tried to optimize (so people know what the use case was).

> code and the extra test in  __free_one_page() shouldn't make this hotpath too

I assume the compiler is able to completely propagate constants and
optimize that out - I haven't checked, though.

> worse. But I still hope we can achieve a more robust solution one day.

I definitely agree. I'd also prefer some kind of guarantees, but I
learned that things always sound easier than they actually are when it
comes to memory management in Linux ... and they take a lot of time (for
example, Michal's/Oscar's attempts to implement vmemmap on hotadded memory).