diff mbox series

[v5,1/5] mm,memory_hotplug: Allocate memmap from the added memory range

Message ID 20210319092635.6214-2-osalvador@suse.de (mailing list archive)
State New, archived
Headers show
Series Allocate memmap from hotadded memory (per device) | expand

Commit Message

Oscar Salvador March 19, 2021, 9:26 a.m. UTC
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
    populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This comes in handy as in {online,offline}_pages, all the isolation and
migration is being done on (buddy_start_pfn, end_pfn] range,
being buddy_start_pfn = start_pfn + nr_vmemmap_pages.

In this way, we have:

[start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
[buddy_start_pfn, end_pfn - 1]       = Initialized and sent to buddy

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/memory.c          |  20 +++---
 include/linux/memory.h         |   8 ++-
 include/linux/memory_hotplug.h |  21 ++++--
 include/linux/memremap.h       |   2 +-
 include/linux/mmzone.h         |   5 ++
 mm/Kconfig                     |   5 ++
 mm/memory_hotplug.c            | 158 +++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c                |   4 +-
 8 files changed, 187 insertions(+), 36 deletions(-)

Comments

David Hildenbrand March 19, 2021, 10:20 a.m. UTC | #1
Tow NITs

> +bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
> +	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> +	unsigned long remaining_size = size - vmemmap_size;
> +
> +	/*
> +	 * Besides having arch support and the feature enabled at runtime, we
> +	 * need a few more assumptions to hold true:
> +	 *
> +	 * a) We span a single memory block: memory onlining/offlinin;g happens

s/offlinin;g/offlining;/

> +	 *    in memory block granularity. We don't want the vmemmap of online
> +	 *    memory blocks to reside on offline memory blocks. In the future,
> +	 *    we might want to support variable-sized memory blocks to make the
> +	 *    feature more versatile.
> +	 *
> +	 * b) The vmemmap pages span complete PMDs: We don't want vmemmap code
> +	 *    to populate memory from the altmap for unrelated parts (i.e.,
> +	 *    other memory blocks)
> +	 *
> +	 * c) The vmemmap pages (and thereby the pages that will be exposed to
> +	 *    the buddy) have to cover full pageblocks: memory onlining/offlining
> +	 *    code requires applicable ranges to be page-aligned, for example, to
> +	 *    set the migratetypes properly.
> +	 *
> +	 * TODO: Although we have a check here to make sure that vmemmap pages
> +	 *	 fully populate a PMD, it is not the right place to check for
> +	 *	 this. A much better solution involves improving vmemmap code
> +	 *	 to fallback to base pages when trying to populate vmemmap using
> +	 *	 altmap as an alternative source of memory, and we do not exactly
> +	 *	 populate a single PMD.
> +	 */
> +	return memmap_on_memory &&
> +	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
> +	       size == memory_block_size_bytes() &&
> +	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> +	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));

IS_ALIGNED(remaining_size, pageblock_nr_pages << PAGE_SHIFT);

LGTM, thanks!

(another pair of eyes certainly wouldn't hurt :) )

Reviewed-by: David Hildenbrand <david@redhat.com>
Oscar Salvador March 19, 2021, 10:31 a.m. UTC | #2
On Fri, Mar 19, 2021 at 11:20:19AM +0100, David Hildenbrand wrote:
> > +bool mhp_supports_memmap_on_memory(unsigned long size)
> > +{
> > +	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
> > +	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> > +	unsigned long remaining_size = size - vmemmap_size;
> > +
> > +	/*
> > +	 * Besides having arch support and the feature enabled at runtime, we
> > +	 * need a few more assumptions to hold true:
> > +	 *
> > +	 * a) We span a single memory block: memory onlining/offlinin;g happens
> 
> s/offlinin;g/offlining;/

Bleh, terrible :-(

> IS_ALIGNED(remaining_size, pageblock_nr_pages << PAGE_SHIFT);

Yaiks

Hopefully Andrew can amend these two nits? 

> (another pair of eyes certainly wouldn't hurt :) )

definitely, but those are pricy as you may know :-D

> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks a lot for the throughout review David, highly appreciated!
David Hildenbrand March 19, 2021, 12:04 p.m. UTC | #3
> Hopefully Andrew can amend these two nits?
> 
>> (another pair of eyes certainly wouldn't hurt :) )
> 
> definitely, but those are pricy as you may know :-D

Even worse, they are even hard to find! :)
Michal Hocko March 23, 2021, 10:11 a.m. UTC | #4
[Sorry for a long overdue review. I didn't have time to follow previous
versions so I am sorry if some of my concerns have been discussed
already]

On Fri 19-03-21 10:26:31, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
>     (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
>     which has performance drawbacks.

I was playing with movable_node and performance implications back in
2017 (unfortunately I do not have specific numbers anymore) and the
setup was a bit extreme - a single node (0) with normal zones and all
other nodes with movable memory only. So not only struct pages but any
other kernel metadata were on a remote node. I remember I could see
clear performance drop scaling with the distance from node 0 somewhere
betweem 5-10% on kbuild bound on a movable node.

>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
>     populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.

In fact beginning of the memory block should be sufficient as sections
cannot be hotremoved without the rest of the memory block.

> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.

Here you are talking about memory block rather than section.

> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> [start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> [buddy_start_pfn, end_pfn - 1]       = Initialized and sent to buddy
> 
> Hot-remove:
> 
>  We need to be careful when removing memory, as adding and
>  removing memory needs to be done with the same granularity.
>  To check that this assumption is not violated, we check the
>  memory range we want to remove and if a) any memory block has
>  vmemmap pages and b) the range spans more than a single memory
>  block, we scream out loud and refuse to proceed.

Is this a real problem? If each memory block has its own vmemmap then we
should be just fine, no?
 
>  If all is good and the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does the right
>  thing and calls vmem_altmap_free instead of free_pagetable.

I would appreciate some more description of the patch itself. The above
outlines a highlevel problems and design. The patch is quite large and
it acts on several layers - physical hotplug, {on,off}lining and sysfs
layer.

Let me capture my thinking:
- from the top level 
- sysfs interfaces - memory block is extended to contain the number of
  vmemmap pages reserved from the beginning of the block for all
  memory sections belonging to the block.
- add_memory_resource is the entry point to reserve the vmemmap space
  for the block. This is an opt-in feature (MHP_MEMMAP_ON_MEMORY) and
  there is no current user at this stage.
- vmem_altmap is instructed to use the reserved vmemmap space as the
  backing storage for the vmemmap struct pages. Via arch_add_memory->
  __populate_section_memmap.
- online_pages for some reason needs to know about the reserved vmemmap
  space. Why? It already knows the intial pfn to online. Why cannot
  caller simply alter both start pfn and nr_pages to online everything
  after the vmemmap space? This is somehow conflating the mem block
  concept deeper into onlining.
- the same applies to offlining.
- finally hotremove - which is the most tricky part. try_remove_memory
  learns about vmemmap reserved space and provides it to __remove_pages
  and eventually all the way down to remove_pagetable via altmap
  Now a question and something I have stumbled over few years back when
  looking into this. Let's say you have multi section memblock so the
  first section of the block backs vmemmaps for all other sections.
  What happens when you drop the first worth of section before tearing
  down all other vmemmaps?

Now to the specific implementation.

[...]
> @@ -185,10 +185,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
>  
>  	switch (action) {
>  	case MEM_ONLINE:
> -		ret = online_pages(start_pfn, nr_pages, online_type, nid);
> +		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
> +				   online_type, nid);
>  		break;

I would just offset start_pfn and nr_pages.

[...]

> @@ -603,7 +606,7 @@ static int add_memory_block(unsigned long base_section_nr)
>  	if (section_count == 0)
>  		return 0;
>  	return init_memory_block(memory_block_id(base_section_nr),
> -				 MEM_ONLINE);
> +				 MEM_ONLINE, 0);

This would deserve a comment.
	/* Early init code to create memory blocks for all the memory.
	 * Backed by bootmem struct pages so no vmemmap reserved space.
	 */

[...]
> -static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
> +static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> +			       unsigned long buddy_start_pfn)

More on that later
>  {
>  	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn;
> +	unsigned long pfn = buddy_start_pfn;
> +
> +	/*
> +	 * When using memmap_on_memory, the range might be unaligned as the
> +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> +	 */
> +	VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));

No this is not something VM_BUG_ON should be used for. This is perfectly
recoverable situation. Besides that this is a wrong layer to care. All
the fixup should happen up in the call chain.

>  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> -		       int online_type, int nid)
> +		       unsigned long nr_vmemmap_pages, int online_type, int nid)
>  {
> -	unsigned long flags;
> +	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
>  	struct zone *zone;
>  	int need_zonelists_rebuild = 0;
>  	int ret;

As already mentioned I believe this would be much easier to follow if
the given pfn really denotes a first pfn to online rather than learn the
code about vmemmap space which is not really interesting from the
onlining POV. Struct pages are already create. All we need is to online
them for using.
Just have a look at pfn vs. buddy_start_pfn usage. Why should
zone_for_pfn_range, node_states_check_changes_online, memory_notify ase
the former rather than later? As mentioned above online_pages_range is
just more complex by doing that.

Sure there are some consistency checks which are more convenient with
the actual pfn start but I believe those shouldn't be a reason for
obfuscating the code and mixing layers.

[...]
> +bool mhp_supports_memmap_on_memory(unsigned long size)
> +{
> +	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
> +	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> +	unsigned long remaining_size = size - vmemmap_size;
> +
> +	/*
> +	 * Besides having arch support and the feature enabled at runtime, we
> +	 * need a few more assumptions to hold true:
> +	 *
> +	 * a) We span a single memory block: memory onlining/offlinin;g happens
> +	 *    in memory block granularity. We don't want the vmemmap of online
> +	 *    memory blocks to reside on offline memory blocks. In the future,
> +	 *    we might want to support variable-sized memory blocks to make the
> +	 *    feature more versatile.
> +	 *
> +	 * b) The vmemmap pages span complete PMDs: We don't want vmemmap code
> +	 *    to populate memory from the altmap for unrelated parts (i.e.,
> +	 *    other memory blocks)
> +	 *
> +	 * c) The vmemmap pages (and thereby the pages that will be exposed to
> +	 *    the buddy) have to cover full pageblocks: memory onlining/offlining
> +	 *    code requires applicable ranges to be page-aligned, for example, to
> +	 *    set the migratetypes properly.
> +	 *
> +	 * TODO: Although we have a check here to make sure that vmemmap pages
> +	 *	 fully populate a PMD, it is not the right place to check for
> +	 *	 this. A much better solution involves improving vmemmap code
> +	 *	 to fallback to base pages when trying to populate vmemmap using
> +	 *	 altmap as an alternative source of memory, and we do not exactly
> +	 *	 populate a single PMD.
> +	 */
> +	return memmap_on_memory &&

What is memmap_on_memory? I do not see anybody setting it anywhere.
Probably a later patch...

> +	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
> +	       size == memory_block_size_bytes() &&
> +	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> +	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));

This is likely more complex than necessary. Is it ever possible that
remaining_size won't be aligned properly when vmemmap_size is PMD_SIZE
aligned?

> @@ -1563,10 +1639,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
>  	return 0;
>  }
>  
> -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> +			unsigned long nr_vmemmap_pages)

same concern as online pages. Nobody should really care about vmemmap
reserved space. Maybe the accounting (count_system_ram_pages_cb) will
need some compensation but I have to say I got lost in this accounting
wrt to memory hotplugged memory. Where do we account hotadded memory to
system_ram_pages?

[...]
> @@ -1836,6 +1927,31 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  	if (rc)
>  		return rc;
>  
> +	/*
> +	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> +	 * the same granularity it was added - a single memory block.
> +	 */
> +	if (memmap_on_memory) {
> +		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> +						      get_nr_vmemmap_pages_cb);
> +		if (nr_vmemmap_pages) {
> +			if (size != memory_block_size_bytes()) {
> +				pr_warn("Refuse to remove %#llx - %#llx,"
> +					"wrong granularity\n",
> +					 start, start + size);
> +				return -EINVAL;
> +			}
> +
> +			/*
> +			 * Let remove_pmd_table->free_hugepage_table
> +			 * do the right thing if we used vmem_altmap
> +			 * when hot-adding the range.
> +			 */
> +			mhp_altmap.alloc = nr_vmemmap_pages;
> +			altmap = &mhp_altmap;
> +		}
> +	}
> +

This made me scratch my head. I do not think this works for size
spanning multiple memory blocks. Maybe we do not allow something like
that happening. The logic seems inside out to me. I believe you want to
either pull arch_remove_memory into the walk_memory_blocks callback and
handle each memory block this way.
Oscar Salvador March 24, 2021, 10:12 a.m. UTC | #5
On Tue, Mar 23, 2021 at 11:11:58AM +0100, Michal Hocko wrote:
> [Sorry for a long overdue review. I didn't have time to follow previous
> versions so I am sorry if some of my concerns have been discussed
> already]

No worries, let's go ;-)

> I was playing with movable_node and performance implications back in
> 2017 (unfortunately I do not have specific numbers anymore) and the
> setup was a bit extreme - a single node (0) with normal zones and all
> other nodes with movable memory only. So not only struct pages but any
> other kernel metadata were on a remote node. I remember I could see
> clear performance drop scaling with the distance from node 0 somewhere
> betweem 5-10% on kbuild bound on a movable node.

I see. Yes, it is a rather extreme case, but I think it clearly shows the impact
of having metadata structures on a non-local node.

 
> In fact beginning of the memory block should be sufficient as sections
> cannot be hotremoved without the rest of the memory block.

Sorry, I meant memory block here.

> > struct pages which back the allocated space then just need to be treated
> > carefully.
> > 
> > Implementation wise we will reuse vmem_altmap infrastructure to override
> > the default allocator used by __populate_section_memmap.
> > Part of the implementation also relies on memory_block structure gaining
> > a new field which specifies the number of vmemmap_pages at the beginning.
> 
> Here you are talking about memory block rather than section.

Yes, see above, it should have been memory block in both cases.

> > Hot-remove:
> > 
> >  We need to be careful when removing memory, as adding and
> >  removing memory needs to be done with the same granularity.
> >  To check that this assumption is not violated, we check the
> >  memory range we want to remove and if a) any memory block has
> >  vmemmap pages and b) the range spans more than a single memory
> >  block, we scream out loud and refuse to proceed.
> 
> Is this a real problem? If each memory block has its own vmemmap then we
> should be just fine, no?

Not entirely.

Assume this:

- memory_block_size = 128MB
- add_memory(256MB) : no uses altmap because size != memory_block_size
- add_memory(128MB) : uses altmap

Now, when trying to remove the memory, we should construct the altmap to let
remove_pmd_table->free_hugepage_table() know that it needs to call vmem_altmap_free()
instead of free_pagetable() for those sections that were populated using altmap.

But that becomes trickier to handle if user does remove_memory(384MB) at once.

The only reasonable way I can think of is something like:

/*
 * Try to diferentiate which ranges used altmap when populating vmemmap,
 * and construct the altmap for those
 */
 loop(size / section_size)
  if (range_used altmap)
   arch_remove_memory(nid, start, size, altmap);
  else
   arch_remove_memory(nid, start, size, NULL);
   
But I do not think this is any better than make this scenario completely a NO-NO,
because in the end, this is asking for trouble.
And yes, normal qemu/barematal users does not have the hability to play these
kind of tricks, as baremetal has HW limitations and qemu creates a device for
every range you hot-add (so you are tied to that device when removing memory
as well), but other users e.g: virtio-mem can do that.

> I would appreciate some more description of the patch itself. The above
> outlines a highlevel problems and design. The patch is quite large and
> it acts on several layers - physical hotplug, {on,off}lining and sysfs
> layer.

Ok, will try to come up with something more complete wrt changelog.

> Let me capture my thinking:
> - from the top level 
> - sysfs interfaces - memory block is extended to contain the number of
>   vmemmap pages reserved from the beginning of the block for all
>   memory sections belonging to the block.
yes

> - add_memory_resource is the entry point to reserve the vmemmap space
>   for the block. This is an opt-in feature (MHP_MEMMAP_ON_MEMORY) and
>   there is no current user at this stage.
yes

> - vmem_altmap is instructed to use the reserved vmemmap space as the
>   backing storage for the vmemmap struct pages. Via arch_add_memory->
>   __populate_section_memmap.
yes

> - online_pages for some reason needs to know about the reserved vmemmap
>   space. Why? It already knows the intial pfn to online. Why cannot
>   caller simply alter both start pfn and nr_pages to online everything
>   after the vmemmap space? This is somehow conflating the mem block
>   concept deeper into onlining.
> - the same applies to offlining.

Because some counters need not only the buddy_nr_pages, but the complete
range.

So, let us see what online_pages() do (offline_pages() as well but slightly
different in some areas)

- move_pfn_range_to_zone():
  1) Resize node and zone spanned pages
     * If we were only to pass the nr_pages without the vmemmap pages,
       node/zone's spanned pages would be wrong as vmemmap pages would not
       be accounted in there.

  2) Inits struct pages by memmap_init_range() and sets its migratetype
     * If we were only to pass the nr_pages without the vmemmap pages,
       vmemmap pages would be totally unitialized.
       We also set its migratetype to MIGRATE_UNMOVABLE.
       Previous versions initialize vmemmap pages in another place but
       there was a consensus to do it here.

 So on, this case, we have:

 if (nr_vmemmap_pages)
    move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
                                       MIGRATE_UNMOVABLE);
 move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL
                        MIGRATE_ISOLATE);

- Increment zone->present_pages
  * We need to account buddy_pages + vmemmap_pages here

- zone->zone_pgdat->node_present_pages
  * Same as above

- online_pages_range() (onlines the pages, __and__ the sections)
  * Here do not only need the (buddy_pages, end_pages), but (vmemmap_pages, end_pages)
    as well, because on one hand we do:

    online_pages_range()
    {
       for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
                (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);

       online_mem_sections(start_pfn, end_pfn);
   }

   For the call to online_mem_sections, we need to whole range (including the vmemmap
   pages), otherwise, if a whole section only contains vmemmap pages, the section
   might be left marked as offline, and that is troublesome.


As I said, the same applies to offline_pages(), but with slightly tweaks here and
there because it handles somewhat different things.

I kind of understand to be reluctant to use vmemmap_pages terminology here, but
unfortunately we need to know about it.
We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.


> - finally hotremove - which is the most tricky part. try_remove_memory
>   learns about vmemmap reserved space and provides it to __remove_pages
>   and eventually all the way down to remove_pagetable via altmap
>   Now a question and something I have stumbled over few years back when
>   looking into this. Let's say you have multi section memblock so the
>   first section of the block backs vmemmaps for all other sections.
>   What happens when you drop the first worth of section before tearing
>   down all other vmemmaps?

I guess you refer to the case were:

- memory_block_size: 1GB (8 sections)
[memory_block] : first 4096 pfns are for vmemmap

Nothing happens, but I see where your comment is comming from.

Back in 2017, in your prototype, there were two different things:

- try_remove_memory (I dunno how it was called back then) still worked
  with pages, not pfns
- arch_memory_memory() either did not have the altmap stuff, or we were
  not setting it properly, but I remember that in your prototype
  you were handling vmemmap pages in free_hugepage_table()->free_pagetable()
  being carefull to not free them.

Back then, when removing the first vmemmap backing further sections, when
then dereferencing those sections in free_pagetable(), we would crash because
the mapping was not there anymore.
This cannot longer happen.

> [...]
> > @@ -185,10 +185,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
> >  
> >  	switch (action) {
> >  	case MEM_ONLINE:
> > -		ret = online_pages(start_pfn, nr_pages, online_type, nid);
> > +		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
> > +				   online_type, nid);
> >  		break;
> 
> I would just offset start_pfn and nr_pages.

As stated above, this is not possible.

 
> > @@ -603,7 +606,7 @@ static int add_memory_block(unsigned long base_section_nr)
> >  	if (section_count == 0)
> >  		return 0;
> >  	return init_memory_block(memory_block_id(base_section_nr),
> > -				 MEM_ONLINE);
> > +				 MEM_ONLINE, 0);
> 
> This would deserve a comment.
> 	/* Early init code to create memory blocks for all the memory.
> 	 * Backed by bootmem struct pages so no vmemmap reserved space.
> 	 */

Ok, will add it.

> >  {
> >  	const unsigned long end_pfn = start_pfn + nr_pages;
> > -	unsigned long pfn;
> > +	unsigned long pfn = buddy_start_pfn;
> > +
> > +	/*
> > +	 * When using memmap_on_memory, the range might be unaligned as the
> > +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> > +	 */
> > +	VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));
> 
> No this is not something VM_BUG_ON should be used for. This is perfectly
> recoverable situation. Besides that this is a wrong layer to care. All
> the fixup should happen up in the call chain.

This should not happen anymore as mhp_support_memmap_on_memory() does not let
to use MHP_MEMMAP_ON_MEMORY if range is not pageblock_nr_pages.

So this can go.

> >  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> > -		       int online_type, int nid)
> > +		       unsigned long nr_vmemmap_pages, int online_type, int nid)
> >  {
> > -	unsigned long flags;
> > +	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
> >  	struct zone *zone;
> >  	int need_zonelists_rebuild = 0;
> >  	int ret;
> 
> As already mentioned I believe this would be much easier to follow if
> the given pfn really denotes a first pfn to online rather than learn the
> code about vmemmap space which is not really interesting from the
> onlining POV. Struct pages are already create. All we need is to online
> them for using.
> Just have a look at pfn vs. buddy_start_pfn usage. Why should
> zone_for_pfn_range, node_states_check_changes_online, memory_notify ase
> the former rather than later? As mentioned above online_pages_range is
> just more complex by doing that.
> 
> Sure there are some consistency checks which are more convenient with
> the actual pfn start but I believe those shouldn't be a reason for
> obfuscating the code and mixing layers.

I think I explained this above, but let me repeat just in case.
Take into account that boot vmemmap_pages are also accounted to:
- zone's spanned_pages/present_pages
- node's spanned_pages/present_pages

And those pages are also initialized somehow, so we need to initialize the hotplug
vmemmap pages as well, and account them.

As I said, we can use a different terminology and name it different, but we need to
- properly account them
- properly initialize them

And I __guess__ we could do it somewhere off the {online,offline_pages()) land,
but I see that trickier and not worh it.

> > +	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
> > +	       size == memory_block_size_bytes() &&
> > +	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> > +	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> 
> This is likely more complex than necessary. Is it ever possible that
> remaining_size won't be aligned properly when vmemmap_size is PMD_SIZE
> aligned?

Yes, on arm64 with large pages depending on HUGETLB support this can lead to
one condition be true while the other not.

> > @@ -1563,10 +1639,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
> >  	return 0;
> >  }
> >  
> > -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> > +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> > +			unsigned long nr_vmemmap_pages)
> 
> same concern as online pages. Nobody should really care about vmemmap
> reserved space. Maybe the accounting (count_system_ram_pages_cb) will
> need some compensation but I have to say I got lost in this accounting
> wrt to memory hotplugged memory. Where do we account hotadded memory to
> system_ram_pages?

Quick summary of account:

- online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
- online_pages()->zone->present_pages += nr_pages;
- zone->zone_pgdat->node_present_pages += nr_pages;
- online_pages()->online_pages_range()->generic_online_page()->totalram_pages_add():
  Accounts for totalram_pages
- online_pages()->adjust_managed_page_count(): Accounts for zone->managed_pages

So, as you can see, we have a mix with of spanned_pages,present_pages and
managed_pages in both {offline,online}_pages().
Vmemmap pages need to be properly accounted to spanned_pages,present_pages,
as we account bootmem vmemmap pages, but they do not be accounted in
managed_pages.

> This made me scratch my head. I do not think this works for size
> spanning multiple memory blocks. Maybe we do not allow something like
> that happening. The logic seems inside out to me. I believe you want to
> either pull arch_remove_memory into the walk_memory_blocks callback and
> handle each memory block this way.

Here, what we fence off is the scenario I mentioned at the beginning, where
someone may call try_remove_memory() with memory_blocks both containing and
not vmemmap_pages.

So, if the user opt-in the feature, he needs to work with the same granularity
in the add and remove operations.

Well, that was a good feedback indeed, and a large one, I hope to have clarified
some of the questions raised.
Michal Hocko March 24, 2021, 12:03 p.m. UTC | #6
On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
> On Tue, Mar 23, 2021 at 11:11:58AM +0100, Michal Hocko wrote:
> > [Sorry for a long overdue review. I didn't have time to follow previous
> > versions so I am sorry if some of my concerns have been discussed
> > already]
> 
> No worries, let's go ;-)
> 
> > I was playing with movable_node and performance implications back in
> > 2017 (unfortunately I do not have specific numbers anymore) and the
> > setup was a bit extreme - a single node (0) with normal zones and all
> > other nodes with movable memory only. So not only struct pages but any
> > other kernel metadata were on a remote node. I remember I could see
> > clear performance drop scaling with the distance from node 0 somewhere
> > betweem 5-10% on kbuild bound on a movable node.
> 
> I see. Yes, it is a rather extreme case, but I think it clearly shows the impact
> of having metadata structures on a non-local node.

Well, the original intention behind that setup was to have all nodes but
node 0 movable to support memory hotplug which the machine allowes.
There are many downsides but I wouldn't be surprised if somebody still
tried to push this way.
  
[...]
> > > Hot-remove:
> > > 
> > >  We need to be careful when removing memory, as adding and
> > >  removing memory needs to be done with the same granularity.
> > >  To check that this assumption is not violated, we check the
> > >  memory range we want to remove and if a) any memory block has
> > >  vmemmap pages and b) the range spans more than a single memory
> > >  block, we scream out loud and refuse to proceed.
> > 
> > Is this a real problem? If each memory block has its own vmemmap then we
> > should be just fine, no?
> 
> Not entirely.
> 
> Assume this:
> 
> - memory_block_size = 128MB
> - add_memory(256MB) : no uses altmap because size != memory_block_size
> - add_memory(128MB) : uses altmap

256 are two memory blocks so why couldn't we split the operation and add
two altmaps each for its own memory block?

> Now, when trying to remove the memory, we should construct the altmap to let
> remove_pmd_table->free_hugepage_table() know that it needs to call vmem_altmap_free()
> instead of free_pagetable() for those sections that were populated using altmap.
> 
> But that becomes trickier to handle if user does remove_memory(384MB) at once.
> 
> The only reasonable way I can think of is something like:
> 
> /*
>  * Try to diferentiate which ranges used altmap when populating vmemmap,
>  * and construct the altmap for those
>  */
>  loop(size / section_size)

block_size I suspect

>   if (range_used altmap)
>    arch_remove_memory(nid, start, size, altmap);
>   else
>    arch_remove_memory(nid, start, size, NULL);

We should know because we do have memory block when removing memory.

> But I do not think this is any better than make this scenario completely a NO-NO,
> because in the end, this is asking for trouble.
> And yes, normal qemu/barematal users does not have the hability to play these
> kind of tricks, as baremetal has HW limitations and qemu creates a device for
> every range you hot-add (so you are tied to that device when removing memory
> as well), but other users e.g: virtio-mem can do that.

I am not against reducing functionality for the initial version where it
makes sense. E.g. partial memory blocks. But if the overall hotplug
operation can be devided into multiple blocks then there shouldn't be
any real reason for restrictions IMHO.

[...]
> > - online_pages for some reason needs to know about the reserved vmemmap
> >   space. Why? It already knows the intial pfn to online. Why cannot
> >   caller simply alter both start pfn and nr_pages to online everything
> >   after the vmemmap space? This is somehow conflating the mem block
> >   concept deeper into onlining.
> > - the same applies to offlining.
> 
> Because some counters need not only the buddy_nr_pages, but the complete
> range.
> 
> So, let us see what online_pages() do (offline_pages() as well but slightly
> different in some areas)
> 
> - move_pfn_range_to_zone():
>   1) Resize node and zone spanned pages
>      * If we were only to pass the nr_pages without the vmemmap pages,
>        node/zone's spanned pages would be wrong as vmemmap pages would not
>        be accounted in there.

Why is that a problem? That memory is not usable anyway.

>   2) Inits struct pages by memmap_init_range() and sets its migratetype
>      * If we were only to pass the nr_pages without the vmemmap pages,
>        vmemmap pages would be totally unitialized.
>        We also set its migratetype to MIGRATE_UNMOVABLE.
>        Previous versions initialize vmemmap pages in another place but
>        there was a consensus to do it here.

Migrate type for pages backing vmemmap?

>  So on, this case, we have:
> 
>  if (nr_vmemmap_pages)
>     move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
>                                        MIGRATE_UNMOVABLE);
>  move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL
>                         MIGRATE_ISOLATE);
> 
> - Increment zone->present_pages
>   * We need to account buddy_pages + vmemmap_pages here

You can compensate for that by accounting present_pages where you
allocate them - when the memory is hot removed.

> - zone->zone_pgdat->node_present_pages
>   * Same as above
> 
> - online_pages_range() (onlines the pages, __and__ the sections)
>   * Here do not only need the (buddy_pages, end_pages), but (vmemmap_pages, end_pages)
>     as well, because on one hand we do:
> 
>     online_pages_range()
>     {
>        for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
>                 (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
> 
>        online_mem_sections(start_pfn, end_pfn);
>    }
> 
>    For the call to online_mem_sections, we need to whole range (including the vmemmap
>    pages), otherwise, if a whole section only contains vmemmap pages, the section
>    might be left marked as offline, and that is troublesome.

I would like to hear much more about those troubles.
 
> As I said, the same applies to offline_pages(), but with slightly tweaks here and
> there because it handles somewhat different things.
> 
> I kind of understand to be reluctant to use vmemmap_pages terminology here, but
> unfortunately we need to know about it.
> We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.

I am not convinced. It seems you are justr trying to graft the new
functionality in. But I still believe that {on,off}lining shouldn't care
about where their vmemmaps come from at all. It should be a
responsibility of the code which reserves that space to compansate for
accounting. Otherwise we will end up with a hard to maintain code
because expectations would be spread at way too many places. Not to
mention different pfns that the code should care about.
 
> > - finally hotremove - which is the most tricky part. try_remove_memory
> >   learns about vmemmap reserved space and provides it to __remove_pages
> >   and eventually all the way down to remove_pagetable via altmap
> >   Now a question and something I have stumbled over few years back when
> >   looking into this. Let's say you have multi section memblock so the
> >   first section of the block backs vmemmaps for all other sections.
> >   What happens when you drop the first worth of section before tearing
> >   down all other vmemmaps?
> 
> I guess you refer to the case were:
> 
> - memory_block_size: 1GB (8 sections)
> [memory_block] : first 4096 pfns are for vmemmap
> 
> Nothing happens, but I see where your comment is comming from.
> 
> Back in 2017, in your prototype, there were two different things:
> 
> - try_remove_memory (I dunno how it was called back then) still worked
>   with pages, not pfns
> - arch_memory_memory() either did not have the altmap stuff, or we were
>   not setting it properly, but I remember that in your prototype
>   you were handling vmemmap pages in free_hugepage_table()->free_pagetable()
>   being carefull to not free them.
> 
> Back then, when removing the first vmemmap backing further sections, when
> then dereferencing those sections in free_pagetable(), we would crash because
> the mapping was not there anymore.
> This cannot longer happen.

OK, it would be great to outline that in the changelog because this is
an important detail.

[...]
> > >  	const unsigned long end_pfn = start_pfn + nr_pages;
> > > -	unsigned long pfn;
> > > +	unsigned long pfn = buddy_start_pfn;
> > > +
> > > +	/*
> > > +	 * When using memmap_on_memory, the range might be unaligned as the
> > > +	 * first pfns are used for vmemmap pages. Align it in case we need to.
> > > +	 */
> > > +	VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));
> > 
> > No this is not something VM_BUG_ON should be used for. This is perfectly
> > recoverable situation. Besides that this is a wrong layer to care. All
> > the fixup should happen up in the call chain.
> 
> This should not happen anymore as mhp_support_memmap_on_memory() does not let
> to use MHP_MEMMAP_ON_MEMORY if range is not pageblock_nr_pages.

My point was that even if this was still needed VM_BUG_ON is not the
right way to take care of it. If you have a way to gracefully handle an
unexpected input then this should always be done.

[...]
> > > +	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
> > > +	       size == memory_block_size_bytes() &&
> > > +	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> > > +	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> > 
> > This is likely more complex than necessary. Is it ever possible that
> > remaining_size won't be aligned properly when vmemmap_size is PMD_SIZE
> > aligned?
> 
> Yes, on arm64 with large pages depending on HUGETLB support this can lead to
> one condition be true while the other not.

A comment would be helpful.

> > > @@ -1563,10 +1639,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
> > >  	return 0;
> > >  }
> > >  
> > > -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> > > +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> > > +			unsigned long nr_vmemmap_pages)
> > 
> > same concern as online pages. Nobody should really care about vmemmap
> > reserved space. Maybe the accounting (count_system_ram_pages_cb) will
> > need some compensation but I have to say I got lost in this accounting
> > wrt to memory hotplugged memory. Where do we account hotadded memory to
> > system_ram_pages?
> 
> Quick summary of account:
> 
> - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
> - online_pages()->zone->present_pages += nr_pages;
> - zone->zone_pgdat->node_present_pages += nr_pages;
> - online_pages()->online_pages_range()->generic_online_page()->totalram_pages_add():
>   Accounts for totalram_pages

these should account vmemmap pages as well. Although I do not why it
would be a big problem to leave those out. Anyway, it should be quite
straightforward to account them at the time when the vmemmap space is
reserved as already mentioned above.

> - online_pages()->adjust_managed_page_count(): Accounts for zone->managed_pages

these are only managed by the allocator so vmemmap pages are off the
table.
Michal Hocko March 24, 2021, 12:10 p.m. UTC | #7
On Wed 24-03-21 13:03:29, Michal Hocko wrote:
> On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
[...]

an additional remark

> > - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
> > - online_pages()->zone->present_pages += nr_pages;

I am pretty sure you shouldn't account vmmemmap pages to the target zone
in some cases - e.g. vmemmap cannot be part of the movable zone, can it?
So this would be yet another special casing. This patch has got it wrong
unless I have missed some special casing.
David Hildenbrand March 24, 2021, 12:23 p.m. UTC | #8
On 24.03.21 13:10, Michal Hocko wrote:
> On Wed 24-03-21 13:03:29, Michal Hocko wrote:
>> On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
> [...]
> 
> an additional remark
> 
>>> - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
>>> - online_pages()->zone->present_pages += nr_pages;
> 
> I am pretty sure you shouldn't account vmmemmap pages to the target zone
> in some cases - e.g. vmemmap cannot be part of the movable zone, can it?
> So this would be yet another special casing. This patch has got it wrong
> unless I have missed some special casing.
> 

It's a bit unfortunate that we have to discuss the very basic design 
decisions again.

@Oscar, maybe you can share the links where we discussed all this and 
add some of it to the patch description.

I think what we have right here is good enough for an initial version, 
from where on we can improve things without having to modify calling code.
Michal Hocko March 24, 2021, 12:37 p.m. UTC | #9
On Wed 24-03-21 13:23:47, David Hildenbrand wrote:
> On 24.03.21 13:10, Michal Hocko wrote:
> > On Wed 24-03-21 13:03:29, Michal Hocko wrote:
> > > On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
> > [...]
> > 
> > an additional remark
> > 
> > > > - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
> > > > - online_pages()->zone->present_pages += nr_pages;
> > 
> > I am pretty sure you shouldn't account vmmemmap pages to the target zone
> > in some cases - e.g. vmemmap cannot be part of the movable zone, can it?
> > So this would be yet another special casing. This patch has got it wrong
> > unless I have missed some special casing.
> > 
> 
> It's a bit unfortunate that we have to discuss the very basic design
> decisions again.

It would be great to have those basic design decisions layed out in the
changelog.

> @Oscar, maybe you can share the links where we discussed all this and add
> some of it to the patch description.
> 
> I think what we have right here is good enough for an initial version, from
> where on we can improve things without having to modify calling code.

I have to say I really dislike vmemmap proliferation into
{on,off}lining. It just doesn't belong there from a layering POV. All
this code should care about is to hand over pages to the allocator and
make them visible.

Is that a sufficient concern to nack the whole thing? No, I do not think
so. But I do not see any particular rush to have this work needs to be
merged ASAP.
David Hildenbrand March 24, 2021, 1:13 p.m. UTC | #10
On 24.03.21 13:37, Michal Hocko wrote:
> On Wed 24-03-21 13:23:47, David Hildenbrand wrote:
>> On 24.03.21 13:10, Michal Hocko wrote:
>>> On Wed 24-03-21 13:03:29, Michal Hocko wrote:
>>>> On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
>>> [...]
>>>
>>> an additional remark
>>>
>>>>> - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
>>>>> - online_pages()->zone->present_pages += nr_pages;
>>>
>>> I am pretty sure you shouldn't account vmmemmap pages to the target zone
>>> in some cases - e.g. vmemmap cannot be part of the movable zone, can it?
>>> So this would be yet another special casing. This patch has got it wrong
>>> unless I have missed some special casing.
>>>
>>
>> It's a bit unfortunate that we have to discuss the very basic design
>> decisions again.
> 
> It would be great to have those basic design decisions layed out in the
> changelog.
> 
>> @Oscar, maybe you can share the links where we discussed all this and add
>> some of it to the patch description.
>>
>> I think what we have right here is good enough for an initial version, from
>> where on we can improve things without having to modify calling code.
> 
> I have to say I really dislike vmemmap proliferation into
> {on,off}lining. It just doesn't belong there from a layering POV. All
> this code should care about is to hand over pages to the allocator and
> make them visible.

Well, someone has to initialize the vmemmap of the vmemmap pages ( which 
is itself :) ), and as the vemmap does not span complete sections things 
can get very weird as we can only set whole sections online (there was 
more to that, I think it's buried in previous discussions).

> 
> Is that a sufficient concern to nack the whole thing? No, I do not think
> so. But I do not see any particular rush to have this work needs to be
> merged ASAP.

Sure, there is no need to rush (not that I suggested that).
Oscar Salvador March 24, 2021, 1:27 p.m. UTC | #11
On Wed, Mar 24, 2021 at 01:03:29PM +0100, Michal Hocko wrote:
> > Assume this:
> > 
> > - memory_block_size = 128MB
> > - add_memory(256MB) : no uses altmap because size != memory_block_size
> > - add_memory(128MB) : uses altmap
> 
> 256 are two memory blocks so why couldn't we split the operation and add
> two altmaps each for its own memory block?

We could, but then the code just gets trickier. I find easier to define more
simple semantics that must hold in order to opt-in the feature.

Moreover, what we get in return wrt. splitting sizes in memory_blocks is not
really worth comparing to the simple check we have to fence off these kind
of situations.

So, IOW, we could do it conceptually, but I would like to keep it simple.

> > But I do not think this is any better than make this scenario completely a NO-NO,
> > because in the end, this is asking for trouble.
> > And yes, normal qemu/barematal users does not have the hability to play these
> > kind of tricks, as baremetal has HW limitations and qemu creates a device for
> > every range you hot-add (so you are tied to that device when removing memory
> > as well), but other users e.g: virtio-mem can do that.
> 
> I am not against reducing functionality for the initial version where it
> makes sense. E.g. partial memory blocks. But if the overall hotplug
> operation can be devided into multiple blocks then there shouldn't be
> any real reason for restrictions IMHO.

As I said, I think it can be done, but it would overcomplicate the picture
at this moment, an I am not sure it is really worth it.
Something to inspect fot the future? Sure, and it kinda sounds interesting,
but putting that in place atm would be too much IMHO.

> [...]
> > > - online_pages for some reason needs to know about the reserved vmemmap
> > >   space. Why? It already knows the intial pfn to online. Why cannot
> > >   caller simply alter both start pfn and nr_pages to online everything
> > >   after the vmemmap space? This is somehow conflating the mem block
> > >   concept deeper into onlining.
> > > - the same applies to offlining.
> > 
> > Because some counters need not only the buddy_nr_pages, but the complete
> > range.
> > 
> > So, let us see what online_pages() do (offline_pages() as well but slightly
> > different in some areas)
> > 
> > - move_pfn_range_to_zone():
> >   1) Resize node and zone spanned pages
> >      * If we were only to pass the nr_pages without the vmemmap pages,
> >        node/zone's spanned pages would be wrong as vmemmap pages would not
> >        be accounted in there.
> 
> Why is that a problem? That memory is not usable anyway.

We account bootmem vemmamp memory as node/zone's spanned pages, why hotplug
vmemmap should be any diferent? And we are talking about spanned pages here,
which do not relate to usable memory.


> >   2) Inits struct pages by memmap_init_range() and sets its migratetype
> >      * If we were only to pass the nr_pages without the vmemmap pages,
> >        vmemmap pages would be totally unitialized.
> >        We also set its migratetype to MIGRATE_UNMOVABLE.
> >        Previous versions initialize vmemmap pages in another place but
> >        there was a consensus to do it here.
> 
> Migrate type for pages backing vmemmap?

Since we initialize them, it made sense to mark them as UNMOVABLE, as that
memory is self-hosted.
More on that below.

> > - Increment zone->present_pages
> >   * We need to account buddy_pages + vmemmap_pages here
> 
> You can compensate for that by accounting present_pages where you
> allocate them - when the memory is hot removed.

Why compensate? We have the data handy, and as above, bootmem vmemmap pages
get directly accounted to node/zone's present_pages. I do not see why
hotplug vmemmap pages would be different in this regard.

> > - zone->zone_pgdat->node_present_pages
> >   * Same as above
> > 
> > - online_pages_range() (onlines the pages, __and__ the sections)
> >   * Here do not only need the (buddy_pages, end_pages), but (vmemmap_pages, end_pages)
> >     as well, because on one hand we do:
> > 
> >     online_pages_range()
> >     {
> >        for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> >                 (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
> > 
> >        online_mem_sections(start_pfn, end_pfn);
> >    }
> > 
> >    For the call to online_mem_sections, we need to whole range (including the vmemmap
> >    pages), otherwise, if a whole section only contains vmemmap pages, the section
> >    might be left marked as offline, and that is troublesome.
> 
> I would like to hear much more about those troubles.

In a previous discussion with David:

   " Well, if the section holding vmemmap pages is offline, would it be bad
    that pfn_to_online_page() returns NULL?
    AFAICS, callers would know that there is nothing to do with such page,
    and that is true since they only back other pages, and if
    these other pages are offline, that is it."

  "It's for example an issue with zone shrinking. But I recall it's also an
   issue when hibernating/suspending, dumping memory and so on ."

Moreover, sections containing bootmem vmemmap pages get also onlined, so it would
special case something that it does not need to.

> > As I said, the same applies to offline_pages(), but with slightly tweaks here and
> > there because it handles somewhat different things.
> > 
> > I kind of understand to be reluctant to use vmemmap_pages terminology here, but
> > unfortunately we need to know about it.
> > We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.
> 
> I am not convinced. It seems you are justr trying to graft the new
> functionality in. But I still believe that {on,off}lining shouldn't care
> about where their vmemmaps come from at all. It should be a
> responsibility of the code which reserves that space to compansate for
> accounting. Otherwise we will end up with a hard to maintain code
> because expectations would be spread at way too many places. Not to
> mention different pfns that the code should care about.
...
> > Back then, when removing the first vmemmap backing further sections, when
> > then dereferencing those sections in free_pagetable(), we would crash because
> > the mapping was not there anymore.
> > This cannot longer happen.
> 
> OK, it would be great to outline that in the changelog because this is
> an important detail.

Will add it.

> > This should not happen anymore as mhp_support_memmap_on_memory() does not let
> > to use MHP_MEMMAP_ON_MEMORY if range is not pageblock_nr_pages.
> 
> My point was that even if this was still needed VM_BUG_ON is not the
> right way to take care of it. If you have a way to gracefully handle an
> unexpected input then this should always be done.

Ok, I understand.

> > Yes, on arm64 with large pages depending on HUGETLB support this can lead to
> > one condition be true while the other not.
> 
> A comment would be helpful.

Sure

> > Quick summary of account:
> > 
> > - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
> > - online_pages()->zone->present_pages += nr_pages;
> > - zone->zone_pgdat->node_present_pages += nr_pages;
> > - online_pages()->online_pages_range()->generic_online_page()->totalram_pages_add():
> >   Accounts for totalram_pages
> 
> these should account vmemmap pages as well. Although I do not why it
> would be a big problem to leave those out. Anyway, it should be quite
> straightforward to account them at the time when the vmemmap space is
> reserved as already mentioned above.
> 
> > - online_pages()->adjust_managed_page_count(): Accounts for zone->managed_pages
> 
> these are only managed by the allocator so vmemmap pages are off the
> table.

Yes, vmemmap pages are not accounted to managed_pages.

Ok, it all boils down to:

- online_pages/offline_pages __besides__ handing pages over to buddy or migrate them,
  take care of onlining/offlining sections, the accounting of:
  present_pages (node/zone)
  spanned_pages (node/zone)
  managed_pages  (zone)
  (remember that bootmem vmemmap pages get directly accounted in those counters,
   special casing vmemmap pages seems wrong tbh)

  __and__ the initialization of pages.

  vmemmap pages, as any other page, neeed to be properly initalized.
  Since the initialization and accounting is carved in online_pages(), it makes
  sense to pass the vmemmap pages in there.
  
I guess we could have some helpers to do that and try to call those helpers
out of {online,offline}_pages(), but I think the end result would look more
weird.

As I said, vmemmap pages need to be initialized and accounted, and all that
work is done in {offline,online}_pages.

As I said, terminilogy could change (not sure it is any better), but we need to
do some of the work those two function provides, and it seems quite straightforward
do it there.
Michal Hocko March 24, 2021, 1:40 p.m. UTC | #12
On Wed 24-03-21 14:13:31, David Hildenbrand wrote:
> On 24.03.21 13:37, Michal Hocko wrote:
> > On Wed 24-03-21 13:23:47, David Hildenbrand wrote:
> > > On 24.03.21 13:10, Michal Hocko wrote:
> > > > On Wed 24-03-21 13:03:29, Michal Hocko wrote:
> > > > > On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
> > > > [...]
> > > > 
> > > > an additional remark
> > > > 
> > > > > > - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
> > > > > > - online_pages()->zone->present_pages += nr_pages;
> > > > 
> > > > I am pretty sure you shouldn't account vmmemmap pages to the target zone
> > > > in some cases - e.g. vmemmap cannot be part of the movable zone, can it?
> > > > So this would be yet another special casing. This patch has got it wrong
> > > > unless I have missed some special casing.
> > > > 
> > > 
> > > It's a bit unfortunate that we have to discuss the very basic design
> > > decisions again.
> > 
> > It would be great to have those basic design decisions layed out in the
> > changelog.
> > 
> > > @Oscar, maybe you can share the links where we discussed all this and add
> > > some of it to the patch description.
> > > 
> > > I think what we have right here is good enough for an initial version, from
> > > where on we can improve things without having to modify calling code.
> > 
> > I have to say I really dislike vmemmap proliferation into
> > {on,off}lining. It just doesn't belong there from a layering POV. All
> > this code should care about is to hand over pages to the allocator and
> > make them visible.
> 
> Well, someone has to initialize the vmemmap of the vmemmap pages ( which is
> itself :) ),

Yeah, and I would expect this to be done when the vmemmap space is
reserved. This is at the hotadd time and we do not know the zone but
that shouldn't really matter because their zone can be quite arbitrary
kernel zone. As mentioned previously I do not think associating those
with zone movable is a good idea as they are fundamentally not movable.
It is likely that the zone doesn't really matter for these pages anyway
and the only think we do care about is that they are not poisoned and
there is at least something but again it would be much better to have a
single place where all those details are done (including accounting)
rather than trying to wrap head around different pfns when onlining
pages and grow potential and suble bugs there.

> and as the vemmap does not span complete sections things can
> get very weird as we can only set whole sections online (there was more to
> that, I think it's buried in previous discussions).

Yes the section can only online as whole. This is an important "detail"
and it would deserve some more clarification in the changelog as well.
You have invested quite some energy into code consolidation and checks
to make sure that hotplugged code doesn't have holes and this work bends
those rules. vmemmap is effectivelly a hole in a memblock/section. I
think we should re-evaluate some of those constrains rather than try to
work them around somehow.
David Hildenbrand March 24, 2021, 2:05 p.m. UTC | #13
On 24.03.21 14:40, Michal Hocko wrote:
> On Wed 24-03-21 14:13:31, David Hildenbrand wrote:
>> On 24.03.21 13:37, Michal Hocko wrote:
>>> On Wed 24-03-21 13:23:47, David Hildenbrand wrote:
>>>> On 24.03.21 13:10, Michal Hocko wrote:
>>>>> On Wed 24-03-21 13:03:29, Michal Hocko wrote:
>>>>>> On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
>>>>> [...]
>>>>>
>>>>> an additional remark
>>>>>
>>>>>>> - online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
>>>>>>> - online_pages()->zone->present_pages += nr_pages;
>>>>>
>>>>> I am pretty sure you shouldn't account vmmemmap pages to the target zone
>>>>> in some cases - e.g. vmemmap cannot be part of the movable zone, can it?
>>>>> So this would be yet another special casing. This patch has got it wrong
>>>>> unless I have missed some special casing.
>>>>>
>>>>
>>>> It's a bit unfortunate that we have to discuss the very basic design
>>>> decisions again.
>>>
>>> It would be great to have those basic design decisions layed out in the
>>> changelog.
>>>
>>>> @Oscar, maybe you can share the links where we discussed all this and add
>>>> some of it to the patch description.
>>>>
>>>> I think what we have right here is good enough for an initial version, from
>>>> where on we can improve things without having to modify calling code.
>>>
>>> I have to say I really dislike vmemmap proliferation into
>>> {on,off}lining. It just doesn't belong there from a layering POV. All
>>> this code should care about is to hand over pages to the allocator and
>>> make them visible.
>>
>> Well, someone has to initialize the vmemmap of the vmemmap pages ( which is
>> itself :) ),
> 
> Yeah, and I would expect this to be done when the vmemmap space is
> reserved. This is at the hotadd time and we do not know the zone but
> that shouldn't really matter because their zone can be quite arbitrary
> kernel zone. As mentioned previously I do not think associating those
> with zone movable is a good idea as they are fundamentally not movable.

I don't think that's an issue. Just another special case to keep things 
simple. (and not completely fragment zones, mess with zone shrinking etc.)

> It is likely that the zone doesn't really matter for these pages anyway
> and the only think we do care about is that they are not poisoned and
> there is at least something but again it would be much better to have a
> single place where all those details are done (including accounting)
> rather than trying to wrap head around different pfns when onlining
> pages and grow potential and suble bugs there.

Exactly, as you said, the zone doesn't really matter - thus, this patch 
just handles it as simple as possible: keep them in the same zone as the 
hole memory block. No fragmented zones. no special casing. simple.

Details are actually pretty much all at a single place when 
onlining/offlining().

> 
>> and as the vemmap does not span complete sections things can
>> get very weird as we can only set whole sections online (there was more to
>> that, I think it's buried in previous discussions).
> 
> Yes the section can only online as whole. This is an important "detail"
> and it would deserve some more clarification in the changelog as well.

Indeed.

> You have invested quite some energy into code consolidation and checks
> to make sure that hotplugged code doesn't have holes and this work bends
> those rules. vmemmap is effectivelly a hole in a memblock/section. I
> think we should re-evaluate some of those constrains rather than try to
> work them around somehow.
It's an offset in the beginning, so it's a special case. And the 
question is if there is a real benefit in handling it differently, for 
example, messing with online sections, messing with zones .. I am not 
convinced that the added complexity gives us a real benefit. But I shall 
be taught otherwise.


BTW: I once thought about having 
online_memory_block(block)/offline_memory_block(block) as separate 
functions instead of having pretty generic (error prone?) 
online_pages()/offline_pages(). Then, these details would just go in 
there and memory blocks + online/offline logic would simply be 
self-contained.
Michal Hocko March 24, 2021, 2:42 p.m. UTC | #14
On Wed 24-03-21 13:03:29, Michal Hocko wrote:
> On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
[...]
> > I kind of understand to be reluctant to use vmemmap_pages terminology here, but
> > unfortunately we need to know about it.
> > We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.
> 
> I am not convinced. It seems you are justr trying to graft the new
> functionality in. But I still believe that {on,off}lining shouldn't care
> about where their vmemmaps come from at all. It should be a
> responsibility of the code which reserves that space to compansate for
> accounting. Otherwise we will end up with a hard to maintain code
> because expectations would be spread at way too many places. Not to
> mention different pfns that the code should care about.

The below is a quick hack on top of this patch to illustrate my
thinking. I have dug out all the vmemmap pieces out of the
{on,off}lining and hooked all the accounting when the space is reserved.
This just compiles without any deeper look so there are likely some
minor problems but I haven't really encountered any major problems or
hacks to introduce into the code. The separation seems to be possible.
The diffstat also looks promising. Am I missing something fundamental in
this?

--- 
 drivers/base/memory.c          |   8 +--
 include/linux/memory_hotplug.h |   6 +-
 mm/memory_hotplug.c            | 151 ++++++++++++++++++++---------------------
 3 files changed, 80 insertions(+), 85 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 5ea2b3fbce02..9697acfe96eb 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -181,15 +181,15 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
 	int ret;
 
-	start_pfn = section_nr_to_pfn(start_section_nr);
+	start_pfn = section_nr_to_pfn(start_section_nr) + nr_vmemmap_pages;
+	nr_pages -= nr_vmemmap_pages;
 
 	switch (action) {
 	case MEM_ONLINE:
-		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
-				   online_type, nid);
+		ret = online_pages(start_pfn, nr_pages, online_type, nid);
 		break;
 	case MEM_OFFLINE:
-		ret = offline_pages(start_pfn, nr_pages, nr_vmemmap_pages);
+		ret = offline_pages(start_pfn, nr_pages);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a85d4b7d15c2..673d2d4a8443 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -109,8 +109,7 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 /* VM interface that may be used by firmware interface */
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			unsigned long nr_vmemmap_pages, int online_type,
-			int nid);
+			int online_type, int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
@@ -317,8 +316,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
-			 unsigned long nr_vmemmap_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 extern int offline_and_remove_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0c3a98cb8cde..754026a9164d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -844,30 +844,19 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 }
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       unsigned long nr_vmemmap_pages, int online_type, int nid)
+		       int online_type, int nid)
 {
-	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
+	unsigned long flags;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
 	int ret;
 	struct memory_notify arg;
 
-	/* We can only online full sections (e.g., SECTION_IS_ONLINE) */
-	if (WARN_ON_ONCE(!nr_pages ||
-			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
-		return -EINVAL;
-
-	buddy_start_pfn = pfn + nr_vmemmap_pages;
-	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
-
 	mem_hotplug_begin();
 
 	/* associate pfn range with the zone */
 	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-	if (nr_vmemmap_pages)
-		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
-				       MIGRATE_UNMOVABLE);
-	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
+	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL,
 			       MIGRATE_ISOLATE);
 
 	arg.start_pfn = pfn;
@@ -884,7 +873,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	 * onlining, such that undo_isolate_page_range() works correctly.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/*
@@ -897,7 +886,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		setup_zone_pageset(zone);
 	}
 
-	online_pages_range(pfn, nr_pages, buddy_start_pfn);
+	online_pages_range(pfn, nr_pages, pfn);
 	zone->present_pages += nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -910,9 +899,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	zone_pcp_update(zone);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
-	undo_isolate_page_range(buddy_start_pfn,
-				buddy_start_pfn + buddy_nr_pages,
-				MIGRATE_MOVABLE);
+	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
 
 	/*
 	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
@@ -1126,6 +1113,59 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
 
+static void reserve_vmmemmap_space(int nid, unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap)
+{
+	struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_NORMAL];
+
+	altmap->free = nr_pages;
+	altmap->base_pfn = pfn;
+
+	/* initialize struct pages and account for this space */
+	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
+}
+
+static void unaccount_vmemmap_space(int nid, unsigned long start_pfn, unsigned long nr_pages)
+{
+	struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_NORMAL];
+	unsigned long flags;
+
+	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+	zone->present_pages -= nr_pages;
+
+	pgdat_resize_lock(zone->zone_pgdat, &flags);
+	zone->zone_pgdat->node_present_pages -= nr_pages;
+	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
+	remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
+}
+
+static int remove_memory_block_cb(struct memory_block *mem, void *arg)
+{
+	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap *altmap = NULL;
+	u64 start = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
+	u64 size = memory_block_size_bytes();
+
+	if (!mem->nr_vmemmap_pages) {
+		arch_remove_memory(mem->nid, start, size, NULL);
+		return 0;
+	}
+
+	/*
+	 * Let remove_pmd_table->free_hugepage_table
+	 * do the right thing if we used vmem_altmap
+	 * when hot-adding the range.
+	 */
+	mhp_altmap.alloc = nr_vmemmap_pages;
+	altmap = &mhp_altmap;
+
+	unaccount_vmemmap_space(mem->nid, PHYS_PFN(start), nr_vmemmap_pages);
+	arch_remove_memory(mem->nid, start, size, altmap);
+
+	return 0;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1170,8 +1210,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 			ret = -EINVAL;
 			goto error;
 		}
-		mhp_altmap.free = PHYS_PFN(size);
-		mhp_altmap.base_pfn = PHYS_PFN(start);
+		reserve_vmmemmap_space(nid, PHYS_PFN(start), PHYS_PFN(size), &mhp_altmap);
 		params.altmap = &mhp_altmap;
 	}
 
@@ -1639,25 +1678,16 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }
 
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
-			unsigned long nr_vmemmap_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
+	unsigned long pfn, system_ram_pages = 0;
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
 	int ret, node;
 	char *reason;
 
-	/* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
-	if (WARN_ON_ONCE(!nr_pages ||
-			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
-		return -EINVAL;
-
-	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
-	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
-
 	mem_hotplug_begin();
 
 	/*
@@ -1693,7 +1723,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	zone_pcp_disable(zone);
 
 	/* set above range as isolated */
-	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
+	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
@@ -1713,7 +1743,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	}
 
 	do {
-		pfn = buddy_start_pfn;
+		pfn = start_pfn;
 		do {
 			if (signal_pending(current)) {
 				ret = -EINTR;
@@ -1744,18 +1774,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 		 * offlining actually in order to make hugetlbfs's object
 		 * counting consistent.
 		 */
-		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
+		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
 		if (ret) {
 			reason = "failure to dissolve huge pages";
 			goto failed_removal_isolated;
 		}
 
-		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
+		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
 
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
-	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);
+	__offline_isolated_pages(start_pfn, end_pfn, start_pfn);
 	pr_debug("Offlined Pages %ld\n", nr_pages);
 
 	/*
@@ -1764,13 +1794,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	 * of isolated pageblocks, memory onlining will properly revert this.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	zone_pcp_enable(zone);
 
 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
+	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
 	zone->present_pages -= nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1799,7 +1829,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	return 0;
 
 failed_removal_isolated:
-	undo_isolate_page_range(buddy_start_pfn, end_pfn, MIGRATE_MOVABLE);
+	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 failed_removal_pcplists_disabled:
 	zone_pcp_enable(zone);
@@ -1830,14 +1860,6 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return 0;
 }
 
-static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
-{
-	/*
-	 * If not set, continue with the next block.
-	 */
-	return mem->nr_vmemmap_pages;
-}
-
 static int check_cpu_on_node(pg_data_t *pgdat)
 {
 	int cpu;
@@ -1912,9 +1934,6 @@ EXPORT_SYMBOL(try_offline_node);
 static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
 	int rc = 0;
-	struct vmem_altmap mhp_altmap = {};
-	struct vmem_altmap *altmap = NULL;
-	unsigned long nr_vmemmap_pages = 0;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1927,31 +1946,6 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 
-	/*
-	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
-	 * the same granularity it was added - a single memory block.
-	 */
-	if (memmap_on_memory) {
-		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
-						      get_nr_vmemmap_pages_cb);
-		if (nr_vmemmap_pages) {
-			if (size != memory_block_size_bytes()) {
-				pr_warn("Refuse to remove %#llx - %#llx,"
-					"wrong granularity\n",
-					 start, start + size);
-				return -EINVAL;
-			}
-
-			/*
-			 * Let remove_pmd_table->free_hugepage_table
-			 * do the right thing if we used vmem_altmap
-			 * when hot-adding the range.
-			 */
-			mhp_altmap.alloc = nr_vmemmap_pages;
-			altmap = &mhp_altmap;
-		}
-	}
-
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
@@ -1963,7 +1957,10 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	mem_hotplug_begin();
 
-	arch_remove_memory(nid, start, size, altmap);
+	if (!memmap_on_memory)
+		arch_remove_memory(nid, start, size, NULL);
+	else
+		walk_memory_blocks(start, size, NULL, remove_memory_block_cb);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_free(start, size);
David Hildenbrand March 24, 2021, 2:52 p.m. UTC | #15
On 24.03.21 15:42, Michal Hocko wrote:
> On Wed 24-03-21 13:03:29, Michal Hocko wrote:
>> On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
> [...]
>>> I kind of understand to be reluctant to use vmemmap_pages terminology here, but
>>> unfortunately we need to know about it.
>>> We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.
>>
>> I am not convinced. It seems you are justr trying to graft the new
>> functionality in. But I still believe that {on,off}lining shouldn't care
>> about where their vmemmaps come from at all. It should be a
>> responsibility of the code which reserves that space to compansate for
>> accounting. Otherwise we will end up with a hard to maintain code
>> because expectations would be spread at way too many places. Not to
>> mention different pfns that the code should care about.
> 
> The below is a quick hack on top of this patch to illustrate my
> thinking. I have dug out all the vmemmap pieces out of the
> {on,off}lining and hooked all the accounting when the space is reserved.
> This just compiles without any deeper look so there are likely some
> minor problems but I haven't really encountered any major problems or
> hacks to introduce into the code. The separation seems to be possible.
> The diffstat also looks promising. Am I missing something fundamental in
> this?
> 

 From a quick glimpse, this touches on two things discussed in the past:

1. If the underlying memory block is offline, all sections are offline. 
Zone shrinking code will happily skip over the vmemmap pages and you can 
end up with out-of-zone pages assigned to the zone. Can happen in corner 
cases. There is no way to know that the memmap of these pages was 
initialized and is of value.

2. You heavily fragment zone layout although you might end up with 
consecutive zones (e.g., online all hotplugged memory movable)

> ---
>   drivers/base/memory.c          |   8 +--
>   include/linux/memory_hotplug.h |   6 +-
>   mm/memory_hotplug.c            | 151 ++++++++++++++++++++---------------------
>   3 files changed, 80 insertions(+), 85 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 5ea2b3fbce02..9697acfe96eb 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -181,15 +181,15 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
>   	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
>   	int ret;
>   
> -	start_pfn = section_nr_to_pfn(start_section_nr);
> +	start_pfn = section_nr_to_pfn(start_section_nr) + nr_vmemmap_pages;
> +	nr_pages -= nr_vmemmap_pages;
>   
>   	switch (action) {
>   	case MEM_ONLINE:
> -		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
> -				   online_type, nid);
> +		ret = online_pages(start_pfn, nr_pages, online_type, nid);
>   		break;
>   	case MEM_OFFLINE:
> -		ret = offline_pages(start_pfn, nr_pages, nr_vmemmap_pages);
> +		ret = offline_pages(start_pfn, nr_pages);
>   		break;
>   	default:
>   		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index a85d4b7d15c2..673d2d4a8443 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -109,8 +109,7 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
>   extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
>   /* VM interface that may be used by firmware interface */
>   extern int online_pages(unsigned long pfn, unsigned long nr_pages,
> -			unsigned long nr_vmemmap_pages, int online_type,
> -			int nid);
> +			int online_type, int nid);
>   extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
>   					 unsigned long end_pfn);
>   extern void __offline_isolated_pages(unsigned long start_pfn,
> @@ -317,8 +316,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
>   #ifdef CONFIG_MEMORY_HOTREMOVE
>   
>   extern void try_offline_node(int nid);
> -extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> -			 unsigned long nr_vmemmap_pages);
> +extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>   extern int remove_memory(int nid, u64 start, u64 size);
>   extern void __remove_memory(int nid, u64 start, u64 size);
>   extern int offline_and_remove_memory(int nid, u64 start, u64 size);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0c3a98cb8cde..754026a9164d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -844,30 +844,19 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
>   }
>   
>   int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> -		       unsigned long nr_vmemmap_pages, int online_type, int nid)
> +		       int online_type, int nid)
>   {
> -	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
> +	unsigned long flags;
>   	struct zone *zone;
>   	int need_zonelists_rebuild = 0;
>   	int ret;
>   	struct memory_notify arg;
>   
> -	/* We can only online full sections (e.g., SECTION_IS_ONLINE) */
> -	if (WARN_ON_ONCE(!nr_pages ||
> -			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
> -		return -EINVAL;
> -
> -	buddy_start_pfn = pfn + nr_vmemmap_pages;
> -	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> -
>   	mem_hotplug_begin();
>   
>   	/* associate pfn range with the zone */
>   	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
> -	if (nr_vmemmap_pages)
> -		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
> -				       MIGRATE_UNMOVABLE);
> -	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
> +	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL,
>   			       MIGRATE_ISOLATE);
>   
>   	arg.start_pfn = pfn;
> @@ -884,7 +873,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>   	 * onlining, such that undo_isolate_page_range() works correctly.
>   	 */
>   	spin_lock_irqsave(&zone->lock, flags);
> -	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
> +	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
>   	spin_unlock_irqrestore(&zone->lock, flags);
>   
>   	/*
> @@ -897,7 +886,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>   		setup_zone_pageset(zone);
>   	}
>   
> -	online_pages_range(pfn, nr_pages, buddy_start_pfn);
> +	online_pages_range(pfn, nr_pages, pfn);
>   	zone->present_pages += nr_pages;
>   
>   	pgdat_resize_lock(zone->zone_pgdat, &flags);
> @@ -910,9 +899,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>   	zone_pcp_update(zone);
>   
>   	/* Basic onlining is complete, allow allocation of onlined pages. */
> -	undo_isolate_page_range(buddy_start_pfn,
> -				buddy_start_pfn + buddy_nr_pages,
> -				MIGRATE_MOVABLE);
> +	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
>   
>   	/*
>   	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> @@ -1126,6 +1113,59 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>   	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>   }
>   
> +static void reserve_vmmemmap_space(int nid, unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap)
> +{
> +	struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_NORMAL];
> +
> +	altmap->free = nr_pages;
> +	altmap->base_pfn = pfn;
> +
> +	/* initialize struct pages and account for this space */
> +	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> +}
> +
> +static void unaccount_vmemmap_space(int nid, unsigned long start_pfn, unsigned long nr_pages)
> +{
> +	struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_NORMAL];
> +	unsigned long flags;
> +
> +	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
> +	zone->present_pages -= nr_pages;
> +
> +	pgdat_resize_lock(zone->zone_pgdat, &flags);
> +	zone->zone_pgdat->node_present_pages -= nr_pages;
> +	pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +
> +	remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
> +}
> +
> +static int remove_memory_block_cb(struct memory_block *mem, void *arg)
> +{
> +	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
> +	struct vmem_altmap mhp_altmap = {};
> +	struct vmem_altmap *altmap = NULL;
> +	u64 start = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
> +	u64 size = memory_block_size_bytes();
> +
> +	if (!mem->nr_vmemmap_pages) {
> +		arch_remove_memory(mem->nid, start, size, NULL);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Let remove_pmd_table->free_hugepage_table
> +	 * do the right thing if we used vmem_altmap
> +	 * when hot-adding the range.
> +	 */
> +	mhp_altmap.alloc = nr_vmemmap_pages;
> +	altmap = &mhp_altmap;
> +
> +	unaccount_vmemmap_space(mem->nid, PHYS_PFN(start), nr_vmemmap_pages);
> +	arch_remove_memory(mem->nid, start, size, altmap);
> +
> +	return 0;
> +}
> +
>   /*
>    * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>    * and online/offline operations (triggered e.g. by sysfs).
> @@ -1170,8 +1210,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>   			ret = -EINVAL;
>   			goto error;
>   		}
> -		mhp_altmap.free = PHYS_PFN(size);
> -		mhp_altmap.base_pfn = PHYS_PFN(start);
> +		reserve_vmmemmap_space(nid, PHYS_PFN(start), PHYS_PFN(size), &mhp_altmap);
>   		params.altmap = &mhp_altmap;
>   	}
>   
> @@ -1639,25 +1678,16 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
>   	return 0;
>   }
>   
> -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> -			unsigned long nr_vmemmap_pages)
> +int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   {
>   	const unsigned long end_pfn = start_pfn + nr_pages;
> -	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
> +	unsigned long pfn, system_ram_pages = 0;
>   	unsigned long flags;
>   	struct zone *zone;
>   	struct memory_notify arg;
>   	int ret, node;
>   	char *reason;
>   
> -	/* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
> -	if (WARN_ON_ONCE(!nr_pages ||
> -			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
> -		return -EINVAL;
> -
> -	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
> -	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
> -
>   	mem_hotplug_begin();
>   
>   	/*
> @@ -1693,7 +1723,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>   	zone_pcp_disable(zone);
>   
>   	/* set above range as isolated */
> -	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
> +	ret = start_isolate_page_range(start_pfn, end_pfn,
>   				       MIGRATE_MOVABLE,
>   				       MEMORY_OFFLINE | REPORT_FAILURE);
>   	if (ret) {
> @@ -1713,7 +1743,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>   	}
>   
>   	do {
> -		pfn = buddy_start_pfn;
> +		pfn = start_pfn;
>   		do {
>   			if (signal_pending(current)) {
>   				ret = -EINTR;
> @@ -1744,18 +1774,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>   		 * offlining actually in order to make hugetlbfs's object
>   		 * counting consistent.
>   		 */
> -		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
> +		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
>   		if (ret) {
>   			reason = "failure to dissolve huge pages";
>   			goto failed_removal_isolated;
>   		}
>   
> -		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
> +		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>   
>   	} while (ret);
>   
>   	/* Mark all sections offline and remove free pages from the buddy. */
> -	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);
> +	__offline_isolated_pages(start_pfn, end_pfn, start_pfn);
>   	pr_debug("Offlined Pages %ld\n", nr_pages);
>   
>   	/*
> @@ -1764,13 +1794,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>   	 * of isolated pageblocks, memory onlining will properly revert this.
>   	 */
>   	spin_lock_irqsave(&zone->lock, flags);
> -	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
> +	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
>   	spin_unlock_irqrestore(&zone->lock, flags);
>   
>   	zone_pcp_enable(zone);
>   
>   	/* removal success */
> -	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
> +	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
>   	zone->present_pages -= nr_pages;
>   
>   	pgdat_resize_lock(zone->zone_pgdat, &flags);
> @@ -1799,7 +1829,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>   	return 0;
>   
>   failed_removal_isolated:
> -	undo_isolate_page_range(buddy_start_pfn, end_pfn, MIGRATE_MOVABLE);
> +	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>   	memory_notify(MEM_CANCEL_OFFLINE, &arg);
>   failed_removal_pcplists_disabled:
>   	zone_pcp_enable(zone);
> @@ -1830,14 +1860,6 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
>   	return 0;
>   }
>   
> -static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
> -{
> -	/*
> -	 * If not set, continue with the next block.
> -	 */
> -	return mem->nr_vmemmap_pages;
> -}
> -
>   static int check_cpu_on_node(pg_data_t *pgdat)
>   {
>   	int cpu;
> @@ -1912,9 +1934,6 @@ EXPORT_SYMBOL(try_offline_node);
>   static int __ref try_remove_memory(int nid, u64 start, u64 size)
>   {
>   	int rc = 0;
> -	struct vmem_altmap mhp_altmap = {};
> -	struct vmem_altmap *altmap = NULL;
> -	unsigned long nr_vmemmap_pages = 0;
>   
>   	BUG_ON(check_hotplug_memory_range(start, size));
>   
> @@ -1927,31 +1946,6 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>   	if (rc)
>   		return rc;
>   
> -	/*
> -	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> -	 * the same granularity it was added - a single memory block.
> -	 */
> -	if (memmap_on_memory) {
> -		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> -						      get_nr_vmemmap_pages_cb);
> -		if (nr_vmemmap_pages) {
> -			if (size != memory_block_size_bytes()) {
> -				pr_warn("Refuse to remove %#llx - %#llx,"
> -					"wrong granularity\n",
> -					 start, start + size);
> -				return -EINVAL;
> -			}
> -
> -			/*
> -			 * Let remove_pmd_table->free_hugepage_table
> -			 * do the right thing if we used vmem_altmap
> -			 * when hot-adding the range.
> -			 */
> -			mhp_altmap.alloc = nr_vmemmap_pages;
> -			altmap = &mhp_altmap;
> -		}
> -	}
> -
>   	/* remove memmap entry */
>   	firmware_map_remove(start, start + size, "System RAM");
>   
> @@ -1963,7 +1957,10 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>   
>   	mem_hotplug_begin();
>   
> -	arch_remove_memory(nid, start, size, altmap);
> +	if (!memmap_on_memory)
> +		arch_remove_memory(nid, start, size, NULL);
> +	else
> +		walk_memory_blocks(start, size, NULL, remove_memory_block_cb);
>   
>   	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>   		memblock_free(start, size);
>
Michal Hocko March 24, 2021, 4:04 p.m. UTC | #16
On Wed 24-03-21 15:52:38, David Hildenbrand wrote:
> On 24.03.21 15:42, Michal Hocko wrote:
> > On Wed 24-03-21 13:03:29, Michal Hocko wrote:
> > > On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
> > [...]
> > > > I kind of understand to be reluctant to use vmemmap_pages terminology here, but
> > > > unfortunately we need to know about it.
> > > > We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.
> > > 
> > > I am not convinced. It seems you are justr trying to graft the new
> > > functionality in. But I still believe that {on,off}lining shouldn't care
> > > about where their vmemmaps come from at all. It should be a
> > > responsibility of the code which reserves that space to compansate for
> > > accounting. Otherwise we will end up with a hard to maintain code
> > > because expectations would be spread at way too many places. Not to
> > > mention different pfns that the code should care about.
> > 
> > The below is a quick hack on top of this patch to illustrate my
> > thinking. I have dug out all the vmemmap pieces out of the
> > {on,off}lining and hooked all the accounting when the space is reserved.
> > This just compiles without any deeper look so there are likely some
> > minor problems but I haven't really encountered any major problems or
> > hacks to introduce into the code. The separation seems to be possible.
> > The diffstat also looks promising. Am I missing something fundamental in
> > this?
> > 
> 
> From a quick glimpse, this touches on two things discussed in the past:
> 
> 1. If the underlying memory block is offline, all sections are offline. Zone
> shrinking code will happily skip over the vmemmap pages and you can end up
> with out-of-zone pages assigned to the zone. Can happen in corner cases.

You are right. But do we really care? Those pages should be of no
interest to anybody iterating through zones/nodes anyway.

> There is no way to know that the memmap of these pages was initialized and
> is of value.
> 
> 2. You heavily fragment zone layout although you might end up with
> consecutive zones (e.g., online all hotplugged memory movable)

What would be consequences?
David Hildenbrand March 24, 2021, 7:16 p.m. UTC | #17
On 24.03.21 17:04, Michal Hocko wrote:
> On Wed 24-03-21 15:52:38, David Hildenbrand wrote:
>> On 24.03.21 15:42, Michal Hocko wrote:
>>> On Wed 24-03-21 13:03:29, Michal Hocko wrote:
>>>> On Wed 24-03-21 11:12:59, Oscar Salvador wrote:
>>> [...]
>>>>> I kind of understand to be reluctant to use vmemmap_pages terminology here, but
>>>>> unfortunately we need to know about it.
>>>>> We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.
>>>>
>>>> I am not convinced. It seems you are justr trying to graft the new
>>>> functionality in. But I still believe that {on,off}lining shouldn't care
>>>> about where their vmemmaps come from at all. It should be a
>>>> responsibility of the code which reserves that space to compansate for
>>>> accounting. Otherwise we will end up with a hard to maintain code
>>>> because expectations would be spread at way too many places. Not to
>>>> mention different pfns that the code should care about.
>>>
>>> The below is a quick hack on top of this patch to illustrate my
>>> thinking. I have dug out all the vmemmap pieces out of the
>>> {on,off}lining and hooked all the accounting when the space is reserved.
>>> This just compiles without any deeper look so there are likely some
>>> minor problems but I haven't really encountered any major problems or
>>> hacks to introduce into the code. The separation seems to be possible.
>>> The diffstat also looks promising. Am I missing something fundamental in
>>> this?
>>>
>>
>>  From a quick glimpse, this touches on two things discussed in the past:
>>
>> 1. If the underlying memory block is offline, all sections are offline. Zone
>> shrinking code will happily skip over the vmemmap pages and you can end up
>> with out-of-zone pages assigned to the zone. Can happen in corner cases.
> 
> You are right. But do we really care? Those pages should be of no
> interest to anybody iterating through zones/nodes anyway.

Well, we were just discussing getting zone/node links + span right for 
all pages (including for special reserved pages), because it already 
resulted in BUGs. So I am not convinced that we *don't* have to care.

However, I agree that most code that cares about node/zone spans 
shouldn't care - e.g., never call set_pfnblock_flags_mask() on such blocks.

But I guess there are corner cases where we would end up with 
zone_is_empty() == true, not sure what that effect would be ... at least 
the node cannot vanish as we disallow offlining it while we have a 
memory block linked to it.


Another thing that comes to my mind is that our zone shrinking code 
currently searches in PAGES_PER_SUBSECTION (2 MiB IIRC) increments. In 
case our vmemmap pages would be less than that, we could accidentally 
shrink the !vmemmap part too much, as we are mis-detecting the type for 
a PAGES_PER_SUBSECTION block.

IIRC, this would apply for memory block sizes < 128 MiB. Not relevant on 
x86 and arm64. Could be relevant for ppc64, if we'd ever want to support 
memmap_on_memory there. Or if we'd ever reduce the section size on some 
arch below 128 MiB. At least we would have to fence it somehow.


> 
>> There is no way to know that the memmap of these pages was initialized and
>> is of value.
>>
>> 2. You heavily fragment zone layout although you might end up with
>> consecutive zones (e.g., online all hotplugged memory movable)
> 
> What would be consequences?

IIRC, set_zone_contiguous() will leave zone->contiguous = false.

This, in turn, will force pageblock_pfn_to_page() via the slow path, 
turning page isolation a bit slower.

Not a deal breaker, but obviously something where Oscar's original patch 
can do better.


I yet have to think again about other issues (I remember most issues we 
discussed back then were related to having the vmemmap only within the 
same memory block). I think 2) might be tolerable, although unfortunate. 
Regarding 1), we'll have to dive into more details.
Oscar Salvador March 25, 2021, 8:07 a.m. UTC | #18
On Wed, Mar 24, 2021 at 08:16:53PM +0100, David Hildenbrand wrote:
> > > 1. If the underlying memory block is offline, all sections are offline. Zone
> > > shrinking code will happily skip over the vmemmap pages and you can end up
> > > with out-of-zone pages assigned to the zone. Can happen in corner cases.
> > 
> > You are right. But do we really care? Those pages should be of no
> > interest to anybody iterating through zones/nodes anyway.
> 
> Well, we were just discussing getting zone/node links + span right for all
> pages (including for special reserved pages), because it already resulted in
> BUGs. So I am not convinced that we *don't* have to care.
> 
> However, I agree that most code that cares about node/zone spans shouldn't
> care - e.g., never call set_pfnblock_flags_mask() on such blocks.
> 
> But I guess there are corner cases where we would end up with
> zone_is_empty() == true, not sure what that effect would be ... at least the
> node cannot vanish as we disallow offlining it while we have a memory block
> linked to it.

Having quickly looked at Michal's change, I have to say that it does not
look that bad, but I think it is doing the initialization/accounting at
the wrong stage, plus the fact that I dislike to place those pages in
ZONE_NORMAL, although they are not movable.
But I think the vmemmap pages should lay within the same zone the pages
they describe, doing so simplifies things, and I do not see any outright
downside.

Back in the old days, we used to have this sort of
unitilization/tearing-down of pages at hot-add/hot-remove stage, and we
moved from there because of many problems we had.
While we might not encounter those problems with vmemmap pages because of
their nature, I think that not sticking with the right place might bring
problems as David outlined.

IMHO, initizalization and accounting should be done in online/offline
stage.
Now, the problems seems to be that doing it right in
offline_pages()/online_pages() seems a bad practice and conflates with
different concept.

This is just an idea I did not get to think carefully, but what if we
do it in helpers right before calling online_pages()/offline_pages()
in memory_block_action() ?
I have to confess that I do not really like the idea, but would solve do
the right things at the right stage.

One last thing.
Let us try to avoid things like "we do not care". Hotplug code was full
of "we do not care" assumptions that bit us in the ass at some point.
Michal Hocko March 25, 2021, 9:17 a.m. UTC | #19
On Thu 25-03-21 09:07:03, Oscar Salvador wrote:
> On Wed, Mar 24, 2021 at 08:16:53PM +0100, David Hildenbrand wrote:
> > > > 1. If the underlying memory block is offline, all sections are offline. Zone
> > > > shrinking code will happily skip over the vmemmap pages and you can end up
> > > > with out-of-zone pages assigned to the zone. Can happen in corner cases.
> > > 
> > > You are right. But do we really care? Those pages should be of no
> > > interest to anybody iterating through zones/nodes anyway.
> > 
> > Well, we were just discussing getting zone/node links + span right for all
> > pages (including for special reserved pages), because it already resulted in
> > BUGs. So I am not convinced that we *don't* have to care.
> > 
> > However, I agree that most code that cares about node/zone spans shouldn't
> > care - e.g., never call set_pfnblock_flags_mask() on such blocks.
> > 
> > But I guess there are corner cases where we would end up with
> > zone_is_empty() == true, not sure what that effect would be ... at least the
> > node cannot vanish as we disallow offlining it while we have a memory block
> > linked to it.
> 
> Having quickly looked at Michal's change, I have to say that it does not
> look that bad, but I think it is doing the initialization/accounting at
> the wrong stage,

Why do you think it is wrong to initialize/account pages when they are
used? Keep in mind that offline pages are not used until they are
onlined. But vmemmap pages are used since the vmemmap is established
which happens in the hotadd stage.

> plus the fact that I dislike to place those pages in
> ZONE_NORMAL, although they are not movable.
> But I think the vmemmap pages should lay within the same zone the pages
> they describe, doing so simplifies things, and I do not see any outright
> downside.

Well, both ways likely have its pros and cons. Nevertheless, if the
vmemmap storage is independent (which is the case for normal hotplug)
then the state is consistent over hotadd, {online, offline} N times,
hotremove cycles.  Which is conceptually reasonable as vmemmap doesn't
go away on each offline.

If you are going to bind accounting to the online/offline stages then
the accounting changes each time you go through the cycle and depending
on the onlining type it would travel among zones. I find it quite
confusing as the storage for vmemmap hasn't changed any of its
properties.

[...]

> This is just an idea I did not get to think carefully, but what if we
> do it in helpers right before calling online_pages()/offline_pages()
> in memory_block_action() ?

That would result in a less confusing code in {on,off}lining code
operating on two sets of pfns, nr_pages. But fundamentally I still
consider it a suboptimal to have accounting which is detached from the
life cycle. If we really want to go that path we should have a very good
reason for that.
Oscar Salvador March 25, 2021, 10:55 a.m. UTC | #20
On Thu, Mar 25, 2021 at 10:17:33AM +0100, Michal Hocko wrote:
> Why do you think it is wrong to initialize/account pages when they are
> used? Keep in mind that offline pages are not used until they are
> onlined. But vmemmap pages are used since the vmemmap is established
> which happens in the hotadd stage.

Yes, that is true.
vmemmap pages are used right when we populate the vmemmap space.

> > plus the fact that I dislike to place those pages in
> > ZONE_NORMAL, although they are not movable.
> > But I think the vmemmap pages should lay within the same zone the pages
> > they describe, doing so simplifies things, and I do not see any outright
> > downside.
> 
> Well, both ways likely have its pros and cons. Nevertheless, if the
> vmemmap storage is independent (which is the case for normal hotplug)
> then the state is consistent over hotadd, {online, offline} N times,
> hotremove cycles.  Which is conceptually reasonable as vmemmap doesn't
> go away on each offline.
> 
> If you are going to bind accounting to the online/offline stages then
> the accounting changes each time you go through the cycle and depending
> on the onlining type it would travel among zones. I find it quite
> confusing as the storage for vmemmap hasn't changed any of its
> properties.

That is a good point I guess.
vmemmap pages do not really go away until the memory is unplugged.

But I see some questions to raise:

- As I said, I really dislike it tiding vmemmap memory to ZONE_NORMAL
  unconditionally and this might result in the problems David mentioned.
  I remember David and I discussed such problems but the problems with
  zones not being contiguos have also been discussed in the past and
  IIRC, we reached the conclusion that a maximal effort should be made
  to keep them that way, otherwise other things suffer e.g: compaction
  code.
  So if we really want to move the initialization/account to the
  hot-add/hot-remove stage, I would really like to be able to set the
  proper zone in there (that is, the same zone where the memory will lay).

- When moving the initialization/accounting to hot-add/hot-remove,
  the section containing the vmemmap pages will remain offline.
  It might get onlined once the pages get online in online_pages(),
  or not if vmemmap pages span a whole section.
  I remember (but maybe David rmemeber better) that that was a problem
  wrt. pfn_to_online_page() and hybernation/kdump.
  So, if that is really a problem, we would have to care of ot setting
  the section to the right state.

- AFAICS, doing all the above brings us to former times were some
  initialization/accounting was done in a previous stage, and I remember
  it was pushed hard to move those in online/offline_pages().
  Are we ok with that?
  As I said, we might have to set the right zone in hot-add stage, as
  otherwise problems might come up.
  Being that case, would not that also be conflating different concepts
  at a wrong phases?

Do not take me wrong, I quite like Michal's idea, and from a
conceptually point of view I guess it is the right thing to do.
But when evualating risks/difficulty, I am not really sure.

If we can pull that off while setting the right zone (and must be seen
what about the section state), and the outcome is not ugly, I am all for
it.
Also a middel-ground might be something like I previously mentioned(having
a helper in memory_block_action() to do the right thing, so
offline/online_pages() do not get pouled.
David Hildenbrand March 25, 2021, 11:08 a.m. UTC | #21
On 25.03.21 11:55, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 10:17:33AM +0100, Michal Hocko wrote:
>> Why do you think it is wrong to initialize/account pages when they are
>> used? Keep in mind that offline pages are not used until they are
>> onlined. But vmemmap pages are used since the vmemmap is established
>> which happens in the hotadd stage.
> 
> Yes, that is true.
> vmemmap pages are used right when we populate the vmemmap space.
> 

Note: I once herd of a corner-case use case where people offline memory 
blocks to then use the "free" memory via /dev/mem for other purposes 
("large physical memory"). Not that I encourage such use cases, but they 
would be fundamentally broken if the vmemmap ends up on offline memory 
and is supposed to keep its state ...

>>> plus the fact that I dislike to place those pages in
>>> ZONE_NORMAL, although they are not movable.
>>> But I think the vmemmap pages should lay within the same zone the pages
>>> they describe, doing so simplifies things, and I do not see any outright
>>> downside.
>>
>> Well, both ways likely have its pros and cons. Nevertheless, if the
>> vmemmap storage is independent (which is the case for normal hotplug)
>> then the state is consistent over hotadd, {online, offline} N times,
>> hotremove cycles.  Which is conceptually reasonable as vmemmap doesn't
>> go away on each offline.
>>
>> If you are going to bind accounting to the online/offline stages then
>> the accounting changes each time you go through the cycle and depending
>> on the onlining type it would travel among zones. I find it quite
>> confusing as the storage for vmemmap hasn't changed any of its
>> properties.
> 
> That is a good point I guess.
> vmemmap pages do not really go away until the memory is unplugged.
> 
> But I see some questions to raise:
> 
> - As I said, I really dislike it tiding vmemmap memory to ZONE_NORMAL
>    unconditionally and this might result in the problems David mentioned.
>    I remember David and I discussed such problems but the problems with
>    zones not being contiguos have also been discussed in the past and
>    IIRC, we reached the conclusion that a maximal effort should be made
>    to keep them that way, otherwise other things suffer e.g: compaction
>    code.
>    So if we really want to move the initialization/account to the
>    hot-add/hot-remove stage, I would really like to be able to set the
>    proper zone in there (that is, the same zone where the memory will lay).

Determining the zone when hot-adding does not make too much sense: you 
don't know what user space might end up deciding (online_kernel, 
online_movable...).

> 
> - When moving the initialization/accounting to hot-add/hot-remove,
>    the section containing the vmemmap pages will remain offline.
>    It might get onlined once the pages get online in online_pages(),
>    or not if vmemmap pages span a whole section.
>    I remember (but maybe David rmemeber better) that that was a problem
>    wrt. pfn_to_online_page() and hybernation/kdump.
>    So, if that is really a problem, we would have to care of ot setting
>    the section to the right state.

Good memory. Indeed, hibernation/kdump won't save the state of the 
vmemmap, because the memory is marked as offline and, thus, logically 
without any valuable content.

> 
> - AFAICS, doing all the above brings us to former times were some
>    initialization/accounting was done in a previous stage, and I remember
>    it was pushed hard to move those in online/offline_pages().
>    Are we ok with that?
>    As I said, we might have to set the right zone in hot-add stage, as
>    otherwise problems might come up.
>    Being that case, would not that also be conflating different concepts
>    at a wrong phases?
> 

I expressed my opinion already, no need to repeat. Sub-section online 
maps would make it cleaner, but I am still not convinced we want/need that.

> Do not take me wrong, I quite like Michal's idea, and from a
> conceptually point of view I guess it is the right thing to do.
> But when evualating risks/difficulty, I am not really sure.
> 
> If we can pull that off while setting the right zone (and must be seen
> what about the section state), and the outcome is not ugly, I am all for
> it.
> Also a middel-ground might be something like I previously mentioned(having
> a helper in memory_block_action() to do the right thing, so
> offline/online_pages() do not get pouled.

As I said, having soemthing like 
memory_block_online()/memory_block_offline() could be one way to tackle 
it. We only support onlining/offlining of memory blocks and I ripped out 
all code that was abusing online_pages/offline_pages ...

So have memory_block_online() call online_pages() and do the accounting 
of the vmemmap, with a big fat comment that sections are actually set 
online/offline in online_pages/offline_pages(). Could be a simple 
cleanup on top of this series ...
Oscar Salvador March 25, 2021, 11:23 a.m. UTC | #22
On Thu, Mar 25, 2021 at 12:08:43PM +0100, David Hildenbrand wrote:
> As I said, having soemthing like
> memory_block_online()/memory_block_offline() could be one way to tackle it.
> We only support onlining/offlining of memory blocks and I ripped out all
> code that was abusing online_pages/offline_pages ...
> 
> So have memory_block_online() call online_pages() and do the accounting of
> the vmemmap, with a big fat comment that sections are actually set
> online/offline in online_pages/offline_pages(). Could be a simple cleanup on
> top of this series ...

I overlooked this in your previous reply.
Yes, this looks like a middle-ground compromise and something I would
definitely pick over the other options.

If there is a consensus that that is the most straightforward way to go, I
could try to code that up to see how it looks so we all have an idea.

Thanks!
Michal Hocko March 25, 2021, 12:26 p.m. UTC | #23
On Thu 25-03-21 11:55:01, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 10:17:33AM +0100, Michal Hocko wrote:
> > Why do you think it is wrong to initialize/account pages when they are
> > used? Keep in mind that offline pages are not used until they are
> > onlined. But vmemmap pages are used since the vmemmap is established
> > which happens in the hotadd stage.
> 
> Yes, that is true.
> vmemmap pages are used right when we populate the vmemmap space.
> 
> > > plus the fact that I dislike to place those pages in
> > > ZONE_NORMAL, although they are not movable.
> > > But I think the vmemmap pages should lay within the same zone the pages
> > > they describe, doing so simplifies things, and I do not see any outright
> > > downside.
> > 
> > Well, both ways likely have its pros and cons. Nevertheless, if the
> > vmemmap storage is independent (which is the case for normal hotplug)
> > then the state is consistent over hotadd, {online, offline} N times,
> > hotremove cycles.  Which is conceptually reasonable as vmemmap doesn't
> > go away on each offline.
> > 
> > If you are going to bind accounting to the online/offline stages then
> > the accounting changes each time you go through the cycle and depending
> > on the onlining type it would travel among zones. I find it quite
> > confusing as the storage for vmemmap hasn't changed any of its
> > properties.
> 
> That is a good point I guess.
> vmemmap pages do not really go away until the memory is unplugged.
> 
> But I see some questions to raise:
> 
> - As I said, I really dislike it tiding vmemmap memory to ZONE_NORMAL
>   unconditionally and this might result in the problems David mentioned.
>   I remember David and I discussed such problems but the problems with
>   zones not being contiguos have also been discussed in the past and
>   IIRC, we reached the conclusion that a maximal effort should be made
>   to keep them that way, otherwise other things suffer e.g: compaction
>   code.

Yeah, David has raised the contiguous flag for zone already. And to be
completely honest I fail to see why we should shape a design based on an
optimization. If anything we can teach set_zone_contiguous to simply
ignore zone affiliation of vmemmap pages. I would be really curious if
that would pose any harm to the compaction code as they are reserved and
compaction should simply skip them.

>   So if we really want to move the initialization/account to the
>   hot-add/hot-remove stage, I would really like to be able to set the
>   proper zone in there (that is, the same zone where the memory will lay).

THere is nothing like a proper zone.

> - When moving the initialization/accounting to hot-add/hot-remove,
>   the section containing the vmemmap pages will remain offline.

Yes this sucks! I do not have a good answer for that as the
online/offline granularity seems rather coarse on that.

>   It might get onlined once the pages get online in online_pages(),
>   or not if vmemmap pages span a whole section.
>   I remember (but maybe David rmemeber better) that that was a problem
>   wrt. pfn_to_online_page() and hybernation/kdump.
>   So, if that is really a problem, we would have to care of ot setting
>   the section to the right state.
> 
> - AFAICS, doing all the above brings us to former times were some
>   initialization/accounting was done in a previous stage, and I remember
>   it was pushed hard to move those in online/offline_pages().

Not sure what you are referring to but if you have prior to f1dd2cd13c4b
("mm, memory_hotplug: do not associate hotadded memory to zones until
online") then this was entirely a different story. Users do care where
they memory goes because that depends on the usecase but do they care
about vmemmap?
Michal Hocko March 25, 2021, 12:35 p.m. UTC | #24
On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
> On 25.03.21 11:55, Oscar Salvador wrote:
> > On Thu, Mar 25, 2021 at 10:17:33AM +0100, Michal Hocko wrote:
> > > Why do you think it is wrong to initialize/account pages when they are
> > > used? Keep in mind that offline pages are not used until they are
> > > onlined. But vmemmap pages are used since the vmemmap is established
> > > which happens in the hotadd stage.
> > 
> > Yes, that is true.
> > vmemmap pages are used right when we populate the vmemmap space.
> > 
> 
> Note: I once herd of a corner-case use case where people offline memory
> blocks to then use the "free" memory via /dev/mem for other purposes ("large
> physical memory"). Not that I encourage such use cases, but they would be
> fundamentally broken if the vmemmap ends up on offline memory and is
> supposed to keep its state ...

I am not aware of such a use case, it surely sounds, ehm creative, but
nothing really new. But such a usecase sounds quite incompatible with
this feature whether we account vmemmap at hotadd or hotremove because
they would need to understand that part of the memory they have hotadded
is not useable.

[...]
> > - When moving the initialization/accounting to hot-add/hot-remove,
> >    the section containing the vmemmap pages will remain offline.
> >    It might get onlined once the pages get online in online_pages(),
> >    or not if vmemmap pages span a whole section.
> >    I remember (but maybe David rmemeber better) that that was a problem
> >    wrt. pfn_to_online_page() and hybernation/kdump.
> >    So, if that is really a problem, we would have to care of ot setting
> >    the section to the right state.
> 
> Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
> because the memory is marked as offline and, thus, logically without any
> valuable content.

Could you point me to the respective hibernation code please? I always
get lost in that area. Anyway, we do have the same problem even if the
whole accounting is handled during {on,off}lining, no?

I am not really worried about kdump though. As the memory is offline
then losing itse vmemmap is a mere annoyance.


> > - AFAICS, doing all the above brings us to former times were some
> >    initialization/accounting was done in a previous stage, and I remember
> >    it was pushed hard to move those in online/offline_pages().
> >    Are we ok with that?
> >    As I said, we might have to set the right zone in hot-add stage, as
> >    otherwise problems might come up.
> >    Being that case, would not that also be conflating different concepts
> >    at a wrong phases?
> > 
> 
> I expressed my opinion already, no need to repeat. Sub-section online maps
> would make it cleaner, but I am still not convinced we want/need that.

Nah, subsections are more tricky than necessary and if we can live
without them and have it just as pmem weirdness is more than enough ;)
David Hildenbrand March 25, 2021, 12:40 p.m. UTC | #25
On 25.03.21 13:35, Michal Hocko wrote:
> On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
>> On 25.03.21 11:55, Oscar Salvador wrote:
>>> On Thu, Mar 25, 2021 at 10:17:33AM +0100, Michal Hocko wrote:
>>>> Why do you think it is wrong to initialize/account pages when they are
>>>> used? Keep in mind that offline pages are not used until they are
>>>> onlined. But vmemmap pages are used since the vmemmap is established
>>>> which happens in the hotadd stage.
>>>
>>> Yes, that is true.
>>> vmemmap pages are used right when we populate the vmemmap space.
>>>
>>
>> Note: I once herd of a corner-case use case where people offline memory
>> blocks to then use the "free" memory via /dev/mem for other purposes ("large
>> physical memory"). Not that I encourage such use cases, but they would be
>> fundamentally broken if the vmemmap ends up on offline memory and is
>> supposed to keep its state ...
> 
> I am not aware of such a use case, it surely sounds, ehm creative, but
> nothing really new. But such a usecase sounds quite incompatible with
> this feature whether we account vmemmap at hotadd or hotremove because
> they would need to understand that part of the memory they have hotadded
> is not useable.

I think they can use it just fine via /dev/mem, which explicitly avoids 
any kind of "struct page" references IIRC. They would be overwriting the 
vmemmap, but that part scan happily read/write until onlining, where the 
vmemmap would get reinitialized and set online - from which point on 
pfn_to_online_page() would succeed also on the vmemmap itself.

> 
> [...]
>>> - When moving the initialization/accounting to hot-add/hot-remove,
>>>     the section containing the vmemmap pages will remain offline.
>>>     It might get onlined once the pages get online in online_pages(),
>>>     or not if vmemmap pages span a whole section.
>>>     I remember (but maybe David rmemeber better) that that was a problem
>>>     wrt. pfn_to_online_page() and hybernation/kdump.
>>>     So, if that is really a problem, we would have to care of ot setting
>>>     the section to the right state.
>>
>> Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
>> because the memory is marked as offline and, thus, logically without any
>> valuable content.
> 
> Could you point me to the respective hibernation code please? I always
> get lost in that area. Anyway, we do have the same problem even if the
> whole accounting is handled during {on,off}lining, no?

kernel/power/snapshot.c:saveable_page().

> 
> I am not really worried about kdump though. As the memory is offline
> then losing itse vmemmap is a mere annoyance.

Yep, kdump was relevant in our previous discussions when we were talking 
about memory blocks having their altmap located on other memory blocks.

> 
> 
>>> - AFAICS, doing all the above brings us to former times were some
>>>     initialization/accounting was done in a previous stage, and I remember
>>>     it was pushed hard to move those in online/offline_pages().
>>>     Are we ok with that?
>>>     As I said, we might have to set the right zone in hot-add stage, as
>>>     otherwise problems might come up.
>>>     Being that case, would not that also be conflating different concepts
>>>     at a wrong phases?
>>>
>>
>> I expressed my opinion already, no need to repeat. Sub-section online maps
>> would make it cleaner, but I am still not convinced we want/need that.
> 
> Nah, subsections are more tricky than necessary and if we can live
> without them and have it just as pmem weirdness is more than enough ;)

Yes, absolutely :)
Oscar Salvador March 25, 2021, 2:02 p.m. UTC | #26
On Thu, Mar 25, 2021 at 01:26:34PM +0100, Michal Hocko wrote:
> Yeah, David has raised the contiguous flag for zone already. And to be
> completely honest I fail to see why we should shape a design based on an
> optimization. If anything we can teach set_zone_contiguous to simply
> ignore zone affiliation of vmemmap pages. I would be really curious if
> that would pose any harm to the compaction code as they are reserved and
> compaction should simply skip them.

No, compaction code is clever enough to skip over those pages as it
already does for any Reserved page.
My comment was more towards having the zone contiguous.

I know it is an optimization, but

 commit 7cf91a98e607c2f935dbcc177d70011e95b8faff
 Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
 Date:   Tue Mar 15 14:57:51 2016 -0700
 
 mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous

talks about 30% of improvment. I am not sure if those numbers would
still hold nowawadys, but it feels wrong to drop it to the ground when
we can do better there, and IMHO, it does not overly complicate things.

> THere is nothing like a proper zone.

I guess not, but for me it makes sense that vmemmap pages stay within
the same zone as the pages they describe.
Of course, this is a matter of opinions/taste.

> Not sure what you are referring to but if you have prior to f1dd2cd13c4b
> ("mm, memory_hotplug: do not associate hotadded memory to zones until
> online") then this was entirely a different story. Users do care where
> they memory goes because that depends on the usecase but do they care
> about vmemmap?

As I said, that is not what I am worried about.
Users do not really care where those pages end up, that is transparent
to them (wrt. vmemmap pages), but we (internally) kind of do.

So, as I said, I see advantatges of using your way, but I see downsides
as:

- I would like to consider zone, and for that we would have to pull
  some of the functions that check for the zone at an aearly stage, and
  the mere thought sounds ugly.
- Section containing vmemmap can remain offline and would have to come
  up to sort that out

Of course, the big advantatge is that we do things at its right time.

Now, doing things as David and I suggested (that is, create a helper to
do the accounting/initialization of vmemmap at online stage but without
populing online/offline_pages()) has the downside of messing with
different concepts that have different lifecycle, but I see an enormous
advantatge and that is it keeps things fairly simple.


I do not want to be seen that I am pushing back just for the sake of
doing so, and I am more than open to explore other means.
Actually, the code has evolved and re-shaped quite a lot since its beginning,
so nothing really speaks against it, but I think the idea of the helper
is less troublesome and simple.
Michal Hocko March 25, 2021, 2:08 p.m. UTC | #27
On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
> On 25.03.21 13:35, Michal Hocko wrote:
> > On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
> > > On 25.03.21 11:55, Oscar Salvador wrote:
[...]
> > > > - When moving the initialization/accounting to hot-add/hot-remove,
> > > >     the section containing the vmemmap pages will remain offline.
> > > >     It might get onlined once the pages get online in online_pages(),
> > > >     or not if vmemmap pages span a whole section.
> > > >     I remember (but maybe David rmemeber better) that that was a problem
> > > >     wrt. pfn_to_online_page() and hybernation/kdump.
> > > >     So, if that is really a problem, we would have to care of ot setting
> > > >     the section to the right state.
> > > 
> > > Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
> > > because the memory is marked as offline and, thus, logically without any
> > > valuable content.
> > 
> > Could you point me to the respective hibernation code please? I always
> > get lost in that area. Anyway, we do have the same problem even if the
> > whole accounting is handled during {on,off}lining, no?
> 
> kernel/power/snapshot.c:saveable_page().

Thanks! So this is as I've suspected. The very same problem is present
if the memory block is marked offline. So we need a solution here
anyway. One way to go would be to consider these vmemmap pages always
online. pfn_to_online_page would have to special case them but we would
need to identify them first. I used to have PageVmemmap or something
like that in my early attempt to do this.

That being said this is not an argument for one or the other aproach.
Both need fixing.
David Hildenbrand March 25, 2021, 2:09 p.m. UTC | #28
On 25.03.21 15:08, Michal Hocko wrote:
> On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
>> On 25.03.21 13:35, Michal Hocko wrote:
>>> On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
>>>> On 25.03.21 11:55, Oscar Salvador wrote:
> [...]
>>>>> - When moving the initialization/accounting to hot-add/hot-remove,
>>>>>      the section containing the vmemmap pages will remain offline.
>>>>>      It might get onlined once the pages get online in online_pages(),
>>>>>      or not if vmemmap pages span a whole section.
>>>>>      I remember (but maybe David rmemeber better) that that was a problem
>>>>>      wrt. pfn_to_online_page() and hybernation/kdump.
>>>>>      So, if that is really a problem, we would have to care of ot setting
>>>>>      the section to the right state.
>>>>
>>>> Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
>>>> because the memory is marked as offline and, thus, logically without any
>>>> valuable content.
>>>
>>> Could you point me to the respective hibernation code please? I always
>>> get lost in that area. Anyway, we do have the same problem even if the
>>> whole accounting is handled during {on,off}lining, no?
>>
>> kernel/power/snapshot.c:saveable_page().
> 
> Thanks! So this is as I've suspected. The very same problem is present
> if the memory block is marked offline. So we need a solution here
> anyway. One way to go would be to consider these vmemmap pages always
> online. pfn_to_online_page would have to special case them but we would
> need to identify them first. I used to have PageVmemmap or something
> like that in my early attempt to do this.
> 
> That being said this is not an argument for one or the other aproach.
> Both need fixing.

Can you elaborate? What is the issue there? What needs fixing?
Michal Hocko March 25, 2021, 2:34 p.m. UTC | #29
On Thu 25-03-21 15:09:35, David Hildenbrand wrote:
> On 25.03.21 15:08, Michal Hocko wrote:
> > On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
> > > On 25.03.21 13:35, Michal Hocko wrote:
> > > > On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
> > > > > On 25.03.21 11:55, Oscar Salvador wrote:
> > [...]
> > > > > > - When moving the initialization/accounting to hot-add/hot-remove,
> > > > > >      the section containing the vmemmap pages will remain offline.
> > > > > >      It might get onlined once the pages get online in online_pages(),
> > > > > >      or not if vmemmap pages span a whole section.
> > > > > >      I remember (but maybe David rmemeber better) that that was a problem
> > > > > >      wrt. pfn_to_online_page() and hybernation/kdump.
> > > > > >      So, if that is really a problem, we would have to care of ot setting
> > > > > >      the section to the right state.
> > > > > 
> > > > > Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
> > > > > because the memory is marked as offline and, thus, logically without any
> > > > > valuable content.

^^^^ THIS

> > > > 
> > > > Could you point me to the respective hibernation code please? I always
> > > > get lost in that area. Anyway, we do have the same problem even if the
> > > > whole accounting is handled during {on,off}lining, no?
> > > 
> > > kernel/power/snapshot.c:saveable_page().
> > 
> > Thanks! So this is as I've suspected. The very same problem is present
> > if the memory block is marked offline. So we need a solution here
> > anyway. One way to go would be to consider these vmemmap pages always
> > online. pfn_to_online_page would have to special case them but we would
> > need to identify them first. I used to have PageVmemmap or something
> > like that in my early attempt to do this.
> > 
> > That being said this is not an argument for one or the other aproach.
> > Both need fixing.
> 
> Can you elaborate? What is the issue there? What needs fixing?

offline section containing vmemmap will be lost during hibernation cycle
IIU the above correctly.
Michal Hocko March 25, 2021, 2:40 p.m. UTC | #30
On Thu 25-03-21 15:02:26, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 01:26:34PM +0100, Michal Hocko wrote:
> > Yeah, David has raised the contiguous flag for zone already. And to be
> > completely honest I fail to see why we should shape a design based on an
> > optimization. If anything we can teach set_zone_contiguous to simply
> > ignore zone affiliation of vmemmap pages. I would be really curious if
> > that would pose any harm to the compaction code as they are reserved and
> > compaction should simply skip them.
> 
> No, compaction code is clever enough to skip over those pages as it
> already does for any Reserved page.
> My comment was more towards having the zone contiguous.
> 
> I know it is an optimization, but
> 
>  commit 7cf91a98e607c2f935dbcc177d70011e95b8faff
>  Author: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>  Date:   Tue Mar 15 14:57:51 2016 -0700
>  
>  mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
> 
> talks about 30% of improvment. I am not sure if those numbers would
> still hold nowawadys, but it feels wrong to drop it to the ground when
> we can do better there, and IMHO, it does not overly complicate things.

Again, do not shape design around an optimization. If this turns out a
real problem then it can be handled on top.

> > THere is nothing like a proper zone.
> 
> I guess not, but for me it makes sense that vmemmap pages stay within
> the same zone as the pages they describe.

This is not the case for normal hotplug so why this should be any
different.

> Of course, this is a matter of opinions/taste.
> 
> > Not sure what you are referring to but if you have prior to f1dd2cd13c4b
> > ("mm, memory_hotplug: do not associate hotadded memory to zones until
> > online") then this was entirely a different story. Users do care where
> > they memory goes because that depends on the usecase but do they care
> > about vmemmap?
> 
> As I said, that is not what I am worried about.
> Users do not really care where those pages end up, that is transparent
> to them (wrt. vmemmap pages), but we (internally) kind of do.
> 
> So, as I said, I see advantatges of using your way, but I see downsides
> as:
> 
> - I would like to consider zone, and for that we would have to pull
>   some of the functions that check for the zone at an aearly stage, and
>   the mere thought sounds ugly.

This is impossible and whatever kind of heuristic you come up with might
be wrong.

> - Section containing vmemmap can remain offline and would have to come
>   up to sort that out

Yes, this is a problem indeed and as I've said in other email this would
be a problem for your initial implementation as well if the memory block
is still offline. I suspect we need to treat these Vmemmap pages as
online (via pfn_to_online_page).
David Hildenbrand March 25, 2021, 2:46 p.m. UTC | #31
On 25.03.21 15:34, Michal Hocko wrote:
> On Thu 25-03-21 15:09:35, David Hildenbrand wrote:
>> On 25.03.21 15:08, Michal Hocko wrote:
>>> On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
>>>> On 25.03.21 13:35, Michal Hocko wrote:
>>>>> On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
>>>>>> On 25.03.21 11:55, Oscar Salvador wrote:
>>> [...]
>>>>>>> - When moving the initialization/accounting to hot-add/hot-remove,
>>>>>>>       the section containing the vmemmap pages will remain offline.
>>>>>>>       It might get onlined once the pages get online in online_pages(),
>>>>>>>       or not if vmemmap pages span a whole section.
>>>>>>>       I remember (but maybe David rmemeber better) that that was a problem
>>>>>>>       wrt. pfn_to_online_page() and hybernation/kdump.
>>>>>>>       So, if that is really a problem, we would have to care of ot setting
>>>>>>>       the section to the right state.
>>>>>>
>>>>>> Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
>>>>>> because the memory is marked as offline and, thus, logically without any
>>>>>> valuable content.
> 
> ^^^^ THIS
> 
>>>>>
>>>>> Could you point me to the respective hibernation code please? I always
>>>>> get lost in that area. Anyway, we do have the same problem even if the
>>>>> whole accounting is handled during {on,off}lining, no?
>>>>
>>>> kernel/power/snapshot.c:saveable_page().
>>>
>>> Thanks! So this is as I've suspected. The very same problem is present
>>> if the memory block is marked offline. So we need a solution here
>>> anyway. One way to go would be to consider these vmemmap pages always
>>> online. pfn_to_online_page would have to special case them but we would
>>> need to identify them first. I used to have PageVmemmap or something
>>> like that in my early attempt to do this.
>>>
>>> That being said this is not an argument for one or the other aproach.
>>> Both need fixing.
>>
>> Can you elaborate? What is the issue there? What needs fixing?
> 
> offline section containing vmemmap will be lost during hibernation cycle
> IIU the above correctly.
> 

Can tell me how that is a problem with Oscars current patch? I only see 
this being a problem with what you propose - most probably I am missing 
something important here.

Offline memory sections don't have a valid memmap (assumption: garbage). 
On hibernation, the whole offline memory block won't be saved, including 
the vmemmap content that resides on the block. This includes the vmemmap 
of the vmemmap pages, which is itself.

When restoring, the whole memory block will contain garbage, including 
the whole vmemmap - which is marked to be offline and to contain garbage.

Oscars patch: Works as expected. Onlining the memory block will properly 
intiialize the vmemmap (including the vmemmap of the vmemmap), then mark 
the vmemmap to be valid (by marking the sections to be online).

Your porposal: Does not work as expected. Once we online the memory 
block we don't initialize the vmemmap of the vmemmap pages. There is 
garbage, which gets exposed to the system as soon as we mark the vmemmap 
to be valid (by marking the sections to be online).
Michal Hocko March 25, 2021, 3:12 p.m. UTC | #32
On Thu 25-03-21 15:46:22, David Hildenbrand wrote:
> On 25.03.21 15:34, Michal Hocko wrote:
> > On Thu 25-03-21 15:09:35, David Hildenbrand wrote:
> > > On 25.03.21 15:08, Michal Hocko wrote:
> > > > On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
> > > > > On 25.03.21 13:35, Michal Hocko wrote:
> > > > > > On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
> > > > > > > On 25.03.21 11:55, Oscar Salvador wrote:
> > > > [...]
> > > > > > > > - When moving the initialization/accounting to hot-add/hot-remove,
> > > > > > > >       the section containing the vmemmap pages will remain offline.
> > > > > > > >       It might get onlined once the pages get online in online_pages(),
> > > > > > > >       or not if vmemmap pages span a whole section.
> > > > > > > >       I remember (but maybe David rmemeber better) that that was a problem
> > > > > > > >       wrt. pfn_to_online_page() and hybernation/kdump.
> > > > > > > >       So, if that is really a problem, we would have to care of ot setting
> > > > > > > >       the section to the right state.
> > > > > > > 
> > > > > > > Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
> > > > > > > because the memory is marked as offline and, thus, logically without any
> > > > > > > valuable content.
> > 
> > ^^^^ THIS
> > 
> > > > > > 
> > > > > > Could you point me to the respective hibernation code please? I always
> > > > > > get lost in that area. Anyway, we do have the same problem even if the
> > > > > > whole accounting is handled during {on,off}lining, no?
> > > > > 
> > > > > kernel/power/snapshot.c:saveable_page().
> > > > 
> > > > Thanks! So this is as I've suspected. The very same problem is present
> > > > if the memory block is marked offline. So we need a solution here
> > > > anyway. One way to go would be to consider these vmemmap pages always
> > > > online. pfn_to_online_page would have to special case them but we would
> > > > need to identify them first. I used to have PageVmemmap or something
> > > > like that in my early attempt to do this.
> > > > 
> > > > That being said this is not an argument for one or the other aproach.
> > > > Both need fixing.
> > > 
> > > Can you elaborate? What is the issue there? What needs fixing?
> > 
> > offline section containing vmemmap will be lost during hibernation cycle
> > IIU the above correctly.
> > 
> 
> Can tell me how that is a problem with Oscars current patch? I only see this
> being a problem with what you propose - most probably I am missing something
> important here.
> 
> Offline memory sections don't have a valid memmap (assumption: garbage). On
> hibernation, the whole offline memory block won't be saved, including the
> vmemmap content that resides on the block. This includes the vmemmap of the
> vmemmap pages, which is itself.
> 
> When restoring, the whole memory block will contain garbage, including the
> whole vmemmap - which is marked to be offline and to contain garbage.

Hmm, so I might be misunderstanding the restoring part. But doesn't that
mean that the whole section/memory block won't get restored because it
is offline and therefore the vmemmap would be pointing to nowhere?
David Hildenbrand March 25, 2021, 3:19 p.m. UTC | #33
On 25.03.21 16:12, Michal Hocko wrote:
> On Thu 25-03-21 15:46:22, David Hildenbrand wrote:
>> On 25.03.21 15:34, Michal Hocko wrote:
>>> On Thu 25-03-21 15:09:35, David Hildenbrand wrote:
>>>> On 25.03.21 15:08, Michal Hocko wrote:
>>>>> On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
>>>>>> On 25.03.21 13:35, Michal Hocko wrote:
>>>>>>> On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
>>>>>>>> On 25.03.21 11:55, Oscar Salvador wrote:
>>>>> [...]
>>>>>>>>> - When moving the initialization/accounting to hot-add/hot-remove,
>>>>>>>>>        the section containing the vmemmap pages will remain offline.
>>>>>>>>>        It might get onlined once the pages get online in online_pages(),
>>>>>>>>>        or not if vmemmap pages span a whole section.
>>>>>>>>>        I remember (but maybe David rmemeber better) that that was a problem
>>>>>>>>>        wrt. pfn_to_online_page() and hybernation/kdump.
>>>>>>>>>        So, if that is really a problem, we would have to care of ot setting
>>>>>>>>>        the section to the right state.
>>>>>>>>
>>>>>>>> Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
>>>>>>>> because the memory is marked as offline and, thus, logically without any
>>>>>>>> valuable content.
>>>
>>> ^^^^ THIS
>>>
>>>>>>>
>>>>>>> Could you point me to the respective hibernation code please? I always
>>>>>>> get lost in that area. Anyway, we do have the same problem even if the
>>>>>>> whole accounting is handled during {on,off}lining, no?
>>>>>>
>>>>>> kernel/power/snapshot.c:saveable_page().
>>>>>
>>>>> Thanks! So this is as I've suspected. The very same problem is present
>>>>> if the memory block is marked offline. So we need a solution here
>>>>> anyway. One way to go would be to consider these vmemmap pages always
>>>>> online. pfn_to_online_page would have to special case them but we would
>>>>> need to identify them first. I used to have PageVmemmap or something
>>>>> like that in my early attempt to do this.
>>>>>
>>>>> That being said this is not an argument for one or the other aproach.
>>>>> Both need fixing.
>>>>
>>>> Can you elaborate? What is the issue there? What needs fixing?
>>>
>>> offline section containing vmemmap will be lost during hibernation cycle
>>> IIU the above correctly.
>>>
>>
>> Can tell me how that is a problem with Oscars current patch? I only see this
>> being a problem with what you propose - most probably I am missing something
>> important here.
>>
>> Offline memory sections don't have a valid memmap (assumption: garbage). On
>> hibernation, the whole offline memory block won't be saved, including the
>> vmemmap content that resides on the block. This includes the vmemmap of the
>> vmemmap pages, which is itself.
>>
>> When restoring, the whole memory block will contain garbage, including the
>> whole vmemmap - which is marked to be offline and to contain garbage.
> 
> Hmm, so I might be misunderstanding the restoring part. But doesn't that
> mean that the whole section/memory block won't get restored because it
> is offline and therefore the vmemmap would be pointing to nowhere?

AFAIU, only the content of the memory block won't be restored - whatever 
memory content existed before the restore operation is kept.

The structures that define how the vmemmap should look like - the memory 
sections and the page tables used for describing the vmemmap should 
properly get saved+restored, as these are located on online memory.

So the vmemmap layout should look after restoring just like before saving.
Michal Hocko March 25, 2021, 3:35 p.m. UTC | #34
On Thu 25-03-21 16:19:36, David Hildenbrand wrote:
> On 25.03.21 16:12, Michal Hocko wrote:
> > On Thu 25-03-21 15:46:22, David Hildenbrand wrote:
> > > On 25.03.21 15:34, Michal Hocko wrote:
> > > > On Thu 25-03-21 15:09:35, David Hildenbrand wrote:
> > > > > On 25.03.21 15:08, Michal Hocko wrote:
> > > > > > On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
> > > > > > > On 25.03.21 13:35, Michal Hocko wrote:
> > > > > > > > On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
> > > > > > > > > On 25.03.21 11:55, Oscar Salvador wrote:
> > > > > > [...]
> > > > > > > > > > - When moving the initialization/accounting to hot-add/hot-remove,
> > > > > > > > > >        the section containing the vmemmap pages will remain offline.
> > > > > > > > > >        It might get onlined once the pages get online in online_pages(),
> > > > > > > > > >        or not if vmemmap pages span a whole section.
> > > > > > > > > >        I remember (but maybe David rmemeber better) that that was a problem
> > > > > > > > > >        wrt. pfn_to_online_page() and hybernation/kdump.
> > > > > > > > > >        So, if that is really a problem, we would have to care of ot setting
> > > > > > > > > >        the section to the right state.
> > > > > > > > > 
> > > > > > > > > Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
> > > > > > > > > because the memory is marked as offline and, thus, logically without any
> > > > > > > > > valuable content.
> > > > 
> > > > ^^^^ THIS
> > > > 
> > > > > > > > 
> > > > > > > > Could you point me to the respective hibernation code please? I always
> > > > > > > > get lost in that area. Anyway, we do have the same problem even if the
> > > > > > > > whole accounting is handled during {on,off}lining, no?
> > > > > > > 
> > > > > > > kernel/power/snapshot.c:saveable_page().
> > > > > > 
> > > > > > Thanks! So this is as I've suspected. The very same problem is present
> > > > > > if the memory block is marked offline. So we need a solution here
> > > > > > anyway. One way to go would be to consider these vmemmap pages always
> > > > > > online. pfn_to_online_page would have to special case them but we would
> > > > > > need to identify them first. I used to have PageVmemmap or something
> > > > > > like that in my early attempt to do this.
> > > > > > 
> > > > > > That being said this is not an argument for one or the other aproach.
> > > > > > Both need fixing.
> > > > > 
> > > > > Can you elaborate? What is the issue there? What needs fixing?
> > > > 
> > > > offline section containing vmemmap will be lost during hibernation cycle
> > > > IIU the above correctly.
> > > > 
> > > 
> > > Can tell me how that is a problem with Oscars current patch? I only see this
> > > being a problem with what you propose - most probably I am missing something
> > > important here.
> > > 
> > > Offline memory sections don't have a valid memmap (assumption: garbage). On
> > > hibernation, the whole offline memory block won't be saved, including the
> > > vmemmap content that resides on the block. This includes the vmemmap of the
> > > vmemmap pages, which is itself.
> > > 
> > > When restoring, the whole memory block will contain garbage, including the
> > > whole vmemmap - which is marked to be offline and to contain garbage.
> > 
> > Hmm, so I might be misunderstanding the restoring part. But doesn't that
> > mean that the whole section/memory block won't get restored because it
> > is offline and therefore the vmemmap would be pointing to nowhere?
> 
> AFAIU, only the content of the memory block won't be restored - whatever
> memory content existed before the restore operation is kept.
> 
> The structures that define how the vmemmap should look like - the memory
> sections and the page tables used for describing the vmemmap should properly
> get saved+restored, as these are located on online memory.
> 
> So the vmemmap layout should look after restoring just like before saving.

OK, makes sense. Thanks for the clarification. 

So there is indeed a difference. One way around that would be to mark
vmemmap pages (e.g. PageReserved && magic value stored somewhere in the
struct page - resembling bootmem vmemmaps) or mark section fully backing
vmemmaps as online (ugly).
David Hildenbrand March 25, 2021, 3:40 p.m. UTC | #35
On 25.03.21 16:35, Michal Hocko wrote:
> On Thu 25-03-21 16:19:36, David Hildenbrand wrote:
>> On 25.03.21 16:12, Michal Hocko wrote:
>>> On Thu 25-03-21 15:46:22, David Hildenbrand wrote:
>>>> On 25.03.21 15:34, Michal Hocko wrote:
>>>>> On Thu 25-03-21 15:09:35, David Hildenbrand wrote:
>>>>>> On 25.03.21 15:08, Michal Hocko wrote:
>>>>>>> On Thu 25-03-21 13:40:45, David Hildenbrand wrote:
>>>>>>>> On 25.03.21 13:35, Michal Hocko wrote:
>>>>>>>>> On Thu 25-03-21 12:08:43, David Hildenbrand wrote:
>>>>>>>>>> On 25.03.21 11:55, Oscar Salvador wrote:
>>>>>>> [...]
>>>>>>>>>>> - When moving the initialization/accounting to hot-add/hot-remove,
>>>>>>>>>>>         the section containing the vmemmap pages will remain offline.
>>>>>>>>>>>         It might get onlined once the pages get online in online_pages(),
>>>>>>>>>>>         or not if vmemmap pages span a whole section.
>>>>>>>>>>>         I remember (but maybe David rmemeber better) that that was a problem
>>>>>>>>>>>         wrt. pfn_to_online_page() and hybernation/kdump.
>>>>>>>>>>>         So, if that is really a problem, we would have to care of ot setting
>>>>>>>>>>>         the section to the right state.
>>>>>>>>>>
>>>>>>>>>> Good memory. Indeed, hibernation/kdump won't save the state of the vmemmap,
>>>>>>>>>> because the memory is marked as offline and, thus, logically without any
>>>>>>>>>> valuable content.
>>>>>
>>>>> ^^^^ THIS
>>>>>
>>>>>>>>>
>>>>>>>>> Could you point me to the respective hibernation code please? I always
>>>>>>>>> get lost in that area. Anyway, we do have the same problem even if the
>>>>>>>>> whole accounting is handled during {on,off}lining, no?
>>>>>>>>
>>>>>>>> kernel/power/snapshot.c:saveable_page().
>>>>>>>
>>>>>>> Thanks! So this is as I've suspected. The very same problem is present
>>>>>>> if the memory block is marked offline. So we need a solution here
>>>>>>> anyway. One way to go would be to consider these vmemmap pages always
>>>>>>> online. pfn_to_online_page would have to special case them but we would
>>>>>>> need to identify them first. I used to have PageVmemmap or something
>>>>>>> like that in my early attempt to do this.
>>>>>>>
>>>>>>> That being said this is not an argument for one or the other aproach.
>>>>>>> Both need fixing.
>>>>>>
>>>>>> Can you elaborate? What is the issue there? What needs fixing?
>>>>>
>>>>> offline section containing vmemmap will be lost during hibernation cycle
>>>>> IIU the above correctly.
>>>>>
>>>>
>>>> Can tell me how that is a problem with Oscars current patch? I only see this
>>>> being a problem with what you propose - most probably I am missing something
>>>> important here.
>>>>
>>>> Offline memory sections don't have a valid memmap (assumption: garbage). On
>>>> hibernation, the whole offline memory block won't be saved, including the
>>>> vmemmap content that resides on the block. This includes the vmemmap of the
>>>> vmemmap pages, which is itself.
>>>>
>>>> When restoring, the whole memory block will contain garbage, including the
>>>> whole vmemmap - which is marked to be offline and to contain garbage.
>>>
>>> Hmm, so I might be misunderstanding the restoring part. But doesn't that
>>> mean that the whole section/memory block won't get restored because it
>>> is offline and therefore the vmemmap would be pointing to nowhere?
>>
>> AFAIU, only the content of the memory block won't be restored - whatever
>> memory content existed before the restore operation is kept.
>>
>> The structures that define how the vmemmap should look like - the memory
>> sections and the page tables used for describing the vmemmap should properly
>> get saved+restored, as these are located on online memory.
>>
>> So the vmemmap layout should look after restoring just like before saving.
> 
> OK, makes sense. Thanks for the clarification.
> 
> So there is indeed a difference. One way around that would be to mark
> vmemmap pages (e.g. PageReserved && magic value stored somewhere in the
> struct page - resembling bootmem vmemmaps) or mark section fully backing
> vmemmaps as online (ugly).

I'm sorry Michal, but then we are hacking around the online section size 
limitation just in another (IMHO more ugly) way, then what Oscar and I 
came up with when discussing this in the past.

Your first approach would require us to look at potential garbage 
(pfn_to_online_page() == NULL) and filter out what might still be useful.

The second approach exposes garbage to the rest of the system as 
initialized memmap.


I honestly cannot say that I prefer either over what we have here.
Michal Hocko March 25, 2021, 4:07 p.m. UTC | #36
On Thu 25-03-21 16:35:58, Michal Hocko wrote:
[...]
> So there is indeed a difference. One way around that would be to mark
> vmemmap pages (e.g. PageReserved && magic value stored somewhere in the
> struct page - resembling bootmem vmemmaps) or mark section fully backing
> vmemmaps as online (ugly).

I am not yet ready to give up on this. Here is a quick stab at the
pfn_to_online_page approach. It is not great but it is not really
terrible either. I think we can do better and skip
find_memory_block_by_id in most cases because nr_vmemmap_pages should be
constant with our current implementation.

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 754026a9164d..0d2bc22c29d3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -304,8 +304,20 @@ struct page *pfn_to_online_page(unsigned long pfn)
 		return NULL;
 
 	ms = __nr_to_section(nr);
-	if (!online_section(ms))
+	if (!online_section(ms)) {
+		if (memmap_on_memory) {
+			struct memory_block *mem;
+			unsigned long section_pfn = section_nr_to_pfn(nr);
+
+			mem = find_memory_block_by_id(memory_block_id(nr));
+			if (!mem || !mem->nr_vmemmap_pages || pfn - section_pfn > mem->nr_vmemmap_pages)
+				return NULL;
+
+			return pfn_to_page(pfn);
+
+		}
 		return NULL;
+	}
 
 	/*
 	 * Save some code text when online_section() +
David Hildenbrand March 25, 2021, 4:20 p.m. UTC | #37
On 25.03.21 17:07, Michal Hocko wrote:
> On Thu 25-03-21 16:35:58, Michal Hocko wrote:
> [...]
>> So there is indeed a difference. One way around that would be to mark
>> vmemmap pages (e.g. PageReserved && magic value stored somewhere in the
>> struct page - resembling bootmem vmemmaps) or mark section fully backing
>> vmemmaps as online (ugly).
> 
> I am not yet ready to give up on this. Here is a quick stab at the
> pfn_to_online_page approach. It is not great but it is not really
> terrible either. I think we can do better and skip

We both seem to have a different taste, to phrase it in a nice way :) ; 
but well, you seem to have set your mind (just like I seem to have set 
mine when trying to find a nice and somewhat-clean way to handle this 
when discussing it in the past).

I expressed my opinion, shared my findings and expressed my concerns; 
the series has my RB and the discussion here is around something I 
consider in no way better than what we have right now. I'll let Oscar 
handle discussing this topic further (sorry Oscar! :) ), but I'll 
happily review what the outcome of that will be.
Michal Hocko March 25, 2021, 4:36 p.m. UTC | #38
On Thu 25-03-21 17:20:23, David Hildenbrand wrote:
> On 25.03.21 17:07, Michal Hocko wrote:
> > On Thu 25-03-21 16:35:58, Michal Hocko wrote:
> > [...]
> > > So there is indeed a difference. One way around that would be to mark
> > > vmemmap pages (e.g. PageReserved && magic value stored somewhere in the
> > > struct page - resembling bootmem vmemmaps) or mark section fully backing
> > > vmemmaps as online (ugly).
> > 
> > I am not yet ready to give up on this. Here is a quick stab at the
> > pfn_to_online_page approach. It is not great but it is not really
> > terrible either. I think we can do better and skip
> 
> We both seem to have a different taste, to phrase it in a nice way :) ; but
> well, you seem to have set your mind (just like I seem to have set mine when
> trying to find a nice and somewhat-clean way to handle this when discussing
> it in the past).

I definitely do not want to fight for a certain solution just for the
sake of it. I really dislike how the lifetime of the reserved space and
its accounting are completely detached. But hey, I do understand that
a worse solution from the design perspective can be better due to
practical reasons or constrains.

I haven't seen the hibernation problem before and I do recognize it is
a nasty one. If all it takes is to make pfn_to_online_page work (and my
previous attempt is incorrect because it should consult block rather
than section pfn range) and there are no other downsides then I would
still prefer to go with my proposal.  If there are still other things to
plug then, well, practicality is going to win.

So before I give up on the "proper" design card, are there more
subtleties to watch for? You have certainly given this much more thought
than I have.
Michal Hocko March 25, 2021, 4:47 p.m. UTC | #39
On Thu 25-03-21 17:36:22, Michal Hocko wrote:
> If all it takes is to make pfn_to_online_page work (and my
> previous attempt is incorrect because it should consult block rather
> than section pfn range)

This should work.

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 9697acfe96eb..e50d685be8ab 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -510,6 +510,23 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
 	return mem;
 }
 
+struct page *is_vmemmap_page(unsigned long pfn)
+{
+	unsigned long nr = pfn_to_section_nr(pfn);
+	struct memory_block *mem;
+	unsigned long block_pfn;
+
+	mem = find_memory_block_by_id(memory_block_id(nr));
+	if (!mem || !mem->nr_vmemmap_pages)
+		return NULL;
+
+	block_pfn = section_nr_to_pfn(mem->start_section_nr);
+	if (pfn - block_pfn > mem->nr_vmemmap_pages)
+		return NULL;
+
+	return pfn_to_page(pfn);
+}
+
 /*
  * Called under device_hotplug_lock.
  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 754026a9164d..760bf3ad48d5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -304,8 +304,16 @@ struct page *pfn_to_online_page(unsigned long pfn)
 		return NULL;
 
 	ms = __nr_to_section(nr);
-	if (!online_section(ms))
+	if (!online_section(ms)) {
+		/*
+		 * vmemmap reserved space can eat up a whole section which then
+		 * never gets onlined because it doesn't contain any memory to
+		 * online.
+		 */
+		if (memmap_on_memory)
+			return is_vmemmap_page(pfn);
 		return NULL;
+	}
 
 	/*
 	 * Save some code text when online_section() +
David Hildenbrand March 25, 2021, 4:55 p.m. UTC | #40
On 25.03.21 17:47, Michal Hocko wrote:
> On Thu 25-03-21 17:36:22, Michal Hocko wrote:
>> If all it takes is to make pfn_to_online_page work (and my
>> previous attempt is incorrect because it should consult block rather
>> than section pfn range)
> 
> This should work.
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 9697acfe96eb..e50d685be8ab 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -510,6 +510,23 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>   	return mem;
>   }
>   
> +struct page *is_vmemmap_page(unsigned long pfn)
> +{
> +	unsigned long nr = pfn_to_section_nr(pfn);
> +	struct memory_block *mem;
> +	unsigned long block_pfn;
> +
> +	mem = find_memory_block_by_id(memory_block_id(nr));
> +	if (!mem || !mem->nr_vmemmap_pages)
> +		return NULL;
> +
> +	block_pfn = section_nr_to_pfn(mem->start_section_nr);
> +	if (pfn - block_pfn > mem->nr_vmemmap_pages)
> +		return NULL;
> +
> +	return pfn_to_page(pfn);
> +}
> +
>   /*
>    * Called under device_hotplug_lock.
>    */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 754026a9164d..760bf3ad48d5 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -304,8 +304,16 @@ struct page *pfn_to_online_page(unsigned long pfn)
>   		return NULL;
>   
>   	ms = __nr_to_section(nr);
> -	if (!online_section(ms))
> +	if (!online_section(ms)) {
> +		/*
> +		 * vmemmap reserved space can eat up a whole section which then
> +		 * never gets onlined because it doesn't contain any memory to
> +		 * online.
> +		 */
> +		if (memmap_on_memory)
> +			return is_vmemmap_page(pfn);
>   		return NULL;
> +	}
>   
>   	/*
>   	 * Save some code text when online_section() +
> 

It should take care of discussed zone shrinking as well, at least as 
long as the granularity is not smaller than sub-sections.
David Hildenbrand March 25, 2021, 6:08 p.m. UTC | #41
On 25.03.21 17:36, Michal Hocko wrote:
> On Thu 25-03-21 17:20:23, David Hildenbrand wrote:
>> On 25.03.21 17:07, Michal Hocko wrote:
>>> On Thu 25-03-21 16:35:58, Michal Hocko wrote:
>>> [...]
>>>> So there is indeed a difference. One way around that would be to mark
>>>> vmemmap pages (e.g. PageReserved && magic value stored somewhere in the
>>>> struct page - resembling bootmem vmemmaps) or mark section fully backing
>>>> vmemmaps as online (ugly).
>>>
>>> I am not yet ready to give up on this. Here is a quick stab at the
>>> pfn_to_online_page approach. It is not great but it is not really
>>> terrible either. I think we can do better and skip
>>
>> We both seem to have a different taste, to phrase it in a nice way :) ; but
>> well, you seem to have set your mind (just like I seem to have set mine when
>> trying to find a nice and somewhat-clean way to handle this when discussing
>> it in the past).
> 
> I definitely do not want to fight for a certain solution just for the
> sake of it. I really dislike how the lifetime of the reserved space and
> its accounting are completely detached. But hey, I do understand that
> a worse solution from the design perspective can be better due to
> practical reasons or constrains.
> 
> I haven't seen the hibernation problem before and I do recognize it is
> a nasty one. If all it takes is to make pfn_to_online_page work (and my
> previous attempt is incorrect because it should consult block rather
> than section pfn range) and there are no other downsides then I would
> still prefer to go with my proposal.  If there are still other things to
> plug then, well, practicality is going to win.
> 
> So before I give up on the "proper" design card, are there more
> subtleties to watch for? You have certainly given this much more thought
> than I have.
> 

"Just one more thing" :)

With the pfn_to_online_page() change, I think what remains is


1. The contiguous zone thingy, which we discussed is not a deal breaker, 
although sub-optimal and most probably not to be optimized in the future.

2. There corner cases issue with /dev/mem use case with offline memory 
blocks I mentioned. Existing setups (!memmap_on_memory) are not 
affected, so I guess we're fine.

3. valid_zones_show() has to be taught to only look at the !vmemmap 
part, otherwise we'll no longer indicate "movable" after onlining to the 
movable zone. Should be fairly easy.


We'll have pfn_to_online_section() succeed without SECTION_IS_ONLINE. I 
think I/we removed all such code that purely relied on that flag for 
optimizations like

if (!online_section(s))
	continue;


I can give it some more thought, it could fly. At least zone shrinking 
and hibernation should continue working as expected, which is a relief.
Oscar Salvador March 25, 2021, 10:06 p.m. UTC | #42
On Thu, Mar 25, 2021 at 05:47:30PM +0100, Michal Hocko wrote:
> On Thu 25-03-21 17:36:22, Michal Hocko wrote:
> > If all it takes is to make pfn_to_online_page work (and my
> > previous attempt is incorrect because it should consult block rather
> > than section pfn range)
> 
> This should work.

Sorry, but while this solves some of the issues with that approach, I really
think that overcomplicates things and buys us not so much in return.
To me it seems that we are just patching things to make it work that
way.

To be honest, I dislike this, and I guess we can only agree to disagree
here.

I find the following much easier, cleaner, and less risky to encounter
pitfalls in the future:

(!!!It is untested and incomplete, and I would be surprised if it even
compiles, but it is enough as a PoC !!!)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 5ea2b3fbce02..799d14fc2f9b 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -169,6 +169,60 @@ int memory_notify(unsigned long val, void *v)
 	return blocking_notifier_call_chain(&memory_chain, val, v);
 }

+static int memory_block_online(unsigned long start_pfn, unsigned long nr_pages,
+			       unsigned long nr_vmemmap_pages, int online_type,
+			       int nid)
+{
+	int ret;
+	/*
+	 * Despite vmemmap pages having a different lifecycle than the pages
+	 * they describe, initialiating and accounting vmemmap pages at the
+	 * online/offline stage eases things a lot.
+	 * We do it out of {online,offline}_pages, so those routines only have
+	 * to deal with pages that are actual usable memory.
+	 */
+	if (nr_vmemmap_pages) {
+		struct zone *z;
+
+		z = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
+		move_pfn_range_to_zone(z, start_pfn, nr_vmemmap_pages, NULL,
+				       MIGRATE_UNMOVABLE);
+		/*
+		 * The below could go to a helper to make it less bulky here,
+		 * so {online,offline}_pages could also use it.
+		 */
+		z->present_pages += nr_pages;
+		pgdat_resize_lock(z->zone_pgdat, &flags);
+		z->zone_pgdat->node_present_pages += nr_pages;
+		pgdat_resize_unlock(z->zone_pgdat, &flags);
+	}
+
+	ret = online_pages(start_pfn + nr_vmemmap_pages, nr_pages - nr_vmemmap_pages,
+			   online_type);
+
+	/*
+	 * In case online_pages() failed for some reason, we should cleanup vmemmap
+	 * accounting as well.
+	 */
+	return ret;
+}
+
+static int memory_block_offline(unsigned long start_pfn, unsigned long nr_pages,
+				unsigned long nr_vmemmap_pages)
+{
+	int ret;
+
+	if (nr_vmemmap_pages) {
+		/*
+		 * Do the opposite of memory_block_online
+		 */
+	}
+
+	ret = offline_pages(start_pfn, nr_pages);
+
+	return ret;
+}
+
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
@@ -185,11 +239,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,

 	switch (action) {
 	case MEM_ONLINE:
-		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
-				   online_type, nid);
+		ret = memory_block_online(start_pfn, nr_pages, nr_vmemmap_pages,
+					  online_type, nid);
 		break;
 	case MEM_OFFLINE:
-		ret = offline_pages(start_pfn, nr_pages, nr_vmemmap_pages);
+		ret = memory_block_offline(start_pfn, nr_pages, nr_vmemmap_pages);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a85d4b7d15c2..d2c734eaccb4 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -109,13 +109,11 @@ extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 /* VM interface that may be used by firmware interface */
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			unsigned long nr_vmemmap_pages, int online_type,
-			int nid);
+			int online_type, int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
-				     unsigned long end_pfn,
-				     unsigned long buddy_start_pfn);
+				     unsigned long end_pfn);

 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);

@@ -317,8 +315,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 #ifdef CONFIG_MEMORY_HOTREMOVE

 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
-			 unsigned long nr_vmemmap_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 extern int offline_and_remove_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 350cde69a97d..0d9ef34509bd 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -640,22 +640,10 @@ void generic_online_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(generic_online_page);

-static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
-			       unsigned long buddy_start_pfn)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn = buddy_start_pfn;
-
-	/*
-	 * When using memmap_on_memory, the range might be unaligned as the
-	 * first pfns are used for vmemmap pages. Align it in case we need to.
-	 */
-	VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));
-
-	while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
-		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
-		pfn += pageblock_nr_pages;
-	}
+	unsigned long pfn = start_pfn;

 	/*
 	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
@@ -844,9 +832,9 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 }

 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       unsigned long nr_vmemmap_pages, int online_type, int nid)
+		       int online_type, int nid)
 {
-	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
+	unsigned long flags;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
 	int ret;
@@ -857,17 +845,11 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;

-	buddy_start_pfn = pfn + nr_vmemmap_pages;
-	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
-
 	mem_hotplug_begin();

 	/* associate pfn range with the zone */
 	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-	if (nr_vmemmap_pages)
-		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
-				       MIGRATE_UNMOVABLE);
-	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
+	move_pfn_range_to_zone(zone, start_pfn, nr_pages, NULL,
 			       MIGRATE_ISOLATE);

 	arg.start_pfn = pfn;
@@ -884,7 +866,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	 * onlining, such that undo_isolate_page_range() works correctly.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);

 	/*
@@ -897,7 +879,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		setup_zone_pageset(zone);
 	}

-	online_pages_range(pfn, nr_pages, buddy_start_pfn);
+	online_pages_range(pfn, nr_pages);
 	zone->present_pages += nr_pages;

 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -910,8 +892,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	zone_pcp_update(zone);

 	/* Basic onlining is complete, allow allocation of onlined pages. */
-	undo_isolate_page_range(buddy_start_pfn,
-				buddy_start_pfn + buddy_nr_pages,
+	undo_isolate_page_range(start_pfn,
+				start_pfn + nr_pages,
 				MIGRATE_MOVABLE);

 	/*
@@ -1639,11 +1621,10 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }

-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
-			unsigned long nr_vmemmap_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
+	unsigned long pfn, system_ram_pages = 0;
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
@@ -1655,9 +1636,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;

-	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
-	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
-
 	mem_hotplug_begin();

 	/*
@@ -1693,7 +1671,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	zone_pcp_disable(zone);

 	/* set above range as isolated */
-	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
+	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
@@ -1713,7 +1691,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	}

 	do {
-		pfn = buddy_start_pfn;
+		pfn = start_pfn;
 		do {
 			if (signal_pending(current)) {
 				ret = -EINTR;
@@ -1744,18 +1722,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 		 * offlining actually in order to make hugetlbfs's object
 		 * counting consistent.
 		 */
-		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
+		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
 		if (ret) {
 			reason = "failure to dissolve huge pages";
 			goto failed_removal_isolated;
 		}

-		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
+		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);

 	} while (ret);

 	/* Mark all sections offline and remove free pages from the buddy. */
-	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);
+	__offline_isolated_pages(start_pfn, end_pfn);
 	pr_debug("Offlined Pages %ld\n", nr_pages);

 	/*
@@ -1764,13 +1742,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	 * of isolated pageblocks, memory onlining will properly revert this.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);

 	zone_pcp_enable(zone);

 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
+	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
 	zone->present_pages -= nr_pages;

 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1799,7 +1777,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	return 0;

 failed_removal_isolated:
-	undo_isolate_page_range(buddy_start_pfn, end_pfn, MIGRATE_MOVABLE);
+	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 failed_removal_pcplists_disabled:
 	zone_pcp_enable(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 85c478e374d7..3e4b29ee2b1e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8830,8 +8830,7 @@ void zone_pcp_reset(struct zone *zone)
  * All pages in the range must be in a single zone, must not contain holes,
  * must span full sections, and must be isolated before calling this function.
  */
-void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn,
-			      unsigned long buddy_start_pfn)
+void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	unsigned long pfn = start_pfn;
 	struct page *page;
@@ -8842,7 +8841,6 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn,
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
-	pfn = buddy_start_pfn;
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
 		/*
Michal Hocko March 26, 2021, 8:35 a.m. UTC | #43
On Thu 25-03-21 23:06:50, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 05:47:30PM +0100, Michal Hocko wrote:
> > On Thu 25-03-21 17:36:22, Michal Hocko wrote:
> > > If all it takes is to make pfn_to_online_page work (and my
> > > previous attempt is incorrect because it should consult block rather
> > > than section pfn range)
> > 
> > This should work.
> 
> Sorry, but while this solves some of the issues with that approach, I really
> think that overcomplicates things and buys us not so much in return.
> To me it seems that we are just patching things to make it work that
> way.

I do agree that special casing vmemmap areas is something I do not
really like but we are in that schrödinger situation when this memory is
not onlineable unless it shares memory section with an onlineable
memory. There are two ways around that, either special case it on
pfn_to_online_page or mark the vmemmap section online even though it is
not really.

> To be honest, I dislike this, and I guess we can only agree to disagree
> here.

No problem there. I will not insist on my approach unless I can convince
you that it is a better solution. It seems I have failed and I can live
with that.

> I find the following much easier, cleaner, and less risky to encounter
> pitfalls in the future:
> 
> (!!!It is untested and incomplete, and I would be surprised if it even
> compiles, but it is enough as a PoC !!!)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 5ea2b3fbce02..799d14fc2f9b 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -169,6 +169,60 @@ int memory_notify(unsigned long val, void *v)
>  	return blocking_notifier_call_chain(&memory_chain, val, v);
>  }
> 
> +static int memory_block_online(unsigned long start_pfn, unsigned long nr_pages,
> +			       unsigned long nr_vmemmap_pages, int online_type,
> +			       int nid)
> +{
> +	int ret;
> +	/*
> +	 * Despite vmemmap pages having a different lifecycle than the pages
> +	 * they describe, initialiating and accounting vmemmap pages at the
> +	 * online/offline stage eases things a lot.

This requires quite some explaining.

> +	 * We do it out of {online,offline}_pages, so those routines only have
> +	 * to deal with pages that are actual usable memory.
> +	 */
> +	if (nr_vmemmap_pages) {
> +		struct zone *z;
> +
> +		z = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
> +		move_pfn_range_to_zone(z, start_pfn, nr_vmemmap_pages, NULL,
> +				       MIGRATE_UNMOVABLE);
> +		/*
> +		 * The below could go to a helper to make it less bulky here,
> +		 * so {online,offline}_pages could also use it.
> +		 */
> +		z->present_pages += nr_pages;
> +		pgdat_resize_lock(z->zone_pgdat, &flags);
> +		z->zone_pgdat->node_present_pages += nr_pages;
> +		pgdat_resize_unlock(z->zone_pgdat, &flags);
> +	}
> +
> +	ret = online_pages(start_pfn + nr_vmemmap_pages, nr_pages - nr_vmemmap_pages,
> +			   online_type);
> +
> +	/*
> +	 * In case online_pages() failed for some reason, we should cleanup vmemmap
> +	 * accounting as well.
> +	 */
> +	return ret;
> +}

Yes this is much better! Just a minor suggestion would be to push
memory_block all the way to memory_block_online (it oline a memory
block). I would also slightly prefer to provide 2 helpers that would make
it clear that this is to reserve/cleanup the vmemamp space (defined in
the memory_hotplug proper).

Thanks!
David Hildenbrand March 26, 2021, 8:52 a.m. UTC | #44
On 26.03.21 09:35, Michal Hocko wrote:
> On Thu 25-03-21 23:06:50, Oscar Salvador wrote:
>> On Thu, Mar 25, 2021 at 05:47:30PM +0100, Michal Hocko wrote:
>>> On Thu 25-03-21 17:36:22, Michal Hocko wrote:
>>>> If all it takes is to make pfn_to_online_page work (and my
>>>> previous attempt is incorrect because it should consult block rather
>>>> than section pfn range)
>>>
>>> This should work.
>>
>> Sorry, but while this solves some of the issues with that approach, I really
>> think that overcomplicates things and buys us not so much in return.
>> To me it seems that we are just patching things to make it work that
>> way.
> 
> I do agree that special casing vmemmap areas is something I do not
> really like but we are in that schrödinger situation when this memory is
> not onlineable unless it shares memory section with an onlineable
> memory. There are two ways around that, either special case it on
> pfn_to_online_page or mark the vmemmap section online even though it is
> not really.
> 
>> To be honest, I dislike this, and I guess we can only agree to disagree
>> here.
> 
> No problem there. I will not insist on my approach unless I can convince
> you that it is a better solution. It seems I have failed and I can live
> with that.
> 
>> I find the following much easier, cleaner, and less risky to encounter
>> pitfalls in the future:
>>
>> (!!!It is untested and incomplete, and I would be surprised if it even
>> compiles, but it is enough as a PoC !!!)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 5ea2b3fbce02..799d14fc2f9b 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -169,6 +169,60 @@ int memory_notify(unsigned long val, void *v)
>>   	return blocking_notifier_call_chain(&memory_chain, val, v);
>>   }
>>
>> +static int memory_block_online(unsigned long start_pfn, unsigned long nr_pages,
>> +			       unsigned long nr_vmemmap_pages, int online_type,
>> +			       int nid)
>> +{
>> +	int ret;
>> +	/*
>> +	 * Despite vmemmap pages having a different lifecycle than the pages
>> +	 * they describe, initialiating and accounting vmemmap pages at the
>> +	 * online/offline stage eases things a lot.
> 
> This requires quite some explaining.
> 
>> +	 * We do it out of {online,offline}_pages, so those routines only have
>> +	 * to deal with pages that are actual usable memory.
>> +	 */
>> +	if (nr_vmemmap_pages) {
>> +		struct zone *z;
>> +
>> +		z = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
>> +		move_pfn_range_to_zone(z, start_pfn, nr_vmemmap_pages, NULL,
>> +				       MIGRATE_UNMOVABLE);
>> +		/*
>> +		 * The below could go to a helper to make it less bulky here,
>> +		 * so {online,offline}_pages could also use it.
>> +		 */
>> +		z->present_pages += nr_pages;
>> +		pgdat_resize_lock(z->zone_pgdat, &flags);
>> +		z->zone_pgdat->node_present_pages += nr_pages;
>> +		pgdat_resize_unlock(z->zone_pgdat, &flags);

Might have to set fully spanned section online. (vmemmap >= SECTION_SIZE)

>> +	}
>> +
>> +	ret = online_pages(start_pfn + nr_vmemmap_pages, nr_pages - nr_vmemmap_pages,
>> +			   online_type);
>> +
>> +	/*
>> +	 * In case online_pages() failed for some reason, we should cleanup vmemmap
>> +	 * accounting as well.
>> +	 */
>> +	return ret;
>> +}
> 
> Yes this is much better! Just a minor suggestion would be to push
> memory_block all the way to memory_block_online (it oline a memory
> block). I would also slightly prefer to provide 2 helpers that would make
> it clear that this is to reserve/cleanup the vmemamp space (defined in
> the memory_hotplug proper).
> 
> Thanks!
> 

Something else to note:


We'll not call the memory notifier (e.g., MEM_ONLINE) for the vmemmap. 
The result is that

1. We won't allocate extended struct pages for the range. Don't think 
this is really problematic (pages are never allocated/freed, so I guess 
we don't care - like ZONE_DEVICE code).

2. We won't allocate kasan shadow memory. We most probably have to do it 
explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see 
mm/memremap.c:pagemap_range()


Further a locking rework might be necessary. We hold the device hotplug 
lock, but not the memory hotplug lock. E.g., for get_online_mems(). 
Might have to move that out online_pages.
Oscar Salvador March 26, 2021, 8:55 a.m. UTC | #45
On Fri, Mar 26, 2021 at 09:35:03AM +0100, Michal Hocko wrote:
> No problem there. I will not insist on my approach unless I can convince
> you that it is a better solution. It seems I have failed and I can live
> with that.

Well, I am glad we got to discuss it at least.

> > +static int memory_block_online(unsigned long start_pfn, unsigned long nr_pages,
> > +			       unsigned long nr_vmemmap_pages, int online_type,
> > +			       int nid)
> > +{
> > +	int ret;
> > +	/*
> > +	 * Despite vmemmap pages having a different lifecycle than the pages
> > +	 * they describe, initialiating and accounting vmemmap pages at the
> > +	 * online/offline stage eases things a lot.
> 
> This requires quite some explaining.

Definitely, I will expand on that and provide some context.

 
> Yes this is much better! Just a minor suggestion would be to push
> memory_block all the way to memory_block_online (it oline a memory
> block). I would also slightly prefer to provide 2 helpers that would make
> it clear that this is to reserve/cleanup the vmemamp space (defined in
> the memory_hotplug proper).

Glad to hear that!
By pushing memory_block all the way to memory_block_{online,offline}, you
mean passing the memblock struct together with nr_vmemmap_pages,
only_type and nid to memory_block_{offline,online}, and derive in there
the start_pfn and nr_pages?

Wrt. to the two helpers, I agree with you.
Actually, I would find quite disturbing to deal with zones in that code
domain.
I will add two proper helpers in memory_hotplug to deal with vmemmap.

If it comes out the way I envision, it could end up quite clean, and much
less disturbing.

Thanks Michal
Oscar Salvador March 26, 2021, 8:57 a.m. UTC | #46
On Fri, Mar 26, 2021 at 09:52:58AM +0100, David Hildenbrand wrote:
> Might have to set fully spanned section online. (vmemmap >= SECTION_SIZE)

Hi David,

could you elaborate on this a bit?

> Something else to note:
> 
> 
> We'll not call the memory notifier (e.g., MEM_ONLINE) for the vmemmap. The
> result is that
> 
> 1. We won't allocate extended struct pages for the range. Don't think this
> is really problematic (pages are never allocated/freed, so I guess we don't
> care - like ZONE_DEVICE code).
> 
> 2. We won't allocate kasan shadow memory. We most probably have to do it
> explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
> mm/memremap.c:pagemap_range()
> 
> 
> Further a locking rework might be necessary. We hold the device hotplug
> lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
> have to move that out online_pages.

I will have a look and see how it goes.
Michal Hocko March 26, 2021, 9:11 a.m. UTC | #47
On Fri 26-03-21 09:55:03, Oscar Salvador wrote:
> On Fri, Mar 26, 2021 at 09:35:03AM +0100, Michal Hocko wrote:
[...]
> > Yes this is much better! Just a minor suggestion would be to push
> > memory_block all the way to memory_block_online (it oline a memory
> > block). I would also slightly prefer to provide 2 helpers that would make
> > it clear that this is to reserve/cleanup the vmemamp space (defined in
> > the memory_hotplug proper).
> 
> Glad to hear that!
> By pushing memory_block all the way to memory_block_{online,offline}, you
> mean passing the memblock struct together with nr_vmemmap_pages,
> only_type and nid to memory_block_{offline,online}, and derive in there
> the start_pfn and nr_pages?

memory_block + online_type should be sufficient.
Oscar Salvador March 26, 2021, 12:15 p.m. UTC | #48
On Fri, Mar 26, 2021 at 09:57:43AM +0100, Oscar Salvador wrote:
> On Fri, Mar 26, 2021 at 09:52:58AM +0100, David Hildenbrand wrote:
> > Might have to set fully spanned section online. (vmemmap >= SECTION_SIZE)

Bleh, this morning I was in a rush and I could not think straight.
I got what you mean now.

Yes, if vmemmap pages fully span a section, we need to online that
section before calling online_pages(), otherwise we would left that
section offline.


> > We'll not call the memory notifier (e.g., MEM_ONLINE) for the vmemmap. The

Well, since it is not actual memory we might do not need to, but

> > 1. We won't allocate extended struct pages for the range. Don't think this
> > is really problematic (pages are never allocated/freed, so I guess we don't
> > care - like ZONE_DEVICE code).

Probably not worth for vmemmap

> > 
> > 2. We won't allocate kasan shadow memory. We most probably have to do it
> > explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
> > mm/memremap.c:pagemap_range()

we should be calling those kasan_{add,remove}.
We can stuff that into the vmemmap helpers, so everything remains
contained.

> > 
> > 
> > Further a locking rework might be necessary. We hold the device hotplug
> > lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
> > have to move that out online_pages.

That is a good point.
I yet have to think about it further, but what about moving those
mem_hotplug_{begin,done} to memory_block_{offline,online}.

Something like:

 static int memory_block_online(...)
 {
         int ret;
 
         mem_hotplug_begin();
 
         if (nr_vmemmap_pages)
                 vmemmap_hotplug_init();
 
         ret = online_pages(...);
         if (ret)
 	/*
 	 * Cleanup stuff
 	 */
 
         mem_hotplug_done();
 }
	 
As you said, you finished cleaning up those users who were calling
{online,offline}_pages() directly, but is this something that we will
forbidden in the future too?
My question falls within "Are we sure we will not need that locking back
in those functions because that is something we will not allow to
happen?"
David Hildenbrand March 26, 2021, 1:36 p.m. UTC | #49
>>>
>>> Further a locking rework might be necessary. We hold the device hotplug
>>> lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
>>> have to move that out online_pages.
> 
> That is a good point.
> I yet have to think about it further, but what about moving those
> mem_hotplug_{begin,done} to memory_block_{offline,online}.
> 
> Something like:
> 
>   static int memory_block_online(...)
>   {
>           int ret;
>   
>           mem_hotplug_begin();
>   
>           if (nr_vmemmap_pages)
>                   vmemmap_hotplug_init();
>   
>           ret = online_pages(...);
>           if (ret)
>   	/*
>   	 * Cleanup stuff
>   	 */
>   
>           mem_hotplug_done();
>   }
> 	
> As you said, you finished cleaning up those users who were calling
> {online,offline}_pages() directly, but is this something that we will
> forbidden in the future too?

Well, I cannot tell what will happen in the future. But at least as long 
as we have memory blocks, I doubt that there are valid use cases for 
online_pages()/offline_pages(). Especially once we have things like 
memmap_on_memory that need special care.

> My question falls within "Are we sure we will not need that locking back
> in those functions because that is something we will not allow to
> happen?"

Who has access to online_pages()/offline_pages() also has access to 
mem_hotplug_begin()/mem_hotplug_done(). It would be nice if we could 
only use online_pages()/offline_pages() internally -- or at least add a 
comment that this is for internal purposes only / requires that locking.
Michal Hocko March 26, 2021, 2:38 p.m. UTC | #50
On Fri 26-03-21 09:52:58, David Hildenbrand wrote:
[...]
> Something else to note:
> 
> 
> We'll not call the memory notifier (e.g., MEM_ONLINE) for the vmemmap. The
> result is that
> 
> 1. We won't allocate extended struct pages for the range. Don't think this
> is really problematic (pages are never allocated/freed, so I guess we don't
> care - like ZONE_DEVICE code).

Agreed. I do not think we need them. Future might disagree but let's
handle it when we have a clear demand.

> 2. We won't allocate kasan shadow memory. We most probably have to do it
> explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
> mm/memremap.c:pagemap_range()

I think this is similar to the above. Does kasan has to know about
memory which will never be used for anything?

> Further a locking rework might be necessary. We hold the device hotplug
> lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
> have to move that out online_pages.

Could you be more explicit why this locking is needed? What it would
protect from for vmemmap pages?
David Hildenbrand March 26, 2021, 2:53 p.m. UTC | #51
On 26.03.21 15:38, Michal Hocko wrote:
> On Fri 26-03-21 09:52:58, David Hildenbrand wrote:
> [...]
>> Something else to note:
>>
>>
>> We'll not call the memory notifier (e.g., MEM_ONLINE) for the vmemmap. The
>> result is that
>>
>> 1. We won't allocate extended struct pages for the range. Don't think this
>> is really problematic (pages are never allocated/freed, so I guess we don't
>> care - like ZONE_DEVICE code).
> 
> Agreed. I do not think we need them. Future might disagree but let's
> handle it when we have a clear demand.
> 
>> 2. We won't allocate kasan shadow memory. We most probably have to do it
>> explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
>> mm/memremap.c:pagemap_range()
> 
> I think this is similar to the above. Does kasan has to know about
> memory which will never be used for anything?

IIRC, kasan will track read/writes to the vmemmap as well. So it could 
theoretically detect if we read from the vmemmap before writing 
(initializing) it IIUC.

This is also why mm/memremap.c does a kasan_add_zero_shadow() before the 
move_pfn_range_to_zone()->memmap_init_range() for the whole region, 
including altmap space.

Now, I am no expert on KASAN, what would happen in case we have access 
to non-tracked memory.

commit 0207df4fa1a869281ddbf72db6203dbf036b3e1a
Author: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Date:   Fri Aug 17 15:47:04 2018 -0700

     kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN

indicates that kasan will crash the system on "non-existent shadow memory"

> 
>> Further a locking rework might be necessary. We hold the device hotplug
>> lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
>> have to move that out online_pages.
> 
> Could you be more explicit why this locking is needed? What it would
> protect from for vmemmap pages?
> 

One example is in mm/kmemleak.c:kmemleak_scan(), where we scan the 
vmemmap for pointers. We don't want the vmemmap to get unmapped while we 
are working on it (-> fault).
Michal Hocko March 26, 2021, 3:31 p.m. UTC | #52
On Fri 26-03-21 15:53:41, David Hildenbrand wrote:
> On 26.03.21 15:38, Michal Hocko wrote:
> > On Fri 26-03-21 09:52:58, David Hildenbrand wrote:
[...]
> > > 2. We won't allocate kasan shadow memory. We most probably have to do it
> > > explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
> > > mm/memremap.c:pagemap_range()
> > 
> > I think this is similar to the above. Does kasan has to know about
> > memory which will never be used for anything?
> 
> IIRC, kasan will track read/writes to the vmemmap as well. So it could
> theoretically detect if we read from the vmemmap before writing
> (initializing) it IIUC.
> This is also why mm/memremap.c does a kasan_add_zero_shadow() before the
> move_pfn_range_to_zone()->memmap_init_range() for the whole region,
> including altmap space.
> 
> Now, I am no expert on KASAN, what would happen in case we have access to
> non-tracked memory.
> 
> commit 0207df4fa1a869281ddbf72db6203dbf036b3e1a
> Author: Andrey Ryabinin <ryabinin.a.a@gmail.com>
> Date:   Fri Aug 17 15:47:04 2018 -0700
> 
>     kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN
> 
> indicates that kasan will crash the system on "non-existent shadow memory"

Interesting. Thanks for the pointer.

> > > Further a locking rework might be necessary. We hold the device hotplug
> > > lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
> > > have to move that out online_pages.
> > 
> > Could you be more explicit why this locking is needed? What it would
> > protect from for vmemmap pages?
> > 
> 
> One example is in mm/kmemleak.c:kmemleak_scan(), where we scan the vmemmap
> for pointers. We don't want the vmemmap to get unmapped while we are working
> on it (-> fault).
 
Hmm, but they are not going away during offline. They just have a less
defined state. Or what exactly do you mean by unmapped?
David Hildenbrand March 26, 2021, 4:03 p.m. UTC | #53
On 26.03.21 16:31, Michal Hocko wrote:
> On Fri 26-03-21 15:53:41, David Hildenbrand wrote:
>> On 26.03.21 15:38, Michal Hocko wrote:
>>> On Fri 26-03-21 09:52:58, David Hildenbrand wrote:
> [...]
>>>> 2. We won't allocate kasan shadow memory. We most probably have to do it
>>>> explicitly via kasan_add_zero_shadow()/kasan_remove_zero_shadow(), see
>>>> mm/memremap.c:pagemap_range()
>>>
>>> I think this is similar to the above. Does kasan has to know about
>>> memory which will never be used for anything?
>>
>> IIRC, kasan will track read/writes to the vmemmap as well. So it could
>> theoretically detect if we read from the vmemmap before writing
>> (initializing) it IIUC.
>> This is also why mm/memremap.c does a kasan_add_zero_shadow() before the
>> move_pfn_range_to_zone()->memmap_init_range() for the whole region,
>> including altmap space.
>>
>> Now, I am no expert on KASAN, what would happen in case we have access to
>> non-tracked memory.
>>
>> commit 0207df4fa1a869281ddbf72db6203dbf036b3e1a
>> Author: Andrey Ryabinin <ryabinin.a.a@gmail.com>
>> Date:   Fri Aug 17 15:47:04 2018 -0700
>>
>>      kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN
>>
>> indicates that kasan will crash the system on "non-existent shadow memory"
> 
> Interesting. Thanks for the pointer.
> 
>>>> Further a locking rework might be necessary. We hold the device hotplug
>>>> lock, but not the memory hotplug lock. E.g., for get_online_mems(). Might
>>>> have to move that out online_pages.
>>>
>>> Could you be more explicit why this locking is needed? What it would
>>> protect from for vmemmap pages?
>>>
>>
>> One example is in mm/kmemleak.c:kmemleak_scan(), where we scan the vmemmap
>> for pointers. We don't want the vmemmap to get unmapped while we are working
>> on it (-> fault).
>   
> Hmm, but they are not going away during offline. They just have a less
> defined state. Or what exactly do you mean by unmapped?

Hmm, also true. We should double check if we're touching other code that 
should better be synchronized with the memory hotplug lock.
diff mbox series

Patch

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..5ea2b3fbce02 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -175,7 +175,7 @@  int memory_notify(unsigned long val, void *v)
  */
 static int
 memory_block_action(unsigned long start_section_nr, unsigned long action,
-		    int online_type, int nid)
+		    int online_type, int nid, unsigned long nr_vmemmap_pages)
 {
 	unsigned long start_pfn;
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -185,10 +185,11 @@  memory_block_action(unsigned long start_section_nr, unsigned long action,
 
 	switch (action) {
 	case MEM_ONLINE:
-		ret = online_pages(start_pfn, nr_pages, online_type, nid);
+		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
+				   online_type, nid);
 		break;
 	case MEM_OFFLINE:
-		ret = offline_pages(start_pfn, nr_pages);
+		ret = offline_pages(start_pfn, nr_pages, nr_vmemmap_pages);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
@@ -211,7 +212,7 @@  static int memory_block_change_state(struct memory_block *mem,
 		mem->state = MEM_GOING_OFFLINE;
 
 	ret = memory_block_action(mem->start_section_nr, to_state,
-				  mem->online_type, mem->nid);
+				  mem->online_type, mem->nid, mem->nr_vmemmap_pages);
 
 	mem->state = ret ? from_state_req : to_state;
 
@@ -567,7 +568,8 @@  int register_memory(struct memory_block *memory)
 	return ret;
 }
 
-static int init_memory_block(unsigned long block_id, unsigned long state)
+static int init_memory_block(unsigned long block_id, unsigned long state,
+			     unsigned long nr_vmemmap_pages)
 {
 	struct memory_block *mem;
 	int ret = 0;
@@ -584,6 +586,7 @@  static int init_memory_block(unsigned long block_id, unsigned long state)
 	mem->start_section_nr = block_id * sections_per_block;
 	mem->state = state;
 	mem->nid = NUMA_NO_NODE;
+	mem->nr_vmemmap_pages = nr_vmemmap_pages;
 
 	ret = register_memory(mem);
 
@@ -603,7 +606,7 @@  static int add_memory_block(unsigned long base_section_nr)
 	if (section_count == 0)
 		return 0;
 	return init_memory_block(memory_block_id(base_section_nr),
-				 MEM_ONLINE);
+				 MEM_ONLINE, 0);
 }
 
 static void unregister_memory(struct memory_block *memory)
@@ -625,7 +628,8 @@  static void unregister_memory(struct memory_block *memory)
  *
  * Called under device_hotplug_lock.
  */
-int create_memory_block_devices(unsigned long start, unsigned long size)
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
 	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -638,7 +642,7 @@  int create_memory_block_devices(unsigned long start, unsigned long size)
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = init_memory_block(block_id, MEM_OFFLINE);
+		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
 		if (ret)
 			break;
 	}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 4da95e684e20..97e92e8b556a 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -29,6 +29,11 @@  struct memory_block {
 	int online_type;		/* for passing data to online routine */
 	int nid;			/* NID for this memory block */
 	struct device dev;
+	/*
+	 * Number of vmemmap pages. These pages
+	 * lay at the beginning of the memory block.
+	 */
+	unsigned long nr_vmemmap_pages;
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -80,7 +85,8 @@  static inline int memory_notify(unsigned long val, void *v)
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
-int create_memory_block_devices(unsigned long start, unsigned long size);
+int create_memory_block_devices(unsigned long start, unsigned long size,
+				unsigned long vmemmap_pages);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7288aa5ef73b..a85d4b7d15c2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -55,6 +55,14 @@  typedef int __bitwise mhp_t;
  */
 #define MHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
 
+/*
+ * We want memmap (struct page array) to be self contained.
+ * To do so, we will use the beginning of the hot-added range to build
+ * the page tables for the memmap array that describes the entire range.
+ * Only selected architectures support it with SPARSE_VMEMMAP.
+ */
+#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
+
 /*
  * Extended parameters for memory hotplug:
  * altmap: alternative allocator for memmap array (optional)
@@ -101,11 +109,13 @@  extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 /* VM interface that may be used by firmware interface */
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			int online_type, int nid);
+			unsigned long nr_vmemmap_pages, int online_type,
+			int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
-				     unsigned long end_pfn);
+				     unsigned long end_pfn,
+				     unsigned long buddy_start_pfn);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
@@ -307,7 +317,8 @@  static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			 unsigned long nr_vmemmap_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 extern int offline_and_remove_memory(int nid, u64 start, u64 size);
@@ -315,7 +326,8 @@  extern int offline_and_remove_memory(int nid, u64 start, u64 size);
 #else
 static inline void try_offline_node(int nid) {}
 
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+				unsigned long nr_vmemmap_pages)
 {
 	return -EINVAL;
 }
@@ -359,6 +371,7 @@  extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
+extern bool mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f5b464daeeca..45a79da89c5f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,7 +17,7 @@  struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long end_pfn;
 	const unsigned long reserve;
 	unsigned long free;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..747e1c21d8e2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -427,6 +427,11 @@  enum zone_type {
 	 *    techniques might use alloc_contig_range() to hide previously
 	 *    exposed pages from the buddy again (e.g., to implement some sort
 	 *    of memory unplug in virtio-mem).
+	 * 6. Memory-hotplug: when using memmap_on_memory and onlining the memory
+	 *    to the MOVABLE zone, the vmemmap pages are also placed in such
+	 *    zone. Such pages cannot be really moved around as they are
+	 *    self-stored in the range, but they are treated as movable when
+	 *    the range they describe is about to be offlined.
 	 *
 	 * In general, no unmovable allocations that degrade memory offlining
 	 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..febf805000f8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -183,6 +183,11 @@  config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+config MHP_MEMMAP_ON_MEMORY
+	def_bool y
+	depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
+	depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5ba51a8bdaeb..350cde69a97d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,8 @@ 
 #include "internal.h"
 #include "shuffle.h"
 
+static bool memmap_on_memory;
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -638,10 +640,22 @@  void generic_online_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
-static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
+			       unsigned long buddy_start_pfn)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn;
+	unsigned long pfn = buddy_start_pfn;
+
+	/*
+	 * When using memmap_on_memory, the range might be unaligned as the
+	 * first pfns are used for vmemmap pages. Align it in case we need to.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));
+
+	while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
+		(*online_page_callback)(pfn_to_page(pfn), pageblock_order);
+		pfn += pageblock_nr_pages;
+	}
 
 	/*
 	 * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
@@ -649,7 +663,7 @@  static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 	 * later). We account all pages as being online and belonging to this
 	 * zone ("present").
 	 */
-	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
+	for (; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
 		(*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
 
 	/* mark all involved sections as online */
@@ -830,9 +844,9 @@  struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 }
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       int online_type, int nid)
+		       unsigned long nr_vmemmap_pages, int online_type, int nid)
 {
-	unsigned long flags;
+	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
 	int ret;
@@ -843,11 +857,18 @@  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 			 !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/* associate pfn range with the zone */
 	zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
+	if (nr_vmemmap_pages)
+		move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
+				       MIGRATE_UNMOVABLE);
+	move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL,
+			       MIGRATE_ISOLATE);
 
 	arg.start_pfn = pfn;
 	arg.nr_pages = nr_pages;
@@ -863,7 +884,7 @@  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	 * onlining, such that undo_isolate_page_range() works correctly.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock += buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/*
@@ -876,7 +897,7 @@  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		setup_zone_pageset(zone);
 	}
 
-	online_pages_range(pfn, nr_pages);
+	online_pages_range(pfn, nr_pages, buddy_start_pfn);
 	zone->present_pages += nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -889,7 +910,9 @@  int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	zone_pcp_update(zone);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
-	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
+	undo_isolate_page_range(buddy_start_pfn,
+				buddy_start_pfn + buddy_nr_pages,
+				MIGRATE_MOVABLE);
 
 	/*
 	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
@@ -1064,6 +1087,45 @@  static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
+	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
+	unsigned long remaining_size = size - vmemmap_size;
+
+	/*
+	 * Besides having arch support and the feature enabled at runtime, we
+	 * need a few more assumptions to hold true:
+	 *
+	 * a) We span a single memory block: memory onlining/offlinin;g happens
+	 *    in memory block granularity. We don't want the vmemmap of online
+	 *    memory blocks to reside on offline memory blocks. In the future,
+	 *    we might want to support variable-sized memory blocks to make the
+	 *    feature more versatile.
+	 *
+	 * b) The vmemmap pages span complete PMDs: We don't want vmemmap code
+	 *    to populate memory from the altmap for unrelated parts (i.e.,
+	 *    other memory blocks)
+	 *
+	 * c) The vmemmap pages (and thereby the pages that will be exposed to
+	 *    the buddy) have to cover full pageblocks: memory onlining/offlining
+	 *    code requires applicable ranges to be page-aligned, for example, to
+	 *    set the migratetypes properly.
+	 *
+	 * TODO: Although we have a check here to make sure that vmemmap pages
+	 *	 fully populate a PMD, it is not the right place to check for
+	 *	 this. A much better solution involves improving vmemmap code
+	 *	 to fallback to base pages when trying to populate vmemmap using
+	 *	 altmap as an alternative source of memory, and we do not exactly
+	 *	 populate a single PMD.
+	 */
+	return memmap_on_memory &&
+	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
+	       size == memory_block_size_bytes() &&
+	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1073,6 +1135,7 @@  static int online_memory_block(struct memory_block *mem, void *arg)
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = PAGE_KERNEL };
+	struct vmem_altmap mhp_altmap = {};
 	u64 start, size;
 	bool new_node = false;
 	int ret;
@@ -1099,13 +1162,26 @@  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		goto error;
 	new_node = ret;
 
+	/*
+	 * Self hosted memmap array
+	 */
+	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
+		if (!mhp_supports_memmap_on_memory(size)) {
+			ret = -EINVAL;
+			goto error;
+		}
+		mhp_altmap.free = PHYS_PFN(size);
+		mhp_altmap.base_pfn = PHYS_PFN(start);
+		params.altmap = &mhp_altmap;
+	}
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size, &params);
 	if (ret < 0)
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size);
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
 	if (ret) {
 		arch_remove_memory(nid, start, size, NULL);
 		goto error;
@@ -1563,10 +1639,11 @@  static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }
 
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			unsigned long nr_vmemmap_pages)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn, system_ram_pages = 0;
+	unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
@@ -1578,6 +1655,9 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 			 !IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
 		return -EINVAL;
 
+	buddy_start_pfn = start_pfn + nr_vmemmap_pages;
+	buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
 	mem_hotplug_begin();
 
 	/*
@@ -1613,7 +1693,7 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone_pcp_disable(zone);
 
 	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
+	ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
@@ -1633,7 +1713,7 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	}
 
 	do {
-		pfn = start_pfn;
+		pfn = buddy_start_pfn;
 		do {
 			if (signal_pending(current)) {
 				ret = -EINTR;
@@ -1664,18 +1744,18 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		 * offlining actually in order to make hugetlbfs's object
 		 * counting consistent.
 		 */
-		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
+		ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
 		if (ret) {
 			reason = "failure to dissolve huge pages";
 			goto failed_removal_isolated;
 		}
 
-		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
+		ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
 
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
-	__offline_isolated_pages(start_pfn, end_pfn);
+	__offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);
 	pr_debug("Offlined Pages %ld\n", nr_pages);
 
 	/*
@@ -1684,13 +1764,13 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	 * of isolated pageblocks, memory onlining will properly revert this.
 	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
+	zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	zone_pcp_enable(zone);
 
 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+	adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
 	zone->present_pages -= nr_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1719,7 +1799,7 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return 0;
 
 failed_removal_isolated:
-	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+	undo_isolate_page_range(buddy_start_pfn, end_pfn, MIGRATE_MOVABLE);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 failed_removal_pcplists_disabled:
 	zone_pcp_enable(zone);
@@ -1750,6 +1830,14 @@  static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return 0;
 }
 
+static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+{
+	/*
+	 * If not set, continue with the next block.
+	 */
+	return mem->nr_vmemmap_pages;
+}
+
 static int check_cpu_on_node(pg_data_t *pgdat)
 {
 	int cpu;
@@ -1824,6 +1912,9 @@  EXPORT_SYMBOL(try_offline_node);
 static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
 	int rc = 0;
+	struct vmem_altmap mhp_altmap = {};
+	struct vmem_altmap *altmap = NULL;
+	unsigned long nr_vmemmap_pages = 0;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1836,6 +1927,31 @@  static int __ref try_remove_memory(int nid, u64 start, u64 size)
 	if (rc)
 		return rc;
 
+	/*
+	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
+	 * the same granularity it was added - a single memory block.
+	 */
+	if (memmap_on_memory) {
+		nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
+						      get_nr_vmemmap_pages_cb);
+		if (nr_vmemmap_pages) {
+			if (size != memory_block_size_bytes()) {
+				pr_warn("Refuse to remove %#llx - %#llx,"
+					"wrong granularity\n",
+					 start, start + size);
+				return -EINVAL;
+			}
+
+			/*
+			 * Let remove_pmd_table->free_hugepage_table
+			 * do the right thing if we used vmem_altmap
+			 * when hot-adding the range.
+			 */
+			mhp_altmap.alloc = nr_vmemmap_pages;
+			altmap = &mhp_altmap;
+		}
+	}
+
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
@@ -1847,7 +1963,7 @@  static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	mem_hotplug_begin();
 
-	arch_remove_memory(nid, start, size, NULL);
+	arch_remove_memory(nid, start, size, altmap);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_free(start, size);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..85c478e374d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8830,7 +8830,8 @@  void zone_pcp_reset(struct zone *zone)
  * All pages in the range must be in a single zone, must not contain holes,
  * must span full sections, and must be isolated before calling this function.
  */
-void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn,
+			      unsigned long buddy_start_pfn)
 {
 	unsigned long pfn = start_pfn;
 	struct page *page;
@@ -8841,6 +8842,7 @@  void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
+	pfn = buddy_start_pfn;
 	while (pfn < end_pfn) {
 		page = pfn_to_page(pfn);
 		/*