mbox series

[v7,0/7] Add support for memmap on memory feature on ppc64

Message ID 20230801044116.10674-1-aneesh.kumar@linux.ibm.com (mailing list archive)
Headers show
Series Add support for memmap on memory feature on ppc64 | expand

Message

Aneesh Kumar K.V Aug. 1, 2023, 4:41 a.m. UTC
This patch series update memmap on memory feature to fall back to
memmap allocation outside the memory block if the alignment rules are
not met. This makes the feature more useful on architectures like
ppc64 where alignment rules are different with 64K page size.

This patch series is dependent on dax vmemmap optimization series
posted here
https://lore.kernel.org/linux-mm/20230718022934.90447-1-aneesh.kumar@linux.ibm.com/

Changes from v6:
* Update comments in the code
* Update commit message for patch 7

Changes from v5:
* Update commit message
* Move memory alloc/free to the callers in patch 6
* Address review feedback w.r.t patch 4

Changes from v4:
* Use altmap.free instead of altmap.reserve
* Address review feedback

Changes from v3:
* Extend the module parameter memmap_on_memory to force allocation even
  though we can waste hotplug memory.

Changes from v2:
* Rebase to latest linus tree
* Redo the series based on review feedback. Multiple changes to the patchset.

Changes from v1:
* update the memblock to store vmemmap_altmap details. This is required
so that when we remove the memory we can find the altmap details which
is needed on some architectures.
* rebase to latest linus tree



Aneesh Kumar K.V (7):
  mm/memory_hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
  mm/memory_hotplug: Allow memmap on memory hotplug request to fallback
  mm/memory_hotplug: Allow architecture to override memmap on memory
    support check
  mm/memory_hotplug: Support memmap_on_memory when memmap is not aligned
    to pageblocks
  powerpc/book3s64/memhotplug: Enable memmap on memory for radix
  mm/memory_hotplug: Embed vmem_altmap details in memory block
  mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

 .../admin-guide/mm/memory-hotplug.rst         |  12 +
 arch/arm64/Kconfig                            |   4 +-
 arch/powerpc/Kconfig                          |   1 +
 arch/powerpc/include/asm/pgtable.h            |  21 ++
 .../platforms/pseries/hotplug-memory.c        |   2 +-
 arch/x86/Kconfig                              |   4 +-
 drivers/acpi/acpi_memhotplug.c                |   3 +-
 drivers/base/memory.c                         |  27 ++-
 include/linux/memory.h                        |   8 +-
 include/linux/memory_hotplug.h                |   3 +-
 mm/Kconfig                                    |   3 +
 mm/memory_hotplug.c                           | 205 ++++++++++++++----
 12 files changed, 220 insertions(+), 73 deletions(-)

Comments

Michal Hocko Aug. 1, 2023, 8:58 a.m. UTC | #1
On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> Allow updating memmap_on_memory mode after the kernel boot. Memory
> hotplug done after the mode update will use the new mmemap_on_memory
> value.

Well, this is a user space kABI extension and as such you should spend
more words about the usecase. Why we could live with this static and now
need dynamic?
Michal Hocko Aug. 1, 2023, 9:04 a.m. UTC | #2
On Tue 01-08-23 10:11:13, Aneesh Kumar K.V wrote:
[...]
> +	if (mode == MEMMAP_ON_MEMORY_FORCE) {
> +		unsigned long memmap_pages = memory_block_memmap_on_memory_pages();
		unsigned long wastage = memmap_pages - PFN_UP(memory_block_memmap_size());

		if (wastage)
			pr_info_once("memmap_on_memory=force will waste %ld pages in each memory block %ld\n",
				wastage, memory_block_size_bytes() >> PAGE_SHIFT);

would be more useful IMHO.

> +
> +		pr_info_once("Memory hotplug will waste %ld pages in each memory block\n",
> +			     memmap_pages - PFN_UP(memory_block_memmap_size()));
Michal Hocko Aug. 1, 2023, 9:07 a.m. UTC | #3
I cannot really judge the ppc specific part but other patches seem
reasonable. Patch 4 could print a more useful information about the
wastage but this is nothing really earth shattering. I am not sure about
the last patch which makes the on-memory property dynamic. This needs
much more justification and use case description IMHO.

That being said for patches 1 - 4 and 6 feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

On Tue 01-08-23 10:11:09, Aneesh Kumar K.V wrote:
> This patch series update memmap on memory feature to fall back to
> memmap allocation outside the memory block if the alignment rules are
> not met. This makes the feature more useful on architectures like
> ppc64 where alignment rules are different with 64K page size.
> 
> This patch series is dependent on dax vmemmap optimization series
> posted here
> https://lore.kernel.org/linux-mm/20230718022934.90447-1-aneesh.kumar@linux.ibm.com/
> 
> Changes from v6:
> * Update comments in the code
> * Update commit message for patch 7
> 
> Changes from v5:
> * Update commit message
> * Move memory alloc/free to the callers in patch 6
> * Address review feedback w.r.t patch 4
> 
> Changes from v4:
> * Use altmap.free instead of altmap.reserve
> * Address review feedback
> 
> Changes from v3:
> * Extend the module parameter memmap_on_memory to force allocation even
>   though we can waste hotplug memory.
> 
> Changes from v2:
> * Rebase to latest linus tree
> * Redo the series based on review feedback. Multiple changes to the patchset.
> 
> Changes from v1:
> * update the memblock to store vmemmap_altmap details. This is required
> so that when we remove the memory we can find the altmap details which
> is needed on some architectures.
> * rebase to latest linus tree
> 
> 
> 
> Aneesh Kumar K.V (7):
>   mm/memory_hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
>   mm/memory_hotplug: Allow memmap on memory hotplug request to fallback
>   mm/memory_hotplug: Allow architecture to override memmap on memory
>     support check
>   mm/memory_hotplug: Support memmap_on_memory when memmap is not aligned
>     to pageblocks
>   powerpc/book3s64/memhotplug: Enable memmap on memory for radix
>   mm/memory_hotplug: Embed vmem_altmap details in memory block
>   mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> 
>  .../admin-guide/mm/memory-hotplug.rst         |  12 +
>  arch/arm64/Kconfig                            |   4 +-
>  arch/powerpc/Kconfig                          |   1 +
>  arch/powerpc/include/asm/pgtable.h            |  21 ++
>  .../platforms/pseries/hotplug-memory.c        |   2 +-
>  arch/x86/Kconfig                              |   4 +-
>  drivers/acpi/acpi_memhotplug.c                |   3 +-
>  drivers/base/memory.c                         |  27 ++-
>  include/linux/memory.h                        |   8 +-
>  include/linux/memory_hotplug.h                |   3 +-
>  mm/Kconfig                                    |   3 +
>  mm/memory_hotplug.c                           | 205 ++++++++++++++----
>  12 files changed, 220 insertions(+), 73 deletions(-)
> 
> -- 
> 2.41.0
Michal Hocko Aug. 1, 2023, 10:50 a.m. UTC | #4
On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> On 8/1/23 2:28 PM, Michal Hocko wrote:
> > On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> >> Allow updating memmap_on_memory mode after the kernel boot. Memory
> >> hotplug done after the mode update will use the new mmemap_on_memory
> >> value.
> > 
> > Well, this is a user space kABI extension and as such you should spend
> > more words about the usecase. Why we could live with this static and now
> > need dynamic?
> > 
> 
> This enables easy testing of memmap_on_memory feature without a kernel reboot.

Testing alone is rather weak argument to be honest.

> I also expect people wanting to use that when they find dax kmem memory online
> failing because of struct page allocation failures[1]. User could reboot back with
> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
> the feature much more useful.

Sure it can be useful but that holds for any feature, right. The main
question is whether this is worth maintaing. The current implementation
seems rather trivial which is an argument to have it but are there any
risks long term? Have you evaluated a potential long term maintenance
cost? There is no easy way to go back and disable it later on without
breaking some userspace.

All that should be in the changelog!
Verma, Vishal L Aug. 1, 2023, 11:10 p.m. UTC | #5
On Tue, 2023-08-01 at 10:11 +0530, Aneesh Kumar K.V wrote:
> With memmap on memory, some architecture needs more details w.r.t altmap
> such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
> computing them again when we remove a memory block, embed vmem_altmap
> details in struct memory_block if we are using memmap on memory block
> feature.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/base/memory.c  | 27 +++++++++++++--------
>  include/linux/memory.h |  8 ++----
>  mm/memory_hotplug.c    | 55 ++++++++++++++++++++++++++----------------
>  3 files changed, 53 insertions(+), 37 deletions(-)
> 
<snip>

> @@ -2136,10 +2148,10 @@ EXPORT_SYMBOL(try_offline_node);
>  
>  static int __ref try_remove_memory(u64 start, u64 size)
>  {
> -       struct vmem_altmap mhp_altmap = {};
> -       struct vmem_altmap *altmap = NULL;
> -       unsigned long nr_vmemmap_pages;
> +       int ret;

Minor nit - there is already an 'int rc' below - just use that, or
rename it to 'ret' if that's better for consistency.

> +       struct memory_block *mem;
>         int rc = 0, nid = NUMA_NO_NODE;
> +       struct vmem_altmap *altmap = NULL;
>  
>         BUG_ON(check_hotplug_memory_range(start, size));
>  
> @@ -2161,25 +2173,20 @@ static int __ref try_remove_memory(u64 start, u64 size)
>          * the same granularity it was added - a single memory block.
>          */
>         if (mhp_memmap_on_memory()) {
> -               nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
> -                                                     get_nr_vmemmap_pages_cb);
> -               if (nr_vmemmap_pages) {
> +               ret = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> +               if (ret) {
>                         if (size != memory_block_size_bytes()) {
>                                 pr_warn("Refuse to remove %#llx - %#llx,"
>                                         "wrong granularity\n",
>                                         start, start + size);
>                                 return -EINVAL;
>                         }
> -
> +                       altmap = mem->altmap;
>                         /*
> -                        * Let remove_pmd_table->free_hugepage_table do the
> -                        * right thing if we used vmem_altmap when hot-adding
> -                        * the range.
> +                        * Mark altmap NULL so that we can add a debug
> +                        * check on memblock free.
>                          */
> -                       mhp_altmap.base_pfn = PHYS_PFN(start);
> -                       mhp_altmap.free = nr_vmemmap_pages;
> -                       mhp_altmap.alloc = nr_vmemmap_pages;
> -                       altmap = &mhp_altmap;
> +                       mem->altmap = NULL;
>                 }
>         }
>
Aneesh Kumar K.V Aug. 2, 2023, 4:45 a.m. UTC | #6
On 8/1/23 4:20 PM, Michal Hocko wrote:
> On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
>> On 8/1/23 2:28 PM, Michal Hocko wrote:
>>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
>>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>>> hotplug done after the mode update will use the new mmemap_on_memory
>>>> value.
>>>
>>> Well, this is a user space kABI extension and as such you should spend
>>> more words about the usecase. Why we could live with this static and now
>>> need dynamic?
>>>
>>
>> This enables easy testing of memmap_on_memory feature without a kernel reboot.
> 
> Testing alone is rather weak argument to be honest.
> 
>> I also expect people wanting to use that when they find dax kmem memory online
>> failing because of struct page allocation failures[1]. User could reboot back with
>> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
>> the feature much more useful.
> 
> Sure it can be useful but that holds for any feature, right. The main
> question is whether this is worth maintaing. The current implementation
> seems rather trivial which is an argument to have it but are there any
> risks long term? Have you evaluated a potential long term maintenance
> cost? There is no easy way to go back and disable it later on without
> breaking some userspace.
> 
> All that should be in the changelog!

I updated it as below. 

mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter

Allow updating memmap_on_memory mode after the kernel boot. Memory
hotplug done after the mode update will use the new mmemap_on_memory
value.

It is now possible to test the memmap_on_memory feature easily without
the need for a kernel reboot. Additionally, for those encountering
struct page allocation failures while using dax kmem memory online, this
feature may prove useful. Instead of rebooting with the
memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
which greatly enhances its usefulness. Making the necessary changes to
support runtime updates is a simple change that should not affect the
addition of new features or result in long-term maintenance problems.
Aneesh Kumar K.V Aug. 2, 2023, 4:47 a.m. UTC | #7
On 8/2/23 4:40 AM, Verma, Vishal L wrote:
> On Tue, 2023-08-01 at 10:11 +0530, Aneesh Kumar K.V wrote:
>> With memmap on memory, some architecture needs more details w.r.t altmap
>> such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of
>> computing them again when we remove a memory block, embed vmem_altmap
>> details in struct memory_block if we are using memmap on memory block
>> feature.
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  drivers/base/memory.c  | 27 +++++++++++++--------
>>  include/linux/memory.h |  8 ++----
>>  mm/memory_hotplug.c    | 55 ++++++++++++++++++++++++++----------------
>>  3 files changed, 53 insertions(+), 37 deletions(-)
>>
> <snip>
> 
>> @@ -2136,10 +2148,10 @@ EXPORT_SYMBOL(try_offline_node);
>>  
>>  static int __ref try_remove_memory(u64 start, u64 size)
>>  {
>> -       struct vmem_altmap mhp_altmap = {};
>> -       struct vmem_altmap *altmap = NULL;
>> -       unsigned long nr_vmemmap_pages;
>> +       int ret;
> 
> Minor nit - there is already an 'int rc' below - just use that, or
> rename it to 'ret' if that's better for consistency.
> 


I reused the existing rc variable. 

>> +       struct memory_block *mem;
>>         int rc = 0, nid = NUMA_NO_NODE;
>> +       struct vmem_altmap *altmap = NULL;
>>  
>>         BUG_ON(check_hotplug_memory_range(start, size));
>>  
>> @@ -2161,25 +2173,20 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>          * the same granularity it was added - a single memory block.
>>          */
>>         if (mhp_memmap_on_memory()) {
>> -               nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
>> -                                                     get_nr_vmemmap_pages_cb);
>> -               if (nr_vmemmap_pages) {
>> +               ret = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
>> +               if (ret) {
>>                         if (size != memory_block_size_bytes()) {
>>                                 pr_warn("Refuse to remove %#llx - %#llx,"
>>                                         "wrong granularity\n",
>>                                         start, start + size);
>>                                 return -EINVAL;
>>                         }
>> -
>> +                       altmap = mem->altmap;
>>                         /*
>> -                        * Let remove_pmd_table->free_hugepage_table do the
>> -                        * right thing if we used vmem_altmap when hot-adding
>> -                        * the range.
>> +                        * Mark altmap NULL so that we can add a debug
>> +                        * check on memblock free.
>>                          */
>> -                       mhp_altmap.base_pfn = PHYS_PFN(start);
>> -                       mhp_altmap.free = nr_vmemmap_pages;
>> -                       mhp_altmap.alloc = nr_vmemmap_pages;
>> -                       altmap = &mhp_altmap;
>> +                       mem->altmap = NULL;
>>                 }
>>         }
>>  


Thank you for taking the time to review the patch.

-aneesh
Michal Hocko Aug. 2, 2023, 3:50 p.m. UTC | #8
On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
> On 8/1/23 4:20 PM, Michal Hocko wrote:
> > On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> >> On 8/1/23 2:28 PM, Michal Hocko wrote:
> >>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> >>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
> >>>> hotplug done after the mode update will use the new mmemap_on_memory
> >>>> value.
> >>>
> >>> Well, this is a user space kABI extension and as such you should spend
> >>> more words about the usecase. Why we could live with this static and now
> >>> need dynamic?
> >>>
> >>
> >> This enables easy testing of memmap_on_memory feature without a kernel reboot.
> > 
> > Testing alone is rather weak argument to be honest.
> > 
> >> I also expect people wanting to use that when they find dax kmem memory online
> >> failing because of struct page allocation failures[1]. User could reboot back with
> >> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
> >> the feature much more useful.
> > 
> > Sure it can be useful but that holds for any feature, right. The main
> > question is whether this is worth maintaing. The current implementation
> > seems rather trivial which is an argument to have it but are there any
> > risks long term? Have you evaluated a potential long term maintenance
> > cost? There is no easy way to go back and disable it later on without
> > breaking some userspace.
> > 
> > All that should be in the changelog!
> 
> I updated it as below. 
> 
> mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> 
> Allow updating memmap_on_memory mode after the kernel boot. Memory
> hotplug done after the mode update will use the new mmemap_on_memory
> value.
> 
> It is now possible to test the memmap_on_memory feature easily without
> the need for a kernel reboot. Additionally, for those encountering
> struct page allocation failures while using dax kmem memory online, this
> feature may prove useful. Instead of rebooting with the
> memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
> which greatly enhances its usefulness.


I do not really see a solid argument why rebooting is really a problem
TBH. Also is the global policy knob really the right fit for existing
hotplug usecases? In other words, if we really want to make
memmap_on_memory more flexible would it make more sense to have it per
memory block property instead (the global knob being the default if
admin doesn't specify it differently).
David Hildenbrand Aug. 2, 2023, 3:54 p.m. UTC | #9
On 02.08.23 17:50, Michal Hocko wrote:
> On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
>> On 8/1/23 4:20 PM, Michal Hocko wrote:
>>> On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
>>>> On 8/1/23 2:28 PM, Michal Hocko wrote:
>>>>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
>>>>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>>>>> hotplug done after the mode update will use the new mmemap_on_memory
>>>>>> value.
>>>>>
>>>>> Well, this is a user space kABI extension and as such you should spend
>>>>> more words about the usecase. Why we could live with this static and now
>>>>> need dynamic?
>>>>>
>>>>
>>>> This enables easy testing of memmap_on_memory feature without a kernel reboot.
>>>
>>> Testing alone is rather weak argument to be honest.
>>>
>>>> I also expect people wanting to use that when they find dax kmem memory online
>>>> failing because of struct page allocation failures[1]. User could reboot back with
>>>> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
>>>> the feature much more useful.
>>>
>>> Sure it can be useful but that holds for any feature, right. The main
>>> question is whether this is worth maintaing. The current implementation
>>> seems rather trivial which is an argument to have it but are there any
>>> risks long term? Have you evaluated a potential long term maintenance
>>> cost? There is no easy way to go back and disable it later on without
>>> breaking some userspace.
>>>
>>> All that should be in the changelog!
>>
>> I updated it as below.
>>
>> mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
>>
>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>> hotplug done after the mode update will use the new mmemap_on_memory
>> value.
>>
>> It is now possible to test the memmap_on_memory feature easily without
>> the need for a kernel reboot. Additionally, for those encountering
>> struct page allocation failures while using dax kmem memory online, this
>> feature may prove useful. Instead of rebooting with the
>> memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
>> which greatly enhances its usefulness.
> 
> 
> I do not really see a solid argument why rebooting is really a problem
> TBH. Also is the global policy knob really the right fit for existing
> hotplug usecases? In other words, if we really want to make
> memmap_on_memory more flexible would it make more sense to have it per
> memory block property instead (the global knob being the default if
> admin doesn't specify it differently).

Per memory block isn't possible, due to the creation order. Also, I 
think it's not the right approach.

I thought about driver toggles. At least for dax/kmem people are looking 
into that:

https://lkml.kernel.org/r/20230801-vv-kmem_memmap-v3-2-406e9aaf5689@intel.com

Where that can also be toggled per device.
Aneesh Kumar K.V Aug. 2, 2023, 3:57 p.m. UTC | #10
On 8/2/23 9:24 PM, David Hildenbrand wrote:
> On 02.08.23 17:50, Michal Hocko wrote:
>> On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
>>> On 8/1/23 4:20 PM, Michal Hocko wrote:
>>>> On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
>>>>> On 8/1/23 2:28 PM, Michal Hocko wrote:
>>>>>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
>>>>>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>>>>>> hotplug done after the mode update will use the new mmemap_on_memory
>>>>>>> value.
>>>>>>
>>>>>> Well, this is a user space kABI extension and as such you should spend
>>>>>> more words about the usecase. Why we could live with this static and now
>>>>>> need dynamic?
>>>>>>
>>>>>
>>>>> This enables easy testing of memmap_on_memory feature without a kernel reboot.
>>>>
>>>> Testing alone is rather weak argument to be honest.
>>>>
>>>>> I also expect people wanting to use that when they find dax kmem memory online
>>>>> failing because of struct page allocation failures[1]. User could reboot back with
>>>>> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
>>>>> the feature much more useful.
>>>>
>>>> Sure it can be useful but that holds for any feature, right. The main
>>>> question is whether this is worth maintaing. The current implementation
>>>> seems rather trivial which is an argument to have it but are there any
>>>> risks long term? Have you evaluated a potential long term maintenance
>>>> cost? There is no easy way to go back and disable it later on without
>>>> breaking some userspace.
>>>>
>>>> All that should be in the changelog!
>>>
>>> I updated it as below.
>>>
>>> mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
>>>
>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>> hotplug done after the mode update will use the new mmemap_on_memory
>>> value.
>>>
>>> It is now possible to test the memmap_on_memory feature easily without
>>> the need for a kernel reboot. Additionally, for those encountering
>>> struct page allocation failures while using dax kmem memory online, this
>>> feature may prove useful. Instead of rebooting with the
>>> memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
>>> which greatly enhances its usefulness.
>>
>>
>> I do not really see a solid argument why rebooting is really a problem
>> TBH. Also is the global policy knob really the right fit for existing
>> hotplug usecases? In other words, if we really want to make
>> memmap_on_memory more flexible would it make more sense to have it per
>> memory block property instead (the global knob being the default if
>> admin doesn't specify it differently).
> 
> Per memory block isn't possible, due to the creation order. Also, I think it's not the right approach.
> 
> I thought about driver toggles. At least for dax/kmem people are looking into that:
> 
> https://lkml.kernel.org/r/20230801-vv-kmem_memmap-v3-2-406e9aaf5689@intel.com
> 
> Where that can also be toggled per device.
> 

That still is dependent on the global memmap_on_memory enabled right? We could make the dax facility independent of the
global feature toggle? 

-aneesh
Verma, Vishal L Aug. 2, 2023, 4:02 p.m. UTC | #11
On Wed, 2023-08-02 at 21:27 +0530, Aneesh Kumar K V wrote:
> On 8/2/23 9:24 PM, David Hildenbrand wrote:
> > On 02.08.23 17:50, Michal Hocko wrote:
> > > On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
> > > > On 8/1/23 4:20 PM, Michal Hocko wrote:
> > > > > On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> > > > > > On 8/1/23 2:28 PM, Michal Hocko wrote:
> > > > > > > On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> > > > > > > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > > > > > > hotplug done after the mode update will use the new mmemap_on_memory
> > > > > > > > value.
> > > > > > > 
> > > > > > > Well, this is a user space kABI extension and as such you should spend
> > > > > > > more words about the usecase. Why we could live with this static and now
> > > > > > > need dynamic?
> > > > > > > 
> > > > > > 
> > > > > > This enables easy testing of memmap_on_memory feature without a kernel reboot.
> > > > > 
> > > > > Testing alone is rather weak argument to be honest.
> > > > > 
> > > > > > I also expect people wanting to use that when they find dax kmem memory online
> > > > > > failing because of struct page allocation failures[1]. User could reboot back with
> > > > > > memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
> > > > > > the feature much more useful.
> > > > > 
> > > > > Sure it can be useful but that holds for any feature, right. The main
> > > > > question is whether this is worth maintaing. The current implementation
> > > > > seems rather trivial which is an argument to have it but are there any
> > > > > risks long term? Have you evaluated a potential long term maintenance
> > > > > cost? There is no easy way to go back and disable it later on without
> > > > > breaking some userspace.
> > > > > 
> > > > > All that should be in the changelog!
> > > > 
> > > > I updated it as below.
> > > > 
> > > > mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> > > > 
> > > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > > hotplug done after the mode update will use the new mmemap_on_memory
> > > > value.
> > > > 
> > > > It is now possible to test the memmap_on_memory feature easily without
> > > > the need for a kernel reboot. Additionally, for those encountering
> > > > struct page allocation failures while using dax kmem memory online, this
> > > > feature may prove useful. Instead of rebooting with the
> > > > memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
> > > > which greatly enhances its usefulness.
> > > 
> > > 
> > > I do not really see a solid argument why rebooting is really a problem
> > > TBH. Also is the global policy knob really the right fit for existing
> > > hotplug usecases? In other words, if we really want to make
> > > memmap_on_memory more flexible would it make more sense to have it per
> > > memory block property instead (the global knob being the default if
> > > admin doesn't specify it differently).
> > 
> > Per memory block isn't possible, due to the creation order. Also, I think it's not the right approach.
> > 
> > I thought about driver toggles. At least for dax/kmem people are looking into that:
> > 
> > https://lkml.kernel.org/r/20230801-vv-kmem_memmap-v3-2-406e9aaf5689@intel.com
> > 
> > Where that can also be toggled per device.
> > 
> 
> That still is dependent on the global memmap_on_memory enabled right? We could make the dax facility independent of the
> global feature toggle? 

Correct - the latest version of those David linked does depend on the
global memmap_on_memory param. Since kmem's behavior for dax devices is
set up to be turned on / off dynamically via sysfs, it would be nice to
have a similar facility for this flag.

I did try the making dax independent of memmap_on_memory approach in my
first posting:

https://lore.kernel.org/nvdimm/20230613-vv-kmem_memmap-v1-1-f6de9c6af2c6@intel.com/

We can certainly revisit that if it looks more suitable.
Michal Hocko Aug. 2, 2023, 4:59 p.m. UTC | #12
On Wed 02-08-23 17:54:04, David Hildenbrand wrote:
> On 02.08.23 17:50, Michal Hocko wrote:
> > On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
> > > On 8/1/23 4:20 PM, Michal Hocko wrote:
> > > > On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> > > > > On 8/1/23 2:28 PM, Michal Hocko wrote:
> > > > > > On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> > > > > > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > > > > > hotplug done after the mode update will use the new mmemap_on_memory
> > > > > > > value.
> > > > > > 
> > > > > > Well, this is a user space kABI extension and as such you should spend
> > > > > > more words about the usecase. Why we could live with this static and now
> > > > > > need dynamic?
> > > > > > 
> > > > > 
> > > > > This enables easy testing of memmap_on_memory feature without a kernel reboot.
> > > > 
> > > > Testing alone is rather weak argument to be honest.
> > > > 
> > > > > I also expect people wanting to use that when they find dax kmem memory online
> > > > > failing because of struct page allocation failures[1]. User could reboot back with
> > > > > memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
> > > > > the feature much more useful.
> > > > 
> > > > Sure it can be useful but that holds for any feature, right. The main
> > > > question is whether this is worth maintaing. The current implementation
> > > > seems rather trivial which is an argument to have it but are there any
> > > > risks long term? Have you evaluated a potential long term maintenance
> > > > cost? There is no easy way to go back and disable it later on without
> > > > breaking some userspace.
> > > > 
> > > > All that should be in the changelog!
> > > 
> > > I updated it as below.
> > > 
> > > mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> > > 
> > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > hotplug done after the mode update will use the new mmemap_on_memory
> > > value.
> > > 
> > > It is now possible to test the memmap_on_memory feature easily without
> > > the need for a kernel reboot. Additionally, for those encountering
> > > struct page allocation failures while using dax kmem memory online, this
> > > feature may prove useful. Instead of rebooting with the
> > > memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
> > > which greatly enhances its usefulness.
> > 
> > 
> > I do not really see a solid argument why rebooting is really a problem
> > TBH. Also is the global policy knob really the right fit for existing
> > hotplug usecases? In other words, if we really want to make
> > memmap_on_memory more flexible would it make more sense to have it per
> > memory block property instead (the global knob being the default if
> > admin doesn't specify it differently).
> 
> Per memory block isn't possible, due to the creation order.

I am not sure I follow. Could you elaborate more?

> Also, I think it's not the right approach.
> 
> I thought about driver toggles. At least for dax/kmem people are looking
> into that:
> 
> https://lkml.kernel.org/r/20230801-vv-kmem_memmap-v3-2-406e9aaf5689@intel.com
> 
> Where that can also be toggled per device.

Per device flag makes sense to me as well. But what we should do about
hotplugged memory with a backing device (like regular RAM). I can see
some reason to have large hotplugged memory ranges to be self vmemap
hosted while smaller blocks to be backed by external vmemmaps.
Michal Hocko Aug. 3, 2023, 8:52 a.m. UTC | #13
On Wed 02-08-23 18:59:04, Michal Hocko wrote:
> On Wed 02-08-23 17:54:04, David Hildenbrand wrote:
> > On 02.08.23 17:50, Michal Hocko wrote:
> > > On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
> > > > On 8/1/23 4:20 PM, Michal Hocko wrote:
> > > > > On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
> > > > > > On 8/1/23 2:28 PM, Michal Hocko wrote:
> > > > > > > On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
> > > > > > > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > > > > > > hotplug done after the mode update will use the new mmemap_on_memory
> > > > > > > > value.
> > > > > > > 
> > > > > > > Well, this is a user space kABI extension and as such you should spend
> > > > > > > more words about the usecase. Why we could live with this static and now
> > > > > > > need dynamic?
> > > > > > > 
> > > > > > 
> > > > > > This enables easy testing of memmap_on_memory feature without a kernel reboot.
> > > > > 
> > > > > Testing alone is rather weak argument to be honest.
> > > > > 
> > > > > > I also expect people wanting to use that when they find dax kmem memory online
> > > > > > failing because of struct page allocation failures[1]. User could reboot back with
> > > > > > memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
> > > > > > the feature much more useful.
> > > > > 
> > > > > Sure it can be useful but that holds for any feature, right. The main
> > > > > question is whether this is worth maintaing. The current implementation
> > > > > seems rather trivial which is an argument to have it but are there any
> > > > > risks long term? Have you evaluated a potential long term maintenance
> > > > > cost? There is no easy way to go back and disable it later on without
> > > > > breaking some userspace.
> > > > > 
> > > > > All that should be in the changelog!
> > > > 
> > > > I updated it as below.
> > > > 
> > > > mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
> > > > 
> > > > Allow updating memmap_on_memory mode after the kernel boot. Memory
> > > > hotplug done after the mode update will use the new mmemap_on_memory
> > > > value.
> > > > 
> > > > It is now possible to test the memmap_on_memory feature easily without
> > > > the need for a kernel reboot. Additionally, for those encountering
> > > > struct page allocation failures while using dax kmem memory online, this
> > > > feature may prove useful. Instead of rebooting with the
> > > > memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
> > > > which greatly enhances its usefulness.
> > > 
> > > 
> > > I do not really see a solid argument why rebooting is really a problem
> > > TBH. Also is the global policy knob really the right fit for existing
> > > hotplug usecases? In other words, if we really want to make
> > > memmap_on_memory more flexible would it make more sense to have it per
> > > memory block property instead (the global knob being the default if
> > > admin doesn't specify it differently).
> > 
> > Per memory block isn't possible, due to the creation order.
> 
> I am not sure I follow. Could you elaborate more?

Must have been a tired brain. Now I see what you mean of course. vmemmap
is allocated at the time the range is registered and therefore memory
block sysfs created. So you are right that it is too late in some sense
but I still believe that this would be doable. The memmap_on_memory file
would be readable only when the block is offline and it would reallocate
vmemmap on the change. Makes sense? Are there any risks? Maybe pfn
walkers?
David Hildenbrand Aug. 3, 2023, 9:24 a.m. UTC | #14
On 03.08.23 10:52, Michal Hocko wrote:
> On Wed 02-08-23 18:59:04, Michal Hocko wrote:
>> On Wed 02-08-23 17:54:04, David Hildenbrand wrote:
>>> On 02.08.23 17:50, Michal Hocko wrote:
>>>> On Wed 02-08-23 10:15:04, Aneesh Kumar K V wrote:
>>>>> On 8/1/23 4:20 PM, Michal Hocko wrote:
>>>>>> On Tue 01-08-23 14:58:29, Aneesh Kumar K V wrote:
>>>>>>> On 8/1/23 2:28 PM, Michal Hocko wrote:
>>>>>>>> On Tue 01-08-23 10:11:16, Aneesh Kumar K.V wrote:
>>>>>>>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>>>>>>>> hotplug done after the mode update will use the new mmemap_on_memory
>>>>>>>>> value.
>>>>>>>>
>>>>>>>> Well, this is a user space kABI extension and as such you should spend
>>>>>>>> more words about the usecase. Why we could live with this static and now
>>>>>>>> need dynamic?
>>>>>>>>
>>>>>>>
>>>>>>> This enables easy testing of memmap_on_memory feature without a kernel reboot.
>>>>>>
>>>>>> Testing alone is rather weak argument to be honest.
>>>>>>
>>>>>>> I also expect people wanting to use that when they find dax kmem memory online
>>>>>>> failing because of struct page allocation failures[1]. User could reboot back with
>>>>>>> memmap_on_memory=y kernel parameter. But being able to enable it via sysfs makes
>>>>>>> the feature much more useful.
>>>>>>
>>>>>> Sure it can be useful but that holds for any feature, right. The main
>>>>>> question is whether this is worth maintaing. The current implementation
>>>>>> seems rather trivial which is an argument to have it but are there any
>>>>>> risks long term? Have you evaluated a potential long term maintenance
>>>>>> cost? There is no easy way to go back and disable it later on without
>>>>>> breaking some userspace.
>>>>>>
>>>>>> All that should be in the changelog!
>>>>>
>>>>> I updated it as below.
>>>>>
>>>>> mm/memory_hotplug: Enable runtime update of memmap_on_memory parameter
>>>>>
>>>>> Allow updating memmap_on_memory mode after the kernel boot. Memory
>>>>> hotplug done after the mode update will use the new mmemap_on_memory
>>>>> value.
>>>>>
>>>>> It is now possible to test the memmap_on_memory feature easily without
>>>>> the need for a kernel reboot. Additionally, for those encountering
>>>>> struct page allocation failures while using dax kmem memory online, this
>>>>> feature may prove useful. Instead of rebooting with the
>>>>> memmap_on_memory=y kernel parameter, users can now enable it via sysfs,
>>>>> which greatly enhances its usefulness.
>>>>
>>>>
>>>> I do not really see a solid argument why rebooting is really a problem
>>>> TBH. Also is the global policy knob really the right fit for existing
>>>> hotplug usecases? In other words, if we really want to make
>>>> memmap_on_memory more flexible would it make more sense to have it per
>>>> memory block property instead (the global knob being the default if
>>>> admin doesn't specify it differently).
>>>
>>> Per memory block isn't possible, due to the creation order.
>>
>> I am not sure I follow. Could you elaborate more?
> 
> Must have been a tired brain. Now I see what you mean of course. vmemmap
> is allocated at the time the range is registered and therefore memory
> block sysfs created. So you are right that it is too late in some sense
> but I still believe that this would be doable. The memmap_on_memory file

Exactly.

> would be readable only when the block is offline and it would reallocate
> vmemmap on the change. Makes sense? Are there any risks? Maybe pfn
> walkers?

The question is: is it of any real value such that it would be worth the 
cost and risk?


One of the primary reasons for memmap_on_memory is that *memory hotplug* 
succeeds even in low-memory situations (including, low on ZONE_NORMAL 
situations). So you want that behavior already when hotplugging such 
devices. While there might be value to relocate it later, I'm not sure 
if that is really worth it, and it does not solve the main use case.

For dax/kmem memory hotplug this is controlled by user space, so user 
space can easily configure it. For DIMMs it's just a hotplug event that 
triggers all that in the kernel, and one has to plan ahead (e.g., set 
the global kernel parameter).

Besides, devices like virtio-mem really cannot universally support 
memmap_on_memory.
Michal Hocko Aug. 3, 2023, 11:30 a.m. UTC | #15
On Thu 03-08-23 11:24:08, David Hildenbrand wrote:
[...]
> > would be readable only when the block is offline and it would reallocate
> > vmemmap on the change. Makes sense? Are there any risks? Maybe pfn
> > walkers?
> 
> The question is: is it of any real value such that it would be worth the
> cost and risk?
> 
> 
> One of the primary reasons for memmap_on_memory is that *memory hotplug*
> succeeds even in low-memory situations (including, low on ZONE_NORMAL
> situations).

One usecase I would have in mind is a mix of smaller and larger memory
blocks. For larger ones you want to have memmap_on_memory in general
because they do not eat memory from outside but small(er) ones might be
more tricky because now you can add a lot of blocks that would be
internally fragmented to prevent larger allocations to form.

> So you want that behavior already when hotplugging such
> devices. While there might be value to relocate it later, I'm not sure if
> that is really worth it, and it does not solve the main use case.

Is it worth it? TBH I am not sure same as I am not sure the global
default should be writable after boot. If we want to make it more
dynamic we should however talk about the proper layer this is
implemented on.
Michal Hocko Aug. 7, 2023, 12:27 p.m. UTC | #16
On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
[...]
> Do you see a need for firmware-managed memory to be hotplugged in with
> different memory block sizes?

In short. Yes. Slightly longer, a fixed size memory block semantic is
just standing in the way and I would even argue it is actively harmful.
Just have a look at ridicously small memory blocks on ppc. I do
understand that it makes some sense to be aligned to the memory model
(so sparsmem section aligned). In an ideal world, memory hotplug v2
interface (if we ever go that path) should be physical memory range based.
David Hildenbrand Aug. 7, 2023, 12:41 p.m. UTC | #17
On 07.08.23 14:27, Michal Hocko wrote:
> On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
> [...]
>> Do you see a need for firmware-managed memory to be hotplugged in with
>> different memory block sizes?
> 
> In short. Yes. Slightly longer, a fixed size memory block semantic is
> just standing in the way and I would even argue it is actively harmful.
> Just have a look at ridicously small memory blocks on ppc. I do
> understand that it makes some sense to be aligned to the memory model
> (so sparsmem section aligned). In an ideal world, memory hotplug v2
> interface (if we ever go that path) should be physical memory range based.

Yes, we discussed that a couple of times already (and so far nobody 
cared to implement any of that).

Small memory block sizes are very beneficial for use cases like PPC 
dlar, virtio-mem, hyperv-balloon, ... essentially in most virtual 
environments where you might want to add/remove memory in very small 
granularity. I don't see that changing any time soon. Rather the opposite.

Small memory block sizes are suboptimal for large machines where you 
might never end up removing such memory (boot memory), or when dealing 
with devices that can only be removed in one piece (DIMM/kmem). We 
already have memory groups in place to model that.

For the latter it might be beneficial to have memory blocks of larger 
size that correspond to the physical memory ranges. That might also make 
a memmap (re-)configuration easier.

Not sure if that is standing in any way or is harmful, though.
David Hildenbrand Aug. 7, 2023, 12:54 p.m. UTC | #18
On 03.08.23 13:30, Michal Hocko wrote:
> On Thu 03-08-23 11:24:08, David Hildenbrand wrote:
> [...]
>>> would be readable only when the block is offline and it would reallocate
>>> vmemmap on the change. Makes sense? Are there any risks? Maybe pfn
>>> walkers?
>>
>> The question is: is it of any real value such that it would be worth the
>> cost and risk?
>>
>>
>> One of the primary reasons for memmap_on_memory is that *memory hotplug*
>> succeeds even in low-memory situations (including, low on ZONE_NORMAL
>> situations).

Sorry for the late reply, I'm busy with 100 other things.

> 
> One usecase I would have in mind is a mix of smaller and larger memory
> blocks. For larger ones you want to have memmap_on_memory in general
> because they do not eat memory from outside but small(er) ones might be
> more tricky because now you can add a lot of blocks that would be
> internally fragmented to prevent larger allocations to form.

Okay, I see what you mean.

The internal fragmentation might become an issue at some point: for 
x86-64 with 128 MiB blocks / 2 MiB THP it's not a real issue right now. 
For a arm64 64k with 512 MiB blocks and 512 MiB THP / hugelb it could be 
one.

I recall discussing that with Oscar back when he added memmap_on_memory, 
where we also discussed the variable-sized memory blocks to avoid such 
internal fragmentation.

For small ones you probably want to only use memmap_on_memory when 
unavoidable: for example, when adding without memmap_on_memory would 
fail / already failed. Possibly some later memmap relocation might make 
sense in some scenarios.

> 
>> So you want that behavior already when hotplugging such
>> devices. While there might be value to relocate it later, I'm not sure if
>> that is really worth it, and it does not solve the main use case.
> 
> Is it worth it? TBH I am not sure same as I am not sure the global
> default should be writable after boot. If we want to make it more
> dynamic we should however talk about the proper layer this is
> implemented on.

Agreed.
David Hildenbrand Aug. 7, 2023, 6:35 p.m. UTC | #19
On 07.08.23 14:41, David Hildenbrand wrote:
> On 07.08.23 14:27, Michal Hocko wrote:
>> On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
>> [...]
>>> Do you see a need for firmware-managed memory to be hotplugged in with
>>> different memory block sizes?
>>
>> In short. Yes. Slightly longer, a fixed size memory block semantic is
>> just standing in the way and I would even argue it is actively harmful.
>> Just have a look at ridicously small memory blocks on ppc. I do
>> understand that it makes some sense to be aligned to the memory model
>> (so sparsmem section aligned). In an ideal world, memory hotplug v2
>> interface (if we ever go that path) should be physical memory range based.
> 
> Yes, we discussed that a couple of times already (and so far nobody
> cared to implement any of that).
> 
> Small memory block sizes are very beneficial for use cases like PPC
> dlar, virtio-mem, hyperv-balloon, ... essentially in most virtual
> environments where you might want to add/remove memory in very small
> granularity. I don't see that changing any time soon. Rather the opposite.
> 
> Small memory block sizes are suboptimal for large machines where you
> might never end up removing such memory (boot memory), or when dealing
> with devices that can only be removed in one piece (DIMM/kmem). We
> already have memory groups in place to model that.
> 
> For the latter it might be beneficial to have memory blocks of larger
> size that correspond to the physical memory ranges. That might also make
> a memmap (re-)configuration easier.
> 
> Not sure if that is standing in any way or is harmful, though.
> 

Just because I thought of something right now, I'll share it, maybe it 
makes sense.

Assume when we get add_memory*(MHP_MEMMAP_ON_MEMORY) and it is enabled 
by the admin:

1) We create a single altmap at the beginning of the memory

2) We create the existing fixed-size memory block devices, but flag them
    to be part of a single "altmap" unit.

3) Whenever we trigger offlining of a single such memory block, we
    offline *all* memory blocks belonging to that altmap, essentially
    using a single offline_pages() call and updating all memory block
    states accordingly.

4) Whenever we trigger onlining of a single such memory block, we
    online *all* memory blocks belonging to that altmap, using a single
    online_pages() call.

5) We fail remove_memory() if it doesn't cover the same (altmap) range.

So we can avoid having a memory block v2 (and all that comes with that 
...) for now and still get that altmap stuff sorted out. As that altmap 
behavior can be controlled by the admin, we should be fine for now.

I think all memory notifiers should already be able to handle bigger 
granularity, but it would be easy to check. Some internal things might 
require a bit of tweaking.

Just a thought.
Aneesh Kumar K.V Aug. 8, 2023, 6:09 a.m. UTC | #20
On 8/8/23 12:05 AM, David Hildenbrand wrote:
> On 07.08.23 14:41, David Hildenbrand wrote:
>> On 07.08.23 14:27, Michal Hocko wrote:
>>> On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
>>> [...]
>>>> Do you see a need for firmware-managed memory to be hotplugged in with
>>>> different memory block sizes?
>>>
>>> In short. Yes. Slightly longer, a fixed size memory block semantic is
>>> just standing in the way and I would even argue it is actively harmful.
>>> Just have a look at ridicously small memory blocks on ppc. I do
>>> understand that it makes some sense to be aligned to the memory model
>>> (so sparsmem section aligned). In an ideal world, memory hotplug v2
>>> interface (if we ever go that path) should be physical memory range based.
>>
>> Yes, we discussed that a couple of times already (and so far nobody
>> cared to implement any of that).
>>
>> Small memory block sizes are very beneficial for use cases like PPC
>> dlar, virtio-mem, hyperv-balloon, ... essentially in most virtual
>> environments where you might want to add/remove memory in very small
>> granularity. I don't see that changing any time soon. Rather the opposite.
>>
>> Small memory block sizes are suboptimal for large machines where you
>> might never end up removing such memory (boot memory), or when dealing
>> with devices that can only be removed in one piece (DIMM/kmem). We
>> already have memory groups in place to model that.
>>
>> For the latter it might be beneficial to have memory blocks of larger
>> size that correspond to the physical memory ranges. That might also make
>> a memmap (re-)configuration easier.
>>
>> Not sure if that is standing in any way or is harmful, though.
>>
> 
> Just because I thought of something right now, I'll share it, maybe it makes sense.
> 
> Assume when we get add_memory*(MHP_MEMMAP_ON_MEMORY) and it is enabled by the admin:
> 
> 1) We create a single altmap at the beginning of the memory
> 
> 2) We create the existing fixed-size memory block devices, but flag them
>    to be part of a single "altmap" unit.
> 
> 3) Whenever we trigger offlining of a single such memory block, we
>    offline *all* memory blocks belonging to that altmap, essentially
>    using a single offline_pages() call and updating all memory block
>    states accordingly.
> 
> 4) Whenever we trigger onlining of a single such memory block, we
>    online *all* memory blocks belonging to that altmap, using a single
>    online_pages() call.
> 
> 5) We fail remove_memory() if it doesn't cover the same (altmap) range.
> 
> So we can avoid having a memory block v2 (and all that comes with that ...) for now and still get that altmap stuff sorted out. As that altmap behavior can be controlled by the admin, we should be fine for now.
> 
> I think all memory notifiers should already be able to handle bigger granularity, but it would be easy to check. Some internal things might require a bit of tweaking.
> 
> Just a thought.
> 


W.r.t enabling memmap_on_memory for ppc64, I guess I will drop patch 7 and resend the series so that Andrew can pick rest of the patches? 

-aneesh
Aneesh Kumar K.V Aug. 8, 2023, 6:29 a.m. UTC | #21
On 8/8/23 12:05 AM, David Hildenbrand wrote:
> On 07.08.23 14:41, David Hildenbrand wrote:
>> On 07.08.23 14:27, Michal Hocko wrote:
>>> On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
>>> [...]
>>>> Do you see a need for firmware-managed memory to be hotplugged in with
>>>> different memory block sizes?
>>>
>>> In short. Yes. Slightly longer, a fixed size memory block semantic is
>>> just standing in the way and I would even argue it is actively harmful.
>>> Just have a look at ridicously small memory blocks on ppc. I do
>>> understand that it makes some sense to be aligned to the memory model
>>> (so sparsmem section aligned). In an ideal world, memory hotplug v2
>>> interface (if we ever go that path) should be physical memory range based.
>>
>> Yes, we discussed that a couple of times already (and so far nobody
>> cared to implement any of that).
>>
>> Small memory block sizes are very beneficial for use cases like PPC
>> dlar, virtio-mem, hyperv-balloon, ... essentially in most virtual
>> environments where you might want to add/remove memory in very small
>> granularity. I don't see that changing any time soon. Rather the opposite.
>>
>> Small memory block sizes are suboptimal for large machines where you
>> might never end up removing such memory (boot memory), or when dealing
>> with devices that can only be removed in one piece (DIMM/kmem). We
>> already have memory groups in place to model that.
>>
>> For the latter it might be beneficial to have memory blocks of larger
>> size that correspond to the physical memory ranges. That might also make
>> a memmap (re-)configuration easier.
>>
>> Not sure if that is standing in any way or is harmful, though.
>>
> 
> Just because I thought of something right now, I'll share it, maybe it makes sense.
> 
> Assume when we get add_memory*(MHP_MEMMAP_ON_MEMORY) and it is enabled by the admin:
> 
> 1) We create a single altmap at the beginning of the memory
> 
> 2) We create the existing fixed-size memory block devices, but flag them
>    to be part of a single "altmap" unit.
> 
> 3) Whenever we trigger offlining of a single such memory block, we
>    offline *all* memory blocks belonging to that altmap, essentially
>    using a single offline_pages() call and updating all memory block
>    states accordingly.
> 
> 4) Whenever we trigger onlining of a single such memory block, we
>    online *all* memory blocks belonging to that altmap, using a single
>    online_pages() call.
> 
> 5) We fail remove_memory() if it doesn't cover the same (altmap) range.
> 
> So we can avoid having a memory block v2 (and all that comes with that ...) for now and still get that altmap stuff sorted out. As that altmap behavior can be controlled by the admin, we should be fine for now.
> 
> I think all memory notifiers should already be able to handle bigger granularity, but it would be easy to check. Some internal things might require a bit of tweaking.
> 

We can look at the possibility of using the altmap space reserved for a namespace (via option -M dev) for allocating struct page memory even with dax/kmem. 

-aneesh
David Hildenbrand Aug. 8, 2023, 7:46 a.m. UTC | #22
On 08.08.23 08:29, Aneesh Kumar K V wrote:
> On 8/8/23 12:05 AM, David Hildenbrand wrote:
>> On 07.08.23 14:41, David Hildenbrand wrote:
>>> On 07.08.23 14:27, Michal Hocko wrote:
>>>> On Sat 05-08-23 19:54:23, Aneesh Kumar K V wrote:
>>>> [...]
>>>>> Do you see a need for firmware-managed memory to be hotplugged in with
>>>>> different memory block sizes?
>>>>
>>>> In short. Yes. Slightly longer, a fixed size memory block semantic is
>>>> just standing in the way and I would even argue it is actively harmful.
>>>> Just have a look at ridicously small memory blocks on ppc. I do
>>>> understand that it makes some sense to be aligned to the memory model
>>>> (so sparsmem section aligned). In an ideal world, memory hotplug v2
>>>> interface (if we ever go that path) should be physical memory range based.
>>>
>>> Yes, we discussed that a couple of times already (and so far nobody
>>> cared to implement any of that).
>>>
>>> Small memory block sizes are very beneficial for use cases like PPC
>>> dlar, virtio-mem, hyperv-balloon, ... essentially in most virtual
>>> environments where you might want to add/remove memory in very small
>>> granularity. I don't see that changing any time soon. Rather the opposite.
>>>
>>> Small memory block sizes are suboptimal for large machines where you
>>> might never end up removing such memory (boot memory), or when dealing
>>> with devices that can only be removed in one piece (DIMM/kmem). We
>>> already have memory groups in place to model that.
>>>
>>> For the latter it might be beneficial to have memory blocks of larger
>>> size that correspond to the physical memory ranges. That might also make
>>> a memmap (re-)configuration easier.
>>>
>>> Not sure if that is standing in any way or is harmful, though.
>>>
>>
>> Just because I thought of something right now, I'll share it, maybe it makes sense.
>>
>> Assume when we get add_memory*(MHP_MEMMAP_ON_MEMORY) and it is enabled by the admin:
>>
>> 1) We create a single altmap at the beginning of the memory
>>
>> 2) We create the existing fixed-size memory block devices, but flag them
>>     to be part of a single "altmap" unit.
>>
>> 3) Whenever we trigger offlining of a single such memory block, we
>>     offline *all* memory blocks belonging to that altmap, essentially
>>     using a single offline_pages() call and updating all memory block
>>     states accordingly.
>>
>> 4) Whenever we trigger onlining of a single such memory block, we
>>     online *all* memory blocks belonging to that altmap, using a single
>>     online_pages() call.
>>
>> 5) We fail remove_memory() if it doesn't cover the same (altmap) range.
>>
>> So we can avoid having a memory block v2 (and all that comes with that ...) for now and still get that altmap stuff sorted out. As that altmap behavior can be controlled by the admin, we should be fine for now.
>>
>> I think all memory notifiers should already be able to handle bigger granularity, but it would be easy to check. Some internal things might require a bit of tweaking.
>>
> 
> We can look at the possibility of using the altmap space reserved for a namespace (via option -M dev) for allocating struct page memory even with dax/kmem.

Right, an alternative would also be for the caller to pass in the 
altmap. Individual memory blocks can then get onlined/offlined as is.

One issue might be, how to get that altmap considered online memory 
(e.g., pfn_to_online_page(), kdump, ...). Right now, the altmap always 
falls into an online section once the memory block is online, so 
pfn_to_online_page() and especially online_section() holds as soon as 
the altmap has reasonable content -- when the memory block is online and 
initialized the memmap. Maybe that can be worked around, just pointing 
it out.