diff mbox series

[RFC,8/8] HACK: mm: memory_hotplug: Drop memblock_phys_free() call in try_remove_memory()

Message ID 20240529171236.32002-9-Jonathan.Cameron@huawei.com
State New, archived
Headers show
Series arm64/memblock: Handling of CXL Fixed Memory Windows. | expand

Commit Message

Jonathan Cameron May 29, 2024, 5:12 p.m. UTC
I'm not sure what this is balancing, but it if is necessary then the reserved
memblock approach can't be used to stash NUMA node assignments as after the
first add / remove cycle the entry is dropped so not available if memory is
re-added at the same HPA.

This patch is here to hopefully spur comments on what this is there for!

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Oscar Salvador May 30, 2024, 10:07 a.m. UTC | #1
On Wed, May 29, 2024 at 06:12:36PM +0100, Jonathan Cameron wrote:
> I'm not sure what this is balancing, but it if is necessary then the reserved
> memblock approach can't be used to stash NUMA node assignments as after the
> first add / remove cycle the entry is dropped so not available if memory is
> re-added at the same HPA.

It is balancing previously allocated memory which was allocated via
memblock_phys_alloc{_range,try_nid}.
memblock_phys_alloc_try_nid() is who does the heavy-lifting, and also calls
kmemleak_alloc_phys() to make kmemleak aware of that memory.

A quick idea that came to me is:
I think that it should be possible 1) create a new memory_block flag
(check 'enum memblock_flags') and 2) flag the range you want with this
range (check memblock_setclr_flag()) with a function like
memblock_reserved_mark_{yourflag}.

Then, in memblock_phys_free() (or down the path) we could check for that flag,
and refuse to proceed if it is set.

Would that work?
I am not sure, but you might need to
Jonathan Cameron May 30, 2024, 12:14 p.m. UTC | #2
On Thu, 30 May 2024 12:07:37 +0200
Oscar Salvador <osalvador@suse.com> wrote:

> On Wed, May 29, 2024 at 06:12:36PM +0100, Jonathan Cameron wrote:
> > I'm not sure what this is balancing, but it if is necessary then the reserved
> > memblock approach can't be used to stash NUMA node assignments as after the
> > first add / remove cycle the entry is dropped so not available if memory is
> > re-added at the same HPA.  
> 
> It is balancing previously allocated memory which was allocated via
> memblock_phys_alloc{_range,try_nid}.
> memblock_phys_alloc_try_nid() is who does the heavy-lifting, and also calls
> kmemleak_alloc_phys() to make kmemleak aware of that memory.

Thanks Oscar,

I'm struggling to find the call path that does that for the particular range being
released.

There are some calls to allocate node data but for small amounts of data
whereas I think this call is removing the full range of the hot removed memory.
Is the intent just to get rid of the node data related reserved memblock if
it happens to be in the memory removed?  If so feels like the call should
really be targetting just that.

> 
> A quick idea that came to me is:
> I think that it should be possible 1) create a new memory_block flag
> (check 'enum memblock_flags') and 2) flag the range you want with this
> range (check memblock_setclr_flag()) with a function like
> memblock_reserved_mark_{yourflag}.
> 
> Then, in memblock_phys_free() (or down the path) we could check for that flag,
> and refuse to proceed if it is set.

I had tried a flag in the main memblock rather than using the reserved memblock
a while back and that was horribly invasive so I got a bit scared of touching
those flags. One used only in the reserved memblock might work
to bypass the removal of the reserved memblocks but carry out everything
else that call is intended to do. I'm still sketchy on what I'm bypassing
though!

Thanks,

Jonathan

> 
> Would that work?
> I am not sure, but you might need to 
> 
>
David Hildenbrand May 31, 2024, 7:49 a.m. UTC | #3
On 29.05.24 19:12, Jonathan Cameron wrote:
> I'm not sure what this is balancing, but it if is necessary then the reserved
> memblock approach can't be used to stash NUMA node assignments as after the
> first add / remove cycle the entry is dropped so not available if memory is
> re-added at the same HPA.
> 
> This patch is here to hopefully spur comments on what this is there for!
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>   mm/memory_hotplug.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 431b1f6753c0..3d8dd4749dfc 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>   	}
>   
>   	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> -		memblock_phys_free(start, size);
> +		//		memblock_phys_free(start, size);
>   		memblock_remove(start, size);
>   	}

memblock_phys_free() works on memblock.reserved, memblock_remove() works 
  on memblock.memory.

If you take a look at the doc at the top of memblock.c:

memblock.memory: physical memory available to the system
memblock.reserved: regions that were allocated [during boot]


memblock.memory is supposed to be a superset of memblock.reserved. Your 
"hack" here indicates that you somehow would be relying on the opposite 
being true, which indicates that you are doing the wrong thing.


memblock_remove() indeed balances against memblock_add_node() for 
hotplugged memory [add_memory_resource()]. There seem to a case where we 
would succeed in hotunplugging memory that was part of "memblock.reserved".

But how could that happen? I think the following way:

Once the buddy is up and running, memory allocated during early boot is 
not freed back to memblock, but usually we simply go via something like 
free_reserved_page(), not memblock_free() [because the buddy took over]. 
So one could end up unplugging memory that still resides in 
memblock.reserved set.

So with memblock_phys_free(), we are enforcing the invariant that 
memblock.memory is a superset of memblock.reserved.

Likely, arm64 should store that node assignment elsewhere from where it 
can be queried. Or it should be using something like 
CONFIG_HAVE_MEMBLOCK_PHYS_MAP for these static windows.
Jonathan Cameron May 31, 2024, 9:48 a.m. UTC | #4
On Fri, 31 May 2024 09:49:32 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 29.05.24 19:12, Jonathan Cameron wrote:
> > I'm not sure what this is balancing, but it if is necessary then the reserved
> > memblock approach can't be used to stash NUMA node assignments as after the
> > first add / remove cycle the entry is dropped so not available if memory is
> > re-added at the same HPA.
> > 
> > This patch is here to hopefully spur comments on what this is there for!
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > ---
> >   mm/memory_hotplug.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 431b1f6753c0..3d8dd4749dfc 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
> >   	}
> >   
> >   	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> > -		memblock_phys_free(start, size);
> > +		//		memblock_phys_free(start, size);
> >   		memblock_remove(start, size);
> >   	}  
> 
> memblock_phys_free() works on memblock.reserved, memblock_remove() works 
>   on memblock.memory.
> 
> If you take a look at the doc at the top of memblock.c:
> 
> memblock.memory: physical memory available to the system
> memblock.reserved: regions that were allocated [during boot]
> 
> 
> memblock.memory is supposed to be a superset of memblock.reserved. Your 
> "hack" here indicates that you somehow would be relying on the opposite 
> being true, which indicates that you are doing the wrong thing.
> 
> 
> memblock_remove() indeed balances against memblock_add_node() for 
> hotplugged memory [add_memory_resource()]. There seem to a case where we 
> would succeed in hotunplugging memory that was part of "memblock.reserved".
> 
> But how could that happen? I think the following way:
> 
> Once the buddy is up and running, memory allocated during early boot is 
> not freed back to memblock, but usually we simply go via something like 
> free_reserved_page(), not memblock_free() [because the buddy took over]. 
> So one could end up unplugging memory that still resides in 
> memblock.reserved set.
> 
> So with memblock_phys_free(), we are enforcing the invariant that 
> memblock.memory is a superset of memblock.reserved.
> 
> Likely, arm64 should store that node assignment elsewhere from where it 
> can be queried. Or it should be using something like 
> CONFIG_HAVE_MEMBLOCK_PHYS_MAP for these static windows.
> 

Hi David,

Thanks for the explanation and pointers.  I'd rather avoid inventing a parallel
infrastructure for this (like x86 has for other reasons, but which is also used
for this purpose). 
From a quick look CONFIG_HAVE_MEMBLOCK_PHYS_MAP is documented in a fashion that
makes me think it's not directly appropriate (this isn't actual physical memory
available during boot) but the general approach of just adding another memblock
collection seems like it will work.

Hardest problem might be naming it.  physmem_possible perhaps?
Fill that with anything found in SRAT or CEDT should work for ACPI, but I'm not
sure on whether we can fill it when neither of those is present.  Maybe we just
don't bother as for today's usecase CEDT needs to be present.

Maybe physmem_known_possible is the way to go. I'll give this approach a spin
and see how it goes.

Jonathan
David Hildenbrand May 31, 2024, 9:55 a.m. UTC | #5
On 31.05.24 11:48, Jonathan Cameron wrote:
> On Fri, 31 May 2024 09:49:32 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 29.05.24 19:12, Jonathan Cameron wrote:
>>> I'm not sure what this is balancing, but it if is necessary then the reserved
>>> memblock approach can't be used to stash NUMA node assignments as after the
>>> first add / remove cycle the entry is dropped so not available if memory is
>>> re-added at the same HPA.
>>>
>>> This patch is here to hopefully spur comments on what this is there for!
>>>
>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> ---
>>>    mm/memory_hotplug.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 431b1f6753c0..3d8dd4749dfc 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>>    	}
>>>    
>>>    	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>>> -		memblock_phys_free(start, size);
>>> +		//		memblock_phys_free(start, size);
>>>    		memblock_remove(start, size);
>>>    	}
>>
>> memblock_phys_free() works on memblock.reserved, memblock_remove() works
>>    on memblock.memory.
>>
>> If you take a look at the doc at the top of memblock.c:
>>
>> memblock.memory: physical memory available to the system
>> memblock.reserved: regions that were allocated [during boot]
>>
>>
>> memblock.memory is supposed to be a superset of memblock.reserved. Your
>> "hack" here indicates that you somehow would be relying on the opposite
>> being true, which indicates that you are doing the wrong thing.
>>
>>
>> memblock_remove() indeed balances against memblock_add_node() for
>> hotplugged memory [add_memory_resource()]. There seem to a case where we
>> would succeed in hotunplugging memory that was part of "memblock.reserved".
>>
>> But how could that happen? I think the following way:
>>
>> Once the buddy is up and running, memory allocated during early boot is
>> not freed back to memblock, but usually we simply go via something like
>> free_reserved_page(), not memblock_free() [because the buddy took over].
>> So one could end up unplugging memory that still resides in
>> memblock.reserved set.
>>
>> So with memblock_phys_free(), we are enforcing the invariant that
>> memblock.memory is a superset of memblock.reserved.
>>
>> Likely, arm64 should store that node assignment elsewhere from where it
>> can be queried. Or it should be using something like
>> CONFIG_HAVE_MEMBLOCK_PHYS_MAP for these static windows.
>>
> 
> Hi David,
> 
> Thanks for the explanation and pointers.  I'd rather avoid inventing a parallel
> infrastructure for this (like x86 has for other reasons, but which is also used
> for this purpose).

Yes, although memblock feels a bit wrong, because it is targeted at 
managing actual present memory, not properties about memory that could 
become present later.

>  From a quick look CONFIG_HAVE_MEMBLOCK_PHYS_MAP is documented in a fashion that
> makes me think it's not directly appropriate (this isn't actual physical memory
> available during boot) but the general approach of just adding another memblock
> collection seems like it will work.

Yes. As an alternative, modify the description of 
CONFIG_HAVE_MEMBLOCK_PHYS_MAP.

> 
> Hardest problem might be naming it.  physmem_possible perhaps?
> Fill that with anything found in SRAT or CEDT should work for ACPI, but I'm not
> sure on whether we can fill it when neither of those is present.  Maybe we just
> don't bother as for today's usecase CEDT needs to be present.

You likely only want something for these special windows, with these 
special semantics. That makes it hard.

"physmem_possible" is a bit too generic for my taste, promising 
semantics that might not hold true (we don't want all hotpluggable areas 
to show up there).

> 
> Maybe physmem_known_possible is the way to go. I'll give this approach a spin
> and see how it goes.

Goes into a better direction, or really "physmem_fixed_windows". Because 
also "physmem_known_possible" as a wrong smell to it.
Mike Rapoport June 3, 2024, 7:57 a.m. UTC | #6
On Fri, May 31, 2024 at 09:49:32AM +0200, David Hildenbrand wrote:
> On 29.05.24 19:12, Jonathan Cameron wrote:
> > I'm not sure what this is balancing, but it if is necessary then the reserved
> > memblock approach can't be used to stash NUMA node assignments as after the
> > first add / remove cycle the entry is dropped so not available if memory is
> > re-added at the same HPA.
> > 
> > This patch is here to hopefully spur comments on what this is there for!
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > ---
> >   mm/memory_hotplug.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 431b1f6753c0..3d8dd4749dfc 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
> >   	}
> >   	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> > -		memblock_phys_free(start, size);
> > +		//		memblock_phys_free(start, size);
> >   		memblock_remove(start, size);
> >   	}
> 
> memblock_phys_free() works on memblock.reserved, memblock_remove() works  on
> memblock.memory.
> 
> If you take a look at the doc at the top of memblock.c:
> 
> memblock.memory: physical memory available to the system
> memblock.reserved: regions that were allocated [during boot]
> 
> 
> memblock.memory is supposed to be a superset of memblock.reserved. Your

No it's not.
memblock.reserved is more of "if there is memory, don't touch it".
Some regions in memblock.reserved are boot time allocations and they are indeed a
subset of memblock.memory, but some are reservations done by firmware (e.g.
reserved memory in DT) that just might not have a corresponding regions in
memblock.memory. It can happen for example, when the same firmware runs on
devices with different memory configuration, but still wants to preserve
some physical addresses.

> "hack" here indicates that you somehow would be relying on the opposite
> being true, which indicates that you are doing the wrong thing.
 
I'm not sure about that, I still have to digest the patches :)
 
> memblock_remove() indeed balances against memblock_add_node() for hotplugged
> memory [add_memory_resource()]. There seem to a case where we would succeed
> in hotunplugging memory that was part of "memblock.reserved".
> 
> But how could that happen? I think the following way:
> 
> Once the buddy is up and running, memory allocated during early boot is not
> freed back to memblock, but usually we simply go via something like
> free_reserved_page(), not memblock_free() [because the buddy took over]. So
> one could end up unplugging memory that still resides in memblock.reserved
> set.
> 
> So with memblock_phys_free(), we are enforcing the invariant that
> memblock.memory is a superset of memblock.reserved.
> 
> Likely, arm64 should store that node assignment elsewhere from where it can
> be queried. Or it should be using something like
> CONFIG_HAVE_MEMBLOCK_PHYS_MAP for these static windows.
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand June 3, 2024, 9:14 a.m. UTC | #7
On 03.06.24 09:57, Mike Rapoport wrote:
> On Fri, May 31, 2024 at 09:49:32AM +0200, David Hildenbrand wrote:
>> On 29.05.24 19:12, Jonathan Cameron wrote:
>>> I'm not sure what this is balancing, but it if is necessary then the reserved
>>> memblock approach can't be used to stash NUMA node assignments as after the
>>> first add / remove cycle the entry is dropped so not available if memory is
>>> re-added at the same HPA.
>>>
>>> This patch is here to hopefully spur comments on what this is there for!
>>>
>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> ---
>>>    mm/memory_hotplug.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 431b1f6753c0..3d8dd4749dfc 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>>    	}
>>>    	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>>> -		memblock_phys_free(start, size);
>>> +		//		memblock_phys_free(start, size);
>>>    		memblock_remove(start, size);
>>>    	}
>>
>> memblock_phys_free() works on memblock.reserved, memblock_remove() works  on
>> memblock.memory.
>>
>> If you take a look at the doc at the top of memblock.c:
>>
>> memblock.memory: physical memory available to the system
>> memblock.reserved: regions that were allocated [during boot]
>>
>>
>> memblock.memory is supposed to be a superset of memblock.reserved. Your
> 
> No it's not.
> memblock.reserved is more of "if there is memory, don't touch it".

Then we should certainly clarify that in the comments! :P

But for the memory hotunplug case, that's most likely why that code was 
added. And it only deals with ordinary system RAM, not weird 
reservations you describe below.

> Some regions in memblock.reserved are boot time allocations and they are indeed a
> subset of memblock.memory, but some are reservations done by firmware (e.g.
> reserved memory in DT) that just might not have a corresponding regions in
> memblock.memory. It can happen for example, when the same firmware runs on
> devices with different memory configuration, but still wants to preserve
> some physical addresses.

Could this happen with a good old BIOS as well? Just curious.

> 
>> "hack" here indicates that you somehow would be relying on the opposite
>> being true, which indicates that you are doing the wrong thing.
>   
> I'm not sure about that, I still have to digest the patches :)

In any case, using "reserved" to store persistent data across 
plug/unplug sounds wrong; but maybe I'm wrong :)
Mike Rapoport June 3, 2024, 10:43 a.m. UTC | #8
On Mon, Jun 03, 2024 at 11:14:00AM +0200, David Hildenbrand wrote:
> On 03.06.24 09:57, Mike Rapoport wrote:
> > On Fri, May 31, 2024 at 09:49:32AM +0200, David Hildenbrand wrote:
> > > On 29.05.24 19:12, Jonathan Cameron wrote:
> > > > I'm not sure what this is balancing, but it if is necessary then the reserved
> > > > memblock approach can't be used to stash NUMA node assignments as after the
> > > > first add / remove cycle the entry is dropped so not available if memory is
> > > > re-added at the same HPA.
> > > > 
> > > > This patch is here to hopefully spur comments on what this is there for!
> > > > 
> > > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > ---
> > > >    mm/memory_hotplug.c | 2 +-
> > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > > index 431b1f6753c0..3d8dd4749dfc 100644
> > > > --- a/mm/memory_hotplug.c
> > > > +++ b/mm/memory_hotplug.c
> > > > @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
> > > >    	}
> > > >    	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> > > > -		memblock_phys_free(start, size);
> > > > +		//		memblock_phys_free(start, size);
> > > >    		memblock_remove(start, size);
> > > >    	}
> > > 
> > > memblock_phys_free() works on memblock.reserved, memblock_remove() works  on
> > > memblock.memory.
> > > 
> > > If you take a look at the doc at the top of memblock.c:
> > > 
> > > memblock.memory: physical memory available to the system
> > > memblock.reserved: regions that were allocated [during boot]
> > > 
> > > 
> > > memblock.memory is supposed to be a superset of memblock.reserved. Your
> > 
> > No it's not.
> > memblock.reserved is more of "if there is memory, don't touch it".
> 
> Then we should certainly clarify that in the comments! :P

You are welcome to send a patch :-P
 
> But for the memory hotunplug case, that's most likely why that code was
> added. And it only deals with ordinary system RAM, not weird reservations
> you describe below.

The commit that added memblock_free() at the first place (f9126ab9241f
("memory-hotplug: fix wrong edge when hot add a new node")) does not really
describe why that was required :(

But at a quick glance it looks completely spurious.
 
> > Some regions in memblock.reserved are boot time allocations and they are indeed a
> > subset of memblock.memory, but some are reservations done by firmware (e.g.
> > reserved memory in DT) that just might not have a corresponding regions in
> > memblock.memory. It can happen for example, when the same firmware runs on
> > devices with different memory configuration, but still wants to preserve
> > some physical addresses.
> 
> Could this happen with a good old BIOS as well? Just curious.
 
Yes. E.g. for E820_TYPE_SOFT_RESERVED.
 
> > > "hack" here indicates that you somehow would be relying on the opposite
> > > being true, which indicates that you are doing the wrong thing.
> > I'm not sure about that, I still have to digest the patches :)
> 
> In any case, using "reserved" to store persistent data across plug/unplug
> sounds wrong; but maybe I'm wrong :)
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand June 3, 2024, 8:53 p.m. UTC | #9
On 03.06.24 12:43, Mike Rapoport wrote:
> On Mon, Jun 03, 2024 at 11:14:00AM +0200, David Hildenbrand wrote:
>> On 03.06.24 09:57, Mike Rapoport wrote:
>>> On Fri, May 31, 2024 at 09:49:32AM +0200, David Hildenbrand wrote:
>>>> On 29.05.24 19:12, Jonathan Cameron wrote:
>>>>> I'm not sure what this is balancing, but it if is necessary then the reserved
>>>>> memblock approach can't be used to stash NUMA node assignments as after the
>>>>> first add / remove cycle the entry is dropped so not available if memory is
>>>>> re-added at the same HPA.
>>>>>
>>>>> This patch is here to hopefully spur comments on what this is there for!
>>>>>
>>>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>> ---
>>>>>     mm/memory_hotplug.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>> index 431b1f6753c0..3d8dd4749dfc 100644
>>>>> --- a/mm/memory_hotplug.c
>>>>> +++ b/mm/memory_hotplug.c
>>>>> @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>>>>     	}
>>>>>     	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>>>>> -		memblock_phys_free(start, size);
>>>>> +		//		memblock_phys_free(start, size);
>>>>>     		memblock_remove(start, size);
>>>>>     	}
>>>>
>>>> memblock_phys_free() works on memblock.reserved, memblock_remove() works  on
>>>> memblock.memory.
>>>>
>>>> If you take a look at the doc at the top of memblock.c:
>>>>
>>>> memblock.memory: physical memory available to the system
>>>> memblock.reserved: regions that were allocated [during boot]
>>>>
>>>>
>>>> memblock.memory is supposed to be a superset of memblock.reserved. Your
>>>
>>> No it's not.
>>> memblock.reserved is more of "if there is memory, don't touch it".
>>
>> Then we should certainly clarify that in the comments! :P
> 
> You are welcome to send a patch :-P

I'll try once I understood what changed ever since you documented that 
in 2018 -- or if we missed that detail back then already.

>   
>> But for the memory hotunplug case, that's most likely why that code was
>> added. And it only deals with ordinary system RAM, not weird reservations
>> you describe below.
> 
> The commit that added memblock_free() at the first place (f9126ab9241f
> ("memory-hotplug: fix wrong edge when hot add a new node")) does not really
> describe why that was required :(
> 
> But at a quick glance it looks completely spurious.

There are more details [1] but I also did not figure out why the 
memblock_free() was really required to resolve that issue.

[1] https://marc.info/?l=linux-kernel&m=142961156129456&w=2
Mike Rapoport June 4, 2024, 9:35 a.m. UTC | #10
On Mon, Jun 03, 2024 at 10:53:03PM +0200, David Hildenbrand wrote:
> On 03.06.24 12:43, Mike Rapoport wrote:
> > On Mon, Jun 03, 2024 at 11:14:00AM +0200, David Hildenbrand wrote:
> > > On 03.06.24 09:57, Mike Rapoport wrote:
> > > > On Fri, May 31, 2024 at 09:49:32AM +0200, David Hildenbrand wrote:
> > > > > On 29.05.24 19:12, Jonathan Cameron wrote:
> > > > > > I'm not sure what this is balancing, but it if is necessary then the reserved
> > > > > > memblock approach can't be used to stash NUMA node assignments as after the
> > > > > > first add / remove cycle the entry is dropped so not available if memory is
> > > > > > re-added at the same HPA.
> > > > > > 
> > > > > > This patch is here to hopefully spur comments on what this is there for!
> > > > > > 
> > > > > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > > > ---
> > > > > >     mm/memory_hotplug.c | 2 +-
> > > > > >     1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > > > > index 431b1f6753c0..3d8dd4749dfc 100644
> > > > > > --- a/mm/memory_hotplug.c
> > > > > > +++ b/mm/memory_hotplug.c
> > > > > > @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
> > > > > >     	}
> > > > > >     	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> > > > > > -		memblock_phys_free(start, size);
> > > > > > +		//		memblock_phys_free(start, size);
> > > > > >     		memblock_remove(start, size);
> > > > > >     	}
> > > > > 
> > > > > memblock_phys_free() works on memblock.reserved, memblock_remove() works  on
> > > > > memblock.memory.
> > > > > 
> > > > > If you take a look at the doc at the top of memblock.c:
> > > > > 
> > > > > memblock.memory: physical memory available to the system
> > > > > memblock.reserved: regions that were allocated [during boot]
> > > > > 
> > > > > 
> > > > > memblock.memory is supposed to be a superset of memblock.reserved. Your
> > > > 
> > > > No it's not.
> > > > memblock.reserved is more of "if there is memory, don't touch it".
> > > 
> > > Then we should certainly clarify that in the comments! :P
> > 
> > You are welcome to send a patch :-P
> 
> I'll try once I understood what changed ever since you documented that in
> 2018 -- or if we missed that detail back then already.
 
> > > But for the memory hotunplug case, that's most likely why that code was
> > > added. And it only deals with ordinary system RAM, not weird reservations
> > > you describe below.
> > 
> > The commit that added memblock_free() at the first place (f9126ab9241f
> > ("memory-hotplug: fix wrong edge when hot add a new node")) does not really
> > describe why that was required :(
> > 
> > But at a quick glance it looks completely spurious.
> 
> There are more details [1] but I also did not figure out why the
> memblock_free() was really required to resolve that issue.
> 
> [1] https://marc.info/?l=linux-kernel&m=142961156129456&w=2
 
The tinkering with memblock there and in f9126ab9241f seem bogus in the
context of memory hotplug on x86.

I believe that dropping that memblock_phys_free() is right thing to do
regardless of this series. There's no corresponding memblock_alloc() and it
was added as part of a fix for hotunplug on x86 that anyway had memblock
discarded at that point.

> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand June 4, 2024, 9:39 a.m. UTC | #11
On 04.06.24 11:35, Mike Rapoport wrote:
> On Mon, Jun 03, 2024 at 10:53:03PM +0200, David Hildenbrand wrote:
>> On 03.06.24 12:43, Mike Rapoport wrote:
>>> On Mon, Jun 03, 2024 at 11:14:00AM +0200, David Hildenbrand wrote:
>>>> On 03.06.24 09:57, Mike Rapoport wrote:
>>>>> On Fri, May 31, 2024 at 09:49:32AM +0200, David Hildenbrand wrote:
>>>>>> On 29.05.24 19:12, Jonathan Cameron wrote:
>>>>>>> I'm not sure what this is balancing, but it if is necessary then the reserved
>>>>>>> memblock approach can't be used to stash NUMA node assignments as after the
>>>>>>> first add / remove cycle the entry is dropped so not available if memory is
>>>>>>> re-added at the same HPA.
>>>>>>>
>>>>>>> This patch is here to hopefully spur comments on what this is there for!
>>>>>>>
>>>>>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>>>> ---
>>>>>>>      mm/memory_hotplug.c | 2 +-
>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>>>> index 431b1f6753c0..3d8dd4749dfc 100644
>>>>>>> --- a/mm/memory_hotplug.c
>>>>>>> +++ b/mm/memory_hotplug.c
>>>>>>> @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>>>>>>      	}
>>>>>>>      	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>>>>>>> -		memblock_phys_free(start, size);
>>>>>>> +		//		memblock_phys_free(start, size);
>>>>>>>      		memblock_remove(start, size);
>>>>>>>      	}
>>>>>>
>>>>>> memblock_phys_free() works on memblock.reserved, memblock_remove() works  on
>>>>>> memblock.memory.
>>>>>>
>>>>>> If you take a look at the doc at the top of memblock.c:
>>>>>>
>>>>>> memblock.memory: physical memory available to the system
>>>>>> memblock.reserved: regions that were allocated [during boot]
>>>>>>
>>>>>>
>>>>>> memblock.memory is supposed to be a superset of memblock.reserved. Your
>>>>>
>>>>> No it's not.
>>>>> memblock.reserved is more of "if there is memory, don't touch it".
>>>>
>>>> Then we should certainly clarify that in the comments! :P
>>>
>>> You are welcome to send a patch :-P
>>
>> I'll try once I understood what changed ever since you documented that in
>> 2018 -- or if we missed that detail back then already.
>   
>>>> But for the memory hotunplug case, that's most likely why that code was
>>>> added. And it only deals with ordinary system RAM, not weird reservations
>>>> you describe below.
>>>
>>> The commit that added memblock_free() at the first place (f9126ab9241f
>>> ("memory-hotplug: fix wrong edge when hot add a new node")) does not really
>>> describe why that was required :(
>>>
>>> But at a quick glance it looks completely spurious.
>>
>> There are more details [1] but I also did not figure out why the
>> memblock_free() was really required to resolve that issue.
>>
>> [1] https://marc.info/?l=linux-kernel&m=142961156129456&w=2
>   
> The tinkering with memblock there and in f9126ab9241f seem bogus in the
> context of memory hotplug on x86.
> 
> I believe that dropping that memblock_phys_free() is right thing to do
> regardless of this series. There's no corresponding memblock_alloc() and it
> was added as part of a fix for hotunplug on x86 that anyway had memblock
> discarded at that point.

So when we re-add that memory, we might have still ranges as "reserved". 
It does sound weird, but you're the boss :)
Mike Rapoport June 5, 2024, 8 a.m. UTC | #12
On Tue, Jun 04, 2024 at 11:39:27AM +0200, David Hildenbrand wrote:
> On 04.06.24 11:35, Mike Rapoport wrote:
> > On Mon, Jun 03, 2024 at 10:53:03PM +0200, David Hildenbrand wrote:
> > > On 03.06.24 12:43, Mike Rapoport wrote:
> > > > On Mon, Jun 03, 2024 at 11:14:00AM +0200, David Hildenbrand wrote:
> > > 
> > > > The commit that added memblock_free() at the first place (f9126ab9241f
> > > > ("memory-hotplug: fix wrong edge when hot add a new node")) does not really
> > > > describe why that was required :(
> > > > 
> > > > But at a quick glance it looks completely spurious.
> > > 
> > > There are more details [1] but I also did not figure out why the
> > > memblock_free() was really required to resolve that issue.
> > > 
> > > [1] https://marc.info/?l=linux-kernel&m=142961156129456&w=2
> > The tinkering with memblock there and in f9126ab9241f seem bogus in the
> > context of memory hotplug on x86.
> > 
> > I believe that dropping that memblock_phys_free() is right thing to do
> > regardless of this series. There's no corresponding memblock_alloc() and it
> > was added as part of a fix for hotunplug on x86 that anyway had memblock
> > discarded at that point.
> 
> So when we re-add that memory, we might have still ranges as "reserved".

I don't see how anything can become reserved on the hotplug path unless
hotplug is possible before mm_core_init().
There are no memblock_reserve() calls in memory_hotplug.c, no memblock
allocations possible after mm is inited, and even if memblock_add() will
need to allocate memory that will be done via slab.

> It does sound weird, but you're the boss :)

Nah, it's mm/memory_hotplug.c, so you are :)

But I can send a patch anyway :)
 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand June 5, 2024, 8:23 a.m. UTC | #13
On 05.06.24 10:00, Mike Rapoport wrote:
> On Tue, Jun 04, 2024 at 11:39:27AM +0200, David Hildenbrand wrote:
>> On 04.06.24 11:35, Mike Rapoport wrote:
>>> On Mon, Jun 03, 2024 at 10:53:03PM +0200, David Hildenbrand wrote:
>>>> On 03.06.24 12:43, Mike Rapoport wrote:
>>>>> On Mon, Jun 03, 2024 at 11:14:00AM +0200, David Hildenbrand wrote:
>>>>
>>>>> The commit that added memblock_free() at the first place (f9126ab9241f
>>>>> ("memory-hotplug: fix wrong edge when hot add a new node")) does not really
>>>>> describe why that was required :(
>>>>>
>>>>> But at a quick glance it looks completely spurious.
>>>>
>>>> There are more details [1] but I also did not figure out why the
>>>> memblock_free() was really required to resolve that issue.
>>>>
>>>> [1] https://marc.info/?l=linux-kernel&m=142961156129456&w=2
>>> The tinkering with memblock there and in f9126ab9241f seem bogus in the
>>> context of memory hotplug on x86.
>>>
>>> I believe that dropping that memblock_phys_free() is right thing to do
>>> regardless of this series. There's no corresponding memblock_alloc() and it
>>> was added as part of a fix for hotunplug on x86 that anyway had memblock
>>> discarded at that point.
>>
>> So when we re-add that memory, we might have still ranges as "reserved".
> 
> I don't see how anything can become reserved on the hotplug path unless
> hotplug is possible before mm_core_init().
> There are no memblock_reserve() calls in memory_hotplug.c, no memblock
> allocations possible after mm is inited, and even if memblock_add() will
> need to allocate memory that will be done via slab.

I had the following in mind:

(1) DIMM is part of boot mem and some boot allocation ended up on it
(2) That boot allocation got freed after the buddy is already up
     (memblock.reserved not updated)
(3) We succeed in offlining the memory and unplugging the DIMM

Now we have some "reserved" leftover from memory that is no longer 
physically around.

(4) We re-plug a DIMM at that position, possibly with a different NUMA
     assignment.

On bare metal, this is unlikely to happen. With current QEMU, it won't 
happen because (hotunpluggable) DIMMs are usually not part of bootmem; 
that is, e820 and friends only indicate it as "hotpluggable but not 
present memory" range. It could be possible after kexec (for example, we 
add that memory to the e820 of the new kernel), but that's rather a 
corner case.

(3) is already unlikely to happen, so removing that memblock_phys_free() 
probably won't change anything in practice.

> 
>> It does sound weird, but you're the boss :)
> 
> Nah, it's mm/memory_hotplug.c, so you are :)
> 

Well yes :) but it's your decision whether we want to use 
memblock.reserved memory to store this persistent NUMA node assignment 
for the fixed memory windows. Essentially another user of 
memblock.reserved we should then document ;)

Removing the memblock_phys_free() call sounds like being a requirement 
for that (although it might make sense independently) use case.

I'm not sure whether memblock.reserved is the right datastructure for 
this purpose, but if you think it is, Jonathan would have green light on 
the general approach in this RFC.
Mike Rapoport June 6, 2024, 3:44 p.m. UTC | #14
On Fri, May 31, 2024 at 11:55:08AM +0200, David Hildenbrand wrote:
> On 31.05.24 11:48, Jonathan Cameron wrote:
> > On Fri, 31 May 2024 09:49:32 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> > > On 29.05.24 19:12, Jonathan Cameron wrote:
> > > > I'm not sure what this is balancing, but it if is necessary then the reserved
> > > > memblock approach can't be used to stash NUMA node assignments as after the
> > > > first add / remove cycle the entry is dropped so not available if memory is
> > > > re-added at the same HPA.
> > > > 
> > > > This patch is here to hopefully spur comments on what this is there for!
> > > > 
> > > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > ---
> > > >    mm/memory_hotplug.c | 2 +-
> > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > > index 431b1f6753c0..3d8dd4749dfc 100644
> > > > --- a/mm/memory_hotplug.c
> > > > +++ b/mm/memory_hotplug.c
> > > > @@ -2284,7 +2284,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
> > > >    	}
> > > >    	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> > > > -		memblock_phys_free(start, size);
> > > > +		//		memblock_phys_free(start, size);
> > > >    		memblock_remove(start, size);
> > > >    	}
> > > 
> > > memblock_phys_free() works on memblock.reserved, memblock_remove() works
> > >    on memblock.memory.
> > > 
> > > If you take a look at the doc at the top of memblock.c:
> > > 
> > > memblock.memory: physical memory available to the system
> > > memblock.reserved: regions that were allocated [during boot]
> > > 
> > > 
> > > memblock.memory is supposed to be a superset of memblock.reserved. Your
> > > "hack" here indicates that you somehow would be relying on the opposite
> > > being true, which indicates that you are doing the wrong thing.
> > > 
> > > 
> > > memblock_remove() indeed balances against memblock_add_node() for
> > > hotplugged memory [add_memory_resource()]. There seem to a case where we
> > > would succeed in hotunplugging memory that was part of "memblock.reserved".
> > > 
> > > But how could that happen? I think the following way:
> > > 
> > > Once the buddy is up and running, memory allocated during early boot is
> > > not freed back to memblock, but usually we simply go via something like
> > > free_reserved_page(), not memblock_free() [because the buddy took over].
> > > So one could end up unplugging memory that still resides in
> > > memblock.reserved set.
> > > 
> > > So with memblock_phys_free(), we are enforcing the invariant that
> > > memblock.memory is a superset of memblock.reserved.
> > > 
> > > Likely, arm64 should store that node assignment elsewhere from where it
> > > can be queried. Or it should be using something like
> > > CONFIG_HAVE_MEMBLOCK_PHYS_MAP for these static windows.
> > > 
> > 
> > Hi David,
> > 
> > Thanks for the explanation and pointers.  I'd rather avoid inventing a parallel
> > infrastructure for this (like x86 has for other reasons, but which is also used
> > for this purpose).
> 
> Yes, although memblock feels a bit wrong, because it is targeted at managing
> actual present memory, not properties about memory that could become present
> later.

Right now memblock.reserved can have ranges that are not actually present,
but I still have doubts about using it to contain memory ranges that could
be hotplugged and I'm more leaning against it.

In general memblock as an infrastructure for dealing with memory ranges and
their properties seems to be a good fit, so either memblock.reserved or a
new memblock_type can be used to implement phys_to_target_node().

memblock.reserved maybe less suitable than a new data structure because
unlike x86::numa_reserved_meminfo it does manage already present memory for
the most part while x86::numa_reserved_meminfo contains regions that do not
overlap with RAM.
 
Now before we continue to bikeshed about the name for the new data
structure, how about this crazy idea to merge parts of arch/x86/numa.c
dealing with 'struct numa_meminfo' to arch_numa.c and then remove some of
the redundancy, e.g by implementing memory_add_physaddr_to_nid() with
memblock.memory instead of numa_meminfo.

It's an involved project indeed, but it has advantage of shared
infrastructure for NUMA initialization of ACPI systems.
AFAIU, x86 has a parallel infrastructure because they were first, but I
don't see a fundamental reason why it has to remain that way.

There also were several attempts to implement fakenuma on arm64 or riscv
(sorry, can't find lore links right now), and with numa_meminfo in
arch_numa.c we get it almost for free.

> >  From a quick look CONFIG_HAVE_MEMBLOCK_PHYS_MAP is documented in a fashion that
> > makes me think it's not directly appropriate (this isn't actual physical memory
> > available during boot) but the general approach of just adding another memblock
> > collection seems like it will work.
> 
> Yes. As an alternative, modify the description of
> CONFIG_HAVE_MEMBLOCK_PHYS_MAP.
 
I wouldn't want to touch physmem, except maybe moving it entirely to
arch/s390 and leaving it there.
 
> > Hardest problem might be naming it.  physmem_possible perhaps?
> > Fill that with anything found in SRAT or CEDT should work for ACPI, but I'm not
> > sure on whether we can fill it when neither of those is present.  Maybe we just
> > don't bother as for today's usecase CEDT needs to be present.
> 
> You likely only want something for these special windows, with these special
> semantics. That makes it hard.
> 
> "physmem_possible" is a bit too generic for my taste, promising semantics
> that might not hold true (we don't want all hotpluggable areas to show up
> there).
> 
> > 
> > Maybe physmem_known_possible is the way to go. I'll give this approach a spin
> > and see how it goes.
> 
> Goes into a better direction, or really "physmem_fixed_windows". Because
> also "physmem_known_possible" as a wrong smell to it.

physmem_potential? ;-)
 
> -- 
> Cheers,
> 
> David / dhildenb
>
diff mbox series

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 431b1f6753c0..3d8dd4749dfc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2284,7 +2284,7 @@  static int __ref try_remove_memory(u64 start, u64 size)
 	}
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-		memblock_phys_free(start, size);
+		//		memblock_phys_free(start, size);
 		memblock_remove(start, size);
 	}