[v2] x86/mm: use max memory block size on bare metal

Message ID	20200609225451.3542648-1-daniel.m.jordan@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=AT+A=7W=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 436FD20734 From: Daniel Jordan <daniel.m.jordan@oracle.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, David Hildenbrand <david@redhat.com>, Michal Hocko <mhocko@kernel.org>, Pavel Tatashin <pasha.tatashin@soleen.com>, Peter Zijlstra <peterz@infradead.org>, Steven Sistare <steven.sistare@oracle.com>, Daniel Jordan <daniel.m.jordan@oracle.com> Subject: [PATCH v2] x86/mm: use max memory block size on bare metal Date: Tue, 9 Jun 2020 18:54:51 -0400 Message-Id: <20200609225451.3542648-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] x86/mm: use max memory block size on bare metal \| expand [v2] x86/mm: use max memory block size on bare metal

Daniel Jordan June 9, 2020, 10:54 p.m. UTC

Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them
and the corresponding nodes.  The slowness happens because the machines
get stuck with the smallest supported memory block size on x86 (128M),
which results in 16,288 directories to cover the 2T of installed RAM.
The search for each memory block is noticeable even with
commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
xarray to accelerate lookup").

Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
on the end of boot memory") chooses the block size based on alignment
with memory end.  That addresses hotplug failures in qemu guests, but
for bare metal systems whose memory end isn't aligned to even the
smallest size, it leaves them at 128M.

Make kernels that aren't running on a hypervisor use the largest
supported size (2G) to minimize overhead on big machines.  Kernel boot
goes 7% faster on the aforementioned servers, shaving off half a second.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---

Applies to 5.7 and today's mainline

 arch/x86/mm/init_64.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Daniel Jordan June 9, 2020, 11:03 p.m. UTC | #1

On Tue, Jun 09, 2020 at 06:54:51PM -0400, Daniel Jordan wrote:
> Some of our servers spend significant time at kernel boot initializing
> memory block sysfs directories and then creating symlinks between them
> and the corresponding nodes.  The slowness happens because the machines
> get stuck with the smallest supported memory block size on x86 (128M),
> which results in 16,288 directories to cover the 2T of installed RAM.
> The search for each memory block is noticeable even with
> commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
> xarray to accelerate lookup").
> 
> Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
> on the end of boot memory") chooses the block size based on alignment
> with memory end.  That addresses hotplug failures in qemu guests, but
> for bare metal systems whose memory end isn't aligned to even the
> smallest size, it leaves them at 128M.
> 
> Make kernels that aren't running on a hypervisor use the largest
> supported size (2G) to minimize overhead on big machines.  Kernel boot
> goes 7% faster on the aforementioned servers, shaving off half a second.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Sistare <steven.sistare@oracle.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---

Forgot the v1 changes:

 - Thanks to David for the idea to make this conditional based on
   virtualization.
 - Update performance numbers to account for 4fb6eabf1037 (David)

David Hildenbrand June 10, 2020, 7:20 a.m. UTC | #2

On 10.06.20 00:54, Daniel Jordan wrote:
> Some of our servers spend significant time at kernel boot initializing
> memory block sysfs directories and then creating symlinks between them
> and the corresponding nodes.  The slowness happens because the machines
> get stuck with the smallest supported memory block size on x86 (128M),
> which results in 16,288 directories to cover the 2T of installed RAM.
> The search for each memory block is noticeable even with
> commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
> xarray to accelerate lookup").
> 
> Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
> on the end of boot memory") chooses the block size based on alignment
> with memory end.  That addresses hotplug failures in qemu guests, but
> for bare metal systems whose memory end isn't aligned to even the
> smallest size, it leaves them at 128M.
> 
> Make kernels that aren't running on a hypervisor use the largest
> supported size (2G) to minimize overhead on big machines.  Kernel boot
> goes 7% faster on the aforementioned servers, shaving off half a second.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Sistare <steven.sistare@oracle.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
> 
> Applies to 5.7 and today's mainline
> 
>  arch/x86/mm/init_64.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 8b5f73f5e207c..906fbdb060748 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -55,6 +55,7 @@
>  #include <asm/uv/uv.h>
>  #include <asm/setup.h>
>  #include <asm/ftrace.h>
> +#include <asm/hypervisor.h>
>  
>  #include "mm_internal.h"
>  
> @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
>  		goto done;
>  	}
>  
> +	/*
> +	 * Use max block size to minimize overhead on bare metal, where
> +	 * alignment for memory hotplug isn't a concern.
> +	 */
> +	if (hypervisor_is_type(X86_HYPER_NATIVE)) {
> +		bz = MAX_BLOCK_SIZE;
> +		goto done;
> +	}

I'd assume that bioses on physical machines >= 64GB will not align
bigger (>= 2GB) DIMMs to something < 2GB.

Acked-by: David Hildenbrand <david@redhat.com>

> +
>  	/* Find the largest allowed block size that aligns to memory end */
>  	for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
>  		if (IS_ALIGNED(boot_mem_end, bz))
>

David Hildenbrand June 10, 2020, 7:30 a.m. UTC | #3

On 10.06.20 09:20, David Hildenbrand wrote:
> On 10.06.20 00:54, Daniel Jordan wrote:
>> Some of our servers spend significant time at kernel boot initializing
>> memory block sysfs directories and then creating symlinks between them
>> and the corresponding nodes.  The slowness happens because the machines
>> get stuck with the smallest supported memory block size on x86 (128M),
>> which results in 16,288 directories to cover the 2T of installed RAM.
>> The search for each memory block is noticeable even with
>> commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in
>> xarray to accelerate lookup").
>>
>> Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
>> on the end of boot memory") chooses the block size based on alignment
>> with memory end.  That addresses hotplug failures in qemu guests, but
>> for bare metal systems whose memory end isn't aligned to even the
>> smallest size, it leaves them at 128M.
>>
>> Make kernels that aren't running on a hypervisor use the largest
>> supported size (2G) to minimize overhead on big machines.  Kernel boot
>> goes 7% faster on the aforementioned servers, shaving off half a second.
>>
>> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Steven Sistare <steven.sistare@oracle.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>
>> Applies to 5.7 and today's mainline
>>
>>  arch/x86/mm/init_64.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 8b5f73f5e207c..906fbdb060748 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -55,6 +55,7 @@
>>  #include <asm/uv/uv.h>
>>  #include <asm/setup.h>
>>  #include <asm/ftrace.h>
>> +#include <asm/hypervisor.h>
>>  
>>  #include "mm_internal.h"
>>  
>> @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
>>  		goto done;
>>  	}
>>  
>> +	/*
>> +	 * Use max block size to minimize overhead on bare metal, where
>> +	 * alignment for memory hotplug isn't a concern.
>> +	 */
>> +	if (hypervisor_is_type(X86_HYPER_NATIVE)) {
>> +		bz = MAX_BLOCK_SIZE;
>> +		goto done;
>> +	}
> 
> I'd assume that bioses on physical machines >= 64GB will not align
> bigger (>= 2GB) DIMMs to something < 2GB.
> 
> Acked-by: David Hildenbrand <david@redhat.com>

FTWT, setup_arch() does the init_hypervisor_platform() call. I assume
that should be early enough.

We should really look into factoring out memory_block_size_bytes() into
common code, turning into a simple global variable read. Then, we should
provide an interface to configure the memory block size during boot from
arch code (set_memory_block_size()).

Daniel Jordan June 10, 2020, 5:16 p.m. UTC | #4

On Wed, Jun 10, 2020 at 09:30:00AM +0200, David Hildenbrand wrote:
> On 10.06.20 09:20, David Hildenbrand wrote:
> > On 10.06.20 00:54, Daniel Jordan wrote:
> >> @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
> >>  		goto done;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Use max block size to minimize overhead on bare metal, where
> >> +	 * alignment for memory hotplug isn't a concern.
> >> +	 */
> >> +	if (hypervisor_is_type(X86_HYPER_NATIVE)) {
> >> +		bz = MAX_BLOCK_SIZE;
> >> +		goto done;
> >> +	}
> > 
> > I'd assume that bioses on physical machines >= 64GB will not align
> > bigger (>= 2GB) DIMMs to something < 2GB.
> > 
> > Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> FTWT, setup_arch() does the init_hypervisor_platform() call. I assume
> that should be early enough.

I checked all the places where x86 uses memory_block_size_bytes(), it looks ok.

Dave Hansen June 11, 2020, 2:16 p.m. UTC | #5

On 6/9/20 3:54 PM, Daniel Jordan wrote:
> +	/*
> +	 * Use max block size to minimize overhead on bare metal, where
> +	 * alignment for memory hotplug isn't a concern.
> +	 */
> +	if (hypervisor_is_type(X86_HYPER_NATIVE)) {
> +		bz = MAX_BLOCK_SIZE;
> +		goto done;
> +	}

What ends up being the worst case scenario?  Booting a really small
bare-metal x86 system, say with 64MB or 128MB of RAM?  What's the
overhead there?

Daniel Jordan June 11, 2020, 4:59 p.m. UTC | #6

On Thu, Jun 11, 2020 at 07:16:02AM -0700, Dave Hansen wrote:
> On 6/9/20 3:54 PM, Daniel Jordan wrote:
> > +	/*
> > +	 * Use max block size to minimize overhead on bare metal, where
> > +	 * alignment for memory hotplug isn't a concern.
> > +	 */
> > +	if (hypervisor_is_type(X86_HYPER_NATIVE)) {
> > +		bz = MAX_BLOCK_SIZE;
> > +		goto done;
> > +	}
> 
> What ends up being the worst case scenario?  Booting a really small
> bare-metal x86 system, say with 64MB or 128MB of RAM?  What's the
> overhead there?

Might not be following you, so bear with me, but we only get to this check on a
system with a physical address end of at least MEM_SIZE_FOR_LARGE_BLOCK (64G),
and this would still (ever so slightly...) reduce overhead of memory block init
at boot in that case.

Dave Hansen June 11, 2020, 5:05 p.m. UTC | #7

On 6/11/20 9:59 AM, Daniel Jordan wrote:
> On Thu, Jun 11, 2020 at 07:16:02AM -0700, Dave Hansen wrote:
>> On 6/9/20 3:54 PM, Daniel Jordan wrote:
>>> +	/*
>>> +	 * Use max block size to minimize overhead on bare metal, where
>>> +	 * alignment for memory hotplug isn't a concern.
>>> +	 */
>>> +	if (hypervisor_is_type(X86_HYPER_NATIVE)) {
>>> +		bz = MAX_BLOCK_SIZE;
>>> +		goto done;
>>> +	}
>> What ends up being the worst case scenario?  Booting a really small
>> bare-metal x86 system, say with 64MB or 128MB of RAM?  What's the
>> overhead there?
> Might not be following you, so bear with me, but we only get to this check on a
> system with a physical address end of at least MEM_SIZE_FOR_LARGE_BLOCK (64G),
> and this would still (ever so slightly...) reduce overhead of memory block init
> at boot in that case.

Ahh, I see now.  That is just above the hunk you added, but just wasn't
in the diff context or mentioned in the changelog.

One other nit for this.  We *do* have actual hardware hotplug, and I'm
pretty sure the alignment guarantees for hardware hotplug are pretty
weak.  For instance, the alignment guarantees for persistent memory are
still only 64MB even on modern platforms.

Let's say we're on bare metal and we see an SRAT table that has some
areas that show that hotplug might happen there.  Is this patch still
ideal there?

Daniel Jordan June 12, 2020, 3:29 a.m. UTC | #8

On Thu, Jun 11, 2020 at 10:05:38AM -0700, Dave Hansen wrote:
> One other nit for this.  We *do* have actual hardware hotplug, and I'm
> pretty sure the alignment guarantees for hardware hotplug are pretty
> weak.  For instance, the alignment guarantees for persistent memory are
> still only 64MB even on modern platforms.
>
> Let's say we're on bare metal and we see an SRAT table that has some
> areas that show that hotplug might happen there.  Is this patch still
> ideal there?

Well, not if there's concern about hardware hotplug.

My assumption going in was that this wasn't a problem in practice.
078eb6aa50dc50 ("x86/mm/memory_hotplug: determine block size based on the end
of boot memory") was merged in 2018 to address qemu hotplug failures and >64G
systems have used a 2G block since 2014 with no complaints about alignment
issues, to my knowledge anyway.

Michal Hocko June 19, 2020, 12:07 p.m. UTC | #9

On Tue 09-06-20 18:54:51, Daniel Jordan wrote:
[...]
> @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
>  		goto done;
>  	}
>  
> +	/*
> +	 * Use max block size to minimize overhead on bare metal, where
> +	 * alignment for memory hotplug isn't a concern.

This really begs a clarification why this is not a concern. Bare metal
can see physical memory hotadd as well. I just suspect that you do not
consider that to be very common so it is not a big deal? And I would
tend to agree but still we are just going to wait until first user
stumbles over this.

Btw. memblock interface just doesn't scale and it is a terrible
interface for large machines and for the memory hotplug in general (just
look at ppc and their insanely small memblocks).

Most usecases I have seen simply want to either offline some portion of
memory without a strong requirement of the physical memory range as long
as it is from a particular node or simply offline and remove the full
node.

I believe that we should think about a future interface rather than
trying to ducktape the blocksize anytime it causes problems. I would be
even tempted to simply add a kernel command line option 
memory_hotplug=disable,legacy,new_shiny

for disable it would simply drop all the sysfs crud and speed up boot
for most users who simply do not care about memory hotplug. new_shiny
would ideally provide an interface that would either export logically
hotplugable memory ranges (e.g. DIMMs) or a query/action interface which
accepts physical ranges as input. Having gazillions of sysfs files is
simply unsustainable.

Daniel Jordan June 22, 2020, 7:17 p.m. UTC | #10

Hello Michal,

(I've been away and may be slow to respond for a little while)

On Fri, Jun 19, 2020 at 02:07:04PM +0200, Michal Hocko wrote:
> On Tue 09-06-20 18:54:51, Daniel Jordan wrote:
> [...]
> > @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
> >  		goto done;
> >  	}
> >  
> > +	/*
> > +	 * Use max block size to minimize overhead on bare metal, where
> > +	 * alignment for memory hotplug isn't a concern.
> 
> This really begs a clarification why this is not a concern. Bare metal
> can see physical memory hotadd as well. I just suspect that you do not
> consider that to be very common so it is not a big deal?

It's not only uncommon, it's also that boot_mem_end on bare metal may not align
with any available memory block size.  For instance, this server's boot_mem_end
is only 4M aligned and FWIW my desktop's is 2M aligned.  As far as I can tell,
the logic that picks the size wasn't intended for bare metal.

> And I would
> tend to agree but still we are just going to wait until first user
> stumbles over this.

This isn't something new with this patch, 2G has been the default on big
machines for years.  This is addressing an unintended side effect of
078eb6aa50dc50, which was for qemu, by restoring the original behavior on bare
metal to avoid oodles of sysfs files.

> Btw. memblock interface just doesn't scale and it is a terrible
> interface for large machines and for the memory hotplug in general (just
> look at ppc and their insanely small memblocks).

I agree that the status quo isn't ideal and is something to address going
forward.

> Most usecases I have seen simply want to either offline some portion of
> memory without a strong requirement of the physical memory range as long
> as it is from a particular node or simply offline and remove the full
> node.

Interesting, would've thought that removing a single bad DIMM for RAS purposes
would also be common relative to how often hotplug is done on real systems.

> I believe that we should think about a future interface rather than
> trying to ducktape the blocksize anytime it causes problems. I would be
> even tempted to simply add a kernel command line option 
> memory_hotplug=disable,legacy,new_shiny
> 
> for disable it would simply drop all the sysfs crud and speed up boot
> for most users who simply do not care about memory hotplug. new_shiny
> would ideally provide an interface that would either export logically
> hotplugable memory ranges (e.g. DIMMs) or a query/action interface which
> accepts physical ranges as input. Having gazillions of sysfs files is
> simply unsustainable.

So in this idea, presumably the default would start off being legacy and then
later be changed to new_shiny?

If new_shiny scales well, maybe 'disable' wouldn't be needed and so using the
option could be avoided most of the time.  If some users really don't want it,
they can build without it.

Michal Hocko June 26, 2020, 12:47 p.m. UTC | #11

On Mon 22-06-20 15:17:39, Daniel Jordan wrote:
> Hello Michal,
> 
> (I've been away and may be slow to respond for a little while)
> 
> On Fri, Jun 19, 2020 at 02:07:04PM +0200, Michal Hocko wrote:
> > On Tue 09-06-20 18:54:51, Daniel Jordan wrote:
> > [...]
> > > @@ -1390,6 +1391,15 @@ static unsigned long probe_memory_block_size(void)
> > >  		goto done;
> > >  	}
> > >  
> > > +	/*
> > > +	 * Use max block size to minimize overhead on bare metal, where
> > > +	 * alignment for memory hotplug isn't a concern.
> > 
> > This really begs a clarification why this is not a concern. Bare metal
> > can see physical memory hotadd as well. I just suspect that you do not
> > consider that to be very common so it is not a big deal?
> 
> It's not only uncommon, it's also that boot_mem_end on bare metal may not align
> with any available memory block size.  For instance, this server's boot_mem_end
> is only 4M aligned and FWIW my desktop's is 2M aligned.  As far as I can tell,
> the logic that picks the size wasn't intended for bare metal.

> 
> > And I would
> > tend to agree but still we are just going to wait until first user
> > stumbles over this.
> 
> This isn't something new with this patch, 2G has been the default on big
> machines for years.  This is addressing an unintended side effect of
> 078eb6aa50dc50, which was for qemu, by restoring the original behavior on bare
> metal to avoid oodles of sysfs files.

I am not really sure the qemu was a target. I suspect it was just easier
to test in qemu.

[...]

> > I believe that we should think about a future interface rather than
> > trying to ducktape the blocksize anytime it causes problems. I would be
> > even tempted to simply add a kernel command line option 
> > memory_hotplug=disable,legacy,new_shiny
> > 
> > for disable it would simply drop all the sysfs crud and speed up boot
> > for most users who simply do not care about memory hotplug. new_shiny
> > would ideally provide an interface that would either export logically
> > hotplugable memory ranges (e.g. DIMMs) or a query/action interface which
> > accepts physical ranges as input. Having gazillions of sysfs files is
> > simply unsustainable.
> 
> So in this idea, presumably the default would start off being legacy and then
> later be changed to new_shiny?

Well it really depends. Going with disable as a default would suit most
users much better because the vast majority simply doesn't use the
functionality. On the other hand real users would regress unless they
enable the option. Which is definitely not nice. Another and much less
intrusive change would be creating sysfs interface on-demand. So until
somebody actually tries to use the interface it won't exist. I haven't
tried to explore how complex that would be. I am not really familiar
with sysfs to be honest. But fundamentally nothing should prevent such a
solution.

Another option would be to create sysfs interface only if there is a
hotplugable memory reported by the platform. But I am not sure we have a
proper interface for that for all arches.

Daniel Jordan July 8, 2020, 6:46 p.m. UTC | #12

(I'm back now)

On Fri, Jun 26, 2020 at 02:47:06PM +0200, Michal Hocko wrote:
> On Mon 22-06-20 15:17:39, Daniel Jordan wrote:
> > Hello Michal,
> > 
> > (I've been away and may be slow to respond for a little while)
> > 
> > On Fri, Jun 19, 2020 at 02:07:04PM +0200, Michal Hocko wrote:
> > > I believe that we should think about a future interface rather than
> > > trying to ducktape the blocksize anytime it causes problems. I would be
> > > even tempted to simply add a kernel command line option 
> > > memory_hotplug=disable,legacy,new_shiny
> > > 
> > > for disable it would simply drop all the sysfs crud and speed up boot
> > > for most users who simply do not care about memory hotplug. new_shiny
> > > would ideally provide an interface that would either export logically
> > > hotplugable memory ranges (e.g. DIMMs) or a query/action interface which
> > > accepts physical ranges as input. Having gazillions of sysfs files is
> > > simply unsustainable.
> > 
> > So in this idea, presumably the default would start off being legacy and then
> > later be changed to new_shiny?
> 
> Well it really depends. Going with disable as a default would suit most
> users much better because the vast majority simply doesn't use the
> functionality. On the other hand real users would regress unless they
> enable the option. Which is definitely not nice.

Agreed.

> Another and much less
> intrusive change would be creating sysfs interface on-demand. So until
> somebody actually tries to use the interface it won't exist. I haven't
> tried to explore how complex that would be. I am not really familiar
> with sysfs to be honest. But fundamentally nothing should prevent such a
> solution.

Hm, don't know sysfs internals either.  It seems possible that someone (or
something...) navigating around could trigger creation unintentionally--for
instance, by reading the symlinks to the memory block dirs in
/sys/devices/system/node/nodeN when trying to find another file to read in the
same place.

> Another option would be to create sysfs interface only if there is a
> hotplugable memory reported by the platform. But I am not sure we have a
> proper interface for that for all arches.

Systems that happen to have hotpluggable ranges but don't want the
overhead would still be stuck, though, it seems.


FWIW, the ideas for new_shiny sound promising.

[v2] x86/mm: use max memory block size on bare metal

Commit Message

Comments

Patch