x86/mm: use max memory block size with unaligned memory end

Message ID	20200604035443.3267046-1-daniel.m.jordan@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=8z1G=7R=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9355220756 From: Daniel Jordan <daniel.m.jordan@oracle.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, David Hildenbrand <david@redhat.com>, Michal Hocko <mhocko@kernel.org>, Pavel Tatashin <pasha.tatashin@soleen.com>, Peter Zijlstra <peterz@infradead.org>, Steven Sistare <steven.sistare@oracle.com>, Daniel Jordan <daniel.m.jordan@oracle.com> Subject: [PATCH] x86/mm: use max memory block size with unaligned memory end Date: Wed, 3 Jun 2020 23:54:43 -0400 Message-Id: <20200604035443.3267046-1-daniel.m.jordan@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	x86/mm: use max memory block size with unaligned memory end \| expand x86/mm: use max memory block size with unaligned memory end

Daniel Jordan June 4, 2020, 3:54 a.m. UTC

Some of our servers spend 14 out of the 21 seconds of kernel boot
initializing memory block sysfs directories and then creating symlinks
between them and the corresponding nodes.  The slowness happens because
the machines get stuck with the smallest supported memory block size on
x86 (128M), which results in 16,288 directories to cover the 2T of
installed RAM, and each of these paths does a linear search of the
memory blocks for every block id, with atomic ops at each step.

Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based
on the end of boot memory") chooses the block size based on alignment
with memory end.  That addresses hotplug failures in qemu guests, but
for bare metal systems whose memory end isn't aligned to the smallest
size, it leaves them at 128M.

For such systems, use the largest supported size (2G) to minimize
overhead on big machines.  That saves nearly all of the 14 seconds so
the kernel boots 3x faster.

There are some simple ways to avoid the linear searches, but for now it
makes no difference with a 2G block.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 arch/x86/mm/init_64.c | 9 +++++++++
 1 file changed, 9 insertions(+)


base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162

David Hildenbrand June 4, 2020, 7:22 a.m. UTC | #1

On 04.06.20 05:54, Daniel Jordan wrote:
> Some of our servers spend 14 out of the 21 seconds of kernel boot
> initializing memory block sysfs directories and then creating symlinks
> between them and the corresponding nodes.  The slowness happens because
> the machines get stuck with the smallest supported memory block size on
> x86 (128M), which results in 16,288 directories to cover the 2T of
> installed RAM, and each of these paths does a linear search of the
> memory blocks for every block id, with atomic ops at each step.

With 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray
to accelerate lookup") merged by Linus' today (strange, I thought this
would be long upstream) all linear searches should be gone and at least
the performance observation in this patch no longer applies.

The memmap init should nowadays consume most time.

Daniel Jordan June 4, 2020, 5:22 p.m. UTC | #2

On Thu, Jun 04, 2020 at 09:22:03AM +0200, David Hildenbrand wrote:
> On 04.06.20 05:54, Daniel Jordan wrote:
> > Some of our servers spend 14 out of the 21 seconds of kernel boot
> > initializing memory block sysfs directories and then creating symlinks
> > between them and the corresponding nodes.  The slowness happens because
> > the machines get stuck with the smallest supported memory block size on
> > x86 (128M), which results in 16,288 directories to cover the 2T of
> > installed RAM, and each of these paths does a linear search of the
> > memory blocks for every block id, with atomic ops at each step.
> 
> With 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray
> to accelerate lookup") merged by Linus' today (strange, I thought this
> would be long upstream)

Ah, thanks for pointing this out!  It was only posted to LKML so I missed it.

> all linear searches should be gone and at least
> the performance observation in this patch no longer applies.

The performance numbers as stated, that's certainly true, but this patch on top
still improves kernel boot by 7%.  It's a savings of half a second -- I'll take
it.

IMHO the root cause of this is really the small block size.  Building a cache
on top to avoid iterating over tons of small blocks seems like papering over
the problem, especially when one of the two affected paths in boot is a
cautious check that might be ready to be removed by now[0]:

    static int init_memory_block(struct memory_block **memory,
    			     unsigned long block_id, unsigned long state)
    {
            ...
    	mem = find_memory_block_by_id(block_id);
    	if (mem) {
    		put_device(&mem->dev);
    		return -EEXIST;
    	}

Anyway, I guess I'll redo the changelog and post again.

> The memmap init should nowadays consume most time.

Yeah, but of course it's not as bad as it was now that it's fully parallelized.

[0] https://lore.kernel.org/linux-mm/a8e96df6-dc6d-037f-491c-92182d4ada8d@redhat.com/

David Hildenbrand June 4, 2020, 5:45 p.m. UTC | #3

On 04.06.20 19:22, Daniel Jordan wrote:
> On Thu, Jun 04, 2020 at 09:22:03AM +0200, David Hildenbrand wrote:
>> On 04.06.20 05:54, Daniel Jordan wrote:
>>> Some of our servers spend 14 out of the 21 seconds of kernel boot
>>> initializing memory block sysfs directories and then creating symlinks
>>> between them and the corresponding nodes.  The slowness happens because
>>> the machines get stuck with the smallest supported memory block size on
>>> x86 (128M), which results in 16,288 directories to cover the 2T of
>>> installed RAM, and each of these paths does a linear search of the
>>> memory blocks for every block id, with atomic ops at each step.
>>
>> With 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray
>> to accelerate lookup") merged by Linus' today (strange, I thought this
>> would be long upstream)
> 
> Ah, thanks for pointing this out!  It was only posted to LKML so I missed it.
> 
>> all linear searches should be gone and at least
>> the performance observation in this patch no longer applies.
> 
> The performance numbers as stated, that's certainly true, but this patch on top
> still improves kernel boot by 7%.  It's a savings of half a second -- I'll take
> it.
> 
> IMHO the root cause of this is really the small block size.  Building a cache
> on top to avoid iterating over tons of small blocks seems like papering over
> the problem, especially when one of the two affected paths in boot is a

The memory block size dictates your memory hot(un)plug granularity.
E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
That's why that's not papering over the problem. Increasing the memory
block size isn't always the answer.

(there are other, still fairly academic approaches to power down memory
banks where you also want small memory blocks instead)

> cautious check that might be ready to be removed by now[0]:

Yeah, we discussed that somewhere already. My change only highlighted
the problem. And now that it's cheap, it can just stay unless there is a
very good reason not to do it.

> 
>     static int init_memory_block(struct memory_block **memory,
>     			     unsigned long block_id, unsigned long state)
>     {
>             ...
>     	mem = find_memory_block_by_id(block_id);
>     	if (mem) {
>     		put_device(&mem->dev);
>     		return -EEXIST;
>     	}
> 
> Anyway, I guess I'll redo the changelog and post again.
> 
>> The memmap init should nowadays consume most time.
> 
> Yeah, but of course it's not as bad as it was now that it's fully parallelized.

Right. I also observed that computing if a zone is contiguous can be
expensive.

Daniel Jordan June 4, 2020, 6:12 p.m. UTC | #4

On Thu, Jun 04, 2020 at 07:45:40PM +0200, David Hildenbrand wrote:
> On 04.06.20 19:22, Daniel Jordan wrote:
> > IMHO the root cause of this is really the small block size.  Building a cache
> > on top to avoid iterating over tons of small blocks seems like papering over
> > the problem, especially when one of the two affected paths in boot is a
> 
> The memory block size dictates your memory hot(un)plug granularity.

Indeed.

> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
> That's why that's not papering over the problem. Increasing the memory
> block size isn't always the answer.

Ok.  If you don't mind, what's the purpose of hotplugging at that granularity?
I'm simply curious.

> > cautious check that might be ready to be removed by now[0]:
> 
> Yeah, we discussed that somewhere already. My change only highlighted
> the problem. And now that it's cheap, it can just stay unless there is a
> very good reason not to do it.

Agreed.

> > Yeah, but of course it's not as bad as it was now that it's fully parallelized.
> 
> Right. I also observed that computing if a zone is contiguous can be
> expensive.

That's right, I remember that.  It's on my list :)

David Hildenbrand June 4, 2020, 6:55 p.m. UTC | #5

>> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
>> That's why that's not papering over the problem. Increasing the memory
>> block size isn't always the answer.
> 
> Ok.  If you don't mind, what's the purpose of hotplugging at that granularity?
> I'm simply curious.

On bare metal: none with that big machines AFAIKS. :)

For VMs/partitions it gives you much more flexibility ("cloud", kata
containers, memory overcommit, ...).

Assume you have a VM with some initial memory size (e.g., 32GB). By
hotplugging up to 256 DIMMs you cab grow in small steps (e.g., 128MB, up
to 64GB, 256MB, up to 96GB, ...). And if you online all the memory
blocks MOVABLE, you can shrink in these small steps.

Regarding PowerPC, AFAIK it also gives the OS more flexibility to find
memory blocks that can be offlined and unplugged, especially without the
MOVABLE zone. Finding some scattered 16MB blocks that can be offlined is
easier than finding one bigger (e.g., 2GB) memory block that can be
offlined. And the history of powerpc dlpar dates back to pre-MOVABLE
days (there is a paper from 2003).

> 
>>> Yeah, but of course it's not as bad as it was now that it's fully parallelized.
>>
>> Right. I also observed that computing if a zone is contiguous can be
>> expensive.
> 
> That's right, I remember that.  It's on my list :)

I do think your change mostly affects bare metal where you do not care
about hotplugging small memory blocks. Maybe an even better check would be

if (!in_vm() {
	bz = MAX_BLOCK_SIZE;
	goto none;
}

because I doubt we have bare metal machines > 64 where we want to
hot(un)plug DIMMs < 2G. But maybe there is a use case I am not aware of
... and I don't know an easy way to check whether we are running inside
a VM or not (like kvm_para_available() ... ).

Dave Hansen June 4, 2020, 8 p.m. UTC | #6

On 6/4/20 11:12 AM, Daniel Jordan wrote:
>> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
>> That's why that's not papering over the problem. Increasing the memory
>> block size isn't always the answer.
> Ok.  If you don't mind, what's the purpose of hotplugging at that granularity?
> I'm simply curious.

FWIW, the 128MB on x86 came from the original sparsemem/hotplug
implementation.  It was the size of the smallest DIMM that my server
system at the time would take.  ppc64's huge page size was and is 16MB
and that's also the granularity with which hypervisors did hot-add way
back then.  I'm not actually sure what they do now.

My belief at the time was that the section size would grow over time as
DIMMs and hotplug units grew.  I was young and naive. :)

I actually can't think of anything that's *keeping* it at 128MB on x86
though.  We don't, for instance, require a whole section to be
pfn_valid().

Daniel Jordan June 4, 2020, 10:24 p.m. UTC | #7

On Thu, Jun 04, 2020 at 08:55:19PM +0200, David Hildenbrand wrote:
> >> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
> >> That's why that's not papering over the problem. Increasing the memory
> >> block size isn't always the answer.
> > 
> > Ok.  If you don't mind, what's the purpose of hotplugging at that granularity?
> > I'm simply curious.
> 
> On bare metal: none with that big machines AFAIKS. :)

Sounds about right :)

> For VMs/partitions it gives you much more flexibility ("cloud", kata
> containers, memory overcommit, ...).
> 
> Assume you have a VM with some initial memory size (e.g., 32GB). By
> hotplugging up to 256 DIMMs you cab grow in small steps (e.g., 128MB, up
> to 64GB, 256MB, up to 96GB, ...). And if you online all the memory
> blocks MOVABLE, you can shrink in these small steps.

Yeah, sorry for not being clear, I meant why does powerpc hotplug at "only" 16M.

> Regarding PowerPC, AFAIK it also gives the OS more flexibility to find
> memory blocks that can be offlined and unplugged, especially without the
> MOVABLE zone. Finding some scattered 16MB blocks that can be offlined is
> easier than finding one bigger (e.g., 2GB) memory block that can be
> offlined. And the history of powerpc dlpar dates back to pre-MOVABLE
> days (there is a paper from 2003).

Makes sense, thanks!

> I do think your change mostly affects bare metal where you do not care
> about hotplugging small memory blocks. Maybe an even better check would be
> 
> if (!in_vm() {
> 	bz = MAX_BLOCK_SIZE;
> 	goto none;
> }
> 
> because I doubt we have bare metal machines > 64 where we want to
> hot(un)plug DIMMs < 2G.

Yeah, agreed, not these days.

> But maybe there is a use case I am not aware of
> ... and I don't know an easy way to check whether we are running inside
> a VM or not (like kvm_para_available() ... ).

What about this?  Works on bare metal and kvm, so presumably all the other HVs
too.

 if (x86_hyper_type == X86_HYPER_NATIVE) {
	bz = MAX_BLOCK_SIZE;
	goto done;
 }

Daniel Jordan June 4, 2020, 10:27 p.m. UTC | #8

On Thu, Jun 04, 2020 at 01:00:55PM -0700, Dave Hansen wrote:
> On 6/4/20 11:12 AM, Daniel Jordan wrote:
> >> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
> >> That's why that's not papering over the problem. Increasing the memory
> >> block size isn't always the answer.
> > Ok.  If you don't mind, what's the purpose of hotplugging at that granularity?
> > I'm simply curious.
> 
> FWIW, the 128MB on x86 came from the original sparsemem/hotplug
> implementation.  It was the size of the smallest DIMM that my server
> system at the time would take.  ppc64's huge page size was and is 16MB
> and that's also the granularity with which hypervisors did hot-add way
> back then.  I'm not actually sure what they do now.

Interesting, that tells me a lot more than the "matt - 128 is convenient right
now" comment that has always weirdly stuck out at me.

> I actually can't think of anything that's *keeping* it at 128MB on x86
> though.  We don't, for instance, require a whole section to be
> pfn_valid().

Hm, something to look into.

David Hildenbrand June 5, 2020, 7:44 a.m. UTC | #9

On 04.06.20 22:00, Dave Hansen wrote:
> On 6/4/20 11:12 AM, Daniel Jordan wrote:
>>> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks.
>>> That's why that's not papering over the problem. Increasing the memory
>>> block size isn't always the answer.
>> Ok.  If you don't mind, what's the purpose of hotplugging at that granularity?
>> I'm simply curious.
> 
> FWIW, the 128MB on x86 came from the original sparsemem/hotplug
> implementation.  It was the size of the smallest DIMM that my server
> system at the time would take.  ppc64's huge page size was and is 16MB
> and that's also the granularity with which hypervisors did hot-add way
> back then.  I'm not actually sure what they do now.
> 
> My belief at the time was that the section size would grow over time as
> DIMMs and hotplug units grew.  I was young and naive. :)

BTW, I recently studied your old hotplug papers and they are highly
appreciated :)

> 
> I actually can't think of anything that's *keeping* it at 128MB on x86
> though.  We don't, for instance, require a whole section to be
> pfn_valid().

Well, sub-section hotadd is only done for vmemmap and we only use it for
!(memory block devices) stuff, a.k.a. ZONE_DEVICE. IIRC, sub-section
hotadd works in granularity of 2M.

AFAIK:
- The lower limit for a section is MAX_ORDER - 1 / pageblock_order
- The smaller the section, the more bits are wasted to store the section
  number in page->flags for page_to_pfn() (!vmemmap IIRC)
- The smaller the section, the bigger the section array(s)
- We want to make sure  the section memmap always spans full pages
  (IIRC, not always the case e.g., arm64 with 256k page size. But arm64
  is weird either way - 512MB (transparent) huge pages with 64k base
  pages ...)

Changing the section size to get rid of sub-section memory hotadd does
not seem to be easily possible. I assume we don't want to create memory
block devices for something as small as current sub-section memory
hotadd size (e.g., 2MB). So having significantly smaller sections might
not make too much sense and your initial section size might have been a
very good, initial pick :)

x86/mm: use max memory block size with unaligned memory end

Commit Message

Comments

Patch