Message ID | 20200604035443.3267046-1-daniel.m.jordan@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/mm: use max memory block size with unaligned memory end | expand |
On 04.06.20 05:54, Daniel Jordan wrote: > Some of our servers spend 14 out of the 21 seconds of kernel boot > initializing memory block sysfs directories and then creating symlinks > between them and the corresponding nodes. The slowness happens because > the machines get stuck with the smallest supported memory block size on > x86 (128M), which results in 16,288 directories to cover the 2T of > installed RAM, and each of these paths does a linear search of the > memory blocks for every block id, with atomic ops at each step. With 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup") merged by Linus' today (strange, I thought this would be long upstream) all linear searches should be gone and at least the performance observation in this patch no longer applies. The memmap init should nowadays consume most time.
On Thu, Jun 04, 2020 at 09:22:03AM +0200, David Hildenbrand wrote: > On 04.06.20 05:54, Daniel Jordan wrote: > > Some of our servers spend 14 out of the 21 seconds of kernel boot > > initializing memory block sysfs directories and then creating symlinks > > between them and the corresponding nodes. The slowness happens because > > the machines get stuck with the smallest supported memory block size on > > x86 (128M), which results in 16,288 directories to cover the 2T of > > installed RAM, and each of these paths does a linear search of the > > memory blocks for every block id, with atomic ops at each step. > > With 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray > to accelerate lookup") merged by Linus' today (strange, I thought this > would be long upstream) Ah, thanks for pointing this out! It was only posted to LKML so I missed it. > all linear searches should be gone and at least > the performance observation in this patch no longer applies. The performance numbers as stated, that's certainly true, but this patch on top still improves kernel boot by 7%. It's a savings of half a second -- I'll take it. IMHO the root cause of this is really the small block size. Building a cache on top to avoid iterating over tons of small blocks seems like papering over the problem, especially when one of the two affected paths in boot is a cautious check that might be ready to be removed by now[0]: static int init_memory_block(struct memory_block **memory, unsigned long block_id, unsigned long state) { ... mem = find_memory_block_by_id(block_id); if (mem) { put_device(&mem->dev); return -EEXIST; } Anyway, I guess I'll redo the changelog and post again. > The memmap init should nowadays consume most time. Yeah, but of course it's not as bad as it was now that it's fully parallelized. [0] https://lore.kernel.org/linux-mm/a8e96df6-dc6d-037f-491c-92182d4ada8d@redhat.com/
On 04.06.20 19:22, Daniel Jordan wrote: > On Thu, Jun 04, 2020 at 09:22:03AM +0200, David Hildenbrand wrote: >> On 04.06.20 05:54, Daniel Jordan wrote: >>> Some of our servers spend 14 out of the 21 seconds of kernel boot >>> initializing memory block sysfs directories and then creating symlinks >>> between them and the corresponding nodes. The slowness happens because >>> the machines get stuck with the smallest supported memory block size on >>> x86 (128M), which results in 16,288 directories to cover the 2T of >>> installed RAM, and each of these paths does a linear search of the >>> memory blocks for every block id, with atomic ops at each step. >> >> With 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray >> to accelerate lookup") merged by Linus' today (strange, I thought this >> would be long upstream) > > Ah, thanks for pointing this out! It was only posted to LKML so I missed it. > >> all linear searches should be gone and at least >> the performance observation in this patch no longer applies. > > The performance numbers as stated, that's certainly true, but this patch on top > still improves kernel boot by 7%. It's a savings of half a second -- I'll take > it. > > IMHO the root cause of this is really the small block size. Building a cache > on top to avoid iterating over tons of small blocks seems like papering over > the problem, especially when one of the two affected paths in boot is a The memory block size dictates your memory hot(un)plug granularity. E.g., on powerpc that's 16MB so they have *a lot* of memory blocks. That's why that's not papering over the problem. Increasing the memory block size isn't always the answer. (there are other, still fairly academic approaches to power down memory banks where you also want small memory blocks instead) > cautious check that might be ready to be removed by now[0]: Yeah, we discussed that somewhere already. My change only highlighted the problem. And now that it's cheap, it can just stay unless there is a very good reason not to do it. > > static int init_memory_block(struct memory_block **memory, > unsigned long block_id, unsigned long state) > { > ... > mem = find_memory_block_by_id(block_id); > if (mem) { > put_device(&mem->dev); > return -EEXIST; > } > > Anyway, I guess I'll redo the changelog and post again. > >> The memmap init should nowadays consume most time. > > Yeah, but of course it's not as bad as it was now that it's fully parallelized. Right. I also observed that computing if a zone is contiguous can be expensive.
On Thu, Jun 04, 2020 at 07:45:40PM +0200, David Hildenbrand wrote: > On 04.06.20 19:22, Daniel Jordan wrote: > > IMHO the root cause of this is really the small block size. Building a cache > > on top to avoid iterating over tons of small blocks seems like papering over > > the problem, especially when one of the two affected paths in boot is a > > The memory block size dictates your memory hot(un)plug granularity. Indeed. > E.g., on powerpc that's 16MB so they have *a lot* of memory blocks. > That's why that's not papering over the problem. Increasing the memory > block size isn't always the answer. Ok. If you don't mind, what's the purpose of hotplugging at that granularity? I'm simply curious. > > cautious check that might be ready to be removed by now[0]: > > Yeah, we discussed that somewhere already. My change only highlighted > the problem. And now that it's cheap, it can just stay unless there is a > very good reason not to do it. Agreed. > > Yeah, but of course it's not as bad as it was now that it's fully parallelized. > > Right. I also observed that computing if a zone is contiguous can be > expensive. That's right, I remember that. It's on my list :)
>> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks. >> That's why that's not papering over the problem. Increasing the memory >> block size isn't always the answer. > > Ok. If you don't mind, what's the purpose of hotplugging at that granularity? > I'm simply curious. On bare metal: none with that big machines AFAIKS. :) For VMs/partitions it gives you much more flexibility ("cloud", kata containers, memory overcommit, ...). Assume you have a VM with some initial memory size (e.g., 32GB). By hotplugging up to 256 DIMMs you cab grow in small steps (e.g., 128MB, up to 64GB, 256MB, up to 96GB, ...). And if you online all the memory blocks MOVABLE, you can shrink in these small steps. Regarding PowerPC, AFAIK it also gives the OS more flexibility to find memory blocks that can be offlined and unplugged, especially without the MOVABLE zone. Finding some scattered 16MB blocks that can be offlined is easier than finding one bigger (e.g., 2GB) memory block that can be offlined. And the history of powerpc dlpar dates back to pre-MOVABLE days (there is a paper from 2003). > >>> Yeah, but of course it's not as bad as it was now that it's fully parallelized. >> >> Right. I also observed that computing if a zone is contiguous can be >> expensive. > > That's right, I remember that. It's on my list :) I do think your change mostly affects bare metal where you do not care about hotplugging small memory blocks. Maybe an even better check would be if (!in_vm() { bz = MAX_BLOCK_SIZE; goto none; } because I doubt we have bare metal machines > 64 where we want to hot(un)plug DIMMs < 2G. But maybe there is a use case I am not aware of ... and I don't know an easy way to check whether we are running inside a VM or not (like kvm_para_available() ... ).
On 6/4/20 11:12 AM, Daniel Jordan wrote: >> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks. >> That's why that's not papering over the problem. Increasing the memory >> block size isn't always the answer. > Ok. If you don't mind, what's the purpose of hotplugging at that granularity? > I'm simply curious. FWIW, the 128MB on x86 came from the original sparsemem/hotplug implementation. It was the size of the smallest DIMM that my server system at the time would take. ppc64's huge page size was and is 16MB and that's also the granularity with which hypervisors did hot-add way back then. I'm not actually sure what they do now. My belief at the time was that the section size would grow over time as DIMMs and hotplug units grew. I was young and naive. :) I actually can't think of anything that's *keeping* it at 128MB on x86 though. We don't, for instance, require a whole section to be pfn_valid().
On Thu, Jun 04, 2020 at 08:55:19PM +0200, David Hildenbrand wrote: > >> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks. > >> That's why that's not papering over the problem. Increasing the memory > >> block size isn't always the answer. > > > > Ok. If you don't mind, what's the purpose of hotplugging at that granularity? > > I'm simply curious. > > On bare metal: none with that big machines AFAIKS. :) Sounds about right :) > For VMs/partitions it gives you much more flexibility ("cloud", kata > containers, memory overcommit, ...). > > Assume you have a VM with some initial memory size (e.g., 32GB). By > hotplugging up to 256 DIMMs you cab grow in small steps (e.g., 128MB, up > to 64GB, 256MB, up to 96GB, ...). And if you online all the memory > blocks MOVABLE, you can shrink in these small steps. Yeah, sorry for not being clear, I meant why does powerpc hotplug at "only" 16M. > Regarding PowerPC, AFAIK it also gives the OS more flexibility to find > memory blocks that can be offlined and unplugged, especially without the > MOVABLE zone. Finding some scattered 16MB blocks that can be offlined is > easier than finding one bigger (e.g., 2GB) memory block that can be > offlined. And the history of powerpc dlpar dates back to pre-MOVABLE > days (there is a paper from 2003). Makes sense, thanks! > I do think your change mostly affects bare metal where you do not care > about hotplugging small memory blocks. Maybe an even better check would be > > if (!in_vm() { > bz = MAX_BLOCK_SIZE; > goto none; > } > > because I doubt we have bare metal machines > 64 where we want to > hot(un)plug DIMMs < 2G. Yeah, agreed, not these days. > But maybe there is a use case I am not aware of > ... and I don't know an easy way to check whether we are running inside > a VM or not (like kvm_para_available() ... ). What about this? Works on bare metal and kvm, so presumably all the other HVs too. if (x86_hyper_type == X86_HYPER_NATIVE) { bz = MAX_BLOCK_SIZE; goto done; }
On Thu, Jun 04, 2020 at 01:00:55PM -0700, Dave Hansen wrote: > On 6/4/20 11:12 AM, Daniel Jordan wrote: > >> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks. > >> That's why that's not papering over the problem. Increasing the memory > >> block size isn't always the answer. > > Ok. If you don't mind, what's the purpose of hotplugging at that granularity? > > I'm simply curious. > > FWIW, the 128MB on x86 came from the original sparsemem/hotplug > implementation. It was the size of the smallest DIMM that my server > system at the time would take. ppc64's huge page size was and is 16MB > and that's also the granularity with which hypervisors did hot-add way > back then. I'm not actually sure what they do now. Interesting, that tells me a lot more than the "matt - 128 is convenient right now" comment that has always weirdly stuck out at me. > I actually can't think of anything that's *keeping* it at 128MB on x86 > though. We don't, for instance, require a whole section to be > pfn_valid(). Hm, something to look into.
On 04.06.20 22:00, Dave Hansen wrote: > On 6/4/20 11:12 AM, Daniel Jordan wrote: >>> E.g., on powerpc that's 16MB so they have *a lot* of memory blocks. >>> That's why that's not papering over the problem. Increasing the memory >>> block size isn't always the answer. >> Ok. If you don't mind, what's the purpose of hotplugging at that granularity? >> I'm simply curious. > > FWIW, the 128MB on x86 came from the original sparsemem/hotplug > implementation. It was the size of the smallest DIMM that my server > system at the time would take. ppc64's huge page size was and is 16MB > and that's also the granularity with which hypervisors did hot-add way > back then. I'm not actually sure what they do now. > > My belief at the time was that the section size would grow over time as > DIMMs and hotplug units grew. I was young and naive. :) BTW, I recently studied your old hotplug papers and they are highly appreciated :) > > I actually can't think of anything that's *keeping* it at 128MB on x86 > though. We don't, for instance, require a whole section to be > pfn_valid(). Well, sub-section hotadd is only done for vmemmap and we only use it for !(memory block devices) stuff, a.k.a. ZONE_DEVICE. IIRC, sub-section hotadd works in granularity of 2M. AFAIK: - The lower limit for a section is MAX_ORDER - 1 / pageblock_order - The smaller the section, the more bits are wasted to store the section number in page->flags for page_to_pfn() (!vmemmap IIRC) - The smaller the section, the bigger the section array(s) - We want to make sure the section memmap always spans full pages (IIRC, not always the case e.g., arm64 with 256k page size. But arm64 is weird either way - 512MB (transparent) huge pages with 64k base pages ...) Changing the section size to get rid of sub-section memory hotadd does not seem to be easily possible. I assume we don't want to create memory block devices for something as small as current sub-section memory hotadd size (e.g., 2MB). So having significantly smaller sections might not make too much sense and your initial section size might have been a very good, initial pick :)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 8b5f73f5e207c..d388127d1b519 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1390,6 +1390,15 @@ static unsigned long probe_memory_block_size(void) goto done; } + /* + * Memory end isn't aligned to any allowed block size, so default to + * the largest to minimize overhead on large memory systems. + */ + if (!IS_ALIGNED(boot_mem_end, MIN_MEMORY_BLOCK_SIZE)) { + bz = MAX_BLOCK_SIZE; + goto done; + } + /* Find the largest allowed block size that aligns to memory end */ for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) { if (IS_ALIGNED(boot_mem_end, bz))
Some of our servers spend 14 out of the 21 seconds of kernel boot initializing memory block sysfs directories and then creating symlinks between them and the corresponding nodes. The slowness happens because the machines get stuck with the smallest supported memory block size on x86 (128M), which results in 16,288 directories to cover the 2T of installed RAM, and each of these paths does a linear search of the memory blocks for every block id, with atomic ops at each step. Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on the end of boot memory") chooses the block size based on alignment with memory end. That addresses hotplug failures in qemu guests, but for bare metal systems whose memory end isn't aligned to the smallest size, it leaves them at 128M. For such systems, use the largest supported size (2G) to minimize overhead on big machines. That saves nearly all of the 14 seconds so the kernel boots 3x faster. There are some simple ways to avoid the linear searches, but for now it makes no difference with a 2G block. Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> --- arch/x86/mm/init_64.c | 9 +++++++++ 1 file changed, 9 insertions(+) base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162