diff mbox series

[1/4] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist

Message ID 20190528120453.27374-1-npiggin@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/4] mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist | expand

Commit Message

Nicholas Piggin May 28, 2019, 12:04 p.m. UTC
The kernel currently clamps large system hashes to MAX_ORDER when
hashdist is not set, which is rather arbitrary.

vmalloc space is limited on 32-bit machines, but this shouldn't
result in much more used because of small physical memory limiting
system hash sizes.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Linus Torvalds May 31, 2019, 6:30 p.m. UTC | #1
On Tue, May 28, 2019 at 5:08 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> The kernel currently clamps large system hashes to MAX_ORDER when
> hashdist is not set, which is rather arbitrary.

I think the *really* arbitrary part here is "hashdist".

If you enable NUMA support, hashdist is just set to 1 by default on
64-bit, whether the machine actually has any numa characteristics or
not. So you take that vmalloc() TLB overhead whether you need it or
not.

So I think your series looks sane, and should help the vmalloc case
for big hash allocations, but I also think that this whole
alloc_large_system_hash() function should be smarter in general.

Yes, it's called "alloc_large_system_hash()", but it's used on small
and perfectly normal-sized systems too, and often for not all that big
hashes.

Yes, we tend to try to make some of those hashes large (dentry one in
particular), but we also use this for small stuff.

For example, on my machine I have several network hashes that have
order 6-8 sizes, none of which really make any sense to use vmalloc
space for (and which are smaller than a large page, so your patch
series wouldn't help).

So on the whole I have no issues with this series, but I do think we
should maybe fix that crazy "if (hashdist)" case. Hmm?

                   Linus
Nicholas Piggin June 3, 2019, 2:22 a.m. UTC | #2
Linus Torvalds's on June 1, 2019 4:30 am:
> On Tue, May 28, 2019 at 5:08 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> The kernel currently clamps large system hashes to MAX_ORDER when
>> hashdist is not set, which is rather arbitrary.
> 
> I think the *really* arbitrary part here is "hashdist".
> 
> If you enable NUMA support, hashdist is just set to 1 by default on
> 64-bit, whether the machine actually has any numa characteristics or
> not. So you take that vmalloc() TLB overhead whether you need it or
> not.

Yeah, that's strange it seems to just be an oversight nobody ever
picked up. Patch 2/4 actually fixed that exactly the way you said.

> 
> So I think your series looks sane, and should help the vmalloc case
> for big hash allocations, but I also think that this whole
> alloc_large_system_hash() function should be smarter in general.
> 
> Yes, it's called "alloc_large_system_hash()", but it's used on small
> and perfectly normal-sized systems too, and often for not all that big
> hashes.
> 
> Yes, we tend to try to make some of those hashes large (dentry one in
> particular), but we also use this for small stuff.
> 
> For example, on my machine I have several network hashes that have
> order 6-8 sizes, none of which really make any sense to use vmalloc
> space for (and which are smaller than a large page, so your patch
> series wouldn't help).
> 
> So on the whole I have no issues with this series, but I do think we
> should maybe fix that crazy "if (hashdist)" case. Hmm?

Yes agreed. Even after this series with 2MB mappings it's actually a bit 
sad that we can't use the linear map for the non-NUMA case. My laptop 
has a 32MB dentry cache and 16MB inode cache so doing a bunch of name 
lookups is quite a waste of TLB entries (although at least with 2MB 
pages it doesn't blow the TLB completely).

We might be able to go a step further and use memblock allocator for
those as well, or reserve some boot CMA for that common case ot just
use the linear map for these hashes. I'll look into that.

Thanks,
Nick
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8abe0af..dd419a074141 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8029,7 +8029,7 @@  void *__init alloc_large_system_hash(const char *tablename,
 			else
 				table = memblock_alloc_raw(size,
 							   SMP_CACHE_BYTES);
-		} else if (hashdist) {
+		} else if (get_order(size) >= MAX_ORDER || hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
 		} else {
 			/*
@@ -8037,10 +8037,8 @@  void *__init alloc_large_system_hash(const char *tablename,
 			 * some pages at the end of hash table which
 			 * alloc_pages_exact() automatically does
 			 */
-			if (get_order(size) < MAX_ORDER) {
-				table = alloc_pages_exact(size, gfp_flags);
-				kmemleak_alloc(table, size, 1, gfp_flags);
-			}
+			table = alloc_pages_exact(size, gfp_flags);
+			kmemleak_alloc(table, size, 1, gfp_flags);
 		}
 	} while (!table && size > PAGE_SIZE && --log2qty);