diff mbox series

mm: kvmalloc: make kmalloc fast path real fast path

Message ID Z-48K0OdNxZXcnkB@tiehlicka (mailing list archive)
State New
Headers show
Series mm: kvmalloc: make kmalloc fast path real fast path | expand

Commit Message

Michal Hocko April 3, 2025, 7:43 a.m. UTC
There are users like xfs which need larger allocations with NOFAIL
sementic. They are not using kvmalloc currently because the current
implementation tries too hard to allocate through the kmalloc path
which causes a lot of direct reclaim and compaction and that hurts
performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
CIL shadow buffers") for more details).

kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
kmalloc (physically contiguous) allocation is preferred and we should go
more aggressive to make it happen. There is currently no way to express
that kmalloc should be very lightweight and as it has been argued [1]
this mode should be default to support kvmalloc(NOFAIL) with a
lightweight kmalloc path which is currently impossible to express as
__GFP_NOFAIL cannot be combined by any other reclaim modifiers.

This patch makes all kmalloc allocations GFP_NOWAIT unless
__GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
fail fast and retry hard on physically contiguous memory with vmalloc
fallback.

There is a potential downside that relatively small allocations (smaller
than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
cause page block fragmentation. We cannot really rule that out but it
seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.

[1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/slub.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Vlastimil Babka April 3, 2025, 8:24 a.m. UTC | #1
On 4/3/25 09:43, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Looks like a step in the right direction, but is that enough?

- to replace xlog_kvmalloc(), we need to deal with kvmalloc() passing
VM_ALLOW_HUGE_VMAP, so we don't end up with GFP_KERNEL huge allocation
anyway (in practice maybe it wouldn't happen because "size >= PMD_SIZE"
required for the huge vmalloc is never true for current xlog_kvmalloc()
users but dunno if we can rely on that).

Maybe it's a bad idea to use VM_ALLOW_HUGE_VMAP in kvmalloc() anyway? Since
we're in a vmalloc fallback which means the huge allocations failed anyway
for the kmalloc() part. Maybe there's some grey area where it makes sense,
with size much larger than PMD_SIZE, e.g. exceeding MAX_PAGE_ORDER where we
can't kmalloc() anyway so at least try to assemble the allocation from huge
vmalloc. Maybe tie it to such a size check, or require __GFP_RETRY_MAYFAIL
to activate VM_ALLOW_HUGE_VMAP?

- we're still not addressing the original issue of high kcompactd activity,
but maybe the answer is that it needs to be investigated more (why deferred
compaction doesn't limit it) instead of trying to suppress it from kvmalloc()

> ---
>  mm/slub.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b46f87662e71..2da40c2f6478 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
>  	 * contribute to a long term fragmentation less than vmalloc fallback.
> -	 * However make sure that larger requests are not too disruptive - no
> -	 * OOM killer and no allocation failure warnings as we have a fallback.
> +	 * However make sure that larger requests are not too disruptive - i.e.
> +	 * do not direct reclaim unless physically continuous memory is preferred
> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> +	 * working in the background but the allocation itself.
>  	 */
>  	if (size > PAGE_SIZE) {
>  		flags |= __GFP_NOWARN;
>  
>  		if (!(flags & __GFP_RETRY_MAYFAIL))
> -			flags |= __GFP_NORETRY;
> +			flags &= ~__GFP_DIRECT_RECLAIM;
>  
>  		/* nofail semantic is implemented by the vmalloc fallback */
>  		flags &= ~__GFP_NOFAIL;
Michal Hocko April 3, 2025, 8:59 a.m. UTC | #2
On Thu 03-04-25 10:24:56, Vlastimil Babka wrote:
[...]
> - to replace xlog_kvmalloc(), we need to deal with kvmalloc() passing
> VM_ALLOW_HUGE_VMAP, so we don't end up with GFP_KERNEL huge allocation
> anyway (in practice maybe it wouldn't happen because "size >= PMD_SIZE"
> required for the huge vmalloc is never true for current xlog_kvmalloc()
> users but dunno if we can rely on that).

I would just make that its own patch. Ideally with some numbers showing
there are code paths benefiting from the change.

> Maybe it's a bad idea to use VM_ALLOW_HUGE_VMAP in kvmalloc() anyway? Since
> we're in a vmalloc fallback which means the huge allocations failed anyway
> for the kmalloc() part. Maybe there's some grey area where it makes sense,
> with size much larger than PMD_SIZE, e.g. exceeding MAX_PAGE_ORDER where we
> can't kmalloc() anyway so at least try to assemble the allocation from huge
> vmalloc. Maybe tie it to such a size check, or require __GFP_RETRY_MAYFAIL
> to activate VM_ALLOW_HUGE_VMAP?

We didn't have that initially. 9becb6889130 ("kvmalloc: use vmalloc_huge
for vmalloc allocations") has added it. I thought large allocations are
very optimistic (ie. NOWAIT like) but that doesn't seem to be the case.

As said above, I would just change that after we have any numbers to
support the removal.

> - we're still not addressing the original issue of high kcompactd activity,
> but maybe the answer is that it needs to be investigated more (why deferred
> compaction doesn't limit it) instead of trying to suppress it from kvmalloc()

yes this seems like something that should be investigated on the
compaction side.

Thanks!
Kees Cook April 3, 2025, 4:21 p.m. UTC | #3
On Thu, Apr 03, 2025 at 09:43:39AM +0200, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Thanks for finding a solution for this! It makes way more sense to me to
kick over to vmap by default for kvmalloc users.

> ---
>  mm/slub.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b46f87662e71..2da40c2f6478 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
>  	 * contribute to a long term fragmentation less than vmalloc fallback.
> -	 * However make sure that larger requests are not too disruptive - no
> -	 * OOM killer and no allocation failure warnings as we have a fallback.
> +	 * However make sure that larger requests are not too disruptive - i.e.
> +	 * do not direct reclaim unless physically continuous memory is preferred
> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> +	 * working in the background but the allocation itself.

I think a word is missing here? "...but do the allocation..." or
"...allocation itself happens" ?
Shakeel Butt April 3, 2025, 6:30 p.m. UTC | #4
On Thu, Apr 03, 2025 at 09:43:39AM +0200, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Michal Hocko April 3, 2025, 7:49 p.m. UTC | #5
On Thu 03-04-25 09:21:50, Kees Cook wrote:
> On Thu, Apr 03, 2025 at 09:43:39AM +0200, Michal Hocko wrote:
[...]
> >  mm/slub.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index b46f87662e71..2da40c2f6478 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
> >  	 * We want to attempt a large physically contiguous block first because
> >  	 * it is less likely to fragment multiple larger blocks and therefore
> >  	 * contribute to a long term fragmentation less than vmalloc fallback.
> > -	 * However make sure that larger requests are not too disruptive - no
> > -	 * OOM killer and no allocation failure warnings as we have a fallback.
> > +	 * However make sure that larger requests are not too disruptive - i.e.
> > +	 * do not direct reclaim unless physically continuous memory is preferred
> > +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> > +	 * working in the background but the allocation itself.
> 
> I think a word is missing here? "...but do the allocation..." or
> "...allocation itself happens" ?

Thinking about this some more I would just cut this short and go with
"We still kick in kswapd/kcompactd to start working in the background"

Does that sound better?
Michal Hocko April 3, 2025, 7:51 p.m. UTC | #6
Add Andrew

Also, Dave do you want me to redirect xlog_cil_kvmalloc to kvmalloc or
do you preffer to do that yourself?

On Thu 03-04-25 09:43:41, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/slub.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b46f87662e71..2da40c2f6478 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
>  	 * contribute to a long term fragmentation less than vmalloc fallback.
> -	 * However make sure that larger requests are not too disruptive - no
> -	 * OOM killer and no allocation failure warnings as we have a fallback.
> +	 * However make sure that larger requests are not too disruptive - i.e.
> +	 * do not direct reclaim unless physically continuous memory is preferred
> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> +	 * working in the background but the allocation itself.
>  	 */
>  	if (size > PAGE_SIZE) {
>  		flags |= __GFP_NOWARN;
>  
>  		if (!(flags & __GFP_RETRY_MAYFAIL))
> -			flags |= __GFP_NORETRY;
> +			flags &= ~__GFP_DIRECT_RECLAIM;
>  
>  		/* nofail semantic is implemented by the vmalloc fallback */
>  		flags &= ~__GFP_NOFAIL;
> -- 
> 2.49.0
>
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index b46f87662e71..2da40c2f6478 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4972,14 +4972,16 @@  static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
 	 * contribute to a long term fragmentation less than vmalloc fallback.
-	 * However make sure that larger requests are not too disruptive - no
-	 * OOM killer and no allocation failure warnings as we have a fallback.
+	 * However make sure that larger requests are not too disruptive - i.e.
+	 * do not direct reclaim unless physically continuous memory is preferred
+	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
+	 * working in the background but the allocation itself.
 	 */
 	if (size > PAGE_SIZE) {
 		flags |= __GFP_NOWARN;
 
 		if (!(flags & __GFP_RETRY_MAYFAIL))
-			flags |= __GFP_NORETRY;
+			flags &= ~__GFP_DIRECT_RECLAIM;
 
 		/* nofail semantic is implemented by the vmalloc fallback */
 		flags &= ~__GFP_NOFAIL;