diff mbox series

[bpf-next,v5,1/7] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation

Message ID 20250115021746.34691-2-alexei.starovoitov@gmail.com (mailing list archive)
State New
Headers show
Series bpf, mm: Introduce try_alloc_pages() | expand

Commit Message

Alexei Starovoitov Jan. 15, 2025, 2:17 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce gfpflags_allow_spinning() condition that signals
to the allocator that running context is unknown.
Then rely on percpu free list of pages to allocate a page.
try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
rmqueue_pcplist() will spin_trylock to grab the page from percpu
free list. If it fails (due to re-entrancy or list being empty)
then rmqueue_bulk()/rmqueue_buddy() will attempt to
spin_trylock zone->lock and grab the page from there.
spin_trylock() is not safe in RT when in NMI or in hard IRQ.
Bailout early in such case.

The support for gfpflags_allow_spinning() mode for free_page and memcg
comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h | 22 ++++++++++
 mm/internal.h       |  1 +
 mm/page_alloc.c     | 98 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 118 insertions(+), 3 deletions(-)

Comments

Vlastimil Babka Jan. 15, 2025, 11:19 a.m. UTC | #1
On 1/15/25 03:17, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> rmqueue_pcplist() will spin_trylock to grab the page from percpu
> free list. If it fails (due to re-entrancy or list being empty)
> then rmqueue_bulk()/rmqueue_buddy() will attempt to
> spin_trylock zone->lock and grab the page from there.
> spin_trylock() is not safe in RT when in NMI or in hard IRQ.
> Bailout early in such case.
> 
> The support for gfpflags_allow_spinning() mode for free_page and memcg
> comes in the next patches.
> 
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Some nits below:

> ---
>  include/linux/gfp.h | 22 ++++++++++
>  mm/internal.h       |  1 +
>  mm/page_alloc.c     | 98 +++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 118 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index b0fe9f62d15b..b41bb6e01781 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>  	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
>  }
>  
> +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> +{
> +	/*
> +	 * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> +	 * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> +	 * All GFP_* flags including GFP_NOWAIT use one or both flags.
> +	 * try_alloc_pages() is the only API that doesn't specify either flag.
> +	 *
> +	 * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
> +	 * those are guaranteed to never block on a sleeping lock.
> +	 * Here we are enforcing that the allaaction doesn't ever spin

					  allocation

> +	 * on any locks (i.e. only trylocks). There is no highlevel
> +	 * GFP_$FOO flag for this use in try_alloc_pages() as the
> +	 * regular page allocator doesn't fully support this
> +	 * allocation mode.
> +	 */
> +	return !(gfp_flags & __GFP_RECLAIM);
> +}

This function seems unused, guess the following patches will use.

> @@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  
>  	might_alloc(gfp_mask);
>  
> -	if (should_fail_alloc_page(gfp_mask, order))
> +	if (!(*alloc_flags & ALLOC_TRYLOCK) &&
> +	    should_fail_alloc_page(gfp_mask, order))

Is it because should_fail_alloc_page() might take some lock or whatnot?
Maybe comment?

>  		return false;
>  
>  	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
> @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
>  }
>  
>  #endif /* CONFIG_UNACCEPTED_MEMORY */
> +
> +/**
> + * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
> + * @nid - node to allocate from
> + * @order - allocation order size
> + *
> + * Allocates pages of a given order from the given node. This is safe to
> + * call from any context (from atomic, NMI, and also reentrant
> + * allocator -> tracepoint -> try_alloc_pages_noprof).
> + * Allocation is best effort and to be expected to fail easily so nobody should
> + * rely on the success. Failures are not reported via warn_alloc().
> + *
> + * Return: allocated page or NULL on failure.
> + */
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> +	/*
> +	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> +	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> +	 * is not safe in arbitrary context.
> +	 *
> +	 * These two are the conditions for gfpflags_allow_spinning() being true.
> +	 *
> +	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> +	 * to warn. Also warn would trigger printk() which is unsafe from
> +	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
> +	 * since the running context is unknown.
> +	 *
> +	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> +	 * is safe in any context. Also zeroing the page is mandatory for
> +	 * BPF use cases.
> +	 *
> +	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
> +	 * specify it here to highlight that try_alloc_pages()
> +	 * doesn't want to deplete reserves.
> +	 */
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> +	unsigned int alloc_flags = ALLOC_TRYLOCK;
> +	struct alloc_context ac = { };
> +	struct page *page;
> +
> +	/*
> +	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
> +	 * If spin_trylock() is called from hard IRQ the current task may be
> +	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> +	 * task as the owner of another rt_spin_lock which will confuse PI
> +	 * logic, so return immediately if called form hard IRQ or NMI.
> +	 *
> +	 * Note, irqs_disabled() case is ok. This function can be called
> +	 * from raw_spin_lock_irqsave region.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +		return NULL;
> +	if (!pcp_allowed_order(order))
> +		return NULL;
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
> +	if (has_unaccepted_memory())
> +		return NULL;
> +#endif
> +	/* Bailout, since _deferred_grow_zone() needs to take a lock */
> +	if (deferred_pages_enabled())
> +		return NULL;

Is it fine for BPF that things will fail to allocate until all memory is
deferred-initialized and accepted? I guess it's easy to teach those places
later to evaluate if they can take the lock.

> +
> +	if (nid == NUMA_NO_NODE)
> +		nid = numa_node_id();
> +
> +	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> +			    &alloc_gfp, &alloc_flags);
> +
> +	/*
> +	 * Best effort allocation from percpu free list.
> +	 * If it's empty attempt to spin_trylock zone->lock.
> +	 */
> +	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);

What about set_page_owner() from post_alloc_hook() and it's stackdepot
saving. I guess not an issue until try_alloc_pages() gets used later, so
just a mental note that it has to be resolved before. Or is it actually safe?

> +
> +	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> +
> +	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> +	kmsan_alloc_page(page, order, alloc_gfp);
> +	return page;
> +}
Alexei Starovoitov Jan. 15, 2025, 11 p.m. UTC | #2
On Wed, Jan 15, 2025 at 3:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory. The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
> > Instead, introduce gfpflags_allow_spinning() condition that signals
> > to the allocator that running context is unknown.
> > Then rely on percpu free list of pages to allocate a page.
> > try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> > rmqueue_pcplist() will spin_trylock to grab the page from percpu
> > free list. If it fails (due to re-entrancy or list being empty)
> > then rmqueue_bulk()/rmqueue_buddy() will attempt to
> > spin_trylock zone->lock and grab the page from there.
> > spin_trylock() is not safe in RT when in NMI or in hard IRQ.
> > Bailout early in such case.
> >
> > The support for gfpflags_allow_spinning() mode for free_page and memcg
> > comes in the next patches.
> >
> > This is a first step towards supporting BPF requirements in SLUB
> > and getting rid of bpf_mem_alloc.
> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> >
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> Some nits below:
>
> > ---
> >  include/linux/gfp.h | 22 ++++++++++
> >  mm/internal.h       |  1 +
> >  mm/page_alloc.c     | 98 +++++++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 118 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index b0fe9f62d15b..b41bb6e01781 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> >       return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
> >  }
> >
> > +static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
> > +{
> > +     /*
> > +      * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
> > +      * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
> > +      * All GFP_* flags including GFP_NOWAIT use one or both flags.
> > +      * try_alloc_pages() is the only API that doesn't specify either flag.
> > +      *
> > +      * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
> > +      * those are guaranteed to never block on a sleeping lock.
> > +      * Here we are enforcing that the allaaction doesn't ever spin
>
>                                           allocation

oops.

> > +      * on any locks (i.e. only trylocks). There is no highlevel
> > +      * GFP_$FOO flag for this use in try_alloc_pages() as the
> > +      * regular page allocator doesn't fully support this
> > +      * allocation mode.
> > +      */
> > +     return !(gfp_flags & __GFP_RECLAIM);
> > +}
>
> This function seems unused, guess the following patches will use.
>
> > @@ -4509,7 +4517,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> >
> >       might_alloc(gfp_mask);
> >
> > -     if (should_fail_alloc_page(gfp_mask, order))
> > +     if (!(*alloc_flags & ALLOC_TRYLOCK) &&
> > +         should_fail_alloc_page(gfp_mask, order))
>
> Is it because should_fail_alloc_page() might take some lock or whatnot?
> Maybe comment?

This is mainly because all kinds of printk-s that fail* logic does.
We've seen way too many syzbot reports because of printk from
the unsupported context.
This part of the comment:
+    * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+    * to warn. Also warn would trigger printk() which is unsafe from
+    * various contexts. We cannot use printk_deferred_enter() to mitigate,
+    * since the running context is unknown.

and also because of get_random_u32() inside fail* logic
that grabs spin_lock.

We've seen syzbot reports about get_random() deadlocking.

Fixing printk and fail*/get_random is necessary too,
but orthogonal work for these patches.

Here I'm preventing this path from prepare_alloc_pages() to avoid
dealing with more syzbot reports than already there.

With try_alloc_pages() out of bpf attached to kprobe inside
printk core logic it would be too easy for syzbot.
And then people will yell at bpf causing problems
whereas the root cause is printk being broken.
We see this finger pointing all too often.

>
> >               return false;
> >
> >       *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
> > @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
> >  }
> >
> >  #endif /* CONFIG_UNACCEPTED_MEMORY */
> > +
> > +/**
> > + * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
> > + * @nid - node to allocate from
> > + * @order - allocation order size
> > + *
> > + * Allocates pages of a given order from the given node. This is safe to
> > + * call from any context (from atomic, NMI, and also reentrant
> > + * allocator -> tracepoint -> try_alloc_pages_noprof).
> > + * Allocation is best effort and to be expected to fail easily so nobody should
> > + * rely on the success. Failures are not reported via warn_alloc().
> > + *
> > + * Return: allocated page or NULL on failure.
> > + */
> > +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> > +{
> > +     /*
> > +      * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> > +      * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> > +      * is not safe in arbitrary context.
> > +      *
> > +      * These two are the conditions for gfpflags_allow_spinning() being true.
> > +      *
> > +      * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> > +      * to warn. Also warn would trigger printk() which is unsafe from
> > +      * various contexts. We cannot use printk_deferred_enter() to mitigate,
> > +      * since the running context is unknown.
> > +      *
> > +      * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> > +      * is safe in any context. Also zeroing the page is mandatory for
> > +      * BPF use cases.
> > +      *
> > +      * Though __GFP_NOMEMALLOC is not checked in the code path below,
> > +      * specify it here to highlight that try_alloc_pages()
> > +      * doesn't want to deplete reserves.
> > +      */
> > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> > +     unsigned int alloc_flags = ALLOC_TRYLOCK;
> > +     struct alloc_context ac = { };
> > +     struct page *page;
> > +
> > +     /*
> > +      * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
> > +      * If spin_trylock() is called from hard IRQ the current task may be
> > +      * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> > +      * task as the owner of another rt_spin_lock which will confuse PI
> > +      * logic, so return immediately if called form hard IRQ or NMI.
> > +      *
> > +      * Note, irqs_disabled() case is ok. This function can be called
> > +      * from raw_spin_lock_irqsave region.
> > +      */
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> > +             return NULL;
> > +     if (!pcp_allowed_order(order))
> > +             return NULL;
> > +
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +     /* Bailout, since try_to_accept_memory_one() needs to take a lock */
> > +     if (has_unaccepted_memory())
> > +             return NULL;
> > +#endif
> > +     /* Bailout, since _deferred_grow_zone() needs to take a lock */
> > +     if (deferred_pages_enabled())
> > +             return NULL;
>
> Is it fine for BPF that things will fail to allocate until all memory is
> deferred-initialized and accepted? I guess it's easy to teach those places
> later to evaluate if they can take the lock.

Exactly. If it becomes an issue it can be addressed.
I didn't want to complicate the code just-because.

> > +
> > +     if (nid == NUMA_NO_NODE)
> > +             nid = numa_node_id();
> > +
> > +     prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > +                         &alloc_gfp, &alloc_flags);
> > +
> > +     /*
> > +      * Best effort allocation from percpu free list.
> > +      * If it's empty attempt to spin_trylock zone->lock.
> > +      */
> > +     page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
>
> What about set_page_owner() from post_alloc_hook() and it's stackdepot
> saving. I guess not an issue until try_alloc_pages() gets used later, so
> just a mental note that it has to be resolved before. Or is it actually safe?

set_page_owner() should be fine.
save_stack() has in_page_owner recursion protection mechanism.

stack_depot_save_flags() may be problematic if there is another
path to it.
I guess I can do:

diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 245d5b416699..61772bc4b811 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -630,7 +630,7 @@ depot_stack_handle_t
stack_depot_save_flags(unsigned long *entries,
                        prealloc = page_address(page);
        }

-       if (in_nmi()) {
+       if (in_nmi() || !gfpflags_allow_spinning(alloc_flags)) {
                /* We can never allocate in NMI context. */
                WARN_ON_ONCE(can_alloc);
                /* Best effort; bail if we fail to take the lock. */
                if (!raw_spin_trylock_irqsave(&pool_lock, flags))
                        goto exit;

as part of this patch,
but not convinced whether it's necessary.
stack_depot* is effectively noinstr.
kprobe-bpf cannot be placed in there and afaict
it doesn't call any tracepoints.
So in_nmi() is the only way to reenter and that's already covered.
Shakeel Butt Jan. 15, 2025, 11:16 p.m. UTC | #3
On Wed, Jan 15, 2025 at 12:19:26PM +0100, Vlastimil Babka wrote:
> On 1/15/25 03:17, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
[...]
> > +
> > +	if (nid == NUMA_NO_NODE)
> > +		nid = numa_node_id();
> > +
> > +	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
> > +			    &alloc_gfp, &alloc_flags);
> > +
> > +	/*
> > +	 * Best effort allocation from percpu free list.
> > +	 * If it's empty attempt to spin_trylock zone->lock.
> > +	 */
> > +	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
> 
> What about set_page_owner() from post_alloc_hook() and it's stackdepot
> saving. I guess not an issue until try_alloc_pages() gets used later, so
> just a mental note that it has to be resolved before. Or is it actually safe?
> 

stack_depot_save() does not seem safe here.

> > +
> > +	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
> > +
> > +	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
> > +	kmsan_alloc_page(page, order, alloc_gfp);
> > +	return page;
> > +}
>
Shakeel Butt Jan. 15, 2025, 11:47 p.m. UTC | #4
On Wed, Jan 15, 2025 at 03:00:08PM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 15, 2025 at 3:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> >

Sorry missed your response here.

> > What about set_page_owner() from post_alloc_hook() and it's stackdepot
> > saving. I guess not an issue until try_alloc_pages() gets used later, so
> > just a mental note that it has to be resolved before. Or is it actually safe?
> 
> set_page_owner() should be fine.
> save_stack() has in_page_owner recursion protection mechanism.
> 
> stack_depot_save_flags() may be problematic if there is another
> path to it.
> I guess I can do:
> 
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 245d5b416699..61772bc4b811 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -630,7 +630,7 @@ depot_stack_handle_t
> stack_depot_save_flags(unsigned long *entries,
>                         prealloc = page_address(page);
>         }

There is alloc_pages(gfp_nested_mask(alloc_flags)...) just couple of
lines above. How about setting can_alloc false along with the below
change for this case? Or we can set ALLOC_TRYLOCK in core alloc_pages()
for !gfpflags_allow_spinning().

> 
> -       if (in_nmi()) {
> +       if (in_nmi() || !gfpflags_allow_spinning(alloc_flags)) {
>                 /* We can never allocate in NMI context. */
>                 WARN_ON_ONCE(can_alloc);
>                 /* Best effort; bail if we fail to take the lock. */
>                 if (!raw_spin_trylock_irqsave(&pool_lock, flags))
>                         goto exit;
> 
> as part of this patch,
> but not convinced whether it's necessary.
> stack_depot* is effectively noinstr.
> kprobe-bpf cannot be placed in there and afaict
> it doesn't call any tracepoints.
> So in_nmi() is the only way to reenter and that's already covered.

Are the locks in stack_depot* only the issue for bpf programs triggered
inside stack_depot*?
Alexei Starovoitov Jan. 16, 2025, 2:44 a.m. UTC | #5
On Wed, Jan 15, 2025 at 3:47 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Jan 15, 2025 at 03:00:08PM -0800, Alexei Starovoitov wrote:
> > On Wed, Jan 15, 2025 at 3:19 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > >
>
> Sorry missed your response here.
>
> > > What about set_page_owner() from post_alloc_hook() and it's stackdepot
> > > saving. I guess not an issue until try_alloc_pages() gets used later, so
> > > just a mental note that it has to be resolved before. Or is it actually safe?
> >
> > set_page_owner() should be fine.
> > save_stack() has in_page_owner recursion protection mechanism.
> >
> > stack_depot_save_flags() may be problematic if there is another
> > path to it.
> > I guess I can do:
> >
> > diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> > index 245d5b416699..61772bc4b811 100644
> > --- a/lib/stackdepot.c
> > +++ b/lib/stackdepot.c
> > @@ -630,7 +630,7 @@ depot_stack_handle_t
> > stack_depot_save_flags(unsigned long *entries,
> >                         prealloc = page_address(page);
> >         }
>
> There is alloc_pages(gfp_nested_mask(alloc_flags)...) just couple of
> lines above. How about setting can_alloc false along with the below
> change for this case?

argh. Just noticed this code path:
free_pages_prepare
  __reset_page_owner
    save_stack(GFP_NOWAIT | __GFP_NOWARN);
       stack_depot_save(entries, nr_entries, flags);
          stack_depot_save_flags(entries, nr_entries, alloc_flags,
                                      STACK_DEPOT_FLAG_CAN_ALLOC);

bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;

if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
         page = alloc_pages(gfp_nested_mask(alloc_flags),
                                 DEPOT_POOL_ORDER);

> Or we can set ALLOC_TRYLOCK in core alloc_pages()
> for !gfpflags_allow_spinning().

so gfpflags_allow_spinning() approach doesn't work out of the door,
since free_pages don't have gfp flags.
Passing FPI flags everywhere is too much churn.

I guess using the same gfp flags as try_alloc_pages()
in __reset_page_owner() should work.
That will make gfpflags_allow_spinning()==true in stack_depot.

In an earlier email I convinced myself that
current->in_page_owner recursion protection in save_stack()
is enough, but looks like it's not.
We could be hitting tracepoint somewhere inside alloc_pages() then
alloc_pages -> tracepoint -> bpf
-> free_pages_nolock -> __free_unref_page -> free_pages_prepare
-> save_stack(GFP_NOWAIT | __GFP_NOWARN);

and this save_stack will be called for the first time.
It's not recursing into itself. Then:

-> stack_depot_save_flags -> alloc_pages(GFP_NOWAIT) -> boom.

So looks like __reset_page_owner() has to use
!gfpflags_allow_spinning() flags and
stack_depot_save_flags() has
to do:
if (!gfpflags_allow_spinning(alloc_flags))
   can_alloc = false;
   raw_spin_trylock_irqsave(&pool_lock, ..)

Will code this up. Unless there are better ideas.
Sebastian Andrzej Siewior Jan. 17, 2025, 6:19 p.m. UTC | #6
On 2025-01-14 18:17:40 [-0800], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.
> Instead, introduce gfpflags_allow_spinning() condition that signals
> to the allocator that running context is unknown.
> Then rely on percpu free list of pages to allocate a page.
> try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
> rmqueue_pcplist() will spin_trylock to grab the page from percpu
> free list. If it fails (due to re-entrancy or list being empty)
> then rmqueue_bulk()/rmqueue_buddy() will attempt to
> spin_trylock zone->lock and grab the page from there.
> spin_trylock() is not safe in RT when in NMI or in hard IRQ.
> Bailout early in such case.
> 
> The support for gfpflags_allow_spinning() mode for free_page and memcg
> comes in the next patches.
> 
> This is a first step towards supporting BPF requirements in SLUB
> and getting rid of bpf_mem_alloc.
> That goal was discussed at LSFMM: https://lwn.net/Articles/974138/
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

could you…

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1cb4b8c8886d..74c2a7af1a77 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7023,3 +7032,86 @@ static bool __free_unaccepted(struct page *page)
>  }
>  
>  #endif /* CONFIG_UNACCEPTED_MEMORY */
> +
> +/**
> + * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
> + * @nid - node to allocate from
> + * @order - allocation order size
> + *
> + * Allocates pages of a given order from the given node. This is safe to
> + * call from any context (from atomic, NMI, and also reentrant
> + * allocator -> tracepoint -> try_alloc_pages_noprof).
> + * Allocation is best effort and to be expected to fail easily so nobody should
> + * rely on the success. Failures are not reported via warn_alloc().

Could you maybe add a pointer like "See AlwaysFailRestrictions below."
or something similar to make the user aware of the comment below where
certain always-fail restrictions are mentioned. Such as PREEMPT_RT + NMI
or deferred_pages_enabled().
It might not be easy to be aware of this.

I'm curious how this turns out in the long run :)

> + *
> + * Return: allocated page or NULL on failure.
> + */
> +struct page *try_alloc_pages_noprof(int nid, unsigned int order)
> +{
> +	/*
> +	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> +	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
> +	 * is not safe in arbitrary context.
> +	 *
> +	 * These two are the conditions for gfpflags_allow_spinning() being true.
> +	 *
> +	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
> +	 * to warn. Also warn would trigger printk() which is unsafe from
> +	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
> +	 * since the running context is unknown.
> +	 *
> +	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
> +	 * is safe in any context. Also zeroing the page is mandatory for
> +	 * BPF use cases.
> +	 *
> +	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
> +	 * specify it here to highlight that try_alloc_pages()
> +	 * doesn't want to deplete reserves.
> +	 */
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
> +	unsigned int alloc_flags = ALLOC_TRYLOCK;
> +	struct alloc_context ac = { };
> +	struct page *page;
> +
> +	/*
> +	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
PREEMPT_RT please. s/may/will

> +	 * If spin_trylock() is called from hard IRQ the current task may be
> +	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
> +	 * task as the owner of another rt_spin_lock which will confuse PI
> +	 * logic, so return immediately if called form hard IRQ or NMI.
> +	 *
> +	 * Note, irqs_disabled() case is ok. This function can be called
> +	 * from raw_spin_lock_irqsave region.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +		return NULL;
> +	if (!pcp_allowed_order(order))
> +		return NULL;
…

Sebastian
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b0fe9f62d15b..b41bb6e01781 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,25 @@  static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }
 
+static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
+{
+	/*
+	 * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
+	 * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
+	 * All GFP_* flags including GFP_NOWAIT use one or both flags.
+	 * try_alloc_pages() is the only API that doesn't specify either flag.
+	 *
+	 * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
+	 * those are guaranteed to never block on a sleeping lock.
+	 * Here we are enforcing that the allaaction doesn't ever spin
+	 * on any locks (i.e. only trylocks). There is no highlevel
+	 * GFP_$FOO flag for this use in try_alloc_pages() as the
+	 * regular page allocator doesn't fully support this
+	 * allocation mode.
+	 */
+	return !(gfp_flags & __GFP_RECLAIM);
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
@@ -347,6 +366,9 @@  static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
+struct page *try_alloc_pages_noprof(int nid, unsigned int order);
+#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
 
diff --git a/mm/internal.h b/mm/internal.h
index cb8d8e8e3ffa..5454fa610aac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1174,6 +1174,7 @@  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1cb4b8c8886d..74c2a7af1a77 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2304,7 +2304,11 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+			return 0;
+		spin_lock_irqsave(&zone->lock, flags);
+	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -2904,7 +2908,11 @@  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
+		if (!spin_trylock_irqsave(&zone->lock, flags)) {
+			if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+				return NULL;
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
@@ -4509,7 +4517,8 @@  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 
 	might_alloc(gfp_mask);
 
-	if (should_fail_alloc_page(gfp_mask, order))
+	if (!(*alloc_flags & ALLOC_TRYLOCK) &&
+	    should_fail_alloc_page(gfp_mask, order))
 		return false;
 
 	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
@@ -7023,3 +7032,86 @@  static bool __free_unaccepted(struct page *page)
 }
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+
+/**
+ * try_alloc_pages_noprof - opportunistic reentrant allocation from any context
+ * @nid - node to allocate from
+ * @order - allocation order size
+ *
+ * Allocates pages of a given order from the given node. This is safe to
+ * call from any context (from atomic, NMI, and also reentrant
+ * allocator -> tracepoint -> try_alloc_pages_noprof).
+ * Allocation is best effort and to be expected to fail easily so nobody should
+ * rely on the success. Failures are not reported via warn_alloc().
+ *
+ * Return: allocated page or NULL on failure.
+ */
+struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+	/*
+	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
+	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
+	 * is not safe in arbitrary context.
+	 *
+	 * These two are the conditions for gfpflags_allow_spinning() being true.
+	 *
+	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+	 * to warn. Also warn would trigger printk() which is unsafe from
+	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
+	 * since the running context is unknown.
+	 *
+	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
+	 * is safe in any context. Also zeroing the page is mandatory for
+	 * BPF use cases.
+	 *
+	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
+	 * specify it here to highlight that try_alloc_pages()
+	 * doesn't want to deplete reserves.
+	 */
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	unsigned int alloc_flags = ALLOC_TRYLOCK;
+	struct alloc_context ac = { };
+	struct page *page;
+
+	/*
+	 * In RT spin_trylock() may call raw_spin_lock() which is unsafe in NMI.
+	 * If spin_trylock() is called from hard IRQ the current task may be
+	 * waiting for one rt_spin_lock, but rt_spin_trylock() will mark the
+	 * task as the owner of another rt_spin_lock which will confuse PI
+	 * logic, so return immediately if called form hard IRQ or NMI.
+	 *
+	 * Note, irqs_disabled() case is ok. This function can be called
+	 * from raw_spin_lock_irqsave region.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;
+	if (!pcp_allowed_order(order))
+		return NULL;
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
+	if (has_unaccepted_memory())
+		return NULL;
+#endif
+	/* Bailout, since _deferred_grow_zone() needs to take a lock */
+	if (deferred_pages_enabled())
+		return NULL;
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+
+	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
+			    &alloc_gfp, &alloc_flags);
+
+	/*
+	 * Best effort allocation from percpu free list.
+	 * If it's empty attempt to spin_trylock zone->lock.
+	 */
+	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
+
+	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+
+	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+	kmsan_alloc_page(page, order, alloc_gfp);
+	return page;
+}