Message ID | 20241210023936.46871-2-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | bpf, mm: Introduce __GFP_TRYLOCK | expand |
On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > + if (preemptible() && !rcu_preempt_depth()) > + return alloc_pages_node_noprof(nid, > + GFP_NOWAIT | __GFP_ZERO, > + order); > + return alloc_pages_node_noprof(nid, > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > + order); [...] > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > */ > alloc_flags |= (__force int) > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); It's not quite clear to me that we need __GFP_TRYLOCK to implement this. I was originally wondering if this wasn't a memalloc_nolock_save() / memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), but I wonder if we can simply do: if (!preemptible() || rcu_preempt_depth()) alloc_flags |= ALLOC_TRYLOCK;
On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Tracing BPF programs execute from tracepoints and kprobes where running > context is unknown, but they need to request additional memory. > The prior workarounds were using pre-allocated memory and BPF specific > freelists to satisfy such allocation requests. Instead, introduce > __GFP_TRYLOCK flag that makes page allocator accessible from any context. > It relies on percpu free list of pages that rmqueue_pcplist() should be > able to pop the page from. If it fails (due to IRQ re-entrancy or list > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > and refill percpu freelist as normal. > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, > so trylock is the only option. The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF where it is not known how much memory will be needed and what the calling context is. I hope it does not spread across the kernel where people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they learn that this does not work, add this flag to the mix to make it work without spending some time on reworking it. Side note: I am in the process of hopefully getting rid of the preempt_disable() from trace points. What remains then is attaching BPF programs to any code/ function with a raw_spinlock_t and I am not yet sure what to do here. Sebastian
On Tue 10-12-24 05:31:30, Matthew Wilcox wrote: > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > > + if (preemptible() && !rcu_preempt_depth()) > > + return alloc_pages_node_noprof(nid, > > + GFP_NOWAIT | __GFP_ZERO, > > + order); > > + return alloc_pages_node_noprof(nid, > > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > > + order); > > [...] > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > > */ > > alloc_flags |= (__force int) > > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this. > I was originally wondering if this wasn't a memalloc_nolock_save() / > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), > but I wonder if we can simply do: > > if (!preemptible() || rcu_preempt_depth()) > alloc_flags |= ALLOC_TRYLOCK; preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that __GFP_TRYLOCK is not really a preferred way to go forward. For 3 reasons. First I do not really like the name as it tells what it does rather than how it should be used. This is a general pattern of many gfp flags unfotrunatelly and historically it has turned out error prone. If a gfp flag is really needed then something like __GFP_ANY_CONTEXT should be used. If the current implementation requires to use try_lock for zone->lock or other changes is not an implementation detail but the user should have a clear understanding that allocation is allowed from any context (NMI, IRQ or otherwise atomic contexts). Is there any reason why GFP_ATOMIC cannot be extended to support new contexts? This allocation mode is already documented to be usable from atomic contexts except from NMI and raw_spinlocks. But is it feasible to extend the current implementation to use only trylock on zone->lock if called from in_nmi() to reduce unexpected failures on contention for existing users? Third, do we even want such a strong guarantee in the generic page allocator path and make it even more complex and harder to maintain? We already have a precence in form of __alloc_pages_bulk which is a special case allocator mode living outside of the page allocator path. It seems that it covers most of your requirements except the fallback to the regular allocation path AFAICS. Is this something you could piggy back on?
On 12/10/24 03:39, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Tracing BPF programs execute from tracepoints and kprobes where running > context is unknown, but they need to request additional memory. > The prior workarounds were using pre-allocated memory and BPF specific > freelists to satisfy such allocation requests. Instead, introduce > __GFP_TRYLOCK flag that makes page allocator accessible from any context. > It relies on percpu free list of pages that rmqueue_pcplist() should be > able to pop the page from. If it fails (due to IRQ re-entrancy or list > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > and refill percpu freelist as normal. > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, > so trylock is the only option. > In theory we can introduce percpu reentrance counter and increment it > every time spin_lock_irqsave(&zone->lock, flags) is used, > but we cannot rely on it. Even if this cpu is not in page_alloc path > the spin_lock_irqsave() is not safe, since BPF prog might be called > from tracepoint where preemption is disabled. So trylock only. > > Note, free_page and memcg are not taught about __GFP_TRYLOCK yet. > The support comes in the next patches. > > This is a first step towards supporting BPF requirements in SLUB > and getting rid of bpf_mem_alloc. > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/ > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> I think there might be more non-try spin_locks reachable from page allocations: - in reserve_highatomic_pageblock() which I think is reachable unless this is limited to order-0 - try_to_accept_memory_one() - as part of post_alloc_hook() in set_page_owner(), stack depot might do raw_spin_lock_irqsave(), is that one ok? hope I didn't miss anything else especially in those other debugging hooks (KASAN etc)
On Tue, Dec 10, 2024 at 10:05:22AM +0100, Michal Hocko wrote: > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote: > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > > > + if (preemptible() && !rcu_preempt_depth()) > > > + return alloc_pages_node_noprof(nid, > > > + GFP_NOWAIT | __GFP_ZERO, > > > + order); > > > + return alloc_pages_node_noprof(nid, > > > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > > > + order); > > > > [...] > > > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > > > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > > > */ > > > alloc_flags |= (__force int) > > > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > > > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); > > > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this. > > I was originally wondering if this wasn't a memalloc_nolock_save() / > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), > > but I wonder if we can simply do: > > > > if (!preemptible() || rcu_preempt_depth()) > > alloc_flags |= ALLOC_TRYLOCK; > > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that > __GFP_TRYLOCK is not really a preferred way to go forward. For 3 > reasons. > > First I do not really like the name as it tells what it does rather than > how it should be used. This is a general pattern of many gfp flags > unfotrunatelly and historically it has turned out error prone. If a gfp > flag is really needed then something like __GFP_ANY_CONTEXT should be > used. If the current implementation requires to use try_lock for > zone->lock or other changes is not an implementation detail but the user > should have a clear understanding that allocation is allowed from any > context (NMI, IRQ or otherwise atomic contexts). > > Is there any reason why GFP_ATOMIC cannot be extended to support new GFP_ATOMIC has access to memory reserves. I see GFP_NOWAIT a better fit and if someone wants access to the reserve they can use __GFP_HIGH with GFP_NOWAIT. > contexts? This allocation mode is already documented to be usable from > atomic contexts except from NMI and raw_spinlocks. But is it feasible to > extend the current implementation to use only trylock on zone->lock if > called from in_nmi() to reduce unexpected failures on contention for > existing users? I think this is the question we (MM folks) need to answer, not the users. > > Third, do we even want such a strong guarantee in the generic page > allocator path and make it even more complex and harder to maintain? I think the alternative would be higher maintenance cost i.e. everyone creating their own layer/solution/caching over page allocator which I think we agree we want to avoid (Vlastimil's LSFMM talk). > We > already have a precence in form of __alloc_pages_bulk which is a special > case allocator mode living outside of the page allocator path. It seems > that it covers most of your requirements except the fallback to the > regular allocation path AFAICS. Is this something you could piggy back > on? BPF already have bpf_mem_alloc() and IIUC this series is an effort to unify and have a single solution.
On Mon, Dec 9, 2024 at 9:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > > + if (preemptible() && !rcu_preempt_depth()) > > + return alloc_pages_node_noprof(nid, > > + GFP_NOWAIT | __GFP_ZERO, > > + order); > > + return alloc_pages_node_noprof(nid, > > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > > + order); > > [...] > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > > */ > > alloc_flags |= (__force int) > > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this. > I was originally wondering if this wasn't a memalloc_nolock_save() / > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), Interesting idea. It could be useful to pass extra flags into free_page path, since it doesn't have flags today and I'm adding free_pages_nolock() in patch 2 just to pass fpi_t fpi_flags around. memalloc_nofs_save()-like makes the most sense when there are multiple allocations and code path attempts to be generic. For bpf use case it's probably overkill. I guess we might have both __GFP_TRYLOCK and memalloc_nolock_save() that clear many flags. Note it needs to clear __GFP_KSWAPD_RECLAIM which is not safe when raw spin lock is held. > but I wonder if we can simply do: > > if (!preemptible() || rcu_preempt_depth()) > alloc_flags |= ALLOC_TRYLOCK; I don't think we can do that. It will penalize existing GFP_ATOMIC/NOWAIT users. kmalloc from RCU CS with GFP_NOWAIT is fine today.
On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > Tracing BPF programs execute from tracepoints and kprobes where running > > context is unknown, but they need to request additional memory. > > The prior workarounds were using pre-allocated memory and BPF specific > > freelists to satisfy such allocation requests. Instead, introduce > > __GFP_TRYLOCK flag that makes page allocator accessible from any context. > > It relies on percpu free list of pages that rmqueue_pcplist() should be > > able to pop the page from. If it fails (due to IRQ re-entrancy or list > > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > > and refill percpu freelist as normal. > > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, > > so trylock is the only option. > > The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF > where it is not known how much memory will be needed and what the > calling context is. Exactly. > I hope it does not spread across the kernel where > people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they > learn that this does not work, add this flag to the mix to make it work > without spending some time on reworking it. We can call it __GFP_BPF to discourage any other usage, but that seems like an odd "solution" to code review problem. If people start using __GFP_TRYLOCK just to shut up lockdep splats they will soon realize that it's an _oportunistic_ allocator. bpf doesn't need more than a page and single page will likely will be found in percpu free page pool, so this opportunistic approach will work most of the time for bpf, but unlikely for others that need order >= PAGE_ALLOC_COSTLY_ORDER (3). > Side note: I am in the process of hopefully getting rid of the > preempt_disable() from trace points. What remains then is attaching BPF > programs to any code/ function with a raw_spinlock_t and I am not yet > sure what to do here. That won't help the bpf core. There are tracepoints that are called after preemption is disabled. The worst is trace_contention_begin() and people have good reasons to attach bpf prog there to collect contention stats. In such case bpf prog has no idea what kind of spin_lock is contending. It might have disabled preemption and/or irqs before getting to that tracepoint. So removal of preempt_disable from tracepoint handling logic doesn't help bpf core. It's a good thing to do anyway.
On Tue, Dec 10, 2024 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote: > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > > > + if (preemptible() && !rcu_preempt_depth()) > > > + return alloc_pages_node_noprof(nid, > > > + GFP_NOWAIT | __GFP_ZERO, > > > + order); > > > + return alloc_pages_node_noprof(nid, > > > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > > > + order); > > > > [...] > > > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > > > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > > > */ > > > alloc_flags |= (__force int) > > > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > > > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); > > > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this. > > I was originally wondering if this wasn't a memalloc_nolock_save() / > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), > > but I wonder if we can simply do: > > > > if (!preemptible() || rcu_preempt_depth()) > > alloc_flags |= ALLOC_TRYLOCK; > > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that > __GFP_TRYLOCK is not really a preferred way to go forward. For 3 > reasons. > > First I do not really like the name as it tells what it does rather than > how it should be used. This is a general pattern of many gfp flags > unfotrunatelly and historically it has turned out error prone. If a gfp > flag is really needed then something like __GFP_ANY_CONTEXT should be > used. If the current implementation requires to use try_lock for > zone->lock or other changes is not an implementation detail but the user > should have a clear understanding that allocation is allowed from any > context (NMI, IRQ or otherwise atomic contexts). __GFP_ANY_CONTEXT would make sense if we wanted to make it available for all kernel users. In this case I agree with Sebastian. This is bpf specific feature, since it doesn't know the context. All other kernel users should pick GFP_KERNEL or ATOMIC or NOWAIT. Exposing GFP_ANY_CONTEXT to all may lead to sloppy code in drivers and elsewhere. > Is there any reason why GFP_ATOMIC cannot be extended to support new > contexts? This allocation mode is already documented to be usable from > atomic contexts except from NMI and raw_spinlocks. But is it feasible to > extend the current implementation to use only trylock on zone->lock if > called from in_nmi() to reduce unexpected failures on contention for > existing users? No. in_nmi() doesn't help. It's the lack of reentrance of slab and page allocator that is an issue. The page alloctor might grab zone lock. In !RT it will disable irqs. In RT will stay sleepable. Both paths will be calling other kernel code including tracepoints, potential kprobes, etc and bpf prog may be attached somewhere. If it calls alloc_page() it may deadlock on zone->lock. pcpu lock is thankfully trylock already. So !irqs_disabled() part of preemptible() guarantees that zone->lock won't deadlock in !RT. And rcu_preempt_depth() case just steers bpf into try lock only path in RT. Since there is no way to tell whether it's safe to call sleepable spin_lock(&zone->lock). > > Third, do we even want such a strong guarantee in the generic page > allocator path and make it even more complex and harder to maintain? I'm happy to add myself as R: or M: for trylock bits, since that will be a fundamental building block for bpf. > We > already have a precence in form of __alloc_pages_bulk which is a special > case allocator mode living outside of the page allocator path. It seems > that it covers most of your requirements except the fallback to the > regular allocation path AFAICS. Is this something you could piggy back > on? __alloc_pages_bulk() has all the same issues. It takes locks. Also it doesn't support GFP_ACCOUNT which is a show stopper. All bpf allocations are going through memcg.
On Tue, Dec 10, 2024 at 10:39 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 12/10/24 03:39, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > Tracing BPF programs execute from tracepoints and kprobes where running > > context is unknown, but they need to request additional memory. > > The prior workarounds were using pre-allocated memory and BPF specific > > freelists to satisfy such allocation requests. Instead, introduce > > __GFP_TRYLOCK flag that makes page allocator accessible from any context. > > It relies on percpu free list of pages that rmqueue_pcplist() should be > > able to pop the page from. If it fails (due to IRQ re-entrancy or list > > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > > and refill percpu freelist as normal. > > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, > > so trylock is the only option. > > In theory we can introduce percpu reentrance counter and increment it > > every time spin_lock_irqsave(&zone->lock, flags) is used, > > but we cannot rely on it. Even if this cpu is not in page_alloc path > > the spin_lock_irqsave() is not safe, since BPF prog might be called > > from tracepoint where preemption is disabled. So trylock only. > > > > Note, free_page and memcg are not taught about __GFP_TRYLOCK yet. > > The support comes in the next patches. > > > > This is a first step towards supporting BPF requirements in SLUB > > and getting rid of bpf_mem_alloc. > > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/ > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > I think there might be more non-try spin_locks reachable from page allocations: > > - in reserve_highatomic_pageblock() which I think is reachable unless this > is limited to order-0 Good point. I missed this bit: if (order > 0) alloc_flags |= ALLOC_HIGHATOMIC; In bpf use case it will be called with order == 0 only, but it's better to fool proof it. I will switch to: __GFP_NOMEMALLOC | __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT > - try_to_accept_memory_one() when I studied the code it looked to me that there should be no unaccepted_pages. I think you're saying that there could be unaccepted memory from the previous allocation and trylock attempt just got unlucky to reach that path? What do you think of the following: - cond_accept_memory(zone, order); + cond_accept_memory(zone, order, alloc_flags); /* * Detect whether the number of free pages is below high @@ -7024,7 +7024,8 @@ static inline bool has_unaccepted_memory(void) return static_branch_unlikely(&zones_with_unaccepted_pages); } -static bool cond_accept_memory(struct zone *zone, unsigned int order) +static bool cond_accept_memory(struct zone *zone, unsigned int order, + unsigned int alloc_flags) { long to_accept; bool ret = false; @@ -7032,6 +7033,9 @@ static bool cond_accept_memory(struct zone *zone, unsigned int order) if (!has_unaccepted_memory()) return false; + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + return false; + or is there a better approach? Reading from current->flags the way Matthew proposed? > - as part of post_alloc_hook() in set_page_owner(), stack depot might do > raw_spin_lock_irqsave(), is that one ok? Well, I looked at the stack depot and was tempted to add trylock handling there, but it looked to be a bit dodgy in general and I figured it should be done separately from this set. Like: if (unlikely(can_alloc && !READ_ONCE(new_pool))) { page = alloc_pages(gfp_nested_mask(alloc_flags), followed by: if (in_nmi()) { /* We can never allocate in NMI context. */ WARN_ON_ONCE(can_alloc); that warn is too late. If we were in_nmi and called alloc_pages the kernel might be misbehaving already. > > hope I didn't miss anything else especially in those other debugging hooks > (KASAN etc) I looked through them and could be missing something, of course. kasan usage in alloc_page path seems fine. But for slab I found kasan_quarantine logic which needs a special treatment. Other slab debugging bits pose issues too. The rough idea is to do kmalloc_nolock() / kfree_nolock() that don't call into any pre/post hooks (including slab_free_hook, slab_pre_alloc_hook). kmalloc_nolock() will pretty much call __slab_alloc_node() directly and do basic kasan poison stuff that needs no locks. I will be going over all the paths again, of course. Thanks for the reviews so far!
On 12/10/24 22:53, Alexei Starovoitov wrote: > On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior > <bigeasy@linutronix.de> wrote: >> >> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote: >> > From: Alexei Starovoitov <ast@kernel.org> >> > >> > Tracing BPF programs execute from tracepoints and kprobes where running >> > context is unknown, but they need to request additional memory. >> > The prior workarounds were using pre-allocated memory and BPF specific >> > freelists to satisfy such allocation requests. Instead, introduce >> > __GFP_TRYLOCK flag that makes page allocator accessible from any context. >> > It relies on percpu free list of pages that rmqueue_pcplist() should be >> > able to pop the page from. If it fails (due to IRQ re-entrancy or list >> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock >> > and refill percpu freelist as normal. >> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, >> > so trylock is the only option. >> >> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF >> where it is not known how much memory will be needed and what the >> calling context is. > > Exactly. > >> I hope it does not spread across the kernel where >> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they >> learn that this does not work, add this flag to the mix to make it work >> without spending some time on reworking it. > > We can call it __GFP_BPF to discourage any other usage, > but that seems like an odd "solution" to code review problem. Could we perhaps not expose the flag to public headers at all, and keep it only as an internal detail of try_alloc_pages_noprof()?
On 12/10/24 23:42, Alexei Starovoitov wrote: > On Tue, Dec 10, 2024 at 10:39 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 12/10/24 03:39, Alexei Starovoitov wrote: >> > From: Alexei Starovoitov <ast@kernel.org> >> > >> > Tracing BPF programs execute from tracepoints and kprobes where running >> > context is unknown, but they need to request additional memory. >> > The prior workarounds were using pre-allocated memory and BPF specific >> > freelists to satisfy such allocation requests. Instead, introduce >> > __GFP_TRYLOCK flag that makes page allocator accessible from any context. >> > It relies on percpu free list of pages that rmqueue_pcplist() should be >> > able to pop the page from. If it fails (due to IRQ re-entrancy or list >> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock >> > and refill percpu freelist as normal. >> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, >> > so trylock is the only option. >> > In theory we can introduce percpu reentrance counter and increment it >> > every time spin_lock_irqsave(&zone->lock, flags) is used, >> > but we cannot rely on it. Even if this cpu is not in page_alloc path >> > the spin_lock_irqsave() is not safe, since BPF prog might be called >> > from tracepoint where preemption is disabled. So trylock only. >> > >> > Note, free_page and memcg are not taught about __GFP_TRYLOCK yet. >> > The support comes in the next patches. >> > >> > This is a first step towards supporting BPF requirements in SLUB >> > and getting rid of bpf_mem_alloc. >> > That goal was discussed at LSFMM: https://lwn.net/Articles/974138/ >> > >> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> >> >> I think there might be more non-try spin_locks reachable from page allocations: >> >> - in reserve_highatomic_pageblock() which I think is reachable unless this >> is limited to order-0 > > Good point. I missed this bit: > if (order > 0) > alloc_flags |= ALLOC_HIGHATOMIC; > > In bpf use case it will be called with order == 0 only, > but it's better to fool proof it. > I will switch to: > __GFP_NOMEMALLOC | __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO | __GFP_ACCOUNT Should work, yeah. >> - try_to_accept_memory_one() > > when I studied the code it looked to me that there should be no > unaccepted_pages. > I think you're saying that there could be unaccepted memory > from the previous allocation and trylock attempt just got unlucky > to reach that path? The unaccepted memory can exist after boot in confidential computing guests to speed up their boot, and then is accepted (encrypted for the guest) on demand. > What do you think of the following: > - cond_accept_memory(zone, order); > + cond_accept_memory(zone, order, alloc_flags); > > /* > * Detect whether the number of free pages is below high > @@ -7024,7 +7024,8 @@ static inline bool has_unaccepted_memory(void) > return static_branch_unlikely(&zones_with_unaccepted_pages); > } > > -static bool cond_accept_memory(struct zone *zone, unsigned int order) > +static bool cond_accept_memory(struct zone *zone, unsigned int order, > + unsigned int alloc_flags) > { > long to_accept; > bool ret = false; > @@ -7032,6 +7033,9 @@ static bool cond_accept_memory(struct zone > *zone, unsigned int order) > if (!has_unaccepted_memory()) > return false; > > + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) > + return false; > + Yeah that should be the straightforward fix. >> >> hope I didn't miss anything else especially in those other debugging hooks >> (KASAN etc) > > I looked through them and could be missing something, of course. > kasan usage in alloc_page path seems fine. > But for slab I found kasan_quarantine logic which needs a special treatment. > Other slab debugging bits pose issues too. > The rough idea is to do kmalloc_nolock() / kfree_nolock() that > don't call into any pre/post hooks (including slab_free_hook, > slab_pre_alloc_hook). > kmalloc_nolock() will pretty much call __slab_alloc_node() directly > and do basic kasan poison stuff that needs no locks. You'll probably want the memcg handling though? Alloc tagging would be good too. Both IIRC propagate the gfp flags when trying to allocate their metadata and can deal with failure so maybe it will work in a straightfoward way. Slab debugging will have to likely be limited, but it's also not handled in the pre/post hooks but deeper. > I will be going over all the paths again, of course. > > Thanks for the reviews so far!
On Tue 10-12-24 12:25:04, Shakeel Butt wrote: > On Tue, Dec 10, 2024 at 10:05:22AM +0100, Michal Hocko wrote: > > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote: > > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > > > > + if (preemptible() && !rcu_preempt_depth()) > > > > + return alloc_pages_node_noprof(nid, > > > > + GFP_NOWAIT | __GFP_ZERO, > > > > + order); > > > > + return alloc_pages_node_noprof(nid, > > > > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > > > > + order); > > > > > > [...] > > > > > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > > > > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > > > > */ > > > > alloc_flags |= (__force int) > > > > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > > > > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); > > > > > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this. > > > I was originally wondering if this wasn't a memalloc_nolock_save() / > > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), > > > but I wonder if we can simply do: > > > > > > if (!preemptible() || rcu_preempt_depth()) > > > alloc_flags |= ALLOC_TRYLOCK; > > > > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that > > __GFP_TRYLOCK is not really a preferred way to go forward. For 3 > > reasons. > > > > First I do not really like the name as it tells what it does rather than > > how it should be used. This is a general pattern of many gfp flags > > unfotrunatelly and historically it has turned out error prone. If a gfp > > flag is really needed then something like __GFP_ANY_CONTEXT should be > > used. If the current implementation requires to use try_lock for > > zone->lock or other changes is not an implementation detail but the user > > should have a clear understanding that allocation is allowed from any > > context (NMI, IRQ or otherwise atomic contexts). > > > > Is there any reason why GFP_ATOMIC cannot be extended to support new > > GFP_ATOMIC has access to memory reserves. I see GFP_NOWAIT a better fit > and if someone wants access to the reserve they can use __GFP_HIGH with > GFP_NOWAIT. Right. The problem with GFP_NOWAIT is that it is very often used as an opportunistic allocation attempt before a more costly fallback. Failing those just because of the zone lock (or other internal locks) contention seems too aggressive. > > Third, do we even want such a strong guarantee in the generic page > > allocator path and make it even more complex and harder to maintain? > > I think the alternative would be higher maintenance cost i.e. everyone > creating their own layer/solution/caching over page allocator which I > think we agree we want to avoid (Vlastimil's LSFMM talk). Yes, I do agree that we do not want to grow special case allocators. I was merely interested in an option to reuse existing bulk allocator for this new purpose.
On Tue 10-12-24 14:06:32, Alexei Starovoitov wrote: > On Tue, Dec 10, 2024 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 10-12-24 05:31:30, Matthew Wilcox wrote: > > > On Mon, Dec 09, 2024 at 06:39:31PM -0800, Alexei Starovoitov wrote: > > > > + if (preemptible() && !rcu_preempt_depth()) > > > > + return alloc_pages_node_noprof(nid, > > > > + GFP_NOWAIT | __GFP_ZERO, > > > > + order); > > > > + return alloc_pages_node_noprof(nid, > > > > + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, > > > > + order); > > > > > > [...] > > > > > > > @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) > > > > * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). > > > > */ > > > > alloc_flags |= (__force int) > > > > - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); > > > > + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); > > > > > > It's not quite clear to me that we need __GFP_TRYLOCK to implement this. > > > I was originally wondering if this wasn't a memalloc_nolock_save() / > > > memalloc_nolock_restore() situation (akin to memalloc_nofs_save/restore), > > > but I wonder if we can simply do: > > > > > > if (!preemptible() || rcu_preempt_depth()) > > > alloc_flags |= ALLOC_TRYLOCK; > > > > preemptible is unusable without CONFIG_PREEMPT_COUNT but I do agree that > > __GFP_TRYLOCK is not really a preferred way to go forward. For 3 > > reasons. > > > > First I do not really like the name as it tells what it does rather than > > how it should be used. This is a general pattern of many gfp flags > > unfotrunatelly and historically it has turned out error prone. If a gfp > > flag is really needed then something like __GFP_ANY_CONTEXT should be > > used. If the current implementation requires to use try_lock for > > zone->lock or other changes is not an implementation detail but the user > > should have a clear understanding that allocation is allowed from any > > context (NMI, IRQ or otherwise atomic contexts). > > __GFP_ANY_CONTEXT would make sense if we wanted to make it available > for all kernel users. In this case I agree with Sebastian. > This is bpf specific feature, since it doesn't know the context. > All other kernel users should pick GFP_KERNEL or ATOMIC or NOWAIT. > Exposing GFP_ANY_CONTEXT to all may lead to sloppy code in drivers > and elsewhere. I do not think we want a single user special allocation mode. Not only there is no way to enforce this to remain BPF special feature, it is also not really a good idea to have a single user feature in the allocator. > > Is there any reason why GFP_ATOMIC cannot be extended to support new > > contexts? This allocation mode is already documented to be usable from > > atomic contexts except from NMI and raw_spinlocks. But is it feasible to > > extend the current implementation to use only trylock on zone->lock if > > called from in_nmi() to reduce unexpected failures on contention for > > existing users? > > No. in_nmi() doesn't help. It's the lack of reentrance of slab and page > allocator that is an issue. > The page alloctor might grab zone lock. In !RT it will disable irqs. > In RT will stay sleepable. Both paths will be calling other > kernel code including tracepoints, potential kprobes, etc > and bpf prog may be attached somewhere. > If it calls alloc_page() it may deadlock on zone->lock. > pcpu lock is thankfully trylock already. > So !irqs_disabled() part of preemptible() guarantees that > zone->lock won't deadlock in !RT. > And rcu_preempt_depth() case just steers bpf into try lock only path in RT. > Since there is no way to tell whether it's safe to call > sleepable spin_lock(&zone->lock). OK I see! > > We > > already have a precence in form of __alloc_pages_bulk which is a special > > case allocator mode living outside of the page allocator path. It seems > > that it covers most of your requirements except the fallback to the > > regular allocation path AFAICS. Is this something you could piggy back > > on? > > __alloc_pages_bulk() has all the same issues. It takes locks. > Also it doesn't support GFP_ACCOUNT which is a show stopper. > All bpf allocations are going through memcg. OK, this requirement was not clear until I've reached later patches in the series (now).
On Wed, Dec 11, 2024 at 12:39 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 12/10/24 22:53, Alexei Starovoitov wrote: > > On Tue, Dec 10, 2024 at 1:01 AM Sebastian Andrzej Siewior > > <bigeasy@linutronix.de> wrote: > >> > >> On 2024-12-09 18:39:31 [-0800], Alexei Starovoitov wrote: > >> > From: Alexei Starovoitov <ast@kernel.org> > >> > > >> > Tracing BPF programs execute from tracepoints and kprobes where running > >> > context is unknown, but they need to request additional memory. > >> > The prior workarounds were using pre-allocated memory and BPF specific > >> > freelists to satisfy such allocation requests. Instead, introduce > >> > __GFP_TRYLOCK flag that makes page allocator accessible from any context. > >> > It relies on percpu free list of pages that rmqueue_pcplist() should be > >> > able to pop the page from. If it fails (due to IRQ re-entrancy or list > >> > being empty) then try_alloc_pages() attempts to spin_trylock zone->lock > >> > and refill percpu freelist as normal. > >> > BPF program may execute with IRQs disabled and zone->lock is sleeping in RT, > >> > so trylock is the only option. > >> > >> The __GFP_TRYLOCK flag looks reasonable given the challenges for BPF > >> where it is not known how much memory will be needed and what the > >> calling context is. > > > > Exactly. > > > >> I hope it does not spread across the kernel where > >> people do ATOMIC in preempt/ IRQ-off on PREEMPT_RT and then once they > >> learn that this does not work, add this flag to the mix to make it work > >> without spending some time on reworking it. > > > > We can call it __GFP_BPF to discourage any other usage, > > but that seems like an odd "solution" to code review problem. > > Could we perhaps not expose the flag to public headers at all, and keep it > only as an internal detail of try_alloc_pages_noprof()? public headers? To pass additional bit via gfp flags into alloc_pages gfp_types.h has to be touched. If you mean moving try_alloc_pages() into mm/page_alloc.c and adding another argument to __alloc_pages_noprof then it's not pretty. It has 'gfp_t gfp' argument. It should to be used to pass the intent. We don't have to add GFP_TRYLOCK at all if we go with memalloc_nolock_save() approach. So I started looking at it, but immediately hit trouble with bits. There are 5 bits left in PF_ and 3 already used for mm needs. That doesn't look sustainable long term. How about we alias nolock concept with PF_MEMALLOC_PIN ? As far as I could trace PF_MEMALLOC_PIN clears GFP_MOVABLE and nothing else. The same bit plus lack of __GFP_KSWAPD_RECLAIM in gfp flags would mean nolock mode in alloc_pages, while PF_MEMALLOC_PIN alone would mean nolock in free_pages and deeper inside memcg paths and such. thoughts? too hacky?
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b0fe9f62d15b..f68daa9c997b 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -347,6 +347,31 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp, } #define alloc_page_vma(...) alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__)) +static inline struct page *try_alloc_pages_noprof(int nid, unsigned int order) +{ + /* + * If spin_locks are not held and interrupts are enabled, use normal + * path. BPF progs run under rcu_read_lock(), so in PREEMPT_RT + * rcu_preempt_depth() will be >= 1 and will use trylock path. + */ + if (preemptible() && !rcu_preempt_depth()) + return alloc_pages_node_noprof(nid, + GFP_NOWAIT | __GFP_ZERO, + order); + /* + * Best effort allocation from percpu free list. + * If it's empty attempt to spin_trylock zone->lock. + * Do not specify __GFP_KSWAPD_RECLAIM to avoid wakeup_kswapd + * that may need to grab a lock. + * Do not specify __GFP_ACCOUNT to avoid local_lock. + * Do not warn either. + */ + return alloc_pages_node_noprof(nid, + __GFP_TRYLOCK | __GFP_NOWARN | __GFP_ZERO, + order); +} +#define try_alloc_pages(...) alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__)) + extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order); #define __get_free_pages(...) alloc_hooks(get_free_pages_noprof(__VA_ARGS__)) diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h index 65db9349f905..72b385a7888d 100644 --- a/include/linux/gfp_types.h +++ b/include/linux/gfp_types.h @@ -48,6 +48,7 @@ enum { ___GFP_THISNODE_BIT, ___GFP_ACCOUNT_BIT, ___GFP_ZEROTAGS_BIT, + ___GFP_TRYLOCK_BIT, #ifdef CONFIG_KASAN_HW_TAGS ___GFP_SKIP_ZERO_BIT, ___GFP_SKIP_KASAN_BIT, @@ -86,6 +87,7 @@ enum { #define ___GFP_THISNODE BIT(___GFP_THISNODE_BIT) #define ___GFP_ACCOUNT BIT(___GFP_ACCOUNT_BIT) #define ___GFP_ZEROTAGS BIT(___GFP_ZEROTAGS_BIT) +#define ___GFP_TRYLOCK BIT(___GFP_TRYLOCK_BIT) #ifdef CONFIG_KASAN_HW_TAGS #define ___GFP_SKIP_ZERO BIT(___GFP_SKIP_ZERO_BIT) #define ___GFP_SKIP_KASAN BIT(___GFP_SKIP_KASAN_BIT) @@ -293,6 +295,7 @@ enum { #define __GFP_COMP ((__force gfp_t)___GFP_COMP) #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) #define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS) +#define __GFP_TRYLOCK ((__force gfp_t)___GFP_TRYLOCK) #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO) #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index bb8a59c6caa2..592c93ee5f35 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -50,6 +50,7 @@ gfpflag_string(__GFP_RECLAIM), \ gfpflag_string(__GFP_DIRECT_RECLAIM), \ gfpflag_string(__GFP_KSWAPD_RECLAIM), \ + gfpflag_string(__GFP_TRYLOCK), \ gfpflag_string(__GFP_ZEROTAGS) #ifdef CONFIG_KASAN_HW_TAGS diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c index 7647096170e9..b3b297d67909 100644 --- a/mm/fail_page_alloc.c +++ b/mm/fail_page_alloc.c @@ -31,6 +31,12 @@ bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) return false; if (gfp_mask & __GFP_NOFAIL) return false; + if (gfp_mask & __GFP_TRYLOCK) + /* + * Internals of should_fail_ex() are not compatible + * with trylock concept. + */ + return false; if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM)) return false; if (fail_page_alloc.ignore_gfp_reclaim && diff --git a/mm/internal.h b/mm/internal.h index cb8d8e8e3ffa..c082b8fa1d71 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1175,6 +1175,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, #endif #define ALLOC_HIGHATOMIC 0x200 /* Allows access to MIGRATE_HIGHATOMIC */ #define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */ +#define ALLOC_TRYLOCK 0x1000000 /* Only use spin_trylock in allocation path */ /* Flags that allow allocations below the min watermark. */ #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1cb4b8c8886d..d511e68903c6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2304,7 +2304,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long flags; int i; - spin_lock_irqsave(&zone->lock, flags); + if (!spin_trylock_irqsave(&zone->lock, flags)) { + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + return 0; + spin_lock_irqsave(&zone->lock, flags); + } for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, alloc_flags); @@ -2904,7 +2908,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, do { page = NULL; - spin_lock_irqsave(&zone->lock, flags); + if (!spin_trylock_irqsave(&zone->lock, flags)){ + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + return NULL; + spin_lock_irqsave(&zone->lock, flags); + } if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { @@ -4001,6 +4009,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) */ BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE); BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD); + BUILD_BUG_ON(__GFP_TRYLOCK != (__force gfp_t) ALLOC_TRYLOCK); /* * The caller may dip into page reserves a bit more if the caller @@ -4009,7 +4018,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH). */ alloc_flags |= (__force int) - (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)); + (gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_TRYLOCK)); if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) { /* @@ -4509,6 +4518,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, might_alloc(gfp_mask); + *alloc_flags |= (__force int) (gfp_mask & __GFP_TRYLOCK); + if (should_fail_alloc_page(gfp_mask, order)) return false; diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index 4d8d94146f8d..1f7f4269fa10 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -682,6 +682,7 @@ static const struct { { "__GFP_RECLAIM", "R" }, { "__GFP_DIRECT_RECLAIM", "DR" }, { "__GFP_KSWAPD_RECLAIM", "KR" }, + { "__GFP_TRYLOCK", "TL" }, }; static size_t max_gfp_len;