Message ID | 20250327145159.99799-1-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] Introduce try_alloc_pages for 6.15 | expand |
On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > The pull includes work from Sebastian, Vlastimil and myself > with a lot of help from Michal and Shakeel. > This is a first step towards making kmalloc reentrant to get rid > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. > These patches make page allocator safe from any context. So I've pulled this too, since it looked generally fine. The one reaction I had is that when you basically change spin_lock_irqsave(&zone->lock, flags); into if (!spin_trylock_irqsave(&zone->lock, flags)) { if (unlikely(alloc_flags & ALLOC_TRYLOCK)) return NULL; spin_lock_irqsave(&zone->lock, flags); } we've seen bad cache behavior for this kind of pattern in other situations: if the "try" fails, the subsequent "do the lock for real" case now does the wrong thing, in that it will immediately try again even if it's almost certainly just going to fail - causing extra write cache accesses. So typically, in places that can see contention, it's better to either do (a) trylock followed by a slowpath that takes the fact that it was locked into account and does a read-only loop until it sees otherwise This is, for example, what the mutex code does with that __mutex_trylock() -> mutex_optimistic_spin() pattern, but our spinlocks end up doing similar things (ie "trylock" followed by "release irq and do the 'relax loop' thing). or (b) do the trylock and lock separately, ie if (unlikely(alloc_flags & ALLOC_TRYLOCK)) { if (!spin_trylock_irqsave(&zone->lock, flags)) return NULL; } else spin_lock_irqsave(&zone->lock, flags); so that you don't end up doing two cache accesses for ownership that can cause extra bouncing. I'm not sure this matters at all in the allocation path - contention may simply not be enough of an issue, and the trylock is purely about "unlikely NMI worries", but I do worry that you might have made the normal case slower. It's easily fixable later if it ends up being the case, so I don't worry too much about it, but I did want to mention it since going through the code made me react to it. Linus
On Sun, 30 Mar 2025 at 13:42, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The one reaction I had is that when you basically change Oh, actually, two reactions now that I fixed up the merge build issue which forced me to look at the function naming. That 'localtry_lock_irqsave()' naming is horrendous. The "try" part in it makes me think it's a trylock. But no, the "localtry" just comes from the lock naming, and the trylock version is localtry_trylock_irqsave. That's horrible. I missed that on the first read-though, and I'm not unpulling it because the code generally makes sense. But I do think that the lock name needs fixing. "localtry_lock_t" is not a good name, and spreading that odd "localtry" into the actual (non-try) locking functions makes the naming actively insane. If the *only* operation you could do on the lock was "trylock", then "localtry" would be fine. Then the lock literally is a "only try" thing. But as it is, the naming now ends up actively broken. Honestly, the lock name should probably reflect the fact that it can be used from any context (with a "trylock"), not about the trylock part itself. So maybe "nmisafe_local_lock_t" or something in that vein? Please fix this up, There aren't *that* many users of "localtry_xyzzy", let's get this fixed before there are more of them. Linus
The pull request you sent on Thu, 27 Mar 2025 10:51:59 -0400:
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/aa918db707fba507e85217961643281ee8dfb2ed
Thank you!
On Sun, Mar 30, 2025 at 1:42 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > The pull includes work from Sebastian, Vlastimil and myself > > with a lot of help from Michal and Shakeel. > > This is a first step towards making kmalloc reentrant to get rid > > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. > > These patches make page allocator safe from any context. > > So I've pulled this too, since it looked generally fine. Thanks! > The one reaction I had is that when you basically change > > spin_lock_irqsave(&zone->lock, flags); > > into > > if (!spin_trylock_irqsave(&zone->lock, flags)) { > if (unlikely(alloc_flags & ALLOC_TRYLOCK)) > return NULL; > spin_lock_irqsave(&zone->lock, flags); > } > > we've seen bad cache behavior for this kind of pattern in other > situations: if the "try" fails, the subsequent "do the lock for real" > case now does the wrong thing, in that it will immediately try again > even if it's almost certainly just going to fail - causing extra write > cache accesses. > > So typically, in places that can see contention, it's better to either do > > (a) trylock followed by a slowpath that takes the fact that it was > locked into account and does a read-only loop until it sees otherwise > > This is, for example, what the mutex code does with that > __mutex_trylock() -> mutex_optimistic_spin() pattern, but our > spinlocks end up doing similar things (ie "trylock" followed by > "release irq and do the 'relax loop' thing). Right, __mutex_trylock(lock) -> mutex_optimistic_spin() pattern is equivalent to 'pending' bit spinning in qspinlock. > or > > (b) do the trylock and lock separately, ie > > if (unlikely(alloc_flags & ALLOC_TRYLOCK)) { > if (!spin_trylock_irqsave(&zone->lock, flags)) > return NULL; > } else > spin_lock_irqsave(&zone->lock, flags); > > so that you don't end up doing two cache accesses for ownership that > can cause extra bouncing. Ok, I will switch to above. > I'm not sure this matters at all in the allocation path - contention > may simply not be enough of an issue, and the trylock is purely about > "unlikely NMI worries", but I do worry that you might have made the > normal case slower. We actually did see zone->lock being contended in production. Last time the culprit was an inadequate per-cpu caching and these series in 6.11 fixed it: https://lwn.net/Articles/947900/ I don't think we've seen it contended in the newer kernels. Johannes, pls correct me if I'm wrong. But to avoid being finger pointed, I'll switch to checking alloc_flags first. It does seem a better trade off to avoid cache bouncing because of 2nd cmpxchg. Though when I wrote it this way I convinced myself and others that it's faster to do trylock first to avoid branch misprediction.
On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I do think that the lock name needs fixing. > > "localtry_lock_t" is not a good name, and spreading that odd > "localtry" into the actual (non-try) locking functions makes the > naming actively insane. > > If the *only* operation you could do on the lock was "trylock", then > "localtry" would be fine. Then the lock literally is a "only try" > thing. But as it is, the naming now ends up actively broken. > > Honestly, the lock name should probably reflect the fact that it can > be used from any context (with a "trylock"), not about the trylock > part itself. > > So maybe "nmisafe_local_lock_t" or something in that vein? > > Please fix this up, There aren't *that* many users of > "localtry_xyzzy", let's get this fixed before there are more of them. Ok. Agree with the reasoning that the name doesn't quite fit. nmisafe_local_lock_t name works for me, though nmisafe_local_lock_irqsave() is a bit verbose. Don't have better name suggestions at the moment. Sebastian, Vlastimil, what do you prefer ?
On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > But to avoid being finger pointed, I'll switch to checking alloc_flags > first. It does seem a better trade off to avoid cache bouncing because > of 2nd cmpxchg. Though when I wrote it this way I convinced myself and > others that it's faster to do trylock first to avoid branch misprediction. Yes, the really hot paths (ie core locking) do the "trylock -> read spinning" for that reason. Then for the normal case, _only_ the trylock is in the path, and that's the best of both worlds. And in practice, the "do two compare-and-exchange" operations actually does work fine, because the cacheline will generally be sticky enough that you don't actually get many extra cachline bouncing. So I'm not sure it matters in the end, but I did react to it. Linus
On Sun, Mar 30, 2025 at 3:08 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > But to avoid being finger pointed, I'll switch to checking alloc_flags > > first. It does seem a better trade off to avoid cache bouncing because > > of 2nd cmpxchg. Though when I wrote it this way I convinced myself and > > others that it's faster to do trylock first to avoid branch misprediction. > > Yes, the really hot paths (ie core locking) do the "trylock -> read > spinning" for that reason. Then for the normal case, _only_ the > trylock is in the path, and that's the best of both worlds. > > And in practice, the "do two compare-and-exchange" operations actually > does work fine, because the cacheline will generally be sticky enough > that you don't actually get many extra cachline bouncing. Right, but I also realized that in the contended case there is an unnecessary irq save/restore pair. Posted the fix: https://lore.kernel.org/bpf/20250331002809.94758-1-alexei.starovoitov@gmail.com/ maybe apply directly? I'll send the renaming fix once we converge on a good name.
On 2025-03-30 14:49:25 [-0700], Alexei Starovoitov wrote: > On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > So maybe "nmisafe_local_lock_t" or something in that vein? > > > > Please fix this up, There aren't *that* many users of > > "localtry_xyzzy", let's get this fixed before there are more of them. > > Ok. Agree with the reasoning that the name doesn't quite fit. > > nmisafe_local_lock_t name works for me, > though nmisafe_local_lock_irqsave() is a bit verbose. > > Don't have better name suggestions at the moment. > > Sebastian, Vlastimil, > what do you prefer ? nmisafe_local_lock_t sounds okay assuming the "nmisafe" part does not make it look like it can be used without the trylock part in NMI context. But yeah, it sounds better than the previous one. Sebastian
On 3/31/25 09:14, Sebastian Sewior wrote: > On 2025-03-30 14:49:25 [-0700], Alexei Starovoitov wrote: >> On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >> > So maybe "nmisafe_local_lock_t" or something in that vein? >> > >> > Please fix this up, There aren't *that* many users of >> > "localtry_xyzzy", let's get this fixed before there are more of them. >> >> Ok. Agree with the reasoning that the name doesn't quite fit. >> >> nmisafe_local_lock_t name works for me, >> though nmisafe_local_lock_irqsave() is a bit verbose. >> >> Don't have better name suggestions at the moment. >> >> Sebastian, Vlastimil, >> what do you prefer ? > > nmisafe_local_lock_t sounds okay assuming the "nmisafe" part does not > make it look like it can be used without the trylock part in NMI context. Yes I was going to point out that e.g. "nmisafe_local_lock_irqsave()" seems rather misleading to me as this operation is not a nmisafe one? I think it comes back to what we discussed in previous versions of the patchset. IMHO conceptually it's still a local_lock, it just supports the new trylock operations. However, to make them possible (on non-RT) if a particular lock instance is to be used with the trylock anywhere, it needs the new "acquired" field, and for the non-trylock operations to work with the field too. So (to inform Linus) the earlier attempt [1] was to just change the existing local_lock_t, but that adds the overhead of the field and manipulating it for instances that don't use trylock. The following attempt [2] meant there would be only a new local_trylock_t type, but the existing locking operations would remain the same, relying on _Generic() parts inside them. It was preferred to make the implementation more obvious, but then we have the naming issue for the operations in addition to the type... [1] https://lore.kernel.org/bpf/20241218030720.1602449-4-alexei.starovoitov@gmail.com/ [2] https://lore.kernel.org/bpf/20250124035655.78899-4-alexei.starovoitov@gmail.com/ > But yeah, it sounds better than the previous one. > > Sebastian
On 3/31/25 00:08, Linus Torvalds wrote: > On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> But to avoid being finger pointed, I'll switch to checking alloc_flags >> first. It does seem a better trade off to avoid cache bouncing because >> of 2nd cmpxchg. Though when I wrote it this way I convinced myself and >> others that it's faster to do trylock first to avoid branch misprediction. > > Yes, the really hot paths (ie core locking) do the "trylock -> read > spinning" for that reason. Then for the normal case, _only_ the > trylock is in the path, and that's the best of both worlds. I've been wondering if spin locks could expose the contended slowpath so we can trylock, and on failure do the check and then call the slowpath directly that doesn't include another trylock. It would also be nice if the trylock part could become inline and only the slowpath would be a function call - even during normal spin_lock_*() operation. AFAIK right now everything is a function call on x86_64. Not sure how feasible would that be with the alternatives and paravirt stuff we do. > And in practice, the "do two compare-and-exchange" operations actually > does work fine, because the cacheline will generally be sticky enough > that you don't actually get many extra cachline bouncing. > > So I'm not sure it matters in the end, but I did react to it. > > Linus
On Sun, Mar 30, 2025 at 02:30:15PM -0700, Alexei Starovoitov wrote: > On Sun, Mar 30, 2025 at 1:42 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > The pull includes work from Sebastian, Vlastimil and myself > > > with a lot of help from Michal and Shakeel. > > > This is a first step towards making kmalloc reentrant to get rid > > > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. > > > These patches make page allocator safe from any context. > > > > So I've pulled this too, since it looked generally fine. > > Thanks! > > > The one reaction I had is that when you basically change > > > > spin_lock_irqsave(&zone->lock, flags); > > > > into > > > > if (!spin_trylock_irqsave(&zone->lock, flags)) { > > if (unlikely(alloc_flags & ALLOC_TRYLOCK)) > > return NULL; > > spin_lock_irqsave(&zone->lock, flags); > > } > > > > we've seen bad cache behavior for this kind of pattern in other > > situations: if the "try" fails, the subsequent "do the lock for real" > > case now does the wrong thing, in that it will immediately try again > > even if it's almost certainly just going to fail - causing extra write > > cache accesses. > > > > So typically, in places that can see contention, it's better to either do > > > > (a) trylock followed by a slowpath that takes the fact that it was > > locked into account and does a read-only loop until it sees otherwise > > > > This is, for example, what the mutex code does with that > > __mutex_trylock() -> mutex_optimistic_spin() pattern, but our > > spinlocks end up doing similar things (ie "trylock" followed by > > "release irq and do the 'relax loop' thing). > > Right, > __mutex_trylock(lock) -> mutex_optimistic_spin() pattern is > equivalent to 'pending' bit spinning in qspinlock. > > > or > > > > (b) do the trylock and lock separately, ie > > > > if (unlikely(alloc_flags & ALLOC_TRYLOCK)) { > > if (!spin_trylock_irqsave(&zone->lock, flags)) > > return NULL; > > } else > > spin_lock_irqsave(&zone->lock, flags); > > > > so that you don't end up doing two cache accesses for ownership that > > can cause extra bouncing. > > Ok, I will switch to above. > > > I'm not sure this matters at all in the allocation path - contention > > may simply not be enough of an issue, and the trylock is purely about > > "unlikely NMI worries", but I do worry that you might have made the > > normal case slower. > > We actually did see zone->lock being contended in production. > Last time the culprit was an inadequate per-cpu caching and > these series in 6.11 fixed it: > https://lwn.net/Articles/947900/ > I don't think we've seen it contended in the newer kernels. > > Johannes, pls correct me if I'm wrong. Contention should indeed be rare in practice. This has become a very coarse lock, with nowadays hundreds of HW threads hitting still only one or two zones. A lot rides on the fastpath per-cpu caches, and it becomes noticable very quickly if those are sized inappropriately. > But to avoid being finger pointed, I'll switch to checking alloc_flags > first. It does seem a better trade off to avoid cache bouncing because > of 2nd cmpxchg. Though when I wrote it this way I convinced myself and > others that it's faster to do trylock first to avoid branch misprediction. If you haven't yet, it could be interesting to check if/where branches are generated at all, given the proximity and the heavy inlining between where you pass the flag and where it's tested.
On Mon, 31 Mar 2025 at 02:59, Vlastimil Babka <vbabka@suse.cz> wrote: > > Yes I was going to point out that e.g. "nmisafe_local_lock_irqsave()" seems > rather misleading to me as this operation is not a nmisafe one? Yeah, it's not a great name either, IO admit. > The following attempt [2] meant there would be only a new local_trylock_t > type, but the existing locking operations would remain the same, relying on > _Generic() parts inside them. Hmm. I actually like that approach. That avoids having the misleading operation naming. IOW, you'd not have a "localtry" when it's not a trylock, and you'd not have "nmisafe" when it's not an operation that is actually nmi-safe. The downside of _Generic() is that it's a bit subtle and can hide the actual operation, but I think that in this situation that's the whole point. So yes, I'd vote for the "let's just introduce the new type that has the required 'acquired' field, and then use a _Generic() model to automatically pick the right op". Linus
On Mon, Mar 31, 2025 at 8:35 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 31 Mar 2025 at 02:59, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > Yes I was going to point out that e.g. "nmisafe_local_lock_irqsave()" seems > > rather misleading to me as this operation is not a nmisafe one? > > Yeah, it's not a great name either, IO admit. > > > The following attempt [2] meant there would be only a new local_trylock_t > > type, but the existing locking operations would remain the same, relying on > > _Generic() parts inside them. > > Hmm. I actually like that approach. > > That avoids having the misleading operation naming. IOW, you'd not > have a "localtry" when it's not a trylock, and you'd not have > "nmisafe" when it's not an operation that is actually nmi-safe. > > The downside of _Generic() is that it's a bit subtle and can hide the > actual operation, but I think that in this situation that's the whole > point. > > So yes, I'd vote for the "let's just introduce the new type that has > the required 'acquired' field, and then use a _Generic() model to > automatically pick the right op". Here is the patch that goes back to that approach: https://lore.kernel.org/bpf/20250401005134.14433-1-alexei.starovoitov@gmail.com/ btw the compiler error when local_lock_t (instead of local_trylock_t) is passed into local_trylock_irqsave() is imo quite readable: ../mm/memcontrol.c: In function ‘consume_stock’: ../include/linux/local_lock_internal.h:136:20: error: assignment to ‘local_trylock_t *’ from incompatible pointer type ‘local_lock_t *’ [-Werror=incompatible-pointer-types] 136 | tl = this_cpu_ptr(lock); \ | ^ ../include/linux/local_lock.h:76:9: note: in expansion of macro ‘__local_trylock_irqsave’ 76 | __local_trylock_irqsave(lock, flags) | ^~~~~~~~~~~~~~~~~~~~~~~ ../mm/memcontrol.c:1790:19: note: in expansion of macro ‘local_trylock_irqsave’ 1790 | else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) | ^~~~~~~~~~~~~~~~~~~~~
Hi Linus, The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b: Linux 6.14-rc1 (2025-02-02 15:39:26 -0800) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages for you to fetch changes up to f90b474a35744b5d43009e4fab232e74a3024cae: mm: Fix the flipped condition in gfpflags_allow_spinning() (2025-03-15 11:18:19 -0700) ---------------------------------------------------------------- Please pull after main MM changes. The pull includes work from Sebastian, Vlastimil and myself with a lot of help from Michal and Shakeel. This is a first step towards making kmalloc reentrant to get rid of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. These patches make page allocator safe from any context. Vlastimil kicked off this effort at LSFMM 2024: https://lwn.net/Articles/974138/ and we continued at LSFMM 2025: https://lore.kernel.org/all/CAADnVQKfkGxudNUkcPJgwe3nTZ=xohnRshx9kLZBTmR_E1DFEg@mail.gmail.com/ Why: SLAB wrappers bind memory to a particular subsystem making it unavailable to the rest of the kernel. Some BPF maps in production consume Gbytes of preallocated memory. Top 5 in Meta: 1.5G, 1.2G, 1.1G, 300M, 200M. Once we have kmalloc that works in any context BPF map preallocation won't be necessary. How: Synchronous kmalloc/page alloc stack has multiple stages going from fast to slow: cmpxchg16 -> slab_alloc -> new_slab -> alloc_pages -> rmqueue_pcplist -> __rmqueue. rmqueue_pcplist was already relying on trylock. This set changes rmqueue_bulk/rmqueue_buddy to attempt a trylock and return ENOMEM if alloc_flags & ALLOC_TRYLOCK. Then it wraps this functionality into try_alloc_pages() helper. We make sure that the logic is sane in PREEMPT_RT. End result: try_alloc_pages()/free_pages_nolock() are safe to call from any context. try_kmalloc() for any context with similar trylock approach will follow. It will use try_alloc_pages() when slab needs a new page. Though such try_kmalloc/page_alloc() is an opportunistic allocator, this design ensures that the probability of successful allocation of small objects (up to one page in size) is high. Even before we have try_kmalloc(), we already use try_alloc_pages() in BPF arena implementation and it's going to be used more extensively in BPF. Once the set was applied to bpf-next we ran into two two small conflicts with MM tree as reported by Stephen: https://lore.kernel.org/bpf/20250311120422.1d9a8f80@canb.auug.org.au/ https://lore.kernel.org/bpf/20250312145247.380c2aa5@canb.auug.org.au/ So Andrew suggested to keep thing as-is instead of moving patchset between the trees before merge window: https://lore.kernel.org/all/20250317132710.fbcde1c8bb66f91f36e78c89@linux-foundation.org/ Note "locking/local_lock: Introduce localtry_lock_t" patch is later used in Vlastimil's sheaves and in Shakeel's changes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> ---------------------------------------------------------------- Alexei Starovoitov (6): mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation mm, bpf: Introduce free_pages_nolock() memcg: Use trylock to access memcg stock_lock. mm, bpf: Use memcg in try_alloc_pages(). bpf: Use try_alloc_pages() to allocate pages for bpf needs. Merge branch 'bpf-mm-introduce-try_alloc_pages' Sebastian Andrzej Siewior (1): locking/local_lock: Introduce localtry_lock_t Vlastimil Babka (1): mm: Fix the flipped condition in gfpflags_allow_spinning() include/linux/bpf.h | 2 +- include/linux/gfp.h | 23 ++++ include/linux/local_lock.h | 70 +++++++++++++ include/linux/local_lock_internal.h | 146 ++++++++++++++++++++++++++ include/linux/mm_types.h | 4 + include/linux/mmzone.h | 3 + kernel/bpf/arena.c | 5 +- kernel/bpf/syscall.c | 23 +++- lib/stackdepot.c | 10 +- mm/internal.h | 1 + mm/memcontrol.c | 53 +++++++--- mm/page_alloc.c | 203 +++++++++++++++++++++++++++++++++--- mm/page_owner.c | 8 +- 13 files changed, 509 insertions(+), 42 deletions(-)