Message ID | 20250327145159.99799-1-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] Introduce try_alloc_pages for 6.15 | expand |
On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > The pull includes work from Sebastian, Vlastimil and myself > with a lot of help from Michal and Shakeel. > This is a first step towards making kmalloc reentrant to get rid > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. > These patches make page allocator safe from any context. So I've pulled this too, since it looked generally fine. The one reaction I had is that when you basically change spin_lock_irqsave(&zone->lock, flags); into if (!spin_trylock_irqsave(&zone->lock, flags)) { if (unlikely(alloc_flags & ALLOC_TRYLOCK)) return NULL; spin_lock_irqsave(&zone->lock, flags); } we've seen bad cache behavior for this kind of pattern in other situations: if the "try" fails, the subsequent "do the lock for real" case now does the wrong thing, in that it will immediately try again even if it's almost certainly just going to fail - causing extra write cache accesses. So typically, in places that can see contention, it's better to either do (a) trylock followed by a slowpath that takes the fact that it was locked into account and does a read-only loop until it sees otherwise This is, for example, what the mutex code does with that __mutex_trylock() -> mutex_optimistic_spin() pattern, but our spinlocks end up doing similar things (ie "trylock" followed by "release irq and do the 'relax loop' thing). or (b) do the trylock and lock separately, ie if (unlikely(alloc_flags & ALLOC_TRYLOCK)) { if (!spin_trylock_irqsave(&zone->lock, flags)) return NULL; } else spin_lock_irqsave(&zone->lock, flags); so that you don't end up doing two cache accesses for ownership that can cause extra bouncing. I'm not sure this matters at all in the allocation path - contention may simply not be enough of an issue, and the trylock is purely about "unlikely NMI worries", but I do worry that you might have made the normal case slower. It's easily fixable later if it ends up being the case, so I don't worry too much about it, but I did want to mention it since going through the code made me react to it. Linus
On Sun, 30 Mar 2025 at 13:42, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The one reaction I had is that when you basically change Oh, actually, two reactions now that I fixed up the merge build issue which forced me to look at the function naming. That 'localtry_lock_irqsave()' naming is horrendous. The "try" part in it makes me think it's a trylock. But no, the "localtry" just comes from the lock naming, and the trylock version is localtry_trylock_irqsave. That's horrible. I missed that on the first read-though, and I'm not unpulling it because the code generally makes sense. But I do think that the lock name needs fixing. "localtry_lock_t" is not a good name, and spreading that odd "localtry" into the actual (non-try) locking functions makes the naming actively insane. If the *only* operation you could do on the lock was "trylock", then "localtry" would be fine. Then the lock literally is a "only try" thing. But as it is, the naming now ends up actively broken. Honestly, the lock name should probably reflect the fact that it can be used from any context (with a "trylock"), not about the trylock part itself. So maybe "nmisafe_local_lock_t" or something in that vein? Please fix this up, There aren't *that* many users of "localtry_xyzzy", let's get this fixed before there are more of them. Linus
The pull request you sent on Thu, 27 Mar 2025 10:51:59 -0400:
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/aa918db707fba507e85217961643281ee8dfb2ed
Thank you!
On Sun, Mar 30, 2025 at 1:42 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 27 Mar 2025 at 07:52, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > The pull includes work from Sebastian, Vlastimil and myself > > with a lot of help from Michal and Shakeel. > > This is a first step towards making kmalloc reentrant to get rid > > of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. > > These patches make page allocator safe from any context. > > So I've pulled this too, since it looked generally fine. Thanks! > The one reaction I had is that when you basically change > > spin_lock_irqsave(&zone->lock, flags); > > into > > if (!spin_trylock_irqsave(&zone->lock, flags)) { > if (unlikely(alloc_flags & ALLOC_TRYLOCK)) > return NULL; > spin_lock_irqsave(&zone->lock, flags); > } > > we've seen bad cache behavior for this kind of pattern in other > situations: if the "try" fails, the subsequent "do the lock for real" > case now does the wrong thing, in that it will immediately try again > even if it's almost certainly just going to fail - causing extra write > cache accesses. > > So typically, in places that can see contention, it's better to either do > > (a) trylock followed by a slowpath that takes the fact that it was > locked into account and does a read-only loop until it sees otherwise > > This is, for example, what the mutex code does with that > __mutex_trylock() -> mutex_optimistic_spin() pattern, but our > spinlocks end up doing similar things (ie "trylock" followed by > "release irq and do the 'relax loop' thing). Right, __mutex_trylock(lock) -> mutex_optimistic_spin() pattern is equivalent to 'pending' bit spinning in qspinlock. > or > > (b) do the trylock and lock separately, ie > > if (unlikely(alloc_flags & ALLOC_TRYLOCK)) { > if (!spin_trylock_irqsave(&zone->lock, flags)) > return NULL; > } else > spin_lock_irqsave(&zone->lock, flags); > > so that you don't end up doing two cache accesses for ownership that > can cause extra bouncing. Ok, I will switch to above. > I'm not sure this matters at all in the allocation path - contention > may simply not be enough of an issue, and the trylock is purely about > "unlikely NMI worries", but I do worry that you might have made the > normal case slower. We actually did see zone->lock being contended in production. Last time the culprit was an inadequate per-cpu caching and these series in 6.11 fixed it: https://lwn.net/Articles/947900/ I don't think we've seen it contended in the newer kernels. Johannes, pls correct me if I'm wrong. But to avoid being finger pointed, I'll switch to checking alloc_flags first. It does seem a better trade off to avoid cache bouncing because of 2nd cmpxchg. Though when I wrote it this way I convinced myself and others that it's faster to do trylock first to avoid branch misprediction.
On Sun, Mar 30, 2025 at 1:56 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I do think that the lock name needs fixing. > > "localtry_lock_t" is not a good name, and spreading that odd > "localtry" into the actual (non-try) locking functions makes the > naming actively insane. > > If the *only* operation you could do on the lock was "trylock", then > "localtry" would be fine. Then the lock literally is a "only try" > thing. But as it is, the naming now ends up actively broken. > > Honestly, the lock name should probably reflect the fact that it can > be used from any context (with a "trylock"), not about the trylock > part itself. > > So maybe "nmisafe_local_lock_t" or something in that vein? > > Please fix this up, There aren't *that* many users of > "localtry_xyzzy", let's get this fixed before there are more of them. Ok. Agree with the reasoning that the name doesn't quite fit. nmisafe_local_lock_t name works for me, though nmisafe_local_lock_irqsave() is a bit verbose. Don't have better name suggestions at the moment. Sebastian, Vlastimil, what do you prefer ?
On Sun, 30 Mar 2025 at 14:30, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > But to avoid being finger pointed, I'll switch to checking alloc_flags > first. It does seem a better trade off to avoid cache bouncing because > of 2nd cmpxchg. Though when I wrote it this way I convinced myself and > others that it's faster to do trylock first to avoid branch misprediction. Yes, the really hot paths (ie core locking) do the "trylock -> read spinning" for that reason. Then for the normal case, _only_ the trylock is in the path, and that's the best of both worlds. And in practice, the "do two compare-and-exchange" operations actually does work fine, because the cacheline will generally be sticky enough that you don't actually get many extra cachline bouncing. So I'm not sure it matters in the end, but I did react to it. Linus
Hi Linus, The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b: Linux 6.14-rc1 (2025-02-02 15:39:26 -0800) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/bpf_try_alloc_pages for you to fetch changes up to f90b474a35744b5d43009e4fab232e74a3024cae: mm: Fix the flipped condition in gfpflags_allow_spinning() (2025-03-15 11:18:19 -0700) ---------------------------------------------------------------- Please pull after main MM changes. The pull includes work from Sebastian, Vlastimil and myself with a lot of help from Michal and Shakeel. This is a first step towards making kmalloc reentrant to get rid of slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. These patches make page allocator safe from any context. Vlastimil kicked off this effort at LSFMM 2024: https://lwn.net/Articles/974138/ and we continued at LSFMM 2025: https://lore.kernel.org/all/CAADnVQKfkGxudNUkcPJgwe3nTZ=xohnRshx9kLZBTmR_E1DFEg@mail.gmail.com/ Why: SLAB wrappers bind memory to a particular subsystem making it unavailable to the rest of the kernel. Some BPF maps in production consume Gbytes of preallocated memory. Top 5 in Meta: 1.5G, 1.2G, 1.1G, 300M, 200M. Once we have kmalloc that works in any context BPF map preallocation won't be necessary. How: Synchronous kmalloc/page alloc stack has multiple stages going from fast to slow: cmpxchg16 -> slab_alloc -> new_slab -> alloc_pages -> rmqueue_pcplist -> __rmqueue. rmqueue_pcplist was already relying on trylock. This set changes rmqueue_bulk/rmqueue_buddy to attempt a trylock and return ENOMEM if alloc_flags & ALLOC_TRYLOCK. Then it wraps this functionality into try_alloc_pages() helper. We make sure that the logic is sane in PREEMPT_RT. End result: try_alloc_pages()/free_pages_nolock() are safe to call from any context. try_kmalloc() for any context with similar trylock approach will follow. It will use try_alloc_pages() when slab needs a new page. Though such try_kmalloc/page_alloc() is an opportunistic allocator, this design ensures that the probability of successful allocation of small objects (up to one page in size) is high. Even before we have try_kmalloc(), we already use try_alloc_pages() in BPF arena implementation and it's going to be used more extensively in BPF. Once the set was applied to bpf-next we ran into two two small conflicts with MM tree as reported by Stephen: https://lore.kernel.org/bpf/20250311120422.1d9a8f80@canb.auug.org.au/ https://lore.kernel.org/bpf/20250312145247.380c2aa5@canb.auug.org.au/ So Andrew suggested to keep thing as-is instead of moving patchset between the trees before merge window: https://lore.kernel.org/all/20250317132710.fbcde1c8bb66f91f36e78c89@linux-foundation.org/ Note "locking/local_lock: Introduce localtry_lock_t" patch is later used in Vlastimil's sheaves and in Shakeel's changes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> ---------------------------------------------------------------- Alexei Starovoitov (6): mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation mm, bpf: Introduce free_pages_nolock() memcg: Use trylock to access memcg stock_lock. mm, bpf: Use memcg in try_alloc_pages(). bpf: Use try_alloc_pages() to allocate pages for bpf needs. Merge branch 'bpf-mm-introduce-try_alloc_pages' Sebastian Andrzej Siewior (1): locking/local_lock: Introduce localtry_lock_t Vlastimil Babka (1): mm: Fix the flipped condition in gfpflags_allow_spinning() include/linux/bpf.h | 2 +- include/linux/gfp.h | 23 ++++ include/linux/local_lock.h | 70 +++++++++++++ include/linux/local_lock_internal.h | 146 ++++++++++++++++++++++++++ include/linux/mm_types.h | 4 + include/linux/mmzone.h | 3 + kernel/bpf/arena.c | 5 +- kernel/bpf/syscall.c | 23 +++- lib/stackdepot.c | 10 +- mm/internal.h | 1 + mm/memcontrol.c | 53 +++++++--- mm/page_alloc.c | 203 +++++++++++++++++++++++++++++++++--- mm/page_owner.c | 8 +- 13 files changed, 509 insertions(+), 42 deletions(-)